0% found this document useful (0 votes)
74 views59 pages

Data Mining: Concepts and Techniques

1. Classification is a data mining technique used to predict categorical class labels. The goal is to accurately predict the class of new records based on a model built from a training dataset. 2. Decision trees are a popular classification technique that uses a tree structure to represent classification rules. Nodes in the tree represent attributes and branches represent attribute values that lead to leaf nodes which assign a class. 3. The decision tree is built using a top-down greedy algorithm that recursively partitions the training data based on attribute tests. The attribute that best splits the data is selected at each node. The tree allows classification of new records by starting at the root and moving down the tree according to attribute values.

Uploaded by

divya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views59 pages

Data Mining: Concepts and Techniques

1. Classification is a data mining technique used to predict categorical class labels. The goal is to accurately predict the class of new records based on a model built from a training dataset. 2. Decision trees are a popular classification technique that uses a tree structure to represent classification rules. Nodes in the tree represent attributes and branches represent attribute values that lead to leaf nodes which assign a class. 3. The decision tree is built using a top-down greedy algorithm that recursively partitions the training data based on attribute tests. The attribute that best splits the data is selected at each node. The tree allows classification of new records by starting at the root and moving down the tree according to attribute values.

Uploaded by

divya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Data Mining:

Concepts and Techniques

Classification: Basic Concepts


Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is the class.

• Find a model for class attribute as a function of the values of other


attributes.
• Goal: previously unseen records should be assigned a class as accurately
as possible.
– A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build
the model and test set used to validate it.
Illustrating Classification Task

Classifier

Training
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Examples of Classification Task
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions as legitimate or fraudulent
• Classifying secondary structures of protein as alpha-helix, beta-sheet,
or random coil
• Categorizing news stories as finance, weather, entertainment, sports,
etc
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Decision Tree Induction: An Example
age income student credit_rating buys_computer
 Training data set: Buys_computer <=30 high no fair no
 Resulting tree: <=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)

• Conditions for stopping partitioning


– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
– There are no samples left
Example of a Decision Tree
l l us
ir ca ir ca u o
go go in s
t e t e n t a s
ca ca co cl
Tid Refund Marital Taxable Splitting Attributes
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No
NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree
l l s
ir ca ir ca u
uo s
o o it n
t eg t eg n las Single,
ca ca co c MarSt
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


NO Refund
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
6 No Married 60K No
NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree
10
10 No Single 90K Yes that fits the same data!
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
No
4 Yes Medium 120K
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Model Decision
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set
Apply Model to Test Data

Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous

• Depends on number of ways to split


– 2-way split
– Multi-way split
Methods For Splitting
• A key step towards building a decision tree is to find an appropriate test
condition for splitting data
• Categorical attributes
− The test condition can be expressed as an attribute value pair (A =
v?), whose outcomes are Yes / No, or as a question about the value of
an attribute (A?)
Methods For Splitting
• Continuous attributes
− The test condition can be expressed in terms of a binary
decision (A < v ?) or (A >= v?), whose outcomes are Yes / No,
or as a range query whose outcomes are vi <= A <= vi+1, for i =
1, 2, … k

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Splitting Based on Nominal Attributes
• Multi-way split: Use as many partitions as distinct values.

CarType
Family Luxury
Sports

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.

CarType OR CarType
{Sports, {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values.

Size
Small Large

Medium

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.
Size Size
{Small, OR {Medium,
Medium} {Large} Large} {Small}

Size
{Small,
Large} {Medium}
• What about this split?
Splitting Based on Continuous Attributes
• Different ways of handling
– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or clustering.

– Binary Decision: (A < v) or (A  v)


• consider all possible splits and finds the best cut
• can be more compute intensive
Decision Tree Algorithms
• The basic idea behind any decision tree algorithm is as follows:
– Choose the best attribute(s) to split the remaining instances and
make that attribute a decision node
– Repeat this process for recursively for each child
– Stop when:
• All the instances have the same target attribute value
• There are no more attributes
• There are no more instances
Algorithm for Decision Tree Induction
(pseudocode)
Algorithm GenDecTree(Sample S, Attlist A)
1. Create a node N
2. If all samples are of the same class C then label N with C; terminate;
3. If A is empty then label N with the most common class C in S (majority
voting); terminate;
4. Select aA, with the highest information gain; Label N with a;
5. For each value v of a:
a. Grow a branch from N with condition a=v;
b. Let Sv be the subset of samples in S with a=v;
c. If Sv is empty then attach a leaf labeled with the most common class in S;
d. Else attach the node generated by GenDecTree(Sv, A-a)
Splitting Criterion
• There are many test conditions one could apply to partition a
collection of records into smaller subsets
• Various measures are available to determine which test condition
provides the best split
− Gini Index
− Entropy / Information Gain
− Classification Error
Attribute Selection Measure
• Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-valued
– Assume there exist several possible split values for each attribute
– May need other tools, such as clustering, to get the possible split values
– Can be modified for categorical attributes
• Information gain (ID3/C4.5)
– All attributes are assumed to be categorical
– Can be modified for continuous-valued attributes
Splitting Criterion: GINI
• To determine the best split, GiniIndex is used
• Gini Index for a given training set T :

GINI (T )  1  
2
p i
i

where pi is the relative frequency of class i in T.


gini(T) is minimized if the classes in T are skewed.

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI
GINI (T )  1   [ p (i | T )]2
i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1
P(C1) = 1/6 P(C2) = 5/6
C2 5
Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4
Gini = 1 – (2/6)2 – (4/6)2 = 0.444

C1 3 P(C1) = 3/6 P(C2) = 3/6


C2 3
Gini = 1 – (3/6)2 – (3/6)2 = 0.5
Splitting Based on GINI
• Used in Classification And Regression Tree (CART), Supervised Learning
in Quest (SLIQ), Scalable Parallelizable Induction of decision Tree
(SPRINT).
• Splitting Criterion: Minimize Gini Index of the Split.
• When a node p is split into k partitions (children), the quality of split is
computed as,
k
ni
GINI split   GINI (i )
i 1 n

where, ni = number of records at child i,


n = number of records at node p.

The attribute providing smallest ginisplit(T) is chosen to split the node.


Binary Attributes: Computing GINI Index
 Splits into two partitions
 Effect of Weighing partitions:

B? Parent
C1 6
Yes No
C2 6
Node N1 Node N2 Gini = 0.500
Gini(N1)
= 1 – (5/6)2 – (2/6)2
N1 N2 Gini(Children)
= 0.194
C1 5 1 = 7/12 * 0.194 +
Gini(N2) 5/12 * 0.528
C2 2 4
= 1 – (1/6)2 – (4/6)2 = 0.333
= 0.528 Gini=0.333
Weather Data: Play or not Play?
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No Note:
overcast hot high false Yes Outlook is the
rain mild high false Yes Forecast,
rain cool normal false Yes
no relation to
rain cool normal true No
overcast cool normal true Yes
Microsoft
sunny mild high false No email program
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
Example Tree for “Play?”
• An internal node is a
test on an attribute.

Outlook • A branch represents


an outcome of the
test.
sunny rain • A leaf node represents
overcast a class label or class
label distribution.
• At each node, one
Humidity Yes
Windy attribute is chosen to
split training
examples into distinct
high normal true false classes as much as
possible
• A new case is
No Yes No Yes classified by
following a matching
path to a leaf node.
Which attribute to select?
A criterion for attribute selection
• Which is the best attribute?
– The one which will result in the smallest tree
– Heuristic: choose the attribute that produces the “purest” nodes
• Strategy: choose attribute that results in greatest information gain
Splitting Criteria based on INFO
• Entropy at a given node T:

Entropy (T )   p log p
i i
i

– Measures homogeneity of a node.


• Maximum (log nc) when records are equally distributed among all
classes implying least information
• Minimum (0.0) when all records belong to one class, implying most
information
– Entropy based computations are similar to the GINI index computations
Examples for computing Entropy
Entropy (T )   p log p
i i
i

C1 0
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
C2 6
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Splitting Based on INFO
• Information Gain:

 n 
GAIN  Entropy ( p )  
k
Entropy (i )  i

 n 
split i 1

Parent Node, p is split into k partitions;


ni is number of records in partition i
– Measures Reduction in Entropy achieved because of the split. Choose
the split that achieves most reduction (maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure.
Splitting Based on INFO

• Gain Ratio:

GAIN n n
 SplitINFO    log
k
GainRATIO Split i i

SplitINFO
split

n n
i 1

Parent Node, p is split into k partitions


ni is the number of records in partition i

– Adjusts Information Gain by the entropy of the partitioning (SplitINFO).


Higher entropy partitioning (large number of small partitions) is
penalized!
– Used in C4.5
– Designed to overcome the disadvantage of Information Gain
Example: attribute “Outlook”
 “Outlook” = “Sunny”:
info([2,3])  entropy(2/5,3/5)  2 / 5 log(2 / 5)  3 / 5 log(3 / 5)  0.971 bits
 “Outlook” = “Overcast”: Note: log(0) is not
defined, but we
info([4,0])  entropy(1,0)  1log(1)  0 log(0)  0 bits evaluate 0*log(0) as
 “Outlook” = “Rainy”: zero
info([3,2])  entropy(3/5,2/5)  3 / 5 log(3 / 5)  2 / 5 log(2 / 5)  0.971 bits
 Expected information for attribute:

info([2,3] , [4,0], [3,2])  (5 / 14)  0.971  (4 / 14)  0  (5 / 14)  0.971


 0.693 bits
Computing the information gain

• Information gain:
(information before split) – (information after split)

gain(" Outlook" )  info([9,5] ) - info([2,3] , [4,0], [3,2])  0.940 - 0.693


 0.247 bits
• Information gain for attributes from weather data:

gain("Outlook" )  0.247 bits


gain("Temperatur e" )  0.029 bits
gain(" Humidity")  0.152 bits
gain(" Windy" )  0.048 bits
Continuing to split

gain(" Humidity")  0.971 bits


gain("Temperatur e" )  0.571 bits

gain(" Windy" )  0.020 bits


The final decision tree

• Note: not all leaves need to be pure; sometimes identical instances have
different classes
 Splitting stops when data can’t be split any further
Attribute Selection: Information Gain
 Class P: buys_computer = “yes” 5 4
 Class N: buys_computer = “no” Info age ( D)  I (2,3)  I (4,0)
14 14
9 9 5 5 5
Info ( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940  I (3,2)  0.694
14 14 14 14 14
age pi ni I(pi, ni)
<=30 2 3 0.971
31…40 4 0 0 5
I (2,3) means “age <=30” has 5 out of
>40 3 2 0.971 14
14 samples, with 2 yes’es and 3
age income student credit_rating buys_computer
<=30 high no fair no no’s. Hence
<=30 high no excellent no
31…40 high no fair yes Gain(age)  Info ( D )  Info age ( D)  0.246
>40 medium no fair yes
>40 low yes fair yes Similarly,
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
Gain(income)  0.029
<=30 low yes fair yes Gain( student )  0.151
>40 medium yes fair yes
<=30 medium yes excellent yes
Gain(credit _ rating )  0.048
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Splitting the samples using age
age?
<=30 30...40 >40

income student credit_rating buys_computer


income student credit_rating buys_computer
high no fair no
medium no fair yes
high no excellent no low yes fair yes
medium no fair no low yes excellent no
low yes fair yes medium yes fair yes
medium yes excellent yes medium no excellent no

income student credit_rating buys_computer


high no fair yes
low yes excellent yes labeled yes
medium no excellent yes
high yes fair yes
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a large number
of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D)    log 2 ( )
j 1 |D| |D|
– GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex. SplitInfo A ( D)   4  log 2 ( 4 )  6  log 2 ( 6 )  4  log 2 ( 4 )  0.926
14 14 14 14 14 14
– gain_ratio(income) = 0.029/0.926 = 0.031
• The attribute with the maximum gain ratio is selected as the splitting
attribute
Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
Comparing Attribute Selection Measures
• The three measures, in general, return good results but
– Information gain:
• biased towards multivalued attributes
– Gain ratio:
• tends to prefer unbalanced splits in which one partition is
much smaller than the others
– Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions and
purity in both partitions
Pros and Cons of decision trees
• Pros • Cons
+ Reasonable training time – Cannot handle complicated
relationship between features
+ Fast application
– simple decision boundaries
+ Easy to interpret
– problems with lots of missing data
+ Easy to implement
+ Can handle large number of
features
Overfitting and Tree Pruning
• Overfitting: An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise or outliers
– Poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
• Use a set of data different from the training data to decide which is
the “best pruned tree”
Overfitting and Tree Pruning
Approaches to Determine the Final Tree Size
• Separate training (2/3) and testing (1/3) sets
• Use cross validation, e.g., 10-fold cross validation
• Use all the data for training
– but apply a statistical test (e.g., chi-square) to estimate whether
expanding or pruning a node may improve the entire distribution
• Use minimum description length (MDL) principle:
– halting growth of the tree when the encoding is minimized
Enhancements to Basic Decision Tree
Induction
• Allow for continuous-valued attributes
– Dynamically define new discrete-valued attributes that partition the
continuous attribute value into a discrete set of intervals
• Handle missing attribute values
– Assign the most common value of the attribute
– Assign probability to each of the possible values
• Attribute construction
– Create new attributes based on existing ones that are sparsely represented
– This reduces fragmentation, repetition, and replication
Effect of Rule Simplification
• Rules are no longer mutually exclusive
– A record may trigger more than one rule
– Solution?
• Ordered rule set
• Unordered rule set – use voting schemes

• Rules are no longer exhaustive


– A record may not trigger any rules
– Solution?
• Use a default class
Ordered Rule Set
• Rules are rank ordered according to their priority
– An ordered rule set is known as a decision list
• When a test record is presented to the classifier
– It is assigned to the class label of the highest ranked rule it has
triggered
– If none of the rules fired, it is assigned to the default class

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
Rule Ordering Schemes
• Rule-based ordering
– Individual rules are ranked based on their quality

• Class-based ordering
– Rules that belong to the same class appear together
Rule-based Ordering Class-based Ordering
(Refund=Yes) ==> No (Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Single,Divorced},


Taxable Income<80K) ==> No Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Married}) ==> No


Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Single,Divorced},
(Refund=No, Marital Status={Married}) ==> No Taxable Income>80K) ==> Yes
Building Classification Rules
• Direct Method:
• Extract rules directly from data
• e.g.: RIPPER, CN2, Holte’s 1R

• Indirect Method:
• Extract rules from other classification models (e.g.
decision trees, etc).
• e.g: C4.5 rules
Using IF-THEN Rules for Classification
• Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
– Rule antecedent/precondition vs. rule consequent
• Assessment of a rule: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
• If more than one rule are triggered, need conflict resolution
– Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute tests)
– Class-based ordering: decreasing order of prevalence or misclassification
cost per class
– Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
Rule Extraction from a Decision Tree
 Rules are easier to understand than large trees
age?
 One rule is created for each path from the root
<=30 31..40
to a leaf >40

student? credit rating?


 Each attribute-value pair along a path forms a yes
conjunction: the leaf holds the class prediction no yes excellent fair

 Rules are mutually exclusive and exhaustive no yes no yes

• Example: Rule extraction from our buys_computer decision-tree


IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
Rule Induction: Sequential Covering Method
• Sequential covering algorithm: Extracts rules directly from training
data
• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
• Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
• Steps:
– Rules are learned one at a time
– Each time a rule is learned, the tuples covered by the rules are
removed
– Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
• Comp. w. decision-tree induction: learning a set of rules
simultaneously

You might also like