Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Classifier
Training
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Examples of Classification Task
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions as legitimate or fraudulent
• Classifying secondary structures of protein as alpha-helix, beta-sheet,
or random coil
• Categorizing news stories as finance, weather, entertainment, sports,
etc
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Decision Tree Induction: An Example
age income student credit_rating buys_computer
Training data set: Buys_computer <=30 high no fair no
Resulting tree: <=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes no yes
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
Training Set
Apply
Model Decision
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
CarType
Family Luxury
Sports
CarType OR CarType
{Sports, {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
Size
{Small,
Large} {Medium}
• What about this split?
Splitting Based on Continuous Attributes
• Different ways of handling
– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or clustering.
GINI (T ) 1
2
p i
i
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI
GINI (T ) 1 [ p (i | T )]2
i
C1 1
P(C1) = 1/6 P(C2) = 5/6
C2 5
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
B? Parent
C1 6
Yes No
C2 6
Node N1 Node N2 Gini = 0.500
Gini(N1)
= 1 – (5/6)2 – (2/6)2
N1 N2 Gini(Children)
= 0.194
C1 5 1 = 7/12 * 0.194 +
Gini(N2) 5/12 * 0.528
C2 2 4
= 1 – (1/6)2 – (4/6)2 = 0.333
= 0.528 Gini=0.333
Weather Data: Play or not Play?
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No Note:
overcast hot high false Yes Outlook is the
rain mild high false Yes Forecast,
rain cool normal false Yes
no relation to
rain cool normal true No
overcast cool normal true Yes
Microsoft
sunny mild high false No email program
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
Example Tree for “Play?”
• An internal node is a
test on an attribute.
Entropy (T ) p log p
i i
i
C1 0
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
C2 6
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
n
GAIN Entropy ( p )
k
Entropy (i ) i
n
split i 1
• Gain Ratio:
GAIN n n
SplitINFO log
k
GainRATIO Split i i
SplitINFO
split
n n
i 1
• Information gain:
(information before split) – (information after split)
• Note: not all leaves need to be pure; sometimes identical instances have
different classes
Splitting stops when data can’t be split any further
Attribute Selection: Information Gain
Class P: buys_computer = “yes” 5 4
Class N: buys_computer = “no” Info age ( D) I (2,3) I (4,0)
14 14
9 9 5 5 5
Info ( D) I (9,5) log 2 ( ) log 2 ( ) 0.940 I (3,2) 0.694
14 14 14 14 14
age pi ni I(pi, ni)
<=30 2 3 0.971
31…40 4 0 0 5
I (2,3) means “age <=30” has 5 out of
>40 3 2 0.971 14
14 samples, with 2 yes’es and 3
age income student credit_rating buys_computer
<=30 high no fair no no’s. Hence
<=30 high no excellent no
31…40 high no fair yes Gain(age) Info ( D ) Info age ( D) 0.246
>40 medium no fair yes
>40 low yes fair yes Similarly,
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
Gain(income) 0.029
<=30 low yes fair yes Gain( student ) 0.151
>40 medium yes fair yes
<=30 medium yes excellent yes
Gain(credit _ rating ) 0.048
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Splitting the samples using age
age?
<=30 30...40 >40
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
Rule Ordering Schemes
• Rule-based ordering
– Individual rules are ranked based on their quality
• Class-based ordering
– Rules that belong to the same class appear together
Rule-based Ordering Class-based Ordering
(Refund=Yes) ==> No (Refund=Yes) ==> No
• Indirect Method:
• Extract rules from other classification models (e.g.
decision trees, etc).
• e.g: C4.5 rules
Using IF-THEN Rules for Classification
• Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
– Rule antecedent/precondition vs. rule consequent
• Assessment of a rule: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
• If more than one rule are triggered, need conflict resolution
– Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute tests)
– Class-based ordering: decreasing order of prevalence or misclassification
cost per class
– Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
Rule Extraction from a Decision Tree
Rules are easier to understand than large trees
age?
One rule is created for each path from the root
<=30 31..40
to a leaf >40