Classification DMKD
Classification DMKD
Prediction
• Classification:
• predicts categorical class labels
• classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
• Prediction:
• models continuous-valued functions, i.e., predicts unknown or missing
values
• Typical Applications
• credit approval
• target marketing
• medical diagnosis
• treatment effectiveness analysis
• Large data sets: disk-resident rather than memory-resident
data
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
• Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in the
data
2
Prediction Problems: Classification vs.
Numeric Prediction
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute and
uses it in classifying new data
• Numeric Prediction
• models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
3
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
4
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Student Maths physics chemistry Grade (Model)
name
Ram 90 80 70 A
Siva 70 75 80 B
IF maths > 80 OR physics > 80
Mani 99 68 98 A
OR Chemistry > 80
sanjay 76 79 74 B
THEN Grade = ‘A’
5 Else Grade = ‘B’
Process (2): Using the Model in Prediction
Classifier
Testing
Data Unseen Data
j 1 | D |
Information gained by branching on attribute A
5
Infoage ( D)
5
I (2,3)
4
I (4,0) I (2,3) means “age <=30” has 5 out of 14 samples,
14 with 2 yes’es and 3 no’s. Hence
14 14
5
I (3,2) 2 2 3 3
14 I (2,3) log 2 ( ) log 2 ( )
5 5 5 5
Gain
Gain(age) Info( D) Infoage ( D) 0.246
Gain(income) 0.029
Gain( student ) 0.151
Gain(credit _ rating ) 0.048
age
<=30 >40
31-40
FAIR EXCELLENT
Buys Buys
comp comp
Yes Yes
No No
Weather Temperature Humidity Wind Golf Play
fine Hot high none no
fine Hot high few no
cloudy Hot high none Yes
rain Warm high none yes
rain Cold medium none yes
rain Cold medium few no
cloudy Cold medium few yes
fine Warm high none no
fine Cold medium none yes
rain Warm medium none yes
fine Warm medium few yes
cloudy Warm high few yes
cloudy Hot medium none yes
rain Warm high few no
S1
gender major birth_country age_range gpa count
M Science Canada 20-25 Very_good 16
F Science Foreign 25-30 Excellent 22
M Engineering Foreign 25-30 Excellent 18
F Science Foreign 25-30 Excellent 25
M Science Canada 20-25 Excellent 21
F Engineering Canada 20-25 Excellent 18
S2 120
130
120 120 130 130
I(s 1, s 2) I(120,130) log 2 log 2 0.9988
250 250 250 250
For major=”Science”: S11=84 S21=42 I(s11,s21)=0.9183
126 82 42
E(major) I ( s11, s 21) I ( s12, s 22) I ( s13, s 23) 0.7873
250 250 250
16
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) i i
i P(X)
• This greatly reduces the computation cost: Only counts the class
distribution
18
Naïve Bayes Classifier: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
Class: <=30 high no excellent no
31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
Data to be classified: <=30 medium no fair no
X = (age <=30, <=30 low yes fair yes
>40 medium yes fair yes
Income = medium, <=30 medium yes excellent yes
Student = yes 31…40 medium no excellent yes
31…40 high yes fair yes
Credit_rating = Fair) >40 medium no excellent no
19
Naïve Bayes Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
20
Using IF-THEN Rules for Classification
• Represent the knowledge in the form of IF-THEN rules
R: IF age = <=30 AND student = yes THEN buys_computer = yes
• Assessment of a rule: coverage and accuracy
• ncovers = no of tuples covered by R
• ncorrect = no of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
IF age = “<=30” AND student = no ncovers = no of tuples covered by R = 3 [no of records that satisfy
the rule antecedent]
THEN buys_computer = no ncorrect = no of tuples correctly classified by R = 3 [no of records
Let given rule as R:AB Then that satisfy both the antecedent and consequent]
|D|= total no of records
Coverage(R) = |A| /|D|
coverage(R) = ncovers /|D| = 3/14
Accuracy(R) = |A∩B| /|A| accuracy(R) = ncorrect / ncovers =3/3 [i.e. 100 %]
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low no fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
IF age = “<=30” AND student = no ncovers = no of tuples covered by R = 4 [no of records that satisfy
the rule antecedent]
THEN buys_computer = no ncorrect = no of tuples correctly classified by R = 3 [no of records
Let given rule as R:AB Then that satisfy both the antecedent and consequent]
|D|= total no of records
Coverage(R) = |A| /|D|
coverage(R) = ncovers /|D| = 4/14
Accuracy(R) = |A∩B| /|A| accuracy(R) = ncorrect / ncovers =3/4 [i.e. 75 %]
Rule Extraction from a Decision Tree
Rules are easier to understand than large
trees age?
One rule is created for each path from the <=30 31..40 >40
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
27
Rule Generation
• To generate a rule
while(true)
find the best predicate p
if coverage > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
28
How to Learn-One-Rule?
29
Model Evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?
Class=Yes Class=No
a: TP (true positive)
b: FN (false negative)
Class=Yes a b
ACTUAL c: FP (false positive)
d: TN (true negative)
CLASS Class=No c d
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
d ( p, q ) ( pi
i
q )
i
2
d ( p, q ) ( pi
i
q ) i
2
Type 2 7 4 Bad 5
Type 3 3 4 Good 3
Type 4 1 4 Good 3.6
Test Data Acid durability = 3 and strength = 7 class = 2 Good and 1 Bad majority = Good
Practical Issues of Classification
• Underfitting and Overfitting
• Missing Values
Underfitting and Overfitting (Example)
500 circular and 500
triangular data points.
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
Computing Impurity Measure
Tid Refund Marital Taxable Before Splitting:
Status Income Class Entropy(Parent)
= -0.3 log(0.3)-(0.7)log(0.7) = 0.8813
1 Yes Single 125K No
2 No Married 100K No Class Class
3 No Single 70K No = Yes = No
Refund=Yes 0 3
4 Yes Married 120K No
Refund=No 2 4
5 No Divorced 95K Yes
Refund=? 1 0
6 No Married 60K No
7 Yes Divorced 220K No
Split on Refund:
Entropy(Children)
Missing = 0.3 (0) + 0.6 (0.9183) = 0.551
value
Gain = 0.9 (0.8813 – 0.551) = 0.3303
Distribute Instances
Tid Refund Marital Taxable
Status Income Class
Tid Refund Marital Taxable
1 Yes Single 125K No Status Income Class
2 No Married 100K No
10 ? Single 90K Yes
3 No Single 70K No 10