ClassificationandPrediction Module3
ClassificationandPrediction Module3
2
Classification vs. Prediction
Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and uses it
in classifying new data
Prediction:
– models continuous-valued functions, i.e., predicts unknown or
missing values
Typical Applications
– credit approval
– target marketing
– medical diagnosis
– treatment effectiveness analysis
3
Definition
CLASSIFICATION
4
Classification in Literature
5
Classification: Formal Definition
6
Illustrating Classification Task
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
7
Classification—A Two-Step
Process
Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
The known label of test sample is compared with the classified
result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting will
occur
8
Classification—A Two-Step
Process
learning/Training (Model Construction)
– Using a classification algorithm, a Model is build by analyzing a set of training
database objects.
– The model is represented as classification rules or decision trees …..etc
Testing / Evaluation
– The Model is tested using a different data set (Test data set) for which the class
label is unseen and the classification accuracy will be estimated.
9
Evaluation of Classification Systems
11
Model Construction
Train Classification
Dataset Algorithms
12
Evaluate and use the Model
IF Rank = ‘professor’ OR Years > 6 THEN Dean =
‘Yes’
Unknown
Test DS Classifier Future DS
Unseen
Yes
Yes
Compute
Yes
Accuracy
75%
Evaluation Phase
13
Use the Model in Prediction
Classifier
Testing
Unknown Data
Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph
14 Assistant Prof 7 yes
Supervised vs. Unsupervised Learning
15
Issues regarding classification and
prediction
16
(2): Evaluating Classification Methods
Predictive accuracy
Speed and scalability
– time to construct the model
– time to use the model
Robustness
– handling noise and missing values
Scalability
– efficiency in disk-resident databases
Interpretability:
– understanding and insight provided by the model
Goodness of rules
– decision tree size
– compactness of classification rules
17
Classification Techniques
18
Classification by Decision Tree Induction
Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
– Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
– Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
19
Example 1 : Training Dataset
age?
<=30 overcast
30..40 >40
no yes no yes
21
Example 2 of a Decision Tree
cal cal us
i i o
or or nu
teg
teg
nti
ass
ca ca co cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat
cal cal us
i i o
or or nu
teg
teg
nti
ass Single,
ca ca co cl MarSt
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
23
Decision Tree Classification Task
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
24
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
25
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
26
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
27
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
28
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
29
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
30
Decision Tree Classification Task
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
31
Decision Tree Induction
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
32
General Structure of Hunt’s Algorithm
Tid Refund Marital Taxable
Let Dt be the set of training records Status Income Cheat
yd 9 No Married 75K No
10 No Single 90K Yes
– If Dt contains records that belong 10
33
Example: C4.5
34
Algorithm for Decision Tree Induction
35
Attribute Selection Measure
36
37
Information Gain (ID3/C4.5)
p p n n
I ( p, n) log 2 log 2
pn pn pn pn
38
Information Gain in Decision Tree
Induction
39
Attribute Selection by Information Gain
Computation
Class P: buys_computer =
“yes” 5 4
E ( age) I ( 2,3) I ( 4,0)
Class N: buys_computer = “no” 14 14
p p n n 5
I ( p,I(p,
n) n)
= I(9,
log5)
2 =0.940
log 2 I (3,2) 0.69
pn pn pn pn 14
Hence
Compute the entropy for age:
Gain(age) I ( p, n) E (age)
Similarly
age pi ni I(pi, ni) Gain(income) 0.029
<=30 2 3 0.971
Gain( student ) 0.151
30…40 4 0 0
>40 3 2 0.971 Gain(credit _ rating ) 0.048
40
41
Gini Index (IBM IntelligentMiner)
43
Avoid Overfitting in
Classification
The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise
or outliers
– Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a node if
this would result in the goodness measure falling below a
threshold
Difficult to choose an appropriate threshold
44
Approaches to Determine the Final Tree
Size
46
Classification in Large Databases
47
Scalable Decision Tree Induction Methods in
Data Mining Studies
48
Presentation of Classification Results
49
Bayesian Classification: Why?
50
Bayesian Theorem
51
Bayesian classification
54
Estimating a-posteriori probabilities
Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
P(X) is constant for all classes
P(C) = relative freq of class C samples
C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
Problem: computing P(X|C) is unfeasible!
55
Naïve Bayesian Classification
56
Play-tennis example: estimating P(xi|C)
outlook
P(sunny|n) = 3/5 P(sunny|p) = 2/9
Outlook Temperature Humidity Windy Class
sunny hot high false N P(overcast|n) = 0 P(overcast|p) = 4/9
sunny hot high true N
overcast
rain
hot
mild
high
high
false
false
P
P
P(rain|n) = 2/5 P(rain|p) = 3/9
rain
rain
cool
cool
normal false
normal true
P
N
temperature
overcast cool normal true P
sunny mild high false N P(hot|n) = 2/5 P(hot|p) = 2/9
sunny cool normal false P
rain mild normal false P P(mild|n) = 2/5 P(mild|p) = 4/9
sunny mild normal true P
overcast mild high true P P(cool|n) = 1/5 P(cool|p) = 3/9
overcast hot normal false P
rain mild high true N humidity
P(high|n) = 4/5 P(high|p) = 3/9
P(p) = 9/14 P(normal|n) = 2/5 P(normal|p) = 6/9
windy
P(n) = 5/14
P(true|n) = 3/5 P(true|p) = 3/9
57
P(false|n) = 2/5 P(false|p) = 6/9
Play-tennis example: classifying X
P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286
58
The independence hypothesis…
59
Other Classification Methods
60
Instance-Based Methods
Instance-based learning:
– Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
Typical approaches
– k-nearest neighbor approach
Instances represented as points in a Euclidean space.
– Case-based reasoning
Uses symbolic representations and knowledge-based
inference
61
The k-Nearest Neighbor Algorithm
64
Remarks on Lazy vs. Eager Learning
65
Genetic Algorithms
66
Rough Set Approach
67
Rough Set based Classification (Pawlak Z )
dec a5 a4 a3 a2 a1
Generate Reducts
x1 Highly
es x2 Influenced
l u Set of Reducts by Number
Va
x3 of attributes
x4 Generate Rules
x5
Datase
t Set of
Rules
(Classifier)
Reduct : The Minimum
Number of attributes
DS = {U,A} (Decision Table)
that represent DS.
A = {a1 a2 a3 a4 a5 dec} (Set of
Core: The set of
Attributes)
attributes that are
U = {x1, x2, x3, x4,x5} (Set of Objects)
exist in all reducts of
DS.
68
Mining Classification Rules: An
Example
- - {a1,a3,a4,a5} - {a1,a3,a4,a5} C2
{a1,a2,a4,a5} {a1,a4,a5} - {a1,a3,a4,a5} - C3
- - {a1,a4,a5} - {a1,a4,a5} C4
- - {a1,a2,a4,a5} - {a2,a4,a5} C5
Discernibility Matrix Modulo
69
Fuzzy Set
Approaches
71
What Is Prediction?
72
Predictive Modeling in
Databases
Predictive modeling: Predict data values or construct
generalized linear models based on the database data.
One can only predict value ranges or category distributions
Method outline:
– Minimal generalization
– Attribute relevance analysis
– Generalized linear model construction
– Prediction
Determine the major factors which influence the prediction
– Data relevance analysis: uncertainty measurement, entropy
analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis
73
Regress Analysis and Log-Linear Models in
Prediction
Linear regression: Y = + X
– Two parameters , and specify the line and are to be
estimated by using the data at hand.
– using the least squares criterion to the known values of Y 1, Y2,
…, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into the above.
Log-linear models:
– The multi-way table of joint probabilities is approximated by a
product of lower-order tables.
– Probability: p(a, b, c, d) = ab acad bcd
74
Locally Weighted Regression
76
Prediction: Categorical Data
77
Classification Accuracy: Estimating Error
Rates
Partition: Training-and-testing
– use two independent data sets, e.g., training set (2/3), test
set(1/3)
– used for data set with large number of samples
Cross-validation
– divide the data set into k subsamples
– use k-1 subsamples as training data and one sub-sample as
test data --- k-fold cross-validation
– for data set with moderate size
Bootstrapping (leave-one-out)
– for small size data
78
Classification Accuracy as efficiency
measure
Confusion Matrix
• A confusion matrix contains information about actual and
predicted
classifications done by a classification system.
• The following table shows the confusion matrix for a two class
classifier.
• The entries in the confusion matrix have the following meaning:
79
Confusion Matrix for the Iris Dataset
Predicted
Accuracy % Iris 3 Iris 2 Iris 1
100.0 0 0 14 Iris 1
78.95 3 15 1 Iris 2
Actual
91.67 11 1 0 Iris 3
80
Approaches of Evaluating Classification
Algorithms
81
Train and Test (Holdout) approach
Data
Patterns
Training Mining
DS Task
Random
Dataset Splitter
Pattern
Test Evaluation
DS
Train : 70%
Test : 30%
82
Example
Qasem
Yes 7 Associate Prof Azmi
Yes 7 Assistant Prof
No 6 Assistant Prof Hamedah
No 7 Associate Prof Azeem
No 3 Associate Prof Fatimah
Yes 2 Professor Hasan
Test Dataset
83
K-Fold Cross Validation
4-Fold Cross
Validation
84
Boosting and Bagging
85
Boosting Technique (II) — Algorithm
86
Summary
Classification is an extensively studied problem (mainly in
statistics, machine learning & neural networks)
Classification is probably one of the most widely used data
mining techniques with a lot of extensions
Scalability is still an important issue for database
applications: thus combining classification with database
techniques should be a promising topic
Research directions: classification of non-relational data,
e.g., text, spatial, multimedia, etc..
87
References (I)
C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997.
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984.
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for
scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining
(KDD'95), pages 39-44, Montreal, Canada, August 1995.
U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994
AAAI Conf., pages 601-606, AAAI Press, 1994.
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416-
427, New York, NY, August 1998.
M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree
induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop Research
Issues on Data Engineering (RIDE'97), pages 111-120, Birmingham, England, April 1997.
88
References (II)
89
Thank you
90