08 Class Basic
08 Class Basic
— Chapter 8 —
5
Classification—A Two-Step Process
n Model construction: describing a set of predetermined classes
n Each tuple/sample is assumed to belong to a predefined class, as
mathematical formulae
n Model usage: for classifying future or unknown objects
n Estimate accuracy of the model
n Note: If the test set is used to select models, it is called validation (test) set
6
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
!"# A %"!& DA"%E )A!*%A+
!"# A%%&%'()'G+,"J . )" Tenured?
/ M,1&%( A%%"2&('MG+,"J P )"
4M",5M +,"JM%%", 6 TM%
8"%M9: A%%&%'()'G+,"J P TM%
8
Chapter 8. Classification: Basic Concepts
no yes no yes
10
Algorithm for Decision Tree Induction
n Basic algorithm (a greedy algorithm)
n Tree is constructed in a top-down recursive divide-and-
conquer manner
n At start, all the training examples are at the root
discretized in advance)
n Examples are partitioned recursively based on selected
attributes
n Test attributes are selected on the basis of a heuristic or
m=2
12
Attribute Selection Measure:
Information Gain (ID3/C4.5)
n Select the attribute with the highest information gain
n Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
n Expected information (entropy) needed to classify a tuple in D:
"
%&D(" $! = "! #! #$% & " #! !
! ='
n Information needed (after using A to split D into v partitions) to
classify D: " # A #
%&D( # " A ! = "
!
! %&D(" A ! !
! =$ # A #
n Information gained by branching on attribute A
noise or outliers
n Poor accuracy for unseen samples
23
Scalability Framework for RainForest
24
Rainforest: Training Set and Its AVC Sets
26
Presentation of Classification Results
n Bayes’ Theorem: !" " $ !! = !"! $ " ! !" " ! = !"! $ " !! !" " ! # !"!!
!"!!
n Let X be a data sample (“evidence”): class label is unknown
n Let H be a hypothesis that X belongs to class C
n Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
n P(H) (prior probability): the initial probability
n E.g., X will buy computer, regardless of age, income, …
medium income
32
Prediction Based on Bayes’ Theorem
n Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
33
Classification Is to Derive the Maximum Posteriori
n Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute vector
X = (x1, x2, …, xn)
n Suppose there are m classes C1, C2, …, Cm.
n Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
n This can be derived from Bayes’ theorem
!"! # # !!"# !
!"# # !! = " "
" !"!!
n Since P(X) is constant for all classes, only
#"" # !! = #"! # " !#"" !
! ! !
needs to be maximized
34
Naïve Bayes Classifier
n A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): %
$# ! " ! "! = " $# # " ! "! = $# # " ! "! ! $# # " ! "! ! $$$ ! $# # " ! "!
C % & %
C =%
n This greatly reduces the computation cost: Only counts the
class distribution
n If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
n If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
# !#µ "!
$ #
# # !% µ % ! " = " !! !
and P(xk|Ci) is !" !
“uncorrected” counterparts
38
Naïve Bayes Classifier: Comments
n Advantages
n Easy to implement
n Disadvantages
n Assumption: class conditional independence, therefore loss
of accuracy
n Practically, dependencies exist among variables
Bayes Classifier
n How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
39
Chapter 8. Classification: Basic Concepts
n One rule is created for each path from the <=30 31..40 >40
root to a leaf
student? credit rating?
yes
n Each attribute-value pair along a path forms a
excellent fair
conjunction: the leaf holds the class no yes
no yes
prediction no yes
n Each time a rule is learned, the tuples covered by the rules are
removed
n Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
n Comp. w. decision-tree induction: learning a set of rules
simultaneously
43
Sequential Covering Algorithm
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
44
Rule Generation
n To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
45
How to Learn-One-Rule?
n Start with the most general rule possible: condition = empty
n Adding new attributes by adopting a greedy depth-first strategy
n Picks the one that most improves the rule quality
50
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
n Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
51
Classifier Evaluation Metrics: Example
52
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
n Holdout method
n Given data is randomly partitioned into two independent sets
54
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
n Suppose we have 2 classifiers, M1 and M2, which one is better?
n These mean error rates are just estimates of error on the true
population of future data cases
n What if the difference between the 2 error rates is just
attributed to chance?
55
Estimating Confidence Intervals:
Null Hypothesis
n Perform 10-fold cross-validation
n Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
n Use t-test (or Student’s t-test)
n Null Hypothesis: M1 & M2 are the same
n If we can reject null hypothesis, then
n we conclude that the difference between M1 & M2 is
statistically significant
n Chose model with lower error rate
56
Estimating Confidence Intervals: t-test
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
57
Estimating Confidence Intervals:
Table for t-distribution
n Symmetric
n Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
n Confidence limit, z
= sig/2
58
Estimating Confidence Intervals:
Statistical Significance
n Are M1 & M2 significantly different?
n Compute t. Select significance level (e.g. sig = 5%)
are same
n Conclude: statistically significant difference between M1
& M2
n Otherwise, conclude that any difference is chance
59
Model Selection: ROC Curves
n ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
n Originated from signal detection theory
n Shows the trade-off between the true
positive rate and the false positive rate
n The area under the ROC curve is a n Vertical axis
measure of the accuracy of the model represents the true
positive rate
n Rank the test tuples in decreasing n Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at n The plot also shows a
the top of the list diagonal line
n The closer to the diagonal line (i.e., the n A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
60
Issues Affecting Model Selection
n Accuracy
n classifier accuracy: predicting class label
n Speed
n time to construct the model (training time)
n time to use the model (classification/prediction time)
n Robustness: handling noise and missing values
n Scalability: efficiency in disk-resident databases
n Interpretability
n understanding and insight provided by the model
n Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
61
Chapter 8. Classification: Basic Concepts
n Ensemble methods
n Use a combination of models to increase accuracy
classifiers
n Boosting: weighted vote with a collection of classifiers
63
Bagging: Boostrap Aggregation
n Analogy: Diagnosis based on multiple doctors’ majority vote
n Training
n Given a set D of d tuples, at each iteration i, a training set Di of d tuples
n The bagged classifier M* counts the votes and assigns the class with the
most votes to X
n Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
n Accuracy
n Often significantly better than a single classifier derived from D
returned
n Two Methods to construct Random Forest:
n Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
n Forest-RC (random linear combinations): Creates new attributes (or
introduced above
n Still difficult for class imbalance problem on multiclass tasks
68
Chapter 8. Classification: Basic Concepts
70
Summary (II)
n Significance tests and ROC curves are useful for model selection.
n There have been numerous comparisons of the different
classification methods; the matter remains a research topic
n No single method has been found to be superior over all others
for all data sets
n Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method
71
References (1)
n C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
n C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
n L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984
n C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998
n P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. KDD'95
n H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for
Effective Classification, ICDE'07
n H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for
Effective Classification, ICDE'08
n W. Cohen. Fast effective rule induction. ICML'95
n G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
gene expression data. SIGMOD'05
72
References (2)
n A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
n G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99.
n R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
n U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
n Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
n J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
n J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
n T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
n D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The
combination of knowledge and statistical data. Machine Learning, 1995.
n W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple
Class-Association Rules, ICDM'01.
73
References (3)
n T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine
Learning, 2000.
n J. Magidson. The Chaid approach to segmentation modeling: Chi-squared
automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of
Marketing Research, Blackwell Business, 1994.
n M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data
mining. EDBT'96.
n T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
n S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-
Disciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
n J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
n J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
n J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
n J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
74
References (4)
n R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98.
n J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96.
n J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
n P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley,
2005.
n S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
n S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
n I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
n X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
n H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03.
75
CS412 Midterm Exam Statistics
n Opinion Question Answering:
n Like the style: 70.83%, dislike: 29.16%
n 80-89: 54 n 50-59: 15
n 70-79: 46 n 40-49: 2
n Speed
n time to construct the model (training time)
78
Predictor Error Measures
n Measure predictor accuracy: measure how far off the predicted value is from
the actual known value
n Loss function: measures the error betw. yi and the predicted value yi’
n Absolute error: | yi – yi’|
n Squared error: (yi – yi’)2
n Test error (generalization error):
!
the average loss over the test set
!
! !
!
! " # "Relative
# #"
" squared error:
"
! $ #" " #" % # "
" =!
" =!
! !
!" #
" =!
" "#"
!$#
" =!
" " ##"
The mean squared-error exaggerates the presence of outliers
Popularly use (square) root mean-square error, similarly, root relative
squared error
79
Scalable Decision Tree Induction Methods
tree earlier
n RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
n Builds an AVC-list (attribute, value, class label)
80
Data Cube-Based Decision-Tree Induction
n Integration of generalization with decision-tree induction
(Kamber et al.’97)
n Classification at primitive concept levels
n E.g., precise temperature, humidity, outlook, etc.
n Low-level concepts, scattered classes, bushy classification-
trees
n Semantic interpretation problems
n Cube-based multi-level classification
n Relevance analysis at multi-levels
n Information-gain analysis with dimension + level
81