0% found this document useful (0 votes)
50 views56 pages

05classification Rule Mining

This document discusses classification and decision tree induction. It begins with an overview of classification problems and processes. Next, it covers decision tree induction, including the basic algorithm which recursively partitions training examples based on attribute tests. It describes evaluating and selecting attributes using measures like information gain. Finally, it provides an example of applying decision tree induction to construct a classification model from training data.

Uploaded by

hawariya abel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views56 pages

05classification Rule Mining

This document discusses classification and decision tree induction. It begins with an overview of classification problems and processes. Next, it covers decision tree induction, including the basic algorithm which recursively partitions training examples based on attribute tests. It describes evaluating and selecting attributes using measures like information gain. Finally, it provides an example of applying decision tree induction to construct a classification model from training data.

Uploaded by

hawariya abel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Classification Rule Mining

Chapter 5

1
1
Classification: Basic Concepts
• Classification: Basic Concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy: Ensemble Methods
• Summary
2
Prediction Problems: Classification vs. Numeric Prediction
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts unknown or missing values
• Typical applications
– Credit/loan approval:
– Medical diagnosis: if a tumor is cancerous or benign
– Fraud detection: if a transaction is fraudulent
– Web page categorization: which category it is

3
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as determined by the class label
attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
– If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set

4
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
5
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes 6
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
– New data is classified based on the training set

• Unsupervised learning (clustering)


– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data

7
Issues in Classification and Predication (1): Data Preparation

• Data cleaning
– Preprocess data in order to reduce noise and handle missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data

8
Issues in Classification and Predication (2): Evaluating Classification
Methods
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability:
– understanding and insight provided by the model
• Goodness of rules
– decision tree size
– compactness of classification rules
9
Classification by Decision Tree Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree

10
Decision Tree Induction: An Example
age income student credit_rating buys_computer
 Training data set: Buys_computer <=30 high no fair no
<=30 high no excellent no
 The data set follows an example of 31…40 high no fair yes
Quinlan’s ID3 (Playing Tennis) >40
>40
medium
low
no fair
yes fair
yes
yes
 Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
11
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
– There are no samples left

12
Attribute Selection Measure

• Information gain (ID3/C4.5)


– All attributes are assumed to be categorical
– Can be modified for continuous-valued attributes
• Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-valued
– Assume there exist several possible split values for each attribute
– May need other tools, such as clustering, to get the possible split values
– Can be modified for categorical attributes

14
Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n elements of class N
– The amount of information, needed to decide if an arbitrary example in S
belongs to P or N is defined as

p p n n
I ( p, n)   log 2  log 2
pn pn pn pn

15
Information Gain in Decision Tree Induction
• Assume that using attribute A a set S will be partitioned into sets {S1, S2 ,
…, Sv} (i.e. Si‘s distinct values of A)
– If Si contains pi examples of P and ni examples of N, the entropy, or the expected
information needed to classify objects in all subtrees Si is
 p n
E ( A)   i i I ( pi , ni )
i 1 p  n

• The encoding information that would be gained by branching on A


Gain( A)  I ( p, n)  E ( A)

16
Attribute Selection: Information Gain
g Class P: buys_computer = “yes” 5 4
g Class N: buys_computer = “no” Info age ( D )  I ( 2,3)  I ( 4,0)
14 14
9 9 5 5 5
Info ( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940
14 14 14 14  I (3,2)  0.694
14
age pi ni I(pi, ni) 5
<=30 2 3 0.971
I (2,3) means “age <=30” has 5 out of 14
14
31…40 4 0 0
samples, with 2 yes’es and 3 no’s.
>40 3 2 0.971
age income student credit_rating buys_computer
Hence
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes Gain(age)  Info ( D )  Info age ( D)  0.246
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no Similarly
31…40 low yes excellent yes
<=30 medium no fair no
Gain (income)  0.029
<=30 low yes fair yes Gain (student )  0.151
>40 medium yes fair yes
<=30 medium yes excellent yes Gain (credit _ rating )  0.048
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no 17
Gini Index (IBM IntelligentMiner)

• If a data set T contains examples from n classes, gini index, gini(T) is defined as
n
gini(T )  1  p 2j
j 1

where pj is the relative frequency of class j in T.


• If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the
gini index of the split data contains examples from n classes, the gini index gini(T) is
defined as N1 N2
gini split (T )  gini(T 1)  gini(T 2)
N N
• The attribute provides the smallest ginisplit(T) is chosen to split the node (need to
enumerate all possible splitting points for each attribute).

19
Enhancements to Basic Decision Tree Induction

• Allow for continuous-valued attributes


– Dynamically define new discrete-valued attributes that partition the
continuous attribute value into a discrete set of intervals
• Handle missing attribute values
– Assign the most common value of the attribute
– Assign probability to each of the possible values
• Attribute construction
– Create new attributes based on existing ones that are sparsely
represented
– This reduces fragmentation, repetition, and replication
26
Bayesian Classification: why?
• A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable
performance with decision tree and selected neural network classifiers
• Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct — prior knowledge can be combined with
observed data
• Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured

34
Bayes’ Theorem: Basics
• Total probability Theorem: M
P(B)   P(B | A )P( A )
i i
i 1
• Bayes’ Theorem: P( H | X)  P(X | H ) P(H )  P(X | H ) P( H ) / P(X)
P(X)
– Let X be a data sample (“evidence”): class label is unknown
– Let H be a hypothesis that X belongs to class C
– Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the
hypothesis holds given the observed data sample X
– P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
– P(X): probability that sample data is observed
– P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40, medium
income
35
Prediction Based on Bayes’ Theorem

• Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the


Bayes’ theorem

P(H | X)  P(X | H )P( H )  P(X | H ) P(H ) / P(X)


P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the
P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many probabilities, involving
significant computational cost
36
Classification is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
• Since P(X) is constant for all classes, only

needs to be maximized P(C | X)  P(X | C )P(C )


i i i

37
Naïve Bayes Classifier

• A simplified assumption: attributes are conditionally independent (i.e., no dependence


relation between attributes):
P(X|Ci)= P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)

• This greatly reduces the computation cost: Only counts the class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (#
of tuples of Ci in D)

• If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution


with a mean μ and standard deviation σ ( x ) 2
 1
g ( x,  ,  )  e 2 2

2 
P ( X | C i )  g ( xk ,  C i ,  C i )
and P(xk|Ci) is
38
Naïve Bayes Classifier: Training Dataset

age income studentcredit_rating


buys_computer
Class: <=30 high no fair no
C1:buys_computer = ‘yes’ <=30 high no excellent no
31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
>40 medium yes fair yes
Student = yes <=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

39
Naïve Bayes Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 age
<=30
income studentcredit_rating
high no fair
buys_com
no
P(buys_computer = “no”) = 5/14= 0.357 <=30 high
31…40 high
no
no
excellent
fair
no
yes
• Compute P(X|Ci) for each class >40 medium no fair yes
>40 low yes fair yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 >40 low yes excellent no
31…40 low yes excellent yes
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 <=30 medium no fair no

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 >40 medium yes fair
<=30 low yes fair yes
yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 <=30 medium yes excellent
31…40 medium no excellent
yes
yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 31…40 high
>40 medium
yes fair
no excellent
yes
no
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 40
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the
predicted prob. will be zero
P(X|Ci)=P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
• Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and
income = high (10)
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The “corrected” prob. estimates are close to their “uncorrected”
counterparts 41
Naïve Bayes Classifier: Comments
• Advantages
– Easy to implement and makes computation possible
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore loss of accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough
etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes Classifier
• Solution:
– Bayesian network
– Decision tree
42
Bayesian Belief Networks (I)
Family
Smoker
History
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1


LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9

The conditional probability table (CPT) for the


variable LungCancer

PositiveXRay Dyspnea

Bayesian Belief Networks

43
Bayesian Belief Networks (II)
• Bayesian belief network allows a subset of the variables conditionally
independent
• A graphical model of causal relationships
• Several cases of learning Bayesian belief networks
– Given both network structure and all the variables
– Given network structure but only some variables
– When the network structure is not known in advance

44
Rule-Based Classification: IF-THEN Rules (1)
• Represent the knowledge in the form of IF-THEN rules
R1: IF age = youth AND student = yes THEN buys_computer = yes
– Rule antecedent/precondition vs. rule consequent
• Assessment of a rule: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
– coverage(R) = ncovers /|D| /* D: training data set */
– accuracy(R) = ncorrect / ncovers
• R1 covers 2 of the 14 tuples. Coverage (R1) = 2/14 = 14.28% and Accuracy(R1)= 2/2 = 100%.
• Example: Let X= (age = youth, income = medium, student = yes, credit rating = fair).
• Using the rule, X is have class label of buys_computer= Yes, as X satisfies R1, and triggers the rule.

45
Rule-Based Classification: IF-THEN Rules (2)

• If more than one rule are triggered, need conflict resolution


– Size ordering: assign the highest priority to the triggering rules that has the “toughest”
requirement (i.e., with the most attribute tests)
– Class-based ordering: decreasing order of prevalence or misclassification cost per class
– Rule-based ordering (decision list): rules are organized into one long priority list, according to
some measure of rule quality or by experts

46
Rule Extraction from a Decision Tree

age?
 Rules are easier to understand than large trees
 One rule is created for each path from the root to <=30 31..40 >40

a leaf student? credit rating?


yes
 Each attribute-value pair along a path forms a excellent fair
no yes
conjunction: the leaf holds the class prediction
no yes no yes
 Rules are mutually exclusive and exhaustive

• Example: Rule extraction from our buys_computer decision-tree


IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
47
Rule Induction: Sequential Covering Method
• Sequential covering algorithm: Extracts rules directly from training data
• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
• Rules are learned sequentially, each for a given class Ci will cover many tuples of Ci
but none (or few) of the tuples of other classes
• Steps:
– Rules are learned one at a time
– Each time a rule is learned, the tuples covered by the rules are removed
– Repeat the process on the remaining tuples until termination condition, e.g., when no more
training examples or when the quality of a rule returned is below a user-specified threshold
• Compare with decision-tree induction: learning a set of rules simultaneously

48
Sequential Covering Algorithm
while (enough target tuples left)
generate a rule
remove positive target tuples satisfying this rule

Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3

Positive
examples

49
Rule Generation
• To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break

A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1

Positive Negative
examples examples 50
How to Learn-One-Rule?
• Start with the most general rule possible: condition = empty
• Adding new attributes by adopting a greedy depth-first strategy
– Picks the one that most improves the rule quality
• Rule-Quality measures: consider both coverage and accuracy
– Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition to R’
pos ' pos
FOIL _ Gain  pos '(log 2  log 2 )
pos ' neg ' pos  neg
• favors rules that have high accuracy and cover many positive tuples
• Rule pruning based on an independent set of test tuples
pos  neg
FOIL _ Prune( R) 
pos  neg
Pos-neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
51
The k-Nearest Neighbor Algorithm
• All instances correspond to points in the n-D space.
• The nearest neighbor are defined in terms of Euclidean distance.
• The target function could be discrete- or real- valued.
• For discrete-valued, the k-NN returns the most common value among the k
training examples nearest to xq.
• Vonoroi diagram: the decision surface induced by 1-NN for a typical set of
training examples.

_
_
_ _ .
+
_ .
+
xq + . . .
_ + . 52
Discussion on the k-NN Algorithm
• The k-NN algorithm for continuous-valued target functions
– Calculate the mean values of the k nearest neighbors
• Distance-weighted nearest neighbor algorithm
– Weight the contribution of each of the k neighbors according to their distance to
the query point xq w 1
2
d ( xq , xi )
• giving greater weight to closer neighbors
– Similarly, for real-valued target functions
• Robust to noisy data by averaging k-nearest neighbors
• Curse of dimensionality: distance between neighbors could be dominated by
irrelevant attributes.
– To overcome it, axes stretch or elimination of the least relevant attributes.

53
What Is Prediction?
• Prediction is similar to classification
– First, construct a model
– Second, use model to predict unknown value
• Major method for prediction is regression
– Linear and multiple regression
– Non-linear regression
• Prediction is different from classification
– Classification refers to predict categorical class label
– Prediction models continuous-valued functions

54
Predictive Modeling in Databases
• Predictive modeling: Predict data values or construct generalized linear
models based on the database data.
• One can only predict value ranges or category distributions
• Method outline:
– Minimal generalization
– Attribute relevance analysis
– Generalized linear model construction
– Prediction
• Determine the major factors which influence the prediction
– Data relevance analysis: uncertainty measurement, entropy analysis,
expert judgement, etc.
• Multi-level prediction: drill-down and roll-up analysis
55
Regress Analysis and Log-Linear Models in Prediction

• Linear regression: Y =  +  X
– Two parameters ,  and  specify the line and are to be estimated by using
the data at hand.
– using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into the above.
• Log-linear models:
– The multi-way table of joint probabilities is approximated by a product of
lower-order tables.
– Probability: p(a, b, c, d) = ab acad bcd

56
Prediction: Numerical Data

57
Prediction: Categorical Data

58
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other metrics to consider?
• Use validation test set of class-labeled tuples instead of training set when
assessing accuracy
• Methods for estimating a classifier’s accuracy:
– Holdout method, random subsampling
– Cross-validation
– Bootstrap
• Comparing classifiers:
– Confidence intervals
– Cost-benefit analysis and ROC Curves

59
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates


# of tuples in class i that were labeled by the classifier as class j
• extra rows/columns to provide totals
60
Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity
and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P
 One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

• Classifier Accuracy, or negative class and minority of


recognition rate: percentage of the positive class
 Sensitivity: True Positive
test set tuples that are correctly
classified recognition rate
 Sensitivity = TP/P
Accuracy = (TP + TN)/All
 Specificity: True Negative
• Error rate: 1 – accuracy, or
recognition rate
Error rate = (FP + FN)/All  Specificity = TN/N

61
Classifier Evaluation Metrics: Precision and Recall,
and F-measures
• Precision: exactness – what % of tuples that the classifier labeled as positive are
actually positive

• Recall: completeness – what % of positive tuples did the classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall

• F measure (F1 or F-score): harmonic mean of precision and recall,

• Fß: weighted measure of precision and recall


– assigns ß times as much weight to recall as to precision

62
62
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

63
Evaluating Classifier Accuracy: Holdout & Cross-Validation
Methods
• Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized data
– Stratified cross-validation: folds are stratified so that class dist. in each fold is
approx. the same as that in the initial data
64
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the
training set
• Several bootstrap methods, and a common one is .632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in a training set of d
samples. The data tuples that did not make it into the training set end up forming the test set.
About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form the test
set (since (1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:

65
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
– Use a combination of models to increase
accuracy
– Combine a series of k learned models, M1, M2,
…, Mk, with the aim of creating an improved
model M*
• Popular ensemble methods
– Bagging: averaging the prediction over a
collection of classifiers
– Boosting: weighted vote with a collection of
classifiers
– Ensemble: combining a set of heterogeneous
classifiers
73
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
– Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with
replacement from D (i.e., bootstrap)
– A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– The bagged classifier M* counts the votes and assigns the class with the most votes to X
• Prediction: can be applied to the prediction of continuous values by taking the average value of each
prediction for a given test tuple
• Accuracy
– Often significantly better than a single classifier derived from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction 74
Boosting
• Analogy: Consult several doctors, based on a combination of weighted diagnoses—
weight assigned based on the previous diagnosis accuracy
• How boosting works?
– Weights are assigned to each training tuple
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to
pay more attention to the training tuples that were misclassified by Mi
– The final M* combines the votes of each individual classifier, where the weight of each classifier's
vote is a function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy, but it also risks
overfitting the model to misclassified data

75
Adaboost (Freund and Schapire, 1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
– Tuples from D are sampled (with replacement) to form a training set Di of the same size
– Each tuple’s chance of being selected is based on its weight
– A classification model Mi is derived from Di
– Its error rate is calculated using Di as a test set
– If a tuple is misclassified, its weight is increased, otherwise it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate
is the sum of the weights of the misclassified tuples: d
error ( M i )   w j  err ( X j )
1  error ( M i ) j

• The weight of classifier Mi’s vote is log


error ( M i )
76
Random Forest (Breiman 2001)
• Random Forest:
– Each classifier in the ensemble is a decision tree classifier and is generated using a random
selection of attributes at each node to determine the split
– During classification, each tree votes and the most popular class is returned
• Two Methods to construct Random Forest:
– Forest-RI (random input selection): Randomly select, at each node, F attributes as candidates for
the split at the node. The CART methodology is used to grow the trees to maximum size
– Forest-RC (random linear combinations): Creates new attributes (or features) that are a linear
combination of the existing attributes (reduces the correlation between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to errors and outliers
• Insensitive to the number of attributes selected for consideration at each split, and faster than
bagging or boosting

77
Summary
• Classification is a form of data analysis that extracts models describing important data
classes.
• Effective and scalable methods have been developed for decision tree induction, Naive
Bayesian classification, rule-based classification, and many other classification methods.
• Evaluation metrics include: accuracy, sensitivity, specificity, precision, recall, F measure,
and Fß measure.
• Stratified k-fold cross-validation is recommended for accuracy estimation. Bagging and
boosting can be used to increase overall accuracy by learning and combining a series of
individual models.

79

You might also like