0% found this document useful (0 votes)

50 views56 pages

05classification Rule Mining

This document discusses classification and decision tree induction. It begins with an overview of classification problems and processes. Next, it covers decision tree induction, including the basic algorithm which recursively partitions training examples based on attribute tests. It describes evaluating and selecting attributes using measures like information gain. Finally, it provides an example of applying decision tree induction to construct a classification model from training data.

Uploaded by

hawariya abel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views56 pages

05classification Rule Mining

Uploaded by

hawariya abel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 56

Classification Rule Mining

Chapter 5

1
1
Classification: Basic Concepts
• Classification: Basic Concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy: Ensemble Methods
• Summary
2
Prediction Problems: Classification vs. Numeric Prediction
• Classification
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts unknown or missing values
• Typical applications
– Credit/loan approval:
– Medical diagnosis: if a tumor is cancerous or benign
– Fraud detection: if a transaction is fraudulent
– Web page categorization: which category it is

3
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as determined by the class label
attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
– If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set

4
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
5
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes 6
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
– New data is classified based on the training set

• Unsupervised learning (clustering)

– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data

7
Issues in Classification and Predication (1): Data Preparation

• Data cleaning
– Preprocess data in order to reduce noise and handle missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data

8
Issues in Classification and Predication (2): Evaluating Classification
Methods
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability:
– understanding and insight provided by the model
• Goodness of rules
– decision tree size
– compactness of classification rules
9
Classification by Decision Tree Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree

10
Decision Tree Induction: An Example
age income student credit_rating buys_computer
 Training data set: Buys_computer <=30 high no fair no
<=30 high no excellent no
 The data set follows an example of 31…40 high no fair yes
Quinlan’s ID3 (Playing Tennis) >40
>40
medium
low
no fair
yes fair
yes
yes
 Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes no yes
11
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
– There are no samples left

12
Attribute Selection Measure

• Information gain (ID3/C4.5)

– All attributes are assumed to be categorical
– Can be modified for continuous-valued attributes
• Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-valued
– Assume there exist several possible split values for each attribute
– May need other tools, such as clustering, to get the possible split values
– Can be modified for categorical attributes

14
Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n elements of class N
– The amount of information, needed to decide if an arbitrary example in S
belongs to P or N is defined as

p p n n
I ( p, n)   log 2  log 2
pn pn pn pn

15
Information Gain in Decision Tree Induction
• Assume that using attribute A a set S will be partitioned into sets {S1, S2 ,
…, Sv} (i.e. Si‘s distinct values of A)
– If Si contains pi examples of P and ni examples of N, the entropy, or the expected
information needed to classify objects in all subtrees Si is
 p n
E ( A)   i i I ( pi , ni )
i 1 p  n

• The encoding information that would be gained by branching on A

Gain( A)  I ( p, n)  E ( A)

16
Attribute Selection: Information Gain
g Class P: buys_computer = “yes” 5 4
g Class N: buys_computer = “no” Info age ( D )  I ( 2,3)  I ( 4,0)
14 14
9 9 5 5 5
Info ( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940
14 14 14 14  I (3,2)  0.694
14
age pi ni I(pi, ni) 5
<=30 2 3 0.971
I (2,3) means “age <=30” has 5 out of 14
14
31…40 4 0 0
samples, with 2 yes’es and 3 no’s.
>40 3 2 0.971
age income student credit_rating buys_computer
Hence
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes Gain(age)  Info ( D )  Info age ( D)  0.246
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no Similarly
31…40 low yes excellent yes
<=30 medium no fair no
Gain (income)  0.029
<=30 low yes fair yes Gain (student )  0.151
>40 medium yes fair yes
<=30 medium yes excellent yes Gain (credit _ rating )  0.048
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no 17
Gini Index (IBM IntelligentMiner)

• If a data set T contains examples from n classes, gini index, gini(T) is defined as
n
gini(T )  1  p 2j
j 1

where pj is the relative frequency of class j in T.

• If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the
gini index of the split data contains examples from n classes, the gini index gini(T) is
defined as N1 N2
gini split (T )  gini(T 1)  gini(T 2)
N N
• The attribute provides the smallest ginisplit(T) is chosen to split the node (need to
enumerate all possible splitting points for each attribute).

19
Enhancements to Basic Decision Tree Induction

• Allow for continuous-valued attributes

– Dynamically define new discrete-valued attributes that partition the
continuous attribute value into a discrete set of intervals
• Handle missing attribute values
– Assign the most common value of the attribute
– Assign probability to each of the possible values
• Attribute construction
– Create new attributes based on existing ones that are sparsely
represented
– This reduces fragmentation, repetition, and replication
26
Bayesian Classification: why?
• A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable
performance with decision tree and selected neural network classifiers
• Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct — prior knowledge can be combined with
observed data
• Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured

34
Bayes’ Theorem: Basics
• Total probability Theorem: M
P(B)   P(B | A )P( A )
i i
i 1
• Bayes’ Theorem: P( H | X)  P(X | H ) P(H )  P(X | H ) P( H ) / P(X)
P(X)
– Let X be a data sample (“evidence”): class label is unknown
– Let H be a hypothesis that X belongs to class C
– Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the
hypothesis holds given the observed data sample X
– P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
– P(X): probability that sample data is observed
– P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40, medium
income
35
Prediction Based on Bayes’ Theorem

• Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the

Bayes’ theorem

P(H | X)  P(X | H )P( H )  P(X | H ) P(H ) / P(X)

P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the
P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many probabilities, involving
significant computational cost
36
Classification is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
• Since P(X) is constant for all classes, only

needs to be maximized P(C | X)  P(X | C )P(C )

i i i

37
Naïve Bayes Classifier

• A simplified assumption: attributes are conditionally independent (i.e., no dependence

relation between attributes):
P(X|Ci)= P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)

• This greatly reduces the computation cost: Only counts the class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (#
of tuples of Ci in D)

• If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution

with a mean μ and standard deviation σ ( x ) 2
 1
g ( x,  ,  )  e 2 2

2 
P ( X | C i )  g ( xk ,  C i ,  C i )
and P(xk|Ci) is
38
Naïve Bayes Classifier: Training Dataset

age income studentcredit_rating

buys_computer
Class: <=30 high no fair no
C1:buys_computer = ‘yes’ <=30 high no excellent no
31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified:
31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
>40 medium yes fair yes
Student = yes <=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

39
Naïve Bayes Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 age
<=30
income studentcredit_rating
high no fair
buys_com
no
P(buys_computer = “no”) = 5/14= 0.357 <=30 high
31…40 high
no
no
excellent
fair
no
yes
• Compute P(X|Ci) for each class >40 medium no fair yes
>40 low yes fair yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 >40 low yes excellent no
31…40 low yes excellent yes
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 <=30 medium no fair no

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 >40 medium yes fair
<=30 low yes fair yes
yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 <=30 medium yes excellent
31…40 medium no excellent
yes
yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 31…40 high
>40 medium
yes fair
no excellent
yes
no
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 40
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the
predicted prob. will be zero
P(X|Ci)=P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
• Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and
income = high (10)
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The “corrected” prob. estimates are close to their “uncorrected”
counterparts 41
Naïve Bayes Classifier: Comments
• Advantages
– Easy to implement and makes computation possible
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore loss of accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough
etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes Classifier
• Solution:
– Bayesian network
– Decision tree
42
Bayesian Belief Networks (I)
Family
Smoker
History
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1

LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9

The conditional probability table (CPT) for the

variable LungCancer

PositiveXRay Dyspnea

Bayesian Belief Networks

43
Bayesian Belief Networks (II)
• Bayesian belief network allows a subset of the variables conditionally
independent
• A graphical model of causal relationships
• Several cases of learning Bayesian belief networks
– Given both network structure and all the variables
– Given network structure but only some variables
– When the network structure is not known in advance

44
Rule-Based Classification: IF-THEN Rules (1)
• Represent the knowledge in the form of IF-THEN rules
R1: IF age = youth AND student = yes THEN buys_computer = yes
– Rule antecedent/precondition vs. rule consequent
• Assessment of a rule: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
– coverage(R) = ncovers /|D| /* D: training data set */
– accuracy(R) = ncorrect / ncovers
• R1 covers 2 of the 14 tuples. Coverage (R1) = 2/14 = 14.28% and Accuracy(R1)= 2/2 = 100%.
• Example: Let X= (age = youth, income = medium, student = yes, credit rating = fair).
• Using the rule, X is have class label of buys_computer= Yes, as X satisfies R1, and triggers the rule.

45
Rule-Based Classification: IF-THEN Rules (2)

• If more than one rule are triggered, need conflict resolution

– Size ordering: assign the highest priority to the triggering rules that has the “toughest”
requirement (i.e., with the most attribute tests)
– Class-based ordering: decreasing order of prevalence or misclassification cost per class
– Rule-based ordering (decision list): rules are organized into one long priority list, according to
some measure of rule quality or by experts

46
Rule Extraction from a Decision Tree

age?
 Rules are easier to understand than large trees
 One rule is created for each path from the root to <=30 31..40 >40

a leaf student? credit rating?

yes
 Each attribute-value pair along a path forms a excellent fair
no yes
conjunction: the leaf holds the class prediction
no yes no yes
 Rules are mutually exclusive and exhaustive

• Example: Rule extraction from our buys_computer decision-tree

IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
47
Rule Induction: Sequential Covering Method
• Sequential covering algorithm: Extracts rules directly from training data
• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
• Rules are learned sequentially, each for a given class Ci will cover many tuples of Ci
but none (or few) of the tuples of other classes
• Steps:
– Rules are learned one at a time
– Each time a rule is learned, the tuples covered by the rules are removed
– Repeat the process on the remaining tuples until termination condition, e.g., when no more
training examples or when the quality of a rule returned is below a user-specified threshold
• Compare with decision-tree induction: learning a set of rules simultaneously

48
Sequential Covering Algorithm
while (enough target tuples left)
generate a rule
remove positive target tuples satisfying this rule

Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3

Positive
examples

49
Rule Generation
• To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break

A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1

Positive Negative
examples examples 50
How to Learn-One-Rule?
• Start with the most general rule possible: condition = empty
• Adding new attributes by adopting a greedy depth-first strategy
– Picks the one that most improves the rule quality
• Rule-Quality measures: consider both coverage and accuracy
– Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition to R’
pos ' pos
FOIL _ Gain  pos '(log 2  log 2 )
pos ' neg ' pos  neg
• favors rules that have high accuracy and cover many positive tuples
• Rule pruning based on an independent set of test tuples
pos  neg
FOIL _ Prune( R) 
pos  neg
Pos-neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
51
The k-Nearest Neighbor Algorithm
• All instances correspond to points in the n-D space.
• The nearest neighbor are defined in terms of Euclidean distance.
• The target function could be discrete- or real- valued.
• For discrete-valued, the k-NN returns the most common value among the k
training examples nearest to xq.
• Vonoroi diagram: the decision surface induced by 1-NN for a typical set of
training examples.

_
_
_ _ .
+
_ .
+
xq + . . .
_ + . 52
Discussion on the k-NN Algorithm
• The k-NN algorithm for continuous-valued target functions
– Calculate the mean values of the k nearest neighbors
• Distance-weighted nearest neighbor algorithm
– Weight the contribution of each of the k neighbors according to their distance to
the query point xq w 1
2
d ( xq , xi )
• giving greater weight to closer neighbors
– Similarly, for real-valued target functions
• Robust to noisy data by averaging k-nearest neighbors
• Curse of dimensionality: distance between neighbors could be dominated by
irrelevant attributes.
– To overcome it, axes stretch or elimination of the least relevant attributes.

53
What Is Prediction?
• Prediction is similar to classification
– First, construct a model
– Second, use model to predict unknown value
• Major method for prediction is regression
– Linear and multiple regression
– Non-linear regression
• Prediction is different from classification
– Classification refers to predict categorical class label
– Prediction models continuous-valued functions

54
Predictive Modeling in Databases
• Predictive modeling: Predict data values or construct generalized linear
models based on the database data.
• One can only predict value ranges or category distributions
• Method outline:
– Minimal generalization
– Attribute relevance analysis
– Generalized linear model construction
– Prediction
• Determine the major factors which influence the prediction
– Data relevance analysis: uncertainty measurement, entropy analysis,
expert judgement, etc.
• Multi-level prediction: drill-down and roll-up analysis
55
Regress Analysis and Log-Linear Models in Prediction

• Linear regression: Y =  +  X
– Two parameters ,  and  specify the line and are to be estimated by using
the data at hand.
– using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into the above.
• Log-linear models:
– The multi-way table of joint probabilities is approximated by a product of
lower-order tables.
– Probability: p(a, b, c, d) = ab acad bcd

56
Prediction: Numerical Data

57
Prediction: Categorical Data

58
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other metrics to consider?
• Use validation test set of class-labeled tuples instead of training set when
assessing accuracy
• Methods for estimating a classifier’s accuracy:
– Holdout method, random subsampling
– Cross-validation
– Bootstrap
• Comparing classifiers:
– Confidence intervals
– Cost-benefit analysis and ROC Curves

59
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:

Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates

# of tuples in class i that were labeled by the classifier as class j
• extra rows/columns to provide totals
60
Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity
and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P
 One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

• Classifier Accuracy, or negative class and minority of

recognition rate: percentage of the positive class
 Sensitivity: True Positive
test set tuples that are correctly
classified recognition rate
 Sensitivity = TP/P
Accuracy = (TP + TN)/All
 Specificity: True Negative
• Error rate: 1 – accuracy, or
recognition rate
Error rate = (FP + FN)/All  Specificity = TN/N

61
Classifier Evaluation Metrics: Precision and Recall,
and F-measures
• Precision: exactness – what % of tuples that the classifier labeled as positive are
actually positive

• Recall: completeness – what % of positive tuples did the classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall

• F measure (F1 or F-score): harmonic mean of precision and recall,

• Fß: weighted measure of precision and recall

– assigns ß times as much weight to recall as to precision

62
62
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

63
Evaluating Classifier Accuracy: Holdout & Cross-Validation
Methods
• Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized data
– Stratified cross-validation: folds are stratified so that class dist. in each fold is
approx. the same as that in the initial data
64
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the
training set
• Several bootstrap methods, and a common one is .632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in a training set of d
samples. The data tuples that did not make it into the training set end up forming the test set.
About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form the test
set (since (1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:

65
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
– Use a combination of models to increase
accuracy
– Combine a series of k learned models, M1, M2,
…, Mk, with the aim of creating an improved
model M*
• Popular ensemble methods
– Bagging: averaging the prediction over a
collection of classifiers
– Boosting: weighted vote with a collection of
classifiers
– Ensemble: combining a set of heterogeneous
classifiers
73
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
– Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with
replacement from D (i.e., bootstrap)
– A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– The bagged classifier M* counts the votes and assigns the class with the most votes to X
• Prediction: can be applied to the prediction of continuous values by taking the average value of each
prediction for a given test tuple
• Accuracy
– Often significantly better than a single classifier derived from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction 74
Boosting
• Analogy: Consult several doctors, based on a combination of weighted diagnoses—
weight assigned based on the previous diagnosis accuracy
• How boosting works?
– Weights are assigned to each training tuple
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to
pay more attention to the training tuples that were misclassified by Mi
– The final M* combines the votes of each individual classifier, where the weight of each classifier's
vote is a function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy, but it also risks
overfitting the model to misclassified data

75
Adaboost (Freund and Schapire, 1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
– Tuples from D are sampled (with replacement) to form a training set Di of the same size
– Each tuple’s chance of being selected is based on its weight
– A classification model Mi is derived from Di
– Its error rate is calculated using Di as a test set
– If a tuple is misclassified, its weight is increased, otherwise it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate
is the sum of the weights of the misclassified tuples: d
error ( M i )   w j  err ( X j )
1  error ( M i ) j

• The weight of classifier Mi’s vote is log

error ( M i )
76
Random Forest (Breiman 2001)
• Random Forest:
– Each classifier in the ensemble is a decision tree classifier and is generated using a random
selection of attributes at each node to determine the split
– During classification, each tree votes and the most popular class is returned
• Two Methods to construct Random Forest:
– Forest-RI (random input selection): Randomly select, at each node, F attributes as candidates for
the split at the node. The CART methodology is used to grow the trees to maximum size
– Forest-RC (random linear combinations): Creates new attributes (or features) that are a linear
combination of the existing attributes (reduces the correlation between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to errors and outliers
• Insensitive to the number of attributes selected for consideration at each split, and faster than
bagging or boosting

77
Summary
• Classification is a form of data analysis that extracts models describing important data
classes.
• Effective and scalable methods have been developed for decision tree induction, Naive
Bayesian classification, rule-based classification, and many other classification methods.
• Evaluation metrics include: accuracy, sensitivity, specificity, precision, recall, F measure,
and Fß measure.
• Stratified k-fold cross-validation is recommended for accuracy estimation. Bagging and
boosting can be used to increase overall accuracy by learning and combining a series of
individual models.

Theories of Teaching and Learning
100% (2)
Theories of Teaching and Learning
31 pages
Learning Perception and Attribution
60% (5)
Learning Perception and Attribution
32 pages
(Empirical Approaches To Language Typology 20.6) Östen Dahl - Eurotyp - Typology of Languages in Europe, Volume 6 - Tense and Aspect in The Languages of Europe-Mouton de Gruyter (2000) PDF
100% (1)
(Empirical Approaches To Language Typology 20.6) Östen Dahl - Eurotyp - Typology of Languages in Europe, Volume 6 - Tense and Aspect in The Languages of Europe-Mouton de Gruyter (2000) PDF
856 pages
7 Classification
100% (3)
7 Classification
63 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Practical Research 2 Module 3
No ratings yet
Practical Research 2 Module 3
9 pages
Data Transmission
No ratings yet
Data Transmission
14 pages
SCIENCE DLL Week 1 Quarter 1
No ratings yet
SCIENCE DLL Week 1 Quarter 1
6 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Classification
100% (1)
Classification
37 pages
Topic 4 - Nurse Patient Relationship
No ratings yet
Topic 4 - Nurse Patient Relationship
29 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
Unit 4
No ratings yet
Unit 4
186 pages
Sociograms: Tracking Relationships in Fiction or Nonfiction: What Is It?
No ratings yet
Sociograms: Tracking Relationships in Fiction or Nonfiction: What Is It?
3 pages
VI. Reflection: Daily Lesson Plan - Mathematics - Grade 7 Caloocan High School S.Y. 2017 - 2018 Third Grading Period
No ratings yet
VI. Reflection: Daily Lesson Plan - Mathematics - Grade 7 Caloocan High School S.Y. 2017 - 2018 Third Grading Period
2 pages
10 Tips For The TOEFL Essay
100% (1)
10 Tips For The TOEFL Essay
16 pages
Classification and Prediction
100% (1)
Classification and Prediction
31 pages
Module 4
No ratings yet
Module 4
99 pages
8 Classification
No ratings yet
8 Classification
82 pages
1 - Chapter One-Introduction To Data Communication and Computer Networking
No ratings yet
1 - Chapter One-Introduction To Data Communication and Computer Networking
110 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
80 pages
Temesgen Tadesse
No ratings yet
Temesgen Tadesse
119 pages
Patterns Algebra - Lesson Plan
50% (2)
Patterns Algebra - Lesson Plan
3 pages
Unit 3
No ratings yet
Unit 3
98 pages
VII - CS8031 - DMDW - Module 6 - Classification - VBP
No ratings yet
VII - CS8031 - DMDW - Module 6 - Classification - VBP
99 pages
08 Class Basic
No ratings yet
08 Class Basic
81 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
83 pages
Lecture 8
No ratings yet
Lecture 8
81 pages
Classification
No ratings yet
Classification
45 pages
Classification
No ratings yet
Classification
73 pages
08ClassBasic L
No ratings yet
08ClassBasic L
78 pages
CH 5
No ratings yet
CH 5
81 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
08 Class Basic
No ratings yet
08 Class Basic
141 pages
Slide 07 Chapter8 Classification Basic Concept
No ratings yet
Slide 07 Chapter8 Classification Basic Concept
55 pages
Unit 3-Classification
No ratings yet
Unit 3-Classification
71 pages
05 Classification
No ratings yet
05 Classification
79 pages
08ClassBasic v1
No ratings yet
08ClassBasic v1
46 pages
DM 4
No ratings yet
DM 4
68 pages
Week 5
No ratings yet
Week 5
72 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Operate DB Application
No ratings yet
Operate DB Application
64 pages
08 Class Basic
No ratings yet
08 Class Basic
76 pages
Chap4 Classification Lecture 5
No ratings yet
Chap4 Classification Lecture 5
74 pages
UNIT 2 Class Basic
No ratings yet
UNIT 2 Class Basic
69 pages
CH 5
No ratings yet
CH 5
84 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
04 Classification
No ratings yet
04 Classification
72 pages
Association Rule Mining
No ratings yet
Association Rule Mining
54 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
Classification: Basic Concepts
No ratings yet
Classification: Basic Concepts
73 pages
Classification
No ratings yet
Classification
45 pages
2007 Matsumoto JOP
No ratings yet
2007 Matsumoto JOP
36 pages
INT To Dirama
No ratings yet
INT To Dirama
21 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
42 pages
Listening Skills
No ratings yet
Listening Skills
50 pages
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
No ratings yet
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
75 pages
Data Mining Book
No ratings yet
Data Mining Book
84 pages
Class Basic
No ratings yet
Class Basic
67 pages
08 Class Basic
No ratings yet
08 Class Basic
86 pages
DM 3
No ratings yet
DM 3
37 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
87 pages
English Syntax
No ratings yet
English Syntax
25 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
Class Basic
No ratings yet
Class Basic
75 pages
Data Mining: Concepts and Techniques: - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 7
61 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
Classification
No ratings yet
Classification
33 pages
Thesis Review by Hawariya Abel, Year II
No ratings yet
Thesis Review by Hawariya Abel, Year II
11 pages
Climate and Weather Lesson Plan
No ratings yet
Climate and Weather Lesson Plan
10 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
Chemistry Learning Skills of Science Technology Engineering and Mathematics Students Influential Role of Teachers Blended Educational Strategy Usage PDF
No ratings yet
Chemistry Learning Skills of Science Technology Engineering and Mathematics Students Influential Role of Teachers Blended Educational Strategy Usage PDF
26 pages
Edtpa Lesson 1 Nevaeh
No ratings yet
Edtpa Lesson 1 Nevaeh
3 pages
02datawarehousing For DM
No ratings yet
02datawarehousing For DM
38 pages
35 +Yayat+S
No ratings yet
35 +Yayat+S
9 pages
Short Story Best
No ratings yet
Short Story Best
7 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
50 pages
Ch6 - Lesson 4 - Communication
No ratings yet
Ch6 - Lesson 4 - Communication
11 pages
Digital Signature Algorithm
No ratings yet
Digital Signature Algorithm
29 pages
Module 2 (Topics 1-5)
No ratings yet
Module 2 (Topics 1-5)
6 pages
Commonly Confused Words Presentation
No ratings yet
Commonly Confused Words Presentation
21 pages
Sidaamu Afiinna Borreessammete Rosaanota Xiinxallote Umma 2013 Arro
No ratings yet
Sidaamu Afiinna Borreessammete Rosaanota Xiinxallote Umma 2013 Arro
2 pages
Reading and Writing Skills Summative 2
No ratings yet
Reading and Writing Skills Summative 2
5 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
Violations of Maxims Analysis of Cooperative Principle in Maleficent Movie
No ratings yet
Violations of Maxims Analysis of Cooperative Principle in Maleficent Movie
6 pages
Activity-Bystander Intervention Study
No ratings yet
Activity-Bystander Intervention Study
17 pages
Managing Knowledge: Knowledge Work and Artificial Intelligence
No ratings yet
Managing Knowledge: Knowledge Work and Artificial Intelligence
34 pages
A Novice Managers Tale of Woe
No ratings yet
A Novice Managers Tale of Woe
2 pages
Mailvis: Visualizing Emailbox For Re-Finding Emails
No ratings yet
Mailvis: Visualizing Emailbox For Re-Finding Emails
2 pages
RPH Super Mind Y1 L1 2019
No ratings yet
RPH Super Mind Y1 L1 2019
3 pages
National STEM School Education Strategy
No ratings yet
National STEM School Education Strategy
12 pages
Aldep
No ratings yet
Aldep
10 pages
High School: San Pablo - Santa Elena - Ecuador
No ratings yet
High School: San Pablo - Santa Elena - Ecuador
2 pages
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
Measurement - Drill Sheets Gr. 6-8
From Everand
Measurement - Drill Sheets Gr. 6-8
Chris Forest
1/5 (1)
Number & Operations - Task Sheets Gr. 3-5
From Everand
Number & Operations - Task Sheets Gr. 3-5
Nat Reed
No ratings yet
Measurement - Task Sheets Gr. 3-5
From Everand
Measurement - Task Sheets Gr. 3-5
Chris Forest
No ratings yet
Number & Operations - Task & Drill Sheets Gr. 3-5
From Everand
Number & Operations - Task & Drill Sheets Gr. 3-5
Nat Reed
No ratings yet

05classification Rule Mining

Uploaded by

05classification Rule Mining

Uploaded by

Classification Rule Mining

NAME RANK YEARS TENURED Classifier

• Unsupervised learning (clustering)

student? yes credit rating?

no yes excellent fair

• Information gain (ID3/C4.5)

• The encoding information that would be gained by branching on A

where pj is the relative frequency of class j in T.

• Allow for continuous-valued attributes

• Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the

P(H | X)  P(X | H )P( H )  P(X | H ) P(H ) / P(X)

needs to be maximized P(C | X)  P(X | C )P(C )

• A simplified assumption: attributes are conditionally independent (i.e., no dependence

• If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution

age income studentcredit_rating

LC 0.8 0.5 0.7 0.1

The conditional probability table (CPT) for the

Bayesian Belief Networks

• If more than one rule are triggered, need conflict resolution

a leaf student? credit rating?

• Example: Rule extraction from our buys_computer decision-tree

Example of Confusion Matrix:

• Given m classes, an entry, CMi,j in a confusion matrix indicates

• Classifier Accuracy, or negative class and minority of

• F measure (F1 or F-score): harmonic mean of precision and recall,

• Fß: weighted measure of precision and recall

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

• The weight of classifier Mi’s vote is log

You might also like