0% found this document useful (0 votes)
0 views

Classification lecture 1

Chapter 6 discusses classification and prediction in data mining, defining classification as predicting categorical labels and prediction as modeling continuous values. It outlines the processes of model construction and usage, the importance of data preparation, and evaluates classification methods based on accuracy, speed, and interpretability. The chapter also covers decision tree induction, Bayesian classification, and issues like overfitting and scalability in large databases.

Uploaded by

agents0209
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Classification lecture 1

Chapter 6 discusses classification and prediction in data mining, defining classification as predicting categorical labels and prediction as modeling continuous values. It outlines the processes of model construction and usage, the importance of data preparation, and evaluates classification methods based on accuracy, speed, and interpretability. The chapter also covers decision tree induction, Bayesian classification, and issues like overfitting and scalability in large databases.

Uploaded by

agents0209
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Chapter 6.

Classification and Prediction

 What is classification? What is  Prediction


prediction?  Accuracy and error measures
 Issues regarding classification  Summary
and prediction

 Classification by decision tree


induction

 Bayesian classification

 Rule-based classification

January 27, 2015 Data Mining: Concepts and Techniques 1


Classification vs. Prediction
 Classification
 predicts categorical class labels (discrete or nominal)

 classifies data (constructs a model) based on the


training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
 Prediction
 models continuous-valued functions, i.e., predicts
unknown or missing values
 Typical applications
 Credit approval

 Target marketing

 Medical diagnosis

 Fraud detection

January 27, 2015 Data Mining: Concepts and Techniques 2


Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision trees,


or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model

 The known label of test sample is compared with the

classified result from the model


 Accuracy rate is the percentage of test set samples that are

correctly classified by the model


 Test set is independent of training set, otherwise over-fitting

will occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
January 27, 2015 Data Mining: Concepts and Techniques 3
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’
January 27, 2015 Data Mining: Concepts and Techniques 4
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
January 27, 2015 Data Mining: Concepts and Techniques 5
Supervised vs. Unsupervised Learning

 Supervised learning (classification)


 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
January 27, 2015 Data Mining: Concepts and Techniques 6
Issues: Data Preparation

 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data

January 27, 2015 Data Mining: Concepts and Techniques 7


Issues: Evaluating Classification Methods

 Accuracy
 classifier accuracy: predicting class label

 predictor accuracy: guessing value of predicted


attributes
 Speed
 time to construct the model (training time)

 time to use the model (classification/prediction time)

 Robustness: handling noise and missing values


 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model

 Other measures, e.g., goodness of rules, such as decision


tree size or compactness of classification rules
January 27, 2015 Data Mining: Concepts and Techniques 8
Decision Tree Induction: Training Dataset

age income student credit_rating buys_computer


<=30 high no fair no
This <=30 high no excellent no
31…40 high no fair yes
follows an >40 medium no fair yes
example >40 low yes fair yes
of >40 low yes excellent no
31…40 low yes excellent yes
Quinlan’s <=30 medium no fair no
ID3 <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

January 27, 2015 Data Mining: Concepts and Techniques 9


Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes yes

January 27, 2015 Data Mining: Concepts and Techniques 10


Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
January 27, 2015 Data Mining: Concepts and Techniques 11
Algorithm for Decision Tree
Induction
 Algorithm: Generate_decision_tree.Generate a
decision tree from the given data
 Input: The training samples , represented by
discrete valued attributes, the set of candidate
attributes, attribute_list.
 Output: A decision tree.
 Method:
(1) create a node N;
(2) if samples are all of the same class , C then
(3) return N as a leaf node labelled with the
class C;
January 27, 2015 Data Mining: Concepts and Techniques 12
Algorithm for Decision Tree
Induction

(4) if attribute_list is empty then


(5) return N as a leaf node labelled with the most
common class in samples;// majority voting
(6) select test_attribute , the attribute among
attribute_list with the highest information
gain;
(7) label node N with the test_attribute;
(8) for each known value ai of test attribute // partition the
samples

(9) grow a branch from node N for the condition test_attribute= ai


(10) let si be the set of samples in samples for which test attribute = ai
January 27, 2015 Data Mining: Concepts and Techniques 13
Algorithm for Decision Tree
Induction

(11) if si is empty then


(12) attach a leaf labelled with the most
common class in samples ;
(13) else attach a node returned by
Generate_decision_tree
(si,attribute_list,test_attribute)

January 27, 2015 Data Mining: Concepts and Techniques 14


15
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple
in D: m
Info( D)   pi log 2 ( pi )
i 1

 Information needed (after using A to split D into v


partitions) to classify D: v |D |
InfoA ( D)    I (D j )
j

j 1 | D |

 Information gained by branching on attribute A


Gain(A)  Info(D)  InfoA(D)
January 27, 2015 Data Mining: Concepts and Techniques 16
Attribute Selection: Information Gain

 Class P: buys_computer = “yes” age pi ni I(pi, ni)


 Class N: buys_computer = “no”
<=30 2 3 0.971
Info( D)  I (9,5)  
9 9
log 2 ( ) 
5 5
log 2 ( ) 0.940 31…40 4 0 0
14 14 14 14
>40 3 2 0.971
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no Gain(age)  Info( D)  Infoage ( D)  0.246
<=30 low yes fair yes
>40 medium yes fair yes Gain(income)  0.029
<=30 medium yes excellent yes
31…40 medium no excellent yes Gain( student)  0.151
31…40 high yes fair yes
>40 medium no excellent no Gain(credit _ rating)  0.048
17
Gini index (CART, IBM IntelligentMiner)
 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini( D)  1        0.459
 14   14 
 Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2 gini  10  4
income{low, medium} ( D )   Gini( D1 )   Gini( D1 )
 14   14 

but gini{medium,high} is 0.30 and thus the best since it is the lowest
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split values
 Can be modified for categorical attributes

January 27, 2015 Data Mining: Concepts and Techniques 25


Overfitting and Tree Pruning

 Overfitting:
 Overfitting results in decision trees that are more
complex than necessary
 An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to noise or
outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
 Difficult to choose an appropriate threshold

January 27, 2015 Data Mining: Concepts and Techniques 27


Post pruning

– Trim the nodes of the decision tree in a


bottom-up fashion
– If generalization error improves after trimming,
replace sub-tree by a leaf node.
– Class label of leaf node is determined from
majority class of instances in the sub-tree

January 27, 2015 Data Mining: Concepts and Techniques 28


Classification in Large Databases

 Classification—a classical problem extensively studied by


statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
 Why decision tree induction in data mining?
 relatively faster learning speed (than other classification
methods)
 convertible to simple and easy to understand
classification rules
 can use SQL queries for accessing databases
 comparable classification accuracy with other methods

January 27, 2015 Data Mining: Concepts and Techniques 30


Scalable Decision Tree
Induction Methods in Data
Mining Studies
 SLIQ
 builds an index for each attribute and only class list and

the current attribute list reside in memory.


 Handles disk resident data sets using disk resident
attribute list and memory resident class list.
 Memory restriction is there when the training set is tool

large.
 When a class list becomes too large performance of

SLIQ decreases.
 SPRINT
 constructs an attribute list data structure .

 SPRINT removes all memory restrictions.

 Designed to be easily parallelized.


January 27, 2015 Data Mining: Concepts and Techniques 31
Scalable Decision Tree Induction
Methods in Data Mining Studies
 PUBLIC
 integrates tree splitting and tree pruning: stop growing

the tree earlier


 RainForest
 separates the scalability aspects from the criteria that

determine the quality of the tree


 builds an AVC-list (attribute, value, class label)

 Rain forest report a speed up over SPRINT.

January 27, 2015 Data Mining: Concepts and Techniques 32


Bayesian Classification: Why?
 A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
 Foundation: Based on Bayes’ Theorem.
 Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
January 27, 2015 Data Mining: Concepts and Techniques 33
Bayesian Theorem: Basics

 Let X be a data sample (“evidence”): class label is unknown


 Let H be a hypothesis that X belongs to class C
 Classification is to determine P(H|X), the probability that
the hypothesis holds given the observed data sample X
 P(H) (prior probability), the initial probability
 E.g., X will buy computer, regardless of age, income, …
 P(X): probability that sample data is observed
 P(X|H) (posteriori probability), the probability of observing
the sample X, given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
January 27, 2015 Data Mining: Concepts and Techniques 34
Bayesian Theorem

 Given training data X, posteriori probability of a


hypothesis H, P(H|X), follows the Bayes theorem

P(H | X)  P(X | H )P(H )


P(X)
 Informally, this can be written as
posteriori = likelihood x prior/evidence
 Predicts X belongs to Ci iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
 Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
January 27, 2015 Data Mining: Concepts and Techniques 35
Towards Naïve Bayesian Classifier
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
 Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
i i i
needs to be maximized

January 27, 2015 Data Mining: Concepts and Techniques 36


Derivation of Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ... P( x | C i)
k 1 2 n
k 1
 This greatly reduces the computation cost: Only counts
the class distribution
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
 If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ 
( x ) 2
1
g ( x,  ,  )  e 2 2
2 
and P(xk|Ci) is
P(X | Ci)  g ( xk , Ci , Ci )
January 27, 2015 Data Mining: Concepts and Techniques 37
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
January 27, 2015 Data Mining: Concepts and Techniques 38
Naïve Bayesian Classifier: An Example
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

 Compute P(X|Ci) for each class


P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)


January 27, 2015 Data Mining: Concepts and Techniques 39
Example - 2
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
An unseen sample
overcast hot high false P
rain mild high false P X = <rain, hot, high,
rain cool normal false P false>
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
January 27, 2015 Data Mining: Concepts and Techniques 40
Play-tennis example: estimating
P(xi|C)
Outlook Temperature Humidity Windy Class outlook
sunny hot high false N P(sunny|p) = 2/9 P(sunny|n) = 3/5
sunny hot high true N
overcast hot high false P P(overcast|p) = P(overcast|n) = 0
rain mild high false P 4/9
rain cool normal false P P(rain|p) = 3/9 P(rain|n) = 2/5
rain cool normal true N temperature
overcast cool normal true P
sunny mild high false N P(hot|p) = 2/9 P(hot|n) = 2/5
sunny cool normal false P P(mild|p) = 4/9 P(mild|n) = 2/5
rain mild normal false P P(cool|p) = 3/9 P(cool|n) = 1/5
sunny mild normal true P
overcast mild high true P humidity
overcast hot normal false P P(high|p) = 3/9 P(high|n) = 4/5
rain mild high true N P(normal|p) = P(normal|n) =
6/9 2/5
P(p) = 9/14
windy
P(n) = 5/14 P(true|p) = 3/9 P(true|n) = 3/5
January 27, 2015 Data Mining: Concepts and Techniques 41
Play-tennis example: classifying X

 An unseen sample X = <rain, hot, high, false>

 P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
 P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286

 Sample X is classified in class n (don’t play)

January 27, 2015 Data Mining: Concepts and Techniques 42


Naïve Bayesian Classifier: Comments
 Advantages
 Easy to implement

 Good results obtained in most of the cases

 Disadvantages
 Assumption: class conditional independence, therefore

loss of accuracy
 Practically, dependencies exist among variables

 E.g., hospitals: patients: Profile: age, family history, etc.


Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
 Dependencies among these cannot be modeled by Naïve

Bayesian Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks
January 27, 2015 Data Mining: Concepts and Techniques 43
Bayesian Belief Networks

 Bayesian belief network allows a subset of the variables


conditionally independent
 A graphical model of casual relationships
 Represents dependency among the variables
 Gives a specification of joint probability distribution
 Nodes: random variables
 Links: dependency
X Y  X and Y are the parents of Z, and Y is
the parent of P
Z  No dependency between Z and P
P  Has no loops or cycles
January 27, 2015 Data Mining: Concepts and Techniques 44
Bayesian Belief Network: An Example

Family The conditional probability table


Smoker
History (CPT) for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1


~LC 0.2 0.5 0.3 0.9
LungCancer Emphysema
CPT shows the conditional probability for each
possible combination of its parents. The CPT for
a variable Z specifies the conditional distribution
P(Z/Parents(Z)).
P(Lungcancer=“yes” | FamilyHistory = “yes” ,
PositiveXRay Dyspnea smoker=“yes”)=0.8

Bayesian Belief Networks Derivation of the probability of a particular


combination of values of X, from CPT:
n
P( x1 ,..., xn )   P( xi | Parents( X i ))
January 27, 2015 i 1 45
Chapter 6. Classification and Prediction

 What is classification? What is  Prediction


prediction?  Accuracy and error measures
 Issues regarding classification  Summary
and prediction

 Classification by decision tree


induction

 Bayesian classification

 Rule-based classification

January 27, 2015 Data Mining: Concepts and Techniques 46


What Is Prediction?
 (Numerical) prediction is similar to classification
 construct a model

 use model to predict continuous or ordered value for a given input

 Prediction is different from classification


 Classification refers to predict categorical class label

 Prediction models continuous-valued functions

 Major method for prediction: regression


 model the relationship between one or more independent or

predictor variables and a dependent or response variable


 Regression analysis
 Linear and multiple regression

 Non-linear regression

 Other regression methods: generalized linear model, Poisson

regression, log-linear models, regression trees


January 27, 2015 Data Mining: Concepts and Techniques 47
Linear Regression
 Linear regression: involves a response variable y and a single
predictor variable x
y = w0 + w 1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
 Method of least squares: estimates the best-fitting straight line
| D|

 (x  x )( yi  y )
w  i 1
i
w  y w x
1 | D|
0 1
 (x
i 1
i  x )2

 Multiple linear regression: involves more than one predictor variable


 Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
 Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
 Solvable by extension of least square method
 Many nonlinear functions can be transformed into the above
January 27, 2015 Data Mining: Concepts and Techniques 48
Regression - Example
 Table shows a set of X Y
paired data where X is Years Salary (in $
Experience 1000s)
the number of years of 3 30
work experience of a 8 57
college graduate and y 9 64
is the corresponding 13 72
salary of the graduate. 3 36
6 43
 Y = 23.6 + 3.5X 11 59
 Predict the salary for a 21 90
graduate with 10 yrs of 1 20
experience. 16 83

 Y = 58.6$
January 27, 2015 Data Mining: Concepts and Techniques 49
Nonlinear Regression
 Some nonlinear models can be modeled by a polynomial
function
 A polynomial regression model can be transformed into
linear regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
 Other functions, such as power function, can also be
transformed to linear model
 Some models are intractable nonlinear (e.g., sum of
exponential terms)
 possible to obtain least square estimates through

extensive calculation on more complex formulae


January 27, 2015 Data Mining: Concepts and Techniques 50
Chapter 6. Classification and Prediction

 What is classification? What is  Prediction


prediction?  Accuracy and error measures
 Issues regarding classification  Summary
and prediction

 Classification by decision tree


induction

 Bayesian classification

 Rule-based classification

January 27, 2015 Data Mining: Concepts and Techniques 51


Evaluating the Accuracy of a Classifier
or Predictor (I)
 Holdout method
 Given data is randomly partitioned into two independent sets

 Training set (e.g., 2/3) for model construction

 Test set (e.g., 1/3) for accuracy estimation

Derive Estimate
Training Classifier Accuracy
set

Data

Test set
 Random sampling: a variation of holdout
 Repeat holdout k times, accuracy = avg. of the accuracies

obtained
January 27, 2015 Data Mining: Concepts and Techniques 52
Evaluating the Accuracy of a Classifier
or Predictor (I)
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets,

each approximately equal size


 At i-th iteration, use Di as test set and others as training set

 The accuracy estimate =

Overall number of correct classifications from the k iterations


Total number of samples in the initial data

 Leave-one-out: k folds where k = # of tuples, for small sized data

 Stratified cross-validation: folds are stratified so that class dist. in


each fold is approx. the same as that in the initial data.

January 27, 2015 Data Mining: Concepts and Techniques 53


Ensemble Methods: Increasing the Accuracy

 Ensemble methods
 Use a combination of models to increase accuracy

 Combine a series of k learned models, M1, M2, …, Mk,

with the aim of creating an improved model M*


 Popular ensemble methods
 Bagging: averaging the prediction over a collection of

classifiers
 Boosting: weighted vote with a collection of classifiers

 Ensemble: combining a set of heterogeneous classifiers

January 27, 2015 Data Mining: Concepts and Techniques 55


Bagging: Boostrap Aggregation
 Analogy: Diagnosis based on multiple doctors’ majority vote
 Training
 Given a set D of d tuples, at each iteration i, a training set Di of d

tuples is sampled with replacement from D (i.e., boostrap)


 A classifier model Mi is learned for each training set Di

 Classification: classify an unknown sample X


 Each classifier Mi returns its class prediction

 The bagged classifier M* counts the votes and assigns the class

with the most votes to X


 Prediction: can be applied to the prediction of continuous values by
taking the average value of each prediction for a given test tuple
 Accuracy
 Often significant better than a single classifier derived from D

 For noise data: not considerably worse, more robust

 Proved improved accuracy in prediction


January 27, 2015 Data Mining: Concepts and Techniques 56
Boosting
 Analogy: Consult several doctors, based on a combination of weighted
diagnoses—weight assigned based on the previous diagnosis accuracy
 How boosting works?
 Weights are assigned to each training tuple
 A series of k classifiers is iteratively learned
 After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier, Mi+1, to pay more attention to the training
tuples that were misclassified by Mi
 The final M* combines the votes of each individual classifier, where
the weight of each classifier's vote is a function of its accuracy
 The boosting algorithm can be extended for the prediction of
continuous values
 Comparing with bagging: boosting tends to achieve greater accuracy,
but it also risks overfitting the model to misclassified data
January 27, 2015 Data Mining: Concepts and Techniques 57
Classifier Accuracy Measures and Confusion
matrix
 t_pos (Eg “cancer samples” that were correctly

classified as such)
 t_neg (“not_cancer” samples that were
correctly classified as such)
 False positives (“not_cancer” samples that were

incorrectly labeled as “cancer”)


 False negative(“cancer” samples that were
incorrectly labeled as “not_cancer”)
 pos is the number of positive C1 C2
samples C1 t_pos f_neg
 neg is the number of negative C2 f_pos t_neg
samples
January 27, 2015 Data Mining: Concepts and Techniques 58
Classifier Accuracy Measures

classes buy_computer = yes buy_computer = no total recognition(%)


buy_computer = yes 6954 46 7000 99.34
buy_computer = no 412 2588 3000 86.27
total 7366 2634 10000 95.52
 Accuracy of a classifier M, acc(M): percentage of test set tuples that are
correctly classified by the model M
 Error rate (misclassification rate) of M = 1 – acc(M)

 Given m classes, CMi,j, an entry in a confusion matrix, indicates #

of tuples in class i that are labeled by the classifier as class j


 Alternative accuracy measures (e.g., for cancer diagnosis)
sensitivity = t-pos/pos /* true positive recognition rate */
specificity = t-neg/neg /* true negative recognition rate */
precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)
 This model can also be used for cost-benefit analysis

January 27, 2015 Data Mining: Concepts and Techniques 59


Predictor Error Measures
 Measure predictor accuracy: measure how far off the predicted value is
from the actual known value
 Loss function: measures the error betw. yi and the predicted value yi’
 Absolute error: | yi – yi’|
 Squared error: (yi – yi’)2
 Test error (generalization error):
d
the average loss over the test
d
set
 Mean absolute error: | y
i 1
i  yi ' | Mean squared error: ( y  y ')
i 1
i i
2

d d
d

 Relative absolute error:  | y


d

i  yi ' |
Relative squared error:  ( yi  yi ' ) 2
i 1
i 1
d d
| y
i 1
i y|  ( y  y)
i 1
i
2

The mean squared-error exaggerates the presence of outliers


Popularly use (square) root mean-square error, similarly, root relative
squared error
January 27, 2015 Data Mining: Concepts and Techniques 60
January 27, 2015 Data Mining: Concepts and Techniques 61

You might also like