0% found this document useful (0 votes)
292 views37 pages

6.data Mining - Classification

Uploaded by

pawankr16123114
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
292 views37 pages

6.data Mining - Classification

Uploaded by

pawankr16123114
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

CLASSIFICATION

CLASSIFICATION PROBLEM

Problem statement:
 Given features X1, X2,…, Xn
 Predict a label Y

Definition: (Classification) It is the task of


learning a target function ‗f‘ that maps each attribute set X to one of
the predefined class labels Y.
EXAMPLE
Day Outlook Temperature Humidity Wind Play
Tennis

Day1 Sunny Hot High Weak No


Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
THINGS WE‘D LIKE TO DO
 Spam Classification
 Given an email, predict whether it is spam or
not

 Medical Diagnosis
 Given a list of symptoms, predict whether a
patient has disease X or not

 Weather
 Based on temperature, humidity, etc… predict
if it will rain tomorrow
CLASSIFICATION PROBLEM

• Training data: examples of the form (d, h(d))


– where d are the data objects to classify (inputs)
– and h(d) are the correct class info for d, h(d){1,…,K}
• Goal: given dnew, provide h(dnew)
CLASSIFICATION—A TWO-STEP PROCESS

 Model construction: describing a set of predetermined classes


 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set (otherwise overfitting)

 If the accuracy is acceptable, use the model to classify new data


 Note: If the test set is used to select models, it is called validation
(test) set
6
PROCESS (1): MODEL CONSTRUCTION

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’
7
PROCESS (2): USING THE MODEL IN PREDICTION

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
8
LEARNING
 A classification technique is a systematic approach to build classification
models from an input dataset
 Each technique employs a learning algorithm to identify a model that best
fits the relationship between the attribute set and class label of the input
data.
 Formally, a computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E.
 Thus a learning system is characterized by:
 task T
 experience E, and
 performance measure P
SUPERVISED VS. UNSUPERVISED LEARNING

 Supervised learning (classification)


 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data

10
PERFORMANCE OF CLASSIFICATION
 Evaluation is based on the counts of test records correctly
and incorrectly predicted by the model
 These counts are tabulated in a table known as confusion
matrix
Predicted Class
Class = 1 Class = 0
Actual Class Class = 1 f11 f10
Class = 0 f01 f00

NumberofCorrect Pr edictions f11  f 00


Accuracy  
TotalNumberof Pr edictions f11  f10  f 01  f 00

 Accuracy will yield misleading results if the data set is unbalanced


 For example, if there were 95 cats and only 5 dogs in the data, a particular classifier might classify all the
observations as cats.
 The classifier would have a 100% recognition rate for the cat class but a 0% recognition rate for the dog class.
CONFUSION MATRIX
Predicted Class
Class = 1 Class = 0
Actual Class Class = 1 TP FN
Class = 0 FP TN

 True Positive: You predicted positive and it’s true.


 True Negative: You predicted negative and it’s true.

 False Positive: (Type 1 Error) You predicted positive and


it’s false.
 False Negative: (Type 2 Error) You predicted negative
and it’s false.
PERFORMANCE OF CLASSIFICATION CONTD..
 In addition to classification accuracy there are two other metrics for
performance evaluation
 Precision: (also called positive predictive value) is the fraction of
relevant instances among the retrieved instances, i.e. Out of all the
positive classes we have predicted correctly, how many are actually
positive
 Recall: (also known as sensitivity) is the fraction of relevant instances
that have been retrieved over the total amount of relevant instances,
i.e. Out of all the positive classes, how much we predicted correctly. It
should be high as possible.

 Example: Suppose a computer program for recognizing dogs in


photographs identifies eight dogs in a picture containing 12 dogs and
some cats. Of the eight dogs identified, five actually are dogs (true
positives), while the rest are cats (false positives).
PERFORMANCE OF CLASSIFICATION CONTD..
 In addition to classification accuracy there are two other metrics for
performance evaluation
 Precision: (also called positive predictive value) is the fraction of
relevant instances among the retrieved instances, i.e. Out of all the
positive classes we have predicted correctly, how many are actually
positive
 Recall: (also known as sensitivity) is the fraction of relevant instances
that have been retrieved over the total amount of relevant instances,
i.e. Out of all the positive classes, how much we predicted correctly. It
should be high as possible.

 Example: Suppose a computer program for recognizing dogs in


photographs identifies eight dogs in a picture containing 12 dogs and
some cats. Of the eight dogs identified, five actually are dogs (true
positives), while the rest are cats (false positives).

 Answer: The program's precision is 5/8 while its recall is 5/12.


PERFORMANCE OF CLASSIFICATION CONTD..
 In simple terms, high precision means that an algorithm returned
substantially more relevant results than irrelevant ones, while high
recall means that an algorithm returned most of the relevant results

Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn

 Precision P = tp/(tp + fp)


 Recall R = tp/(tp + fn)
 F-Measure F = 2*R*P/(R+P)
PERFORMANCE OF CLASSIFICATION CONTD..
 Consider the following confusion matrix:

 Find the % of accuracy, precision, recall


PERFORMANCE OF CLASSIFICATION CONTD..
CLASS IMBALANCE PROBLEM
 Consider the following confusion matrix:

 Here the main class of interest is rare.


 The sensitivity and specificity measures can be used,
respectively, for this type of situation.
 These measures are defined as follows:

 It has high specificity, meaning that it can accurately


recognize negative tuples.
LEARNING MODELS
 Eager Learners - when given a set of training tuples, will
construct a generalization (i.e., classification) model
before receiving new (e.g., test) tuples to classify
 Rule-based classification
 Decision-tree induction
 Naïve – Bayes classifier
 Support Vector Machine (SVM)
 Classification based on Association Rule Mining
 Artificial Neural Network

 Lazy Learner - the learner instead waits until the last


minute before doing any model construction to classify a
given test tuple
 k-nearest-neighbor classifiers (k-NN)
 Case based Reasoning classifiers
RULE – BASED CLASSIFICATION

 Represent the knowledge in the form of IF-THEN rules


 These rules are generated directly from the training data using a
sequential covering algorithm

R: IF age = youth AND student = yes THEN buys_computer = yes


 Rule antecedent/precondition vs. rule consequent

 Assessment of a rule: coverage and accuracy


 ncovers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R

coverage(R) = ncovers /|D| /* D: training data set */


accuracy(R) = ncorrect / ncovers

20
EXAMPLE

 Consider rule R1, which covers 2 of the 14 tuples. It can


correctly classify both tuples.

 Coverage (R1) = ?
 Accuracy (R1) = ?
RULE – BASED CLASSIFICATION

 Let‘s see how we can use rule-based classification to


predict the class label of a given tuple, X, where –
X= (age = youth, income = medium, student = yes, credit rating =
fair)
 We would like to classify X according to buys computer.
 X satisfies R1, which triggers the rule, where
R1: IF age = youth AND student = yes THEN buys_computer = yes

 If R1 is the only rule satisfied, then the rule fires by returning the
class prediction for X
RULE – BASED CLASSIFICATION
 If more than one rule are triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules
that has the ―toughest‖ requirement (i.e., with the most attribute
tests)

 Class-based ordering: decreasing order of prevalence or


misclassification cost per class

 Rule-based ordering (decision list): rules are organized into one


long priority list, according to some measure of rule quality or by
experts
 If there is no rule satisfied by X –
 A default rule can be set up to specify a default class, based on a training set.

 This may be the class in majority or the majority class of the tuples that were
not covered by any rule.
RULE INDUCTION: SEQUENTIAL COVERING METHOD

 Sequential covering algorithm: Extracts rules directly from


training data
 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time
 Each time a rule is learned, the tuples covered by the rules are
removed
 Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
 Comp. w. decision-tree induction: learning a set of rules
simultaneously

24
SEQUENTIAL COVERING ALGORITHM
HOW ARE RULES LEARNED?
RULE QUALITY MEASURES
 Choosing between two rules based on accuracy

 Rule R1 correctly classifies 38 of the 40 tuples it covers


 Whereas, Rule R2 covers only two tuples, which it
correctly classifies
K-NEAREST NEIGHBOR CLASSIFIER
 Nearest-neighbor classifiers are based on learning by
analogy

 by comparing a given test tuple with training tuples that


are similar to it

 The training tuples are described by n attributes.


 In this way, all the training tuples are stored in an n-
dimensional pattern space
 When given an unknown tuple, a k-NN classifier searches
the pattern space for the k training tuples that are closest
to the unknown tuple.

 These k training tuples are the k ―nearest neighbors‖ of


the unknown tuple.
K-NEAREST NEIGHBOR CLASSIFIER
 ―Closeness‖ is defined in terms of a distance metric, such
as Euclidean distance
 Euclidean distance between two points or tuples, say, X1
= (x11, x12,..., x1n) and X2 = (x21, x22,..., x2n), is –

 For k-nearest-neighbor classification, the unknown tuple


is assigned the most common class among its k-nearest
neighbors
EXAMPLE
 Suppose the training dataset is plotted as follows:

 Now, we need to classify new data point with black dot (at point 60,60) into blue or red
class.
 assuming K = 3 i.e. it would find three nearest data points.
DISCUSSION ON THE K-NN ALGORITHM

 k-NN for real-valued prediction for a given unknown


tuple
 Returns the mean values of the k nearest neighbors
 Distance-weighted nearest neighbor algorithm

 Weight the contribution of each of the k neighbors


1
according to their distance to the query xq w 2
d ( xq , x )
 Give greater weight to closer neighbors i
 Robust to noisy data by averaging k-nearest neighbors
 Curse of dimensionality: distance between neighbors
could be dominated by irrelevant attributes
 To overcome it, axes stretch or elimination of the least
relevant attributes
31
DISCUSSION ON THE K-NN ALGORITHM
 How can I determine a good value for k, the number of
neighbors?

 Starting with k = 1, we use a test set to estimate the error rate


of the classifier

 This process can be repeated each time by incrementing k to


allow for one more neighbor

 The k value that gives the minimum error rate may be


selected.
EVALUATING CLASSIFIER ACCURACY:
HOLDOUT & CROSS-VALIDATION METHODS
 Holdout method
 Given data is randomly partitioned into two
independent sets
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation

 Random sampling: a variation of holdout


 Repeat holdout k times, accuracy = avg. of the accuracies obtained
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
 At i-th iteration, use Di as test set and others as
training set
 Leave-one-out: k folds where k = # of tuples, for small
sized data
 *Stratified cross-validation*: folds are stratified so
that class dist. in each fold is approx. the same as that
in the initial data
33
EVALUATING CLASSIFIER ACCURACY: BOOTSTRAP

 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 i.e., each time a tuple is selected, it is equally likely to be
selected again and re-added to the training set
 Several bootstrap methods, and a common one is .632 boostrap
 A data set with d tuples is sampled d times, with replacement,
resulting in a training set of d samples. The data tuples that did
not make it into the training set end up forming the test set.
About 63.2% of the original data end up in the bootstrap, and the
remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
 Repeat the sampling procedure k times, overall accuracy of the
model:

34
ESTIMATING CONFIDENCE INTERVALS:
CLASSIFIER MODELS M1 VS. M2

 Suppose we have 2 classifiers, M1 and M2, which one


is better?
 Use 10-fold cross-validation to obtain and
 These mean error rates are just estimates of error on
the true population of future data cases
 What if the difference between the 2 error rates is
just attributed to chance?
 Use a test of statistical significance
 Obtain confidence limits for our error estimates

35
MODEL SELECTION: ROC CURVES

 ROC (Receiver Operating


Characteristics) curves: for visual
comparison of classification models
 Originated from signal detection
theory
 Shows the trade-off between the true
positive rate and the false positive
rate  Vertical axis
 The area under the ROC curve is a represents the true
measure of the accuracy of the model positive rate
 Rank the test tuples in decreasing  Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at
the top of the list  The plot also shows a
 The closer to the diagonal line (i.e., diagonal line
the closer the area is to 0.5), the less  A model with perfect
accurate is the model accuracy will have an
area of 1.0
36
ISSUES AFFECTING MODEL SELECTION

 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision
tree size or compactness of classification rules
37

You might also like