Classification
Classification
• The attribute with the maximum gain ratio is selected as the splitting
attribute.
Decision Tree: Attribute Selection Measures- Gain Ratio
• Computation of gain ratio for the attribute income.
• For a discrete-valued attribute, the subset that gives the minimum Gini
index for that attribute is selected as its splitting subset.
Decision Tree: Attribute Selection Measures- Gini Index
• The reduction in impurity that would be incurred by a binary split on a
discrete- or continuous-valued attribute A is:
•…
Decision Tree
The three measures, in general, return good results but
• Information gain:
➢biased towards multivalued attributes
• Gain ratio:
➢tends to prefer unbalanced splits in which one partition is much
smaller than the others
• Gini index:
➢biased to multivalued attributes
➢has difficulty when # of classes is large
➢tends to favor tests that result in equal-sized partitions and purity in
both partitions
Bayes Classification
• Bayesian classifiers are statistical classifiers.
• Bayesian classifier can predict class membership probabilities such as
the probability that a given tuple belongs to a particular class.
• Bayesian classification is based on Bayes’ theorem.
• A simple Bayesian classifier known as the naïve Bayesian classifier.
• Naïve Bayesian classifiers assume that the effect of an attribute value
on a given class is independent of the values of the other attributes.
• This greatly reduces the computation cost: Only counts the class
distribution.
Naïve Bayesian Classification
• Predicting a class label using naïve Bayesian classification.
• For example, we wish to classify
X 600 25000 ?
X 600 25000 k=3
Instance-based Classification: k-Nearest-Neighbor
Applicant Cibil Score Income Loan Approved Applicant Cibil Score Income Loan Approved Euclidean Distance
A 700 50000 Y A 700 50000 Y 25000.20
B 800 40000 Y B 800 40000 Y 15001.33
C 750 30000 Y C 750 30000 Y 5002.25
D 400 10000 N D 400 10000 N 15001.33
E 850 8000 Y E 850 8000 Y 17001.84
F 600 20000 N F 600 20000 N 5000.00
G 700 35000 Y G 700 35000 Y 10000.50
H 750 100000 Y H 750 100000 Y 75000.15
I 500 150000 N I 500 150000 N 125000.04
J 650 18000 N J 650 18000 N 7000.18
X 600 25000 N
X 600 25000 N k=3
Metrics for Evaluating Classifier Performance
• The evaluation metrics assess how good or how accurate your classifier
is at predicting the class label of tuples.
• Use validation test set of class-labeled tuples instead of training set
when assessing accuracy.
Metrics for Evaluating Classifier Performance
• There are four additional terms we need to know that are the “building
blocks” used in computing many evaluation measures.
• These terms are summarized in the confusion matrix.
• True positives (TP): These refer to the positive tuples that were
correctly labeled by the classifier. Let TP be the number of true
positives.
• True negatives (TN): These are the negative tuples that were correctly
labeled by the classifier. Let TN be the number of true negatives.
Metrics for Evaluating Classifier Performance
• There are four additional terms we need to know that are the “building
blocks” used in computing many evaluation measures.
• False positives (FP): These are the negative tuples that were incorrectly
labeled as positive.
• e.g., tuples of class buys_computer = no for which the classifier
predicted buys_computer = yes. Let FP be the number of false positives.
• False negatives (FN): These are the positive tuples that were mislabeled
as negative.
• e.g., tuples of class buys_computer =
yes for which the classifier predicted
buys_computer = no. Let FN be the
number of false negatives.
Metrics for Evaluating Classifier Performance
• The confusion matrix is a useful tool for analyzing how well your classifier can
recognize tuples of different classes.
• TP and TN tell us when the classifier is getting things right.
• while FP and FN tell us when the classifier is getting things wrong (i.e.,
mislabeling).
• Given m classes (where m ≥ 2), a confusion matrix is a table of at least size m
by m. An entry, CMi,j in the first m rows and m columns indicates the number
of tuples of class i that were labeled by the classifier as class j.
• Good accuracy: ideally most of the tuples
would be represented along the diagonal of
the confusion matrix, from entry CM1,1 to
CMm,m.
• Rest of the entries being zero or close to zero.
That is, ideally, FP and FN are around zero.
Metrics for Evaluating Classifier Performance
• The table may have additional rows or columns to provide totals.
• For example, in the confusion matrix has P and N as shown in table.
• In addition, P’ is the number of tuples that were labeled as positive (TP
+ FP).
• N’ is the number of tuples that were labeled as negative (TN + FN).
• The total number of tuples is TP + TN + FP + TN, or P + N, or P’ + N’.
Metrics for Evaluating Classifier Performance
• The accuracy (or recognition) of a classifier on a given test set is the
percentage of test set tuples that are correctly classified by the
classifier.
• Accuracy reflects how well the classifier recognizes tuples of the various
classes.
Metrics for Evaluating Classifier Performance
• The accuracy (or recognition) of a classifier on a given test set is the
percentage of test set tuples that are correctly classified by the
classifier.
• Accuracy reflects how well the classifier recognizes tuples of the various
classes.
Metrics for Evaluating Classifier Performance
• The accuracy (or recognition) of a classifier on a given test set is the
percentage of test set tuples that are correctly classified by the
classifier.
• Accuracy reflects how well the classifier recognizes tuples of the various
classes.
Metrics for Evaluating Classifier Performance
• Error rate (or misclassification rate) of a classifier, M.
• simply 1 − accuracy(M), where accuracy(M) is the accuracy of M.
Metrics for Evaluating Classifier Performance
• Error rate (or misclassification rate) of a classifier, M.
• simply 1 − accuracy(M), where accuracy(M) is the accuracy of M.
Metrics for Evaluating Classifier Performance
Class Imbalance Problem:
• where the main class of interest is rare. That is, the data set distribution
reflects a significant majority of the negative class and a minority positive
class.
• For example, in fraud detection applications, the class of interest (or
positive class) is “fraud,” which occurs much less frequently than the
negative “nonfraudulant” class.
• In medical data, there may be a rare class, such as “cancer.” Suppose that
you have trained a classifier to classify medical data tuples, where the class
label attribute is “cancer” and the possible class values are “yes” and “no.”
An accuracy rate of, say, 97% may make the classifier seem quite accurate.
• But what if only, say, 3% of the training tuples are actually cancer? Clearly,
an accuracy rate of 97% may not be acceptable—the classifier could be
correctly labeling only the noncancer tuples, for instance, and
misclassifying all the cancer tuples.
Metrics for Evaluating Classifier Performance
The sensitivity and specificity measures can be used to overcome the
Class Imbalance Problem.
• sensitivity is also referred to as the true positive (recognition) rate (i.e.,
the proportion of positive tuples that are correctly identified),