0% found this document useful (0 votes)
28 views12 pages

Module 5 ML

Uploaded by

Abhiram Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views12 pages

Module 5 ML

Uploaded by

Abhiram Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Module-5 (Classification Assessment)

Classification Performance measures - Precision, Recall, Accuracy, F-Measure, Receiver


Operating Characteristic Curve (ROC), Area Under Curve (AUC).
Bootstrapping, Cross Validation, Ensemble methods, Bias-Variance decomposition. Case
Study: Develop a classifier for face detection.

CLASSIFICATION PERFORMANCE MEASURES

Let D be the testing set comprising n points in a d dimensional space, let {C1, C2 ,…….., Ck} ,
denote the set of k class la bels, and let M be a classifier. For xi ∈ D, let yi, denote its true class,
and let 𝑦̂=M(x
𝑖 i) denote its predicted class.

Error Rate

The error rate is the fraction of incorrect predictions for the classifier over the testing set,
defined as

where I is an indicator function that has the value 1 when its argument is true, and 0 otherwise.
Error rate is an estimate of the probability of misclassification. The lower the error rate the
better the classifier.

Accuracy
The accuracy of a classifier is the fraction of correct predictions over the testing set:

Accuracy gives an estimate of the probability of a correct prediction; thus, the higher the
accuracy, the better the classifier.

Contingency Table based Measures

Let D = {D1, D2, ... , Dk} denote a partitioning of the testing points based on their true class
labels, where
Accuracy/Precision
The class-specific accuracy or precision of the classifier M for class Ci is given as the fraction
of correct predictions over all points predicted to be in class Ci

where mi is the number of examples predicted as ci by classifier M. The higher the accuracy
on class ci the better the classifier. The overall precision or accuracy of the classifier is the
weighted average of the class-specific accuracy:

Coverage/Recall
The class-specific coverage or recall of M for class ci is the fraction of correct predictions over
all points in class ci:

where ni is the number of points in class ci . The higher the coverage the better the classifier.

F-measure

Often there is a trade-off between the precision and recall of a classifier. For example, it is easy
to make recalli =1, by predicting all testing points to be in class ci. However, in this case preci
will be low. On the other hand, we can make preci very high by predicting only a few points as
ci, for instance, for those predictions where M has the most confidence, but in this case recalli
will be low. Ideally, we would like both precision and recall to be high.
The class-specific F-measure tries to balance the precision and recall values, by computing
their harmonic mean for class ci:

The higher the Fi value the better the classifier.

The overall F-measure for the classifier M is the mean of the class-specific values:

For a perfect classifier, the maximum value of the F-measure is 1.

Binary Classification: Positive and Negative Class

When there are only k =2 classes, we call class C1 the positive class and C2 the negative
class. Consider the following Confusion Matrix for two classes.
The entries of the resulting 2×2 confusion matrix, shown in the Table , are given special names,
as follows:
• True Positives (TP): The number of points that the classifier correctly predicts as positive:

• False Positives (FP): The number of points the classifier predicts to be positive, which in
fact belong to the negative class.

• False Negatives (FN): The number of points the classifier predicts to be in the negative
class, which in fact belong to the positive class:

• True Negatives ( TN): The number of points that the classifier correctly predicts as
negative:

➢ ERROR RATE: The error rate for the binary classification case is given as the fraction of
mistakes (or false predictions)

➢ ACCURACY: The accuracy is the fraction of correct predictions

➢ CLASS SPECIFIC PRECISION: The precision for the positive and negative class is given
as

where mi = |Ri|, is the number of points predicted by M as having class ci


➢ Sensitivity: True Positive Rate: The true positive rate, also called sensitivity, is the
fraction of correct predictions with respect to all points in the positive class, that is, it is
simply the recall for the positive class

where n1 is the size of the positive class


➢ Specificity: True Negative Rate: The true negative rate, also called specificity, is simply
the recall for the negative class

where n2 is the size of the negative class


➢ False Negative Rate: The false negative rate is defined as

➢ False Positive Rate: The false positive rate is defined as

EXAMPLE:

ROC(Receiver Operating Characteristic)

• ROC curves are a useful visual tool for comparing two classification models.
• The name ROC stands for Receiver Operating Characteristic.
• ROC curves come from signal detection theory that was developed during World War II
for the analysis of radar images.
• An ROC curve shows the trade-off between the true positive rate or sensitivity (proportion
of positive tuples that are correctly identified) and the false-positive rate (proportion of
negative tuples that are incorrectly identified as positive) for a given model.
o That is, given a two-class problem, it allows us to visualize the trade-off between
the rate at which the model can accurately recognize ‘yes’ cases versus the rate at
which it mistakenly identifies ‘no’ cases as ‘yes’ for different “portions” of the test
set.
• Any increase in the true positive rate occurs at the cost of an increase in the false-positive
rate.
• The area under the ROC curve is a measure of the accuracy of the model.
• In order to plot an ROC curve for a given classification model, M, the model must be able
to return a probability or ranking for the predicted class of each test tuple. That is, we need
to rank the test tuples in decreasing order, where the one the classifier thinks is most likely
to belong to the positive or ‘yes’ class appears at the top of the list.
• The vertical axis of an ROC curve represents the true positive rate.
• The horizontal axis represents the false-positive rate.
• An ROC curve for M is plotted as follows.
o Starting at the bottom left-hand corner (where the true positive rate and false-
positive rate are both 0), we check the actual class label of the tuple at the top of the
list.
o If we have a true positive (that is, a positive tuple that was correctly classified),
then on the ROC curve, we move up and plot a point.
o If, instead, the tuple really belongs to the ‘no’ class, we have a false positive.
o On the ROC curve, we move right and plot a point.
o This process is repeated for each of the test tuples, each time moving up on the
curve for a true positive or toward the right for a false positive.
• Figure below shows the ROC curves of two classification models. The plot also shows a
diagonal line where for every true positive of such a model, we are just as likely to
encounter a false positive. Thus, the closer the ROC curve of a model is to the diagonal
line, the less accurate the model. If the model is really good, initially we are more likely to
encounter true positives as we move down the ranked list. Thus, the curve would move
steeply up from zero. Later, as we start to encounter fewer and fewer true positives, and
more and more false positives, the curve cases off and becomes more horizontal.
• To assess the accuracy of a model, we can measure the area under the curve. Several
software packages are able to perform such calculation. The closer the area is to 0.5, the
less accurate the corresponding model is. A model with perfect accuracy will have an area
of 1.0
Area Under ROC Curve (AUC)

The area under the ROC curve (AUC) can be used as a measure of classifier performance.
Because the total area of the plot is 1, the AUC lies in the interval [0, 1]– the higher the better.
The A UC value is essentially the probability that the classifier will rank a random positive test
case higher than a random negative test instance.

ROC/AUC ALGORITHM

It takes as input the testing set D, and the classifier M. The first step is to predict the score S(xi)
for each test point xi ∈ D. Next, we sort the (S(xi),yi) pairs, that is the score and the true class
pairs, in decreasing order of the scores (line 3).

Initially, we set the positive score threshold ρ = ∞ (line 7). The foreach loop (line 8) examines
each pair (S(xi), yi) in sorted order, and for each distinct value of the score, it sets ρ =S(xi) and
plots the point

As each test point is examined, the true and false positive values are adjusted based on the true
class yi for the test point xi. If y1= c1, we increment the true positives, otherwise, we increment
the false positives (lines 15-16). At the end of the foreach loop we plot the final point in the
ROC curve (line 17).

The AUC value is computed as each new point is added to the ROC plot. The algorithm
maintains the previous values of the false and true positives, FPprev and TPprev, for the
previous score threshold ρ. Given the current FP and TP values, we compute the area under the
curve defined by the four points

These four points define a trapezoid, whenever x2 > x1 and y2 > y1, otherwise, they define a
rectangle (which may be degenerate, with zero area). The function TRAPEZOID-AREA
computes the area under the trapezoid, which is given as b · h, where b = |x2−x1| is the length
of the base of the trapezoid and h = 1/2(y2 + y1) is the average height of the trapezoid.

Evaluating the Accuracy of a Classifier or Predictor

Different methods to evaluate accuracy based on randomly sampled partitions of the given data
are:
• Holdout Method
• Random subsampling
• Cross validation
• Bootstrap method

Holdout Method and Random Subsampling

In the holdout method, the given data are randomly partitioned into two independent sets, a
training set and a test set. Typically, two-thirds of the data are allocated to the training set, and
the remaining one-third is allocated to the test set. The training set is used to derive the model,
whose accuracy is estimated with the test set. The estimate is pessimistic because only a portion
of the initial data is used to derive the model.

Random subsampling is a variation of the holdout method in which the holdout method is
repeated k times. The overall accuracy estimate is taken as the average of the accuracies
obtained from each iteration.
Cross-validation
In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive
subsets or “folds,” D1, D2,..., Dk, each of approximately equal size. Training and testing is
performed k times. In iteration i, partition Di is reserved as the test set, and the remaining
partitions are collectively used to train the model. That is, in the first iteration, subsets D2,...,
Dk collectively serve as the training set in order to obtain a first model, which is tested on D1;
the second iteration is trained on subsets D1, D3,..., Dk and tested on D2; and so on.

Unlike the holdout and random subsampling methods above, here, each sample is used the
same number of times for training and once for testing. For classification, the accuracy estimate
is the overall number of correct classifications from the k iterations, divided by the total number
of tuples in the initial data. For prediction, the error estimate can be computed as the total loss
from the k iterations, divided by the total number of initial tuples.

Leave-one-out is a special case of k-fold cross-validation where k is set to the number of initial
tuples. That is, only one sample is “left out” at a time for the test set.
In stratified cross-validation, the folds are stratified so that the class distribution of the tuples
in each fold is approximately the same as that in the initial data. In general, stratified 10-fold
cross-validation is recommended for estimating accuracy (even if computation power allows
using more folds) due to its relatively low bias and variance.

Bootstrap

The bootstrap method samples the given training tuples uniformly with replacement. That is,
each time a tuple is selected, it is equally likely to be selected again and readded to the training
set. For instance, imagine a machine that randomly selects tuples for our training set. In
sampling with replacement, the machine is allowed to select the same tuple more than once.
There are several bootstrap methods.

A commonly used one is the .632 bootstrap, which works as follows. Suppose we are given a
data set of d tuples. The data set is sampled d times, with replacement, resulting in a bootstrap
sample or training set of d samples. It is very likely that some of the original data tuples will
occur more than once in this sample. The data tuples that did not make it into the training set
end up forming the test set. Suppose we were to try this out several times. As it turns out, on
average, 63.2% of the original data tuples will end up in the bootstrap, and the remaining 36.8%
will form the test set (hence, the name, .632 bootstrap.)
“Where does the figure, 63.2%, come from?” Each tuple has a probability of 1/d of being
selected, so the probability of not being chosen is (1−1/d). We have to select d times, so the
probability that a tuple will not be chosen during this whole time is (1− 1/d) d . If d is large,
the probability approaches e −1 = 0.368. 14 Thus, 36.8% of tuples will not be selected for
training and thereby end up in the test set, and the remaining 63.2% will form the training set.
We can repeat the sampling procedure k times, where in each iteration, we use the current test
set to obtain an accuracy estimate of the model obtained from the current bootstrap sample.
The overall accuracy of the model is then estimated as

where Acc(Mi)test-set is the accuracy of the model obtained with bootstrap sample i when it is
applied to test set i. Acc(Mi)train-set is the accuracy of the model obtained with bootstrap sample
i when it is applied to the original set of data tuples. The bootstrap method works well with
small data sets.

Ensemble Methods—Increasing the Accuracy


The main general strategies for improving classifier and predictor accuracy is : Bagging and
boosting

In the figure, Each combines a series of k learned models (classifiers or predictors), M1, M2,...,
Mk, with the aim of creating an improved composite model, M∗.
Both bagging and boosting can be used for classification as well as prediction.
Bagging
Suppose that you are a patient and would like to have a diagnosis made based on your
symptoms. Instead of asking one doctor, you may choose to ask several. If a certain diagnosis
occurs more than any of the others, you may choose this as the final or best diagnosis. That is,
the final diagnosis is made based on a majority vote, where each doctor gets an equal vote.
Now replace each doctor by a classifier, and you have the basic idea behind bagging.
Intuitively, a majority vote made by a large group of doctors may be more reliable than a
majority vote made by a small group.
Given a set, D, of d tuples, bagging works as follows. For iteration i (i = 1, 2,..., k), a training
set, Di , of d tuples is sampled with replacement from the original set of tuples, D. The term
bagging stands for bootstrap aggregation. Each training set is a bootstrap sample. Because
sampling with replacement is used, some of the original tuples of D may not be included in Di
, whereas others may occur more than once. A classifier model, Mi , is learned for each training
set, Di . To classify an unknown tuple, X, each classifier, Mi , returns its class prediction, which
counts as one vote.
The bagged classifier, M∗, counts the votes and assigns the class with the most votes to X.
Bagging can be applied to the prediction of continuous values by taking the average value of
each prediction for a given test tuple. The bagged classifier often has significantly greater
accuracy than a single classifier derived from D, the original training data. It will not be
considerably worse and is more robust to the effects of noisy data.
The increased accuracy occurs because the composite model reduces the variance of the
individual classifiers. For prediction, it was theoretically proven that a bagged predictor will
always have improved accuracy over a single predictor derived from D.

Boosting
As in the previous section, suppose that as a patient, you have certain symptoms. Instead of
consulting one doctor, you choose to consult several. Suppose you assign weights to the value
or worth of each doctor’s diagnosis, based on the accuracies of previous diagnoses they have
made. The final diagnosis is then a combination of the weighted diagnoses. This is the essence
behind boosting.
In boosting, weights are assigned to each training tuple. A series of k classifiers is iteratively
learned. After a classifier Mi is learned, the weights are updated to allow the subsequent
classifier, Mi+1, to “pay more attention” to the training tuples that were misclassified by Mi .
The final boosted classifier, M∗, combines the votes of each individual classifier, where the
weight of each classifier’s vote is a function of its accuracy. The boosting algorithm can be
extended for the prediction of continuous values.
Adaboost is a popular boosting algorithm. Suppose we would like to boost the accuracy of
some learning method. We are given D, a data set of d class-labeled tuples, (X1, y1), (X2, y2),
..., (Xd, yd), where yi is the class label of tupleXi . Initially, Adaboost assigns each training
tuple an equal weight of 1/d. Generating k classifiers for the ensemble requires k rounds
through the rest of the algorithm. In round i, the tuples from D are sampled to form a training
set, Di , of size d. Sampling with replacement is used—the same tuple may be selected more
than once. Each tuple’s chance of being selected is based on its weight.
A classifier model, Mi , is derived from the training tuples of Di . Its error is then calculated
using Di as a test set. The weights of the training tuples are then adjusted according to how
they were classified. If a tuple was incorrectly classified, its weight is increased. If a tuple was
correctly classified, its weight is decreased. A tuple’s weight reflects how hard it is to classify—
the higher the weight, the more often it has been misclassified. These weights will be used to
generate the training samples for the classifier of the next round. The basic idea is that when
we build a classifier, we want it to focus more on the misclassified tuples of the previous round.
Some classifiers may be better at classifying some “hard” tuples than others. In this way, we
build a series of classifiers that complement each other.

You might also like