0% found this document useful (0 votes)
26 views27 pages

Lec5 Classification

Uploaded by

Saitama Deku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views27 pages

Lec5 Classification

Uploaded by

Saitama Deku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

1

Intro to classification

CSCI-P 556
ZORAN TIGANJ
2
Reminders/Announcements

u Don’t forget the quiz deadline today


4
Today: Intro to classification

u After regression example, now we will cover a classification example


u We will use MNIST dataset, which is a set of 70,000 small images of digits
handwritten by high school students and employees of the US Census
Bureau. Each image is labeled with the digit it represents.
5
MNIST dataset
6
Examples of MNIST digits
7
Training a Binary Classifier

u Let’s simplify the problem for now and only try to identify one digit—for
example, the number 5.
u This “5-detector” will be an example of a binary classifier, capable of
distinguishing between just two classes, 5 and not 5.
8
Performance Measures

u Let’s do cross-validation:

corss_val_score did K-fold cross-validation which means splitting the


training set into K folds (in this case, three), then making predictions
and evaluating them on each fold using a model trained on the
remaining folds.
9
Performance Measures

u Well, before you get too excited, let’s look at a very dumb classifier that
just classifies every single image in the “not-5” class:
10
Performance Measures

u Well, before you get too excited, let’s look at a very dumb classifier that
just classifies every single image in the “not-5” class:
Accuracy not a good measure due to skewness in data

This demonstrates why accuracy is generally not the preferred


performance measure for classifiers, especially when you are
dealing with skewed datasets (i.e., when some classes are much
more frequent than others)
11
Confusion Matrix

u A much better way to evaluate the performance of a classifier is to look at


the confusion matrix.
u The general idea is to count the number of times instances of class A are
classified as class B.
u For example, to know the number of times the classifier confused images of 5s
with 3s, you would look in the fifth row and third column of the confusion matrix.
12
Confusion Matrix

u Each row in a confusion matrix represents an actual


class, while each column represents a predicted class.
u The first row of this matrix considers non-5 images (the
negative class ): 53,057 of them were correctly
classified as non-5s (they are called true negatives),
while the remaining 1,522 were wrongly classified as 5s
(false positives).
u The second row considers the images of 5s (the positive
class): 1,325 were wrongly classified as non-5s (false
negatives), while the remaining 4,096 w ere correctly
classified as 5s ( true positives ).
13
Confusion Matrix
14
Precision of the classifier

u The confusion matrix gives you a lot of information, but sometimes you may
prefer a more concise metric.
u An interesting one to look at is the accuracy of the positive predictions; this
is called the precision of the classifier

u TP is the number of true positives,


u FP is the number of false positives.
15
Sensitivity

u Precision is typically used along with another metric named recall, also
called sensitivity or the true positive rate (TPR): the ratio of positive
instances that are correctly detected by the classifier

u FN is the number of false negatives.


16
Confusion matrix - illustration
17
Precision and recall of 5-detector

u Scikit-Learn provides several functions to compute classifier metrics,


including precision and recall:
18
F1 score

u It is often convenient to combine precision and recall into a single metric


called the F1 score, in particular if you need a simple way to compare two
classifiers.
u The F score is the harmonic mean of precision and recall. Whereas the
regular mean treats all values equally, the harmonic mean gives much
more weight to low values.
u As a result, the classifier will only get a high F score if both recall and
precision are high.
19
F1 score of 5-detector
20
Precision/Recall Trade-off

u The F1 score favors classifiers that have similar precision and recall.
u This is not always what you want: in some contexts you mostly care about
precision, and in other contexts you really care about recall.
u For example, if you trained a classifier to detect videos that are safe for kids, you
would probably prefer a classifier that rejects many good videos (low recall) but
keeps only safe ones (high precision).
u On the other hand, suppose you train a classifier to detect shoplifters in
surveillance images: it is probably fine if your classifier has only 30% precision as
long as it has 99% recall (sure, the security guards will get a few false alerts, but
almost all shoplifters will get caught).
u Unfortunately, you can’t have it both ways: increasing precision reduces
recall, and vice versa. This is called the precision/recall trade-off.
21
22
The receiver operating characteristic
(ROC) curve
u The receiver operating characteristic (ROC) curve
is another common tool used with binary classifiers.
u It is very similar to the precision/recall curve, but
instead of plotting precision versus recall, the ROC
curve plots the true positive rate (another name
for recall) against the false positive rate (FPR).
u The FPR is the ratio of negative instances that are
incorrectly classified as positive. It is equal to 1 –
the true negative rate (TNR), which is the ratio of
negative instances that are correctly classified as
negative.
u The TNR is also called specificity. Hence, the ROC
curve plots sensitivity (recall) versus 1 – specificity .
(1- TNR (Specificity)
23
Area under the curve (AUC)

u One way to compare classifiers is to measure the area under the curve
(AUC).
u A perfect classifier will have a ROC AUC equal to 1, whereas a purely
random classifier will have a ROC AUC equal to 0.5.
u Scikit-Learn provides a function to compute the ROC AUC:
24
Comparing different classifiers using
ROC
25
OvR vs OvO

u Some algorithms (such as SGD classifiers, Random Forest classifiers, and


naive Bayes classifiers) are capable of handling multiple classes natively.
u Others (such as Logistic Regression or Support Vector Machine classifiers)
are strictly binary classifiers. However, there are various strategies that you
can use to perform multiclass classification with multiple binary classifiers.
u One way to create a system that can classify the digit images into 10 classes
(from 0 to 9) is to train 10 binary classifiers, one for each digit (a 0-detector, a 1-
detector, a 2-detector, and so on). This is called the one-versus-the-rest (OvR)
strategy (also called one-versus-all ).
26
OvR vs OvO

u Some algorithms (such as SGD classifiers, Random Forest classifiers, and


naive Bayes classifiers) are capable of handling multiple classes natively.
u Others (such as Logistic Regression or Support Vector Machine classifiers)
are strictly binary classifiers. However, there are various strategies that you
can use to perform multiclass classification with multiple binary classifiers.
u Another strategy is to train a binary classifier for every pair of digits: one to
distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and
so on. This is called the one-versus-one (OvO) strategy.
u The main advantage of OvO is that each classifier only needs to be trained on
the part of the training set for the two classes that it must distinguish.
27
Multilabel Classification

u Consider a face-recognition classifier: what should it do if it recognizes


several people in the same picture?
u It should attach one tag per person it recognizes. Say the classifier has been
trained to recognize three faces, Alice, Bob, and Charlie.
u Then when the classifier is shown a picture of Alice and Charlie, it should output
[1, 0, 1] (meaning “Alice yes, Bob no, Charlie yes”).
u Such a classification system that outputs multiple binary tags is called a
multilabel classification system.
28
Next time

u Linear regression, from Chapter 4 from Hands on machine learning


textbook

You might also like