0% found this document useful (0 votes)
11 views37 pages

9 - Session 9 - Visualizing Model Performance, Evidence and Probabilities

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views37 pages

9 - Session 9 - Visualizing Model Performance, Evidence and Probabilities

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Visualizing

Model
Performance,
Evidence and
Probabilities
Ranking Instead of Classifying

Evaluating classifiers play an important


role in understanding and applying data
science concepts in the business world. A
classifier is a tool used in machine learning
to group items or data into certain categories
based on features learned from training data.
Class confusion is a situation where the classifier incorrectly identifies a class. For
example, in classifiers of email as "spam" or "non-spam", class confusion occurs
when emails that are not actually spam are misidentified as spam, and vice versa.
Confusion matrix is ​a table that describes the performance of a classifier by
separating correct and incorrect decisions.

An n × n matrix with columns labeled with actual classes and rows labeled with
predicted classes.
To describe the actual class and the class predicted by the model, different symbols
are usually used.
Unbalanced classes often occur in business applications where classifiers
are used to discover unusual entities from large populations. Examples
include looking for defrauded customers, checking for defective parts in
an assembly line, or targeting consumers who will respond to an
application offer.

The problem arises when a rare class becomes highly unbalanced in the
distribution, for example, only appearing in 1 out of every 1000
examples. In situations like this, the use of accuracy as an evaluation
metric becomes irrelevant. In fact, if we always chose the most common
class (in this case, appearing 999 out of 1000 examples), we would get
an accuracy rate of 99.9%, but this is not useful because the model will
not successfully identify rare classes.
Accuracy is the simplest and most popular measure for
evaluating classifier performance. It measures the proportion
of correct predictions out of all predictions made. Even though
it is easy to understand, accuracy has shortcomings, especially
in unbalanced data (imbalanced datasets) where the number of
examples from one class is much greater than from other
classes.
The use of accuracy as a simple classification evaluation metric
does not differentiate between false positive errors and false
negative errors. Both types of errors are calculated together,
with the implicit assumption that both errors have the same
level of importance. In real -world domains, false positive and
false negative errors have different consequences and often
have different impacts.

Ideally, we should evaluate the costs or benefits of each


decision made by the classifier. By combining all these costs
and benefits, we can estimate the expected gain or loss from
using the classifier
Profit Curves
ROC Graphs and Curves
When the threshold is lowered, cases move from row
N to row Y in the confusion matrix in that a case
previously considered negative is now classified as
positive, so the number changes.

The number changes depending on the original class


of the example.
If the case is positive (in the "p" column), it will
move up and become a true positive ( Y,p).
If the case is negative (n), it will be a false positive
(Y,n).
Technically, each different threshold produces a
different classifier, which is represented by its own
confusion matrix.
ROC Graphs and Curves
Generating ROC curve:
Algorithm
• Sort the test set by the model predictions

• Start with cutoff = max (prediction)

• Decrease cutoff, after each step count the number of true positives
TP (positives with prediction above the cutoff) and false positives FP
(negatives above the cutoff)

• Calculate TP rate (TP/P) and FP (FP/N) rate

• Plot current number of TP/P as a function of current FP/N


ROC Graphs and
Curves
• ROC graphs decouple classifier performance from the conditions
under which the classifiers will be used

• ROC graphs are independent of the class proportions as well as the


costs and benefits

• Not the most intuitive visualization for many business stakeholders


Area Under the
ROC Curve (AUC)
• The area under a classifier’s curve expressed as a fraction of the
unit square
• Its value ranges from zero to one

• The AUC is useful when a single number is needed to summarize


performance, or when nothing is known about the operating
conditions
• A ROC curve provides more information than its area

• Equivalent to the Mann-Whitney-Wilcoxon measure


• Also equivalent to the Gini Coefficient (with a minor algebraic
transformation)
• Both are equivalent to the probability that a randomly chosen positive
instance will be ranked ahead of a randomly chosen negative instance
Cumulative Response
curve
Lift Curve
Let’s focus back in on
actually mining the
data..

Which model should TelCo


select in order to target
customers with a special offer,
prior to contract expiration?
Performance
Evaluation
Training Set: Model Accuracy
Classification Tree 95%
Logistic Regression 93%
𝑘-Nearest Neighbors 100%
Naïve Bays 76%

Test Set:
Model Accuracy AUC
Classification Tree 91.8%±0.0 0.614±0.014
Logistic Regression 93.0%±0.1 0.574±0.023
𝑘-Nearest Neighbors 93.0%±0.0 0.537±0.015
Naïve Bays 76.5%±0.6 0.632±0.019
Performance
Evaluation
Naïve Bayes confusion matrix:
p n
Y 127 (3%) 848 (18%)
N 200 (4%) 3518 (75%)

𝑘-Nearest Neighbors confusion matrix:

p n
Y 3 (0%) 15 (0%)
N 324 (7%) 4351 (93%)
ROC Curve
Lift Curve
Profit Curves
Profit Curves
Agenda 2
 Introduction

 Bayes‘ Rule
 Applying Bayes‘ rule to data science
 Naive Bayes
 Advantages and Disadvantages of Naive Bayes
 Example
Introductory example
 So far: using data to draw conclusions about
some unknown quantity of a data instance
 Now: analyse data instances as evidence
for or against different values of the target
 Example: target online displays to
consumers based on webpages they
have visited in the past
 Run a targeted campaign for, e.g., a luxury hotel
 Target variable: will the consumer book a hotel
room within one week after having seen the
advertisement?
 Cookies allow for observing which consumers
book rooms
Introductory example
 (…)
 A consumer is characterized by the set of websites we
have observed her to have visited previously (cookies!)
 We assume that some of these websites are more likely to be visited by good
prospects for the luxury hotel
 Problem: we do not have the resources to estimate the evidence potential for
each site manually
 Idea: use historical data to estimate both the direction and the
strength of the evidence
 Combine the evidence to estimate the
resulting likelihood of class membership
 Similar problem: spam detection
Combining evidence
probabilistically
 What is the probability 𝑃(𝐶) that if you show an
ad to any customer, it will book a room given
some evidence 𝐸 (such as the websites visited by
a particular customer)? → 𝑃 𝐶 𝐸

 Problem: for any particular collection of evidence 𝐸,


we may not have seen enough cases/seen it at all!

 Idea: consider the different pieces of evidence


separately, and then combine evidence
Reminder: statistical
(in)dependence
 If the events A and B are statistically independent,
then we can compute the probability that both A and B
occur as
𝑝 𝐴𝐵 = 𝑝 𝐴 ∙ 𝑝 𝐵 𝐴

Example: rolling a fair dice

 The general formular for combining probabilities that


take care of dependencies between events is

𝑝 𝐴𝐵 = 𝑝 𝐴 𝑝 𝐵 𝐴

Given that you know A, what is the probability of B


Agenda
 Introduction

 Bayes‘ Rule
 Applying Bayes‘ rule to data science
 Naive Bayes
 Advantages and Disadvantages of Naive Bayes
 Example
Bayes‘ rule (2/2)
 Bayes‘ rule says
 that we can compute the probability of our hypothesis 𝐻 given some evidence 𝐸
by instead looking at the probability of the evidence given the hypothesis as
well as the unconditional probability of the hypothesis and the evidence.
 Example: medical diagnosis
 Hypothesis 𝐻 = measles, Evidence 𝐸 = red spots
 In order to directly estimate 𝑝(𝑚𝑒𝑎𝑠𝑙𝑒𝑠|𝑟𝑒𝑑 𝑠𝑝𝑜𝑡𝑠) we would need to think
through all the different reasons a person might exhibit red spots and what
proportion of them would be measles.
 Instead: 𝑝(𝐸|𝐻) is the prob. that one has red spots given that one has
measles. 𝑝(𝐻) is simply the prob. that someone has measles, and 𝑝(𝐸) that
someone has red spots.

10
Applying Bayes‘ rule to data
science (1/2)
 A lot of DM methods are based on Bayes‘ rule
 Bayes‘ rule for classification of the probability that
the target variable C takes on the class of interest ܿ
after taking the evidence 𝐸 (feature values) into
account:

 𝑝 (𝐶 = 𝑐) is the 'prior' probability of the class, i.e., the


probability we would assign to the class before seeing any
evidence [e.g., the prevalence of c in the population =
percentage of all examples that are of class c.

 𝑝 (𝐸|𝐶 = 𝑐) is the likelihood of seeing the evidence 𝐸 [the


percentage of examples of class c that have 𝐸]
 𝑝 (𝐸) is the likelihood of the evidence [occurence of 𝐸]
11
Applying Bayes‘ rule to data science (2/2)

 (…)
 Estimating these values, we could use 𝑝 𝐶 = 𝑐 𝑬 as an
estimate of class probability
 Alternatively, we could use the values as a score to rank
instances
 Drawback: if 𝐸 is a usual vector of attribute values,
we would require knowing the full joint probability of
the example
 This is difficult to measure
 We may never see a specific example in the training data
that matches a given 𝐸 in our test data

 Make a particular assumption of independence!


Naive Bayes (1/2)
 Conditional independence: use the class of the
example as condition
 This allows for easy combination of probabilities:
𝑝 𝐴𝐵 𝐶 = 𝑝 𝐴 𝐶 ∙ 𝑝 𝐵 𝐶
In other words: we assume that the attributes are
conditionally independent, i.e.

Each of the terms p(ei|c) can be computed directly


from the data (count up the prop. we see ei in c)
Naive Bayes (2/2)
 Naive Bayes classifies a new example by estimating the probability
that the example belongs to each class and reports the class with
highest probability
 Note that the denominator P(E) never actually has
to be calculated
 We can focus on the nominator for comparison of different classes ܿ, because the
denominator is always the same
 If we need probability estimates, the probabilities will add
up to one, so we can derive it from the other quantities
(Dis)Advantages of
Naive Bayes
 Naive Bayes
 is a simple classifier, although it takes all the feature
evidence into account
 is very efficient in terms of storage space and comp. time
 performs surprisingly well for classification
 is an „incremental learner“
 Note that the independence assumption does not hurt classification
performance very much
 To some extent, we double the evidence
 Tends to make the probability estimates more extreme in the correct direction
 Don‘t use the probability estimates themselves!
 But ranking is ok!
Example: Naive Bayes
classifier (1/5)
References

❑ Provost, F.; Fawcett, T.: Data Science for Business; Fundamental Principles of
Data Mining and Data- Analytic Thinking. O‘Reilly, CA 95472, 2013.
❑ Sharda, R., Delen, D., Turban, E., (2018). Business intelligence, Analytics, and
Data Science: A Managerial Perspective, 4th Edition, Pearson.

❑ ryan dan alyaq dan ikhlas kamalia


Thank You

You might also like