9 - Session 9 - Visualizing Model Performance, Evidence and Probabilities
9 - Session 9 - Visualizing Model Performance, Evidence and Probabilities
Model
Performance,
Evidence and
Probabilities
Ranking Instead of Classifying
An n × n matrix with columns labeled with actual classes and rows labeled with
predicted classes.
To describe the actual class and the class predicted by the model, different symbols
are usually used.
Unbalanced classes often occur in business applications where classifiers
are used to discover unusual entities from large populations. Examples
include looking for defrauded customers, checking for defective parts in
an assembly line, or targeting consumers who will respond to an
application offer.
The problem arises when a rare class becomes highly unbalanced in the
distribution, for example, only appearing in 1 out of every 1000
examples. In situations like this, the use of accuracy as an evaluation
metric becomes irrelevant. In fact, if we always chose the most common
class (in this case, appearing 999 out of 1000 examples), we would get
an accuracy rate of 99.9%, but this is not useful because the model will
not successfully identify rare classes.
Accuracy is the simplest and most popular measure for
evaluating classifier performance. It measures the proportion
of correct predictions out of all predictions made. Even though
it is easy to understand, accuracy has shortcomings, especially
in unbalanced data (imbalanced datasets) where the number of
examples from one class is much greater than from other
classes.
The use of accuracy as a simple classification evaluation metric
does not differentiate between false positive errors and false
negative errors. Both types of errors are calculated together,
with the implicit assumption that both errors have the same
level of importance. In real -world domains, false positive and
false negative errors have different consequences and often
have different impacts.
• Decrease cutoff, after each step count the number of true positives
TP (positives with prediction above the cutoff) and false positives FP
(negatives above the cutoff)
Test Set:
Model Accuracy AUC
Classification Tree 91.8%±0.0 0.614±0.014
Logistic Regression 93.0%±0.1 0.574±0.023
𝑘-Nearest Neighbors 93.0%±0.0 0.537±0.015
Naïve Bays 76.5%±0.6 0.632±0.019
Performance
Evaluation
Naïve Bayes confusion matrix:
p n
Y 127 (3%) 848 (18%)
N 200 (4%) 3518 (75%)
p n
Y 3 (0%) 15 (0%)
N 324 (7%) 4351 (93%)
ROC Curve
Lift Curve
Profit Curves
Profit Curves
Agenda 2
Introduction
Bayes‘ Rule
Applying Bayes‘ rule to data science
Naive Bayes
Advantages and Disadvantages of Naive Bayes
Example
Introductory example
So far: using data to draw conclusions about
some unknown quantity of a data instance
Now: analyse data instances as evidence
for or against different values of the target
Example: target online displays to
consumers based on webpages they
have visited in the past
Run a targeted campaign for, e.g., a luxury hotel
Target variable: will the consumer book a hotel
room within one week after having seen the
advertisement?
Cookies allow for observing which consumers
book rooms
Introductory example
(…)
A consumer is characterized by the set of websites we
have observed her to have visited previously (cookies!)
We assume that some of these websites are more likely to be visited by good
prospects for the luxury hotel
Problem: we do not have the resources to estimate the evidence potential for
each site manually
Idea: use historical data to estimate both the direction and the
strength of the evidence
Combine the evidence to estimate the
resulting likelihood of class membership
Similar problem: spam detection
Combining evidence
probabilistically
What is the probability 𝑃(𝐶) that if you show an
ad to any customer, it will book a room given
some evidence 𝐸 (such as the websites visited by
a particular customer)? → 𝑃 𝐶 𝐸
𝑝 𝐴𝐵 = 𝑝 𝐴 𝑝 𝐵 𝐴
Bayes‘ Rule
Applying Bayes‘ rule to data science
Naive Bayes
Advantages and Disadvantages of Naive Bayes
Example
Bayes‘ rule (2/2)
Bayes‘ rule says
that we can compute the probability of our hypothesis 𝐻 given some evidence 𝐸
by instead looking at the probability of the evidence given the hypothesis as
well as the unconditional probability of the hypothesis and the evidence.
Example: medical diagnosis
Hypothesis 𝐻 = measles, Evidence 𝐸 = red spots
In order to directly estimate 𝑝(𝑚𝑒𝑎𝑠𝑙𝑒𝑠|𝑟𝑒𝑑 𝑠𝑝𝑜𝑡𝑠) we would need to think
through all the different reasons a person might exhibit red spots and what
proportion of them would be measles.
Instead: 𝑝(𝐸|𝐻) is the prob. that one has red spots given that one has
measles. 𝑝(𝐻) is simply the prob. that someone has measles, and 𝑝(𝐸) that
someone has red spots.
10
Applying Bayes‘ rule to data
science (1/2)
A lot of DM methods are based on Bayes‘ rule
Bayes‘ rule for classification of the probability that
the target variable C takes on the class of interest ܿ
after taking the evidence 𝐸 (feature values) into
account:
(…)
Estimating these values, we could use 𝑝 𝐶 = 𝑐 𝑬 as an
estimate of class probability
Alternatively, we could use the values as a score to rank
instances
Drawback: if 𝐸 is a usual vector of attribute values,
we would require knowing the full joint probability of
the example
This is difficult to measure
We may never see a specific example in the training data
that matches a given 𝐸 in our test data
❑ Provost, F.; Fawcett, T.: Data Science for Business; Fundamental Principles of
Data Mining and Data- Analytic Thinking. O‘Reilly, CA 95472, 2013.
❑ Sharda, R., Delen, D., Turban, E., (2018). Business intelligence, Analytics, and
Data Science: A Managerial Perspective, 4th Edition, Pearson.