0% found this document useful (0 votes)
32 views57 pages

Course Slides - Classification

The document discusses classification in machine learning, including binary, multi-class, and multi-label classification. It covers classification basics and examples, the machine learning ecosystem, and types of classification problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views57 pages

Course Slides - Classification

The document discusses classification in machine learning, including binary, multi-class, and multi-label classification. It covers classification basics and examples, the machine learning ecosystem, and types of classification problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

YES NO

Classification – Fundamentals & Practical Applications

BIDA TM - Business Intelligence & Data Analysis


Course Learning Objectives

Understand what Classification Perform simple classification tasks Understand the implicit
is and its applicability to many using logistic regression in Excel assumptions behind Classification
real-world scenarios techniques and algorithms

Create classification models in Interpret and evaluate the Explore more advanced evaluation
Python using statsmodels and performance of classification techniques such as PDP plots and
sklearn modules models, outputs and parameters SHAP values to expand your
horizons.

BIDA TM - Business Intelligence & Data Analysis


YES NO

Classification Basics

BIDA TM - Business Intelligence & Data Analysis


Machine Learning Use Cases

Machine Learning can be used across a wide variety of tasks in Finance.

• The target categories should be discrete variables


• Predictions are made using one or more input variables

Strokes Variation Form Credit Income Status

Transaction Signature Loan Default


Analysis Prediction

Not
Genuine Forged Default
Default

BIDA TM - Business Intelligence & Data Analysis


Recap: The Machine Learning Ecosystem

There are four main types of Machine Learning algorithms:

Supervised Machine Learning Unsupervised Machine Learning

…uses labelled datasets to train algorithms in …uses unlabelled datasets to train algorithms in
classifying data classifying data

Classification Regression Clustering Association

Reinforcement Learning Deep Learning

…uses reward maximization such that the algorithm …learns data patterns and structure from the data
determines the optimal behaviour in an itself and is scalable to big data. It can be used for
environment supervised and unsupervised learning

BIDA TM - Business Intelligence & Data Analysis


Types of Classification

Binary Use Case Input Variables Output Classes

• Classification tasks that have two class Variation,


Malignant
Tumour Texture,
labels OR
diagnosis Contrast, Growth
Benign
Rate etc
• Outcomes must be ONE of the two
classes
Purchase History, Will Buy
Customer
Click history, OR
• Algorithm that only deals with Prediction
Customer Profile Will Not Buy
binary classification include Logistic
Regression and Support Vector Machines Spelling Errors,
Spam
Email Spam Grammatical
OR
Detection Errors, Email
Not Spam
domain

BIDA TM - Business Intelligence & Data Analysis


Types of Classification

Use Case Input Variables Output Classes


Multi-Class
Malignant
• Has more than two class labels Variation, Texture, OR
Tumour
Contrast, Growth Benign
diagnosis
• Outcomes must be ONE of a range of Rate etc OR
Premalignant
classes
Will Buy
• Algorithm suited for multi-class Purchase History, OR
Customer
Click history, Will Not Buy
problems include decision trees and Prediction
Customer Profile OR
random forests. Insufficient Data
Spam
Spelling Errors,
OR
Email Spam Grammatical
Not Spam
Detection Errors, Email
OR
domain
Unsafe

BIDA TM - Business Intelligence & Data Analysis


Types of Classification

Multi-label
Output Classes
• Has two or more class labels Use Case Input Variables
(Labels)
Malignant
• Outcome can be ONE or MORE of the Variation, Texture, OR /AND
Tumour
class labels diagnosis
Contrast, Growth Benign
Rate etc OR / AND
• Difference between multi-label and Premalignant

multi-class is that each label in the Will Buy


Purchase History, OR / AND
former one represents a different but Customer
Click history, Will Not Buy
Prediction
related classification problem Customer Profile OR / AND
Insufficient Data
Spam
Spelling Errors,
OR / AND
Email Spam Grammatical
For example, a multi-label classifier may classify Not Spam
Detection Errors, Email
an email as both spam and unsafe, or classify the OR / AND
domain
tumor as both benign and premalignant Unsafe

BIDA TM - Business Intelligence & Data Analysis


Common Classification Use Cases

Binary
• Machinery Outage Prediction - Failure OR Not Failure
• Anomaly Detection – Fraud OR Not Fraud
• Credit Card Default - Customer Likely to Default OR Not Likely to Default

Multi-Class
• Product Classification – Red Wine OR White Wine OR Rose Wine
• News Classification of Articles – Sports OR Lifestyle OR Economy OR Current Affairs
• Facial Image Recognition – Happy OR Sad OR Angry

Multi-Label
• Social Tag Selection - #TogetherAtHome OR / AND #COVID19 OR / AND #WorkFromHome
BIDA TM - Business Intelligence & Data Analysis
Visualizing Classification

For simple scenarios, it can help to visualize the input variables and output classes on a chart.

Two Classes Class 1: Spam Three Classes Class 1: Spam


Class 2: Not Spam Class 2: Not Spam
Class 3: Unsafe
Grammatical Errors

Grammatical Errors
Malicious Content
Malicious Content

In each case, the goal is to use the input data to separate the classes as cleanly as possible.

BIDA TM - Business Intelligence & Data Analysis


Classification Algorithms

Throughout the course, we’ll explore the most common classification algorithms.

Logistic Regression Naïve Bayes KNN SVM

Decision Trees Random Forest

Uses regression principles


to achieve separation
between discrete classes

We will look at the benefits of each and how to interpret and evaluate the outputs.

BIDA TM - Business Intelligence & Data Analysis


Logistic Regression

BIDA TM - Business Intelligence & Data Analysis


Logistic Regression

Understand and learn the Understand and calculate the Learn how to interpret log odds
fundamental concepts of probabilities, log odds and how and the assumptions behind
Logistic Regression these impact model interpretation Logistic regression

Be comfortable with the Be able to comprehend and Practice a basic logistic regression
mathematics behind Logistic interpret the outputs of logistic example in Excel and Python.
Regression and how to regression algorithm for business
manipulate it scenarios.

BIDA TM - Business Intelligence & Data Analysis


Logistic Regression

Logistic regression makes a classification based on the probability of an event happening.

“What is the probability that this customer


will purchase medical supplies?"
50% 50% 50%
48%
35%
80%

0% 100% 0% 100% 0% 100%


Customer 1 Customer 2 Customer 3
Prediction: Will Not Buy Prediction: Will Buy Prediction: Will Not Buy

We determine a threshold (typically 0.5 or 50%) as the cut off between prediction classes.​
In this case: customers lower than the threshold are predicted to NOT buy.
customers higher than the threshold are predicted to WILL buy.
BIDA TM - Business Intelligence & Data Analysis
Logistic Regression

Changing the threshold will change the prediction class of some observations.

48%
35%
80%

0% 100% 0% 100% 0% 100%


Customer 1 Customer 2 Customer 3
Prediction: Will Not Buy Prediction: Will Buy Prediction: Will Not Buy
Prediction: Will Buy

We can use evaluation metrics to help us decide on the most appropriate threshold.

BIDA TM - Business Intelligence & Data Analysis


Visualizing Logistic Regression

Logistic Regression probabilities are estimated using one or more input variables

Probability of Buying (Y) 1 1 Customers who DID buy medical supplies


Predict
0.75
NOT BUY

0.5
Predict
0.25 BUY
Not Buy Buy
0 0 Customers who did NOT buy medical supplies
30 35 40 45 50 55
Customer Age (X)

Logistic Regression uses a curved line to summarize our observed data points

The logistic regression line generates probabilities between 0 and 1.

BIDA TM - Business Intelligence & Data Analysis


Defining the Logistic Regression Curve

• The logistic regression is based on the logistic function,


also called the sigmoid function

(Y)
1.0

• Mathematically this is defined as


1 0.5
𝑓𝑓(𝑥𝑥) =
1 + 𝑒𝑒 −𝐿𝐿
(Where 'e' is the base of the natural logarithm)

• The input value of L determines the output value of y. -6 -3 0 3 6


(L)

• The logistic function outputs numbers between 0 and 1.

At input (L) 0, output (y) = 0.5. As input (L) increases, output (y) increases.

BIDA TM - Business Intelligence & Data Analysis


Defining the Logistic Regression Curve

To transform a linear regression into a logistic regression


we take the linear regression equation…
1

𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏 𝒙𝒙
0.75

Target Variable (Y)


0.5

and substitute it for L in our Logistic function…


0.25

1
𝑦𝑦𝑦 = 0
1+𝑒𝑒 −𝐿𝐿
0 5 10 15 20 25
Independent Variable (X)
1
𝑃𝑃 𝑌𝑌 = 1|𝑋𝑋 =
1+𝑒𝑒 −(𝜷𝜷𝟎𝟎 +𝜷𝜷𝟏𝟏 𝒙𝒙)
(where P(Y=1|X) is the probability of Y=1 given X input features) The logistic curve has been
transformed to fit the input data.

BIDA TM - Business Intelligence & Data Analysis


Summary of Logistic Vs Linear Regression

Logistic Regression Linear Regression

1 1
Logistic regression Linear regression
Probability (Y)

Probability (Y)
0.75 0.75
provides a better fit poor fit for
classifying data
0.5 0.5

0.25 0.25

0 0
0 5 10 15 20 25 0 5 10 15 20 25
Independent Variable (X) Independent Variable (X)

• Probabilities always sit between 0 and 1. • Probabilities are not limited to a sensible
range.
• More suitable for predicting a binary outcome.
• Suited to predicting number of something.
• More suitable for Classification.

BIDA TM - Business Intelligence & Data Analysis


Logistic regression assumptions

1. The dependent variable is binary i.e. fits into one of


two clear-cut categories
With multicollinear variables, the algorithm
2. There should be no, or very little, multicollinearity would be unable to separate their effects
likely causing errors
between the predictor variables— meaning the
independent variables should be independent of each
Tumour
other Density

3. Logistic regression requires large sample sizes—the


larger the sample size, the more reliable the results
Size Mass
4. The independent variables should be linearly related
to the log odds
Multicollinearity would happen because size
and mass hold similar information.

BIDA TM - Business Intelligence & Data Analysis


Probability, Odds and Log Odds

Win Lose
• Probabilities are the chance of something p = 60% p = 40%
happening, relative to all outcomes. W W W W W W L L L L

• Odds are the chance of something Odds Of 𝒑𝒑 (𝑾𝑾) 𝟎𝟎.𝟔𝟔 𝒑𝒑 (𝑾𝑾)


= = = 1.5 =
happening, relative to other outcomes. Winning 𝒑𝒑(𝑳𝑳) 𝟎𝟎.𝟒𝟒 𝟏𝟏 −𝒑𝒑(𝑾𝑾)

Log Odds 𝟔𝟔𝟔𝟔


• Log Odds are simply the log of the odds. = ln ( ) = ln ( 𝟏𝟏. 𝟓𝟓)
Of Winning 𝟒𝟒𝟒𝟒
• Log odds are easier for statistical models to
work with.

BIDA TM - Business Intelligence & Data Analysis


Logistic Probabilities, Odds and Log Odds
Interpreting the impact of a change in an input variable appears to be difficult at first.

1
𝑃𝑃 𝑌𝑌 = 1|𝑋𝑋 = Here x is our input variable.
1 + 𝑒𝑒 −(𝜷𝜷𝟎𝟎 +𝜷𝜷𝟏𝟏 𝒙𝒙)

Rearranging our logistic regression equation we can reach the following:

1 𝑷𝑷 𝒀𝒀 = 𝟏𝟏 𝑃𝑃(𝑌𝑌 = 1)
𝑃𝑃 𝑌𝑌 = 1|𝑋𝑋 = REARRANGE ln = ln = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏 𝒙𝒙
1 + 𝑒𝑒 −(𝜷𝜷𝟎𝟎 +𝜷𝜷𝟏𝟏 𝒙𝒙) 𝟏𝟏 − 𝑷𝑷 𝒀𝒀 = 𝟏𝟏 𝑃𝑃(𝑌𝑌 = 0)

𝑷𝑷 𝒀𝒀 = 𝟏𝟏
ln = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏 𝒙𝒙
𝟏𝟏 − 𝑷𝑷 𝒀𝒀 = 𝟏𝟏

Log Odds of being 1 Linear Inputs

BIDA TM - Business Intelligence & Data Analysis


Interpreting Coefficients

We can interpret the coefficients in two ways:

Log ( Odds of
being 1 ) = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙 Odds of
being 1 = exp ( 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙 )
For every unit change in 𝒙𝒙 For every unit change in 𝒙𝒙
the log odds will change by β1. the odds will change by exp(β1)

Odds are generally easier to


interpret than log odds.

BIDA TM - Business Intelligence & Data Analysis


Interpretation Scenario

You are tasked by your company with predicting the likelihood of purchase of medical supplies (Y).

Input features: Equation

• X1 = Customer tenure

• X2 = Purchased in the last year Odds of Purchase = exp (𝜷𝜷𝟎𝟎 + 𝜷𝜷𝑻𝑻𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆 𝒙𝒙𝟏𝟏 + 𝜷𝜷𝑷𝑷𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖 𝒙𝒙𝟐𝟐)
Coefficients:

• Customer loyalty coefficient 𝜷𝜷𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻 = 0.6

• Purchased last year coefficient 𝜷𝜷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷 = 0.2

Odds Interpretations:

• exp(𝜷𝜷𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻 ) = 1.82
“Every extra year as a customer increases the odds of purchase by a factor of 1.82.

• exp(𝜷𝜷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷 ) = 1.22
“Customers that purchased in the last year have 22% higher odds to buy again this year.

BIDA TM - Business Intelligence & Data Analysis


Logistic Regression in Practice

Excel Python packages


• RegressIt is a powerful Excel add-in tool
that performs multivariate descriptive
data analysis and linear as well as logistic Scikit-learn provides a range of
regression supervised and unsupervised
learning algorithms

statsmodels provides classes and


functions for many different
statistical models and tests

BIDA TM - Business Intelligence & Data Analysis


Classification Algorithms

BIDA TM - Business Intelligence & Data Analysis


Classification Algorithms

YES NO

Learn about the basic Understand the underlying Learn the inner workings of the
Classification algorithms and considerations and assumptions of algorithms and understand how
when and how to use them these algorithms they categorize observations

Understand how to interpret the Be able to apply these to real-world Apply each classification algorithm
parameters, outputs and scenarios and learn about their in Python identify the quality of
evaluation metrics for the applications results from each.
algorithms

BIDA TM - Business Intelligence & Data Analysis


Algorithms Overview

We will review five common algorithms used to build classification models:

Naïve Bayes KNN SVM

Decision Trees Random Forest

BIDA TM - Business Intelligence & Data Analysis


Naïve Bayes

• Naïve Bayes is a probabilistic model based on Bayes theorem; it gives us classifications based on
probabilities

• Bayes theorem generates the probability of one event, given the probabilities of other events

Strengths Weaknesses
• Easy to use and good for large • Assumes independence in
datasets features, which makes it less
• Can be used to solve multi-class applicable on most real-world
prediction problems datasets

• Bayes Theorem states the probability (P) of an event A happening given that an event B occurred
can be given by:
𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃 𝐴𝐴
𝑃𝑃(𝐴𝐴|𝐵𝐵) =
𝑃𝑃(𝐵𝐵)
Where A is the hypothesis or
outcome variable, and B is the
evidence or features

BIDA TM - Business Intelligence & Data Analysis


Naïve Bayes – Example

We want to predict the likelihood of an email being spam, given it contains a grammatical error.

Observation Spam Email? Grammatical


Errors?
20 Emails Spam Not Spam
1 Yes Yes
2 No No
3 No Yes
Grammatical 3 4 7
4 No No Errors
5 No No
6 Yes No No
Grammatical
7 No No 1 12 13
Errors
8 No No
9 No Yes
10 No No 4 16 20
… … …
BIDA TM - Business Intelligence & Data Analysis
Naïve Bayes – Example

We want to predict the likelihood of an email being spam, given it contains a grammatical error.

𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃 𝐴𝐴
20 Emails Spam Not Spam
• 𝑃𝑃(𝐴𝐴|𝐵𝐵) =
𝑃𝑃(𝐵𝐵)
• P(Spam|Error) =

Grammatical 3 4 7 • 𝑃𝑃 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 3/4 or 0.75


Errors
• 𝑃𝑃 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 4/20 or 0.20
No • 𝑃𝑃 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 = 7/20 or 0.35
Grammatical
Errors 1 12 13

0.75∗0.20
• P(Spam|Error) = = 0.43 or 43% likely
0.35
4 16 20

BIDA TM - Business Intelligence & Data Analysis


K-Nearest Neighbours (KNN)

KNN assigns output classes based on the most similar observations in our sample space.

By similar, we mean those who have the closest input values.

Two Classes Class 1: Spam


New observation - Which class would you likely assign it
Class 2: Not Spam
to? Not Spam!

The new observation is closer to the Not Spam


observations, meaning its characteristics are more
Grammatical Errors

similar.
Frequency (Y)

New observation - The nearest observation is Spam.

However, the next 3 closest are Not Spam.

How many of the nearest observations should we choose?

Malicious Content
Frequency (X) Once we have optimized this number of nearest neighbours,
we can visualize our decision boundary, which represents the
boundary between Spam and Not Spam.

BIDA TM - Business Intelligence & Data Analysis


K-Nearest Neighbours - Example

Choosing the right value for K is critical


p1 K=1 will simply classify the data point on the basis of
Misclassified labels
Feature (Y)

one closest neighbour

Class 1: Not Spam K=1000 will not identify categories in the data at all
p2 Class 2: Unsafe
Class 3: Spam
Increasing K improves prediction due to averaging of
Feature (X)
the distance, the algorithm selects the most suitable
As the misclassified point p1 (Not Spam) has a green point point
(Spam) closest, a value of K=1 misclassifies the Not Spam email
as Spam. An optimal value of K must be selected depending on
The exact thing happens in the case of p2 the size of the dataset

Strengths Weaknesses
• Simple and easy implementation (no • Performance decreases as the number of
assumptions) examples and/or independent variables
• Can be used for classification and regression increases

BIDA TM - Business Intelligence & Data Analysis


Support Vector Machines (SVM)
In 2D sample space, In 3D sample space,
separation is created separation is created using a SVMs separate data points with a line or plane through
using a line plane the sample space

SVM aims to find a plane that maximises the separation


Class 2
Class 2 distance between data points of both classes
Class 1

The algorithm defines the criterion for a boundary that is


maximally far away from any data point

This distance to the closest data point from the decision


Class 1
surface determines the margin of the classifier

Strengths Weaknesses
• Uses a subset of training points in • The algorithm does not directly
the decision function so it is provide probability estimates,
memory efficient these are calculated using
expensive techniques

BIDA TM - Business Intelligence & Data Analysis


Decision Trees

A decision tree is a cascading set of questions, used to incrementally separate classes and improve
predictive power.

Strengths
• Simple to understand and interpret
(visualise)
• Used for classification or regression
• Requires little data preparation as it can
handle both numerical and categorical

Weaknesses Yes No
• Tends to overfit (the trained tree
doesn’t generalise well to unseen data)
• Small changes in data tends to cause big
difference in tree (instability)
• Can become expensive to compute with
high dimensionality

BIDA TM - Business Intelligence & Data Analysis


Decision Trees - Example

Spam: 4
• Each node represents a feature/attribute Not Spam: 16
of the data set and branches represent a Grammatical Error?
decision
• The decision tree starts at the root node
• We go down the tree asking true/false
Spam: 3 Spam: 1
questions at decision nodes…
Not Spam: 4 Not Spam: 12
• …until the leaf node (or outcome) is
reached
Yes No
• Once the tree is completed, it can be
used to evaluate each email by answering
the questions until a leaf node is reached
and a prediction is made.

BIDA TM - Business Intelligence & Data Analysis


Decision Trees - Example
20% Spam
Spam: 4
• Splitting is the process of dividing a node Not Spam: 16
into child nodes Grammatical Error?
• The goal is to make each of the child
nodes more “pure” or homogenous –
containing more similar classes of 43% Spam 7.7% Spam
observations
Spam: 3 Spam: 1
• Separating emails according to Not Spam: 4 Not Spam: 12
grammatical errors adds predictive value
to the model
Yes No

BIDA TM - Business Intelligence & Data Analysis


Random Forest
A random forest is known as an ensemble model since it combines the results from multiple other models; in
this case decision trees.

Majority Voting

Predict 1 + Predict 0 + Predict 1


= Final Prediction: 1

Strengths Weaknesses
• Improved accuracy (reduces overfitting) and more • The complexity of the algorithm can cause it to
powerful than decision trees become slow and inefficient
• Used in regression and classification • Real-time predictions can be slow due to large inputs

BIDA TM - Business Intelligence & Data Analysis


Evaluation & Interpretability

BIDA TM - Business Intelligence & Data Analysis


Evaluation & Interpretability

Understand the basic outcomes Learn how to evaluate and Understand when certain metrics
of classification and how they compare models with evaluation may be more appropriate than
are represented visually in a metrics such as accuracy, precision others.
confusion matrix. and recall.

Learn more advanced evaluation Understand how to interpret AUC- Implement the above model
metrics that build on basis ROC curves and use them to evaluation techniques in Python
Precision and Recall, such as F- compare model results. using SkLearn.
scores.

BIDA TM - Business Intelligence & Data Analysis


Model Evaluation Basics

• Evaluation metrics are important because they ensure that the model is performing correctly

• To set up our model evaluation, the dataset is often split into training and testing data

Available Sample Data

Training Data Testing Data

• The model learns from the training data

• The testing data is used to test how well the model performs on new unseen data
• Better evaluation results ensure that the model can be used in real-world on new data reliably

BIDA TM - Business Intelligence & Data Analysis


Confusion Matrix

The confusion matrix helps us understand the quality of our predictions.

Prediction
Negative (0) Positive (1)

True Negative False Positive


Negative (0)

The number of emails we correctly The number of emails


predicted as NOT SPAM. incorrectly predicted as SPAM.
Actual

False Negative True Positive


Positive (1)

The number of emails incorrectly The number of emails we


predicted as NOT SPAM correctly predicted as SPAM.

BIDA TM - Business Intelligence & Data Analysis


Evaluation Metrics

There are four key metrics that can help summarize the observations in the confusion matrix:
Accuracy = (TN + TP) / Total Predictions

Describes what proportion of predictions were correct


Prediction (may not always be the best indicator of performance).
Negative (0) Positive (1)
Precision = TP / (TP + FP)
Negative (0)

How good are the positive predictions? Out of those


True False
Negative (TN) Positive (FP) predicted positive, how many were actually positive?
Actual

Recall = TP / (TP + FN)

Describes what proportion of the actual positive cases


Positive (1)

False True were correctly identified.


Negative (FN) Positive (TP)

F1 Score = 2 * [ (Precision*Recall) / (Precision+Recall)]

Provides a balance between precision and recall.

BIDA TM - Business Intelligence & Data Analysis


Evaluation Metrics Example

Consider an example where we have 100 emails and we use a model to predict whether an email is
spam or not. Here SPAM emails is our positive class and NOT SPAM is our negative class.

𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇 95
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = = = 0.95
Prediction 𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 + 𝐹𝐹𝐹𝐹 90 + 3 + 2 + 5
Not Spam Spam 𝑇𝑇𝑇𝑇 5
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = = = 0.71
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 5 + 2
𝑇𝑇𝑇𝑇 5
Not Spam

True Negative False Positive 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = = = 0.62


90 2 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 5 + 3
Actual

In this situation, Precision and Recall are equally important:


False Negative True Positive • we don’t want to pass any SPAM emails as NOT SPAM (as they
Spam

3 5 could be dangerous)

and at the same time:

• we would not want any NOT SPAM emails going into our SPAM box
(as that email could be important to us).
BIDA TM - Business Intelligence & Data Analysis
Precision Vs Recall

• Accuracy works well for balanced classes (having roughly equal number of samples of every class)
• Precision is a good choice of metric when we have imbalanced classes and we want to minimized
false positives.
• Recall is a good choice of metric when we have imbalanced classes and we want to minimize false
negatives.

• In tumour risk detection, we would want to reduce the number of False Negatives.
• We cannot afford to miss any malignant samples in the data.
• Recall would be the preferred metric here because it measures the proportion of actual malignant
tumours that we detected.
• Misclassifying a low risk tumour as risky is obviously not ideal, but is a secondary priority.

• Ideally, a model with both a high recall and precision score would be preferred.

BIDA TM - Business Intelligence & Data Analysis


Fβ –Score

F β is a combined evaluation metric that balances precision and recall.

By choosing any value of β, we can modify our equation to control the weight of Precision and Recall in
our calculations.
1 + β2
𝐹𝐹β =
β2 1
+
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃

B=1
focus on Focus on
precision recall
B<1 B>1

B=0

F₁ and F₂ are the most common iterations of Fβ score

BIDA TM - Business Intelligence & Data Analysis


Practical Example Fβ
Prediction
Not Spam Spam
• We can calculate F₁ score as:

Not Spam
1 + 12 True Negative False Positive
𝐹𝐹1 = 2 = 0.66
1
+
1 90 2
0.62 0.71

Actual
• And we can calculate F₂ as:

1 + 22 False Negative True Positive

Spam
𝐹𝐹2 = 2 = 0.63 5
2 1 3
0.62 + 0.71

• F₁ gives equal importance to both Precision and Recall

• In this example, F₁ would be a relatively better choice as we prefer both Precision and Recall
somewhat equally here.

BIDA TM - Business Intelligence & Data Analysis


Is Accuracy the Best Choice?

Now consider we have 1000 sample observations:


• Not Spam (-ve) 980 samples
Prediction
• Spam (+ve) 20 samples Not Spam Spam

This is an imbalanced class problem.

Not Spam
True Negative False Positive
𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇 953
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = 𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹 = 10+950+30+17 = 0.95 = 95% 950 30

Actual
However, we are failing to correctly predict the class we care False Negative True Positive

Spam
about - the SPAM emails, and we can see it in the poor 17 3
performance of the other metrics:
• Precision: 3/33 = 10%

• Recall: 3/20 = 15%

So Accuracy may not be the best choice for all problems


BIDA TM - Business Intelligence & Data Analysis
The ROC Curve

The Receiver Operating Characteristic (ROC) curve visualizes the model performance and is a
useful way to evaluate the results of a binary classification model.

True Positive Rate TPR False Positive Rate

Describes the proportion of actual Describes the proportion of actual


positives that we correctly identified. negatives that we flagged as
positive.
Same as recall.

𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃


𝑇𝑇𝑇𝑇𝑇𝑇 = 𝐹𝐹𝐹𝐹𝐹𝐹 =
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 + 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃

FPR

Each classification model we create can be plotted as a curve on this chart.


The further the ROC curve from the random classifier (towards TPR), the better the model is at
predicting overall results.
BIDA TM - Business Intelligence & Data Analysis
The ROC Curve - AUC

• AUC stands for Area Under the Curve. It is


TPR Perfect
calculated as the surface or area that sits
underneath the curve.

• Helps summarise the ROC curve into a single


number between 0 and 1, to help compare
different algorithms
• Higher AUC (close to 1) means the model is
better at separating classes.
• A perfect model would display a square in
order to maximize the area.
FPR

BIDA TM - Business Intelligence & Data Analysis


Overfitting Vs Underfitting

Underfitting and Overfitting are terms that helps us summarize the performance of a model.

Underfitting Overfitting
• Is too simple for the scenario • Learns the training data too well (effectively learns the
answers)
• Does not perform well on the training data
• Will perform very well on training data.
• Oversimplifies patterns in the data • But is unable to look at the bigger picture and generalize
trends
• Will not perform well on testing data
• Will likely perform badly on testing (new) data

BIDA TM - Business Intelligence & Data Analysis


We’ve built a good working model, now what?

Often a model is the spark for follow up questions, such as:


• To what extent can we explain the reasons why our model makes the predictions that it does?

• What are the main drivers for making this class prediction?

• What are the main drivers for predicting this customer will purchase my product?

• What influences positively/negatively on that purchase prediction?

• How large is the influence of each input variable?

• How can we explain the difference between one prediction and the next?

BIDA TM - Business Intelligence & Data Analysis


Interpretability of Machine Learning Models

• A model is interpretable if it can be understood by anyone without


additional explanation.
Naïve Bayes
• Interpretability ensures that the model is reliable, fair, robust
Linear Regression
and reasonable
Decision Trees

Interpretability
• Basic models, like Linear models or Naïve Bayes models, are Logistic Regression
highly interpretable and user-friendly
SVMs
• A model is accurate is it can make higher quality predictions on Random Forests

average. Neural Networks

• For better accuracy in models that use real-world data we tend Accuracy
to require more complex models (often less interpretable) like
Neural Networks.

We must consider the extent to which we are required to dissect and provide interpretable outputs.

BIDA TM - Business Intelligence & Data Analysis


Interpretability and Explainability

Interpretability A model is interpretable if you can work out the simple cause and effect
relationship between inputs and outputs with relatively simple logic.
What is happening?

Naïve Linear Decision Logistic Random Neural Deep Neural


Bayes Regression Trees Regression Forest Networks Networks

Why is it happening?

Explainability A model is explainable if we can fully dissect each part of the model and
explain it’s role in the decision making process.

White Box Models are easy to interpret and Black Box Models make it difficult to pin-
explain. point causality for a particular outcome.

Reinforces accountability and audit. Less helpful for accountability or audit.

BIDA TM - Business Intelligence & Data Analysis


Introduction to Feature Importance
Range Avg / Median

• Feature importance helps determine how well each of the


individual features helps predict the overall outcome. Genetics

Sedentary
• The higher the total importance, the more influence that
Age
feature has on what we want to predict. Smoker

• This technique tends to be used in tree-based algorithms Height

such as Decision Trees or Random Forests but can also be Weight

used more generally. Low High


Feature importance
• There are different types of feature importance depending (across model iterations)

on how we define importance e.g. number of times a feature We can see genetics, age and being
sedentary have high importance when
is used to split a branch in a decision tree. predicting whether a tumor is malignant.

Strengths Weaknesses
• Model learns better by prioritising important features • Mostly rely on approximations other than for linear models
• Training time and memory requirements reduces • Most methods cannot deal with high-dimensional data

BIDA TM - Business Intelligence & Data Analysis


Introduction to Partial Dependence Plots (PDP) 1.0
0.8
0.6
• Partial Dependence Plots (PDP) show the marginal effect of 0.4

independent feature(s) on model predictions 0.2


Age
0.0

Malignant Tumour Probability


• They shows whether the relationship between the outcome 20 40 60 80 100

and predictor variable is linear, monotonic or more complex


1.0
• Straight flat PDP indicates that the feature is not important. 0.8
0.6
• When predicting malignant tumours – increase in age (>55) 0.4

and genetic likelihood score (>0.5) means higher likelihood of 0.2


Genetic Likelihood
tumour being malignant 0.0
0.2 0.4 0.6 0.8 1.0

Strengths Weaknesses
1.0
• The interpretation of the • Assumes independence in the features 0.8
plots is intuitive and easy to • Limited interpretability with more than 2 0.6
understand features 0.4
0.2
Height
0.0
0.2 0.4 0.6 0.8 1.0

BIDA TM - Business Intelligence & Data Analysis


Introduction to SHapley Additive exPlanations (SHAP)

• SHAP values help explain the contribution each feature is


Genetics
making on an individual prediction.
Sedentary
• When we are talking about individual observations, we are Age

talking about local explainability. Smoker

Height
Instead of asking:
Weight

• How much is the prediction of a tumour being malignant driven by a


persons age? 0.05 0.10 0.15 0.20 0.25
SHAP Value
We would ask:

• How much is the prediction of this patients tumour being malignant Genetics was the highest driver
behind this positive prediction.
driven by the fact that she is over 65 years old?

Strengths Weaknesses
• Makes black-box models easy to explain to all audiences • SHAP can be slow to plot
• Allows for decomposing of each prediction by all the features • It is possible to create intentionally misleading
interpretations that can hide biases

BIDA TM - Business Intelligence & Data Analysis

You might also like