0% found this document useful (0 votes)

32 views57 pages

Course Slides - Classification

The document discusses classification in machine learning, including binary, multi-class, and multi-label classification. It covers classification basics and examples, the machine learning ecosystem, and types of classification problems.

Uploaded by

Nguyễn Trần Trung Thịnh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views57 pages

Course Slides - Classification

Uploaded by

Nguyễn Trần Trung Thịnh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

YES NO

Classification – Fundamentals & Practical Applications

BIDA TM - Business Intelligence & Data Analysis

Course Learning Objectives

Understand what Classification Perform simple classification tasks Understand the implicit
is and its applicability to many using logistic regression in Excel assumptions behind Classification
real-world scenarios techniques and algorithms

Create classification models in Interpret and evaluate the Explore more advanced evaluation
Python using statsmodels and performance of classification techniques such as PDP plots and
sklearn modules models, outputs and parameters SHAP values to expand your
horizons.

BIDA TM - Business Intelligence & Data Analysis

YES NO

Classification Basics

BIDA TM - Business Intelligence & Data Analysis

Machine Learning Use Cases

Machine Learning can be used across a wide variety of tasks in Finance.

• The target categories should be discrete variables

• Predictions are made using one or more input variables

Strokes Variation Form Credit Income Status

Transaction Signature Loan Default

Analysis Prediction

Not
Genuine Forged Default
Default

BIDA TM - Business Intelligence & Data Analysis

Recap: The Machine Learning Ecosystem

There are four main types of Machine Learning algorithms:

Supervised Machine Learning Unsupervised Machine Learning

…uses labelled datasets to train algorithms in …uses unlabelled datasets to train algorithms in
classifying data classifying data

Classification Regression Clustering Association

Reinforcement Learning Deep Learning

…uses reward maximization such that the algorithm …learns data patterns and structure from the data
determines the optimal behaviour in an itself and is scalable to big data. It can be used for
environment supervised and unsupervised learning

BIDA TM - Business Intelligence & Data Analysis

Types of Classification

Binary Use Case Input Variables Output Classes

• Classification tasks that have two class Variation,

Malignant
Tumour Texture,
labels OR
diagnosis Contrast, Growth
Benign
Rate etc
• Outcomes must be ONE of the two
classes
Purchase History, Will Buy
Customer
Click history, OR
• Algorithm that only deals with Prediction
Customer Profile Will Not Buy
binary classification include Logistic
Regression and Support Vector Machines Spelling Errors,
Spam
Email Spam Grammatical
OR
Detection Errors, Email
Not Spam
domain

BIDA TM - Business Intelligence & Data Analysis

Types of Classification

Use Case Input Variables Output Classes

Multi-Class
Malignant
• Has more than two class labels Variation, Texture, OR
Tumour
Contrast, Growth Benign
diagnosis
• Outcomes must be ONE of a range of Rate etc OR
Premalignant
classes
Will Buy
• Algorithm suited for multi-class Purchase History, OR
Customer
Click history, Will Not Buy
problems include decision trees and Prediction
Customer Profile OR
random forests. Insufficient Data
Spam
Spelling Errors,
OR
Email Spam Grammatical
Not Spam
Detection Errors, Email
OR
domain
Unsafe

BIDA TM - Business Intelligence & Data Analysis

Types of Classification

Multi-label
Output Classes
• Has two or more class labels Use Case Input Variables
(Labels)
Malignant
• Outcome can be ONE or MORE of the Variation, Texture, OR /AND
Tumour
class labels diagnosis
Contrast, Growth Benign
Rate etc OR / AND
• Difference between multi-label and Premalignant

multi-class is that each label in the Will Buy

Purchase History, OR / AND
former one represents a different but Customer
Click history, Will Not Buy
Prediction
related classification problem Customer Profile OR / AND
Insufficient Data
Spam
Spelling Errors,
OR / AND
Email Spam Grammatical
For example, a multi-label classifier may classify Not Spam
Detection Errors, Email
an email as both spam and unsafe, or classify the OR / AND
domain
tumor as both benign and premalignant Unsafe

BIDA TM - Business Intelligence & Data Analysis

Common Classification Use Cases

Binary
• Machinery Outage Prediction - Failure OR Not Failure
• Anomaly Detection – Fraud OR Not Fraud
• Credit Card Default - Customer Likely to Default OR Not Likely to Default

Multi-Class
• Product Classification – Red Wine OR White Wine OR Rose Wine
• News Classification of Articles – Sports OR Lifestyle OR Economy OR Current Affairs
• Facial Image Recognition – Happy OR Sad OR Angry

Multi-Label
• Social Tag Selection - #TogetherAtHome OR / AND #COVID19 OR / AND #WorkFromHome
BIDA TM - Business Intelligence & Data Analysis
Visualizing Classification

For simple scenarios, it can help to visualize the input variables and output classes on a chart.

Two Classes Class 1: Spam Three Classes Class 1: Spam

Class 2: Not Spam Class 2: Not Spam
Class 3: Unsafe
Grammatical Errors

Grammatical Errors
Malicious Content
Malicious Content

In each case, the goal is to use the input data to separate the classes as cleanly as possible.

BIDA TM - Business Intelligence & Data Analysis

Classification Algorithms

Throughout the course, we’ll explore the most common classification algorithms.

Logistic Regression Naïve Bayes KNN SVM

Decision Trees Random Forest

Uses regression principles

to achieve separation
between discrete classes

We will look at the benefits of each and how to interpret and evaluate the outputs.

BIDA TM - Business Intelligence & Data Analysis

Logistic Regression

BIDA TM - Business Intelligence & Data Analysis

Logistic Regression

Understand and learn the Understand and calculate the Learn how to interpret log odds
fundamental concepts of probabilities, log odds and how and the assumptions behind
Logistic Regression these impact model interpretation Logistic regression

Be comfortable with the Be able to comprehend and Practice a basic logistic regression
mathematics behind Logistic interpret the outputs of logistic example in Excel and Python.
Regression and how to regression algorithm for business
manipulate it scenarios.

BIDA TM - Business Intelligence & Data Analysis

Logistic Regression

Logistic regression makes a classification based on the probability of an event happening.

“What is the probability that this customer

will purchase medical supplies?"
50% 50% 50%
48%
35%
80%

0% 100% 0% 100% 0% 100%

Customer 1 Customer 2 Customer 3
Prediction: Will Not Buy Prediction: Will Buy Prediction: Will Not Buy

We determine a threshold (typically 0.5 or 50%) as the cut off between prediction classes.
In this case: customers lower than the threshold are predicted to NOT buy.
customers higher than the threshold are predicted to WILL buy.
BIDA TM - Business Intelligence & Data Analysis
Logistic Regression

Changing the threshold will change the prediction class of some observations.

48%
35%
80%

0% 100% 0% 100% 0% 100%

Customer 1 Customer 2 Customer 3
Prediction: Will Not Buy Prediction: Will Buy Prediction: Will Not Buy
Prediction: Will Buy

We can use evaluation metrics to help us decide on the most appropriate threshold.

BIDA TM - Business Intelligence & Data Analysis

Visualizing Logistic Regression

Logistic Regression probabilities are estimated using one or more input variables

Probability of Buying (Y) 1 1 Customers who DID buy medical supplies

Predict
0.75
NOT BUY

0.5
Predict
0.25 BUY
Not Buy Buy
0 0 Customers who did NOT buy medical supplies
30 35 40 45 50 55
Customer Age (X)

Logistic Regression uses a curved line to summarize our observed data points

The logistic regression line generates probabilities between 0 and 1.

BIDA TM - Business Intelligence & Data Analysis

Defining the Logistic Regression Curve

• The logistic regression is based on the logistic function,

also called the sigmoid function

(Y)
1.0

• Mathematically this is defined as

1 0.5
𝑓𝑓(𝑥𝑥) =
1 + 𝑒𝑒 −𝐿𝐿
(Where 'e' is the base of the natural logarithm)

• The input value of L determines the output value of y. -6 -3 0 3 6

(L)

• The logistic function outputs numbers between 0 and 1.

At input (L) 0, output (y) = 0.5. As input (L) increases, output (y) increases.

BIDA TM - Business Intelligence & Data Analysis

Defining the Logistic Regression Curve

To transform a linear regression into a logistic regression

we take the linear regression equation…
1

𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏 𝒙𝒙
0.75

Target Variable (Y)

0.5

and substitute it for L in our Logistic function…

0.25

1
𝑦𝑦𝑦 = 0
1+𝑒𝑒 −𝐿𝐿
0 5 10 15 20 25
Independent Variable (X)
1
𝑃𝑃 𝑌𝑌 = 1|𝑋𝑋 =
1+𝑒𝑒 −(𝜷𝜷𝟎𝟎 +𝜷𝜷𝟏𝟏 𝒙𝒙)
(where P(Y=1|X) is the probability of Y=1 given X input features) The logistic curve has been
transformed to fit the input data.

BIDA TM - Business Intelligence & Data Analysis

Summary of Logistic Vs Linear Regression

Logistic Regression Linear Regression

1 1
Logistic regression Linear regression
Probability (Y)

Probability (Y)
0.75 0.75
provides a better fit poor fit for
classifying data
0.5 0.5

0.25 0.25

0 0
0 5 10 15 20 25 0 5 10 15 20 25
Independent Variable (X) Independent Variable (X)

• Probabilities always sit between 0 and 1. • Probabilities are not limited to a sensible
range.
• More suitable for predicting a binary outcome.
• Suited to predicting number of something.
• More suitable for Classification.

BIDA TM - Business Intelligence & Data Analysis

Logistic regression assumptions

1. The dependent variable is binary i.e. fits into one of

two clear-cut categories
With multicollinear variables, the algorithm
2. There should be no, or very little, multicollinearity would be unable to separate their effects
likely causing errors
between the predictor variables— meaning the
independent variables should be independent of each
Tumour
other Density

3. Logistic regression requires large sample sizes—the

larger the sample size, the more reliable the results
Size Mass
4. The independent variables should be linearly related
to the log odds
Multicollinearity would happen because size
and mass hold similar information.

BIDA TM - Business Intelligence & Data Analysis

Probability, Odds and Log Odds

Win Lose
• Probabilities are the chance of something p = 60% p = 40%
happening, relative to all outcomes. W W W W W W L L L L

• Odds are the chance of something Odds Of 𝒑𝒑 (𝑾𝑾) 𝟎𝟎.𝟔𝟔 𝒑𝒑 (𝑾𝑾)

= = = 1.5 =
happening, relative to other outcomes. Winning 𝒑𝒑(𝑳𝑳) 𝟎𝟎.𝟒𝟒 𝟏𝟏 −𝒑𝒑(𝑾𝑾)

Log Odds 𝟔𝟔𝟔𝟔

• Log Odds are simply the log of the odds. = ln ( ) = ln ( 𝟏𝟏. 𝟓𝟓)
Of Winning 𝟒𝟒𝟒𝟒
• Log odds are easier for statistical models to
work with.

BIDA TM - Business Intelligence & Data Analysis

Logistic Probabilities, Odds and Log Odds
Interpreting the impact of a change in an input variable appears to be difficult at first.

1
𝑃𝑃 𝑌𝑌 = 1|𝑋𝑋 = Here x is our input variable.
1 + 𝑒𝑒 −(𝜷𝜷𝟎𝟎 +𝜷𝜷𝟏𝟏 𝒙𝒙)

Rearranging our logistic regression equation we can reach the following:

1 𝑷𝑷 𝒀𝒀 = 𝟏𝟏 𝑃𝑃(𝑌𝑌 = 1)
𝑃𝑃 𝑌𝑌 = 1|𝑋𝑋 = REARRANGE ln = ln = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏 𝒙𝒙
1 + 𝑒𝑒 −(𝜷𝜷𝟎𝟎 +𝜷𝜷𝟏𝟏 𝒙𝒙) 𝟏𝟏 − 𝑷𝑷 𝒀𝒀 = 𝟏𝟏 𝑃𝑃(𝑌𝑌 = 0)

𝑷𝑷 𝒀𝒀 = 𝟏𝟏
ln = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏 𝒙𝒙
𝟏𝟏 − 𝑷𝑷 𝒀𝒀 = 𝟏𝟏

Log Odds of being 1 Linear Inputs

BIDA TM - Business Intelligence & Data Analysis

Interpreting Coefficients

We can interpret the coefficients in two ways:

Log ( Odds of
being 1 ) = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙 Odds of
being 1 = exp ( 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙 )
For every unit change in 𝒙𝒙 For every unit change in 𝒙𝒙
the log odds will change by β1. the odds will change by exp(β1)

Odds are generally easier to

interpret than log odds.

BIDA TM - Business Intelligence & Data Analysis

Interpretation Scenario

You are tasked by your company with predicting the likelihood of purchase of medical supplies (Y).

Input features: Equation

• X1 = Customer tenure

• X2 = Purchased in the last year Odds of Purchase = exp (𝜷𝜷𝟎𝟎 + 𝜷𝜷𝑻𝑻𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆 𝒙𝒙𝟏𝟏 + 𝜷𝜷𝑷𝑷𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖 𝒙𝒙𝟐𝟐)
Coefficients:

• Customer loyalty coefficient 𝜷𝜷𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻 = 0.6

• Purchased last year coefficient 𝜷𝜷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷 = 0.2

Odds Interpretations:

• exp(𝜷𝜷𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻 ) = 1.82
“Every extra year as a customer increases the odds of purchase by a factor of 1.82.

• exp(𝜷𝜷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷 ) = 1.22
“Customers that purchased in the last year have 22% higher odds to buy again this year.

BIDA TM - Business Intelligence & Data Analysis

Logistic Regression in Practice

Excel Python packages

• RegressIt is a powerful Excel add-in tool
that performs multivariate descriptive
data analysis and linear as well as logistic Scikit-learn provides a range of
regression supervised and unsupervised
learning algorithms

statsmodels provides classes and

functions for many different
statistical models and tests

BIDA TM - Business Intelligence & Data Analysis

Classification Algorithms

BIDA TM - Business Intelligence & Data Analysis

Classification Algorithms

YES NO

Learn about the basic Understand the underlying Learn the inner workings of the
Classification algorithms and considerations and assumptions of algorithms and understand how
when and how to use them these algorithms they categorize observations

Understand how to interpret the Be able to apply these to real-world Apply each classification algorithm
parameters, outputs and scenarios and learn about their in Python identify the quality of
evaluation metrics for the applications results from each.
algorithms

BIDA TM - Business Intelligence & Data Analysis

Algorithms Overview

We will review five common algorithms used to build classification models:

Naïve Bayes KNN SVM

Decision Trees Random Forest

BIDA TM - Business Intelligence & Data Analysis

Naïve Bayes

• Naïve Bayes is a probabilistic model based on Bayes theorem; it gives us classifications based on
probabilities

• Bayes theorem generates the probability of one event, given the probabilities of other events

Strengths Weaknesses
• Easy to use and good for large • Assumes independence in
datasets features, which makes it less
• Can be used to solve multi-class applicable on most real-world
prediction problems datasets

• Bayes Theorem states the probability (P) of an event A happening given that an event B occurred
can be given by:
𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃 𝐴𝐴
𝑃𝑃(𝐴𝐴|𝐵𝐵) =
𝑃𝑃(𝐵𝐵)
Where A is the hypothesis or
outcome variable, and B is the
evidence or features

BIDA TM - Business Intelligence & Data Analysis

Naïve Bayes – Example

We want to predict the likelihood of an email being spam, given it contains a grammatical error.

Observation Spam Email? Grammatical

Errors?
20 Emails Spam Not Spam
1 Yes Yes
2 No No
3 No Yes
Grammatical 3 4 7
4 No No Errors
5 No No
6 Yes No No
Grammatical
7 No No 1 12 13
Errors
8 No No
9 No Yes
10 No No 4 16 20
… … …
BIDA TM - Business Intelligence & Data Analysis
Naïve Bayes – Example

We want to predict the likelihood of an email being spam, given it contains a grammatical error.

𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃 𝐴𝐴
20 Emails Spam Not Spam
• 𝑃𝑃(𝐴𝐴|𝐵𝐵) =
𝑃𝑃(𝐵𝐵)
• P(Spam|Error) =

Grammatical 3 4 7 • 𝑃𝑃 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 3/4 or 0.75

Errors
• 𝑃𝑃 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 4/20 or 0.20
No • 𝑃𝑃 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 = 7/20 or 0.35
Grammatical
Errors 1 12 13

0.75∗0.20
• P(Spam|Error) = = 0.43 or 43% likely
0.35
4 16 20

BIDA TM - Business Intelligence & Data Analysis

K-Nearest Neighbours (KNN)

KNN assigns output classes based on the most similar observations in our sample space.

By similar, we mean those who have the closest input values.

Two Classes Class 1: Spam

New observation - Which class would you likely assign it
Class 2: Not Spam
to? Not Spam!

The new observation is closer to the Not Spam

observations, meaning its characteristics are more
Grammatical Errors

similar.
Frequency (Y)

New observation - The nearest observation is Spam.

However, the next 3 closest are Not Spam.

How many of the nearest observations should we choose?

Malicious Content
Frequency (X) Once we have optimized this number of nearest neighbours,
we can visualize our decision boundary, which represents the
boundary between Spam and Not Spam.

BIDA TM - Business Intelligence & Data Analysis

K-Nearest Neighbours - Example

Choosing the right value for K is critical

p1 K=1 will simply classify the data point on the basis of
Misclassified labels
Feature (Y)

one closest neighbour

Class 1: Not Spam K=1000 will not identify categories in the data at all
p2 Class 2: Unsafe
Class 3: Spam
Increasing K improves prediction due to averaging of
Feature (X)
the distance, the algorithm selects the most suitable
As the misclassified point p1 (Not Spam) has a green point point
(Spam) closest, a value of K=1 misclassifies the Not Spam email
as Spam. An optimal value of K must be selected depending on
The exact thing happens in the case of p2 the size of the dataset

Strengths Weaknesses
• Simple and easy implementation (no • Performance decreases as the number of
assumptions) examples and/or independent variables
• Can be used for classification and regression increases

BIDA TM - Business Intelligence & Data Analysis

Support Vector Machines (SVM)
In 2D sample space, In 3D sample space,
separation is created separation is created using a SVMs separate data points with a line or plane through
using a line plane the sample space

SVM aims to find a plane that maximises the separation

Class 2
Class 2 distance between data points of both classes
Class 1

The algorithm defines the criterion for a boundary that is

maximally far away from any data point

This distance to the closest data point from the decision

Class 1
surface determines the margin of the classifier

Strengths Weaknesses
• Uses a subset of training points in • The algorithm does not directly
the decision function so it is provide probability estimates,
memory efficient these are calculated using
expensive techniques

BIDA TM - Business Intelligence & Data Analysis

Decision Trees

A decision tree is a cascading set of questions, used to incrementally separate classes and improve
predictive power.

Strengths
• Simple to understand and interpret
(visualise)
• Used for classification or regression
• Requires little data preparation as it can
handle both numerical and categorical

Weaknesses Yes No
• Tends to overfit (the trained tree
doesn’t generalise well to unseen data)
• Small changes in data tends to cause big
difference in tree (instability)
• Can become expensive to compute with
high dimensionality

BIDA TM - Business Intelligence & Data Analysis

Decision Trees - Example

Spam: 4
• Each node represents a feature/attribute Not Spam: 16
of the data set and branches represent a Grammatical Error?
decision
• The decision tree starts at the root node
• We go down the tree asking true/false
Spam: 3 Spam: 1
questions at decision nodes…
Not Spam: 4 Not Spam: 12
• …until the leaf node (or outcome) is
reached
Yes No
• Once the tree is completed, it can be
used to evaluate each email by answering
the questions until a leaf node is reached
and a prediction is made.

BIDA TM - Business Intelligence & Data Analysis

Decision Trees - Example
20% Spam
Spam: 4
• Splitting is the process of dividing a node Not Spam: 16
into child nodes Grammatical Error?
• The goal is to make each of the child
nodes more “pure” or homogenous –
containing more similar classes of 43% Spam 7.7% Spam
observations
Spam: 3 Spam: 1
• Separating emails according to Not Spam: 4 Not Spam: 12
grammatical errors adds predictive value
to the model
Yes No

BIDA TM - Business Intelligence & Data Analysis

Random Forest
A random forest is known as an ensemble model since it combines the results from multiple other models; in
this case decision trees.

Majority Voting

Predict 1 + Predict 0 + Predict 1

= Final Prediction: 1

Strengths Weaknesses
• Improved accuracy (reduces overfitting) and more • The complexity of the algorithm can cause it to
powerful than decision trees become slow and inefficient
• Used in regression and classification • Real-time predictions can be slow due to large inputs

BIDA TM - Business Intelligence & Data Analysis

Evaluation & Interpretability

BIDA TM - Business Intelligence & Data Analysis

Evaluation & Interpretability

Understand the basic outcomes Learn how to evaluate and Understand when certain metrics
of classification and how they compare models with evaluation may be more appropriate than
are represented visually in a metrics such as accuracy, precision others.
confusion matrix. and recall.

Learn more advanced evaluation Understand how to interpret AUC- Implement the above model
metrics that build on basis ROC curves and use them to evaluation techniques in Python
Precision and Recall, such as F- compare model results. using SkLearn.
scores.

BIDA TM - Business Intelligence & Data Analysis

Model Evaluation Basics

• Evaluation metrics are important because they ensure that the model is performing correctly

• To set up our model evaluation, the dataset is often split into training and testing data

Available Sample Data

Training Data Testing Data

• The model learns from the training data

• The testing data is used to test how well the model performs on new unseen data
• Better evaluation results ensure that the model can be used in real-world on new data reliably

BIDA TM - Business Intelligence & Data Analysis

Confusion Matrix

The confusion matrix helps us understand the quality of our predictions.

Prediction
Negative (0) Positive (1)

True Negative False Positive

Negative (0)

The number of emails we correctly The number of emails

predicted as NOT SPAM. incorrectly predicted as SPAM.
Actual

False Negative True Positive

Positive (1)

The number of emails incorrectly The number of emails we

predicted as NOT SPAM correctly predicted as SPAM.

BIDA TM - Business Intelligence & Data Analysis

Evaluation Metrics

There are four key metrics that can help summarize the observations in the confusion matrix:
Accuracy = (TN + TP) / Total Predictions

Describes what proportion of predictions were correct

Prediction (may not always be the best indicator of performance).
Negative (0) Positive (1)
Precision = TP / (TP + FP)
Negative (0)

How good are the positive predictions? Out of those

True False
Negative (TN) Positive (FP) predicted positive, how many were actually positive?
Actual

Recall = TP / (TP + FN)

Describes what proportion of the actual positive cases

Positive (1)

False True were correctly identified.

Negative (FN) Positive (TP)

F1 Score = 2 * [ (Precision*Recall) / (Precision+Recall)]

Provides a balance between precision and recall.

BIDA TM - Business Intelligence & Data Analysis

Evaluation Metrics Example

Consider an example where we have 100 emails and we use a model to predict whether an email is
spam or not. Here SPAM emails is our positive class and NOT SPAM is our negative class.

𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇 95
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = = = 0.95
Prediction 𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 + 𝐹𝐹𝐹𝐹 90 + 3 + 2 + 5
Not Spam Spam 𝑇𝑇𝑇𝑇 5
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = = = 0.71
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 5 + 2
𝑇𝑇𝑇𝑇 5
Not Spam

True Negative False Positive 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = = = 0.62

90 2 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 5 + 3
Actual

In this situation, Precision and Recall are equally important:

False Negative True Positive • we don’t want to pass any SPAM emails as NOT SPAM (as they
Spam

3 5 could be dangerous)

and at the same time:

• we would not want any NOT SPAM emails going into our SPAM box
(as that email could be important to us).
BIDA TM - Business Intelligence & Data Analysis
Precision Vs Recall

• Accuracy works well for balanced classes (having roughly equal number of samples of every class)
• Precision is a good choice of metric when we have imbalanced classes and we want to minimized
false positives.
• Recall is a good choice of metric when we have imbalanced classes and we want to minimize false
negatives.

• In tumour risk detection, we would want to reduce the number of False Negatives.
• We cannot afford to miss any malignant samples in the data.
• Recall would be the preferred metric here because it measures the proportion of actual malignant
tumours that we detected.
• Misclassifying a low risk tumour as risky is obviously not ideal, but is a secondary priority.

• Ideally, a model with both a high recall and precision score would be preferred.

BIDA TM - Business Intelligence & Data Analysis

Fβ –Score

F β is a combined evaluation metric that balances precision and recall.

By choosing any value of β, we can modify our equation to control the weight of Precision and Recall in
our calculations.
1 + β2
𝐹𝐹β =
β2 1
+
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃

B=1
focus on Focus on
precision recall
B<1 B>1

B=0

F₁ and F₂ are the most common iterations of Fβ score

BIDA TM - Business Intelligence & Data Analysis

Practical Example Fβ
Prediction
Not Spam Spam
• We can calculate F₁ score as:

Not Spam
1 + 12 True Negative False Positive
𝐹𝐹1 = 2 = 0.66
1
+
1 90 2
0.62 0.71

Actual
• And we can calculate F₂ as:

1 + 22 False Negative True Positive

Spam
𝐹𝐹2 = 2 = 0.63 5
2 1 3
0.62 + 0.71

• F₁ gives equal importance to both Precision and Recall

• In this example, F₁ would be a relatively better choice as we prefer both Precision and Recall
somewhat equally here.

BIDA TM - Business Intelligence & Data Analysis

Is Accuracy the Best Choice?

Now consider we have 1000 sample observations:

• Not Spam (-ve) 980 samples
Prediction
• Spam (+ve) 20 samples Not Spam Spam

This is an imbalanced class problem.

Not Spam
True Negative False Positive
𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇 953
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = 𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹 = 10+950+30+17 = 0.95 = 95% 950 30

Actual
However, we are failing to correctly predict the class we care False Negative True Positive

Spam
about - the SPAM emails, and we can see it in the poor 17 3
performance of the other metrics:
• Precision: 3/33 = 10%

• Recall: 3/20 = 15%

So Accuracy may not be the best choice for all problems

BIDA TM - Business Intelligence & Data Analysis
The ROC Curve

The Receiver Operating Characteristic (ROC) curve visualizes the model performance and is a
useful way to evaluate the results of a binary classification model.

True Positive Rate TPR False Positive Rate

Describes the proportion of actual Describes the proportion of actual

positives that we correctly identified. negatives that we flagged as
positive.
Same as recall.

𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃

𝑇𝑇𝑇𝑇𝑇𝑇 = 𝐹𝐹𝐹𝐹𝐹𝐹 =
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 + 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃

FPR

Each classification model we create can be plotted as a curve on this chart.

The further the ROC curve from the random classifier (towards TPR), the better the model is at
predicting overall results.
BIDA TM - Business Intelligence & Data Analysis
The ROC Curve - AUC

• AUC stands for Area Under the Curve. It is

TPR Perfect
calculated as the surface or area that sits
underneath the curve.

• Helps summarise the ROC curve into a single

number between 0 and 1, to help compare
different algorithms
• Higher AUC (close to 1) means the model is
better at separating classes.
• A perfect model would display a square in
order to maximize the area.
FPR

BIDA TM - Business Intelligence & Data Analysis

Overfitting Vs Underfitting

Underfitting and Overfitting are terms that helps us summarize the performance of a model.

Underfitting Overfitting
• Is too simple for the scenario • Learns the training data too well (effectively learns the
answers)
• Does not perform well on the training data
• Will perform very well on training data.
• Oversimplifies patterns in the data • But is unable to look at the bigger picture and generalize
trends
• Will not perform well on testing data
• Will likely perform badly on testing (new) data

BIDA TM - Business Intelligence & Data Analysis

We’ve built a good working model, now what?

Often a model is the spark for follow up questions, such as:

• To what extent can we explain the reasons why our model makes the predictions that it does?

• What are the main drivers for making this class prediction?

• What are the main drivers for predicting this customer will purchase my product?

• What influences positively/negatively on that purchase prediction?

• How large is the influence of each input variable?

• How can we explain the difference between one prediction and the next?

BIDA TM - Business Intelligence & Data Analysis

Interpretability of Machine Learning Models

• A model is interpretable if it can be understood by anyone without

additional explanation.
Naïve Bayes
• Interpretability ensures that the model is reliable, fair, robust
Linear Regression
and reasonable
Decision Trees

Interpretability
• Basic models, like Linear models or Naïve Bayes models, are Logistic Regression
highly interpretable and user-friendly
SVMs
• A model is accurate is it can make higher quality predictions on Random Forests

average. Neural Networks

• For better accuracy in models that use real-world data we tend Accuracy
to require more complex models (often less interpretable) like
Neural Networks.

We must consider the extent to which we are required to dissect and provide interpretable outputs.

BIDA TM - Business Intelligence & Data Analysis

Interpretability and Explainability

Interpretability A model is interpretable if you can work out the simple cause and effect
relationship between inputs and outputs with relatively simple logic.
What is happening?

Naïve Linear Decision Logistic Random Neural Deep Neural

Bayes Regression Trees Regression Forest Networks Networks

Why is it happening?

Explainability A model is explainable if we can fully dissect each part of the model and
explain it’s role in the decision making process.

White Box Models are easy to interpret and Black Box Models make it difficult to pin-
explain. point causality for a particular outcome.

Reinforces accountability and audit. Less helpful for accountability or audit.

BIDA TM - Business Intelligence & Data Analysis

Introduction to Feature Importance
Range Avg / Median

• Feature importance helps determine how well each of the

individual features helps predict the overall outcome. Genetics

Sedentary
• The higher the total importance, the more influence that
Age
feature has on what we want to predict. Smoker

• This technique tends to be used in tree-based algorithms Height

such as Decision Trees or Random Forests but can also be Weight

used more generally. Low High

Feature importance
• There are different types of feature importance depending (across model iterations)

on how we define importance e.g. number of times a feature We can see genetics, age and being
sedentary have high importance when
is used to split a branch in a decision tree. predicting whether a tumor is malignant.

Strengths Weaknesses
• Model learns better by prioritising important features • Mostly rely on approximations other than for linear models
• Training time and memory requirements reduces • Most methods cannot deal with high-dimensional data

BIDA TM - Business Intelligence & Data Analysis

Introduction to Partial Dependence Plots (PDP) 1.0
0.8
0.6
• Partial Dependence Plots (PDP) show the marginal effect of 0.4

independent feature(s) on model predictions 0.2

Age
0.0

Malignant Tumour Probability

• They shows whether the relationship between the outcome 20 40 60 80 100

and predictor variable is linear, monotonic or more complex

1.0
• Straight flat PDP indicates that the feature is not important. 0.8
0.6
• When predicting malignant tumours – increase in age (>55) 0.4

and genetic likelihood score (>0.5) means higher likelihood of 0.2

Genetic Likelihood
tumour being malignant 0.0
0.2 0.4 0.6 0.8 1.0

Strengths Weaknesses
1.0
• The interpretation of the • Assumes independence in the features 0.8
plots is intuitive and easy to • Limited interpretability with more than 2 0.6
understand features 0.4
0.2
Height
0.0
0.2 0.4 0.6 0.8 1.0

BIDA TM - Business Intelligence & Data Analysis

Introduction to SHapley Additive exPlanations (SHAP)

• SHAP values help explain the contribution each feature is

Genetics
making on an individual prediction.
Sedentary
• When we are talking about individual observations, we are Age

talking about local explainability. Smoker

Height
Instead of asking:
Weight

• How much is the prediction of a tumour being malignant driven by a

persons age? 0.05 0.10 0.15 0.20 0.25
SHAP Value
We would ask:

• How much is the prediction of this patients tumour being malignant Genetics was the highest driver
behind this positive prediction.
driven by the fact that she is over 65 years old?

Strengths Weaknesses
• Makes black-box models easy to explain to all audiences • SHAP can be slow to plot
• Allows for decomposing of each prediction by all the features • It is possible to create intentionally misleading
interpretations that can hide biases

BIDA TM - Business Intelligence & Data Analysis

Predictive Analytics Updated
No ratings yet
Predictive Analytics Updated
30 pages
A Machine Learning Approach Based On Contract Parameters For Cost Forecasting in Construction
100% (1)
A Machine Learning Approach Based On Contract Parameters For Cost Forecasting in Construction
13 pages
Advanced Data Analysis in Neuroscience Integrating Statistical and Computational Models PDF
No ratings yet
Advanced Data Analysis in Neuroscience Integrating Statistical and Computational Models PDF
308 pages
1 - Course Slides - Data Science and ML Fundamentals
No ratings yet
1 - Course Slides - Data Science and ML Fundamentals
92 pages
IA Biology HL - Effect of Time and Temperature On The Concentration of Vitamin C in Milk
No ratings yet
IA Biology HL - Effect of Time and Temperature On The Concentration of Vitamin C in Milk
12 pages
Time Series Analysis Notes
100% (1)
Time Series Analysis Notes
21 pages
Course Slides - Classification
No ratings yet
Course Slides - Classification
57 pages
SMDS Unit 5
No ratings yet
SMDS Unit 5
21 pages
Regression Vs Classification in Machine Learning Explained!
No ratings yet
Regression Vs Classification in Machine Learning Explained!
10 pages
Hit 2203-Big Data & Data Analytics - Lecture - 3
No ratings yet
Hit 2203-Big Data & Data Analytics - Lecture - 3
10 pages
Logistic Regression - AI-ML Developer Course
100% (1)
Logistic Regression - AI-ML Developer Course
14 pages
Unit Vi Parametric Machine Learning
No ratings yet
Unit Vi Parametric Machine Learning
77 pages
ML Unit-IV Notes
No ratings yet
ML Unit-IV Notes
49 pages
P 2.1 Logistic Regression
No ratings yet
P 2.1 Logistic Regression
18 pages
CO 2 Session 3
No ratings yet
CO 2 Session 3
39 pages
ML 4
No ratings yet
ML 4
80 pages
MC4301 - ML Unit 4 (Parametric Machine Learning)
No ratings yet
MC4301 - ML Unit 4 (Parametric Machine Learning)
56 pages
ML CLASS 5 Logistic Regression Algorithm
No ratings yet
ML CLASS 5 Logistic Regression Algorithm
16 pages
Unit 9 - Classification & Clustering
No ratings yet
Unit 9 - Classification & Clustering
34 pages
Logistic Regression Report
No ratings yet
Logistic Regression Report
39 pages
ML - MU - Unit - 2 - Supervised Learning-Classification Techniques
No ratings yet
ML - MU - Unit - 2 - Supervised Learning-Classification Techniques
153 pages
Module 2
No ratings yet
Module 2
92 pages
7.logistics Regression - BDSM - Oct - 2020
No ratings yet
7.logistics Regression - BDSM - Oct - 2020
49 pages
Session 9-Logistic Regression
No ratings yet
Session 9-Logistic Regression
33 pages
YLP Logistic Regression
No ratings yet
YLP Logistic Regression
61 pages
Artificial Intelligence Data Science
No ratings yet
Artificial Intelligence Data Science
1 page
Sonia Jessica - 2022 - How Does Logistic Regression Work
No ratings yet
Sonia Jessica - 2022 - How Does Logistic Regression Work
4 pages
Logistic Regression in Python - Real Python
No ratings yet
Logistic Regression in Python - Real Python
27 pages
Course Slides - Data Science and ML Fundamentals
No ratings yet
Course Slides - Data Science and ML Fundamentals
92 pages
Chapter 4 Statistical Classification Methods
No ratings yet
Chapter 4 Statistical Classification Methods
73 pages
Practical - Logistic Regression
No ratings yet
Practical - Logistic Regression
84 pages
Logisticregression
No ratings yet
Logisticregression
22 pages
Logistic Regression
No ratings yet
Logistic Regression
25 pages
Classification
100% (2)
Classification
105 pages
Classification Models
No ratings yet
Classification Models
3 pages
Logesti For Biginners
No ratings yet
Logesti For Biginners
13 pages
ML - Module 3
No ratings yet
ML - Module 3
58 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
Fai Module 3
No ratings yet
Fai Module 3
67 pages
Unit 3-2
No ratings yet
Unit 3-2
20 pages
Mbas901 - L4
No ratings yet
Mbas901 - L4
83 pages
Logistic Regression
No ratings yet
Logistic Regression
36 pages
Chapter 4 Statistical Classification Methods
No ratings yet
Chapter 4 Statistical Classification Methods
63 pages
ML Notes by Pushpa
No ratings yet
ML Notes by Pushpa
26 pages
Accounting Analytics 2
No ratings yet
Accounting Analytics 2
41 pages
Logistic Regression 5
No ratings yet
Logistic Regression 5
61 pages
NUS - SOC - ML - Module 19 - Summary Deck
No ratings yet
NUS - SOC - ML - Module 19 - Summary Deck
34 pages
Logistic Regression
No ratings yet
Logistic Regression
4 pages
DA Mid 2
No ratings yet
DA Mid 2
17 pages
Lecture 08
No ratings yet
Lecture 08
42 pages
ML Algo
No ratings yet
ML Algo
36 pages
CLASSIFICATION
No ratings yet
CLASSIFICATION
21 pages
Supervised Learning
No ratings yet
Supervised Learning
187 pages
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
Chapter 10 Logistic Reg
No ratings yet
Chapter 10 Logistic Reg
29 pages
Logestic Regression Model
No ratings yet
Logestic Regression Model
13 pages
Simafire Logistic Regression Article Digest
No ratings yet
Simafire Logistic Regression Article Digest
11 pages
ML Unit 2
No ratings yet
ML Unit 2
25 pages
Lecture 7 Classification
No ratings yet
Lecture 7 Classification
33 pages
Artificial Intelligence Lec 4
No ratings yet
Artificial Intelligence Lec 4
13 pages
Lecture 4-Logistic-Regression
No ratings yet
Lecture 4-Logistic-Regression
50 pages
Unit II
100% (1)
Unit II
13 pages
Lecture 1.1 1.2
No ratings yet
Lecture 1.1 1.2
11 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
LabERPSim Cahier09-02 Novembre2009
No ratings yet
LabERPSim Cahier09-02 Novembre2009
15 pages
Cognos - Final Assignment
No ratings yet
Cognos - Final Assignment
2 pages
Car Sales by Model End
No ratings yet
Car Sales by Model End
36 pages
Screencapture Coursera Org Learn Excel Basics Data Analysis Ibm Quiz XmR4O Practice Quiz Attempt 2023 10 01 07 - 45 - 24
No ratings yet
Screencapture Coursera Org Learn Excel Basics Data Analysis Ibm Quiz XmR4O Practice Quiz Attempt 2023 10 01 07 - 45 - 24
1 page
Espresso Software Reporting Challenge Exercise
No ratings yet
Espresso Software Reporting Challenge Exercise
3 pages
KTVM 1
No ratings yet
KTVM 1
11 pages
Python Branching and Loops
No ratings yet
Python Branching and Loops
4 pages
Regression: Finding The Equation of The Line of Best Fit: Background and General Principle
No ratings yet
Regression: Finding The Equation of The Line of Best Fit: Background and General Principle
6 pages
Romi DM 03 Persiapan Mar2016
No ratings yet
Romi DM 03 Persiapan Mar2016
82 pages
Mgt782 Midterm New
No ratings yet
Mgt782 Midterm New
2 pages
Testing The Significance of The Correlation Coefficient
No ratings yet
Testing The Significance of The Correlation Coefficient
12 pages
MS 02 More Exercises
No ratings yet
MS 02 More Exercises
5 pages
Over All BCS Syllabus
No ratings yet
Over All BCS Syllabus
51 pages
Business Analytics
No ratings yet
Business Analytics
13 pages
Paper 8-Weather Prediction Using Linear Regression Model-Bnmit IITCEE ICCCI - Conference-1
No ratings yet
Paper 8-Weather Prediction Using Linear Regression Model-Bnmit IITCEE ICCCI - Conference-1
4 pages
Gauss Markov Theorem
No ratings yet
Gauss Markov Theorem
16 pages
Handling Overdispersion in Poisson Regression Using Negative Binomial Regression For Poverty Case in West Java
No ratings yet
Handling Overdispersion in Poisson Regression Using Negative Binomial Regression For Poverty Case in West Java
7 pages
Yulfita Aini Efi Andari PDF
No ratings yet
Yulfita Aini Efi Andari PDF
8 pages
Cross Sectional
No ratings yet
Cross Sectional
40 pages
Information To Users
No ratings yet
Information To Users
277 pages
14.2 Machine Learning and Deep Learning
No ratings yet
14.2 Machine Learning and Deep Learning
7 pages
Munroe Chandler2008
No ratings yet
Munroe Chandler2008
9 pages
Module 4
No ratings yet
Module 4
41 pages
SIDDHANT VIJAY 2K20 CH 65 Sem 5
No ratings yet
SIDDHANT VIJAY 2K20 CH 65 Sem 5
29 pages
Odeleye & Oyeneye
No ratings yet
Odeleye & Oyeneye
15 pages
QMS 064-DAS-Content
No ratings yet
QMS 064-DAS-Content
3 pages
Correlation 1
100% (1)
Correlation 1
57 pages
MMW-Module 3
No ratings yet
MMW-Module 3
23 pages
Bernard Rotich Thesis Egerton University
No ratings yet
Bernard Rotich Thesis Egerton University
68 pages
Missing Data Imputation by K Nearest Neighbours Based On Grey Relational Structure and Mutual Information
No ratings yet
Missing Data Imputation by K Nearest Neighbours Based On Grey Relational Structure and Mutual Information
22 pages
Syllabus
No ratings yet
Syllabus
73 pages
08 Introductory Econometrics Fourth Sem
No ratings yet
08 Introductory Econometrics Fourth Sem
4 pages
Quantitative Analysis Sectional Test
No ratings yet
Quantitative Analysis Sectional Test
11 pages