Course Slides - Classification
Course Slides - Classification
Understand what Classification Perform simple classification tasks Understand the implicit
is and its applicability to many using logistic regression in Excel assumptions behind Classification
real-world scenarios techniques and algorithms
Create classification models in Interpret and evaluate the Explore more advanced evaluation
Python using statsmodels and performance of classification techniques such as PDP plots and
sklearn modules models, outputs and parameters SHAP values to expand your
horizons.
Classification Basics
Not
Genuine Forged Default
Default
…uses labelled datasets to train algorithms in …uses unlabelled datasets to train algorithms in
classifying data classifying data
…uses reward maximization such that the algorithm …learns data patterns and structure from the data
determines the optimal behaviour in an itself and is scalable to big data. It can be used for
environment supervised and unsupervised learning
Multi-label
Output Classes
• Has two or more class labels Use Case Input Variables
(Labels)
Malignant
• Outcome can be ONE or MORE of the Variation, Texture, OR /AND
Tumour
class labels diagnosis
Contrast, Growth Benign
Rate etc OR / AND
• Difference between multi-label and Premalignant
Binary
• Machinery Outage Prediction - Failure OR Not Failure
• Anomaly Detection – Fraud OR Not Fraud
• Credit Card Default - Customer Likely to Default OR Not Likely to Default
Multi-Class
• Product Classification – Red Wine OR White Wine OR Rose Wine
• News Classification of Articles – Sports OR Lifestyle OR Economy OR Current Affairs
• Facial Image Recognition – Happy OR Sad OR Angry
Multi-Label
• Social Tag Selection - #TogetherAtHome OR / AND #COVID19 OR / AND #WorkFromHome
BIDA TM - Business Intelligence & Data Analysis
Visualizing Classification
For simple scenarios, it can help to visualize the input variables and output classes on a chart.
Grammatical Errors
Malicious Content
Malicious Content
In each case, the goal is to use the input data to separate the classes as cleanly as possible.
Throughout the course, we’ll explore the most common classification algorithms.
We will look at the benefits of each and how to interpret and evaluate the outputs.
Understand and learn the Understand and calculate the Learn how to interpret log odds
fundamental concepts of probabilities, log odds and how and the assumptions behind
Logistic Regression these impact model interpretation Logistic regression
Be comfortable with the Be able to comprehend and Practice a basic logistic regression
mathematics behind Logistic interpret the outputs of logistic example in Excel and Python.
Regression and how to regression algorithm for business
manipulate it scenarios.
We determine a threshold (typically 0.5 or 50%) as the cut off between prediction classes.
In this case: customers lower than the threshold are predicted to NOT buy.
customers higher than the threshold are predicted to WILL buy.
BIDA TM - Business Intelligence & Data Analysis
Logistic Regression
Changing the threshold will change the prediction class of some observations.
48%
35%
80%
We can use evaluation metrics to help us decide on the most appropriate threshold.
Logistic Regression probabilities are estimated using one or more input variables
0.5
Predict
0.25 BUY
Not Buy Buy
0 0 Customers who did NOT buy medical supplies
30 35 40 45 50 55
Customer Age (X)
Logistic Regression uses a curved line to summarize our observed data points
(Y)
1.0
At input (L) 0, output (y) = 0.5. As input (L) increases, output (y) increases.
𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏 𝒙𝒙
0.75
1
𝑦𝑦𝑦 = 0
1+𝑒𝑒 −𝐿𝐿
0 5 10 15 20 25
Independent Variable (X)
1
𝑃𝑃 𝑌𝑌 = 1|𝑋𝑋 =
1+𝑒𝑒 −(𝜷𝜷𝟎𝟎 +𝜷𝜷𝟏𝟏 𝒙𝒙)
(where P(Y=1|X) is the probability of Y=1 given X input features) The logistic curve has been
transformed to fit the input data.
1 1
Logistic regression Linear regression
Probability (Y)
Probability (Y)
0.75 0.75
provides a better fit poor fit for
classifying data
0.5 0.5
0.25 0.25
0 0
0 5 10 15 20 25 0 5 10 15 20 25
Independent Variable (X) Independent Variable (X)
• Probabilities always sit between 0 and 1. • Probabilities are not limited to a sensible
range.
• More suitable for predicting a binary outcome.
• Suited to predicting number of something.
• More suitable for Classification.
Win Lose
• Probabilities are the chance of something p = 60% p = 40%
happening, relative to all outcomes. W W W W W W L L L L
1
𝑃𝑃 𝑌𝑌 = 1|𝑋𝑋 = Here x is our input variable.
1 + 𝑒𝑒 −(𝜷𝜷𝟎𝟎 +𝜷𝜷𝟏𝟏 𝒙𝒙)
1 𝑷𝑷 𝒀𝒀 = 𝟏𝟏 𝑃𝑃(𝑌𝑌 = 1)
𝑃𝑃 𝑌𝑌 = 1|𝑋𝑋 = REARRANGE ln = ln = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏 𝒙𝒙
1 + 𝑒𝑒 −(𝜷𝜷𝟎𝟎 +𝜷𝜷𝟏𝟏 𝒙𝒙) 𝟏𝟏 − 𝑷𝑷 𝒀𝒀 = 𝟏𝟏 𝑃𝑃(𝑌𝑌 = 0)
𝑷𝑷 𝒀𝒀 = 𝟏𝟏
ln = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏 𝒙𝒙
𝟏𝟏 − 𝑷𝑷 𝒀𝒀 = 𝟏𝟏
Log ( Odds of
being 1 ) = 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙 Odds of
being 1 = exp ( 𝜷𝜷𝟎𝟎 + 𝜷𝜷𝟏𝟏𝒙𝒙 )
For every unit change in 𝒙𝒙 For every unit change in 𝒙𝒙
the log odds will change by β1. the odds will change by exp(β1)
You are tasked by your company with predicting the likelihood of purchase of medical supplies (Y).
• X1 = Customer tenure
• X2 = Purchased in the last year Odds of Purchase = exp (𝜷𝜷𝟎𝟎 + 𝜷𝜷𝑻𝑻𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆 𝒙𝒙𝟏𝟏 + 𝜷𝜷𝑷𝑷𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖 𝒙𝒙𝟐𝟐)
Coefficients:
Odds Interpretations:
• exp(𝜷𝜷𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻𝑻 ) = 1.82
“Every extra year as a customer increases the odds of purchase by a factor of 1.82.
• exp(𝜷𝜷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷 ) = 1.22
“Customers that purchased in the last year have 22% higher odds to buy again this year.
YES NO
Learn about the basic Understand the underlying Learn the inner workings of the
Classification algorithms and considerations and assumptions of algorithms and understand how
when and how to use them these algorithms they categorize observations
Understand how to interpret the Be able to apply these to real-world Apply each classification algorithm
parameters, outputs and scenarios and learn about their in Python identify the quality of
evaluation metrics for the applications results from each.
algorithms
• Naïve Bayes is a probabilistic model based on Bayes theorem; it gives us classifications based on
probabilities
• Bayes theorem generates the probability of one event, given the probabilities of other events
Strengths Weaknesses
• Easy to use and good for large • Assumes independence in
datasets features, which makes it less
• Can be used to solve multi-class applicable on most real-world
prediction problems datasets
• Bayes Theorem states the probability (P) of an event A happening given that an event B occurred
can be given by:
𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃 𝐴𝐴
𝑃𝑃(𝐴𝐴|𝐵𝐵) =
𝑃𝑃(𝐵𝐵)
Where A is the hypothesis or
outcome variable, and B is the
evidence or features
We want to predict the likelihood of an email being spam, given it contains a grammatical error.
We want to predict the likelihood of an email being spam, given it contains a grammatical error.
𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃 𝐴𝐴
20 Emails Spam Not Spam
• 𝑃𝑃(𝐴𝐴|𝐵𝐵) =
𝑃𝑃(𝐵𝐵)
• P(Spam|Error) =
0.75∗0.20
• P(Spam|Error) = = 0.43 or 43% likely
0.35
4 16 20
KNN assigns output classes based on the most similar observations in our sample space.
similar.
Frequency (Y)
Malicious Content
Frequency (X) Once we have optimized this number of nearest neighbours,
we can visualize our decision boundary, which represents the
boundary between Spam and Not Spam.
Class 1: Not Spam K=1000 will not identify categories in the data at all
p2 Class 2: Unsafe
Class 3: Spam
Increasing K improves prediction due to averaging of
Feature (X)
the distance, the algorithm selects the most suitable
As the misclassified point p1 (Not Spam) has a green point point
(Spam) closest, a value of K=1 misclassifies the Not Spam email
as Spam. An optimal value of K must be selected depending on
The exact thing happens in the case of p2 the size of the dataset
Strengths Weaknesses
• Simple and easy implementation (no • Performance decreases as the number of
assumptions) examples and/or independent variables
• Can be used for classification and regression increases
Strengths Weaknesses
• Uses a subset of training points in • The algorithm does not directly
the decision function so it is provide probability estimates,
memory efficient these are calculated using
expensive techniques
A decision tree is a cascading set of questions, used to incrementally separate classes and improve
predictive power.
Strengths
• Simple to understand and interpret
(visualise)
• Used for classification or regression
• Requires little data preparation as it can
handle both numerical and categorical
Weaknesses Yes No
• Tends to overfit (the trained tree
doesn’t generalise well to unseen data)
• Small changes in data tends to cause big
difference in tree (instability)
• Can become expensive to compute with
high dimensionality
Spam: 4
• Each node represents a feature/attribute Not Spam: 16
of the data set and branches represent a Grammatical Error?
decision
• The decision tree starts at the root node
• We go down the tree asking true/false
Spam: 3 Spam: 1
questions at decision nodes…
Not Spam: 4 Not Spam: 12
• …until the leaf node (or outcome) is
reached
Yes No
• Once the tree is completed, it can be
used to evaluate each email by answering
the questions until a leaf node is reached
and a prediction is made.
Majority Voting
Strengths Weaknesses
• Improved accuracy (reduces overfitting) and more • The complexity of the algorithm can cause it to
powerful than decision trees become slow and inefficient
• Used in regression and classification • Real-time predictions can be slow due to large inputs
Understand the basic outcomes Learn how to evaluate and Understand when certain metrics
of classification and how they compare models with evaluation may be more appropriate than
are represented visually in a metrics such as accuracy, precision others.
confusion matrix. and recall.
Learn more advanced evaluation Understand how to interpret AUC- Implement the above model
metrics that build on basis ROC curves and use them to evaluation techniques in Python
Precision and Recall, such as F- compare model results. using SkLearn.
scores.
• Evaluation metrics are important because they ensure that the model is performing correctly
• To set up our model evaluation, the dataset is often split into training and testing data
• The testing data is used to test how well the model performs on new unseen data
• Better evaluation results ensure that the model can be used in real-world on new data reliably
Prediction
Negative (0) Positive (1)
There are four key metrics that can help summarize the observations in the confusion matrix:
Accuracy = (TN + TP) / Total Predictions
Consider an example where we have 100 emails and we use a model to predict whether an email is
spam or not. Here SPAM emails is our positive class and NOT SPAM is our negative class.
𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇 95
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = = = 0.95
Prediction 𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 + 𝐹𝐹𝐹𝐹 90 + 3 + 2 + 5
Not Spam Spam 𝑇𝑇𝑇𝑇 5
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = = = 0.71
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 5 + 2
𝑇𝑇𝑇𝑇 5
Not Spam
3 5 could be dangerous)
• we would not want any NOT SPAM emails going into our SPAM box
(as that email could be important to us).
BIDA TM - Business Intelligence & Data Analysis
Precision Vs Recall
• Accuracy works well for balanced classes (having roughly equal number of samples of every class)
• Precision is a good choice of metric when we have imbalanced classes and we want to minimized
false positives.
• Recall is a good choice of metric when we have imbalanced classes and we want to minimize false
negatives.
• In tumour risk detection, we would want to reduce the number of False Negatives.
• We cannot afford to miss any malignant samples in the data.
• Recall would be the preferred metric here because it measures the proportion of actual malignant
tumours that we detected.
• Misclassifying a low risk tumour as risky is obviously not ideal, but is a secondary priority.
• Ideally, a model with both a high recall and precision score would be preferred.
By choosing any value of β, we can modify our equation to control the weight of Precision and Recall in
our calculations.
1 + β2
𝐹𝐹β =
β2 1
+
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃
B=1
focus on Focus on
precision recall
B<1 B>1
B=0
Not Spam
1 + 12 True Negative False Positive
𝐹𝐹1 = 2 = 0.66
1
+
1 90 2
0.62 0.71
Actual
• And we can calculate F₂ as:
Spam
𝐹𝐹2 = 2 = 0.63 5
2 1 3
0.62 + 0.71
• In this example, F₁ would be a relatively better choice as we prefer both Precision and Recall
somewhat equally here.
Not Spam
True Negative False Positive
𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇 953
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = 𝑇𝑇𝑇𝑇+𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹+𝐹𝐹𝐹𝐹 = 10+950+30+17 = 0.95 = 95% 950 30
Actual
However, we are failing to correctly predict the class we care False Negative True Positive
Spam
about - the SPAM emails, and we can see it in the poor 17 3
performance of the other metrics:
• Precision: 3/33 = 10%
The Receiver Operating Characteristic (ROC) curve visualizes the model performance and is a
useful way to evaluate the results of a binary classification model.
FPR
Underfitting and Overfitting are terms that helps us summarize the performance of a model.
Underfitting Overfitting
• Is too simple for the scenario • Learns the training data too well (effectively learns the
answers)
• Does not perform well on the training data
• Will perform very well on training data.
• Oversimplifies patterns in the data • But is unable to look at the bigger picture and generalize
trends
• Will not perform well on testing data
• Will likely perform badly on testing (new) data
• What are the main drivers for making this class prediction?
• What are the main drivers for predicting this customer will purchase my product?
• How can we explain the difference between one prediction and the next?
Interpretability
• Basic models, like Linear models or Naïve Bayes models, are Logistic Regression
highly interpretable and user-friendly
SVMs
• A model is accurate is it can make higher quality predictions on Random Forests
• For better accuracy in models that use real-world data we tend Accuracy
to require more complex models (often less interpretable) like
Neural Networks.
We must consider the extent to which we are required to dissect and provide interpretable outputs.
Interpretability A model is interpretable if you can work out the simple cause and effect
relationship between inputs and outputs with relatively simple logic.
What is happening?
Why is it happening?
Explainability A model is explainable if we can fully dissect each part of the model and
explain it’s role in the decision making process.
White Box Models are easy to interpret and Black Box Models make it difficult to pin-
explain. point causality for a particular outcome.
Sedentary
• The higher the total importance, the more influence that
Age
feature has on what we want to predict. Smoker
on how we define importance e.g. number of times a feature We can see genetics, age and being
sedentary have high importance when
is used to split a branch in a decision tree. predicting whether a tumor is malignant.
Strengths Weaknesses
• Model learns better by prioritising important features • Mostly rely on approximations other than for linear models
• Training time and memory requirements reduces • Most methods cannot deal with high-dimensional data
Strengths Weaknesses
1.0
• The interpretation of the • Assumes independence in the features 0.8
plots is intuitive and easy to • Limited interpretability with more than 2 0.6
understand features 0.4
0.2
Height
0.0
0.2 0.4 0.6 0.8 1.0
Height
Instead of asking:
Weight
• How much is the prediction of this patients tumour being malignant Genetics was the highest driver
behind this positive prediction.
driven by the fact that she is over 65 years old?
Strengths Weaknesses
• Makes black-box models easy to explain to all audiences • SHAP can be slow to plot
• Allows for decomposing of each prediction by all the features • It is possible to create intentionally misleading
interpretations that can hide biases