0% found this document useful (0 votes)

16 views26 pages

INSY446 - 4 - Classification Part 1

Uploaded by

iryannh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views26 pages

INSY446 - 4 - Classification Part 1

Uploaded by

iryannh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

INSY 446 – Winter 2023

Data Mining for Business

Analytics

Session 4 – Classification Model

January 30, 2023
Dongliang Sheng
Classification Task
§ Classification is a supervised machine learning
task where the target variable is categorical
§ It is probably the most common data mining
task
§ Examples
– Banking: determine whether a mortgage application
is good
– Education: place a student into a particular track
with respect to special needs
– Medicine: diagnose whether a disease is present
– Law: determine if a will is fraudulent
– Security: identify whether a certain financial
transaction represents a terrorist threat
2
Classification Task
§ For example, the target variable
income_bracket may include the categories
“Low” and “High”
§ The algorithm examines relationships between
the values of the predictors and the target
values
§ Suppose we want to classify a person’s
income bracket based on the age, gender, and
occupation values of others contained in a
dataset
Subject Age Gender Occupation Income Bracket
001 47 F Software Engineer High
002 28 M Marketing Consultant High
003 35 M Unemployed Low
… … … … … 3
Classification Task
§ First, the classification algorithm examines the
data set values for the predictors and target
variables in the training set
§ This way, the algorithm “learns” which values
of the predictor variables are associated with
values of the target variable
§ For example, older females may be associated
with income_bracket values of “High”
§ Now that the data model is built, the algorithm
examines new records for which
income_bracket is unknown
§ According to classifications in the training set,
the algorithm classifies the new records
4
Classification Task
§ Most classification models start by calculating
the probability that each new observation
belongs to each class of the target variable
§ For example, the algorithm may estimate the
probability that a 59-year-old female consultant
is in the high-income bracket is 83% while the
probability that she is in the low-income
bracket is 17%
§ Then, based on a confidence threshold, the
algorithm outputs the classification
§ For instance, if the threshold is lower than or
equal to 83%, a 59-year-old female consultant
will be classified as a high-income person
5
Evaluating Classification Tasks
§ Obviously, the primary measure is the
accuracy of the model
§ Essentially, this measures calculate the
percentage of the prediction being correct
§ So, the output is only a single number (e.g., the
accuracy of the model is 78.90%)
§ In this a good measure? Can we say that a
model with 99.99% accuracy is always a good
model?

6
Evaluating Classification Tasks
§ First, we can extend the accuracy results by
developing a confusion matrix
§ Essentially, we compare the prediction of the
outcome variable and the true outcome

Predicted Category
Actual Category

0 1
True Negative False Positive
0 Predicted 0 Predicted 1
Actually 0 Actually 0
False Negative True Positive
1 Predicted 0 Predicted 1
Actually 1 Actually 1

7
Evaluating Classification Tasks

§ For Example, let’s consider the following

confusion matrix
Predicted Category
Actual Category

0 1

0 18,132 884

1 2,578 3,406

𝟖𝟖𝟒#𝟐𝟓𝟕𝟖
§ The overall error rate is 𝟏𝟖𝟏𝟑𝟐#𝟖𝟖𝟒#𝟐𝟓𝟕𝟖#𝟑𝟒𝟎𝟔 =
13.85% (and the accuracy rate is 86.15%)
8
Evaluating Classification Tasks
§ Suppose we run this classification during the
loan application process where the target variable
is whether the client is low-income (0) or high-
income (1).
§ The false positive rate (FPR) is 𝟏𝟖𝟏𝟑𝟐&𝟖𝟖𝟒
𝟖𝟖𝟒
= 4.65%
“Among low-income clients, the probability that
the algorithm would identify them as high-income
ones is 4.65%”
§ The false negative rate (FNR) is 𝟐𝟓𝟕𝟖&𝟑𝟒𝟎𝟔
𝟐𝟓𝟕𝟖
= 43.08%
“Among high-income clients, the probability that
the algorithm would identify them as low-income
ones is 43.08%”
9
Evaluating Classification Tasks
§ The Precision is 𝟖𝟖𝟒&𝟑𝟒𝟎𝟔
𝟑𝟒𝟎𝟔
= 79.39%
“If the algorithm identifies a client as a high-
income, the probability that the algorithm is right
is 79.39%”
§ The Recall is 𝟑𝟒𝟎𝟔&𝟐𝟓𝟕𝟖
𝟑𝟒𝟎𝟔
= 56.92%
“Among high-income clients, the probability that
the algorithm would identify them as high-income
ones is 56.92%”
§ Is precision or recall more important?

10
Evaluating Classification Tasks
§ There is an additional “modern” measure to
evaluation classification models called F1 Score
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ×𝑹𝒆𝒄𝒂𝒍𝒍
𝑭𝟏 = 𝟐×
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 + 𝑹𝒆𝒄𝒂𝒍𝒍
§ Because F1 Score considers both Precision and
Recall, it could help in a scenario where it is not
clear whether Precision or Recall is more
important, but we want to consider both of them
in the evaluation

11
Logistic Regression
§ Logistic Regression is one of the most popular
classification algorithm (i.e., an algorithm used
for classification tasks)
§ Why linear regression is not appropriate when
the dependent variable is categorical?
– Binary data typically does not have a normal
distribution
– Predicted value of the dependent variable can be
beyond 0 and 1
– Probabilities are often not linear

12
Logistic Regression
§ Logistic Regression method describes
relationship between the set of predictors and
the categorical target variable
§ Generally, we use Logistic Regression when
the target variable is binary
§ Example: Suppose researchers are interested
in potential relationship between patient age
and presence/absence of disease

13
Logistic Regression
§ Plot shows least squares
regression line (straight)
and logistic regression
line (curved) for disease
on age
§ The linear regression
assumes linear relationship
between variables
§ In contrast, logistic regression line assumes
non-linear relationship between predictor and
response
§ Patient 11 estimation errors (vertical lines)
shown
14
Logistic Regression

§ Patient 11’s estimation error greater for linear

regression versus logistic regression
§ Thus, for this point, and many others, linear
regression does a poorer job of estimating
disease

15
Logistic Regression in Python

§ Similar to the case of Linear Regression, you

can either use statsmodels or sklearn to
perform Logistic Regression
§ Statistics Perspective
– statsmodels package
– obtain stats-related results (t-value, p-value, etc.)
§ Data Mining Perspective
– sklearn package
– results are compatible with standard sklearn
functions (cross-validation, error calculation, etc.)

16
Example 1
statsmodels

# Load libraries
import statsmodels.api

# Load built-in dataset

from sklearn.datasets import load_iris
iris = load_iris()

# Explore the data

print(iris.keys())
print(iris.data.shape)
print(iris.feature_names)
print(iris.target_names)
print(iris.DESCR)

# Setup dependent and independent variables

y = iris.target[0:100]
X = iris.data[0:100,0:1]

# Run Regression
logit = statsmodels.api.Logit(y,X)
model = logit.fit()

# View results
model.summary()
17
Interpreting the Coefficient

§ Unlike Linear Regression, the coefficient of the

predictors in Logistic Regression represents the
relationship between the predictor and the log-odd of
the target variable

18
Example 2
sklearn

# Load libraries
from sklearn.linear_model import LogisticRegression

# Load built-in dataset

from sklearn.datasets import load_iris
iris = load_iris()

# Setup dependent and independent variables

y = iris.target[0:100]
X = iris.data[0:100,0:1]

# Run logistic regression

lr1 = LogisticRegression()
model1 = lr1.fit(X,y)

# Review results
model1.intercept_
model1.coef_

19
Example 3
Real-world data (churn.csv)

# Load libraries
from sklearn.linear_model import LogisticRegression
import pandas

# Load and contruct the data

churn_df = pandas.read_csv(“C:\\...\\Churn.csv")
X = churn_df.iloc[:,4:12]
y = churn_df["Churn"]

# Run the model

# View results

20
Cross Validation in Classification

§ In regression tasks, we calculate the MSE to

measure the model performance
§ For classification tasks, as discussed earlier,
we can calculate the accuracy score to
measure the model performance
§ Alternatively, we can also use other measures
including precision, recall, and F1 score

21
Example 4
Cross Validation (Accuracy)

# Load libraries
from sklearn.linear_model import LogisticRegression
import pandas

# Load and contruct the data

churn_df = pandas.read_csv("Churn.csv")
X = churn_df.iloc[:,4:12]
y = churn_df["Churn"]

# Split the data

from sklearn.model_selection import train_test_split

# Run the model

# Calculate the accuracy score

from sklearn import metrics

# Print the confusion matrix

# Confusion matrix with label

22
Example 5
Cross Validation (Precision/Recall/F1)

# Load libraries
from sklearn.linear_model import LogisticRegression
import pandas

# Load and contruct the data

churn_df = pandas.read_csv("Churn.csv")
X = churn_df.iloc[:,4:12]
y = churn_df["Churn"]

# Split the data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 5)

# Run the model

# Using the model to predict the results based on the test dataset

# Calculate the Precision/Recall

from sklearn import metrics

# Calculate the F1 score

23
Example 6
Use UniversalBank.csv

# Load libraries
from sklearn.linear_model import LogisticRegression
import pandas

# Load and contruct the data

df = pandas.read_csv("UniversalBank.csv")
X = df.iloc[:,1:12]
y = df["Personal Loan"]

# Split the data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 5)

# Run the model

# Calculate the accuracy score and confusion matrix

from sklearn import metrics

24
Exercise #1

§ Use cancer.csv dataset

§ Use “class” as the target variable and other
variables as predictors
§ Split the data into a test (30%) and training
(70%) dataset
§ Run the logistic regression model based on
the training dataset and perform cross-
validation on the test dataset. Print the
accuracy of your model

25
Exercise #2
§ Use the same dataset in #1
§ Use “class” as the target variable and
clump_thickness, bare_nuclei, mitoses as
predictors
§ Split the data into a test (30%) and training (70%)
dataset
§ Run the logistic regression model based on the
training dataset and perform cross-validation on
the test dataset. Print the accuracy of your model
§ Predict if an observation where
clump_thickness=2, bare_nuclei=2, mitoses=2 is
cancerous.
26

Classification
100% (2)
Classification
105 pages
6.data Mining - Classification
No ratings yet
6.data Mining - Classification
37 pages
Module-2 - Logistic Regression in Machine Learning
No ratings yet
Module-2 - Logistic Regression in Machine Learning
28 pages
Logistic Regression
100% (1)
Logistic Regression
10 pages
Supervised Machine Learning Algorithm
100% (1)
Supervised Machine Learning Algorithm
111 pages
19-Introduction Classification Algorithm-18-09-2024
No ratings yet
19-Introduction Classification Algorithm-18-09-2024
102 pages
Classification
No ratings yet
Classification
3 pages
08 - Classification - Decision Trees
No ratings yet
08 - Classification - Decision Trees
116 pages
Classification and Prediction
No ratings yet
Classification and Prediction
14 pages
Supervised Learning
No ratings yet
Supervised Learning
187 pages
IE0005 SummaryLecture Week3
No ratings yet
IE0005 SummaryLecture Week3
37 pages
Classification Slides
No ratings yet
Classification Slides
147 pages
Loan-Prediction Using Machine Learning
No ratings yet
Loan-Prediction Using Machine Learning
31 pages
DSML Clasification
No ratings yet
DSML Clasification
44 pages
05 Logistic Regression
No ratings yet
05 Logistic Regression
12 pages
MLT 1 - 7 Kanish
No ratings yet
MLT 1 - 7 Kanish
24 pages
Logistic Regression in Python - Real Python
No ratings yet
Logistic Regression in Python - Real Python
27 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Ds Module 4
No ratings yet
Ds Module 4
73 pages
Perform Prediction Using Regression Algorithm: Ex No: 1 Date
No ratings yet
Perform Prediction Using Regression Algorithm: Ex No: 1 Date
13 pages
MLS 2 - Classification
No ratings yet
MLS 2 - Classification
13 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
Unit 2
No ratings yet
Unit 2
28 pages
Machine Learning Model
No ratings yet
Machine Learning Model
9 pages
Classification Part 1
No ratings yet
Classification Part 1
76 pages
Lecture 13-Supervised Learning-Decision Trees-M
No ratings yet
Lecture 13-Supervised Learning-Decision Trees-M
47 pages
CO 2 Session 3
No ratings yet
CO 2 Session 3
39 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Logistic Regression
No ratings yet
Logistic Regression
4 pages
ML Notes by Pushpa
No ratings yet
ML Notes by Pushpa
26 pages
B-56 Sanket Jambhulkar MLA-3
No ratings yet
B-56 Sanket Jambhulkar MLA-3
7 pages
Reading Material - Module-5 - Introduction To Special Topics
No ratings yet
Reading Material - Module-5 - Introduction To Special Topics
27 pages
Loan Status Prediction
No ratings yet
Loan Status Prediction
23 pages
ML CLASS 5 Logistic Regression Algorithm
No ratings yet
ML CLASS 5 Logistic Regression Algorithm
16 pages
Model Evaluation - II
No ratings yet
Model Evaluation - II
12 pages
08 Classification
No ratings yet
08 Classification
26 pages
MISY 631 Final Review Calculators Will Be Provided For The Exam
No ratings yet
MISY 631 Final Review Calculators Will Be Provided For The Exam
9 pages
Unit 3 (DWDM)
No ratings yet
Unit 3 (DWDM)
23 pages
DS Unit 4
No ratings yet
DS Unit 4
13 pages
Classification Ppts 2021
No ratings yet
Classification Ppts 2021
80 pages
IE 527 Intelligent Engineering Systems: Basic Concepts Model/performance Evaluation Overfitting
No ratings yet
IE 527 Intelligent Engineering Systems: Basic Concepts Model/performance Evaluation Overfitting
18 pages
Introduction To Computing Lab
100% (1)
Introduction To Computing Lab
31 pages
Intermediate Analytics-Regression-Week 3-1
No ratings yet
Intermediate Analytics-Regression-Week 3-1
44 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Statistical Learning Slides
No ratings yet
Statistical Learning Slides
60 pages
ML - Collection.2019 04 15
No ratings yet
ML - Collection.2019 04 15
30 pages
Unit 2
No ratings yet
Unit 2
20 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
Customer Scoring - Case Study
No ratings yet
Customer Scoring - Case Study
15 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
ML Model Paper 2 Solution
No ratings yet
ML Model Paper 2 Solution
15 pages
Lecture 7 Classification
No ratings yet
Lecture 7 Classification
33 pages
Lecture 9
No ratings yet
Lecture 9
27 pages
Machine Learning Lab Manual 06
100% (1)
Machine Learning Lab Manual 06
8 pages
Logistic Regression
No ratings yet
Logistic Regression
30 pages
Logistic Regression Tech Document
No ratings yet
Logistic Regression Tech Document
12 pages
18mca52c U3
No ratings yet
18mca52c U3
8 pages
Session01 DataScience
No ratings yet
Session01 DataScience
79 pages
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
No ratings yet
Introduction To Machine Learning: ETH Zurich Janik Schuettler Marcel Graetz FS18
18 pages
Chapter 6 Supervised Learning
No ratings yet
Chapter 6 Supervised Learning
6 pages
Overlap Add Save
No ratings yet
Overlap Add Save
8 pages
Pronest 8 Manual
No ratings yet
Pronest 8 Manual
275 pages
Ai DL ML
No ratings yet
Ai DL ML
151 pages
ALV in ABAP Using OOPS Concept
No ratings yet
ALV in ABAP Using OOPS Concept
11 pages
43 - Crystal UHD 4K CU7000 UA43CU7000UXLY - Samsung
No ratings yet
43 - Crystal UHD 4K CU7000 UA43CU7000UXLY - Samsung
11 pages
Microsoft Industry Reference Architecture For Banking (MIRA-B)
No ratings yet
Microsoft Industry Reference Architecture For Banking (MIRA-B)
62 pages
About IBM PDF
No ratings yet
About IBM PDF
30 pages
ABB FT Switch Family Brochure
100% (1)
ABB FT Switch Family Brochure
40 pages
Vasicek Short Rate Interest Rate Model
No ratings yet
Vasicek Short Rate Interest Rate Model
5 pages
PSUC
No ratings yet
PSUC
5 pages
Ingame Commands VCMP 0.3
No ratings yet
Ingame Commands VCMP 0.3
3 pages
Writing Code For NLP Research-1
No ratings yet
Writing Code For NLP Research-1
254 pages
Maturity Model Architect: A Tool For Maturity Assessment Support
No ratings yet
Maturity Model Architect: A Tool For Maturity Assessment Support
11 pages
RomCmdOutput TimeOut
No ratings yet
RomCmdOutput TimeOut
25 pages
ControlAcceso Resumen
No ratings yet
ControlAcceso Resumen
27 pages
Acer Swift User Manual
No ratings yet
Acer Swift User Manual
75 pages
Chat Bot Mini Project
No ratings yet
Chat Bot Mini Project
4 pages
Drawing Realistic Ipad2 - Photoshop Tutorial: Step 1: Ipad Basic Shape
No ratings yet
Drawing Realistic Ipad2 - Photoshop Tutorial: Step 1: Ipad Basic Shape
54 pages
Project Report Sample
No ratings yet
Project Report Sample
27 pages
GRX 3
No ratings yet
GRX 3
2 pages
Internet Tablet OS 2008 Edition User Guide: Nokia N800 Internet Tablet Nokia N810 Internet Tablet
No ratings yet
Internet Tablet OS 2008 Edition User Guide: Nokia N800 Internet Tablet Nokia N810 Internet Tablet
52 pages
M Upgrading Procedures 9x
No ratings yet
M Upgrading Procedures 9x
2 pages
EVO 4 User Guide Manual V3.0
No ratings yet
EVO 4 User Guide Manual V3.0
24 pages
Muskan (Graphic Designer) CV
No ratings yet
Muskan (Graphic Designer) CV
2 pages
Lastexception 63858456909
No ratings yet
Lastexception 63858456909
2 pages
ER To Relational Mapping
No ratings yet
ER To Relational Mapping
8 pages
My First SQL Practice - To Create Table
No ratings yet
My First SQL Practice - To Create Table
4 pages
Structural Shape Optimization Using Moving Mesh Method
No ratings yet
Structural Shape Optimization Using Moving Mesh Method
5 pages
Posonic HomeAlarm EX10 & EX18 Flyer
No ratings yet
Posonic HomeAlarm EX10 & EX18 Flyer
1 page
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet

INSY446 - 4 - Classification Part 1

Uploaded by

INSY446 - 4 - Classification Part 1

Uploaded by

INSY 446 – Winter 2023

Data Mining for Business

Session 4 – Classification Model

§ For Example, let’s consider the following

§ Patient 11’s estimation error greater for linear

§ Similar to the case of Linear Regression, you

# Load built-in dataset

# Explore the data

# Setup dependent and independent variables

§ Unlike Linear Regression, the coefficient of the

# Load built-in dataset

# Setup dependent and independent variables

# Run logistic regression

# Load and contruct the data

# Run the model

§ In regression tasks, we calculate the MSE to

# Load and contruct the data

# Split the data

# Run the model

# Calculate the accuracy score

# Print the confusion matrix

# Confusion matrix with label

# Load and contruct the data

# Split the data

# Run the model

# Calculate the Precision/Recall

# Calculate the F1 score

# Load and contruct the data

# Split the data

# Run the model

# Calculate the accuracy score and confusion matrix

§ Use cancer.csv dataset

You might also like