0% found this document useful (0 votes)
16 views26 pages

INSY446 - 4 - Classification Part 1

Uploaded by

iryannh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views26 pages

INSY446 - 4 - Classification Part 1

Uploaded by

iryannh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

INSY 446 – Winter 2023

Data Mining for Business


Analytics

Session 4 – Classification Model


January 30, 2023
Dongliang Sheng
Classification Task
§ Classification is a supervised machine learning
task where the target variable is categorical
§ It is probably the most common data mining
task
§ Examples
– Banking: determine whether a mortgage application
is good
– Education: place a student into a particular track
with respect to special needs
– Medicine: diagnose whether a disease is present
– Law: determine if a will is fraudulent
– Security: identify whether a certain financial
transaction represents a terrorist threat
2
Classification Task
§ For example, the target variable
income_bracket may include the categories
“Low” and “High”
§ The algorithm examines relationships between
the values of the predictors and the target
values
§ Suppose we want to classify a person’s
income bracket based on the age, gender, and
occupation values of others contained in a
dataset
Subject Age Gender Occupation Income Bracket
001 47 F Software Engineer High
002 28 M Marketing Consultant High
003 35 M Unemployed Low
… … … … … 3
Classification Task
§ First, the classification algorithm examines the
data set values for the predictors and target
variables in the training set
§ This way, the algorithm “learns” which values
of the predictor variables are associated with
values of the target variable
§ For example, older females may be associated
with income_bracket values of “High”
§ Now that the data model is built, the algorithm
examines new records for which
income_bracket is unknown
§ According to classifications in the training set,
the algorithm classifies the new records
4
Classification Task
§ Most classification models start by calculating
the probability that each new observation
belongs to each class of the target variable
§ For example, the algorithm may estimate the
probability that a 59-year-old female consultant
is in the high-income bracket is 83% while the
probability that she is in the low-income
bracket is 17%
§ Then, based on a confidence threshold, the
algorithm outputs the classification
§ For instance, if the threshold is lower than or
equal to 83%, a 59-year-old female consultant
will be classified as a high-income person
5
Evaluating Classification Tasks
§ Obviously, the primary measure is the
accuracy of the model
§ Essentially, this measures calculate the
percentage of the prediction being correct
§ So, the output is only a single number (e.g., the
accuracy of the model is 78.90%)
§ In this a good measure? Can we say that a
model with 99.99% accuracy is always a good
model?

6
Evaluating Classification Tasks
§ First, we can extend the accuracy results by
developing a confusion matrix
§ Essentially, we compare the prediction of the
outcome variable and the true outcome

Predicted Category
Actual Category

0 1
True Negative False Positive
0 Predicted 0 Predicted 1
Actually 0 Actually 0
False Negative True Positive
1 Predicted 0 Predicted 1
Actually 1 Actually 1

7
Evaluating Classification Tasks

§ For Example, let’s consider the following


confusion matrix
Predicted Category
Actual Category

0 1

0 18,132 884

1 2,578 3,406

𝟖𝟖𝟒#𝟐𝟓𝟕𝟖
§ The overall error rate is 𝟏𝟖𝟏𝟑𝟐#𝟖𝟖𝟒#𝟐𝟓𝟕𝟖#𝟑𝟒𝟎𝟔 =
13.85% (and the accuracy rate is 86.15%)
8
Evaluating Classification Tasks
§ Suppose we run this classification during the
loan application process where the target variable
is whether the client is low-income (0) or high-
income (1).
§ The false positive rate (FPR) is 𝟏𝟖𝟏𝟑𝟐&𝟖𝟖𝟒
𝟖𝟖𝟒
= 4.65%
“Among low-income clients, the probability that
the algorithm would identify them as high-income
ones is 4.65%”
§ The false negative rate (FNR) is 𝟐𝟓𝟕𝟖&𝟑𝟒𝟎𝟔
𝟐𝟓𝟕𝟖
= 43.08%
“Among high-income clients, the probability that
the algorithm would identify them as low-income
ones is 43.08%”
9
Evaluating Classification Tasks
§ The Precision is 𝟖𝟖𝟒&𝟑𝟒𝟎𝟔
𝟑𝟒𝟎𝟔
= 79.39%
“If the algorithm identifies a client as a high-
income, the probability that the algorithm is right
is 79.39%”
§ The Recall is 𝟑𝟒𝟎𝟔&𝟐𝟓𝟕𝟖
𝟑𝟒𝟎𝟔
= 56.92%
“Among high-income clients, the probability that
the algorithm would identify them as high-income
ones is 56.92%”
§ Is precision or recall more important?

10
Evaluating Classification Tasks
§ There is an additional “modern” measure to
evaluation classification models called F1 Score
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ×𝑹𝒆𝒄𝒂𝒍𝒍
𝑭𝟏 = 𝟐×
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 + 𝑹𝒆𝒄𝒂𝒍𝒍
§ Because F1 Score considers both Precision and
Recall, it could help in a scenario where it is not
clear whether Precision or Recall is more
important, but we want to consider both of them
in the evaluation

11
Logistic Regression
§ Logistic Regression is one of the most popular
classification algorithm (i.e., an algorithm used
for classification tasks)
§ Why linear regression is not appropriate when
the dependent variable is categorical?
– Binary data typically does not have a normal
distribution
– Predicted value of the dependent variable can be
beyond 0 and 1
– Probabilities are often not linear

12
Logistic Regression
§ Logistic Regression method describes
relationship between the set of predictors and
the categorical target variable
§ Generally, we use Logistic Regression when
the target variable is binary
§ Example: Suppose researchers are interested
in potential relationship between patient age
and presence/absence of disease

13
Logistic Regression
§ Plot shows least squares
regression line (straight)
and logistic regression
line (curved) for disease
on age
§ The linear regression
assumes linear relationship
between variables
§ In contrast, logistic regression line assumes
non-linear relationship between predictor and
response
§ Patient 11 estimation errors (vertical lines)
shown
14
Logistic Regression

§ Patient 11’s estimation error greater for linear


regression versus logistic regression
§ Thus, for this point, and many others, linear
regression does a poorer job of estimating
disease

15
Logistic Regression in Python

§ Similar to the case of Linear Regression, you


can either use statsmodels or sklearn to
perform Logistic Regression
§ Statistics Perspective
– statsmodels package
– obtain stats-related results (t-value, p-value, etc.)
§ Data Mining Perspective
– sklearn package
– results are compatible with standard sklearn
functions (cross-validation, error calculation, etc.)

16
Example 1
statsmodels

# Load libraries
import statsmodels.api

# Load built-in dataset


from sklearn.datasets import load_iris
iris = load_iris()

# Explore the data


print(iris.keys())
print(iris.data.shape)
print(iris.feature_names)
print(iris.target_names)
print(iris.DESCR)

# Setup dependent and independent variables


y = iris.target[0:100]
X = iris.data[0:100,0:1]

# Run Regression
logit = statsmodels.api.Logit(y,X)
model = logit.fit()

# View results
model.summary()
17
Interpreting the Coefficient

§ Unlike Linear Regression, the coefficient of the


predictors in Logistic Regression represents the
relationship between the predictor and the log-odd of
the target variable

18
Example 2
sklearn

# Load libraries
from sklearn.linear_model import LogisticRegression

# Load built-in dataset


from sklearn.datasets import load_iris
iris = load_iris()

# Setup dependent and independent variables


y = iris.target[0:100]
X = iris.data[0:100,0:1]

# Run logistic regression


lr1 = LogisticRegression()
model1 = lr1.fit(X,y)

# Review results
model1.intercept_
model1.coef_

19
Example 3
Real-world data (churn.csv)

# Load libraries
from sklearn.linear_model import LogisticRegression
import pandas

# Load and contruct the data


churn_df = pandas.read_csv(“C:\\...\\Churn.csv")
X = churn_df.iloc[:,4:12]
y = churn_df["Churn"]

# Run the model

# View results

20
Cross Validation in Classification

§ In regression tasks, we calculate the MSE to


measure the model performance
§ For classification tasks, as discussed earlier,
we can calculate the accuracy score to
measure the model performance
§ Alternatively, we can also use other measures
including precision, recall, and F1 score

21
Example 4
Cross Validation (Accuracy)

# Load libraries
from sklearn.linear_model import LogisticRegression
import pandas

# Load and contruct the data


churn_df = pandas.read_csv("Churn.csv")
X = churn_df.iloc[:,4:12]
y = churn_df["Churn"]

# Split the data


from sklearn.model_selection import train_test_split

# Run the model

# Calculate the accuracy score


from sklearn import metrics

# Print the confusion matrix

# Confusion matrix with label

22
Example 5
Cross Validation (Precision/Recall/F1)

# Load libraries
from sklearn.linear_model import LogisticRegression
import pandas

# Load and contruct the data


churn_df = pandas.read_csv("Churn.csv")
X = churn_df.iloc[:,4:12]
y = churn_df["Churn"]

# Split the data


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 5)

# Run the model

# Using the model to predict the results based on the test dataset

# Calculate the Precision/Recall


from sklearn import metrics

# Calculate the F1 score

23
Example 6
Use UniversalBank.csv

# Load libraries
from sklearn.linear_model import LogisticRegression
import pandas

# Load and contruct the data


df = pandas.read_csv("UniversalBank.csv")
X = df.iloc[:,1:12]
y = df["Personal Loan"]

# Split the data


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 5)

# Run the model

# Calculate the accuracy score and confusion matrix


from sklearn import metrics

24
Exercise #1

§ Use cancer.csv dataset


§ Use “class” as the target variable and other
variables as predictors
§ Split the data into a test (30%) and training
(70%) dataset
§ Run the logistic regression model based on
the training dataset and perform cross-
validation on the test dataset. Print the
accuracy of your model

25
Exercise #2
§ Use the same dataset in #1
§ Use “class” as the target variable and
clump_thickness, bare_nuclei, mitoses as
predictors
§ Split the data into a test (30%) and training (70%)
dataset
§ Run the logistic regression model based on the
training dataset and perform cross-validation on
the test dataset. Print the accuracy of your model
§ Predict if an observation where
clump_thickness=2, bare_nuclei=2, mitoses=2 is
cancerous.
26

You might also like