INSY446 - 4 - Classification Part 1
INSY446 - 4 - Classification Part 1
6
Evaluating Classification Tasks
§ First, we can extend the accuracy results by
developing a confusion matrix
§ Essentially, we compare the prediction of the
outcome variable and the true outcome
Predicted Category
Actual Category
0 1
True Negative False Positive
0 Predicted 0 Predicted 1
Actually 0 Actually 0
False Negative True Positive
1 Predicted 0 Predicted 1
Actually 1 Actually 1
7
Evaluating Classification Tasks
0 1
0 18,132 884
1 2,578 3,406
𝟖𝟖𝟒#𝟐𝟓𝟕𝟖
§ The overall error rate is 𝟏𝟖𝟏𝟑𝟐#𝟖𝟖𝟒#𝟐𝟓𝟕𝟖#𝟑𝟒𝟎𝟔 =
13.85% (and the accuracy rate is 86.15%)
8
Evaluating Classification Tasks
§ Suppose we run this classification during the
loan application process where the target variable
is whether the client is low-income (0) or high-
income (1).
§ The false positive rate (FPR) is 𝟏𝟖𝟏𝟑𝟐&𝟖𝟖𝟒
𝟖𝟖𝟒
= 4.65%
“Among low-income clients, the probability that
the algorithm would identify them as high-income
ones is 4.65%”
§ The false negative rate (FNR) is 𝟐𝟓𝟕𝟖&𝟑𝟒𝟎𝟔
𝟐𝟓𝟕𝟖
= 43.08%
“Among high-income clients, the probability that
the algorithm would identify them as low-income
ones is 43.08%”
9
Evaluating Classification Tasks
§ The Precision is 𝟖𝟖𝟒&𝟑𝟒𝟎𝟔
𝟑𝟒𝟎𝟔
= 79.39%
“If the algorithm identifies a client as a high-
income, the probability that the algorithm is right
is 79.39%”
§ The Recall is 𝟑𝟒𝟎𝟔&𝟐𝟓𝟕𝟖
𝟑𝟒𝟎𝟔
= 56.92%
“Among high-income clients, the probability that
the algorithm would identify them as high-income
ones is 56.92%”
§ Is precision or recall more important?
10
Evaluating Classification Tasks
§ There is an additional “modern” measure to
evaluation classification models called F1 Score
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ×𝑹𝒆𝒄𝒂𝒍𝒍
𝑭𝟏 = 𝟐×
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 + 𝑹𝒆𝒄𝒂𝒍𝒍
§ Because F1 Score considers both Precision and
Recall, it could help in a scenario where it is not
clear whether Precision or Recall is more
important, but we want to consider both of them
in the evaluation
11
Logistic Regression
§ Logistic Regression is one of the most popular
classification algorithm (i.e., an algorithm used
for classification tasks)
§ Why linear regression is not appropriate when
the dependent variable is categorical?
– Binary data typically does not have a normal
distribution
– Predicted value of the dependent variable can be
beyond 0 and 1
– Probabilities are often not linear
12
Logistic Regression
§ Logistic Regression method describes
relationship between the set of predictors and
the categorical target variable
§ Generally, we use Logistic Regression when
the target variable is binary
§ Example: Suppose researchers are interested
in potential relationship between patient age
and presence/absence of disease
13
Logistic Regression
§ Plot shows least squares
regression line (straight)
and logistic regression
line (curved) for disease
on age
§ The linear regression
assumes linear relationship
between variables
§ In contrast, logistic regression line assumes
non-linear relationship between predictor and
response
§ Patient 11 estimation errors (vertical lines)
shown
14
Logistic Regression
15
Logistic Regression in Python
16
Example 1
statsmodels
# Load libraries
import statsmodels.api
# Run Regression
logit = statsmodels.api.Logit(y,X)
model = logit.fit()
# View results
model.summary()
17
Interpreting the Coefficient
18
Example 2
sklearn
# Load libraries
from sklearn.linear_model import LogisticRegression
# Review results
model1.intercept_
model1.coef_
19
Example 3
Real-world data (churn.csv)
# Load libraries
from sklearn.linear_model import LogisticRegression
import pandas
# View results
20
Cross Validation in Classification
21
Example 4
Cross Validation (Accuracy)
# Load libraries
from sklearn.linear_model import LogisticRegression
import pandas
22
Example 5
Cross Validation (Precision/Recall/F1)
# Load libraries
from sklearn.linear_model import LogisticRegression
import pandas
# Using the model to predict the results based on the test dataset
23
Example 6
Use UniversalBank.csv
# Load libraries
from sklearn.linear_model import LogisticRegression
import pandas
24
Exercise #1
25
Exercise #2
§ Use the same dataset in #1
§ Use “class” as the target variable and
clump_thickness, bare_nuclei, mitoses as
predictors
§ Split the data into a test (30%) and training (70%)
dataset
§ Run the logistic regression model based on the
training dataset and perform cross-validation on
the test dataset. Print the accuracy of your model
§ Predict if an observation where
clump_thickness=2, bare_nuclei=2, mitoses=2 is
cancerous.
26