0% found this document useful (0 votes)
47 views

Module 3.3 Classification Models, An Overview

This document provides an overview of commonly used supervised classification models, including regularization techniques like Lasso and Ridge regression. It discusses logistic regression, K-nearest neighbors (KNN), and hyperparameters for regularization and KNN. The document is divided into multiple parts that will be covered in subsequent modules, including regularization, logistic regression, KNN, support vector machines, and evaluation metrics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Module 3.3 Classification Models, An Overview

This document provides an overview of commonly used supervised classification models, including regularization techniques like Lasso and Ridge regression. It discusses logistic regression, K-nearest neighbors (KNN), and hyperparameters for regularization and KNN. The document is divided into multiple parts that will be covered in subsequent modules, including regularization, logistic regression, KNN, support vector machines, and evaluation metrics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

MODULE 3.

3 SUPERVISED LEARNING: CLASSIFICATION MODELS PART 1


An Overview of Commonly Used Supervised Classification Models

Classification is a supervised machine learning method where a model tries to predict the correct label of a given input
data. Classification algorithms are probability-based, meaning the outcome is the category for which the algorithm
finds the highest probability that a target variable belongs to it.

Businesses can use such classification tasks to predict whether a customer is likely to purchase a product, determine if
an image contains a particular type of object, detect fraud, predict probability of default, and more.

Classification models in general are less mathematical in nature compared to regression-based algorithms like time
series models. For Module 3.3, we present a high-level overview of commonly used supervised learning classification
models. In the next module, we will implement these models in a case study and compare the model performance.

Module 3.3 will cover the following:

Part 1: Regularization
Part 2: Logistic Regression
Part 3: K-Nearest Neighbors
Part 4: Support Vector Machine
Part 5: (Bonus model, not covered in class) Classification and Regression Trees
Part 6: Grid Search and Cross Validation
Part 7: Evaluation Metrics

Part 1: Regularization

When a linear regression model contains many independent variables, their coefficients will be poorly determined,
and the model will have a tendency to fit extremely well to the training data (data used to build the model) but fit
poorly to testing data (data used to test how good the model is).

This is known as overfitting. A model with too many parameters might try to fit noise specific ONLY to the training data
and is unable to predict the output or target variable from the unseen (test) data because of all the noise.

One popular technique to control overfitting is regularization, which involves the addition of a penalty term to the
error or loss function to discourage the coefficients from reaching large values. Regularization, in simple terms, is a
penalty mechanism that applies shrinkage to model parameters (driving them closer to zero) in order to build a model
with higher prediction accuracy and interpretation.

Regularization tries to reduce the variance of the model, without a substantial increase in the bias. It works by adding
a penalty (or complexity or shrinkage) term to the error or loss function to discourage the coefficients from reaching
large values.

There are two common ways to regularize a regression model: Lasso regression and Ridge regression.

L1 Regularization or Lasso Regression

Lasso regression performs L1 regularization by adding a factor of the sum of the absolute value of coefficients to the
cost/loss function. It is written generally as
𝑝
min (𝐿) = 𝐿 + 𝜆 ∑𝑗=1|𝛽𝑗 |,
where 𝐿 = loss or cost function;
|𝛽𝑗 | = absolute value of the coefficients;
𝜆 = regularization parameter that controls the amount of regularization applied – the larger the value of 𝜆, the
more features are shrunk to zero.

L1 regularization can lead to zero coefficients (i.e., some of the features are completely neglected for the evaluation
of output) resulting to a sparse model. The larger the value of λ, the more features are shrunk to zero. This can
eliminate some features entirely and give us a subset of predictors, reducing model complexity.

So, Lasso regression not only helps in reducing overfitting, but also can help in feature selection. Predictors not shrunk
toward zero signify that they are important, and thus L1 regularization allows for feature selection.

Consequently, a limitation of Lasso regression is that if there are two or more highly collinear variables then it selects
one of them randomly which is not good for the interpretation of our model.

L2 Regularization or Ridge Regression

Ridge regression performs L2 regularization by adding a factor of the sum of the square of coefficients to the cost/loss
function. It is written generally as
𝑝
min ( 𝐿) = 𝐿 + 𝜆 ∑𝑗=1 𝛽𝑗2,
where 𝐿 = loss or cost function;
𝛽𝑗2 = square of the coefficients;
𝑐 = regularization parameter that controls the amount of regularization applied – the larger the value of 𝜆, the
more features are shrunk to zero.

Ridge regression puts constraint on the coefficients. The penalty term (λ) regularizes the coefficients such that if the
coefficients take large values, the optimization function is penalized. So, ridge regression shrinks the coefficients and
helps to reduce the model complexity. Shrinking the coefficients leads to a lower variance and a lower error value.

A limitation of L2 regularization is, while it decreases the complexity of a model, it does not reduce the number of
variables – this is not desirable if we want the model to select important variables.

Important Points on λ

The penalty term 𝜆 is the tuning parameter used in regularization that decides how much we want to penalize the
flexibility of our model (i.e., controls the impact on bias and variance).

When 𝜆 = 0, the penalty term has no effect, min(𝐿) becomes the cost function of the linear regression model. Hence,
the model will resemble a simple linear regression model.

Recall that as 𝜆 rises, it significantly reduces the value of coefficient estimates and thus reduces the variance. But after
a certain value of 𝜆, the model starts losing some important properties, giving rise to bias in the model and thus
underfitting. Therefore, we have to select the value of 𝜆 carefully.

Part 2: Logistic Regression

Logistic regression is one of the most widely used algorithms for classification. The logistic regression model arises
from the desire to model the probabilities of the output classes given a function that is linear in 𝑥, at the same time
ensuring that output probabilities sum up to one and remain between zero and one as we would expect from
probabilities.

If we train a linear regression model 𝑦̂ = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑖 𝑥𝑖 on several examples where 𝑦̂ = 0 or 1, we might end


up predicting some probabilities that are less than zero or greater than one, which doesn’t make sense. Instead, we
use a logistic regression model (or logit model), which is a modification of linear regression that makes sure to output
a probability between zero and one by applying the sigmoid function.

If 𝑦̂ = 0 or 1, we can formulate 𝑦̂ as a realization of a random variable that can take the values one and zero with
probabilities 𝑝 and 1−𝑝, respectively. A model that satisfies the boundary requirement is the logistic equation:
𝑒 (𝛽0 +𝛽1 𝑥1+⋯+𝛽𝑖 𝑥𝑖 )
𝑦=
1 + 𝑒 (𝛽0 +𝛽1 𝑥1+⋯+𝛽𝑖 𝑥𝑖 )

Where 𝑦 is the predicted output, 𝛽0 is the bias or intercept, and 𝛽𝑖 is the coefficient with predictor 𝑥𝑖 that must be
learned from the training data.

Similar to linear regression, input values (𝑥) are combined linearly using weights or coefficient values to predict an
output value (𝑦). The output coming from the above equation is a probability that is transformed into a binary value
(0 or 1) to get the model prediction.

The logistic equation can be linearized by the following transformation:

𝑝
logit(𝑝) = ln ( ) = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑖 𝑥𝑖
1−𝑝

The left-hand side is termed the logit, which stands for “logistic unit.” It is also known as the log odds. In this case, our
model will produce values on the log scale and with the logistic equation above, we can transform the values to the 0
to 1 range.

In logistic regression, the cost function is basically a measure of how often we predicted 1 when the true answer was
0, or vice versa.

In terms of the advantages, the logistic regression model is easy to implement, has good interpretability, and performs
very well on linearly separable classes. The output of the model is a probability, which provides more insight. Although
there may be risk of overfitting, this may be addressed using L1/L2 regularization.

In terms of disadvantages, the model may overfit when provided with large numbers of features. Logistic regression
can only learn linear functions and is less suitable to complex relationships between features and the target variable.
Also, it may not handle irrelevant features well, especially if the features are strongly correlated.

Hyperparameters

▪ Regularization (penalty in sklearn): By default, this is set to none. However, logistic regression can have
regularization, which can be L1, L2, or elasticnet.

▪ Regularization strength (C in sklearn): If a penalty term (λ) is added to the model, this hyperparameter estimates
by how much the coefficients should be corrected while applying gradient descend algorithm during training. Good
values of the penalty parameters can be [100, 10, 1.0, 0.1, 0.01].

Part 3: K-Nearest Neighbors

K-nearest neighbors (KNN) is considered a “lazy learner,” as there is no learning required in the model. For a new data
point, predictions are made by searching through the entire training set for the K most similar instances (the neighbors)
and summarizing the output variable for those K instances.

In the below diagram, yellow and violet points correspond to Class A and Class B in training data. The red star points
to the test data which is to be classified. When k = 3, we predict Class B as the output and when k = 6, we predict Class
A as the output.
To determine which of the K instances in the training dataset are most similar to a new input, a distance measure is
used. The most popular distance measure is the Euclidean distance, which is calculated as the square root of the sum
of the squared differences between a point 𝑎 and a point 𝑏 across all input attributes 𝑖, and which is represented as
𝑑 (𝑎, 𝑏) = √ ∑𝑛𝑖=1(𝑎𝑖 − 𝑏𝑖 )2. Euclidean distance is a good distance measure to use if the input variables are similar in
type.

Another widely-used distance metric is the Manhattan distance, in which the distance between point 𝑎 and point 𝑏
is represented as 𝑑 (𝑎, 𝑏) = ∑𝑛𝑖=1|𝑎𝑖 − 𝑏𝑖 |. Manhattan distance is a good measure to use if the input variables are not
similar in type.

The steps of KNN can be summarized as follows:


1. Choose the number of K and a distance metric.
2. Find the K-nearest neighbors of the sample that we want to classify.
3. Assign the class label by majority vote.

In terms of advantages, no training is involved and hence there is no learning phase. Since the algorithm requires no
training before making predictions, new data can be added seamlessly without impacting the accuracy of the
algorithm. It is intuitive and easy to understand. The model naturally handles multiclass classification and can learn
complex decision boundaries. KNN is effective if the training data is large.

In terms of the disadvantages, the distance metric to choose is not obvious and difficult to justify in many cases. KNN
performs poorly on high dimensional datasets. It is expensive and slow to predict new instances because the distance
to all neighbors must be recalculated. KNN is sensitive to noise in the dataset. We need to manually input missing
values and remove outliers. Also, feature scaling (standardization and normalization) is required before applying the
KNN algorithm to any dataset; otherwise, KNN may generate wrong predictions.

Hyperparameters

▪ Number of neighbors (n_neighbors in sklearn): The most important hyperparameter for KNN is the number of k
neighbors. Good values are between 1 and 20.

▪ Distance metric (metric in sklearn): It may also be interesting to test different distance metrics for choosing the
composition of the neighborhood. Good values are euclidean and manhattan.

Part 4: Support Vector Machine

Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or nonlinear classification,
regression, and even outlier detection tasks. SVMs work by finding the optimal hyperplane that separates data points
into different classes.
How does SVMs work? We’ll illustrate SVM using a two-class problem and begin with a case in which the classes are
linearly separable, meaning that a straight line can be drawn that perfectly separates the classes with the margin being
the perpendicular distance between the closest points to the line from each class as shown in the image above. In a
two-dimensional space, a hyperplane is simply a line that separates the data points into two classes. In a three-
dimensional space, a hyperplane is a plane that separates the data points into two classes. Similarly, in N -dimensional
space, a hyperplane has (N-1)-dimensions.

The closest points to the line are called support vectors and are the only points that ultimately influence the position
of the separating line – any points that are further from the line can be moved, removed, or added with no impact on
the line.

Many such separating lines are possible and the SVM algorithm finds the one with the widest margin. A margin is the
distance between the decision boundary (hyperplane) and the closest data points from each class. SVM classifies
points by maximizing the width of a margin that separates the classes.

Why maximize the margin? The intuition behind it is that a larger margin indicates a greater degree of confidence in
the classification and that the model would be more generalizable on unseen/test data.

In practice, the data is messy and cannot be separated perfectly with a hyperplane. The constraint of maximizing the
margin of the line that separates the classes must be relaxed. This change allows some points in the training data to
violate the separating line. An additional set of coefficients is introduced that give the margin wiggle room in each
dimension.

A tuning parameter is introduced, simply called C, that defines the magnitude of the wiggle allowed across all
dimensions. The larger the value of C, the more violations of the hyperplane are permitted.

In some cases, it is not possible to find a hyperplane or a linear decision boundary, and kernels are used. A kernel is
just a transformation of the input data that allows the SVM algorithm to treat/process the data more easily. Using
kernels, the original data is projected into a higher dimension to classify the data better, as illustrated below.

In terms of advantages, SVM is fairly robust against overfitting, especially in higher dimensional space. It handles the
nonlinear relationships quite well. Also, there is no distributional requirement for the data. Outliers can be well
handled using soft margin constant C.

However, SVM can be inefficient to train and memory-intensive to run and tune. It also doesn’t perform well with
large datasets. Hyperparameters and kernels are to be carefully tuned for sufficient accuracy.

Hyperparameters

▪ Kernels (kernel in sklearn): The choice of kernel controls the manner in which the input variables will be projected.
There are many kernels to choose from, but linear and RBF (as illustrated above) are the most common.
▪ Penalty (C in sklearn): The penalty parameter tells the SVM optimization how much you want to avoid
misclassifying each training example. For large values of the penalty parameter, the optimization will choose a
smaller-margin hyperplane. Good values might be a log scale from 10 to 1,000.

Part 5: Classification and Regression Trees

Classification And Regression Tree (CART) is a variation of the decision tree algorithm. In the most general terms, the
purpose of an analysis via tree-building algorithms is to determine a set of if–then logical (split) conditions that permit
accurate prediction or classification of cases.

Classification and regression trees (or CART or decision tree classifiers) are attractive models if we care about
interpretability. We can think of this model as breaking down our data and making a decision based on asking a series
of questions. This algorithm is the foundation of ensemble methods such as random forest and gradient boosting
method.

The model can be represented by a binary tree (or decision tree), where each node is an input variable 𝑥 with a split
point and each leaf contains an output variable 𝑦 for prediction. Below is a simple example of a simple classification
tree to predict whether a person is a male or a female based on two inputs of height (in centimeters) and weight (in
kilograms).

Regression trees and classification trees are two different types of tree models used in machine learning. The term
'classification and regression tree' (CART) is often used to refer to either type – but they both use the same algorithm,
a decision tree-like structure diagram. A regression tree uses data from a set to predict an outcome or target value. It
starts with a single root node which breaks down into smaller nodes until it reaches its maximum depth. These leaf
nodes represent the final prediction labels for each sample within the data set.

On the other hand, a classification tree works by predicting which class label will be given to new samples by looking
at characteristics of previously labelled examples in the data set. In contrast to regression, it does not predict values
like age or height; instead, it predicts classes such as yes/no or pass/fail.

Classification trees and regression trees are two types of decision trees that can be used to construct a decision graph.
A classification tree is used when the output variable is categorical, while a regression tree is used when the output
variable is continuous. Each node in the graph represents a data point or test, each child node branches off from its
parent node based on a split point determined by an algorithm, and ultimately leads to either a prediction or
conclusion. For our purposes, we will only discuss the classification tree model.

Training a CART Model

For example, if we wanted to use a decision tree for predicting whether someone would buy something or not (binary
classification), then we could use a classification tree. In this case, our input variables may include age, gender, and
income level. The root node might ask 'Is the person over 21 years old?', with two subsequent nodes being ‘Yes’ and
‘No’ respectively. If yes was selected as the answer, it might lead to another question such as 'What is their annual
income?'. This process continues until all questions have been asked, leading us to reach either one of two conclusions:
purchase or no purchase.

Below is a more general representation of a classification tree:


Basically, a decision tree is a rather simple structure that consists of three different kinds of elements: One Root Node,
which is the starting point including all training samples, multiple Decision Nodes, where we split our data using simple
if-else-decision rules, and multiple Terminal Nodes or Leaf Nodes, where we eventually assign the classes for our
classification purpose.

Creating a binary tree is actually a process of dividing up the input space. A greedy approach called recursive binary
splitting is used to divide the space. This is a numerical procedure in which all the values are lined up and different
split points are tried and tested using a cost (loss) function. The split with the best cost (lowest cost, because we
minimize cost) is selected. All input variables and all possible split points are evaluated and chosen in a greedy manner
(e.g., the very best split point is chosen each time).

Classification CART trees uses the Gini cost function; it provides an indication of how pure the leaf nodes are (i.e., how
mixed the training data assigned to each node is).

The Gini index is a metric stores the sum of squared probabilities of each class. It computes the degree of probability
of a specific variable that is wrongly being classified when chosen randomly and a variation of the Gini coefficient. It
works on categorical variables, provides outcomes either “successful” or “failure” and hence conducts binary splitting
only.

The degree of the Gini index varies from 0 to 1:


• Where 0 depicts that all the elements are allied to a certain class, or only one class exists there.
• The Gini index of value 1 signifies that all the elements are randomly distributed across various classes, and
• A value of 0.5 denotes the elements are uniformly distributed into some classes.

Stopping Criterion

The recursive binary splitting procedure described in the preceding section needs to know when to stop splitting as it
works its way down the tree with the training data. The most common stopping procedure is to use a minimum count
on the number of training instances assigned to each leaf node. If the count is less than some minimum, then the split
is not accepted and the node is taken as a final leaf node.

Tree Pruning

The complexity of a decision tree is defined as the number of splits in the tree. Simpler trees are preferred as they are
faster to run and easy to understand, consume less memory during processing and storage, and are less likely to overfit
the data. Pruning can be used after learning the tree to further lift performance. The fastest and simplest pruning
method is to work through each leaf node in the tree and evaluate the effect of removing it using a test set. A leaf
node is removed only if doing so results in a drop in the overall cost function on the entire test set. The removal of
nodes can be stopped when no further improvements can be made.
Advantages and Disadvantages

In terms of advantages, CART is easy to interpret and can adapt to learn complex relationships. It requires little data
preparation, and data typically does not need to be scaled. Feature importance is built in due to the way decision
nodes are built. It performs well on large datasets. CART does not make any assumptions about the distribution of
data or the relationships between features. It can handle both categorical and numerical features without requiring
feature scaling.

In terms of disadvantages, CART decision trees are prone to overfitting, especially when they are allowed to grow too
deep. Overfitting occurs when the tree becomes too complex and fits the training data noise, resulting in poor
generalization to new, unseen data. Small changes in the training data can lead to significantly different tree
structures. This can result in instability, making the model sensitive to variations in the data.

Hyperparameters

CART has many hyperparameters. Hyperparameter tuning is crucial for optimizing the performance of CART models,
and it often involves techniques like cross-validation to find the best combination of hyperparameters for your specific
problem. Maximum depth (max_depth in sklearn) and minimum number of samples (min_samples_split in sklearn)
are often a good starting point for hyperparameter tuning since adjusting these two hyperparameters allows you to
find the right balance between model complexity and generalization for your specific problem.

▪ Maximum depth (max_depth in sklearn): This parameter controls the maximum depth or depth limit of the
decision tree. It can help prevent overfitting by limiting the tree's complexity. Good values can range from 2 to 30
depending on the number of features in the data.

▪ Minimum number of samples (min_samples_split in sklearn): It specifies the minimum number of samples
required to split an internal node. Increasing this value can prevent the tree from splitting too early, reducing the
likelihood of overfitting. A good value can range from 10 to 20 as it balances complexity and generalization.

Part 6: Grid Search and Cross Validation

Grid Search for Hyperparameter Tuning

A model hyperparameter is a characteristic of a model that is external to the model and whose value cannot be
estimated from data. The value of the hyperparameter has to be set before the learning process begins. For example,
C in Support Vector Machines, k in K-Nearest Neighbors, or the number of hidden layers in neural networks.

Grid search is used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions.
The overall idea of the grid search is to create a grid of all possible hyperparameter combinations and train the model
using each one of them. These hyperparameters are tuned during grid search to achieve better model performance.

Due to its exhaustive search, a grid search is guaranteed to find the optimal parameter within the grid. A major
drawback is that the size of the grid grows exponentially with the addition of more parameters or more considered
values, thus costing more memory and computational time.

The GridSearchCV class in the model_selection module of the sklearn package facilitates the systematic evaluation
of all combinations of the hyperparameter values that we would like to test. The grid search provided by
GridSearchCV exhaustively generates candidates from a grid of parameter values specified with the param_grid
parameter. For instance, the following param_grid for a Support Vector Machine model specifies that two grids should
be explored: one with a linear kernel and C values in [1, 10, 100, 1000], and the second one with an RBF kernel, and
the cross-product of C values ranging in [1, 10, 100, 1000] and gamma values in [0.001, 0.0001].

param_grid = [
{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
The GridSearchCV instance implements the usual estimator interface: when “fitting” it on a dataset all the possible
combinations of parameter values are evaluated and the best combination is retained.

Cross Validation

One of the challenges of machine learning is training models that are able to generalize well to unseen data (overfitting
versus underfitting or a bias-variance trade-off). The main idea behind cross validation is to split the data one time or
several times so that each split is used once as a validation set and the remainder is used as a training set: part of the
data (the training sample) is used to train the algorithm, and the remaining part (the validation sample) is used for
estimating the risk of the algorithm. Cross validation is another technique that can be used when tuning
hyperparameters.

The most common type of cross-validation is k-fold cross-validation, where the data is randomly partitioned into k
equal-sized subsets, or folds. The model is trained on k-1 folds and tested on the remaining fold, and this process is
repeated k times, with each fold serving as the test set once. The performance of the model is then averaged over the
k iterations to obtain a more robust estimate of its performance.

The below image shows an example of cross validation, where the data is split into k = 5 folds and in each round one
of the sets is used for validation.

A potential drawback of cross validation is the computational cost, especially when paired with a grid search for
hyperparameter tuning. Cross validation can be performed in a couple of lines using the sklearn package; we will
perform cross validation in the supervised learning case studies.

Part 7: Evaluation Metrics

The metrics used to evaluate the machine learning algorithms are very important. The choice of metrics to use
influences how the performance of machine learning algorithms is measured and compared. The metrics influence
both how you weight the importance of different characteristics in the results and your ultimate choice of algorithm.

The main evaluation metrics for classification are:


• Accuracy
• Precision
• Recall
• Area Under the ROC Curve (AUC)
• McFadden’s Pseudo R2 (for Logistic Regression)

For simplicity, we will mostly discuss things in terms of a binary classification problem (i.e., only two outcomes, such
as true or false); some common terms are:
• True positives (TP): Predicted positive and are actually positive
• False positives (FP): Predicted positive and are actually negative
• True negatives (TN): Predicted negative and are actually negative
• False negatives (FN): Predicted negative and are actually positive
Accuracy

Accuracy is the number of correct predictions made as a ratio of all predictions made. This is the most common
evaluation metric for classification problems and is also the most misused. It is most suitable when there are an equal
number of observations in each class (which is rarely the case) and when all predictions and the related prediction
errors are equally important, which is often not the case.

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒


𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑜𝑡𝑎𝑙 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛

Precision or Specificity

Precision is the percentage of positive instances out of the total predicted positive instances. Here, the denominator is
the model prediction done as positive from the whole given dataset. Precision is a good measure to determine when
the cost of false positives is high (e.g., email spam detection).

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

Recall or Sensitivity

Recall (or sensitivity or true positive rate) is the percentage of positive instances out of the total actual positive
instances. Therefore, the denominator (true positive + false negative) is the actual number of positive instances present
in the dataset. Recall is a good measure when there is a high cost associated with false negatives (e.g., fraud detection) .

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒

Area Under ROC Curve

ROC is a probability curve, and AUC represents degree or measure of separability. It tells how much the model is
capable of distinguishing between classes. The higher the AUC, the better the model is at predicting zeros as zeros
and ones as ones. An AUC of 0.5 means that the model has no class separation capacity whatsoever.
AUC-ROC generates probability values instead of binary [0,1] values. It should be used when your data is roughly
balanced. Using ROC for imbalanced datasets lead to incorrect interpretation.

Confusion Matrix

A confusion matrix lays out the performance of a learning algorithm. The confusion matrix is simply a square matrix
that reports the counts of the true positive (TP), true negative (TN), false positive (FP), and false negative (FN)
predictions of a classifier.

McFadden’s Pseudo R2 for Logistic Regression

Logistic regression models are fitted using the method of maximum likelihood (i.e., the parameter estimates are those
values which maximize the likelihood of the data which have been observed). McFadden’s R squared measure is
defined as:
2 log (𝐿𝑐 )
𝑅𝑀𝑐𝐹𝑎𝑑𝑑𝑒𝑛 =1− ,
log ( 𝐿𝑛𝑢𝑙𝑙 )

where 𝐿𝑐 denotes the (maximized) likelihood value from the current fitted model, and 𝐿𝑛𝑢𝑙𝑙 denotes the correspondi ng
value but for the null model – the model with only an intercept and no covariates.

In most empirical research, typically, one could not hope to find predictors which are strong enough to give predicted
probabilities so close to 0 or 1, and so one shouldn’t be surprised if one obtains a value of. If comparing two models
on the same data, McFadden’s would be higher for the model with the greater likelihood.

Strategies to Choose the Right Metric

Choose Accuracy when


▪ The cost of FP and FN are roughly equal
▪ The benefit of TP and TN are roughly equal

Choose Precision when


▪ The cost of FP is much higher than FN
▪ The benefit of TP is much higher than TN

Choose Recall when


▪ The cost of FN is much higher than FP
▪ The benefit of TN is much higher than TP

Area Under ROC Curve (AUC-ROC Curve) or Precision-Recall Curve


▪ Use AUC-ROC when dealing with balanced, or almost, dataset
▪ Use precision-recall curve for imbalanced dataset

You might also like