0% found this document useful (0 votes)
12 views32 pages

ML 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views32 pages

ML 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

UNIT-4

#Classification
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can
be called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with the
corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
The best example of an ML classification algorithm is Email Spam Detector.
The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.

The algorithm which implements the classification on a dataset is known as a classifier.


There are two types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then
it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it
is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Learners in Classification Problems:
In the classification problems, there are two types of learners:
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives
the test dataset. In Lazy learner case, classification is done on the basis of the most
related data stored in the training dataset. It takes less time in training but more time
for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners: Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner
takes more time in learning, and less time in prediction. Example: Decision Trees,
Na�ve Bayes, ANN.
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into the Mainly two category:
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Na�ve Bayes
o Decision Tree Classification
o Random Forest Classification
Evaluating a Classification model:
Once our model is completed, it is necessary to evaluate its performance; either it is a
Classification or Regression model. So for evaluating a Classification model, we have the
following ways:
1. Log Loss or Cross-Entropy Loss:
o It is used for evaluating the performance of a classifier, whose output is a probability
value between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:
1. ?(ylog(p)+(1?y)log(1?p))
Where y= Actual output, p= predicted output.
2. Confusion Matrix:
o The confusion matrix provides us a matrix/table as output and describes the
performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like as
below table:

o Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis
and FPR(False Positive Rate) on X-axis.
Use cases of Classification Algorithms
Classification algorithms can be used in different places. Below are some popular use cases
of Classification Algorithms:
o Email Spam Detection
o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.
Classification Types
There are two main classification types in machine learning:
Binary Classification
In binary classification, the goal is to classify the input into one of two classes or categories.
Example – On the basis of the given health conditions of a person, we have to determine
whether the person has a certain disease or not.
Multiclass Classification
In multi-class classification, the goal is to classify the input into one of several classes or
categories. For Example – On the basis of data about different species of flowers, we have to
determine which specie our observation belongs to.

Binary vs Multi class classification


Other categories of classification involves:
Multi-Label Classification
In, Multi-label Classification the goal is to predict which of several labels a new data point
belongs to. This is different from multiclass classification, where each data point can only
belong to one class. For example, a multi-label classification algorithm could be used to
classify images of animals as belonging to one or more of the categories cat, dog, bird, or
fish.
Imbalanced Classification
In, Imbalanced Classification the goal is to predict whether a new data point belongs to a
minority class, even though there are many more examples of the majority class. For
example, a medical diagnosis algorithm could be used to predict whether a patient has a
rare disease, even though there are many more patients with common diseases.
Classification Algorithms
There are various types of classifiers algorithms. Some of them are :
Linear Classifiers
Linear models create a linear decision boundary between classes. They are simple and
computationally efficient. Some of the linear classification models are as follows:
 Logistic Regression
 Support Vector Machines having kernel = ‘linear’
 Single-layer Perceptron
 Stochastic Gradient Descent (SGD) Classifier
Non-linear Classifiers
Non-linear models create a non-linear decision boundary between classes. They can capture
more complex relationships between the input features and the target variable. Some of the
non-linear classification models are as follows:
 K-Nearest Neighbours
 Kernel SVM
 Naive Bayes
 Decision Tree Classification
 Ensemble learning classifiers:
 Random Forests,
 AdaBoost,
 Bagging Classifier,
 Voting Classifier,
 ExtraTrees Classifier
 Multi-layer Artificial Neural Networks
Learners in Classifications Algorithm
In machine learning, classification learners can also be classified as either “lazy” or “eager”
learners.
 Lazy Learners: Lazy Learners are also known as instance-based learners, lazy learners
do not learn a model during the training phase. Instead, they simply store the
training data and use it to classify new instances at prediction time. It is very fast at
prediction time because it does not require computations during the predictions. it is
less effective in high-dimensional spaces or when the number of training instances is
large. Examples of lazy learners include k-nearest neighbors and case-based
reasoning.
 Eager Learners: Eager Learners are also known as model-based learners, eager
learners learn a model from the training data during the training phase and use this
model to classify new instances at prediction time. It is more effective in high-
dimensional spaces having large training datasets. Examples of eager learners include
decision trees, random forests, and support vector machines.
Classification Models in Machine Learning
Evaluating a classification model is an important step in machine learning, as it helps to
assess the performance and generalization ability of the model on new, unseen data. There
are several metrics and techniques that can be used to evaluate a classification model,
depending on the specific problem and requirements. Here are some commonly used
evaluation metrics:
 Classification Accuracy: The proportion of correctly classified instances over the total
number of instances in the test set. It is a simple and intuitive metric but can be
misleading in imbalanced datasets where the majority class dominates the accuracy
score.
 Confusion matrix: A table that shows the number of true positives, true negatives,
false positives, and false negatives for each class, which can be used to calculate
various evaluation metrics.
 Precision and Recall: Precision measures the proportion of true positives over the
total number of predicted positives, while recall measures the proportion of true
positives over the total number of actual positives. These metrics are useful in
scenarios where one class is more important than the other, or when there is a
trade-off between false positives and false negatives.
 F1-Score: The harmonic mean of precision and recall, calculated as 2 x (precision x
recall) / (precision + recall). It is a useful metric for imbalanced datasets where both
precision and recall are important.
 ROC curve and AUC: The Receiver Operating Characteristic (ROC) curve is a plot of
the true positive rate (recall) against the false positive rate (1-specificity) for different
threshold values of the classifier’s decision function. The Area Under the Curve (AUC)
measures the overall performance of the classifier, with values ranging from 0.5
(random guessing) to 1 (perfect classification).
 Cross-validation: A technique that divides the data into multiple folds and trains the
model on each fold while testing on the others, to obtain a more robust estimate of
the model’s performance.
It is important to choose the appropriate evaluation metric(s) based on the specific problem
and requirements, and to avoid overfitting by evaluating the model on independent test
data.
Characteristics of Classification
Here are the characteristics of the classification:
 Categorical Target Variable: Classification deals with predicting categorical target
variables that represent discrete classes or labels. Examples include classifying emails
as spam or not spam, predicting whether a patient has a high risk of heart disease, or
identifying image objects.
 Accuracy and Error Rates: Classification models are evaluated based on their ability
to correctly classify data points. Common metrics include accuracy, precision, recall,
and F1-score.
 Model Complexity: Classification models range from simple linear classifiers to more
complex nonlinear models. The choice of model complexity depends on the
complexity of the relationship between the input features and the target variable.
 Overfitting and Underfitting: Classification models are susceptible to overfitting and
underfitting. Overfitting occurs when the model learns the training data too well and
fails to generalize to new data.
#Linear Classification
Logistic Regression in Machine Learning
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or
1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:

Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it
is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.
Assumptions for Logistic Regression:
o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm. We will first train our model with lots of images of cats and dogs
so that it can learn about different features of cats and dogs, and then we test it with this
strange creature. So as support vector creates a decision boundary between these two data
(cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat
and dog. On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and
x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or
blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between
the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.

Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.


#Ensemble Classifiers used in data mining
Ensemble learning helps improve machine learning results by combining several models.
This approach allows the production of better predictive performance compared to a single
model. Basic idea is to learn a set of classifiers (experts) and to allow them to vote.
Advantage : Improvement in predictive accuracy.
Disadvantage : It is difficult to understand an ensemble of classifiers.

Why do ensembles work?


Dietterich(2002) showed that ensembles overcome three problems –
 Statistical Problem –
The Statistical Problem arises when the hypothesis space is too large for the amount
of available data. Hence, there are many hypotheses with the same accuracy on the
data and the learning algorithm chooses only one of them! There is a risk that the
accuracy of the chosen hypothesis is low on unseen data!
 Computational Problem –
The Computational Problem arises when the learning algorithm cannot guarantees
finding the best hypothesis.
 Representational Problem –
The Representational Problem arises when the hypothesis space does not contain
any good approximation of the target class(es).
Main Challenge for Developing Ensemble Models?
The main challenge is not to obtain highly accurate base models, but rather to obtain base
models which make different kinds of errors. For example, if ensembles are used for
classification, high accuracies can be accomplished if different base models misclassify
different training examples, even if the base classifier accuracy is low.
Methods for Independently Constructing Ensembles –
 Majority Vote
 Bagging and Random Forest
 Randomness Injection
 Feature-Selection Ensembles
 Error-Correcting Output Coding
Methods for Coordinated Construction of Ensembles –
 Boosting
 Stacking
Reliable Classification: Meta-Classifier Approach
Co-Training and Self-Training
Types of Ensemble Classifier –
Bagging:
Bagging (Bootstrap Aggregation) is used to reduce the variance of a decision tree. Suppose a
set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement
from D (i.e., bootstrap). Then a classifier model Mi is learned for each training set D < i. Each
classifier Mi returns its class prediction. The bagged classifier M* counts the votes and
assigns the class with the most votes to X (unknown sample).
Implementation steps of Bagging –
1. Multiple subsets are created from the original data set with equal tuples, selecting
observations with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel from each training set and independent of each
other.
4. The final predictions are determined by combining the predictions from all the

models.
Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble is a decision
tree classifier and is generated using a random selection of attributes at each node to
determine the split. During classification, each tree votes and the most popular class is
returned.
Implementation steps of Random Forest –
1. Multiple subsets are created from the original data set, selecting observations
with replacement.
2. A subset of features is selected randomly and whichever feature gives the
best split is used to split the node iteratively.
3. The tree is grown to the largest.
4. Repeat the above steps and prediction is given based on the aggregation of
predictions from n number of trees.

Ensemble classifiers combine the predictions of multiple base models (weak or strong
learners) to create a more robust and accurate prediction model. The main idea is that by
leveraging the strengths of different models, ensembles can reduce bias, variance, or both,
often outperforming individual models.

Why Use Ensemble Methods?


1. Improved Accuracy:
o Ensemble methods capitalize on the diversity of individual models to correct
each other's errors.
2. Reduced Overfitting:
o By averaging predictions or using majority voting, ensembles reduce the risk
of overfitting on training data.
3. Flexibility:
o They can combine models of different types (e.g., decision trees, SVMs,
neural networks).

Types of Ensemble Classifiers


1. Bagging (Bootstrap Aggregating):
o How it works:
 Trains multiple models on different random subsets of the training
data (with replacement).
 Combines predictions through averaging (for regression) or majority
voting (for classification).
o Goal: Reduce variance without increasing bias.
o Example: Random Forest.
o Key Benefit: Works well with high-variance models like decision trees.
2. Boosting:
o How it works:
 Trains models sequentially, where each model corrects the errors of
the previous one.
 Combines predictions through weighted averaging or voting.
o Goal: Reduce bias and variance.
o Example: AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost.
o Key Benefit: Excellent for improving performance on weak learners.
3. Stacking (Stacked Generalization):
o How it works:
 Combines predictions from multiple base models using a meta-model.
 The meta-model learns to weigh and combine the predictions of the
base models.
o Goal: Leverage the strengths of different types of models.
o Example: Using logistic regression or gradient boosting as a meta-model.
o Key Benefit: Highly flexible; can mix any type of model.
4. Voting:
o How it works:
 Combines predictions of multiple models through simple voting (for
classification) or averaging (for regression).
 Can be hard (majority voting) or soft (weighted probabilities).
o Goal: Aggregate predictions from diverse models.
o Example: Combining SVM, decision trees, and logistic regression.
o Key Benefit: Simple and effective.
5. Bagging + Boosting Hybrid:
o Combines bagging and boosting techniques, e.g., XGBoost with subsampling.

Popular Ensemble Models


1. Random Forest:
o Uses bagging with decision trees.
o Combines predictions from multiple decision trees for robust performance.
2. AdaBoost:
o Sequentially adjusts model weights to focus on misclassified examples.
o Often uses decision stumps as weak learners.
3. Gradient Boosting:
o Optimizes the loss function iteratively by training new models to correct the
errors of previous models.
4. XGBoost (Extreme Gradient Boosting):
o Efficient and scalable implementation of gradient boosting with
regularization.
5. LightGBM:
o A gradient boosting method optimized for speed and memory efficiency.
6. CatBoost:
o Gradient boosting designed for categorical features with minimal
preprocessing.

Advantages of Ensemble Classifiers


1. High Accuracy:
o Outperform single models in most cases.
2. Versatility:
o Applicable to both classification and regression problems.
3. Robustness:
o Resistant to overfitting when designed properly.

Challenges
1. Complexity:
o More computationally intensive than single models.
o Difficult to interpret.
2. Diminishing Returns:
o Over-complex ensembles might not significantly improve performance.
3. Risk of Overfitting:
o If not managed, boosting models can overfit noisy data.

Use Cases of Ensemble Classifiers


1. Healthcare:
o Disease diagnosis (e.g., cancer prediction using ensemble of random forest
and gradient boosting).
2. Finance:
o Fraud detection, risk assessment, and stock price prediction.
3. E-commerce:
o Recommendation systems (e.g., predicting user preferences).
4. Image and Text Analysis:
o Sentiment analysis, object recognition, and classification.
# Model Selection
Model selection is the process of choosing the best model from a set of candidate models
based on certain criteria. The goal is to select a model that best serves a given function, such
as prediction or parameter estimation.
Here are some model selection techniques:
 Cross-validation: A resampling technique that splits data into groups or folds. Some
folds are used as the test set, and the rest are used as the training set. This technique
is used to evaluate models by training and evaluating them on each fold separately.
 Akaike Information Criterion (AIC): A probabilistic technique that measures the
quality of a statistical model for a given dataset.
 Bayesian Information Criterion (BIC): A probabilistic technique used for model
selection.
 Minimum Description Length (MDL): A probabilistic technique used for model
selection.
 Bayesian model selection (BMS): A method for determining the most likely
hypothesis about the mechanisms that generated observed data.
 Testing-based approaches: Approaches that select variables based on whether they
are significant when added or removed. Examples include backward elimination,
forward selection, and stepwise regression.
Model selection refers to the process of choosing the most appropriate machine learning
model from a set of candidates for a specific task. The goal is to find a model that balances
bias and variance, performs well on unseen data, and aligns with the problem's constraints.

Key Considerations in Model Selection


1. Nature of the Problem:
o Classification: Logistic regression, decision trees, random forests, etc.
o Regression: Linear regression, support vector regression, etc.
o Clustering: K-means, DBSCAN, hierarchical clustering.
o Time Series: ARIMA, LSTM.
2. Data Size and Quality:
o Large datasets favor complex models like neural networks.
o Small datasets require simpler models to avoid overfitting.
3. Model Interpretability:
o For explainable decisions, models like linear regression or decision trees are
preferred.
o Complex models like deep neural networks may lack interpretability.
4. Computational Resources:
o Simple models (e.g., linear regression) are computationally inexpensive.
o Complex models (e.g., deep learning) require more resources and time.
5. Domain Knowledge:
o Incorporating domain knowledge can guide the selection of features and
models.

Steps in Model Selection


1. Define the Objective:
o Understand the problem and the type of prediction required (e.g.,
classification, regression).
2. Split the Data:
o Divide the dataset into training, validation, and test sets.
3. Train Multiple Models:
o Train different algorithms on the training data.
4. Evaluate Performance:
o Use metrics to assess model performance on validation data.
5. Fine-tune the Model:
o Optimize hyperparameters for the best-performing model using techniques
like grid search or random search.
6. Select the Final Model:
o Choose the model with the best balance of metrics on validation and test
datasets.

Model Evaluation Metrics


Classification Metrics:
 Accuracy: Proportion of correct predictions.
 Precision: Focuses on the accuracy of positive predictions.
 Recall (Sensitivity): Measures the ability to identify true positives.
 F1-Score: Harmonic mean of precision and recall.
 ROC-AUC: Evaluates model performance across thresholds.
Regression Metrics:
 Mean Squared Error (MSE): Penalizes large errors.
 Mean Absolute Error (MAE): Measures average absolute errors.
 R² Score: Indicates how well the model explains variance in the data.
Other Considerations:
 Cross-validation scores to ensure generalizability.
 Confusion matrix for detailed classification errors.

Model Selection Techniques


1. Train-Test Split:
o Split data into training and testing subsets. Train on the former, evaluate on
the latter.
2. Cross-Validation:
o Use kkk-fold cross-validation to divide the data into kkk subsets, ensuring
robust evaluation.
3. Hyperparameter Tuning:
o Use techniques like grid search, random search, or Bayesian optimization to
find the best configuration for a model.
4. Model Comparison:
o Train multiple models and compare them based on evaluation metrics.

Common Tools and Libraries for Model Selection


1. Scikit-learn:
o Provides functions for train-test splits, cross-validation, and hyperparameter
tuning.
o Example: GridSearchCV, RandomizedSearchCV.
2. MLflow:
o Tracks experiments and aids in model selection.
3. TensorFlow/Keras and PyTorch:
o Useful for deep learning model selection.
4. H2O.ai and AutoML tools:
o Automates model training and selection.
Challenges in Model Selection
1. Overfitting/Underfitting:
o A model that performs well on training data but poorly on validation data is
overfitting.
o A model that performs poorly on both is underfitting.
2. Bias-Variance Tradeoff:
o Simpler models have high bias but low variance; complex models have low
bias but high variance.
3. High-Dimensional Data:
o Requires dimensionality reduction or feature selection to avoid the curse of
dimensionality.
4. Data Imbalance:
o Skewed class distributions can bias performance metrics.
# Cross Validation
Cross validation is a technique used in machine learning to evaluate the performance of a
model on unseen data. It involves dividing the available data into multiple folds or subsets,
using one of these folds as a validation set, and training the model on the remaining folds.
This process is repeated multiple times, each time using a different fold as the validation set.
Finally, the results from each validation step are averaged to produce a more robust
estimate of the model’s performance. Cross validation is an important step in the machine
learning process and helps to ensure that the model selected for deployment is robust and
generalizes well to new data.
What is cross-validation used for?
The main purpose of cross validation is to prevent overfitting, which occurs when a model is
trained too well on the training data and performs poorly on new, unseen data. By
evaluating the model on multiple validation sets, cross validation provides a more realistic
estimate of the model’s generalization performance, i.e., its ability to perform well on new,
unseen data.
Types of Cross-Validation
There are several types of cross validation techniques, including k-fold cross validation,
leave-one-out cross validation, and Holdout validation, Stratified Cross-Validation. The
choice of technique depends on the size and nature of the data, as well as the specific
requirements of the modeling problem.
1. Holdout Validation
In Holdout Validation, we perform training on the 50% of the given dataset and rest 50% is
used for the testing purpose. It’s a simple and quick way to evaluate a model. The major
drawback of this method is that we perform training on the 50% of the dataset, it may
possible that the remaining 50% of the data contains some important information which we
are leaving while training our model i.e. higher bias.
2. LOOCV (Leave One Out Cross Validation)
In this method, we perform training on the whole dataset but leaves only one data-point of
the available dataset and then iterates for each data-point. In LOOCV, the model is trained
on n−1 samples and tested on the one omitted sample, repeating this process for each data
point in the dataset. It has some advantages as well as disadvantages also.
An advantage of using this method is that we make use of all data points and hence it is low
bias.
The major drawback of this method is that it leads to higher variation in the testing model
as we are testing against one data point. If the data point is an outlier it can lead to higher
variation. Another drawback is it takes a lot of execution time as it iterates over ‘the
number of data points’ times.
3. Stratified Cross-Validation
It is a technique used in machine learning to ensure that each fold of the cross-validation
process maintains the same class distribution as the entire dataset. This is particularly
important when dealing with imbalanced datasets, where certain classes may be
underrepresented. In this method,
1. The dataset is divided into k folds while maintaining the proportion of classes in each
fold.
2. During each iteration, one-fold is used for testing, and the remaining folds are used
for training.
3. The process is repeated k times, with each fold serving as the test set exactly once.
Stratified Cross-Validation is essential when dealing with classification problems where
maintaining the balance of class distribution is crucial for the model to generalize well to
unseen data.
4. K-Fold Cross Validation
In K-Fold Cross Validation, we split the dataset into k number of subsets (known as folds)
then we perform training on the all the subsets but leave one(k-1) subset for the evaluation
of the trained model. In this method, we iterate k times with a different subset reserved for
testing purpose each time.
Note: It is always suggested that the value of k should be 10 as the lower value of k takes
towards validation and higher value of k leads to LOOCV method.
Example of K Fold Cross Validation
The diagram below shows an example of the training subsets and evaluation subsets
generated in k-fold cross-validation. Here, we have total 25 instances. In first iteration we
use the first 20 percent of data for evaluation, and the remaining 80 percent for training ([1-
5] testing and [5-25] training) while in the second iteration we use the second subset of 20
percent for evaluation, and the remaining three subsets of the data for training ([5-10]

testing and [1-5 and 10-25] training), and so on.


Comparison between cross-validation and hold out method
Advantages of train/test split:
1. This runs K times faster than Leave One Out cross-validation because K-fold cross-
validation repeats the train/test split K-times.
2. Simpler to examine the detailed results of the testing process.
Advantages of cross-validation:
1. More accurate estimate of out-of-sample accuracy.
2. More “efficient” use of data as every observation is used for both training and
testing.
Advantages and Disadvantages of Cross Validation
Advantages:
1. Overcoming Overfitting: Cross validation helps to prevent overfitting by providing a
more robust estimate of the model’s performance on unseen data.
2. Model Selection: Cross validation can be used to compare different models and
select the one that performs the best on average.
3. Hyperparameter tuning: Cross validation can be used to optimize the
hyperparameters of a model, such as the regularization parameter, by selecting the
values that result in the best performance on the validation set.
4. Data Efficient: Cross validation allows the use of all the available data for both
training and validation, making it a more data-efficient method compared to
traditional validation techniques.
Disadvantages:
1. Computationally Expensive: Cross validation can be computationally expensive,
especially when the number of folds is large or when the model is complex and
requires a long time to train.
2. Time-Consuming: Cross validation can be time-consuming, especially when there are
many hyperparameters to tune or when multiple models need to be compared.
3. Bias-Variance Tradeoff: The choice of the number of folds in cross validation can
impact the bias-variance tradeoff, i.e., too few folds may result in high variance, while
too many folds may result in high bias.
#Holdout
Holdout Method is the simplest sort of method to evaluate a classifier. In this method, the
data set (a collection of data items or examples) is separated into two sets, called
the Training set and Test set.
A classifier performs function of assigning data items in a given collection to a target
category or class.
Example –
E-mails in our inbox being classified into spam and non-spam.
Classifier should be evaluated to find out, it’s accuracy, error rate, and error estimates. It can
be done using various methods. One of most primitive methods in evaluation of classifier
is ‘Holdout Method’.
In the holdout method, data set is partitioned, such that – maximum data belongs to
training set and remaining data belongs to test set.
Example –
If there are 20 data items present, 12 are placed in training set and remaining 8 are placed in
test set.
 After partitioning data set into two sets, training set is used to build a
model/classifier.
 After construction of classifier, we use data items in test set, to test accuracy, error
rate and error estimate of model/classifier.
However, it is vital to remember two statements with regard to holdout method. These are :
If maximum possible data items are placed in training set for construction of
model/classifier, classifier’s error rates and estimates would be very low and accuracy would
be high. This is sign of a good classifier/model.
Example –
A student ‘gfg’ is coached by a teacher. Teacher teaches her all possible topics which might
appear for exam. Hence, she tends to commit very less mistakes in exam, thus performing
well.
If more training data are used to construct a classifier, it qualifies any data used from test
set, to test it (classifier).
If more number of data items are present in test set, such that they are used to test classifier
built using training set. We can observe more accurate evaluation of classifier with respect
to it’s accuracy, error rate and estimation.
Example –
A student ‘gfg’ is coached by a teacher. Teacher teaches her some topics, which might
appear for the exam. If the student ‘gfg’ is given a number of exams on basis of this
coaching, an accurate determination of student’s weak and strong points can be found out.
If more test data are used to evaluate constructed classifier, it’s error rate, error estimate
and accuracy can be accurately determined.
Problem :
During partitioning of whole data set into 2 parts i.e., training set and test set, if all data
items belonging to class – GFG1, are placed in test set entirely, such that none of data items
of class GFG1 are in training set. It is evident, that model/classifier built, is not trained using
data items of class – GFG1.
Solution :
Stratification is a technique, using which data items belonging to class – GFG1 are divided
and placed into two data sets i.e training set and test set, equally. Such that, model/classifier
is trained by data items belonging to class -GFG1.
Example –
All the four data items belonging to class – GFG1, here, are divided equally and placed, two
data items each, into two data sets – training set and test set.

The Holdout Method is a simple and widely used technique for evaluating the performance
of a machine learning model. It involves splitting the dataset into separate training and
testing subsets, training the model on the training set, and evaluating its performance on
the testing set.

Steps in the Holdout Method


1. Dataset Splitting:
 Divide the data into two subsets:
 Training Set: Used to train the machine learning model.
 Testing Set: Used to evaluate the model's performance on unseen
data.
 A typical split ratio is 70:30 or 80:20 (training to testing).
2. Model Training:
 Train the model on the training set.
3. Model Evaluation:
 Test the model's performance on the testing set using appropriate evaluation
metrics.

Advantages of the Holdout Method


1. Simplicity:
 Easy to implement and understand.
2. Fast:
 Suitable for large datasets as it only requires training the model once.
3. No Data Leakage:
 Ensures that the testing data remains unseen during training, providing an
unbiased evaluation.

Disadvantages of the Holdout Method


1. High Variance:
 Performance depends on the random split of data, which might not represent
the entire dataset.
 A poor split (e.g., skewed class distribution) can lead to misleading results.
2. Inefficient Data Usage:
 Not all data is used for training, which can be limiting for small datasets.
3. Bias in Small Datasets:
 For small datasets, the testing set may not be large enough to evaluate the
model properly.
When to Use the Holdout Method
1. Large Datasets:
 For large datasets, a single split can be representative of the entire data
distribution.
2. Quick Evaluation:
 Use the holdout method for fast prototyping or initial evaluation.
3. Hyperparameter Tuning:
 Often used in combination with validation sets or cross-validation to optimize
hyperparameters.

Holdout vs. Cross-Validation

Aspect Holdout Method Cross-Validation

Computational Cost Low (single split and training) High (multiple splits and training)

Less efficient (unused data in each More efficient (uses entire dataset for
Data Utilization split) training)

Variance High (depends on the split) Low (averages performance across splits)

Ease of Implementation Easy Slightly complex

Use Case Quick prototyping or large datasets Robust model evaluation for small datasets

Holdout with Validation Set


To improve reliability, the holdout method can include a validation set:
1. Split the Dataset:
 Training Set (60%): Train the model.
 Validation Set (20%): Tune hyperparameters and evaluate during training.
 Testing Set (20%): Final evaluation on unseen data.
2. Advantages:
 Prevents overfitting during hyperparameter tuning.

Best Practices
1. Shuffle Data:
 Shuffle data before splitting to ensure randomization and prevent patterns in
the order.
2. Stratification:
 Use stratified splitting for classification problems to maintain class balance.
3. Multiple Splits:
 Repeat the holdout process with different splits to reduce variability.

You might also like