0% found this document useful (0 votes)
59 views

Model Validation & Data Partition

Uploaded by

ibaadahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Model Validation & Data Partition

Uploaded by

ibaadahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

What is Model Validation and Why is it Important?

We all have pursued enough articles about Machine Learning, and the first notion we often
come up with is ‘Machine Learning is about making predictions.’

Yes, it is somewhat convincing, but these predictions come up after assorted processes like
Data Preparation, Choosing a Model, Training the Model, Parameter Tuning, Model
Validation, etc. So, only after carrying out the aforementioned operations, a Machine Learning
Model (Regression or Classification) is efficient to make predictions.

Let’s have a look below to have a better understanding.

What is Model Validation?

So, as the name suggests ‘Model Validation’, we can perceive that the model is seeking some
validation, but what’s that validation all about? Let’s try to answer it.

Model validation is the process that is carried out after Model Training where the trained
model is evaluated with a testing data set. The testing data may or may not be a chunk of the
same data set from which the training set is procured.

To know things better, we can note that the two types of Model Validation techniques are
namely,

 In-sample validation – testing data from the same dataset that is used to build the model.

 Out-of-sample validation – testing data from a new dataset that isn’t used to build the model
Conclusion alert! Model validation refers to the process of confirming that the model achieves
its intended purpose i.e., how effective our model is.

But how is it achieved? Take a look below.

The ultimate goal for any machine learning model is to learn from examples in such a manner
that the model is capable of generalizing the learning to new instances which it has not yet
seen. So, when we approach a problem with a dataset in hand, it is very important that we find
the right machine learning algorithm to create our model. Every model has its own strengths
and weaknesses. For instance, some algorithms have a higher tolerance for small datasets,
while others may be good with large amounts of data. For this reason, two different models
using similar data can predict different results with different degrees of accuracy and hence
model validation is required.

Following is the chronology for Model Validation-

-Choose a machine learning algorithm.

-Choose hyperparameters for the model.

-Fit the model to the training data.

-Use the model to predict labels for new data.

Note- In machine learning, we use the term parameters to refer to something that can be
learned by the algorithm during training and hyperparameters to refer to something that is
passed to the algorithm.

Then the accuracy score for the model is calculated and if in any case, this accuracy score is
low, we change the value of the hyperparameters used in the model, and retest it until we get a
decent accuracy score.
There are various ways of validating a model among which the two most famous methods are
Cross Validation and Bootstrapping but there is no single validation method that works in all
scenarios. Therefore, it is important to understand the type of data we are working with.

Although you can read more compositions to learn these techniques better.

Importance of Model Validation

Now after having a glimpse of Model Validation, we all can imagine how important a
component it is of the entire Model development process. Validating the machine learning
model outputs are important to ensure its accuracy. When a machine learning model is trained,
a huge amount of training data is used and the main aim of checking the model validation
provides an opportunity for machine learning engineers to improve the data quality and
quantity. As it happens, without checking and validating the model it is not right to rely on its
prediction. And in sensitive areas like healthcare and self-driven vehicles, any kind of mistake
in object detection can lead to major fatalities due to wrong decisions taken by the machine in
real-life predictions. And validating the ML model at the training and development stage helps
to make the model make the right predictions. Some added advantages of Model Validation
are as follows.

 Scalability and flexibility

 Reduce the costs.

 Enhance the model quality.

 Discovering more errors

 Prevents the model from overfitting and underfitting.


It is extremely important that data scientists validate machine learning models that are under
training for accuracy and stability as it needs to be ensured that the model picks up on most of
the trends and patterns in the data without incurring too much noise.

Now we are clear with the fact that building the machine learning model is not just enough to
rely on its predictions, we need to check the accuracy and validate the same to ensure the
precision of results given by the model and make it usable in real-life applications.

Data partitioning and validation

Data partitioning and validation are crucial steps in machine learning (ML) to ensure that the
developed models generalize well to new, unseen data. The typical approach involves splitting
the available data into training, validation, and test sets. Here's an overview of these concepts:

Data Partitioning:

Training Set: The largest portion of the dataset is used for training the model. The model learns
patterns and relationships from this set.
Validation Set: A smaller portion is set aside for model tuning and hyperparameter
optimization. The performance of the model on the validation set helps in making decisions
about adjustments to the model.
Test Set: Another portion is kept separate for the final evaluation of the model's performance.
This set is not used during the training or tuning phases and serves as an unbiased assessment
of the model's generalization.
Validation Strategies:

Holdout Method: The dataset is split into training and validation sets. This method is simple
but may lead to variability depending on the random split.
K-Fold Cross-Validation: The dataset is divided into K folds, and the model is trained K times,
each time using K-1 folds for training and the remaining fold for validation. This helps in
obtaining a more robust performance estimate.
Stratified Sampling: Ensures that the distribution of class labels in each partition is
representative of the overall dataset. It is particularly useful for imbalanced datasets.
Validation Metrics:

Common metrics include accuracy, precision, recall, F1 score for classification problems, and
Mean Squared Error (MSE), R-squared for regression problems.
Choose metrics based on the nature of the problem and the goals of the model.
Overfitting and Underfitting:

Overfitting: Occurs when the model performs well on the training set but poorly on new,
unseen data. Regularization techniques and tuning hyperparameters can help mitigate
overfitting.
Underfitting: Occurs when the model is too simple and cannot capture the underlying patterns
in the data. Increasing model complexity or using more features may help.
Test Set Evaluation:

Once the model is trained and tuned using the training and validation sets, it is evaluated on
the test set to assess its generalization performance.
The test set performance provides an unbiased estimate of how well the model is expected to
perform on new, unseen data.
By following these practices, you can better ensure that your machine learning models
generalize well and perform reliably on real-world data. It's important to note that proper data
partitioning and validation are integral to producing models that are robust and effective in
various scenarios.

4.1 CROSS VALIDATION


To test the performance of a classifier, we need to have a number of training/validation set
pairs from a dataset X. To get them, if the sample X is large enough, we can randomly divide
it then divide each part randomly into two and use one half for training and the other half for
validation. Unfortunately, datasets are never large enough to do this. So, we use the same
data split differently; this is called cross-validation.
Cross-validation is a technique to evaluate predictive models by partitioning the original
sample into a training set to train the model, and a test set to evaluate it.
During the evaluation of machine learning (ML) models, the following question might arise:
 Is this model the best one available from the hypothesis space of the algorithm in terms
 of generalization error on an unknown/future data set?
 What training and testing techniques are used for the model?
 What model should be selected from the available ones?
4.2 Methods used for Cross-Validation:
4.2.1 Holdout Method
Consider training a model using an algorithm on a given dataset. Using the same training
data, you determine that the trained model has an accuracy of 95% or even 100%. What
does this mean? Can this model be used for prediction?
No. This is because your model has been trained on the given data, i.e. it knows the data
and has generalized over it very well. In contrast, when you try to predict over a new set
of data, it will most likely give you very bad accuracy because it has never seen the data
before and thus cannot generalize well over it. To deal with such problems, hold-out
methods can be employed.
The hold-out method involves splitting the data into multiple parts and using one part
for training the model and the rest for validating and testing it. It can be used for both
model evaluation and selection.
In cases where every piece of data is used for training the model, there remains the
problem of selecting the best model from a list of possible models. Primarily, we want
to identify which model has the lowest generalization error or which model makes a
better prediction on future or unseen datasets than all of the others. There is a need to
have a mechanism that allows the model to be trained on one set of data and tested on
another set of data. This is where hold-out comes into play.
Hold-Out Method for Model Evaluation
Model evaluation using the hold-out method entails splitting the dataset into training and test
datasets, evaluating model performance, and determining the most optimal model. This
diagram illustrates the hold-out method for model evaluation.
There are two parts to the dataset in the diagram above. One split is held aside as a training set.
Another set is held back for testing or evaluation of the model. The percentage of the split is
determined based on the amount of training data available. A typical split of 70–30% is used
in which 70% of the dataset is used for training and 30% is used for testing the model.
The objective of this technique is to select the best model based on its accuracy on the testing
dataset and compare it with other models. There is, however, the possibility that the model can
be well fitted to the test data using this technique. In other words, models are trained to
improve model accuracy on test datasets based on the assumption that the test dataset
represents the population. As a result, the test error becomes an optimistic estimation of the
generalization error. Obviously, this is not what we want. Since the final model is trained to fit
well (or overfit) the test data, it won’t generalize well to unknowns or future datasets.
Follow the steps below for using the hold-out method for model evaluation:
1. Split the dataset in two (preferably 70–30%; however, the split percentage can vary
and should be random).

2. Now, we train the model on the training dataset by selecting some fixed set of
hyperparameters while training the model.
3. Use the hold-out test dataset to evaluate the model.

4. Use the entire dataset to train the final model so that it can generalize better on future
datasets.

In this process, the dataset is split into training and test sets, and a fixed set of hyperparameters
is used to evaluate the model. There is another process in which data can also be split into three
sets, and these sets can be used to select a model or to tune hyperparameters. We will discuss
that technique next.

Hold-Out Method for Model Selection


Sometimes the model selection process is referred to as hyperparameter tuning. During the
hold-out method of selecting a model, the dataset is separated into three sets — training,
validation, and test.
Follow the steps below for using the hold-out method for model selection:
1. Divide the dataset into three parts: training dataset, validation dataset, and test dataset.
2. Now, different machine learning algorithms can be used to train different models. You
can train your classification model, for example, using logistic regression, random forest,
and XGBoost.
3. Tune the hyperparameters for models trained with different algorithms. Change the
hyperparameter settings for each algorithm mentioned in step 2 and come up with
multiple models.
4. On the validation dataset, test the performance of each of these models (associating
with each of the algorithms).
5. Choose the most optimal model from those tested on the validation dataset. The most
optimal model will be set up with the most optimal hyperparameters. Using the example
above, let’s suppose the model trained with XGBoost with the most optimal
hyperparameters is selected.
6. Finally, on the test dataset, test the performance of the most optimal model.

4.2.2 K-Fold Cross-Validation


K-fold cross-validation approach divides the input dataset into K groups of samples of equal
sizes. These samples are called folds. For each learning set, the prediction function uses k-1
folds, and the rest of the folds are used for the test set. This approach is a very popular CV
approach because it is easy to understand, and the output is less biased than other methods.
The steps for k-fold cross-validation are:
o Split the input dataset into K groups
o For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the model
using the test set.
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds. On
1st iteration, the first fold is reserved for test the model , and rest are used to train the model.
On 2nd iteration, the second fold is used to test the model, and rest are used to train the model.
This process will continue until each fold is not used for the test fold.

4.2.3 Stratified k-fold cross-validation:


This technique is similar to k-fold cross-validation with some little changes. This approach
works on stratification concept, it is a process of rearranging the data to ensure that each fold
or group is a good representative of the complete dataset. To deal with the bias and variance,
it is one of the best approaches.

It can be understood with an example of housing prices, such that the price of some houses
can be much high than other houses. To tackle such situations, a stratified k-fold
crossvalidation technique is useful.
4.2.4 Leave one out cross-validation
This method is similar to the leave-p-out cross-validation, but instead of p, we need to take 1
dataset out of training. It means, in this approach, for each learning set, only one data point is
reserved, and the remaining dataset is used to train the model. This process repeats for each
data point. Hence for n samples, we get n different training set and n test set. It has the
following features:
o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the model as we
iteratively check against one data point.
4.3 Bias-Variance Trade off
It is important to understand prediction errors (bias and variance) when it comes to
accuracy in any machine learning algorithm. There is a tradeoff between a model’s ability
to minimize bias and variance which is referred to as the best solution for selecting a value
of Regularization constant. Proper understanding of these errors would help to avoid the
overfitting and underfitting of a data set while training the algorithm
Bias
The bias is known as the difference between the prediction of the values by the ML model
and the correct value. Being high in biasing gives a large error in training as well as testing
data. Its recommended that an algorithm should always be low biased to avoid the problem
of underfitting.By high bias, the data predicted is in a straight line format, thus not fitting
accurately in the data in the data set. Such fitting is known as Underfitting of Data. This
happens when the hypothesis is too simple or linear in nature. Refer to the graph given
below for an example of such a situation.
Bias Variance Tradeoff
If the algorithm is too simple (hypothesis with linear eq.) then it may be on high bias and
low variance condition and thus is error-prone. If algorithms fit too complex ( hypothesis
with high degree eq.) then it may be on high variance and low bias. In the latter condition,
the new entries will not perform well. Well, there is something between both of these
conditions, known as Trade-off or Bias Variance Trade-off.
This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time. For the graph, the
perfect tradeoff will be like.

Model Deployment :

Deploying a machine learning model is one of the most important parts of an ML pipeline
and really determines the applicability and accessibility of the model. Building a machine
learning model is one of the most challenging tasks of building a ML pipeline for processing
and predicting data but deploying it successfully is critical in order to convert your time and
effort into real output. There are several important aspects of model deployment that need to
be considered while thinking about deploying ML models.
Data access and query: You need to make sure that your model would have easy access to
the data and is able to make predictions and/or retrain itself accordingly based on the given
data. There are two main types of data querying for ML pipelines: Using an API to query the
data that is being stored in another service or use uploaded data that has been provided
through a form, either through HTML or other frameworks.
You need to make sure that the data remains safe and the transfer of data is encrypted.

Data processing and storage: The optimality of your ML model will depend on the way that
you store and process your data. If one uploads a CSV, saving just the CSV can be a time
consuming and computationally intensive task, especially if the data files are huge. To
counteract this issue, the data can be stored in slices or stored in a different format such as a
hash table or binary tree in order to make sure that the ML model can easily access and
process your data without having to dig through millions of rows of a CSV.

Storage of the ML infrastructure: Your machine learning model can simply be a python
file if you retrain your model everytime or it could be a pickle file that has a stored Python
object that can be easily loaded and used on incoming data. Most simple deployments of ML
models use a pickled version of trained ML models that are loaded and then used to predict
outcomes. There are other ways to store information of trained ML models but they are not as
common. Make sure to have enough storage space for your pickle files.

Processing infrastructure: This is a crucial part of deploying ML models. This choice of


processing infrastructure allows the system to load the ML model and use it on incoming data
automatically. If you are using Python, the easiest way is to build a Flask app that has
predefined functions for loading the ML model and applying it on an uploaded set of CSV of
data points. This infrastructure can also just be an API that works every hour to query data
from a MongoDB server (or any place where you have stored your data) to look for new data
points and use them to make predictions. The latter is used in more complicated applications
where the prediction and retraining takes time, and real-time output is difficult to produce. In
such cases, the developer has to build a queue system that allows one to queue their jobs for
prediction and have their results emailed to them, which brings us to the most important
point.

Another important aspect of the processing infrastructure is debugging to make sure that the
user doesn’t load data that would otherwise cause problems in the model. Remember – The
ML model is just a machine and does not know how to process data structures that it has
never seen before. For example, a model expecting an integer cannot process a string of 1.
The data needs to be converted into numerical data (or floats) before it is sent into the ML
model.

Thus, testing the infrastructure against a variety of inputs and writing code to address all
kinds of scenarios is very important. You may ask yourself questions like – ‘Would adding a
space in one data point make a difference ?’. Imagining all the possible scenarios would help
you build systems that need less work later.

 Presentation and output: You need to build a proper way to display the results of the
model. This could just be an HTML file with dynamic variables that allow you to populate
things like accuracy, the predicted result, the error, etc. In more complicated pipelines, the
API develops the result into a PDF or an email that is sent to the specified email address. In
other cases, it may store the result which can then be queried by the user with a specified key
or job ID that was generated at the time of submission.
Logging of results: This is an underrated component of deploying ML models. One must log
key statistics and results for each run of the model to make sure that everything runs
smoothly. In some cases, one may build a simple script to look in the logs for specific errors
or problems which can then be highlighted on a monitoring dashboard. Logging can also help
you keep track of bugs or issues that may not have been addressed in the infrastructure.

Monitoring and Maintenance: Maintaining the ML model and fixing it regularly is


recommended but may not be necessary based on the context of the problem. In important
and legally regulated environments like loan applications, ML models need to be monitored
carefully and any biases or drifts must be quickly fixed. In other cases where the model has
seen mostly all of the population data and really doesn’t need retraining, such as some
biological models trained on the Human DNA, monitoring them doesn’t make a lot of sense
except for looking out for errors or bugs that might cause problems to users.

Obtain user feedback: Test the model for a number of users before you distribute it to the
general public. Make sure to collect feedback from the users about the pain points of the
model and address them accordingly. Employing a UX/UI researcher may be worth the time
and effort if your model is very complicated.

Cloud infrastructure and compute power: Lastly, make good estimates of the amount of
memory and compute resources that are used by each job. Based on that, make a decision
about the cloud infrastructure that you’d like to use for deploying your application. A flask
application runs really well on a free Heroku server but cannot handle more than 200 users at
a time. Thus, if you’re planning to have more users or queries at a time, invest in good AWS
servers that provide good memory and performance. This also allows you to scale the model
easily and let more users use it.

Towards the end, your ML model’s success depends on a lot of things including the
infrastructure you develop and the infrastructure you deploy to run your model. A lot of
things can go sideways in the beginning so make sure to keep checking your logs and your
system usage so as to provide a seamless service to your users.

You might also like