Model Validation & Data Partition
Model Validation & Data Partition
We all have pursued enough articles about Machine Learning, and the first notion we often
come up with is ‘Machine Learning is about making predictions.’
Yes, it is somewhat convincing, but these predictions come up after assorted processes like
Data Preparation, Choosing a Model, Training the Model, Parameter Tuning, Model
Validation, etc. So, only after carrying out the aforementioned operations, a Machine Learning
Model (Regression or Classification) is efficient to make predictions.
So, as the name suggests ‘Model Validation’, we can perceive that the model is seeking some
validation, but what’s that validation all about? Let’s try to answer it.
Model validation is the process that is carried out after Model Training where the trained
model is evaluated with a testing data set. The testing data may or may not be a chunk of the
same data set from which the training set is procured.
To know things better, we can note that the two types of Model Validation techniques are
namely,
In-sample validation – testing data from the same dataset that is used to build the model.
Out-of-sample validation – testing data from a new dataset that isn’t used to build the model
Conclusion alert! Model validation refers to the process of confirming that the model achieves
its intended purpose i.e., how effective our model is.
The ultimate goal for any machine learning model is to learn from examples in such a manner
that the model is capable of generalizing the learning to new instances which it has not yet
seen. So, when we approach a problem with a dataset in hand, it is very important that we find
the right machine learning algorithm to create our model. Every model has its own strengths
and weaknesses. For instance, some algorithms have a higher tolerance for small datasets,
while others may be good with large amounts of data. For this reason, two different models
using similar data can predict different results with different degrees of accuracy and hence
model validation is required.
Note- In machine learning, we use the term parameters to refer to something that can be
learned by the algorithm during training and hyperparameters to refer to something that is
passed to the algorithm.
Then the accuracy score for the model is calculated and if in any case, this accuracy score is
low, we change the value of the hyperparameters used in the model, and retest it until we get a
decent accuracy score.
There are various ways of validating a model among which the two most famous methods are
Cross Validation and Bootstrapping but there is no single validation method that works in all
scenarios. Therefore, it is important to understand the type of data we are working with.
Although you can read more compositions to learn these techniques better.
Now after having a glimpse of Model Validation, we all can imagine how important a
component it is of the entire Model development process. Validating the machine learning
model outputs are important to ensure its accuracy. When a machine learning model is trained,
a huge amount of training data is used and the main aim of checking the model validation
provides an opportunity for machine learning engineers to improve the data quality and
quantity. As it happens, without checking and validating the model it is not right to rely on its
prediction. And in sensitive areas like healthcare and self-driven vehicles, any kind of mistake
in object detection can lead to major fatalities due to wrong decisions taken by the machine in
real-life predictions. And validating the ML model at the training and development stage helps
to make the model make the right predictions. Some added advantages of Model Validation
are as follows.
Now we are clear with the fact that building the machine learning model is not just enough to
rely on its predictions, we need to check the accuracy and validate the same to ensure the
precision of results given by the model and make it usable in real-life applications.
Data partitioning and validation are crucial steps in machine learning (ML) to ensure that the
developed models generalize well to new, unseen data. The typical approach involves splitting
the available data into training, validation, and test sets. Here's an overview of these concepts:
Data Partitioning:
Training Set: The largest portion of the dataset is used for training the model. The model learns
patterns and relationships from this set.
Validation Set: A smaller portion is set aside for model tuning and hyperparameter
optimization. The performance of the model on the validation set helps in making decisions
about adjustments to the model.
Test Set: Another portion is kept separate for the final evaluation of the model's performance.
This set is not used during the training or tuning phases and serves as an unbiased assessment
of the model's generalization.
Validation Strategies:
Holdout Method: The dataset is split into training and validation sets. This method is simple
but may lead to variability depending on the random split.
K-Fold Cross-Validation: The dataset is divided into K folds, and the model is trained K times,
each time using K-1 folds for training and the remaining fold for validation. This helps in
obtaining a more robust performance estimate.
Stratified Sampling: Ensures that the distribution of class labels in each partition is
representative of the overall dataset. It is particularly useful for imbalanced datasets.
Validation Metrics:
Common metrics include accuracy, precision, recall, F1 score for classification problems, and
Mean Squared Error (MSE), R-squared for regression problems.
Choose metrics based on the nature of the problem and the goals of the model.
Overfitting and Underfitting:
Overfitting: Occurs when the model performs well on the training set but poorly on new,
unseen data. Regularization techniques and tuning hyperparameters can help mitigate
overfitting.
Underfitting: Occurs when the model is too simple and cannot capture the underlying patterns
in the data. Increasing model complexity or using more features may help.
Test Set Evaluation:
Once the model is trained and tuned using the training and validation sets, it is evaluated on
the test set to assess its generalization performance.
The test set performance provides an unbiased estimate of how well the model is expected to
perform on new, unseen data.
By following these practices, you can better ensure that your machine learning models
generalize well and perform reliably on real-world data. It's important to note that proper data
partitioning and validation are integral to producing models that are robust and effective in
various scenarios.
2. Now, we train the model on the training dataset by selecting some fixed set of
hyperparameters while training the model.
3. Use the hold-out test dataset to evaluate the model.
4. Use the entire dataset to train the final model so that it can generalize better on future
datasets.
In this process, the dataset is split into training and test sets, and a fixed set of hyperparameters
is used to evaluate the model. There is another process in which data can also be split into three
sets, and these sets can be used to select a model or to tune hyperparameters. We will discuss
that technique next.
It can be understood with an example of housing prices, such that the price of some houses
can be much high than other houses. To tackle such situations, a stratified k-fold
crossvalidation technique is useful.
4.2.4 Leave one out cross-validation
This method is similar to the leave-p-out cross-validation, but instead of p, we need to take 1
dataset out of training. It means, in this approach, for each learning set, only one data point is
reserved, and the remaining dataset is used to train the model. This process repeats for each
data point. Hence for n samples, we get n different training set and n test set. It has the
following features:
o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the model as we
iteratively check against one data point.
4.3 Bias-Variance Trade off
It is important to understand prediction errors (bias and variance) when it comes to
accuracy in any machine learning algorithm. There is a tradeoff between a model’s ability
to minimize bias and variance which is referred to as the best solution for selecting a value
of Regularization constant. Proper understanding of these errors would help to avoid the
overfitting and underfitting of a data set while training the algorithm
Bias
The bias is known as the difference between the prediction of the values by the ML model
and the correct value. Being high in biasing gives a large error in training as well as testing
data. Its recommended that an algorithm should always be low biased to avoid the problem
of underfitting.By high bias, the data predicted is in a straight line format, thus not fitting
accurately in the data in the data set. Such fitting is known as Underfitting of Data. This
happens when the hypothesis is too simple or linear in nature. Refer to the graph given
below for an example of such a situation.
Bias Variance Tradeoff
If the algorithm is too simple (hypothesis with linear eq.) then it may be on high bias and
low variance condition and thus is error-prone. If algorithms fit too complex ( hypothesis
with high degree eq.) then it may be on high variance and low bias. In the latter condition,
the new entries will not perform well. Well, there is something between both of these
conditions, known as Trade-off or Bias Variance Trade-off.
This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time. For the graph, the
perfect tradeoff will be like.
Model Deployment :
Deploying a machine learning model is one of the most important parts of an ML pipeline
and really determines the applicability and accessibility of the model. Building a machine
learning model is one of the most challenging tasks of building a ML pipeline for processing
and predicting data but deploying it successfully is critical in order to convert your time and
effort into real output. There are several important aspects of model deployment that need to
be considered while thinking about deploying ML models.
Data access and query: You need to make sure that your model would have easy access to
the data and is able to make predictions and/or retrain itself accordingly based on the given
data. There are two main types of data querying for ML pipelines: Using an API to query the
data that is being stored in another service or use uploaded data that has been provided
through a form, either through HTML or other frameworks.
You need to make sure that the data remains safe and the transfer of data is encrypted.
Data processing and storage: The optimality of your ML model will depend on the way that
you store and process your data. If one uploads a CSV, saving just the CSV can be a time
consuming and computationally intensive task, especially if the data files are huge. To
counteract this issue, the data can be stored in slices or stored in a different format such as a
hash table or binary tree in order to make sure that the ML model can easily access and
process your data without having to dig through millions of rows of a CSV.
Storage of the ML infrastructure: Your machine learning model can simply be a python
file if you retrain your model everytime or it could be a pickle file that has a stored Python
object that can be easily loaded and used on incoming data. Most simple deployments of ML
models use a pickled version of trained ML models that are loaded and then used to predict
outcomes. There are other ways to store information of trained ML models but they are not as
common. Make sure to have enough storage space for your pickle files.
Another important aspect of the processing infrastructure is debugging to make sure that the
user doesn’t load data that would otherwise cause problems in the model. Remember – The
ML model is just a machine and does not know how to process data structures that it has
never seen before. For example, a model expecting an integer cannot process a string of 1.
The data needs to be converted into numerical data (or floats) before it is sent into the ML
model.
Thus, testing the infrastructure against a variety of inputs and writing code to address all
kinds of scenarios is very important. You may ask yourself questions like – ‘Would adding a
space in one data point make a difference ?’. Imagining all the possible scenarios would help
you build systems that need less work later.
Presentation and output: You need to build a proper way to display the results of the
model. This could just be an HTML file with dynamic variables that allow you to populate
things like accuracy, the predicted result, the error, etc. In more complicated pipelines, the
API develops the result into a PDF or an email that is sent to the specified email address. In
other cases, it may store the result which can then be queried by the user with a specified key
or job ID that was generated at the time of submission.
Logging of results: This is an underrated component of deploying ML models. One must log
key statistics and results for each run of the model to make sure that everything runs
smoothly. In some cases, one may build a simple script to look in the logs for specific errors
or problems which can then be highlighted on a monitoring dashboard. Logging can also help
you keep track of bugs or issues that may not have been addressed in the infrastructure.
Obtain user feedback: Test the model for a number of users before you distribute it to the
general public. Make sure to collect feedback from the users about the pain points of the
model and address them accordingly. Employing a UX/UI researcher may be worth the time
and effort if your model is very complicated.
Cloud infrastructure and compute power: Lastly, make good estimates of the amount of
memory and compute resources that are used by each job. Based on that, make a decision
about the cloud infrastructure that you’d like to use for deploying your application. A flask
application runs really well on a free Heroku server but cannot handle more than 200 users at
a time. Thus, if you’re planning to have more users or queries at a time, invest in good AWS
servers that provide good memory and performance. This also allows you to scale the model
easily and let more users use it.
Towards the end, your ML model’s success depends on a lot of things including the
infrastructure you develop and the infrastructure you deploy to run your model. A lot of
things can go sideways in the beginning so make sure to keep checking your logs and your
system usage so as to provide a seamless service to your users.