Train and Test Datasets in Machine Learning
Train and Test Datasets in Machine Learning
Machine Learning is one of the booming technologies across the world that
enables computers/machines to turn a huge amount of data into predictions.
However, these predictions highly depend on the quality of the data, and if
we are not using the right data for our model, then it will not generate the
expected result. In machine learning projects, we generally divide the
original dataset into training data and test data. We train our model over a
subset of the original dataset, i.e., the training dataset, and then evaluate
whether it can generalize well to the new or unseen dataset or test
set. Therefore, train and test datasets are the two key concepts of
machine learning, where the training dataset is used to fit the
model, and the test dataset is used to evaluate the model.
In this topic, we are going to discuss train and test datasets along with the
difference between both of them. So, let's start with the introduction of the
training dataset and test dataset in Machine Learning.
For example, for training a sentiment analysis model, the training data could
be as below:
The type of training data that we provide to the model is highly responsible
for the model's accuracy and prediction ability. It means that the better the
quality of the training data, the better will be the performance of the model.
Training data is approximately more than or equal to 60% of the total data
for an ML project.
At this stage, we can also check and compare the testing accuracy with the
training accuracy, which means how accurate our model is with the test
dataset against the training dataset. If the accuracy of the model on training
data is greater than that on testing data, then the model is said to have
overfitting.
We can understand it as if we train our model with a training set and then
test it with a completely different test dataset, and then our model will not
be able to understand the correlations between the features.
Therefore, if we train and test the model with two different datasets, then it
will decrease the performance of the model. Hence it is important to split a
dataset into two parts, i.e., train and test set.
In this way, we can easily evaluate the performance of our model. Such as, if
it performs well with the training data, but does not perform well with the
test dataset, then it is estimated that the model may be overfitted.
For splitting the dataset, we can use the train_test_split function of scikit-
learn.
Explanation:
In the first line of the above code, we have imported the train_test_split
function from the sklearn library.
A model can be said as overfitted when it performs quite well with the
training dataset but does not generalize well with the new or unseen dataset.
The issue of overfitting occurs when the model tries to cover all the data
points and hence starts caching noises present in the data. Due to this, it
can't generalize well to the new dataset. Because of these issues, the
accuracy and efficiency of the model degrade. Generally, the complex model
has a high chance of overfitting. There are various ways by which we can
avoid overfitting in the model, such as Using the Cross-Validation method,
early stopping the training, or by regularization, etc.
Once the model is trained enough with the relevant training data, it is tested
with the test data. We can understand the whole process of training and
testing in three steps, which are as follows: