Train and Test Datasets in Machine Learning
Train and Test Datasets in Machine Learning
• Machine Learning is one of the booming technologies across the world that enables
• However, these predictions highly depend on the quality of the data, and if we are not using
the right data for our model, then it will not generate the expected result.
• In machine learning projects, we generally divide the original dataset into training data and
test data.
• We train our model over a subset of the original dataset, i.e., the training dataset, and then
evaluate whether it can generalize well to the new or unseen dataset or test set.
• Therefore, train and test datasets are the two key concepts of machine
learning, where the training dataset is used to fit the model, and the
• In this topic, we are going to discuss train and test datasets along with
the difference between both of them. So, let's start with the introduction
• Models are required to find the patterns from the given training datasets in order to make
predictions.
• On the other hand, for supervised learning, the training data contains labels in order to train
• The type of training data that we provide to the model is highly responsible for the model's
• It means that the better the quality of the training data, the better will be the performance of
the model.
• Training data is approximately more than or equal to 60% of the total data for an ML project.
What is Test Dataset?
• Once we train the model with the training dataset, it's time to test the model with the test
dataset.
• This dataset evaluates the performance of the model and ensures that the model can
• The test dataset is another subset of original data, which is independent of the training
dataset.
• However, it has some similar types of features and class probability distribution and uses it as
• Test data is a well-organized dataset that contains data for each type of scenario for a given
problem that the model would be facing when used in the real world.
• Usually, the test dataset is approximately 20-25% of the total original data for an ML project.
• At this stage, we can also check and compare the testing accuracy with the training
accuracy, which means how accurate our model is with the test dataset against the
training dataset.
• If the accuracy of the model on training data is greater than that on testing data,
• Hence it is important to split a dataset into two parts, i.e., train and test set.
• Such as, if it performs well with the training data, but does not perform well
with the test dataset, then it is estimated that the model may be overfitted.
• For splitting the dataset, we can use the train_test_split function of scikit-
learn.
• The bellow line of code can be used to split dataset:
• In the train_test_split() function, we have passed four parameters. Which first two
are for arrays of data, and test_size is for specifying the size of the test set. The
test_size may be .5, .3, or .2, which tells the dividing ratio of training and testing sets.
• The last parameter, random_state, is used to set a seed for a random generator so
that you always get the same result, and the most used value for this is 42.
Overfitting and Underfitting issues
• Overfitting and underfitting are the most common problems that occur in the Machine
Learning model.
• A model can be said as overfitted when it performs quite well with the training dataset but
• The issue of overfitting occurs when the model tries to cover all the data points and hence
• Due to this, it can't generalize well to the new dataset. Because of these issues, the accuracy
• There are various ways by which we can avoid overfitting in the model, such as Using
• It means the model shows poor performance even with the training
dataset.
• The main difference between training data and testing data is that training data is the
subset of original data that is used to train the machine learning model, whereas
• The training dataset is generally larger in size compared to the testing dataset.
• The general ratios of splitting train and test datasets are 80:20, 70:30, or 90:10.
• Training data is well known to the model as it is used to train the model, whereas
• These experiences or observations an algorithm can take from the training data, which is fed
to it. Further, one of the great things about ML algorithms is that they can learn and improve
over time on their own, as they are trained with the relevant training data.
• Once the model is trained enough with the relevant training data, it is tested with the test data.
We can understand the whole process of training and testing in three steps, which are as
follows:
• Feed: Firstly, we need to train the model by feeding it with training input data.
• Define: Now, training data is tagged with the corresponding outputs (in Supervised
Learning), and the model transforms the training data into text vectors or a number
of data features.
• Test: In the last step, we test the model by feeding it with the test data/unseen
dataset. This step ensures that the model is trained efficiently and can generalize
well.
The above process is explained using a flowchart given below:
Traits of Quality training data
• The very first quality of training data should be relevant to the problem that you are
going to solve.
• It means that whatever data you are using should be relevant to the current
problem.
• For example, if you are building a model to analyze social media data, then data
should be taken from different social sites such as Twitter, Facebook, Instagram, etc.
2. Uniform:
There should always be uniformity among the features of a dataset. It means all
data for a particular problem should be taken from the same source with the same
attributes.
3. Consistency: In the dataset, the similar attributes must always correspond to the
features that you need to train the model in a better way. With a comprehensive
dataset, the model will be able to learn all the edge cases.