DL Practical 1 Train - Test - Split
DL Practical 1 Train - Test - Split
import numpy as np
import pandas as pd
Both the Pandas and NumPy can be seen as an essential library for any scientific computation,
including machine learning due to their intuitive syntax and high-performance matrix
computation capabilities.
NumPy arrays facilitate advanced mathematical and other types of operations on large numbers
of data.
This code imports the NumPy library as 'np' and the 'train_test_split' function from the Scikit-
Learn (sklearn) library's 'model_selection' module.
The 'train_test_split' function is a utility function provided by Scikit-Learn to split a dataset
into two separate sets: a training set and a testing set. This is a common technique in machine
learning, where we want to train our model on a portion of the data and evaluate its performance
on the remaining portion.
The train-test split is used to estimate the performance of machine learning algorithms that are
applicable for prediction-based Algorithms/Applications. This method is a fast and easy
procedure to perform such that we can compare our own machine learning model results to
machine results. By default, the Test set is split into 30 % of actual data and the training set is
split into 70% of the actual data.
Data Splitting:
Scikit-learn alias sklearn is the most useful and robust library for machine learning in Python.
The scikit-learn library provides us with the model_selection module in which we have the
splitter function train_test_split().
Syntax:
Parameters:
*arrays: inputs such as lists, arrays, data frames, or matrices
test_size: this is a float value whose value ranges between 0.0 and 1.0. it represents the
proportion of our test size. its default value is none.
train_size: this is a float value whose value ranges between 0.0 and 1.0. it represents the
proportion of our train size. its default value is none.
random_state: this parameter is used to control the shuffling applied to the data before applying
the split. it acts as a seed.
shuffle: This parameter is used to shuffle the data before splitting. Its default value is true.
stratify: This parameter is used to split the data in a stratified fashion.
Code:
import numpy as np
from sklearn.model_selection import train_test_split
#This will split the data into 80% train and 20% test
x_train,x_test,y_train,y_test = train_test_split(X,Y,train_size=0.8)
print(x_train)
print(y_train)
Output:
[ 5 4 2 10 3 7 6 1]
[25 20 10 50 15 35 30 5]
Output:
[1 2 3 4 5 6]
[ 5 10 15 20 25 30]
In this example, we generate a random dataset of 100 samples with 5 input features and 1 output
label. We then use the `train_test_split` function to split the data into training and testing sets,
with 60% of the data used for training and the remaining 40% used for testing. We set
`shuffle=False` to avoid shuffling the data before splitting it. Finally, we print the shapes of the
resulting arrays to confirm that the data was split correctly.
#variables X and Y, with 60% of the data assigned to the training set #and 40% to the test set.
The data is randomly shuffled before the #split, and no specific random state is set.
x_train,x_test,y_train,y_test = train_test_split(X,Y,train_size=0.6,random_state = None)
print(x_train)
print(y_train)
Output:
[ 1 3 4 9 10 7]
[ 5 15 20 45 50 35]
#It performs a train-test split on the variables X and y, where 70% of #the data is assigned to
the training set and 30% to the test set.
#It will print train-test data and shape
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = 0.3)
print(x_train)
print(x_train.shape)
print(x_test)
print(x_test.shape)
Output:
[1 7 6 8 5 3 4]
(7,)
[ 2 10 9]
(3,)
`X`: input features
`Y`: output labels
`train_size`: proportion of data used for training (default is 0.75)
`shuffle`: whether to shuffle the data before splitting (default is True)
`random_state`: a seed value for the random number generator used for shuffling and splitting
the data (default is None)
`stratify`: preserve the proportion of classes in the output labels in both the training and testing
sets (default is None)
Conclusion:
The train-test split is a common technique used in machine learning to evaluate the
performance of a model. This helps in assessing how well the model generalizes to unseen data.
Experiment Date of
Grade Teacher's Sign
Number Performance