0% found this document useful (0 votes)
31 views

Machine Learning-Lecture 02

The document discusses splitting datasets into training and test sets when using machine learning algorithms. It explains that typically 80% of data is used for training and 20% for testing. It also discusses different types of train-test splits including 50:50 splits and k-fold cross validation as well as situations where train-test splitting should not be used such as with small datasets.

Uploaded by

Amna Arooj
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Machine Learning-Lecture 02

The document discusses splitting datasets into training and test sets when using machine learning algorithms. It explains that typically 80% of data is used for training and 20% for testing. It also discusses different types of train-test splits including 50:50 splits and k-fold cross validation as well as situations where train-test splitting should not be used such as with small datasets.

Uploaded by

Amna Arooj
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Machine Learning

LECTURE – 02

Training and Test Data


How Supervised Learning Algorithm works

April 27, 2024 2


Splitting Datasets

• To use a dataset in Machine Learning, the dataset is first split into a training
and test set

• The training set is used to train the model

• The test set is used to test the accuracy of the model

• Typically, split 80% training, 20% test.

• The objective is to estimate the performance of the machine learning model


on new data: data not used to train the model.
It’s About Training
Slicing Dataset

• You could imagine slicing the single data set as follows:


When not to use train-test split
• Dataset is small.
• Not enough data in the training dataset for the model to learn an effective mapping
of inputs to outputs.
• Not enough data in the test set to effectively evaluate the model performance.

• The estimated performance could be overly optimistic (good) or overly pessimistic


(bad).
When not to use train-test split
• Data Imbalance – Overfitting
• If the training data is overly unbalanced, then the model will predict a non-meaningful
result
• For example, if the model is a binary classifier (apple vs. pear), and nearly all the
samples are of the same label (e.g., apple), then the model will simply learn that
everything is related to that particular label (apple)
• This is called overfitting. To prevent overfitting, there needs to be a fairly equal
distribution of training samples for each classification, or range if label is a real value.
When not to use train-test split

• Reusing the same data for both training and testing is a bad idea
because we need to know how the method will work on data it was
not trained on.
Types of train-test split

• 50:50 split

• Leave One Out Cross Validation

• K-Fold Cross validation


Dataset
50:50 split
K-Fold Cross validation
Leave One Out Cross Validation

• LOOCV is an extreme case of k-fold where k=n

• In the leave-one-out (LOO) cross-validation, we train our machine-


learning model n times where n is to our dataset’s size.
Leave One Out Cross Validation
LOOCV
Train-Test Split Procedure in Scikit-Learn
• The scikit-learn Python machine learning library provides an
implementation of the train-test split evaluation procedure via
the train_test_split() function.

• The function takes a loaded dataset as input and returns the dataset
split into two subsets
Train-Test Split Procedure in Scikit-Learn

• Ideally, you can split your original dataset into input (X) and output (y)
columns, then call the function passing both arrays and have them
split appropriately into train and test subsets.

• 0.33 where 33 percent of the dataset will be allocated to the test set
and 67 percent will be allocated to the training set.
Train-Test Split Procedure in Scikit-Learn

You might also like