NB4-09 PT IV Data Augmentation and Early Stopping
NB4-09 PT IV Data Augmentation and Early Stopping
Goal: Let's try to create a NB where we use data augmentation and early stopping
when i.e., we reach a model that does not improve for a certain number of epochs. We
will use a skin lesion dataset similar to the previous one but with 7 classes.
1. Introduction
Data Augmentation and Early Stopping are two crucial techniques in machine
learning and deep learning that help improve the performance and generalization of
models. Here’s an explanation of their importance:
A. Data Augmentation
Definition: Data augmentation involves creating new training examples from the
existing data by applying random transformations such as rotation, translation,
flipping, scaling, noise addition, and more.
Importance:
B. Early Stopping
Importance:
Reduces the Need for Manual Tuning: Finding the right number of training
epochs manually can be challenging. Early stopping automates this process by
dynamically determining the best point to stop training.
Example: Suppose you're training a neural network and notice that the validation
loss starts increasing after 50 epochs, while the training loss continues to decrease.
Early stopping would halt the training process at this point, preventing the model from
overfitting to the training data.
nvidia-smi is a command-line utility provided by NVIDIA that helps you manage and
monitor NVIDIA GPU devices. It stands for NVIDIA System Management Interface.
We save the root directory of the project '/content' as 'HOME' since we will be
navigating through the directory to have multiple projects under the same HOME.
Additionally, we will have the datasets in the 'datasets' directory, so all datasets are
easily accessible for any project.
Next, it imports the drive module from the google.colab library, which provides
functionalities for mounting Google Drive in Google Colab.
Additionally, Google Drive is mounted in Google Colab and made available at the
path /content/drive. The user will be prompted to authorize access to Google Drive.
Once authorized, the content of Google Drive will be accessible from that point
onwards in the Colab notebook.
Create the dataset directory (if it doesn't exist), where we are going to save the
dataset with which we are going to train our CNN.
3: 'df' - dermatofibroma
4: 'mel' - melanoma
Setting a Dataloader
These transformations are typically applied during the training process to increase the
diversity of the training data, helping to improve the generalization of the deep
learning model.
This setup ensures that the datasets are properly preprocessed and ready to be used
in training and evaluating machine learning models, particularly deep neural
networks, using PyTorch.
The train set is unmodified in size because ``transform()`` transform the data but it
don't augment the dataset.
Data Augmentation
The train set is modified in size because ConcatDataset() augment the dataset.
Let us show one example for each class, for fun. As we've transformed the image by
normalizing it, we should undo the transformation before visualizing the image.
Settings Hyperparameters
We are going to define some training parameters for the network, such as the number
of batches, epochs, and classes in the dataset because they are needed for
dataloaders in order to set up our training loop. We will run only 10 epochs to check
functionality. Later, we will load a model that has already been trained for 30 epochs.
Display all images and its ground truth from a random batch
To see how the DataLoader works and how it handles the loaded data, we will select a
random batch and display it, indicating its class label as well. It is said, we can display
all images and its ground truth from a random batch in a easy way with dataloaders.
3. Global Average Pooling: Replace the Flatten layer with a global average
pooling layer to reduce the number of parameters and prevent overfitting.
6. Data Augmentation: While not part of the model architecture, performing data
augmentation during training can improve performance.
Explanation of Changes:
These changes can improve the model's generalization capability and efficiency.
To create directories named train1, train2, etc., each time you execute a training loop,
you can modify the code to check the number of existing training directories and then
create the next directory in sequence. Here's an example of how you could do this:
Todo código.
7. Predictions (Inference)
Testing
3. Classification Report:
These changes should help you better understand your model's results and ensure
that the code handles different numbers of classes correctly.