02 - Efficient (3) - JupyterLab
02 - Efficient (3) - JupyterLab
Header
Disaster
Imagery Risk Monitoring Using Satellite
02 - Efficient Model Training
In this notebook, you will learn how to train a segmentation model with the TAO
Toolkit using pre-trained ResNet-18 weights. In addition, you will learn how to export
the model for deployment.
Table of Contents
This notebook covers the below sections:
1. Introduction to the TAO Toolkit
Transfer Learning
Vision AI Pre-Trained Models Supported
TAO Toolkit Workflow
TAO Launcher, CLI (Command Line Interface), and Spec Files
Set Up Environment Variables
Exercise #1 - Explore TAO Toolkit CLI
2. U-Net Semantic Segmentation Model
Preparation for Model Training
Download Pre-Trained Model
Prepare Dataset
Model Training
Exercise #2 - Modify Dataset Config
Exercise #3 - Modify Model Config
Exercise #4 - Modify Training Config
Combine Configuration Files
Initiate Model Training
Evaluating the Model
Visualizing Model Inference
3. Model Export
TensorRT - Programmable Inference Accelerator
Export the Trained Model
Generate TensorRT Engine
data to produce highly accurate computer vision models efficiently, eliminating the
need for large training runs and deep AI expertise. In addition, it also enables model
optimization for inference performance. You can learn more about the TAO Toolkit
here or read the documentation here.
No description has been provided for this image
The TAO Toolkit uses pre-trained models to accelerate the AI development process
and reduce costs associated with large scale data collection, labeling, and training
models from scratch. Transfer learning with pre-trained models can be used for
classification, object detection, and image segmentation tasks. The TAO Toolkit
offers useful features such as:
Low-coding approach that requires no AI framework expertise, reducing the
barrier of entry for anyone who wants to get started building AI-based
applications
Flexible configurations that allow customization to help advance users prototype
faster
Large catalogue of production-ready pre-trained models for common Computer
Vision (CV) tasks that can also be customized with users' own data
Easy to use interface for model optimization such as pruning and quantization-
aware training
Integration with the Triton Inference Server
Note: The TAO Toolkit comes with a set of reference scripts and configuration
specifications with default parameter values that enable developers to kick-start
training and fine-tuning. This lowers the bar and enables users without a deep
understanding of models, expertise in deep learning, or beginning coding skills to be
able to train new models and fine-tune the pre-trained ones.
Transfer Learning
In practice, it is rare and inefficient to initiate the learning task on a network with
randomly initialized weights due to factors like data scarcity (inadequate number of
training samples) or prolonged training times. One of the most common techniques
to overcome this is to use transfer learning. Transfer learning is the process of
transferring learned features from one application to another. It is a commonly used
training technique where developers use a model trained on one task and re-train to
use it on a different task. This works surprisingly well as many of the early layers in a
neural network are the same for similar tasks. For example, many of the early layers
in a convolutional neural network used for a CV model are primarily used to identify
outlines, curves, and other features in an image. The network formed by these layers
are referred to as the backbone of a more complex model. Also known as feature
extractors, they take as input the image and extracts the feature map upon which the
rest of the network is based. The learned features from these layers can be applied
to similar tasks carrying out the same identification in other domains. Transfer
dli-e5d62e622240-86cddf.westus2.cloudapp.azure.com/lab/lab/tree/02_efficient_model_training.ipynb 2/19
21/03/2025, 22:59 02_efficient_model_training
Users interact with the launcher with its Command Line Interface that is configured
using simple Protocol Buffer specification files to include parameters such as the
dataset parameters, model parameters, and optimizer and training hyperparameters.
More information about the TAO Toolkit Launcher can be found in the TAO Docs.
The tasks can be invoked from the TAO Toolkit Launcher using the convention tao
<task_group> <task> <subtask> <args_per_subtask> , where
<args_per_subtask> are the arguments required for a given subtask. The tasks
in the containers are grouped into different task_groups, which are divided into the
following categories:
model
dataset
deploy
The tasks under model contain routines to perform train , evaluate , and
inference on one of any number of deep neural network models supported by
TAO. The tasks under dataset contain routines to manipulate datasets, such as
augment and auto_label, while the tasks under deploy optimize and deploy models
to TensorRT.
Once the container is launched, the subtasks are run by the TAO Toolkit containers
using the appropriate hardware resources.
No description has been provided for this image
No description has been provided for this image
Since the TAO Toolkit uses the launcher to pull containers, the first time running a
task may take extra time to load.
Note that users will be able to define their own export encryption key when training
from a general-purpose model. This is to protect proprietary IP and used to decrypt
the .etlt model during deployment.
In [ ]: # DO NOT CHANGE THIS CELL
# set environment variables
import os
import json
%set_env KEY=my_model_key
%set_env LOCAL_PROJECT_DIR=/dli/task/tao_project
%set_env LOCAL_DATA_DIR=/dli/task/flood_data
%set_env LOCAL_SPECS_DIR=/dli/task/tao_project/spec_files
os.environ["LOCAL_EXPERIMENT_DIR"]=os.path.join(os.getenv("LOCAL_PROJECT_
%set_env TAO_PROJECT_DIR=/workspace/tao-experiments
%set_env TAO_DATA_DIR=/workspace/tao-experiments/data
%set_env TAO_SPECS_DIR=/workspace/tao-experiments/spec_files
%set_env TAO_EXPERIMENT_DIR=/workspace/tao-experiments/unet
!mkdir $LOCAL_EXPERIMENT_DIR
The cell below maps the project directory on your local host to a workspace directory
in the TAO docker instance, so that the data and the results are mapped from in and
out of the docker. This is done by creating a .tao_mounts.json file. For more
information, please refer to the launcher instance in the user guide. Setting the
DockerOptions ensures that you don't have permission issues when writing data
into folders created by the TAO docker.
In [ ]: # DO NOT CHANGE THIS CELL
# mapping up the local directories to the TAO docker
mounts_file = os.path.expanduser("~/.tao_mounts.json")
drive_map = {
"Mounts": [
# Mapping the data directory
{
"source": os.environ["LOCAL_PROJECT_DIR"],
"destination": "/workspace/tao-experiments"
},
# Mapping the specs directory.
{
"source": os.environ["LOCAL_SPECS_DIR"],
"destination": os.environ["TAO_SPECS_DIR"]
},
dli-e5d62e622240-86cddf.westus2.cloudapp.azure.com/lab/lab/tree/02_efficient_model_training.ipynb 5/19
21/03/2025, 22:59 02_efficient_model_training
To see the usage of different functionality that are supported, use the -h or --
help option. For more information, see the TAO Toolkit Quick Start Guide. Here is
the sample output:
In [ ]: # DO NOT CHANGE THIS CELL
!tao model --help
With the TAO Toolkit, users can train models for object detection, classification,
segmentation, optical character recognition, facial landmark estimation, gaze
estimation, and more. In TAO's terminology, these would be the tasks, which support
subtasks such as train , prune , evaluate , export , etc. Each task/subtask
requires different combinations of configuration files to accommodate for different
parameters, such as the dataset parameters, model parameters, and optimizer and
training hyperparameters. Part of what makes TAO Toolkit so easy to use is that most
of those parameters are hidden away in the form of experiment specification files
(spec files). They are detailed in the documentation for reference. It's very helpful to
have these resources handy when working with the TAO Toolkit. In addition, there are
several specific tasks that help with handling the launched commands.
Below are the tasks available in the TAO Toolkit, organized by their respective
computer vision objectives. We grayed out the tasks for Conversational AI as they are
out of scope for this course.
No description has been provided for this image
dli-e5d62e622240-86cddf.westus2.cloudapp.azure.com/lab/lab/tree/02_efficient_model_training.ipynb 7/19
21/03/2025, 22:59 02_efficient_model_training
For this lab, we will use ResNet18 as the architecture for the semantic segmentation
model. Residual neural network, or ResNet, is a type of convolutional neural network
used as a backbone for many computer vision tasks. The 18 refers to the number of
dli-e5d62e622240-86cddf.westus2.cloudapp.azure.com/lab/lab/tree/02_efficient_model_training.ipynb 8/19
21/03/2025, 22:59 02_efficient_model_training
layers in this architecture. It should be noted that typically the deeper (i.e. more
layers) a neural network is, the more time consuming it is to compute.
No description has been provided for this image
We designated the model to be downloaded locally to
tao_project/unet/pretrained_resnet18 , which is mapped to
/workspace/tao-experiments/unet/pretrained_resnet18 in the TAO
container based on the mapping of LOCAL_PROJECT_DIR to TAO_PROJECT_DIR .
Looking at the local_path and transfer_id keys of the output JSON, we can
gather that the path of the pre-trained model should be in the
tao_project/unet/pretrained_resnet18/pretrained_semantic_segmentatio
directory . When referencing paths for the TAO Toolkit, it's important to use paths
based on the TAO container. In this case it would be /workspace/tao-
experiments/unet/pretrained_resnet18/pretrained_semantic_segmentatio
Prepare Dataset
The TAO Toolkit expects the training data for the unet subtasks to be in the format
described in the documentation. Each mask image is a single-channel image, where
every pixel is assigned an integer value that represents the segmentation class
label_id , as per the mapping provided in the dataset_config . Additionally,
each image and label have the same file ID before the extension and size. The image-
to-label correspondence is maintained using this filename. The data folder structure
for images and masks must be in the following format.
No description has been provided for this image
The test folder in the above directory structure is optional; any folder can be used
for inference.
Below we will split the data into train set and validation set and copy the
images into their respective folder.
In [ ]: # DO NOT CHANGE THIS CELL
# remove existing splits
!rm -rf $LOCAL_DATA_DIR/images/train
!mkdir -p $LOCAL_DATA_DIR/images/train
!rm -rf $LOCAL_DATA_DIR/images/val
!mkdir -p $LOCAL_DATA_DIR/images/val
dli-e5d62e622240-86cddf.westus2.cloudapp.azure.com/lab/lab/tree/02_efficient_model_training.ipynb 9/19
21/03/2025, 22:59 02_efficient_model_training
Model Training
Training configuration is done through a training spec file, which includes options
such as which dataset to use for training, which dataset to use for validation, which
pre-trained model architecture to use, which hyperparameters to tune, and other
training options. The train , evaluate , prune , and inference subtasks for a
U-Net experiment share the same configuration file. Configuration files can be
created from scratch or modified using the templates provided in TAO Toolkit's
sample applications.
The training configuration file has the following sections:
dataset_config
model_config
training_config
dli-e5d62e622240-86cddf.westus2.cloudapp.azure.com/lab/lab/tree/02_efficient_model_training.ipynb 10/19
21/03/2025, 22:59 02_efficient_model_training
dataset (str) : The input type dataset used. The currently supported
dataset is custom to the user. Open-source datasets will be added in the
future.
augment (bool) : If the input should be augmented online while training.
When using one’s own dataset to train and fine-tune a model, the dataset can be
augmented while training to introduce variations in the dataset. This is known as
online augmentation. This is very useful in training as data variation improves
the overall quality of the model and prevents overfitting. Training a deep neural
network requires large amounts of annotated data, which can be a manual and
expensive process. Furthermore, it can be difficult to estimate all the corner
cases that the network may go through. The TAO Toolkit provides spatial
augmentation (resize and flip) and color space augmentation (brightness) to
create synthetic data variations.
augmentation_config (dict) :
spatial_augmentation (dict) : Supports spatial augmentation such
as flip, zoom, and translate.
hflip_probability (float) : Probability to flip an input image
horizontally.
vflip_probability (float) : Probability to flip an input image
vertically.
crop_and_resize_prob (float)
brightness_augmentation (dict) : Configures the color space
transformation.
: Adjust brightness using delta value.
delta (float)
input_image_type (str) : The input image type to indicate if input image is
grayscale or color (RGB).
dli-e5d62e622240-86cddf.westus2.cloudapp.azure.com/lab/lab/tree/02_efficient_model_training.ipynb 11/19
21/03/2025, 22:59 02_efficient_model_training
dli-e5d62e622240-86cddf.westus2.cloudapp.azure.com/lab/lab/tree/02_efficient_model_training.ipynb 12/19
21/03/2025, 22:59 02_efficient_model_training
dli-e5d62e622240-86cddf.westus2.cloudapp.azure.com/lab/lab/tree/02_efficient_model_training.ipynb 13/19
21/03/2025, 22:59 02_efficient_model_training
Instructions:
Modify the training_config (here) section of the training configuration file
by changing the <FIXME> into acceptable values and save changes. Typically,
using a higher epochs count will improve model performance but takes longer
time to complete. For the purpose of this exercise, we recommend starting with
a low n_epochs , such as 5 , to allow the model to converge without taking too
much time.
In [ ]: # DO NOT CHANGE THIS CELL
# read the config file
!cat $LOCAL_SPECS_DIR/resnet18/training_config.txt
dli-e5d62e622240-86cddf.westus2.cloudapp.azure.com/lab/lab/tree/02_efficient_model_training.ipynb 14/19
21/03/2025, 22:59 02_efficient_model_training
experiments/unt/pretrained_resnet18/pretrained_semantic_segmentation
to reference the pre-trained model.
Multi-GPU support can be enabled for those with the hardware using the --gpus
argument. When running the training with more than one GPU, we will need to modify
the batch_size and learning_rate . In most cases, scaling down the batch-
size by a factor of NUM_GPU's or scaling up the learning rate by a factor of
NUM_GPUs would be a good place to start.
In [ ]: # DO NOT CHANGE THIS CELL
# remove any previous training if exists
!rm -rf $LOCAL_EXPERIMENT_DIR/resnet18
Note: The training may take hours to complete. unet supports restarting from
checkpoints in case the training job is killed prematurely. Training from the closest
checkpoint may be resumed by simply re-running the same command.
In [ ]: # DO NOT CHANGE THIS CELL
print('Model for every epoch at checkpoint_interval mentioned in the spec
print('---------------------')
!tree -a $LOCAL_EXPERIMENT_DIR/resnet18
the model, the -o argument indicates where the evaluation metrics outputs should
be written, and the -k argument indicates the key to load the model.
In [ ]: # DO NOT CHANGE THIS CELL
# evaluate the model using the same validation set as training
!tao model unet evaluate -e $TAO_SPECS_DIR/resnet18/combined_config.txt\
-m $TAO_EXPERIMENT_DIR/resnet18/weights/resnet18
-o $TAO_EXPERIMENT_DIR/resnet18/ \
-k $KEY
To understand how the TAO Toolkit measures accuracy of the segmentation model,
we'll have to understand two measures: recall and precision. The first measure is
focused on identifying positive cases and is called recall. We define recall as the
ability of the model to identify all true positive samples of the dataset. In
mathematical terms, recall is the ratio of true positives over true positives plus false
negatives. By other means, recall tells us, among all the test samples belonging to
the output class, how many of them are identified correctly by the model. The next
measure is called precision and is defined as the ability of the model to identify the
relevant samples only. It is the ratio of true positives over true positives plus false
positives. A well-known measure that summarizes the balance between precision and
recall is f1-score, which is their harmonic mean.
We can write a quick function that will help us sample random inferences. Execute
the below cells to visualize the inference.
dli-e5d62e622240-86cddf.westus2.cloudapp.azure.com/lab/lab/tree/02_efficient_model_training.ipynb 16/19
21/03/2025, 22:59 02_efficient_model_training
fig_dim=4
fig, ax_arr=plt.subplots(num_images, 4, figsize=[4*fig_dim, num_image
ax_arr[0, 0].set_title('Overlay')
ax_arr[0, 1].set_title('Input')
ax_arr[0, 2].set_title('Inference')
ax_arr[0, 3].set_title('Actual')
ax_arr[0, 0].set_xticks([])
ax_arr[0, 0].set_yticks([])
visualize_images(NUM_IMAGES)
Model Export
Once we are satisfied with our model, we can move to deployment. unet includes
an export subtask to export and prepare a trained U-Net model for deployment.
Exporting the model decouples the training process from deployment and allows
conversion to TensorRT engines outside the TAO environment. TensorRT is a highly
optimized package that takes trained models and optimizes them for inference.
TensorRT engines are specific to each hardware configuration and should be
generated for each unique inference environment. This may be interchangeably
referred to as the .trt or .engine file. The same exported TAO model may be
used universally across training and deployment hardware. This is referred to as the
.etlt file, or encrypted TAO file.
dli-e5d62e622240-86cddf.westus2.cloudapp.azure.com/lab/lab/tree/02_efficient_model_training.ipynb 17/19
21/03/2025, 22:59 02_efficient_model_training
dli-e5d62e622240-86cddf.westus2.cloudapp.azure.com/lab/lab/tree/02_efficient_model_training.ipynb 18/19
21/03/2025, 22:59 02_efficient_model_training
dli-e5d62e622240-86cddf.westus2.cloudapp.azure.com/lab/lab/tree/02_efficient_model_training.ipynb 19/19