Gridin I. Automated Deep Learning Using Neural Network... 2023
Gridin I. Automated Deep Learning Using Neural Network... 2023
Learning Using
Neural Network
Intelligence
Develop and Design PyTorch and TensorFlow
Models Using Python
—
Ivan Gridin
Automated Deep Learning
Using Neural Network
Intelligence
Develop and Design PyTorch
and TensorFlow Models Using Python
Ivan Gridin
Automated Deep Learning Using Neural Network Intelligence: Develop and Design
PyTorch and TensorFlow Models Using Python
Ivan Gridin
Vilnius, Lithuania
v
Table of Contents
vi
Table of Contents
vii
Table of Contents
Index��������������������������������������������������������������������������������������������������������������������� 379
ix
About the Author
Ivan Gridin is a researcher, author, developer, and artificial
intelligence expert who has worked on distributive high-
load systems and implemented different machine learning
approaches in practice. One of the primary areas of his
research is the design and development of predictive time
series models. Ivan has fundamental math skills in random
process theory, time series analysis, machine learning,
reinforcement learning, neural architecture search, and
optimization. He has published books on genetic algorithms
and time series forecasting.
He is a loving husband and father and collector of old math books.
You can learn more about him on LinkedIn: https://www.linkedin.com/in/survex/.
xi
About the Technical Reviewer
Andre Ye is a deep learning researcher and writer
working toward making deep learning more accessible,
understandable, and responsible through technical
communication. He is also a cofounder at Critiq, a machine
learning platform facilitating greater efficiency in the peer-
review process. In his spare time, Andre enjoys keeping up
with current deep learning research, reading up on history
and philosophy, and playing the piano.
xiii
Introduction
Machine learning is a big part of our lives in today's world. We cannot even think of a
world without machine learning approaches at the moment, and it has already started
to take a huge part of our daily activities. Websites, mobile applications, self-driving cars,
home devices, and many things surrounding us use machine learning algorithms. The
dawn of computing power, especially the graphics processing unit, was accompanied
by the practical start of the deep learning implementation. Deep learning studies the
design of deep neural networks. This approach shows impressive efficiency and has
experienced explosive growth in recent years.
Not surprisingly, the number of tasks to be solved and the need for machine learning
specialists are constantly growing. At the same time, the number of routine actions that
developers and data scientists execute to solve machine learning problems is increasing.
Meanwhile, researchers developed special techniques to save time and automate the
most common machine learning tasks. These techniques were separated into the
special area called automated machine learning, or AutoML. This book focuses on the
automated deep learning (AutoDL) area, which studies the automation of deep learning
problems. AutoDL considers the issues of creating and designing optimal deep learning
models. This approach has been rapidly developed in recent years and, in some cases,
can completely automate the solution of typical tasks.
This book is about implementing AutoDL methods using Microsoft Neural Network
Intelligence (NNI). NNI is a Python toolkit that contains the most common and
advanced AutoDL methods: Hyperparameter Optimization (HPO), Neural Architecture
Search (NAS), and Model Compression. NNI supports the most popular deep learning
frameworks. This book covers the NNI implementation of various AutoDL techniques
using the PyTorch and TensorFlow frameworks.
Сhapter 1 focuses on automated deep learning basics and why we should put this
approach into practice. We will also install NNI and examine the main basic scenarios
for its use. We will learn how to run simple Hello World Experiments and interact with
NNI via the command line and WebUI.
In Сhapter 2, we will move on to the study of the most common AutoDL task –
Hyperparameter Optimization (HPO). We will learn what Hyperparameter Optimization
xv
Introduction
is, what hyperparameters are, and how to organize an NNI HPO experiment using
PyTorch and TensorFlow. We will also construct three kinds of research that will make
a historical journey to the origins of deep learning. The first one will help us determine
the best LeNet model hyperparameters for the MNIST problem. The second research
integrates a new dropout layer and rectified linear unit (ReLU) activation into the
original LeNet model. And the third one will show us how we can evolve the LeNet
model in AlexNet using simple HPO techniques.
In Сhapter 3, we will study NNI's main search algorithms (Tuners), which aim to
solve HPO tasks. Here, we will consider the practical application and the description of
the following algorithms: Evolution Tuner, Anneal Tuner, and SMBO Tuners.
Chapter 3 provides the creation of a custom Tuner and applies it to the classic Shallow
AutoML problem – building an optimal pipeline using Scikit methods.
In Chapter 4, we will begin to research Neural Architecture Search (NAS). NAS is an
approach that studies the creation and design of neural networks best suited to solve
a specific problem. This chapter covers Multi-trial NAS and its main principles. We'll
discuss the NNI Retiari framework, define Model Spaces and Model Mutators, and set
up experiments that construct optimal neural networks. Also, this chapter introduces
various exploration algorithms that explore Multi-trial NAS Model Space: Regularized
Evolution, TPE Strategy, and RL Strategy. Next, we will build LeNet-based and ResNet-
based Multi-trial NAS experiments to solve the CIFAR-10 problem.
In Chapter 5, we move on to One-shot NAS, one of the latest advances in
AutoDL. This chapter explains how to construct a Supernet, how to design cell-based
neural architectures, and perform Efficient Neural Architecture Search (ENAS) and
Differentiable Architecture Search (DARTS) One-shot NAS algorithms.
In Chapter 6, we will cover the important topic of model pruning. Model pruning
compresses neural network removing redundant weights or even layers. This technique
is crucial for lightweight devices when we need to save computing resources. This
chapter will examine basic One-shot and iterative pruning algorithms.
Chapter 7 will focus on practical recipes for using NNI to organize robust, extensive,
and big data experiments.
This book explores practical NNI applications of AutoDL methods and describes
their theory also. Therefore, this book can be helpful for data scientists who want to get
the idea that underlies various AutoDL techniques and algorithms.
This book requires intermediate deep learning understanding and TensorFlow or
PyTorch knowledge.
xvi
Introduction
xvii
CHAPTER 1
Introduction to Neural
Network Intelligence
There was a great burst of deep learning industry in the past few years. Deep learning
approaches have achieved outstanding results in computer vision, natural language
processing, robotics, time series forecasting, and optimal control theory. However,
there is no “silver bullet model” to solve all kinds of problems. Each problem and
dataset needs a specific model architecture to achieve suitable performance. Machine
learning models, especially deep learning models, have a lot of tunable parameters
that can drastically affect the model performance. Those are model design, training
method, model configuration hyperparameters, etc. The model optimization process
is performed for each application and even each dataset. Data scientists and machine
learning experts often spend a lot of time performing manual model optimization. This
activity can be frustrating because it takes too much time and is usually based on an
expert’s experience and quasi-random search.
However, recent results in automated machine learning and deep learning meta-
optimization make it possible to automate the optimizing process for a specific task.
It is also possible to create brand new model architecture from scratch without having
any experience solving similar problems in the past. The Neural Network Intelligence
(NNI) toolkit provides the latest state-of-the-art techniques to solve the most challenging
automated deep learning problems. We’ll start exploring the basic NNI features in this
chapter.
Let’s elaborate NFL theorem to more functional language. Say we have a set of
datasets: D1, D2, D3, ..., and the estimated performance of random search algorithm R on
each dataset Di equals to r:
Then for any search algorithm A and any dataset Di with estimation r + q, there is a
dataset Dj with estimation r - q:
Statement 2 says that if algorithm A is better than random algorithm R for dataset Di,
then there is dataset Dj for which algorithm A will be worse than random algorithm R
and E(A; Di) + E(A; Dj) = E(R; Di) + E(R; Dj). This fact makes all algorithms equivalent if
we consider them separately from a specific dataset and task. For example, let’s say we
have an algorithm A for predicting the color of the next box by previous ones with rules
listed in Table 1-1.
3
Chapter 1 Introduction to Neural Network Intelligence
And the prediction algorithm A works with 100% accuracy for dataset D1, in which
two black boxes follow two white boxes and two white boxes follow two black boxes, as
shown in Figure 1-2.
But let’s examine how algorithm A works on dataset D2, in which white and black
boxes alternate one after another one by one, as shown in Figure 1-3.
4
Chapter 1 Introduction to Neural Network Intelligence
Figure 1-3 demonstrates that algorithm A has 0% accuracy on dataset D2. This
example illustrates that “there is no optimal algorithm for all datasets” and “there is no
optimal solution for all cases.” And how does NFL theorem influence deep learning?
Each deep learning model and each dataset generates a loss function that should be
minimized. If we have two deep learning models, M1 and M2, then this means that they
show good results only for certain types of problems and certain types of datasets. You
cannot expect the same deep learning model to perform similarly on a different dataset,
much less for a different kind of problem. So if you apply model M1 and model M2 to
problem P1 on dataset D1, you can expect that model M1 will show good performance
in this case, and model M2 will demonstrate poor performance. Figure 1-4 illustrates
this point.
But we can get the opposite results if the models are applied to a different problem
and a different dataset as shown in Figure 1-5.
5
Chapter 1 Introduction to Neural Network Intelligence
So, the NFL theorem tells us that we cannot expect a model to perform equally well
for different cases. The slightest modification in the problem statement or changes
in the dataset require additional model optimization for the updates. This fact makes
the AutoDL irreplaceable in preparing an effective production-ready solution. It is
also worth mentioning that the set of realistic datasets is much smaller than the set
of all possible datasets, which makes it possible to determine a class of most suitable
algorithms for solving specific problems. Nevertheless, the NFL theorem remains true
since selecting the best algorithm for all types of problems is impossible.
6
Chapter 1 Introduction to Neural Network Intelligence
This approach will help update the model with the latest advances in deep learning,
enhancing the model’s performance.
7
Chapter 1 Introduction to Neural Network Intelligence
Using AutoDL techniques, you can adapt the model to other datasets.
8
Chapter 1 Introduction to Neural Network Intelligence
And I find this to be a fantastic direction for further research and practical
applications. Humanity has developed deep learning models and their training based on
error backpropagation. Neural networks of a particular architecture can reveal the most
complex dependencies and patterns. So why not take the next step and develop neural
network design algorithms that will create the optimal neural network architecture for a
specific task.
R
einventing the Wheel
Many machine learning experts spend a lot of time developing existing methods to solve
the problems described earlier. Automated machine learning techniques can save weeks
or even months of development. Of course, automated deep learning cannot substitute
for deep learning engineers, and human experience and intuition is the main driver in
all inventions nowadays. But anyway, AutoDL can significantly decrease the amount of
custom work needed. Automated deep learning should become a must-have tool for
solving practical problems that can save significant time.
9
Chapter 1 Introduction to Neural Network Intelligence
I nstall
NNI minimal system requirements are: Ubuntu, 18.04; macOS, 11; Windows 10, 21H2
and Python 3.7.
NNI can be simply installed as follows:
We will be using version 2.7 in this book, so I highly recommend installing version
2.7 to avoid version differences:
Let’s test the installation by executing “Hello World” scenario. Run the following
command (ch1/install/hello_world/config.yml file is contained in the source code):
If the installation was successful, you should see the following output:
And you can follow the link http://127.0.0.1:8080 in your browser. Figure 1-9
demonstrates NNI web user interface that we will cover in the next sections.
10
Chapter 1 Introduction to Neural Network Intelligence
If everything is ok, then you can stop NNI by executing the following in the
command line:
nnictl stop
D
ocker
If you have any problems with the installation, you can use the docker image that was
prepared for this book. The Dockerfile in Listing 1-1 is based on the official NNI docker
image msranni/nni:v2.7 from the official docker repository: https://hub.docker.
com/r/msranni/nni/tags.
FROM msranni/nni:v2.7
RUN mkdir /book
ADD . /book
EXPOSE 8080
ENTRYPOINT ["tail", "-f", "/dev/null"]
11
Chapter 1 Introduction to Neural Network Intelligence
The docker autodl_nni_book image contains all the necessary libraries and
dependencies to run all the experiments that we will study in this book.
Let’s run the “Hello World” scenario we examined in the previous section using
docker. We start the docker container:
and after that, you can access NNI WebUI via http://127.0.0.1:8080 in your browser.
The code repository for this book is in /book directory of the docker image. Therefore,
in the autodl_nni_book docker image, you can execute all commands that will concern
NNI as follows:
But in any case, the docker’s capabilities are limited. For flexible debugging and
better interaction with NNI, I strongly recommend that you work with NNI without using
the docker if possible.
12
Chapter 1 Introduction to Neural Network Intelligence
Tuner selects a parameter in the search space and transfers it to Trial. Trial is a
Python script that tests the model with parameters passed by Tuner and returns a metric
that estimates the model’s performance.
This search process can be depicted as shown in Figure 1-10.
After a certain number of trials, we have a sufficient number of results that estimate
the suitability of each parameter for an optimized model.
13
Chapter 1 Introduction to Neural Network Intelligence
When we say that we need to optimize the black-box function, it means that we
need to find such input parameters for which the black-box function outputs the highest
value. Let’s say that we have a black-box function, which is defined by the code in
Listing 1-2.
14
Chapter 1 Introduction to Neural Network Intelligence
Of course, the optimization problem for the function presented in Listing 1-2 can be
solved analytically, but let’s pretend that we do not know how the function acts inside
the black box. All we know is that the black-box function returns real value and receives
the following input parameters:
Let’s start solving our problem by defining a search space. Search space is defined in
JSON format using special directives. We will define the search space using the following
JSON file.
{
"x": {"_type": "quniform", "_value": [1, 100, 1]},
"y": {"_type": "quniform", "_value": [1, 10, 1]},
"z": {"_type": "quniform", "_value": [1, 10000, 0.01]}
}
quniform directive creates a value list from a to b with step s. So the search space
defined in Listing 1-3 can be presented the following way:
• y in [1, 2, 3, …, 9, 10]
Note We’ll explore how to define search space in more detail in the next chapter.
15
Chapter 1 Introduction to Neural Network Intelligence
if __name__ == '__main__':
# parameter from the search space selected by tuner
p = nni.get_next_parameter()
x, y, z = p['x'], p['y'], p['z']
r = black_box_function(x, y, z)
# returning result to NNI
nni.report_final_result(r)
Note We’ll explore how to define trial in more detail in the next chapter.
And the last thing left for us to do is to define the configuration of our experiment,
which will look for the best input parameters for the black-box function.
trialConcurrency: 4
maxTrialNumber: 1000
searchSpaceFile: search_space.json
trialCodeDirectory: .
trialCommand: python3 trial.py
tuner:
name: Evolution
classArgs:
optimize_mode: maximize
trainingService:
platform: local
16
Chapter 1 Introduction to Neural Network Intelligence
The experiment that we have defined in Listing 1-5 has the following properties:
Now everything is ready to find the input parameters that maximize the black-box
function. Let’s run NNI:
And you can monitor the experiment process in the web panel:
http://127.0.0.1:8080.
After completing the experiment, you can observe the parameter that returned the
best metric on the NNI overview page: http://127.0.0.1:8080/oview.
17
Chapter 1 Introduction to Neural Network Intelligence
We see that parameter (x=49, y=2, z=7024.61) is the best result of the experiment.
The function for this parameter returns 48.02, which is the maximum of all trials.
Of course, we could have obtained the same result more simply, but now, we are
introducing the basic capabilities of NNI. In the next chapters, we will see the full
strength of this tool.
O
verview Page
The overview page http://127.0.0.1:8080/oview contains summary information
about a running experiment.
The upper left panel contains information about the experiment state (Figure 1-13).
18
Chapter 1 Introduction to Neural Network Intelligence
The lower left panel shows the number of trials performed and the running time. The
maximum number of trials and the maximum time can be edited on the fly (Figure 1-14).
The right panel on the overview page contains a summary of the top trials
(Figure 1-15).
19
Chapter 1 Introduction to Neural Network Intelligence
If you just want to run an experiment and get the best test result, then you can
only deal with the overview page. But for a more detailed analysis of the experiment
execution, you will need the trials details page.
20
Chapter 1 Introduction to Neural Network Intelligence
This panel allows hyperparameter data mining to help you better understand the
nature of the investigated black-box function. We will stay on this moment for a while.
Select the top 5% trials on the hyperparameter panel (Figure 1-18).
21
Chapter 1 Introduction to Neural Network Intelligence
And we can get a lot of insights from Figure 1-18. For all top 5% trials, the following
is true:
• y is even.
Based on the information we obtained here, we can perform our own simplified
search that finds the best parameter close to 48.02, which was found during the NNI
experiment. Let’s examine Listing 1-6.
import random
from ch1.bbf.black_box_function import black_box_function
seed = 0
random.seed(0)
max_ = -100
best_trial = None
for _ in range(100):
x = random.choice([50, 49, 48])
y = random.choice([2, 4, 6, 8, 10])
22
Chapter 1 Introduction to Neural Network Intelligence
print(best_trial)
An experiment is usually a rather lengthy process that can take days or even
weeks. Sometimes, there may be interesting hypotheses to test. For example, it may be
necessary to run a trial with specific parameters manually. And if you don’t want to wait
until the end of the experiment, then you can add a custom trial to the queue by clicking
the “Copy” button in the list of challenges. You can enter your trial parameters in the
pop-up window and submit a trial. Figure 1-20 demonstrates how you can submit a
custom trial.
23
Chapter 1 Introduction to Neural Network Intelligence
NNI offers a web panel for experiments that simplifies administration and
monitoring tasks. We will get back to it more than once in the next chapters.
24
Chapter 1 Introduction to Neural Network Intelligence
25
Chapter 1 Introduction to Neural Network Intelligence
26
Chapter 1 Introduction to Neural Network Intelligence
E mbedded NNI
Even though the capabilities of the NNI server are quite broad, the NNI can run in
embedded mode. It is more convenient to run NNI in Python embedded mode in some
cases. This need may arise when it is necessary to dynamically create experiments and
have more control over the experiment execution. We will use NNI in embedded mode
in some examples in the next chapters.
Listing 1-7 shows an example of the execution of an experiment in embedded mode
to optimize the black-box function we examined earlier.
# Loading Packages
from pathlib import Path
from nni.experiment import Experiment
# Experiment Configuration
experiment = Experiment('local')
experiment.config.experiment_name = 'Black Box Function Optimization'
experiment.config.trial_concurrency = 4
experiment.config.max_trial_number = 1000
experiment.config.search_space = search_space
experiment.config.trial_command = 'python3 trial.py'
experiment.config.trial_code_directory = Path(__file__).parent
experiment.config.tuner.name = 'Evolution'
experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
27
Chapter 1 Introduction to Neural Network Intelligence
# Starting NNI
http_port = 8080
experiment.start(http_port)
# Event Loop
while True:
if experiment.get_status() == 'DONE':
search_data = experiment.export_data()
search_metrics = experiment.get_job_metrics()
input("Experiment is finished. Press any key to exit...")
break
Listing 1-7 contains an event loop that allows you to track the progress of your
experiment automatically. Therefore, you can programmatically design experiments and
get the best solutions for a problem.
T roubleshooting
If you experience any problems or errors launching and using NNI, you can follow this
mini-guide to determine the issue.
NNI is not starting. In this case, you’ll see the error output message after running
nnictl start command, and this error message can help you fix the problem.
NNI is starting, but you see an ERROR badge in the overview web panel, as shown
in Figure 1-21.
In this case, check Trial logs in Trial jobs panel, as shown in Figure 1-23.
This mini-guide may make it easier to find and fix the NNI problem.
29
Chapter 1 Introduction to Neural Network Intelligence
TensorFlow will not duplicate each other but will be close to each other. Therefore, if you
are only a PyTorch user, you will not lose anything if you do not dive into the examples
with TensorFlow models.
This book will use the following framework versions:
• PyTorch: 1.9.0
• Scikit-learn: 0.24.1
In any case, I recommend you to go through all examples because their concepts can
be easily ported to your favorite deep learning framework.
Summary
In this chapter, we have explored the NNI basic features. NNI is a very powerful toolkit
for solving various AutoML tasks. And at the beginning of this chapter, we separately
investigated the demand to apply AutoML techniques in practice. In the next chapter, we
will begin exploring the application of the classic Hyperparameter Optimization (HPO)
approach. We will study how HPO techniques can optimize existing architectures and
create a new model design.
30
CHAPTER 2
Hyperparameter
Optimization
Almost every deep learning model has a large number of hyperparameters. Choosing
the proper hyperparameters is one of the most common problems in AutoML. A small
change in one of the model’s hyperparameters can significantly change its performance.
Hyperparameter Optimization (HPO) is the first and most effective step in deep learning
model tuning. Due to its ubiquity, Hyperparameter Optimization is sometimes regarded
as synonymous with AutoML.
NNI provides a broad and flexible set of HPO tools. This chapter will examine various
neural network designs and how NNI can be applied to optimize their hyperparameters
for particular problems.
What Is Hyperparameter?
Let’s start the chapter by defining what a model hyperparameter is. A deep learning
model has three types of parameters:
31
© Ivan Gridin 2022
I. Gridin, Automated Deep Learning Using Neural Network Intelligence,
https://doi.org/10.1007/978-1-4842-8149-9_2
Chapter 2 Hyperparameter Optimization
• Task parameters: The parameters that the task sets for you. These
parameters lie in the problem requirements, which need to be
satisfied and cannot be changed. For example, suppose we solve the
binary classification problem determining “cat or dog?” by analyzing
their pictures. In that case, we have a task parameter: 2, which
indicates the number of output classes. Or, for example, we have the
air temperature prediction problem for the next three days. Then
parameter 3 is a task parameter, and it lies in the task requirements
and cannot be changed in any way.
Let’s look at an example of a Fully Connected Neural Network (or Dense Network)
with three linear (or dense) layers with activation functions and five-valued input vector
and scalar output, which can be represented as follows in the TensorFlow framework:
We import necessary packages:
Next, we set task parameters which are task requirements. Our Fully Connected
Neural Network has to receive five-valued input vector and output a scalar value:
# Task Parameters
inp_dim = 5
out_dim = 1
Since we have three linear (or dense) layers, we can specify the output_dimension
value for two of them. The third layer has an output_dimension value of 1 because this is
a task requirement. These values are hyperparameters:
# Hyperparameters
l1_dim = 8
l2_dim = 4
32
Chapter 2 Hyperparameter Optimization
# Model
model = tf.keras.Sequential(
[
Dense(l1_dim, name = 'l1',
activation = 'sigmoid', input_dim = inp_dim),
Dense(l2_dim, name = 'l2',
activation = 'relu'),
Dense(out_dim, name = 'l3'),
]
)
model.build()
Total params: 89
The model shown in Listing 2-1 has 2 hyperparameters and 89 model parameters.
Hyperparameters usually directly affect the number of model parameters. Indeed,
in Listing 2-1, the l1_dim and l2_dim hyperparameters set the dimensions of weight
matrices and bias vectors. Figure 2-1 illustrates the hyperparameter impact on the FCNN
model architecture and its parameters.
33
Chapter 2 Hyperparameter Optimization
• Layer hyperparameter
• Training hyperparameter
• Feature hyperparameter
• Design hyperparameter
34
Chapter 2 Hyperparameter Optimization
L ayer Hyperparameter
Almost all layers of a deep learning model imply the presence of initial parameters. For
example:
T raining Hyperparameter
The training process is an integral part of the model architecture. Each model generates
a multidimensional loss function surface. The model training process tries to find the
best local minima on the loss function surface. The training process parameters can
drastically affect trained model performance.
The most common example is learning rate tuning. Most training algorithms
use gradient descent as the main idea behind model training. The gradient descent
concept means that a transition vector is calculated for each point on the loss function
surface. But the length of this vector is determined by the learning rate parameter. Too
high learning rate parameter can lead to a gradient descent explosion and a complete
inability to find an acceptable local minima on the loss function surface. At the same
time, too low learning rate stops the training process at a too high point on the surface
and does not allow model parameters to reach a lower point. Figure 2-2 demonstrates
the learning rate problem.
35
Chapter 2 Hyperparameter Optimization
• Learning rate
• Batch size
Feature Hyperparameter
Feature hyperparameter affects dataset preprocessing methods. The data structure
in the input dataset can significantly improve the model’s performance, especially in
natural language processing (NLP) problems. But transformations of the input dataset
do not always improve the model’s performance, so you often have to “play” with various
feature preprocessing techniques to reach the best results.
Let’s examine a dataset that contains movie reviews data. This dataset includes the
following features:
36
Chapter 2 Hyperparameter Optimization
And we are solving the classical binary classification problem, that is, we have to
determine whether the review is negative or positive. Then we can apply the following
preprocessing, which is shown in Figure 2-3.
The dataset preprocessing shown in Figure 2-3 has the following feature
hyperparameters: use Normalization, use Weekend labeling, and use Stop words removal.
Let’s describe their meanings:
37
Chapter 2 Hyperparameter Optimization
• use Normalization:
The date does not carry any information for the neural network.
But an extra datetime labeling might help. For example, reviews
left on holidays can be positive more often because people are
in a good mood. Then we can use the weekend labeling method,
which will label each date if it is a holiday or a weekend. Date
series can then be converted from 2021-11-05, 2021-11-06,
2021-11-07, … to 0, 1, 1, ….
38
Chapter 2 Hyperparameter Optimization
D
esign Hyperparameter
Design hyperparameter has a direct impact on the choice of neural network architecture.
Their values control the choice of neural network layers and connections between them.
Figure 2-4 shows design hyperparameters.
The design hyperparameter search shown in Figure 2-4 has the following design
hyperparameters: use Dropout, use MaxPool, and Activation function. And they affect the
model design the following way:
• use Dropout:
39
Chapter 2 Hyperparameter Optimization
• use MaxPool:
• Activation function:
• Linear layer
• Dropout layer
• ReLU activation
• Linear layer
Search Space
Say we have determined the model’s hyperparameters, which will need to be optimized.
Next, we must define a search space for each of the hyperparameters. Determining
the search space requires some experience and intuition. You must understand that
the larger the search space, the longer the experiment. And it is more difficult to find
a suitable solution. Therefore, it is pointless to specify a huge number of values in the
search space. For example, if l1 is a hyperparameter that specifies the dimension of a
linear layer (tensorflow.keras.layers.Dense(l1) or torch.nn.Linear(out_features
= l1)), then you don’t need to set the search space for the hyperparameter to
40
Chapter 2 Hyperparameter Optimization
[1, 2, 3,..., 999, 1000]. For the hyperparameter l1, values expressing the power
of two (2n) are more suitable: [4, 8, 16, ..., 256], because the representation size of
linear layers is relevant only proportionally, not additively (if l1 = 256 performed poorly,
thus, it is highly likely that l1 = 256 + 8 will show the same result). A reasonable choice
of hyperparameters can significantly reduce the time of the experiment without losing
its quality. Before specifying a search space, you can perform manual exploration to
determine which hyperparameter values have the most impact on model performance.
The search space for the HPO problem is defined by defining a set of possible
values for each of the hyperparameters. NNI allows the following sampling strategies to
define hyperparameter search space: choice, randint, uniform, quniform, loguniform,
qloguniform, normal, qnormal, lognormal, and qlognormal.
choice
{"_type": "choice", "_value": options}
Choice sampling strategy allows you to manually specify a list of values that a
hyperparameter can take. It can be a list of numbers and strings. For example:
Choice sampling also supports nested search spaces. Nested choice is especially
useful when dealing with design hyperparameters. Here is the example of nested choice
sampling:
"layer1":{
"_type": "choice",
"_value": [{"_name": "Empty"},
{
"_name": "Conv", "kernel_size":
{"_type": "choice", "_value": [1, 2, 3, 5]}
},
{
"_name": "Max_pool", "pooling_size":
{"_type": "choice", "_value": [2, 3, 5]}
},
41
Chapter 2 Hyperparameter Optimization
{
"_name": "Avg_pool", "pooling_size":
{"_type": "choice", "_value": [2, 3, 5]}
}
]
}
randomint
{"_type": "randint", "_value": [lower, upper]}
uniform
{"_type": "uniform", "_value": [low, high]}
quniform
{"_type": "quniform", "_value": [low, high, q]}
Acts like uniform sampling but with q discretization that can be expressed as
clip(round(uniform(low, high) / q) * q, low, high). For example, for _value
specified as [1, 11, 2.5], possible values are [1, 2.5, 5, 7.5, 10, 11].
loguniform
{"_type": "loguniform", "_value": [low, high]}
Chooses random value according to loguniform distribution on [low, high] that can be
expressed as np.exp(uniform(np.log(low), np.log(high))).
42
Chapter 2 Hyperparameter Optimization
q loguniform
{"_type": "qloguniform", "_value": [low, high, q]}
Acts like loguniform sampling but with q discretization that can be expressed as
clip(round(loguniform(low, high) / q) * q, low, high).
n ormal
{"_type": "normal", "_value": [mu, sigma]}
q normal
{"_type": "qnormal", "_value": [mu, sigma, q]}
Acts like normal sampling but with q discretization that can be expressed as
round(normal(mu, sigma) / q) * q.
l ognormal
{"_type": "lognormal", "_value": [mu, sigma]}
q lognormal
{"_type": "qlognormal", "_value": [mu, sigma, q]}
Acts like lognormal sampling but with q discretization that can be expressed as
round(exp(normal(mu, sigma)) / q) * q.
The implementation of sampling strategies is in nni.parameter expressions. You
can explore search space sampling strategies manually, as shown in Listing 2-2.
43
Chapter 2 Hyperparameter Optimization
import nni
from numpy.random.mtrand import RandomState
import matplotlib.pyplot as plt
space = [
nni.quniform(0, 100, 5, RandomState(seed))
for seed in range(20)
]
And after, you can observe generated samples using quniform method in Figure 2-5.
{
"dropout_rate":
{ "_type": "uniform", "_value": [0.1, 0.5]},
"conv_size":
{"_type": "choice", "_value": [2, 3, 5, 7]},
44
Chapter 2 Hyperparameter Optimization
"layer1_hidden_size":
{"_type": "choice", "_value": [128, 512, 1024]},
"layer2_hidden_size":
{"_type": "choice", "_value": [16, 32, 64]},
"activation_function":
{"_type": "choice", "_value": ["tanh", "sigmoid", "relu"]},
"training_batch_size":
{"_type": "choice", "_value": [100, 250, 500]},
"training_learning_rate":
{"_type": "uniform", "_value": [0.0001, 0.1]}
}
Listing 2-3 demonstrates a typical search space for deep learning model:
• dropout_rate: Layer hyperparameter that defines the p parameter in
dropout layer. dropout_rate can take any value from 0.1 to 0.5.
• conv_size: Layer hyperparameter that defines the kernel size of
convolutional layer. conv_size can take any value from the list: 2,
3, 5, 7.
• layer1_hidden_size: Layer hyperparameter that defines the output
dimension of the first linear layer. layer1_hidden_size can take any
value from the list: 128, 512, 1024.
• layer2_hidden_size: Layer hyperparameter that defines the output
dimension of the second linear layer. layer2_hidden_size can take
any value from the list: 16, 32, 64.
• activation_function: Design hyperparameter that defines output
activation function. activation_function can take any value from
the list: tanh, sigmoid, relu.
• training_batch_size: Training hyperparameter that defines batch
size that will be used during training. training_batch_size can take
any value from the list: 100, 250, 500.
• training_learning_rate: Training hyperparameter that defines
learning rate. training_learning_rate can take any value from
0.0001 to 0.1.
45
Chapter 2 Hyperparameter Optimization
T uners
After defining the search space, we need to define a tuner that will explore the search
space and select trial hyperparameter combinations based on the existing results.
The tuner is set as follows in the configuration file:
tuner:
name: <Tuner_Name>
classArgs:
optimize_mode: minimize
arg1: val1
arg2: val2
Each tuner has its own set of parameters. The only common parameter for all
tuners is optimize_mode, which marks the direction of optimization of the metric that
characterizes the model’s performance: minimize, maximize.
Table 2-2 provides the list of tuners available in NNI v2.7.
46
Chapter 2 Hyperparameter Optimization
tuner:
name: Random
In some cases, Random Search Tuner is well suited for exploring the search
space when you need to extract information about hyperparameters’ impact on
model performance. After a random search space exploration, you can redefine
hyperparameter search space and select another tuner.
47
Chapter 2 Hyperparameter Optimization
tuner:
name: GridSearch
Grid Search Tuner accepts only search space variables that are generated with
choice, quniform, and randint functions.
O
rganizing Experiment
And so, we are all set to begin our first explorations. Let’s look at the file organization
pattern we’ll be using in this book. I also recommend that you follow the same approach.
These simple rules will help you avoid unnecessary errors when running an
experiment:
• Use a separate directory for each experiment.
48
Chapter 2 Hyperparameter Optimization
trialConcurrency: 1
searchSpaceFile: search_space.json
trialCodeDirectory: .
trialCommand: python3 trial.py
tuner:
name: GridSearch
trainingService:
platform: local
Model class and training/testing methods are in a separate file shown in Listing 2-5.
class DummyModel:
def train(self):
# Training here
...
def test(self):
# Test results
return round(self.x + self.y + random() / 10, 2)
49
Chapter 2 Hyperparameter Optimization
The trial script receives parameters from the NNI tuner, initializes the model, trains
it, and tests its performance. The trial script interacts with NNI using the following API
methods shown in Table 2-3.
Let’s look at the trial script pattern in Listing 2-6.
We import necessary modules:
import os
import sys
import nni
And here, we add the root directory of the code to the system path. This is done
because NNI has no idea about the structure of our code and the location of modules.
Now, we can import the classes we need from our code structure:
50
Chapter 2 Hyperparameter Optimization
Trial initiates the model, trains it, measures its performance, and returns the result to
NNI Tuner:
def trial(hparams):
"""
Trial Script:
- Initiate Model
- Train
- Test
- Report
"""
model = DummyModel(**hparams)
model.train()
accuracy = model.test()
if __name__ == '__main__':
trial(hparams)
51
Chapter 2 Hyperparameter Optimization
Fine! We’ve defined different hyperparameter types, examined how to define search
spaces, studied simple tuners, and represented a pattern for creating experiments. We
are now ready to move on to real research. The following sections will examine how
HPO methods optimize the model for the specific problem and help develop a new
model design.
52
Chapter 2 Hyperparameter Optimization
MNIST dataset is a set of 28×28 grayscale images. Therefore, each dataset object is a
(28, 28, 1) tensor. Let’s examine several samples of this dataset.
53
Chapter 2 Hyperparameter Optimization
This may seem like a very simple task, but it is not. Recall how often you could
not understand the number written down by another person’s hand. Handwritten
digit recognition was one of the first fundamental problems of pattern recognition.
LeNet-5 is one of the earliest neural networks used for recognizing handwritten and
machine-printed characters. The main reason behind the popularity of this model was
its straightforward architecture. It is a multilayer convolution neural network for image
classification. Abstract LeNet model design is depicted in Figure 2-11.
54
Chapter 2 Hyperparameter Optimization
Our goal is to optimize the LeNet model to the handwritten digit recognition
problem. The easiest thing is to start Layer Hyperparameter Optimization. Let’s just
count the number of layer hyperparameters in the LeNet model:
LeNet contains two Conv2D layers, two MaxPool2D layers, and two linear layers.
Thus, we have 2×5 + 2×2 + 2×2 = 14 layer hyperparameters in LeNet model. Let’s assume
that for each hyperparameter, we will have a set of possible values that will consist of
only two elements, although, of course, many hyperparameters require more values for
the flexibility of the experiment. But even this binary search space contains 214 = 16 384
elements. And these are just the most primitive layer hyperparameters for one of the
simplest deep learning models. As the complexity of the model increases, the number of
possible hyperparameters in it grows exponentially. Even the most advanced tuner can
take a very long time to explore this search space. Therefore, we need some experience
and intuition, which will allow us to select the critical hyperparameter range for each
model without increasing the search space too much.
Let’s consider only the following layer hyperparameters:
• filter_size_2 = 2 * filter_size_1
• kernel_size_1 = kernel_size_2
• filter_size
• kernel_size
• out_features
55
Chapter 2 Hyperparameter Optimization
Usually, the best values for the filter_size of the convolutional layer are presented
as 2 . Therefore, we will choose the following ones for the search space: 8, 16, 32. The
n
kernel_size values are usually chosen in a set of 2n + 1. And the larger the image is,
the larger kernel_size value should be chosen. The samples of the MNIST dataset are
28×28 images. These are pretty small images, so we shouldn’t choose large kernel_size
values: 2, 3, 5. The best values for l1_size are powers of two, as for filter_size. The
linear layer is applied to the tensor after the flatten layer, which means that we have to
consider the dimension of the tensor that the previous layers will produce. In the case
of the MNIST problem, we will focus on the following l1_size values: 32, 64, 128. For
MaxPool2D layer we will set the lowest possible value of pool_size, which is 2, and set
sigmoid as the activation function. Yes! That’s the function that has been used for quite a
long time as an activation function for most pattern recognition problems. We will return
to the problem of choosing an activation function in the next section.
We will use classic batch neural network training with Adam optimizer. So let’s now
look at training hyperparameters. I suggest starting the study by selecting the simplest
hyperparameters: batch_size and learning rate. The best parameters for batch_
size are expressed as 2n and learning_rate as 10-n. For this case, we will choose the
following: 256, 512, 1024 for batch_size and 0.01, 0.001, 0.0001 for learning_rate.
Now let’s convert the hyperparameter constraints we made earlier into NNI search
space. Listing 2-8 defines the search space for LeNet hyperparameter optimization
search space.
{
"filter_size": {
"_type": "choice", "_value": [8, 16, 32]},
"kernel_size": {
"_type": "choice", "_value": [2, 3, 5]},
"l1_size": {
"_type": "choice", "_value": [32, 64, 128]},
"batch_size": {
"_type": "choice", "_value": [256, 512, 1024]},
"learning_rate": {
"_type": "choice", "_value": [0.01, 0.001, 0.0001]}
}
56
Chapter 2 Hyperparameter Optimization
Fine! In the next step, we will create TensorFlow and PyTorch implementations of the
LeNet model considering the HPO problem.
class TfLeNetModel(Model):
self.conv1 = Conv2D(
filters = filter_size,
kernel_size = kernel_size,
activation = 'sigmoid'
)
self.pool1 = MaxPool2D(pool_size = 2)
self.conv2 = Conv2D(
filters = filter_size * 2,
kernel_size = kernel_size,
57
Chapter 2 Hyperparameter Optimization
activation = 'sigmoid'
)
self.pool2 = MaxPool2D(pool_size = 2)
Dense stack:
self.flatten = Flatten()
self.fc1 = Dense(
units = l1_size,
activation = 'sigmoid'
)
self.fc2 = Dense(
units = 10,
activation = 'softmax'
)
LeNet is a straightforward model which passes calculation results from one layer to
another:
The training method uses two training hyperparameters: batch_size and learning
rate. We use Adam optimizer with categorical cross-entropy loss function:
58
Chapter 2 Hyperparameter Optimization
intermediate_cb = TfNniIntermediateResult('accuracy')
self.fit(
x_train,
y_train,
batch_size = batch_size,
epochs = 10,
verbose = 0,
callbacks = [intermediate_cb]
)
And the last method we need to define is model testing. We load the test MNIST
dataset and perform the classification by measuring its accuracy:
def test(self):
"""Testing Trained Model Performance"""
(_, _), (x_test, y_test) = mnist_dataset()
loss, accuracy = self.evaluate(x_test, y_test, verbose = 0)
return accuracy
Listing 2-10. NNI trial script with TensorFlow LeNet implementation. ch2/lenet_
hpo/tf_trial.py
import os
import sys
import nni
59
Chapter 2 Hyperparameter Optimization
The trial method initializes the model, trains it, tests it, and returns the NNI metric:
def trial(hparams):
model = TfLeNetModel(
filter_size = hparams['filter_size'],
kernel_size = hparams['kernel_size'],
l1_size = hparams['l1_size']
)
model.train(
batch_size = hparams['batch_size'],
learning_rate = hparams['learning_rate']
)
accuracy = model.test()
And finally, we define the main entry point for the trial:
if __name__ == '__main__':
trial(hparams)
Remember that a trial script can be executed in stand-alone mode, so you can run a
ch2/lenet_hpo/tf_trial.py to test its execution with custom parameters.
60
Chapter 2 Hyperparameter Optimization
class PtLeNetModel(nn.Module):
This implementation will use lazy layer initialization, so we explicitly save the l1_
size hyperparameter:
self.conv1 = nn.Conv2d(
in_channels = 1,
out_channels = filter_size,
kernel_size = kernel_size
)
61
Chapter 2 Hyperparameter Optimization
self.conv2 = nn.Conv2d(
in_channels = filter_size,
out_channels = filter_size * 2,
kernel_size = kernel_size
)
We don’t initialize the first linear layer, but we use lazy initialization. To initialize a
linear layer, we must specify an in_features value. But this is not so simple. We need
to know the dimension of the tensor, which the previous layers will produce. To do this,
sometimes, you have to do complex calculations. It is easier to calculate the dimension
of this tensor at the first call and at this moment initialize the linear layer.
@property
def fc1(self):
if self._fc1 is None:
self._fc1 = nn.Linear(
self.fc1__in_features,
self.l1_size
)
return self._fc1
LeNet is a straightforward model which passes calculation results from one layer to
another:
62
Chapter 2 Hyperparameter Optimization
optimizer = optim.Adam(
self.parameters(),
lr = learning_rate
)
Vanilla PyTorch does not have built-in batch training. Therefore, we manually split
the dataset into batches and perform epoch loop and batch loop:
self.train()
for epoch in range(1, 10 + 1):
# Random permutations for batch training
permutation = torch.randperm(dataset_size)
for bi in range(1, dataset_size, batch_size):
63
Chapter 2 Hyperparameter Optimization
optimizer.zero_grad()
output = self(batch_x)
loss = F.cross_entropy(output, batch_y)
loss.backward()
optimizer.step()
At the end of each epoch, we calculate the model accuracy and return it to NNI as an
intermediate result:
output = self(x)
predict = output.argmax(dim = 1, keepdim = True)
accuracy = round(accuracy_score(predict, y), 4)
print(F'Epoch: {epoch}| Accuracy: {accuracy}')
# report intermediate result
nni.report_intermediate_result(accuracy)
And the last method we need to define is model testing. We load the test MNIST
dataset and perform the classification by measuring its accuracy:
def test_model(self):
self.eval()
# Preparing Test Dataset
_, (x, y) = mnist_dataset()
x = torch.from_numpy(x).float()
y = torch.from_numpy(y).long()
x = torch.permute(x, (0, 3, 1, 2))
with torch.no_grad():
output = self(x)
predict = output.argmax(dim = 1, keepdim = True)
accuracy = round(accuracy_score(predict, y), 4)
return accuracy
64
Chapter 2 Hyperparameter Optimization
Well, since the implementation of PyTorch LeNet model is ready, we can implement
the NNI trial script using Listing 2-12.
We import necessary modules and pass code root directory to system path:
Listing 2-12. NNI trial script with TensorFlow LeNet implementation. ch2/lenet_
hpo/pt_trial.py
import os
import sys
import nni
The trial method initializes the model, trains it, tests it, and returns the NNI metric:
def trial(hparams):
model = PtLeNetModel(
filter_size = hparams['filter_size'],
kernel_size = hparams['kernel_size'],
l1_size = hparams['l1_size']
)
model.train_model(
batch_size = hparams['batch_size'],
learning_rate = hparams['learning_rate']
)
accuracy = model.test_model()
nni.report_final_result(accuracy)
And finally, we define the main entry point for the trial:
if __name__ == '__main__':
# Manual HyperParameters
hparams = {
'filter_size': 32,
'kernel_size': 3, #5,
65
Chapter 2 Hyperparameter Optimization
'l1_size': 64, #1024,
'batch_size': 512, #32,
'learning_rate': 1e-2, #1e-4,
}
trial(hparams)
Remember that a trial script can be executed in stand-alone mode, so you can run a
ch2/lenet_hpo/pt_trial.py to test its execution with custom parameters.
trialConcurrency: 4
searchSpaceFile: search_space.json
trialCodeDirectory: .
Uncomment PyTorch trial line to run the experiment using PyTorch implementation:
66
Chapter 2 Hyperparameter Optimization
The search space contains 35 = 243 elements. This is a small search space, and we
can use the Grid Search Tuner here:
tuner:
name: GridSearch
trainingService:
platform: local
Note Duration ~ 2 hours on Intel Core i7 with CUDA (GeForce GTX 1050)
• learning_rate: 0.001
• batch_size: 256
• l1_size: 64
• kernel_size: 5
• filter_size: 32
The best trial showed a 0.9885 result. And this is an acceptable outcome. We can
assume that LeNet supplied by best hyperparameters recognizes handwritten numbers
with 98.85% accuracy on test dataset.
In Figure 2-12, we can observe the top 1% of the top trials in the
hyperparameters panel.
67
Chapter 2 Hyperparameter Optimization
Figure 2-12 demonstrates that all the best results have kernel_size = 5. Otherwise,
the best results have no dependencies among its hyperparameters.
After completing a study, I like to visualize the results. We already have accuracy
metric. But it will still be interesting to glance at the images that the LeNet model could
not classify correctly. Perhaps the achieved accuracy of 98.85% is the best possible
accuracy? Maybe the test dataset contains samples that cannot be correctly classified?
Listing 2-14 displays the first nine failed predictions.
We import necessary modules:
# Best Hyperparameters
hparams = {
"learning_rate": 0.001,
"batch_size": 256,
68
Chapter 2 Hyperparameter Optimization
"l1_size": 64,
"kernel_size": 5,
"filter_size": 32
}
And after that, we train the model using the best training hyperparameters:
# Model Training
model.train(
batch_size = hparams['batch_size'],
learning_rate = hparams['learning_rate']
)
# MNIST Dataset
(_, _), (x_test, y_test) = mnist_dataset()
# Predictions
output = model(x_test)
y_pred = tf.argmax(output, 1)
number_of_fails_left = 9
fails = []
for i in range(len(x_test)):
if number_of_fails_left == 0:
break
69
Chapter 2 Hyperparameter Optimization
Figure 2-13 displays LeNet failed predictions. To be honest, the samples #1, #3, #4,
#5, #6, and #8 are really difficult to classify. I don’t think that a reader would recognize
the number 2 in sample #4. Therefore, we do not need to demand 100% accuracy from
our model. But still, I think there is room for improvement.
70
Chapter 2 Hyperparameter Optimization
71
Chapter 2 Hyperparameter Optimization
to inject these techniques into a LeNet model to improve its performance. We can do
research using HPO that will help us find the best architecture.
Let’s introduce an activation design hyperparameter responsible for choosing the
activation function. To simplify the problem, we will use a one-for-all policy. This means
that if the activation hyperparameter has a sigmoid value, then the LeNet model will
have a sigmoid function for all activations. The same is true if activation has a relu
value. Figure 2-14 presents activation hyperparameter.
Typically, a dropout layer is inserted between linear layers. But we initially do not
know whether this technique would be effective, so we have to make possible two
variants of LeNet architecture: with dropout layer and without dropout layer. To do this,
we use the use_dropout design hyperparameter. If use_dropout is 0, then the LeNet
model does not use the dropout layer, and if use_dropout is 1, then the LeNet model
uses the dropout layer. At the same time, the dropout layer will be tested using three
different p (dropout rate) values: 0.3, 0.5, and 0.7. Figure 2-15 presents use_dropout
design hyperparameter.
72
Chapter 2 Hyperparameter Optimization
Each model design works well with specific layer hyperparameters. Therefore, we
also need to include layer hyperparameters in the search space. In this experiment, we
will use the same hyperparameters we used in the previous section. But we will choose
slightly different values for them:
• filter_size: 16, 32
• kernel_size: 5, 7
73
Chapter 2 Hyperparameter Optimization
"use_dropout": {
"_type": "choice",
"_value": [
{"_name": 0},
{
"_name": 1, "rate":
{"_type": "choice", "_value": [0.3, 0.5, 0.7]}
}
]
},
"filter_size": {
"_type": "choice", "_value": [16, 32]},
"kernel_size": {
"_type": "choice", "_value": [5, 7]},
"l1_size": {
"_type": "choice", "_value": [64, 128, 256]},
"batch_size": {
"_type": "choice", "_value": [512, 1024]},
"learning_rate": {
"_type": "choice", "_value": [0.001, 0.0001]}
}
And just like in the previous section, the next step is to make TensorFlow and
PyTorch implementations of the LeNet Upgrade model.
74
Chapter 2 Hyperparameter Optimization
def __init__(
self,
filter_size,
kernel_size,
l1_size,
activation,
use_dropout,
dropout_rate = None
):
super().__init__()
self.conv1 = Conv2D(
filters = filter_size,
kernel_size = kernel_size,
activation = activation
)
self.pool1 = MaxPool2D(pool_size = 2)
self.conv2 = Conv2D(
filters = filter_size * 2,
kernel_size = kernel_size,
activation = activation
)
75
Chapter 2 Hyperparameter Optimization
self.pool2 = MaxPool2D(pool_size = 2)
self.flatten = Flatten()
self.fc1 = Dense(
units = l1_size,
activation = activation
)
if use_dropout:
self.drop = Dropout(rate = dropout_rate)
else:
self.drop = tf.identity
self.fc2 = Dense(
units = 10,
activation = 'softmax'
)
76
Chapter 2 Hyperparameter Optimization
The trial method initializes the model, trains it, tests it, and returns the NNI metric:
def trial(hparams):
use_dropout = bool(hparams['use_dropout']['_name'])
model_params = {
"filter_size": hparams['filter_size'],
"kernel_size": hparams['kernel_size'],
"l1_size": hparams['l1_size'],
"activation": hparams['activation'],
"use_dropout": use_dropout
}
if use_dropout:
model_params['dropout_rate'] = hparams['use_dropout']['rate']
model = TfLeNetUpgradeModel(**model_params)
model.train(
batch_size = hparams['batch_size'],
learning_rate = hparams['learning_rate']
)
accuracy = model.test()
77
Chapter 2 Hyperparameter Optimization
if __name__ == '__main__':
trial(hparams)
Remember that a trial script can be executed in stand-alone mode, so you can run a
ch2/lenet_upgrade/tf_trial.py to test its execution with custom parameters.
78
Chapter 2 Hyperparameter Optimization
def __init__(
self,
filter_size,
kernel_size,
l1_size,
activation,
use_dropout,
dropout_rate = None
):
super(PtLeNetUpgradeModel, self).__init__()
if use_dropout:
self.drop = nn.Dropout(p = dropout_rate)
else:
self.drop = nn.Identity()
79
Chapter 2 Hyperparameter Optimization
self.conv1 = nn.Conv2d(
1,
filter_size,
kernel_size = kernel_size
)
self.conv2 = nn.Conv2d(
filter_size,
filter_size * 2,
kernel_size = kernel_size
)
# Lazy fc1 Layer Initialization
self.fc1__in_features = 0
self._fc1 = None
self.fc2 = nn.Linear(l1_size, 10)
80
Chapter 2 Hyperparameter Optimization
Listing 2-19. NNI trial script with PyTorch LeNetUpgrade implementation. ch2/
lenet_upgrade/pt_trial.py
import os
import sys
import nni
The trial method initializes the model, trains it, tests it, and returns the NNI metric:
def trial(hparams):
use_dropout = bool(hparams['dropout']['_name'])
model_params = {
"filter_size": hparams['filter_size'],
"kernel_size": hparams['kernel_size'],
"l1_size": hparams['l1_size'],
"activation": hparams['activation'],
"use_dropout": use_dropout
}
if use_dropout:
model_params['dropout_rate'] = hparams['dropout']['rate']
model = PtLeNetUpgradeModel(**model_params)
model.train_model(
batch_size = hparams['batch_size'],
learning_rate = hparams['learning_rate']
)
accuracy = model.test_model()
nni.report_final_result(accuracy)
81
Chapter 2 Hyperparameter Optimization
if __name__ == '__main__':
# Manual HyperParameters
hparams = {
'dropout': {'_name': 1, 'rate': 0.5},
'activation': 'relu',
'filter_size': 32,
'kernel_size': 3,
'l1_size': 64,
'batch_size': 512,
'learning_rate': 1e-3,
}
trial(hparams)
Remember that a trial script can be executed in stand-alone mode, so you can run a
ch2/lenet_upgrade/pt_trial.py to test its execution with custom parameters.
82
Chapter 2 Hyperparameter Optimization
maxTrialNumber: 300
searchSpaceFile: search_space.json
trialCodeDirectory: .
Uncomment PyTorch trial line to run the experiment using PyTorch implementation:
GridSearch Tuner cannot be used for search spaces that utilize nested choice, so we
pick Random Search Tuner:
tuner:
name: Random
trainingService:
platform: local
Note Duration ~ 3 hours on Intel Core i7 with CUDA (GeForce GTX 1050)
• activation: relu
• use_dropout: 1
• rate: 0.5
• filter_size: 32
• kernel_size: 5
• l1_size: 256
• batch_size: 512
• learning_rate: 0.001
83
Chapter 2 Hyperparameter Optimization
Figure 2-16 demonstrates that all three best hyperparameter combinations have the
following hyperparameter values:
• activation: relu
• use_dropout: 1
which can be considered as solid evidence in favor of using the ReLU and dropout
techniques. Of course, this result may seem somewhat obvious, but this is only because
you already know about the benefits of using ReLU and dropout layers. Initially, this
fact did not seem so obvious and required practical evidence, which we have just
demonstrated.
Finally, let’s take a look at the MNIST database samples that the upgraded LeNet
model failed to classify. These samples are presented in Figure 2-17.
84
Chapter 2 Hyperparameter Optimization
I deliberately didn’t print the correct results in Figure 2-17. Take one minute, write
down your answer for each sample, and compare it with the correct answers in the
following:
Answers. #1: 9. #2: 7. #3: 7. #4: 0. #5: 2. #6: 8. #7: 2. #8: 9. #9. 8
If you haven’t made a single mistake, then I admire you! I guessed only four
numbers. If a human has difficulties recognizing some handwritten characters, then the
neural network is already close to its performance threshold. The current result of the
upgraded LeNet model we have developed is close to the best.
This section demonstrates how new deep learning techniques can be injected into
existing architecture. We defined a search space to choose the best design combination.
HPO chose an upgraded model design that significantly improved the performance of
the original model. And this is a very simple and useful technique that will allow you to
uptune your models using the latest advances in machine learning.
85
Chapter 2 Hyperparameter Optimization
samples of “humans or horses” dataset. This dataset contains 300×300 color images, that
is, (300, 300, 3) tensors. Obviously, the image of a human or a horse is more complex
than a 28×28 grayscale image of a handwritten number. And perhaps, we will need to
evolve the LeNet model that we considered earlier. We will call it LeNet Evolution model.
Let’s look at the architecture of the LeNet model again. LeNet model design can
be divided into two components: feature extraction and decision making. Indeed, the
convolution layer stack (Conv2D → Activation → MaxPool2D → Conv2D → Activation
→ MaxPool2D → Flatten) is responsible for extracting image patterns, that is, feature
extraction. At the same time, the fully connected layer stack (Linear → Activation →
Linear → SoftMax) is responsible for selecting particular patterns to classify an input
object. Figure 2-19 shows the areas of responsibility for each component.
86
Chapter 2 Hyperparameter Optimization
Since human and horse images are more complex, we need to make the feature
extraction component more sophisticated. Two types of layer sequences are usually
responsible for extracting image patterns: Conv2D → Activation → MaxPool2D and
Conv2D → Activation. We can build an experiment that will inject different feature
extraction sequences to find the best model design to solve the “human and horses”
classification problem. In the following, we define three types of feature extraction layer
sequences adding none as an empty sequence:
87
Chapter 2 Hyperparameter Optimization
• none: Identity
And finally, the LeNet Evolution feature extraction component will look the following
way: Conv2D(kernel_size=5, filters=16) → Activation → MaxPool2D(pool_size=3)
→ Conv2D(kernel_size=3, filters=8) → Activation.
We can consider feature extraction sequences as the building blocks of the LeNet
Evolution model, as shown in Figure 2-20.
88
Chapter 2 Hyperparameter Optimization
And so we can define the three design hyperparameters of the LeNet Evolution
model. Table 2-4 provides LeNet Evolution model design hyperparameters.
89
Chapter 2 Hyperparameter Optimization
Since the feature extraction component of the LeNet Evolution model returns more
features than it did with the MNIST problem, we should also let the experiment create
a more advanced decision maker component. The decision maker component can be
improved by adding an extra linear layer. This is the easiest and most efficient way to
enhance a decision maker component. Since we proved the sustainability of the dropout
layer and ReLU activation in the previous section, they will also be used in the decision
maker component. Figure 2-21 demonstrates two variants of the decision maker
component.
90
Chapter 2 Hyperparameter Optimization
The design of the decision maker component will be determined by the following
hyperparameters shown in Table 2-5.
91
Chapter 2 Hyperparameter Optimization
In this experiment, we are not just looking for the best hyperparameters, but we are
trying to create a new architecture of the deep learning model based on the principles
of the original LeNet model. Here, we try not just to tune an existing model but also to
create a new one. The list of design hyperparameters is responsible for unique deep
learning model design.
Listing 2-21 defines NNI search space of the LeNet Evolution model.
The first feature slot fe_slot_1 can be filled with one of these feature extraction
sequences:
• Conv2D(kernel_size, filters) → Activation →
MaxPool2D(pool_size)
• None
92
Chapter 2 Hyperparameter Optimization
The second and third feature extraction slots (fe_slot_2, fe_slot_3) have the
same values set as fe_slot_1:
"fe_slot_2": {
"_type": "choice",
"_value": [
{"_name": "none"},
{
"_name": "simple",
"filters": {"_type": "choice", "_value": [8, 16, 32]},
"kernel": {"_type": "choice", "_value": [5, 7, 9, 11]}
},
{
"_name": "with_pool",
"filters": {"_type": "choice", "_value": [8, 16, 32]},
"kernel": {"_type": "choice", "_value": [5, 7, 9, 11]},
"pool_size": {"_type": "choice", "_value": [3, 5, 7]}
}
]
},
"fe_slot_3": {
"_type": "choice",
"_value": [
{"_name": "none"},
{
"_name": "simple",
"filters": {"_type": "choice", "_value": [8, 16, 32]},
"kernel": {"_type": "choice", "_value": [5, 7, 9, 11]}
},
{
"_name": "with_pool",
"filters": {"_type": "choice", "_value": [8, 16, 32]},
"kernel": {"_type": "choice", "_value": [5, 7, 9, 11]},
"pool_size": {"_type": "choice", "_value": [3, 5, 7]}
}
]
},
93
Chapter 2 Hyperparameter Optimization
"l1_size": {
"_type": "choice", "_value": [512, 1024, 2048]},
"l2_size": {
"_type": "choice", "_value": [0, 512, 1024]},
"dropout_rate": {
"_type": "choice", "_value": [0.3, 0.5, 0.7]},
"learning_rate": {
"_type": "choice", "_value": [0.001, 0.0001]}
}
We have just defined rather nontrivial search space. Let’s hope that the result of the
experiment will meet our expectations and the resulting architecture will perfectly solve
the problem of human and horses classification. The next step is to make TensorFlow
and PyTorch implementations of the LeNet Evolution model.
94
Chapter 2 Hyperparameter Optimization
class TfLeNetEvolution(Model):
def __init__(
self,
feat_ext_sequences,
l1_size,
l2_size,
dropout_rate
):
super().__init__()
layer_stack = []
layer_stack.append(Flatten())
95
Chapter 2 Hyperparameter Optimization
layer_stack.append(
Dense(
units = l1_size,
activation = 'relu'
)
)
layer_stack.append(
Dropout(rate = dropout_rate)
)
layer_stack.append(
Dense(
units = 2,
activation = 'softmax'
)
)
self.seq = tf.keras.Sequential(layer_stack)
96
Chapter 2 Hyperparameter Optimization
As before, we use the Adam optimizer with cross-entropy as the loss function:
intermediate_cb = TfNniIntermediateResult('accuracy')
self.fit(
x_train,
y_train,
batch_size = batch_size,
epochs = epochs,
verbose = 0,
callbacks = [intermediate_cb]
)
Model testing:
def test(self):
(_, _), (x_test, y_test) = hoh_dataset()
loss, accuracy = self.evaluate(x_test, y_test, verbose = 0)
return accuracy
Since LeNetEvolutionModel is done, we can implement the NNI trial script using
Listing 2-23.
97
Chapter 2 Hyperparameter Optimization
We import necessary modules and pass code root directory to system path:
The trial method initializes the model, trains it, tests it, and returns the NNI metric:
def trial(hparams):
feat_ext_sequences = []
for k, v in hparams.items():
if k.startswith('fe_slot_'):
v['type'] = v['_name']
feat_ext_sequences.append(v)
Model initialization:
model = TfLeNetEvolution(
feat_ext_sequences = feat_ext_sequences,
l1_size = hparams['l1_size'],
l2_size = hparams['l2_size'],
dropout_rate = hparams['dropout_rate']
)
98
Chapter 2 Hyperparameter Optimization
Here, we train the model during 50 epochs and fixed batch_size = 16:
model.train(
batch_size = 16,
learning_rate = 0.001,
epochs = 50
)
accuracy = model.test()
if __name__ == '__main__':
'l1_size': 1024,
'l2_size': 512,
'dropout_rate': .3,
'learning_rate': 0.001
}
trial(hparams)
Remember that a trial script can be executed in stand–alone mode, so you can run a
ch2/lenet_to_alexnet/tf_trial.py to test its execution with custom parameters.
100
Chapter 2 Hyperparameter Optimization
class PtLeNetEvolution(pl.LightningModule):
def __init__(
self,
feat_ext_sequences,
l1_size,
l2_size,
dropout_rate,
learning_rate
) -> None:
super().__init__()
self.lr = learning_rate
self.dropout_rate = dropout_rate
self.save_hyperparameters()
The first step is to create a layer sequence for the feature extraction component
dynamically:
fe_stack = []
101
Chapter 2 Hyperparameter Optimization
)
if fe_seq['type'] == 'with_pool':
fe_stack.append(
nn.MaxPool2d(
kernel_size = fe_seq['pool_size']
)
)
fe_stack.append(nn.ReLU())
in_dim = fe_seq['filters']
self.fe_stack = nn.Sequential(*fe_stack)
The next step is to create a layer sequence for the decision maker component:
@property
def fc1(self):
if self._fc1 is None:
self._fc1 = nn.Sequential(
nn.Linear(
self.fc1__in_features,
102
Chapter 2 Hyperparameter Optimization
self.hparams['l1_size']
),
nn.ReLU(),
nn.Dropout(self.dropout_rate)
)
return self._fc1
def configure_optimizers(self):
return torch.optim.Adam(
self.parameters(),
lr = self.lr
)
103
Chapter 2 Hyperparameter Optimization
The following method performs the training process on the training dataset and tests
the trained model on the test dataset:
trainer = pl.Trainer(
max_epochs = epochs,
checkpoint_callback = False
)
Model training:
trainer.fit(self, train_loader)
104
Chapter 2 Hyperparameter Optimization
output = self(x_test)
predict = output.argmax(dim = 1, keepdim = True)
accuracy = round(accuracy_score(predict, y_test), 4)
return accuracy
Since LeNetEvolutionModel is done, we can implement the NNI trial script using
Listing 2-15.
We import necessary modules and pass code root directory to system path:
The trial method initializes the model, trains it, tests it, and returns the NNI metric:
def trial(hparams):
feat_ext_sequences = []
for k, v in hparams.items():
if k.startswith('fe_slot_'):
v['type'] = v['_name']
feat_ext_sequences.append(v)
105
Chapter 2 Hyperparameter Optimization
Model initialization:
model = PtLeNetEvolution(
feat_ext_sequences = feat_ext_sequences,
l1_size = hparams['l1_size'],
l2_size = hparams['l2_size'],
dropout_rate = hparams['dropout_rate'],
learning_rate = hparams['learning_rate']
)
Next, we train the model during 50 epochs and fixed batch_size = 16 and test it in
the same method:
accuracy = model.train_and_test_model(
batch_size = 16,
epochs = 50
)
if __name__ == '__main__':
'fe_slot_3': {
'_name': 'with_pool',
'filters': 8,
'kernel': 5,
'pool_size': 3
},
'l1_size': 1024,
'l2_size': 512,
'dropout_rate': .3,
'learning_rate': 0.001
}
trial(hparams)
Remember that a trial script can be executed in stand-alone mode, so you can run a
ch2/lenet_to_alexnet/pt_trial.py to test its execution with custom parameters.
107
Chapter 2 Hyperparameter Optimization
maxTrialNumber: 400
searchSpaceFile: search_space.json
trialCodeDirectory: .
Uncomment PyTorch trial line to run the experiment using PyTorch implementation:
GridSearch Tuner cannot be used for search spaces that utilize nested choice, so we
pick Random Search Tuner:
tuner:
name: Random
trainingService:
platform: local
Note Duration ~ 18 hours on Intel Core i7 with CUDA (GeForce GTX 1050)
Best trial hyperparameters returned by the experiment are listed in Table 2-6.
108
Chapter 2 Hyperparameter Optimization
fe_slot_1 • with_pool:
• filters: 32
• kernel: 7
• pool_size: 5
fe_slot_2 • with_pool:
• filters: 8
• kernel: 11
• pool_size: 5
fe_slot_3 • simple:
• filters: 8
• kernel: 7
l1_size 1024
l2_size 512
dropout_rate 0.3
learning_rate 0.0001
The best trial demonstrated a 0.9941 accuracy on test dataset, which is an excellent
result. We have indeed managed to build a model that distinguishes complex colored
objects with a very high degree of accuracy. This is a good development! The reader
may wonder: Why is this section called From LeNet to AlexNet? Well, it’s time to answer
it. AlexNet is the name of a convolutional neural network architecture that won the
2012 Image Recognition competition. AlexNet classified images into 1000 different
classes. At that time, it was a pretty advanced deep learning model. Let’s now compare
three models: the LeNet model, the model we constructed in this section using HPO
techniques (LeNet Evolution), and the AlexNet model.
Figure 2-22 shows that the model we built in this section for humans and horses
classification is somewhere between the original LeNet model and the AlexNet model.
Our model shows a remarkable test result of 99.41% accuracy. But most importantly, it
was built entirely automatically with the help of HPO techniques and the NNI tool! We
did not do any complex calculations or analytical analysis. We have just constructed
a flexible LeNet Evolution model whose architecture depended on the passed
109
Chapter 2 Hyperparameter Optimization
hyperparameters. And as a result, we got a unique model that is fully adapted to solving
a specific task. These results confirm the promise of HPO’s approach to solving deep
learning problems.
Summary
In this chapter, we started the HPO study. We studied how to create NNI experiments
and solve practical problems. We managed to optimize the original LeNet model for
handwritten digit recognition, upgrade the LeNet model using ReLU and dropout
techniques, and construct a new complex color pattern recognition model based on
the existing LeNet model. The results that we have obtained demonstrate the promise
of AutoDL. In the next chapter, we will continue to study HPO and dive into the more
advanced NNI usage in Hyperparameter Optimization problems.
110
CHAPTER 3
Hyperparameter
Optimization Under Shell
In the previous chapter, we saw that simple HPO techniques could produce very
impressive results. Hyperparameter Optimization not only optimizes a specific model
for a dataset but can even construct new architectures. But the fact is that we have used
an elementary set of tools for HPO tasks so far. Indeed, up to this point, we have only
used the primitive Random Search Tuner and Grid Search Tuner. We learned from the
previous chapter that search spaces could contain millions and hundreds of millions of
parameters. And if we had unlimited time, we could always use the Grid Search Tuner.
But unfortunately, this approach is not applicable in reality. We need Tuners that strike
a good balance between speed and quality in finding the best hyperparameters. Another
helpful technique is Early Stopping algorithms. Early Stopping algorithms analyze the
model training process based on intermediate results and decide whether to continue
training or stop it to save time.
This chapter will study various HPO Tuners and tell about their basic features. We
will explore the use of Early Stopping algorithms that speed up the experiment stopping
trials with an unpromising training process. And also, consider creating a custom HPO
Tuner for a particular task. This chapter will greatly enhance the practical application of
the Hyperparameter Optimization approach.
T uners
We begin this chapter by examining the various HPO Tuners. As you remember, Tuner
receives metrics from Trial after evaluating a particular search space parameter.
Based on the existing result history of all completed Trials, Tuner decides which
hyperparameter configuration to test next. The main task of the Tuner is to find the best
111
© Ivan Gridin 2022
I. Gridin, Automated Deep Learning Using Neural Network Intelligence,
https://doi.org/10.1007/978-1-4842-8149-9_3
Chapter 3 Hyperparameter Optimization Under Shell
I deliberately do not give an analytical formula for this function. black_box_f1 is just
a black-box function, and we know nothing about its internal logic. In real life, black-box
functions have the following properties:
All black-box functions that we will examine in this chapter will satisfy these
properties. But anyway, we can cheat a little and plot this function:
if __name__ == '__main__':
scatter_plot(black_box_f1, [-10, 10], [-10, 10])
Figure 3-1 shows that the red area is where the black-box function black_box_f1
reaches its maximum values.
112
Chapter 3 Hyperparameter Optimization Under Shell
Note In this chapter, we will examine only problems of finding the maxima of
the black-box function f. The problem of finding the maxima of the function f is
equivalent to the problem of finding the minima of the function -f.
Therefore, of course, the choice of a Tuner has great importance. Let’s take a look at
how Random Search Tuner explores the black_box_f1 function. We are implementing
an embedded experiment to visualize the trial parameters that the tuner has selected
during the experiment in Listing 3-2.
113
Chapter 3 Hyperparameter Optimization Under Shell
The search space for black_box_f1 contains all integer pairs in [-10, 10] × [-10, 10].
There are 441 elements in the search space.
search_space = {
"x": {"_type": "quniform", "_value": [-10, 10, 1]},
"y": {"_type": "quniform", "_value": [-10, 10, 1]}
}
experiment = Experiment('local')
experiment.config.experiment_name = 'Random Tuner'
experiment.config.trial_concurrency = 4
experiment.config.max_trial_number = 100
experiment.config.search_space = search_space
experiment.config.trial_command = 'python3 trial.py'
experiment.config.trial_code_directory = Path(__file__).parent
experiment.config.tuner.name = 'Random'
http_port = 8080
experiment.start(http_port)
while True:
if experiment.get_status() == 'DONE':
114
Chapter 3 Hyperparameter Optimization Under Shell
When the experiment is finished, we display all the trials that were created during
the experiment:
search_data = experiment.export_data()
trial_params = [trial.parameter for trial in search_data]
search_metrics = experiment.get_job_metrics()
input("Experiment is finished. Press any key to exit...")
break
Let’s examine all the trials that Random Search Tuner generated during the
experiment in Listing 3-2.
Figure 3-2 shows that the trials generated by Random Search Tuner are simple
random dots scattering. In some cases, the dot (trial) may successfully fall into the area
of maximum values, but in many cases, the area of maximum values remains unexplored
properly.
Let’s now look at the trials that the Grid Search Tuner generates by exploring the
black_box_f1 function in Listing 3-3.
experiment.config.tuner.name = 'GridSearch'
The Grid Search experiment looks much like the Random Search experiment in
Listing 3-2. We only use Grid Search Tuner here:
Figure 3-3 demonstrates the trials generated by the Grid Search Tuner.
We see that the Grid Search Tuner simply iterates through all the values in the
search space in a particular order. This approach can be helpful when dealing with a
small search space when it is possible to iterate over all the values in the search space.
Otherwise, the trials generated by Grid Search Tuner may not even get close to the area
of maximum values of the black-box function.
116
Chapter 3 Hyperparameter Optimization Under Shell
The main problem with Random and Grid tuners is that they don’t interact with
their trial results in any way. They do not have any “memory” that would allow them to
highlight promising areas in the search space and concentrate their search on them. We
will now begin to study tuners that have “memory” and which can explore the search
space more efficiently.
E volution Tuner
Evolution Tuner search is based on the principles of natural evolution. It implements
two fundamental principles of evolution: selection and mutation. Evolution Tuner
initializes a population of a specific size. Each population individual represents a
particular set of parameters in the search space. Each individual has a fitness property
that indicates the Trial result. We say that individual A is better than individual B:
Evolution Tuner takes the best individual from a random pair of individuals and
mutates it randomly by replacing the value of its parameter with another value from
the search space. After that, the mutated individual replaces the original one, and the
process is repeated again. Figure 3-4 illustrates this search principle.
117
Chapter 3 Hyperparameter Optimization Under Shell
One of the big problems with Evolution Tuner is that mutation doesn’t always
improve an individual’s fitness. Mutation operation only executes a random change in
parameter values, and usually, mutation degrades an individual’s performance.
# config.yml
tuner:
name: Evolution
classArgs:
optimize_mode: maximize
population_size: 100
Evolution Tuner supports all search space types: choice, choice(nested), randint,
uniform, quniform, loguniform, qloguniform, normal, qnormal, lognormal, and
qlognormal.
Let’s take a look at Evolution Tuner in action optimizing black_box_f1 function in
Listing 3-4.
(Full code is provided in the corresponding file: ch3/tuners/evolution_tuner/
run_experiment.py.)
Setting population size:
population_size = 8
experiment.config.tuner.name = 'Evolution'
experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
experiment.config.tuner.class_args['population_size'] = population_size
118
Chapter 3 Hyperparameter Optimization Under Shell
Unlike Grid Search Tuner and Random Search Tuner, an Evolution Tuner has
“memory.” That is why it is more attractive to analyze the search process progress.
We will show the history of the allocation of trial parameters in the search space by
generations:
# Event Loop
while True:
if experiment.get_status() == 'DONE':
search_data = experiment.export_data()
trial_params_chunks = [
trial_params[i:i + 25]
for i in range(0, len(trial_params), 25)
]
119
Chapter 3 Hyperparameter Optimization Under Shell
Figure 3-5 shows that the allocation of trial parameters is close to random distribution.
120
Chapter 3 Hyperparameter Optimization Under Shell
In Figure 3-6, we see that most of the individuals are already in the red zone, which
means that the population is moving smoothly toward the highest values of the function.
And the last generation shown in Figure 3-7 has at least one individual at the top of the
red zone.
Note Not all built-in NNI tuners support random seed setting. Hence, the
experiments are not reproducible. Therefore, the results you get on your local
machine may differ from those shown in this chapter. However, the general
behavior of tuners remains the same on all machines, so the results of the same
tuner on the same search space are similar.
We can consider the Evolution Tuner as a directed random search. It is slightly better
than random search but still has many problems due to the too random nature of this
algorithm. Evolution Tuner usually requires many trials but is usually selected due to its
simplicity.
121
Chapter 3 Hyperparameter Optimization Under Shell
Anneal Tuner
Anneal Tuner is based on the Simulated Annealing algorithm. Simulated Annealing is a
method for solving optimization problems. The algorithm models the physical process
of heating a material and slowly lowering the temperature to decrease defects, thus
minimizing the system energy. Annealing Tuner uses randomness as part of the search
process like Evolution Tuner.
The annealing algorithm consists of the following steps:
1. The annealing algorithm selects a random element X in the
search space, and the f(X) value is calculated.
• Δ: f(X) - f(X’).
• σ is standard deviation of all explored values during the
search: f(X1), ..., f(Xn) multiplied by degradation ratio ci, where
c is the positive value lower than 1 and i is the number of
iteration, σ = ci std([f(X1), …, f(Xn)]).
Next, we compare r and e , where e is the exponential.
5a. If r < e , then the algorithm degrades from X to X’: X ← X’.
This is done hoping that it will be possible to reach a new
peak in the next iteration and the transition to X’ is only an
intermediate step. We can consider this as an exploration
step that explores an area near X in the search space.
122
Chapter 3 Hyperparameter Optimization Under Shell
The closer f(X) is to f(X’), the more likely an exploration step would be taken.
Steps from 2 to 5 are repeated n times. Figure 3-8 demonstrates the annealing
algorithm flow.
The essence of the annealing algorithm is to get to the “hills” of the surface of the
black-box function f and study the area of this “hill.” In some cases, the algorithm may
descend from “hills” hoping to climb to a higher one. The disadvantage of this algorithm
is that it cannot cover large distances between different “hills” of the surface of the
function f. Figure 3-9 demonstrates the annealing algorithm in action.
123
Chapter 3 Hyperparameter Optimization Under Shell
# config.yml
tuner:
name: Anneal
classArgs:
population_size: 100
Anneal Tuner supports all search space types: choice, choice(nested), randint,
uniform, quniform, loguniform, qloguniform, normal, qnormal, lognormal, and
qlognormal.
Listing 3-5 illustrates another black-box function holder_function, based on
Holder’s function. We will use it for testing Anneal Tuner performance.
124
Chapter 3 Hyperparameter Optimization Under Shell
if __name__ == '__main__':
scatter_plot(holder_function, [-10, 10], [-10, 10])
Figure 3-10 shows the surface of holder_function function. This surface is much
more challenging to explore. It has many hills that are evenly spaced on the surface. The
highest peaks are in the left corners.
125
Chapter 3 Hyperparameter Optimization Under Shell
experiment.config.tuner.name = 'Anneal'
experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
After the experiment is completed, we can analyze the progress of the search for
Anneal Tuner.
In Figure 3-11, we can see that Anneal Tuner is starting to study the “hills” in the
lower-left corner.
126
Chapter 3 Hyperparameter Optimization Under Shell
Figure 3-12 demonstrates that the second generation of trials is completely focused
on the two “hills” in the lower-left corner.
And as we can see in Figure 3-13, Anneal Tuner is completely concentrated on
exploring only one “hill.” Anneal Tuner found a local maxima but could not find global
maxima at the bottom of the left corner, close to the solution Anneal Tuner found.
127
Chapter 3 Hyperparameter Optimization Under Shell
Anneal Tuner and Evolution Tuner are variants of directed random search. They
are intuitive and straightforward, but they may not always explore the search space
effectively. Let’s study more advanced tuners based on the Bayesian optimization
approach.
128
Chapter 3 Hyperparameter Optimization Under Shell
The next step is to create a probability function p(y|x) based on the data: (x1, f(x1)),
(x2, f(x2), (x3, f(x3)). p(y|x) is called a “surrogate” for the objective (or black-box) function.
The surrogate function determines the probability distribution of the objective (or black-
box) function for any element xi in the search space. This means that for any xi, we can
say that with a p probability, the value f(xi) = yi lies in the (a, b) interval. This concept is
demonstrated in Figure 3-15.
129
Chapter 3 Hyperparameter Optimization Under Shell
Having surrogate function p(x|y), we can extrapolate it on the whole search space.
Figure 3-16 gives a visual description of the surrogate model:
Figure 3-16. Surrogate model for three trials: (x1, f(x1)), (x2, f(x2)), (x3, f(x3))
Based on the constructed surrogate model, the SMBO algorithm makes its prediction
regarding the potential maxima of the black-box function. The next goal of the algorithm
is to find a higher value of the black-box function than the current maximum value
f(x2). The following trial parameters are determined using the Expected Improvement
function:
If we assume that f(x2) = y2, then SMBO will choose x4 as the next trial parameter if
the EIy2(x) will be maximum with x4. Figure 3-17 illustrates the next trial selection by
SMBO algorithm.
130
Chapter 3 Hyperparameter Optimization Under Shell
After selecting the x4 as the next trial value, we evaluate f(x4) and rebuild the
surrogate model concerning the new data: (x1, f(x1)), (x2, f(x2)), (x3, f(x3)), (x4, f(x4)) as
shown in Figure 3-18.
Figure 3-18. Surrogate model for four trials: (x1, f(x1)), (x2, f(x2)), (x3, f(x3)),
(x4, f(x4))
131
Chapter 3 Hyperparameter Optimization Under Shell
SMBO aims to converge the surrogate model to the objective function with more
data, which these approaches do by continually updating the surrogate probability
model after each objective function evaluation. SMBO Tuners are efficient because they
choose the next parameters in an informed way.
SMBO Tuner performs the following steps in a cycle until the maximum number of
trials is reached:
This is the framework for all SMBO Tuners. The only difference between them is
the p(y|x) function definition. Different SMBO Tuners have different approaches to
estimating the probability function p(y|x) based on a historical dataset. This chapter
will cover the following SMBO Tuners: Tree-Structured Parzen Estimator Tuner and
Gaussian Process Tuner.
2. Next, the Tuner sorts the executed trials by their values and
divides them into “good” and “bad” groups based on some
quantile - γ. The first group, “good” group, contains trials that gave
the best results and the “bad” one, all other trials.
Figure 3-19 depicts the TPE model after the first two steps.
132
Chapter 3 Hyperparameter Optimization Under Shell
Figure 3-20 illustrates the algorithm of selecting the next trial parameter.
133
Chapter 3 Hyperparameter Optimization Under Shell
# config.yml
tuner:
name: TPE
classArgs:
optimize_mode: maximize
seed: 12345
tpe_args:
constant_liar_type: 'mean'
n_startup_jobs: 10
n_ei_candidates: 20
linear_forgetting: 100
prior_weight: 0
gamma: 0.5
• tpe_args.constant_liar_type:
Default: 'best'
134
Chapter 3 Hyperparameter Optimization Under Shell
• tpe_args.n_startup_jobs:
Type: int
Default: 20
• tpe_args.n_ei_candidates:
Type: int
Default: 24
• tpe_args.linear_forgetting:
Type: int
Default: 25
TPE lowers the weights of old trials. This parameter controls how
many iterations it takes for a trial to start decay.
• tpe_args.prior_weight:
Type: float
Default: 1.0
Determines the weight of trial configuration in the history trial
configurations.
• tpe_args.gamma:
Type: float
Default: 0.25
135
Chapter 3 Hyperparameter Optimization Under Shell
Note TPE Tuner configuration parameters mentioned above are only valid from
NNI version 2.6. They will not work on previous versions.
TPE Tuner supports all search space types: choice, choice(nested), randint,
uniform, quniform, loguniform, qloguniform, normal, qnormal, lognormal, and
qlognormal.
We can see from Listing 3-7 how TPE Tuner performs optimizing holder_function.
(Full code is provided in the corresponding file: ch3/tuners/tpe_tuner/
run_experiment.py.)
Setting the TPE Tuner for the Experiment:
experiment.config.tuner.name = 'TPE'
experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
experiment.config.tuner.class_args['seed'] = 0
experiment.config.tuner.class_args['tpe_args'] = {
'n_startup_jobs': 20,
'gamma': 0.5
}
After the experiment is completed, we can analyze the progress of the search for
TPE Tuner.
136
Chapter 3 Hyperparameter Optimization Under Shell
Figure 3-21 shows that the distribution of points is more like a random scattering,
which makes sense because, in the tuner setup, we specified 'n_startup_jobs': 20,
which means that the first 20 trials will be completely random.
But as we see in Figure 3-22, the TPE Tuner finds the global maxima of
holder_function, thanks to a probabilistic exploration model.
The TPE Tuner is intuitively clear, based on a solid probability ground, and has a
good balance between exploration and exploitation policy. Another nice feature is that it
supports choice(nested) search type, which can be critical in some research.
# config.yml
tuner:
name: GPTuner
classArgs:
optimize_mode: maximize
utility: 'ei'
kappa: 5.0
xi: 0.0
nu: 2.5
alpha: 1e-6
cold_start_num: 10
selection_num_warm_up: 100000
selection_num_starting_points: 250
138
Chapter 3 Hyperparameter Optimization Under Shell
• utility:
Default: 'ei'
Type: float
Default: 5
Used by the ucb utility function. The bigger the kappa is, the more
exploratory the tuner will be.
• xi:
Type: float
Default: 0
Used by the ei and poi utility functions. The bigger the xi is, the
more exploratory the tuner will be.
• nu:
Type: float
Default: 2.5
Sets the Matern kernel. The smaller the nu is, the less smooth the
approximated function will be.
• alpha:
Type: float
Default: 1e-6
139
Chapter 3 Hyperparameter Optimization Under Shell
• cold_start_num:
Type: int
Default: 10
• selection_num_warm_up:
Type: int
Default: 1e5
• selection_num_starting_points:
Type: int
Default: 250
GP Tuner supports the following search space types: choice, randint, uniform,
quniform, loguniform, and qloguniform.
GP Tuner suffers a lot from parallelization issues. If you run an experiment in
concurrency mode (i.e., trial_concurrency > 1), multiple processes simultaneously
decide on their next trial candidate based on the same historical data. Therefore, different
processes are testing the same parameters at the same time. This is a big problem with
all SMBO tuners. But TPE Tuner can get around this problem with the Constant Liar
technique, while for GP Tuner, this problem remains serious. In Figure 3-23, we can see
the Trial Metric panel for GP Tuner with trial_concurrency = 8. It shows that GP Tuner
contains chunks of the same trials, which does not speed up the process of exploring the
search space in any way.
140
Chapter 3 Hyperparameter Optimization Under Shell
experiment.config.trial_concurrency = 1
experiment.config.tuner.name = 'GPTuner'
experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
GP Tuner shows excellent results. Figure 3-24 shows the coordinates of all trials
during the experiment. GP Tuner found both global maxima and evenly explored the
entire search space.
141
Chapter 3 Hyperparameter Optimization Under Shell
GP Tuner suggests taking search space elements to find a suitable solution in a small
number of black-box function evaluations. GP Tuner carries about the surrogate model
instead of the black-box function itself, using the conjugate gradient method to find the
highest expected improvement candidates. GP Tuner shows good exploratory behavior
in testing out areas that seem promising under the current surrogate model.
142
Chapter 3 Hyperparameter Optimization Under Shell
And the answer is as follows: search spaces have a specific structure and
dependencies that allow certain Tuners to outperform others for many kinds of
problems. Therefore, if we know that we are optimizing a model for an image
classification problem, this can give us some insight into the search space structure.
Consequently, we can choose a Tuner that is more likely to show good results
than others.
There is a separate area of research in which scientists arrange battles of search
algorithms to determine the best one for a particular class of problems. Scientists use
benchmarks to estimate the characteristics of the search algorithm. The benchmark
algorithm evaluates the search algorithm several times for different search spaces. For
example, the benchmark pseudo-code might look like this:
GP Tuner 4.00
Evolution 4.22
Anneal 4.39
TPE 4.67
Random 5.33
143
Chapter 3 Hyperparameter Optimization Under Shell
Some benchmarks can last several days or even weeks. Therefore, it is always more
convenient to borrow the results obtained and published after research. In many cases,
you can execute a mini-research yourself, which will help determine the characteristics
of the search space structure. In any case, understanding the deep learning model
optimization problem and principles of the Search Tuner is very helpful in choosing the
right strategy for solving the HPO problem.
C
ustom Tuner
Built-in tuners are suitable for most tasks. But there are situations when you need to add
some custom logic to improve the quality of the HPO Experiment. Indeed, sometimes,
we may know specific properties of the search space that the built-in tuner does not take
into account. Also, the developer can implement their original idea and test it on real
problems. For such cases, NNI allows you to implement a Custom Tuner. Custom Tuner
can be used in an experiment and shared with other developers.
T uner Internals
Each Tuner class should inherit nni.tuner.Tuner and implement the following
methods: __init__, update_search_space, generate_parameters, receive_trial_
result. Any Tuner can be implemented based on the self-describing sample presented
in Listing 3-9.
class CustomTunerSample(Tuner):
144
Chapter 3 Hyperparameter Optimization Under Shell
return {}
145
Chapter 3 Hyperparameter Optimization Under Shell
146
Chapter 3 Hyperparameter Optimization Under Shell
Custom Tuner is integrated into the experiment using the following config:
tuner:
codeDirectory: <path_to_tuner_dir>
className: <tuner_file_name>.<class_name>
experiment.config.tuner = CustomAlgorithmConfig()
experiment.config.tuner.code_directory = 'path_to_tuner_dir'
experiment.config.tuner.class_name = 'tuner_file_name.class_name'
experiment.config.tuner.class_args = {'arg': 'value'}
147
Chapter 3 Hyperparameter Optimization Under Shell
148
Chapter 3 Hyperparameter Optimization Under Shell
We can see in Figure 3-27 the surface of Ackley’s function. It has one highest hill and
several smaller hills nearby.
Before we check how NewEvolutionTuner solves the problem of finding the maxima
of Ackley’s function, we need to implement it in Listing 3-11.
Importing necessary modules:
import random
import numpy as np
from nni.tuner import Tuner
from nni.utils import (
OptimizeMode, extract_scalar_reward,
json2space, json2parameter,
)
149
Chapter 3 Hyperparameter Optimization Under Shell
class Individual:
def to_dict(self):
return {'x': self.x, 'y': self.y}
class Population:
Then, we need to add a method that will return an individual by its param_id:
150
Chapter 3 Hyperparameter Optimization Under Shell
At the start of the experiment, the Tuner will create N individuals who will not have
a trial number. We will call these individuals virgins. get_first_virgin method returns
the first virgin found in population:
def get_first_virgin(self):
for ind in self.individuals:
if ind.param_id is None:
return ind
return None
def get_population_with_result(self):
population_with_result = [ind for ind in self.individuals if
ind.result is not None]
return population_with_result
The next method returns the best individual from the whole population, that is, an
individual that has the highest result:
def get_best_individual(self):
sorted_population = sorted(self.get_population_with_result(),
key = lambda ind: ind.result)
return sorted_population[-1]
And here, we come to our primary evolution method replace_worst, which will
develop the population:
• Add the mutant of the best individual to the population instead of the
worst one.
151
Chapter 3 Hyperparameter Optimization Under Shell
self.individuals.remove(worst)
best = self.get_best_individual()
x = round(best.x + random.gauss(0, 1), 2)
y = round(best.y + random.gauss(0, 1), 2)
mutant = Individual(x, y, param_id)
self.individuals.append(mutant)
return mutant
We can start implementing the tuner after defining the Individual and Population
classes:
class NewEvolutionTuner(Tuner):
self.optimize_mode = OptimizeMode(optimize_mode)
self.population_size = population_size
Next, Tuner initializes properties related to the search space it is working with:
self.search_space_json = None
self.random_state = None
self.population = Population()
self.space = None
When the Tuner starts, the update_search_space method is invoked. It generates the
Population of Random Individuals:
152
Chapter 3 Hyperparameter Optimization Under Shell
for _ in range(self.population_size):
params = json2parameter(self.search_space_json, is_rand, self.
random_state)
ind = Individual(params['x'], params['y'])
self.population.add(ind)
When the Experiment returns the Trial’s result, we save it to an individual object:
Well, our NewEvolutionTuner is ready for action! We can launch the experiment
using the following config file shown in Listing 3-12.
153
Chapter 3 Hyperparameter Optimization Under Shell
y:
_type: "quniform"
_value: [-10, 10, 0.01]
trialConcurrency: 4
trialCodeDirectory: .
trialCommand: python3 trial.py
tuner:
codeDirectory: .
className: evolution_tuner.NewEvolutionTuner
trainingService:
platform: local
experiment.config.tuner = CustomAlgorithmConfig()
experiment.config.tuner.code_directory = Path(__file__).parent
experiment.config.tuner.class_name = 'evolution_tuner.NewEvolutionTuner'
experiment.config.tuner.class_args = {'population_size': 8}
$ python3 ch3/tuners/custom_tuner/run_experiment.py
154
Chapter 3 Hyperparameter Optimization Under Shell
Of course, developing a custom Tuner is not always an easy task. Still, the ability
to implement your search algorithm and integrate it into the HPO process can greatly
increase the experiment results. In this section, we have provided an example of how you
can do this. If necessary, you can implement your ideas based on the illustration given in
this section.
155
Chapter 3 Hyperparameter Optimization Under Shell
E arly Stopping
Some parameters in the search space produce very low Trial results. And this is normal
because the Tuner may not always know in advance which areas of the search space
to explore and which not to explore. Tuner often tries parameters that give very low
results. The Trial itself is expensive because a lot of time is spent on it. For example,
training a neural network with a complex architecture on a large dataset can take
hours. And it would be helpful not to spend a lot of time on trials that show poor
results in their execution. NNI uses Early Stopping algorithms to solve that issue. Early
Stopping algorithms analyze the intermediate trial results and compare them with the
intermediate results of other trials. If the algorithm decides that the intermediate results
of the current Trial are too low, then it stops the Trial so as not to waste time on it.
Figure 3-30 explains Early Stopping approach. Trial 3 early stopped at step N
because intermediate results of this Trial were significantly worse than the other trials’
intermediate results at step N.
156
Chapter 3 Hyperparameter Optimization Under Shell
The deep learning training algorithms also have Early Stop policies. Training Early
Stopping policу stops model training when the model starts to degrade or there is no
improvement for a long time. Do not confuse Training Early Stopping with HPO Early
Stopping. They are not related in any way. Indeed, take a look at Figure 3-31. Training
progress is good, and there is no reason to stop the training. But if we compare the
training process with other trials, it is apparent that it is much worse, and the HPO Early
Stopping algorithm can stop this Trial.
And Figure 3-32 demonstrates the opposite situation. The training process begins
to degrade, and the Training Early Stopping algorithm terminates the training process.
In contrast, the HPO Early Stopping algorithm may consider the current Trail very
promising because its intermediate results are significantly superior compared to
other trials.
157
Chapter 3 Hyperparameter Optimization Under Shell
Please keep in mind designing a deep learning model and the HPO Experiment that
Training Early Stopping and HPO Early Stopping are not correlated.
Median Stop
Median Stop is a straightforward, early stopping rule that stops a pending Trial after step
N if the Trial’s best objective value by step N is strictly worse than the median value of
the running averages of all completed trials’ objectives reported up to step N.
Median Stop algorithm can be implemented in Experiment with the following
experiment configuration:
assessor:
name: Medianstop
classArgs:
# number of warm up steps
start_step: 10
158
Chapter 3 Hyperparameter Optimization Under Shell
Let’s look at a synthetic problem to see the Median Stop algorithm in action. Say, we
have the following identity function, f: x → x, with the training progress containing 100
x
epochs (steps) and expressed by the following rule: epoch + r, where r is a random
10
variable on (-1, 1). We can characterize the function f as an identity function with
parabolic training progress. Listing 3-14 contains the implementation of function f.
def identity_with_parabolic_training(x):
history = [
max(round(x / 10, 2) * pow(h, .5) + random.uniform(-3, 3), 0)
for h in range(1, 101)
]
return x, history
Let’s visualize the training process of the following set of functions: f(0), f(10), f(20),
..., f(100).
if __name__ == '__main__':
import matplotlib.pyplot as plt
plt.ylabel('Intermediate Result')
plt.xlabel('Epochs')
plt.legend()
plt.show()
Figure 3-33 shows various training curves. This plot illustrates that the lower training
curves are unpromising. Unpromising training can be stopped in advance according to
the Early Stopping algorithm.
159
Chapter 3 Hyperparameter Optimization Under Shell
Let’s launch an experiment that will use the Median Stop algorithm. The trial script
is defined in Listing 3-15.
Trial header with imported modules:
import os
import sys
from time import sleep
import nni
Executing Trial:
if __name__ == '__main__':
params = nni.get_next_parameter()
x = params['x']
160
Chapter 3 Hyperparameter Optimization Under Shell
nni.report_final_result(final)
maxTrialNumber: 100
trialConcurrency: 8
trialCodeDirectory: .
trialCommand: python3 trial.py
tuner:
name: Random
assessor:
name: Medianstop
classArgs:
# number of warm up steps
start_step: 10
trainingService:
platform: local
161
Chapter 3 Hyperparameter Optimization Under Shell
After the experiment is completed, we can observe in Figure 3-34 that many trials
have the EARLY_STOPPED status, as expected.
C
urve Fitting
Curve Fitting Assessor is an LPA (learning, predicting, assessing) algorithm that stops a
pending Trial at step N if the prediction of the final epoch’s performance is worse than
the best final performance in the trial history. Curve Fitting Assessor makes a prediction
about the final result of the Trial’s training and compares it with the completed ones.
This algorithm treats the Early Stopping task as a time series forecasting problem. If the
training prediction is pessimistic, then the algorithm stops the trial. Figure 3-35 explains
the Curve Fitting Early Stopping approach.
162
Chapter 3 Hyperparameter Optimization Under Shell
assessor:
name: Curvefitting
classArgs:
epoch_num: 20
start_step: 6
threshold: 0.95
gap: 1
We will not study the principles of the Curve Fitting Early Stopping algorithm in
this book. You can refer to the official documentation (https://nni.readthedocs.
io/en/v2.7/reference/hpo.html#nni.algorithms.hpo.curvefitting_assessor.
CurvefittingAssessor) or review a paper dedicated to “Speeding up Automatic
Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning
Curves” (https://ml.informatik.uni-freiburg.de/wp-content/uploads/papers/
15-IJCAI-Extrapolation_of_Learning_Curves.pdf).
However, there is a very tiny chance of stopping a trial that can give good results
too early. Usually, the training curves of deep learning models behave similarly, so if
the trials’ intermediate results were significantly worse than those of other trials, you
probably should not expect anything good, but complete the trial and move on to the
next one.
164
Chapter 3 Hyperparameter Optimization Under Shell
Let’s convert the problem into a more strict mathematical language. We need
to find a model M that consists of compositions of functions Fi ∈ {F} and maximizes
the value L(M, D), where L evaluates the performance of model M on dataset
D. Figure 3-38 formulates the optimal functional pipeline problem:
165
Chapter 3 Hyperparameter Optimization Under Shell
Let’s study how this problem can be solved with NNI. As an example, I want to use
the classic AutoML task, which searches for the optimal pipeline of classical shallow
machine learning methods to solve a supervised learning problem. In this section, I
would like to pay tribute to classical machine learning, which increasingly gives way to
deep learning. This approach can be applied to any problem of searching the optimal
functional pipeline.
Problem
Let’s examine binary classification problem with Gamma Telescope Dataset (https://
archive.ics.uci.edu/ml/datasets/magic+gamma+telescope). This dataset contains
data of signals received by the telescope. The task is to discriminate signals caused by
primary gammas (signal) from the images of hadronic showers initiated by cosmic rays
in the upper atmosphere (background). This dataset contains 19020 instances and the
following columns:
166
Chapter 3 Hyperparameter Optimization Under Shell
9. fAlpha: Type: real. Angle of major axis with vector to origin (deg)
10. fDist: Type: real. Distance from origin to center of ellipse (mm)
Perhaps the reader understands something in these physical data, but I do not
understand anything about them. But that’s exactly what we need machine learning for –
to find patterns and dependencies where we cannot see them ourselves.
The dataset is located in ch3/ml_pipeline/data/magic04.data and is converted to a
supervised learning problem in Listing 3-17.
Importing modules:
import os
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
def telescope_dataset():
cd = os.path.dirname(os.path.abspath(__file__))
telescope_df = pd.read_csv(f'{cd}/data/magic04.data')
Dropping na values:
telescope_df.dropna(inplace = True)
telescope_df.columns = [
'fLength', 'fWidth', 'fSize', 'fConc', 'fConcl',
'fAsym', 'fM3Long', 'fM3Trans', 'fAlpha', 'fDist', 'class']
167
Chapter 3 Hyperparameter Optimization Under Shell
Shuffling dataset:
telescope_df = telescope_df.iloc[np.random.
permutation(len(telescope_df))]
telescope_df.reset_index(drop = True, inplace = True)
Class labeling:
Since we have identified the problem and prepared the dataset, we can begin to
determine the machine learning methods that our model will consist of.
Operators
Now let’s define the functions that will make up the functional pipeline. We will call
them Operators in the machine learning context. Machine learning operators for a
classification problem can be separated into three types:
• Selectors: Choose the most significant features (columns) from the
dataset, removing dependent features. Usually reduces the input size.
The selected features remain unchanged.
168
Chapter 3 Hyperparameter Optimization Under Shell
Figure 3-39 shows a machine learning pipeline: selector, transformer, and classifier.
class Operator:
169
Chapter 3 Hyperparameter Optimization Under Shell
class OperatorSpace:
Selector list:
selectors = [
Operator('SelectFwe', SelectFwe, {
'alpha': arange(0, 0.05, 0.001).tolist()
}),
Operator('SelectPercentile', SelectPercentile, {
'percentile': list(range(1, 100))
}),
transformers = [
Operator('Binarizer', Binarizer, {
'threshold': arange(0.0, 1.01, 0.05).tolist()
}),
Operator('FastICA', FastICA, {
'tol': arange(0.0, 1.01, 0.05).tolist()
}),
classifiers = [
Operator('GaussianNB', GaussianNB),
Operator('BernoulliNB', BernoulliNB, {
'alpha': [0.01, 0.1, 1, 10]
}),
170
Chapter 3 Hyperparameter Optimization Under Shell
@classmethod
def get_operator_by_name(cls, name):
S
earch Space
Let’s define search space for the classifier. We assume that the classifier’s operator
pipeline will have
Therefore, a pipeline can have from one to five operators. Please take a look at
Figure 3-40.
171
Chapter 3 Hyperparameter Optimization Under Shell
The pipeline has five cells. Each cell can be filled with some value from the
corresponding operator space. The cell can be empty if none value is selected. For
example, the following pipeline might be selected: Selector3 → none → Transformer1
→ none → Classifier2, which is equal to Selector3 → Transformer1 → Classifier2.
This search space definition is huge and challenging to construct manually, so we will
add a special class in Listing 3-19 that will create an operator search space definition
according to the NNI specification.
172
Chapter 3 Hyperparameter Optimization Under Shell
class SearchSpace:
Each cell has an operator type that can be filled by selector, transformer, and
classifier. The operator_search_space method creates a search space for each
cell type.
@classmethod
def operator_search_space(cls, operator_type):
"""
Search space for operator by `operator_type`
"""
ss = []
operators = []
for o in operators:
row = {'_name': o.name}
for p_name, values in o.params.items():
row[p_name] = {"_type": "choice", "_value": values}
ss.append(row)
return ss
173
Chapter 3 Hyperparameter Optimization Under Shell
Next, we define a method build that constructs a search space of all cells according
to the NNI specification:
@classmethod
def build(cls):
return {
"op_1": {
"_type": "choice",
"_value": cls.operator_search_space('selector')
},
"op_2": {
"_type": "choice",
"_value": cls.operator_search_space('transformer')
},
"op_3": {
"_type": "choice",
"_value": cls.operator_search_space('transformer')
},
"op_4": {
"_type": "choice",
"_value": cls.operator_search_space('transformer')
},
"op_5": {
"_type": "choice",
"_value": cls.operator_search_space('classifier')
}
}
Even though the search space definition is quite large, we can print it out:
if __name__ == '__main__':
search_space = SearchSpace.build()
print(search_space)
174
Chapter 3 Hyperparameter Optimization Under Shell
M
odel
So, what do we have by now? We have an operator space and a search space. Let’s now
implement a model that converts the pipeline configuration into a real machine learning
classifier. Listing 3-20 introduces MlPipelineClassifier.
Importing modules:
class MlPipelineClassifier:
ops = []
for _, params in pipe_config.items():
# operator name
op_name = params.pop('_name')
op = OperatorSpace.get_operator_by_name(op_name)
ops.append((op.name, op.clz(**params)))
self.pipe = Pipeline(ops)
175
Chapter 3 Hyperparameter Optimization Under Shell
Since the model is ready, let’s try to initialize it using the sample pipeline parameter
and apply it to the classification problem:
if __name__ == '__main__':
pipe_config = {
'op_1': {
'_name': 'SelectPercentile',
'percentile': 2
},
'op_2': {
'_name': 'none'
},
'op_3': {
'_name': 'Normalizer',
'norm': 'l1'
},
'op_4': {
'_name': 'PCA',
'svd_solver': 'randomized',
'iterated_power': 3
},
'op_5': {
'_name': 'DecisionTreeClassifier',
'criterion': "entropy",
'max_depth': 8
}
}
model = MlPipelineClassifier(pipe_config)
X_train, y_train, X_test, y_test = telescope_dataset()
176
Chapter 3 Hyperparameter Optimization Under Shell
model.train(X_train, y_train)
score = model.score(X_test, y_test)
print(score)
The model demonstrates 64% accuracy. This is definitely not the result we expect, so
let’s use the HPO techniques to construct a better-performance model.
T uner
Now everything is ready to start the experiment. But I would like to focus on the search
space that we use. The trial parameter is a sequence of operators, which may contain
empty operators, that is, S3(p3) → none → T1(p1) → none → C2(p2). But at the same
time, the following parameter exists too: S3(p3) → none → none → T1(p1) → C2(p2).
They are different parameters in the search space but generate the same classifier model:
Keep in mind that the same two identical operators with different parameters are
not equal, that is, SelectFwe(alpha=0) is not equal to SelectFwe(alpha=0.05). Let’s
customize Tuner by forbidding it to create parameters that will generate equivalent
models concerning the parameters already tried, that is, if we already tried parameter
P1 = SelectPercentile(percentile = 2) → none → Normalizer(norm='l1')
→ none → DecisionTreeClassifier(max_depth=8), then parameter P2 =
SelectPercentile(percentile = 2) → none → none → Normalizer(norm='l1')
→ DecisionTreeClassifier(max_depth=8) will not be passed to Experiment,
because the model generated by P2 equals to the model generated by P1. Let’s
create EvolutionShrinkTuner, which inherits EvolutionTuner and tracks all
executed pipelines forbidding passing the equal ones to the Experiment. We can see
EvolutionShrinkTuner implementation in Listing 3-21.
177
Chapter 3 Hyperparameter Optimization Under Shell
import json
from nni.algorithms.hpo.evolution_tuner import EvolutionTuner
class EvolutionShrinkTuner(EvolutionTuner):
self.registry = []
return params
The following is_valid method converts the parameter to the canonical form by
removing none operators and checks if it has already been tried: if it has, then returns
False; if not, then saves it and returns True.
178
Chapter 3 Hyperparameter Optimization Under Shell
self.registry.append(canonical_form)
return True
This simple technique introduces the concept of equivalence between the elements
of the search space and can significantly shrink the search space.
E xperiment
And now, we are finally ready to launch an experiment to find the optimal functional
pipeline for solving the AutoML problem. The trial script in Listing 3-22 initializes the
model, prepares datasets, trains the model, tests it, and returns model accuracy to NNI
Experiment.
(Full code is provided in the corresponding file: ch3/ml_pipeline/trial.py.)
def trial(hparams):
#Initializing model
model = MlPipelineClassifier(hparams)
model.train(X_train, y_train)
179
Chapter 3 Hyperparameter Optimization Under Shell
experiment = Experiment('local')
experiment.config.experiment_name = 'AutoML Pipeline'
experiment.config.trial_concurrency = 4
experiment.config.max_trial_number = 500
experiment.config.search_space = SearchSpace.build()
Trial configuration:
experiment.config.tuner = CustomAlgorithmConfig()
experiment.config.tuner.code_directory = Path(__file__).parent
experiment.config.tuner.class_name = 'evolution_shrink_tuner.
EvolutionShrinkTuner'
experiment.config.tuner.class_args = {
'optimize_mode': 'maximize',
'population_size': 64
}
180
Chapter 3 Hyperparameter Optimization Under Shell
Launching Experiment:
http_port = 8080
experiment.start(http_port)
# Event Loop
while True:
if experiment.get_status() == 'DONE':
search_data = experiment.export_data()
search_metrics = experiment.get_job_metrics()
input("Experiment is finished. Press any key to exit...")
break
181
Chapter 3 Hyperparameter Optimization Under Shell
The best Trial demonstrates 0.89143 accuracy and has the following parameters:
{
"op_1": {
"_name": "SelectFwe",
"alpha": 0.049
},
"op_2": {
"_name": "MinMaxScaler"
},
"op_3": {
"_name": "RobustScaler"
},
"op_4": {
"_name": "none"
},
"op_5": {
"_name": "MLPClassifier",
"alpha": 0.01,
"learning_rate_init": 0.01
}
}
In fact, the classifier we have built shows very good results, which are close to optimal.
The best classifier for this model performs with the following accuracy: 0.898 (“Multi-
Task Architecture with Attention for Imaging Atmospheric Cherenkov Telescope Data
Analysis,” www.scitepress.org/Papers/2021/102974/102974.pdf). We have just built the
custom AutoML toolkit based on NNI. This approach can also be applied to any functional
pipeline optimization. Similarly, you can automatically design deep learning models with
a sequential layer layout. Indeed, operator space can consist of deep learning layers, and
the model can be a neural network based on the sequence of layers pipeline. Of course,
we could dive deeper into applying this approach to Neural Architecture Search, but it has
significant drawbacks, which we will discuss in the next section.
182
Chapter 3 Hyperparameter Optimization Under Shell
We need some different and special techniques to search for efficient neural network
architectures. And we will begin to explore them in the next chapter.
183
Chapter 3 Hyperparameter Optimization Under Shell
Summary
This chapter has taken a deep dive into Tuner internals and various black-box function
optimization algorithms. Understanding the principles of Tuners’ behavior and their
practical application can remarkably improve the design of NNI experiments and HPO
results. In the next chapter, we’ll move on to the most exciting and interesting part of our
book: Neural Architecture Search. We will study the latest techniques to find the optimal
design of neural networks for a specific task.
184
CHAPTER 4
Multi-trial Neural
Architecture Search
And now we come to the most exciting part of this book. As we noted at the end of the last
chapter, HPO methods are pretty limited for automating the search for the optimal deep
learning models, but Neural Architecture Search (NAS) dispels these limits. This chapter
focuses on NAS, one of the most promising areas of automated deep learning. Automatic
Neural Architecture Search is increasingly important in finding appropriate deep learning
models. Recent researches have proven the NAS effectiveness and found some models
that could beat manually tuned ones. NAS is a fairly young discipline in machine learning.
It took shape as a separate discipline in 2018. Since then, it has made a significant
breakthrough in automating neural network architecture construction that solves a specific
problem. The most manual design of neural networks can be replaced by automated
architecture search soon, so this area is very up and coming for all data scientists. NAS
produced many top computer vision architectures. Architectures like NASNet, EfficientNet,
and MobileNet are the result of automated Neural Architecture Search.
There are two types of NAS: Multi-trial and One-shot. In Multi-trial NAS, a model
evaluator evaluates each sampled model’s performance, and an Exploration Strategy
samples models from defined Model Space, while One-shot NAS tries to find optimal
neural architecture training and exploring one Supernet derived from the Model Space.
This chapter is dedicated to Multi-trial NAS.
This chapter is divided into two parts: Neural Architecture Search Using Retiarii
(PyTorch) and Classic Neural Architecture Search (TensorFlow). Retiarii is a deep
learning framework that supports the exploratory training on a neural network Model
Space developed by NNI experts. Retiarii is an advantageous approach that allows
structuring and planning the NAS. Unfortunately, the NNI 2.7 version (which is used
in this book) only implements the Retiarii approach for the PyTorch framework. And
it would be unfair not to pay attention to the TensorFlow framework in this chapter, so
185
© Ivan Gridin 2022
I. Gridin, Automated Deep Learning Using Neural Network Intelligence,
https://doi.org/10.1007/978-1-4842-8149-9_4
Chapter 4 Multi-trial Neural Architecture Search
the classic methods of NAS using NNI are considered in the Classic Neural Architecture
Search (TensorFlow) part. In any case, NNI supports TensorFlow for One-shot NAS,
which we will explore in the next chapter. Therefore, TensorFlow users will be able to
take full advantage of NAS approaches.
186
Chapter 4 Multi-trial Neural Architecture Search
In Figure 4-1, we see the DFG, which contains different types of nodes with
parameters. The architecture of each neural network can be represented as a Data Flow
Graph. Indeed, in Figure 4-1, we can replace the rectangle node with a convolution
layer with parameters: padding, stride, filter_size, etc. Neural Architecture Search
Explorers have no idea about the nature of each of the DFG nodes. The main task is to
construct a DFG from various deep learning layers, which forms the architecture of a
neural network and solves a specific problem in the best way.
187
Chapter 4 Multi-trial Neural Architecture Search
x = 1
x ← x × 2
x ← x × 4
return x
x = 1
x ← sigmoid(x) or x ← tanh(x) or x ← relu(x)
x ← x × 2
x ← x + 0 or x ← x + 1 or x ← x + 2
x ← x × 4
x ← sigmoid(x) or x ← tanh(x) or x ← relu(x)
return x
The Model Space of this problem can be depicted as shown in Figure 4-2. We need to
find a DFG that maximizes output.
188
Chapter 4 Multi-trial Neural Architecture Search
import os
import torch
import nni
import nni.retiarii.nn.pytorch as nn
189
Chapter 4 Multi-trial Neural Architecture Search
First, we need to define the Model Space for Exploration. Model Space in the
NAS context can be considered as search space in the HPO context. Model Space is
represented by a class that contains all possible architectures to try. Each such class
must be annotated with @model_wrapper.
@model_wrapper
class DummyModel(nn.Module):
def __init__(self):
super().__init__()
# operator 1
self.op1 = nn.LayerChoice([
nn.Tanh(),
nn.Sigmoid(),
nn.ReLU()
])
Another type of mutation is ValueChoice. This mutator selects one of the values from
the list:
# addition
self.add = nn.ValueChoice([0, 1, 2])
# operator 2
self.op2 = nn.LayerChoice([
nn.Tanh(),
nn.Sigmoid(),
nn.ReLU()
])
190
Chapter 4 Multi-trial Neural Architecture Search
After, we define the evaluate method, which returns the model result:
def evaluate(model_cls):
model = model_cls()
x = torch.Tensor([1])
y = model(x)
This code is used for model architecture visualization. We will get back to this
technique later.
nni.report_final_result(y.item())
Once we have defined the Model Space and its instance Evaluation, we can move on
to launching the experiment with the code shown in Listing 4-2.
Importing necessary modules:
model_space = DummyModel()
evaluator = FunctionalEvaluator(evaluate)
exp_config = RetiariiExeConfig('local')
exp_config.experiment_name = 'dummy_search'
exp_config.trial_concurrency = 1
exp_config.max_trial_number = 100
exp_config.training_service.use_active_gpu = False
export_formatter = 'dict'
6. Launching Experiment:
exp.run(exp_config, 8080)
while True:
sleep(1)
input("Experiment is finished. Press any key to exit...")
print('Final model:')
192
Chapter 4 Multi-trial Neural Architecture Search
for model_code in exp.export_top_models(formatter =
export_formatter):
print(model_code)
break
After the experiment is completed, we can analyze the Trial jobs panel examining
the best results.
As shown in Figure 4-3, the best model returns 16. And it has the following set of
parameters:
{
"model_1": "2",
"model_2": 2,
"model_3": "2"
}
193
Chapter 4 Multi-trial Neural Architecture Search
The preceding parameters are not self-describing, so we can use the visualization
function to render DFG render as shown in Figure 4-4.
NNI uses Netron to visualize trial models. Netron is a tiny viewer for neural networks,
deep learning, and machine learning models. Clicking the Netron button, you’ll see the
screen as shown in Figure 4-5.
194
Chapter 4 Multi-trial Neural Architecture Search
Now we can declare that we have found a solution to the problem of finding a model
that maximizes the value of the chain of operators F. The Data Flow Graph of this model
is shown in Figure 4-6.
195
Chapter 4 Multi-trial Neural Architecture Search
The purpose of this example was to demonstrate that the primary goal of a NAS
approach is to find a DFG that maximizes (or minimizes) a black-box function. In the
same way HPO methods are searching for parameters that maximize (or minimize)
the black-box function. After introducing basic NAS techniques, we can dive into more
details.
Retiarii Framework
Retiarii framework is designed to separate the main logical entities of Neural
Architecture Search. This makes the NAS procedure clear and elegant. Using the Retiarii
framework, the researcher can only focus on particular aspects of the investigation. The
main components of the Retiarii framework are
196
Chapter 4 Multi-trial Neural Architecture Search
• Base Model
• Mutator
• Model Space
• Evaluator
• Exploration Strategy
Base Model is the primary skeleton of a neural network. Base Model is actually a
simple deep learning model that solves some problem. Often the Base Model does
not show good performance. But it has some primary neural architecture and training
algorithm.
Mutator is a possible change that a Base Model can be subjected to. Mutator defines
the transformation of the Base Model architecture into another one. Usually, many
mutators are applied to Base Model.
Model Space is the set of all possible Base Model mutations. Each mutator generates
several variants of neural network architectures. Applying all mutators to the Base Model
defines the Model Space.
Evaluator measures the performance of a sample from the Model Space. This is a
typical algorithm for training and testing a neural network.
Exploration Strategy defines the Model Space exploration algorithm. The main
objective of the Exploration Strategy is to find the best model in the least number
of trials.
All these concepts are pretty familiar to us after studying Hyperparameter
Optimization. Table 4-1 contains the main logical entities from NAS and HPO. As you
can see, they mean almost the same thing.
197
Chapter 4 Multi-trial Neural Architecture Search
B
ase Model
The Base Model is a starting point from which all possible architecture modifications will
be made. For example, the Base Model for NAS of the MNIST problem can be presented
as it is shown in Listing 4-3.
Importing PyTorch modules:
import torch
import torch.nn.functional as F
import nni.retiarii.nn.pytorch as nn
198
Chapter 4 Multi-trial Neural Architecture Search
Next, we have the classic LeNet model design for the digit recognition problem:
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout(0.25)
self.dropout2 = nn.Dropout(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
Mutators
Base Model is a single model. To create Model Space, we have to add Mutators to the
Base Model. Each mutator provides a way to change the Base Model. All possible
mutations applied to the Base Model form the Model Space. NNI provides the following
mutation operations: LayerChoice, ValueChoice, InputChoice, and Repeat.
199
Chapter 4 Multi-trial Neural Architecture Search
LayerChoice
LayerChoice mutator forms the candidate layers for a layer placeholder. One of these
layers is tried in the exploration process. LayerChoice mutator is applied to the Base
Model the following way:
# import part
import nni.retiarii.nn.pytorch as nn
# model design
self.activation = nn.LayerChoice([
nn.ReLU(),
nn.Sigmoid(),
nn.Identity
])
# forward
x = self.activation(x)
LayerChoice adds layer variations to the Base Model as shown in Figure 4-8.
200
Chapter 4 Multi-trial Neural Architecture Search
ValueChoice
ValueChoice forms the list of single values to be tried as layer hyperparameters.
ValueChoice can be used as layer hyperparameters only. It cannot be used as an
arbitrary hyperparameter like batch_size or learning_rate in the evaluation process.
ValueChoice mutator is applied to the Base Model the following way:
# import part
import nni.retiarii.nn.pytorch as nn
# model design
self.drop = nn.Dropout(nn.ValueChoice([0.25, 0.5, 0.75]))
# forward
x = self.drop(x)
InputChoice
InputChoice tries different connections. It takes several tensors and chooses n_chosen
tensors from them. InputChoice mutator is applied to the Base Model the following way:
# import part
import nni.retiarii.nn.pytorch as nn
# model design
self.switch = nn.InputChoice(n_candidates = 2, n_chosen = 1)
# forward
# branch one
a = self.op_a1(x)
a = self.op_a2(a)
# branch two
b = self.op_b1(x)
b = self.op_b2(b)
# choosing connection
x = self.switch([a, b])
201
Chapter 4 Multi-trial Neural Architecture Search
InputChoice is designed to find the best data flow branches in neural network
architecture. Figure 4-9 illustrates this concept.
If InputChoice picks more than one candidate tensors (i.e., n_chosen > 1), then
the reduction strategy is applied: sum, mean, concat. This is a very useful technique that
allows to extract and merge several connections at the same time. InputChoice for
multiple candidates with reduction can be applied the following way:
# import part
import nni.retiarii.nn.pytorch as nn
# model design
self.mix = nn.InputChoice(n_candidates = 3, n_chosen = 2, reduction = 'sum')
# forward
# branch one
a = self.op_a1(x)
a = self.op_a2(a)
202
Chapter 4 Multi-trial Neural Architecture Search
# branch two
b = self.op_b1(x)
b = self.op_b2(b)
# branch three
c = self.op_b1(x)
c = self.op_b2(c)
# choosing connection
x = self.mix([a, b, c])
The preceding code generates the search space shown in Figure 4-10.
InputChoice Mutator for multiple candidates can choose the same tensor several
times. It happens when other connections do not bring any helpful information to neural
network performance.
203
Chapter 4 Multi-trial Neural Architecture Search
# import part
import nni.retiarii.nn.pytorch as nn
# model design
self.skip_connect = nn.InputChoice(n_candidates = 2, n_chosen = 1)
# forward
x0 = x.clone()
# connection
x1 = self.op(x)
x0 = self.skip_connect([x0, None])
if x0 is not None:
# skipping connection
x1 += x0
In the first case, the model will use the skip connection technique, but not in the
second one. Figure 4-11 demonstrates skip connection mutation.
204
Chapter 4 Multi-trial Neural Architecture Search
1: x → (x × 2)
2: x → (x × 2) → (x × 3)
3: x → (x × 2) → (x × 3) → (x × 4)
You need to select two pipelines whose sum is maximum. You can select the same
pipeline twice. This is a fairly simple task. It is intuitively clear that the sum of the last
pipeline will give the maximum value. Let’s run the third pipeline: 1 → 1 × 2 → 2 × 3 → 6
× 4 → 24; thus, the maximum value we can obtain is 48.
Now let’s obtain the same result using NNI and InputChoice Mutator
using Listing 4-4.
205
Chapter 4 Multi-trial Neural Architecture Search
import nni
import nni.retiarii.nn.pytorch as nn
from nni.retiarii import model_wrapper
class ProdBlock(nn.Module):
@model_wrapper
class InputChoiceModelSpace(nn.Module):
def __init__(self):
super().__init__()
self.x2 = ProdBlock(2)
self.x3 = ProdBlock(3)
self.x4 = ProdBlock(4)
206
Chapter 4 Multi-trial Neural Architecture Search
and InputChoice mutator that will select the best pair of pipelines:
self.mix = nn.InputChoice(
n_candidates = 3,
n_chosen = 2,
reduction = 'sum'
)
forward action executes three different pipelines, and InputChoice mutator tries
only two of them:
First pipeline: x → (x × 2)
# Branch A
a = self.x2(x)
Second pipeline: 2: x → (x × 2) → (x × 3)
# Branch B
b = self.x2(x)
b = self.x3(b)
Third pipeline: x → (x × 2) → (x × 3) → (x × 4)
# Branch C
c = self.x2(x)
c = self.x3(c)
c = self.x4(c)
def evaluate(model_cls):
model = model_cls()
x = 1
out = model(x)
207
Chapter 4 Multi-trial Neural Architecture Search
# visualizing
onnx_dir = os.path.abspath(os.environ.get('NNI_OUTPUT_DIR', '.'))
os.makedirs(onnx_dir, exist_ok = True)
torch.onnx.export(model, x, onnx_dir + '/model.onnx')
nni.report_final_result(out)
$ python3 ch4/retiarii/common/input_choice/run_experiment.py
You can analyze the results on the WebUI detail page: http://127.0.0.1:8080/detail.
In Figure 4-12, we see 32 = 9 trials. The best trial shows 48 and has the following
parameters: { "model_1_0": 2, "model_1_1": 2 }, which means that the best result is
achieved with the last pipelines.
return self.mix(
[
a, # <- 0
b, # <- 1
c # <- 2
])
Such simple examples help to understand better how mutators act before
proceeding to a real NAS.
208
Chapter 4 Multi-trial Neural Architecture Search
R
epeat
Repeat mutator repeats some action a certain number of times. In the NAS context, the
Repeat mutator tries to determine how often to iterate the same neural network block.
For example, the ResNet neural network architecture implies a stack of Residual Blocks.
But the optimal number of Residual Blocks may depend on the specific task. Figure 4-13
shows part of the ResNet architecture.
209
Chapter 4 Multi-trial Neural Architecture Search
Repeat mutator accepts a function that generates a block by its sequence number.
Here is the pattern how Repeat mutator can be applied:
import nni.retiarii.nn.pytorch as nn
from nni.retiarii import model_wrapper
class SomeBlock(nn.Module):
...
Set a builder function that generates a block concerning its ordinal number in
the stack:
def create_some_block(block_num):
# some logic here that depends on 'block_num'
return SomeBlock(block_num)
@model_wrapper
class RepeatModelSpace(nn.Module):
def __init__(self):
super().__init__()
...
self.repeat_block = nn.Repeat(
create_some_block,
depth = (1, 5) # repeat from 1 to 5 times
)
...
x = self.repeat_block(x)
...
210
Chapter 4 Multi-trial Neural Architecture Search
@classmethod
def create(cls, block_num):
return AddBlock(block_num)
@model_wrapper
class RepeatModelSpace(nn.Module):
def __init__(self):
super().__init__()
211
Chapter 4 Multi-trial Neural Architecture Search
self.repeat = nn.Repeat(
AddBlock.create,
depth = (1, 5)
)
$ python3 ch4/retiarii/common/repeat/run_experiment.py
• if: InputChoice
• loop: Repeat
Later in this chapter, we will examine a Model Space construction for a real NAS task
using these mutators.
Labeling
All the mutator APIs have an optional argument label. Mutators with the same label
will share the same value. A typical example is
self.net = nn.Sequential(
nn.Linear(10, nn.ValueChoice([32, 64, 128], label='hidden_dim'),
nn.Linear(nn.ValueChoice([32, 64, 128], label='hidden_dim'), 3)
)
212
Chapter 4 Multi-trial Neural Architecture Search
E xample
Listing 4-6 demonstrates a trivial example of the Model Space applied to the image
classification network.
@model_wrapper
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.LayerChoice([
nn.Conv2d(32, 64, 3, 1),
nn.Identity
], label = 'conv_layer')
self.dropout1 = nn.Dropout(
nn.ValueChoice([0.25, 0.5, 0.75]),
label = 'dropout'
)
self.dropout2 = nn.Dropout(0.5)
213
Chapter 4 Multi-trial Neural Architecture Search
feature = nn.ValueChoice(
[64, 128, 256],
label = 'hidden_size'
)
self.fc1 = nn.Linear(9216, feature)
self.fc2 = nn.Linear(feature, 10)
Evaluators
Retiarii Evaluator is a function that accepts a model class, initiates a model, trains it, tests
it, and returns a result to Experiment. Evaluator can be implemented using the following
pattern:
def evaluate(model_cls):
# Initiate model
model = model_cls()
214
Chapter 4 Multi-trial Neural Architecture Search
Retiarii Evaluator is pretty close to the Trial approach we used in HPO in previous
chapters.
E xploration Strategies
NNI provides the following exploration strategies for Multi-shot NAS: Random
Strategy, Grid Search, Regularized Evolution, TPE Strategy, and RL Strategy. We are
already familiar with some of them because they implement the same approach as
corresponding HPO Tuners. But anyway, let’s briefly study each of them.
R
andom Strategy
Random Strategy (nni.retiarii.strategy.Random) randomly samples new models
from the Model Space. It is a simple but still effective technique to Explore Model Space.
Random Search is a good first-time Exploration Strategy, and it can give you good clues
when you have no idea about the dataset you are dealing with and suitable architecture
designs. Usually, Random Search is used first, and after the Model Space is refined, a
more intelligent Exploration Strategy is applied.
G
rid Search
Grid Search Strategy (nni.retiarii.strategy.GridSearch) samples new models from
Model Space using a Grid Search algorithm.
R
egularized Evolution
Regularized Evolution Strategy (nni.retiarii.strategy.RegularizedEvolution)
implements Genetic Algorithm Search with mutation operator using Tournament Selection
method. Regularized Evolution Strategy is close to Evolution Tuner we studied in Chapter 3.
Pseudo-code that describes Regularized Evolution Algorithm is provided in the following.
Regularized Evolution Strategy has three global hyperparameters:
215
Chapter 4 Multi-trial Neural Architecture Search
# HYPERPARAMETERS
POPULATION_SIZE
CANDIDATES_N
GENERATIONS_N
population = []
for _ in range(POPULATION_SIZE):
individual = generate_random_architecture()
evaluate(individual)
population.append(individual)
for _ in range(GENERATIONS):
From these candidates, the best one is selected (i.e., the individual that has the best
metric):
best_candidate = get_best_from(candidates)
Random mutation is performed on the best candidate (i.e., algorithm runs several
Mutators in the original model):
mutant = mutate(best_candidate)
evaluate(mutant)
replace_worst(population, mutant)
216
Chapter 4 Multi-trial Neural Architecture Search
• optimize_mode:
Type: string
Default: maximize
• population_size:
Type: int
Default: 100
Type: int
Default: 20000
• sample_size:
Type: int
Default: 25
• mutation_prob:
Type: float
Default: 0.05
T PE Strategy
TPE Strategy (nni.retiarii.strategy.TPEStrategy) is a Sequential Model-Based
Optimization approach based on Tree-structured Parzen Estimator. It acts the same way
as TPE Tuner we studied in Chapter 3.
For more details, please refer to the original paper describing the TPE approach:
https://papers.nips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-
Paper.pdf.
R
L Strategy
RL Strategy (nni.retiarii.strategy.PolicyBasedRL) implements the Reinforcement
Learning approach based on policy-gradient method (Proximal Policy Optimization or
PPO). RL Strategy implements a special Recurrent Neural Network called Controller.
Controller generates various model architectures from Model Space. The Controller acts
as a stochastic policy; hence, it returns the mutation probability for each of the Mutators
in the Model Space. After each trial, the Controller updates the weights of its RNN
according to the Proximal Optimization Policy method. This approach allows exploring
the Model Space by constructing a probability distribution for each of the mutators.
Figure 4-15 demonstrates RL Strategy in action.
218
Chapter 4 Multi-trial Neural Architecture Search
• max_collect:
Type: int
Default: 100
219
Chapter 4 Multi-trial Neural Architecture Search
• Trial_per_collect:
Type: int
Default: 20
How many trials (trajectories) each time collector collects. After each completed
trajectory, the trainer will sample the batch from the replay buffer and do the
Controller update.
For more details, please refer to the original paper describing the Neural Architecture
Search with Reinforcement Learning: https://arxiv.org/pdf/1611.01578.pdf.
E xperiment
And the last thing left is the Experiment. Retiarii Experiment is launched in stand-alone
(embedded) mode and contains seven steps:
• Launch Experiment
• Returning results
220
Chapter 4 Multi-trial Neural Architecture Search
# Launch Experiment
exp.run(exp_config, 8081 + random.randint(0, 100))
# Returning results
print('Final model:')
for model_code in exp.export_top_models(formatter=export_formatter):
print(model_code)
You can use the WebUI after running the experiment the same way we did earlier
launching HPO experiments.
$ python3 ch4/utils/datasets.py
221
Chapter 4 Multi-trial Neural Architecture Search
Let’s try to find an appropriate deep learning model based on the LeNet approach
solving the CIFAR-10 classification problem. As we already know, the LeNet image
recognition architecture can be divided into two components: Feature Extraction
Component and Decision Maker Component. The Feature Extraction Component consists
of a sequence of Feature Extraction blocks with convolution layer. Decision Maker
Component consists of Fully Connected Components with linear layer.
The design of the Feature Extraction block can be as follows:
• Conv → Activation
Figure 4-17 demonstrates possible architecture options for the Feature Extraction
block or Feature Extraction block space.
222
Chapter 4 Multi-trial Neural Architecture Search
In the same way, we can determine the possible designs for the Fully
Connected block:
• Linear → Activation
223
Chapter 4 Multi-trial Neural Architecture Search
The neural architecture we are looking for consists of Feature Extraction and Fully
Connected block sequences. Each of these blocks may have the architecture shown in
Figures 4-17 and 4-18, respectively. The LeNet NAS algorithm must find the optimal
Feature Extraction and Fully Connected block sequence lengths, as well as their
architectures. The Model Space for LeNet NAS can be drawn as depicted in Figure 4-19.
We will start implementing the CIFAR-10 LeNet NAS by defining a Feature Extraction
block in Listing 4-7.
224
Chapter 4 Multi-trial Neural Architecture Search
class FeatureExtractionBlock(nn.Module):
def __init__(
self,
dim: Tuple[int, int],
kernel_size,
activation,
block_num = 0
) -> None:
super().__init__()
self.input_dim = dim[0]
self.output_dim = dim[1]
self.conv = nn.Conv2d(
in_channels = self.input_dim,
out_channels = self.output_dim,
kernel_size = kernel_size
)
self.max_pool = nn.MaxPool2d(2, 2)
self.activation = activation
225
Chapter 4 Multi-trial Neural Architecture Search
• Conv → Activation
x = self.conv(x)
# Branch A
a = self.max_pool(x)
a = self.activation(a)
# Branch B
b = self.activation(x)
@classmethod
def create(cls, activation, in_dimension):
def create_block(i):
params = {
'kernel_size': nn.ValueChoice(
[3, 5],
label = f'fe_kernel_size_{i}'
)
}
226
Chapter 4 Multi-trial Neural Architecture Search
params['dim'] = dim
params['activation'] = activation
return FeatureExtractionBlock(**params)
return create_block
class FullyConnectedBlock(nn.Module):
def __init__(
self,
dim: Tuple[int, int],
227
Chapter 4 Multi-trial Neural Architecture Search
dropout_rate,
activation,
block_num
) -> None:
super().__init__()
self.input_dim = dim[0]
self.output_dim = dim[1]
self._linear = None
• Linear → Activation
self.switch = nn.InputChoice(
n_candidates = 2,
n_chosen = 1,
label = f'fc_switch_{block_num}'
)
# Branch A
a = self.linear(x)
a = self.dropout(a)
a = self.activation(a)
228
Chapter 4 Multi-trial Neural Architecture Search
# Branch B
b = self.linear(x)
b = self.activation(b)
@classmethod
def create(cls, activation, units, dropout_rate):
def create_block(i):
return FullyConnectedBlock(
dim = (units[i], units[i + 1]),
dropout_rate = dropout_rate,
activation = activation,
block_num = i
)
return create_block
229
Chapter 4 Multi-trial Neural Architecture Search
@model_wrapper
class Cifar10LeNetModelSpace(nn.Module):
def __init__(self):
super().__init__()
# number of classes for CIFAR-10 dataset
self.class_num = 10
First, we define the space for the Feature Extraction sequence. All Feature Extraction
blocks will share the same activation function:
fe_activation = nn.LayerChoice(
[nn.Sigmoid(), nn.ReLU()],
label = f'fe_activation'
)
Repeat mutator will create two or three Feature Extraction blocks in a row:
self.fe = nn.Repeat(
FeatureExtractionBlock.create(fe_activation, self.input_channels),
depth = (2, 3), label = 'fe_repeat'
)
self.flat = nn.Flatten()
All Fully Connected blocks will share the same activation function:
dm_activation = nn.LayerChoice(
[nn.Sigmoid(), nn.ReLU()],
label = f'fc_activation'
)
230
Chapter 4 Multi-trial Neural Architecture Search
Repeat mutator will create from one to three Fully Connected blocks in a row:
self.dm = nn.Repeat(
FullyConnectedBlock.create(
dm_activation,
[None, l1_size, l2_size, l3_size],
dropout_rate
),
depth = (1, 3), label = 'fc_repeat'
)
self.linear_final_input_dim = None
self._linear_final = None
self.log_max = nn.LogSoftmax(dim = 1)
x = self.linear_final(x)
return self.log_max(x)
Model evaluator is a classical neural network train–test algorithm. You can examine
its code here: ch4/retiarii/cifar_10_lenet/eval.py.
Fine! Since LeNet Model Space is ready, we can start the research with the code in
Listing 4-10.
231
Chapter 4 Multi-trial Neural Architecture Search
Importing modules:
model_space = Cifar10LeNetModelSpace()
evaluator = FunctionalEvaluator(evaluate)
search_strategy = strategy.PolicyBasedRL(
trial_per_collect = 10,
max_collect = 200
)
Experiment configuration:
exp_config = RetiariiExeConfig('local')
exp_config.experiment_name = 'CIFAR10_LeNet_NAS'
exp_config.trial_concurrency = 1
exp_config.max_trial_number = 500
exp_config.training_service.use_active_gpu = False
export_formatter = 'dict'
232
Chapter 4 Multi-trial Neural Architecture Search
Launching Experiment:
exp.run(exp_config, 8080)
Returning results:
print('Final model:')
for model_code in exp.export_top_models(formatter = export_formatter):
print(model_code)
$ python3 ch4/retiarii/cifar_10_lenet/run_cifar10_lenet_experiment.py
Note Duration ~ 20 hours on Intel Core i7 with CUDA (GeForce GTX 1050)
The best model shows 0.84 accuracy on test dataset. It is not a bad result, but it
seems there is still room for improvement. The best model has the following parameters:
{
"fe_repeat": 2,
"fe_kernel_size_0": 3,
"fe_activation": "1",
"fe_switch_0": 0,
"fe_kernel_size_1": 5,
"fe_kernel_size_2": 3,
"fc_repeat": 2,
"l1_size": 128,
"fc_dropout_rate": 0.3,
"fc_switch_0": 0,
"l2_size": 64,
"fc_switch_1": 0,
"l3_size": 64,
"fc_switch_2": 0
}
233
Chapter 4 Multi-trial Neural Architecture Search
self.fe = nn.Sequential(
[
FeatureExtractionBlock.create(fe_activation, self.input_
channels)(0),
FeatureExtractionBlock.create(fe_activation, self.input_
channels)(1),
]
)
'kernel_size': nn.ValueChoice(
[3, # <- this value
5],
label = f'fe_kernel_size_{i}'
)
Also, it is convenient to visualize the architecture using Netron in Trial details panel.
Figure 4-20 demonstrates the architecture of the best model.
234
Chapter 4 Multi-trial Neural Architecture Search
The research we made in this section can be a good starting point for your own NAS
solutions. It contains basic techniques used in Multi-trial NAS. Here, we used the LeNet
model as the Base Model, but you can choose any model and any mutators that fit the
concrete problem better. But we haven’t achieved a great result in the LeNet NAS. In the
next section, we’ll try another NAS with a more sophisticated approach.
236
Chapter 4 Multi-trial Neural Architecture Search
And finally, we can construct the ResNet model. ResNet has the following
architecture:
The optimal length of a Residual Cell sequence depends on the dataset. And we
will try to find it in ResNet NAS. Figure 4-23 presents the complete Model Space for
ResNet NAS.
237
Chapter 4 Multi-trial Neural Architecture Search
Listing 4-11 defines the Bottleneck block for ResNet Model Space.
class Bottleneck(nn.Module):
expansion = 4
@classmethod
def result_channels_num(cls, channels):
return channels * cls.expansion
238
Chapter 4 Multi-trial Neural Architecture Search
def __init__(
self,
cell_num,
in_channels,
out_channels,
i_downsample = None,
stride = 1
):
super(Bottleneck, self).__init__()
self.conv1 = nn.Conv2d(
in_channels, out_channels,
kernel_size = 1, stride = 1, padding = 0
)
self.batch_norm1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(
out_channels, out_channels,
kernel_size = 3, stride = stride, padding = 1
)
self.batch_norm2 = nn.BatchNorm2d(out_channels)
self.conv3 = nn.Conv2d(
out_channels, self.result_channels_num(out_channels),
kernel_size = 1, stride = 1, padding = 0
)
self.batch_norm3 = nn.BatchNorm2d(self.result_channels_num(out_
channels))
239
Chapter 4 Multi-trial Neural Architecture Search
Skip connection acts the same for all blocks in Residual Cell because all InputChoice
mutators share the same label in the Residual Cell:
self.skip_connection = nn.InputChoice(
n_candidates = 2,
n_chosen = 1,
label = f'bottle_neck_{cell_num}_skip_connection'
)
self.i_downsample = i_downsample
self.stride = stride
self.relu = nn.ReLU()
# x0
identity = x.clone()
x = self.relu(self.batch_norm1(self.conv1(x)))
x = self.relu(self.batch_norm2(self.conv2(x)))
x = self.conv3(x)
x = self.batch_norm3(x)
240
Chapter 4 Multi-trial Neural Architecture Search
x = self.relu(x)
return x
Since we have defined the Bottleneck block, we can move to the ResNet Model Space
definition in Listing 4-12.
(Some unimportant code segments are omitted. Complete code is provided in the
corresponding file: ch4/retiarii/cifar_10_resnet/res_net_model_space.py.)
Importing modules:
@model_wrapper
class ResNetModelSpace(nn.Module):
def __init__(self):
super().__init__()
self.relu = nn.ReLU()
241
Chapter 4 Multi-trial Neural Architecture Search
self.conv1 = nn.Conv2d(
in_channels = self.in_channels,
out_channels = self.channels,
kernel_size = 7,
stride = 2,
padding = 3,
bias = False
)
self.batch_norm1 = nn.BatchNorm2d(64)
Constructing Residual Cell sequence with Repeat mutator (from two to five cells):
self.res_cells = nn.Repeat(
ResNetModelSpace.residual_cell_builder(),
depth = (2, 5), label = 'res_cells_repeat'
)
242
Chapter 4 Multi-trial Neural Architecture Search
x = self.res_cells(x)
x = self.avg_pool(x)
x = x.reshape(x.shape[0], -1)
x = self.relu(self.fc1(x))
x = self.fc2(x)
return x
The following method is used to construct Residual Cells for the Repeat mutator:
@classmethod
def residual_cell_builder(cls):
def create_cell(cell_num):
downsample = None
layers = []
243
Chapter 4 Multi-trial Neural Architecture Search
if stride != 1 or cls.channels != Bottleneck.result_channels_
num(planes):
downsample = nn.Sequential(
nn.Conv2d(
in_channels = cls.channels,
out_channels = Bottleneck.result_channels_
num(planes),
kernel_size = 1,
stride = stride
),
nn.BatchNorm2d(
num_features = Bottleneck.result_channels_num(planes)
)
)
layers.append(
Bottleneck(
cell_num = cell_num,
in_channels = cls.channels,
out_channels = planes,
i_downsample = downsample,
stride = stride
)
)
cls.channels = Bottleneck.result_channels_num(planes)
244
Chapter 4 Multi-trial Neural Architecture Search
)
return nn.Sequential(*layers)
return create_cell
Phew. It was not easy to follow the definition of ResNet Model Space if you are not
familiar with ResNet yet. Anyway, don’t forget that NAS treats the neural network as Data
Flow Graph, and it tries to find the optimal combination of nodes and connections in
the Model Space we constructed in Figure 4-23. Even if some deep learning concepts
are not familiar to you yet, try to treat the NAS as the search for the optimal subgraph in
supergraph space.
ResNet model evaluator is a classical neural network train–test algorithm. You can
examine its code here: ch4/retiarii/cifar_10_resnet/eval.py. ResNet NAS Experiment
script does not differ too much from LeNet NAS, and I don’t provide its code in the
book. Please refer to the script file: ch4/retiarii/cifar_10_resnet/run_cifar10_resnet_
experiment.py.
The experiment can be run as follows:
$ python3 ch4/retiarii/cifar_10_resnet/run_cifar10_resnet_experiment.py
Note Duration ~ 60 hours on Intel Core i7 with CUDA (GeForce GTX 1050)
The best model shows 0.957 accuracy on test dataset. That is not perfect, but a much
better result than LeNet NAS produced (0.84). The best neural network architecture has
the following parameters:
{
"pool_size": 2,
"res_cells_repeat": 5,
"bottle_neck_0_skip_connection": 1,
"bottle_neck_1_skip_connection": 0,
"bottle_neck_2_skip_connection": 0,
"bottle_neck_3_skip_connection": 0,
"bottle_neck_4_skip_connection": 0,
"fc1_output_dim": 512
}
245
Chapter 4 Multi-trial Neural Architecture Search
These parameters mean that the best model has the sequence of five Residual Cells:
the first cell is not using the skipping connection technique, and the others use it. This
is a predictable result because usually skipping connection technique raises neural
network performance.
In this section, we have achieved a great result! CIFAR-10 is a highly complex
computer vision problem, and even large, sophisticated neural networks cannot
reach high accuracy with this dataset. Table 4-2 compares the architecture we have
constructed in this section with other common architectures.
79 AutoDropout 96.8
96 Wide ResNet 96.11
- Multi-trial NAS ResNet Result 95.7
104 SimpleNetv1 95.51
108 MomentumNet 95.18
116 VGG-19 with GradInit 94.71
128 Tree+Max-Avg pooling 94
246
Chapter 4 Multi-trial Neural Architecture Search
The main logical entities of Classic NAS are Base Model, Mutator, Search Space, Trial,
and Search Strategy. Let’s study each of them by applying NAS algorithm to the classical
MNIST problem.
B
ase Model
Every Neural Architecture Search starts with defining a Base Model. The Base Model is a
neural network that acts as a starting point for new architectures. The Base Model can be
very simple or very complex. The researcher chooses the model that is most suitable as a
baseline.
Let’s examine the classic LeNet model for the MNIST problem in Listing 4-13.
class LeNetModel(Model):
def __init__(self):
super().__init__()
self.conv1 = Conv2D(6, 3, padding = 'same', activation = 'relu')
self.pool = MaxPool2D(2)
self.conv2 = Conv2D(16, 3, padding = 'same', activation = 'relu')
self.bn = BatchNormalization()
self.gap = AveragePooling2D(2)
self.fc1 = Dense(120, activation = 'relu')
self.fc2 = Dense(84, activation = 'relu')
self.fc3 = Dense(10)
Feed forward:
x = self.conv1(x)
x = self.pool(x)
x = self.conv2(x)
247
Chapter 4 Multi-trial Neural Architecture Search
x = self.pool(self.bn(x))
x = self.gap(x)
x = tf.reshape(x, [batch_size, -1])
x = self.fc1(x)
x = self.fc2(x)
x = self.fc3(x)
return x
The model described in Listing 4-13 will serve as a piece of clay for new
architectures.
M
utators
Mutator transforms the Base Model into a new one. A set of mutators allow defining a
search space for the NAS. Classic NNI NAS provides two mutators: LayerChoice and
InputChoice. LayerChoice mutator forms the candidate layers for a layer placeholder.
One of the candidates is tried in the exploration process. InputChoice tries different
connections. It takes several tensors and chooses n_chosen tensors from them.
You can learn more about LayerChoice and InputChoice mutators in the subsection
“Mutators” under the section “Neural Architecture Search Using Retiarii (PyTorch).”
We can apply LayerChoice and InputChoice mutators to the Base Model in the
following way.
class LeNetModelSpace(Model):
def __init__(self):
super().__init__()
self.conv1 = LayerChoice([
Conv2D(6, 3, padding = 'same', activation = 'relu'),
Conv2D(6, 5, padding = 'same', activation = 'relu'),
Conv2D(6, 7, padding = 'same', activation = 'relu'),
], key = 'conv1')
248
Chapter 4 Multi-trial Neural Architecture Search
self.pool = LayerChoice([
MaxPool2D(2),
MaxPool2D(3)],
key = 'pool'
)
self.conv2 = LayerChoice([
Conv2D(16, 3, padding = 'same', activation = 'relu'),
Conv2D(16, 5, padding = 'same', activation = 'relu'),
Conv2D(16, 7, padding = 'same', activation = 'relu'),
], key = 'conv2')
self.conv3 = Conv2D(16, 1)
self.skip_connect = InputChoice(
n_candidates = 2,
n_chosen = 1,
key = 'skip_connect'
)
self.bn = BatchNormalization()
self.gap = AveragePooling2D(2)
self.fc1 = Dense(120, activation = 'relu')
self.fc2 = LayerChoice([
Dense(84, activation = 'relu'),
Layer()
], key = 'fc2')
self.fc3 = Dense(10)
249
Chapter 4 Multi-trial Neural Architecture Search
Trial
NAS Trial means the same as the Trial in the HPO context. It initializes the model, trains
it, tests it, and returns model accuracy. There is only one new feature: you must use the
get_and_apply_next_architecture method from nn.algorithms.nas.tensorflow.
classic_nas package to initialize the model. Listing 4-15 provides the NAS Trial.
(Full code is provided in the corresponding file: ch4/classic/trial.py.)
250
Chapter 4 Multi-trial Neural Architecture Search
net = LeNetModelSpace()
get_and_apply_next_architecture(net)
train_model(net, dataset_train, optimizer, epochs)
acc = test_model(net, dataset_test)
nni.report_final_result(acc.numpy())
S
earch Space
After the Trial script is defined, you should generate the search space JSON file manually
using the following command:
{
"conv1": {
"_type": "layer_choice",
"_value": ["0", "1", "2"]
},
"conv2": {
"_type": "layer_choice",
"_value": ["0", "1", "2"]
},
"fc2": {
"_type": "layer_choice",
"_value": ["0", "1"]
},
"pool": {
"_type": "layer_choice",
"_value": ["0", "1"]
},
251
Chapter 4 Multi-trial Neural Architecture Search
"skip_connect": {
"_type": "input_choice",
"_value": {
"candidates": ["",""],
"n_chosen": 1
}
}
}
As you can see, NNI Classic NAS implementation is pretty close to HPO
implementation. The search space file is a list of all possible neural architecture choices.
S
earch Strategy
Classic NAS supports the following Search Strategies:
• Random Search
For more information about these tuners, you can refer to the subsection
“Exploration Strategies” under the section “Neural Architecture Search Using Retiarii
(PyTorch).”
E xperiment
The last thing left to do is to define the experiment configuration. Experiment
configuration is defined in Listing 4-17.
experimentName: example_mnist
trialConcurrency: 1
maxTrialNum: 100
trainingServicePlatform: local
searchSpacePath: search_space.json
252
Chapter 4 Multi-trial Neural Architecture Search
tuner:
builtinTunerName: PPOTuner
classArgs:
optimize_mode: maximize
trial:
command: python3 trial.py
And now, we can start the experiment with the following command:
Note Duration ~ 2 hours on Intel Core i7 with CUDA (GeForce GTX 1050)
Experiment returns the best accuracy 0.9923 on test dataset, with the following
parameter set:
conv1: 2
pool: 0
conv2: 0
fc2: 0
skip_connect: 0
self.conv1 = LayerChoice([
Conv2D(6, 3, ...),
Conv2D(6, 5, ...),
Conv2D(6, 7, ...), # <- 2
], key = 'conv1')
self.pool = LayerChoice([
MaxPool2D(2), # <- 0
MaxPool2D(3)],
key = 'pool'
)
253
Chapter 4 Multi-trial Neural Architecture Search
self.conv2 = LayerChoice([
Conv2D(16, 3, ...), # <- 0
Conv2D(16, 5, ...),
Conv2D(16, 7, ...),
], key = 'conv2')
self.fc2 = LayerChoice([
Dense(84, activation = 'relu'), # <- 0
Layer()
], key = 'fc2')
x0 = self.skip_connect([
x0, # <- 0
None]
)
254
Chapter 4 Multi-trial Neural Architecture Search
Classic NAS does not differ too much from the HPO approach. Indeed, we could build
the same experiment with layer and design hyperparameter search. We made a close trick
in Chapter 2, in the section “From LeNet to AlexNet.” Neural Architecture Search has made
great strides lately, and Classic NAS cannot support new research ideas. In any case, you
can still use Classic NAS and get meaningful results by searching for new solutions.
255
Chapter 4 Multi-trial Neural Architecture Search
Summary
Multi-trial Neural Architecture Search using Retiarii and classic approaches offers
clear and elegant solutions for searching robust neural architectures. Many meaningful
results could be achieved using Multi-trial NAS. But this approach has one very serious
drawback. It takes too much time. Indeed, complex models and massive datasets need
too much time to train, and the Model Space can contain millions of model samples.
Even the most advanced Exploration Strategy can take too much time to converge to
some suboptimal neural architecture. But the time problem has a solution called One-
shot NAS, and we will explore this method in the next chapter.
256
CHAPTER 5
One-Shot Neural
Architecture Search
In the previous chapter, we explored Multi-trial Neural Architecture Search, which is a
very promising approach. And the reader might wonder why Multi-trial NAS is called
that way. Are there any other non-Multi-trial NAS approaches, and is it really possible
to search for the optimal neural network architecture in some other way without trying
it? It looks pretty natural that the only way to find the optimal solution is to try different
elements in the search space. In fact, it turns out that this is not entirely true. There is an
approach that allows you to find the best architecture by training some Supernet. And
this approach is called One-shot Neural Architecture Search. As the name “one-shot”
implies, this approach involves only one try or shot. Of course, this “shot” is much longer
than single neural network training, but nevertheless, it saves a lot of time.
In this chapter, we will study what One-shot NAS is and how to design architectures
for this approach. We will examine two popular One-shot algorithms: Efficient Neural
Architecture Search via Parameter Sharing (ENAS) and Differentiable Architecture
Search (DARTS). Of course, we will apply these algorithms to solve practical problems.
NNI 2.7 version (which is used in this book) has ENAS algorithm implementation
for the TensorFlow framework, but it doesn’t have one for the DARTS algorithm. Anyway,
ENAS algorithm is one of the most popular and efficient One-shot NAS implementations,
so TensorFlow users shouldn’t get too frustrated.
257
© Ivan Gridin 2022
I. Gridin, Automated Deep Learning Using Neural Network Intelligence,
https://doi.org/10.1007/978-1-4842-8149-9_5
Chapter 5 One-Shot Neural Architecture Search
makes Multi-trial NAS not applicable in practice. Indeed, some Multi-trial experiments
can take weeks or months on the most modern computing resources. A new One-
shot NAS approach has been proposed to address this weakness of Multi-trial
architecture search.
The best way to introduce One-shot NAS is to provide an example. Let’s say we are
looking for the optimal architecture for the MNIST problem. And we have the Model
Space shown in Figure 5-1.
Figure 5-1 shows the Model Space with two mutable layers. Each mutable layer
has the following choices: Conv 1×1, Conv 3×3, and Conv 5×5. In a classic Multi-shot
scenario, we would perform 3×3=9 trials for each combination of parameters and pick
the best one. While the One-shot NAS approach follows another technique, we create
one Supernet that merges or reduces the output of each mutable layer and train the
resulting neural network only once. Figure 5-2 demonstrates this Supernet.
258
Chapter 5 One-Shot Neural Architecture Search
259
Chapter 5 One-Shot Neural Architecture Search
And finally, we pick the combination which demonstrated the best performance.
This combination represents the result of the One-shot NAS algorithm. For example, if a
combination (Conv 5×5, Conv 5×5) showed the best accuracy, then our target network
design is Conv 5×5 → Conv 5×5 → Linear → Linear.
Let’s summarize what exactly we did during the One-shot NAS algorithm:
• We trained it.
The main benefit we have here is that we trained Supernet only once instead of
training each of nine candidate networks! It speeds up the whole neural architecture
search dramatically, because the network training is the longest part of NAS process.
A reader may have a fair question: “But wait! We trained a single Supernet network.
All candidate layers learned to work together! But then we decided to break it into different
parts, leaving the same weights. This is nonsense!” I agree. This is a very counterintuitive
concept. Indeed, all layers were trained together, and they learned to complement and
help each other in solving the problem. Surely you can’t just throw out some layers from
a neural network searching for the best architecture. But the most fantastic thing about
One-shot NAS is that you can! There is still no sufficient mathematical basis for this
260
Chapter 5 One-Shot Neural Architecture Search
approach, but it works in practice. Let’s implement this approach in practice using the
example we considered earlier. In this section, we will not be using the NNI toolkit. Here,
our goal is to get an intuition about the One-shot NAS approach.
To begin with, we will make a vanilla Multi-trial NAS. Listing 5-1 (TensorFlow
implementation) and Listing 5-2 (PyTorch implementation) implement the model
shown in Figure 5-1.
We import necessary modules:
The following model accepts two parameters, kernel1 and kernel2, that define the
conv1 and conv2 layers:
class TfLeNetMultiTrialModel(Model):
261
Chapter 5 One-Shot Neural Architecture Search
The following model below accepts two parameters, kernel1 and kernel2, that
define the conv1 and conv2 layers:
class PtLeNetMultiTrialModel(nn.Module):
And now, let’s execute Multi-trial NAS iterating through the various kernel_size
parameters (kernel1: [1, 3, 5], kernel2: [1, 3, 5]) using the script in Listing 5-3
(TensorFlow implementation) and Listing 5-4 (PyTorch implementation).
262
Chapter 5 One-Shot Neural Architecture Search
kernel1_choices = [1, 3, 5]
kernel2_choices = [1, 3, 5]
results = {}
for k1 in kernel1_choices:
for k2 in kernel2_choices:
# Trial
model = TfLeNetMultiTrialModel(k1, k2)
train(model)
accuracy = test(model)
results[(k1, k2)] = accuracy
Displaying results:
print('=======')
print('Results:')
for k, v in results.items():
print(f'Conv1 {k[0]}x{k[0]}, Conv2: {k[1]}x{k[1]} : {v}')
263
Chapter 5 One-Shot Neural Architecture Search
kernel1_choices = [1, 3, 5]
kernel2_choices = [1, 3, 5]
results = {}
for k1 in kernel1_choices:
for k2 in kernel2_choices:
# Trial
model = PtLeNetMultiTrialModel(k1, k2)
train_model(model)
accuracy = test_model(model)
results[(k1, k2)] = accuracy
print('=======')
print('Results:')
for k, v in results.items():
print(f'Conv1 {k[0]}x{k[0]}, Conv2: {k[1]}x{k[1]} : {v}')
The results of the Multi-trial NAS we performed are listed in Table 5-1.
264
Chapter 5 One-Shot Neural Architecture Search
According to Table 5-1, the best candidate is (Conv5×5, Conv5×5). Well, we tried
every neural design candidate and found the most suitable one for the MNIST problem.
Of course, the Multi-trial NAS we implemented earlier is quite simple, but that’s how all
Multi-trial approaches act in general.
But for now, let’s try to get the same result using the One-shot NAS approach!
First, we create the Supernet model depicted in Figure 5-2 (in Listing 5-5 [TensorFlow
implementation] and Listing 5-6 [PyTorch implementation]).
We import necessary modules:
class TfLeNetNaiveSupernet(Model):
def __init__(self):
super().__init__()
self.pool1 = MaxPool2D(pool_size = 2)
self.pool2 = MaxPool2D(pool_size = 2)
self.flatten = Flatten()
self.fc1 = Dense(128, 'relu')
self.fc2 = Dense(10, 'softmax')
265
Chapter 5 One-Shot Neural Architecture Search
call method accepts mask parameter, which activates candidate layer in a sum merge
operation. The mask parameter is not passed in the training mode, and all candidates
are summed:
But in evaluation mode, we pass mask parameter and activate only particular layers:
x = self.flatten(x)
x = self.fc1(x)
return self.fc2(x)
266
Chapter 5 One-Shot Neural Architecture Search
def __init__(self):
super(PtLeNetNaiveSupernet, self).__init__()
self.flat = nn.Flatten()
self.fc1 = nn.Linear(1568, 128)
self.fc2 = nn.Linear(128, 10)
But in evaluation mode, we pass mask parameter and activate only particular layers:
267
Chapter 5 One-Shot Neural Architecture Search
x = torch.relu(x)
x = F.max_pool2d(x, 2, 2)
x = self.flat(x)
x = torch.relu(self.fc1(x))
x = self.fc2(x)
Next, we train the Supernet and evaluate different candidate layer combinations in
Listing 5-7 (TensorFlow implementation) and Listing 5-8 (PyTorch implementation).
We import necessary modules:
Initializing Supernet:
model = TfLeNetNaiveSupernet()
Training Supernet:
train(model)
_, (x, y) = mnist_dataset()
268
Chapter 5 One-Shot Neural Architecture Search
kernel1_choices = [1, 3, 5]
kernel2_choices = [1, 3, 5]
results = {}
for m1 in range(0, len(kernel1_choices)):
for m2 in range(0, len(kernel2_choices)):
# activation mask
mask = [[0, 0, 0], [0, 0, 0]]
# activating conv1 and conv2 layers
mask[0][m1] = 1
mask[1][m2] = 1
Displaying results:
print('=======')
print('Results:')
for k, v in results.items():
print(f'Conv1 {k[0]}x{k[0]}, Conv2: {k[1]}x{k[1]} : {v}')
269
Chapter 5 One-Shot Neural Architecture Search
Initializing Supernet:
model = PtLeNetNaiveSupernet()
Training Supernet:
train_model(model)
_, (x, y) = mnist_dataset()
x = torch.from_numpy(x).float()
y = torch.from_numpy(y).long()
x = torch.permute(x, (0, 3, 1, 2))
model.eval()
kernel1_choices = [1, 3, 5]
kernel2_choices = [1, 3, 5]
results = {}
for m1 in range(0, len(kernel1_choices)):
for m2 in range(0, len(kernel2_choices)):
# activation mask
mask = [[0, 0, 0], [0, 0, 0]]
# activating conv1 and conv2 layers
mask[0][m1] = 1
mask[1][m2] = 1
270
Chapter 5 One-Shot Neural Architecture Search
Displaying results:
print('=======')
print('Results:')
for k, v in results.items():
print(f'Conv1 {k[0]}x{k[0]}, Conv2: {k[1]}x{k[1]} : {v}')
The best neural architecture found by One-shot NAS is (Conv 5×5, Conv 5×5),
which is exactly the same as the result of Multi-trial NAS. Incredible, isn’t it? We found
the same result in a much shorter time!
Note The results presented in Table 5-2 are only needed to rank various
combinations of architectures to pick the best one. These results do not
characterize the accuracy of the corresponding combination. One-shot models are
typically only used to rank architectures in the Model Space. The best-performing
architectures are retrained from scratch after the search is completed.
271
Chapter 5 One-Shot Neural Architecture Search
Since we have an intuition about the One-shot NAS approach, we can start
implementing it using the NNI framework.
S
upernet Architecture
As we saw in the previous section, one of the main concepts in One-shot NAS is the
Supernet. Supernet is a single neural network that contains all the various neural
network architectures from the defined Model Space. Supernet is trained once according
to the One-shot NAS technique, and then the optimal subnet is selected. In Multi-
trial NAS, each Data Flow Graph is tried separately. But One-shot NAS creates a single
Supernet based on all possible Data Flow Graphs in Model Space.
272
Chapter 5 One-Shot Neural Architecture Search
NNI creates Model Space for One-shot NAS using LayerChoice and InputChoice
operations. LayerChoice candidates form a special block in the Supernet. Each
LayerChoice candidate transforms input tensor, and then their output tensors are
reduced according to a particular One-shot NAS algorithm. The reduce operation
can be sum, mean, or any other operation that merges tensors. InputChoice candidate
tensors are reduced in Supernet in the same way as LayerChoice candidates. Figure 5-5
demonstrates Model Space constructed with LayerChoice and InputChoice operations.
Model Space depicted in Figure 5-5 generates Supernet with reduce operations.
Figure 5-6 demonstrates Supernet with sum as reduce operations.
273
Chapter 5 One-Shot Neural Architecture Search
274
Chapter 5 One-Shot Neural Architecture Search
275
Chapter 5 One-Shot Neural Architecture Search
This restriction does not exist in Multi-trial NAS since TensorFlow and PyTorch
frameworks allowed layer parameters to be calculated depending on the input tensor;
therefore, LayerChoice candidates could return tensors of various sizes. In the case
of One-shot NAS, we must be sure that the candidates return tensors of the same size.
Otherwise, the NAS algorithm fails with an error.
Let’s create our first Model Space for One-shot NAS. It will be a “Hello World”
model, which we will use to test One-shot NAS algorithms in the next section. We will
define Model Space for One-shot search using LeNet architecture variations depicted in
Figure 5-9.
276
Chapter 5 One-Shot Neural Architecture Search
NNI implementation for TensorFlow of the Model Space depicted in Figure 5-9 is
provided in Listing 5-9.
LayerChoice and InputChoice methods are implemented in nni.nas.tensorflow.
mutables package:
277
Chapter 5 One-Shot Neural Architecture Search
class TfLeNetSupernet(Model):
def __init__(self):
super().__init__()
self.conv1 = LayerChoice([
create_conv(kernel = 1, filters = 16), # 0
create_conv(kernel = 3, filters = 16), # 1
create_conv(kernel = 5, filters = 16) # 2
], key = 'conv1')
self.conv2 = LayerChoice([
create_conv(kernel = 1, filters = 32), # 0
create_conv(kernel = 3, filters = 32), # 1
create_conv(kernel = 5, filters = 32) # 2
], key = 'conv2')
self.pool = MaxPool2D(2)
self.flat = Flatten()
278
Chapter 5 One-Shot Neural Architecture Search
# branch 1
x1 = self.fc12(self.fc11(x))
# branch 2
x2 = self.fc2(x)
return x
NNI implementation for PyTorch of the Model Space depicted in Figure 5-9 is
provided in Listing 5-10.
LayerChoice and InputChoice methods are implemented in nni.retiarii.nn.pytorch
package:
279
Chapter 5 One-Shot Neural Architecture Search
class PtLeNetSupernet(nn.Module):
self.conv1 = LayerChoice(OrderedDict(
[
('conv1x1->16', create_conv(1, 1, 16)), # 0
('conv3x3->16', create_conv(3, 1, 16)), # 1
('conv5x5->16', create_conv(5, 1, 16)), # 2
]
), label = 'conv1')
self.conv2 = LayerChoice(OrderedDict(
[
('conv1x1->32', create_conv(1, 16, 32)), # 0
('conv3x3->32', create_conv(3, 16, 32)), # 1
('conv5x5->32', create_conv(5, 16, 32)), # 2
]
), label = 'conv2')
self.act = nn.ReLU()
self.flat = nn.Flatten()
280
Chapter 5 One-Shot Neural Architecture Search
x = self.flat(x)
# branch 1
x1 = self.act(self.fc11(x))
x1 = self.act(self.fc12(x1))
# branch 2
x2 = self.act(self.fc2(x))
281
Chapter 5 One-Shot Neural Architecture Search
O
ne-Shot Algorithms
At the moment, One-shot NAS is a young area and is developing rapidly. Many
algorithms that implement the One-shot concept are being invented at the moment.
This section will study two of the most popular One-shot algorithms: Efficient Neural
Architecture Search (ENAS) and Differentiable Architecture Search (DARTS).
282
Chapter 5 One-Shot Neural Architecture Search
RL Controller picks a subnet and runs one or several training epochs. The core of the
ENAS approach is the weight-sharing technique. The same layers in different subnets
share the same weights. When Controller picks a subnet, it is not being trained from
scratch; layers share weights that have already been trained. Weight-sharing concept is
demonstrated in Figure 5-11.
283
Chapter 5 One-Shot Neural Architecture Search
The weight-sharing approach allows the Controller to find the best architecture
with small iterations and without retraining a new subnet from scratch each time. ENAS
algorithm can be demonstrated using the following pseudo-code:
We initialize a Supernet: S
And load train and validation datasets: train_ds, val_ds
Next, the algorithm initializes ENAS Controller (Controller(S, θ)), where θ denotes
Controller weights that help choose optimal subnet from S Supernet:
Ctrl = Controller(S, θ)
284
Chapter 5 One-Shot Neural Architecture Search
train_once(subnet, batch)
Ctrl.add_experience(reward)
Ctrl.self_update_with_new_experience()
Ctrl.best()
285
Chapter 5 One-Shot Neural Architecture Search
In the algorithm described earlier, the Controller trains various subnets on the
training dataset and then tests them on the validation dataset. By repeating this process
many times, the Controller understands which architectures show the best accuracy
and gradually reduces the exploration process to a limited number of architectures. At
the end of the training process, the Controller converges to one or several of the best
architectures.
Let’s see how ENAS works in practice:
286
Chapter 5 One-Shot Neural Architecture Search
Next, Controller chooses the most promising subnets to train using the weight-
sharing technique and tests their accuracy on the validation dataset, as shown in
Figure 5-13.
287
Chapter 5 One-Shot Neural Architecture Search
288
Chapter 5 One-Shot Neural Architecture Search
One-shot ENAS algorithm is close to RL Strategy from Multi-trial NAS, but with one
significant difference. ENAS does not make a complete subnet training cycle but uses
weight sharing and incremental one-step training. This difference dramatically speeds
up the process of finding the best architecture.
NNI implements ENAS using the following classes:
• PyTorch: nni.retiarii.oneshot.pytorch.enas.EnasTrainer
• TensorFlow: nni.algorithms.nas.tensorflow.enas.EnasTrainer
289
Chapter 5 One-Shot Neural Architecture Search
290
Chapter 5 One-Shot Neural Architecture Search
Parameter Description
Now let’s move on to the practical application of the ENAS algorithm for the LeNet
Model Space we defined in the previous section.
Initializing LeNetSupernet:
model = TfLeNetSupernet()
Loading datasets:
loss = SparseCategoricalCrossentropy(
from_logits = True,
reduction = Reduction.NONE
)
Defining optimizer:
optimizer = Adam()
num_epochs = 10
batch_size = 256
Initializing EnasTrainer:
trainer = enas.EnasTrainer(
model,
loss = loss,
metrics = accuracy,
reward_function = reward_accuracy,
optimizer = optimizer,
batch_size = batch_size,
num_epochs = num_epochs,
dataset_train = dataset_train,
dataset_valid = dataset_valid,
log_frequency = 10,
child_steps = 10,
mutator_steps = 30
)
trainer.train()
best = get_best_model(trainer.mutator)
print(best)
292
Chapter 5 One-Shot Neural Architecture Search
Listing 5-11 returns the following best model as the result of ENAS algorithm:
• conv1: 1 (Conv3×3)
• conv2: 2 (Conv5×5)
• dm: 0 (Linear256→Linear10)
import torch.nn as nn
from nni.retiarii.oneshot.pytorch.enas import EnasTrainer
from torch.optim.sgd import SGD
import ch5.datasets as datasets
from ch5.model.lenet.pt_lenet import PtLeNetSupernet
from ch5.pt_utils import accuracy, reward_accuracy
Initializing LeNetSupernet:
model = PtLeNetSupernet()
Loading datasets:
criterion = nn.CrossEntropyLoss()
Defining optimizer:
optimizer = SGD(
model.parameters(), 0.05,
momentum = 0.9, weight_decay = 1.0E-4
)
293
Chapter 5 One-Shot Neural Architecture Search
batch_size = 256
log_frequency = 50
num_epochs = 10
ctrl_kwargs = {"tanh_constant": 1.1}
Initializing EnasTrainer:
trainer = EnasTrainer(
model,
loss = criterion,
metrics = accuracy,
reward_function = reward_accuracy,
optimizer = optimizer,
batch_size = batch_size,
num_epochs = num_epochs,
dataset = dataset_train,
log_frequency = log_frequency,
ctrl_kwargs = ctrl_kwargs,
ctrl_steps_aggregate = 20
)
trainer.fit()
best_model = trainer.export()
print(best_model)
Listing 5-12 returns the following best model as the result of ENAS algorithm:
• conv1: 1 (Conv3×3)
• conv2: 1 (Conv3×3)
• dm: 0 (Linear256→Linear10)
294
Chapter 5 One-Shot Neural Architecture Search
ENAS is one of the first One-shot NAS algorithms that made the community rethink
the whole approach to Neural Architecture Search. But ENAS may seem complicated to
an inexperienced reader due to the complex internal algorithm and nontrivial tuning. In
the next section, we’ll study a more elegant One-shot NAS technique.
exp (α i ) oi ( x )
o′(x) = ∑
i ∑ exp(α )
j j
This means that the DARTS algorithm creates a Supernet derived from Model
Space with sum reducing, and each choice operation is followed with αi parameter
that specifies the weight of the operation. This makes the {α} parameter set trainable as
Supernet weights. At the end of Supernet training, the choices with the highest α values
are chosen as the optimal subnet operations. Figure 5-15 illustrates this concept.
295
Chapter 5 One-Shot Neural Architecture Search
During Supernet training using the DARTS algorithm, inefficient choices tend
to be zeroed, and the search converges to a single architecture, which is the search
result. Figure 5-16 visualizes the DARTS algorithm applied to the LeNetSupermodel. It
gradually relaxes inefficient layers showing the best architecture.
296
Chapter 5 One-Shot Neural Architecture Search
NNI 2.7 implements DARTS only for PyTorch framework using the following class:
nni.retiarii.oneshot.pytorch.DartsTrainer.
Table 5-4 shows DartsTrainer parameters.
297
Chapter 5 One-Shot Neural Architecture Search
import torch
import torch.nn as nn
import ch5.datasets as datasets
from nni.retiarii.oneshot.pytorch import DartsTrainer
298
Chapter 5 One-Shot Neural Architecture Search
Initializing LeNetSupernet:
model = PtLeNetSupernet()
Loading datasets:
criterion = nn.CrossEntropyLoss()
Defining optimizer:
optim = torch.optim.SGD(
model.parameters(), 0.025,
momentum = 0.9, weight_decay = 3.0E-4
)
num_epochs = 10
batch_size = 256
metrics = accuracy
Initializing DartsTrainer:
trainer = DartsTrainer(
model = model,
loss = criterion,
metrics = metrics,
optimizer = optim,
num_epochs = num_epochs,
dataset = dataset_train,
batch_size = batch_size,
log_frequency = 10,
unrolled = False
)
299
Chapter 5 One-Shot Neural Architecture Search
trainer.fit()
best_architecture = trainer.export()
print('Best architecture:', best_architecture)
Listing 5-13 returns the following best model as the result of DARTS algorithm:
• conv1: conv5x5->16
• conv2: conv5x5->32
• dm: 1 (Linear10)
• SepConvBranch(3)
• NonSepConvBranch(3)
• SepConvBranch(5)
• NonSepConvBranch(3)
300
Chapter 5 One-Shot Neural Architecture Search
• AvgPoolBranch
• MaxPoolBranch
The implementations of these layers are not provided here, but the reader can get
details in the following source code files:
• TensorFlow: ch5/model/general/tf_ops.py
• PyTorch: ch5/model/general/pt_ops.py
Figure 5-17 depicts block operation space.
Each nth cell accepts one required input which is transformed by block operation
and n additional inputs. Additional inputs are not required and can be zeroed in
different subnets. The normalized sum of block operation output and additional inputs
forms cell output. Figure 5-18 demonstrates examples of cell spaces.
The sequence of cells forms the GeneralSupernet, and the output of each cell can
be the input of the subsequent cell. After every three cells, a FactorizedReduced layer is
inserted. In this section, we will use a GeneralSupernet with six cells. Figure 5-19 depicts
GeneralSupernet architecture.
301
Chapter 5 One-Shot Neural Architecture Search
Let’s calculate how many subnets GeneralSupernet has: (6) × (6×2) × (6×22) × (6×23)
× (6×24) × (6×25) = 66 × 21+2+3+4+5 ~ 1,500,000,000. Of course, it is not possible to efficiently
explore this Model Space using a Multi-trial NAS approach. It is also possible to use the
GeneralSupernet with 9, 12, or 24 cells. In this case, the number of subnets will become
enormous.
There are a lot of predefined Supernets for One-shot NAS that are aimed at solving
a specific class of problems. And GeneralSupernet we defined earlier is one of the
simplest. Let’s implement GeneralSupernet and run One-shot NAS on the CIFAR-10
dataset.
class Cell(MutableScope):
303
Chapter 5 One-Shot Neural Architecture Search
self.block_op = LayerChoice([
build_conv(filters, 3, 'conv3'),
build_separable_conv(filters, 3, 'sepconv3'),
build_conv(filters, 5, 'conv5'),
build_separable_conv(filters, 5, 'sepconv5'),
build_avg_pool(filters, 'avgpool'),
build_max_pool(filters, 'maxpool'),
], key = f'op_{cell_ord}')
out = self.block_op(inputs[-1])
return self.batch_norm(out)
304
Chapter 5 One-Shot Neural Architecture Search
class GeneralSupernet(Model):
def __init__(
self,
num_cells = 6,
filters = 24,
num_classes = 10
):
super().__init__()
self.num_cells = num_cells
self.stem = Sequential([
Conv2D(filters, kernel_size = 3, padding = 'same', use_bias
= False),
BatchNormalization()
])
self.cells = []
self.pool_layers = []
305
Chapter 5 One-Shot Neural Architecture Search
self.gap = GlobalAveragePooling2D()
self.dense = Dense(num_classes)
cur = cell(prev_outputs)
prev_outputs.append(cur)
cur = self.gap(cur)
logits = self.dense(cur)
return logits
Since we defined GeneralSupernet, we can launch ENAS using the following script.
Importing modules:
306
Chapter 5 One-Shot Neural Architecture Search
Initializing GeneralSupernet:
model = GeneralSupernet()
Loading datasets:
loss = SparseCategoricalCrossentropy(
from_logits = True,
reduction = Reduction.NONE
)
Declaring optimizer:
metrics = accuracy
reward_function = reward_accuracy
batch_size = 256
num_epochs = 100
Initializing EnasTrainer:
trainer = enas.EnasTrainer(
model,
loss = loss,
metrics = metrics,
reward_function = reward_function,
optimizer = optimizer,
batch_size = batch_size,
num_epochs = num_epochs,
dataset_train = dataset_train,
dataset_valid = dataset_valid
)
307
Chapter 5 One-Shot Neural Architecture Search
Launching training:
trainer.train()
Displaying results:
best = get_best_model(trainer.mutator)
print(best)
Note Duration ~ 6 hours on Intel Core i7 with CUDA (GeForce GTX 1050)
After the search was completed, the following report was returned:
• op_layer_0: 3 SepConvBranch(5)
• op_layer_1: 3 SepConvBranch(5)
• op_layer_2: 1 SepConvBranch(3)
• op_layer_3: 4 NonSepConvBranch(3)
• op_layer_4: 1 SepConvBranch(3)
• op_layer_5: 1 SepConvBranch(3)
308
Chapter 5 One-Shot Neural Architecture Search
As see in Figure 5-20, the best architecture does not use pooling operations,
AvgPoolBranch and MaxPoolBranch, and this makes sense because GeneralSupernet has
built-in FactorizedReduced layers.
309
Chapter 5 One-Shot Neural Architecture Search
class Cell(nn.Module):
self.block_op = LayerChoice(OrderedDict([
('SepConvBranch(3)', ConvBranch(in_f, out_f, 3, 1, 1, False)),
('NonSepConvBranch(3)', ConvBranch(in_f, out_f, 3, 1, 1, True)),
('SepConvBranch(5)', ConvBranch(in_f, out_f, 5, 1, 2, False)),
('NonSepConvBranch(3)', ConvBranch(in_f, out_f, 5, 1, 2, True)),
('AvgPoolBranch', PoolBranch('avg', in_f, out_f, 3, 1, 1)),
('MaxPoolBranch', PoolBranch('max', in_f, out_f, 3, 1, 1))
]), label = f'op_{cell_ord}')
310
Chapter 5 One-Shot Neural Architecture Search
out = self.block_op(inputs[-1])
return self.batch_norm(out)
class GeneralSupernet(nn.Module):
def __init__(
self,
num_cells = 6,
out_f = 24,
in_channels = 3,
num_classes = 10
):
super().__init__()
self.num_cells = num_cells
311
Chapter 5 One-Shot Neural Architecture Search
self.pool_layers_idx = [
cell_id
for cell_id in range(1, num_cells + 1) if cell_id % 3 == 0
]
self.cells = nn.ModuleList()
self.pool_layers = nn.ModuleList()
self.gap = nn.AdaptiveAvgPool2d(1)
self.dense = nn.Linear(out_f, num_classes)
312
Chapter 5 One-Shot Neural Architecture Search
import torch
import torch.nn as nn
import ch5.datasets as datasets
from nni.retiarii.oneshot.pytorch import DartsTrainer
from ch5.model.general.pt_general import GeneralSupernet
from ch5.pt_utils import accuracy
Initializing GeneralSupernet:
model = GeneralSupernet()
Loading datasets:
criterion = nn.CrossEntropyLoss()
313
Chapter 5 One-Shot Neural Architecture Search
Declaring optimizer:
optim = torch.optim.SGD(
model.parameters(), 0.025,
momentum = 0.9, weight_decay = 3.0E-4
)
num_epochs = 100
batch_size = 128
accuracy_metrics = accuracy
Initializing DartsTrainer:
trainer = DartsTrainer(
model = model,
loss = criterion,
metrics = accuracy_metrics,
optimizer = optim,
num_epochs = num_epochs,
dataset = dataset_train,
batch_size = batch_size,
log_frequency = 10,
unrolled = False
)
Launching training:
trainer.fit()
314
Chapter 5 One-Shot Neural Architecture Search
Displaying results:
best_architecture = trainer.export()
print('Best architecture:', best_architecture)
Note Duration ~ 4 hours on Intel Core i7 with CUDA (GeForce GTX 1050)
After the search was completed, the following report was returned:
• op_layer_0: SepConvBranch(3)
• op_layer_1: SepConvBranch(5)
• op_layer_2: SepConvBranch(5)
• op_layer_3: SepConvBranch(5)
• op_layer_4: SepConvBranch(5)
• op_layer_5: MaxPoolBranch
315
Chapter 5 One-Shot Neural Architecture Search
316
Chapter 5 One-Shot Neural Architecture Search
The architectures obtained using ENAS (Figure 5-20) and DARTS (Figure-21) are
similar. They tend to use the SepConvBranch(5) operation and share Cell0 output.
ENAS best architecture and DARTS best architecture achieve 91.2% and 92.8% accuracy,
respectively. But we can further improve the accuracy if we increase the number of cells
(num_cells) in the GeneralSupernet. This will make the search longer, but it will result
in a more accurate target architecture. The beautiful thing is that we can use the same
GeneralSupernet and One-shot algorithm for any pattern recognition problem. This
gives us a universal approach to solving typical deep learning problems. Absolutely One-
shot NAS is one of the most significant achievements of automated deep learning.
317
Chapter 5 One-Shot Neural Architecture Search
• ✓✓: Suited
• ✓: Poorly suited
And here, we again face the problem that there is no unique approach for solving any
situation, and the No Free Lunch theorem applies here as well. But understanding how
each algorithm acts will help you make the right choice for solving a particular problem.
S
ummary
One-shot NAS is a very promising area of study. It allows you to find neural network
solutions in a reasonable time. Currently, One-shot algorithms can discover completely
new architectural keys to solve the most complex problems. This field is developing
rapidly and will be a handy tool in any researcher’s toolkit. In this chapter, we introduced
the basic concepts of One-shot NAS and mastered using two of its algorithms: ENAS
and DARTS. This can be a good starting point for putting One-shot NAS into practice. In
the next chapter, we will consider the important problem of model compression, which
allows you to eliminate unnecessary neural network elements without losing its accuracy.
318
CHAPTER 6
Model Pruning
Deep learning models have reached significant success in many real-life problems. A
lot of devices use neural networks to perform everyday tasks. However, complex neural
networks are computationally expensive. And not all devices have GPU processors
to run deep learning models. Therefore, it would be helpful to perform model
compression methods to reduce the model size and accelerate model performance
without losing accuracy significantly. One of the main model compression techniques
is model pruning. Pruning optimizes the model by eliminating some model weights.
It can eliminate a significant amount of model weights with no negligible damage to
model performance. A pruned model is lighter and faster. Pruning is a straightforward
approach that can give nice model speedup results.
NNI provides a toolkit to help users to execute model pruning algorithms. NNI 2.7
version (which is used in this book) supports pruning for the PyTorch framework only.
This chapter will study several pruning algorithms and learn how to apply them in
practice.
319
© Ivan Gridin 2022
I. Gridin, Automated Deep Learning Using Neural Network Intelligence,
https://doi.org/10.1007/978-1-4842-8149-9_6
Chapter 6 Model Pruning
320
Chapter 6 Model Pruning
321
Chapter 6 Model Pruning
Pruning is a great technique, but sometimes, it degrades the model. It is not always
possible to remove weights without compromising the model’s accuracy in complex
neural networks. However, model accuracy degradation can be minimal, and we’ll see
further that it is possible to compress a model by 80% with almost no accuracy decrease.
class PtLeNetModel(nn.Module):
322
Chapter 6 Model Pruning
x = F.relu(self.conv3(x))
x = F.max_pool2d(x, 2, 2)
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = torch.relu(self.fc3(x))
Next, we add two helper methods that we will need in the future: count_total_
weights and count_total_weights. count_nonzero_weights is a helper method that
counts the number of zeros in the neural network’s weights, and count_total_weights
counts the total number of neural network weights.
def count_nonzero_weights(self):
counter = 0
for params in list(self.parameters()):
counter += torch.count_nonzero(params).item()
return counter
def count_total_weights(self):
counter = 0
for params in list(self.parameters()):
counter += torch.numel(params)
return counter
import os
import random
import matplotlib.pyplot as plt
import torch
from nni.algorithms.compression.v2.pytorch.pruning import LevelPruner
323
Chapter 6 Model Pruning
CUR_DIR = os.path.dirname(os.path.abspath(__file__))
model = PtLeNetModel()
path = f'{CUR_DIR}/../data/lenet.pth'
model.load_state_dict(torch.load(path))
original_acc = model.test_model(test_ds)
original_nzw = model.count_nonzero_weights()
Next, we are pruning original model with one-shot LevelPruner (pruning algorithm
internals will be explained in the next section):
# Pruning Config
prune_config = [{
'sparsity': .8,
'op_types': ['default'],
}]
# LevelPruner
pruner = LevelPruner(model, prune_config)
324
Chapter 6 Model Pruning
model_pruned, _ = pruner.compress()
epochs = 10
acc_list = []
for epoch in range(1, epochs + 1):
model_pruned.train_model(epochs = 1, train_dataset = train_ds)
acc = model_pruned.test_model(test_dataset = test_ds)
acc_list.append(acc)
print(f'Pruned: Epoch {epoch}. Accuracy: {acc}.')
pruned_nzw = model_pruned.count_nonzero_weights()
plt.title(
'Fine-tuning\n'
f'Original Non-zero weights number: {original_nzw}\n'
f'Pruned Non-zero weights number: {pruned_nzw}')
plt.axhline(y = original_acc, c = "red",
label = 'Original model accuracy')
plt.plot(acc_list, label = 'Pruned model accuracy')
plt.xlabel('Retraining Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
325
Chapter 6 Model Pruning
Figure 6-2 illustrates fine-tuning of the compressed model. Original trained LeNet
model has 59,786 nonzero weights, and compressed LeNetModel has 12,107 nonzero
weights only. At the beginning of the retraining, the compressed model is less accurate
than the original model. But after 10 training epochs, the compressed model achieves
the same accuracy as the original model, having only 12,107 active weights out of
59,780. Obviously, the original LeNet model was redundant and could be replaced by
pruned one.
Finally, let’s save the pruned model for future usage:
model_path = f'{CUR_DIR}/../data/lenet_pruned.pth'
mask_path = f'{CUR_DIR}/../data/mask.pth'
pruner.export_model(
model_path = model_path,
mask_path = mask_path
)
326
Chapter 6 Model Pruning
NNI provides a special wrapper that allows loading pruned models nni.
compression.pytorch.ModelSpeedup and use them after. In Listing 6-3, we load pruned
LeNet model and analyze its characteristics.
(Please install the following package to run this script: torchsummary.)
Importing modules:
import os
import torch
from nni.compression.pytorch import ModelSpeedup
from torchsummary import summary
from ch6.model.pt_lenet import PtLeNetModel
CUR_DIR = os.path.dirname(os.path.abspath(__file__))
acc = model_pruned.test_model()
print(acc)
The pruned model returns 0.9916 accuracy, the same as the original unpruned
model. Also, the pruned model actually shrinks its layers deleting unnecessary weights.
The pruned model is less than the original one, and let’s examine the difference
between them:
327
Chapter 6 Model Pruning
model_original.load_state_dict(torch.load(model_path))
print('==== PRUNED MODEL =====')
summary(model_pruned, (1, 28, 28))
print('=========================')
As shown in Table 6-1, pruned model shrinks Conv2d-1, Conv2d-3, Linear-4, and
Linear-5 layers. And this is the primary goal of the pruning algorithm, which eliminates
redundancy from the neural network. The earlier example illustrates how we can prune
a pre-trained model using NNI. Let’s move forward and study pruning algorithms in
more detail.
328
Chapter 6 Model Pruning
O
ne-Shot Pruners
One-shot pruning algorithms prune weights only once based on a specific metric.
Usually, pruned weights are close to zero, and pruner suggests that their removal will not
impact the model’s accuracy. One-shot pruners act the following way:
• Pruner accepts a model and selects active weights and the weights to
be pruned. As a result, the pruner returns a model and a binary mask,
where 1 means active weight and 0 means weight to be pruned.
We can define the main steps to implement one-shot pruning algorithms using NNI
the following way:
2. Initialize pruner
329
Chapter 6 Model Pruning
model = SomeModel()
model.load_state_dict(torch.load(model_path))
Step 2. To initialize the pruner, we must specify the pruner configuration (we will study
the pruner configuration in the next section). Here is an example of pruner initialization:
prune_config = [{
'sparsity': .8,
'op_types': ['default'],
}]
pruner = LevelPruner(model, prune_config)
Step 3. Pruner compresses original model applying its logic to reduce model weights:
330
Chapter 6 Model Pruning
And that’s it. The compressed model is ready to use! Let’s create a helper script
that will apply steps 2–6 we defined earlier. Listing 6-4 applies a pruning algorithm and
returns a compressed model.
Importing modules:
import copy
import os
import torch
from nni.compression.pytorch import ModelSpeedup
from torchsummary import summary
CUR_DIR = os.path.dirname(os.path.abspath(__file__))
def oneshot_prune(
model_original,
pruner_cls,
pruner_config,
train_ds,
epochs = 10,
model_input_shape = (1, 1, 28, 28)
):
pruner_name = pruner_cls.__name__
model = copy.deepcopy(model_original)
331
Chapter 6 Model Pruning
model_path = f'{CUR_DIR}/../data/{pruner_name}_pruned.pth'
mask_path = f'{CUR_DIR}/../data/{pruner_name}_mask.pth'
pruner.export_model(
model_path = model_path,
mask_path = mask_path
)
dummy_input = torch.randn(model_input_shape)
model_pruned = model_original.__class__()
model_pruned.load_state_dict(torch.load(model_path))
speedup = ModelSpeedup(model_pruned, dummy_input, mask_path)
speedup.speedup_model()
model_pruned.eval()
return model_pruned
Fine. Since we know how to apply one-shot pruning algorithms, let’s go further and
study some of them.
332
Chapter 6 Model Pruning
P
runer Configuration
Each pruner accepts configuration, which specifies its internal logic. Pruner
configuration is a List of Dict entries, and each entry specifies a pruning strategy
applied to a specified layer set. Table 6-2 describes pruner configuration parameters.
sparsity Specifies the sparsity for each layer in this configuration entry to be compressed. If
sparsity = 0.8, then 80% of layer weights will be pruned, and 20% will be left
active
op_types Specifies what types of operations to compress. 'default' means following the
algorithm’s default setting. All supported module types for PyTorch are defined package file
nni/compression/pytorch/default_layers.py:
'Conv1d', 'Conv2d', 'Conv3d', 'ConvTranspose1d',
'ConvTranspose2d', 'ConvTranspose3d', 'Linear', 'Bilinear',
'PReLU', 'Embedding', 'EmbeddingBag'
op_names Specifies names of operations to be compressed. If this field is omitted, operations will
not be filtered by it
op_ Operation partial names to be compressed. If op_partial_names = 'fc_', then all
partial_ layers with the following mask 'fc_*' will be pruned
names
exclude Default is False. If this field is True, it means the operations with specified types and
names will be excluded from the compression
prune_config = [
{
'sparsity': .8,
'op_types': ['Conv2d'],
},
333
Chapter 6 Model Pruning
{
'sparsity': .6,
'op_types': ['Linear'],
},
{
'op_names': ['fc3'],
'exclude': True
}
]
If you don’t want to specify a special pruning strategy for each layer type, then you
can use the following configuration:
prune_config = [
{
'sparsity': .8,
'op_types': ['default']
}
]
Level Pruner
Level pruner is a straightforward one-shot pruner. Sparsity level means prune ratio, that
is, sparsity=0.7 means that 70% of model weight parameters will be pruned. Level
pruner sorts the weights in the specified layer by their absolute values. And then mask to
zero the smallest magnitude weights until the desired sparsity level is reached.
Level pruner is applied the following way:
prune_config = [{
'sparsity': .8,
'op_types': ['default'],
}]
334
Chapter 6 Model Pruning
Let’s apply LevelPruner to prune the LeNet model using Listing 6-5.
Importing modules:
original = PtLeNetModel.load_model()
train_ds, test_ds = mnist_dataset()
We will prune Conv2d layers with 0.8 sparsity and Linear layers with 0.6 sparsity.
Also, we will exclude the final classifier linear layer fc3 from pruning.
prune_config = [
{
'sparsity': .8,
'op_types': ['Conv2d'],
},
{
'sparsity': .6,
'op_types': ['Linear'],
},
{
'op_names': ['fc3'],
'exclude': True
}
]
Defining pruner:
pruner_cls = LevelPruner
335
Chapter 6 Model Pruning
visualize_mask(mask)
Figure 6-4 shows the mask of a pruned model. We see that the mask leaves 40%
active weights for linear layers and 20% active weights for convolutional layers.
The original model has 0.991 accuracy, and the compressed one has the same 0.991
accuracy. Table 6-3 compares the architectures of original and compressed models.
336
Chapter 6 Model Pruning
Level pruner compressed the original LeNet model without decreasing its accuracy.
F PGM Pruner
FPGM (Filter Pruning via Geometric Median) Pruner is a one-shot pruner that prunes
filters with the smallest geometric median. For more details, please refer to the original
paper “Filter Pruning via Geometric Median for Deep Convolutional Neural Networks
Acceleration” (https://arxiv.org/pdf/1811.00250.pdf).
FPGM Pruner supports Conv2d, Linear as layers for pruning operation. FPGM
Pruner is applied the following way:
prune_config = [{
'sparsity': .8,
'op_types': ['Conv2d'],
}]
337
Chapter 6 Model Pruning
original = PtLeNetModel.load_model()
train_ds, test_ds = mnist_dataset()
Pruning convolutional layers of original model using FPGMPruner with 0.5 sparsity:
visualize_mask(mask)
Figure 6-5 shows the mask of the compressed model. We see that the mask leaves
50% active weights for Conv2d layers.
338
Chapter 6 Model Pruning
Original model has 0.991 accuracy, while the compressed one has close accuracy
0.9894. Table 6-4 compares the architectures of original and compressed models.
Table 6-4 shows that FPGMPruner compressed the original model very heavily with
almost no loss of accuracy.
339
Chapter 6 Model Pruning
prune_config = [{
'sparsity': .8,
'op_types': ['Conv2d'],
}]
original = PtLeNetModel.load_model()
train_ds, test_ds = mnist_dataset()
340
Chapter 6 Model Pruning
visualize_mask(masks)
Figure 6-6 visualizes the mask of the compressed model. We see that the mask leaves
30% active weights for Conv2d layers.
And let’s compare the architectures of the original and compressed model:
Original model has 0.991 accuracy, while the compressed one degrades to 0.98
accuracy. Table 6-5 compares the architectures of original and compressed models.
341
Chapter 6 Model Pruning
Table 6-5 shows that L2Norm Pruner compressed the original model almost five
times, degrading from 0.991 to 0.98 accuracy.
I terative Pruners
One-shot pruners are easy to use but have one major drawback. We must guess the
optimal sparsity values in advance. We usually want to maximize model compression
without decreasing accuracy significantly. The natural solution would be to iterate
several sparsity values to find the optimal one. This is exactly what iterative pruners
were designed for. Pruning algorithms iteratively prune weights during optimization,
which control the pruning schedule, including some automatic pruning algorithms.
After the iterative pruning algorithm completes all iterations, it selects the best pruned
model according to specified score (it is not a necessary accuracy score). Figure 6-7
demonstrates iterative pruning in action.
342
Chapter 6 Model Pruning
This section will examine two popular iterative tuners: linear pruner and
AGP pruner.
Linear Pruner
Linear pruner is an iterative pruner. It will increase sparsity evenly from zero during each
iteration. For example, the final sparsity is set as 0.5, and the iteration number is 5, and
then the sparsity used in each iteration is [0.1, 0.2, 0.3, 0.4, 0.5].
343
Chapter 6 Model Pruning
pruner = LinearPruner(
model = original,
config_list = config_list,
pruning_algorithm = 'l1',
total_iteration = 4,
finetuner = finetuner,
evaluator = finetuner,
speedup = True,
dummy_input = dummy_input_tensor
)
pruner.compress()
We will detail the Iterative Tuner configuration parameters in the “Iterative Pruner
Configuration” section.
AGP Pruner
AGP is an iterative pruner, in which the sparsity is increased from an initial sparsity value
si = 0 to a final sparsity value sf over a span of n pruning iterations, starting at training step
t0 and with pruning frequency Δt:
For more details, please refer to the original paper “Exploring the efficacy of pruning
for model compression” (https://arxiv.org/pdf/1710.01878.pdf).
344
Chapter 6 Model Pruning
pruner = AGPPruner(
model = original,
config_list = config_list,
pruning_algorithm = 'l1',
total_iteration = 4,
finetuner = finetuner,
evaluator = finetuner,
speedup = True,
dummy_input = dummy_input_tensor
)
pruner.compress()
We will detail the Iterative Tuner configuration parameters in the “Iterative Pruner
Configuration” section.
345
Chapter 6 Model Pruning
346
Chapter 6 Model Pruning
Let’s handle this scenario to find the best compressed LeNet model that fits 30,000
model size (original size is 59,786) using iterative LinearPruner.
347
Chapter 6 Model Pruning
Importing modules:
CUR_DIR = os.path.dirname(os.path.abspath(__file__))
original = PtLeNetModel.load_model()
config_list = [
{'sparsity': 0.85, 'op_types': ['Conv2d']},
{'sparsity': 0.4, 'op_types': ['Linear']},
{'op_names': ['fc3'], 'exclude': True} # excluding final layer
]
Now we need to define a method that calculates the model’s score. All models whose
size exceeds 30,000 will have a 0 score because we do not accept models larger than the
specified size.
pruner = LinearPruner(
model = original,
config_list = config_list,
348
Chapter 6 Model Pruning
pruning_algorithm = 'l1',
total_iteration = 10,
finetuner = lambda m: m.train_model(epochs = 1),
evaluator = evaluator,
speedup = True,
dummy_input = torch.rand(10, 1, 28, 28),
log_dir = CUR_DIR # logging results (model.pth and mask.pth is there)
)
pruner.compress()
Receiving results:
Now let’s analyze results returned by LinearPruner. First, let’s display a sparsity of
each layer of the best pruned model:
print('===========')
print(f'Best accuracy: {best_score}')
print('Best Sparsity:')
for layer in best_sparsity:
print(f'{layer}')
349
Chapter 6 Model Pruning
And finally, let’s compare the original and the best pruned model:
# Displaying comparison
train_ds, test_ds = mnist_dataset()
model_comparison(original, compressed, test_ds, (1, 28, 28))
Original model has 0.991 accuracy, while the best compressed one that fits
30,000 size has 0.9913 accuracy. Table 6-8 compares the architectures of original and
compressed models.
350
Chapter 6 Model Pruning
Great! We have significantly compressed the original LeNet model more than twice
without losing accuracy.
Listing 6-9 handles this scenario to find the minimal compressed LeNet model that
gives > 0.98 accuracy (original accuracy is 0.991) using iterative AGPPruner.
351
Chapter 6 Model Pruning
Importing modules:
Listing 6-9. Minimal size above specified accuracy threshold scenario. ch6/
algos/iter/agp_pruner_min_size_scr.py
import os
from math import inf
import torch
from nni.algorithms.compression.v2.pytorch.pruning import AGPPruner
from ch6.algos.utils import model_comparison
from ch6.datasets import mnist_dataset
from ch6.model.pt_lenet import PtLeNetModel
CUR_DIR = os.path.dirname(os.path.abspath(__file__))
original = PtLeNetModel.load_model()
config_list = [
{'sparsity': 0.85, 'op_types': ['Conv2d']},
{'sparsity': 0.4, 'op_types': ['Linear']},
{'op_names': ['fc3'], 'exclude': True} # excluding final layer
]
Now we need to define a method that calculates the model’s score. All models
whose accuracy is lower than 0.98 will have a -inf score because we do not accept such
poor models. If the model performs better than 0.98, then its score is (-count_total_
weights):
352
Chapter 6 Model Pruning
pruner = AGPPruner(
model = original,
config_list = config_list,
pruning_algorithm = 'l2',
total_iteration = 20,
finetuner = lambda m: m.train_model(epochs = 1),
evaluator = evaluator,
speedup = True,
dummy_input = torch.rand(10, 1, 28, 28),
log_dir = CUR_DIR # logging results (model.pth and mask.pth is there)
)
pruner.compress()
Receiving results:
Now let’s analyze results returned by AGPPruner. First, let’s display a sparsity of each
layer of the best pruned model:
print('===========')
print(f'Best accuracy: {best_score}')
print('Best Sparsity:')
for layer in best_sparsity:
print(f'{layer}')
353
Chapter 6 Model Pruning
And finally, let’s compare the original and the best pruned model:
# Displaying comparison
train_ds, test_ds = mnist_dataset()
model_comparison(original, compressed, test_ds, (1, 28, 28))
Original model has 0.991 accuracy and 59,786 size, while the minimal model that
exceeds 0.98 accuracy has 0.9883 accuracy and 11,535 size. Table 6-10 compares the
architectures of original and compressed models.
354
Chapter 6 Model Pruning
Yes, the accuracy of the original model degraded from 0.991 to 0.9883, but we
compressed the original model more than five times! We made our model lightweight,
and now it is more attractive for economical usage.
In this section, we have demonstrated the practical use of iterative pruners in real-
world cases. We see that iterative pruning significantly benefits practical deep learning
deployment problems.
NNI provides a rich set of pruning algorithms:
• Slim Pruner
• Activation APoZ Rank Pruner
• ADMM Pruner
• Movement Pruner
• AMC Pruner
S
ummary
Pruning is an essential part of automated deep learning. It denotes the neural network
complexity problem. In addition to the fact that we need an accurate neural network,
we also need a lightweight neural network. We will always prefer a neural network
with 1M parameters over a neural network with 10M parameters if they have the same
accuracy. In this chapter, we have covered the basic principles of model pruning using
NNI. Model pruning is a significant direction of neural network optimization that allows
the integration of machine learning models into simple devices.
355
CHAPTER 7
NNI Recipes
In the previous chapters, we studied various NNI features and applications. NNI is a
very efficient automated deep learning tool that solves complex deep learning problems.
We have witnessed that many NNI experiments can last days or even weeks. Therefore,
it is crucial to organize experiments properly. Otherwise, a lot of valuable information
and efforts can be lost. On the other hand, NNI uses sophisticated mathematical search
algorithms to find the optimal solution in the shortest time in the vast search space.
Time is a precious resource. So it is also essential to speed up the NNI execution, which
will help maximize the efficiency. It is great to understand the mathematical core of
algorithms NNI implements, but it is also important to know how to use NNI effectively.
This chapter will examine patterns and recipes that can help make NNI interactions
much more effective. These recipes should help speed up, stabilize, and make research
and experiments more developer friendly.
S
peed Up Trials
It is essential to speed up the Trial execution in HPO and Multi-trial NAS. The
completion of the search algorithm depends on the duration of the Trials, so Trial speed
optimization is the first thing a developer should start with. Here, we will mention basic
rules a reader should follow to construct a fast Trial.
Use the GPU. One of the most common ways to speed up neural network
computations is to use a GPU. Properly configuring the model for GPU usage is the
developer’s responsibility. If your machine has GPUs, ensure they are utilized during
NNI Trial execution.
Do not download dataset twice. A common mistake is downloading heavy datasets
without caching them on the disk. Please make sure that the downloaded dataset is
cached on disk and the trial does not attempt to download ten gigabytes from the
Internet each time it runs a new trial.
357
© Ivan Gridin 2022
I. Gridin, Automated Deep Learning Using Neural Network Intelligence,
https://doi.org/10.1007/978-1-4842-8149-9_7
Chapter 7 NNI Recipes
Use the duration panel to determine the longest-running Trials. This can help in
finding abnormally long Trials. Figure 7-1 shows NNI duration panel.
Use dry trial runs to debug trials. Each trial can be run manually as a Python script,
which can help find performance issues and bottlenecks. Try running the Trial several
times to check its performance before launching the experiment.
S
tart–Stop–Resume
Keep in mind that each experiment can be manually stopped and resumed after. All
experiment information is stored in the NNI output folder (path is defined by NNI_
OUTPUT_DIR environment variable, ~/nni-experiments by default). Therefore, you can
stop the experiment at any time using the following command:
experiment = Experiment('local')
358
Chapter 7 NNI Recipes
experiment.resume('experiment_id', port)
while True:
sleep(1)
if experiment.get_status() == 'DONE':
break
You can also use the WebUI to update the Experiment configuration and
search space.
N
NI and TensorBoard
NNI can be integrated with TensorBoard. This is very practical if you want to visualize
additional Trial metrics. Let’s look at an example of integrating NNI with TensorBoard.
Make sure tensorboard is installed in your environment. Listing 7-1 illustrates a dummy
Trial implementation that writes metrics using TensorBoard format.
359
Chapter 7 NNI Recipes
import os
from random import random
import nni
Initializing SummaryWriter:
if __name__ == '__main__':
p = nni.get_next_parameter()
for i in range(100):
writer.add_scalar('Accuracy', acc, i)
writer.add_scalar('Loss', loss, i)
nni.report_intermediate_result(acc)
nni.report_final_result(acc)
You can run dummy experiment that uses Trial from Listing 7-1 using the following
command:
360
Chapter 7 NNI Recipes
Once the experiment has started, you can go to the Trial jobs panel on Trails detail
page, select Trials you want to analyze, and click TensorBoard button, as shown in
Figure 7-3.
After clicking the TensorBoard button, NNI starts the TensorBoard process passing
Trial log directories as its input and redirects the browser to its web page. Figure 7-4
shows TensorBoard panel with metrics we have collected during dummy trials we
defined in Listing 7-1.
361
Chapter 7 NNI Recipes
NNI runs an actual TensorBoard process, so you can stop it when you’re done with it,
as shown in Figure 7-5.
362
Chapter 7 NNI Recipes
Integrating TensorBoard in your NNI Experiments can help you analyze Trial results
and whole Experiment progress.
You can use this trick if you want to move your experiment to a more powerful server
or if you want to share the results of an experiment.
363
Chapter 7 NNI Recipes
Scaling Experiments
Scaling is the most natural approach to speed up Experiment execution. You can use
multiple servers to distribute Trial jobs. NNI implements the Training Service concept.
Training Service is an environment that performs Trial jobs. We have only used the Local
Training Service in this book, which means that all calculations are done on the local
machine. But you can organize an Experiment using various Remote Training Services.
NNI 2.7 supports the following environments as Training Services:
Many search algorithms allow concurrent Trial execution, so you can horizontally
scale the experiment, significantly increasing its speed. Figure 7-7 illustrates this
concept.
Let’s look at an example configuration that uses the Remote Training Service.
Common configuration part:
trialConcurrency: 4
maxTrialNumber: 100
searchSpace:
364
Chapter 7 NNI Recipes
x:
_type: quniform
_value: [1, 100, 0.1]
trialCodeDirectory: .
trialCommand: python3 trial.py
tuner:
name: Random
trainingService:
platform: remote
machineList:
You can apply Remote Training Service in embedded (stand-alone) NNI mode as
follows:
# Loading Packages
from nni.experiment import Experiment, RemoteConfig, RemoteMachineConfig
from pathlib import Path
365
Chapter 7 NNI Recipes
nni_host_ip = '10.10.120.20'
remote_ip = '10.10.120.21'
remote_ssh_user = 'nni_user'
remote_ssh_pass = 'nni_pass'
remote_python_path = '/opt/python3/bin'
# Experiment Configuration
experiment = Experiment('remote')
experiment.config.experiment_name = 'Remote Experiment'
experiment.config.trial_concurrency = 4
experiment.config.trial_command = 'python3 trial.py'
experiment.config.trial_code_directory = Path(__file__).parent
experiment.config.max_trial_number = 1000
experiment.config.search_space = search_space
experiment.config.tuner.name = 'Random'
experiment.config.nni_manager_ip = nni_host_ip
remote_service = RemoteConfig()
remote_machine = RemoteMachineConfig()
remote_machine.host = remote_ip
remote_machine.user = remote_ssh_user
remote_machine.password = remote_ssh_pass
remote_machine.python_path = remote_python_path
remote_service.machine_list = [remote_machine]
experiment.config.training_service = remote_service
Starting NNI:
http_port = 8080
experiment.start(http_port)
366
Chapter 7 NNI Recipes
while True:
if experiment.get_status() == 'DONE':
break
The remote server must have the same Python environment installed as the
Experiment host server. NNI copies the experiment information to the remote server
and executes the Trial jobs during the experiment. Here is an example of a Trial process
executed on a remote server:
NNI provides rich explanations concerning Training Services. Please refer to the
official documentation for more details: https://nni.readthedocs.io/.
Shared Storage
NNI scaling we considered in the previous section has one serious drawback. Training
Services return only Trial metrics (nni.report_intermediate_result and nni.report_
final_result) to Experiment host server. All Trial logs are stored on the machine they
are executed as shown in Figure 7-8.
367
Chapter 7 NNI Recipes
This is not convenient because the logs are located in different places.
To solve this problem, NNI provides a Shared Storage implementation that allows
you to store all Trial logs in one place, accessible to the NNI Experiment. Figure 7-9
depicts architecture of NNI Experiment with Shared Storage.
There are two ways to implement Shared Storage in Experiment: NFS and Azure
Blob. Here is a sample configuration for NFS Shared Storage:
# Experiment Configuration
...
# Training Service Configuration
...
# Shared Storage Configuration
sharedStorage:
storageType: NFS
localMountPoint: ${your/local/mount/point}
remoteMountPoint: ${your/remote/mount/point}
nfsServer: ${nfs-server-ip}
exportedDirectory: ${nfs/exported/directory}
localMounted: nnimount
# Values for localMounted:
368
Chapter 7 NNI Recipes
Please refer to the official documentation for more details concerning Shared Storage
implementation: https://nni.readthedocs.io/.
These are pretty serious limitations, making working with such an effective method
much more difficult. Let’s try to eliminate all these limitations on the example of PyTorch
LeNet Supernet (ch7/one_shot_nas/pt_lenet.py) and DartsTrainer implementation.
DartsTrainer accepts a user-defined method that calculates Supernet accuracy
during the training process. To visualize the training process, we can use tensorboard
logging with SummaryWriter that logs each training iteration’s accuracy. Listing 7-2
demonstrates how to visualize training progress.
(Full code is provided in the corresponding file: ch7/one_shot_nas/pt_utils.py.)
Initializing SummaryWriter:
cd = os.path.dirname(os.path.abspath(__file__))
dt = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
369
Chapter 7 NNI Recipes
tb_summary = SummaryWriter(f'{cd}/runs/{dt}')
iter_counter = 0
global iter_counter
...
Calculating results:
res = dict()
for k in topk:
correct_k = correct[:k].reshape(-1).float().sum(0)
accuracy = correct_k.mul_(1.0 / batch_size).item()
res["acc{}".format(k)] = accuracy
return res
370
Chapter 7 NNI Recipes
371
Chapter 7 NNI Recipes
This trick allows us to solve all main problems concerning the One-shot NAS
process:
Listing 7-3 demonstrates how One-shot NAS process with checkpoint dumping can
be implemented.
We are using pickle to dump trainer binary image (please install this package if
necessary):
372
Chapter 7 NNI Recipes
import os
from os.path import exists
import torch
import torch.nn as nn
import ch7.datasets as datasets
from nni.retiarii.oneshot.pytorch import DartsTrainer
from ch7.one_shot_nas.pt_lenet import PtLeNetSupernet
from ch7.one_shot_nas.pt_utils import accuracy
cd = os.path.dirname(os.path.abspath(__file__))
trainer_checkpoint_path = f'{cd}/darts_trainer_checkpoint.bin'
def get_darts_trainer():
# Supernet
model = PtLeNetSupernet()
# Dataset
dataset_train, dataset_valid = datasets.get_dataset("mnist")
# Optimizer
optim = torch.optim.SGD(
model.parameters(), 0.025,
momentum = 0.9, weight_decay = 3.0E-4
)
373
Chapter 7 NNI Recipes
batch_size = 256
metrics = accuracy
return darts_trainer
The following method trains Supernet for specified number of epochs and dumps
trainer:
return darts_trainer
And here is the main script that loads the trainer from binary checkpoint if necessary
and splits the whole training loop into multiple subcycles:
if __name__ == '__main__':
if exists(trainer_checkpoint_path):
374
Chapter 7 NNI Recipes
for _ in range(10):
trainer = train_and_dump(trainer, epochs = 5)
tensorboard --logdir=ch7/one_shot_nas/runs/
And now, we can monitor Supernet training progress using the following link:
http://localhost:6006/#scalars. Figure 7-12 demonstrates TensorBoard web page.
375
Chapter 7 NNI Recipes
But the most important thing is that you can stop the execution of ch7/one_shot_
nas/darts_train_with_checkpoint.py script and then run it again. DartsTrainer will be
restored from the binary checkpoint file ch7/one_shot_nas/darts_trainer_checkpoint.
bin and continue Supernet training. Figure 7-13 shows that DartsTrainer continues
training from the checkpoint, not from scratch.
376
Chapter 7 NNI Recipes
Summary
This chapter examined several tricks and patterns that can facilitate your user
experience. NNI is an open source developer-friendly framework, so you can implement
your own ideas and approaches in your research and experiments. In this chapter, we
complete the book. I hope that you now can appreciate the effectiveness of using the
NNI framework, and it can become an indispensable tool for your daily research activity.
377
Index
A Black-box function, 13–18, 22, 27, 28, 112,
113, 116, 124, 128, 130, 132, 142,
Abstract LeNet model design, 54
184, 196
Ackley’s function, 148, 149
Block operation, 300, 301
Activation design
Bottleneck block space, 236
hyperparameter, 72
Box prediction algorithm, 3
Adam optimizer, 56, 58, 97, 103
AGP pruner, 343–345
Annealing algorithm flow, 123 C
Anneal Tuner Choice sampling strategy, 41
annealing algorithm, 122, 123 CIFAR-10, 221, 222
configuration, 124 CIFAR-10 ResNet NAS, 236–246
generation, 126, 128 Classic Neural Architecture Search
optimizing holder_function, 125 (TensorFlow)
holder’s black-box function, 124 base model, 247, 248
AutoDL approaches, 318 experiment, 252, 254, 255
Automated deep learning (AutoDL) mutators, 248–250
adapting model to search space, 251
new dataset, 8 search strategy, 252
definition, 2 trial, 250
injecting technique, 7 Conv2D layers, 55, 335, 341
neural architecture search, 8 Cross-entropy loss function, 58, 64, 103
No Free Lunch theorem, 3–6 Curve fitting assessor, 162, 163
sections, 2 Curve Fitting prediction, 163
solving practical problems, 9 Customized trial, 24
source code, 9 Custom Tuner, internals, 144–147,
Automated machine learning See also New Evolution Custom Tuner
(AutoML), 1, 2, 9
D
B DartsTrainer, 297, 369
Base model, 197–201, 236, 247, 248, 250 Data flow graph (DFG), 186
Benchmark algorithm, 143 Decision maker hyperparameters, 91, 94
379
© Ivan Gridin 2022
I. Gridin, Automated Deep Learning Using Neural Network Intelligence,
https://doi.org/10.1007/978-1-4842-8149-9
INDEX
Deep learning, 1, 2, 107, 109, 110, 144, 157, optimizing black_box_f1 function, 118
158, 164, 166, 182, 183, 185 configuration, 118
Deep neural networks (DNNs), 187 experiment, 118
Design hyperparameter, 34, 39, 40, search principle, 117
72, 73, 255 search space by generations, 119–121
Differentiable Architecture selection and mutation, 117
Search (DARTS) Experiment file structure, 52
in action, 297 Exploration strategies
algorithm, 300 grid search, 215
DartsTrainer parameters, 298 random strategy, 215
operation relaxation, 296 regularized evolution, 215–218
PyTorch, 298 RL, 218–220
Dropout design hyperparameter, 72, 73 TPE, 218
Dropout technique, 71, 110
F
E FactorizedReduced layer, 301, 309
Efficient Neural Architecture FCNN model weights and biases, 33
Search (ENAS) Feature engineering, 2
automatic model design, 282 Feature extraction and decision maker
EnasTrainer parameters, 290 components, 87
GeneralSupernet best architecture, 309 Feature Extraction block, 222–224, 227, 230
initial epoch, 287 Feature extraction hyperparameters,
middle training, 288 89, 98, 105
NNI, 289 Feature hyperparameter, 34, 36–38
practice, 286 Filter pruning via geometric median
PyTorch implementation, 293–295 (FPGM) pruner, 337, 339
RL Controller subnet selection, 283 Fully Connected block, 227
TensorFlow implementation, 291–293 Fully Connected block space, 223
training loop, 284
weight sharing, 284
Efficient neural architecture search via G
parameter sharing (ENAS), Gaussian Process (GP) Tuner
257, 282 concurrency issue, 141
EnasTrainer parameters, 289, 290 configuration, 138
Evaluator, 185, 197, 214, 215, 220 holder_function optimization, 141
EvolutionShrinkTuner, 177, 178 Holder’s black-box function
Evolution Tuner optimization, 142
380
INDEX
383
INDEX
384