Sherpa: Robust Hyperparameter Optimization for
Machine Learning
Lars Hertela,∗, Julian Colladob , Peter Sadowskic , Jordan Ottb , Pierre Baldib
arXiv:2005.04048v1 [cs.LG] 8 May 2020
a
Department of Statistics Donald Bren School of Information and Computer Sciences
University of California, Irvine Bren Hall 2019 Irvine, CA 92697-1250, USA
b
Department of Computer Science Donald Bren School of Information and Computer
Sciences University of California, Irvine 3019 Donald Bren Hall Irvine, CA 92697-3435,
USA
c
Information and Computer Science University of Hawai’i at Mãnoa, 1680 East-West
Rd, Honolulu, HI 96822, USA
Abstract
Sherpa is a hyperparameter optimization library for machine learning models.
It is specifically designed for problems with computationally expensive, iterative function evaluations, such as the hyperparameter tuning of deep neural
networks. With Sherpa, scientists can quickly optimize hyperparameters using a variety of powerful and interchangeable algorithms. Sherpa can be run
on either a single machine or in parallel on a cluster. Finally, an interactive
dashboard enables users to view the progress of models as they are trained,
cancel trials, and explore which hyperparameter combinations are working
best. Sherpa empowers machine learning practitioners by automating the
more tedious aspects of model tuning. Its source code and documentation
are available at https://github.com/sherpa-ai/sherpa.
Keywords: Hyperparameter Optimization, Machine Learning, Deep Neural
Networks
1. Motivation and significance
Hyperparameters are tuning parameters of machine learning models. Hyperparameter optimization refers to the process of choosing optimal hyperparameters for a machine learning model. This optimization is crucial to obtain
Corresponding author
Email addresses:
[email protected] (Lars Hertel),
[email protected] (Julian
Collado),
[email protected] (Peter Sadowski),
[email protected]
(Jordan Ott),
[email protected] (Pierre Baldi)
∗
Preprint submitted to SoftwareX
May 11, 2020
optimal performance from the machine learning model. Since hyperparameters cannot be directly learned from the training data, their optimization
is often a process of trial and error conducted manually by the researcher.
There are two problems with the trial and error approach. Firstly, it is time
consuming and can take days or even weeks of the researcher’s attention.
Secondly, it is dependent on the researcher’s ability to interpret results and
choose good hyperparameter settings. These limitations lead to a large need
to automate this process. Sherpa is a software that addresses this need.
Existing hyperparameter optimization software can be divided into bayesian
optimization software, bandit and evolutionary algorithm software, framework specific software, and all-round software. Software that implements
bayesian optimization started with SMAC [1], Spearmint [2], and HyperOpt
[3]. More recent software in this regime has been GPyOpt [4], RoBo [5],
DragonFly [6], Cornell-MOE [7, 8], and mlrMBO [9]. These software packages have high quality, stand-alone bayesian optimization implementations,
often with unique twists. However, most of these do not provide infrastructure for parallel training.
As an alternative to bayesian optimization, multi-armed bandits and evolutionary algorithms have recently become popular. HpBandSter implements
Hyperband [10] and BOHB [11], Pbt implements Population Based Training
[12], PyCMA implements CMA-ES [13], and TPot [14, 15] provides hyperparameter search via genetic programming.
A number of framework specific libraries have also been proposed. AutoWeka [16] and Auto-Sklearn [17] focus on WEKA [18] and Scikit-learn [19],
respectively. Furthermore, a number of packages have been proposed for the
machine learning framework Keras [20]. Hyperas, Auto-Keras [21], Talos,
Kopt, and HORD each provide hyperparameter optimization specifically for
Keras. These libraries make it easy to get started due to their tight integration with the machine learning framework. However, researchers will
inevitably run into limitations when a different machine learning framework
is needed.
Lastly, a number of implementations aim at being framework agnostic
and also support multiple optimization algorithms. Table 1 shows a detailed comparison of these ”all-round” packages to Sherpa. Note that we
excluded Google Vizier [22] and similar frameworks from other cloud computing providers since these are not free to use.
Sherpa is already being used in a wide variety of applications such as
machine learning methods [27], solid state physics [28], particle physics [29],
medical image analysis [30], and cyber security[31]. Due to the fact that the
number of machine learning applications is growing rapidly we can expect
there to be a growing need for hyperparameter optimization software such
2
Software
Distributed
Visualizations
BayesianOptimization
Evolutionary
Bandit/
Early-stopping
Sherpa
Advisor
Chocolate
Test-Tube[23]
Ray-Tune[24]
Optuna[25]
BTB [26]
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
No
No
No
No
Yes
No
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
Yes
No
Yes
No
No
Yes
Yes
No
No
Yes
Yes
Yes
Table 1: Feature comparison of hyperparameter optimization frameworks. Bayesian optimization, evolutionary, and bandit/early-stopping refer to the support of hyperparameter
optimization algorithms based on these methods.
as Sherpa.
2. Software Description
2.1. Hyperparameter Optimization
We begin by laying out the components of a hyperparameter optimization.
Consider the training of a machine learning model. A user has a model that
is being trained with data. Before training there are hyperparameters that
need to be set. At the end of the training we obtain an objective value.
This workflow can be illustrated via the training of a neural network. The
model is a neural network. The data are images that the neural network is
trained on. The hyperparameter setting is the number of hidden layers of
the neural network. The objective is the prediction accuracy on a hold-out
dataset obtained at the end of training.
For automated hyperparameter optimization we also need hyperparameter ranges, a results table, and a hyperparameter optimization algorithm. The
hyperparameter ranges define what values each hyperparameter is allowed to
take. The results store hyperparameter settings and their associated objective value. Finally, the algorithm takes results and ranges and produces a
new suggestion for a hyperparameter setting. We refer to this suggestion as
a trial.
For the neural network example the hyperparameter range might be 1,
2, 3, or 4 hidden layers. We might have previous results that 1 corresponds
to 80% accuracy and 3 to 90% accuracy. The algorithm might then produce
a new trial with 4 hidden layers. After training the neural network with 4
hidden layers we find it achieves 88% accuracy and add this to the results.
Then the next trial is suggested.
3
2.2. Components
We now describe how Sherpa implements the components described in
Section 2.1. Sherpa implements hyperparameter ranges as sherpa.Parameter
objects. The algorithm is implemented as a sherpa.algorithms.Algorithm
object. A list of hyperparameter ranges and an algorithm are combined to
create a sherpa.Study (Figure 1). The study stores the results. Trials are
implemented as sherpa.Trial objects.
Sherpa.Study
+parameters
+algorithm
Sherpa.Parameter
1..*
1
Sherpa.Algorithm
+lower_is_better
+results
1
Pandas.DataFrame
+get_suggestion()
+add_observation()
+finalize()
Figure 1: Diagram showing Sherpa’s Study class.
Sherpa implements two user interfaces. We will refer to the two interfaces
as API mode and parallel mode.
2.3. API Mode
In API mode the user interacts with the Study object. Given a study s:
1. A new trial of name t is obtained by calling s.get suggestion()
or by iterating over the study (e.g. for t in s).
2. First, t.parameters is used to initialize and train a machine learning
model. Then s.add observation(t, objective=o) is called
to add objective o for trial t. Invalid observations are automatically
excluded from the results.
3. Finally, s.finalize(t) informs Sherpa that the model training is
finished.
Interacting with the Study class is easy. It also requires minimal setup. The
limitation in API mode is that it cannot evaluate trials in parallel.
4
2.4. Parallel Mode
In parallel-mode multiple trials can be evaluated in parallel. The user
provides two scripts: a server script and a machine learning (ML) script.
The server script defines the hyperparameter ranges, the algorithm, the job
scheduler, and the command to execute the machine learning script. The
optimization starts by calling sherpa.optimize.
In the machine learning script the user trains the machine learning model
given some hyperparameters and adds the resulting objective value to Sherpa.
Using a sherpa.Client called c a trial t is obtained by calling c.get trial().
To add observations c.send metrics(trial=t, objective=o) is used.
Internally, sherpa.optimize runs a loop that uses the Study class. Figure 2 illustrates the parallel-mode architecture.
1. The loop submits new trials if resources are available by submitting
a job to the scheduler. Furthermore, the new trials are added to a
database. From there they can be retrieved by the client.
2. The loop updates results by querying the database for new results.
3. Finally, the loop checks whether jobs have finished. This means resources are free again. In addition, the corresponding trials can be
finalized.
If the user’s machine learning script does not submit an objective value such
as when it crashed, Sherpa continues with the next trial.
User Sherpa Script
User
ML
Script
Trial
Run
in parallel
Ranges Algorithm Command Scheduler
sherpa.optimize()
1.Submit
new trials
Job status
2.Update
results
Objective
sherpa.Client
Submit
job
get_suggestion()
add_observation()
Add trial
3.Update
active trials
Trials
Objective Values
finalize()
sherpa.Study
sherpa.Scheduler
New results
sherpa.Database
Figure 2: Architecture diagram for parallel hyperparameter optimization in Sherpa. The
user only interacts with Sherpa via the solid red arrows, everything else happens internally.
5
3. Software Functionalities
3.1. Available Hyperparameter Types
Sherpa supports four hyperparameter types:
• sherpa.Continuous
• sherpa.Discrete
• sherpa.Choice
• sherpa.Ordinal.
These correspond to a range of floats, a range of integers, an unordered
categorical variable, and an ordered categorical variable, respectively. Each
parameter has name and range arguments. The range expects a list defining
lower and upper bound for continuous and discrete variables. For choice and
ordinal variables the range expects the categories.
3.2. Diversity of Algorithms
Sherpa aims to help researchers at various stages in their model development. For this reason, it provides a choice of hyperparameter tuning algorithms. The following optimization algorithms are currently supported.
• sherpa.algorithms.RandomSearch:
Random Search [32] samples hyperparameter settings uniformly from
the specified ranges. It is a robust algorithm because it explores the
space uniformly. Furthermore, with the dashboard the user can make
their own inference on the results.
• sherpa.algorithms.GridSearch:
Grid Search follows a grid over the hyperparameter space and evaluates all combinations. It is useful to systematically explore one or two
hyperparameters. It is not recommended for more than two hyperparameters.
• sherpa.algorithms.bayesian optimization.GPyOpt:
Bayesian optimization is a model-based search. For each trial it picks
the most promising hyperparameter setting based on prior results.
Sherpa’s implementation wraps the package GPyOpt [4].
• sherpa.algorithms.successive halving.SuccessiveHalving:
Asynchronous Successive Halving (ASHA) [33] is a hyperparameter optimization algorithm based on multi-armed bandits. It allows the efficient exploration of a large hyperparameter space. This is accomplished
by the early stopping of unpromising trials.
6
• sherpa.algorithms.PopulationBasedTraining:
Population-based Training (PBT) [12] is an evolutionary algorithm.
The algorithm jointly optimizes a population of models and their hyperparameters. This is achieved by adjusting hyperparameters during
training. It is particularly suited for neural network training hyperparameters such as learning rate, weight decay, or batch size.
• sherpa.algorithms.LocalSearch:
Local Search is a heuristic algorithm. It starts with a seed hyperparameter setting. During optimization it randomly perturbs one hyperparameter at a time. If a setting improves on the seed then it becomes
the new seed. This algorithm is particularly useful if the user already
has a well performing hyperparameter setting.
All implemented algorithms allow parallel evaluation and can be used with
all available parameter types. An empirical comparison of the algorithms
can be found in the documentation1 .
3.3. Accounting for Random Variation
Sherpa can account for variation via the Repeat algorithm. The objective value of a model may vary between training runs. Reasons for this can
be random initialization or stochastic training. The Repeat algorithm runs
each hyperparameter setting multiple times. Thus variation can be taken
into account when analyzing results.
3.4. Visualization Dashboard
Sherpa provides an interactive web-based dashboard. It allows the user to
monitor progress of the hyperparameter optimization in real time. Figure 3
shows a screenshot of the dashboard.
At the top of the dashboard is a parallel coordinates plot [34, 35]. It
allows exploration of relationships between hyperparameter settings and objective values (Figure 3 top). Each vertical axis corresponds to a hyperparameter or the objective. The axes can be brushed over to select subsets of
trials. The plot is implemented using the D3.js parallel-coordinates library
by Chang [36]. At the bottom right is a line chart. It shows objective values against training iteration (Figure 3 bottom right). This chart allows to
monitor training progress of each trial. It is also useful to analyze whether
a trial’s training converged. At the bottom left is a table of all completed
1
https://parameter-sherpa.readthedocs.io/en/latest/algorithms/
algorithms.html
7
trials (Figure 3 bottom left). Hovering over trials in the table highlights the
corresponding lines in the plots. Finally, the dashboard has a stopping button (Figure 3 top right corner). This allows the user to cancel the training
for unpromising trials.
The dashboard runs automatically during a hyperparameter optimization.
It can be accessed in a web-browser via a link provided by Sherpa. The
dashboard is useful to quickly evaluate questions such as:
• Are the selected hyperparameter ranges appropriate?
• Is training unstable for some hyperparameter settings?
• Does a particular hyperparameter have little impact on the performance
of the machine learning algorithm?
• Are the best observed hyperparameter settings consistent?
Based on these observations the user can refine the hyperparameter ranges
or choose a different algorithm, if appropriate.
Figure 3: The dashboard provides a parallel coordinates plot (top) and a table of finished
trials (bottom left). Trials in progress are shown via a progress line chart (bottom right).
Figure recommended to be viewed as PDF and via zooming in.
3.5. Scaling up with a Cluster
In parallel mode Sherpa can run parallel evaluations. A job scheduler is
responsible for running the user’s machine learning script. The following job
schedulers are implemented.
8
• The LocalScheduler evaluates parallel trials on the same computation node. This scheduler is useful for running on multiple local CPU
cores or GPUs. It has a simple resource handler for GPU allocation
(see Figure 5 for an example).
• The SGEScheduler uses Sun Grid Engine (SGE) [37]. Submission
arguments and an environment profile can be specified via arguments
to the scheduler.
• The SLURMScheduler is based on SLURM [38]. Its interface is similar to the SGEScheduler.
Concurrency between workers is handled via MongoDB, a NoSQL database
program. Parallel mode expects that MongoDB is installed on the system.
4. Illustrative Examples
4.1. Handwritten Digits Classification with a Neural Network
The following is an example of a Sherpa hyperparameter optimization.
It uses the MNIST handwritten digits dataset [39]. A Keras neural network
is used to classify the digits. The neural network has one hidden layer and
a softmax output. The hyperparameters are the learning rate of the Adam
[40] optimizer, the number of hidden units, and the hidden layer activation
function. The search is first conducted using Sherpa’s API mode. After that
we show the same example using Sherpa’s parallel mode.
4.1.1. API Mode
Figure 4 shows the hyperparameter optimization in Sherpa’s API mode.
The script starts with imports and loading of the MNIST dataset. Next,
the hyperparameters learning rate, num units, and activation are defined.
These refer to the Adam learning rate, number of hidden layer units, and
hidden layer activation function, respectively. As optimization algorithm
the GPyOpt algorithm is chosen. Hyperparameter ranges and algorithm are
combined via the Study. The lower is better flag indicates that lower
objective values are not better. This is because we will be maximizing the
classification accuracy. After that a for-loop iterates over the study. The
for-loop yields a trial at each iteration. A Keras model is instantiated using
the hyperparameter settings. The Keras model is iteratively trained and
evaluated via an inner for-loop. We add an observation for each iteration
and use finalize after the training is finished. Note that we pass the loss
as context to add observation. The context accepts a dictionary with
any additional metrics that the user wants to record. Code to replicate this
9
example is available as a Jupyter notebook2 and on Google Colab3 . A video
tutorial is also available on YouTube4 . Tutorials using the Successive Halving
and Population Based Training algorithms are also available56 .
2
https://github.com/sherpa-ai/sherpa/blob/master/examples/
keras_mnist_mlp.ipynb
3
https://colab.research.google.com/drive/1I19R1GfKPjlgNdHlxJwNC4PitvySsdon
4
https://youtu.be/-exnF3uv0Ws
5
https://github.com/sherpa-ai/sherpa/blob/master/examples/
keras_mnist_mlp_successive_halving.ipynb
6
https://github.com/sherpa-ai/sherpa/blob/master/examples/
keras_mnist_mlp_population_based_training.ipynb
10
import sherpa
import sherpa.algorithms.bayesian_optimization as
bayesian_optimization
import keras
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.datasets import mnist
from keras.optimizers import Adam
epochs = 15
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train/255.0, x_test/255.0
# Sherpa setup
parameters = [sherpa.Continuous(’learning_rate’, [1e-4, 1e-2]),
sherpa.Discrete(’num_units’, [32, 128]),
sherpa.Choice(’activation’,
[’relu’, ’tanh’, ’sigmoid’])]
algorithm = bayesian_optimization.GPyOpt(max_num_trials=50)
study = sherpa.Study(parameters=parameters,
algorithm=algorithm,
lower_is_better=False)
for trial in study:
lr = trial.parameters[’learning_rate’]
num_units = trial.parameters[’num_units’]
act = trial.parameters[’activation’]
# Create model
model = Sequential([Flatten(input_shape=(28, 28)),
Dense(num_units, activation=act),
Dense(10, activation=’softmax’)])
optimizer = Adam(lr=lr)
model.compile(loss=’sparse_categorical_crossentropy’,
optimizer=optimizer,
metrics=[’accuracy’])
# Train model
for i in range(epochs):
model.fit(x_train, y_train)
loss, accuracy = model.evaluate(x_test, y_test)
study.add_observation(trial=trial, iteration=i,
objective=accuracy,
context={’loss’: loss})
study.finalize(trial=trial)
Figure 4: An example showing how to tune the hyperparameters of a neural network on
the MNIST dataset using Sherpa in API mode.
11
4.1.2. Parallel Mode
We now show the same hyperparameter optimization using Sherpa’s parallel mode. Figure 5 (top) shows the server script. First, the hyperparameters and search algorithm are defined. This time we also define a
LocalScheduler instance. Hyperparameters, algorithm, and scheduler
are passed to the sherpa.optimize function. We also pass a command
”python trial.py”. The command indicates how to execute the user’s machine
learning script. Furthermore, the argument max concurrent=2 indicates
that two evaluations will be running at a time. Figure 5 (bottom) shows
the machine learning script. First, we set environment variables for GPU
configuration. Next we create a Client. To obtain hyperparameters we call
the client’s get trial method. Furthermore, during training we call the
client’s send metrics method. This replaces add observation in parallel mode. Also, in parallel mode no finalize call is needed.
12
import sherpa
import sherpa.algorithms.bayesian_optimization as
bayesian_optimization
from sherpa.schedulers import LocalScheduler
params = [sherpa.Continuous(’learning_rate’, [1e-4, 1e-2]),
sherpa.Discrete(’num_units’, [32, 128]),
sherpa.Choice(’activation’,
[’relu’, ’tanh’, ’sigmoid’])]
alg = bayesian_optimization.GPyOpt(max_num_trials=50)
sched = LocalScheduler(resources=[0,1])
sherpa.optimize(parameters=params, algorithm=alg,
scheduler=sched, lower_is_better=False,
command=’python trial.py’, max_concurrent=2)
import sherpa
import os
GPU_ID = os.environ[’SHERPA_RESOURCE’]
os.environ[’CUDA_VISIBLE_DEVICES’] = GPU_ID
import keras
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.datasets import mnist
from keras.optimizers import Adam
epochs = 15
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train/255.0, x_test/255.0
# Sherpa client
client = sherpa.Client()
trial = client.get_trial()
lr = trial.parameters[’learning_rate’]
num_units = trial.parameters[’num_units’]
act = trial.parameters[’activation’]
# Create model
model = Sequential([Flatten(input_shape=(28, 28)),
Dense(num_units, activation=act),
Dense(10, activation=’softmax’)])
optimizer = Adam(lr=lr)
model.compile(loss=’sparse_categorical_crossentropy’,
optimizer=optimizer,
metrics=[’accuracy’])
# Train model
for i in range(epochs):
model.fit(x_train, y_train)
loss, accuracy = model.evaluate(x_test, y_test)
client.send_metrics(trial=trial, iteration=i,
objective=accuracy,
context={’loss’: loss})
Figure 5: A code listing showing how to use Sherpa in parallel mode to tune the hyperparameters of a neural network trained on the13handwritten digits dataset MNIST. The top
code listing shows the server-script. The bottom listing shows the trial-script.
4.2. Deep learning for Cloud Resolving Models
4.2.1. Introduction
The following illustrates an example of a Sherpa hyperparameter optimization in the field of climate modeling, specifically cloud resolving models
(CRM). We apply Sherpa to optimize the deep neural network (DNN) of
Rasp et al. [41].
The input to the model is a 94-dimensional vector. Features include
temperature, humidity, meridional wind, surface pressure, incoming solar
radiation, sensible heat flux, and latent heat flux. The output of the DNN is
a 65-dimensional vector. It is composed of the sum of the CRM and radiative
heating rates, the CRM moistening rate, the net radiative fluxes at the top
of the atmosphere and surface of the earth, and the observed precipitation.
4.2.2. General Hyperparameter Optimization
Initially a random search was conducted on the following hyperparameters: batch normalization [42], dropout [43, 44], Leaky ReLU coefficient [45],
learning rate, nodes per hidden layer, number of hidden layers. The parameter ranges were chosen to encompass the parameters specified in [41]. From
the dashboard (Figure A.6) we identify that the best performing configurations have low dropout, leaky ReLU coefficients mostly around 0.3 or larger,
and learning rates mostly near 0.002. The majority of good models have 8
layers and batch normalization. However, the number of units does not seem
to have a large impact. The hyperparameter ranges and best configuration
are provided in Tables A.2 and A.3 in the appendix.
4.2.3. Optimization of the Learning Rate Schedule
An additional search was conducted to fine-tune the DNN training hyperparameters. Specifically, the initial learning rate and the learning rate decay
were optimized. The range of initial learning rate values was ±10−4 of the
best value from Section 4.2.2. The range of learning rate decay factors was
0.5 to 1. The learning rate gets multiplied by this factor after every epoch
to produce a new learning rate. In comparison, the model in Rasp et al.
[41] uses a decay factor of approximately 0.58. The remaining hyperparameters were set to the best configuration from Section 4.2.2. A total of 50
trials were evaluated via random search. The best learning rate was found
to be 0.001196. The best decay value was found as 0.843784. The overall
optimal hyperparameter setting is shown in Table A.3 of the supplementary
materials.
14
4.2.4. Results
We compare the model found by Sherpa to the model from Rasp et al.
[41] via R2 plots (Figure A.7). The R2 plots show the coefficient of determination at different pressures and latitudes. We find that the Sherpa
model consistently outperforms the comparison model. In particular, it is
able to perform for latitudes for which the prior model fails. Figure A.7f
shows that the Sherpa model’s loss reduces further after the Rasp et al. [41]
model has converged. This is the result of the learning rate fine-tuning from
Section 4.2.3.
5. Impact
Machine learning is used to ever larger extends in the scientific community. Nearly every machine learning application can benefit from hyperparameter optimization. The issue is that researchers often do not have a
practical tool at hand. Therefore, they usually resort to manually tuning parameters. Sherpa aims to be this tool. Its goal is to require minimal learning
from the user to get started. It also aims to support the user as their needs
for parallel evaluation or exotic optimization algorithms grow. As shown by
references in Section 1, Sherpa is already being used by researchers to achieve
improvements in a variety of domains. In addition to that, the software has
been downloaded more than 6000 times from the PyPi Python package manager7 . It also has over 160 stars on the software hosting website GitHub. A
GitHub star means that another user has added the software to a personal
list for later reference.
6. Conclusions
Sherpa is a flexible open-source software for robust hyperparameter optimization of machine learning models. It provides the user with several
interchangeable hyperparameter optimization algorithms, each of which may
be useful at different stages of model development. Its interactive dashboard
allows the user to monitor and analyze the results of multiple hyperparameter optimization runs in real-time. It also allows the user to see patterns
in the performance of hyperparameters to judge the robustness of individual settings. Sherpa can be used on a laptop or in a distributed fashion on
a cluster. In summary, rather than a black-box that spits out one hyperparameter setting, Sherpa provides the tools that a researcher needs when
7
https://pepy.tech/project/parameter-sherpa
15
doing hyperparameter exploration and optimization for the development of
machine learning models.
7. Conflict of Interest
We wish to confirm that there are no known conflicts of interest associated
with this publication and there has been no significant financial support for
this work that could have influenced its outcome.
Acknowledgements
We would like to thank Amin Tavakoli, Christine Lee, Gregor Urban, and
Siwei Chen for helping test the software and providing useful feedback, and
Yuzo Kanomata for computing support. This material is based upon work
supported by the National Science Foundation under grant number 1633631.
We also wish to acknowledge a hardware grant from NVIDIA.
References
[1] F. Hutter, H. H. Hoos, K. Leyton-Brown, Sequential model-based optimization for general algorithm configuration, in: International Conference on Learning and Intelligent Optimization, Springer, pp. 507–523.
[2] J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian optimization
of machine learning algorithms, in: Advances in neural information
processing systems, pp. 2951–2959.
[3] J. Bergstra, D. Yamins, D. D. Cox, Hyperopt: A python library for
optimizing the hyperparameters of machine learning algorithms, in:
Proceedings of the 12th Python in Science Conference, Citeseer, pp.
13–20.
[4] T. G. authors, GPyOpt: A bayesian optimization framework in python,
http://github.com/SheffieldML/GPyOpt, 2016.
[5] A. Klein, S. Falkner, N. Mansur, F. Hutter, Robo: A flexible and robust
bayesian optimization framework in python, in: NIPS 2017 Bayesian
Optimization Workshop.
[6] K. Kandasamy, K. R. Vysyaraju, W. Neiswanger, B. Paria, C. R. Collins,
J. Schneider, B. Poczos, E. P. Xing, Tuning Hyperparameters without
Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly, arXiv preprint arXiv:1903.06694 (2019).
16
[7] J. Wu, P. Frazier, The parallel knowledge gradient method for batch
bayesian optimization, in: Advances in Neural Information Processing
Systems, pp. 3126–3134.
[8] J. Wu, M. Poloczek, A. G. Wilson, P. I. Frazier, Bayesian optimization
with gradients, in: Advances in Neural Information Processing Systems,
pp. 5267–5278.
[9] B. Bischl, J. Richter, J. Bossek, D. Horn, J. Thomas, M. Lang, mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions (2017).
[10] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A. Talwalkar, Hyperband: A novel bandit-based approach to hyperparameter optimization,
The Journal of Machine Learning Research 18 (2017) 6765–6816.
[11] S. Falkner, A. Klein, F. Hutter, Bohb: Robust and efficient hyperparameter optimization at scale, arXiv preprint arXiv:1807.01774 (2018).
[12] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan,
et al., Population based training of neural networks, arXiv preprint
arXiv:1711.09846 (2017).
[13] C. Igel, T. Suttorp, N. Hansen, A computational efficient covariance
matrix update and a (1+ 1)-cma for evolution strategies, in: Proceedings
of the 8th annual conference on Genetic and evolutionary computation,
ACM, pp. 453–460.
[14] R. S. Olson, R. J. Urbanowicz, P. C. Andrews, N. A. Lavender, L. C.
Kidd, J. H. Moore, Applications of Evolutionary Computation: 19th
European Conference, EvoApplications 2016, Porto, Portugal, March 30
– April 1, 2016, Proceedings, Part I, Springer International Publishing,
pp. 123–137.
[15] R. S. Olson, N. Bartley, R. J. Urbanowicz, J. H. Moore, Evaluation of
a tree-based pipeline optimization tool for automating data science, in:
Proceedings of the Genetic and Evolutionary Computation Conference
2016, GECCO ’16, ACM, New York, NY, USA, 2016, pp. 485–492.
[16] L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, K. Leyton-Brown,
Auto-weka 2.0: Automatic model selection and hyperparameter optimization in weka, The Journal of Machine Learning Research 18 (2017)
826–830.
17
[17] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum,
F. Hutter, Efficient and robust automated machine learning, in: Advances in Neural Information Processing Systems, pp. 2962–2970.
[18] G. Holmes, A. Donkin, I. H. Witten, Weka: A machine learning workbench (1994).
[19] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al.,
Scikit-learn: Machine learning in python, Journal of machine learning
research 12 (2011) 2825–2830.
[20] F. Chollet, et al., Keras, https://keras.io, 2015.
[21] H. Jin, Q. Song, X. Hu, Auto-keras: An efficient neural architecture
search system, in: Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, ACM, pp. 1946–
1956.
[22] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, D. Sculley,
Google vizier: A service for black-box optimization, in: Proceedings of
the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 1487–1495.
[23] W. Falcon, Test tube, https://github.com/williamfalcon/
test-tube, 2017.
[24] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, I. Stoica,
Tune: A research platform for distributed model selection and training,
arXiv preprint arXiv:1807.05118 (2018).
[25] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A nextgeneration hyperparameter optimization framework, in: Proceedings of
the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM, pp. 2623–2631.
[26] L. Gustafson, Bayesian Tuning and Bandits: An Extensible, Open
Source Library for AutoML, M. eng thesis, Massachusetts Institute of
Technology, Cambridge, MA, 2018.
[27] P. Sadowski, P. Baldi, Neural network regression with beta, dirichlet,
and dirichlet-multinomial outputs (2018).
18
[28] Z. Cao, Y. Dan, Z. Xiong, C. Niu, X. Li, S. Qian, J. Hu, Convolutional
neural networks for crystal material property prediction using hybrid
orbital-field matrix and magpie descriptors, Crystals 9 (2019) 191.
[29] P. Baldi, J. Bian, L. Hertel, L. Li, Improved energy reconstruction in
nova with regression convolutional neural networks, Physical Review D
99 (2019) 012011.
[30] C. Ritter, T. Wollmann, P. Bernhard, M. Gunkel, D. M. Braun, J.-Y.
Lee, J. Meiners, R. Simon, G. Sauter, H. Erfle, et al., Hyperparameter
optimization for image analysis: application to prostate tissue images
and live cell data of virus-infected cells, International journal of computer assisted radiology and surgery (2019) 1–11.
[31] Z. Langford, L. Eisenbeiser, M. Vondal, Robust signal classification
using siamese networks, in: Proceedings of the ACM Workshop on
Wireless Security and Machine Learning, ACM, pp. 1–5.
[32] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, Journal of Machine Learning Research 13 (2012) 281–305.
[33] L. Li, K. Jamieson, A. Rostamizadeh, E. Gonina, M. Hardt, B. Recht,
A. Talwalkar, Massively parallel hyperparameter tuning, arXiv preprint
arXiv:1810.05934 (2018).
[34] A. Inselberg, B. Dimsdale, Parallel coordinates for visualizing multidimensional geometry, in: Computer Graphics 1987, Springer, 1987,
pp. 25–44.
[35] H. Hauser, F. Ledermann, H. Doleisch, Angular brushing of extended
parallel coordinates, in: Information Visualization, 2002. INFOVIS
2002. IEEE Symposium on, IEEE, pp. 127–130.
[36] K.
Chang,
Parallel
coordinates,
https://github.com/
syntagmatic/parallel-coordinates, 2019.
[37] W. Gentzsch, Sun grid engine: Towards creating a compute power
grid, in: Cluster Computing and the Grid, 2001. Proceedings. First
IEEE/ACM International Symposium on, IEEE, pp. 35–36.
[38] A. B. Yoo, M. A. Jette, M. Grondona, Slurm: Simple linux utility for
resource management, in: Workshop on Job Scheduling Strategies for
Parallel Processing, Springer, pp. 44–60.
19
[39] L. Deng, The mnist database of handwritten digit images for machine
learning research [best of the web], IEEE Signal Processing Magazine
29 (2012) 141–142.
[40] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization,
arXiv preprint arXiv:1412.6980 (2014).
[41] S. Rasp, M. S. Pritchard, P. Gentine, Deep learning to represent subgrid
processes in climate models, Proceedings of the National Academy of
Sciences 115 (2018) 9684–9689.
[42] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint
arXiv:1502.03167 (2015).
[43] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,
Dropout: a simple way to prevent neural networks from overfitting, The
Journal of Machine Learning Research 15 (2014) 1929–1958.
[44] P. Baldi, P. J. Sadowski, Understanding dropout, in: Advances in neural
information processing systems, pp. 2814–2822.
[45] F. Agostinelli, M. Hoffman, P. Sadowski, P. Baldi, Learning activation functions to improve deep neural networks, arXiv preprint
arXiv:1412.6830 (2014).
20
Appendix A. Deep learning for Cloud Resolving Models
Initially a random search was conducted on the hyperparameters listed
in Table A.2.
Name
Options
Parameter Type
Batch Normalization[42]
Dropout[43, 44]
Leaky ReLU coefficient[45]
Learning Rate
Nodes per Layer
Number of layers
[yes, no]
Choice
[0, 0.25]
Continuous
[0 - 0.4]
Continuous
[0.0001 - 0.01]Continuous (log)
[200 - 300]
Discrete
[8 - 10]
Discrete
Table A.2: DNN Hyperparameter Search Space.
A screenshot of the Sherpa dashboard at the end of the hyperparameter
optimization is shown in Figure A.6 (recommended to be viewed as PDF and
via zooming in). On the dashboard layer x refers to the number of nodes in
layer x. From Figure A.6 one can see that the best performing configurations
have low dropout, leaky ReLU coefficients mostly around 0.3 or larger, and
learning rates mostly near 0.002. The majority of good models have 8 layers
and batch normalization. However, the number of units does not seem have
a large impact.
Figure A.6: Screenshot of the dashboard at the end of the initial random search. The 8
best trials were selected by brushing of the Objective axis in the parallel coordinates plot.
21
Following the secondary search for an optimal learning rate schedule (Section 4.2.3) the hyperparameters in Table A.3) were found to be overall optimal. The optimized learning rate and schedule found by Sherpa is of considerable importance. Referencing the loss curves in Figure A.7f one can see the
learning rate schedule used in [41] forces the learning rate to decay rapidly
causing an early plateau of the loss. The learning rate schedule discovered by
Sherpa on the other hand allows the DNN to keep learning, further reducing
the loss.
Batch Normalization
Dropout
Leaky ReLU coefficient
Learning Rate
Learning Rate Decay
Nodes per Layer
Number of layers
No
0.0
0.3957
0.001301
0.843784
[299, 269, 248, 293, 251, 281, 258, 277, 209, 270]
10
Table A.3: Best hyperparameter configuration found by Sherpa.
Figure A.7 displays results of the optimized model as they pertain to
climate modeling metrics. These plots denote R2 values at corresponding
pressures and latitudes. Larger values of the R2 indicate that the DNN is
able to explain more variance in the corresponding variable. Of particular
importance, are areas where Sherpa is able to perform well in regions where
the previously published model fails (e.g. latitudes between -25 and 25 in
Figure A.7c). At all pressures and latitudes the Sherpa model outperforms
the previously published model and thereby achieves a new state of the art
for this dataset.
22
(a)
(b)
(c)
(d)
(e)
(f)
Figure A.7: Case study results for an optimized deep neural network applied to cloud
resolving models. Figures A.7a and A.7b show the coefficient of determination R2 vs.
pressure for convective heating rate and convective moistening rate, respectively. Figures A.7c, A.7d, and A.7e show R2 values against latitude, and A.7f shows loss trajectories. All figures compare the optimized Sherpa model against the model developed by
Rasp et al. [41].
23