ANDONIE, R. Hyperparameter Optimization in Learning Systems. Journal of Membrane Computing. 2019.

Hyperparameter Optimization in Learning Systems

Uploaded by

rhafael.costa.dev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views13 pages

ANDONIE, R. Hyperparameter Optimization in Learning Systems. Journal of Membrane Computing. 2019.

Hyperparameter Optimization in Learning Systems

Uploaded by

rhafael.costa.dev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Journal of Membrane Computing

https://doi.org/10.1007/s41965-019-00023-0

SURVEY PAPER

Hyperparameter optimization in learning systems

Răzvan Andonie1

Received: 16 August 2019 / Accepted: 1 October 2019

Abstract
While the training parameters of machine learning models are adapted during the training phase, the values of the hyper-
parameters (or meta-parameters) have to be specified before the learning phase. The goal is to find a set of hyperparameter
values which gives us the best model for our data in a reasonable amount of time. We present an integrated view of methods
used in hyperparameter optimization of learning systems, with an emphasis on computational complexity aspects. Our thesis
is that we should solve a hyperparameter optimization problem using a combination of techniques for: optimization, search
space and training time reduction. Case studies from real-world applications illustrate the practical aspects. We create the
framework for a future separation between parameters and hyperparameters in adaptive P systems.

Keywords Hyperparameters · Membrane computing · Spiking neural P system · P system · Neural computing · Machine
learning

1 Introduction neural (SN) P systems (introduced in [27]) were discussed1.

A first attempt to use SN P systems in pattern recognition
The main building element within P systems is the mem- can be found in [49], where SN P systems were reported
brane. A membrane is a discrete unit which can contain a to outperform back propagation and probabilistic neural
set of objects (symbols/catalysts), a set of rules, and a set of networks.
other membranes contained within [37, 36]. Essentially, a There are things we have to clarify. For instance, we
P system is a membrane structure. To what extend can such should distinguish between trainable parameters and hyper-
a structure adapt to a given problem? Can it be trained in parameters of a model. In the above-cited references, the
a similar way we train machine learning models or, more plasticity of P systems refers to the parameters only. For
specifically, neural networks? example, in [55], a spiking neural P system with self-organ-
There is a recent tendency to create bridges between ization has no initially designed synapses. The synapses can
adaptive P systems and machine learning learning para- be created or deleted according to the information contained
digms, motivated by the huge success of deep learning in in involved neurons during the computation. In this case, the
current technologies. Some recent P systems are adaptive synapses are parameters, but not hyperparameters.
[13, 55], and [3]. During the Brainstorming Week on Mem- Many variants of SN P systems have been proposed [44]:
brane Computing Sevilla, 2018, ideas about evolving spiking with anti-spikes, with weights, with thresholds, with rules on
synapses, with structural plasticity, with learning functions,
The paper is based on the invited talk at the 20th Conference on with polarization, with white hole neurons, with astrocytes,
Membrane Computing (CMC 20), August 5–8, 2019, Curtea de etc. For a given problem, which of these models is the most
Argeş, Romania. efficient? In this case, the type of the spiking neuron used is
a hyperparameter which has to be instantiated by optimiza-
Electronic supplementary material The online version of this
article (https://doi.org/10.1007/s41965-019-00023-0) contains tion before the P systems evolves and adjusts its parameters.
supplementary material, which is available to authorized users. In supervised machine learning, in contrast to P systems,
the difference between parameters and hyperparameters of
* Răzvan Andonie models was intensively studied.
[email protected]
1
Computer Science Department, Central Washington
1
University, Ellensburg, WA, USA https://www.gcn.us.es/files/bwmc2018-evolsnp-present.pdf

13
Vol.:(0123456789)
R. Andonie

The motivation of our paper is to create the framework The validation process is more complex than described
for a future separation between parameters and hyperparam- here. Each combination of hyperparameter values may
eters in adaptive P systems. We present an integrated view result in a different model and Eq. (1) evaluates and com-
of methods used in hyperparameter optimization of learning pares the objective function for different models. The pro-
systems in general, with an emphasis on computational com- cess of finding the best-performing model from a set of
plexity aspects. Case studies from real-world applications models that were produced by different hyperparameter
will illustrate the practical aspects. settings is called model selection [41].
As a first step, we have to define more precisely what we For simplicity, in Eq. (1), we referred to a “valida-
understand by “hyperparameter” in the context of machine tion set”, without further details. We have to evaluate the
learning models. Nearly, all model algorithms used in expectation of the score over an unknown distribution and
machine learning have a set of tuning hyperparameters we usually approximate this expectation using a three-way
which affect how the learning algorithm fits the model to the split, dividing the dataset into a training, validation, and
data. These hyperparameters should be distinguished from test dataset. Having a training–validation pair for hyper-
internal model parameters, such as the weights or the rule parameter tuning and model selections allows us to keep
probabilities, which are adaptively adjusted to solve a prob- the test set “independent” for model evaluation.
lem. For instance, the hyperparameters of neural networks The recent interest in hyperparameter optimization
typically specify the architecture of the network (number and is related to the importance and complexity of the deep
type of layers, number and type of nodes, etc). learning architectures. The existing deep learning hyper-
The model (the framework) itself may be considered, at parameter optimization algorithms are computationally
a meta-level, a hyperparameter. In this case, we have a list demanding. For example, obtaining an optimized archi-
of possible models, each with its own list of hyperparam- tecture for the CIFAR-10 and ImageNet datasets required
eters, and all have to be optimized for a given problem. For 2000 GPU days of reinforcement learning [58] or 3150
simplicity, we will discuss here only optimization of hyper- GPU days of evolution [42].
parameters within the framework of a given model. Many There are two inherent causes of this inefficiency: one
optimizers attempt to optimize both the choice of the model is related to the search space, which can be a discrete
and the hyperparameters of the model (e.g., Auto-WEKA, domain. In its most general form, discrete optimization
Hyperopt-sklearn, AutoML, auto-sklearn, etc). is NP-complete.
The aim of hyperparameter optimization is to find the The second cause is that evaluating the objective func-
hyperparameters of a given model that return the best per- tion to find the score is expensive: Each time we try dif-
formance as measured on a validation set. This process can ferent hyperparameters, we have to train a model on the
be represented in equation form as: training data, make predictions on the validation data, and
then calculate the validation metric. This optimization is
x∗ = arg min f (x), (1) usually done by re-training multiple models with differ-
x∈ℵ
ent combinations of hyperparameter values and evaluating
where f(x) is an objective function to minimize (such as their performance. We call this re-training + evaluation for
RMSE or error rate) evaluated on the validation set; x∗ is one set of hyperparameter values a trial.
the set of hyperparameters that yields the lowest value of the Basically, there are three computational complexity
score, and x can take on any value in the domain ℵ. In simple aspects which have to be addressed: (a) Choose an efficient
terms, we want to find the model hyperparameters that yield optimizations method for Eq. (1); (b) Reduce the search
the best score on the validation set metric. space; and (c) Reduce the training time for each trial.
More detailed, Eq. (1), can be written as: Whereas choosing a good optimization is a well-defined
(but difficult) task, the problem of reducing the search
x∗ = arg min f (x, 𝛾 ∗ ;Svalidation ). (2) space and the training time is also important. This can be
x∈ℵ
done, for instance, by reducing the number of hyperpa-
Equation (2) includes an inner optimization used to find 𝛾 ∗,
rameters and features, reducing the training set, and using
the optimal value of 𝛾 , for the current x value:
additional objective functions.
𝛾 ∗ = arg min f (x, 𝛾;Strain ). The number of trials increases generally exponen-
𝛾∈𝛤 (3) tial with the number of hyperparameters. Therefore, we
Svalidation and Strain denote the validation and training datasets, would like to reduce the number of hyperparameters. For
respectively; 𝛾 is the set of learned parameters in the domain instance, hyperparameters can be ranked (and selected)
𝛤 through minimization of the training error. based on the functional analysis of the variance of the
objective function. This method is known as sensitivity
analysis.