0% found this document useful (0 votes)
32 views13 pages

ANDONIE, R. Hyperparameter Optimization in Learning Systems. Journal of Membrane Computing. 2019.

Hyperparameter Optimization in Learning Systems
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views13 pages

ANDONIE, R. Hyperparameter Optimization in Learning Systems. Journal of Membrane Computing. 2019.

Hyperparameter Optimization in Learning Systems
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Journal of Membrane Computing

https://doi.org/10.1007/s41965-019-00023-0

SURVEY PAPER

Hyperparameter optimization in learning systems


Răzvan Andonie1

Received: 16 August 2019 / Accepted: 1 October 2019


© Springer Nature Singapore Pte Ltd. 2019

Abstract
While the training parameters of machine learning models are adapted during the training phase, the values of the hyper-
parameters (or meta-parameters) have to be specified before the learning phase. The goal is to find a set of hyperparameter
values which gives us the best model for our data in a reasonable amount of time. We present an integrated view of methods
used in hyperparameter optimization of learning systems, with an emphasis on computational complexity aspects. Our thesis
is that we should solve a hyperparameter optimization problem using a combination of techniques for: optimization, search
space and training time reduction. Case studies from real-world applications illustrate the practical aspects. We create the
framework for a future separation between parameters and hyperparameters in adaptive P systems.

Keywords Hyperparameters · Membrane computing · Spiking neural P system · P system · Neural computing · Machine
learning

1 Introduction neural (SN) P systems (introduced in [27]) were discussed1.


A first attempt to use SN P systems in pattern recognition
The main building element within P systems is the mem- can be found in [49], where SN P systems were reported
brane. A membrane is a discrete unit which can contain a to outperform back propagation and probabilistic neural
set of objects (symbols/catalysts), a set of rules, and a set of networks.
other membranes contained within [37, 36]. Essentially, a There are things we have to clarify. For instance, we
P system is a membrane structure. To what extend can such should distinguish between trainable parameters and hyper-
a structure adapt to a given problem? Can it be trained in parameters of a model. In the above-cited references, the
a similar way we train machine learning models or, more plasticity of P systems refers to the parameters only. For
specifically, neural networks? example, in [55], a spiking neural P system with self-organ-
There is a recent tendency to create bridges between ization has no initially designed synapses. The synapses can
adaptive P systems and machine learning learning para- be created or deleted according to the information contained
digms, motivated by the huge success of deep learning in in involved neurons during the computation. In this case, the
current technologies. Some recent P systems are adaptive synapses are parameters, but not hyperparameters.
[13, 55], and [3]. During the Brainstorming Week on Mem- Many variants of SN P systems have been proposed [44]:
brane Computing Sevilla, 2018, ideas about evolving spiking with anti-spikes, with weights, with thresholds, with rules on
synapses, with structural plasticity, with learning functions,
The paper is based on the invited talk at the 20th Conference on with polarization, with white hole neurons, with astrocytes,
Membrane Computing (CMC 20), August 5–8, 2019, Curtea de etc. For a given problem, which of these models is the most
Argeş, Romania. efficient? In this case, the type of the spiking neuron used is
a hyperparameter which has to be instantiated by optimiza-
Electronic supplementary material The online version of this
article (https​://doi.org/10.1007/s4196​5-019-00023​-0) contains tion before the P systems evolves and adjusts its parameters.
supplementary material, which is available to authorized users. In supervised machine learning, in contrast to P systems,
the difference between parameters and hyperparameters of
* Răzvan Andonie models was intensively studied.
[email protected]
1
Computer Science Department, Central Washington
1
University, Ellensburg, WA, USA https​://www.gcn.us.es/files​/bwmc2​018-evols​np-prese​nt.pdf

13
Vol.:(0123456789)
R. Andonie

The motivation of our paper is to create the framework The validation process is more complex than described
for a future separation between parameters and hyperparam- here. Each combination of hyperparameter values may
eters in adaptive P systems. We present an integrated view result in a different model and Eq. (1) evaluates and com-
of methods used in hyperparameter optimization of learning pares the objective function for different models. The pro-
systems in general, with an emphasis on computational com- cess of finding the best-performing model from a set of
plexity aspects. Case studies from real-world applications models that were produced by different hyperparameter
will illustrate the practical aspects. settings is called model selection [41].
As a first step, we have to define more precisely what we For simplicity, in Eq. (1), we referred to a “valida-
understand by “hyperparameter” in the context of machine tion set”, without further details. We have to evaluate the
learning models. Nearly, all model algorithms used in expectation of the score over an unknown distribution and
machine learning have a set of tuning hyperparameters we usually approximate this expectation using a three-way
which affect how the learning algorithm fits the model to the split, dividing the dataset into a training, validation, and
data. These hyperparameters should be distinguished from test dataset. Having a training–validation pair for hyper-
internal model parameters, such as the weights or the rule parameter tuning and model selections allows us to keep
probabilities, which are adaptively adjusted to solve a prob- the test set “independent” for model evaluation.
lem. For instance, the hyperparameters of neural networks The recent interest in hyperparameter optimization
typically specify the architecture of the network (number and is related to the importance and complexity of the deep
type of layers, number and type of nodes, etc). learning architectures. The existing deep learning hyper-
The model (the framework) itself may be considered, at parameter optimization algorithms are computationally
a meta-level, a hyperparameter. In this case, we have a list demanding. For example, obtaining an optimized archi-
of possible models, each with its own list of hyperparam- tecture for the CIFAR-10 and ImageNet datasets required
eters, and all have to be optimized for a given problem. For 2000 GPU days of reinforcement learning [58] or 3150
simplicity, we will discuss here only optimization of hyper- GPU days of evolution [42].
parameters within the framework of a given model. Many There are two inherent causes of this inefficiency: one
optimizers attempt to optimize both the choice of the model is related to the search space, which can be a discrete
and the hyperparameters of the model (e.g., Auto-WEKA, domain. In its most general form, discrete optimization
Hyperopt-sklearn, AutoML, auto-sklearn, etc). is NP-complete.
The aim of hyperparameter optimization is to find the The second cause is that evaluating the objective func-
hyperparameters of a given model that return the best per- tion to find the score is expensive: Each time we try dif-
formance as measured on a validation set. This process can ferent hyperparameters, we have to train a model on the
be represented in equation form as: training data, make predictions on the validation data, and
then calculate the validation metric. This optimization is
x∗ = arg min f (x), (1) usually done by re-training multiple models with differ-
x∈ℵ
ent combinations of hyperparameter values and evaluating
where f(x) is an objective function to minimize (such as their performance. We call this re-training + evaluation for
RMSE or error rate) evaluated on the validation set; x∗ is one set of hyperparameter values a trial.
the set of hyperparameters that yields the lowest value of the Basically, there are three computational complexity
score, and x can take on any value in the domain ℵ. In simple aspects which have to be addressed: (a) Choose an efficient
terms, we want to find the model hyperparameters that yield optimizations method for Eq. (1); (b) Reduce the search
the best score on the validation set metric. space; and (c) Reduce the training time for each trial.
More detailed, Eq. (1), can be written as: Whereas choosing a good optimization is a well-defined
(but difficult) task, the problem of reducing the search
x∗ = arg min f (x, 𝛾 ∗ ;Svalidation ). (2) space and the training time is also important. This can be
x∈ℵ
done, for instance, by reducing the number of hyperpa-
Equation (2) includes an inner optimization used to find 𝛾 ∗,
rameters and features, reducing the training set, and using
the optimal value of 𝛾 , for the current x value:
additional objective functions.
𝛾 ∗ = arg min f (x, 𝛾;Strain ). The number of trials increases generally exponen-
𝛾∈𝛤 (3) tial with the number of hyperparameters. Therefore, we
Svalidation and Strain denote the validation and training datasets, would like to reduce the number of hyperparameters. For
respectively; 𝛾 is the set of learned parameters in the domain instance, hyperparameters can be ranked (and selected)
𝛤 through minimization of the training error. based on the functional analysis of the variance of the
objective function. This method is known as sensitivity
analysis.

13
Hyperparameter optimization in learning systems

Reducing the number of trials is not enough if the training • Grid search is simple to implement and parallelization is
time for each trial is high. In this case, we should attempt to trivial.
reduce (instance selection) the training set. • Grid search (with access to a compute cluster) typically
Using additional objective functions can also reduce the finds a better solution than purely manual sequential opti-
search space for the optimal hyperparameter configuration. mization (in the same amount of time).
For instance, we may add constrains about the complexity • Grid search is reliable in low-dimensional spaces (e.g.,
of a neural network (such as the number of connections). 1-d, 2-d). For instance, grid search is relatively efficient
We are ready now to describe the structure of the paper. for optimizing the two standard hyperparameters of
Section 2 is an overview of some standard methods for SVMs (LibSVM does this).
hyperparameter optimization. Section 3 presents how
the search space and the training time can be reduced by GS suffers from the curse of dimensionality because the
instance selection, hyperparameter/feature ranking and number of joint values grows exponentially with the number
using additional objective functions. Section 4 lists the most of hyperparameters. Therefore, GS is not recommended for
recent software packages for hyperparameter optimization. the optimization of many hyperparameters.
Section 5 presents case studies based on three well-known
machine learning models. Section 6 contains final remarks 2.2 Random search
and open problems.
Random search (RS) consists in drawing samples from the
parameter space following a particular distribution for each
2 Methods for hyperparameter of the parameters. Using the same number of trials, RS
optimization generally yields better results than GS or more complicated
hyperparameter optimization methods. Especially in higher-
Hyperparameter optimization should be regarded as a for- dimensional spaces, the computation resources required by
mal outer loop in the learning process. In the most general RS methods are significantly lower than for GS [31]. RS
case, such an optimization should include a budgeting choice works best under the assumption that not all hyperparam-
of how many CPU cycles are to be spent on hyperparam- eters are equally important [11].
eter exploration, and how many CPU cycles are to be spent Other advantages of RS are:
evaluating each hyperparameter choice (i.e. by tuning the
regular parameters) [9]. – The experiment can be stopped any time and the trials
A simple strategy for hyperparameter optimization is a form a complete experiment. The key is to define a good
greedy approach: investigate the local neighborhood of a stopping criterion, representing a trade-off between accu-
given hyperparameter configuration: vary one hyperparam- racy and computation time.
eter at a time and measure how performance changes. The – New trials can be added to an experiment without having
only information obtained with this analysis is how different to adjust the grid and commit to a much larger experi-
hyperparameter values perform in the context of a single ment.
instantiation of the other hyperparameters. We cannot expect – Every trial can be carried out asynchronously. Therefore,
good results with this approach. RS methods are relatively easy to implement on parallel
Fortunately, we do have more systematic approaches. We computer architectures.
will review the most fundamental methods. Details, can be – If the computer carrying out a trial fails for any reason,
found, for instance, in [9]. its trial can be either abandoned or restarted without
jeopardizing the optimization.
2.1 Grid search
Recent attempts to optimize the RS algorithm are: Li’s
The most commonly used hyperparameter optimization et al. Hyperband [32], which speeds up RS through adap-
strategy is a combination of Grid search (GS) and manual tive resource allocation and early-stopping; Domhan et al.
tuning. There are several reasons why manual search and [17], which have developed a probabilistic model to mimic
grid search prevail as the state of the art despite decades of early termination of sub-optimal candidate; and Florea et al.
research into global optimization: [20], where we introduced a dynamically computed stop-
ping criterion for RS, reducing the number of trials without
• Manual optimization gives researchers some degree of reducing the generalization performance.
insight into the optimization process. To illustrate the efficiency of RS in high-dimensional
• There is no technical overhead or barrier to manual opti- spaces, we refer to the following real-world application.
mization. Using RS, we have introduced in [31] the first polynomial

13
R. Andonie

(in the size of the input and the number of dimensions) shrinks towards a better point. Despite its age, the NMA
algorithm for finding maximal empty hyper-rectangles search technique is still very search popular.
(holes) in data. All previous (deterministic) algorithms are The NMA was used in Convolutional Neural Network
exponential. (CNN) optimization in [1, 2], in conjunction with a rela-
We used 5522 protein structures randomly selected from tively small optimization dataset. It works well for objective
the Protein Databank, a repository of the atomic coordinates functions that are smooth, unimodal and not too noisy. The
of over 100,000 protein structures that have been solved weakness of this method is that it is not very good for prob-
using experimental methods [56]. Proteins are three-dimen- lems with more than about 10 variables; above this number
sional dynamic structures that mediate virtually all cellular of variables, convergence becomes increasingly difficult.
biological events. From the hyper-rectangles generated by
our algorithm, we were able to determine which of the 39 2.4 Bayesian optimization
dimensions in our data were most frequently the bounding
conditions of the largest found hyper-rectangles. RS and GS pay no attention to past results and keep search-
Our algorithm only needs to examine a small fraction ing across the entire search space. However, it may happen
of the theoretical maximum of 6.007576 × 10104 possible that the optimal answer lies within a small region. In con-
hyper-rectangles. In a second stage, we were able to extract trast, Bayesian optimization iteratively computes a posterior
if/then rules from the hyper-rectangle output and found distribution of functions that best describes the objective
several interesting relationships among the 39-dimensional function. As the number of observations grows, the posterior
data. distribution improves, and the algorithm becomes more cer-
tain of which regions in parameter space are worth explor-
2.3 Derivative‑free optimization: Nelder–Mead ing and which are not. By evaluating hyperparameters that
appear more promising from past results, Bayesian methods
In hyperparameter optimization, we usually encounter the can find better model settings than RS or GS in fewer itera-
following challenges: non-differentiable functions, multiple tions. A reviews of Bayesian optimization can be found in
objectives, large dimensionality, mixed variables (discrete, [45].
continuous, permutation), multiple local minima (multi- Bayesian optimization keeps track of past evaluation
modal), discrete search space, etc. Derivative-free optimi- results which they use to form a probabilistic model mapping
zation refers to the solution of optimization problems using hyperparameters to a probability of a score on the objective
algorithms that do not require derivative information, but function:
only objective function values.
P(score|hyperparameters)
Unlike derivative-based methods in a convex search
space, derivative-free methods are not necessarily guaran- This model is called a surrogate for the objective function.
teed to find the global optimum. Examples of derivative-free The surrogate probability model is iteratively updated after
optimization methods are: Nelder–Mead (NMA), Simulated each evaluation of the objective function.
Annealing, Evolutionary Algorithms, and Particle Swarm Following Will Koehrsen2, we have the following steps:
Optimization (PSO).
The NMA was introduced [34] as early as 1965 and per- 1. Build a surrogate probability model of the objective
forms a search in n-dimensional space using heuristic ideas. function.
The method uses the concept of a simplex, a structure in 2. Find the hyperparameters that perform best on the sur-
n-dimensional space formed by n + 1 not coplanar points. rogate.
NMA maintains a set of n + 1 test points arranged as 3. Apply these hyperparameters to the true objective func-
a simplex and then modifies the simplex at each iteration tion.
using four operations: reflection, expansion, contraction, 4. Update the surrogate model incorporating the new
and shrinking. Each of these operations generates a new results.
point. The sequence of operations performed in one iteration 5. Repeat steps 2–4 until max iterations or time is reached.
depends on the value of the objective function at the new
point relative to the other points. It extrapolates the behavior Sequential model-based optimization (SMBO) meth-
of the objective function measured at each test point, to find ods are a formalization of Bayesian optimization. The
a new test point and to replace one of the old test points with
the new one, etc. If this point is better than the best current
point, then it tries to move along this line. If the new point 2
https​://towar​dsdat​ascie​nce.com/a-conce​ptual​-expla​natio​n-of-bayes​
is not much better than the previous value, then the simplex ian-model​-based​-hyper​param​eter-optim​izati​on-for-machi​ne-learn​ing-
b8172​27805​0f

13
Hyperparameter optimization in learning systems

sequential refers to running trials one after another, each 3.1 Reduce the training dataset
time trying better hyperparameters and updating a prob-
ability model (surrogate). Generally, training models with enough information is
There are several variants of SMBO methods that dif- essential to achieve good performance. However, it is com-
fer in steps 3 and 4, namely, how they build a surrogate mon that a training dataset T contains samples that may
of the objective function and the criteria used to select be similar to each other (that is, redundant) or noisy. This
the next hyperparameters: Gaussian Processes, Random increases computation time and can be detrimental to gen-
Forest Regressions, Tree Parzen Estimators, etc. eralization performance.
In low-dimensional problems with numerical hyperpa- Instance selection aims to select subset S ⊂ T , hoping that
rameters, the best available hyperparameter optimization it can represent the whole training set and achieve accept-
methods use Bayesian optimization [48]. However, Bayes- able performance. The techniques for instance selection
ian optimization is restricted to problems of moderate are very similar to the ones in feature selection. For exam-
dimension [45]. ple, instance selection methods can either start with S = 𝛷
(incremental method) or S = T (decremental method). As a
result, this reduces training time. A review of instance selec-
tion methods can be found in [35].
3 Computational complexity issues Like in feature selection, according to the strategy used
for selecting instances, we can divide the instance selection
There are two complexity issues related to the search pro- methods in two groups [35]:
cess in hyperparameter optimization:
1. Wrapper. The selection criterion is based on the accu-
– A1. The execution time of each trial. This training racy obtained by a classifier (commonly, those instances
phase for each trial depends on the size and the dimen- that do not contribute with the classification accuracy
sionality of the training set. are discarded from the training set).
– A2. The complexity of the search space itself and the 2. Filter. The selection criterion uses a selection function
number of evaluated combinations of hyperparam- which is not based on a classifier.
eters.
For hyperparameter optimization, since we evaluate dif-
In case of deep learning we have both aspects: high- ferent ML models, we are only interested in filter instance
dimensional search space, which according to the curse of selection.
dimensionality principle, has to be covered with an expo- Several instance selection techniques have been intro-
nential increasing number of points (in this case, combina- duced. For example, a very simple technique is to select
tions of hyperparameters); large training sets, which are one training sample at a time such that, when added to the
typically used in deep learning. previous set of examples, it results in the largest decrease
To address these issues and reduce the search space, in a squared error estimate criterion [39]. In [18], a meas-
some standard techniques may be used: ure of a sample’s influence on the classifier output is used
to reduce the training set. Stochastic sampling algorithms
– Reduce the training dataset based on statistical sam- were introduced in [47]. The progressive sampling method
pling (relates to A1). presented in [40] is an incremental method using progres-
– Feature selection (relates to A1). sively larger samples as long as model accuracy improves.
– Hyperparameter selection: detect which hyperparam- Instance selection algorithms for CNN architectures were
eters are more important for the neural network opti- proposed in [1, 2, 28].
mization. This may reduce the search space (relates to A related approach to instance selection is active learn-
A2). ing which sequentially identifies critical samples to train
– Beside the accuracy, use additional objective func- on. For example, Bengio et al. suggest that guiding a classi-
tions: number of operations, optimization time, etc). fier by presenting training samples in an order of increasing
This may also reduce the search space (relates to A2). difficulty can improve learning [8]. They use the following
For instance, superior results were obtained by com- observation: “Humans and animals learn much better when
bining accuracy with visualization via a deconvolution the examples are not randomly presented but organized in
network [1]. a meaningful order which illustrates gradually more con-
cepts, and gradually more complex ones”. There are also
We will describe in the following how these techniques other ways to order the training patterns (for instance, [16]).
can be used. According to [8], these ordering strategies can be seen as a

13
R. Andonie

special form of transfer learning where the initial tasks are instantiation in the context of all instantiations of the other
used to guide the learner so that it will perform better on hyperparameters. For instance, we can use sensitivity analy-
the final task. sis (SA) [29], described as follows. Once a network has been
At the extreme, we may consider the order of the training trained, calculate an average value for each hyperparameter.
sequence order as a hyperparameter of the model. However, Then, holding all variables but one at a time at their average
finding the optimal permutation based on some performance levels, vary the one input over its entire range and compute
criteria is computationally not feasible, since it is in the the variability produced in the net outputs. Analysis of this
order of the possible permutations. variability may be done for several different networks, each
trained from a different weight initialization. The algorithm
3.2 Reduce the number of features will then rank the hyperparameters from highest to lowest
according to the mean variability produced in the output.
Feature selection is a standard technique in machine learning Three different sensitivity measures were proposed in
[24]. By reducing the number of features we reduce training [29], based on: output range, variance, and average gradient
time. Feature selection is very similar to hyperparameter over all the intervals. Although the use of average values by
selection and similar techniques can be used to assess the all but one input does not capture all of the complex inter-
importance of hyperparameters and features. Basically, we actions in the input space, it does produce a rough estimate
search for an optimal configuration of hyperparameters (or of the model’s univariate sensitivity to each input variable.
features), using trials to assess each configuration. Obviously, we do not consider the interactions between
The search for the optimal joint combination of features hyperparameters.
and hyperparameters is computationally hard. A sequential SA measures the effects on the output of a given model
greedy search in two stages (features selection followed by when the inputs are varied through their range of values. It
hyperparameter selection) is simpler. We omit in this case allows a ranking of the inputs that is based on the amount of
that the selection of features and hyperparameters are not output changes that are produced due to variations in a given
independent processes, but this may be acceptable. input. SA can be used both for hyperparameter and feature
The problem of performing feature selection inside or importance assessment.
outside the cross-validation loop (the loop which is doing the A more sophisticated approach is based on the analysis of
fine tuning of the parameters) was conducted in [43] and had variance (functional ANOVA) [26]. It is presented as a soft-
very nuanced results. Ultimately, it depends on the dataset ware package (fANOVA3) able to approximate the impor-
and the problem to be solved. tance of the hyperparameters of a model.
Some software packages (see Sect. 4) perform feature Recently introduced, the N-RReliefF algorithm [50] can
selection and hyperparameter optimization together. There also estimate the contribution of each single hyperparam-
are also some published results in which features + hyper- eter to the performance of a model. N-RReliefF was used
parameters are optimized together for some specific models. to determine the importance of the interactions between
For example, the SVM [51]. hyperparameters on 100 data sets. The results showed that
the same hyperparameters have similar importance on dif-
3.3 Hyperparameter selection by functional ferent data sets. This does not mean that only adjusting the
analysis of variance most important hyperparameters and combinations is the
best option in all cases. When there are enough computing
If we can assess the importance of each hyperparameter, resources, it is still recommended to optimize all hyperpa-
then we can also select and optimize only the most important rameters [26]. However, for computational intensive opti-
hyperparameters of a model. This would reduce the com- mizations, hyperparameter selection is a reasonable option.
plexity of hyperparameter optimization.
Optimally, we would like to know how all hyperparam- 3.4 Use additional objective functions
eters affect performance across all their instantiations. In
most cases, such a general optimization is computationally The performance of the objective function f (in Eq. 1) is
prohibitive. A greedy approach to assess the importance of a usually measured in terms of the accuracy of the model.
hyperparameter is to vary one hyperparameter at a time and Beside accuracy, we may use additional objective function.
measure how this affects the performance of the objective The search could be also guided with goals like training time
function. The only information obtained with this analysis is or memory requirements of the network.
how different hyperparameter values perform in the context
of a single instantiation of the other hyperparameters.
A more elaborated analysis based on predictive models
can be used to quantify the performance of a hyperparameter 3
https​://www.autom​l.org/algor​ithm-analy​sis/fanov​a/.

13
Hyperparameter optimization in learning systems

One possibility is to add an objective function measur- and automatically searches for the right learning algo-
ing the complexity of the model. Smaller complexity also rithm for a new machine learning dataset and optimizes its
means a smaller number of hyperparameters. We prefer hyperparameters.
models with small complexities. Models with lower com- Following the idea of Auto-Weka, several automated
plexity are not only faster to train. They also require smaller machine learning tools (AutoML) were recently developed.
training sets (the curse of dimensionality) and have a smaller Their ultimate goal is to automatize the end-to-end process
chance to over-fit (over-fitting increases with the number of of applying machine learning to real-world problems, even
hyperparameters). for people with no major expertise in this field. Ideally, such
For a CNN network, the complexity is the aggregation of a tool will choose the optimal pipeline for a labeled input
the following hyperparameters’ values: number of layers, dataset: data preprocessing, feature selection/extraction, and
number of maps, number of fully connected layers, number learning model with its optimal hyperparameters.
of neurons in each layer. In case of the ConvNet neural net- Some of the existent AutoML tools are8: MLBox, auto-
work architecture implementation on mobile devices, beside sklearn, Tree-Based Pipeline Optimization Tool, H2O,
accuracy, we may also consider the characteristics of the AutoKeras. TransmogrifAI9 is an end-to-end AutoML
hardware implementation. Such metrics include latency and library for structured data written in Scala that runs on top of
energy consumption [57]. Apache Spark. It was developed with a focus on accelerating
machine learning developer productivity through machine
learning automation.
4 Software for hyperparameter Commercial cloud-based AutoML services offer highly
optimization integrated hyperparameter optimization capabilities. Some
of them are offered by well-known companies:
Several software libraries dedicated to hyperparameter opti-
mization exist. Each optimization technique included in a – Google Cloud AutoML10, a suite of machine learning
package is called a solver. The choice of a particular solver products that enables developers with limited machine
depends on the problem to be optimized. learning expertise to train high-quality models specific
LIBSVM [14] and scikit-learn [38] come with their own to their business needs.
implementation of GS, with scikit-learn also offering sup- – Microsoft Azure ML11, which can be used to streamline
port for RS. the building, training, and deployment of machine learn-
Bayesian techniques are implemented by packages like ing models.
BayesianOptimization4, Spearmint5, and pyGPGO6. – Amazon SageMaker Automatic Model Tuning12, which
Hyperopt-sklearn is a software project that provides can launch multiple training jobs, with different hyperpa-
automatic algorithm configuration of the scikit-learn rameter combinations, based on the results of completed
library. It can be used for both model selection and hyper- training jobs. SageMaker uses Bayesian hyperparameter
parameter optimization. Hyperopt-sklearn has the follow- optimization.
ing implemented solvers [12]: RS, simulated annealing, and
Tree-of-Parzen-Estimators.
Optunity7 is a Python library containing various optimiz- 5 Case studies
ers for hyperparameter tuning. Optunity is currently also
supported in R, MATLAB, GNU Octave and Java through 5.1 Hyperparameter optimization in fuzzy ARTMAP
Jython. It has the following solvers available: GS, RS, PSO, models
NMA, Covariance Matrix Adaptation Evolutionary Strategy,
Tree-structured Parzen Estimator, and Sobol sequences. Hyperparameter optimization is not a new concept.
Auto-WEKA [30], built on top of Weka [25] is able to For example, in 1986, Grefenstette optimized the
perform GS, RS, and Bayesian optimization. Auto-sklearn
[19] extends the idea of configuring a general machine
learning framework with efficient global optimization,
introduced with Auto-WEKA. It is built around scikit-learn 8
https​://heart​beat.fritz​.ai/autom​l-the-next-wave-of-machi​ne-learn​
ing-5494b​aac61​5f.
9
https​://trans​mogri​f.ai/.
4
https​://githu​b.com/fmfn/Bayes​ianOp​timiz​ation​. 10
https​://cloud​.googl​e.com/autom​l/.
5
https​://githu​b.com/HIPS/Spear​mint. 11
https​://docs.micro​soft.com/en-us/azure​/machi​ne-learn​ing/studi​
6
https​://githu​b.com/hawk3​1/pyGPG​O. o-modul​e-refer​ence/tune-model​-hyper​param​eters​.
7 12
http://optun​ity.readt​hedoc​s.io/en/lates​t/. https​://aws.amazo​n.com/sagem​aker/.

13
R. Andonie

hyperparameters of a genetic algorithm using a meta-level hyperparameters which define a SVM model: C and 𝛾 , the
genetic algorithm, an intriguing concept at that time [22]. parameter of the Gaussian kernel.
In [5], we optimized the hyperparameters of a class of In [20], we introduced a dynamic early stopping condition
Fuzzy ARTMAP neural networks, Fuzzy ARTMAP with for RS hyperparameter optimization, tested for SVM clas-
Input Relevances (FAMR). The FAMR, introduced in [7], sification. We reduced significantly the number of trials. The
is a Fuzzy ARTMAP incremental learning system used for code runs on a multi-core system and has a good scalability
classification and probability estimation. During the learning for an increasing number of cores. We will describe in the
phase, each training sample is assigned a relevance factor following the main results from [20], omitting details which
proportional to the importance of that sample. are less relevant here.
Our work in [5] was related to the prediction of biologi- A simplified version of a hyperparameter optimization
cal activities of HIV-1 protease inhibitory compounds when algorithm is characterized by the objective fitness func-
inferring from small training sets. In fact, we were forced tion f and the generator of samples g. The fitness func-
to use a training set of only 176 samples (molecules in this tion returns a classification accuracy measure of the target
case), since no other data were available. We optimized the model. The generator g is in charge of providing the next
FAMR training data relevances using a genetic algorithm. set of values that will be used to compute the model’s
We also optimized the order of the training data presenta- fitness. A hasNext method implemented by the generator
tion. These optimizations ameliorated the problem of insuf- offers the possibility to terminate the algorithm before the
ficient data and we improved the generalization performance maximum number of N evaluations is reached, if some
of the trained model. The computational overhead induced convergence criteria is satisfied.
by these optimizations was acceptable. In the particular case of RS, the generator g draws sam-
In [6] we also optimized not only the relevances and ples from the specific distribution of each of the hyperpa-
the order of the training data presentation but also some rameters to be optimized. Our goal is to reduce the com-
hyperparameters of the FAMR network. We used again a putational complexity of the RS method in terms of m, the
genetic algorithm. To some extent, the prediction perfor- number of trials. In other words, we aim to compute less
mance improved, but the computational overhead increased than N trials, without a significant impact on the value of
significantly and we faced overfitting aspects. According to the fitness function.
our experiments, we concluded that using a genetic algo- For this, we introduce a dynamic stopping criterion,
rithm for hyperparameter optimization is computationally included in a randomized optimization algorithm (Algo-
not feasible for large training datasets. rithm 1). The algorithm is a two-step optimizer. First, it
iterates for a predefined number of steps n, n << N , and
finds the optimal combination of hyperparameter values,
5.2 Hyperparameter optimization in SVM models temp_opt . Then, it searches for the first result better than
temp_opt . The optimal result, opt, is either the first result
Support vector machine (SVM) classifiers depend on sev- better than temp_opt or temp_opt if N is reached.
eral parameters and are quite sensitive to changes in any of
those parameters [15]. For a Gaussian kernel, there are two

13
Hyperparameter optimization in learning systems

Given a restricted computational budget, expressed by classification (prediction) performance within the same
a target number of trials m, we would like to determine computational budget. We applied this method to CNN
the optimal value for n, the value of n which maximizes architecture optimization. We will describe in a simplified
the probability of obtaining opt after at most m trials, way the WRS algorithm from [21].
n < m < N (where m = N∕2 is a reasonable target). Similar to GS and RS, we make the assumption that
We determined this optimal value: n = m∕e . Choos- there is no statistical correlation between the variables of
ing for n a value greater than the optimal one not only the objective function (hyperparameters). The standard RS
increases the probability of finding the optimal hyperpa- technique [10] generates a new multi-dimensional sam-
rameter instantiation but also increases the probability of ple at each step k, with new random values for each of the
using more trials. sample’s dimensions (features): X k = {xik }, i = 1, … , d ,
Our result can be used to implement an improved ver- where xi is generated according to a probability distribution
sion of the previous algorithm that can automatically Pi (x), i = 1, … , d , and d is the number of dimensions.
set the value of n to N/e. For example, to maximize the WRS is an improved version of RS, designed for hyper-
chances to obtain the best value, after a target maximum parameter optimization. It assigns probabilities of change
of 150 attempts, we must set n to 150/e (≈ 55). For a target pi , i = 1, … , d to each dimension. For each dimension i,
maximum of 100 attempts, n should be 37, and so on. after a certain number of steps ki , instead of always generat-
We can reverse the problem: Given an acceptable prob- ing a new value, we generate it with probability pi and use
ability P0 to achieve the best result among the N trials, which the best value known so far with probability 1 − pi.
is the optimal value for n? For the standard RS algorithm The intuition behind the proposed algorithm is that after
without the dynamic stopping criterion, if all trials are inde- already fixing d0 (1 < d0 < d ) values, each d-dimensional
pendent, the required number of trials needed to identify the optimization problem reduces itself to a d − d0 dimensional
optimum with a probability P0 is given by m = N ⋅ P0. one. In the context of this d − d0 dimensional problem,
With our algorithm, we can compromise, accepting a choosing a set of values that already performed well for the
probability P, P < P0 , to identify gopt using lees trials. remaining dimensions might prove more fruitful than choos-
If all N combinations are tested (when the stopping crite- ing some d − d0 random values. To avoid getting stuck in a
rion opt is not activated), P has the lower bound: local optimum, instead of setting a hard boundary between
choosing the best combination of values found so far or
P ≥ m∕(eN) ⋅ ln e2 = 2P0 ∕e ≈ 0.7357P0 .
generating new random samples, we assign probabilities of
However, the probability P to find gopt after less than N change for each dimension of the search space.
trials has the lower bound: WRS has two phases. In the first phase, it runs the RS for
P ≥ m∕(eN) = P0 ∕e ≈ 0.3678P0 . a predefined number of trials and allows: (a) to identify the
best combination of values so far; and (b) to give enough
We used our method to optimize five SVM hyperparam-
input on the importance of each dimension in the optimiza-
eters: kernel type (RBF, Polynomial or Linear chosen with
tion process. The second phase considers the probabilities of
equal probability), 𝛾 (drawn from an exponential distribution
change and generates the candidate values according to them.
with 𝜆 = 10); cost (C, drawn from an exponential distribu-
Between these two phases, we run one instance of fANOVA
tion with 𝜆 = 10 ); degree (chosen with equal probability
[26], to determine the importance of each dimension with
from the set {2, 3, 4, 5}); and coef0 (uniform on [0, 1]).
respect to the objective function. Intuitively, the most impor-
We run the experiments on six of the most popular data-
tant dimension (the dimension that yields the largest variation
sets from the UCI Machine Learning Repository 13 and
of the objective function) is the one that should change most
obtained on par accuracy values with the existing main-
frequently, to cover as much of the variation range as pos-
stream hyperparameter optimization techniques. Our algo-
sible. For a dimension with small variation of the objective
rithm terminates after a significantly reduced number of tri-
function, it might be more efficient to keep a certain tempo-
als compared to the standard implementation of RS, which
rary optimum value once this has been identified.
leads to an important decrease in the computational budget
A step of the WRS algorithm applied to function maximiza-
required for the optimization.
tion is described by Algorithm 2, whereas the entire method is
detailed in Algorithm 3. F is the objective function, the value F
5.3 Hyperparameter optimization in CNN models
(X) has to be computed for each argument, X k is the best argu-
ment at iteration k, whereas N is the total number of iterations.
In [21], we introduced an improved version of the RS
At each step of Algorithm 3, at least one dimension will
method, the Weighted Random Search (WRS) method.
change, hence we always choose at least one of the pi probabili-
The focus of the WRS method is the optimization of the
ties to be equal to one. For the other probabilities, any value in
13
http://archi​ve.ics.uci.edu/ml/index​.php. (0, 1] is valid. If all values are one, then we are in the case of RS.

13
R. Andonie

Besides a way to compute the objective function, Algo- We sorted the function variables with respect to their
rithm 2 requires only the combination of values that yields importance (weights) and assigned their probabilities pi
the best F (X) value obtained so far and the probability accordingly: the smaller the weight of a parameter, the
of change for each dimension. The current optimal value smaller it’s probability of change. Therefore, the most
of the objective function can be made optional, since the important parameter is the one that will always change
comparison can be done outside of Algorithm 2. Algo- ( p1 = 1). To compute the weight of each parameter, we
rithm 3 coordinates the sequence of the described steps run RS for a predefined number of steps, N0 < N . On the
and calls Algorithm 2 in a loop, until the maximum num- obtained values, we applied fANOVA to estimate the impor-
ber of trials N is reached. tance of the hyperparameters. If wi is the weight of the i-th
The value pi is the probability of change and ki the parameter and w1 is the weight of the most important one,
minimum number of required values, for dimension i, then pi = wi ∕w1 , i = 1, … , d . We optimized the following
i = 1, … , d . We proved that, regardless of the distribution CNN hyperparameters: the number of convolution layers,
used for generating xi , if we choose ki , i = 1, … , d , so that the number of fully connected layers, the number of output
at least two distinct values are generated for each dimen- filters in each convolution layer, and the number of neurons
sion, we have: At any step n, WRS has a greater probabil- in each fully connected layer.
ity than RS to find the global optimum. Therefore, given We generated each hyperparameter according to the uni-
the same number of iterations, on average, WRS finds the form distribution and assessed the performance of the model
global optimum faster than RS. solely by the classification accuracy. For the same number

13
Hyperparameter optimization in learning systems

of trials, the WRS algorithm produced significantly better networks. In 2016 IEEE Long island systems, applications and
results than RS on the CIFAR-10 dataset. technology conference (LISAT) (pp. 1–5).
2. Albelwi, S., & Mahmood, A. (2016). Automated optimal architec-
ture of deep convolutional neural networks for image recognition.
In 2016 15th IEEE International conference on machine learning
6 Conclusions and open problems and applications (ICMLA) (pp. 53–60). https​://doi.org/10.1109/
ICMLA​.2016.0018.
3. Aman, B., & Ciobanu, G. (2019). Adaptive P systems. Lecture
Determining the proper architecture design for machine Notes in Computer Science, 11399, 57–72.
learning models is a challenge because it differs for each 4. Andonie, R. (1998). The psychologiocal limits of neural com-
dataset and therefore requires adjustments for each one. putation. In M. Kárný, K. Warwick, & V. Kůrková (Eds.), Deal-
For most datasets, only a few of the hyperparameters really ing with complexity: A neural networks approach (pp. 252–263).
London: Springer.
matter. However, different hyperparameters are important 5. Andonie, R., Fabry-Asztalos, L., Abdul-Wahid, C. B., Abdul-
on different data sets. There is no mathematical method for Wahid, S., Barker, G. I., & Magill, L. C. (2011). Fuzzy ARTMAP
determining the appropriate hyperparameters for a given prediction of biological activities for potential HIV-1 protease
dataset, so the selection relies on trial and error. We should inhibitors using a small molecular data set. IEEE/ACM Transac-
tions on Computational Biology and Bioinformatics, 8(1), 80–93.
use a customized combination of optimization, search space https​://doi.org/10.1109/TCBB.2009.50.
and training time reduction techniques. 6. Andonie, R., Fabry-Asztalos, L., Magill, L., & Abdul-Wahid, S.
The computational power of SN P systems was well stud- (2007). A new fuzzy ARTMAP approach for predicting biologi-
ied. Several variants of these systems are known to have uni- cal activity of potential HIV-1 protease inhibitors. In 2007 IEEE
International conference on bioinformatics and biomedicine
versal computational capability, being equivalent to Turing (BIBM 2007) (pp. 56–61). https​://doi.org/10.1109/BIBM.2007.9.
machines [44]. Meanwhile, as early as in 1943, McCulloch 7. Andonie, R., & Sasu, L. (2006). Fuzzy ARTMAP with input rel-
and Pitts [33] asserted that neural networks are computa- evances. IEEE Transactions on Neural Networks, 17(4), 929–941.
tionally universal. This was also discussed by John von https​://doi.org/10.1109/TNN.2006.87598​8.
8. Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Cur-
Neumann in 1945 [52]. Details about the universal compu- riculum learning. In Proceedings of the 26th annual international
tational capability of neural models can be found in [4, 46]. conference on machine learning, ICML’09 (pp. 41–48). New
There are many possibilities for SN P systems for applica- York, NY, USA: ACM. https​://doi.org/10.1145/15533​74.15533​
tions like optimization and classification with learning abil- 80.
9. Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algo-
ity [44]. However, up to this moment, we know no attempt rithms for hyper-parameter optimization. In 25th Annual confer-
to bring problems and techniques from the neural computing ence on neural information processing systems (NIPS 2011),
area to the SN P systems area [54]. The current SN P sys- advances in neural information processing systems (Vol. 24). Gra-
tems with weights, like the McCulloch–Pitts neurons (intro- nada, Spain: Neural Information Processing Systems Foundation.
10. Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algo-
duced in 1943), are not able to adapt the weights during a rithms for hyper-parameter optimization. In J. Shawe-Taylor, R. S.
learning process. SN P systems with weights were studied Zemel, P. L. Bartlett, F. C. N. Pereira, & K. Q. Weinberger (Eds.),
both in the generative and the accepting case, but not in an NIPS (pp. 2546–2554). http://dblp.uni-trier​.de/db/conf/nips/nips2​
adaptive case. There are few attempts to use neural network 011.html.
11. Bergstra, J., & Bengio, Y. (2012). Random search for hyper-
learning rules for adapting the parameters of P systems: [53] parameter optimization. Journal of Machine Learning Research,
uses the Widrow–Hoff rule and [23] uses the Hebbian rule 13, 281–305.
to learn parameters. 12. Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., & Cox, D.
To create further analogies between adaptive P systems D. (2015). Hyperopt: A Python library for model selection and
hyperparameter optimization. Computational Science and Discov-
and machine learning models, we should use more advanced ery, 8(1), 014008. http://stacks​ .iop.org/1749-4699/8/i=1/a=01400​
parameter learning algorithms to train P systems. At a meta- 8.
level, we should then be able to optimize the hyperparam- 13. Cabarle, F. G. C., Adorna, H. N., Pérez-Jiménez, M. J., & Song, T.
eters of these models. (2015). Spiking neural P systems with structural plasticity. Neural
Computing and Applications, 26(8), 1905–1917.
14. Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support
Acknowledgements I am deeply grateful to Dr. Gheorghe Păun for vector machines. ACM Transactions on Intelligent Systems and
his valuable comments on a draft of this paper and for motivating me Technology, 2, 27. Software Retrieved from http://www.csie.ntu.
to finish it. edu.tw/~cjlin​/libsv​m.
15. Cortes, C., & Vapnik, V. (1995). Support-vector net-
works. Machine Learning, 20(3), 273–297. https ​ : //doi.
org/10.1023/A:10226​27411​411.
16. Dagher, I., Georgiopoulos, M., Heileman, G. L., & Bebis, G.
References (1998). Ordered fuzzy ARTMAP: A fuzzy ARTMAP algo-
rithm with a fixed order of pattern presentation. In 1998 IEEE
1. Albelwi, S., & Mahmood, A. (2016). Analysis of instance selec- International joint conference on neural networks proceedings.
tion algorithms on large datasets with deep convolutional neural IEEE world congress on computational intelligence (Cat. No.

13
R. Andonie

98CH36227) (Vol. 3, pp. 1717–1722). https​://doi.org/10.1109/ software and applications conference (COMPSAC) (Vol. 1, pp.
IJCNN​.1998.68711​5. 563–571). https​://doi.org/10.1109/COMPS​AC.2016.73.
17. Domhan, T., Springenberg, J. T., & Hutter, F. (2015). Speed- 32. Li, L., Jamieson, K. G., DeSalvo, G., Rostamizadeh, A., & Tal-
ing up automatic hyperparameter optimization of deep neural walkar, A. (2016). Efficient hyperparameter optimization and
networks by extrapolation of learning curves. In Proceedings infinitely many armed bandits. CoRR abs/1603.06560. arxiv​
of the 24th international conference on artificial intelligence, :1603.06560​.
IJCAI’15 (pp. 3460–3468). AAAI Press. http://dl.acm.org/citat​ 33. McCulloch, W., & Pitts, W. (1943). A logical calculus of the ideas
ion.cfm?id=28325​81.28327​31. immanent in nervous activity. Bulletin of Mathematical Biophys-
18. Engelbrecht, A. P. (2001). Selective learning for multilayer feed- ics, 5, 115–133.
forward neural networks. In Proceedings of the 6th international 34. Nelder, J. A., & Mead, R. (1965). A simplex method for function
work-conference on artificial and natural neural networks: Con- minimization. The Computer Journal, 7, 308–313.
nectionist models of neurons, learning processes and artificial 35. Olvera-López, J. A., Carrasco-Ochoa, J. A., Martínez-Trinidad, J.
intelligence-Part I, IWANN’01 (pp. 386–393). London, UK: F., & Kittler, J. (2010). A review of instance selection methods.
Springer. Artificial Intelligence Review, 34(2), 133–143.
19. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J. T., 36. Păun, G. (2000). Computing with membranes. Journal of Com-
Blum, M., & Hutter, F. (2015). Efficient and robust automated puter and System Sciences, 61(1), 108–143.
machine learning. In Proceedings of the 28th international con- 37. Păun, G., Rozenberg, G., & Salomaa, A. (2010). The Oxford hand-
ference on neural information processing systems—Volume 2, book of membrane computing. Oxford: Oxford University Press.
NIPS’15 (pp. 2755–2763). Cambridge, MA, USA: MIT Press. 38. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion,
http://dl.acm.org/citat​ion.cfm?id=29694​42.29695​47. B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in
20. Florea, A. C., & Andonie, R. (2018). A dynamic early stopping Python. Journal of Machine Learning Research, 12, 2825–2830.
criterion for random search in SVM hyperparameter optimization. 39. Plutowski, M., & White, H. (1993). Selecting concise training sets
In L. Iliadis, I. Maglogiannis, & V. Plagianakos (Eds.), Artificial from clean data. IEEE Transactions on Neural Networks, 4(2),
intelligence applications and innovations (pp. 168–180). Cham: 305–318. https​://doi.org/10.1109/72.20761​8.
Springer International Publishing. 40. Provost, F., Jensen, D., & Oates, T. (1999). Efficient progressive
21. Florea, A. C., & Andonie, R. (2019). Weighted random search sampling. In Proceedings of the fifth ACM SIGKDD international
for hyperparameter optimization. International Journal of Com- conference on knowledge discovery and data mining KDD’99 (pp.
puters Communications & Control, 14(2), 154–169. https​://doi. 23–32). New York, NY, USA: ACM.
org/10.15837​/ijccc​.2019.2.3514. 41. Raschka, S. (2018). Model evaluation, model selection, and algo-
22. Grefenstette, J. J. (1986). Optimization of control parameters rithm selection in machine learning. CoRR abs/1811.12808. arxiv​
for genetic algorithms. IEEE Transactions on Systems, Man, :1811.12808​.
and Cybernetics, 16(1), 122–128. https ​ : //doi.org/10.1109/ 42. Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2018). Regu-
TSMC.1986.28928​8. larized evolution for image classifier architecture search. CoRR
23. Gutiérrez-Naranjo, M. A., & Pérez-Jiménez, M. J. (2009). Heb- abs/1802.01548. arxiv​:1802.01548​.
bian learning from spiking neural P systems view. In D. W. Corne, 43. Refaeilzadeh, P., Tang, L., & Liu, H. (2007). On comparison of
P. Frisco, G. Păun, G. Rozenberg, & A. Salomaa (Eds.), Mem- feature selection algorithms. In AAAI Workshop—technical report
brane computing (pp. 217–230). Berlin: Springer. (Vol. WS-07-05, pp. 34–39).
24. Guyon, I., & Elisseeff, A. (2003). An introduction to variable 44. Rong, H., Wu, T., Pan, L., & Zhang, G. (2018). Spiking neu-
and feature selection. Journal of Machine Learning Research, 3, ral P systems: Theoretical results and applications. In C. Gra-
1157–1182. ciani, A. Riscos-Núñez, G. Păun, G. Rozenberg, & A. Salomaa
25. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., (Eds.), Enjoying natural computing: Essays dedicated to Mario
& Witten, I. H. (2009). The WEKA data mining software: An de Jesús Pérez-Jiménez on the occasion of his 70th birthday (pp.
update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18. 256–268). Cham: Springer International Publishing. https​://doi.
https​://doi.org/10.1145/16562​74.16562​78. org/10.1007/978-3-030-00265​-7_20.
26. Hutter, F., Hoos, H., & Leyton-Brown, K. (2014). An efficient 45. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & de Freitas,
approach for assessing hyperparameter importance. In Proceed- N. (2016). Taking the human out of the loop: A review of Bayes-
ings of the 31th international conference on machine learning, ian optimization. Proceedings of the IEEE, 104(1), 148–175. https​
ICML 2014, Beijing, China, 21–26 June 2014 (pp. 754–762). ://doi.org/10.1109/JPROC​.2015.24942​18.
27. Ionescu, M., Păun, G., & Yokomori, T. (2006). Spiking neural P 46. Siegelmann, H. T., & Sontag, E. D. (1992). On the computational
systems. Fundamenta Informaticae, 71, 279–308. power of neural nets. In Proceedings of the fifth annual workshop
28. Kabkab, M., Alavi, A., & Chellappa, R. (2016). DCNNs on a on computational learning theory, COLT’92 (pp. 440–449). New
diet: Sampling strategies for reducing the training set size. CoRR York, NY, USA: ACM. https​://doi.org/10.1145/13038​5.13043​2.
abs/1606.04232. arxiv​:1606.04232​. 47. Skalak, D. B. (1994). Prototype and feature selection by sam-
29. Kewley, R. H., Embrechts, M. J., & Breneman, C. (2000). Data pling and random mutation hill climbing algorithms. In Machine
strip mining for the virtual design of pharmaceuticals with neu- learning: proceedings of the eleventh international conference
ral networks. IEEE Transactions on Neural Networks, 11(3), (pp. 293–301). Morgan Kaufmann.
668–679. 48. Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayes-
30. Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F., & Leyton- ian optimization of machine learning algorithms. In Advances in
Brown, K. (2017). Auto-WEKA 2.0: Automatic model selection neural information processing systems 25: 26th annual conference
and hyperparameter optimization in WEKA. Journal of Machine on neural information processing systems 2012. Proceedings of a
Learning Research, 18(25), 1–5. http://jmlr.org/paper​s/v18/16- meeting held December 3–6, 2012, Lake Tahoe, Nevada, United
261.html. States (pp. 2960–2968).
31. Lemley, J., Jagodzinski, F., & Andonie, R. (2016). Big holes in big 49. Song, T., Pan, L., Wu, T., Zheng, P., Wong, M. L. D., & Rod-
data: A Monte Carlo algorithm for detecting large hyper-rectan- ríguez-Patón, A. (2019). Spiking neural P systems with learning
gles in high dimensional data. In 2016 IEEE 40th annual computer functions. IEEE Transactions on Nanobioscience, 18(2), 176–190.
https​://doi.org/10.1109/TNB.2019.28969​81.

13
Hyperparameter optimization in learning systems

50. Sun, Y., Gong, H., Li, Y., & Zhang, D. (2019). Hyperparameter 57. Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y.,
importance analysis based on N-RReliefF algorithm. Interna- Vajda, P., Jia, Y., & Keutzer, K. (2018). Fbnet: Hardware-aware
tional Journal of Computers Communications & Control, 14(4), efficient convnet design via differentiable neural architecture
557–573. search. CoRR abs/1812.03443. arxiv​:1812.03443​.
51. Sunkad, Z. A., & Soujanya (2016). Feature selection and hyper- 58. Zoph, B., & Le, Q. V. (2016). Neural architecture search with
parameter optimization of svm for human activity recognition. reinforcement learning. CoRR abs/1611.01578. arxiv:​ 1611.01578.​
In 2016 3rd International conference on soft computing machine
intelligence (ISCMI) (pp. 104–109). https​://doi.org/10.1109/ Publisher’s Note Springer Nature remains neutral with regard to
ISCMI​.2016.30. jurisdictional claims in published maps and institutional affiliations.
52. von Neumann, J. (1993). First draft of a report on the EDVAC.
IEEE Annals of the History of Computing, 15(4), 27–75. https​://
doi.org/10.1109/85.23838​9.
53. Wang, J., & Peng, H. (2013). Adaptive fuzzy spiking neural P Răzvan Andonie received the M.S. degree in Mathematics and Com-
systems for fuzzy inference and learning. International Jour- puter Science from University of Cluj-Napoca, Romania, and the Ph.D.
nal of Computer Mathematics, 90(4), 857–868. https​: //doi. degree from University of Bucharest, Romania. His Ph.D. advisor was
org/10.1080/00207​160.2012.74365​3. Solomon Marcus, Fellow of the Romanian Academy. He is a professor
54. Wang, J. J., Hoogeboom, H. J., Pan, L., Păun, G., & Pérez-Jimé- of Computer Science at Central Washington University and Director
nez, M. J. (2010). Spiking neural P systems with weights. Neural of the Computational Science MS program. His actual research inter-
Computation, 22, 2615–2646. ests are neural networks, deep learning, machine learning, cognitive
55. Wang, X., Song, T., Gong, F., & Zheng, P. (2016). On the compu- computing, parallel/distributed computing, and data science. He is an
tational power of spiking neural P systems with self-organization. IEEE and ACM senior member.
Scientific Reports, 6, 27624.
56. Westbrook, J., Berman, H. M., Feng, Z., Gilliland, G., Bhat, T. N.,
Weissig, H., et al. (2000). The protein data bank. Nucleic Acids
Research, 28, 235–242.

13

You might also like