Hyperparameter Optimization of ML Algorithms
Hyperparameter Optimization of ML Algorithms
Abstract
Machine learning algorithms have been used widely in various applications
and areas. To fit a machine learning model into different problems, its hyper-
parameters must be tuned. Selecting the best hyper-parameter configuration
for machine learning models has a direct impact on the model’s performance.
It often requires deep knowledge of machine learning algorithms and appro-
priate hyper-parameter optimization techniques. Although several automatic
optimization techniques exist, they have different strengths and drawbacks
when applied to different types of problems. In this paper, optimizing the
hyper-parameters of common machine learning models is studied. We in-
troduce several state-of-the-art optimization techniques and discuss how to
apply them to machine learning algorithms. Many available libraries and
frameworks developed for hyper-parameter optimization problems are pro-
vided, and some open challenges of hyper-parameter optimization research
are also discussed in this paper. Moreover, experiments are conducted on
benchmark datasets to compare the performance of different optimization
methods and provide practical examples of hyper-parameter optimization.
This survey paper will help industrial users, data analysts, and researchers
to better develop machine learning models by identifying the proper hyper-
parameter configurations effectively.
Keywords: Hyper-parameter optimization, machine learning, Bayesian
optimization, particle swarm optimization, genetic algorithm, grid search.
2
(HPO) [9]. The main aim of HPO is to automate hyper-parameter tuning
process and make it possible for users to apply machine learning models to
practical problems effectively [3]. The optimal model architecture of a ML
model is expected to be obtained after a HPO process. Some important
reasons for applying HPO techniques to ML models are as follows [6]:
3
proach that involves exhaustively searching for a fixed domain of hyper-
parameter values. Random search (RS) [13] is another decision-theoretic
method that randomly selects hyper-parameter combinations in the search
space, given limited execution time and resources. In GS and RS, each hyper-
parameter configuration is treated independently.
Unlike GS and RS, Bayesian optimization (BO) [14] models determine the
next hyper-parameter value based on the previous results of tested hyper-
parameter values, which avoids many unnecessary evaluations; thus, BO can
detect the best hyper-parameter combination within fewer iterations than GS
and RS. To be applied to different problems, BO can model the distribution
of the objective function using different models as the surrogate function,
including Gaussian process (GP), random forest (RF), and tree-structured
Parzen estimators (TPE) models [15]. BO-RF and BO-TPE can retain the
conditionality of variables [15]. Thus, they can be used to optimize condi-
tional hyper-parameters, like the kernel type and the penalty parameter C in
a support vector machine (SVM). However, since BO models work sequen-
tially to balance the exploration of unexplored areas and the exploitation of
currently-tested regions, it is difficult to parallelize them.
Training a ML model often takes considerable time and space. Multi-
fidelity optimization algorithms are developed to tackle problems with lim-
ited resources, and the most common ones being bandit-based algorithms.
Hyperband [16] is a popular bandit-based optimization technique that can be
considered an improved version of RS. It generates small versions of datasets
and allocates a same budget to each hyper-parameter combination. In each
iteration of Hyperband, poorly-performing hyper-parameter configurations
are eliminated to save time and resources.
Metaheuristic algorithms are a set of techniques used to solve complex,
large search space and non-convex optimization problems to which HPO
problems belong [17]. Among all metaheuristic methods, genetic algorithm
(GA) [18] and particle swarm optimization (PSO) [19] are the two most
prevalent metaheuristic algorithms used for HPO problems. Genetic algo-
rithms detect well-performing hyper-parameter combinations in each gener-
ation, and pass them to the next generation to identify the best-performing
combination. In PSO algorithms, each particle communicates with other
particles to detect and update the current global optimum in each iteration
until the final optimum is detected. Metaheuristics can efficiently explore the
search space to detect optimal or near-optimal solutions. Hence, they are par-
ticularly suitable for the HPO problems with large configuration space due
4
to their high efficiency. For instance, a deep neural network (DNN) often
has a large configuration space with multiple hyper-parameters, including
the activation and optimizer types, the learning rate, drop-out rate, etc.
Although using HPO algorithms to tune the hyper-parameters of ML
models greatly improves the model performance, certain other aspects, like
their computational complexity, still have much room for improvement. On
the other hand, since different HPO models have their own advantages and
suitable problems, overviewing them is necessary for proper optimization
algorithm selection in terms of different types of ML models and problems.
This paper makes the following contributions:
1. It reviews common ML algorithms and their important hyper-parameters.
2. It analyzes common HPO techniques, including their benefits and draw-
backs, to help apply them to different ML models by appropriate algo-
rithm selection in practical problems.
3. It surveys common HPO libraries and frameworks for practical use.
4. It discusses the open challenges and research directions of the HPO
research domain.
In this survey paper, we begin with a comprehensive introduction of the
common optimization techniques used in ML hyper-parameter tuning prob-
lems. Section 2 introduces the main concepts of mathematical optimization
and hyper-parameter optimization, as well as the general HPO process. In
Section 3, we discuss the key hyper-parameters of common ML models that
need to be tuned. Section 4 covers the various state-of-the-art optimization
approaches that have been proposed for tackling HPO problems. In Section
5, we analyze different HPO methods and discuss how they can be applied
to ML algorithms. In Section 6, we provide an introduction to various public
libraries and frameworks that are developed to implement HPO. Section 7
presents and discusses the experimental results of using HPO on benchmark
datasets for HPO method comparison and practical use case demonstration.
In Section 8, we discuss several research directions and open challenges that
should be considered to improve current HPO models or develop new HPO
approaches. We conclude the paper in Section 9.
5
by an optimization method until the objective function value approaches a
minimum value and the accuracy rate approaches a maximum value [20].
Similarly, hyper-parameter optimization methods aim to optimize the archi-
tecture of a ML model by evaluating the optimal hyper-parameter configura-
tions. In this section, the main concepts of mathematical optimization and
hyper-parameter optimization for machine learning models are discussed.
6
To conclude, an optimization problem consists of three major compo-
nents: a set of decision variables x, an objective function f (x) to be either
minimized or maximized, and a set of constraints that allow the variables to
take on values in certain ranges (if it is a constrained optimization problem).
Therefore, the goal of optimization tasks is to obtain the set of variable val-
ues that minimize or maximize the objective function while satisfying any
applicable constraints.
Many HPO problems have certain constraints, like the feasible domain
of the number of clusters in k-means, as well as time and space constraints.
Therefore, constrained optimization techniques are widely-used in HPO prob-
lems [3].
For optimization problems, in many cases, only a local instead of a global
optimum can be obtained. For example, to obtain the minimum of a problem,
assuming D is the feasible region of a decision variable x, a global minimum
is the point x∗ ∈ D satisfying f (x∗ ) ≤ f (x) ∀x ∈ D , while a local minimum
is a point x∗ ∈ D in a neighborhood N satisfying f (x∗ ) ≤ f (x) ∀x ∈ N ∩ D
[21] . Thus, the local optimum may only be an optimum in a small range
instead of being the optimal solution in the entire feasible region.
A local optimum is only guaranteed to be the global optimum in con-
vex functions [22]. Convex functions are the functions that only have one
optimum. Therefore, continuing to search along the direction in which the
objective function decreases can detect the global optimal value. A function
f (x) is a convex function if [22], for ∀x1 , x2 ∈ X, ∀t ∈ [0, 1],
f (tx1 + (1 − t)x2 ) ≤ tf (x1 ) + (1 − t)f (x2 ) , (4)
where X is the domain of decision variables, and t is a coefficient in the range
of [0,1].
An optimization problem is a convex optimization problem only when the
objective function f (x) is a convex function and the feasible region C is a
convex set, denoted by [22]:
min f (x)
x
(5)
subject to x ∈ C.
On the other hand, nonconvex functions have multiple local optimums,
but only one of these optimums is the global optimum. Most ML and HPO
problems are nonconvex optimization problems. Thus, utilizing inappropri-
ate optimization methods can often detect only a local instead of a global
optimum.
7
There are many traditional methods that can be used to solve opti-
mization problems, including gradient descent, Newtons method, conjugate
gradient, and heuristic optimization methods [20]. Gradient descent is a
commonly-used optimization method that uses the negative gradient direc-
tion as the search direction to move towards the optimum. However, gradient
descent cannot guarantee to detect the global optimum unless the objective
function is a convex function. Newtons method uses the inverse matrix of the
Hessian matrix to obtain the optimum. Newtons method has faster conver-
gence speed than gradient descent, but often requires more time and larger
space than gradient descent to store and calculate the Hessian matrix. Con-
jugate gradient searches along the conjugated directions constructed by the
gradient of known data points to detect the optimum. Conjugate gradient
has faster convergence speed than gradient descent, but its calculation of con-
jugate gradient is more complex. Unlike other traditional methods, heuristic
methods use empirical rules to solve the optimization problems instead of
following systematical steps to obtain the solution. Heuristic methods can
often detect the approximate global optimum within a few iterations, but
cannot guarantee to detect the global optimum [20].
8
the degree of the polynomial kernel function only needs to be tuned when
the kernel type is chosen to be polynomial.
In simple cases, all hyper-parameters can take unrestricted real values,
and the feasible set X of hyper-parameters can be a real-valued n-dimensional
vector space. However, in most cases, the hyper-parameters of a ML model
often take on values from different domains and have different constraints,
so their optimization problems are often complex constrained optimization
problems [24]. For instance, the number of considered features in a decision
tree should be in the range of 0 to the number of features, and the number
of clusters in k-means should not be larger than the size of data points.
Additionally, categorical features can often only take several certain values,
like the limited choices of the activation function and the optimizer of a
neural network. Therefore, the feasible domain of X often has a complex
structure, which increases the problems’ complexity [24].
In general, for a hyper-parameter optimization problem, the aim is to
obtain [19]:
x∗ = arg min f (x), (6)
x∈X
where f (x) is the objective function to be minimized, such as the error rate or
the root mean squared error (RMSE); x∗ is the hyper-parameter configuration
that produces the optimum value of f (x); and a hyper-parameter x can take
any value in the search space X.
The aim of HPO is to achieve optimal or near-optimal model performance
by tuning hyper-parameters within the given budgets [3]. The mathemati-
cal expression of the function f varies, depending on the objective function
of the chosen ML algorithm and the performance metric function. Model
performance can be evaluated by various metrics, like accuracy, RMSE, F1-
score, and false alarm rate. On the other hand, in practice, time budgets are
an essential constraint for optimizing HPO models and must be considered.
It often requires a massive amount of time to optimize the objective function
of a ML model with a reasonable number of hyper-parameter configurations.
Every time a hyper-parameter value is tested, the entire ML model needs to
be retrained, and the validation set needs to be processed to generate a score
that reflects the model performance.
After selecting a ML algorithm, the main process of HPO is as follows
[10]:
1. Select the objective function and the performance metrics;
9
2. Select the hyper-parameters that require tuning, summarize their types,
and determine the appropriate optimization technique;
3. Train the ML model using the default hyper-parameter configuration
or common values as the baseline model;
4. Start the optimization process with a large search space as the hyper-
parameter feasible domain determined by manual testing and/or do-
main knowledge;
5. Narrow the search space based on the regions of currently-tested well-
performing hyper-parameter values, or explore new search spaces if
necessary.
6. Return the best-performing hyper-parameter configuration as the final
solution.
However, most traditional optimization techniques [25] are unsuitable
for HPO, since HPO problems are different from traditional optimization
problems in the following aspects [10]:
1. The optimization target, the objective function of ML models, is usu-
ally a non-convex and non-differentiable function. Therefore, many
traditional optimization methods designed to solve convex or differen-
tiable optimization problems are often unsuitable for HPO problems,
since these methods may return a local optimum instead of a global op-
timum. Additionally, an optimization target lacking smoothness makes
certain traditional derivative-free optimization models perform poorly
for HPO problems [26].
2. The hyper-parameters of ML models include continuous, discrete, cat-
egorical, and conditional hyper-parameters. Thus, many traditional
numerical optimization methods [27] that only aim to tackle numerical
or continuous variables are unsuitable for HPO problems.
3. It is often computationally expensive to train a ML model on a large-
scale dataset. HPO techniques sometimes use data sampling to obtain
approximate values of the objective function. Thus, effective optimiza-
tion techniques for HPO problems should be able to use these approx-
imate values. However, function evaluation time is often ignored in
many black-box optimization (BBO) models, so they often require ex-
act instead of approximate objective function values. Consequently,
many BBO algorithms are often unsuitable for HPO problems with
limited time and resource budgets.
10
Therefore, appropriate optimization algorithms should be applied to HPO
problems to identify optimal hyper-parameter configurations for ML models.
11
where N is the number of training data points, xi is the feature vector of
the i-th instance, yi is the corresponding actual output, and L is the cost
function value of each sample.
Many different loss functions exist in supervised learning algorithms, in-
cluding the square of Euclidean distance, cross-entropy, information gain, etc.
[34]. On the other hand, different ML algorithms generate different predic-
tive model architectures based on different hyper-parameter configurations,
which will be discussed in detail in this subsection.
where kwk2 is the L2 -norm of the coefficient vector, and α is the regularization
strength. A larger value of α indicates a larger amount of shrinkage; thus,
the coefficients are also more robust to collinearity.
Lasso regression [38] is another linear model used to estimate sparse co-
efficients, consisting of a linear model with an L1 priori added regularization
term. It aims to minimize the objective function [37]:
p
X
αkwk1 + (yi − wi · xi )2 , (10)
i=1
12
where α is a constant and kwk1 is the L1 -norm of the coefficient vector.
Therefore, the regularization strength α is an crucial hyper-parameter of
both ridge and lasso regression models.
Logistic regression (LR) [39] is a linear model used for classification prob-
lems. In LR, its cost function may be different, depending on the regular-
ization method chosen for the penalization. There are three main types of
regularization methods in LR: L1 -norm, L2 -norm, and elastic-net regulariza-
tion [40].
Therefore, the first hyper-parameter that needs to be tuned in LR is to
the regularization method used in the penalization, ’l1’, ’l2’, ’elasticnet’ or
’none’, which is called ’penalty’ in sklearn. The coefficient, ’C’, is another
essential hyper-parameter that determines the regularization strength of the
model. In addition, the ’solver’ type, representing the optimization algorithm
type, can be set to ’newton-cg’, ’lbfgs’, ’liblinear’, ’sag’, or ’saga’ in LR. The
’solver’ type has correlations with ’penalty’ and ’C’, so they are conditional
hyper-parameters.
3.1.2. KNN
K-nearest neighbor (KNN) is a simple ML algorithm that is used to clas-
sify data points by calculating the distances between different data points.
In KNN, the predicted class of each test sample is set to the class to which
most of its k-nearest neighbors in the training set belong.
Assuming the training set T = {(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )}, xi is the
feature vector of an instance, and yi ∈ {c1 , c2 , · · · , cm } is the class of the
instance, i = (1, 2, · · · n), for a test instance x, its class y can be denoted by
[41]:
X
y = arg max I (yi = cj ) , i = 1, 2, · · · , n; j = 1, 2, · · · , m, (11)
cj
xi ∈Nk (x)
13
The distance metric and the power parameter of the Minkowski metric can
also be tuned as it can result in minor improvement. Lastly, the ’algorithm’
used to compute the nearest neighbors can also be chosen from a ball tree, a
k-dimensional (KD) tree, or a brute force search. Typically, the model can
determine the most appropriate algorithm itself by setting the ’algorithm’ to
’auto’ in sklearn [31].
3.1.3. SVM
A support vector machines (SVM) [43] is a supervised learning algorithm
that can be used for both classification and regression problems. SVM algo-
rithms are based on the concept of mapping data points from low-dimensional
into high-dimensional space to make them linearly separable; a hyperplane
is then generated as the classification boundary to partition data points [44].
Assuming there are n data points, the objective function of SVM is [45]:
( n )
1X T
arg min max {0, 1 − yi f (xi )} + Cw w , (12)
w n i=1
14
As shown in the kernel function equations, a few other different hyper-
parameters need to be tuned after a kernel type is chosen. The hyper-
parameter γ, denoted by ’gamma’ in sklearn, is the conditional hyper-parameter
of the ’kernel type’ hyper-parameter when it is set to polynomial, RBF, or
sigmoid; r, specified by ’coef0’ in sklearn, is the conditional hyper-parameter
of polynomial and sigmoid kernels. Moreover, the polynomial kernel has
an additional conditional hyper-parameter d representing the ’degree’ of the
polynomial kernel function. In support vector regression (SVR) models, there
is another hyper-parameter, ’epsilon’, indicating the distance error to of its
loss function [31].
where P (y) is the probability of a value y, and P (xi |y) is the posterior prob-
abilities of xi given the values of y. Regarding the different assumptions of
the distribution of P (xi |y), there are different types of nave Bayes classi-
fiers. The four main types of NB models are: Bernoulli NB, Gaussian NB,
multinomial NB, and complement NB [48].
For Gaussian NB [49], the likelihood of features is assumed to follow a
Gaussian distribution:
!
1 (xi − µy )2
P (xi |y) = p exp − . (18)
2πσy2 2σy2
15
belonging to the class y. Based on the concept of relative frequency counting,
θy can be estimated by a smoothed version of θyi [31]:
Nyi + α
θ̂yi = , (19)
Ny + αn
where Nyi is the number of times when feature i is in a data point belonging
to class y, and Ny is the sum of all Nyi (i = 0, 1, 2, · · · , n). The smoothing
priors α ≥ 0 are used for features that are not in the learning samples. When
α = 1, it is called Laplace smoothing; when α < 1, it is called Lidstone
smoothing.
Complement NB [51] is an improved version of the standard multino-
mial NB algorithm and is suitable for processing imbalanced data, while
Bernoulli NB [52] requires samples to have binary-valued feature vectors so
that the data can follow multivariate Bernoulli distributions. They both
have the additive (Laplace/Lidstone) smoothing parameter, α, as the main
hyper-parameter that needs tuning. To conclude, for nave Bayes algorithms,
users often do not need to tune hyper-parameters or only need to tune the
smoothing parameter α, which is a continuous hyper-parameter.
16
the best split, ’max features’, can also be tuned as a feature selection pro-
cess. Moreover, there are several discrete hyper-parameters related to the
splitting process, which need to be tuned to achieve better performance: the
minimum number of data points to split a decision node or to obtain a leaf
node, denoted by ’min samples split’ and ’min samples leaf’, respectively;
the ’max leaf nodes’, indicating the maximum number of leaf nodes, and the
’min weight fraction leaf’ that means the minimum weighted fraction of the
total weights, can also be tuned to improve model performance [31] [56].
Based on the concept of DT models, many decision-tree-based ensemble
algorithms have been proposed to improve model performance by combin-
ing multiple decision trees, including random forest (RF), extra trees (ET),
and extreme gradient boosting (XGBoost) models. RF [57] is an ensemble
learning method that uses the bagging method to combine multiple decision
trees. In RF, basic DTs are built on many randomly-generated subsets, and
the class with the majority voting will be selected to be the final classification
result [58]. ET [59] is another tree-based ensemble learning method that is
similar to RF, but it uses all samples to build DTs and randomly selects the
feature sets. In addition, RF optimizes splits on DTs while ET randomly
makes the splits. XGBoost [32] is a popular tree-based ensemble model de-
signed for speed and performance improvement, which uses the boosting and
gradient descent methods to combine basic DTs. In XGBoost, the next input
sample of a new DT will be related to the results of previous DTs. XGBoost
aims to minimize the following objective function [55]:
t
1 X G2j
Obj = − + γt, (20)
2 j=1 Hj + λ
where t is the number of leaves in a decision tree, G and H are the sums of
the first and second order gradient statistics of the cost function, γ and λ
are the penalty coefficients.
Since tree-based ensemble models are built with decision trees as base
learners, they have the same hyper-parameters as DT models, described in
this subsection. Apart from these hyper-parameters, RF, ET, and XGBoost
all have another crucial hyper-parameter to be tuned, which is the number of
decision trees to be combined, denoted by ’n estimators’ in sklearn. XGBoost
has several additional hyper-parameters, including [60]: ’min child weight’
which means the minimum sum of weights in a child node; ’subsample’ and
’colsample bytree’ used to control the subsampling ratio of instances and
17
features, respectively; and four continuous hyper-parameters ’gamma’, ’al-
pha’, ’lambda’, and ’learning rate’ indicating the minimum loss reduction
for a split, L1 , and L2 regularization term on weights, and the learning rate,
respectively.
18
3.1.7. Deep Learning Models
Deep learning (DL) algorithms are widely applied to various areas like
computer vision, natural language processing, and machine translation since
they have had great success solving many types of problems. DL models are
based on the theory of artificial neural networks (ANNs). Common types
of DL architectures include deep neural networks (DNNs), feedforward neu-
ral networks (FFNNs), deep belief networks (DBNs), convolutional neural
networks (CNNs), recurrent neural networks (RNNs) and many more [63].
All these DL models have similar hyper-parameters since they have similar
underlying neural network models. Compared with other ML models, DL
models benefit more from HPO since they often have many hyper-parameters
that require tuning.
The first set of hyper-parameters is related to the construction of the
structure of a DL model; hence, named model design hyper-parameters.
Since all neural network models have an input layer and an output layer,
the complexity of a deep learning model mainly depends on the number of
hidden layers and the number of neurons of each layer, which are two main
hyper-parameters to build DL models [64]. These two hyper-parameters
are set and tuned according to the complexity of the datasets or the prob-
lems. DL models need to have enough capacity to model objective func-
tions (or prediction tasks) while avoiding over-fitting. At the next stage,
certain function types need to be set or tuned. The first function type to
configure is the loss function type, which is chosen mainly based on the
problem type (e.g., binary cross-entropy for binary classification problems,
multi-class cross-entropy for multi-classification problems, and RMSE for re-
gression problems). Another important hyper-parameter is the activation
function type used to model non-linear functions, which be set to ’softmax’,
’rectified linear unit (ReLU)’, ’sigmoid’, ’tanh’, or ’softsign’. Lastly, the opti-
mizer type can be set to stochastic gradient descent (SGD), adaptive moment
estimation (Adam), root mean square propagation (RMSprop), etc. [65].
On the other hand, some other hyper-parameters are related to the opti-
mization and training process of DL models; hence, categorized as optimizer
hyper-parameters. The learning rate is one of the most important hyper-
parameters in DL models [66]. It determines the step size at each iteration,
which enables the objective function to converge. A large learning rate speeds
up the learning process, but the gradient may oscillate around the local mini-
mum value or even cannot converge. On the other hand, a small learning rate
19
converges smoothly, but will largely increase model training time by requir-
ing more training epochs. An appropriate learning rate enables the objective
function to converge to a global minimum in a reasonable amount of time.
Another common hyper-parameter is the drop-out rate. Drop-out is a stan-
dard regularization method for DL models proposed to reduce over-fitting.
In drop-out, a proportion of neurons are randomly selected and removed, and
the percentage of neurons to be removed should be tuned.
Mini-batch size and the number of epochs are the other two DL hyper-
parameters that represent the number of processed samples before updating
the model, and the number of complete passes through the entire training
set, respectively [67]. Mini-batch size is affected by the resource requirements
of the training process, speed, and the number of iterations. The number
of epochs depends on the size of the training set and should be tuned by
slowly increasing its value until validation accuracy starts to decrease, which
indicates over-fitting. On the other hand, DL models often converge within
a few epochs, and the following epochs may lead to unnecessary additional
execution time and over-fitting, which can be avoided by the early stopping
method. Early stopping is a form of regularization whereby model training
stops in advance when validation accuracy does not increase after a certain
number of consecutive epochs. The number of waiting epochs, called early
stop patience, can also be tuned to reduce model training time.
Apart from traditional DL models, transfer learning (TL) is a technology
that obtains a pre-trained model on the data in a related domain and transfers
it to other target tasks [68]. To transfer a DL model from one problem to
another problem, a certain number of top layers are frozen, and only the
remaining layers are retrained to fit the new problem. Therefore, the number
of frozen layers is a vital hyper-parameter to tune if TL is used.
20
3.2.1. Clustering Algorithms
In most clustering algorithms including k-means, EM, and hierarchical
clustering the number of clusters is the most important hyper-parameter to
tune [69].
The k-means algorithm [70] uses k prototypes, indicating the centroids
of clusters, to cluster data. In k-means algorithms, the number of clusters,
’n clusters’, must be specified, and is determined by minimizing the sum of
squared errors [70]:
Xnk
min (xi − uj )2 , (21)
uj ∈Ck
i=0
where (x1 , · · · , xn ) is the data matrix; uj , also called the centroid of the
cluster Ck , is the mean of the samples in the cluster; and nk is the number
of sample points in the cluster Ck .
To tune k-means, ’n clusters’ is the most crucial hyper-parameter. Be-
sides this, the method for centroid initialization, ’init’, could be set to ’k-
means++’, ’random’ or a human-defined array, which slightly affects model
performance. In addition, ’n init’, denoting the number of times that the
k-means algorithm will be executed with different centroid seeds, and the
’max iter’, the maximum number of iterations in a single execution of k-
means, also have slight impacts on model performance [31].
The expectation-maximization (EM) algorithm [71] is an iterative al-
gorithm used to detect the maximum likelihood estimation of parameters.
Gaussian Mixture model is a clustering method that uses a mixture of Gaus-
sian distributions to model data by implementing the EM method. Simi-
lar to k-means, its major hyper-parameter to be tuned is ’n components’,
indicating the number of clusters or Gaussian distributions. Additionally,
different methods can be chosen to constrain the covariance of the estimated
classes in Gaussian mixture models, including ’full covariance’, ’tied’, ’diago-
nal’ or ’spherical’ [72]. Other hyper-parameters could also be tuned, including
’max iter’ and ’tol’, representing the number of EM iterations to perform and
the convergence threshold, respectively [31].
Hierarchical clustering [73] methods build clusters by continuously merg-
ing or splitting the built-in clusters. The hierarchy of clusters is represented
by a tree-structure; its root indicates the unique cluster gathering all samples,
and its leaves represent the clusters with only one sample [73]. In sklearn, the
function ’AgglomerativeClustering’ is a common type of hierarchical cluster-
ing. In agglomerative clustering, the linkage criteria, ’linkage’, determines
21
the distance between sets of observations and can be set to ’ward’, ’com-
plete’, ’average’, or ’single’, indicating whether to minimize the variance of
the all clusters, or use the maximum, average, or minimum distance between
every two clusters, respectively. Like other clustering methods, its main
hyper-parameter is the number of clusters, ’n clusters’. However, ’n clusters’
cannot be set if we choose to set the ’distance threshold’, the linkage distance
threshold for merging clusters, since if so, ’n clusters’ will be determined au-
tomatically.
DBSCAN [74] is a density-based clustering method that determines the
clusters by dividing data into clusters with sufficiently high density. Unlike
other clustering models, the number of clusters does not need to be configured
before training. Instead, DBSCAN has two significant conditional hyper-
parameters the scan radius represented by ’eps’, and the minimum number
of considered neighbor points represented by ’min samples’ which define the
cluster density together [75]. DBSCAN works by starting with an unvisited
point and detecting all its neighbor points within a pre-defined distance ’eps’.
If the number of neighbor points reaches the value of ’min samples’, this un-
visited point and all its neighbors are defined as a cluster. The procedures are
executed recursively until all data points are visited. A higher ’min samples’
or a lower ’eps’ indicates a higher density to form a cluster.
22
data matrix can be transformed into a new space with reduced dimension-
ality. Singular value decomposition (SVD) [77] is a popular method used
to obtain the eigenvalues and eigenvectors of the covariance matrix of PCA.
Therefore, in addition to ’n components’, the SVD solver type is another
hyper-parameter of PCA to be tuned, which can be assigned to ’auto’, ’full’,
’arpack’ or ’randomized’ [31].
Linear discriminant analysis (LDA) [78] is another common dimensional-
ity reduction method that projects the features onto the most discriminative
directions. Unlike PCA, which obtains the direction with the largest variance
as the principal component, LDA optimizes the feature subspace of classifica-
tion. The objective of LDA is to minimize the variance inside each class and
maximize the variance between different classes after projection. Thus, the
projection points in each class should be as close as possible, and the distance
between the center points of different classes should be as large as possible.
Similar to PCA, the number of features to be extracted, ’n components’,
should be tuned in LDA models. Additionally, the solver type of LDA can
also be set to ’svd’ for SVD, ’lsqr’ for least-squares solution, or ’eigen’ for
eigenvalue decomposition [79]. LDA also has a conditional hyper-parameter,
the shrinkage parameter, ’shrinkage’, which can be set to a float value along
with ’lsqr’ and ’eigen’ solvers.
23
inspired increased research into techniques for the automatic optimization of
hyper-parameters [80].
24
number of parameter combinations from the specified distribution, which
improves system efficiency by reducing the probability of wasting much time
on an unimportant small search space. Since the number of total evaluations
in RS is set to a certain value n before the optimization process starts, the
computational complexity of RS is O(n) [84]. In addition, RS can detect the
global optimum or the near-global optimum when given enough budgets [6].
Although RS is more efficient than GS for large search spaces, there are
still a large number of unnecessary function evaluations since it does not
exploit the previously well-performing regions [2].
To conclude, the main limitation of both RS and GS is that every eval-
uation in their iterations is independent of previous evaluations; thus, they
waste massive time evaluating poorly-performing areas of the search space.
This issue can be solved by other optimization methods, like Bayesian opti-
mization that uses previous evaluation records to determine the next evalu-
ation [14].
25
for these ML algorithms to identify global optimums using gradient-based
optimization techniques.
26
models: BO-GP, BO-RF, BO-TPE. An alternative name for BO-RF is se-
quential model-based algorithm configuration (SMAC) [90].
4.3.1. BO-GP
Gaussian process (GP) is a standard surrogate model for objective func-
tion modeling in BO [87]. Assuming that the function f with a mean µ
and a covariance σ 2 is a realization of a GP, the predictions follow a normal
distribution [91]:
p(y|x, D) = N y|µ̂, σ̂ 2 ,
(22)
where D is the configuration space of hyper-parameters, and y = f (x) is
the evaluation result of each hyper-parameter value x. After obtaining a
set of predictions, the points to be evaluated next are then selected from the
confidence intervals generated by the BO-GP model. Each newly-tested data
point is added to the sample records, and the BO-GP model is re-built with
the new information. This procedure is repeated until termination.
Applying a BO-GP to a size n dataset has a time complexity of O(n3 )
and space complexity of O(n2 ) [92]. One main limitation of BO-GP is that
the cubic complexity to the number of instances limits the capacity for paral-
lelization [3]. Additionally, it is mainly used to optimize continuous variables.
4.3.2. SMAC
Random forest (RF) is another popular surrogate function for BO to
model the objective function using an ensemble of regression trees. BO using
RF as the surrogate model is also called SMAC [90].
Assuming that there is a Gaussian model N (y|µ̂, σ̂ 2 ), and µ̂ and σ̂ 2 are
the mean and variance of the regression function r(x), respectively, then [90]:
1 X
µ̂ = r(x), (23)
|B| r∈B
1 X
σ̂ 2 = (r(x) − µ̂)2 , (24)
|B| − 1
r∈B̄
27
3. To maintain a low computational cost, both the minimum number of
instances considered for further split and the number of trees to grow
are set to a certain value.
4. Finally, the mean and variance for each new configuration are estimated
by RF.
Compared with BO-GP, the main advantage of SMAC is its support for all
types of variables, including continuous, discrete, categorical, and conditional
hyper-parameters [91]. The time complexities of using SMAC to fit and
predict variances are O(nlogn) and O(logn), respectively, which are much
lower than the complexities of BO-GP [3].
4.3.3. BO-TPE
Tree-structured Parzen estimator (TPE) [12] is another common surro-
gate model for BO. Instead of defining a predictive distribution used in BO-
GP, BO-TPE creates two density functions, l(x) and g(x), to act as the
generative models for all domain variables [3]. To apply TPE, the observa-
tion results are divided into good results and poor results by a pre-defined
percentile y ∗ , and the two sets of results are modeled by simple Parzen win-
dows [12]:
l(x), if y < y ∗
p(x|y, D) = . (25)
g(x), if y > y ∗
After that, the expected improvement in the acquisition function is re-
flected by the ratio between the two density functions, which is used to
determine the new configurations for evaluation. The Parzen estimators are
organized in a tree structure, so the specified conditional dependencies are
retained. Therefore, TPE naturally supports specified conditional hyper-
parameters [91]. The time complexity of BO-TPE is O(nlogn), which is
lower than the complexity of BO-GP [3].
BO methods are effective for many HPO problems, even if the objective
function f is stochastic, non-convex, or non-continuous. However, the main
drawback of BO models is that, if they fail to achieve the balance between
exploration and exploitation, they might only reach the local instead of a
global optimum. RS does not have this limitation since it does not focus on
any specific area. Additionally, it is difficult to parallelize BO models since
their intermediate results are dependent on each other [7].
28
4.4. Multi-fidelity Optimization Algorithms
One major issue with HPO is the long execution time, which increases
with a larger number of hyper-parameter values and larger datasets. The
execution time may be several hours, several days, or even more [93]. Multi-
fidelity optimization techniques are common approaches to solve the con-
straint of limited time and resources. To save time, people can use a subset
of the original dataset or a subset of the features [94]. Multi-fidelity involves
low-fidelity and high-fidelity evaluations and combines them for practical
applications [95]. In low-fidelity evaluations, a relatively small subset is eval-
uated at a low cost but with poor generalization performance. In high-fidelity
evaluations, a relatively large subset is evaluated with better generalization
performance but at a higher cost than low-fidelity evaluations. In multi-
fidelity optimization algorithms, poorly-performing configurations are dis-
carded after each round of hyper-parameter evaluation on generated subsets,
and only well-performing hyper-parameter configurations will be evaluated
on the entire training set.
Bandit-based algorithms categorized to multi-fidelity optimization algo-
rithms have shown success in dealing with deep learning optimization prob-
lems [3]. Two common bandit-based techniques are successive halving [96]
and Hyperband [16].
29
allocated to each configuration [6]. Thus, the main concern of successive
halving is how to allocate the budget and how to determine whether to test
fewer configurations with a higher budget for each or to test more configura-
tions with a lower budget for each [2].
4.4.2. Hyperband
Hyperband [16] is then proposed to solve the dilemma of successive halv-
ing algorithms by dynamically choosing a reasonable number of configura-
tions. It aims to achieve a trade-off between the number of hyper-parameter
configurations (n) and their allocated budgets by dividing the total budgets
(B) into n pieces and allocating these pieces to each configuration (b = B/n).
Successive halving serves as a subroutine on each set of random configura-
tions to eliminate the poorly-performing hyper-parameter configurations and
improve efficiency. The main steps of Hyperband algorithms are shown in
Algorithm 1 [2].
Algorithm 1 Hyperband
Input: bmax , b
min
1: smax = log bbmax
min
Firstly, the budget constraints bmin and bmax are determined by the total
number of data points, the minimum number of instances required to train a
sensible model, and the available budgets. After that, the number of config-
urations n and the budget size allocated to each configuration are calculated
based on bmin and bmax in steps 2-3 of Algorithm 1. The configurations are
sampled based on n and b, and then passed to the successive halving model
demonstrated in steps 4-5. The successive halving algorithm discards the
identified poorly-performing configurations and passes the well-performing
configurations on to the next iteration. This process is repeated until the
final optimal hyper-parameter configuration is identified. By involving the
successive halving searching method, Hyperband has a computational com-
plexity of O(nlogn) [16].
30
4.4.3. BOHB
Bayesian Optimization HyperBand (BOHB) [97] is a state-of-the-art HPO
technique that combines Bayesian optimization and Hyperband to incorpo-
rate the advantages of both while avoiding their drawbacks. The original
Hyperband uses a random search to search the hyper-parameter configura-
tion space, which has a low efficiency. BOHB replaces the RS method by
BO to achieve both high performance as well as low execution time by ef-
fectively using parallel resources to optimize all types of hyper-parameters.
In BOHB, TPE is the standard surrogate model for BO, but it uses multidi-
mensional kernel density estimators. Therefore, the complexity of BOHB is
also O(nlogn) [97].
It has been shown that BOHB outperforms many other optimization tech-
niques when tuning SVM and DL models [97]. The only limitation of BOHB
is that it requires the evaluations on subsets with small budgets to be rep-
resentative of evaluations on the entire training set; otherwise, BOHB may
have a slower convergence speed than standard BO models.
31
also inherit their parents’ characteristics and may involve better and worse
individuals. Better individuals will be more likely to survive and have more
capable offspring, while the worse individuals will gradually disappear. After
several generations, the individual with the best adaptability will be identi-
fied as the global optimum [99].
To apply GA to HPO problems, each chromosome or individual represents
a hyper-parameter, and its decimal value is the actual input value of the
hyper-parameter in each evaluation. Every chromosome has several genes,
which are binary digits; and then crossover and mutation operations are
performed on the genes of this chromosome. The population involves all
possible values within the initialized chromosome/parameter ranges, while
the fitness function characterizes the evaluation metrics of the parameters
[99].
Since the randomly-initialized parameter values often do not include the
optimal parameter values, several operations on the well-performing chromo-
somes, including selection, crossover, and mutation operations, must be per-
formed to identify the optimums [18]. Chromosome selection is implemented
by selecting those chromosomes with good fitness function values. To keep
the population size unchanged, the chromosomes with good fitness function
values are passed to the next generation with higher probability, where they
generate new chromosomes with the parents’ best characteristics. Chromo-
some selection ensures that good characteristics of each generation can be
passed to later generations. Crossover is used to generate new chromosomes
by exchanging a proportion of genes in different chromosomes. Mutation
operations are also used to generate new chromosomes by randomly altering
one or more genes of a chromosome. Crossover and mutation operations en-
able later generations to have different characteristics and reduce the chance
of missing some good characteristics [3].
The main procedures of GA are as follows [98]:
1. Randomly initialize the population, chromosomes, and genes represent-
ing the entire search space, hyper-parameters, and hyper-parameter
values, respectively.
2. Evaluate the performance of each individual in the current generation
by calculating the fitness function, which indicates the objective func-
tion of a ML model.
3. Perform selection, crossover, and mutation operations on the chromo-
somes to produce a new generation involving the next hyper-parameter
32
configurations to be evaluated.
4. Repeat steps 2 & 3 until the termination condition is met.
5. Terminate and output the optimal hyper-parameter configuration.
Among the above steps, the population initialization step is an impor-
tant step of GA and PSO since it provides an initial guess of the optimal
values. Although the initialized values will be iteratively improved in the
optimization process, a suitable population initialization method can signif-
icantly improve the convergence speed and performance of POAs. A good
initial population of hyper-parameters should involve individuals that are
close to global optimums by covering the promising regions and should not
be localized to an unpromising region of the search space [100].
To generate hyper-parameter configuration candidates for the initial pop-
ulation, random initialization that simply creates the initial population with
random values in the given search space is often used in GA [101]. Thus,
GA is easily implemented and does not necessitate good initializations, be-
cause its selection, crossover, and mutation operations lower the possibility
of missing the global optimum.
Hence, it is useful when the data analyst does not have much experi-
ence determining a potential appropriate initial search space for the hyper-
parameters. The main limitation of GA is that the algorithm itself introduces
additional hyper-parameters to be configured, including the fitness function
type, population size, crossover rate, and mutation rate. Moreover, GA is a
sequential execution algorithm, making it difficult to parallelize. The time
complexity of GA is O(n2 ) [102]. As a result, sometimes, GA may be ineffi-
cient due to low convergence speed.
S = (S1 , S2 , · · · , Sn ) , (26)
33
and each particle Si is represented by a vector:
Si =< →
−
xi , →
−
vi , →
−
pi >, (27)
where →−
xi is the current position, →
−
vi is the current velocity, and →−pi is the
known best position of the particle so far.
PSO initially generates each particle with a random position and a ran-
dom velocity. Every particle evaluates the current position and records the
position with its performance score. In the next iteration, the velocity →
−
vi of
each particle is changed based on the previous position → −
pi and the current
global optimal position →
−p:
→
−
vi := →
−
vi + U (0, ϕ1 ) (→
−
pi − →
−
xi ) + U (0, ϕ2 ) (→
−
p −→
−
xi ), (28)
34
have been proposed to improve the performance of evolutionary algorithms,
like the opposition-based optimization algorithm [101] and the space trans-
formation search method [106]. Involving additional population initialization
techniques will require more execution time and resources.
35
results to reduce unnecessary evaluations and improve efficiency. BO-GP
mainly supports continuous and discrete hyper-parameters (by rounding them),
but does not support conditional hyper-parameters [14]; while SMAC and
BO-TPE are both able to handle categorical, discrete, continuous, and con-
ditional hyper-parameters. SMAC performs better when there are many cate-
gorical and conditional parameters, or cross-validation is used, while BO-GP
performs better for only a few continuous parameters [15]. BO-TPE pre-
serves the specified conditional relationships, so one advantage of BO-TPE
over BO-GP is its innate support for specified conditional hyper-parameters
[14].
Metaheuristic algorithms, including GA and PSO, are more complicated
than many other HPO algorithms, but often perform well for complex op-
timization problems. They support all types of hyper-parameters and are
particularly efficient for large configuration spaces, since they can obtain the
near-optimal solutions even within very few iterations. However, GA and
PSO have their own advantages and disadvantages in practical use. PSO
is able to support large-scale parallelization, and is particularly suitable for
continuous and conditional HPO problems [19]; on the other hand, GA is
executed sequentially, making it difficult to be parallelized. Therefore, PSO
often executes faster than GA, especially for large configuration spaces and
large datasets. However, an appropriate population initialization is crucial
for PSO; otherwise, it may converge slowly or only identify a local instead of
a global optimum. Yet, the impact of proper population initialization is not
as significant for GA as for PSO [108]. Another limitation of GA is that it
introduces additional hyper-parameters, like its crossover and mutation rates
[18].
The strengths and limitations of the hyper-parameter optimization algo-
rithms involved in this paper are summarized in Table 1.
36
Table 1: The comparison of common HPO algorithms (n is the number of hyper-parameter
values and k is the the number of hyper-parameters)
HPO Strengths Limitations Time
Method Com-
plexity
Time-consuming,
GS Simple. Only efficient with categorical O(nk )
HPs.
Not consider previous results.
More efficient than GS.
RS Not efficient with conditional O(n)
Enable parallelization.
HPs.
Gradient- Fast convergence speed for con- Only support continuous HPs.
O(nk )
based tinuous HPs. May only detect local optimums.
models
Poor capacity for parallelization.
Fast convergence speed for con-
BO-GP Not efficient with conditional O(n3 )
tinuous HPs.
HPs.
SMAC Efficient with all types of HPs. Poor capacity for parallelization. O(nlogn)
Efficient with all types of HPs.
BO-TPE Poor capacity for parallelization. O(nlogn)
Keep conditional dependencies.
Not efficient with conditional
HPs.
Hyperband Enable parallelization. O(nlogn)
Require subsets with small bud-
gets to be representative.
Efficient with all types of HPs. Require subsets with small bud-
BOHB O(nlogn)
Enable parallelization. gets to be representative.
Efficient with all types of HPs.
GA Poor capacity for parallelization. O(n2 )
Not require good initialization.
Efficient with all types of HPs.
PSO Require proper initialization. O(nlogn)
Enable parallelization.
BOHB would be the best choice, since it has the advantages of both BO and
Hyperband [6] [97].
On the other hand, if multiple fidelities are not applicable, which means
that using the subsets of the original dataset or the subsets of original features
is misleading or too noisy to reflect the performance of the entire dataset,
BOHB may perform poorly with higher time complexity than standard BO
models, then choosing other HPO algorithms would be more efficient [97].
ML algorithms can be classified by the characteristics of their hyper-
parameter configurations. Appropriate optimization algorithms can be cho-
sen to optimize the hyper-parameters based on these characteristics.
37
parameter needs to be tuned. For KNN, the major hyper-parameter is k,
the number of considered neighbors. The most essential hyper-parameter of
k-means, hierarchical clustering, and EM is the number of clusters. Similarly,
for dimensionality reduction algorithms, including PCA and LDA, their basic
hyper-parameter is ’n components’, the number of features to be extracted.
In these situations, Bayesian optimization is the best choice, and the
three surrogates could be tested to find the best one. Hyperband is another
good choice, which may have a fast execution speed due to its capacity for
parallelization. In some cases, people may want to fine-tune the ML model by
considering other less important hyper-parameters, like the distance metric
of KNN and the SVD solver type of PCA; so BO-TPE, GA, or PSO could
be chosen for these situations.
38
5.2.4. A Large Hyper-parameter Configuration Space with Multiple Types of
Hyper-parameters
In ML, tree-based algorithms, including DT, RF, ET, and XGBoost, as
well as DL algorithms, like DNN, CNN, RNN, are the most complex types of
ML algorithms to bed tuned, since they have many hyper-parameters with
various, different types. For these ML models, PSO is the best choice since it
enables parallel executions to improve efficiency, particularly for DL models
that often require massive training time. Some other techniques, like GA,
BO-TPE, and SMAC can also be used, but they may cost more time than
PSO, since it is difficult to parallelize these techniques.
39
6.1. Sklearn
In sklearn [31], ’GridSearchCV’ can be implemented to detect the optimal
hyper-parameters using the GS algorithm. Each hyper-parameter value in
the human-defined configuration space is evaluated by the program, with
its performance evaluated using cross-validation. When all the instances in
the configuration space have been evaluated, the optimal hyper-parameter
combination in the defined search space with its performance score will be
returned.
’RandomizedSearchCV’ is also provided in sklearn to implement a RS
method. It evaluates a pre-defined number of randomly-selected hyper-
parameter values in parallel. Cross-validation is conducted to effectively
evaluate the performance of each configuration.
6.2. Spearmint
Spearmint [87] is a library using Bayesian optimization with the Gaussian
process as the surrogate model. Spearmint’s primary deficiency is that it is
not very efficient for categorical and conditional hyper-parameters.
6.3. BayesOpt
Bayesian Optimization (BayesOpt) [109] is a Python library employed
to solve HPO problems using BO. BayesOpt uses a Gaussian process as its
surrogate model to calculate the objective function based on past evaluations
and utilizes an acquisition function to determine the next values.
6.4. Hyperopt
Hyperopt [110] is a HPO framework that involves RS and BO-TPE as
the optimization algorithms. Unlike some of the other libraries that only
support a single model, Hyperopt is able to use multiple instances to model
hierarchical hyper-parameters. In addition, Hyperopt is parallelizable since
it uses MongoDb as the central database to store the hyper-parameter com-
binations. hyperopt-sklearn [111] and hyperas [112] are the two libraries that
can apply Hyperopt to scikit-learn and Keras libraries.
6.5. SMAC
SMAC [90][113] is another library that uses BO with random forest as the
surrogate model. It supports categorical, continuous, and discrete variables.
40
6.6. BOHB
BOHB framework [97] is a combination of Bayesian optimization and Hy-
perband [15]. It overcomes one limitation of Hyperband, in that it randomly
generates the test configurations, by replacing this procedure by BO. TPE
is used as the surrogate model to store and model function evaluations. Us-
ing BOHB to evaluate the instance can achieve a trade-off between model
performance and the current budget.
6.7. Optunity
Optunity [83] is a popular HPO framework that provides several optimiza-
tion techniques, including GS, RS, PSO, and BO-TPE. In Optunity, categori-
cal hyper-parameters are converted to discrete hyper-parameters by indexing,
and discrete hyper-parameters are processed as continuous hyper-parameters
by rounding them; as such, it supports all kinds of hyper-parameter.
6.8. Skopt
Skopt (scikit-optimize) [114] is a HPO library that is built on top of
the scikit-learn [31] library. It implements several sequential model-based
optimization models, including RS and BO-GP. The methods exhibit good
performance with small search space and proper initialization.
6.9. GpFlowOpt
GpFlowOpt [115] is a Python library for BO using GP as the surrogate
model. It supports running BO-GP on GPU using the Tensorflow library.
Therefore, GpFlowOpt is a good choice if BO is used in deep learning models,
and GPU resources are available.
6.10. Talos
Talos [116] is a Python package designed for hyper-parameter optimiza-
tion with Keras models. Talos can be fully deployed into any Keras models
and implemented easily without learning any new syntax. Several optimiza-
tion techniques, including GS, RS, and probabilistic reduction, can be im-
plemented using Talos.
41
6.11. Sherpa
Sherpa [117] is a Python package used for HPO problems. It can be used
with other ML libraries, including sklearn [31], Tensorflow[118], and Keras
[33]. It supports parallel computations and has several optimization methods,
including GS, RS, BO-GP (via GPyOpt), Hyperband, and population-based
training (PBT).
6.12. Osprey
Osprey [119] is a Python library designed to optimize hyper-parameters.
Several HPO strategies are available in Osprey, including GS, RS, BO-TPE
(via Hyperopt), and BO-GP (via GPyOpt).
6.13. FAR-HO
FAR-HO [120] is a hyper-parameter optimization package that employs
gradient-based algorithms with TensorFlow. FAR-HO contains a few gradient-
based optimizers, like reverse hyper-gradient and forward hyper-gradient
methods. This library is designed to build access to the gradient-based hyper-
parameter optimizers in TensorFlow, allowing deep learning model training
and hyper-parameter optimization in GPU or other tensor-optimized com-
puting environments.
6.14. Hyperband
Hyperband [16] is a Python package for tuning hyper-parameters by Hy-
perband, a bandit-based approach. Similar to ’GridSearchCV’ and ’Random-
izedSearchCV’ in scikit-learn, there is a class named ’HyperbandSearchCV’
in Hyperband that can be combined with sklearn and used for HPO problems.
In ’HyperbandSearchCV’ method, cross-validation is used for evaluation.
6.15. DEAP
DEAP [121] is a novel evolutionary computation package for Python that
contains several evolutionary algorithms like GA and PSO. It integrates with
parallelization mechanisms like multiprocessing, and machine learning pack-
ages like sklearn.
42
6.16. TPOT
TPOT [122] is a Python tool for auto-ML that uses genetic programming
to optimize ML pipelines. TPOT is built on top of sklearn, so it is easy to
implement TPOT on ML models. ’TPOTClassifier’ is its principal function,
and several additional hyper-parameters of GA must be set to fit specific
problems.
6.17. Nevergrad
Nevergrad [123] is an open-source Python library that includes a wide
range of optimizers, like fast-GA and PSO. In ML, Nevergrad can be used
to tune all types of hyper-parameters, including discrete, continuous, and
categorical hyper-parameters, by choosing different optimizers.
7. Experiments
To summarize the content of Sections 3 to 6, a comprehensive overview
of applying hyper-parameter optimization techniques to ML models is shown
in Table 2. It provides a summary of common ML algorithms, their hyper-
parameters, suitable optimization methods, and available Python libraries;
thus, data analysts and researchers can look up this table and select suitable
optimization algorithms as well as libraries for practical use.
To put theory into practice, several experiments have been conducted
based on Table 2. This section provides the experiments of applying eight
different HPO techniques to three common and representative ML algorithms
on two benchmark datasets. In the first part of this section, the experimental
setup and the main process of HPO are discussed. In the second part, the
results of utilizing different HPO methods are compared and analyzed. The
sample code of the experiments has been published in [124] to illustrate the
process of applying hyper-parameter optimization to ML models.
43
Table 2: A comprehensive overview of common ML models, their hyper-parameters, suit-
able optimization techniques, and available Python libraries
ML Algorithm Main HPs Optional HPs HPO methods Libraries
Linear regression - - - -
Ridge & lasso alpha - BO-GP Skpot
penalty,
BO-TPE, Hyperopt,
Logistic regression c, -
SMAC SMAC
solver
Skpot,
weights,
BOs, Hyperopt,
KNN n neighbors p,
Hyperband SMAC,
algorithm
Hyperband
C, gamma, BO-TPE, Hyperopt,
SVM kernel, coef0, SMAC, SMAC,
epsilon (for SVR) degree BOHB BOHB
NB alpha - BO-GP Skpot
criterion, GA,
TPOT,
max depth, splitter, PSO,
Optunity,
DT min samples split, min weight fraction leaf, BO-TPE,
SMAC,
min samples leaf, max leaf nodes SMAC,
BOHB
max features BOHB
n estimators
GA,
max depth, TPOT,
splitter, PSO,
criterion, Optunity,
RF & ET min weight fraction leaf, BO-TPE,
min samples split, SMAC,
max leaf nodes SMAC,
min samples leaf, BOHB
BOHB
max features
n estimators, GA,
min child weight, TPOT,
max depth, PSO,
gamma, Optunity,
XGBoost learning rate, BO-TPE,
alpha, SMAC,
subsample, SMAC,
lambda BOHB
colsample bytree, BOHB
estimators,
Voting weights GS sklearn
voting
sklearn,
base estimator, max samples, GS, Skpot,
Bagging
n estimators max features BOs Hyperopt,
SMAC
base estimator,
BO-TPE, Hyperopt,
AdaBoost n estimators, -
SMAC SMAC
learning rate
number of hidden layers,
units per layer,
loss,
optimizer,
number of frozen layers
Activation, PSO, Optunity,
Deep learning (if transfer learning
learning rate, BOHB BOHB
is used)
dropout rate,
epochs,
batch size,
early stop patience
Skpot,
init,
BOs, Hyperopt,
K-means n clusters n init,
Hyperband SMAC,
max iter
Hyperband
Skpot,
n clusters, BOs, Hyperopt,
Hierarchical clustering linkage
distance threshold Hyperband SMAC,
Hyperband
BO-TPE, Hyperopt,
eps,
DBSCAN - SMAC, SMAC,
min samples
BOHB BOHB
covariance type,
Gaussian mixture n components max iter, BO-GP Skpot
tol
Skpot,
BOs, Hyperopt,
PCA n components svd solver
Hyperband SMAC,
Hyperband
Skpot,
solver, BOs, Hyperopt,
LDA n components
shrinkage Hyperband SMAC,
Hyperband
44
the benchmark datasets for HPO method evaluation on data analytics prob-
lems. MNIST is a hand-written digit recognition dataset used as a multi-
classification problem, while the Boston housing dataset contains information
about the price of houses in various places in the city of Boston and can be
used as a regression dataset to predict the housing prices.
At the next stage, the ML models with their objective function need to
be configured. In Section 5, all common ML models are divided into five cat-
egories based on their hyper-parameter types. Among those ML categories,
”one discrete hyper-parameter”, ”a few conditional hyper-parameters”, and
”a large hyper-parameter configuration space with multiple types of hyper-
parameters” are the three most common cases. Thus, three ML algorithms,
KNN, SVM, and RF, are selected as the target models to be optimized, since
their hyper-parameter types represent the three most common HPO cases:
KNN has one important hyper-parameter, the number of considered near-
est neighbors for each sample; SVM has a few conditional hyper-parameters,
like the kernel type and the penalty parameter C; RF has multiple hyper-
parameters of different types, as discussed in Section 3. Moreover, KNN,
SVM, and RF can all be applied to solve both classification and regression
problems.
In the next step, the performance metric and evaluation method are con-
figured. For each experiment on the selected two datasets, 3-fold cross vali-
dation is implemented to evaluate the involved HPO methods. The two most
commonly-used performance metrics are used in our experiments. For classi-
fication models, accuracy is used as the classifier performance metric, which
is the proportion of correctly classified data; while for regression models,
the mean squared error (MSE) is used as the regressor performance metric,
which measures the average squared difference between the predicted values
and the actual values. Additionally, the computational time (CT) , the total
time needed to complete a HPO process with 3-fold cross-validation, is also
used as the model efficiency metric [55]. In each experiment, the optimal ML
model architecture that has the highest accuracy or the lowest MSE will be
returned with the optimal hyper-parameter configuration.
After that, to fairly compare different optimization algorithms and frame-
works, certain constraints should be satisfied. Firstly, we compare different
HPO methods using the same hyper-parameter configuration space. For
KNN, the only hyper-parameter to be optimized, ’n neighbors’, is set to be
in the same range of 1 to 20 for each optimization method evaluation. The
hyper-parameters of SVM and RF models for classification and regression
45
Table 3: Configuration space for the hyper-parameters of tested ML models
ML Model Hyper-parameter Type Search Space
n estimators Discrete [10,100]
max depth Discrete [5,50]
min samples split Discrete [2,11]
RF Classifier
min samples leaf Discrete [1,11]
criterion Categorical [’gini’, ’entropy’]
max features Discrete [1,64]
C Continuous [0.1,50]
SVM Classifier
kernel Categorical [’linear’, ’poly’, ’rbf’, ’sigmoid’]
KNN Classifier n neighbors Discrete [1,20]
n estimators Discrete [10,100]
max depth Discrete [5,50]
min samples split Discrete [2,11]
RF Regressor
min samples leaf Discrete [1,11]
criterion Categorical [’mse’, ’mae’]
max features Discrete [1,13]
C Continuous [0.1,50]
SVM Regressor kernel Categorical [’linear’, ’poly’, ’rbf’, ’sigmoid’]
epsilon Continuous [0.001,1]
KNN Regressor n neighbors Discrete [1,20]
problems are also set to be in the same configuration space for each type of
problem. The specifics of the configuration space for ML models are shown
in Table 3. The selected hyper-parameters and their search space are deter-
mined based on the concepts in Section 3, domain knowledge, and manual
testings [81]. The hyper-parameter types of each ML algorithm are also
summarized in Table 3.
On the other hand, to fairly compare the performance metrics of opti-
mization techniques, the maximum number of iterations for all HPO methods
is set to 50 for RF and SVM model optimizations, and 10 for KNN model
optimization based on manual testings and domain knowledge. Moreover,
to avoid the impacts of randomness, all experiments are repeated ten times
with different random seeds, and results are averaged for regression problems
or given the majority vote for classification problems.
In Section 4, more than ten HPO methods are introduced. In our exper-
iments, eight representative HPO approaches are selected for performance
comparison, including GS, RS, BO-GP, BO-TPE, Hyperband, BOHB, GA,
and PSO. After setting up the fair experimental environments for each HPO
method, the HPO experiments are implemented based on the steps discussed
in Section 2.2.
All experiments were conducted using Python 3.5 on a machine with
46
Table 4: Performance evaluation of applying HPO methods to the RF classifier on the
MNIST dataset
Optimization Accuracy CT (s)
Algorithm (%)
Default HPs 90.65 0.09
GS 93.32 48.62
RS 93.38 16.73
BO-GP 93.38 20.60
BO-TPE 93.88 12.58
Hyperband 93.38 8.89
BOHB 93.38 9.45
GA 93.83 19.19
PSO 93.73 12.43
47
Table 5: Performance evaluation of applying HPO methods to the SVM classifier on the
MNIST dataset
Optimization Accuracy CT (s)
Algorithm (%)
Default HPs 97.05 0.29
GS 97.44 32.90
RS 97.35 12.48
BO-GP 97.50 17.56
BO-TPE 97.44 3.02
Hyperband 97.44 11.37
BOHB 97.44 8.18
GA 97.44 16.89
PSO 97.44 8.33
Table 6: Performance evaluation of applying HPO methods to the KNN classifier on the
MNIST dataset
Optimization Accuracy CT (s)
Algorithm (%)
Default HPs 96.27 0.24
GS 96.22 7.86
RS 96.33 6.44
BO-GP 96.83 1.12
BO-TPE 96.83 2.33
Hyperband 96.22 4.54
BOHB 97.44 3.84
GA 96.83 2.34
PSO 96.83 1.73
48
Table 8: Performance evaluation of applying HPO methods to the SVM regressor on the
Boston-housing dataset
Optimization MSE CT (s)
Algorithm
Default HPs 77.43 0.02
GS 67.07 1.33
RS 61.40 0.48
BO-GP 61.27 5.87
BO-TPE 59.40 0.33
Hyperband 73.44 0.32
BOHB 59.67 0.31
GA 60.17 1.12
PSO 58.72 0.53
Table 9: Performance evaluation of applying HPO methods to the KNN regressor on the
Boston-housing dataset
Optimization MSE CT (s)
Algorithm
Default HPs 81.48 0.004
GS 81.53 0.12
RS 80.77 0.11
BO-GP 80.77 0.49
BO-TPE 80.83 0.08
Hyperband 80.87 0.10
BOHB 80.77 0.09
GA 80.77 0.33
PSO 80.74 0.19
49
which have a larger search space than KNN.
The performance of BO and multi-fidelity models is much better than GS
and RS. The computation time of BO-GP is often higher than other HPO
methods due to its cubic time complexity, but it can obtain better perfor-
mance metrics for ML models with small-size continuous hyper-parameter
space, like KNN. Conversely, hyperband is often not able to obtain the high-
est accuracy or the lowest MSE, but their computational time is low because
it works on the small-sized subsets. The performance of BO-TPE and BOHB
are often better than others, since they can detect the optimal or near-optimal
hyper-parameter configurations within a short computational time.
For metaheuristics methods, GA and PSO, their accuracies are often
higher than other HPO methods for classification problems, and their MSEs
are often lower than other optimization techniques. However, their computa-
tional time is often higher than BO-TPE and multi-fidelity models, especially
for GA, which does not support parallel executions.
To summarize, GS and RS are simple to be implemented, but they often
cannot detect the optimal hyper-parameter configurations or cost much com-
putational time. BO-GP and GA also cost more computational time than
other HPO methods, but BO-GP works well on small configuration space,
while GA is effective for large configuration space. Hyperband’s computa-
tional time is low, but it cannot guarantee to detect the global optimums.
For ML models with large configuration space, BO-TPE, BOHB, and PSO
often work well.
50
Table 10: The open challenges and future directions of HPO research
Category Challenges & Future Re- Brief Description
quirements
Costly objective function eval- HPO methods should reduce evaluation time
Model complexity
uations on large datasets.
Complex search space HPO methods should reduce execution time on
high dimensionalities (large hyper-parameter
search space).
Strong anytime performance HPO methods should be able to detect the op-
timal or near-optimal HPs even with a very
limited budget.
Model performance Strong final performance HPO methods should be able to detect the
global optimum when given a sufficient bud-
get.
Comparability There should exist a standard set of bench-
marks to fairly evaluate and compare different
optimization algorithms.
Over-fitting and generalization The optimal HPs detected by HPO methods
should have generalizability to build efficient
models on unseen data.
Randomness HPO methods should reduce randomness on
the obtained results.
Scalability HPO methods should be scalable to multi-
ple libraries or platforms (e.g., distributed ML
platforms).
Continuous updating capabil- HPO methods should consider their capacity
ity to detect and update optimal HP combinations
on continuously-updated data.
Depending on the scale of data, the model complexity, and available compu-
tational resources, the evaluation of each hyper-parameter configuration may
cost several minutes, hours, days, or even more [93]. Additionally, the values
of certain hyper-parameters have a direct impact on the execution time, like
the number of considered neighbors in KNN, the number of basic decision
trees in RF, and the number of hidden layers in deep neural networks [125].
To solve this problem by HPO algorithms, BO models reduce the to-
tal number of evaluations by spending time choosing the next evaluating
point instead of simply evaluating all possible hyper-parameter configura-
tions; however, they still require much execution time due to their poor
capacity for parallelization. On the other hand, although multi-fidelity opti-
mization methods, like Hyperband, have had some success dealing with HPO
problems with limited budgets, there are still some problems that cannot be
effectively solved by HPO due to the complexity of models or the scale of
datasets [6]. For example, the ImageNet [126] challenge is a very popular
problem in the image processing domain, but there has not been any re-
51
search or work on efficiently optimizing hyper-parameters for the ImageNet
challenge yet, due to its huge scale and the complexity of CNN models used
on ImageNet.
52
8.2.2. Comparability of HPO Methods
To optimize the hyper-parameters of ML models, different optimization
algorithms can be applied to each ML framework. Different optimization
techniques have their own strengths and drawbacks in different cases, and
currently, there is no single optimization approach that outperforms all other
approaches when processing different datasets with various metrics and hyper-
parameter types [3]. In this paper, we have analyzed the strengths and weak-
nesses of common hyper-parameter optimization techniques based on their
principles and their performance in practical applications; but this topic
could be extended more comprehensively.
To solve this problem, a standard set of benchmarks could be designed
and agreed on by the community for a better comparison of different HPO
algorithms. For example, there is a platform called COCO (Comparing Con-
tinuous Optimizers) [127] that provides benchmarks and analyzes common
continuous optimizers. However, there is, to date, not any reliable platform
that provides benchmarks and analysis of all common hyper-parameter opti-
mization approaches. It would be easier for people to choose HPO algorithms
in practical applications if a platform like COCO exists. In addition, a uni-
fied metric can also improve the comparability of different HPO algorithms,
since different metrics are currently used in different practical problems [6].
On the other hand, based on the comparison of different HPO algorithms,
a way to further improve HPO is to combine existing models or propose new
models that contain as many benefits as possible and are more suitable for
practical problems than existing singular models. For example, the BOHB
method [97] has had some success by combining Bayesian optimization and
Hyperband. In addition, future research should consider both model per-
formance and time budgets to develop HPO algorithms that suit real-life
applications.
53
One solution to reduce or avoid over-fitting is to use cross-validation to
identify a stable optimum that performs best in all or most of the subsets
instead of a sharp optimum that only performs well in a singular validation
set [6]. However, cross-validation increases the execution time several-fold. It
would be beneficial if methods can better deal with overfitting and improve
generalization in future research.
8.2.4. Randomness
There are stochastic components in the objective function of ML algo-
rithms; thus, in some cases, the optimal hyper-parameter configuration might
be different after each run. This randomness could be due to various pro-
cedures of certain ML models, like neural network initialization, or different
sampled subsets of a bagging model [93]; or due to certain procedures of
HPO algorithms, like crossover and mutation operations in GA. In addition,
it is often difficult for HPO methods to identify the global optimums, due to
the fact that HPO problems are mainly NP-hard problems. Many existing
HPO algorithms can only collect several different near-optimal values, which
is caused by randomness. Thus, the existing HPO models can be further
improved to reduce the impact of randomness. One possible solution is to
run a HPO method multiple times and select the hyper-parameter value that
occurs most as the final optimum.
8.2.5. Scalability
In practice, one main limitation of many existing HPO frameworks is that
they are tightly integrated with one or a couple of machine learning libraries,
like sklearn and Keras, which restricts them to only work with a single node
instead of large data volumes [3]. To tackle large datasets, some distributed
machine learning platforms, like Apache SystemML [129] and Spark MLib
[130], have been developed; however, only very few HPO frameworks exist
that support distributed ML. Therefore, more research efforts and scalable
HPO frameworks, like the ones supporting distributed ML platforms, should
be developed to support more libraries.
On the other hand, future practical HPO algorithms should have the
scalability to efficiently optimize hyper-parameters from a small size to a
large size, irrespective of whether they are continuous, discrete, categorical,
or conditional hyper-parameters.
54
8.2.6. Continuous Updating Capability
In practice, many datasets are not stationary and are constantly updated
by adding new data and deleting old data. Correspondingly, the optimal
hyper-parameter values or combinations may also change with data changes.
Currently, developing HPO methods with the capacity to continuously tune
hyper-parameter values as the data changes has not drawn much attention,
since researchers and data analysts often do not alter the ML model after
achieving a currently optimal performance [3]. However, since their optimal
hyper-parameter values would change with data changes, proper approaches
should be proposed to achieve continuous updating capability.
9. Conclusion
Machine learning has become the primary strategy for tackling data-
related problems and has been widely used in various applications. To apply
ML models to practical problems, their hyper-parameters need to be tuned to
fit specific datasets. However, since the scale of produced data is greatly in-
creased in real-life, and manually tuning hyper-parameters is extremely com-
putationally expensive, it has become crucial to optimize hyper-parameters
by an automatic process. In this survey paper, we have comprehensively dis-
cussed the state-of-the-art research into the domain of hyper-parameter opti-
mization as well as how to apply them to different ML models by theory and
practical experiments. To apply optimization methods to ML models, the
hyper-parameter types in a ML model is the main concern for HPO method
selection. To summarize, BOHB is the recommended choice for optimizing a
ML model, if randomly selected subsets are highly-representative of the given
dataset, since it can efficiently optimize all types of hyper-parameters; other-
wise, BO models are recommended for small hyper-parameter configuration
space, while PSO is usually the best choice for large configuration space.
Moreover, some existing useful HPO tools and frameworks, open challenges,
and potential research directions are also provided and highlighted for prac-
tical use and future research purposes. We hope that our survey paper serves
as a useful resource for ML users, developers, data analysts, and researchers
to use and tune ML models utilizing proper HPO techniques and frameworks.
We also hope that it helps to enhance understanding of the challenges that
still exist within the HPO domain, and thereby further advance HPO and
ML applications in future research.
55
References
[1] M.I. Jordan, T.M. Mitchell, Machine learning: Trends,
perspectives, and prospects, Science 349 (2015) 255260.
https://doi.org/10.1126/science.aaa8415.
[2] M.-A. Zller and M. F. Huber, Benchmark and Survey of Automated Ma-
chine Learning Frameworks, arXiv preprint arXiv:1904.12054, (2019).
https://arxiv.org/abs/1904.12054.
[3] R. E. Shawi, M. Maher, S. Sakr, Automated machine learning: State-
of-the-art and open challenges, arXiv preprint arXiv:1906.02287, (2019).
http://arxiv.org/abs/1906.02287.
[4] M. Kuhn and K. Johnson, Applied Predictive Modeling, Springer (2013)
ISBN: 9781461468493.
[5] G.I. Diaz, A. Fokoue-Nkoutche, G. Nannicini, H. Samulowitz, An effec-
tive algorithm for hyperparameter optimization of neural networks, IBM
J. Res. Dev. 61 (2017) 120. https://doi.org/10.1147/JRD.2017.2709578.
[6] F. Hutter, L. Kotthoff, and J. Vanschoren, Eds., Automatic Ma-
chine Learning: Methods, Systems, Challenges, Springer (2019) ISBN:
9783030053185.
[7] N. Decastro-Garca, . L. Muoz Castaeda, D. Escudero Garca, and M. V.
Carriegos, Effect of the Sampling of a Dataset in the Hyperparameter
Optimization Phase over the Efficiency of a Machine Learning Algo-
rithm, Complexity 2019 (2019). https://doi.org/10.1155/2019/6278908.
[8] S. Abreu, Automated Architecture Design for Deep Neu-
ral Networks, arXiv preprint arXiv:1908.10714, (2019).
http://arxiv.org/abs/1908.10714.
[9] O. S. Steinholtz, A Comparative Study of Black-box Optimization Algo-
rithms for Tuning of Hyper-parameters in Deep Neural Networks, M.S.
thesis, Dept. Elect. Eng., Lule Univ. Technol., (2018).
[10] G. Luo, A review of automatic selection methods for machine learn-
ing algorithms and hyper-parameter values, Netw. Model. Anal. Heal.
Informatics Bioinforma. 5 (2016) 116. https://doi.org/10.1007/s13721-
016-0125-6.
56
[11] D. Maclaurin, D. Duvenaud, R.P. Adams, Gradient-based Hyper-
parameter Optimization through Reversible Learning, arXiv preprint
arXiv:1502.03492, (2015). http://arxiv.org/abs/1502.03492.
57
[21] T.M. S. Bradley, A. Hax, Applied Mathematical Programming,
Addison-Wesley, Reading, Massachusetts. (1977).
58
[32] T.Chen, C.Guestrin, XGBoost: a scalable tree boosting system, arXiv
preprint arXiv:1603.02754, (2016). http://arxiv.org/abs/1603.02754.
59
[44] L. Yang, R. Muresan, A. Al-Dweik, L.J. Hadjileontiadis,
Image-Based Visibility Estimation Algorithm for Intelligent
Transportation Systems, IEEE Access. 6 (2018) 7672876740.
https://doi.org/10.1109/ACCESS.2018.2884225.
[46] O.S. Soliman, A.S. Mahmoud, A classification system for remote sensing
satellite images using support vector machine with non-linear kernel
functions, 2012 8th Int. Conf. Informatics Syst. INFOS 2012. (2012)
BIO-181-BIO-187.
[47] I. Rish, An empirical study of the naive Bayes classifier, IJCAI 2001
Work. Empir. methods Artif. Intell., (2001), 41-46.
[51] J.D.M. Rennie, L. Shih, J. Teevan, D.R. Karger Tackling the poor as-
sumptions of Naive Bayes text classifiers, Proc. Twent. Int. Conf. Mach.
Learn. ICML (2003), 616-623.
60
[53] S. Rasoul, L. David, A Survey of Decision Tree Classifier Methodology,
IEEE Trans. Syst. Man. Cybern. 21 (1991) 660674.
[60] Y. Xia, C. Liu, Y.Y. Li, N. Liu, A boosted decision tree approach using
Bayesian hyper-parameter optimization for credit scoring, Expert Syst.
Appl. 78 (2017) 225241. https://doi.org/10.1016/j.eswa.2017.02.017.
61
[62] A. Moubayed, E. Aqeeli, A. Shami, Ensemble-based Feature Selection
and Classification Model for DNS Typo-squatting Detection, in: 2020
IEEE Can. Conf. Electr. Comput. Eng., 2020.
[68] D. Han, Q. Liu, W. Fan, A new image classification method using CNN
transfer learning and web data augmentation, Expert Syst. Appl. 95
(2018) 4356. https://doi.org/10.1016/j.eswa.2017.11.028.
62
[71] T. K. Moon, The expectation-maximization algorithm, IEEE Signal
Process. Mag. 13 (6) (1996) 4760.
[72] S. Brahim-Belhouari, A. Bermak, M. Shi, P.C.H. Chan, Fast and Ro-
bust gas identification system using an integrated gas sensor technol-
ogy and Gaussian mixture models, IEEE Sens. J. 5 (2005) 14331444.
https://doi.org/10.1109/JSEN.2005.858926.
[73] Z. Y., K. G., Hierarchical Clustering Algorithms for Document Dataset,
Data Min. Knowl. Discov. 10 (2005) 141168.
[74] K. Khan, S.U. Rehman, K. Aziz, S. Fong, S. Sarasvady, A.
Vishwa, DBSCAN: Past, present and future, 5th Int. Conf.
Appl. Digit. Inf. Web Technol. ICADIWT 2014. (2014) 232238.
https://doi.org/10.1109/ICADIWT.2014.6814687.
[75] H. Zhou, P. Wang, H. Li, Research on adaptive parameters determina-
tion in DBSCAN algorithm, J. Inf. Comput. Sci. 9 (2012) 19671973.
[76] J. Shlens, A Tutorial on Principal Component Analysis, arXiv preprint
arXiv:1404.1100, (2014). https://arxiv.org/abs1404.1100
[77] N. Halko, P. Martinsson, J. Tropp, Finding structure with randomness:
probabilistic algorithms for constructing approximate matrix decompo-
sitions, SIAM Rev. 53 (2) (2011), pp. 217-288
[78] M. Loog, Conditional linear discriminant analysis,
Proc. - Int. Conf. Pattern Recognit. 2 (2006) 387390.
https://doi.org/10.1109/ICPR.2006.402.
[79] P. Howland, J. Wang, H. Park, Solving the small sample size problem in
face recognition using generalized discriminant analysis, Pattern Recog-
nit. 39 (2006) 277287. https://doi.org/10.1016/j.patcog.2005.06.013.
[80] I. Ilievski, T. Akhtar, J. Feng, C.A. Shoemaker, Efficient hyperparam-
eter optimization of deep learning algorithms using deterministic RBF
surrogates, 31st AAAI Conf. Artif. Intell. AAAI 2017. (2017) 822829.
[81] M.N. Injadat, A. Moubayed, A.B. Nassif, A. Shami, Sys-
tematic Ensemble Model Selection Approach for Educational
Data Mining, Knowledge-Based Syst. 200 (2020) 105992.
https://doi.org/10.1016/j.knosys.2020.105992.
63
[82] M. Injadat, A. Moubayed, A.B. Nassif, A. Shami, Multi-split Optimized
Bagging Ensemble Model Selection for Multi-class Educational Data
Mining, Springers Appl. Intell. (2020).
[83] M. Claesen, J. Simm, D. Popovic, Y. Moreau, and B. De Moor, Easy Hy-
perparameter Search Using Optunity, arXiv preprint arXiv:1412.1114,
(2014). https://arxiv.org/abs1412.1114.
[84] C. Witt, Worst-case and average-case approximations by simple ran-
domized search heuristics, in: Proceedings of the 22nd Annual Sympo-
sium on Theoretical Aspects of Computer Science, STACS05, Stuttgart,
Germany, 2005, pp. 4456.
[85] Y. Bengio, Gradient-based optimization of hyperparameters, Neural
Comput. 12 (8) (2000) 1889-1900.
[86] H. H. Yang and S. I. Amari, Complexity Issues in Natural Gradient
Descent Method for Training Multilayer Perceptrons, Neural Comput.
10 (8) (1998) 21372157.
[87] J. Snoek, H. Larochelle, R. Adams Practical Bayesian optimization of
machine learning algorithms Adv. Neural Inf. Process. Syst. 4 (2012),
2951-2959.
[88] E. Hazan, A. Klivans, and Y. Yuan, Hyperparameter optimiza-
tion: a spectral approach, arXiv preprint arXiv:1706.00764, (2017).
https://arxiv.org/abs1706.00764.
[89] M. Seeger, Gaussian processes for machine learning, Int. J. Neural Syst.,
14 (2004), 69-106.
[90] F. Hutter, H. H. Hoos, and K. Leyton-Brown, Sequential model-based
optimization for general algorithm configuration, Proc. LION 5, (2011)
507-523.
[91] I. Dewancker, M. McCourt, S. Clark, Bayesian Optimization Primer,
(2015) URL: https://sigopt.com/static/pdf/SigOpt Bayesian Optimiza-
tion Primer.pdf
[92] J. Hensman, N. Fusi, and N. D. Lawrence, Gaussian pro-
cesses for big data, arXiv preprint arXiv:1309.6835, (2013).
https://arxiv.org/abs/1309.6835.
64
[93] M. Claesen and B. De Moor, Hyperparameter Search in
Machine Learning, arXiv preprint arXiv:1502.02127, (2015).
https://arxiv.org/abs1502.02127.
65
[103] Y. Shi, R.C. Eberhart, Parameter Selection in Particle Swarm Opti-
mization, Evolutionary Programming VII, Springer (1998) 591-600.
66
[112] M. Pumperla, Hyperas, 2019. http://maxpumperla.com/hyperas/.
67
[123] J. Rapin and O. Teytaud, Nevergrad - A gradient-free optimization
platform, 2018. https://GitHub.com/FacebookResearch/Nevergrad.
[125] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford Uni-
versity Press (1995).
68
Li Yang received the B.E. degree in computer sci-
ence from Wuhan University of Science and Technology,
Wuhan, China in 2016 and the MASc degree in Engineer-
ing from University of Guelph, Guelph, Canada, 2018.
Since 2018 he has been working toward the Ph.D. degree
in the Department of Electrical and Computer Engineer-
ing, Western University, London, Canada. His research
interests include cybersecurity, machine learning, network
data analytics, and time series data analytics.
69