Comparative Study of Bayesian Optimization Process For The Best Machine Learning Hyperparameters
Comparative Study of Bayesian Optimization Process For The Best Machine Learning Hyperparameters
1 Introduction
Hyperparameter tuning [8] is used to test the combination of hyperparameters that are
randomly chosen to improve machine learning problems. It is difficult to manually choose
the best values of hyperparameters because the choice of hyperparameters affects the
performance of the model. So, the performance of the learning model depends on better
choice of hyperparameters. In this regard, there existe an important technique for tuning
hyperparameters are random search, grid search, particle swarm optimization, genetic
algorithm and Bayesian optimization [8] [9] [16] [17]. Bayesian optimization is one of the
good technique for tuning hyperparameters in automatic learning models.
Bayesian optimization [4] [11] [13] is a method used to solve objective functions that
are costly to evaluate and also to find the global maximum of this function. It is based on
the Gaussian process, Random forest and Tree structure parzen estimator (TPE) which
constitutes two density functions (good density and bad density) and the observations
are divided in to together [1]. In TPE the ratio must be maximized for minimize the
expected improvement of the acquisition function which allows to find the new configura-
tion of the hyperparameters [1]. All three surrogate models are used to approximate the
objective function. Hawever, the most of the time, the Gaussian process is the most used
in Bayesian optimization.
2 F. Fatih et al.
In order to explain this, we use the heart disease dataset. This dataset is on of the
serious diseases that threaten human life. ML machine learning algorithms playing a key
role in predicting heart disease based on different symptoms such as age, gender etc. The
main objective is to detect the patient in its early stages where it can be treated and
save lives from death in order to reduce the morality rates by heart disease. In this work,
we applied Bayesian optimization to find the best hyperparameters that improved the
accuracy for each algorithm namely RF, SVM, KNN and LR as the performance of the
learning models depends on better hyperparameters. So, our result shows that the highest
accuracy in BO-GP and BO-TPE are of LR and SVM respectively.
2 Related Work
Bayesian optimization (BO) [4] [11] [13] [17] is an efficient method that consists of two
essential components namely the surrogate models and the acquisition function to de-
termine the next hyperparameters configurations that allows to find an approximation
of a costly objective function to be evaluated. The surrogate models are: Tree structure
parzen estimator (TPE) [1], random forest [15] [17], and Gaussian process [4] [13] [10] [7]
[8] . The acquisition functions are the expected improvement (EI) [2], the probability of
improvement [11] and the upper confidence bound (UCB) [11] . The most used acquisition
function in Bayesian optimization is the expected improvement [11]. But, according to [2]
shows that there is a better acquisition function than EI is the acquisition function E3 I
which balances exploitation and exploration in BO.
The concept of Bayesian optimization was introduced in [11] with two experiments.
The first experiment is determining the global maximum of the objective function f(x,y).
The second experiment is compared between Bayesian optimization and random search
in the SVM machine learning algorithm, which shows that there is no difference in the
performance of these two methods. There are more works as [6] [8] [9] [16] [17] show that
Bayesian optimization is one of the most effective hyperparameter optimization techniques
for tuning hyerparameters in machine learning models.
In this article [9] presents a comparative study between three HPO methods: grid
search, random search, Bayesian optimization. This comparison is used to find the best
method that can be used to obtain the highest accuracy in a short time simulation. The
results of [9] show that the method of Bayesian optimization is more efficient than the
other methods.
This work [6] presents a comparative analysis of various hyperparameter tuning tech-
niques, namely Grid Search, Random Search, Bayesian Optimization, Particle Swarm Op-
timization (PSO), and Genetic Algorithm (GA). They are used to optimize the accuracy of
Title Suppressed Due to Excessive Length 3
six machine learning algorithms, namely, Logistic Regression (LR), Ridge Classifier (RC),
Support Vector Machine Classifier (SVC), Decision Tree (DT), Random Forest (RF), and
Naive Bayes (NB) classifiers. These algorithms are used to solve the tree sentiment classi-
fication problem. The results of [6] shows that the performance for each machine learning
algorithm before and after setting the hyperparameters shows that the highest accuracy
was given by SVC before and after setting the hyperparameters with the highest scores
obtained when using Bayesian optimization.
where
• The mean µ and the variance σ,
• Φ is the cumulative distribution functions (CDF):
Z z
Φ(z) = φ(z)dz
−∞
−z 2
with φ(z) = √12π exp 2
is the probability density function (PDF) of the normal
distribution N (0, 1).
Where Φ and φ are the cumulative distribution functions (CDF) and probability den-
sity function (PDF).
5 Optimization process
The Gaussian process [17] [11] is a surrogate model most commonly used in Bayesian
optimization to approximate the objective function f : X −→ R with X is a finite set
of N points , and the values of the objective function f = [f (x1 ), . . . , f (xn )] [11] are
distributed according to a multivariate Gaussian distribution. Thus the Gaussian process
is given by [9] [12] [14]:
′
f ∼ GP(µ(x), K(x, x ))
Where µ is a mean vector and K is a covariance matrix. Predictions following a normal
distribution [17]:
P (y|x, D) = N (y|µ̃, σ̃ 2 )
Where D is the configuration space of the hyperparameters, and y = f (x) is the result
of the evaluation of each hyperparameter value X [17]. We assume µ(x) = 0 so the new
means and variances are [14] :
µ̃ = K(x)T K −1 y,
σ̃ 2 = K(x, x) − K(x)T K −1 K(x).
These new means and variance will be used in the acquisition function to find the
next evaluation point of the true objective function f .
Bayesian optimization using RF as a surrogate model. It’s also called sequential model
based algorithm configuration (SMAC) [17]. Assuming that there is a Gaussian model
N (y|µ̃, σ˜2 ), which µ̃ and σ˜2 are the mean and variance of the regression function r(x),
respectively [5] [15] [17]:
1 X
µ̃ = r(x)
|B|
r∈B
1
σ˜2 =
X
(r(x) − µ̃)2
|B| − 1
r∈B
The tree structured parzen estimator (TPE) is another common surrogate model for
Bayesian optimization [17]. It creates a model applying the Bayes rule to calculate p(y|x)
following:
P (x|y)P (y)
P (y|x) =
P (x)
6 F. Fatih et al.
But this method takes a different approach, since Bayesian optimization try to determine
p(y|x) [1], Tree structure parzen estimator models p(x|y) and p(y), i.e. (TPE) does not
directly model p(y|x) but rather p(x|y) and p(y), and the likelihood probability is defined
as follows[1]:
l(x) if y < y ∗
P (x|y) =
g(x) if y > y ∗
Where l(x) is the probability density function formed using the observed variables x
such that the objective function value is less than the threshold y ∗ . Then l(x) models
the density of the best observations, and g(x) is the density function using the remaining
observations such that the objective function value is greater than the threshold y ∗ .
Then g(x) models the density of bad observations [1]. TPE uses the following expected
improvement [1]:
R y∗
γy ∗ l(x) − l(x) −∞
P (y) dy
EIy∗ (x) =
γl(x) − (1 − γ)g(x)
g(x)
∝ (γ + (1 − γ))−1
l(x)
l(x)
The expected improvement is proportional to the ratio g(x) the Tree structure parzen
estimator works by drawing x values from l(x) based only on x values that give scores
below the threshold, and not g(x) to increase the EI, then to maximize the expected
improvement, we must maximize this ratio [1].
6.1 Results
A comparative analysis between BO-GP and BO-TPE are used to determine process
gives the highest accuracy for different ML algorithms such as RF, KNN, SVM and LR
applying to heart disease prediction:
Fig. 1: Accuracy.
8 F. Fatih et al.
Automatic learning algorithms are used to measure the performance of RF, SVM, KNN
and LR. According to Table 1.1 the highest accuracy is of LR with a score of 87.91%,
while the lowest accuracy we obtained for SVM is 57.14%.
BO-GP shows that the highest accuracy for LR of 89, 012% compared to other learning
algorithms. The BO-TPE results show that the highest accuracy is from SVM with a score
of 89.01%, and the lowest accuracy is from KNN which gives a score of 67.03%. Then,
the accuracy of LR in BO-GP and the accuracy of SVM in BO-TPE are larger than the
accuracy of Table 1.1. From these results we deduce that the hyperparameters setting
allows to find the best parameters that helped to improve the accuracy for each learning
model. The following tables show the performance ranking of BO-GP and BO-TPE:
7 Conclusion
The results we obtained from the highest accuracy in BO-GP and BO-TPE are re-
spectively LR is 89.01% and SVM is 89.01%. Then, hyperparameter tuning allows to find
the best hyperparameters that improve the accuracy for each algorithm. Our study shows
that the right choice of hyperparameters depends on the performance of machine learning
models.
References
1. Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. Algorithms for hyperparameter optimiza-
tion. Advances in neural information processing systems 24 (2011).
2. Berk, J., Nguyen, V., Gupta, S., Rana, S., and Venkatesh, S. Exploration enhanced expected
improvement for bayesian optimization. In joint european conference on machine learning and
knowledge discovery in databases (2018), Springer, pp. 621–637.
3. Bodin, E., Kaiser, M., Kazlauskaite, I., Dai, Z., Campbell, N., and Ek, C. H. Modulating
surrogates for bayesian optimization. In International Conference on Machine Learning (2020),
PMLR, pp. 970–979.
4. Brochu, E., Cora, V. M., and De Freitas, N. A tutorial on bayesian optimization of expen-
sive cost functions, with application to active user modeling and hierarchical reinforcement
learning. arXiv preprint arXiv :1012.2599 (2010).
5. Dewancker, I., McCourt, M., and Clark, S. Bayesian optimization for machine learning : A
practical guidebook. arXiv preprint arXiv :1612.04858 (2016).
6. Elgeldawi, E., Sayed, A., Galal, A. R., and Zaki, A. M. Hyperparameter tuning for machine
learning algorithms used for arabic sentiment analysis. In Informatics (2021), vol. 8, Multi-
disciplinary Digital Publishing Institute, p. 79.
7. Hoffman, M., Brochu, E., De Freitas, N., et al. Portfolio allocation for bayesian optimization.
In UAI (2011), Citeseer, pp. 327–336.
8. Joy, T. T., Rana, S., Gupta, S., and Venkatesh, S. Hyperparameter tuning for big data using
bayesian optimisation. In 2016 23rd International Conference on Pattern Recognition (ICPR)
(2016), IEEE, pp. 2574–2579.
9. Kim, H.-C., and Kang, M.-J. Comparison of hyper-parameter optimization methods for deep
neural networks. Journal of IKEEE 24, 4 (2020), 969–974.
10. Li, D., and Kanoulas, E. Bayesian optimization for optimizing retrieval systems. In Proceed-
ings of the Eleventh ACM International Conference on Web Search and Data Mining (2018),
pp. 360–368.
11. Matosevic, A. On bayesian optimization and its application to hyperparameter tuning, 2018.
12. Nguyen, V., Gupta, S., Rana, S., Li, C., and Venkatesh, S. Regret for expected improve-
ment over the best-observed value and stopping condition. In Asian Conference on Machine
Learning (2017), PMLR, pp. 279–294.
13. Nomura, M., and Abe, K. A simple heuristic for bayesian optimization with a low budget.
arXiv preprint arXiv :1911.07790 (2019).
14. Rasmussen, C. E., and Nickisch, H. Gaussian processes for machine learning (gpml) toolbox.
The Journal of Machine Learning Research 11 (2010), 3011–3015.
15. van Hoof, J., and Vanschoren, J. Hyperboost : Hyperparameter optimization by gradient
boosting surrogate models. arXiv preprint arXiv :2101.02289 (2021).
16. Wu, J., Toscano-Palmerin, S., Frazier, P. I., and Wilson, A. G. Practical multifidelity
bayesian optimization for hyperparameter tuning. In Uncertainty in Artificial Intelligence
(2020), PMLR, pp. 788–798.
17. Yang, L., and Shami, A. On hyperparameter optimization of machine learning algorithms :
Theory and practice. Neurocomputing 415 (2020), 295–316