0% found this document useful (0 votes)
17 views

Comparative Study of Bayesian Optimization Process For The Best Machine Learning Hyperparameters

Uploaded by

emil hard
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Comparative Study of Bayesian Optimization Process For The Best Machine Learning Hyperparameters

Uploaded by

emil hard
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Comparative study of bayesian optimization process

for the best machine learning hyperparameters

Fatima FATIH1 , Zakariae EN-NAIMANI2 , and Khalid HADDOUCH3


1
Laboratory LISA, ENSA University of Sidi Mohamed Ben Abdellah, Fez, Morocco
2
Laboratory SSDIA, ENSET University of Hassan II Casablanca, Mohammedia, Morocco

Abstract. Bayesian optimization is important algorithm that uses two essential


components, namely the surrogate model and the acquisition function. They are
used to approximate the unknown objective function. This optimization is used as
a hyperparameter tuning technique for the four machine learning algorithms to in-
crease their performance. In this work, we applied Bayesian optimization to choose
the best hyperparameters for a set of ML algorithms namely RF, SVM, KNN and
LR. For this, we used a heart disease dataset. In this context, we obtained the best
hyperparameters with accuracy for each machine learning algorithm optimized by
BO-GP and BO-TPE. The results demanstrate the highest accuracy in BO-GP
and BO-TPE are respectively LR is 89.01% and SVM is 89.01%. Then, the set-
ting of hyperparameters allows to find the best hyperparameters that is improved
accuracy for each algorithm.

Keywords: Bayesian optimization · Machine learning · Gaussian process· Tree


structured parzen estimator · Hyperparameter optimization.

1 Introduction

Hyperparameter tuning [8] is used to test the combination of hyperparameters that are
randomly chosen to improve machine learning problems. It is difficult to manually choose
the best values of hyperparameters because the choice of hyperparameters affects the
performance of the model. So, the performance of the learning model depends on better
choice of hyperparameters. In this regard, there existe an important technique for tuning
hyperparameters are random search, grid search, particle swarm optimization, genetic
algorithm and Bayesian optimization [8] [9] [16] [17]. Bayesian optimization is one of the
good technique for tuning hyperparameters in automatic learning models.

Bayesian optimization [4] [11] [13] is a method used to solve objective functions that
are costly to evaluate and also to find the global maximum of this function. It is based on
the Gaussian process, Random forest and Tree structure parzen estimator (TPE) which
constitutes two density functions (good density and bad density) and the observations
are divided in to together [1]. In TPE the ratio must be maximized for minimize the
expected improvement of the acquisition function which allows to find the new configura-
tion of the hyperparameters [1]. All three surrogate models are used to approximate the
objective function. Hawever, the most of the time, the Gaussian process is the most used
in Bayesian optimization.
2 F. Fatih et al.

In order to explain this, we use the heart disease dataset. This dataset is on of the
serious diseases that threaten human life. ML machine learning algorithms playing a key
role in predicting heart disease based on different symptoms such as age, gender etc. The
main objective is to detect the patient in its early stages where it can be treated and
save lives from death in order to reduce the morality rates by heart disease. In this work,
we applied Bayesian optimization to find the best hyperparameters that improved the
accuracy for each algorithm namely RF, SVM, KNN and LR as the performance of the
learning models depends on better hyperparameters. So, our result shows that the highest
accuracy in BO-GP and BO-TPE are of LR and SVM respectively.

This paper is structured as follows. We present the related Bayesian optimization


work in section 2. section 3 illustrates two components of Bayesian optimization. Then
the steps of Bayesian optimization to find the global maximum of an objective function
which are costly to evaluate are presented in section 4. The optimization process which
consists of three surrogate functions is explained in section 5. In section 6, we present
experimental results. And finally, we end with a conclusion.

2 Related Work
Bayesian optimization (BO) [4] [11] [13] [17] is an efficient method that consists of two
essential components namely the surrogate models and the acquisition function to de-
termine the next hyperparameters configurations that allows to find an approximation
of a costly objective function to be evaluated. The surrogate models are: Tree structure
parzen estimator (TPE) [1], random forest [15] [17], and Gaussian process [4] [13] [10] [7]
[8] . The acquisition functions are the expected improvement (EI) [2], the probability of
improvement [11] and the upper confidence bound (UCB) [11] . The most used acquisition
function in Bayesian optimization is the expected improvement [11]. But, according to [2]
shows that there is a better acquisition function than EI is the acquisition function E3 I
which balances exploitation and exploration in BO.

The concept of Bayesian optimization was introduced in [11] with two experiments.
The first experiment is determining the global maximum of the objective function f(x,y).
The second experiment is compared between Bayesian optimization and random search
in the SVM machine learning algorithm, which shows that there is no difference in the
performance of these two methods. There are more works as [6] [8] [9] [16] [17] show that
Bayesian optimization is one of the most effective hyperparameter optimization techniques
for tuning hyerparameters in machine learning models.

In this article [9] presents a comparative study between three HPO methods: grid
search, random search, Bayesian optimization. This comparison is used to find the best
method that can be used to obtain the highest accuracy in a short time simulation. The
results of [9] show that the method of Bayesian optimization is more efficient than the
other methods.

This work [6] presents a comparative analysis of various hyperparameter tuning tech-
niques, namely Grid Search, Random Search, Bayesian Optimization, Particle Swarm Op-
timization (PSO), and Genetic Algorithm (GA). They are used to optimize the accuracy of
Title Suppressed Due to Excessive Length 3

six machine learning algorithms, namely, Logistic Regression (LR), Ridge Classifier (RC),
Support Vector Machine Classifier (SVC), Decision Tree (DT), Random Forest (RF), and
Naive Bayes (NB) classifiers. These algorithms are used to solve the tree sentiment classi-
fication problem. The results of [6] shows that the performance for each machine learning
algorithm before and after setting the hyperparameters shows that the highest accuracy
was given by SVC before and after setting the hyperparameters with the highest scores
obtained when using Bayesian optimization.

We have seen in [17] a comparative study between eight different hyperparameter


optimization methods that are implemented on three machine learning models (KNN,
RF, SVM) of classification and regression. First, it compares accuracy and computation
time (CT) for classification problems that evaluated on the MNIST dataset. Secondly, it is
compared MSE and computation time for the regression problem which is evaluated on the
Bosten-houssing dataset. In this paper it is shown that using the default hyperparameter
settings does not give the best model performance. So, it is important to use HPO methods
to determine the best hyperparameters.

3 The components of Bayesian optimization


Bayesian optimization uses the following two important components:

3.1 Surrogate functions


The surrogate model [11] is a probability model that gives a representation of the ob-
jective function that is expensive to evaluate. We will see in section 5 three Surrogate
models namely: gaussian process (GP), random forest (RF), tree structure parzen esti-
mator (TPE). However, in the most of the time, GP [3] is good tool used in Bayesian
optimization. The main idea of these surrogate models are used to approximate the un-
known objective function and to search the global optimization of this function.

3.2 Acquisition functions


The acquisition function is an essential technique in Bayesian optimization. Mathemat-
ically, the point that maximizes the acquisition function is used to propose the next
sampling point for the next iteration. The most commonly used acquisition functions in
Bayesian optimization are:
1. Probability of Improvement (PI)

We can define the improvement I(x) as follows:


f (x) − f (x∗ ) if f (x) > f (x∗ )


I(x) = max((f (x) − f (x ), 0) =
0 if f (x) < f (x∗ )
The probability of improvement is defined as follows
P I(x) = P[I(x) > 0]
= P[f (x) > f (x∗ )]
µ(x) − f (x∗ )
= Φ( )
σ(x)
4 F. Fatih et al.

where
• The mean µ and the variance σ,
• Φ is the cumulative distribution functions (CDF):
Z z
Φ(z) = φ(z)dz
−∞

−z 2

with φ(z) = √12π exp 2
is the probability density function (PDF) of the normal
distribution N (0, 1).

2. Expected Improvement (EI)

The expected improvement is defined as follows:


(x∗ ) (x∗ )
(
(µ(x) − f (x∗ ))Φ( µ(x)−f
σ(x) ) + σ(x)φ( µ(x)−f
σ(x) ) if σ(x) > 0,
EI(x) =
0 if σ(x) = 0

Where Φ and φ are the cumulative distribution functions (CDF) and probability den-
sity function (PDF).

3. The Upper Confidence Bound (UCB)

The Upper Confidence Bound is defined as the sequence [11]:

U CB(x) = µ(x) + βσ(x)


where β > 0 is a user-selected parameter that is used to balance exploration and
exploitation [11].

4 Bayesian optimization steps


To find the new hyperparameters to approximate an unknown objective function f , we
have the following Bayesian optimization steps [17]:
1. Build a surrogate model of the objective function and we used almost all the time the
Gaussian process to approximate the true objective function.
2. Find the optimal values of hyperparameters on the substitution model.
In this step we used the acquisition function to choose the next hyperparameters.
The hyperparameters that maximizes the acquisition function is an hyperparameter
chosen to use as the first sample in the graph of the substitution function.
3. We compute the true objective function of this new hyperparameter that we obtained
in step 2 and obtain a score.
4. Update the substitution probability model with the new results.
In this step we compute the substitution function to determine the mean and the
variance of this hyperparameter, then we define the value of µ in new iteration.
5. Repeat steps 2 through 4 until the maximum iteration pattern is reached.
Finally we find an approximation of the real objective function that allows us to find the
global maximum from the previously evaluated samples.
Title Suppressed Due to Excessive Length 5

5 Optimization process

There are three following substitution models in Bayesian optimization:

5.1 Bayesian optimization - Gaussian process (BO-GP)

The Gaussian process [17] [11] is a surrogate model most commonly used in Bayesian
optimization to approximate the objective function f : X −→ R with X is a finite set
of N points , and the values of the objective function f = [f (x1 ), . . . , f (xn )] [11] are
distributed according to a multivariate Gaussian distribution. Thus the Gaussian process
is given by [9] [12] [14]:

f ∼ GP(µ(x), K(x, x ))
Where µ is a mean vector and K is a covariance matrix. Predictions following a normal
distribution [17]:
P (y|x, D) = N (y|µ̃, σ̃ 2 )
Where D is the configuration space of the hyperparameters, and y = f (x) is the result
of the evaluation of each hyperparameter value X [17]. We assume µ(x) = 0 so the new
means and variances are [14] :

µ̃ = K(x)T K −1 y,
σ̃ 2 = K(x, x) − K(x)T K −1 K(x).
These new means and variance will be used in the acquisition function to find the
next evaluation point of the true objective function f .

5.2 Sequential model-based algorithm configuration (SMAC)

Bayesian optimization using RF as a surrogate model. It’s also called sequential model
based algorithm configuration (SMAC) [17]. Assuming that there is a Gaussian model
N (y|µ̃, σ˜2 ), which µ̃ and σ˜2 are the mean and variance of the regression function r(x),
respectively [5] [15] [17]:
1 X
µ̃ = r(x)
|B|
r∈B

1
σ˜2 =
X
(r(x) − µ̃)2
|B| − 1
r∈B

where B is a set of regression trees in the forest.

5.3 Tree structure parzen estimator (TPE)

The tree structured parzen estimator (TPE) is another common surrogate model for
Bayesian optimization [17]. It creates a model applying the Bayes rule to calculate p(y|x)
following:
P (x|y)P (y)
P (y|x) =
P (x)
6 F. Fatih et al.

But this method takes a different approach, since Bayesian optimization try to determine
p(y|x) [1], Tree structure parzen estimator models p(x|y) and p(y), i.e. (TPE) does not
directly model p(y|x) but rather p(x|y) and p(y), and the likelihood probability is defined
as follows[1]:
l(x) if y < y ∗

P (x|y) =
g(x) if y > y ∗
Where l(x) is the probability density function formed using the observed variables x
such that the objective function value is less than the threshold y ∗ . Then l(x) models
the density of the best observations, and g(x) is the density function using the remaining
observations such that the objective function value is greater than the threshold y ∗ .
Then g(x) models the density of bad observations [1]. TPE uses the following expected
improvement [1]:

R y∗
γy ∗ l(x) − l(x) −∞
P (y) dy
EIy∗ (x) =
γl(x) − (1 − γ)g(x)
g(x)
∝ (γ + (1 − γ))−1
l(x)
l(x)
The expected improvement is proportional to the ratio g(x) the Tree structure parzen
estimator works by drawing x values from l(x) based only on x values that give scores
below the threshold, and not g(x) to increase the EI, then to maximize the expected
improvement, we must maximize this ratio [1].

6 Experimental and results


The heart disease dataset, we downloaded from kaggle, contain 76 attributes but all
published experiments refer to the use of a subset of 14 of them. So the 14 features are
detailed as follows:
1. Age : Age of the patient,
2. Sex : Sex of the patient,
3. exang: exercise induced angina (1 = yes, 0 = no),
4. ca: number of major vessels (0-3),
5. cp : Chest Pain type (1: typical angina, 2: atypical angina, 3: non-anginal pain, 4:
asymptomatic ),
6. trtbps : resting blood pressure (in mm Hg),
7. chol : cholestoral in mg/dl fetched via BMI sensor,
8. fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false),
9. rest-ecg : resting electrocardiographic results (0: normal, 1: having ST-T wave ab-
normality (T wave inversions and/or ST elevation or depression of > 0.05 mV), 2:
showing probable or definite left ventricular hypertrophy by Estes’ criteria),
10. thalach : maximum heart rate achieved,
11. target : (0 = less chance of heart attack 1 = more chance of heart attack contenu).
Based on this dataset, we categorized patients as 1 indicating the presence of heart
disease and 0 indicating the absence of heart disease. Then we used 70% training data
and 30% used data to test the obtained model.
Title Suppressed Due to Excessive Length 7

6.1 Results

A comparative analysis between BO-GP and BO-TPE are used to determine process
gives the highest accuracy for different ML algorithms such as RF, KNN, SVM and LR
applying to heart disease prediction:

Table 1: The accuracy of ML algorithms without using BO.


ML Accuracy Time (s)
RF 0.8461 0.2447
KNN 0.6593 0.0210
SVM 0.5714 0.0230
LR 0.8791 0.0601

Fig. 1: Accuracy.
8 F. Fatih et al.

Table 2: Evaluation of BO-GP performance.


ML Best hyperpameters Precision Recall F1_score accuracy Time (s)
criterion=’entropy’,
max_depth=61,
max_features=’sqrt’,
RF 0.8 0.9756 0.8791 0.8791 80.36
min_samples-leaf=11,
min_samples-split=14
n_estimators=205
KNN n-neighbors=17 0.8 0.9756 0.8791 0.8791 15.34
C=13.273723936150628
SVM 0.8 0.9756 0.8791 0.8791 169.41
kernel=’linear’
C=3.131333053846714
LR penalty=l2, 0.8163 0.9756 0.8888 0.8901 25.87
solver= ’liblinear’

Fig. 2: Accuracy Fig. 3: Precision

Fig. 4: Recall Fig. 5: F1-score


Title Suppressed Due to Excessive Length 9

Table 3: Evaluation of BO-TPE performance.


ML Best hyperpameters Precision Recall F1_score accuracy Time (s)
criterion=’gini,’,
max_depth=80.0,
max_features=’auto’,
RF 0.7692 0.9756 0.8602 0.85714 53.283
min_samples-leaf=0.1232 ,
min_samples-split=0.2559
n_estimators=210
KNN n_neighbors=24.0 0.6078 0.7560 0.6739 0.6703 0.5642
C= 6.609156
SVM 0.8297 0.9512 0.8863 0.8901 42.917
kernel=’linear’
C =1.1082953781569425
LR penalty=l1, 0.8 0.9756 0.8791 0.8791 0.5868
solver= ’liblinear’

Fig. 6: Accuracy Fig. 7: Precision

Fig. 8: Recall Fig. 9: F1-score


10 F. Fatih et al.

6.2 Discussion of results and comparisons

Automatic learning algorithms are used to measure the performance of RF, SVM, KNN
and LR. According to Table 1.1 the highest accuracy is of LR with a score of 87.91%,
while the lowest accuracy we obtained for SVM is 57.14%.

BO-GP shows that the highest accuracy for LR of 89, 012% compared to other learning
algorithms. The BO-TPE results show that the highest accuracy is from SVM with a score
of 89.01%, and the lowest accuracy is from KNN which gives a score of 67.03%. Then,
the accuracy of LR in BO-GP and the accuracy of SVM in BO-TPE are larger than the
accuracy of Table 1.1. From these results we deduce that the hyperparameters setting
allows to find the best parameters that helped to improve the accuracy for each learning
model. The following tables show the performance ranking of BO-GP and BO-TPE:

Precision Recall F1_score accuracy


BO-GP 1 1 1 1
BO-TPE 2 1 2 2
(a) RF.

Precision Recall F1_score accuracy


BO-GP 1 1 1 1
BO-TPE 2 2 2 2
(b) KNN.

Precision Recall F1_score accuracy


BO-GP 2 1 2 2
BO-TPE 1 2 1 1
(c) SVM.

Precision Recall F1_score accuracy


BO-GP 1 1 1 1
BO-TPE 2 1 2 2
(d) LR.
Table 4: Performance ranking of BO-GP and BO-TPE for each machine learning algo-
rithm.

7 Conclusion

Bayesian optimization is an efficient hyperparameter tuning technique to improve machine


learning problems. Experience shows that BO-GP and proposed BO-TPE can find the
best hyperparameters with accuracy for machine learning models namely RF, KNN, SVM
and LR.
Title Suppressed Due to Excessive Length 11

The results we obtained from the highest accuracy in BO-GP and BO-TPE are re-
spectively LR is 89.01% and SVM is 89.01%. Then, hyperparameter tuning allows to find
the best hyperparameters that improve the accuracy for each algorithm. Our study shows
that the right choice of hyperparameters depends on the performance of machine learning
models.

References
1. Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. Algorithms for hyperparameter optimiza-
tion. Advances in neural information processing systems 24 (2011).
2. Berk, J., Nguyen, V., Gupta, S., Rana, S., and Venkatesh, S. Exploration enhanced expected
improvement for bayesian optimization. In joint european conference on machine learning and
knowledge discovery in databases (2018), Springer, pp. 621–637.
3. Bodin, E., Kaiser, M., Kazlauskaite, I., Dai, Z., Campbell, N., and Ek, C. H. Modulating
surrogates for bayesian optimization. In International Conference on Machine Learning (2020),
PMLR, pp. 970–979.
4. Brochu, E., Cora, V. M., and De Freitas, N. A tutorial on bayesian optimization of expen-
sive cost functions, with application to active user modeling and hierarchical reinforcement
learning. arXiv preprint arXiv :1012.2599 (2010).
5. Dewancker, I., McCourt, M., and Clark, S. Bayesian optimization for machine learning : A
practical guidebook. arXiv preprint arXiv :1612.04858 (2016).
6. Elgeldawi, E., Sayed, A., Galal, A. R., and Zaki, A. M. Hyperparameter tuning for machine
learning algorithms used for arabic sentiment analysis. In Informatics (2021), vol. 8, Multi-
disciplinary Digital Publishing Institute, p. 79.
7. Hoffman, M., Brochu, E., De Freitas, N., et al. Portfolio allocation for bayesian optimization.
In UAI (2011), Citeseer, pp. 327–336.
8. Joy, T. T., Rana, S., Gupta, S., and Venkatesh, S. Hyperparameter tuning for big data using
bayesian optimisation. In 2016 23rd International Conference on Pattern Recognition (ICPR)
(2016), IEEE, pp. 2574–2579.
9. Kim, H.-C., and Kang, M.-J. Comparison of hyper-parameter optimization methods for deep
neural networks. Journal of IKEEE 24, 4 (2020), 969–974.
10. Li, D., and Kanoulas, E. Bayesian optimization for optimizing retrieval systems. In Proceed-
ings of the Eleventh ACM International Conference on Web Search and Data Mining (2018),
pp. 360–368.
11. Matosevic, A. On bayesian optimization and its application to hyperparameter tuning, 2018.
12. Nguyen, V., Gupta, S., Rana, S., Li, C., and Venkatesh, S. Regret for expected improve-
ment over the best-observed value and stopping condition. In Asian Conference on Machine
Learning (2017), PMLR, pp. 279–294.
13. Nomura, M., and Abe, K. A simple heuristic for bayesian optimization with a low budget.
arXiv preprint arXiv :1911.07790 (2019).
14. Rasmussen, C. E., and Nickisch, H. Gaussian processes for machine learning (gpml) toolbox.
The Journal of Machine Learning Research 11 (2010), 3011–3015.
15. van Hoof, J., and Vanschoren, J. Hyperboost : Hyperparameter optimization by gradient
boosting surrogate models. arXiv preprint arXiv :2101.02289 (2021).
16. Wu, J., Toscano-Palmerin, S., Frazier, P. I., and Wilson, A. G. Practical multifidelity
bayesian optimization for hyperparameter tuning. In Uncertainty in Artificial Intelligence
(2020), PMLR, pp. 788–798.
17. Yang, L., and Shami, A. On hyperparameter optimization of machine learning algorithms :
Theory and practice. Neurocomputing 415 (2020), 295–316

You might also like