0% found this document useful (0 votes)
11 views10 pages

A Decision Support Framework For AutoML Systems A Meta-Learning Approach

The paper presents a decision support framework for Automated Machine Learning (AutoML) systems using a meta-learning approach to enhance the efficiency of algorithm selection and hyper-parameter tuning. By analyzing 200 datasets and employing 30 classifiers from Weka and Scikit-learn, the authors aim to automate decision-making for non-expert users, addressing the growing need for data scientists in the face of increasing data volumes. The framework includes methodologies for predicting classifier performance and managing time budgets effectively during the AutoML process.

Uploaded by

Nizam Hridoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

A Decision Support Framework For AutoML Systems A Meta-Learning Approach

The paper presents a decision support framework for Automated Machine Learning (AutoML) systems using a meta-learning approach to enhance the efficiency of algorithm selection and hyper-parameter tuning. By analyzing 200 datasets and employing 30 classifiers from Weka and Scikit-learn, the authors aim to automate decision-making for non-expert users, addressing the growing need for data scientists in the face of increasing data volumes. The framework includes methodologies for predicting classifier performance and managing time budgets effectively during the AutoML process.

Uploaded by

Nizam Hridoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2019 International Conference on Data Mining Workshops (ICDMW)

A Decision Support Framework for AutoML


Systems: A Meta-Learning Approach
Salijona Dyrmishi, Radwa Elshawi and Sherif Sakr
Data Systems Group, University of Tartu, Estonia
{firstname.lastname}@ut.ee

Abstract— In general, the process of building a high-quality for machine learning tasks. Over the years, several machine
machine learning model is an iterative, complex and time- learning libraries (e.g., Weka, Scikit-learn) have been
consuming process that involves exploring the performance of developed to facilitate the process of building a machine
various machine learning algorithms in addition to having a
good understanding and experience with effectively tuning their learning model. In practice, the process of building a high-
hyper-parameters. In practice, conducting this process efficiently quality machine learning model is an iterative, complex and
requires solid knowledge and experience with the various tech- time-consuming process. During this process, a data scientist
niques that can be employed. With the continuous and vast is commonly challenged with a large number of choices where
increase of the amount of data in our digital world, it has been informed decisions need to be taken (Figure 1). For example,
acknowledged that the number of knowledgeable data scientists
can not scale to address these challenges. Thus, there is a the data scientist needs to select among a wide range of
crucial need for automating the process of Combined Algorithm possible algorithms including the classification or regression
Selection and Hyper-parameter tuning (CASH) in the machine techniques (e.g. Support Vector Machines, Neural Networks,
learning domain. Recently, several systems (e.g., AutoWeka, Bayesian Models, Decision Trees, etc) in addition to tun-
AutoSKLearn, SmartML) have been introduced to tackle this ing numerous hyper-parameters of the selected algorithm. In
challenge with the main aim of reducing the role of human in the
loop and filling the gap for non-expert machine learning users practice, the quality of such decisions significantly affect the
by playing the role of the data scientist. performance of the developed machine learning models. For
Meta-Learning is described as the process of learning from example, in Weka, among the existing 39 machine learning
previous experience gained during applying various learning algorithms, the accuracy can significantly vary across different
algorithms on different types of data, and hence reducing the models, on average about 46% on 21 datasets and 94% on one
needed time to learn new tasks. In the context of the Automated
Machine Learning (AutoML) process, one main advantage of dataset [1]. For common machine learning algorithms (e.g.,
meta-learning techniques is that they allow hand-engineered Support Vector Machine and Random Forest), the average
algorithms to be replaced with novel automated methods which accuracy change is more than 20% on 14 out of the 21 datasets.
are designed in a data-driven way. In this paper, we present a Although making such decisions require solid knowledge and
methodology and framework for using Meta-Learning techniques expertise, in practice, increasingly, users of machine learning
to develop new methods that serve as an effective decision support
for the AutoML process. In particular, we use Meta-Learning tools are often non-experts who require off-the-shelf solutions.
techniques to answer several crucial questions for the AutoML This situation is creating a potential data science crisis, similar
process including: 1) Which classifiers are expected to be the best to that of the software crisis [2], due to the crucial need of
performing on a given dataset? 2) Can we predict the training having an increasing number of data scientists with strong
time of a classifier? 3) Which classifiers are worth investing a knowledge and good experience so that they are able to
larger portion of the time budget to improve their performance
by tuning them? In our Meta-Learning process, we used 200 keep up with harnessing the power of the massive amounts
datasets with different characteristics on a wide set of meta- of data which are produced daily. In particular, it has been
features. In addition, we used 30 classifiers from two popular acknowledged that data scientists can not scale1 and it is
machine learning libraries, namely, Weka and Scikit-learn. Our almost impossible to balance between the number of qualified
results and Meta-Models have been obtained in a fully automated data scientists and the required effort to manually analyze the
way. The methodology and results of our framework can be easily
embeded/utilized by any AutoML system. increasingly growing sizes of available data. Thus, we are
witnessing a growing focus and interest to support automating
the process of building machine learning pipelines where the
I. I NTRODUCTION presence of a human in the loop need to be dramatically
Due to the increasing success of machine learning tech- reduced, or preferably eliminated.
niques in several application domains, they have been attract- In the last few years, several systems (e.g., AutoWeka [1],
ing a lot of attention from the research and business com- AutoSKLearn [3], SmartML [4]) have been introduced to
munities. In practice, the rise of new technologies, computing tackle the challenge of automating the process of Combined
paradigms and modern data-intensive software systems (e.g., Algorithm Selection and Hyper-parameter tuning (CASH) in
Cloud Computing, Internet-of-Things, Social Computing, Mo- the machine learning domain [5]. These techniques have
bile Computing) in addition to the access to the large amount commonly formulated the problem as an optimization problem
of data generated by large companies (e.g, Amazon, Google,
Facebook, Twitter) have given a rise to new opportunities 1 https://hbr.org/2015/05/data-scientists-dont-scale

2375-9259/19/$31.00 ©2019 IEEE 97


DOI 10.1109/ICDMW.2019.00025

Authorized licensed use limited to: University Augsburg. Downloaded on September 08,2022 at 12:57:20 UTC from IEEE Xplore. Restrictions apply.
for predicting the best performing classifiers on a given
dataset based on its meta-features.
• Can we predict the training and testing times for
a classifier on a given dataset? In principle, a main
goal for AutoML systems is to efficiently manage the
allocated time budget for exploring a larger number of
the promising iterations in the search space. In general,
the more explored iterations, the higher the probability
of finding the optimal CASH configurations. Thus, we
developed a Meta-Model to predict and rank the classi-
fiers based on its expected training and testing times on a
Fig. 1. Decisions of the Machine Learning Modeling Process
given dataset. In practice, this information is crucial for
effectively managing the time budget during the CASH
optimization process.
that can be solved by wide range of techniques [6], [1], [4]. • Which Classifiers are More Tunable? How much
In general, the CASH problem is described as follows [1]: of the time budget should be allocated for each
Given a set of machine learning algorithms A = candidate classifier? In general, the Tunability of a
{A(1) , A2 , ...}, and a dataset D is divided into disjoint training classifier is measured by the magnitude of its performance
Dtrain , and validation Dvalidation sets. The goal is to find an variance when tuning its hyperparameters. We analyze
∗ ∗
algorithm A(i) where A(i) ∈ A and A(i) is a tuned version the performance of 30 classifiers and show that their
(i)
of A that achieves the highest generalization performance tunability can significantly vary. Our results show that it is
by training A(i) on Dtrain , and evaluating it on Dvalidation . significant for AutoML systems to consider the tunability
In particular, the goal of any CASH optimization technique is of the classifiers on managing the allocated time budget
defined as: for their optimization process. In addition, we propose
∗ a scoring metric to support the decision of ranking
A(i) ∈ argmin L(A(i) , Dtrain , Dvalidation )
A  A the classifiers according to their Tunability taking in
consideration their average performance, average training
where L(A(i) , Dtrain , Dvalidation ) is the loss function (e.g: time, average testing time and the time budget of the
error rate, false positives, etc). In practice, a main constraint optimization process.
for CASH optimization techniques is the time budget. In
particular, the aim of the optimization algorithm is to select The remainder of this paper is organized as follows. Sec-
and tune a machine learning algorithm that can achieve (near)- tion II provides an overview of our Meta-Learning based
optimal performance in terms of the user-defined evaluation framework and methodology for supporting the optimiza-
metric (e.g., accuracy, sensitivity, specificity, F1-score) within tion decisions of the AutoML process. Section III pro-
the user-defined time budget for the search process (Figure 1). vides a detailed analysis for 30 machine learning classi-
Meta-Learning [7], [8], [9] is described as the process fiers from two popular machine learning libraries, Weka and
of learning from previous experience gained during apply- Scikit-learn. In addition, we present our Meta-Learning
ing various learning algorithms on different kinds of data, based approach for predicting the top performing classifiers for
and hence reducing the needed time to learn new tasks. a given dataset. Section IV presents our Meta-Learning based
In the context of Automated Machine Learning (AutoML) approach for building Meta-Models for predicting the average
process, one main advantage of Meta-Learning techniques is accuracy, average training time and the standard deviation of
that they allow replacing hand-engineered algorithms with expected accuracy, for a given classifier. Using these Meta-
novel automated methods which are designed in a data- Models, we present a novel scoring method for ranking the
driven way. In this paper, we present a methodology and classifiers based on their tunability. In addition, we describe
framework for using Meta-Learning techniques to develop how to use this scoring method to effectively manage the time
new methods that serve as an effective decision support for budget allocation among candidate classifiers. We discuss the
the AutoML process. Our Meta-Learning study is using the related work in Section V before we conclude the paper in
results of running 30 classifiers (17 classifiers from Weka and Secion VI.
13 from Scikit-learn), with different hyper-parameter
configurations, over 200 datasets with different characteristics II. F RAMEWORK AND M ETHODOLOGY
on a wide set of meta-features [10]. In particular, we use Meta- Figure 2 illustrates an overview of our Meta-Learning based
Learning techniques to answer the following questions: framework for supporting the optimization decisions of the
• Which classifiers are expected to perform better on AutoML process. In our framework, we run N classifiers, each
a given dataset? We present an extensive analysis for classifier with K different hyper-parameter configurations,
the the performance of 30 classifiers. In particular, we over M datasets. The results of all runs are stored in a
analyze the performance of these classifiers using dif- knowledge base where each record represents one run of
ferent evaluation metrics (e.g., Accuracy, Training Time, a classifier c with hyper-parameter configurations h over a
Testing Time). In addition, we developed a Meta-Model dataset d. In particular, each record stores the meta-features

98

Authorized licensed use limited to: University Augsburg. Downloaded on September 08,2022 at 12:57:20 UTC from IEEE Xplore. Restrictions apply.
Meta-Feature Description
nr instances Number of Dataset Instances
log nr instances Logarithm Number of Instances
nr classes Number of Classes
class entropy Class Entropy
nr f eatures Number of Features
log nr f eatures Logarithm Number of Features
nr numerical f eatures Number of Numerical Features
nr categorical f eatures Number of Categorical Features
ration umc at Ratio of Numerical to Categorical Features
dataset ratio Ratio of Number of Instances to number of Features
missing val Number of Missing Values
ratio missing val Ratio of Missing Values to whole dataset values
symbols mean Mean of number of Symbols in Categorical Features
symbols sum Sum of number of Symbols in Categorical Features
symbols std dev Std. Dev. of Symbols in Categorical Features
Fig. 2. An Overview of our Meta-Learning Based Framework for skew min Minimum Skewness of Numerical Features
skew max Maximum Skewness of Numerical Features
Supporting the AutoML Optimization Process
skew mean Mean of Skewness of Numerical Features
skews td dev Std. Dev. of Skewness of Numerical Features
kurtosis min Minimum Kurtosis of Numerical Features
kurtosis max Maximum Kurtosis of Numerical Features
of the dataset, the classifier information with the values of kurtosis mean Mean of Kurtosis of Numerical Features
kurtosis std dev Std. Dev. of Kurtosis of Numerical Features
its hyper-parameter configurations in addition to the results min prob Minimum Class Label Probability
of the run including the classifier performance metrics (e.g., max prob Maximum Class Label Probability
mean prob Mean Class Label Probability
Accuracy, Specificity, Sensitivity, F1-Score), training time and std dev prob Standard Deviation of Class Labels Probabilities
testing time. The content of the knowledge base is then
TABLE I
analyzed for building Meta-Models and scoring functions
E XTRACTED M ETA - FEATURES FOR OUR DATASETS
which are then used for supporting the optimization decisions
during the AutoML process.

Datasets
In this study, we used 200 datasets that have been collected
from the popular OpenML2 repository [11]. The datasets
represent a mix of binary (54%) and multiclass (46%) classi-
fication tasks. The sizes of the datasets vary where the size
of the largest among them is 350 MB. For our study, we
considered 28 Meta-features for our datasets (Table I). The
200 datasets have been carefully selected so that they cover a
wide range of values, for each meta-feature. Figure 3 illustrates
an overview of the distribution of values for some of our
meta-features. For the sake of better presentation and the
readability of the figures, we have eliminated few outliers from
the graphs. The main focus of our study is on the performance
of the classifiers. Thus, in order to ensure the fairness in
our performance comparison, we have not implemented any
preprocessing steps on the datasets to avoid any bias or impact
on the performance of the classifiers [12].

Classifiers with Hyper-Paramters Configurations


For building our knowledge base, we used 13 Fig. 3. An Overview of The Characteristics of the Meta-Features of our
classifiers from the popular Python-based machine Experimental Datasets
learning library, Scikit-learn. These classifiers
are: Gaussian Process, K Neighbors, Support
Vector Classifier (SVC), AdaBoost, Random Weka. These classifiers are: kstar, Naive Bayes,
Forest, Quadratic Discriminant Analysis, Random Forest, Hoeffding Tree, IBK, Decision
Gaussian Naive Bayes (GaussianNB), Gradient Table, Bagging, PART, Sequential Minimal
Boosting, Linear Discriminant Analysis, Optimization (SMO), Logic Boost, J48, Simple
Decision Tree, Complement Naive Bayes Logistic, Logistic Regression, AdaBoostM1,
(ComplementNB), Perceptron, and Logistic Regression Tree Representative (REPTree),
Regression. In addition, we used 17 classifiers Logistic Model Trees, and One Rule (OneR). In
from the popular Java-based machine learning library, general, Weka offers a comparably wider range of classifiers
compared to Scikit-learn, among them we have chosen
2 https://www.openml.org/
the most popular ones. For each run of a classifier over a

99

Authorized licensed use limited to: University Augsburg. Downloaded on September 08,2022 at 12:57:20 UTC from IEEE Xplore. Restrictions apply.
both values where the positive values are for those datasets
where the Scikit-learn classifiers have performed better
while the negative values are for those datasets where the
WEKA classifiers have performed better. The results show
that WEKA classifiers have performed better for 96 datasets,
Scikit-learn classifiers performed better for 50 datasets
while the best performance of the classifiers of both libraries
has been the same for 54 datasets. The results show that
Fig. 4. The Performance of Weka Classifiers vs The Performance of the difference in accuracy between the classifiers of the two
Scikit-learn Classifiers libraries can reach up to 20%. Therefore, in practice, exploring
classifiers from various libraries could represent a viable
option for achieving a higher accuracy for the end-user.
dataset, we generated 40 combinations of hyper-parameter
Figure 5 illustrates the comparisons on the accuracy perfor-
configurations. In particular, for each classifier, we have
mance of the different classifiers. In particular, Figure 5(a)
generated a list of possible and reasonable combinations
illustrates the average accuracy performance of the clas-
where we applied, for each dataset, a random search among
sifiers of the WEKA library while Figure 5(b) illustrates
them [13]. The results of all runs (30 ∗ 40 ∗ 200 = 240K
the average accuracy performance of the classifiers of the
Runs) have been stored in our knowledge base. We have
Scikit-learn library. The results of both figures show that
made all artifacts (e.g., datasets, source code) of our study
the average accuracy performance of the top 9 classifiers of the
available in the project repository3 .
WEKA library, namely, LogisticRegression, IBK, Bagging,
RandomF orest, J48, P ART , LogicBoost, DecisionT able,
III. W HICH C LASSIFIERS W ILL P ERFORM B ETTER ?! and N aiveBayes, outperform the best classifier of the
Analyzing the Performance of the Classifiers Scikit-learn library, K Neighbors. The results also
show that SMO is the classifier with the lowest average
We analyzed the performance of the classifiers using the
accuracy for WEKA while Quadratic Discriminant
following metrics:
Analysis is the classifier with the lowest average accuracy
• Average Accuracy: For each classifier, we computed the
for Scikit-learn.
average of the accuracy achieved over all datasets and Figures 5(c) and 5(d) illustrates the trade-off analysis be-
using the different hyper-parameter configurations. tween the Average Classifier Accuracy and the Average Train-
• Average Training Time: For each classifier, we com-
ing Time for the classififers of the WEKA and Scikit-learn
puted the average time of the training phase for all runs libraries, respectively. The results of Figure 5(c) show that for
over the all datasets. WEKA, Kstar and IBK significantly outperform the rest of
• Average Testing Time: For each classifier, we computed
classifiers, mainly because their training times are very short
the average time of the testing phase for all runs over the (Figure 6(a)), while Logistic Regression comes in the
all datasets. last place, mainly for its long training time (Figure 6(a)). The
• Average Total Time: For each classifier, we computed
results of Figure 5(d) shows that, for Scikit-learn li-
the average total time, training time plus testing time, for brary, ComplementNB outperforms all other classifiers while
all runs over the all datasets. Gradient Boosting comes in the last place. In practice,
• Average Accuracy/Training Time: We used this metric
a careful consideration for the trade-off between the accuracy
as an indicator to analyze the trade-off between the performance of the classifier and its training time is required
accuracy performance of a classifier and its training time especially when the time budget for the AutoML task is
• Top Performer: For each classifier, we calculated the
relatively short.
number of appearances in the top 3 performing classifiers Figures 5(e) and 5(f) show the top performing classifiers
over all datasets. In particular, for each dataset, we for the WEKA and Scikit-learn libraries, respectively.
identified the top three performing classifiers in terms of The results of these figure show that, for WEKA, Bagging
the accuracy. Then, we count the number of appearances and Logic Boost are the top performing classifiers, while,
in the top lists, for each classifier. for Scikit-learn, SVC and Logistic Regression
We have conducted our experiments on 6 machines, each of are the top performing classifiers. The results of Figures 5(e)
them has 8 CPU cores at 2GHz and with 16GB of memory. and 5(f) also show that Naive Bayes and Quadratic
All machines have been running Ubuntu x86 64. Figure 4 Discriminant Analysis have the lowest appearance in
illustrates an overall performance comparison on the accu- the top performing classifier for WEKA and Scikit-learn
racy between all Weka classifiers and all Scikit-learn libraries, respectively.
classifiers. In particular, for each dataset, we have selected the Figure 6 illustrates the comparisons on the time performance
best accuracy among all Weka classifiers and the best accuracy of the different classifiers across WEKA and Scikit-learn
among all Scikit-learn classifiers and then compared them. In libraries. Figure 6(a) illustrates the Average Training Time
Figure 4, the Y-axis shows the difference in accurracy between of the classifiers of the WEKA library while Figure 6(b)
illustrates the average training time of the classifiers of
3 https://github.com/DataSystemsGroupUT/
the Scikit-learn library. For WEKA library, Kstar has
AutoMLMetaLearn

100

Authorized licensed use limited to: University Augsburg. Downloaded on September 08,2022 at 12:57:20 UTC from IEEE Xplore. Restrictions apply.
(a) Average Classifier Accuracy - WEKA (b) Average Classifier Accuracy - Scikit-learn

(c) Average Classifier Accuracy/ Training Time - WEKA (d) Average Classifier Accuracy/ Training Time - Scikit-learn

(e) Number of Appearance for Each Classifier in the Top 3 Performing (f) Number of Appearance for Each Classifier in the Top 3 Performing
Classifiers - WEKA Classifiers - Scikit-learn

Fig. 5. Accuracy Performance Comparison for the Different Classifiers

the smallest average training time while Logistic Model library, the results show that the Logistic Regression
Trees has the longest average training time (Figure 6(a)). and Perceptron are the top performers in terms of average
The results of Figure 6(b) show that for Scikit-learn, testing time while the Gaussian Process Classifier
ComplementNB is the top performing classifier, while and K Neighbors are the slowest (Figure 6(d)).
AdaBoost comes in the last place.
The Average Total Time of the classifiers of both WEKA and
Figures 6(c) and 6(d) show the Average Testing Time of Scikit-learn libraries are shown in Figures 6(e) and 6(f),
the classifiers of the WEKA and Scikit-learn libraries, respectively. For WEKA library, OneR has the smallest average
respectively. The results show that for the WEKA library, the total time, while Logistic Model Trees has the largest
OneR and Logistic Model Trees classifiers are outper- average total time (Figure 6(e)). For Scikit-learn library,
forming all other classifiers while the Kstar classifier has the ComplementNB is the classifier with the lowest average
slowest testing time (Figure 6(c)). For the Scikit-learn total time while Gradient Boosting Classifier is

101

Authorized licensed use limited to: University Augsburg. Downloaded on September 08,2022 at 12:57:20 UTC from IEEE Xplore. Restrictions apply.
(a) Average Training Time for Weka Classifiers (b) Average Training Time for Scikit-learn Classifiers

(c) Average Testing Time for The Weka Classifiers (d) Average Testing Time for The Scikit-learn Classifiers

(e) Average Total Time for The WEKA Classifiers (f) Average Total Time for The Scikit-learn Classifiers

Fig. 6. Time Performance Comparison for the Different Classifiers

Scikit-Learn Weka
the classifier with the highest average total time (Figure 6(f)). AdaBoost AdaBoostM1
In general, the results of all experiments show that there is SVM SMO
no classifier which is a clear winner for all cases and there is KNeighborsClassifier IBk
always clear performance trade-offs between the classifiers in RandomForestClassifier RandomForest
Logistic Regression Logistic Regression
terms of their accuracy, training time and testing time. Such
trade-offs need to be carefully considered on the design of the TABLE II
automation process. Furthermore, in practice, the optimization 5 S IMILAR C LASSIFIERS IN S C I K I T - L E A R N AND WE K A LIBRARIES

goals of different users can vary. Thus, an effective way to


tackle this challenge is using a scoring function to rank the
classifiers, according to the user-defined optimization goal, by Comparing the Performance of Similar Classifiers on Weka
fusing their performance on the most interesting metrics for and Scikit Learn
the end-user. For both WEKA and Scikit-learn libraries, there ex-
ist some classifiers that are similar but they have differ-

102

Authorized licensed use limited to: University Augsburg. Downloaded on September 08,2022 at 12:57:20 UTC from IEEE Xplore. Restrictions apply.
(a) Average Accuracy (b) Average Training Time

(c) Average Testing Time (d) Average Accuracy/Training Time

Fig. 7. Performance Comparison for Selected 5 Classifiers in Scikit-learn and Weka libraries

ent implementations in both libraries. In this section, we parable performance. For the AdaBoost, K Neighbors
focus on comparing the performance of 5 similar classi- and Support Vector Machine classifiers, the imple-
fiers that are available in both libraries (Table II), how- mentations of the Weka library perform better while for the
ever, they may have slightly different implementations. Logistic Regression classifier, the implementation of
Figure 7 illustrates the performance comparison between the Scikit-learn library performs better (Figure 7(d)).
the five selected classifiers in both of the WEKA and In general, the results of this comparison study show that the
Scikit-learn libraries. The results show that for all performance of the same classifiers can vary a lot, for all evalu-
classifiers, except, Support Vector Machine, the im- ation metrics, among the implementations of different machine
plementation of the classifiers in WEKA outperform the im- learning libraries. Thus, in practice, it is hard to generalize the
plementations of their associated in Scikit-learn in optimization rules/mechanisms for the AutoML process over
terms of the average accuracy performance (Figure 7(a)). In different machine learning libraries (search spaces). Instead,
terms of average training time, the results show that for the for any AutoML optimization process, it is crucial for any
Logistic Regression and Random Forest, the im- used prediction models or heuristic-based optimization rules
plementations of the Scikit-learn library are faster while to consider the performance characteristics of the classifiers
for AdaBoost, K Neighbors and Support Vector of the underlying machine learning library.
Machine, the implementations for the WEKA library are
faster (Figure 7(b)). In terms of average testing time, the Meta-Model for Predicting the Best Performing Classifiers
implementations of the two libraries vary by a lot (Figure 7(c)). In general, during any AutoML optimization process, a main
In particular, for the WEKA library, the implementations of goal is to effectively reduce the search space in order to make
AdaBoost and Support Vector Machine are signifi- an efficient use for allocated time budget. In particular, it is
cantly faster while the implementations of Scikit-learn significant for the optimization process to select, or at least
library are significantly faster for the rest of the classifiers. start with, only few classifiers which have the higher potential
In terms of the trade-off analysis between the accuracy and to provide the best performance on the input dataset. In order
training time, the results show that for the Random Forest to tackle this challenge, we developed a Meta-Model for
classifier, the implementations of both libraries have very com- predicting the best performing classifiers on a given dataset,

103

Authorized licensed use limited to: University Augsburg. Downloaded on September 08,2022 at 12:57:20 UTC from IEEE Xplore. Restrictions apply.
Fig. 8. The most Important Meta-Features for Predicting the Top Performing Fig. 9. Standard Deviations of the Accuracies of the Classifiers of the Weka
Classifiers for a Given Dataset Library

based on its meta-features, given that we have a previous


knowledge of how the classifiers perform on datasets with
similar meta-features characteristics. To develop this model,
for each dataset and each classifier, we selected the best accu-
racy achieved from a certain combination of hyperparameters.
Then, for each dataset we calculated the standard deviation
of the classifier accuracy results (std) and we labeled the
classifier as Class 0 (Top Performer Classifier
for the Dataset) if its best accuracy for the dataset is
greater or equal to the maximum accuracy achieved by all
classifiers minus c∗std (c is a constant), otherwise we label the Fig. 10. Standard Deviations of the Accuracies of the Classifiers of the
classifier as Class 1 (Low Performer Classifier Scikit-learn Library
for the Dataset). In other words, a Class 0 label
reflects that the classifier has a potential to be among the best
performing classifiers on the dataset and a label of Class bigger portion of the time budget for the classifiers with higher
1 reflects that the classifier does not have the potential to tunability while allocating a smaller portion of the time budget
be among the top performers. Using the extracted infor- to the classifiers with lower tunability.
mation from the knowledge base, for our 200 experimental In our experiments, for each dataset and classifier, we
datasets, we fitted a decision tree prediction model for the calculated the standard deviation of the accuracy values from
top performing classifiers, using the meta-feature variables different hyperparameters combination over the same datasets.
as the predictors of the model. We have used the decision Figures 9 and 10 illustrate the violin plots for the probabil-
tree for building our prediction model because of its good ity density of standard deviation for the classifiers accuracy
interpretability and the better ability to understand the feature performance of the Weka and Scikit-learn libraries,
importance coefficients of the model (Figure 8). For our Meta- respectively. The results of the experiments show that the
Model, we have been interested mainly for optimizing the characteristics of the classifiers significantly vary. For exam-
prediction recall of Class 0 . Therefore, we had to consider ple, among Scikit-learn classifiers, the percentage of the
different levels of c hyperparameter of the decision tree model datasets on which the classifiers had no difference between the
where the value of 0.3 provided the best model result. Our performance of their best run and their worst run was 60% for
model has achieved the following results: Recall = 0.84 Gaussian Process, 48% for Linear Discriminant
and F1 Score = 0.71. Analysis and 21% for ComplementNB. On the other
hand, the Decision Tree and Random Forest classi-
fiers had the lowest percentage (0.5%). The percentage of the
IV. T O TUNE OR N OT TO T UNE THE C LASSIFIER ? F OR
Perceptron classifier is 1%. Among the Weka classifiers,
H OW L ONG ?
the percentage was 24% for the AdaBoostM1 classifier, 23%
In practice, one of the significant challenges for the AutoML for the Logistic Regression, 18% for the OneR and
optimization process is that given a time budget t and a 12% for the Decision Table. On the other hand, the
search space of N classifiers, how much time should be Logic Boost and IBK classifiers had the lowest percentage
allocated for tuning the hyperparameters of each classifier? In (2.2%). The percentage of the KStar classifier is 2.5%
general, one straightforward way to tackle this challenge is to In general, an effective and accurate prediction for the
equally divide the time budget among all classifiers where each tunability of a classifier, and consequently its effective allo-
classifier is allocated t/N portion of the time budget. However, cation of the time budget, would require to consider different
in practice, it has been shown that different classifiers can factors including the expected accuracy of the classifier on
significantly vary in terms of their Tunability. In particular, the the input dataset, expected standard deviation on the accuracy
variance in the magnitude of the accuracy performance of the performance of the classifier when tuning its hyperparameters
classifier when tuning its hyperparameters can be significantly in addition to the expected training time of the classifier.
different [14]. Thus, it would be more efficient to allocate a In particular, let us assume that classifiers c1 and c2 have

104

Authorized licensed use limited to: University Augsburg. Downloaded on September 08,2022 at 12:57:20 UTC from IEEE Xplore. Restrictions apply.
predicted accuracies (μ1 , μ2 ), predicted standard deviations has been implemented on top of Scikit-Learn. It is based
on the accuracy (σ1 , σ2 ) and predicted training times (t1 , t2 ), on genetic programming by exploring many different possible
respectively. Rules for ranking the tunability (T une(ci )) of pipelines of feature engineering and learning algorithms. Then,
the classifiers can be represented as follows: it finds the best one out of them. Recipe [17] is another
• R1: μ1 > μ2 ⇒ T une(c1 )  T une(c2 ) framework that follows the same optimization procedure as
• R2: μ1 = μ2 ∧ σ1 > σ2 ⇒ T une(c1 )  T une(c2 ) TPOT using genetic programming, which in turn exploits
• R3: σ1 > σ2 ∧ μ2 > μ1 + (σ1 ∗ μ1 ) ⇒ T une(c2 )  the advantages of a global search. However, it considers
T une(c1 ) the unconstrained search problem in TPOT, where resources
• R4: μ1 = μ2 ∧ t1 > t2 ⇒ T une(c2 )  T une(c1 ) can be spent into generating and evaluating invalid solutions
Utilizing the information of our knowledge base, we used by adding a grammar that avoids the generation of invalid
AutoSKLearn, a winner of two ChaLearn AutoML chal- pipelines, and can speed up optimization process. Second, it
lenges4 , to build Meta-Models for predicting the average works with a bigger search space of different model con-
accuracy of a classifier (RMSE = 0.12), the standard deviation figurations than AutoSkLearn and TPOT. SmartML [4]
for accuracy of the classifier (RMSE = 0.07) in addition to has been introduced as the first R package for automated
predicting the classifier training time (RMSE = 154.8). The model building for classification tasks. It uses a meta-learning
predictors of all models are the meta-features of the input approach for algorithm selection and uses SMAC Bayesian
dataset. Using the developed meta-models, we designed a Optimisation for its hyperparameter tuning. ML-Plan [18]
scoring function for assisting on the process of automatically has been proposed to tackle the composability challenge on
deciding which classifiers are more probable to give a better building machine learning pipelines. In particular, it integrates
result after tuning on a new dataset, given a specific time a super-set of both Weka and Scikit-learn algorithms
budget. In particular, for each classifier (ci ), we compute a to construct a full pipeline. ML-Plan tackles the challenge of
tuning score (si ) as follows: the search problem for finding an optimal machine learning
pipeline using hierarchical task network algorithm where the
T · σi
s i = μi + search space is modeled as a large tree graph where each leaf
ti 2 node is considered as a goal node of a full pipeline.
where T represents the total time budget, μi represents the In general, there is a wide set of factors that can affect
predicted average accuracy of the classifier, σi represents the an algorithm’s runtime execution. Some of these factors are
predicted standard deviation of the accuracy of the classifier hardware-related (e.g., CPU utilization, memory consumption,
and ti represents the predicted average training time of the number of jobs running in parallel, disk storage) in addition
classifier. In designing this scoring function, we use μi for to the specific details for the efficiency of the algorithm’s
exploiting the higher potential for achieving a higher prediction implementation. In general, most of the experimental evalu-
σ
accuracy (R1), we use for exploring the higher potential of ation and comparison studies for different types of algorithms
ti show that mostly there is no clear winner as there are always
gaining an accuracy improvement from each additional run of
T some trade-offs that needs to be considered and optimized
the classifier (R2 and R3) while we use for trading off according to the context of the problems and the user’s goals.
ti
exploration and exploitation based on the calculation of the For example, in the context of the optimization process of
number of expected runs of each classifier within the time AutoML systems, the hyperparameter search space is typically
budget (R4). Apparently, the higher the tunability score (si ) very large and there is always the trade-off of the willingness
for a given classifier, the bigger the portion of the time budget to give up some improvements if we can get reasonable
that should be allocated for this classifier. results within a predefined time budget. The threshold of this
trade-off is always user-defined and vary from one user to
V. R ELATED WORK another. Doan and Kalita [19] presented a method to predict
In the last few years, several AutoML systems have been the runtime of an algorithm in a particular dataset without
presented in the literature [5]. For example, Auto-WEKA [1] real execution. For practical reasons, the scope of the study
is considered as the first and pioneer machine learning au- has been restricted only to the default values of the hype-
tomation framework. It was implemented in Java on top of parameters configurations. They consider 28 classifiers from
Weka and applies Bayesian optimization using Sequential Weka and 55 datasets. The original datasets are transformed to
Model-based Algorithm Configuration (SMAC) [15] and Tree- a smaller versions by dimensionality reduction. The classifiers
structured Parzen Estimator (TPE) for both algorithm selection have been running over the datasets and their runtime is
and hyper-parameter optimization. Autosklearn [3] has recorded (a total of 1540 records (runs) have been generated).
been implemented on top of Scikit-learn where it intro- The authors have evaluated several regression techniques based
duced the idea of meta-learning in the initialization of com- on RMSE. The MARS model which resembles a tree-based
bined algorithm selection and hyperparameter tuning. It also approach was found to be the best regressor to predict the
used SMAC as a Bayesian optimization technique. In addition, run time. Yang et al. [20] have presented another approach
ensemble methods were used to improve the performance of that uses a meta-learning based approach using a collabora-
output models. TPOT [16] is another AutoML framework that tive filtering technique for model selection in AutoML and
predicting the runtime using polynomial regression. van Rijn
4 http://automl.chalearn.org/ and Hutter [21] presented an approach method to study which

105

Authorized licensed use limited to: University Augsburg. Downloaded on September 08,2022 at 12:57:20 UTC from IEEE Xplore. Restrictions apply.
hyperparameters are more important to tune. The main focus ACKNOWLEDGMENT
of this work was to derive conclusions that can be generalized This work of Salijona Dyrmishi and Sherif Sakr is funded by
and hold for any dataset. Their study included the datasets the European Regional Development Funds via the Mobilitas
of the popular OpenML100 [22] corpus and focused only on Plus programme (grant MOBTT75). The work of Radwa
3 Scikit-learn classifiers, namely, Support Vector Elshawi is funded by the European Regional Development
Machines, Random Forest and Adaboost. For each Funds via the Mobilitas Plus programme (MOBJD341). The
dataset, each of the three classifier has been running for at authors would like to thank Mohamed Maher for his useful
least 150 runs with different hyperparameters and their perfor- comments on some of the results of this study.
mance were recorded. To study the distribution of Functional
ANOVA’s, random forests is used to get the impact on variance R EFERENCES
of every hyper parameter and get empirical evidence about
[1] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown.
hyperparameters which are more likely to yield improvements Auto-weka: Combined selection and hyperparameter optimization of
in a considerable number of datasets. To the best of our classification algorithms. In KDD, 2013.
knowledge, our study is the first that considers such large [2] Brian Fitzgerald. Software crisis 2.0. Computer, 45(4), 2012.
[3] Matthias Feurer et al. Efficient and robust automated machine learning.
number of classifiers, 30 classifiers, from 2 different popular In Advances in neural information processing systems, 2015.
machine learning (Weka and Scikit-learn) libraries over [4] Mohamed Maher and Sherif Sakr. SmartML: A Meta Learning-Based
a large number of datasets (200) with a wide set of meta- Framework for Automated Selection and Hyperparameter Tuning for
Machine Learning Algorithms. In EDBT, pages 554–557, 2019.
features characteristics for building various meta-models and [5] Radwa El Shawi, Mohamed Maher, and Sherif Sakr. Automated machine
scoring techniques that can assist a multi-objective decision learning: State-of-the-art and open challenges. CoRR, abs/1906.02287,
mechanism for the AutoML optimization process. 2019.
[6] Lars Kotthoff, Chris Thornton, Holger H. Hoos, Frank Hutter, and
Kevin Leyton-Brown. Auto-weka 2.0: Automatic model selection and
hyperparameter optimization in weka. JMLR, 18(1), 2017.
VI. C ONCLUSION [7] Joaquin Vanschoren. Meta-learning: A survey. CoRR, abs/1810.03548,
2018.
Machine learning has become one of the main engines of [8] Pavel Brazdil, Christophe Giraud Carrier, Carlos Soares, and Ricardo
the current era. The production pipeline of a machine learning Vilalta. Metalearning: Applications to data mining. Springer Science
model passes through different phases and challenges that & Business Media, 2008.
[9] Christophe Giraud-Carrier. Metalearning-a tutorial. In Tutorial at
require solid knowledge about several configuration options. the 7th international conference on machine learning and applications
However, as the scale of data produced daily is increasing (ICMLA), San Diego, California, USA, 2008.
continuously at an exponential scale, it has become essential [10] Besim Bilalli, Alberto Abelló, and Tomas Aluja-Banet. On the predictive
power of meta-features in openml. International Journal of Applied
to automate this process and make it more usable by domain Mathematics and Computer Science, 27(4):697–712, 2017.
experts and non technical users. In this paper, we presented a [11] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo.
methodology and framework for using Meta-Learning tech- OpenML: Networked Science in Machine Learning. SIGKDD Explo-
rations, 15(2):49–60, 2013.
niques to develop new methods that serve as an effective [12] Vicente Garcı́a et al. On the effectiveness of preprocessing methods
multi-objective decision support for the AutoML process. We when dealing with different levels of class imbalance. Knowledge-Based
presented an extensive analysis for the performance charac- Systems, 25(1), 2012.
[13] James Bergstra and Yoshua Bengio. Random search for hyper-parameter
teristics of a wide set of classifiers using various evaluation optimization. JMLR, 13, 2012.
metrics and analyzed various trade-offs. We presented our [14] Philipp Probst et al. Tunability: Importance of hyperparameters of
approach for building Meta-Models for predicting the best machine learning algorithms. JMLR, 20(53), 2019.
[15] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential
performing classifiers on a given dataset, predicting the train- model-based optimization for general algorithm configuration. In Pro-
ing/testing times of a classifier, ranking the classifiers based ceedings of the 5th International Conference on Learning and Intelligent
on their Tunability and developed an effective approach for Optimization, 2011.
[16] Trang T Le, Weixuan Fu, and Jason H Moore. Scaling tree-based
allocating the time budget of the AutoML process among automated machine learning to biomedical big data with a dataset
various classifiers. Our methodology and framework remains selector. BioRxiv, 2018.
agnostic towards the underlying machine learning library. In [17] Alex GC de Sá et al. Recipe: a grammar-based framework for
automatically evolving classification pipelines. In European Conference
addition, they can be easily embedded/utilized by any AutoML on Genetic Programming, 2017.
system. As a future work, we are planning to extend our work [18] Felix Mohr, Marcel Wever, and Eyke Hüllermeier. ML-Plan: Automated
in several directions. For example, we are planning to extend machine learning via hierarchical planning. Machine Learning, 2018.
[19] Tri Doan and Jugal Kalita. Predicting run time of classification algo-
our analysis to include the classifiers of distributed machine rithms using meta-learning. International Journal of Machine Learning
learning libraries (e.g., Spark ML [23], SystemML [24]). In and Cybernetics, 8(6):1929–1943, 2017.
addition, further analysis and Meta-Models can be developed. [20] Chengrun Yang, Yuji Akimoto, Dae Won Kim, and Madeleine Udell.
Oboe: Collaborative filtering for automl initialization. arXiv preprint
For example, building an accurate meta-model for correlating arXiv:1808.03233, 2018.
the most important hyper-parameters to tune for a classifier [21] Jan N. van Rijn and Frank Hutter. Hyperparameter importance across
and the meta-features of a given datasets would be quite datasets. In KDD, KDD ’18, 2018.
[22] Bernd Bischl et al. Openml benchmarking suites and the openml100.
useful for improving the efficiency and quality of the hyper- arXiv preprint arXiv:1708.03731, 2017.
parameter tuning process. Furthermore, a deeper analysis [23] Xiangrui Meng et al. Mllib: Machine learning in apache spark. The
and understanding of the most significant meta-features for Journal of Machine Learning Research, 17(1):1235–1241, 2016.
[24] Matthias Boehm et al. Systemml: Declarative machine learning on spark.
accurately representing a given dataset is quite crucial for Proceedings of the VLDB Endowment, 9(13):1425–1436, 2016.
improving the quality of the Meta-Learning process.

106

Authorized licensed use limited to: University Augsburg. Downloaded on September 08,2022 at 12:57:20 UTC from IEEE Xplore. Restrictions apply.

You might also like