How To Select A Suitable Machine Learning Algorithm
How To Select A Suitable Machine Learning Algorithm
Abstract: The increasing availability of data gatherable from various sources and in several contexts, is forcing
practitioners to find affordable ways to manage and exploit datasets. Within this context, machine learning (ML) -
which can be described as a set of algorithms to analyse and process data to extract relevant features for
clusterization, classification or prediction - emerged as one of the most investigated area providing powerful tools.
Indeed, in literature it is possible to find a considerable number of articles dealing with ML algorithms and describing
their real-world applications. This considerable number of works, depicting a wide variety of algorithms and
widespread applications, creates an extensive knowledge on the topic. At the same time, it may also generate
disorientation in the selection of the right approach. Thus, the need of synthesis and guidelines to drive the selection
of the most suitable algorithm for a specific scope arises. To provide a response to such a necessity, the authors
propose a ML algorithm selection tool. As a starting point, authors analysed several ML algorithms investigating their
scope, their characteristics, and their typical fields of application, including also real examples. According to this
exploration, authors identified two decision layers: the first one concerns the nature of the learning activity
(supervised, unsupervised, etc.) while the second one is related to the characteristics of the ML algorithms (type of
response, data size and type they can manage, etc.). Starting from a pool of algorithms, the first layer enables the
users to narrow this pool depending on their scope. Then, the second layer guides the final selection, fitting the
users’ constraints, the previously mentioned algorithms features, and the data characteristics.
Keywords: machine learning; classification; selection framework; data analysis; decision making
87
XXIII Summer School “Francesco Turco” – Industrial Systems Engineering
the characteristics of each model and that s/he computes able to make predictions. Supervised learning approaches
the value for each single model under consideration. In can be used for regression or classification purposes. In
this case, the model selection could be biased by the initial the first case, the idea is to use past data to build a
pool of algorithms selected by the user that could not regression model able to predict the future behaviour of
include the most suitable ones. the dataset. In the second case, the aim of the model is to
classify data in specific classes, based on that, assign new
The definition of the validation strategy for the proposed
data to the correct class. In essence, the supervised
framework requires a detailed discussion due to the fact
learning process aims to construct a mapping function
that the problem under study is non-trivial. The validation
conditioned to the provided training data set (Christiano
strategy should be developed considering the
Silva and Zhao, 2016). The following list reports the
characteristics of the traditional methods and of the new
algorithms considered for classification purposes:
framework, making possible to identify the pro and cons
of each one and, in turn, to understand when one perform - Logit: this algorithm is suited for binary
better than the others. For this reason, the framework classifications. Logistic regression algorithm
validation is postponed to another paper. calculates the class membership probability for
one of the two categories in a dataset. It is best
The paper is structured as follows: Section 2 introduces
suited for data clearly separated by a single, linear
the concepts of ML, discussing the main characteristics of
boundary (Dreiseitl and Ohno-Machado, 2002;
the Supervised and Unsupervised learning approaches,
Smola and Vishwanathan, 2008);
and presenting a set of Supervised and Unsupervised
algorithms. Section 3 deals with the definition of the - Multinomial Logit: in the multinomial situation,
framework, describing the drivers composing it and how there are different categorical response variables
they should be applied to datasets. Section 4 concludes the that can assume more than two outcomes. It
paper discussing the main benefits and limits of the basically gives an estimation about the class
framework. probabilities for a multi-category response which
are then used to classify the new cases into one
2.Machine Learning
of several outcome groups (Dreiseitl and Ohno-
Mishra and Gupta (2017) defines ML as algorithms that, Machado, 2002; Smola and Vishwanathan, 2008);
through the automatic association of events and their
- Classification Trees: it is a non-parametric
consequences, allows making accurate predictions based
algorithm. For this reason, its performance is not
on a database of past observations.
affected by the presence of outliers. Even
Literature distinguishes Active Learning approaches from though it can handle a wide variety of input data,
Passive Learning approaches. The main idea behind this algorithm is not suitable for high
Active Learning is that it is a learning process where the dimensional datasets. It has a flow chart
ML algorithm is allowed to select the data from which it structure, where each node represents a test and
learns and, by that, is able to perform better with less each leaf represent the response. It is easy to
training (Liu, 2010). On the contrary, the Passive Learning visualize, and results are easy to interpret (Singh
refers to an ensemble of approaches, namely Supervised, et al., 2016);
Unsupervised and Semi-Supervised, which use random
- Support Vector Machines (SVM) Classification:
samples from the dataset to build models that can
the algorithm aims at classifying data by finding
perform prediction, classification, clustering or other tasks
linear decision boundary (called hyperplane)
depending on the dataset composition (Mishra and Gupta,
which separates the data classes. The algorithm
2017). The differences among the three Passive Learning
aims at finding the hyperplane that has the
approaches is based on the presence of labels in the
largest margin between two classes. For non-
dataset.
linear situations, the algorithm considers a loss
Labels are constituted by one or more tags that contains function that penalizes the points on the wrong
desirable information on the data and favour its side of the hyperplane. Sometime this algorithm
recognition. Thus, data can be classified as labelled or uses a kernel to transform nonlinearly separable
unlabelled. In particular, if the algorithm is trained on the data into higher dimensions where a linear
base of labelled data the approach is identified as decision boundary can be found. The SVM is
“Supervised Learning”; otherwise, in case of unlabelled suitable for binary data, but also discrete data can
data, the approach is called “Unsupervised Learning”. In be used as input. High dimensional data can be
addition, if the training dataset contains both labelled and managed easily. The algorithm performance
unlabelled data, the approach is called “Semi-Supervised decreases in presence of noise (Kotsiantis et al.,
Learning”. The presented framework deals only with 2007);
Supervised and Unsupervised learning approaches.
- k-Nearest Neighbour: this algorithm categorizes
2.1 Supervised Learning an object depending on the classes of the nearest
neighbours in the dataset. Consequently, the
In the supervised learning, the idea is to exploit the algorithm assumes that objects that are close to
information emerging from the data distribution and from each other are similar. The algorithm can be
the external knowledge – the labels – to create a model trained using different distance metrics (e.g.
88
XXIII Summer School “Francesco Turco” – Industrial Systems Engineering
Euclidean, Chebyshev, etc.). The algorithm can constituted by a continuous value. Neural
work with binary and discrete variables, but its Networks algorithm can deal with noise and
performance is strongly affected by the data size outliers in the dataset (Singh et al., 2016).
and the presence of outliers and noise
2.2 Unsupervised Learning
(Kotsiantis et al., 2007);
In the unsupervised learning case, the main task consists
- (Multilayer) Neural Network: this algorithm
in finding intrinsic data structures. The learning process,
consists of a set of simple, interconnected
in this case, is solely guided by the data relationships as no
computation units called neurons, organized into
labels are available for the data analysis (Mitchell, 1997).
layers with different roles called input, output
The following list reports the algorithms considered for
and hidden layer, respectively. The number of
clustering purposes:
hidden layers depends upon the model
complexity. The neurons are connected via - Fuzzy C-Means (FCM): it uses fuzzy algorithms
weighted links, and the way the neurons are to handle data and analyse them. It is useful
connected defines different types of Neural when data point can belong to more than one
Network. A Neural Network is trained iteratively cluster. The number of clusters should be
to find the right weights for links. It best fits the known. It is not suitable for datasets with noise
modelling of highly nonlinear systems, when or outliers but can handle large datasets (Havens
data are available incrementally and there is a et al., 2012);
constant need to update the model. Neural
Networks algorithm can deal with noise and - Balanced Iterative Reducing and Clustering using
outliers in the dataset (Singh et al., 2016); Hierarchies (BIRCH): it is a hierarchical
algorithm that takes as input a set of N data
The following list reports the algorithms considered for points and a desired number of clusters k.
regression purposes: BIRCH relies on the use of clustering feature
(CF) vectors to store and summarize the
- Regression Trees: unlike the Classification Trees,
information about each cluster. It organizes
Regression Trees can handle categorical and
these vectors in a CF tree, which is a height-
continuous variables. This algorithm is suitable
balanced tree data structure. It is suitable for
when data has many features interacting in
large datasets and is robust to outliers and noise.
complicated and nonlinear ways. It sub-divides
Disadvantages include a difficulty in finding
the space into smaller regions and further
arbitrary shaped clusters (Pitolli et al., 2017);
partitions the sub-divisions and assigns to its
nodes (leaves) where interactions are more - Clustering Using Representatives (CURE): it is
manageable. As for classification trees, nodes are an improvement of the BIRCH algorithm since
subdivided into leaf nodes which contain the it is possible to find clusters of arbitrary shapes.
responses (Yildiz et al., 2017); CURE is also more robust with respect to
outliers and scalable to large datasets. These
- SVM Regression: this kind of regression is very
benefits are achieved by using several
similar to a classification algorithm, but it is
representative objects for a cluster. At each
thought to predict a continuous response. It
iteration, the two clusters with the closest pair of
does not find a hyperplane to separate data, but
representative objects are merged. A drawback is
it searches for a model that deviates from the
the user-specified parameter values, the number
measured data by a value that is not greater than
of clusters and the shrinking factor (Guha et al.,
a small amount, having the values of parameter
1998);
as small as possible, in order to minimize the
sensitivity to error. It is usually used for high- - RObust Clustering using linKs (ROCK): it
dimensional data, where there is a large number assumes a similarity measure between objects
of predictor variables (Yildiz et al., 2017); and defines a ‘link’ between two objects whose
similarity exceeds a threshold. Initially, each
- k-Nearest Neighbour: it can be used in case of
object is assigned to a separate cluster. Then,
continuous data labels. The value of the
clusters are merged repeatedly according to their
parameter k influences the prediction variance:
closeness. This algorithm is not able to handle
when it is small there is a high variance in
properly large datasets and is not robust to
prediction, while when it is high there is a large
outliers (Guha et al., 2000);
bias. The scale of features for KNN regression
influences the quality of predictions (Hidalgo et - k-Means: it divides data into k mutually exclusive
al., 2017); clusters. The distance from the cluster centre
defines the probability to belong to it. It fits large
- (Multilayer) Neural Network: this algorithm is
datasets but its performance decreases when
similar to the one used for classification
outliers are present in the dataset (Pham et al.,
purposes. The main difference with the
2005);
classification versions of Neural Network is that,
while in the first case the output is constituted by
a discrete value (the class), here the output is
89
XXIII Summer School “Francesco Turco” – Industrial Systems Engineering
- k-Medoids or Partitioning Around Medoids Kotsiantis et al., 2007; Saxena et al., 2017). Moreover, the
(PAM): it is similar to k-Means. The term k presence in literature of a wide range of different
represents the number of medoids to be algorithms can create disorientation in the users not
identified and the number of clusters that are acquainted with their known strengths and weaknesses.
required. A medoid is an object whose average This paper proposes a selection framework that works on
dissimilarity from the other objects in the cluster two different layers, each one linked to a different aspect
is minimal and for this reason it is used as of the analysis. This would guide the user in the selection
representative object (the centre) of the cluster. of the ML algorithms suitable for the analysis of a specific
It processes and assigns n data points to k dataset. The first layer of the ML algorithm selection is
clusters with k medoids. In contrast to k-Means based on the presence of labels in the dataset and on the
algorithm, k-Medoids uses a data in the cluster as scope of analysis (i.e. learning activity). In this way, the
cluster centre, while in k-Means the centroid is user is guided towards Supervised or Unsupervised
calculated and may not coincide with a data- Learning algorithms. Then, in case of a Supervised
point in the cluster. This makes the k-Medoids approach, the user is requested to indicate whether s/he is
algorithm more robust in handling noise and interested in a classification or a regression analysis.
outliers because it minimizes a sum of pairwise Otherwise, in case of Unsupervised Learning only
differences instead of a sum of Euclidean clustering is proposed. In the second layer, four more
distances (Shafiq and Torunski, 2016). drivers guide the user in the identification of proper ML
algorithms. The drivers have been identified after a
Based on these definitions, the framework is proposed in
literature review of ML application cases. The results are
the following section.
reported in Table 1, which contains the list of drivers, their
3.Framework description and the list of papers used for their
identification. It is worth to notice that some papers are
In the selection of a suitable ML algorithm for data associated to multiple drivers, in some cases to every
analysis many aspects should be taken into account, and driver. This underpins the importance of these drivers,
most of the times selecting an algorithm only on the base strengthening the fact that these should be taken into
of the promised accuracy or computational speed leads to consideration during the ML algorithm selection phase.
unsatisfactory results (Andreopoulos et al., 2009;
90
XXIII Summer School “Francesco Turco” – Industrial Systems Engineering
Figure 1 depicts the two layers of the framework and the Moreover, this paper deals only with the framework
related selection drivers, while Table 2 shows the list of development but not with its validation due to the
algorithms classified based on those drivers. As explained complexity of the problem and due to the space
earlier, following this framework the users should be able limitations.
to identify one or more ML algorithms suitable for their
This selection framework aims at supporting the
scope (drivers Learning and Learning Activity) and the
researchers who are approaching the ML field guiding
dataset characteristics (drivers Data Type, Scalability,
them through the selection of a set of algorithms suitable
Robustness to Outliers/Noise and Response Type).
for their scopes, helping them to avoid incurring in the
application of algorithms which are not suitable for the
data they are dealing with. The selection framework
covers some of the most commonly used ML Supervised
and Unsupervised algorithms. The drivers presented in
the selection framework constitute a solid base for the
selection of proper ML algorithms since they are easily
recognizable and do not necessitate deep analysis to be
identified.
The research presented in this paper is not free from
limitations and possible future improvements. First, the
current pool of ML algorithms lacks Semi-Supervised and
Active Learning approaches. Furthermore, more
classification algorithms, as well as more regression
algorithms, could be considered in the framework.
Examples of these algorithms are the generalized linear
models, the Bayesian networks, the linear and quadratic
discriminant analysis, the gaussian processes etc. Also, the
pool of unsupervised clustering algorithms could be
extended including new algorithms such as the affinity
propagation, the spectral clustering, the gaussian mixtures,
Figure 1: Machine Learning Algorithm Selection
the agglomerative clustering etc. Moreover, the level of
Framework detail used for each type of algorithm could be improved
(e.g. the variants of Classification Trees, of Neural
In order to be effective, the proposed selection Networks, the different kernels available for the SVM
framework has to be applied in a precise sequence. In etc.). Second, the drivers list could be extended
particular, at the beginning, the first layer requires to select considering other characteristics of the data such as the
the learning approach and, in turn, the learning activity. training time, the required parameters etc. Third, due to
Then, the second selection layer requires the identification the number of algorithms currently available in the
of the data characteristics to understand which of the framework and the possible combinations of the dataset
available algorithms should be used for the analysis. As characteristics, in some cases the framework may be
above mentioned, the user should consider the nature of unable to suggest a suitable ML algorithm. Fourth, the
the data in order to avoid algorithms unable to deal with framework currently does not take into consideration the
the dataset. Thus, knowing the dataset dimension, the user application of data manipulation techniques that could
is required to indicate whether the it has a limited number modify the dataset structure in the pre-processing phase
of features, and so can be handled by most of the and, in turn, extend the pool of algorithms to be taken
algorithms, or requires specific algorithms, whose into account for the analysis.
performance are not affected by the number of features.
Moreover, the user should consider the possibility to have Future work will encompass multiple aspects not
outliers or noise in the dataset and clarify if their presence considered in this paper, starting from the validation
is a problem or not for the analysis. If it is a problem, the strategy, which should be chosen carefully to be effective,
framework removes the algorithms not able to manage the and continuing with the selection framework extension
presence of outliers and/or noise. Finally, the framework and refinement.
requires the user to specify the type of response required
as output of the analysis.
4.Conclusions
This paper presented a selection framework aimed at
guiding users in the selection of one or more suitable ML
algorithms to use in their analyses. The proposed selection
framework does not claim to be complete with respect to
the available literature on the topic and/or to suggest the
best algorithm to the user due to the vast complexity
characterising the ML field of research and application.
91
XXIII Summer School “Francesco Turco” – Industrial Systems Engineering
92
XXIII Summer School “Francesco Turco” – Industrial Systems Engineering
Informatics, Vol. 35 No. 5–6, pp. 352–359. clustering techniques and developments”,
Neurocomputing, Elsevier B.V., Vol. 267, pp. 664–
European Commission. (2016), “Digitising European
681.
Industry Reaping the full benefits of a Digital Single
Market”. Schwarz, G. (1978), “Estimating the Dimension of a
Model”, The Annals of Statistics, Vol. 6 No. 2, pp.
Guha, S., Rastogi, R. and Shim, K. (1998), “CURE: an
461–464.
efficient clustering algorithm for large databases”,
ACM SIGMOD Record, Vol. 27 No. 2, pp. 73–84. Shafiq, M.O. and Torunski, E. (2016), “A Parallel K-
Medoids Algorithm for Clustering based on
Guha, S., Rastogi, R. and Shim, K. (2000), “Rock: a robust
MapReduce”, 2016 15th IEEE International Conference
clustering algorithm for categorical attributes”,
on Machine Learning and Applications A, pp. 502–207.
Information Systems, Vol. 25 No. 5, pp. 345–366.
Singh, A., Thakur, N. and Sharma, A. (2016), “A Review
Hannan, E.J. and Quinn, B.G. (1979), “The
of Supervised Machine Learning Algorithms”, 2016
Determination of the Order of an Autoregression”,
International Conference on Computing for Sustainable
Journal of the Royal Statistical Society. Series B
Global Development (INDIACom), Vol. 16, pp. 1310–
(Methodological), Vol. 41 No. 2, pp. 190–195.
1315.
Havens, T.C., Bezdek, J.C., Leckie, C., Hall, L.O. and
Smola, A. and Vishwanathan, S.V.N. (2008), Introduction to
Palaniswami, M. (2012), “Fuzzy c-Means algorithms
Machine Learning.
for very large data”, IEEE Transactions on Fuzzy
Systems, Vol. 20 No. 6, pp. 1130–1146. Teng, S.-H. (2018), “Scalable Algorithms in the Age of Big
Data and Network Sciences”, Proceedings of the
Hidalgo, J.I., Colmenar, J.M., Kronberger, G., Winkler,
Eleventh ACM International Conference on Web Search
S.M., Garnica, O. and Lanchares, J. (2017), “Data
and Data Mining - WSDM ’18, pp. 6–7.
Based Prediction of Blood Glucose Concentrations
Using Evolutionary Methods”, Journal of Medical Yildiz, B., Bilbao, J.I. and Sproul, A.B. (2017), “A review
Systems, Vol. 41 No. 9. and analysis of regression and machine learning
models on commercial building electricity load
Kotsiantis, S.B. (2007), “Supervised machine learning: a
forecasting”, Renewable and Sustainable Energy Reviews,
review of classification techniques.”, Informatica
Elsevier Ltd, Vol. 73 No. February, pp. 1104–1122.
(03505596), Vol. 31 No. 3, pp. 249–268.
Kotsiantis, S.B., Zaharakis, I. and Pintelas, P. (2007),
“Supervised machine learning: A review of
classification techniques”, Vol. 31, pp. 249–268.
Liu, Y. (2010), “Active learning literature survey”,
Computer Sciences Technical Report 1648.
Microsoft. (2017), “The Machine Learning Algorithm
Cheat Sheet”, available at:
https://docs.microsoft.com/en-us/azure/machine-
learning/studio/algorithm-choice#the-machine-
learning-algorithm-cheat-sheet.
Mishra, C. and Gupta, D.L. (2017), “Deep Machine
Learning and Neural Networks: An Overview”,
IAES International Journal of Artificial Intelligence (IJ-
AI), Vol. 6 No. 2, pp. 66–73.
Mitchell, T.M. (1997), Machine Learning, McGraw-Hill
Science/Engineering/Math.
Pham, D.T., Dimov, S.S. and Nguyen, C.D. (2005),
“Selection of K in K-means clustering”, Proceedings of
the Institution of Mechanical Engineers, Part C: Journal of
Mechanical Engineering Science, Vol. 219 No. 1, pp.
103–119.
Pitolli, G., Aniello, L., Laurenza, G., Querzoni, L. and
Baldoni, R. (2017), “Malware family identification
with BIRCH clustering”, Proceedings - International
Carnahan Conference on Security Technology, Vol. 2017–
Octob.
Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O.P.,
Tiwari, A., Er, M.J., et al. (2017), “A review of
93