0% found this document useful (0 votes)
82 views73 pages

Machine Learning Algorithms, Real-World Applications and Research Directions

Uploaded by

otseiword
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views73 pages

Machine Learning Algorithms, Real-World Applications and Research Directions

Uploaded by

otseiword
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
You are on page 1/ 73

Abstract

In the current age of the Fourth Industrial Revolution (4IR or Industry

4.0), the digital world has a wealth of data, such as Internet of

Things (IoT) data, cybersecurity data, mobile data, business data,

social media data, health data, etc. To intelligently analyze these

data and develop the corresponding smart and automated

applications, the knowledge of artificial intelligence (AI), particularly,

machine learning (ML) is the key. Various types of machine learning

algorithms such as supervised, unsupervised, semi-supervised, and

reinforcement learning exist in the area. Besides, the deep learning,

which is part of a broader family of machine learning methods, can

intelligently analyze the data on a large scale. In this paper, we

present a comprehensive view on these machine learning algorithms

that can be applied to enhance the intelligence and the capabilities

of an application. Thus, this study’s key contribution is explaining

the principles of different machine learning techniques and their

applicability in various real-world application domains, such as

cybersecurity systems, smart cities, healthcare, e-commerce,

agriculture, and many more. We also highlight the challenges and

potential research directions based on our study. Overall, this paper

aims to serve as a reference point for both academia and industry

professionals as well as for decision-makers in various real-world

situations and application areas, particularly from the technical point

of view.
Similar content being viewed by others

Machine Learning Approaches for Smart City Applications:


Emergence, Challenges and Opportunities

Chapter © 2022

Insights into the Advancements of Artificial Intelligence and


Machine Learning, the Present State of Art, and Future
Prospects: Seven Decades of Digital Revolution

Chapter © 2021

Editorial: Machine Learning, Advances in Computing,


Renewable Energy and Communication (MARC)

Chapter © 2022

Explore related subjects


Discover the latest articles, news and stories from top researchers in related

subjects.

1. Artificial Intelligence
Introduction

We live in the age of data, where everything around us is connected

to a data source, and everything in our lives is digitally recorded [21,

103]. For instance, the current electronic world has a wealth of

various kinds of data, such as the Internet of Things (IoT) data,

cybersecurity data, smart city data, business data, smartphone

data, social media data, health data, COVID-19 data, and many

more. The data can be structured, semi-structured, or unstructured,

discussed briefly in Sect. “Types of Real-World Data and Machine

Learning Techniques”, which is increasing day-by-day. Extracting

insights from these data can be used to build various intelligent

applications in the relevant domains. For instance, to build a data-

driven automated and intelligent cybersecurity system, the relevant

cybersecurity data can be used [105]; to build personalized context-

aware smart mobile applications, the relevant mobile data can be

used [103], and so on. Thus, the data management tools and

techniques having the capability of extracting insights or useful

knowledge from the data in a timely and intelligent way is urgently

needed, on which the real-world applications are based.

Fig. 1
The worldwide popularity score of various types of ML algorithms

(supervised, unsupervised, semi-supervised, and reinforcement) in a

range of 0 (min) to 100 (max) over time where x-axis represents the

timestamp information and y-axis represents the corresponding

score

Full size image


Artificial intelligence (AI), particularly, machine learning (ML) have
grown rapidly in recent years in the context of data analysis and
computing that typically allows the applications to function in an
intelligent manner [95]. ML usually provides systems with the ability
to learn and enhance from experience automatically without being
specifically programmed and is generally referred to as the most
popular latest technologies in the fourth industrial revolution (4IR or
Industry 4.0) [103, 105]. “Industry 4.0” [114] is typically the ongoing
automation of conventional manufacturing and industrial practices,
including exploratory data processing, using new smart technologies
such as machine learning automation. Thus, to intelligently analyze
these data and to develop the corresponding real-world applications,
machine learning algorithms is the key. The learning algorithms can
be categorized into four major types, such as supervised,
unsupervised, semi-supervised, and reinforcement learning in the
area [75], discussed briefly in Sect. “Types of Real-World Data and
Machine Learning Techniques”. The popularity of these approaches
to learning is increasing day-by-day, which is shown in Fig. 1, based
on data collected from Google Trends [4] over the last five years.
The x-axis of the figure indicates the specific dates and the
corresponding popularity score within the range of

0(𝑚𝑖𝑛𝑖𝑚𝑢𝑚) to 100(𝑚𝑎𝑥𝑖𝑚𝑢𝑚)

has been shown in y-axis. According to Fig. 1, the popularity

indication values for these learning types are low in 2015 and are

increasing day by day. These statistics motivate us to study on

machine learning in this paper, which can play an important role in

the real-world through Industry 4.0 automation.


In general, the effectiveness and the efficiency of a machine

learning solution depend on the nature and characteristics of data

and the performance of the learning algorithms. In the area of

machine learning algorithms, classification analysis, regression, data

clustering, feature engineering and dimensionality reduction,

association rule learning, or reinforcement learning techniques exist

to effectively build data-driven systems [41, 125]. Besides, deep

learning originated from the artificial neural network that can be

used to intelligently analyze data, which is known as part of a wider

family of machine learning approaches [96]. Thus, selecting a proper

learning algorithm that is suitable for the target application in a

particular domain is challenging. The reason is that the purpose of

different learning algorithms is different, even the outcome of

different learning algorithms in a similar category may vary

depending on the data characteristics [106]. Thus, it is important to

understand the principles of various machine learning algorithms

and their applicability to apply in various real-world application

areas, such as IoT systems, cybersecurity services, business and

recommendation systems, smart cities, healthcare and COVID-19,

context-aware systems, sustainable agriculture, and many more that

are explained briefly in Sect. “Applications of Machine Learning”.


Based on the importance and potentiality of “Machine Learning” to

analyze the data mentioned above, in this paper, we provide a

comprehensive view on various types of machine learning

algorithms that can be applied to enhance the intelligence and the

capabilities of an application. Thus, the key contribution of this study

is explaining the principles and potentiality of different machine

learning techniques, and their applicability in various real-world

application areas mentioned earlier. The purpose of this paper is,

therefore, to provide a basic guide for those academia and industry

people who want to study, research, and develop data-driven

automated and intelligent systems in the relevant areas based on

machine learning techniques.

The key contributions of this paper are listed as follows:

● To define the scope of our study by taking into account the

nature and characteristics of various types of real-world

data and the capabilities of various learning techniques.

● To provide a comprehensive view on machine learning

algorithms that can be applied to enhance the intelligence

and capabilities of a data-driven application.


● To discuss the applicability of machine learning-based

solutions in various real-world application domains.

● To highlight and summarize the potential research directions

within the scope of our study for intelligent data analysis

and services.

The rest of the paper is organized as follows. The next section

presents the types of data and machine learning algorithms in a

broader sense and defines the scope of our study. We briefly discuss

and explain different machine learning algorithms in the subsequent

section followed by which various real-world application areas based

on machine learning algorithms are discussed and summarized. In

the penultimate section, we highlight several research issues and

potential future directions, and the final section concludes this

paper.

Types of Real-World Data and Machine Learning


Techniques

Machine learning algorithms typically consume and process data to

learn the related patterns about individuals, business processes,

transactions, events, and so on. In the following, we discuss various


types of real-world data as well as categories of machine learning

algorithms.

Types of Real-World Data

Usually, the availability of data is considered as the key to construct

a machine learning model or data-driven real-world systems [103,

105]. Data can be of various forms, such as structured, semi-

structured, or unstructured [41, 72]. Besides, the “metadata” is

another type that typically represents data about the data. In the

following, we briefly discuss these types of data.

● Structured: It has a well-defined structure, conforms to a

data model following a standard order, which is highly

organized and easily accessed, and used by an entity or a

computer program. In well-defined schemes, such as

relational databases, structured data are typically stored,

i.e., in a tabular format. For instance, names, dates,

addresses, credit card numbers, stock information,

geolocation, etc. are examples of structured data.


● Unstructured: On the other hand, there is no pre-defined

format or organization for unstructured data, making it

much more difficult to capture, process, and analyze, mostly

containing text and multimedia material. For example,

sensor data, emails, blog entries, wikis, and word processing

documents, PDF files, audio files, videos, images,

presentations, web pages, and many other types of business

documents can be considered as unstructured data.

● Semi-structured: Semi-structured data are not stored in a

relational database like the structured data mentioned

above, but it does have certain organizational properties

that make it easier to analyze. HTML, XML, JSON documents,

NoSQL databases, etc., are some examples of semi-

structured data.

● Metadata: It is not the normal form of data, but “data about

data”. The primary difference between “data” and

“metadata” is that data are simply the material that can

classify, measure, or even document something relative to

an organization’s data properties. On the other hand,

metadata describes the relevant data information, giving it

more significance for data users. A basic example of a

document’s metadata might be the author, file size, date

generated by the document, keywords to define the

document, etc.

In the area of machine learning and data science, researchers use

various widely used datasets for different purposes. These are, for
example, cybersecurity datasets such as NSL-KDD [119], UNSW-

NB15 [76], ISCX’12 [1], CIC-DDoS2019 [2], Bot-IoT [59], etc.,

smartphone datasets such as phone call logs [84, 101], SMS Log

[29], mobile application usages logs [137] [117], mobile phone

notification logs [73] etc., IoT data [16, 57, 62], agriculture and e-

commerce data [120, 138], health data such as heart disease [92],

diabetes mellitus [83, 134], COVID-19 [43, 74], etc., and many more

in various application domains. The data can be in different types

discussed above, which may vary from application to application in

the real world. To analyze such data in a particular problem domain,

and to extract the insights or useful knowledge from the data for

building the real-world intelligent applications, different types of

machine learning techniques can be used according to their learning

capabilities, which is discussed in the following.

Types of Machine Learning Techniques

Machine Learning algorithms are mainly divided into four categories:

Supervised learning, Unsupervised learning, Semi-supervised

learning, and Reinforcement learning [75], as shown in Fig. 2. In the

following, we briefly discuss each type of learning technique with the

scope of their applicability to solve real-world problems.


Fig. 2

Various types of machine learning techniques

Full size image


Table 1 Various types of machine learning techniques with examples

Full size table

● Supervised: Supervised learning is typically the task of

machine learning to learn a function that maps an input to

an output based on sample input-output pairs [41]. It uses

labeled training data and a collection of training examples

to infer a function. Supervised learning is carried out when

certain goals are identified to be accomplished from a

certain set of inputs [105], i.e., a task-driven approach. The

most common supervised tasks are “classification” that

separates the data, and “regression” that fits the data. For

instance, predicting the class label or sentiment of a piece

of text, like a tweet or a product review, i.e., text

classification, is an example of supervised learning.

● Unsupervised: Unsupervised learning analyzes unlabeled

datasets without the need for human interference, i.e., a

data-driven process [41]. This is widely used for extracting

generative features, identifying meaningful trends and


structures, groupings in results, and exploratory purposes.

The most common unsupervised learning tasks are

clustering, density estimation, feature learning,

dimensionality reduction, finding association rules, anomaly

detection, etc.

● Semi-supervised: Semi-supervised learning can be defined

as a hybridization of the above-mentioned supervised and

unsupervised methods, as it operates on both labeled and

unlabeled data [41, 105]. Thus, it falls between learning

“without supervision” and learning “with supervision”. In the

real world, labeled data could be rare in several contexts,

and unlabeled data are numerous, where semi-supervised

learning is useful [75]. The ultimate goal of a semi-

supervised learning model is to provide a better outcome for

prediction than that produced using the labeled data alone

from the model. Some application areas where semi-

supervised learning is used include machine translation,

fraud detection, labeling data and text classification.

● Reinforcement: Reinforcement learning is a type of machine

learning algorithm that enables software agents and

machines to automatically evaluate the optimal behavior in

a particular context or environment to improve its efficiency

[52], i.e., an environment-driven approach. This type of

learning is based on reward or penalty, and its ultimate goal

is to use insights obtained from environmental activists to

take action to increase the reward or minimize the risk [75].


It is a powerful tool for training AI models that can help

increase automation or optimize the operational efficiency

of sophisticated systems such as robotics, autonomous

driving tasks, manufacturing and supply chain logistics,

however, not preferable to use it for solving the basic or

straightforward problems.

Thus, to build effective models in various application areas different

types of machine learning techniques can play a significant role

according to their learning capabilities, depending on the nature of

the data discussed earlier, and the target outcome. In Table 1, we

summarize various types of machine learning techniques with

examples. In the following, we provide a comprehensive view of

machine learning algorithms that can be applied to enhance the

intelligence and capabilities of a data-driven application.

Machine Learning Tasks and Algorithms

In this section, we discuss various machine learning algorithms that

include classification analysis, regression analysis, data clustering,

association rule learning, feature engineering for dimensionality

reduction, as well as deep learning methods. A general structure of a

machine learning-based predictive model has been shown in Fig. 3,


where the model is trained from historical data in phase 1 and the

outcome is generated in phase 2 for the new test data.

Fig. 3
A general structure of a machine learning based predictive model

considering both the training and testing phase

Full size image

Classification Analysis

Classification is regarded as a supervised learning method in

machine learning, referring to a problem of predictive modeling as

well, where a class label is predicted for a given example [41].

Mathematically, it maps a function (f) from input variables (X) to

output variables (Y) as target, label or categories. To predict the

class of given data points, it can be carried out on structured or

unstructured data. For example, spam detection such as “spam” and

“not spam” in email service providers can be a classification

problem. In the following, we summarize the common classification

problems.
● Binary classification: It refers to the classification tasks

having two class labels such as “true and false” or “yes and

no” [41]. In such binary classification tasks, one class could

be the normal state, while the abnormal state could be

another class. For instance, “cancer not detected” is the

normal state of a task that involves a medical test, and

“cancer detected” could be considered as the abnormal

state. Similarly, “spam” and “not spam” in the above

example of email service providers are considered as binary

classification.

● Multiclass classification: Traditionally, this refers to those

classification tasks having more than two class labels [41].

The multiclass classification does not have the principle of

normal and abnormal outcomes, unlike binary classification

tasks. Instead, within a range of specified classes, examples

are classified as belonging to one. For example, it can be a

multiclass classification task to classify various types of

network attacks in the NSL-KDD [119] dataset, where the

attack categories are classified into four class labels, such

as DoS (Denial of Service Attack), U2R (User to Root Attack),

R2L (Root to Local Attack), and Probing Attack.

● Multi-label classification: In machine learning, multi-label

classification is an important consideration where an

example is associated with several classes or labels. Thus, it

is a generalization of multiclass classification, where the

classes involved in the problem are hierarchically


structured, and each example may simultaneously belong to

more than one class in each hierarchical level, e.g., multi-

level text classification. For instance, Google news can be

presented under the categories of a “city name”,

“technology”, or “latest news”, etc. Multi-label classification

includes advanced machine learning algorithms that support

predicting various mutually non-exclusive classes or labels,

unlike traditional classification tasks where class labels are

mutually exclusive [82].

Many classification algorithms have been proposed in the machine

learning and data science literature [41, 125]. In the following, we

summarize the most common and popular methods that are used

widely in various application areas.

● Naive Bayes (NB): The naive Bayes algorithm is based on

the Bayes’ theorem with the assumption of independence

between each pair of features [51]. It works well and can be

used for both binary and multi-class categories in many

real-world situations, such as document or text

classification, spam filtering, etc. To effectively classify the

noisy instances in the data and to construct a robust

prediction model, the NB classifier can be used [94]. The key

benefit is that, compared to more sophisticated approaches,


it needs a small amount of training data to estimate the

necessary parameters and quickly [82]. However, its

performance may affect due to its strong assumptions on

features independence. Gaussian, Multinomial,

Complement, Bernoulli, and Categorical are the common

variants of NB classifier [82].

● Linear Discriminant Analysis (LDA): Linear Discriminant

Analysis (LDA) is a linear decision boundary classifier

created by fitting class conditional densities to data and

applying Bayes’ rule [51, 82]. This method is also known as

a generalization of Fisher’s linear discriminant, which

projects a given dataset into a lower-dimensional space, i.e.,

a reduction of dimensionality that minimizes the complexity

of the model or reduces the resulting model’s computational

costs. The standard LDA model usually suits each class with

a Gaussian density, assuming that all classes share the

same covariance matrix [82]. LDA is closely related to

ANOVA (analysis of variance) and regression analysis, which

seek to express one dependent variable as a linear

combination of other features or measurements.

● Logistic regression (LR): Another common probabilistic

based statistical model used to solve classification issues in

machine learning is Logistic Regression (LR) [64]. Logistic

regression typically uses a logistic function to estimate the

probabilities, which is also referred to as the mathematically

defined sigmoid function in Eq. 1. It can overfit high-


dimensional datasets and works well when the dataset can

be separated linearly. The regularization (L1 and L2)

techniques [82] can be used to avoid over-fitting in such

scenarios. The assumption of linearity between the

dependent and independent variables is considered as a

major drawback of Logistic Regression. It can be used for

both classification and regression problems, but it is more

commonly used for classification.

● 𝑔(𝑧)=11+exp⁡(−𝑧).
● (1)

● K-nearest neighbors (KNN): K-Nearest Neighbors (KNN) [9] is

an “instance-based learning” or non-generalizing learning,

also known as a “lazy learning” algorithm. It does not focus

on constructing a general internal model; instead, it stores

all instances corresponding to training data in n-dimensional

space. KNN uses data and classifies new data points based

on similarity measures (e.g., Euclidean distance function)

[82]. Classification is computed from a simple majority vote

of the k nearest neighbors of each point. It is quite robust to

noisy training data, and accuracy depends on the data

quality. The biggest issue with KNN is to choose the optimal

number of neighbors to be considered. KNN can be used

both for classification as well as regression.

● Support vector machine (SVM): In machine learning, another

common technique that can be used for classification,

regression, or other tasks is a support vector machine (SVM)


[56]. In high- or infinite-dimensional space, a support vector

machine constructs a hyper-plane or set of hyper-planes.

Intuitively, the hyper-plane, which has the greatest distance

from the nearest training data points in any class, achieves

a strong separation since, in general, the greater the

margin, the lower the classifier’s generalization error. It is

effective in high-dimensional spaces and can behave

differently based on different mathematical functions known

as the kernel. Linear, polynomial, radial basis function (RBF),

sigmoid, etc., are the popular kernel functions used in SVM

classifier [82]. However, when the data set contains more

noise, such as overlapping target classes, SVM does not

perform well.

● Decision tree (DT): Decision tree (DT) [88] is a well-known

non-parametric supervised learning method. DT learning

methods are used for both the classification and regression

tasks [82]. ID3 [87], C4.5 [88], and CART [20] are well

known for DT algorithms. Moreover, recently proposed

BehavDT [100], and IntrudTree [97] by Sarker et al. are

effective in the relevant application domains, such as user

behavior analytics and cybersecurity analytics, respectively.

By sorting down the tree from the root to some leaf nodes,

as shown in Fig. 4, DT classifies the instances. Instances are

classified by checking the attribute defined by that node,

starting at the root node of the tree, and then moving down

the tree branch corresponding to the attribute value. For


splitting, the most popular criteria are “gini” for the Gini

impurity and “entropy” for the information gain that can be

expressed mathematically as [82].

● Entropy:𝐻(𝑥)=−∑𝑖=1𝑛𝑝(𝑥𝑖)log2⁡𝑝(𝑥𝑖)
● (2)

● Gini(𝐸)=1−∑𝑖=1𝑐𝑝𝑖2.
● (3)

Fig. 4
An example of a decision tree structure

Full size image

Fig. 5
An example of a random forest structure considering multiple

decision trees

Full size image


● Random forest (RF): A random forest classifier [19] is well

known as an ensemble classification technique that is used

in the field of machine learning and data science in various

application areas. This method uses “parallel ensembling”

which fits several decision tree classifiers in parallel, as

shown in Fig. 5, on different data set sub-samples and uses

majority voting or averages for the outcome or final result. It

thus minimizes the over-fitting problem and increases the

prediction accuracy and control [82]. Therefore, the RF

learning model with multiple decision trees is typically more

accurate than a single decision tree based model [106]. To

build a series of decision trees with controlled variation, it

combines bootstrap aggregation (bagging) [18] and random

feature selection [11]. It is adaptable to both classification

and regression problems and fits well for both categorical

and continuous values.

● Adaptive Boosting (AdaBoost): Adaptive Boosting

(AdaBoost) is an ensemble learning process that employs an

iterative approach to improve poor classifiers by learning

from their errors. This is developed by Yoav Freund et al.

[35] and also known as “meta-learning”. Unlike the random

forest that uses parallel ensembling, Adaboost uses

“sequential ensembling”. It creates a powerful classifier by

combining many poorly performing classifiers to obtain a

good classifier of high accuracy. In that sense, AdaBoost is

called an adaptive classifier by significantly improving the


efficiency of the classifier, but in some instances, it can

trigger overfits. AdaBoost is best used to boost the

performance of decision trees, base estimator [82], on

binary classification problems, however, is sensitive to noisy

data and outliers.

● Extreme gradient boosting (XGBoost): Gradient Boosting,

like Random Forests [19] above, is an ensemble learning

algorithm that generates a final model based on a series of

individual models, typically decision trees. The gradient is

used to minimize the loss function, similar to how neural

networks [41] use gradient descent to optimize weights.

Extreme Gradient Boosting (XGBoost) is a form of gradient

boosting that takes more detailed approximations into

account when determining the best model [82]. It computes

second-order gradients of the loss function to minimize loss

and advanced regularization (L1 and L2) [82], which reduces

over-fitting, and improves model generalization and

performance. XGBoost is fast to interpret and can handle

large-sized datasets well.

● Stochastic gradient descent (SGD): Stochastic gradient


descent (SGD) [41] is an iterative method for optimizing an
objective function with appropriate smoothness properties,
where the word ‘stochastic’ refers to random probability.
This reduces the computational burden, particularly in high-
dimensional optimization problems, allowing for faster
iterations in exchange for a lower convergence rate. A
gradient is the slope of a function that calculates a
variable’s degree of change in response to another
variable’s changes. Mathematically, the Gradient Descent is
a convex function whose output is a partial derivative of a

𝛼
set of its input parameters. Let,

𝐽𝑖
● is the learning rate, and

𝑖th
● is the training example cost of

● , then Eq. (4) represents the stochastic gradient descent

𝑗th
weight update method at the

● iteration. In large-scale and sparse machine learning, SGD

has been successfully applied to problems often

encountered in text classification and natural language

processing [82]. However, SGD is sensitive to feature

scaling and needs a range of hyperparameters, such as the

regularization parameter and the number of iterations.

● 𝑤𝑗 := 𝑤𝑗−𝛼 ∂𝐽𝑖∂𝑤𝑗.
● (4)

● Rule-based classification: The term rule-based classification

can be used to refer to any classification scheme that makes

use of IF-THEN rules for class prediction. Several

classification algorithms such as Zero-R [125], One-R [47],

decision trees [87, 88], DTNB [110], Ripple Down Rule

learner (RIDOR) [125], Repeated Incremental Pruning to

Produce Error Reduction (RIPPER) [126] exist with the ability

of rule generation. The decision tree is one of the most

common rule-based classification algorithms among these

techniques because it has several advantages, such as

being easier to interpret; the ability to handle high-

dimensional data; simplicity and speed; good accuracy; and


the capability to produce rules for human clear and

understandable classification [127] [128]. The decision tree-

based rules also provide significant accuracy in a prediction

model for unseen test cases [106]. Since the rules are easily

interpretable, these rule-based classifiers are often used to

produce descriptive models that can describe a system

including the entities and their relationships.

Fig. 6
Classification vs. regression. In classification the dotted line

represents a linear boundary that separates the two classes; in

regression, the dotted line models the linear relationship between

the two variables

Full size image

Regression Analysis

Regression analysis includes several methods of machine learning

that allow to predict a continuous (y) result variable based on the

value of one or more (x) predictor variables [41]. The most

significant distinction between classification and regression is that

classification predicts distinct class labels, while regression

facilitates the prediction of a continuous quantity. Figure 6 shows an

example of how classification is different with regression models.

Some overlaps are often found between the two types of machine

learning algorithms. Regression models are now widely used in a


variety of fields, including financial forecasting or prediction, cost

estimation, trend analysis, marketing, time series estimation, drug

response modeling, and many more. Some of the familiar types of

regression algorithms are linear, polynomial, lasso and ridge

regression, etc., which are explained briefly in the following.

● Simple and multiple linear regression: This is one of the

most popular ML modeling techniques as well as a well-

known regression technique. In this technique, the

dependent variable is continuous, the independent

variable(s) can be continuous or discrete, and the form of

the regression line is linear. Linear regression creates a

relationship between the dependent variable (Y) and one or

more independent variables (X) (also known as regression

line) using the best fit straight line [41]. It is defined by the

following equations:

● 𝑦=𝑎+𝑏𝑥+𝑒
● (5)

● 𝑦=𝑎+𝑏1𝑥1+𝑏2𝑥2+⋯+𝑏𝑛𝑥𝑛+𝑒,
● (6)

● where a is the intercept, b is the slope of the line, and e is

the error term. This equation can be used to predict the

value of the target variable based on the given predictor

variable(s). Multiple linear regression is an extension of


simple linear regression that allows two or more predictor

variables to model a response variable, y, as a linear

function [41] defined in Eq. 6, whereas simple linear

regression has only 1 independent variable, defined in Eq. 5.

● Polynomial regression: Polynomial regression is a form of


regression analysis in which the relationship between the
independent variable x and the dependent variable y is not

● 𝑛th
linear, but is the polynomial degree of

● in x [82]. The equation for polynomial regression is also

derived from linear regression (polynomial regression of

degree 1) equation, which is defined as below:

● 𝑦=𝑏0+𝑏1𝑥+𝑏2𝑥2+𝑏3𝑥3+⋯+𝑏𝑛𝑥𝑛+𝑒.
● (7)

● 𝑏0,𝑏1,...𝑏𝑛
● Here, y is the predicted/target output,

● are the regression coefficients, x is an independent/ input


variable. In simple words, we can say that if data are not

● 𝑛th
distributed linearly, instead it is

● degree of polynomial then we use polynomial regression to

get desired output.

● LASSO and ridge regression: LASSO and Ridge regression

are well known as powerful techniques which are typically

used for building learning models in presence of a large

number of features, due to their capability to preventing

over-fitting and reducing the complexity of the model. The

LASSO (least absolute shrinkage and selection operator)

regression model uses L1 regularization technique [82] that

uses shrinkage, which penalizes “absolute value of


magnitude of coefficients” (L1 penalty). As a result, LASSO

appears to render coefficients to absolute zero. Thus, LASSO

regression aims to find the subset of predictors that

minimizes the prediction error for a quantitative response

variable. On the other hand, ridge regression uses L2

regularization [82], which is the “squared magnitude of

coefficients” (L2 penalty). Thus, ridge regression forces the

weights to be small but never sets the coefficient value to

zero, and does a non-sparse solution. Overall, LASSO

regression is useful to obtain a subset of predictors by

eliminating less important features, and ridge regression is

useful when a data set has “multicollinearity” which refers

to the predictors that are correlated with other predictors.

Cluster Analysis

Cluster analysis, also known as clustering, is an unsupervised

machine learning technique for identifying and grouping related data

points in large datasets without concern for the specific outcome. It

does grouping a collection of objects in such a way that objects in

the same category, called a cluster, are in some sense more similar

to each other than objects in other groups [41]. It is often used as a

data analysis technique to discover interesting trends or patterns in

data, e.g., groups of consumers based on their behavior. In a broad


range of application areas, such as cybersecurity, e-commerce,

mobile data processing, health analytics, user modeling and

behavioral analytics, clustering can be used. In the following, we

briefly discuss and summarize various types of clustering methods.

● Partitioning methods: Based on the features and similarities

in the data, this clustering approach categorizes the data

into multiple groups or clusters. The data scientists or

analysts typically determine the number of clusters either

dynamically or statically depending on the nature of the

target applications, to produce for the methods of

clustering. The most common clustering algorithms based

on partitioning methods are K-means [69], K-Mediods [80],

CLARA [55] etc.

● Density-based methods: To identify distinct groups or

clusters, it uses the concept that a cluster in the data space

is a contiguous region of high point density isolated from

other such clusters by contiguous regions of low point

density. Points that are not part of a cluster are considered

as noise. The typical clustering algorithms based on density

are DBSCAN [32], OPTICS [12] etc. The density-based

methods typically struggle with clusters of similar density

and high dimensionality data.


● Hierarchical-based methods: Hierarchical clustering typically

seeks to construct a hierarchy of clusters, i.e., the tree

structure. Strategies for hierarchical clustering generally fall

into two types: (i) Agglomerative—a “bottom-up” approach

in which each observation begins in its cluster and pairs of

clusters are combined as one, moves up the hierarchy, and

(ii) Divisive—a “top-down” approach in which all

observations begin in one cluster and splits are performed

recursively, moves down the hierarchy, as shown in Fig 7.

Our earlier proposed BOTS technique, Sarker et al. [102] is

an example of a hierarchical, particularly, bottom-up

clustering algorithm.

● Grid-based methods: To deal with massive datasets, grid-

based clustering is especially suitable. To obtain clusters,

the principle is first to summarize the dataset with a grid

representation and then to combine grid cells. STING [122],

CLIQUE [6], etc. are the standard algorithms of grid-based

clustering.

● Model-based methods: There are mainly two types of model-

based clustering algorithms: one that uses statistical

learning, and the other based on a method of neural

network learning [130]. For instance, GMM [89] is an

example of a statistical learning method, and SOM [22] [96]

is an example of a neural network learning method.

● Constraint-based methods: Constrained-based clustering is

a semi-supervised approach to data clustering that uses


constraints to incorporate domain knowledge. Application or

user-oriented constraints are incorporated to perform the

clustering. The typical algorithms of this kind of clustering

are COP K-means [121], CMWK-Means [27], etc.

Fig. 7
A graphical interpretation of the widely-used hierarchical clustering

(Bottom-up and top-down) technique

Full size image

Many clustering algorithms have been proposed with the ability to

grouping data in machine learning and data science literature [41,

125]. In the following, we summarize the popular methods that are

used widely in various application areas.

● K-means clustering: K-means clustering [69] is a fast, robust,

and simple algorithm that provides reliable results when

data sets are well-separated from each other. The data

points are allocated to a cluster in this algorithm in such a

way that the amount of the squared distance between the

data points and the centroid is as small as possible. In other

words, the K-means algorithm identifies the k number of


centroids and then assigns each data point to the nearest

cluster while keeping the centroids as small as possible.

Since it begins with a random selection of cluster centers,

the results can be inconsistent. Since extreme values can

easily affect a mean, the K-means clustering algorithm is

sensitive to outliers. K-medoids clustering [91] is a variant of

K-means that is more robust to noises and outliers.

● Mean-shift clustering: Mean-shift clustering [37] is a

nonparametric clustering technique that does not require

prior knowledge of the number of clusters or constraints on

cluster shape. Mean-shift clustering aims to discover “blobs”

in a smooth distribution or density of samples [82]. It is a

centroid-based algorithm that works by updating centroid

candidates to be the mean of the points in a given region. To

form the final set of centroids, these candidates are filtered

in a post-processing stage to remove near-duplicates.

Cluster analysis in computer vision and image processing

are examples of application domains. Mean Shift has the

disadvantage of being computationally expensive.

Moreover, in cases of high dimension, where the number of

clusters shifts abruptly, the mean-shift algorithm does not

work well.

● DBSCAN: Density-based spatial clustering of applications

with noise (DBSCAN) [32] is a base algorithm for density-

based clustering which is widely used in data mining and

machine learning. This is known as a non-parametric


density-based clustering technique for separating high-

density clusters from low-density clusters that are used in

model building. DBSCAN’s main idea is that a point belongs

to a cluster if it is close to many points from that cluster. It

can find clusters of various shapes and sizes in a vast

volume of data that is noisy and contains outliers. DBSCAN,

unlike k-means, does not require a priori specification of the

number of clusters in the data and can find arbitrarily

shaped clusters. Although k-means is much faster than

DBSCAN, it is efficient at finding high-density regions and

outliers, i.e., is robust to outliers.

● GMM clustering: Gaussian mixture models (GMMs) are often

used for data clustering, which is a distribution-based

clustering algorithm. A Gaussian mixture model is a

probabilistic model in which all the data points are produced

by a mixture of a finite number of Gaussian distributions

with unknown parameters [82]. To find the Gaussian

parameters for each cluster, an optimization algorithm

called expectation-maximization (EM) [82] can be used. EM

is an iterative method that uses a statistical model to

estimate the parameters. In contrast to k-means, Gaussian

mixture models account for uncertainty and return the

likelihood that a data point belongs to one of the k clusters.

GMM clustering is more robust than k-means and works well

even with non-linear data distributions.


● Agglomerative hierarchical clustering: The most common

method of hierarchical clustering used to group objects in

clusters based on their similarity is agglomerative

clustering. This technique uses a bottom-up approach,

where each object is first treated as a singleton cluster by

the algorithm. Following that, pairs of clusters are merged

one by one until all clusters have been merged into a single

large cluster containing all objects. The result is a

dendrogram, which is a tree-based representation of the

elements. Single linkage [115], Complete linkage [116],

BOTS [102] etc. are some examples of such techniques. The

main advantage of agglomerative hierarchical clustering

over k-means is that the tree-structure hierarchy generated

by agglomerative clustering is more informative than the

unstructured collection of flat clusters returned by k-means,

which can help to make better decisions in the relevant

application areas.

Dimensionality Reduction and Feature Learning

In machine learning and data science, high-dimensional data

processing is a challenging task for both researchers and application

developers. Thus, dimensionality reduction which is an unsupervised

learning technique, is important because it leads to better human


interpretations, lower computational costs, and avoids overfitting

and redundancy by simplifying models. Both the process of feature

selection and feature extraction can be used for dimensionality

reduction. The primary distinction between the selection and

extraction of features is that the “feature selection” keeps a subset

of the original features [97], while “feature extraction” creates brand

new ones [98]. In the following, we briefly discuss these techniques.

● Feature selection: The selection of features, also known as

the selection of variables or attributes in the data, is the

process of choosing a subset of unique features (variables,

predictors) to use in building machine learning and data

science model. It decreases a model’s complexity by

eliminating the irrelevant or less important features and

allows for faster training of machine learning algorithms. A

right and optimal subset of the selected features in a

problem domain is capable to minimize the overfitting

problem through simplifying and generalizing the model as

well as increases the model’s accuracy [97]. Thus, “feature

selection” [66, 99] is considered as one of the primary

concepts in machine learning that greatly affects the

effectiveness and efficiency of the target machine learning

model. Chi-squared test, Analysis of variance (ANOVA) test,


Pearson’s correlation coefficient, recursive feature

elimination, are some popular techniques that can be used

for feature selection.

● Feature extraction: In a machine learning-based model or

system, feature extraction techniques usually provide a

better understanding of the data, a way to improve

prediction accuracy, and to reduce computational cost or

training time. The aim of “feature extraction” [66, 99] is to

reduce the number of features in a dataset by generating

new ones from the existing ones and then discarding the

original features. The majority of the information found in

the original set of features can then be summarized using

this new reduced set of features. For instance, principal

components analysis (PCA) is often used as a

dimensionality-reduction technique to extract a lower-

dimensional space creating new brand components from the

existing features in a dataset [98].

Many algorithms have been proposed to reduce data dimensions in

the machine learning and data science literature [41, 125]. In the

following, we summarize the popular methods that are used widely

in various application areas.


● Variance threshold: A simple basic approach to feature

selection is the variance threshold [82]. This excludes all

features of low variance, i.e., all features whose variance

does not exceed the threshold. It eliminates all zero-

variance characteristics by default, i.e., characteristics that

have the same value in all samples. This feature selection

algorithm looks only at the (X) features, not the (y) outputs

needed, and can, therefore, be used for unsupervised

learning.

● Pearson correlation: Pearson’s correlation is another method


to understand a feature’s relation to the response variable
and can be used for feature selection [99]. This method is
also used for finding the association between the features in
a dataset. The resulting value is
● [−1,1]
● , where
● −1
● means perfect negative correlation,
● +1
● means perfect positive correlation, and 0 means that the

two variables do not have a linear correlation. If two random

variables represent X and Y, then the correlation coefficient

between X and Y is defined as [41]

● 𝑟(𝑋,𝑌)=∑𝑖=1𝑛(𝑋𝑖−𝑋¯)
(𝑌𝑖−𝑌¯)∑𝑖=1𝑛(𝑋𝑖−𝑋¯)2∑𝑖=1𝑛(𝑌𝑖−𝑌¯)2.
● (8)

● ANOVA: Analysis of variance (ANOVA) is a statistical tool

used to verify the mean values of two or more groups that

differ significantly from each other. ANOVA assumes a linear

relationship between the variables and the target and the


variables’ normal distribution. To statistically test the

equality of means, the ANOVA method utilizes F tests. For

feature selection, the results ‘ANOVA F value’ [82] of this

test can be used where certain features independent of the

goal variable can be omitted.

● 𝜒2
● Chi square: The chi-square

● [82] statistic is an estimate of the difference between the


effects of a series of events or variables observed and
expected frequencies. The magnitude of the difference
between the real and observed values, the degrees of

● 𝜒2
freedom, and the sample size depends on

● 𝜒2
● . The chi-square

● is commonly used for testing relationships between

● 𝑂𝑖
categorical variables. If

● 𝐸𝑖
● represents observed value and

● represents expected value, then

● 𝜒2=∑𝑖=1𝑛(𝑂𝑖−𝐸𝑖)2𝐸𝑖.
● (9)

● Recursive feature elimination (RFE): Recursive Feature

Elimination (RFE) is a brute force approach to feature

selection. RFE [82] fits the model and removes the weakest

feature before it meets the specified number of features.

Features are ranked by the coefficients or feature

significance of the model. RFE aims to remove

dependencies and collinearity in the model by recursively

removing a small number of features per iteration.


● Model-based selection: To reduce the dimensionality of the

data, linear models penalized with the L1 regularization can

be used. Least absolute shrinkage and selection operator

(Lasso) regression is a type of linear regression that has the

property of shrinking some of the coefficients to zero [82].

Therefore, that feature can be removed from the model.

Thus, the penalized lasso regression method, often used in

machine learning to select the subset of variables. Extra

Trees Classifier [82] is an example of a tree-based estimator

that can be used to compute impurity-based function

importance, which can then be used to discard irrelevant

features.

● Principal component analysis (PCA): Principal component

analysis (PCA) is a well-known unsupervised learning

approach in the field of machine learning and data science.

PCA is a mathematical technique that transforms a set of

correlated variables into a set of uncorrelated variables

known as principal components [48, 81]. Figure 8 shows an

example of the effect of PCA on various dimensions space,

where Fig. 8a shows the original features in 3D space, and

Fig. 8b shows the created principal components PC1 and

PC2 onto a 2D plane, and 1D line with the principal

component PC1 respectively. Thus, PCA can be used as a

feature extraction technique that reduces the dimensionality

of the datasets, and to build an effective machine learning

model [98]. Technically, PCA identifies the completely


transformed with the highest eigenvalues of a covariance

matrix and then uses those to project the data into a new

subspace of equal or fewer dimensions [82].

Fig. 8

An example of a principal component analysis (PCA) and created

principal components PC1 and PC2 in different dimension space


Full size image

Association Rule Learning

Association rule learning is a rule-based machine learning approach

to discover interesting relationships, “IF-THEN” statements, in large

datasets between variables [7]. One example is that “if a customer

buys a computer or laptop (an item), s/he is likely to also buy anti-

virus software (another item) at the same time”. Association rules

are employed today in many application areas, including IoT

services, medical diagnosis, usage behavior analytics, web usage

mining, smartphone applications, cybersecurity applications, and

bioinformatics. In comparison to sequence mining, association rule

learning does not usually take into account the order of things within

or across transactions. A common way of measuring the usefulness

of association rules is to use its parameter, the ‘support’ and

‘confidence’, which is introduced in [7].


In the data mining literature, many association rule learning

methods have been proposed, such as logic dependent [34],

frequent pattern based [8, 49, 68], and tree-based [42]. The most

popular association rule learning algorithms are summarized below.

● AIS and SETM: AIS is the first algorithm proposed by Agrawal

et al. [7] for association rule mining. The AIS algorithm’s

main downside is that too many candidate itemsets are

generated, requiring more space and wasting a lot of effort.

This algorithm calls for too many passes over the entire

dataset to produce the rules. Another approach SETM [49]

exhibits good performance and stable behavior with

execution time; however, it suffers from the same flaw as

the AIS algorithm.

● Apriori: For generating association rules for a given dataset,

Agrawal et al. [8] proposed the Apriori, Apriori-TID, and

Apriori-Hybrid algorithms. These later algorithms outperform

the AIS and SETM mentioned above due to the Apriori

property of frequent itemset [8]. The term ‘Apriori’ usually

refers to having prior knowledge of frequent itemset

properties. Apriori uses a “bottom-up” approach, where it

generates the candidate itemsets. To reduce the search

space, Apriori uses the property “all subsets of a frequent


itemset must be frequent; and if an itemset is infrequent,

then all its supersets must also be infrequent”. Another

approach predictive Apriori [108] can also generate rules;

however, it receives unexpected results as it combines both

the support and confidence. The Apriori [8] is the widely

applicable techniques in mining association rules.

● ECLAT: This technique was proposed by Zaki et al. [131] and

stands for Equivalence Class Clustering and bottom-up

Lattice Traversal. ECLAT uses a depth-first search to find

frequent itemsets. In contrast to the Apriori [8] algorithm,

which represents data in a horizontal pattern, it represents

data vertically. Hence, the ECLAT algorithm is more efficient

and scalable in the area of association rule learning. This

algorithm is better suited for small and medium datasets

whereas the Apriori algorithm is used for large datasets.

● FP-Growth: Another common association rule learning

technique based on the frequent-pattern tree (FP-tree)

proposed by Han et al. [42] is Frequent Pattern Growth,

known as FP-Growth. The key difference with Apriori is that

while generating rules, the Apriori algorithm [8] generates

frequent candidate itemsets; on the other hand, the FP-

growth algorithm [42] prevents candidate generation and

thus produces a tree by the successful strategy of ‘divide

and conquer’ approach. Due to its sophistication, however,

FP-Tree is challenging to use in an interactive mining

environment [133]. Thus, the FP-Tree would not fit into


memory for massive data sets, making it challenging to

process big data as well. Another solution is RARM (Rapid

Association Rule Mining) proposed by Das et al. [26] but

faces a related FP-tree issue [133].

● ABC-RuleMiner: A rule-based machine learning method,

recently proposed in our earlier paper, by Sarker et al.

[104], to discover the interesting non-redundant rules to

provide real-world intelligent services. This algorithm

effectively identifies the redundancy in associations by

taking into account the impact or precedence of the related

contextual features and discovers a set of non-redundant

association rules. This algorithm first constructs an

association generation tree (AGT), a top-down approach,

and then extracts the association rules through traversing

the tree. Thus, ABC-RuleMiner is more potent than

traditional rule-based methods in terms of both non-

redundant rule generation and intelligent decision-making,

particularly in a context-aware smart computing

environment, where human or user preferences are

involved.

Among the association rule learning techniques discussed above,

Apriori [8] is the most widely used algorithm for discovering

association rules from a given dataset [133]. The main strength of

the association learning technique is its comprehensiveness, as it

generates all associations that satisfy the user-specified constraints,

such as minimum support and confidence value. The ABC-RuleMiner


approach [104] discussed earlier could give significant results in

terms of non-redundant rule generation and intelligent decision-

making for the relevant application areas in the real world.

Reinforcement Learning

Reinforcement learning (RL) is a machine learning technique that

allows an agent to learn by trial and error in an interactive

environment using input from its actions and experiences. Unlike

supervised learning, which is based on given sample data or

examples, the RL method is based on interacting with the

environment. The problem to be solved in reinforcement learning

(RL) is defined as a Markov Decision Process (MDP) [86], i.e., all

about sequentially making decisions. An RL problem typically

includes four elements such as Agent, Environment, Rewards, and

Policy.

RL can be split roughly into Model-based and Model-free techniques.

Model-based RL is the process of inferring optimal behavior from a

model of the environment by performing actions and observing the


results, which include the next state and the immediate reward [85].

AlphaZero, AlphaGo [113] are examples of the model-based

approaches. On the other hand, a model-free approach does not use

the distribution of the transition probability and the reward function

associated with MDP. Q-learning, Deep Q Network, Monte Carlo

Control, SARSA (State–Action–Reward–State–Action), etc. are some

examples of model-free algorithms [52]. The policy network, which is

required for model-based RL but not for model-free, is the key

difference between model-free and model-based learning. In the

following, we discuss the popular RL algorithms.

● Monte Carlo methods: Monte Carlo techniques, or Monte

Carlo experiments, are a wide category of computational

algorithms that rely on repeated random sampling to obtain

numerical results [52]. The underlying concept is to use

randomness to solve problems that are deterministic in

principle. Optimization, numerical integration, and making

drawings from the probability distribution are the three

problem classes where Monte Carlo techniques are most

commonly used.

● Q-learning: Q-learning is a model-free reinforcement

learning algorithm for learning the quality of behaviors that

tell an agent what action to take under what conditions [52].


It does not need a model of the environment (hence the

term “model-free”), and it can deal with stochastic

transitions and rewards without the need for adaptations.

The ‘Q’ in Q-learning usually stands for quality, as the

algorithm calculates the maximum expected rewards for a

given behavior in a given state.

● Deep Q-learning: The basic working step in Deep Q-Learning

[52] is that the initial state is fed into the neural network,

which returns the Q-value of all possible actions as an

output. Still, when we have a reasonably simple setting to

overcome, Q-learning works well. However, when the

number of states and actions becomes more complicated,

deep learning can be used as a function approximator.

Reinforcement learning, along with supervised and unsupervised

learning, is one of the basic machine learning paradigms. RL can be

used to solve numerous real-world problems in various fields, such

as game theory, control theory, operations analysis, information

theory, simulation-based optimization, manufacturing, supply chain

logistics, multi-agent systems, swarm intelligence, aircraft control,

robot motion control, and many more.


Artificial Neural Network and Deep Learning

Deep learning is part of a wider family of artificial neural networks

(ANN)-based machine learning approaches with representation

learning. Deep learning provides a computational architecture by

combining several processing layers, such as input, hidden, and

output layers, to learn from data [41]. The main advantage of deep

learning over traditional machine learning methods is its better

performance in several cases, particularly learning from large

datasets [105, 129]. Figure 9 shows a general performance of deep

learning over machine learning considering the increasing amount of

data. However, it may vary depending on the data characteristics

and experimental set up.

Fig. 9
Machine learning and deep learning performance in general with the

amount of data

Full size image


The most common deep learning algorithms are: Multi-layer

Perceptron (MLP), Convolutional Neural Network (CNN, or ConvNet),

Long Short-Term Memory Recurrent Neural Network (LSTM-RNN)

[96]. In the following, we discuss various types of deep learning

methods that can be used to build effective data-driven models for

various purposes.

Fig. 10
A structure of an artificial neural network modeling with multiple

processing layers

Full size image

● MLP: The base architecture of deep learning, which is also

known as the feed-forward artificial neural network, is called

a multilayer perceptron (MLP) [82]. A typical MLP is a fully

connected network consisting of an input layer, one or more

hidden layers, and an output layer, as shown in Fig. 10. Each

node in one layer connects to each node in the following

layer at a certain weight. MLP utilizes the “Backpropagation”

technique [41], the most “fundamental building block” in a

neural network, to adjust the weight values internally while

building the model. MLP is sensitive to scaling features and

allows a variety of hyperparameters to be tuned, such as

the number of hidden layers, neurons, and iterations, which

can result in a computationally costly model.


● CNN or ConvNet: The convolution neural network (CNN) [65]

enhances the design of the standard ANN, consisting of

convolutional layers, pooling layers, as well as fully

connected layers, as shown in Fig. 11. As it takes the

advantage of the two-dimensional (2D) structure of the

input data, it is typically broadly used in several areas such

as image and video recognition, image processing and

classification, medical image analysis, natural language

processing, etc. While CNN has a greater computational

burden, without any manual intervention, it has the

advantage of automatically detecting the important

features, and hence CNN is considered to be more powerful

than conventional ANN. A number of advanced deep

learning models based on CNN can be used in the field, such

as AlexNet [60], Xception [24], Inception [118], Visual

Geometry Group (VGG) [44], ResNet [45], etc.

● LSTM-RNN: Long short-term memory (LSTM) is an artificial

recurrent neural network (RNN) architecture used in the

area of deep learning [38]. LSTM has feedback links, unlike

normal feed-forward neural networks. LSTM networks are

well-suited for analyzing and learning sequential data, such

as classifying, processing, and predicting data based on

time series data, which differentiates it from other

conventional networks. Thus, LSTM can be used when the

data are in a sequential format, such as time, sentence,

etc., and commonly applied in the area of time-series


analysis, natural language processing, speech recognition,

etc.

Fig. 11

An example of a convolutional neural network (CNN or ConvNet)

including multiple convolution and pooling layers


Full size image

In addition to these most common deep learning methods discussed

above, several other deep learning approaches [96] exist in the area

for various purposes. For instance, the self-organizing map (SOM)

[58] uses unsupervised learning to represent the high-dimensional

data by a 2D grid map, thus achieving dimensionality reduction. The

autoencoder (AE) [15] is another learning technique that is widely

used for dimensionality reduction as well and feature extraction in

unsupervised learning tasks. Restricted Boltzmann machines (RBM)

[46] can be used for dimensionality reduction, classification,

regression, collaborative filtering, feature learning, and topic

modeling. A deep belief network (DBN) is typically composed of

simple, unsupervised networks such as restricted Boltzmann

machines (RBMs) or autoencoders, and a backpropagation neural

network (BPNN) [123]. A generative adversarial network (GAN) [39]

is a form of the network for deep learning that can generate data

with characteristics close to the actual data input. Transfer learning

is currently very common because it can train deep neural networks

with comparatively low data, which is typically the re-use of a new

problem with a pre-trained model [124]. A brief discussion of these

artificial neural networks (ANN) and deep learning (DL) models are

summarized in our earlier paper Sarker et al. [96].


Overall, based on the learning techniques discussed above, we can

conclude that various types of machine learning techniques, such as

classification analysis, regression, data clustering, feature selection

and extraction, and dimensionality reduction, association rule

learning, reinforcement learning, or deep learning techniques, can

play a significant role for various purposes according to their

capabilities. In the following section, we discuss several application

areas based on machine learning algorithms.

Applications of Machine Learning

In the current age of the Fourth Industrial Revolution (4IR), machine

learning becomes popular in various application areas, because of

its learning capabilities from the past and making intelligent

decisions. In the following, we summarize and discuss ten popular

application areas of machine learning technology.

● Predictive analytics and intelligent decision-making: A major

application field of machine learning is intelligent decision-

making by data-driven predictive analytics [21, 70]. The


basis of predictive analytics is capturing and exploiting

relationships between explanatory variables and predicted

variables from previous events to predict the unknown

outcome [41]. For instance, identifying suspects or criminals

after a crime has been committed, or detecting credit card

fraud as it happens. Another application, where machine

learning algorithms can assist retailers in better

understanding consumer preferences and behavior, better

manage inventory, avoiding out-of-stock situations, and

optimizing logistics and warehousing in e-commerce.

Various machine learning algorithms such as decision trees,

support vector machines, artificial neural networks, etc.

[106, 125] are commonly used in the area. Since accurate

predictions provide insight into the unknown, they can

improve the decisions of industries, businesses, and almost

any organization, including government agencies, e-

commerce, telecommunications, banking and financial

services, healthcare, sales and marketing, transportation,

social networking, and many others.

● Cybersecurity and threat intelligence: Cybersecurity is one

of the most essential areas of Industry 4.0. [114], which is

typically the practice of protecting networks, systems,

hardware, and data from digital attacks [114]. Machine

learning has become a crucial cybersecurity technology that

constantly learns by analyzing data to identify patterns,

better detect malware in encrypted traffic, find insider


threats, predict where bad neighborhoods are online, keep

people safe while browsing, or secure data in the cloud by

uncovering suspicious activity. For instance, clustering

techniques can be used to identify cyber-anomalies, policy

violations, etc. To detect various types of cyber-attacks or

intrusions machine learning classification models by taking

into account the impact of security features are useful [97].

Various deep learning-based security models can also be

used on the large scale of security datasets [96, 129].

Moreover, security policy rules generated by association

rule learning techniques can play a significant role to build a

rule-based security system [105]. Thus, we can say that

various learning techniques discussed in Sect. Machine

Learning Tasks and Algorithms, can enable cybersecurity

professionals to be more proactive inefficiently preventing

threats and cyber-attacks.

● Internet of things (IoT) and smart cities: Internet of Things

(IoT) is another essential area of Industry 4.0. [114], which

turns everyday objects into smart objects by allowing them

to transmit data and automate tasks without the need for

human interaction. IoT is, therefore, considered to be the

big frontier that can enhance almost all activities in our

lives, such as smart governance, smart home, education,

communication, transportation, retail, agriculture, health

care, business, and many more [70]. Smart city is one of

IoT’s core fields of application, using technologies to


enhance city services and residents’ living experiences

[132, 135]. As machine learning utilizes experience to

recognize trends and create models that help predict future

behavior and events, it has become a crucial technology for

IoT applications [103]. For example, to predict traffic in

smart cities, parking availability prediction, estimate the

total usage of energy of the citizens for a particular period,

make context-aware and timely decisions for the people,

etc. are some tasks that can be solved using machine

learning techniques according to the current needs of the

people.

● Traffic prediction and transportation: Transportation systems


have become a crucial component of every country’s
economic development. Nonetheless, several cities around
the world are experiencing an excessive rise in traffic
volume, resulting in serious issues such as delays, traffic
congestion, higher fuel prices, increased CO
● 2
● pollution, accidents, emergencies, and a decline in modern

society’s quality of life [40]. Thus, an intelligent

transportation system through predicting future traffic is

important, which is an indispensable part of a smart city.

Accurate traffic prediction based on machine and deep

learning modeling can help to minimize the issues [17, 30,

31]. For example, based on the travel history and trend of

traveling through various routes, machine learning can

assist transportation companies in predicting possible issues

that may occur on specific routes and recommending their


customers to take a different path. Ultimately, these

learning-based data-driven models help improve traffic flow,

increase the usage and efficiency of sustainable modes of

transportation, and limit real-world disruption by modeling

and visualizing future changes.

● Healthcare and COVID-19 pandemic: Machine learning can

help to solve diagnostic and prognostic problems in a

variety of medical domains, such as disease prediction,

medical knowledge extraction, detecting regularities in data,

patient management, etc. [33, 77, 112]. Coronavirus

disease (COVID-19) is an infectious disease caused by a

newly discovered coronavirus, according to the World Health

Organization (WHO) [3]. Recently, the learning techniques

have become popular in the battle against COVID-19 [61,

63]. For the COVID-19 pandemic, the learning techniques

are used to classify patients at high risk, their mortality rate,

and other anomalies [61]. It can also be used to better

understand the virus’s origin, COVID-19 outbreak prediction,

as well as for disease diagnosis and treatment [14, 50]. With

the help of machine learning, researchers can forecast

where and when, the COVID-19 is likely to spread, and notify

those regions to match the required arrangements. Deep

learning also provides exciting solutions to the problems of

medical image processing and is seen as a crucial technique

for potential applications, particularly for COVID-19

pandemic [10, 78, 111]. Overall, machine and deep learning


techniques can help to fight the COVID-19 virus and the

pandemic as well as intelligent clinical decisions making in

the domain of healthcare.

● E-commerce and product recommendations: Product

recommendation is one of the most well known and widely

used applications of machine learning, and it is one of the

most prominent features of almost any e-commerce website

today. Machine learning technology can assist businesses in

analyzing their consumers’ purchasing histories and making

customized product suggestions for their next purchase

based on their behavior and preferences. E-commerce

companies, for example, can easily position product

suggestions and offers by analyzing browsing trends and

click-through rates of specific items. Using predictive

modeling based on machine learning techniques, many

online retailers, such as Amazon [71], can better manage

inventory, prevent out-of-stock situations, and optimize

logistics and warehousing. The future of sales and marketing

is the ability to capture, evaluate, and use consumer data to

provide a customized shopping experience. Furthermore,

machine learning techniques enable companies to create

packages and content that are tailored to the needs of their

customers, allowing them to maintain existing customers

while attracting new ones.

● NLP and sentiment analysis: Natural language processing

(NLP) involves the reading and understanding of spoken or


written language through the medium of a computer [79,

103]. Thus, NLP helps computers, for instance, to read a

text, hear speech, interpret it, analyze sentiment, and

decide which aspects are significant, where machine

learning techniques can be used. Virtual personal assistant,

chatbot, speech recognition, document description,

language or machine translation, etc. are some examples of

NLP-related tasks. Sentiment Analysis [90] (also referred to

as opinion mining or emotion AI) is an NLP sub-field that

seeks to identify and extract public mood and views within a

given text through blogs, reviews, social media, forums,

news, etc. For instance, businesses and brands use

sentiment analysis to understand the social sentiment of

their brand, product, or service through social media

platforms or the web as a whole. Overall, sentiment analysis

is considered as a machine learning task that analyzes texts

for polarity, such as “positive”, “negative”, or “neutral”

along with more intense emotions like very happy, happy,

sad, very sad, angry, have interest, or not interested etc.

● Image, speech and pattern recognition: Image recognition

[36] is a well-known and widespread example of machine

learning in the real world, which can identify an object as a

digital image. For instance, to label an x-ray as cancerous or

not, character recognition, or face detection in an image,

tagging suggestions on social media, e.g., Facebook, are

common examples of image recognition. Speech recognition


[23] is also very popular that typically uses sound and

linguistic models, e.g., Google Assistant, Cortana, Siri, Alexa,

etc. [67], where machine learning methods are used.

Pattern recognition [13] is defined as the automated

recognition of patterns and regularities in data, e.g., image

analysis. Several machine learning techniques such as

classification, feature selection, clustering, or sequence

labeling methods are used in the area.

● Sustainable agriculture: Agriculture is essential to the

survival of all human activities [109]. Sustainable agriculture

practices help to improve agricultural productivity while also

reducing negative impacts on the environment [5, 25, 109].

The sustainable agriculture supply chains are knowledge-

intensive and based on information, skills, technologies,

etc., where knowledge transfer encourages farmers to

enhance their decisions to adopt sustainable agriculture

practices utilizing the increasing amount of data captured

by emerging technologies, e.g., the Internet of Things (IoT),

mobile technologies and devices, etc. [5, 53, 54]. Machine

learning can be applied in various phases of sustainable

agriculture, such as in the pre-production phase - for the

prediction of crop yield, soil properties, irrigation

requirements, etc.; in the production phase—for weather

prediction, disease detection, weed detection, soil nutrient

management, livestock management, etc.; in processing

phase—for demand estimation, production planning, etc.


and in the distribution phase - the inventory management,

consumer analysis, etc.

● User behavior analytics and context-aware smartphone

applications: Context-awareness is a system’s ability to

capture knowledge about its surroundings at any moment

and modify behaviors accordingly [28, 93]. Context-aware

computing uses software and hardware to automatically

collect and interpret data for direct responses. The mobile

app development environment has been changed greatly

with the power of AI, particularly, machine learning

techniques through their learning capabilities from

contextual data [103, 136]. Thus, the developers of mobile

apps can rely on machine learning to create smart apps that

can understand human behavior, support, and entertain

users [107, 137, 140]. To build various personalized data-

driven context-aware systems, such as smart interruption

management, smart mobile recommendation, context-

aware smart searching, decision-making that intelligently

assist end mobile phone users in a pervasive computing

environment, machine learning techniques are applicable.

For example, context-aware association rules can be used to

build an intelligent phone call application [104]. Clustering

approaches are useful in capturing users’ diverse behavioral

activities by taking into account data in time series [102]. To

predict the future events in various contexts, the

classification methods can be used [106, 139]. Thus, various


learning techniques discussed in Sect. “Machine Learning

Tasks and Algorithms” can help to build context-aware

adaptive and smart applications according to the

preferences of the mobile phone users.

In addition to these application areas, machine learning-based

models can also apply to several other domains such as

bioinformatics, cheminformatics, computer networks, DNA sequence

classification, economics and banking, robotics, advanced

engineering, and many more.

Challenges and Research Directions

Our study on machine learning algorithms for intelligent data

analysis and applications opens several research issues in the area.

Thus, in this section, we summarize and discuss the challenges

faced and the potential research opportunities and future directions.

In general, the effectiveness and the efficiency of a machine

learning-based solution depend on the nature and characteristics of

the data, and the performance of the learning algorithms. To collect


the data in the relevant domain, such as cybersecurity, IoT,

healthcare and agriculture discussed in Sect. “Applications of

Machine Learning” is not straightforward, although the current

cyberspace enables the production of a huge amount of data with

very high frequency. Thus, collecting useful data for the target

machine learning-based applications, e.g., smart city applications,

and their management is important to further analysis. Therefore, a

more in-depth investigation of data collection methods is needed

while working on the real-world data. Moreover, the historical data

may contain many ambiguous values, missing values, outliers, and

meaningless data. The machine learning algorithms, discussed in

Sect “Machine Learning Tasks and Algorithms” highly impact on data

quality, and availability for training, and consequently on the

resultant model. Thus, to accurately clean and pre-process the

diverse data collected from diverse sources is a challenging task.

Therefore, effectively modifying or enhance existing pre-processing

methods, or proposing new data preparation techniques are required

to effectively use the learning algorithms in the associated

application domain.

To analyze the data and extract insights, there exist many machine

learning algorithms, summarized in Sect. “Machine Learning Tasks

and Algorithms”. Thus, selecting a proper learning algorithm that is


suitable for the target application is challenging. The reason is that

the outcome of different learning algorithms may vary depending on

the data characteristics [106]. Selecting a wrong learning algorithm

would result in producing unexpected outcomes that may lead to

loss of effort, as well as the model’s effectiveness and accuracy. In

terms of model building, the techniques discussed in Sect. “Machine

Learning Tasks and Algorithms” can directly be used to solve many

real-world issues in diverse domains, such as cybersecurity, smart

cities and healthcare summarized in Sect. “Applications of Machine

Learning”. However, the hybrid learning model, e.g., the ensemble

of methods, modifying or enhancement of the existing learning

techniques, or designing new learning methods, could be a potential

future work in the area.

Thus, the ultimate success of a machine learning-based solution and

corresponding applications mainly depends on both the data and the

learning algorithms. If the data are bad to learn, such as non-

representative, poor-quality, irrelevant features, or insufficient

quantity for training, then the machine learning models may become

useless or will produce lower accuracy. Therefore, effectively

processing the data and handling the diverse learning algorithms

are important, for a machine learning-based solution and eventually

building intelligent applications.


Conclusion

In this paper, we have conducted a comprehensive overview of

machine learning algorithms for intelligent data analysis and

applications. According to our goal, we have briefly discussed how

various types of machine learning methods can be used for making

solutions to various real-world issues. A successful machine learning

model depends on both the data and the performance of the

learning algorithms. The sophisticated learning algorithms then need

to be trained through the collected real-world data and knowledge

related to the target application before the system can assist with

intelligent decision-making. We also discussed several popular

application areas based on machine learning techniques to highlight

their applicability in various real-world issues. Finally, we have

summarized and discussed the challenges faced and the potential

research opportunities and future directions in the area. Therefore,

the challenges that are identified create promising research

opportunities in the field which must be addressed with effective

solutions in various application areas. Overall, we believe that our

study on machine learning-based solutions opens up a promising

direction and can be used as a reference guide for potential research

and applications for both academia and industry professionals as

well as for decision-makers, from a technical point of view.

You might also like