Association Rule
Association Rule
Abstract: The digital world is rife with data in this Fourth Industrial Revolution (4IR) or Industry 4.0
era. Examples of this data include cybersecurity, mobile, social media, business, Internet of Things
(IoT), and health data. The key to developing intelligent analyses of these data and correspondingly
clever and automated applications is understanding artificial intelligence (AI), and specifically
machine learning (ML). There are many different kinds of machine learning algorithms in the field,
including supervised, unsupervised, semi-supervised, and reinforcement learning. Furthermore, deep
learning, a subset of a larger class of machine learning techniques, is capable of large-scale, intelligent
data analysis. We provide a thorough analysis of various machine learning techniques in this paper,
which may be used to improve an application's intelligence and functionality. Therefore, the
fundamental contribution of this study is to provide an explanation of the principles underlying
various machine learning approaches and how they may be applied in a variety of real-world
application areas, including e-commerce, cybersecurity systems, smart cities, healthcare, and
agriculture, among many others. Based on our investigation, we also emphasise the difficulties and
future directions for research. The overall goal of this article is to provide a technical point of
reference for experts in the industry and academia, as well as for decision-makers in a variety of real-
world scenarios and application domains.
Keywords: Machine learning, Data science, Artificial intelligence, Deep learning, Intelligent
applications, Predictive analytics.
Introduction: In this era of data, where everything is digitally recorded [1,3] and connected to a data
source, we are living in the age of data. For example, there is an abundance of different types of data
in the modern electronic environment, including cybersecurity data, Internet of Things (IoT) data, and
smart data, company data, social media data, smartphone data, COVID-19 data, health data, and a lot
more. Structured, semi-structured, and unstructured data are all covered in brief in Section "Types of
Real-World Data and Machine Learning Techniques," and the number of these types of data is
growing daily. By drawing conclusions from these data, numerous intelligent applications in the
pertinent fields can be developed. For example, the pertinent cybersecurity data can be utilized to
create a data-driven, automated, and intelligent cybersecurity system [4], the relevant mobile data can
be used to create context-aware, tailored smart mobile applications [3], and so on. Therefore, there is
an urgent need for data management tools and techniques that can quickly and intelligently extract
insights or meaningful knowledge from the data, which will serve as the foundation for real-world
applications.
In the context of data analysis and computing, artificial intelligence (AI), and machine learning (ML)
in particular, have expanded quickly in recent years, usually enabling the applications to perform in an
intelligent manner [5]. ML is widely regarded as the most popular newest technologies in the fourth
industrial revolution (4IR or Industry 4.0) and often gives systems the capacity to learn and improve
from experience automatically without being specifically coded [3, 4]. "Industry 4.0" [6] generally
refers to the continuous automation of traditional industrial processes and manufacturing, including
exploratory data processing, through the use of new smart technologies like machine learning
automation. Machine learning algorithms are therefore essential for the intelligent analysis of these
data and the creation of the related real-world applications. Four main types of learning algorithms
may be distinguished in this domain: supervised, unsupervised, semi-supervised, and reinforcement
learning [7].
Generally speaking, the type and characteristics of the data as well as the functionality of the learning
algorithms determine how successful and efficient a machine learning solution is. To efficiently create
data-driven systems, machine learning methods can be used in conjunction with classification
analysis, regression, data clustering, feature engineering and dimensionality reduction, association rule
learning, or reinforcement learning [ 8]. Furthermore, as part of a larger family of machine learning
techniques, artificial neural networks, which are known to be capable of intelligent data analysis, are
the source of deep learning [9]. It is therefore difficult to choose an appropriate learning algorithm that
fits the intended application in a given domain. The explanation for this is that various learning
algorithms have distinct goals, and even within a same category, the results of various learning
algorithms can differ based on the properties of the data [10]. In order to apply machine learning
algorithms in a variety of real-world application areas, including cybersecurity services, IoT systems,
business and recommendation systems, smart cities, healthcare and COVID-19, context-aware
systems, sustainable agriculture, and many more, it is crucial to understand the underlying principles
of these algorithms. These topics are briefly discussed in Section "Applications of Machine Learning."
In this paper, we present a thorough view on several forms of machine learning algorithms that may be
implemented to improve the intelligence and capacities of an application, based on the significance
and potential of "Machine Learning" to evaluate the data indicated above. The explanation of the
concepts and potential of various machine learning approaches, as well as their relevance in the
previously listed real-world application domains, constitutes the study's primary contribution. This
paper's goal is to give academics and professionals in the field a fundamental understanding of
machine learning so they may study, research, and construct intelligent and data-driven systems in the
relevant domains.
The key contributions of this paper are listed as follows:
To establish the parameters of our research by considering the features and attributes of
diverse real-world data sets as well as the capacities of different learning approaches.
To offer a thorough understanding of machine learning methods that can be used to improve a
data-driven application's intelligence and capabilities.
To talk about how machine learning-based solutions can be used in a range of real-world
application areas.
To emphasize and enumerate the possible directions for future research on intelligent data
analysis and services that fall under the purview of our study.
This is how the remainder of the paper is structured. The next section outlines the parameters of our
investigation and provides a more comprehensive presentation of the different kinds of data and
machine learning algorithms. In the part that follows, we go over and describe several machine
learning algorithms in brief. After that, we go over and summarize a number of real-world application
areas that use machine learning algorithms. We highlight a number of research questions and possible
future paths in the penultimate section, and this study is concluded in the last section.
Real-World Data Types and Machine Learning Methodologies:Usually, machine learning
algorithms take in and analyze data to discover relevant patterns regarding people, transactions,
events, business procedures, and so forth. The different kinds of real-world data and machine learning
algorithm categories are covered in the sections that follow.
Types of Real -World Data: Generally speaking, data accessibility is seen as essential to building
real-world data-driven systems or machine learning models [3, 4]. There are many different types of
data, including unstructured, semi-structured, and structured data [11]. Furthermore, an additional type
that generally represents information about the data is called "metadata." We go over these kinds of
data briefly in the sections that follow.
Structured: It is highly ordered, easily accessible, and follows a standard order in a data model.
It can be utilized by a computer program or an entity. Tabular formats are commonly used to
store structured data in well-defined systems, such relational databases. Structured data
includes things like names, dates, addresses, credit card numbers, stock information,
geolocation, and so on.
Semi-structured: Unlike the structured data previously discussed, semi-structured data is not
kept in a relational database; yet, it does contain some organizational characteristics that
facilitate analysis. Semi-structured data includes things like HTML, XML, JSON documents,
NoSQL databases, and more.
Unstructured: Unstructured data, which primarily consists of text and multimedia, is
considerably harder to collect, handle, and analyze because it lacks a predetermined format or
arrangement. Unstructured data includes, but is not limited to, sensor data, emails, blog posts,
wikis, word processing documents, PDF files, audio files, videos, photos, presentations, web
pages, and many more kinds of business documents.
Metadata: Also known as "data about data," this is not your typical type of data. "Data" are just
the materials that can be used to classify, quantify, or even document something in relation to
an organization's data attributes. This is the main distinction between "data" and "metadata."
However, metadata provides a description of the pertinent data, making it more meaningful to
data users. The author, file size, date the document was created, keywords used to characterize
the document, etc. are a few fundamental examples of metadata for a document.
Researchers in the fields of data science and machine learning employ a variety of popular datasets for
diverse applications. For instance, these include datasets related to cybersecurity, like NSL-KDD [12],
Bot-IoT [14], and so on; smartphone datasets, call logs [13], SMS logs [2], logs of mobile application
usages [15], logs of mobile phone notifications [18] and so on; IoT data, [16], data related to
agriculture and e-commerce[19], health data, like heart disease [17], COVID-19 [20], and many more
in a variety of application domains. The data can be in any of the several categories mentioned above,
and they can differ depending on the real-world application. The data can be in any of the several
categories mentioned above, and they can differ depending on the real-world application. Various
machine learning techniques can be employed based on their learning capabilities to analyze data in a
specific domain and extract valuable insights for developing intelligent applications in the real world.
These techniques are covered in the following.
Types of Machine Learning Techniques:
As seen in Fig. 1, machine learning algorithms can be broadly classified into four
categories: semi-supervised learning, reinforcement learning, unsupervised
learning, and supervised learning [7]. The following gives a quick overview of
each kind of learning strategy and how it might be used to address issues in the
real world.
Supervised: Based on sample input-output pairings, machine learning is generally tasked with
learning a function that translates an input to an output [21]. To infer a function, it makes use
of labeled training data and a set of training examples. Under supervised learning, a task-
driven technique is used when certain objectives are determined to be achieved from a given
set of inputs [4]. The two most popular supervised tasks are "regression," which fits the data,
and "classification," which divides the data. Supervised learning can be used, for example, to
predict the class label or emotion of a text segment, such as a tweet or a product review.
Unsupervised: Unsupervised learning is a data-driven method that examines unlabeled datasets
without the requirement for human intervention [21]. This is frequently used for exploratory
reasons, groupings in findings, generative feature extraction, and the identification of
significant patterns and structures. Clustering, density estimation, feature learning,
dimensionality reduction, association rule discovery, anomaly detection, and other
unsupervised learning tasks are among the most popular ones.
Semi-supervised: Working with both labeled and unlabeled data, semi-supervised learning is
characterized as a combination of the supervised and unsupervised approaches described above
[21, 4]. It so lies in the middle of "without supervision" and "with supervision" learning. In
real-world scenarios, when unlabeled data are abundant and labeled data may be scarce in
certain settings, semi-supervised learning might be beneficial [7]. A semi-supervised learning
model's ultimate objective is to produce a better prediction result than one might obtain from
the model utilizing just the labeled data. Semi-supervised learning has application in machine
translation, fraud detection, labeling data, and text categorization, among other areas.
Reinforcement: An environment-driven technique, or reinforcement learning, is a kind of
machine learning algorithm that allows software agents and machines to automatically assess
the optimal behavior in a given context or environment to increase its efficiency [22]. The
ultimate objective of this kind of learning, which is based on rewards and penalties, is to use
the knowledge gathered from environmental activists to take actions that will maximize
rewards and reduce risks [7]. Although it is not ideal to use it to solve simple or basic
problems, it is a potent tool for training AI models that can help increase automation or
optimize the operational efficiency of sophisticated systems like robotics, autonomous driving
tasks, manufacturing, and supply chain logistics.
Fig. 2 A general structure of a machine learning based predictive model considering both the training and testing
phase
Binary classification: Classification assignments with two class labels, such as "true and false"
or "yes and no," are referred to as binary classification [21]. In these kinds of binary
classification problems, the normal condition may belong to one class and the pathological
state to another. For example, in a task involving a medical test, "cancer not detected" could be
the normal condition, whereas "cancer detected" could be the abnormal state. Comparably, in
the email service provider example above, "spam" and "not spam" are regarded as binary
classifications.
Multiclass classification: This term has historically been used to describe classification
problems with more than two class labels [21]. Unlike binary classification tasks, multiclass
classification does not follow the concept of normal and abnormal outcomes. Rather, examples
are classed as belonging to one of a range of specified classes. Classifying different types of
network attacks, for instance, can be a multiclass classification task using the NSL-KDD [12]
dataset. This dataset has four class labels for the attack categories, which include DoS (Denial
of Service Attack), U2R (User to Root Attack), R2L (Root to Local Attack), and Probing
Attack.
The literature on machine learning and data science has many proposed categorization methods [21,
8]. We provide a summary of the most extensively used, well-liked techniques in a range of
application domains below.
Naïve Bayse (NB): The naive Bayes algorithm relies on the assumption of independence
between every pair of attributes and is based on the Bayes theorem. It performs effectively and
may be applied to a variety of real-world scenarios, including spam filtering, document or text
categorization, and both binary and multi-class categories. The NB classifier can be used to
efficiently categorize the noisy examples in the data and to build a reliable prediction model
[25]. The main advantage is that it requires less training data and does it faster than more
complex methods to estimate the required parameters [23]. However, because of its strict
feature independence assumptions, its performance might be affected. NB classifiers are
commonly available in Gaussian, Multinomial, Complement, Bernoulli, and Categorical forms
[23].
Linear Discriminant Analysis (LDA): Applying Bayes' rule and fitting class conditional
densities to data yields a linear decision boundary classifier known as linear discriminant
analysis (LDA) [23]. Another name for this technique is a generalization of Fisher's linear
discriminant, which projects a given dataset into a lower-dimensional space, i.e., a
dimensionality reduction that lowers the computational costs or decreases the complexity of
the resulting model. Assuming that every class has the same covariance matrix, the basic LDA
model typically fits each class with a Gaussian density [23]. Regression analysis and ANOVA
(analysis of variance) are closely linked to LDA since both aim to express a single dependent
variable as a linear mixture of additional features or measures.
Logistic regression (LR): Logistic Regression (LR) is another popular statistical model with a
probabilistic foundation that is utilized in machine learning to address classification problems
[24]. Typically, logistic regression estimates the probabilities using a logistic function, which
is also known as the sigmoid function in Eq. 1 because it is theoretically defined. It is most
effective when the dataset can be divided linearly and has a tendency to overfit high-
dimensional datasets. In these kinds of situations, one can prevent over-fitting by using the
regularization (L1 and L2) approaches [23]. A key flaw in logistic regression is the assumption
of linearity between the independent and dependent variables. Although it can be applied to
regression problems as well as classification problems, classification problems are its more
frequent use.
g(z) = 1 (1)
1 + exp(−z)
Support vector machine (SVM): Support vector machines (SVMs) are another popular
machine learning technology that may be applied to tasks like regression, classification, and
other tasks [27]. A support vector machine builds a hyper-plane or collection of hyper-planes
in high- or infinite-dimensional space. Presumably, since the larger the margin, the smaller the
classifier's generalization error, the hyper-plane with the largest distance from the closest
training data points in each class attains a strong separation. It works well in high-dimensional
spaces and exhibits variable behavior depending on the kernel, a family of mathematical
functions. Some of the common kernel functions used in SVM classifiers are linear,
polynomial, radial basis function (RBF), sigmoid, etc [23]. Nevertheless, SVM does not
perform as well when the data set has additional noise, such as overlapping target classes.
Decision tree (DT): One well-known non-parametric supervised learning technique is the
decision tree (DT) [28]. The classification and regression challenges also make use of DT
learning techniques [23]. For DT algorithms, ID3 [29], C4.5 [28], and CART [30] are well
recognized. Furthermore, the recently suggested BehavDT [31] and IntrudTree [32] by Sarker
et al. both work well in the pertinent application fields of cybersecurity and user behavior
analytics, respectively. As illustrated in Fig. 3, by sorting the tree downward from the root to a
few leaf nodes , DT categorizes the occurrences. Starting at the tree's root node and working
along the branch that corresponds to the attribute value, instances are categorized by
examining the attribute defined by that node. The most widely used criterion for splitting are
"gini" for the Gini impurity and "entropy" for the information gain, which has a mathematical
expression of [23].
Random forest (RF): One well-known ensemble classification method in machine learning and
data science with many applications is the random forest classifier [35]. This technique
employs "parallel ensembling," which fits multiple decision tree classifiers concurrently on
various data set sub-samples, as illustrated in Fig. 4. The outcome or final result is determined
by majority voting or averages. As a result, it reduces the issue of over-fitting and improves
control and forecast accuracy [23]. As a result, RF learning models that incorporate many
decision trees tend to be more accurate than models that only use one decision tree [10]. It
combines random feature selection [33] with bootstrap aggregation (bagging) [34] to create a
sequence of decision trees with controlled variation. It is flexible.
Fig. 4 An example of a random forest structure considering multiple decision trees
Regression Analysis: A continuous (y) result variable can be predicted using regression analysis, which
consists of a number of machines learning techniques, depending on the values of one or more (x)
predictor variables [41]. The primary difference between regression and classification is that the latter
helps to forecast a continuous quantity, whilst the former predicts discrete class labels. An illustration
of how classification differs when using regression models may be found in Figure 5. There are
frequently some similarities between the two categories of machine learning algorithms. These days,
regression models are frequently employed in many other domains, such as time series estimation,
trend analysis, cost estimation, financial forecasting or prediction, and many more.
Simple and multiple linear regression: One of the most widely used ML modeling strategies
and a well-known regression approach is simple and multiple linear regression. The dependent
variable in this technique is continuous, the independent variable or variables might be discrete
or continuous, and the regression line has a linear form. Using the best fit straight line, linear
regression establishes a link between the dependent variable (Y) and one or more independent
variables (X), also referred to as the regression line [41]. The following equations define it:
y =a + bx + e (4)
y =a + b1x1 + b2x2 + ⋯ + bnxn + e, (5)
where ‘a’ is the intercept, ‘b’ is the slope of the line, and ‘e’ is the error term. Using the provided
predictor variable(s), one can use this equation to predict the value of the target variable. While
basic linear regression only includes one independent variable, defined in Eq. 4, multiple linear
regression is an extension of simple linear regression that allows two or more predictor variables
to model a response variable, y, as a linear function [21] specified in Eq. 5.
Fig. 5 Classification vs. regression. In classification the dotted line represents a linear boundary that separates the
two classes; in regres- sion, the dotted line models the linear relationship between the two variables
Polynomial regression: Polynomial regression is a type of regression analysis where the link
between the dependent variable (y) and the independent variable (x) is expressed as the
polynomial degree of nth in x, rather than as a linear relationship [23]. The linear regression
(polynomial regression of degree 1) equation, which is defined as follows, is also the source of
the polynomial regression equation:
Here, y is the predicted/target output, b0, b1, ...bn are the regression coefficients, x is an
independent/ input variable. In simple words, we can say that if data are not dis- tributed
linearly, instead it is nth degree of polynomial then we use polynomial regression to get desired
output.
Cluster Analysis: Unsupervised machine learning techniques such as cluster analysis, often called
clustering, can be used to find and group related data points in big datasets without regard to the final
result. It accomplishes this by organizing a set of items into clusters, which are groupings of related
objects that are, in some way, more similar to one another than objects in other groups [21]. It is
frequently used as a data analysis approach to find intriguing patterns or trends in data, such as
customer groupings based on behavior. Clustering has a wide range of applications, including user
modeling, health analytics, e-commerce, mobile data processing, cybersecurity, and behavioral
analytics. The following provides a quick overview and summary of several kind of clustering
methods.
Partitioning methods: Partitioning techniques: This clustering methodology divides the data
into several groups or clusters according to the characteristics and similarities in the data.
Depending on the type of target application, data scientists or analysts usually calculate the
number of clusters either statically or dynamically to generate for the clustering algorithms.
Based on partitioning techniques, the most popular clustering algorithms include K-means
[36], K-Mediods [38], CLARA [37], etc.
Density-based methods: It makes advantage of the idea that a cluster in the data space is a
contiguous region of high point density isolated from other such clusters by contiguous regions
of low point density in order to distinguish different groups or clusters. Noisy points are those
that do not belong to a cluster. DBSCAN [32], OPTICS [12], and other clustering techniques
based on density are common. When dealing with clusters with similar densities and large
dimensionality data, the density-based approaches generally falter.
Hierarchical-based methods: Typically, the goal of hierarchical clustering is to create a tree-
like structure, or hierarchy, among the clusters. There are two main categories of hierarchical
clustering strategies: As illustrated in Fig. 7, there are two different approaches: (i)
Agglomerative, which is a "bottom-up" approach where each observation starts in its cluster
and pairs of clusters are combined as one, moving up the hierarchy, and (ii) Divisive, which is
a "top-down" approach where all observations start in one cluster and splits are carried out
recursively, moving down the hierarchy. In particular, the bottom-up clustering algorithm from
Sarker et al. [102], which we had previously proposed, is an example of a hierarchical
technique.
Grid-based methods: Grid-based clustering is very useful for handling large datasets. The idea
is to use a grid representation to summarize the dataset and then combine grid cells to create
clusters. The common grid-based clustering methods are STING [42], CLIQUE [43], etc.
Model-based methods: Model-based clustering algorithms primarily fall into two categories:
those that rely on neural network learning and those that employ statistical learning [47]. As an
illustration, GMM [46] is a statistical learning approach, while SOM [45] [9] represents a
neural network learning method.
Constraint-based methods: A semi-supervised method of data clustering called "constrained-
based clustering" makes use of restrictions to include domain knowledge. The clustering is
formed by incorporating application- or user-oriented requirements. These types of clustering
are typically clustered using CMWK-Means [49], COP K-means [48], etc.
In the literature on machine learning and data science, numerous clustering algorithms have been
suggested with the capability to group data [21, 8]. We provide a summary of the widely-used, well-
liked techniques in a number of application areas below.
K-means clustering: Fast, dependable, and easy to use, K-means clustering [36] yields accurate
findings when data sets are well-separated from one another. Using this approach, the data
points are assigned to clusters so that the squared distance between the data points and the
centroid is as short as feasible. Stated differently, the K-means method finds the k number of
centroids and, in an effort to minimize the centroids, assigns each data point to the closest
cluster. The K-means clustering process is susceptible to outliers because extreme values can
quickly alter a mean. A K-means variation that is more resilient to noise and outliers is called
K-medoids clustering [50].
Mean-shift clustering: A nonparametric clustering method that does not require previous
knowledge of the number of clusters or constraints on cluster shape is mean-shift clustering
[51]. Finding "blobs" in a smooth distribution or sample density is the goal of mean-shift
clustering [23]. This algorithm, which is based on centroid selection, updates centroid
candidates to represent the average of the points inside a specified area. These candidates are
filtered in a post-processing step to eliminate near-duplicates, forming the final collection of
centroids. Application domains include computer vision and image processing, where cluster
analysis is used. One drawback of Mean Shift is its high computational cost. Moreover, the
mean-shift approach performs poorly in large dimension scenarios where the number of
clusters shifts suddenly.
DBSCAN: A foundational approach for density-based clustering that is extensively employed
in data mining and machine learning is called Density-based Spatial Clustering of Applications
with Noise (DBSCAN) [39]. This method of separating high-density clusters from low-density
clusters for model construction is referred to as a non-parametric density-based clustering
strategy. The fundamental tenet of DBSCAN is that a point is associated with a cluster if it is
near several other points in that cluster. In a large amount of noisy, outlier-filled data, it can
identify clusters of different sizes and shapes. Unlike k-means, DBSCAN may identify clusters
of any shape and does not require an a priori determination of the number of clusters in the
data.
GMM clustering: When using a distribution-based clustering algorithm for data clustering,
Gaussian mixture models (GMMs) are frequently employed. A probabilistic model known as a
Gaussian mixture model is one in which a mixing of a finite number of Gaussian distributions
with unknown parameters produces all of the data points [23]. Expectation-maximization (EM)
[23] is an optimization approach that can be used to determine the Gaussian parameters for
each cluster. EM is an iterative technique that estimates the parameters using a statistical
model. Gaussian mixture models, as opposed to k-means, take uncertainty into account and
give the probability that a given data point falls into one of the k clusters. Compared to k-
means, GMM clustering is more reliable and effective with non-linear data distributions.
Agglomerative hierarchical clustering: Agglom- erative clustering is the most widely used
hierarchical clustering technique that groups objects in clusters according to their similarity.
Using a bottom-up methodology, the program first treats every item as a singleton cluster in
this strategy. After that, cluster pairs are combined one at a time until every cluster is
combined into a single, sizable cluster that contains every object. A dendrogram, or tree-based
representation of the elements, is the finished product. Examples of these techniques are Single
linkage [52], Complete linkage, BOTS [41], and so on. The primary benefit of agglomerative
hierarchical clustering over k-means is that the tree-structure hierarchy produced by
agglomerative clustering can aid in better decision-making in the applicable application areas
because it is more informative than the disorganized collection of flat clusters returned by k-
means.
In the literature on machine learning and data science, numerous strategies have been put forth to
decrease data dimensions [21, 8]. We provide a summary of the widely-used, well-liked techniques in
a number of application areas below.
Variance threshold: • The variance threshold is a straightforward fundamental method for
feature selection [23]. By doing this, all features with low variance—that is, features whose
variance is below the threshold—are eliminated. By default, it removes all zero-variance
characteristics, or traits with the same value across all samples. This feature selection approach
can be used for unsupervised learning because it only considers the (X) features and not the
necessary (y) outputs.
References: