Data Mining
Data Mining
Data Mining
Concepts and Applications
Contributors
Yoosoo Oh, Seonghee Min, Andri Irfan Rifai, Setiawan Hadi, Paquita Putri Ramadhani, Julius Olufemi
Ogunleye, Yao Shan, Esma Ergüner Özkoç, P.V. Sai Charan, P. Mohan Anand, Sandeep K. Shukla, Mawande
Sikibi, Wencai Du, Weijun Li, Qun Yang, Leon Bobrowski, Farzaneh Mansoori Mooseloo, Saeid Sadeghi,
Maghsoud Amiri, Wei-Cheng Ye, Jia-Ching Wang
Individual chapters of this publication are distributed under the terms of the Creative Commons
Attribution 3.0 Unported License which permits commercial use, distribution and reproduction of
the individual chapters, provided the original author(s) and source publication are appropriately
acknowledged. If so indicated, certain images may not be included under the Creative Commons
license. In such cases users will need to obtain permission from the license holder to reproduce
the material. More details and guidelines concerning content reuse and adaptation can be found at
http://www.intechopen.com/copyright-policy.html.
Notice
Statements and opinions expressed in the chapters are these of the individual contributors and not
necessarily those of the editors or publisher. No responsibility is accepted for the accuracy of
information contained in the published chapters. The publisher assumes no responsibility for any
damage or injury to persons or property arising out of the use of any materials, instructions, methods
or ideas contained in the book.
156
Countries delivered to
Top 1%
most cited scientists
12.2%
Contributors from top 500 universities
A
ATE NALY
IV
R
TI
CLA
CS
BOOK
CITATION
INDEX
IN
DEXED
Preface XV
Section 1
Concepts of Data Mining 1
Chapter 1 3
The Concept of Data Mining
by Julius Olufemi Ogunleye
Chapter 2 23
Use Data Mining Cleansing to Prepare Data for Strategic Decisions
by Mawande Sikibi
Chapter 3 35
Privacy Preserving Data Mining
by Esma Ergüner Özkoç
Chapter 4 49
Multilabel Classification Based on Graph Neural Networks
by Wei-Cheng Ye and Jia-Ching Wang
Chapter 5 63
DMAPT: Study of Data Mining and Machine Learning Techniques
in Advanced Persistent Threat Attribution and Detection
by P. V. Sai Charan, P. Mohan Anand and Sandeep K. Shukla
Chapter 6 81
Text Classification on the Instagram Caption Using Support Vector
Machine
by Setiawan Hadi and Paquita Putri Ramadhani
Chapter 7 93
Computing on Vertices in Data Mining
by Leon Bobrowski
Section 2
Applications of Data Mining 113
Chapter 8 115
Artificial Intelligence and Its Application in Optimization under
Uncertainty
by Saeid Sadeghi, Maghsoud Amiri and Farzaneh Mansoori Mooseloo
Chapter 9 139
Practical Application Using the Clustering Algorithm
by Yoosoo Oh and Seonghee Min
Chapter 10 151
Leaching Mechanisms of Trace Elements from Coal and Host Rock
Using Method of Data Mining
by Yao Shan
Chapter 11 169
Tourist Sentiment Mining Based on Deep Learning
by Weijun Li, Qun Yang and Wencai Du
Chapter 12 191
Data Mining Applied for Community Satisfaction Prediction of
Rehabilitation and Reconstruction Project (Learn from Palu Disasters)
by Andri Irfan Rifai
XIV
II
Preface
This book discusses the concepts of data mining and presents some of the advanced
research in this field. The book provides the fundamentals, techniques, and
methods of processing big data for various applications. The chapters discuss the
concepts, applications, and research frontiers in data mining with algorithms and
implementation details for use in the real world. It includes twelve chapters divided
into two sections: “Concepts of Data Mining” and “Applications of Data Mining.” The
initial seven chapters describe the concepts of data mining, while the remaining five
chapters discuss the applications of data mining. The chapters include real-world
problems in various fields and propose methods to address them. The first chapter
introduces readers to the technologies explored in each of the subsequent chapters.
Chapter 1 provides an overview of the data mining process and its benefits and
drawbacks, as well as discusses data mining methodologies and tasks. This chapter
also discusses data mining techniques in terms of their features, benefits, drawbacks,
and application areas.
After the introductory chapter on the concepts of data mining, we look at the various
steps in the data mining process. The initial step after acquiring the data is data
cleaning, followed by data Integration, data reduction, and data transformation.
The data is then analyzed and evaluated for knowledge discovery.
Chapter 2 describes the initial step of data cleaning to prepare data for strategic
decisions. As the pre-processing of data is an important step in the data mining
process, the data cleaning process helps in obtaining accurate strategic decisions.
The presence of incorrect or inconsistent data can significantly distort the results
of analyses, often negating the potential benefits of strategic decision-making
approaches. Thus, the representation and quality of data are first and foremost before
running an analysis. As such, this chapter identifies the sources of data collection to
remove errors and describes data mining cleaning and its methods.
Privacy has become a serious problem, especially in data mining applications that
involve the collection and sharing of personal data. For these reasons, the problem of
protecting privacy in the context of data mining differs from traditional data privacy
protection, as data mining can act as both a friend and foe. Chapter 3 discusses
privacy-preserving data mining and its two techniques, namely, those proposed for
input data that will be subject to data mining, and those suggested for processed
data that are the output of the data mining algorithms. This chapter also presents
attacks against the privacy of data mining applications. The chapter concludes with a
discussion of next-generation privacy-preserving data mining applications at both the
individual and organizational levels.
In the cyber world, modern-day malware is quite intelligent with the ability of
hiding its presence on the network and performing stealthy operations in the
background. Advance persistent threat (APT) is one such kind of malware attack
on sensitive corporate and banking networks that can remain undetected for a long
time. In real-time corporate networks, identifying the presence of intruders is a
challenging task for security experts. Chapter 5 presents a study on data mining
and machine learning techniques in APT attribution and detection. In this chapter,
the authors shed light on various data mining, machine learning techniques and
frameworks used in both attribution and detection of APT malware. Additionally,
the chapter highlights gap analysis and the need for a paradigm shift in existing
techniques to deal with evolving modern APT malware.
Instagram is one of the world’s top ten most popular social networks. One of the
main purposes of Instagram is social media marketing. Chapter 6 focuses on text
classification of Instagram captions using support vector machine (SVMs). The
proposed SVM algorithm uses text classification to categorize Instagram captions
XVI
IV
into organized groups, namely fashion, food and beverage, technology, health and
beauty, lifestyle and travel, and so on, in 66,171 post captions to classify what is
trending on the platform. The chapter uses the term frequency-inverse document
frequency (TF-IDF) method and percentage variations for data separation in this
study.
The main challenges in data mining are related to large, multi-dimensional data
sets. There is a need to develop algorithms that are precise and efficient enough to
deal with big data problems. The simplex algorithm from linear programming is an
example of a successful big data problem-solving tool.
XVII
V
concentration clustering algorithm for particulate matter distribution estimation
performs a K-means clustering algorithm to cluster feature data sets to find the
observatory location representing particulate matter distribution.
Chapter 10 looks at the leaching mechanisms of trace elements from coal and host
rock using data mining. Coal and host rock, including gangue dump, are important
sources of toxic elements that have great potential to contaminate surface and ground
water. The leaching and migration of trace elements are controlled mainly by two
factors: trace elements’ occurrence and surrounding environment. The traditional
method to investigate elements’ occurrence and leaching mechanisms is based
on a geochemical method. In this chapter, data mining is applied to discover the
relationship and patterns that are concealed in the data matrix. From the geochemical
point of view, the patterns mean the occurrence and leaching mechanisms of trace
elements from coal and host rock. An unsupervised machine learning method using
principal component analysis is applied to reduce dimensions of the data matrix of
solid and liquid samples, then the re-calculated data is clustered to find its co-existing
pattern using the Gaussian mixture model.
This book is for students, researchers, practitioners, data analysts, and business
professionals who seek information on the various data mining techniques and their
applications.
XVIII
VI
I would like to convey my gratitude to everyone who contributed to this book including
the authors of the accepted chapters. My special thanks to Publishing Process Manager,
Ms. Mia Vulovic, and other staff at IntechOpen for their support and efforts in bringing
this book to fruitful completion.
XIX
VII
Section 1
1
Chapter 1
Abstract
Data mining is a technique for identifying patterns in large amounts of data and
information. Databases, data centers, the internet, and other data storage formats;
or data that is dynamically streaming into the network are examples of data sources.
This paper provides an overview of the data mining process, as well as its benefits and
drawbacks, as well as data mining methodologies and tasks. This study also discusses
data mining techniques in terms of their features, benefits, drawbacks, and applica-
tion areas.
1. Introduction
systems [1, 2]. This chapter will go into the basics of data mining as well as the data
extraction techniques. Mastering this technology and its techniques will have signifi-
cant advantages as well as a competitive edge.
Data mining has gotten a lot of attention in the information industry in recent
years because of the widespread availability of massive quantities of data and the
pressing need to transform the data into valuable information and knowledge.
Business management, quality control, and market research, as well as engineering
design and science discovery, will all benefit from the information and expertise
acquired. Governments, private corporations, large organizations, and all industries
are interested in collecting a large amount of data for business and research purposes
[3, 4]. The following are some of the reasons why data mining is so important:
• Data mining is the process of collecting vast amounts of data in order to extract
information and dreams from it. The data industry is currently experiencing
rapid growth, which has resulted in increased demand for data analysts and
scientists.
• We interpret the data and then translate it into useful information using this
technique. This enables an organization to make more accurate and better deci-
sions. Data mining aids in the creation of wise business decisions, the execution
of accurate campaigns, the prediction of outcomes, and many other tasks.
• We can evaluate consumer habits and insights with the aid of data mining. This
results in a lot of growth and a data-driven business.
It’s important to remember that which data mining approach to utilize is mostly
determined by the amount of data accessible, the type of data, and the dimensions.
Although there are evident differences in the types of challenges that each data
Figure 1.
An overview of data mining process.
4
The Concept of Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99417
mining technique is best suited for, the nature of data from the real world, as well
as the complicated ways in which markets, customers, and the data that represents
them, means that the data is always altering. As a result, there is no obvious law that
favors one technique over another. Decisions are sometimes made depending on
the availability of experienced data mining analysts in one or more techniques. The
preference of one technique over the others depends more on getting good resources
and good analysts (Figure 1) [5].
2. Related works
The study offered an overview of some of the most widely used data mining
algorithms. They were divided into two portions, each with its own theme:
The authors discussed a variety of data mining methods so that the reader may see
how each algorithm fits into the larger picture of data mining approaches. There were
six different types of data mining algorithms presented in all. The Authors noted that,
although there are a number of other algorithms and many variations of the tech-
niques that were described, one of the algorithms is almost always used in real-world
deployments of data mining systems [4].
3. Methods
1. Collecting requirements
The collection and understanding of requirements is the first step in any data
mining project. With the vendor’s business viewpoint, data mining analysts or
users determine the requirement scope.
2. Data investigation
This move entails identifying and converting data patterns using data mining
statistics. It necessitates collecting, assessing, and investigating the requirement
or project. Experts comprehend the issues and challenges and translate them into
metadata.
4. Modeling
Data experts use their best tools for this phase because it is so important in the
overall data processing. To filter the data in an acceptable way, all modeling
methods are used. Modeling and assessment are intertwined steps that must be
completed at the same time to ensure that the criteria are correct. After the final
modeling is completed, the accuracy of the final result can be checked.
5. Assessment or Evaluation
After efficient modeling, this is the filtering method. If the result is not acceptable,
it is then passed back to the model. After a satisfactory result, the requirement is
double-checked with the provider to ensure that no details are overlooked. At the
end of the process, data mining experts evaluate the entire outcome.
6. Deployment
This is the final stage in the entire process. Data is presented to vendors in the
form of spreadsheets or graphs by experts.
The following functions can be performed with data mining services [9, 10]:
• Knowledge extraction: This is the procedure for finding useful trends in data that
can be used in decision-making [11]. This is because decisions must be made on
the basis of correct/accurate data and evidence.
• Web data: Web data is notoriously difficult to mine. This is due to the essence
of the situation. Web data, for example, can be considered dynamic, meaning it
changes over time. As a result, the data mining process should be replicated at
regular intervals.
• Data pre-processing: Typically, the data gathered is stored in a data center. This
information must be pre-processed. Data mining experts should manually delete
any data that is considered unimportant during pre-processing.
• Market research, surveys, and analysis: Data mining can be used for product
research, surveys, and market research. It is possible to collect data that would be
useful in the creation of new marketing strategies and promotions.
• News: With nearly all major newspapers and news outlets sharing their news online
these days, it is easy to collect information on developments and other important
topics. It is possible to be in a better place to compete in the market this way.
• Internet research: The internet is well-known for its vast amount of knowledge.
It is obvious that it is the most important source of data. It is possible to collect
a great deal of knowledge about various businesses, consumers, and company
clients. Frauds can be detected using online resources.
• Study of competitors: It’s important to know how your competitors are doing in
the business world. It is important to understand both their strengths and weak-
nesses. Their methods of marketing and distribution can be mined, including
their methods of reducing overall costs.
Data mining and its features have many advantages. It raises the need for a data-driven
market as it is combined with analytics and big data. Some of the benefits are as follows:
3. Data mining is useful not only for making forecasts, but also for developing new
services and goods.
7
Data Mining - Concepts and Applications
4. Predictive models are used in the retail sector for products and services. Better
quality and consumer insights are possible in retail stores. Historical data is used
to calculate discounts and redemption.
5. Data mining aids financial gains and alerts for banks. They create a model
based on consumer data that aids in the loan application process, among
other things.
7. Marketing firms use data mining to create data models and forecasts based on
historical data. They manage promotions, marketing strategies, and so on. This
leads to fast growth and prosperity.
8. Data mining results in the creation of new revenue sources, resulting in the
expansion of the company.
10. When competitive advantages are found, data mining can help reduce
production costs.
Understanding the types of tasks, or problems, that data mining can solve is the
best way to learn about it. The majority of data mining tasks can be classified as either
prediction or summary at a high level. Predictive tasks allow you to forecast the value
of a variable based on previously collected data. Predicting when a customer will
leave a business, predicting whether a transaction is fraudulent, and recognizing
the best customers to receive direct marketing offers are all examples of predictive
tasks. Descriptive tasks, on the other hand, attempt to summarize the information.
Automatically segmenting customers based on their similarities and differences, as
well as identifying correlations between products in market-basket data, are examples
of such tasks [12].
Organizations now have more data at their disposal than they have ever had
before. However, due to the sheer volume of data, making sense of the massive
amounts of organized and unstructured data to enact organization-wide changes can
be exceedingly difficult. This problem, if not properly handled, has the potential to
reduce the value of all the data.
Data mining is the method by which businesses look for trends in data to gain
insights that are important to their needs. Both business intelligence and data science
need it. Organizations may use a variety of data mining strategies to transform raw
data into actionable insights [13]. These range from cutting-edge artificial intelligence
to the fundamentals of data planning, all of which are critical for getting the most out
of data investments.
b. Pattern Recognition
8
The Concept of Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99417
c. Classification
d.Association
e. Detection of Outliers
f. Clustering
g. Regression
h. Prediction
i. Sequential trends
j. Decision Trees
k. Statistical techniques
l. Visualization
m. Neural Networks
n. Data warehousing
b. Pattern Recognition
A basic data mining technique is pattern recognition. It entails spotting and
tracking trends or patterns in data in order to draw informed conclusions about
business outcomes. When a company notices a pattern in sales data, for example,
it has a reason to act. If it’s determined that a certain product sells better than
others for a specific demographic, a company may use this information to
develop similar goods or services, or simply better stock the original product for
this demographic [14].
c. Classification
The various attributes associated with different types of data are analyzed using
classification data mining techniques. Organizations may categorize or classify
9
Data Mining - Concepts and Applications
similar data after identifying the key characteristics of these data types. This is
essential for recognizing personally identifiable information that organizations
may wish to shield or redact from records, for example.
d.Association
The statistical technique of association is a data mining technique. It denotes that
some data (or data-driven events) are linked to other data. It’s similar to the ma-
chine learning concept of co-occurrence, where the existence of one data-driven
event indicates the probability of another. Correlation and association are two
statistical concepts that are very similar. This means that data analysis reveals a
connection between two data occurrences, such as the fact that hamburger pur-
chases are often followed by French fries purchases.
e. Detecting of Outliers
Outlier detection is used to identify the deviations in datasets. When companies
discover anomalies in their records, it becomes easier to understand why they
occur and plan for potential events in order to achieve business goals. For
example, if there is an increase in the use of transactional systems for credit
cards at a certain time of day, businesses can use this information to maximize
their income for the day by finding out the cause of it.
f. Clustering
Clustering is an analytics methodology that employs visual approaches to data
interpretation. Graphics are used by clustering mechanisms to demonstrate
where data distribution is in relation to various metrics. Different colors are used
in clustering techniques to represent data distribution. When it comes to cluster
analytics, graph-based methods are perfect. Users can visually see how data is
distributed and recognize patterns related to their business goals using graphs
and clustering in particular.
g. Regression
The essence of the relationship between variables in a dataset can be determined
using regression techniques. In some cases, such connections may be causal, and
in others, they may only be correlations. Regression is a simple white box tech-
nique for revealing the relationships between variables. In areas of forecasting
and data modeling, regression methods are used (Figure 2).
Figure 2.
Illustration example of linear regression on a set of data [15].
10
The Concept of Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99417
h.Prediction
One of the four branches of analytics is prediction, which is a very important
feature of data mining. Patterns observed in current or historical data are ex-
tended into the future using predictive analytics. As a result, it allows businesses
to predict what data patterns will emerge next. Using predictive analytics can
take a variety of forms. Machine learning and artificial intelligence are used in
some of the more advanced examples. Predictive analytics, on the other hand,
does not have to rely on these methods; simpler algorithms can also be used.
i. Sequential Trends
This data mining technique focuses on identifying a sequence of events. It’s
particularly useful for transactional data mining. For example, when a customer
buys a pair of shoes, this technique will show which pieces of clothing they are
more likely to buy. Understanding sequential trends may assist businesses in
recommending additional products to consumers in order to increase sales.
j. Decision trees
Decision trees are a form of predictive model that enables businesses to mine
data more effectively. A decision tree is technically a machine learning technique,
but because of its simplicity, it is more often referred to as a white box machine
learning technique. Users can see how the data inputs influence the outputs using
a decision tree. A random forest is a predictive analytics model that is created by
combining different decision tree models. Complicated random forest models are
referred to as “black box” machine learning techniques because their outputs are
not always easy to comprehend based on their inputs. However, in most cases,
this simple form of ensemble modeling is more effective than relying solely on
decision trees (Figure 3).
Figure 3.
Example of a decision tree [15].
k. Statistical techniques
Statistical approaches are at the heart of the majority of data mining analytics.
The various analytics models are focused on mathematical principles that pro-
duce numerical values that can be used to achieve clear business goals. In image
recognition systems, neural networks, for example, use complex statistics based
11
Data Mining - Concepts and Applications
l. Visualization
Another essential aspect of data mining is data visualization which uses sen-
sory impressions that can be seen to provide users with access to data. Today’s
data visualizations are interactive, useful for streaming data in real-time, and
distinguished by a variety of colors that show various data trends and patterns.
Dashboards are a valuable tool for uncovering data mining insights using data
visualizations. Instead of relying solely on the numerical results of mathematical
models, organizations may create dashboards based on a variety of metrics and
use visualizations to visually illustrate trends in data.
m. Neural Networks
A neural network is a type of machine learning model that is frequently used
in AI and deep learning applications. Among the most accurate machine learning
models used today is neural network. They are named for the fact that they
have multiple layers that resemble how neurons function in the human brain.
While a neural network can be a powerful tool in data mining, companies should
exercise caution when using it because some of these neural network models
are extremely complex, making it difficult to understand how a neural network
calculated an output (Figure 4).
Figure 4.
Example of a neural network [15].
n.Data warehousing
Data warehousing used to imply storing organized data in relational database
management systems so that it could be analyzed for business intelligence,
reporting, and simple dashboarding. Cloud data centers and data warehouses
in semi-structured and unstructured data stores, such as Hadoop, are available
today. Although data warehouses have historically been used to store and analyze
historical data, many new approaches can now provide in-depth, real-time data
analysis.
12
The Concept of Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99417
With so many methods to use during data mining, it’s important to have the right
resources to get the most out of your analytics. For proper implementation, these
methods usually necessitate the use of many different tools or a tool with a broad set
of capabilities.
While organizations can use data science tools like R, Python, or Knime for
machine learning analytics, it’s critical to use a data governance tool to ensure compli-
ance and proper data lineage. Additionally, in order to conduct analytics, companies
would need to collaborate with repositories such as cloud data stores, as well as
dashboards and data visualizations to provide business users with the knowledge
they need to comprehend analytics. All of these features are available in tools, but it’s
critical to find one or more that meet your company’s requirements [16].
4. Discussions
The development of data mining has been accelerated by cloud computing tech-
nology. Cloud systems are ideally adapted for today’s high-speed, massive amounts of
semi-structured and unstructured data that most businesses must contend with. The
elastic capabilities of the cloud will easily scale to meet these big data demands. As a
result, since the cloud can carry more data in a variety of formats, more data min-
ing techniques are needed to transform the data into insight. Advanced data mining
techniques such as AI and deep learning are now available as cloud services.
Future advancements in cloud computing would undoubtedly increase the need
for more powerful data mining software. AI and machine learning will become even
more commonplace in the next five years than they are now. The cloud is the most
suitable way to both store and process data for business value, given the exponentially
growing pace of data growth on a daily basis. As a result, data mining methods can
depend much more on the cloud than they do now.
Currently, data scientists use a variety of data mining techniques, which differ
in precision, efficiency, and the type and/or volume of data available for analy-
sis. Classical and modern data mining techniques are two types of data mining
techniques. Statistical approaches, Nearest Neighbors, Clustering, and Regression
Analysis are examples of Classical techniques, while Modern techniques include
Neural Networks, Rule Induction Systems, and Decision Trees.
13
Data Mining - Concepts and Applications
1. Statistical Techniques
Advantages
Disadvantages
• The researcher is only able to draw patterns and correlations from the data and
cannot assess the validity or consider a causal theory process.
• It’s difficult to view and validate this data because it’s always secondary.
2. Nearest Neighbors
Advantages
• No Education Transfer
• It is constantly growing.
• Selecting the first hyper parameter can take some time, but once done, the rest of
the parameters are compatible with it.
Disadvantages
• While the implementation can be simple, the efficiency (or speed of the
algorithm) decreases rapidly as the dataset grows.
• Can handle a small number of input variables, but as the number of variables
increases, the algorithm has trouble predicting the performance of new data points.
• When classifying new data, the problem of determining the optimal number of
neighbors to consider is frequently encountered.
3. Clustering
Advantages
• Hierarchical methods enable the end user to choose from a large number of
clusters or a small number of clusters.
• Appropriate for data sets of any form and attributes of any kind.
Disadvantages
• The assumption is not completely right, and the clustering result is dependent
on the parameters of the chosen models.
Advantages
• Linear regression can solve some very simple problems much faster and more
easily, since prediction is simply a multiple of the predictors.
• Linear regression: the modeling process is simple, requires few calculations, and
runs quickly even when the data is large.
15
Data Mining - Concepts and Applications
• Linear regression: the factor can provide insight into and interpretation of each
variable.
• Linear regression is easier to implement, evaluate, and apply than other methods.
• Multiple regression will assess the relative importance of one or more predictor
variables in determining the criterion’s significance.
Disadvantages
• Any disadvantage of using a multiple regression model is usually due to the data
used, either because there is insufficient data or because the cause is incorrectly
assumed to be a correlation.
5. Neural Networks
Advantages
• Artificial Neural Networks (ANN) will model and analyze nonlinear, complex
relationships.
• Has highly accurate statistical models that can be used to solve a wide range of
problems.
• Information is stored on the network as a whole, not in a database, and the network
will run even though a few pieces of information are missing from one location.
16
The Concept of Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99417
• Has fault tolerance, which means that contamination of one or more ANN cells
will not stop development.
Disadvantages
• Network behavior that is not explained: Even though ANN provides a sampling
solution, it does not explain why or how it works.
• Difficulty in demonstrating the problem to the network: ANNs should deal with
numerical data. Before integrating into ANN, problems must be translated into
numerical values.
6. Rule Induction
Advantages
• When dealing with a small number of rules, IF-THEN rules are easy to
understand and are meant to be the most interpretable model.
• The decision rules are just as descriptive as decision trees, but they are a lot
smaller.
17
Data Mining - Concepts and Applications
• Since conditions only shift at the threshold, decision rules will withstand
monotonous input function transformations.
• IF-THEN rules produce models with few features. Only the features that are
important to the model are chosen.
• Simple rules like OneR can be used to test more complex algorithms.
Disadvantages
• IF-THEN laws are mostly concerned with grouping and almost completely
neglect regression.
• Categorical functions are also needed. This means that numerical features must
be classified if they are to be included.
Advantages
• Data is organized into distinct categories, which are therefore simpler to grasp
than points on a multidimensional hyperplane, as in linear regression. With its
nodes and edges, the tree structure has a natural visualization.
• CART validates the Tree immediately, implying that the algorithm has the model
validation and discovery of the optimally general model (the algorithm) built
deep inside it.
• There are so many powerful data mining features that decision trees mark so
strongly.
Disadvantages
• Can struggle with some very simple problems where prediction is simply a
multiple of predictors.
18
The Concept of Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99417
• It had a silky feel to it. Small changes in the input function may have a big effect
on the forecast outcome, which is not always a good thing.
• The trees are still very shaky. A few tweaks to the training dataset will result in a
completely different tree. Since every split is based on splitting the parent, this is
the case.
These methods are best applied to particular tasks in order to achieve the best
performance. The Table 1 below lists the data mining tasks and the techniques that
can be used to complete them.
A business analyst’s dream is data warehousing. All of the data concerning the
organization’s actions is centralized and accessible through a single set of analytical
tools. A data warehouse system’s goal is to give decision-makers the accurate, timely
data they need to make the best decisions possible. A relational database manage-
ment system server serves as the central repository for informational data in the data
warehouse architecture. The processing of operational data is kept distinct from the
processing of data warehouse data.
The central information repository is surrounded by a number of critical com-
ponents that work together to make the overall ecosystem functional, manageable,
and available to both operational systems and end-user query and analysis tools. The
warehouse’s raw data is often derived from operational applications. Data is cleansed
and turned into an integrated structure and format when it enters the warehouse.
Conversion, summarization, filtering, and condensing of data may all be part of
the transformation process. Since the data contains a historical component, the
warehouse must be capable of holding and managing large volumes of data as well as
different data structures for the same database over time.
Table 1.
Data mining tasks and the methods used to accomplish them.
19
Data Mining - Concepts and Applications
The following Table 2 lists data mining techniques and their areas of applications.
Table 2.
Data mining techniques and their areas use.
5. Conclusion
It’s worthy of note to state that time is spent on extracting useful information
from data. As a result, in order for companies to develop quickly, it is necessary to
make accurate and timely decisions that enable them to take advantage of available
opportunities. In today’s world of technology trends, data mining is a rapidly grow-
ing industry. In order to obtain valuable and reliable information, everyone today
needs data to be used in the right way and with the right approach. Data mining can
be initiated by gaining access to the appropriate resources. Since data mining begins
immediately after data ingestion, finding data preparation tools that support the
various data structures required for data mining analytics is important. Organizations
may also want to identify data in order to use the aforementioned methods to explore
it. Modern data warehousing, as well as various predictive and machine learning/AI
techniques, are helpful in this regard.
Choosing which approach to employ, and when, is clearly one of the most difficult
aspects of implementing a data mining process. Some of the parameters that are
critical in deciding the technique to be used are determined by trial and error. There
are clear differences in the types of problems that each data mining technique is best
suited for. As a result, there is no simple rule that favors one technique over another.
20
The Concept of Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99417
Decisions are often taken based on the availability of qualified data mining analysts in
one or more techniques. The choice of a technique over the other is more dependent
on the availability of good resources and analysts.
Acknowledgements
This work was supported by the Faculty of Applied Informatics, Tomas Bata
University in Zlín, Czech Republic, under Projects IGA/CebiaTech/2021/001.
Author details
© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
21
Data Mining - Concepts and Applications
References
[1] Software Testing Help (April 16, [12] Karna H. et al. (2018): Application of
2020): Data Mining Techniques: data mining methods for effort
Algorithm, Methods & Top Data estimation of software projects.
Mining Tools.
[13] Sehra S.K. et al. (2014): Analysis of
[2] Silhavy, P., Silhavy, R., & Prokopova, Data Mining techniques for software
Z. (2019): Categorical variable effort estimation.
segmentation model for software
development effort estimation. IEEE [14] Trevor H. Et al.(2009): The Elements
Access, 7, 9618-9626. of Statistical Learning: Data Mining,
Inference, and Prediction. Archived from
[3] Sandeep Dhawan (2014): An the original on 2009-11-10. Retrieved
Overview of Efficient Data Mining 2012-08-07.
Techniques.
[15] Ogunleye J.O. (2020): Review of Data
[4] Alex Berson et al. (2005): An Mining Techniques in Software Effort
Overview of Data Mining Techniques. Estimation.
[5] Jiawei H. and Micheline K. (2000): [16] Dejaeger K., et al.(2012): Data
Data Mining: Concepts and Techniques. Mining Techniques for Software Effort
Estimation: A Comparative Study.
[6] ACM SIGKDD (2006-04-30),
Retrieved (2014-01-27): Data Mining
Curriculum.
Abstract
Keywords: Data, Data cleaning, Data collection, Data mining, Data preparation,
Data collection, Data quality, Messy data
1. Introduction
Time has changed for the production organizations who believe keeping messy
data saves their day. This messy data is in the dataset, which is stored in databases,
repositories, and data warehouses. Massive amounts of data are available on their
resources for the organization to influence their strategic decision. Data collected
from various resources are messy, and this affects the quality of the data result. Data
preparation offers a better data quality, which will help the organizations yearly,
making most existing methods no longer suitable for messy data.
The growing enthusiasm of messy data on the dataset for data-driven strategic
decision-making has created the importance of preparing data for future use over the
years. The rapid growth of messy data drives new opportunities for the organization
and processing the quality of the data by cleaning and preparing data becomes essen-
tial for analysts and users. Unfortunately, this could be handled correctly as reliable
data could lead to a misguided strategic decision.
23
Data Mining - Concepts and Applications
Data mining is no longer a new research field [1]. It aims to prepare data to
improve data quality before processing by identifying and removing errors and
inconsistencies on the dataset [2]. Data mining can pull data to prepare it to inform
organization strategic decisions. However, preparing data can be used before for
specific organizational purposes.
Data mining could be added to a single application to pull anomalies within a large
dataset. Utilizing the software arranges data in the large dataset to develop efficient
organizational strategies. Data mining software is a user-friendly interface that allows
organizational analysts and users who may not be technically advanced to execute
data preparation in data mining [3]. Putting this capability in the hands of the non-
technical user allows responding to data quality issues quickly.
Data preparation is the feature within data mining; it has immeasurable value
working with data [4]. Utilizing the software will begin to embed within the orga-
nization. Data mining software is available on the market for an organization to use
their data in the dataset. Thus, markets are different from a decade ago due to rapid
change in the world economy and technological advancement. This technology is
popular with marketers because it allows analysts and users to make smart strategic
decisions. It enables better development of market strategies for competitive advan-
tage ahead amongst organizations. As vendors continue to introduce solutions, the
marketing strategy improves the data quality of the dataset stored in their resources.
With data mining, analysts and users can access the dataset in preparation for it to be
available for future use.
2. Objectives
This paper aims to develop a process data mining capability undertake on the
dataset. The literature review considers current knowledge contributions to this topic
towards these paper objectives.
3. Literature review
Data preparation corrects inconsistent data in the dataset to prepare quality data
[5]. Research indicates that data preparation in data mining formulates a workflow
process covering steps to prepare data [6]. However, some research suggested that
data preparation begins with data collection to check data quality [7]. This paper
aims to demonstrate the evolution of collecting data into preparation steps to influ-
ence data quality. The paper examines the data preparation in data mining processes
through data collection.
24
Use Data Mining Cleansing to Prepare Data for Strategic Decisions
DOI: http://dx.doi.org/10.5772/intechopen.99144
Data mining is often described as an add-on software in checking the data quality
in the dataset by searching through the large amount of data stored in databases,
repositories, and data warehouses. The data stored is believed to be too messy,
inconsistent, and have errors; it is unclear information to analysts and users, make
it difficult to be ready to be used for its specific purposes [8]. Overloaded data limit
analysts and users; thus, software such as data mining is developed to solve this chal-
lenge through automation.
The data mining software uses recognition technologies and statistical techniques
to clean messy data and discover the possible rule to govern data in databases,
repositories, and data warehouses. Data mining considers the process that requires
goals and objectives to be specified [9]. Once the intended goals met, it is necessary
to determine what data is collected or available. However, before data is used, data
preparation is performed, making data ready for its purposes.
The concept that strategic or effective decisions are based on appropriate data
is not new. Finding the correct data for strategic decisions began 30 years ago [10].
During the late 1960s, organizations create reports from production sensors into data-
bases, repositories, and data warehouses. These resources stored data to retrieved and
manipulate to produce constructive reports containing information to meet specific
strategic decision needs.
In the 1980s, analysts and users began to need data more frequently and to be more
individualized. Thus, the organizations started to request data in the resources. Later
in the 1990s, analysts and users required immediate access to be more detailed infor-
mation. This meant to correlate with production and strategic decisions processes.
It has helped the analysts and users extract its data from databases, repositories, and
data warehouses.
The analysts and users began to realize the need for more tools to prepare data for
future uses. Additionally, the organizations recognized the accumulated amount of
data; thus, new tools to prepare data before meeting their needs. Such tools enabled
the system to search for any possible errors and inconsistencies in the dataset. Data
mining software was the first developed to help analysts and users to find quality data
from a voluminous amount of data. Because the massive volume of data keeps rapidly
growing, preparation methods are urgently needed. Therefore, data mining has
become an increasingly important research field [11].
Data cleansing is an operation within data mining software that can be performed on
the existing data to remove anomalies and obtain the data collection. It involves removing
the errors, inconsistencies and transform data into a uniform format in the dataset [12].
With the amount of data collected, manual data cleansing for preparation is impossible
as it is time-consuming and prone to errors. The data cleansing process consists of several
stages: detecting data errors and repairing the data errors [13]. Although, it is thought
of as a tedious exercise. However, establish a process and template for the data cleansing
process gives assurance that the method applied is correct. Hence, data cleansing focuses
on errors beyond small technical variations and constitutes a significant shift within [14].
Data cleansing based on the knowledge of technical errors expects normal values
on the dataset. Missing values may be due to interruptions of the data flow. Hence,
25
Data Mining - Concepts and Applications
predefined rules for dealing with errors and true missing and extreme values are part
of better practice. However, it is more efficient to detect the errors by active searching
for them on the dataset in a planned way. Lack of data through data cleansing will
arise if the analysts and users do not fully understand a dataset, including skips and
filters [14].
Moore and McCabe [15] emphasized the serious strategic decision error would
endure if the data quality were poor, leading to low data utilization efficiency.
Although data cleansing follows data collection, data thoroughly checked for errors,
and other inconsistencies are corrected for future use [16]. Although the importance
of data-handling procedure is being underlined in better clinical practice and data
management guidelines, gaps in knowledge about optimal data handling methodolo-
gies and standard of quality data are still present [14].
Detecting and correcting corrupted or inaccurate records help to meet standard
quality data from the dataset. Find the incorrect, inaccurate, or irrelevant parts of
the data, replace, modify, and delete coarse data [14]. The reality of the matter, data
cannot always be used as it is and needs preparation to be used. Achieving higher
preparation data quality during a data cleansing process is required to remove anoma-
lies. Thus, the data cleansing process can be defined as assessing data’s correctness
and improving it. Therefore, enhancing data quality, pre-processing data mining
techniques are used to understand the data and make it more easily accessible.
Data validation is described as the process of ensuring data has undergone clean-
ing to ensure that it is both correct and useful. Although, it intended to provide a
guarantee for the fitness and consistency of data in the dataset. Failure or omission
in data validation can lead to data corruption. Catching data early on the dataset is
important as it helps debug the roots of the cause and roll back in the working state
[17]. Moreover, it is important to rely on mechanisms specific to data validation rather
than on the detection of second-order effects.
Errors are bound to happen during the data collection process, while data is sel-
dom 100% correct. Data validation helps to minimize erroneous data from the data-
set. Data validation rules help organizations follow standards that make it efficient to
work with data. Although, duplication data provide challenges to many organizations.
Factors that cause the duplication of data are the data entry of machines and operators
from production to capture data. An organization needs a powerful matching solution
to overcome this challenge of duplicating records to ensure clean and usable data.
Data validation checks the accuracy and data quality of source data, usually
performed before processing the data. It can be seen as a form of data cleansing. Data
validation ensures that the data is complete (no blanks or empty values), unique
(includes different values that are not repeated), and the values that range consistent
with the expectations. When moving and merging data, it is important to ensure
that data from different sources and repositories conform to organizational rules and
not become corrupted due to inconsistencies in type or context. Data validation is a
general term and can be performed on any data. However, including data within a
single application, such as Microsoft Excel, or merging simple data within a single
data store.
The data validation process is a significant aspect of filtering the large dataset
and improving the overall process’s efficiency. However, every technique or pro-
cess consists of benefits and challenges; therefore, it is crucial to have a complete
26
Use Data Mining Cleansing to Prepare Data for Strategic Decisions
DOI: http://dx.doi.org/10.5772/intechopen.99144
acknowledgement. Data handling can be easier if analysts and users adapt this tech-
nique with the appropriate process, then data validation can provide the best outcome
possible for data. Data validation can be broken down into the following categories:
data completeness and data consistency.
Data integrity refers to the integrity of the data. However, for the data to be valid,
there should not be any gaps or missing information for data to be truly complete.
Occasionally incomplete data is unusable, but it is usually used in the absence of
information, leading to cost error and miscalculations.
An incomplete data is usually the result of unsuccessful data collection. This
denotes the degree to which all required data are available in the dataset [18]. A mea-
sure of data completeness would be the percentage of missing data entries. However,
the true goal of data completeness is not to have perfect 100% data. It ensures that
data the essential to the purpose of validity. Therefore, it is a necessary component of
the data quality framework and is closely related to validity and accuracy.
Data preparation is the process of cleaning and transforming raw data before
processing and analysis for future use. It is an important step before processing and
often involves reformatting data, correcting data, and combining data sets to enrich
data [20]. Its task is to blend, shape, clean, consolidate data into one file or data table
to get it ready for analytics or other organizational purposes.
27
Data Mining - Concepts and Applications
The data must be clean, formatted, and transformed into something digestible by
data mining software to achieve the final preparation stage. These actual processes
include a wide range of steps, such as consolidating or separating fields and columns,
changing formats, deleting unnecessary or junk data, and making corrections to data.
In this literature review, several studies have used data preparation and data
mining on the messy data on the dataset for future use, few studies on the quality data
check. This is the gap in this paper, as it aims at reviewing the available data mining
preparing methods for messy data. Since the data preparation framework needs to
meet data quality criteria, using a quality dimension includes accuracy, completeness,
timeliness, and consistency [21]. Quality data check is crucial because it automates
data and provides information about the number of valid, missing, and mismatched
values in each column. The result shows the quality data above each column in the
dataset. A data mining software will help remove errors and inconsistencies in the
dataset to meet quality data check percentage [22].
Quality data check on the dataset, it may be better to use a transformation. These
quality data checks can create data quality rules which persist in checking columnar
data against defined. Performing variety checks, transform data automatically show
the effect of transformations on the overall quality of data. It can provide various
services for the organization and only with high-quality data and achieve the top-
service in the organization [13].
4. Methodology
history of making several sheets of steel at a high rate. It increases the data in the
dataset, not only proper data but also messy or dirty data. The company was selected
due to its nature of producing a high number of products. Therefore, it was suitable
for this research, which is dealing with data.
This section describes the findings, and the overall discussion represents the data-
sets with data cleansing preparation. These three datasets were obtained from the data
repository, and Table 1 represents the excel files dataset before using the data mining
tool. Machine data file contain 30 000 records in the dataset. It contains 10% missing
values and 7 duplicate records. The alarm data file contains 45 000 records. It contains
25% missing values and 28 duplicated records in the dataset. Finally, the sensor data
file contains 100 000 records in the dataset. It contains 45% missing values and 100
duplicated records. These files format was using Microsoft Excel as the technique to
use datasets.
29
Data Mining - Concepts and Applications
Table 1.
Raw data.
Table 2.
Data mining uses.
A data mining tool was used as the result of the analysis. Table 2 shows the
importance of using data mining in removing errors and inconsistencies in records.
The data mining tool in Table 2 has removed machine data records decreases from 30
000 to 26 993. There was no missing value found on the dataset, with no duplication
records. Alarm data records decreased from 45 000 to 33 722, with no missing values
and duplicated records. Sensor data records decreased 100 000 to 54 900, with no
missing values and duplicated records.
Missing values represent how efficient this tool in finding missing values of a file.
Other features were whether this tool could find duplication, illegal values, merging
the records and misspelling. Ease file format supported by these records and of use.
5.1 Discussion
This paper aims to investigate data cleansing in big data. Based on the available
data cleansing methods discussed in the previous section, data cleansing for big
data needs to be improvised and improved to cope with the massive amount of data.
The traditional data cleaning method is important for developing the data cleaning
framework for big data applications. In the review of Potter, this method only focused
on solving data transformation challenges [13]. The Excel spreadsheet supports
problems like duplicate record detection, and the user needs other approaches to deal
with duplicate record detection problems [27].
Data mining can require manual and automatic procedures, but this approach
focuses on duplication and missing elimination despite various data quality challenges
in the dataset. Traditional data cleansing tools tend to solve only one data quality
problem throughout the process and require human intervention to resolve data
cleansing conflicts. In the big data era, the traditional data cleansing process is no
longer acceptable as data needs to be cleansed and analyzed fast. The data is growing
more complex as it may include structured data, semi-structured data, and unstruc-
tured data. The discussed methods focus only on structured data. However, existing
methods have some limitations when working with dirty data. Data mining performs
the computations of each stage as “local” in each Excel spreadsheet, and the data
exchange is done at the stage boundaries by broadcast or hash partitioning.
30
Use Data Mining Cleansing to Prepare Data for Strategic Decisions
DOI: http://dx.doi.org/10.5772/intechopen.99144
6.1 Recommendation
6.2 Conclusion
Acknowledgements
First, I would like to thank God for His blessing in completing this paper and
my highest gratitude goes to my mentor for guarding me throughout this paper. Her
patience on this paper was something I admired.
Also, thanks to seen and unseen hands that have given me direct and indirect help
to finish this paper. Finally, thanks to my family who keeps encouraging through diffi-
cult time. Even if it was not fashionable to do so.
Declarations
I, Mawande Sikibi, hereby declare that this paper is wholly my work and has not
been submitted anywhere else for academic credit either by myself or another person.
31
Data Mining - Concepts and Applications
Author details
Mawande Sikibi
University of Johannesburg, South Africa
© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
32
Use Data Mining Cleansing to Prepare Data for Strategic Decisions
DOI: http://dx.doi.org/10.5772/intechopen.99144
References
Abstract
Data mining techniques provide benefits in many areas such as medicine, sports,
marketing, signal processing as well as data and network security. However, although
data mining techniques used in security subjects such as intrusion detection, biomet-
ric authentication, fraud and malware classification, “privacy” has become a serious
problem, especially in data mining applications that involve the collection and sharing
of personal data. For these reasons, the problem of protecting privacy in the context
of data mining differs from traditional data privacy protection, as data mining can act
as both a friend and foe. Chapter covers the previously developed privacy preserving
data mining techniques in two parts: (i) techniques proposed for input data that will
be subject to data mining and (ii) techniques suggested for processed data (output of
the data mining algorithms). Also presents attacks against the privacy of data mining
applications. The chapter conclude with a discussion of next-generation privacy-
preserving data mining applications at both the individual and organizational levels.
Keywords: privacy preserving data mining, data privacy, PPDM methods, privacy
attacks, Anonymization
1. Introduction
Especially with the 2019 pandemic, in today’s world where business and educa-
tion life is done electronically over the internet, fast and voluminous data sharing
is made with the undeniable effect of social media and unfortunately technology
works against privacy. The rapid widespread use of data mining techniques in areas
such as medicine, sports, marketing, signal processing has also increased the interest
in privacy. The important point here is to define the boundaries of the concept of
privacy and to provide a clear definition. Individuals define privacy with the phrase
“keep information about me from being available to others”. However, when it comes
to using these personal data in a study that is considered to be well intentioned,
individuals are not disturbed by this situation and do not think that their privacy
is violated [1]. What is missed here is the difficulty of preventing abuse once the
information is released.
Personal data is information that relates to an identified or identifiable individual.
This concept consists of the components that the data pertain to a person and that
this person can also be identified. Personal data is a concept that belongs to the “ego”
and is handled in a wide range from names to preferences, feelings and thoughts. An
identifiable person is someone who can be identified directly or indirectly, in particu-
lar by reference to an identification number or one or more factors specific to their
35
Data Mining - Concepts and Applications
physical, physiological, mental, economic, cultural or social identity. For this reason,
the loss of the individual’s control authority over these data brings about the loss of
the individual’s freedom, autonomy, privacy, in short, the property of being me. The
main way to ensure the use of these data without harming the privacy of individuals is
to remove the identifiability of the person.
Data analysis methods, including data mining, commodify data and turn it into
economic value. Apart from the ethical debates about this, it is an undeniable fact that
the digital environment increases the risk of losing control of all information about
one’s own intellectual, emotional and situational, in short, losing its autonomy and
violating the informational privacy area. The main dilemma here is; the freedom in
the flow of information provided by technology, the interest relationships it provides
and the benefit provided by the information source is the control power required by
the concept of being an individual [2].
In addition, legal regulations aiming to protect personal data are made by govern-
ments, including for what purpose (historical, statistical, commercial, scientific)
data is used, how it is collected and how it should be stored. For example, the US
HIPAA rules aim to protect individually identifiable health information. These are
information that is a subset of health information, including demographic informa-
tion collected from an individual [3]. In the EC95/46 [4] directive, the European
parliament and of the council allow the use of personal data in the case of (i) if the
data subject has explicitly given his permission, or (ii) the need for a result requested
by the individual. This also applies to corporate privacy issues. Privacy concerns bring
corporate privacy concerns with them. However, corporate privacy and individual
privacy issues are not much different from each other. The disclosure of information
about an organization can be considered a potential privacy breach. In this case, it
involves both views to generalize to disclosure of information about a subset of data.
The point to note here is that while focusing on the disclosure of data subjects,
the secrets of the data providers’ organization should also be taken into account. For
example, considering that data mining studies were carried out on student data of
more than one university in an academic study. Although the methods used protect
the privacy of the student, certain information that is specific to the university and
they want to keep may be revealed. Although the personal data owned by the organi-
zations are secured by contracts and legal regulations, information about a subset of
the combined data set may reveal the identity of the data subject. The organization
that owns the data set must be involved in a distributed data mining process as long as
it can prevent the disclosure of the data subjects it provides and its own trade secrets.
In the literature, solutions that take data privacy into account have been proposed
in data mining. A solution that ensures that no individual data is exposed can still
publish information that describes the collection as a whole. This type of corporate
information is often the purpose of data mining, but some results can be identified,
various data hiding and suppression techniques have been developed to ensure that
the data are not individually identified.
The concept of privacy can be examined under three headings as “physical–physi-
cal, mental-communicative and data privacy [5]. The main subject in this study is
data privacy.
Data privacy can be defined as the protection of real persons, institutions and
organizations (Data Subject) that need to be protected in accordance with the law and
36
Privacy Preserving Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99224
ethical rules during the life cycle of data (collecting data, processing and analyzing
data, publishing and sharing data, preserving data, re-use data) [6]. In this process,
for what purpose the data will be processed, with whom it will be shared, where it
will be transferred, and being able to be controlled by the data subject at a transpar-
ent and controllable level are important requirements of data privacy. On the other
hand, there is no exact definition of privacy, the definition can be made specific to the
application.
Data controllers who need to take privacy precautions in order to prevent data
breaches are assumed to be reliable and have legal obligations; stores and uses the data
collected with digital applications using appropriate methods, and shares them by
anonymizing when necessary. Collected data are classified into four groups [7];
• Sensitive attributes (SA): It contains data that is private and sensitive to individu-
als, such as sickness and salary.
• Insensitive attributes: It contains general and non-risky data that are not covered
by other attributes.
It is not sufficient to measure privacy with a single metric because different defini-
tions can be made for different applications and multiple parameters must be evalu-
ated for this purpose. It is possible to examine the proposed metrics for PPDMs [8, 9]
as privacy level metric and data quality metric, depending on which aspect of privacy
is measured. While evaluating these metrics, they can be measured in two subgroups
to evaluate the level of privacy/data quality on the input data (data criteria) and
data mining results (result criteria). How secure the data is in terms of disclosure is
measured by the level of privacy metrics [10]:
Bounded knowledge: The purpose here is to restrict the data with certain rules
and prevent the disclosure of the information that should remain confidential. It
can be transformed into limited data by adding noise to the data or by generalizing
the data.
Need to know: With this metric, keeping unnecessary data away from the system
prevents privacy data that will arise. It also ensures that access control (access reason
and access authorization) to data.
Protected from disclosure: In order to keep the confidential data that may come
out as a result of data mining, some operations (such as checking the queries) can
be done on the results to provide privacy. Using the classification method to prevent
the disclosure of data, which is one of the criteria for ensuring privacy, is one of the
effective methods [11].
Data quality metrics: It quantifies the loss of information/benefit, and the
complexity criteria that measure the efficiency and scalability of different techniques
are evaluated within this scope.
37
Data Mining - Concepts and Applications
Privacy Protected Data Mining (PPDM) techniques have been developed to allow
the extraction of information from data sets while preventing the disclosure of data
subjects’ identities or sensitive information. In addition, PPDM allows more than one
researcher to collaborate on a dataset [11, 12]. Also PPDM can be defined as perform-
ing data mining on data sets to be obtained from databases containing sensitive and
confidential information in a multilateral environment without disclosing the data of
each party to other parties [13].
In order to protect privacy in data mining, statistical and cryptographic based
approaches have been proposed. The vast majority of these approaches operate on
original data to protect privacy. This is referred to as the natural trade-off between
data quality and privacy level.
PPDM methods are being studied on to perform effective data mining by guar-
anteeing a certain level of privacy. Several different taxonomies have been proposed
for these methods. In the literature, based on data life cycle stages (data collection,
data publishing, data distribution and output of data mining) [10] or they are
classified based on the method used (Anonymization based, Perturbation based,
Randomization based, Condensation based and Cryptography based) [14].
In this study, PPDM approaches are examined with a simple taxonomy as meth-
ods applied to input data and processed data (output information) that is subject to
data mining.
This section includes the methods suggested for collecting, cleaning, integration,
selection and transformation phases of input data that will be subject to data mining.
Although it varies according to the application used or the state of trust to the
institution collecting the data, it is recommended that the original values not be
stored and used only in the conversion process in order to prevent disclosure of
privacy. For example, the data collected with sensors, which are now widely used
with internet of things, can be transformed at the stage it collects, randomizing the
obtained values and transforming the raw data before being used in data mining.
In this section, data perturbation, randomization, suppression, data swapping,
anonymity, cryptography and differential privacy methods are discussed.
data are preserved. For example; the result of the randomization of A with B is C
(C = A + B) if A be the original data distribution, and B, a publicly known noise dis-
tribution independent of A. Then, A may be reconstructed with “A= C− B”. However,
this reconstruction process may not be successful if B has a large variance and C’s
sample size is not large enough. As a solution, approaches that implement the Bayes
[21], or EM [22] formula can be used. While the randomization method limits data
usage to the distribution of C, it requires a lot of noise to hide outliers. Because in this
approach, outliers are more vulnerable to attacks when compared to values in denser
regions in the data. Although this reduces the use of the data for mining purposes, it
may be necessary to add too much noise to all records in the data that would result in
loss of information, in order to prevent it [7].
Randomly generated values can be added to the original data with an additive
or multiplicative method [23]. The aim is to ensure that noise added to individual
records for privacy is non-extractable. Multiplicative Noise is more efficient than the
Additive Noise method because it is more difficult to predict the original values.
With Microaggregation method, all records in the data set are first arranged in a
meaningful order and then the whole set is divided into a certain number of subsets.
Then, by taking the average of the value of each subset of the specified attribute, the
value of that attribute of the subset is replaced with the average value. Thus, the aver-
age value of that attribute for the entire data set will not change.
Since data perturbation approaches have a negative impact on data utility and are
not resistant to attacks, they are often not preferred in utility-based data models.
2.1.2 Suppression
With this technique, as the result of data exchanges, private data can be easily
exposed in the system, for this reason it is recommended to use only in safe environ-
ments. It can be used in conjunction with other methods such as k-anonymity without
violating privacy definitions.
2.1.4 Cryptography
Cryptography is a technique that converts plain text to cipher text using various
encryption algorithms to encode messages in a way that cannot be read. It is a method
of storing and transmitting data in specific form using cryptography techniques so
that only intended persons can read and process it.
In data mining applications, cryptography-based techniques are used to protect
privacy during data collection and data storage [25, 28], and guarantee a very high
level of data privacy [23]. Encryption is generally costly due to time and compu-
tational complexity. Hence, as the volume of data increases, the time to process on
encrypted data increases and creates a potential barrier to real-time analysis [29].
Secure multiparty computing (SMC) is a special encryption protocol where, when
there is more than one participating party, the interested parties learn nothing but
results [30, 31]. The SMC calculation must be done carefully so that it does not reveal
sensitive data, but the calculated result can enable the parties to estimate the value of
sensitive data.
Many privacy conversions are for creating groups between anonymous records
that are converted in a group-specific manner. A number of techniques have been
proposed for group anonymity in different studies, such as k-anonymity, l-diversity,
and t-proximity methods. The comparison of group anonymity methods is given in
Table 1.
2.1.5.1 k-anonymity
Table 1.
Group based anonymity methods.
40
Privacy Preserving Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99224
To reduce the level of detail of the data representation, some attributes can be
replaced with more general values (data swapping), some data points can be elimi-
nated, or descriptive data can be deleted (suppression). However, while k-Anonymity
provides protection against attacks on the disclosure of identities, it does not protect
against attacks on disclosure of attributes. It is also more convenient to use for indi-
vidual data rather than directly applying it to restrict data mining results that protect
privacy. Besides, k-anonymity fully protects the privacy of users when it comes to
the homogeneity of sensitive values in the data. Providing optimum k-anonymity is a
problem in the NP-Hard class and approximate solutions have been proposed to avoid
calculation difficulties [33].
In the literature, different studies such as k-neighbor anonymity, k-degree ano-
nymity, cotomorphism anonymity, k-candidate anonymity and l-grouping derived
from the k-anonymity approach have been proposed according to the structural
features of the data.
2.1.5.2 l-diversity
2.1.5.3 t-closeness
The outputs of data mining algorithms can disclose information without open
access to the original data set. Sensitive information can be accessed through studies
on the results. For this reason, data mining output must also protect privacy.
This method is examined as query inference control and query auditing. In the
query inference control, the input data or the output of the query is controlled.
41
Data Mining - Concepts and Applications
In t Query auditing, the queries made on the outputs obtained by data mining are
audited. If the audited query enables the disclosure of confidential data, the query
request is denied. Although it limits data mining, it plays an active role in ensur-
ing privacy. Query auditing can be done online or offline. Since queries and query
results are already known in offline control, it is evaluated whether the results violate
privacy. In online auditing, since the queries are not known, privacy metrics are car-
ried out simultaneously during the execution of the query. This method is examined
within the scope of statistical database security.
In data mining, it is one of the most frequently used methods of Association Rules
to reveal the nature of interesting associations between binary variables. During data
mining, some rules may explicitly disclose private information about the data subject
(individual or group).
Unnecessary and information-leaking rules may occur in some relationships.
The aim of the Association rule hiding technique first proposed by Atallah [38] is to
protect privacy by hiding all sensitive rules. The weakness with this technique is that a
significant number of insensitive rules can be hidden incorrectly [39].
In this section, the common types of attacks that lead to the development of the
methods given above and lead to privacy violations are summarized [6].
Attacks that are made by making use of the intuitive similarity of sensitive attri-
bute values within anonymous groups.
In this case, it is not sufficient for the sensitive attribute values to be different
from each other in terms of protecting privacy [40]. This attack can be prevented by
calculating the similarities of sensitive attributes in the same anonymous group and
by providing solutions to include similar sensitive attribute values in different groups.
42
Privacy Preserving Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99224
In cases where all or most of the sensitive attributes in the groups included in the
anonymous tables are similar, the privacy of data owners is at risk of violation.
In order to prevent homogeneity attacks, it is necessary to prevent similar sensitive
attributes within the groups in the anonymous table from being in the same group or
to reproduce heterogeneous records by diluting the homogeneous attributes with the
record duplication approach [34].
It has been shown that with theoretical and experimental methods, interchange-
ability concepts and inferences about privacy can be made with Definetti’s theorem
[42]. The fact that the people who want to carry out this attack do not need exten-
sive background knowledge makes this attack attractive. An attacker can perform
an attack using machine-learning techniques on non-sensitive attributes in the
dataset.
The fact that the information about which data anonymization algorithm is used
in the data mining application is public is also considered as a privacy vulnerability
[43]. It is based on the principle that changes on data should remain at minimum level
in anonymization processes and should not be overly anonymized.
Publicly declaring previously published generalized data over time causes this
attack. For this reason, previously published tables should be used and new records
that may cause data disclosure should not be shared [44].
43
Data Mining - Concepts and Applications
4. Discussion
The fact that the digitalization process has become mandatory all over the world
with Covid-19 pandemic has accelerated the data flow. It has become even more impor-
tant to collect the necessary data, analyze it correctly and reveal reliable information.
This situation has triggered the use of data mining methods to increase productivity and
provide high quality products/services in almost all sectors. While applying data mining
methods, it is obvious that if privacy is not taken into consideration during the data life
cycle, irreversible damages will occur for individuals/institutions and organizations.
In order to increase the access and benefits of data mining technology, before
applying PPDM techniques, “privacy” should be defined precisely, measurement
metrics should be determined and the results obtained should be evaluated with these
metrics. For this reason, this study primarily focused on the definition of privacy.
The term privacy is quite extensive and does not have a standard definition. It is quite
challenging in measuring privacy, as there is no standard privacy definition. Some
measurement metrics are mentioned in this chapter, but metrics are usually deter-
mined by application. The lack of a standard privacy measurement metric also make
challenging the comparison and evaluation of the developed PPDM techniques.
In the age of digital and online business, privacy protection needs to be done at
the individual and organizational levels. Privacy protection at the individual level
depends on person who is influenced by religious beliefs, community norms and cul-
ture. For this reason, the concept of personalized privacy, which allows individuals to
have a certain level of control over their data, has been proposed. However, it has been
observed that there are difficulties in implementing personalized privacy, as people
think that compromising their privacy for applications they think is well-intentioned
will not damage. Therefore, in the context of personalized privacy, new solutions are
required for the trade-off between privacy and utility.
To effectively protect organizational level data privacy [7]; Policy makers in
organizations should support privacy-enhancing technical architectures/models
to securely collect, analyze and share data. Laws, regulations and fundamental
principles regarding privacy should be analyzed by organizations. It is necessary for
organizations to include the data owners in their assessment of privacy and security
practices. Data owners should involve the whole process about what data is collected,
how it is analyzed and for what purpose it is used. In addition, they should have the
right to correct personal data in order to avoid negative consequences of incorrect
data. Organizations should employ data privacy analysts, data security scientists, and
data privacy architects who can develop data mining applications securely.
From a technical point of view, methods that protect confidentiality in data
analytics are still in their infancy. Although studies continue by different scientific
communities such as cryptography, database management and data mining, an
interdisciplinary study should be conducted on PPDM. For example, the difficulties
encountered in this process should also be addressed from a legal perspective. Thus,
a better roadmap for next-generation privacy-preserving data mining design can be
developed by academic researchers and industrial practitioners.
5. Conclusion
Businesses and even governments collect data through many digital platforms
(social media, e-health, e-commerce, entertainment, e-government etc.) they use to
44
Privacy Preserving Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99224
serve their customers/citizens. The data collected can be sensitive data and this data
can be stored, analyzed and, in good probability, anonymized and shared with others.
In studies where data is used at any stage of the life cycle, regardless of the purpose,
it is necessary to explain a privacy permission and the reason why the data should be
accessed. Privacy Preserving Data Mining (PPDM) techniques are being developed to
allow information to be extracted from data without disclosing sensitive information.
There is no single optimal PPDM technique for any stage of the data lifecycle. The
PPDM technique to be applied varies according to the application requirements, such
as the desired privacy level, data size and volume, tolerable information loss level,
transaction complexity, etc. Because different application areas have different rules,
assumptions and requirements regarding privacy.
In this chapter, the previously proposed PPDM techniques are examined in two
sections. First section includes the methods suggested for collecting, cleaning, inte-
gration, selection and transformation phases of input data that will be subject to data
mining and second section covers methods applied to processed data. Finally, attacks
against the privacy of data mining applications are given in this chapter.
Author details
© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
45
Data Mining - Concepts and Applications
References
[1] Clifton C, Kantarcioglu M, Vaidya J, Data Mining. New York, NY, USA:
Defining privacy for data mining. In Springer, 2008, pp. 183-205.
National science foundation workshop
on next generation data mining. 2002; [9] Dua S, Du X, Data Mining and
Vol. 1, No. 26, p. 1 Machine Learning in Cybersecurity.
Boca Raton, FL, USA: CRC Press, 2011.
[2] İzgi M. C, The concept of privacy in
the context of personal health data. [10] Mendes R, Vilela J. P, Privacy-
Türkiye Biyoetik Dergisi, 2014. (S 1), 1 preserving data mining: methods,
metrics, and applications. IEEE Access,
[3] Centers for Disease Control and 2017; 5, 10562-10582.
Prevention. HIPAA privacy rule and
public health. Guidance from CDC and [11] Vaidya J, Clifton C, Privacy-
the US Department of Health and preserving data mining: Why, how, and
Human Services. MMWR: Morbidity when. IEEE Security & Privacy, 2004;
and mortality weekly report, 2(6), 19-27.
200352(Suppl 1), 1-17.
[12] Nayak G, Devi S, A survey on privacy
preserving data mining: approaches and
[4] Data P, Directive 95/46/EC of the
techniques. International Journal of
European parliament and of the council
Engineering Science and Technology,
on the protection of individuals with
2011; 3(3), 2127-2133.
regard to the processing of personal data
and on the free movement of such data.
[13] Lindell Y, Pinkas B, Privacy
Official Journal L, 1995; 281(23/11),
Preserving Data Mining, In: Proceedings
0031-0050.
of the 20th Annual International
Cryptology Conference, 2000;
[5] Belsey A, Chadwick, R. Ethical issues
California, USA, 36- 53
in journalism and the media. Routledge.
(Eds.) 2002 [14] Rathod S, Patel D, Survey on Privacy
Preserving Data Mining Techniques.
[6] Vural Y, Veri Mahremiyeti: Saldırılar, International Journal of Engineering
Korunma Ve Yeni Bir Çözüm Önerisi. Research & Technology (IJERT) 2020;
Uluslararası Bilgi Güvenliği Mühendisliği Vol. 9 Issue 06
Dergisi, 4(2), 21-34.
[15] Hong T. P, Yang K. T, Lin C. W,
[7] Pramanik M. I, Lau R. Y, Hossain M. Wang S. L, Evolutionary privacy-
S, Rahoman M. M, Debnath S. K, preserving data mining. In: Proceedings
Rashed, M. G., Uddin M. Z.,. Privacy of the World Automation Congress 2010;
preserving big data analytics: A critical (pp. 1-7). IEEE.
analysis of state-of-the-art. Wiley
Interdisciplinary Reviews: Data Mining [16] Qi X, Zong M, An overview of
and Knowledge Discovery, 2021; privacy preserving data mining. Procedia
11(1), e1387. Environmental Sciences, 2011; 12,
1341-1347
[8] Bertino E, Lin D, Jiang W, A survey of
quantification of privacy preserving data [17] Muralidhar K, Sarathy R, A
mining algorithms, in Privacy-Preserving theoretical basis for perturbation
46
Privacy Preserving Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99224
Abstract
1. Introduction
Traditional supervised learning deals with the analysis of single-label data, which
means that samples are associated with a single label. However, in many real-world
data mining applications, such as text classification [1, 2], scene classification [3, 4],
crowd sensing/mining [5–11], and gene functional classification [12, 13], the samples
are associated with more than one label. From this description, we understand that the
challenge of the multilabel classification task is its potential output.
Basically, multilabel learning algorithms can be categorized into two different
groups. 1) Problem transformation method. This method takes the multilabel problem
and converts it into a single-label problem that can easily be classified using any
classifier using the relationship between labels. 2) Adapted algorithm method. This
method directly performs multilabel classification rather than transforming the prob-
lem into different subsets of problems, and most of these methods use the Euclidean
distance between samples.
49
Data Mining - Concepts and Applications
The main idea of this paper is to aggregate similar samples to obtain better results.
To aggregate similar samples, we use the properties of graph neural networks (GNNs)
[14]. The main contributions of this study are as follows:
The rest of this paper is arranged as follows. Section 2 shows the taxonomy of
multilabel learning algorithms and describes their methods. Section 3 presents the
details of our proposed method. Section 4 describes the multilabel datasets, evaluation
metrics and experimental results, followed by the conclusions in Section 5.
2. Related work
In this section, we review multilabel learning algorithms. The algorithms that have
been applied to multilabel learning over the last decade are not just those mentioned in
this paper. Figure 1 summarizes the algorithms detailed in the next section.
Figure 1.
Taxonomy of multilabel learning algorithms [15].
50
Multilabel Classification Based on Graph Neural Networks
DOI: http://dx.doi.org/10.5772/intechopen.99681
hence cannot be parallelized. Calibrated label ranking (CAL) performs ranking via the
pairwise comparison of labels and has the advantage of considering the relationship
(but only the pairwise relationship) between labels. Label powersets (LP) treat the
situation when multiple labels belong to the same sample as a new label and have the
advantage of considering the relationship between labels, but the time complexity
grows exponentially with label sets. Random k-labelsets (RKL) are variants of LP
models where each classifier is trained with a small random set of labels; their advan-
tage is that they consider the relationship between labels, but they have a low accuracy
rate if a worse label set combination is randomly selected.
The multilabel k-nearest neighbor (MLkNN) method is derived from the tradi-
tional k-nearest neighbor algorithm. Each sample is identified with k nearest neigh-
bors in the training set, and information is obtained from these identified neighbors.
Multilabel support vector machine (ML-SVM) classification determines an optimal
hyperplane that separates observations according to their labels. A multilabel decision
tree (ML-DT) is constructed by building a decision tree, where each node corresponds
to a set of samples in the data set.
GNNs were mentioned for the first time and further elaborated by [16]. The goal
of a GNN is to learn a node’s representation of the acquisition of its information by
propagation. Currently, there are many deep learning tasks that need to process data
with graph structures. Convolutional neural networks (CNNs) [17] have been suc-
cessfully developed in the field of computer vision [18, 19] but are unable to process
graph structured data [20]. The method used in this paper is called a graph
convolutional network (GCN). A GCN can aggregate similar samples by propagating
neighbor information, giving it the ability to infer, and there is no need to consider the
sequence. GCNs have appeared in many top machine learning conferences and many
applications across different tasks and domains, such as manifold learning [21, 22],
computer vision [23–25], text classification [26, 27], hashing [28, 29], and
hyperspectral image classification [30, 31].
This section presents the overall flow of our proposed method, as shown in
Figure 2. The multilabel data matrix is first converted into a similarity matrix gener-
ated from a Laplacian graph. We call this a multilabel-based Laplacian graph and use
this graph as inputs to the GCN model. Each node in the output layer predicts the
probability of class membership for the label.
This section presents the proposed method. Before this, let us describe some
notational conventions. Matrices are written in boldface capital letters (e.g., X). The
transpose of a matrix is denoted as X⊤ . Vectors are written in boldface lowercase
51
Data Mining - Concepts and Applications
Figure 2.
An illustration of the work flow of the proposed method. Fully green color represents the training model; fully blue
color represents the test model.
letters (e.g., x). For a matrix X ∈ n�m , the j-th column and the ij-th entry are denoted
by xj and xij , respectively. I denotes the identity matrix, k�k2 is the l2 -norm, and 1
denotes a column vector with all elements equal to ones.
Based on [32], we formally present our multilabel-based Laplacian graph. For a
multilabel dataset, let X ¼ ½x1 , ⋯, xn � ∈ n�m be the data matrix with n and m
representing the number of samples and the dimensions, respectively. S ∈ n�n is the
multilabel-based Laplacian graph, and we use a sparse representation method to
construct this graph as follows:
n �
X � n
X
min �xi � xj �2 Sij þ β ksi k22
S 2
i, j¼1 i¼1 (1)
Τ
s:t: ∀Sii ¼ 0, Sij ≥ 0, 1 si ¼ 1:
Based on [34], we fit the GCN used for single-label classification to multilabel
classification. The GCN has been modified from a first-order Chebyshev approxima-
tion [35]. In order to create a multidimensional input, ChevNet convolution with an
input vector x and a filter g θ is formulated as follows:
1 1
x⋆g θ ¼ θ0 x � θ1 D�2 AD�2 x, (2)
where ⋆ means the convolution operator, A is the adjacency matrix and D is the
degree matrix. By using the single parameter θ = θ0 = �θ1 to avoid overfitting, Eq. (2)
can be rewritten as:
52
Multilabel Classification Based on Graph Neural Networks
DOI: http://dx.doi.org/10.5772/intechopen.99681
� 1 1
�
x⋆g θ ¼ θ In þ D�2 AD�2 x: (3)
Repeated use of this graph convolution operation may cause serious problems such
1 1
1 1
as vanishing gradients. Therefore, In þ D�2 AD�2 in Eq. (3) is modified to D ~ �2 A ~ �2
~D
~ ¼ A þ In and D P~
with A ~ ii ¼ Aij , finally giving a layerwise propagation rule to
j
support multidimensional inputs as follows:
� �1 1 �
Hðlþ1Þ ¼ σ D
~ 2A ~ �2 HðlÞ WðlÞ :
~D (4)
Here, HðlÞ is the output of an activation function in the l-th layer of the GCN. WðlÞ is a
trainable weight matrix corresponding to the l-th layer of GCN. Hð0Þ is the data matrix.
σ ð�Þ denotes a specific activation function such as a sigmoid activation function.
This paper considers only a two-layer GCN model as the proposed method, and we
modify Eq. (4) by placing the adjacent matrix into a multilabel-based Laplacian graph to
obtain the formula of the two-layer GCN method proposed in this paper as follows:
� �1 � 1 �
Hð1Þ ¼ σ D ^D
^ 2S ^ 2 Hð0Þ Wð0Þ
� �1 �1 � (5)
Hð2Þ ¼ σ D ^D
^ 2S ^ 2 Hð1Þ Wð1Þ ,
^ ¼ S þ In and D P^
where S ^ ii ¼
j Sij. For semi-supervised multilabel classification, we
evaluate the mean square error over all labeled samples:
t � �2
1X ð2Þ
Mean Square Error ¼ Hi � Yi , (6)
t i¼1
where Y ∈ ½0, 1�n�c is the ground truth label matrix with c labelsets, and t is the
number of labeled samples.
4. Experiments
4.1 Datasets
The multilabel datasets used in this paper and their associated statistics are shown
in Table 1.
In this study, we have added probabilistic classifier chains [36], CSMLC [37] and
RethinkNet [38] as baselines for comparison. The experimental settings are as follows:
First, multilabel datasets are preprocessed to [0,1] as inputs, 80% of the samples are
used for model (both multilabel learning and proposed method) training, and the last
20% of the samples are used as test sets. We also add Gaussian noise ranging from 6%
to 12% of each test sample to test the robustness of the model. The overall framework
is shown in Figure 2.
53
Data Mining - Concepts and Applications
Table 1.
Statistics of the multilabel datasets.
For deep learning, we train all models for 200 epochs using Adam [39] with a
learning rate of 0.01 and the mean square error as the loss function.
In multilabel learning, the evaluation metrics must be more rigorous than tradi-
tional single-label learning because one sample may be associated with multiple labels.
These evaluation metrics [15] are divided into three groups, as shown in Figure 3. The
higher the values of the F1 score, precision, mean average precision and recall, the
better the performance is. The lower the values of the Hamming loss, one-error,
coverage and ranking loss, the better the performance is. We consider the Hamming
loss, one-error and mean average precision as three major metrics.
All experiments use different combinations of training and test data to verify the
trained model and average the results after repeating the training ten times. According
to the observations in Figures 4–6, the following conclusions are reached:
Figure 3.
Taxonomy of evaluation metrics.
54
Multilabel Classification Based on Graph Neural Networks
DOI: http://dx.doi.org/10.5772/intechopen.99681
Figure 4.
Results of the proposed method compared with multilabel learning algorithms on the used multilabel datasets.
(a)–(c) show the results without adding Gaussian noise.
55
Data Mining - Concepts and Applications
Figure 5.
Results of the proposed method compared with multilabel learning algorithms on the used multilabel datasets.
(a)–(c) show the results of adding 6% Gaussian noise.
56
Multilabel Classification Based on Graph Neural Networks
DOI: http://dx.doi.org/10.5772/intechopen.99681
Figure 6.
Results of the proposed method compared with multilabel learning algorithms on the used multilabel datasets.
(a)–(c) show the results of adding 12% Gaussian noise.
57
Data Mining - Concepts and Applications
• Regardless of whether the Gaussian noise is added to the data set, the classification
results of the problem transformation methods (BR, CCs, CAL, LP and RKL) are
almost worse than the adaptive algorithms (MLkNN, ML-SVM and ML-DT)
• We found that our method was raised on average by 1.8% and 8% higher in
Hamming loss and mean average precision, respectively. And also has excellent
performance even if the dataset were contaminated by noise.
5. Conclusions
Acknowledgements
This work is supported in part by the Data Science Lab, NSYSU, and in part by the
Pervasive Artificial Intelligence Research (PAIR) Lab, Taiwan, under the grant Nos.
110-2634-F-008-004 and 110-2221-E-110-046.
Author details
© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
58
Multilabel Classification Based on Graph Neural Networks
DOI: http://dx.doi.org/10.5772/intechopen.99681
References
[29] R. Xu, C. Li, J. Yan, C. Deng, and X. [36] K. Dembczyński, W. Cheng, and E.
Liu, “Graph convolutional network Hüllermeier, “Bayes optimal multilabel
hashing for cross-modal retrieval,” in classification via probabilistic classifier
Proc. International Joint Conference on chains,” in Proc. International Conference
Artificial Intelligence, Macao, China, on Machine Learning, Haifa, Israel, 2010,
2019, Aug. 10–16, pp. 982–988. Jun. 21–24, pp. 279–286.
[30] D. Hong, L. Gao, J. Yao, B. Zhang, A. [37] K.-H. Huang and H.-T. Lin, “Cost-
Plaza, and J. Chanussot, “Graph sensitive label embedding for multi-label
convolutional networks for classification,” Machine Learning, vol.
hyperspectral image classification,” IEEE 106, no. 9–10, pp. 1725–1746, Oct. 2017.
Transactions on Geoscience and Remote
Sensing, pp. 1–13, Aug. 2020. [38] Y.-Y. Yang, Y.-A. Lin, H.-M. Chu,
and H.-T. Lin, “Deep learning with a
[31] A. Qin, Z. Shang, J. Tian, Y. Wang, rethinking structure for multi-label
T. Zhang, and Y. Y. Tang, “Spectral– classification,” in Proc. Asian Conference
spatial graph convolutional networks for on Machine Learning, Nagoya, Japan,
semisupervised hyperspectral image 2019, Nov. 17–19, pp. 125–140.
classification,” IEEE Geoscience and
Remote Sensing Letters, vol. 16, no. 2, [39] D. P. Kingma and J. Ba, “Adam: A
pp. 241–245, Feb. 2019. method for stochastic optimization,” in
Proc. International Conference for
[32] H. Wang, Y. Yang, and B. Liu, Learning Representations, San Diego,
“GMC: Graph-based multi-view California, United States, 2015, May
clustering,” IEEE Transactions on 07–09.
Knowledge and Data Engineering, vol. 32,
no. 6, pp. 1116–1129, Jun. 2019.
Abstract
1. Introduction
deployment techniques to stay low under the radar for a long time [2]. Finally,
gathered sensitive information is pushed in small chunks to its external control and
command servers (C2C) using some clever exfiltration techniques.
The whole process of the APT life cycle is broadly divided into seven different
phases as shown in Figure 1 [3]. In the Reconnaissance phase, the attacker chooses the
target network and studies the internal network structure and comes up with the
necessary strategy, TTP, to bypass the initial layer of defence. Reconnaissance is
followed by the Initial compromise phase, where attackers exploit open vulnerabilities
to get an initial foothold into the targeted network. After that, the attackers try to
replicate and propagate into another machine and establishes backdoors to pull
more sophisticated payloads in Establishing foothold phase. Later in the Lateral
movement phase, attackers escalate various privileges to perform more sophisticated
tasks to hide its traces. In this particular phase, attackers traverse from one network to
another network in search of sensitive information. After collecting the necessary
data, the attackers strategically centralises this collected data to staging servers. In the
data exfiltration phase, attackers use different custom encoding and encryption
mechanisms to push these collected data to external control and command servers.
Finally, to preserve the anonymity of the process, attacker leaves no traces by
clearing the tracks and creates a backdoor to revisit that particular organisation in
the future.
APT has grown to become a global tool for cyber warfare between countries.
Carbanak APT campaign infected thousands of people worldwide and caused nearly
$1 billion damage across the globe [4]. APT actors carried out a variety of actions in
this operation, including opening fraudulent accounts and employing bogus services
to obtain funds, as well as sending money to cybercriminals via the SWIFT (Society
for Worldwide Interbank Financial Telecommunication) network. Similarly, in 2018,
Big Bang APT developed a much more robust and sophisticated multi-stage malware
targeting the Palestinian Authority [5]. This APT malware includes several modules
Figure 1.
APT life cycle phases.
64
DMAPT: Study of Data Mining and Machine Learning Techniques in Advanced Persistent Threat…
DOI: http://dx.doi.org/10.5772/intechopen.99291
that perform tasks ranging from obtaining a file list, capturing screenshots, rebooting
the machine, retrieving system information, and self-deletion. More recently, a supply
chain attack on solar winds by the Russian APT group was considered one of the
sophisticated attacks. RefreshInternals() method in solar winds attack depict the
maturity of these state-sponsored APT groups in terms of malware design and payload
delivery [6].
In order to deal with these kinds of state-sponsored targeted attacks, security
experts consider APT attribution and detection as two key pillars. Attribution is an
analysis process that explains about “who” is behind particular cyber espionage and
“why” they have done it [7]. This process gives insights about particular APT threat
actors and their targeted areas as well. Based on this preliminary information, the
security community try to detect these attacks by fixing issues at different levels of an
organisation. Since APT attribution and detection became crucial for many security
firms/govt agencies, both these processes require massive data pre-processing and
analysis. To address these issues, researchers propose different data mining and
machine learning techniques in both attribution and detection as well. In this paper,
we discuss various data mining and machine learning techniques in both detection and
attribution of APT malware. In addition to this, we compare different detection
techniques, and we highlight research gaps among those techniques which need to be
addressed by the security community to combat this sophisticated APT malware.
This paper is organised as follows. Section 1. details APT overview and phases of
APT, followed by the need for data mining and ML techniques in both attribution and
detection of APT malware. Section 2. talks about the process of attribution and dif-
ferent techniques proposed to perform APT attribution. Section 3. discuss about
various state of the art data mining and ML techniques proposed by the research
community in APT detection. Section 4. details research gap analysis followed by
conclusion and future scope.
APT attribution is an analysis process that reveals the identity of the threat actors
and their motto through a series of steps [8]. First, security firms collect data from
different victim organisations by performing forensic analysis on the respective net-
works and collect different Indicators of Compromise (IOC). In general, attackers
repeat this pattern in several other organisations as well. Security firms observe and
analyse these repeated patterns in IOC and TTP’s together, and cluster these combi-
nations as intrusion sets. Performing data analytics on these intrusion sets over a
period will eventually reveal the threat actor and motivation behind the attack as
depicted in Figure 2, respectively.
Figure 2.
Overview of APT attribution process.
65
Data Mining - Concepts and Applications
2.1 DeepAPT: APT attribution using deep neural network and transfer learning
Figure 3.
2-dimensional visualisation of APT families using t-SNE algorithm [9].
66
DMAPT: Study of Data Mining and Machine Learning Techniques in Advanced Persistent Threat…
DOI: http://dx.doi.org/10.5772/intechopen.99291
Most APT attribution techniques heavily rely on performing analysis for malware
samples used in that particular campaign. The key disadvantage of this strategy is that
the same malware samples can be used in several operations. In some situations, APT
groups specifically buy malware from the dark web based on their requirements. So,
the ML models constructed by only considering malware samples may not give effi-
cient results in terms of APT attribution. In order to address this issue, Lior Perry et al.
proposed a method named NO-DOUBT, i.e. Novel Document Based Attribution, by
constructing models on threat intelligence reports with the help of Natural Language
Processing (NLP) techniques [10]. In this research, the authors collected 249 threat
intelligence reports of 12 different APT actors and considered APT attack attribution
as a multi-text classification problem. The proposed model consists of mainly two
phases, as shown in Figure 4. In the training phase, labelled reports and word
embeddings transform the input data to a vector representation. For generating this
vector representation, authors propose SMOBI (Smoothed Binary Vector) algorithm,
which will find cosine similarities between input words in labelled data sets and word
embeddings to form a huge n m matrix. This vector representation and labels are
given to the ensemble xGBoost classifier to construct a known actor model. In the
deployment phase, new test reports (unlabelled) are also converted to vector repre-
sentation and given to the known actor model to determine the probability predictions
to the known classes. These probability predictions are given to a New Actor Model (a
binary classifier that outputs whether it is a known APT actor or a new unknown
actor) to make final predictions. Although this model struggles to detect Deep Panda
and APT29 actors, SMOBI based APT attribution outperforms previous text-based
APT attribution models (unigrams + bigrams and tf-idf) in terms of Accuracy, Preci-
sion and Recall.
Figure 4.
NO-DOUBT method for APT attribution [10].
Most of the APT attribution processes depend upon the manual analysis in victim
networks and collecting low-level indicators of compromise (forensic analysis at
67
Data Mining - Concepts and Applications
Figure 5.
Cyber threat attribution framework [11].
firewalls, tracebacks, IDS and Honeypots). However, APT actors change this low-level
IOC from one organisation to another organisation. ML models built based on this low-
level IOC, results in inadequate cyber intelligence systems. On the other hand,
collecting high-level IOC’s for each organisation is time-consuming. Such high-level
IOC’s are published in the form of Cyber Threat Intelligence (CTI) reports across the
organisations as a common practice. In 2019, Umara Noor et al. proposed a distribu-
tional semantic technique of NLP to build a cyber threat attribution framework by
extracting patterns from CTI reports [11]. The proposed attribution framework is
broadly divided into three phases, as depicted in Figure 5. In this experiment, authors
used a customised search engine to collect 327 unstructured CTI documents
corresponding to 36 APT actors as a part of data collection phase. The CTI documents
do not contain the exact keyword described in the standard taxonomy due to varying
textual definitions and choices for communicating a concept. Rather than using a simple
68
DMAPT: Study of Data Mining and Machine Learning Techniques in Advanced Persistent Threat…
DOI: http://dx.doi.org/10.5772/intechopen.99291
keyword-based search, the authors developed a semantic search method based on the
statistical distributional semantic relevance technique (LSA), to retrieve relevant docu-
ments. The input CTI records are indexed using LSA. The statistically derived concep-
tual indices (from LSA indexer) are searched for semantically relevant topics using the
high-level IOC labels specified in MITRE ATT&CK [11]. Based on cosine similarity, the
CTA-TTP correlation matrix is constructed in the CTI analytics phase. ML models are
built on top of the CTA-TTP correlation matrix in the Cyber Threat Attribution phase.
Among various classifiers, the Deep Neural Network turned out to be the best per-
former with 94% attribution accuracy on test data with high precision and recall values.
Behavioural analysis of APT malware gives better insights on both APT attribution
and detection. Based on this motivation, Weijie Han et al. proposed that, dynamic
system call information reveals behavioural characteristics of APT malware [12]. Fur-
thermore, the authors built an ontology model to understand in-depth relation between
the maliciousness of APT malware to its families, as depicted in Figure 6, respectively.
APTMalInsight framework mainly consists of two modules i.e. APT malware family
classification module and detection module. The basic concept behind the
APTMalInsight framework is to profile the behavioural characteristics of APT malware.
It obtains dynamic system call information from the programs to reliably detect and
attribute APT malware to their respective families. Primarily, APT malware samples are
executed to extract dynamic API calls. After extracting API calls, authors calculated the
feature importance of each API call and built a feature vector by selecting top N-API calls
from the API call sequence. ML models built on top of that feature vector will output the
APT attribution class for test data, as shown in Figure 7. For the experiment, authors
considered a total of 864 APT malware samples belonging to five different families. As
per the experimentation results, Random Forest turned out to be the best model in terms
of Accuracy(98%), Precision and Recall for APT malware family attribution.
2.5 ATOMIC: FireEye’s framework for large scale clustering and associating APT
threat actors
Security firms like FireEye investigate many victim networks and collect IOC and
group them together as uncategorised (“UNC”) intrusion sets. Over time, this type of
UNC sets are increasing rapidly, and security firms need to either merge these other
APT groups or assign a new group name based on manual analysis. FireEye security
Figure 6.
APT malware ontology model [12].
69
Data Mining - Concepts and Applications
Figure 7.
High-level overview of APTMalInsight framework [12].
Figure 8.
Cosine similarity between different un-attributed APT groups [13].
70
DMAPT: Study of Data Mining and Machine Learning Techniques in Advanced Persistent Threat…
DOI: http://dx.doi.org/10.5772/intechopen.99291
between two different APT groups. Based on this idea, FireEye automated the whole
process of APT attribution and merging different uncategorised groups.
Most of the APT families stay undetected for a long period and use intelligent ways
to damage the vulnerable hosts. When a traditional malware executes, most of the
events occur sequentially and leave some traces behind. These traces help modern-day
intelligent systems like SIEM, IDS, IPS to prevent these attacks. But, when it comes to
the case of APTs, they clean the attack traces and also prevent sequence execution of
events. Also, APT employs Anti-VM and Anti-debugging techniques for making
things harder for the detection systems. The hardness in detecting the APT has made
the cyber security enthusiasts draw their attention towards this domain. Some of the
important contributions in the research area are mentioned below. A detailed com-
parison among different detection techniques are illustrated in Table 1, respectively.
[14] RNN-LSTM and GHSOM Network Traffic Flow Deep learning stack with sequential
neural networks to detect APT.
[15] Provenance Graph Mining Host Audit Data Suspicious information flows are
(Linux audit or identified using MITRE ATT&CK
Windows ETW) framework.
[16] Directed Graph Mining SIEM Event Logs Extracting attack vectors from SIEM
and One Class SVM logs
[18] RNN-LSTM SIEM Event Logs Identify possible event codes and their
sequence to detect an APT attack in
realtime
[19] Ensemble Classifier Network Traffic Flow Separate threat detection sub-module
for APT life cycle phases.
[20] Multi fractal based error Network Traffic Flow Multi fractal analysis to extract the
minimization hidden information of TCP connections.
[21] Correlation Analysis Multiple data sources Construction of Attack Pyramid using
multiple planes to detect APT
[22] J48 Classifier API log data API calls to track process injection and
privilege escalation activities.
Table 1.
Comparison of different APT detection methods.
71
Data Mining - Concepts and Applications
Tero et al. [14] proposed a theoretical approach for detecting APT by developing
a stack of Deep Learning methods where each layer has a particular task in
handling APT events. The authors consider network payload and packet header
information as features, and they streamlined the input to the detection stack
without any data filtering mechanism. The detection stack is designed sequentially.
The initial layers, i.e. layer-1 and layer-2, are used to detect the known attacks and
legitimate network traffic from the data flow respectively. Layer-3 of the detection
stack employs in identifying the outliers having historical presence. It uses Recurrent
Neural Network-Long Short Term Memory (RNN-LSTM) units to confirm whether
an outlier has historical occurrence. Layer-4 helps to classify the outliers into four
categories, i.e. regular traffic, known attack, predicted attack and unknown outlier
using an anomaly detection method named Growing Hierarchical Self-Organising
Map (GHSOM). The stack’s final layer helps to map the anomalies (i.e. interconnec-
tions between the outlier events) using a Graph Database (GDB). The proposed
stack model is highly modular and was designed to perform dynamic detection of
APT events with a decent detection accuracy. However, this detection system is
complex in design and result in higher time complexity when dealing with massive
data inputs.
HOLMES model of APT detection is strongly based on the principles of the APT
kill chain model. The cyber kill chain model gives a higher-level overview of the
sequence of events in successful APT espionage, i.e. reconnaissance, command and
control communication, privilege escalation, lateral movement, data exfiltration, and
trace removal. Audit data from various operating systems are converted to a common
data representation format and passed as input to the proposed model in the initial
step. Lower-level information flows are extracted from the audit data such as process,
files, memory objects and network information etc. The core part of the proposed
model is to map the lower-level information data flows to the phases of the APT-kill
chain by constructing an intermediate layer. The intermediate layer is responsible for
identifying various TTP’s (Tools, Techniques, Procedures) from the low-level infor-
mation data flow that correlates with respective phase of the APT life cycle. The
authors considered around 200 TTP patterns based on MITRE ATT&CK framework
[15]. The TTP patterns and noise filtering mechanism are employed in constructing a
High-Level Scenario Graph (HSG) from which we can detect the APT attack with
decent accuracy.
3.3 Anomaly detection in log data using graph databases and machine learning to
defend advanced persistent threats
Schindler et al. proposed an APT detection engine based on the principles of APT
kill chain phases [16]. In this work, SIEM logs were considered as data source. The
correlation is identified between the event logs and the phases of APT kill chain. An
adapted kill chain model is constructed to identify the possible attack vectors from the
SIEM event logs. This model is implemented at two different levels.
72
DMAPT: Study of Data Mining and Machine Learning Techniques in Advanced Persistent Threat…
DOI: http://dx.doi.org/10.5772/intechopen.99291
Level-1 deals with graph-based forensic analysis where logs from different pro-
grams are aggregated based on timestamp to identify events with in the network. A
directed graph is constructed from the multiple layers of event sequences. Each event
sequence reveals whether the event flow matches with the partial/full phases of the
APT kill chain.
Level-2 helps in identifying various anomalous activities using the Machine Learn-
ing approach. An ML classifier is constructed to make the model robust in detecting
APT events along with the graph model. Authors considered “one-class SVM” as the
classifier model and used windows logs, firewall logs, file audit logs of benign system
programs as its data source. This model is expected to identify all the events that differ
from the benign programs.
The proposed model achieved a decent accuracy score of 95.33% in detecting APT
events. However, considering the case of smart malware where malicious programs
mimic normal user behaviour, the proposed model tends to produce a relatively high
false-positives.
3.4 A study on cyber threat prediction based on intrusion event for APT attack
detection
Yong-Ho Kim et al. [17] proposed a theoretical model for APT detection that
consider intrusion detection system logs as data source. From the IDS logs,
correlation rules between various system events are identified to build an attack
graph. Identifying the correlation between the intrusion detection logs helps in
predicting the future attacks. In the initial phase, intrusion detection logs are collected
and corresponding intrusion events were extracted. The extracted events are passed to
different function blocks, each corresponding to a particular detection activity. One of
the functional block identifies the single-directional i.e. (host to C2C interaction) and
bi-directional (host to C2C, C2C to host) communication activities. Another block
identifies the repetitive intrusion events and combines them as a single event to
optimise the time and resource constraints. A correlation analysis block identifies
the context of intrusion detection events and creates sequential rules based on the
principles of 5 W and 1H (When, Where, Why, Who, What and How). Finally, the
prediction engine consider the attack scenario and tries to predict one or more
events that can occur after a single intrusion event. This module consider data
mining principles such as support and confidence to produce the best possible result.
The time constraint is one of the practical problems with this model, as some of the
functional blocks take a longer time to process events. Another important aspect is
that, rules of the intrusion detection systems will directly affect the outcome of this
model.
3.5 APT detection using long short term memory neural networks
Charan et al. [18] proposed an APT detection engine that takes SIEM event logs as
input and use LSTM neural networks to detect the successful APT espionage. The
author consider Splunk SIEM logs as a data source and streamline data to the Hadoop
framework to process and obtain the event codes for every activity. Based on the APT
life cycle phases, the author listed out the possible event codes and their sequence,
leading to successful APT espionage. The core part of this work is to identify the event
codes occurring in a sequence, and this process requires memorising the previous state
event codes. So, in the proposed model, LSTM (a variant of RNN) is considered a
73
Data Mining - Concepts and Applications
3.6 MLAPT: detection of APT attacks using machine learning and correlation
analysis
APT detection research mainly rely on the analysis of malware payload used in
different phases of APT attack. This kind of approach result in high false positives in
case of multi-stage malware deployment. In order to address this issue, Ghafir et al.
proposed a model to detect multi-stage APT malware by using machine learning and
correlation analysis (MLAPT) [19]. The MLAPT system is broadly divided into three
modules, i.e. 1) Threat detection module, 2) Alert correlation module and 3) Predic-
tion module. Initially, network traffic is passed to the Threat detection module in
which authors built several submodules to detect multi-stage attacks. The Output
alerts from the Threat detection module are passed to the Alert correlation module.
Alert correlation module filters redundant alerts and clusters these alerts based on
correlation time interval. The correlation indexing sub-module determines a given
scenario is either a full APT scenario or sub-APT scenario based on alert correlation
score. The prediction module consider sub APT scenarios and predict its probability of
becoming a full APT scenario. Based on that prediction module, alerts are escalated to
the network security team to stop this APT kill chain. The novelty of this research lies
in the detection of APT across all life cycle phases. Added to this, the MLAPT system
monitors and detects real-time APT attacks with a decent 81% Accuracy.
Detecting APT network patterns is a complex task as it tries to mimic the behav-
iour of regular TCP traffic. APT malware opens and closes TCP connections to its C2C
servers like any other regular legitimate connection with a minimal data transfer to
stay low under the radar. Single scale analysis does not extract the complexities of this
kind of APT traffic and lowers the detection accuracy. Researchers found that current
supervised ML models use euclidean based error minimization, which results in high
false positives while detecting complex APT traffic. To address these issues, Sana
Siddiqui et al. proposed an APT detection model using multi-fractal based analysis to
extract the hidden information of TCP connections [20]. Initially, the authors consid-
ered 30% of labelled datasets and computed prior correlation fractal dimension values
for normal and APT data points. Both these computed values are loaded into the
memory before processing the remaining 70% unlabelled dataset. Each point in the
remaining 70% dataset is added to both normal and APT labelled dataset, and poste-
rior fractal dimension values are calculated in the next step. The absolute difference
between prior and posterior values for both regular and APT samples are calculated to
determine the closest cluster to the data point. If fd_anom (absolute difference
between prior and posterior for APT sample) ≤ fd_norm (absolute difference
between prior and posterior for normal sample), then that data point is classified as an
APT sample and vice versa. As per the experimental observations, fractal dimension
based ML models performs better in terms of accuracy (94.42%) than the euclidean
based ML models.
74
DMAPT: Study of Data Mining and Machine Learning Techniques in Advanced Persistent Threat…
DOI: http://dx.doi.org/10.5772/intechopen.99291
Paul Guira et al. proposed a conceptual framework known as the Attack Pyramid for
APT detection [21]. In this approach, the goal of the attack (data exfiltration in most of the
cases) should be identified and placed on top of the pyramid. Further more, the model
identifies various planes such as user plane, application plane, network plane and physical
plane where the possibility of attacks are maximised. From the proposed approach, one
can identify the correlation between various events across different planes. In general, an
APT attack span multiple planes as the attack life cycle progresses. So, it is possible to
identify the attack contexts that span through multiple attack planes. Events from differ-
ent sources, i.e. VPN logs, firewall logs, IDS logs, authentication logs, system event logs
are passed as data source to the detection engine. From these logs, the context of attack is
identified using correlation rules. In the next step, the suspicious activities are identified
by matching the attack contexts using a signature database. This model requires updating
signatures at regular intervals to identify new attack contexts in real-time scenarios.
Chun-I Fan et al. [22] proposed a generalised way for APT detection using system
calls log data. The model was built based on the principles of dynamic malware
analysis where API call (system call) events were passed through a detection engine.
The novelty of this work lies in the approach of handling the API calls. Modern APT
malware is often used to create child processes or inject code into a new process to
evade detection. Authors have created a program named “TraceHook” that monitors
all the code injection activities. Tracehook outputs the API count for the executable
samples (benign/malware), and a machine learning classifier model is constructed on
top of the obtained API count values. The proposed model considers only six impor-
tant DLLs to monitor and can be combined with other APT detection models to build a
robust APT detection engine.
Identifying and stopping a particular life cycle event can break the full APT cycle
and minimise damage to a considerable proportion. Based on this idea, researchers
proposed various methods to stop malicious C2C communication. Modern-day
malware employed a new way to communicate with their C2C server with the help of
Domain Generation Algorithms (DGA). DGA creates a dynamic list of domain names
in which a few domain names are active for a limited amount of time. So, the malware
communicates to a different C2C domain name for every successful communication.
This practice helps the smart malware to avoid detection from the traditional
antivirus, firewalls, and other network scanning software. Anand et al. [23] proposed
a classification technique to detect character-based DGA, i.e. domain names are
constructed by concatenating characters in a pseudo-random manner, for example,
wqzdsqtuxsbht.com. In this method, author extracted various lexical-based features
such as n-grams, character frequencies, and statistical features to build an ensemble
classifier. The proposed model can detect character-based DGA domain names with a
decent accuracy score of 97%. Charan et al. [24] proposed a similar technique to detect
word-based DGA domain names where domain names are constructed by concatenat-
ing two or three words from dictionaries, for example crossmentioncare.com. In their
model, the author consider lexical, statistical, network-based features to build an
75
Data Mining - Concepts and Applications
ensemble classifier. A combination of the above two models can detect the C2C
communication activity with a decent accuracy.
Conflict of interest
Abbreviations
Author details
© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
77
Data Mining - Concepts and Applications
References
Abstract
Instagram is one of the world’s top ten most popular social networks. Instagram is
the most popular social networking platform in the United States, India, and Brazil,
with over 1 billion monthly active users. Each of these countries has more than 91
million Instagram users. The number of Instagram users shows the various reasons
and goals for them to play this social media. Social Media Marketing does not escape
being one of the purposes of using Instagram, with benefits to place a market for their
products. Using text classification to categorize Instagram captions into organized
groups, namely fashion, food & beverage, technology, health & beauty, lifestyle &
travel, this paper is expected to help people know the current trends on Instagram.
The Support Vector Machine algorithm in this research is used in 66171 post captions
to classify trending on Instagram. The TF-IDF (Term Frequency times Inverse Docu-
ment Frequency) method and percentage variations were used for data separation in
this study. This study result indicates that the use of SVM with a percentage ratio 70%
of dataset for training and 30% of dataset for testing produces a higher level of
accuracy compared to the others.
1. Introduction
Currently, the internet and humans cannot be separated because of the large
amount of information and knowledge available on the internet with its ability to
facilitate access to various things. In addition to information disclosure, the internet is
also used as a place to share experiences and hobbies through social media [1].
Obtaining an overview of social media, according to Wikipedia, social media is an
online platform that allows individuals to easily join, share, social networks, wikis,
forums, and create blogs. Blogs, social networks, and wikis are the most common
social media used by people worldwide. As of August 2017, Instagram is the sixth most
popular social media platform with 700 million members.
This social media platform, commonly called IG or Insta, is an image and video
sharing application that facilitates users to upload photos and videos, apply digital
81
Data Mining - Concepts and Applications
filters to photos and videos, and also share them on other social media [2]. Moreover,
Instagram also has several other functions, namely:
2. Share recommendations
3. Online marketing
At first, social media was just a way for people to communicate with one another.
As technology advances, social media allows people to express themselves as creators
and thinkers, rather than just as observers. Which activities can be facilely done using
Instagram. Due to the increasingly massive use of social media, marketing through
social media appears to be the best option in developing their business [3].
The caption in every Instagram post is one way to attract the audience’s interest to
buy the goods or services being traded [4]. Audiences can interact with or respond to
the post. Observations show that a post gets significantly different interactions,
depending on the content of the image and the caption. When an image is uploaded
with a specific caption, especially using a hashtag, the post can become a trend. The
profile of a person who is a potential target market, or demographic segmentation,
behavioral segmentation, and lifestyle segmentation, is related to interests. These
things allow marketers to know who is paying attention and interest in the trend.
According to Shopify.co.id, there are several trending Instagram categories in 2020,
namely Fashion, Food & Beverage, Technology, Health & Beauty, and Lifestyle &
Travel.
We can conclude that the classification of Instagram captions plays a significant
role in mapping the development of trends on the platform. By knowing the latest
people’s favorite trends, new business people have the convenience of promoting their
brand. The Instagram posts trend can be known through the text classification
method. Is the trend towards Fashion, Food & Beverage, Technology, Health &
Beauty, or Lifestyle & Travel? We can find out by using the Support Vector Machine
algorithm.
In Figure 1 below, the methodology used in this study is presented.
Figure 1.
Text classsification.
82
Text Classification on the Instagram Caption Using Support Vector Machine
DOI: http://dx.doi.org/10.5772/intechopen.99684
There are two different types of datasets: data training (CSV files) and data crawling
JSON as the data testing. 66.171 data are found in the data training, which contains
username, caption, and labeling. There are 1.894 Instagram captions obtained for data
testing. The caption data retrieval will be processed to produce a certain weight, which
will be used later during the Instagram caption data classification process.
The data in Table 1 is then divided into five categories; Fashion, Food & Beverage,
Technology, Health & Beauty, and Lifestyle & Travel. The Table 2 shows the pro-
portion of the amount of data in each category:
rajvegad055 #viral #top #instatop #public #photography #editz #pose #models #look 1
#attitude #style #bollywood #actorslife #hairstyle
83
Data Mining - Concepts and Applications
Table 1.
Instagram caption data.
1 Fashion 12,638
3 Technology 1,385
Table 2.
Proportion of the amount of data In each category/class.
We can see in the Table 2 that shows an imbalanced dataset, where a dispropor-
tionate ratio is found in each class. This disproportionate ratio can be spotted in the
Health & Beauty and Technology category data, which has a significant difference in
data. This imbalanced dataset will impact the prediction process in each class later.
With the imbalanced dataset, the model will tend to predict the majority class data.
Meanwhile, the minority class will be treated as noise or even ignored on some
occasions. Due to that, there might be misclassification of the minority class compared
to the majority class. In this research, the way to resolve the imbalanced dataset is by
using the performance matrix, which is the F 1 score.
The text preprocessing step is the beginning part of text mining. In text mining,
preprocessing is the act of transforming poorly formatted input into structured data
that meets the demands of the process.
The preprocessing stage is presented in Figure 2. After collecting the data, the next
process was text processing. It included case folding, tokenizing, and cleaning.
Case folding is the process of converting the letters contained in the text into
lowercase letters. Characters other than letters in the A-Z alphabet are omitted. This
process was carried out due to the inconsistent use of lowercase and uppercase letters
in Instagram captions. Case Folding aims to convert all data in the form of Instagram
84
Text Classification on the Instagram Caption Using Support Vector Machine
DOI: http://dx.doi.org/10.5772/intechopen.99684
Figure 2.
Text preprocessing.
captions to conform to the standard, which usually uses lowercase letters [5]. The
other characters which are not letters or numbers, like punctuation and space, will be
considered as delimiter. The other characters which are not letters or numbers, like
punctuation and space, will be considered as delimiter. The illustration is displayed in
Figure 3.
Tokenizing is a process to divides a large number of characters in a text into a
single word unit by distinguishing particular characters required as a word separator
[5]. Each word is identified or separated with another using space character, so this
tokenizing process relies on space characters in the document to separate the words.
The process is illustrated in Figure 4.
Filtering is a method that uses a stoplist (removing unnecessary words) or
wordlist to extract certain key words from the token results (including crucial words).
Some English stopword examples are “the” “from”, “and”, and others. The meaning
behind the stopword use is to remove words with low information in a text to focus on
the essential words to replace them. Filtering is done by determining what terms will
be used to represent a document, where a document describes each of its contents and
differs from one another. This process is illuatrated in Figure 5.
Figure 3.
Case folding process.
Figure 4.
Tokenizing process.
Figure 5.
Filtering (Stopword removal) process.
85
Data Mining - Concepts and Applications
2. TF–IDF
The next step after text processing is TF — IDF method. At this stage, each word is
assigned a weight based on how frequently it appears in the manuscript or document
[6]. The computation of Term Frequency (TF) and Inverse Document Frequency
(IDF) is also included in this technique (IDF). The steps are as follows:
4. F1 score
The value of accuracy in testing the data is known by the F1 score, which is the
average of Precision and Recall, where both metrics are calculated simultaneously
[10]. Precision describes the degree of precision between the required data and the
model’s predicted outputs [10]. The percentage of success of a model in recovering
information is represented through Recall. The formula for the F1 score is as follows:
precision ∗ recall
F 1 Score ¼ 2 ∗ (5)
precision þ recall
The F1 score calculation can be used as an evaluation standard from the predictive
classification result if there is a class imbalance in the data.
The following are the steps taken to conduct this research:
1. The data from this study are classified into several types. Each type is labeled as
follows; Fashion, Food & Beverage, Technology, Health & Beauty, and Lifestyle
& Travel.
2. Case-folding and tokenizing are carried out in the data processing process by
applying them to any text used in data training or data testing. After that, a new
document is obtained in order to proceed to the following step.
4. In the data split stage, ratios of 70:30, 60:40, 50:50, and 40:60 were used for data
training.
5. This series ends with an evaluation stage in which the F1 score is used to
determine the prediction results on the training data.
87
Data Mining - Concepts and Applications
This research begins by analyzing the dataset that has been prepared to determine
whether the data has missing values, data imbalances, and other problems in the data.
Proceed to the preprocessing stage to remove symbols, emoji, number punctuations,
and white/multiple spaces in the text. Filtering is also done to remove words that are
stop words. Then, To identify the frequency of occurrence of a word in the document,
the data is transformed into vector form, and the value of Term Frequency (TF) and
Inverse Document Frequency (IDF) for each token (word) is calculated. The clean
and weighted data is then divided into train and testing groups with varying ratios.
The support vector machine method and the radial basis function (RBF) kernel are
used to classify Instagram caption data. The whole process ends with evaluating the
algorithm performance using the F1 score to overcome the imbalanced data. The
difference in the distribution of train data and testing data proportion aims to see
whether there is an effect of the training data and testing data proportion on the
results of the F1 score.
The outcome of the analysis is as follows:
From Table 3, we can see the F 1 score is obtained from the experiment. The F 1
score is generated using a distinct proportion of training and testing data and the
results of Recall value and Precision. The results show that a bigger proportion of data
training, compared to the data testing, will produce a more significant F 1 score
compared to the other proportions.
Tables 4–7 show particular findings for Precision value, Recall, and F 1 Score in
each category with a varied proportion of data training and testing (70:30, 60:40,
50:50, and 40:60, respectively).The following is the result of the calculation for
Precision value, Recall, and F1 Score in each data proportion:
In Table 3 the classification results are presented using the Support Vector
Machine algorithm. The average F 1 score is above 88% and the largest F1 score is the
proportion of training and testing data with a proportion of 70:30. These results are
obtained through the Kernel Radial Basis Function (RBF). This proves that a larger
amount of training data in a model can produce better results. The F 1 scores from
each category with different training data share and testing data proportions are
shown in Tables 4–7. The proportion of data share from training data and testing
data generated is 70:30. These results are better, especially in the Technology
category.
It might be interesting to split training data set and testing data set with the ratio of
80 per cent training set and 20 per cent test set and perform another experiment using
that ratio. The result could give higher or lower accuracy compared with previous
experiment. However, based on the references, it will be depends on the method and
algorithm used.
Table 3.
Comparison of precision, recall, and F 1 score for each training and testing proportion.
88
Text Classification on the Instagram Caption Using Support Vector Machine
DOI: http://dx.doi.org/10.5772/intechopen.99684
Table 4.
Proportion of data 70:30 for comparison of precision, recall, F 1 score.
Table 5.
Proportion of 60:40 for comparison of precision, recall, F1 score.
Table 6.
Proportion of 50:50 for comparison of precision, recall, F1 score.
Table 7.
Proportion of 40:60 for comparison of precision, recall, F1 score.
89
Data Mining - Concepts and Applications
5. Conclusion
The conclusions that can be drawn from this research are as follows:
1. In this study, a very good F1 score, above 88%, was obtained using the Support
Vector Machine (SVM) with Kernel Radial Basis Function (RBF).
2. The performance of the SVM algorithm has increased with the use of TF-IDF as a
feature extraction method. The possibility of a different reaction from the
algorithm, namely by not getting the expected result, can occur if there is
untrained data in the data set. Data that has not been validated by experts is
untrained data. Sometimes, inaccuracies can result from improper labeling of the
source.
Author details
© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
90
Text Classification on the Instagram Caption Using Support Vector Machine
DOI: http://dx.doi.org/10.5772/intechopen.99684
References
Computing on Vertices
in Data Mining
Leon Bobrowski
Abstract
The main challenges in data mining are related to large, multi-dimensional data
sets. There is a need to develop algorithms that are precise and efficient enough to deal
with big data problems. The Simplex algorithm from linear programming can be seen
as an example of a successful big data problem solving tool. According to the funda-
mental theorem of linear programming the solution of the optimization problem can
found in one of the vertices in the parameter space. The basis exchange algorithms
also search for the optimal solution among finite number of the vertices in the
parameter space. Basis exchange algorithms enable the design of complex layers of
classifiers or predictive models based on a small number of multivariate data vectors.
1. Introduction
Various data mining tools are proposed to extract patterns from data sets [1].
Large, multidimensional data sets impose high requirements as to the precision and
efficiency of calculations used to extract patterns (regularities) useful in practice [2].
In this context, there is still a need to develop new algorithms of data mining [3]. New
types of patterns are also obtained in result of combining different types of classifica-
tion or prognosis models [4].
The Simplex algorithm from linear programming is used as an effective big data
mining tool [5]. According to the basic theorem of linear programming, the solution to
the linear optimization problem with linear constraints can be found at one of the
vertices in the parameter space. Narrowing the search area to a finite number of
vertices is a source of the efficiency of the Simplex algorithm.
Basis exchange algorithms also look for an optimal solution among a finite number
of vertices in the parameter space [6]. The basis exchange algorithms are based on the
Gauss - Jordan transformation and, for this reason, are similar to the Simplex algo-
rithm. Controlling the basis exchange algorithm is related to the minimization of
convex and piecewise linear (CPL) criterion functions [7].
The perceptron and collinearity criterion functions belong to the family of CPL
functions The minimization of the perceptron criterion function allows to check the
linear separability of data sets and to design piecewise linear classifiers [8].
93
Data Mining - Concepts and Applications
Let us assume that each of m objects Oj from a given database were represented by
the n-dimensional feature vector xj = [xj,1,...,xj,n]T belonging to the feature space F[n]
(xj ∈ F[n]). The data set C consists of m such feature vectors xj:
C ¼ xj , where j ¼ 1, … , m (1)
The components xj,i of the feature vector xj are numerical values (xj,i ∈ R or xj,I
∈{0, 1}) of the individual features Xi of the j-th object Oj. In this context, each
feature vector xj (xj ∈ F[n]) represents n features Xi belonging to the feature set
F(n) = {X1, … , Xn}.
The pairs {Gk+, Gk=} (k = 1, … , K) of the learning sets Gk+and Gk= (Gk+ ∩ Gk� = ∅)
are formed from some feature vectors xj selected from the data set C (1):
Gk þ ¼ xj : j ∈ J k þ , and Gk � ¼ xj : j ∈ J k � (2)
where Jk+ and Jk� are non-empty sets of indices j of vectors xj (Jk+ ∩ Jk� = ∅).
The positive learning set Gk+ is composed of mk+ feature vectors xj (j ∈ Jk+).
Similarly, the negative learning set Gk� is composed of mk� feature vectors xj (j ∈ Jk�),
where mk+ + mk� ≤ m.
Possibility of the learning sets Gk+ and Gk� (2) separation using a hyperplane
H(wk, θk) in the feature space F[n] is investigated in pattern recognition [1]:
H ð w k , θ k Þ ¼ x : w k T x ¼ θk (3)
According to the above inequalities, all vectors xj from the learning set Gk+ (2) are
located on the positive side of the hyperplane H(wk, θk) (3), and all vectors xj from
the set Gk� lie on the negative side of this hyperplane.
94
Computing on Vertices in Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99315
The hyperplane H(wk, θk) (3) separates (4) the sets Gk+ and Gk� (1) with the
following margin δL2(wk) based on the Euclidean (L2) norm which is used in the
Support Vector Machines (SVM) method [12]:
1=2
δL2 ðwk Þ ¼ 2=k wk kL2 ¼ 2= wk T wk (5)
where || wk ||L2 = (wkTwk)1/2 is the Euclidean length of the weight vector wk.
The margin δL1(wk) with the L1 norm related to the hyperplane H(wk, θk) (2),
which separates (10) the learning sets Gk+ and Gk� (2) was determined by analogy to
(5) as [11]:
where || wk ||L1 = |wk,1| + ... +| wk,n| is the L1 length of the weight vector wk.
The margins δ L2(wk) (5) or δ L1(wk) (6) are maximized to improve the generaliza-
tion properties of linear classifiers designed from the learning sets Gk+ and Gk� (2) [7].
The following set of mk0 = mk+ + mk� linear equations can be formulated on the
basis of the linear separability inequalities (4):
ð∀j ∈ J k � Þ xj T wk ¼ θk � 1
ð∀i ∈ Ik Þ ei T wk ¼ 0 (8)
The parameter vertex wk = [wk,1,..., wk,n]T can be determined by the linear Eqs. (7)
and (8) if the feature vectors xj forming the learning sets Gk+ and Gk� (2) are linearly
independent [7].
The feature vector xj0 (xj0 ∈ Gk+ ∪ Gk� (2)) is a linear combination of some other
vectors xj(i) ( j(i) 6¼ j0 ) from the learning sets (2), if there are such parameters
α j0 ,i (αj0 ,i 6¼ 0) that the following relation holds:
Definition 2: Feature vectors xj making up the learning sets Gk+ and Gk� (2) are
linearly independent if neither of these vectors xj0 (xj0 ∈ Gk+ ∪ Gk�) can be expressed
as a linear combination (9) of l (l ∈{1, … , m - 1}) other vectors xj(l) from the learning
sets.
If the number mk0 = mk+ + mk� of elements xj of the learning sets Gk+ and Gk� (2) is
smaller than the dimension n of the feature space F[n] (mk+ + mk� ≤ n), then the
parameter vertex wk(θk) can be defined by the linear equations in the following
matrix form [13]:
B k w k ð θk Þ ¼ 1 k ð θk Þ (10)
95
Data Mining - Concepts and Applications
where
and
T
Bk ¼ x1 , … , xmk’ , eiðmk’þ1Þ , … :, eiðnÞ (12)
The first mk+ components of the vector 1k(θk) are equal to θk + 1, the next mk�
components equal to θk - 1, and the last n - mk+ � mk� components are equal to 0. The
first mk+ rows of the square matrix Bk (12) are formed by the feature vectors xj (j ∈
Jk+) from the set Gk+ (2), the next mk� rows are formed by vectors xj (j ∈ Jk�) from
the set Gk� (2), and the last n - mk+ � mk� rows are made up of unit vectors ej (i ∈ Ik):
If the matrix Bk (12) is non-singular, then there exists the inverse matrix Bk�1:
Bk �1 ¼ r1 , … , rmk’ , riðmk’þ1Þ , … :, riðnÞ (13)
In this case, the parameter vertex wk(θk) (10) can be defined by the following
equation:
where the vector rk+ is the sum of the first mk+ columns ri of the inverse matrix Bk�1
(13), and the vector rk� is the sum of the successive mk� columns ri of this matrix.
The last n - (mk+ + mk�) components wk.i(θk) of the vector
wk(θk) = [wk,1(θk), … , wk,n(θk)]T (14) linked to the zero components of the vector
1k(θk) (11) are equal to zero:
The conditions wk.i(θk) = 0 (15) result from the equations eiTwk(θk) = 0 (8) at the
vertex wk(θk) (14).
Length ||wk(θk)||L1 of the weight vector wk(θk) (14) in the L1 norm is the sum of
mk0 = mk+ + mk� components |wk,i(θk)|:
The length ||wk(θk)||L1 (16) of the vector wk(θk) (14) with the L1 norm is mini-
mized to increase the margin δL1(wk(θk)) (6). The length ||wk(θk)||L1 (16) can be
minimized by selecting the optimal threshold value θk* on the basis of the Eq. (14).
Theorem 1: The learning sets Gk+ and Gk� (2) formed by m (m ≤ n) linearly
independent (9) feature vectors xj are linearly separable (4) in the feature space F[n]
(xj ∈ F[n]).
Proof: If the learning sets Gk+ and Gk� (2) are formed by m linearly independent
feature.
vectors xj then the non-singular matrix Bk = [x1, … , xm, ei(m + 1), … ., ei(n)]T (12)
containing these m vectors xj and n - m unit vectors ei (i ∈ Ik) can be defined [10]. In
this case, the inverse matrix Bk�1 (13) exists and can determine the vertex wk(θk)
(14). The vertex equation Bk wk(θk) = 1k(θk) (10) can be reformulated for the feature
vectors xj (2) as follows:
∀xj ∈ Gk þ wk ðθk ÞT xj ¼ θk þ 1 and ∀xj ∈ Gk � wk ðθk ÞT xj ¼ θk –1 (19)
The solution of the Eqs. (19) satisfies the linear separability inequalities (4).
It is possible to enlarge the learning sets Gk+ and Gk� (2) in such a way, which
maintains their linear separability (4).
Lemma 1: Increasing the positive learning set Gk+ (2) by such a new vector xj0 (xj0 ∉
+
Gk ), which is a linear combination with the parameters αj0 ,i (9) of some feature
vectors xj(l) (2) from this set (xj(l) ∈ Gk+) preserves the linear separability (4) of the
learning sets if the parameters αj0 ,i fulfill the following condition:
The above inequality means that linear separability connditions (4) still apply after
the increasing of the learning set Gk+ (2).
Lemma 2: Increasing the negative learning set Gk� (2) by such a new vector
xj0 (xj0 ∉ Gk�), which is a linear combination with the parameters αj0 ,i (9) of some
feature vectors xj(l) (2) from this set (xj(l) ∈ Gk�) preserves the linear separability (4)
of the learning sets if the parameters αj0 , i fulfill the following condition:
The minimization the perceptron criterion function allows to assess the degree of
linear separabilty (4) of the learning sets Gk+ and Gk� (2) in different feature sub-
spaces F[n0 ] (F[n0 ] ⊂ F[n + 1]) [6]. When defining the perceptron criterion function,
it is convenient to use the following augmented feature vectors yj (yj ∈ F[n + 1]) and
augmented weight vectors vk (vk ∈ Rn + 1) [1]:
T T
∀j ∈ J þ
k ð2Þ yj ¼ xj , 1 , (23)
97
Data Mining - Concepts and Applications
� � � T �T
∀j ∈ J �
k ð2Þ yj ¼ � xj , 1
and
� �T
vk ¼ wk T , �θk ¼ ½wk,1 , … , wk,n , �θk �T (24)
The augmented vectors yj are constructed (23) on the basis of the learning sets Gk+
and Gk� (2). These learning sets are extracted from the data set C (1) according to
some additional knowledge. The linear separability (4) of the learning sets Gk+ and
Gk� (2) can be reformulated using the following set of m inequalities with the aug-
mented vectors yj (23) [7]:
� � T
ð∃vk Þ ∀j ∈ J þ
k ∪ J k ð2Þ vk yj ≥ 1
�
(25)
Dual hyperplanes hjp (26) divide the parameter space Rn + 1 (v ∈ Rn + 1) into a finite
number L of disconnected regions (convex polyhedra) Dlp (l = 1, … , L) [7]:
n o
Dl p ¼ v : ð∀j ∈ J l þ Þ yj T v ≥ 1 and ð∀j ∈ J l � Þ yj T v < 1 (27)
where Jl+ and Jl� are disjointed subsets (Jl+ ∩ Jl+ = ∅) of indices j of feature vectors
xj making up the learning sets Gk+ and Gk� (2).
The perceptron penalty functions φjp(v) are defined as follows for each of
augmented feature vectors yj (23) [6]:
ð∀j ∈ J k Þ
1 � yj T v if yj T v < 1 (28)
p
φj ðvÞ ¼
T
0 if yj v ≥ 1
The j - th penalty function φjp(v) (28) is greater than zero if and only if the weight
vector v is located on the wrong side (yjTv < 1) of the j-th dual hyperplane hjp (26).
The function φjp(v) (28) is linear and greater than zero as long as the parameter vector
v = [vk,1,..., vk,n + 1]T remains on the wrong side of the hyperplane hjp (26). Convex
and piecewise-linear (CPL) penalty functions φjp(v) (28) are used to enforce the
linear separation (8) of the learning sets Gk+ and Gk� (2).
The perceptron criterion function Φkp(v) is defined as the weighted sum of the
penalty functions φjp(v) (28) [6]:
Positive parameters αj (αj > 0) can be treated as prices of individual feature vectors xj:
98
Computing on Vertices in Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99315
where mk+ (mk�) is the number of elements xj in the learning set Gk+ (Gk�) (2).
The perceptron criterion function Φkp(v) (29) was built on the basis of the error
correction algorithm, the basic algorithm in the Perceptron model of learning processes
in neural networks [14].
The criterion function Φkp(v) (29) is convex and piecewise-linear (CPL) [6]. It
means, among others, that the function Φkp(v) (29) remains linear within each area
Dl (27):
where the summation is performed on all vectors yj (23) fulfilling the condition
yjTv < 1.
The optimal vector vk* determines the minimum value Φkp(vk*) of the criterion
function Φkp(v) (29):
ð∃vk ∗ Þ ∀v ∈ Rnþ1 Φk p ðvÞ ≥ Φk p ðvk ∗ Þ ≥ 0 (32)
Since the criterion function Φkp(v) (29) is linear in each convex polyhedron Dl
(27), the optimal point vk* representing the minimum Φkp(vk*) (32) can be located in
selected vertex of some polyhedron Dl0 p (27). This property of the optimal vector vk*
(32) follows from the fundamental theorem of linear programming [5].
It has been shown that the minimum value Φkp(vk*) (32) of the perceptron criterion
function Φkp(v) (29) with the parameters αj (30) is normalized as follows [6]:
0 ≤ Φk p ðvk ∗ Þ ≤ 1 (33)
where λ (λ ≥ 0) is the cost level. The standard values of the cost parameters γi are
equal to one (∀i ∈ {1, ..., n} γi = 1).
The optimal vector vk,λ* constitutes the minimum value Ψkp(vk,λ*) of the CPL
criterion function Ψkp(v) (34), which is defined on elements xj of the learning sets
Gk+ and Gk� (2):
99
Data Mining - Concepts and Applications
ð∃vk,λ ∗ Þ ∀v ∈ Rnþ1 Ψk p ðvÞ ≥ Ψk p ðvk,λ ∗ Þ > 0 (35)
Similarly as in the case of the perceptron criterion function Φkp(v) (29), the
optimal vector vk,λ* (35) can be located in selected vertex of some polyhedron Dl0 (27).
The minimum value Ψkp(vk,λ*) (35) of the criterion function Ψkp(v) (34) is used,
among others, in the relaxed linear separability (RLS) method of gene subsets
selection [15].
The penalty functions φj(w) (36) can be related to the following dual hyperplanes
hj1 in the parameter (weight) space Rn (w ∈ Rn):
ð∀j ¼ 1, … , mÞ hj 1 ¼ w : xj T w ¼ 1 (37)
The CPL penalty φj(w) (36) is equal to zero (φjc(w) = 0) in the point
w = [w1,..., wn]T if and only if the point w is located on the dual hyperplane hj1 (37).
The collinearity criterion function Φk(w) is defined as the weighted sum of the
penalty functions φj(w) (36) determined by feature vectors xj forming the data subset
Ck (Ck ⊂ C (1)):
where the sum takes into account only the indices J of the set Jk = {j: xj ∈ Ck}, and
the positive parameters βj (βj > 0) in the function Φk(w) (38) can be treated as the
prices of particular feature vectors xj. The standard choice of the parameters βj values
is one ((∀j ∈ Jk) βj = 1.0).
The collinearity criterion function Φk(w) (38) is convex and piecewise-linear
(CPL) as the sum of this type of penalty functions φj(w) (36) [9]. The vector wk*
determines the minimum value Φk(wk*) of the criterion function Φk(w) (38):
Definition 3: The data subset Ck (Ck ⊂ C (1)) is collinear when all feature vectors xj
from this subset are located on some hyperplane H(w, θ) = {x: wTx = θ} with θ ¼ 6 0.
Theorem 3: The minimum value Φkp(vk*) (39) of the collinearity criterion function
Φk(w) (38) defined on the feature vectors xj constituting a data subset Ck (Ck ⊂ C
(1)) is equal to zero (Φkp(vk*) = 0) when this subset Ck is collinear (Def. 3) [9].
100
Computing on Vertices in Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99315
Different collinear subsets Ck can be extracted from data set C (1) with a large
number m of elements xj by minimizing the collinearity criterion function Φkp(w)
(38) [9].
The minimum value Φkp(vk*) (39) of the collinearity criterion function Φk(w)
(38) can be reduced to zero by omitting some feature vectors xj from the data subset
Ck (Ck ⊂ C (1)). If the minimum value Φk(wk*) (39) is greater than zero (Φk(wk*) >
0) then we can select feature vectors xj (j ∈ Jk(wk*)) with the penalty φj(wk*) (36)
greater than zero:
Omitting one feature vector xj0 (j0 ∈ Jk(wk*)) with the above property results in the
following reduction of the minimum value Φkp(vk*) (39);
where Φk0 (wk0 *) is the minimum value (39) of the collinearity criterion function
Φk0 (w) (38) defined on feature vectors xj constituting the data subset Ck reduced by
the vector xj0 .
The regularized criterion function Ψk(w) is defined as the sum of the collinearity
criterion function Φk(w) (38) and some additional CPL penalty functions φj0(w) [7]:
where λ ≥ 0 is the cost level. The standard values of the cost parameters γi are equal
to one ((∀i ∈ {1, … ,n}) γi = 1). The additional CPL penalty functions φj0(w) are
defined below [7]:
ð∀i ¼ 1, … , nÞ (43)
�wj if wj ≤ 0
χi ðwÞ ¼ ∣ ei T w ∣ ¼
wj if wj > 0
The functions φj0(w) (43) are related to the following dual hyperplanes hj0 in the
parameter (weight) space Rn (w ∈ Rn):
ð∀i ¼ 1, … , nÞ hj 0 ¼ w : ej T w ¼ 0 ¼ w : wj ¼ 0 (44)
The CPL penalty function φj0(w) (43) is equal to zero (φj0(w) = 0) in the
point w = [w1,..., wn]T if and only if this point is located on the dual hyperplane
hj0 (44).
5. Parameter vertices
The perceptron criterion function Φkp(v) (29) and the collinearity criterion func-
tion Φk(w) (38) are convex and piecewise-linear (CPL). The minimum values of a
such CPL criterion functions can be located in parameter vertices of some convex
polyhedra. We consider the parameter vertices wk (wk ∈ Rn) related to the
collinearity criterion function Φk(w) (38).
101
Data Mining - Concepts and Applications
Definition 4: The parameter vertex wk of the rank rk (rk ≤ n) in the weight space Rn
(wk ∈ Rn) is the intersection point of rk hyperplanes hj1 (37) defined by linearly
indepenedent feature vectors xj (j ∈ Jk) from the data set C (1) and n - rk hyperplanes
hi0 (44) defined by unit vectors ei (i ∈ Ik) [7].
The j-th dual hyperplane hj1 (37) defined by the feature vector xj (1) passes
through the k-th vertex wk if the equation wkTxj = 1 holds.
Definition 5: The k-th weight vertex wk of the rank rk is degenerate in the parameter
space Rn if the number mk of hyperplanes hj1 (37) passing through this vertex
(wkTxj = 1) is greater than the rank rk (mk > rk).
The vertex wk can be defined by the following set of n linear equations:
and
Eqs. (45) and (46) can be represented in the below matrix form [7]:
Bk wk ¼ 1k (47)
where 1k = [1, … ,1, 0, … ,0]T is the vector with the first rk components equal to one
and the remaining n - rk components are equal to zero.
The square matrix Bk (47) consists of k feature vectors xj (j ∈ Jk (45)) and n - k unit
vectors ei (i ∈ Ik (46)) []:
T
Bk ¼ x1 , … , xk , eiðkþ1Þ , … , eiðnÞ (48)
where the symbol ei(l) denotes such unit vector, which is the l-th row of the matrix Bk.
Since feature vectors xj (∀j∈ Jk(wk) (45)) making up rk rows of the matrix Bk (48)
are linearly independent, then the inverse matrix Bk�1 exists:
Bk �1 ¼ r1 , … , rk , riðkþ1Þ , … :, riðnÞ (49)
The inverse matrix Bk�1 (49) can be obtained starting from the unit matrix
I = [e1,..., en]T and using the basis exchange algorithm [8].
The non-singular matrix Bk (48) is the basis of the feature space F[n] related to the
vertex wk = [wk,1, … , wk,n]T. Since the last n - rk components of the vector 1k (47) are
equal to zero, the following equation holds:
wk ¼ Bk �1 1k ¼ r1 þ … þ rk (50)
According to Eq. (50), the weight vertex wk is the sum of the first k columns ri of
the inverse matrix Bk�1 (49).
Remark 1: The n - k components wk.i of the vector wk = [wk,1, … , wk,n]T (50) linked
to the zero components of the vector 1k = [1, … , 1, 0, … ., 0, 1]T (7) are equal to zero:
The conditions wk.i = 0 (51) result from the equations wkTei = 0 (46) at the vertex wk.
102
Computing on Vertices in Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99315
The fundamental theorem of linear programming shows that the minimum Φk(wk*)
(39) of the CPL collinearity criterion function Φk(w) (38) can always be located in
one of the vertices wk (50) [5]. The same property has also the regularized criterion
function Ψk(w) (42), another function of the CPL type [7].
We can see that all such feature vectors xj (1) which define hyperplanes hj1 (37)
passing through the vertex wk are located on the hyperplane H(wk, 1) = {x: wkTx = 1}
(3) in the feature space F[n]. A large number mk of feature vectors xj (1) located on the
hyperplane H(wk, 1) (3) form the collinear cluster C(wk) based on the vertex wk [8]:
Cðwk Þ ¼ xj ∈ C ð1Þ : wk T x ¼ 1 (52)
If the vertex wk of the rank rk is degenerate in the parameter space Rn then the
collinear cluster C(wk) (52) contains more than rk feature vectors xj (1).
The k-th vertex wk = [wk,1, … , wk,n]T in the parameter space Rn (wk ∈ Rn) is linked
by the Eq. (47) to the non-singular matrix Bk (48). The rows of the matrix Bk (48) can
form the basis of the feature space F[n]. The conditions wk.i = 0 (51) result from the
equations wkTei = 0 (46) at the vertex wk.
Each feature vector xj from the data set C (1) represents n features Xi belonging to
the feature set R(n) = {X1, … , Xn}. The k-th vertexical feature subset Rk(rk) consists of rk
features Xi that are connected to the weights wk.i different from zero (wk.i 6¼ 0):
Rk ðrk Þ ¼ X ið1Þ , … , X iðrkÞ (54)
The k-th vertexical subspace Fk[rk] (Fk[rk] ⊂ F[n]) contains the reduced vectors
xj[rk] with rk componets xj,i(l) (xj[rk] ∈ Fk[rk]) related to the weights wk.i different
from zero:
T
ð∀j ∈ f1, … , mgÞ xj ½rk � ¼ xj,ið1Þ , … , xj,iðrkÞ (55)
The reduced vectors xj[rk] (55) are obtained from the feature vectors
xj = [xj,1,...,xj,n]T belonging to the data set C (1) by omitting the n - rk components xj,i
related to the weights wk.i equal to zero (wk.i = 0).
We consider the optimal vertexical subspace Fk*[rk] (Fk*[rk] ⊂ F[n]) related to the
reduced optimal vertex wk*[rk] which determines the minimum Φk(wk*) (39) of the
collinearity criterion function Φk(w) (38). The optimal collinear cluster
C(wk*[rk]) (52) is based on the optimal vertex wk*[rk] = [wk,1*, … , wk,rk*]T with rk
different from zero components wk,i* (wk.i* 6¼ 0). Feature vectors xj belonging to the
collinear cluster C(wk*) (52) satisfy the equations wk*[rk]Txj[rk] = 1, hence:
∀xj ∈ Pðwk ∗ Þ
(56)
wk:1 ∗ xj,ið1Þ þ … þ wk:rk ∗ xj,iðrkÞ ¼ 1
where xj,i(l) are components of the j-th feature vectors xj related to the weights wk.i
different from zero (wk.i 6¼ 0).
A large number mk of feature vectors xj (1) belonging to the collinear cluster
C(wk*[rk]) (52) justifies the following collinear model of interaction between selected
features Xi(l) which is based on the Eqs. (56) [9]:
103
Data Mining - Concepts and Applications
The collinear interaction model (57) allows, inter alia, to design the following
prognostic models for each feature Xi0 from the subset Rk(rk) (54):
∀i0 ∈ f1, … , rk Þ X i0 ¼ αi0 ,0 þ αi0 ,1 X ið1Þ þ … þ αi0 ,rk X iðrkÞ (58)
where βi0 ,0 = 1 / wk.i0 *, βi0 , i0 = 0, and (∀ i(l) 6¼ i0 ) βi0 ,i(l) = wk.i(l)* / wk.i0 *.
Feature Xi0 is a dependent variable in the prognostic model (58), the remaining m -
1 features Xi(l) are independent variables (i(l) 6¼ i0 ). The family of rk prognostic
models (58) can be designed on the basis of one collinear interaction model (57).
Models (58) have a better justification for a large number mk of feature vectors xj (1)
in the collinear cluster C(wk*[rk]) (52).
The collinearity criterion function Φ(w) (38), like other convex and piecewise
linear (CPL) criterion functions, can be minimized using the basis exchange algorithm
[8]. The basis exchange algorithm aimed at minimization of the collinearity criterion
function Φ(w) (38) is described below.
According to the basis exchange algorithm, the optimal vertex wk*, which consti-
tutes the minimum value Φk(wk*) (39) of the collinearity function Φk(w) (38), is
achieved after a finite number L of the steps l as a result of guided movement between
selected vertices wk (50) [8]:
w0 ! w1 ! … : ! wL (59)
The sequence of vertices wk (59) is related by (47) to the following sequence of the
inverse matrices Bk�1 (49):
B0 �1 ! B1 �1 ! … : ! BL �1 (60)
The sequence of vertices wk(l) (59) typically starts at the vertex w0 = [0,..., 0]T related
to the identity matrix B0 = In = [e1,..., en]T of the dimension n x n [7]. The final vertex wL
(59) should assure the minimum value of the collinearity criterion function Φ(w) (38):
by an exit criterion based on the gradient of the collinearity criterion function Φ(w)
(38) [7]. The exit criterion allows to determine the exit edge rk + 1 (49) of the greatest
descent of the collinearity criterion function Φ(w) (38). As a result of replacing the
unit vector ei(k + 1) with the feature vector xk + 1, the value Φ(wk) of the collinearity
function Φ(w) (38) decreases (41):
After a finite number L (L ≤ m) of the steps k, the collinearity function Φ(w) (38)
reaches its minimum (61) at the final vertex wL (59).
The sequence (60) of the inverse matrices Bk�1 is obtained in a multi-step
process of minimizing the function Φ(w) (38). During the k-th step, the matrix
Bk-1 = [x1, … , xk-1, ei(k), … ., ei(n)]T (12) is transformed into the matrix Bk by replacing
the unit vector ei(k) with the feature vector xk:
and
ð∀i 6¼ iðl þ 1ÞÞ ri ðl þ 1Þ ¼ ri ðlÞ � ri ðlÞT xlþ1 riðlÞ ðl þ 1Þ ¼
¼ ri ðlÞ � ri ðlÞT xjðlþ1Þ =riðlÞ ðlÞT xlþ1 riðlÞ ðlÞ
where i(l+1) is the index of the unit vector ei(l+1) leaving the basis
Bl = [x1,..., xl, ei(l+1),..., ei(n)]T during the l-th stage.
Remark 2: The vector Gauss-Jordan transformation (64) resulting from the
replacing of the unit vector ei(k) with the feature vector xk in the basis
Bk-1 = [x1,..., xk-1, ei(k),..., ei(n)]T cannot be executed when the below collinearity
condition is met [7]:
Similarly, the symbol xj[k] = [xj,1,...,xj,k]T means the reduced vector obtained
from the feature vector xj = [xj,1,...,xj,n]T after he reducing of the last n - k
components xj,i:
105
Data Mining - Concepts and Applications
T
ð∀j ∈ f1, … , mÞ xj ½k� ¼ xj,1 , … , xj,k (67)
Lemma 3: The collinearity condition (65) appears during the k-th step when the
reduced vector xk[k] (66) is a linear combination of the basis reduced vectors xj[k]
(67) with j < k:
The maximal number lmax (69) of different vertices wL(l) (59) can be large when
m < < n:
106
Computing on Vertices in Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.99315
The choice between different final vertices wL(l) (59) can be based on the minimi-
zation of the regularized criterion function Ψ(w) (42). The regularized function Ψ(w)
(42) is the sum of the collinearity function Φ(w) (38) and the weighted sum of the
cost functions φi0(w) (43). If Φ(wL(l)) = 0 (38), then the value Ψ(wL(l)) of the
criterion function Ψ(w) (42) at the final vertex wL(l) (59) can be given as follows:
Ψ wLðlÞ ¼ λi Σi χi φj 0 wLðlÞ ¼
(71)
¼ λ Σ χi ∣ wLðlÞ,i ∣
where the above sums take into account only the indices i of the subset I(wL(l)) of
the non-zero components wL(l),i of the final vertex wL(l) = [wL(l),1, … , wL(l),n]T (59):
I wLðlÞ ¼ i : ei T wLðlÞ 6¼ 0 ¼ i : wLðlÞ,i 6¼ 0 (72)
If the final vertex wL(l) (59) is not degenerate (Def. 5), then the matrix BL(l) (48) is
built from all m feature vectors xj (j ∈ {1,...., m}) making up the data set C (1) and
from n - m selected unit vectors ei (i ∈ I(wL(l)) (71)).
T
BÞm ¼ x1 , … , xm , eiðmþ1Þ , … , eiðnÞ (73)
The problem of the constrained minimizing of the regularized function Ψ(w) (71)
at the vertices wL(l) (59) satisfying the conditions Φ(wL(l)) = 0 (69) can be formulated
in the following way:
min l Ψ wLðlÞ : Φ wLðlÞ ¼ 0 ¼
(74)
¼ min l Σi γ wLðlÞ,i : Φ wLðlÞ ¼ 0
i
According to the above formulation, the search for the minimum of the regularized
criterion function Ψ(w) (42) is takes place at all such vertices wL(l) (59), where the
collinearity function Φ(w) (38) is equal to zero. The regularized criterion function
Ψ(w) (42) is defined as follows at the final vertices
wL(l) = [wL(l),1, … , wL(l),n]T (59), where Φ(wL(l)) = 0:
∀wLðlÞ Ψ0 wLðlÞ ¼ Σ γi ∣ wLðlÞ,i ∣ (75)
The optimal vertex wL(l)* is the minimum value Ψ0 (wL(l)*) of the CPL criterion
function Ψ0 (w) (75) defined on such final vertices wL(l) (59), where Φ(wL(l)) = 0 (38):
∃wLðlÞ ∗ ∀wLðlÞ : Φ wLðlÞ ¼ 0 Ψ0 wLðlÞ ≥ Ψ0 wLðlÞ ∗ > 0 (76)
107
Data Mining - Concepts and Applications
Optimal vertex wL(l)* with the smallest L1 length || wL(l)* ||L1 (77) is related to the
largest L1 margin δL1(wL(l)*) (6) [11]:
δL1 wLðlÞ ∗ ¼ 2=k wLðlÞ ∗ k L1 ¼ 2= jwLðlÞ,1 ∗ jþ … þj wLðlÞ,n ∗ j (78)
The basis exchane algorithm allow to solve the constraint minimization problem
(74) and to find the optimal vertex wL(l)* (77) with the largest L1 margin δL1(wL(l)*).
Support Vector Machines (SVM) is the most popular method for designing linear
classifiers or prognostic models with large margins [12]. According to the SVM
approach, the optimal linear classifier or the prognostic model defined by such an
optimal weight vector w* that has a maximum margin δL2(w*) based on the Euclidean
(L2) norm:
1=2
δL2 ðw ∗ Þ ¼ 2=k w ∗ kL2 ¼ 2= ðw ∗ ÞT w ∗ (79)
unit vectors ei(l) in the matrices Bm(l) (73) are exchanged to minimize the CPL
function Ψ0 (wm(l)) (75) at the final vertices wm(l) (77). The optimal basis Bm* defines
(47) the optimal vertex wm(l)* (77), which is characterized by the largest margin
δL1(wm(l)*) (78).
The vertexical feature subspace F1*[m] (F1*[m] ⊂ F[n] (1)) can be obtained on
the basis of the optimal vertex wm(l)* (77) with the largest margin δL1(wm(l)*) (78).
The vertexical subspace F1*[m] contains the reduced vectors x1,j[m] with the
dimension m [7]:
The reduced vectors x1,j[m] (80) are obtained from the feature vectors xj = [xj,1,...,
xj,n]T (xj ∈ F[n]) ignoring such components xj,i which are related to the unit vectors
ei in the optimal basis B1* = [x1,..., xm, ei(m + 1),..., ei(n)]T (73). The reduced vectors
x1,j[m] are represented by such m features Xi (Xi ∈ R1* (54)), which are not linked to
the unit vectors ei (i ∉ Im(l)*) in the basis Bm(l)* (73) representing the optimal vertex
wm(l)* (77).
R1 ∗ ¼ X ið1Þ , … , X iðmÞ : iðlÞ ∉ ImðlÞ ∗ ð72Þ (81)
The m features Xi(l) belonging to the optimal subset R1* (Xi(l) ∈ R1* (81) are related
to the weights wk.l* (wk*[m] = [wk,1*, … , wk,m*]T) that are not zero
(wk.l * 6¼ 0).
The optimal feature subset R1* (81) consists of m collinear features Xi. The
optimal vertex w1*[m] (Φ(w1*[m]) = 0 (69)) in the reduced parameter space
Rm (w1*[m] ∈ Rm) is based on these m features Xi. The reduced optimal vertex
w1*[m] with the largest margin δL1(w1*[m]) (77) is the unique solution of the
constrained optimization problem (74). Maximizing the L1 margin δL1(wl*) (78)
leads to the first reduced vertex w1*[m] = [wk,1*, … , wk,m*]T with non-zero
components wk.i * (wk.i * 6¼ 0).
The collinear interaction model between m collinear features Xi(l) from the optimal
subset R1*(m) (81) can be formulated as follows (57):
The prognostic models for each feature Xi0 from the subset R1* (81) may have the
following form (58):
∀i0 ∈ f1, … , mÞ X i0 ¼ αi0 ,0 þ αi0 ,1 X ið1Þ þ … þ αi0 ,m X iðmÞ (83)
where αi0 ,0 = 1 / wk.i0 *, αi0 , i0 = 0, and (∀ i(l) 6¼ i0 ) αi0 ,i(l) = wk.i(l)* / wk.i0 *.
In the case of a data set C with a small number m (m < < n) of multidimensional
feature vectors xj (1), the prognostic models (83) for individual features Xi0 can be weak.
It is know that sets (ensembles) of weak models can have strong generalizing properties
[4]. A set of weak prognostic models (83) for a selected feature (dependent variable) Xi0
can be implemented in the complex layer of L prognostic models (83) [11].
The complex layer can be built on the basis of the sequence of L optimal vertices wl*
(77) related to m features Xi constituting the subsets Rl* (81), where l = 0, 1,..., L.
109
Data Mining - Concepts and Applications
Design assumption: Each subset Rl* (81) in the sequence (84) contains a priori
selected feature (dependent variable) Xi0 and m - 1 other features (independent
variables) Xi(l). The other features Xi(l) (Xi(l) ∈ Rl*) should be different in successive
subsets Rl* (l = 0, 1,..., L).
The first optimal; vertex w1* (77) in the sequence (84) is designed on the basis of m
feature vectors xj (1), which are represented by all n features Xi constituting the
feature set F(n) = {X1, … , Xn}. The vertex w1* (77) is found by solving the constraint
optimization problem (74) according to the procedure with the two stages outlined
earlier. The two-stage procedure allows to find the optimal vertex w1* (77) with the
largest L1 margin δL1(w1*) (78).
The second optimal vertex w2* (77) in the sequence (84) is obtained on the basis of
m reduced feature vectors xj[n - (m - 1)] (67), which are represented by n - (m - 1)
features Xi constituting the reduced feature subset F2(n - (m + 1)):
The l-th optimal vertex wl* (77) in the sequence (84) is designed on the basis of m
reduced vectors xj[n - l(m - 1)] (67), which are represented by n - l(m - 1) features Xi
constituting the feature subset Fl(n - l(m - 1)):
The sequence (84) of L optimal vertices wl* (77) related to the subsets
Fl(n - l(m - 1)) (86) of features is characterized by decreased L1 margins δL1(wl*)
(78) [18].
The prognostic models (83) for the dependent feature (variable) Xi0 are designed
for each subset Fl(n - l(m - 1)) (86) of features Xi, where l = 0, 1,..., L (84):
The final forecast Xi0 ∧ for the dependent feature (variable) Xi0 based on the com-
plex layer of L + 1 prognostic models (88) can have the following form:
In accordance with the Eq. (89), the final forecast Xi(m)∧ for the feature Xi0 results
from averaging the forecasts of L + 1 individual models Xi0 (l) (88).
9. Concluding remarks
(88) are built by using a small number m of collinear features Xi belonging to the
optimal feature clusters Rl* (81). The optimal feature clusters Rl* (81) are formed by
the search for the largest margins δL1(wl*) (78) in the L1 norm.
The averaged prognostic models Xi0 ∧ (89) are based on the layer of L parallel
models Xi0 (l) (88). In line with the ergodic theory, averaging on a small number m of
feature vectors xj has been replaced with averaging on L collinear clusters Rl* (81) of
features Xi. Such averaging scheme should allow for a more stable extraction of gen-
eral patterns from small samples of high-dimensional feature vectors xj (1) [11].
Author details
Leon Bobrowski1,2
© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
111
Data Mining - Concepts and Applications
References
[1] Duda O. R., Hart P. E., and Stork D. Margins″, pp. 29 - 40 in: ACIIDS 2021,
G., Pattern classification, J. Wiley, New Springer Verlag, 2021
York, 2001
[12] Boser B. E., Guyon I., Vapnik V. N.:
[2] Hand D., Smyth P., and Mannila H., A training algorithm for optimal margin
Principles of data mining, MIT Press, classifiers. In: Proceedings of the Fifth
Cambridge (2001) Annual Workshop of Computational
Learning Theory, 5, 144–152. Pittsburgh,
[3] Bishop C. M., Pattern Recognition and ACM, 1992
Machine Learning. Springer Verlag, 2006
[13] Bobrowski L., Łukaszuk T.:
[4] Kuncheva L.: Combining Pattern Repeatable functionalities in complex
Classifiers: Methods and Algorithms, 2nd layers of formal neurons, EANN 2021,
Edition, J. Wiley, New Jersey (2014). Engineering Applications of Neural
Networks, Springer 2021
[5] Simonnard M., Linear Programming,
Prentice – Hall, New York, Englewood [14] Rosenblatt F.: Principles of
Cliffs, 1966 neurodynamics, Spartan Books,
Washington, 1962
[6] Bobrowski L., Data mining based on
convex and piecewise linear (CPL) [15] Bobrowski L., Łukaszuk, T.: Relaxed
criterion functions (in Polish), Białystok Linear Separability (RLS) Approach to
University of Technology, 2005 Feature (Gene) Subset Selection, pp. 103
- 118 in: Selected Works in Bioinformatics,
[7] Bobrowski L., Data Exploration and Edited by: Xuhua Xia, INTECH, 2011
Linear Separability, pp. 1 - 172, Lambert
Academic Publishing, 2019 [16] Bobrowski L.: ″Large Matrices
Inversion Using the Basis Exchange
[8] Bobrowski, L.: ″Design of piecewise Algorithm″, British Journal of
linear classifiers from formal neurons by Mathematics & Computer Science, 21(1):
some basis exchange technique″, Pattern 1-11, 2017
Recognition, 24(9), pp. 863-870 (1991).
[17] Petersen K.: Ergodic Theory
[9] Bobrowski L., Zabielski P., ″Models of (Cambridge Studies in Advanced
Multiple Interactions from Collinear Mathematics), Cambridge University
Patterns″, pp. 153-165 in: Bioinformatics Press, 1990
and Biomedical Engineering (IWBBIO
2018), Eds.: I. Rojas, F. Guzman, LNCS [18] Bobrowski L., Zabielski P.: ″Feature
10208, Springer Verlag, 2018 (gene) clustering with collinearity
models″, ICCCI 2021 (to appear),
[10] Bobrowski L., Small Samples of Springer Verlag, 2021
Multidimensional Feature Vectors
(ICCCI 2020), pp. 87 - 98 in: Advances in
Computational Collective Intelligence,
Eds.: Hernes M, et al., Springer 2020
112
Section 2
113
Chapter 8
Abstract
115
Data Mining - Concepts and Applications
1. Introduction
In this regard, this chapter reviews recent advances in data-driven optimization that
highlight the integration of mathematical programming and ML for decision-making
under uncertainty and identifies potential research opportunities. We compare data-
driven optimization performance to conventional models from optimization method-
ology. We summarize the existing research papers on data-driven optimization under
uncertainty and classify them into three categories: Data-driven stochastic program,
Data-driven robust optimization, and Data-driven chance-constrained, according to
their unique approach to uncertainty modeling distinct optimization structures. Based
on the literature survey, we identify five promising future research directions on opti-
mization under uncertainty in the era of big data and DL, (i) Employment of DL in the
field of data-driven optimization under uncertainty, (ii) Deep data-driven models, (iii)
Online learning-based data-driven optimization, (iv) Leveraging RL techniques for
optimization, and (v) Deep RL for solving NP-hard problems and highlight respective
research challenges and potential methodologies. We conducted an extensive literature
review on recent papers published across the premier journals between 2002 and
2020 in our field, namely, the European Journal of Operational Research, Operations
Research, Journal of Cleaner Production, Production and Operations Management,
Journal of Operations Management, Computers in Industry, and Decision Sciences. We
specifically searched for papers containing “big data”, “data-driven optimization”, “arti-
ficial intelligence”, “machine learning”, “deep learning”, and “Reinforcement learning”.
However, our research into the existing literature reveals a scarcity of research works
utilizing DL and RL in these disciplines.
The remainder of this paper is organized as follows: Section 2 provides an intro-
duction to the mathematical optimization method. In Section 3, a brief review of AI
methods such as ML, DL, and RL is provided. In sections 4–6, applying different ML,
DL, and RL techniques in data-driven optimization under uncertainty are presented.
Finally, the book chapter ends with the conclusion, some managerial implications,
and future research recommendations.
118
Artificial Intelligence and Its Application in Optimization under Uncertainty
DOI: http://dx.doi.org/10.5772/intechopen.98628
mimic “cognitive” functions that humans associate with the human mind, such as
“learning” and problem-solving” [22]. A brief description of the three main areas of
AI, including ML, DL, and RL, is provided in the following.
119
Data Mining - Concepts and Applications
In the big data and ML era, a large amount of interactive data are routinely
generated and collected in different industries. Intelligence and data-driven analysis
and decision-making have a critical role in process operations, design, and control.
The success of the DSS depends primarily on the ability to process and analyze large
amounts of data and extract relevant and useful knowledge and information from
them. In this context, the data-driven approach has gained prominence due to its
provision of insights for decision-making and easy implementation. The data-driven
optimization framework is a hybrid system that integrates AI and optimization
methods for devising a data-driven and intelligent DSS. The data-driven system
applied ML techniques for uncertainty modeling. The data-driven approach can
discover various database patterns without relying on prior knowledge while also can
handle multiple scenarios and flexible objectives. It can also extract information and
knowledge from data without speed [29, 30].
The framework of data-driven optimization under uncertainty could be consid-
ered a hybrid system that integrates the data-driven system based on ML to extract
useful and relevant information from data. The model-based system is based on
mathematical programming to derive the optimal decisions from the information
[28]. The inability of traditional optimization methods to analyze big data, as well as
recent advances in ML techniques, made data-driven optimization a promising way
to hedge against uncertainty in the era of big data and ML. Therefore, these promises
create the need for organic integration and effective interaction between ML and
mathematical programming. In existing data-driven optimization frameworks, data
serve as input to a data-driven system. After that, useful, accurate, and relevant
uncertainty information is extracted through the data-driven system and further
passed along to the model-based system based on mathematical programming for
rigorous and systematic optimization under uncertainty, using paradigms such as
robust optimization and stochastic programming.
The various ML techniques and their potentials applications in data-driven
optimization under uncertainty are presented in the following.
The stochastic programs are used where the distribution of the uncertain
parameters is only observable through a finite training dataset [31]. As the primary
assumption in the stochastic programming approach, the probability distribution
of uncertain parameters should be clear. However, such complete knowledge of
parameters probability distribution is rarely available in practice. In practice, instead
of knowing the actual distribution of an uncertainty parameter, what the decision-
maker has is a set of historical/ or real-time uncertainty data and possibly some prior
structure knowledge of the probability. Also, the assumed possibility distribution of
uncertain parameters may deviate from their actual distribution. Moreover, relying
on a single probability distribution could lead to sub-optimal solutions or even lead
to the deterioration in out-of-sample performance [32]. Motivated by these stochastic
programming weaknesses, DRO emerges as a new data-driven optimization para-
digm that hedges against the worst-case distribution in an ambiguity set [28]. DRO
paradigm integrates data-driven systems and model-based systems. A data-driven
approach is applied in the DRO model to construct an uncertainty set of probability
120
Artificial Intelligence and Its Application in Optimization under Uncertainty
DOI: http://dx.doi.org/10.5772/intechopen.98628
distributions from uncertainty data through statistical inference and big data analyt-
ics [28]. In data-driven stochastic modeling, the uncertainty is modeled via a family of
probability distributions that well capture uncertainty data on hand [28]. This set of
probability distributions is referred to as an ambiguity set. With this ambiguity set, a
model is then proposed for problem design. Finally, a solution strategy is applied for
solving the optimization problem. For example in the literature, the Wasserstein met-
ric has been used, to construct a ball in the space of (multivariate and non-discrete)
probability distributions centered at the uniform distribution on the training samples,
to seek decisions that perform best in view of the worst-case distribution within this
Wasserstein ball [31]. Different practical approaches, such as the moment-based, and
the adopted distance metric, were employed for uncertainty constructing [33, 34],
and [31]. DRO is an effective method to address the inexactness of probability
distributions of uncertain parameters in decision-making under uncertainty that can
be applied for optimizing supply chain activities, for planning and scheduling under
uncertainty. This way reduces the modeling difficulty for uncertain parameters. Wang
& Chen [35] proposed a two-stage DRO model considering scarce data of disasters.
A moment-based fuzzy set describes uncertain distributions of blood demand to
optimize blood inventory prepositioning and relief activities together. Chiou [36],
to regulate the risk associated with hazardous material transportation and minimize
total travel cost on the interested area under stochasticity, presented a multi-objective
data-driven stochastic optimization model to determine generalized travel cost for
hazmat carriers. Gao et al. [37] proposed a two-stage DRO model for better deci-
sion making in optimal design and shale gas supply chains under uncertainty. They
applied a data-driven approach to construct the ambiguity set based on principal
component analysis and first-order deviation functions. In the other study, Ning &
You [28] proposed a novel data-driven Wasserstein DRO model for biomass with
agricultural waste-to-energy network design under uncertainty. They proposed a
data-driven approach to construct the Wasserstein ambiguity set for the feedstock
price uncertainty, which is utilized to quantify their distances from the data-based
empirical distribution.
the intrinsic structure and complexity of uncertainty data. Furthermore, these uncer-
tainty sets are specified by a finite number of parameters, thereby limiting modeling
flexibility. Motivated by this knowledge gap, data-driven robust optimization emerges
as a powerful paradigm for addressing uncertainty in decision making.
Choosing a good uncertainty set enables robust optimization models to provide
better solutions than other approaches solutions [5, 6]. Poor choice of the uncertainty
set makes robust optimization model overly conservative or computationally intracta-
ble. In the era of big data, many data are routinely generated and collected containing
abundant information about the distribution of uncertainties; thereby, ML tools can
construct the uncertainty sets based upon these data. Data-driven robust optimiza-
tion is a new paradigm for hedging against uncertainty in the era of big data. The ML
tools can be applied to estimate data densities with sufficient accuracy and construct
an appropriate uncertainty set based upon intelligent analysis and the use of uncer-
tainty data for modeling robust optimization problems. A desirable uncertainty set
shall have enough flexibility to adapt to the intrinsic structure behind data, thereby
characterizing the underlying distribution and facilitating the solutions.
Data-driven robust optimization could be considered a “hybrid” system that
integrates the data-driven system based on ML to construct the uncertainty set from
historical uncertainty data. The model-based system is based on the robust program-
ming model to derive the optimal decisions from the information. More specifically,
data serves as input to a data-driven system. Figure 1 presents the data-driven
optimization paradigm framework. After that, the data-driven method constructs
the uncertainty set to extract information from historical data fully. Constructing
the uncertainty sets based upon historical data can be considered as an unsupervised
learning problem from an ML perspective. So, data-driven robust optimization is a
hybrid system that utilizes ML techniques to design data-driven uncertainty sets and
develops a robust optimization problem from the data-driven set. Different effective
unsupervised learning models such as the Dirichlet process mixture model, maxi-
mum likelihood estimation, principal component analysis, regular and conservative
support vector clustering, Bayesian ML, and kernel density estimation were employed
for uncertainty constructing, which could provide powerful representations of data
distributions [38, 40, 41]. Uncertainty set is the set that can offer robust solutions
with a conservatism level. Furthermore, this uncertainty set is finally given to the
model-based system based on robust optimization to obtain robust solutions under
uncertainty.
ML methods of support vector clustering-based uncertainty set (SVCU) and
conservative support vector clustering-based uncertainty set (CSVCU) have been
applied to finding an enclosed hypersphere with minimum volume which is able to
cover all data samples as tightly as possible as uncertainty sets. Conservative support
vector clustering is the most suitable choice for obtaining robust solutions in cases
with sufficient data to construct an uncertainty set enclosing future data with a high
confidence level [42]. Furthermore, it is the most effective choice for obtaining lower
Figure 1.
The schematic of the data-driven optimization paradigm framework.
122
Artificial Intelligence and Its Application in Optimization under Uncertainty
DOI: http://dx.doi.org/10.5772/intechopen.98628
conservative solutions. On the other hand, CSVCU is suitable for highly conservative
decision-makers since it is the only set that can offer robust solutions with a high
conservatism level, particularly when there is limited data [42]. A data-driven robust
optimization under correlated uncertainty was proposed to hedge against the fluctua-
tions generated from continuous production processes in an ethylene plant [43]. For
capturing and enrich the valid information of uncertainties, a copula-based method is
introduced to estimate the joint probability distribution and simulate mutual scenar-
ios for uncertainties. A deterministic and data-driven robust optimization framework
was proposed for energy systems optimization under uncertainty. The uncertainty set
is constructed by support vector clustering based on real industrial data [39]. A data-
driven robust optimization was applied to design and optimize the entire wastewater
sludge to-biodiesel supply chain [42]. They develop a conservative support vector
clustering (CSVS) method to construct an uncertainty set from limited data. The
developed uncertainty set encloses the fuzzy support neighborhood of data samples,
making it practical even when the available data is limited.
The recent development in the data science field, AI, and ML techniques have
enabled intelligent and automated DSS and real-time analytics coupled with com-
puting power improvements. Thus, AI techniques are applied to big data sources to
extract the knowledge-based rules or identify the underlying rules and patterns by
ML techniques, to drive the systems toward set objectives. DL is an ML technique that
can extract high levels of information and knowledge from massive data volumes.
DL algorithms consist of multiple processing layers to learn representations of data
with multiple abstraction levels [26]. For example, recently, DL techniques have
been used to accurately forecasting customer demand, price, and inventory leading
to optimization of supply chain performance. An intelligent forecasting system leads
to optimize performance, reduce costs, and increase sales and profit. DL techniques
can apply deep neural network architectures to solve various complex problems. The
DL paradigm requires high computing power and a large amount of data for training.
The recent advances in parallel architectures and GUP (Graphical Processing Unit)
enabled the necessary computing power required in deep neural networks (DNN).
The emergence of advanced IoT and blockchain technologies has also solved the need
for a large amount of data to learn. IoT and blockchain result in massive amounts of
streaming real-time data often referred to as “big data,” which brings new opportuni-
ties to control and manage supply chains [49]. Optimizing the parameters in DNN is
a challenging undertaking. Several optimization algorithms such as Adam, Adagrad,
RMSprop, have been proposed to optimize the network parameters in DNN and
improve generalizability. This technique, which stabilizes the optimization, paved the
way for learning deeper networks [50]. In real applications, uncertainty data exhibit
very complex and highly nonlinear characteristics. DNN can be used to uncover
useful patterns of uncertainty data for optimizing under uncertainty [28]. Deep data-
driven optimization could be considered a “hybrid” system that integrates the deep
data-driven system based on DL to forecast the uncertainty parameters. The model-
based system is based on mathematical programming to drive the optimal decisions
from predicted parameters (the deep data-driven system). In the DL-based system,
DNN has been applied to analyze features, complex interactions, and relationships
among features of a problem from samples of the dataset and learn model, which can
be used for demand, inventory, and price forecasting. Kilimci et al. [51] developed
an intelligent demand forecasting system based on the analysis and interpretation
of the historical data using different forecasting methods, including support vector
regression algorithm, time series analysis techniques, and DL models. In a study, the
Auto-Regressive Integrated the backpropagation (BP) network method, recurrent
neural network (RNN) method, and Moving Average (ARIMA) model were tested
to forecast the price of agricultural products [52]. Yu et al. [53] developed an online
big-data-driven forecasting model of Google trends to improve oil consumption pre-
diction. Their proposed forecasting model considers traditional econometric models
(LogR and LR) and typical AI techniques (BPNN, SVM, DT, and ELM).
Accurate automatic optimization heuristics are necessary for dealing with the
complexity and diversity of modern hardware and software. ML is a proven tech-
nique for learning such heuristics, but its success is bound by the quality of the
features used. Developers must handcraft these features through a combination of
expert domain knowledge and trial and error. This makes the quality of the final
model directly dependent on the skill and available time of the system architect.
DL techniques are a better way to build heuristics. A deep neural network can learn
124
Artificial Intelligence and Its Application in Optimization under Uncertainty
DOI: http://dx.doi.org/10.5772/intechopen.98628
heuristics over raw code entirely without using code features. The neural network
simultaneously constructs appropriate representations of the code and learns how
best to optimize, removing the need for manual feature creation. DNN can improve
the accuracy of models without the help of human experts. Generally, this approach
is a fundamental way to integrate forecast approaches into mathematical optimiza-
tion models. First, a probabilistic forecast approach for future uncertainties is given
by exploiting the advanced DL structures. Second, a model-based system based
on mathematical programming is applied to derive the optimal decisions from the
forecasting data. Comparison and evaluation of the forecasting models are significant
since DL models can have different performances depending on the properties of
the data [54, 55]. The performances of DL models differ according to the forecasting
time, training duration, target data, and simple or ensemble structure [56, 57].
In a study, Nam et al. [54, 55] applied DL-based models to forecast fluctuating
electricity demand and generation in renewable energy systems. This study compares
and evaluates DL models and conventional statistical models. The DL models include
DNN, long short-term memory, gated recurrent unit, and the disadvantages of
conventional statistical models such as multiple linear regression and seasonal autore-
gressive integrated moving average. In another study, the operation of a cryogenic
NGL recovery unit for the extraction of NGL has been optimized by implementing
data-driven techniques [58]. The proposed approach is based on an optimization
framework that integrates dynamic process simulations with two DL-based surrogate
models using a long short-term memory (LSTM) layout with a bidirectional recurrent
neural network (RNN) structure. Kilimci et al. [51] developed an intelligent demand
forecasting system. This improved model is based on analyzing and interpreting the
historical data using different forecasting methods, including time series analysis
techniques, support vector regression algorithm, and DL models.
Accessing a sufficient amount of data for some optimization models is a practi-
cal challenge. For example, the quality of scenario-based optimization frameworks
strongly depends on access to a sufficient amount of uncertain data. However, in
practice, the amount of uncertainty data sampled from the underlying distribution
is limited. On the other hand, acquiring a sufficient amount of uncertainty data is
extremely time-consuming and expensive in some cases, which leads to the limited
application of some approaches [59]. To deal with the practical challenge of requiring
an insufficient amount of data, deep generative models emerge as a new paradigm to
generate synthetic uncertainty data with the aim of better decisions with insufficient
uncertainty data. DL techniques could be applied to learn the useful intrinsic patterns
from the available uncertainty data and generate synthetic uncertainty data. More
specifically, in deep generative models, the correct data distribution is mimicked
either implicitly or explicitly by the DL techniques. Then the learned distribution
is used to generate new data points referred to as synthetic data [28]. After that,
these synthetic data serve as input to an optimizing model to derive the optimal
decisions. Some of the most commonly used deep generative models are variational
autoencoders generative and adversarial networks [26]. These synthetic uncertainty
data generated by the DL techniques can be potentially useful in the scenario-based
optimization model.
currently experiencing a radical shift due to the advent of DL. However, our research
into the existing literature reveals a scarcity of research utilizing DL in approximate
modeling. The introduction of DL models into an optimization formulation provides a
means to reduce the problem complexity and maintain model accuracy [60]. Recently
it has been shown that DL models in the form of neural networks with rectified linear
units can be exactly recast as a mixed-integer linear programming formulation. DL is
a method to approximate complex systems and tasks by exploiting large amounts of
data to develop rigorous mathematical models [60].
Using DNN to model real-world problems is a powerful tool, as they provide an
efficient abstraction that can be used to analyze the structure of the task at hand.
The rigorous mathematical model is developed based on neural networks modeling
complex systems and optimizing their operations in the deep data-driven model
framework. This approximate model is developed by exploiting large amounts of
data using DL techniques. Then the solving method is applied to obtain the optimal
solutions of the developed optimization model. Developing an optimal solution to the
approximate model remains challenging [60].
Pfrommer et al. [61] utilized a stochastic genetic algorithm to optimize a com-
posite textile draping process where a neural network was utilized as a surrogate
model. Marino et al. [62] presented an approach for modeling and planning under
uncertainty using deep Bayesian neural networks (DBNNs). They use DBNNs to learn
a stochastic model of the system dynamics. Planning is addressed as an open-loop
trajectory optimization problem. In the study, DL-based surrogate modeling and
optimization were proposed for microalgal biofuel production and photobioreac-
tor design [63]. This surrogate model is built upon a few simulated results from the
physical model to learn the sophisticated hydrodynamic and biochemical kinetic
mechanisms; then adopts a hybrid stochastic optimization algorithm to explore
untested processes and find optimal solutions. Tang & Zhang [64] developed a deep
data-driven framework for modeling combustion systems and optimizing their opera-
tions. First, they developed a deep belief network to model the combustion systems.
Next, they developed a multi-objective optimization model by integrating the deep
belief network-based models, the considered operational constraints, and the control
variable constraints.
RL has transformed AI, especially after the success of Google DeepMind. This
branch of ML epitomizes a step toward building autonomous systems by understand-
ing the visual world. Deep RL is currently applied to different sorts of problems that
were previously obstinate. In this subsection, the authors will analyze Deep RL and its
applications in optimization.
RL is one of the ML areas recently applied to tackle complex sequential decision
problems. RL is concerned with how a software agent should choose an action to
maximize a cumulative reward. RL is considered an optimal solution in address-
ing challenges where many factors must be taken into account, like supply chain
management. For example, Q-learning is a type of RL algorithm that is applied to
tackle simple optimization problems. In this approach, the Q-value has been applied
to any state of the system. Although the classical RL algorithms guarantee optimal
policy, these algorithms cannot promptly solve large states or actions. Many problems
in the real world have large and action spaces. Applying RL algorithms for solving
large problems would be nearly impossible, as these models would be costly to train.
Therefore, deep RL emerges as a new method in which DNN is used to approximate
any of the following RL components. Recently, deep Q-network (DQN) algorithms
have been used in different areas. For example, deep Q-network (DQN) algorithms
have been applied to solve supply chain optimization problems. These DQNs operate
as the decision-maker of each agent. That results in a competitive game in which each
DQN agent plays independently to minimize its own cost. Instead, recently a unified
Figure 2.
The schematic of the “closed-loop” online learning-based data-driven optimization framework.
127
Data Mining - Concepts and Applications
framework has been proposed in which the agents still play independently from
one another. Still, in the training phase, this model uses a feedback scheme so that
the DQN agent learns the total cost for the whole network and, over time, learns to
minimize it.
Like other types of reinforcement ML technique, multi-agent RL is a system of
agents (e.g., robots, machines, and cars) interacting within a common environ-
ment. Each agent decides each time-step and works along with the other agent(s) to
achieve a given goal. The agents are learnable units that want to learn policy on the
fly to maximize the long-term reward through the interaction with the environment.
Recently the multi-agent RL techniques have been applied to develop the supply chain
management (SCM) systems that perform optimally for each entity in the chain. A
supply chain can be defined as a network of autonomous business entities collectively
responsible for procurement, manufacturing, storing, and distribution [65]. Entities
in a supply chain have different sets of environmental constraints and objectives.
One of the biggest challenges of the development of MAS based supply chain is
designing agent policies. To address designing agent policies, recently, automatic
policy designing by RL has drawn attention. RL is considered an optimal solution in
addressing challenges where a huge number of factors must be taken into account,
like SCM. RL technique does not require datasets covering all environments, con-
straints, operations, and entity operation results. A multi-agent RL (MARL)-based
SCM system can enable agents to learn automatically policies that optimize the supply
chain performance using RL concerning certain constraints, environments, and
objectives to optimize the performance. More specifically, the RL technique enables
an agent to learn a policy by correcting necessary data itself during trial-and-error
on the content of operations [66]. All agents also simultaneously cooperate to opti-
mize the performances of the entire supply chain. RL technique can be applied for a
certain problem when all processes concerning the problem satisfy a Markov prop-
erty. Environmental change for a certain agent depends on the previous state of the
environment and the agent’s action. It is impossible to assume the Markov property
because an agent’s environmental change depends on the previous state for the agent
and the other agent’s actions.
There are two problems in developing a MARL technique for SCM: Building
Markov decision processes for a supply chain and then avoiding learning stagnation
among agents in learning processes. For solving these problems, a learning manage-
ment method with deep neural network (DNN)-weight evolution (LM-DWE) has
been applied [67]. Fuji et al. [67] developed a multi-agent RL technique to develop
a supply chain management (SCM) system that enables agents to learn policies that
optimize SC performance. They applied a learning management method with deep-
neural-network (DNN)-weight evolution (LM-DWE) in the MARL for SCM. An RL
framework-FeedRec has been used in a study to optimize long-term user engagement
[68]. They used hierarchical LSTM to design the Q-Network to model the complex
user behaviors; they also used Q Network to simulate the environment. Zhang et al.
[69] proposed a multi-agent learning (MAL) algorithm and applied it for optimizing
online resource allocation in cluster networks.
fascinating issue and often requires significant specialized knowledge and trial-and-
error. NP-hard problems are solved with exact methods, heuristic algorithms, or a
combination of them. Although exact methods provide optimal answers, they have
the limitation of performing inefficiently in time complexity. Heuristics are used to
improve computational time efficiency and provide decent or near-optimal solutions
[70]. According to the definition of Burke et al. [71], a hyper-heuristic is a searching
mechanism that aims to select or generate appropriate heuristics to solve an optimiza-
tion problem. However, the effectiveness of general heuristic algorithms is dependent
on the problem being considered, and high levels of performance often require
extensive tailoring and domain-specific knowledge. ML strategies have become a
promising route to addressing these challenges, which led to the development of
meta-algorithms to various combinatorial problems.
Solution approaches meta-heuristics and hyper-heuristics have been developed
to tackle the NP-hard combinatorial optimization problem [72]. Recently, hyper-
heuristics arise in this context as efficient methodologies for selecting or generating
(meta) heuristics to solve NP-hard optimization problems. Hyper-heuristics are
categorized into heuristic selection (Methodologies to select) and heuristic generation
(Methodologies to generate) [71]. Deep RL is a possible learning method that can
automatically solve various optimization problems [73]. Encouragingly, characteris-
tics of the deep RL method have been found in comparison with classical methods,
e.g., strong generalization ability and fast solving speed. RL methods can be used at
different levels to solve combinatorial optimization problems. They can be applied
directly to the problem, as part of a meta-heuristic, or as part of hyper-heuristics [74].
Utilizing advanced computation power with meta-heuristics algorithms and massive-
data processing techniques has successfully solved various NP-hard problems.
However, meta-heuristic approaches find good solutions which, do not guarantee
the determination of the global optimum. Meta-heuristics still face the limitations of
exploitation and exploration, which consists of choosing between a greedy search and
a wider exploration of the solution space.
A way to guide Meta-heuristic algorithms during the search for better solutions
is to generate the initial population of a genetic algorithm by using a technique of
Q-Learning algorithm.
The hyper-heuristic for heuristic selection can use RL algorithms, enabling the
system to autonomously select the meta-heuristic to use in the optimization process
and the respective parameters. For example, Falcão et al. [74] proposed a hyper-
heuristic module for solving scheduling problems in manufacturing systems. The
proposed hyper-heuristic module uses an RL algorithm, which enables the system
to autonomously select the meta-heuristic to use in the optimization process and
the respective parameters. Cano-Belmán et al. [75] proposed a heuristic generation
scatter search algorithm to address a mixed-model assembly line sequencing prob-
lem. Khalil et al. (Dai et al., 2017) developed a neural combinatorial optimization
framework that utilizes neural networks and RL to tackle combinatorial optimization
problems. The developed meta-algorithm automatically learns good heuristics for a
diverse range of optimization problems over graphs. Mosadegh et al. [72] proposed
novel hyper-simulated annealing (HSA) to tackle the NP-hard problem. They devel-
oped new mathematical models to describe a mixed-model sequencing problem with
stochastic processing times (MMSPSP). The HSA applies a Q-learning algorithm to
select appropriate heuristics through its search process [72]. The main idea is to con-
duct simulated annealing (SA)-based algorithms to find a suitable heuristic among
available ones creating a neighbor solution(s).
129
Data Mining - Concepts and Applications
Data-driven optimization refers to the art and science of integrating the data-
driven system based on ML to convert (big) data into relevant and useful information
and insights, and the model-based system based on mathematical programming to
derive the optimal and more accurate decisions from the information. As a direct
implication, the generic approach proposed in data-driven optimization can be uti-
lized to create an automated, data-driven, and intelligent DSS, which would increase
the quality of decisions both in terms of efficiency and effectiveness. Recent advances
in DL as a predictive model have received great attention lately. One of the distin-
guishing features of DNN is its ability to “learn” better predictions from large-scale
data than ML methods. Hence, one of the primary messages of this overview chapter
is to review the applicability of DL in improving DSS across core areas of supply chain
operations.
Much data is generated at ever-faster rates by companies and organizations [76].
Applying the advanced DL techniques for predictive analytics becomes a promising
issue for further research to improve the decision-making process. Although the
conventional data-driven optimization paradigm has made significant progress for
hedging against uncertainty, it is foreseeable that data-driven mathematical pro-
gramming frameworks would proliferate in the next few years due to the generation
131
Data Mining - Concepts and Applications
132
Artificial Intelligence and Its Application in Optimization under Uncertainty
DOI: http://dx.doi.org/10.5772/intechopen.98628
Author details
© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
133
Data Mining - Concepts and Applications
References
[1] Biegler, L. T., & Grossmann, I. E. programming approach for the medical
(2004). Retrospective on optimization. drug inventory routing problem under
Computers & Chemical Engineering, uncertainty. Computers & Industrial
28(8), 1169-1192. Engineering, 128, 358-370.
[2] Sakizlis, V., Perkins, J. D., & [10] Quddus, M. A., Chowdhury, S.,
Pistikopoulos, E. N. (2004). Recent Marufuzzaman, M., Yu, F., & Bian, L.
advances in optimization-based (2018). A two-stage chance-constrained
simultaneous process and control design. stochastic programming model for a
Computers & Chemical Engineering, bio-fuel supply chain network.
28(10), 2069-2086. International Journal of Production
Economics, 195, 27-44.
[3] Sahinidis, N. V. (2004). Optimization
under uncertainty: state-of-the-art and [11] Mavromatidis, G., Orehounig, K., &
opportunities. Computers & Chemical Carmeliet, J. (2018). Design of
Engineering, 28(6-7), 971-983. distributed energy systems under
uncertainty: A two-stage stochastic
[4] Darvazeh, S. S., Vanani, I. R., &
programming approach. Applied energy,
Musolu, F. M. (2020). Big data analytics 222, 932-950.
and its applications in supply chain
management. In New Trends in the Use of [12] Lima, C., Relvas, S., & Barbosa-Póvoa,
Artificial Intelligence for the Industry A. (2018). Stochastic programming
4.0 (p. 175). IntechOpen. approach for the optimal tactical planning
of the downstream oil supply chain.
[5] Bertsimas, D., Gupta, V., & Kallus, N.
Computers & Chemical Engineering, 108,
(2018a). Data-driven robust optimization.
314-336.
Mathematical Programming, 167(2),
235-292.
[13] Alipour, M., Zare, K., & Seyedi, H.
[6] Bertsimas, D., Gupta, V., & Kallus, N.
(2018). A multi-follower bilevel
(2018b). Data-driven robust optimization. stochastic programming approach for
Mathematical Programming, 167(2), energy management of combined heat
235-292. and power micro-grids. Energy, 149,
135-146.
[7] Grossmann, I. E., Apap, R. M., Calfa,
B. A., García-Herreros, P., & Zhang, Q. [14] Ben-Tal, A., El Ghaoui, L., &
(2016). Recent advances in mathematical Nemirovski, A. (2009). Robust
programming techniques for the optimization. Princeton university press.
optimization of process systems under
uncertainty. Computers & Chemical [15] Kim, J., Do Chung, B., Kang, Y., &
Engineering, 91, 3-14. Jeong, B. (2018). Robust optimization
model for closed-loop supply chain
[8] Birge, J. R., & Louveaux, F. (2011). planning under reverse logistics flow and
Introduction to stochastic programming. demand uncertainty. Journal of cleaner
Springer Science & Business Media. production, 196, 1314-1328.
[9] Nikzad, E., Bashiri, M., & [16] Aalaei, A., & Davoudpour, H. (2017).
Oliveira, F. (2019). Two-stage stochastic A robust optimization model for cellular
134
Artificial Intelligence and Its Application in Optimization under Uncertainty
DOI: http://dx.doi.org/10.5772/intechopen.98628
manufacturing system into supply chain [25] Jakhar, D., & Kaur, I. (2020).
management. International Journal of Artificial intelligence, machine learning
Production Economics, 183, 667-679. and deep learning: definitions and
differences. Clinical and experimental
[17] Lim, Y. F., & Wang, C. (2017). dermatology, 45(1), 131-132.
Inventory management based on
target-oriented robust optimization. [26] Goodfellow, I., Bengio, Y.,
Management Science, 63(12), 4409-4427. Courville, A., & Bengio, Y. (2016). Deep
learning (Vol. 1, No. 2). Cambridge:
[18] Vitus, M. P., Zhou, Z., & Tomlin, C. MIT press.
J. (2015). Stochastic control with
uncertain parameters via chance [27] Wang, H., Wu, Y., Min, G., Xu, J., &
constrained control. IEEE Transactions on Tang, P. (2019). Data-driven dynamic
Automatic Control, 61(10), 2892-2905. resource scheduling for network slicing:
A deep reinforcement learning
[19] Farina, M., Giulioni, L., & Scattolini, approach. Information Sciences, 498,
R. (2016). Stochastic linear model 106-116.
predictive control with chance
constraints–a review. Journal of Process [28] Ning, C., & You, F. (2019).
Control, 44, 53-67. Optimization under uncertainty in the
era of big data and deep learning: When
[20] Guo, Y., Baker, K., Dall’Anese, E., machine learning meets mathematical
Hu, Z., & Summers, T. H. (2018). programming. Computers & Chemical
Data-based distributionally robust Engineering, 125, 434-448.
stochastic optimal power flow—Part I:
Methodologies. IEEE Transactions on [29] Wong, A. K. C., & Wang, Y. (2003).
Power Systems, 34(2), 1483-1492. Pattern discovery: a data driven approach
to decision support. IEEE Transactions on
[21] Carvalho, A., Lefévre, S., Schildbach, Systems, Man, and Cybernetics, Part C
G., Kong, J., & Borrelli, F. (2015). (Applications and Reviews), 33(1),
Automated driving: The role of forecasts 114-124.
and uncertainty—A control perspective.
European Journal of Control, 24, 14-32. [30] Yang, H., Jin, Z., Wang, J., Zhao, Y.,
Wang, H., & Xiao, W. (2019). Data-
[22] Russell, S., & Norvig, P. (2002). Driven Stochastic Scheduling for Energy
Artificial intelligence: a modern Integrated Systems. Energies,
approach. 12(12), 2317.
[23] Ngiam, K. Y., & Khor, W. (2019). Big [31] Esfahani, P. M., & Kuhn, D. (2018).
data and machine learning algorithms for Data-driven distributionally robust
healthcare delivery. The Lancet Oncology, optimization using the Wasserstein
20(5), e262-e273. metric: Performance guarantees and
tractable reformulations. Mathematical
[24] Helm, J. M., Swiergosz, A. M., Programming, 171(1), 115-166.
Haeberle, H. S., Karnuta, J. M., Schaffer,
J. L., Krebs, V. E., ... & Ramkumar, P. N. [32] Smith, J. E., & Winkler, R. L.
(2020). Machine learning and artificial (2006). The optimizer’s curse: Skepticism
intelligence: Definitions, applications, and postdecision surprise in decision
and future directions. Current reviews in analysis. Management Science, 52(3),
musculoskeletal medicine, 13(1), 69-76. 311-322.
135
Data Mining - Concepts and Applications
programs over Wasserstein balls. arXiv forecasting model for renewable energy
preprint arXiv:1809.00210. scenarios to guide sustainable energy
policy: A case study of Korea. Renewable
[49] Khan, P. W., Byun, Y. C., & Park, N. and Sustainable Energy Reviews, 122,
(2020). IoT-Blockchain Enabled 109725.
Optimized Provenance System for Food
Industry 4.0 Using Advanced Deep [56] Li, Q., Loy-Benitez, J., Nam, K.,
Learning. Sensors, 20(10), 2990. Hwangbo, S., Rashidi, J., & Yoo, C.
(2019). Sustainable and reliable design of
[50] Kraus, M., Feuerriegel, S., & reverse osmosis desalination with hybrid
Oztekin, A. (2020). Deep learning in renewable energy systems through
business analytics and operations supply chain forecasting using recurrent
research: Models, applications and neural networks. Energy, 178, 277-292.
managerial implications. European
Journal of Operational Research, 281(3), [57] Loy-Benitez, J., Vilela, P., Li, Q., &
628-641. Yoo, C. (2019). Sequential prediction of
quantitative health risk assessment for
[51] Kilimci, Z. H., Akyuz, A. O., Uysal, the fine particulate matter in an
M., Akyokus, S., Uysal, M. O., Atak underground facility using deep
Bulbul, B., & Ekmis, M. A. (2019). An recurrent neural networks. Ecotoxicology
improved demand forecasting model and environmental safety, 169, 316-324.
using deep learning approach and
proposed decision integration strategy [58] Zhu, W., Chebeir, J., & Romagnoli, J.
for supply chain. Complexity, 2019. A. (2020). Operation optimization of a
cryogenic NGL recovery unit using deep
[52] Weng, Y., Wang, X., Hua, J., Wang, learning based surrogate modeling.
H., Kang, M., & Wang, F. Y. (2019). Computers & Chemical Engineering,
Forecasting horticultural products price 137, 106815.
using ARIMA model and neural network
based on a large-scale data set collected [59] Gupta, V., & Rusmevichientong, P.
by web crawler. IEEE Transactions on (2017). Small-data, large-scale linear
Computational Social Systems, 6(3), optimization with uncertain objectives.
547-553. Management Science, 67(1), 220-241.
[53] Yu, L., Zhao, Y., Tang, L., & Yang, Z. [60] Katz, J., Pappas, I., Avraamidou, S.,
(2019). Online big data-driven oil & Pistikopoulos, E. N. (2020).
consumption forecasting with Google Integrating deep learning models
trends. International Journal of Forecasting, and multiparametric programming.
35(1), 213-223. Computers & Chemical Engineering,
136, 106801.
[54] Nam, K., Hwangbo, S., & Yoo, C.
(2020a). A deep learning-based [61] Pfrommer, J., Zimmerling, C., Liu, J.,
forecasting model for renewable energy Kärger, L., Henning, F., & Beyerer, J.
scenarios to guide sustainable energy (2018). Optimisation of manufacturing
policy: A case study of Korea. Renewable process parameters using deep neural
and Sustainable Energy Reviews, 122, networks as surrogate models. Procedia
109725. CiRP, 72, 426-431.
[55] Nam, K., Hwangbo, S., & Yoo, C. [62] Marino, D. L., & Manic, M. (2019).
(2020b). A deep learning-based Modeling and planning under uncertainty
137
Data Mining - Concepts and Applications
using deep neural networks. IEEE [70] Dumitrescu, I., & Stützle, T. (2003,
Transactions on Industrial Informatics, April). Combinations of local search and
15(8), 4442-4454. exact algorithms. In Workshops
on Applications of Evolutionary
[63] del Rio-Chanona, E. A., Wagner, J. Computation (pp. 211-223). Springer,
L., Ali, H., Fiorelli, F., Zhang, D., & Berlin, Heidelberg.
Hellgardt, K. (2019). Deep learning-
based surrogate modeling and [71] Burke, E. K., Gendreau, M., Hyde,
optimization for microalgal biofuel M., Kendall, G., Ochoa, G., Özcan, E., &
production and photobioreactor design. Qu, R. (2013).
AIChE Journal, 65(3), 915-923.
[72] Mosadegh, H., Ghomi, S. F., & Süer,
[64] Tang, Z., & Zhang, Z. (2019). G. A. (2020). Stochastic mixed-model
The multi-objective optimization of assembly line sequencing problem:
combustion system operations based on Mathematical modeling and Q-learning
deep data-driven models. Energy, based simulated annealing hyper-
182, 37-47. heuristics. European Journal of
Operational Research, 282(2), 530-544.
[65] Swaminathan, J. M., Smith, S. F., &
Sadeh, N. M. (1998). Modeling supply [73] Li, K., Zhang, T., & Wang, R. (2020).
chain dynamics: A multiagent approach. Deep reinforcement learning for multi-
Decision sciences, 29(3), 607-632. objective optimization. IEEE transactions
on cybernetics.
[66] Mnih, V., Kavukcuoglu, K., Silver, D.,
Rusu, A. A., Veness, J., Bellemare, M. G., [74] Falcão, D., Madureira, A., &
... & Hassabis, D. (2015). Human-level Pereira, I. (2015, June). Q-learning based
control through deep reinforcement hyper-heuristic for scheduling system
learning. nature, 518(7540), 529-533. self-parameterization. In 2015 10th
Iberian Conference on Information Systems
[67] Fuji, T., Ito, K., Matsumoto, K., & and Technologies (CISTI) (pp. 1-7). IEEE.
Yano, K. (2018, January). Deep multi-
agent reinforcement learning using [75] Cano-Belmán, J., Ríos-Mercado, R.
dnn-weight evolution to optimize supply Z., & Bautista, J. (2010). A scatter search
chain performance. In Proceedings of the based hyper-heuristic for sequencing a
51st Hawaii International Conference on mixed-model assembly line. Journal of
System Sciences. Heuristics, 16(6), 749-770.
[68] Zou, L., Xia, L., Ding, Z., Song, J., [76] Corbett, C. J. (2018). How
Liu, W., & Yin, D. (2019, July). sustainable is big data?. Production and
Reinforcement learning to optimize Operations Management, 27(9),
long-term user engagement in 1685-1695.
recommender systems. In Proceedings of
the 25th ACM SIGKDD International
Conference on Knowledge Discovery &
Data Mining (pp. 2810-2818).
Abstract
This chapter will survey the clustering algorithm that is unsupervised learning
among data mining and machine learning techniques. The most popular clustering
algorithm is the K-means clustering algorithm; It can represent a cluster of data. The
K-means clustering algorithm is an essential factor in finding an appropriate K value
for distributing the training dataset. It is common to find this value experimentally.
Also, it can use the elbow method, which is a heuristic approach used in determining
the number of clusters. One of the present clusterings applied studies is the particulate
matter concentration clustering algorithm for particulate matter distribution estima-
tion. This algorithm divides the area of the center that the fine dust distribution using
K-means clustering. It then finds the coordinates of the optimal point according to the
distribution of the particulate matter values. The training dataset is the latitude,
longitude of the observatory, and PM10 value obtained from the AirKorea website
provided by the Korea Environment Corporation. This study performed the K-means
clustering algorithm to cluster feature datasets. Furthermore, it showed an experiment
on the K values to represent the cluster better. It performed clustering by changing K
values from 10 to 23. Then it generated 16 labels divided into 16 cities in Korea and
compared them to the clustering result. Visualizing them on the actual map confirmed
whether the clusters of each city were evenly bound. Moreover, it figures out the
cluster center to find the observatory location representing particulate matter
distribution.
1. Introduction
This chapter introduces the data mining and the clustering algorithm, which is
unsupervised learning among machine learning techniques. In this chapter, we ana-
lyze the performed clustering application research that used the air pollution concen-
tration data. It has been a problem recently. The most popular algorithm among the
clustering is the K-means clustering algorithm; it represents a data cluster. It is an
essential factor that finds an appropriate K value for the distribution of the training
dataset. Commonly, we determine the K value experimentally, and at this point, we
can set the value using the elbow technique.
139
Data Mining - Concepts and Applications
2. Related works
In this section, we analyzed related studies to predict the concentration of fine dust
[7–12]. The related studies use air pollution data and meteorological data together. In
particular, the accuracy of prediction is high when weather data such as temperature
and wind speed are used rather than air pollution data [7]. Traditionally, the studies
predict the concentration of fine dust through machine learning methods such as
linear regression or support vector regression. However, these methods are challeng-
ing to consider the spatiotemporal correlation [8]. Therefore, it focuses on improving
prediction accuracy by using deep learning [9–12]. There are four distinct seasons in
Korea depending on the air mass, so there is a significant difference in the concentra-
tion of fine dust by season. Therefore, we must be considered the relationship
between location and time.
Joun et al. predicted the concentration of fine dust using the MLR, SVR, ARIMA,
and ARIMAX [11]. In this paper, the training datasets are air pollution data (NO2,
SO2, CO, O3, PM10) and meteorological data (temperature, precipitation, wind
speed). They confirmed that time, location, NO2, CO, O3, SO2, maximum tempera-
ture, precipitation, and maximum wind speed were significant variables using multi-
ple linear regression analysis. In addition, they used multiple linear regression and
support vector regression to predict fine dust distribution. The prediction accuracy
was higher in the artificial neural network than in the multiple support vector regres-
sion. If the PM10 concentration increased above 100, the support vector regression
was exceptionally high. They perform experiments using ARIMA and ARIMAX to
analyze the factors of time according to the location. As a result, there was a difference
in the learning accuracy according to the location of the experimental data. Further-
more, the accuracy was higher in using the air quality factor and the meteorological
factor than using only the time variable.
140
Practical Application Using the Clustering Algorithm
DOI: http://dx.doi.org/10.5772/intechopen.99314
Pseudocode 1.
The process of making the dataset by the day.
Figure 1.
The air pollution dataset in April 2020.
k X
X
arg min kx � μi k2 (1)
i¼1 x ∈ Si
We calculate the inertia value to find the appropriate K value for K-means
Clustering. The inertia value is the sum of the distances between clusters at each
center point after clustering. Figure 2 shows the inertia value according to the K.
The optimal k value is where the inertia value decreases rapidly, and the change is
not significant. However, it is difficult to determine the optimal k value in this
graph. Therefore, we set the k value to 16, focusing on dividing the whole country into
16 provinces.
Pseudocode 2.
The process of performing scaling and clustering.
Pseudocode 2 is the source code that loads April data, performs the scaling and
clustering. Also, Figure 3 presents the coordinates of the center point of each cluster
as a result of performing clustering based on the air pollution data for a month.
We use the Folium python library to show this map [16]. The marker of the same
color is the cluster’s point divided into 16 in the cluster for the day. Also, Table 1 is a
comparison of 16 administrative district labels and clustering results. For example, the
0 label is the Gangwon-do area, and 11, 12, 15 labels contain the twelve air pollution
stations in this district.
142
Practical Application Using the Clustering Algorithm
DOI: http://dx.doi.org/10.5772/intechopen.99314
Figure 2.
The inertia value according to the K.
Figure 3.
The coordinates of the center point of each cluster for a month.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Gangwon-do 3 4 5
Gyeonggi-do 22 11 38 2 7
Gyeongsangnam-do 1 6 1 9
Gyeongsangbuk-do 5 5 1 1 5 3
Gwangju 1 8
Daegu 9 3 1 1
Deajeon 6 3 1
Busan 7 4 10
Seoul 8 10 21
Ulsan 5 9 2
Incheon 3 12
Jeollanam-do 8 8 3
Jeollabuk-do 3 2 14
Jeju-do 5
Chungcheongnam-do 1 15 1 6
Chungcheongbuk-do 6 6 1 1
Table 1.
Number of stations in each cluster by administrative district.
Figure 4.
The visualization of the 16 center points on a map to divide regions.
144
Practical Application Using the Clustering Algorithm
DOI: http://dx.doi.org/10.5772/intechopen.99314
distributed in the Seoul cluster center, Incheon, and Gyeonggi-do than other regions.
It is because many air pollution monitoring stations are mainly distributed in the
metropolitan area in Korea.
Figure 5 shows the results of classifying air pollution monitoring stations by
calculating the distance to each station from the obtained 16 center coordinates. Points
on the map are the location of the air pollution monitoring station. In this case, we
calculated the Euclidean distance using latitude and longitude.
Also, Figure 6 visualizes the convex hull polygon by connecting the outermost
point of the classified measurement stations as a line [17]. This method has the
advantage of accurately classifying even if the distance between each point is close
because classification is performed based on the location of the stations. However, in
an area without an observatory, it is a shaded area, and the distribution of air pollution
cannot be measured.
This chapter found cluster’s center points using the location and concentration of
air pollution monitoring stations to divide air pollution areas that can reflect data
distribution. The stations are classified based on the center coordinates, and air pollu-
tion areas are divided using the Convexhull polygon. However, there was a problem
that the classified air pollution areas did not include areas without air pollution
monitoring stations.
Therefore, we use the Voronoi algorithm to include areas without measurement
stations [18]. Also, it can classify areas based on the center point of the cluster. The
Voronoi algorithm is to get a line segment that can divide the distance between
neighboring points into two and obtain a polygon with the intersection of each line
segment as a vertex. Figure 7 shows the divided regions using the Voronoi algorithm.
The dots represent the centers of classified clusters. The method used in the Voronoi
Figure 5.
The results of classifying air pollution monitoring stations by cluster.
145
Data Mining - Concepts and Applications
Figure 6.
The result using the convex hull polygon algorithm.
Figure 7.
The result using the Voronoi algorithm.
146
Practical Application Using the Clustering Algorithm
DOI: http://dx.doi.org/10.5772/intechopen.99314
algorithm is the Euclidean calculation method. Unlike the convex hull method in
Figure 6, the Voronoi algorithm’s classification method can divide regions without
shadowed areas of the Korean Peninsula.
We compare the existing administrative districts in Korea [19], the regional classifi-
cation method using the convex hull method, and the Voronoi algorithm. Existing
administrative districts are classified according to the criteria defined in the Administra-
tive District Practice Manual. Also, the convex hull method divided the area into classified
air pollution measurement stations. The Voronoi algorithm classifies regions using the
distance value based on the center point of the cluster. Air pollution concentrations were
not reflected in existing administrative districts, but the convex hull method and Voronoi
algorithm can classify regions. Finally, in the convex hull method, the area without a
measuring station is shaded, unlike the existing administrative area and Voronoi algo-
rithm. Comprehensively, the Voronoi algorithm can classify the region by reflecting the
air pollution concentration without the shaded area.
5. Conclusion
In this chapter, we collected the data of air pollution stations in Korea and used K-
means clustering to learn about data mining and machine learning algorithms. We
divide air pollution areas to predict the distribution of air pollution using air pollution
concentration clustering. The training dataset is latitude, longitude, NO2, SO2, CO,
O3, PM10, PM25, with air pollution data for one month in April 2020. We use the
collected dataset and classify air pollution monitoring stations. Based on the central
coordinates of the cluster, the areas of the Korean territory were classified through the
Voronoi algorithm. Finally, we confirmed that the proposed air pollution area could be
classified by considering the distribution of air pollution, unlike traditional adminis-
trative districts. Moreover, the proposed area can help understand the distribution of
air pollution in the shaded areas that do not have air pollution stations.
Acknowledgements
Author details
© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
147
Data Mining - Concepts and Applications
References
[1] WHO, Air pollution, May 2018, predictions. Environmental Science and
Available from: https://www.who.int/ Pollution Research. 2016;23.22:
news-room/fact-sheets/detail/ambient- 22408-22417.
(outdoor)-air-quality-and-health/
[Accessed: 2021-06-01] [9] Freeman BS, Taylor G, Gharabaghi B,
Thé J. Forecasting air quality time series
[2] Hänninen O. O. WHO Guidelines for using deep learning. Journal of the Air
Indoor Air Quality: Dampness and Mold. and Waste Management Association.
In Fundamentals of mold growth in 2018;68.8:866-886.
indoor environments and strategies for
healthy living. Wageningen Academic [10] Qi Z, Wang T, Song G, Hu W, Li X,
Publishers, Wageningen. 2011. Zhang Z. Deep air learning:
p. 277-302. Interpolation, prediction, and feature
analysis of fine-grained air quality.
[3] World Health Organization, WHO air IEEE Transactions on Knowledge
quality guidelines global update, report and Data Engineering. 2018;30.12:
on a working group meeting, Bonn, 2285-2297.
Germany, 18–20 October, 2005.
[11] Joun S, Choi J, Bae J. Performance
[4] Air Korea, Available from: http://www. Comparison of Algorithms for the
airkorea.or.kr/ [Accessed: 2021-06-01] Prediction of Fine Dust Concentration.
In: Proceedings of Korea Software
[5] Min S, Oh, Y. A Study of Particulate Congress 2017, 8-10 February 2019;
Matter Clustering for PM10 Distribution Pyeong Chang. p. 775-777.
Prediction, In: Proceedings of the
International Symposium on Innovation [12] Cho K, Jung Y, Kang C, Oh C.
in Information Technology and Conformity assessment of machine
Applications (2019 ISIITA); 11-13 learning algorithm for particulate matter
February 2019; Okinawa. p. 53-56. prediction. Journal of the Korea Institute
of Information and Communication
[6] Min S, Oh Y. A study of particulate Engineering. 2019;23.1:20-26.
matter area division using PM10 data
clustering: Focusing on the case of
[13] AirKorea, Available from: http://
Korean particulate matter observatory.
www.airkorea.or.kr/ [Accessed:
Journal of Adv Research in Dynamical
2021-06-01]
and Control Systems. 2019;11.12:
959-965. DOI:10.5373/JARDCS/V11SP12/
[14] Kakao Map API, Available from:
20193300
https://apis.map.kakao.com/ [Accessed:
[7] Munir S, Habeebullah TM, Seroji AR,
2021-06-01]
Morsy EA, Mohammed AM, Saud, WA,
Awad AH. Modeling particulate matter [15] Scikit-learn, Available from:
concentrations in Makkah, applying a https://scikit-learn.org/ [Accessed:
statistical modeling approach. Aerosol 2021-06-01]
Air Quality Research. 2013;13.3:901-910.
[16] Folium Python, Available from:
[8] Li X, Peng L, Hu Y, Shao J, Chi T. https://python-visualization.github.io/
Deep learning architecture for air quality folium/ [Accessed: 2021-06-01]
148
Practical Application Using the Clustering Algorithm
DOI: http://dx.doi.org/10.5772/intechopen.99314
149
Chapter 10
Abstract
Coal and host rock, including the gangue dump, are important sources of toxic
elements, which have high-contaminating potential to surface and groundwater.
Surface water in the coal mine area and groundwater in the active or abandoned coal
mines have been observed to be polluted by trace elements, such as arsenic, mercury,
lead, selenium, cadmium. It is helpful to control pollution caused by the trace ele-
ments by understanding the leaching behavior and mechanism. The leaching and
migration of the trace elements are controlled mainly by two factors, trace elements’
occurrence and the surrounding environment. The traditional method to investigate
elements’ occurrence and leaching mechanism is based on the geochemical method.
In this research, the data mining method was applied to find the relationship and
patterns, which is concealed in the data matrix. From the geochemical point of view,
the patterns mean the occurrence and leaching mechanism of trace elements from
coal and host rock. An unsupervised machine learning method, principal component
analysis was applied to reduce dimensions of data matrix of solid and liquid samples,
and then, the re-calculated data were clustered to find its co-existing pattern using
the method of Gaussian mixture model.
1. Introduction
Coal is a complex system, which contains most elements in the periodic table. The
origin of the coal was organic matter containing virtually every element in the peri-
odic table, mainly carbon, but also trace elements. The elements with relative higher
content in the coal and host rock, such as iron (Fe) and aluminum (Al), which usually
take 1–20% of the rock, respectively, and sodium (Na), potassium (K), calcium (Ca),
magnesium (Mg), which are usually in the range of 0.01–10% of the rock, respec-
tively. The trace elements refer to the elements at the 10–10,000 ppm levels in coal,
rocks, and soil, etc. A variety of chemicals are associated with coal that is either found
in the coal or in the rock layers that lie above and beneath the seams of coal [1]. Some
of the trace elements are of great health concern. For example, lead (Pb) accounts
151
Data Mining - Concepts and Applications
for most of the cases of pediatric heavy metal poisoning and makes it difficult for
children to learn, pay attention, and succeed in school. Mercury (Hg) exposure puts
newborns at risk of neurological deficits and increased cardiovascular risk in adults.
Arsenic (As) could cause heavy metal poisoning in adults and does not leave the body
once it enters.
Coal mining has caused global environmental concern due to mainly two rea-
sons—first, the coal and host rock contains multiple kinds of toxic trace elements,
some of which are of great environmental and health issues, most of them (As, Cd,
Co, Cr, Cu, Mn, Ni, Pb, Se, Sn V, and Zn) are associated with inorganic matter [2, 3];
second, the trace elements may be released through combustion and water-rock
interaction [3–9].
The coal mine water, containing toxic trace elements, has influenced the water
quality of both the groundwater and surface water in China. To control the con-
tamination of trace elements, a lot of efforts have been making in both research and
management. According to the Chinese national standard GB/T 19223-2015, the
coal mine water is defined as bursting water, infiltrating water from surface water,
and working produced water, during coal mining activity. The water is classified
into acid (pH < 6.0), neutral (6.0–9.0) and alkaline (pH > 9), low- (<1000 mg/L),
medium- (1000–6000 mg/L), and high-mineralized water (>6000 mg/L), and low-
(<50 mg/L), medium- (50–500 mg/L), and high-suspended (>500 mg/L) coal mine
water, regarding pH value, total dissolved solids, and suspended matter, respectively.
Trace elements released from the coal and rock may contaminate surface and ground-
water, including selenium (Se), As, Pb, fluorine (F), Hg, etc., leading to some differ-
ent unique characteristics of the coal mine water. However, the releasing patterns are
relatively similar among the coal mine waters. In the coal-bearing seam, the primitive
environment is H-rich and reductive, where some reductive minerals are stable, such
as pyrite, chalcopyrite, and sphalerite. While the coal and rock seam contact with
air, the Eh value of the surrounding environment is elevated, and the minerals are
oxidized [10, 11]. Through this process, the pH value may be reduced, accompanying
the release of metal elements into the water, and high concentrations of metal trace
elements in the water [12–14]. However, the neutral and alkaline mine water is also
common, because of the dissolution of alkaline minerals, such as calcite and dolomite.
The net effect of which determines the pH value of coal mine water produces a high
mineralization value [12, 15].
Besides the water parameters, the occurrence of trace elements also influences its
migration [16–19]. Main minerals in coal include quartz, clay, sulfur-contained min-
erals, and a lesser number of feldspars and carbonates [20, 21]. As, Cr, Pb, Hg, Mo,
Zn, and Sb were found to be enriched in coal compared with continent crust [22–25],
while compared to coal, host rock and gangue rejected on the land of coal can release
up to 10 times toxic elements into water [2, 26–28].
The migration behavior of trace elements is controlled by two factors, the trace
element occurrence and the surrounding environment. However, migration patterns
and mechanism of trace elements into a surrounding water body are complex and
different depending on the investigating sites. Traditional methods to investigate this
process are based on geochemical surveys and testing. The information and pattern
behind the data matrix are hard to identify. Along with the development of machine
learning, multivariate analytical technology has been applied in some different areas
of the geochemical research, the fourth paradigm for the research is becoming a
more and more powerful tool to find a solution among the mass data. The multivari-
ate analysis has been used to study the water characteristics [29], source [30, 31],
152
Leaching Mechanisms of Trace Elements from Coal and Host Rock Using Method of Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.100498
3. Method
This study was carried out at the Xuzhou-Datun coal mine district, located at the
northwest of Jiangsu province, eastern China (Figure 1). The area of Xuzhou city is
in the plain of Huanghuai, South part of northern China. Sediment stratum covering
the Archean system are Simian, Cambrian, middle-lower Ordovician, middle-upper
Carboniferous, Permian, Jurassic, Cretaceous, Tertiary, and Quaternary system, from
bottom to top. The hydrogeology cell selected for this study is isolated by a series of
faults. This includes Sanhejian, Yaoqiao, and Longdong coal mines shown in Figure 1.
In this area, groundwater flows from northeast to southwest.
The coal seams that are being mined are located in the Carboniferous and Permian
systems, the former include Benxi and Taiyuan formations, and the latter include
Figure 1.
Location of the study area.
154
Leaching Mechanisms of Trace Elements from Coal and Host Rock Using Method of Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.100498
Shanxi and Lower-Shihezi formations, listed from the bottom to top in both systems.
In Permian strata, there are mostly low sulfur content Gas coal and fat coal. The lower
formation in Carboniferous has a higher content of sulfur than the upper layers. Mass
percentage of sulfur in Permian Shanxi formation coal seams is around 0.83% in coal
seam No.7 and 1.09% in coal seam No.9. In coal seam No.17 and No.19 in the Taiyuan
formation, the average sulfur content was tested to be 1.87 and 3.49%, respectively.
The two mining coal seams (No.2 and No.7) in the Permian system were included
in this study; these are located in the middle Lower-Shihezi formations (No.2) and
Shanxi formations (No.7). The two formations give thickness of 187–302.95 m and
81.67–136.13, respectively. White feldspar, quartz granule-sandstone, and silicon-
mudstone cementation are the main minerals in the lower Shanxi formation. In addi-
tion, siltstone, siderite, carbon-mudstone, and plant-fossil clast can also be found.
Gray mudstone, sand-mudstone, and sandstone are the major rocks in the middle
Shanxi formation with some silicon-mudstone and siderite also present.
There are six aquifers in the sediment stratum of the hydrogeology cell. A grit
aquifer in the Quaternary, a conglomerate rock aquifer in the Jurassic, two sandstone
aquifers—one in the lower-Shihezi formation, and one above the coal seam in the
Shanxi formation; and two limestone aquifers—one is located in the Carboniferous
Taiyuan formation (thickness of 180–200 m) and the other in the Ordovician (thick-
ness of 600 m). These last two aquifers are the main water sources of the coal seam.
A total of 16 water samples and 28 rock/coal samples were collected from the study
area. Water samples were collected in 1000 mL Nalgene bottles previously acid-
cleaned and rinsed twice using the water to be collected. pe and pH of water samples
were taken in the field by using a JENCO 6010 pH/ORP meter. Coal and rock samples
were collected from the working area at the mine and put into plastic bags that were
immediately sealed.
Major ions and physical parameters of water samples were determined according
to Chinese standard protocols in Jiangsu Provincial Coal Geology Research Institute.
Solid samples were acid digested to determine the concentration of trace elements. The
concentration of trace elements in water/coal/rock samples was determined by ICP-MS
and the ICP-AES. The ICP-MS analysis was carried out in the China University of
Mining and Technology using the X-Series ICP-MS—Thermo Electron Co. An internal
standard of Rh was used to determine the limit of detection (0.5 pg/mL) and analyti-
cal deviation (less than 2%). The ICP-AES analysis was carried out in the Nanjing
University using a JY38S ICP-AES model. The limit of detection and deviation for the
analysis carried out by such equipment are 0.01 μg/mL and less than 2%, respectively.
Leaching experiments were conducted using the batch mode to simulate conditions
in a coal seam where water movement is slow and dissolution reactions tend to achieve
equilibrium, with regard to the previous studies [44, 45]. To simulate a “closed envi-
ronment” (with low pO2; see Stumm and Morgan [46] for details), bottles were closed
with a rubber stopper; samples were taken out using syringes. The pe of the solution
during experiments was determined by a JENCO 6010 pH/ORP meter.
Three subsamples were used for each sample: one per 1000 mL aliquot of deion-
ized water at the following pHs: 2, 5.6, 7, and 12. Flasks were sealed and shaken every
2 h for up to 10 days. The temperature was controlled using a water bath at about
40°C. Leachate solutions were collected using syringes at 2, 6, 24, and 48 h. A total of
0.5 mol/L HNO3 was added into all the samples. Leachate aliquots were titrated with
155
Data Mining - Concepts and Applications
We have applied software R as a tool, the packages psych and mclust were used to
calculate PCA and GM model clustering results.
A total of 16 water samples were collected from the study site, including 12
coal mine waters, two surface waters, and two carbonate waters, respectively.
Concentrations of major ions are drawn in a piper plot (Figure 2). Figure 2 suggests
that the carbonate water and coal mine water belong to medium-mineralized water,
and surface water belongs to low-mineralized water, respectively. The surface water
is Na-Mg-Ca-Cl−-SO42−-HCO3−-type water, the carbonate water is Na-Mg-Ca-SO42−-
type water, and the coal mine water is Na-Ca-SO42−-, Na-SO42−-, or Na-HCO3−-type
water, respectively. Coal mine waters showed characteristics of high-soluble miner-
als. [SO42−] of most coal mine water samples were higher than USEPA and Chinese
highest limit, 250 mg/L. Besides [SO42−], [Cl−], TDS, and hardness were also higher
than the Chinese-regulated limit. The combination of higher levels of Ca2+, Mg2+,
HCO3−, and SO42− concentrations in the groundwater suggests that the coupled reac-
tions involving sulfide oxidation and carbonate dissolution largely control the solute
acquisition processes in the study area [52].
Figure 2.
Piper plot of the water samples.
157
Data Mining - Concepts and Applications
The PCA analysis is used to reduce the dimensions of the water matrix. In this
study case, dimension means water parameters. Water samples are represented by 10s
of conventional inorganic and organic parameters, some of which are an indicator of
the environment and reaction pathways, and some others a redundant or collinear.
The PCA method could solve problems of not only parameter redundant and col-
linear, but also shows principal components in the data matrix, and relationships
between parameters and among the parameters and samples could also be shown by
using the parameters’ loading and samples’ score, respectively.
In this study, the traditional method of PCA calculation was applied, and principal
components and variance that the PC explained were calculated. In the original table,
16 parameters were tested, and the PCA calculation used 16 new components to
represent the original parameters, which explain the variance of samples, in descend-
ing order. The head six components explained 29, 21, 17, 10, 9, and 5% of the variance,
respectively. Considering the balance of more variance explained and less compo-
nents, we chose two principal components to stand for the sample data. The GM
method was used to group the ions and trace elements in the water sample, which is
shown in Figure 3. The parameters were clustered into four groups. Group 1 includes
K+ + Na+ and Cl−; group 2 includes Ca2+, Mg2+, Cl−, SO42−, TDS, and hardness; group
3 includes HCO3−, CO32−, and pH; group 4 includes As, Hg, Se, Cd, Pb, respectively.
The samples were collected in or around the coal mine district, so the clustering result
is representative, and the groups were separated from others distinctly. From the
clustering result, it is suggested that group 2 stands for the dissolution of carbonate,
and group 4 stands for the trace element. The trace element contaminant could be
identified from this result.
4.2 Leaching mechanism of trace elements from the coal host rock
To investigate the leaching mechanism of trace elements from the coal host rock,
both the rock sample and water sample were tested. The rock samples were those
1.0
0.5
RC2
0.0
−0.5
RC1
Figure 3.
Loadings of the multivariate analysis and clustering result of water samples.
158
Leaching Mechanisms of Trace Elements from Coal and Host Rock Using Method of Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.100498
collected from coal roof, which then was processed in a standard treatment to decide
its content. The milled rock samples were mixed with deionized water in the batch
experiments to observe and evaluate the leaching behavior and mechanism of the
trace elements from rock to water. The major and trace element concentrations in host
rock and leachate are listed in the Table 1 in Shan et al. [53]. A hypothesis was that the
occurrence and leaching mechanism of the trace elements in the solid samples were
related to their concentrations in the water samples. Therefore, the PCA was applied
to reduce dimensions of the rock and water samples, and then, the analytical results
of solid and liquid samples are discussed parallelly.
For the rock samples, 18 elements were tested, and then, the PCA method was
applied. The first two components explained 91% of all variance; therefore, the two
PCs were used to stand for information of the data. For the water samples, 16 ions
and trace elements were tested. The same analytical process was applied. The first
two PCs explained 87% of all variance, which were used to stand for information in
the water samples. By using the new PCs, parameters were assigned loadings on every
new component. Then, the parameters of rock and water samples can be drawn in a
two-dimensional (2D) scatter diagram. Figure 4 shows the elements of rock samples,
and Figure 5 shows the ions and elements of water samples in a 2D scatter diagram,
respectively.
The PCA-treated data were clustered using the expectation maximization (EM)
algorithm. The EM algorithm could make several clustering results. By considering
the BIC score and conciseness of every clustering model, the parameters in the rock
samples were clustered into three groups. The first group includes Mo, Pb, Cr, V, Ti,
and Al, which are marked in solid circles; the second group includes Zn, Ba, Mn, Fe,
Mg, As, Hg, Se, and Cd, which are shown in hollow squares; the third group includes
Cu, Sr, and Ca, which are shown in solid triangles. As mentioned before, the cluster-
ing could help to analyze the elements’ occurrence in solid samples. Cr has a high
affinity of clay and ash yield in gangue [3]. Zhou et al. [2] reported a high relationship
0.5
RC2
0.0
−0.5
RC1
Figure 4.
Loadings of the multivariate analysis and clustering result of rock samples.
159
Data Mining - Concepts and Applications
0.8
0.6
0.4
RC2
0.2
0.0
−0.2
RC1
Figure 5.
Loadings of the multivariate analysis and clustering result of rock leachate.
pattern in the solid and liquid samples. It is apparent that they were controlled by the
dissolution of sulfur minerals. The content of the sulfur mineral in the rock was not
high in our samples. However, the oxidation and dissolution processes were distinct,
leading to the release of toxic trace elements.
The major and trace element concentrations in coal and leachate are listed in the
Table 1 in Shan et al. [53]. The same analytical method with rock was applied to the
coal and coal leaching analysis. And the PCA and clustering analytical results of coal
and coal leaching water are shown in Figures 6 and 7. Two principal components
could explain 96 and 91% variance for the coal and leachate, respectively. As Figure 6
shows that elements are clustered into four groups, the group 1 includes Mo, Pb, Cr, V,
Cu, Ti, Al, Hg, and Se; group 2 includes Zn and Cd; group 3 includes Ba, Mn, Sr, Mg,
and Ca; group 4 includes Fe and As, respectively. The ions and trace elements in coal
leachate, as shown in Figure 7, were grouped into three groups. Group 1 includes Al,
Se, and Pb; group 2 includes Si, As, Sr, Mo, and Hg; group 3 includes Ti, Cr, Mn, Fe,
Zn, Cd, and Ba, respectively. Finkelman et al. [3] investigated the occurrence of most
of the trace elements, it is found that 65% of Ti, 90% of Al, and 75% of Cr 25% and
30% of Cu and Mo are in clay minerals, little Pb and Se are in clay form, 75 and 65%
of Zn and Cd formed in mono-sulfide form, and 70 and 90% of As and Hg are sulfide
form. Pumure et al. [39] argued that As and Se usually occur in clay minerals. Pb was
found to be sulfide form as pyrite and galena [54] and organic form [55].
Combining the literature review and PCA-clustering analysis, group 1 for the
coal samples stands for clay affinity, groups 2 and 4 are sulfur-mineral elements,
and group 3 is related to carbonate minerals. Group 2 has two elements, Zn and Cd.
This result is consistent with some previous studies [2, 56]. It is concluded the main
occurrence of trace elements: As, Hg, Cd occurred in sulfide minerals, and Pb, Cr,
and Se occurred in clay minerals, respectively. Zn and Cd are the primary elements
0.5
RC2
0.0
−0.5
RC1
Figure 6.
Loadings of the multivariate analysis and clustering result of coal samples.
161
Data Mining - Concepts and Applications
0.5
RC2
0.0
−0.5
RC1
Figure 7.
Loadings of the multivariate analysis and clustering result of coal leachate.
in sphalerite. Compared with the host rock, the sphalerite is more probably to form
an independent mineral in coal.
The coal leachate clustering results were relatively different with that of the ana-
lytical results of coal. Compared to the rock samples, coal is a more complex matrix
and consists of organic and mineral matter, the latter including crystalline miner-
als, non-crystalline mineraloids, and elements with non-mineral associations [55].
However, some patterns could be concluded. Group 1 includes Al, Se, and Pb, which is
similar to group 1 in the coal analysis. Therefore, group 1 stands for the elements that
originated from clay minerals. Group 2 stands for the elements related to sulfur-bearing
minerals. As and Hg had similar behavior patterns in solid and liquid matrices. So the
leaching product in water was mainly from the dissolution of its bearing mineral, the
sulfide mineral. Similar to the host rock analysis, low content of sulfur-mineral may
lead to trace element concentration.
The trace elements Se, Cr, and Pb have similar behavior patterns in solid and liquid
matrices, suggesting a dissolution progress of its bearing minerals. According to the
literature research and coexisting analysis, these elements usually occur in continen-
tal facies minerals, such as clay minerals.
5. Conclusion
When the host rock is leaching with water, As, Hg, and Se were originated from
oxidation and dissolution of sulfur-mineral; especially for pyrite, Cr was mainly
controlled by the transformation of clay minerals. When the coal is leaching with
water, As and Hg showed high affinity of sulfur-minerals, and Se and Cr seemed to
be controlled by the water-rock interaction of clay minerals. It suggested that Se exist
in sulfide mineral, clay minerals, and also organic matters. Therefore, the leaching
mechanism of Se is not unique, and multiple mechanisms may control or influence
the leaching behaviors. Cd and Pb showed apparent differences between the solid
samples and liquid samples. The mechanism leading to this result was probably
explained not only the releasing process, but also the adsorption process. These
elements are typical metal elements. They can be easily adsorbed in the alkaline and
neutral environment. Therefore, the released metal elements were adsorbed by clay
minerals and organic matters. The immigration mechanism and long-term environ-
mental impact need further studies.
Acknowledgements
The test of samples was carried out in the Jiangsu Provincial Coal Geology
Research Institute, the Analysis and Test Center of the China University of Mining
and Technology, Imperial College London. We would like to thank all of them for
their support.
Author details
Yao Shan
School of Emergency Technology and Management, North China Institute of Science
and Technology, Yanjiao, China
© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
163
Data Mining - Concepts and Applications
References
[1] Goodell J. Big Coal: The Dirty Secret [8] Querol X, Alastuey A, Zhuang X,
Behind America's Energy Future. Hower JC, Lopez-Soler A, Plana F, et al.
Houghton-Mifflin: New York, NY; Petrology, mineralogy and geochemistry
2006 of the Permian and Triassic coals in the
Leping area, Jiangxi Province, southeast
[2] Zhou C, Liu G, Fang T, Sun R, Wu D. China. International Journal of Coal
Leaching characteristic and Geology. 2001;48:23-45
environmental implication of rejection
rocks from Huainan Coalfield, Anhui [9] Mohanty AK, Lingaswamy M, Rao G,
Province, China. Journal of Geochemical Sankaran S. Impact of acid mine drainage
Exploration. 2014;143:54-61 and hydrogeochemical studies in a part
of Rajrappa coal mining area of Ramgarh
[3] Finkelman RB, Plamer CA, Wang P. District, Jharkhand State of India.
Quantification of the modes of Groundwater for Sustainable
occurrence of 42 elements in coal. Development. 2018;7:164-175
International Journal of Coal Geology.
2018;185:138-160 [10] Sahoo PK, Tripathy S, Panigrahi MK,
Equeenuddin SM. Geochemical
[4] Fang WX, Wu PW, Hu RZ. characterization of coal and waste rocks
Geochemical research of the impact of from a high sulfur bearing coalfield,
Se–Cu–Mo–V-bearing coal layers on the India: Implication for acid and metal
environment in Pingli County, Shaanxi generation. Journal of Geochemical
Province, China. Journal of Geochemical Exploration. 2014;145:135-147
Exploration. 2003;80:105-115
[11] Zhu C, Qu S, Zhang J, Wang Y,
[5] Finkelman RB, Orem W, Zhang Y. Distribution, occurrence and
Castranova V, Tatu CA, Belin HE, leaching dynamic behavior of sodium in
Zheng B, et al. Health impacts of coal Zhundong coal. Fuel. 2017;190:189-197
and coal use: Possible solutions.
International Journal of Coal Geology. [12] Zhao F, Sun H, Liu N, Cai W, Han R,
2002;50:425-443 Chen B. Evaluation of static acid
production potential for coal bearing
[6] Liu G, Yang P, Peng Z, Chou CL. formation (in Chinese). Earth Science-
Petrographic and geochemical contrasts Journal of China University of
and environmentally significant trace Geosciences. 2014;39(3):350-356
elements in marine-influenced coal
seams, Yanzhou mining area, China. [13] Cravotta CA III. Monitoring, field
Journal of Asian Earth Sciences. experiments, and geochemical modeling
2004;23:491-506 of Fe (II) oxidation kinetics in a stream
dominated by net-alkaline coal-mine
[7] Liu G, Vassilev SV, Gao L, Zheng L, drainage, Pennsylvania, USA. Applied
Peng Z. Mineral and chemical Geochemistry. 2015;62:96-107
composition and some trace element
contents in coals and coal ashes from [14] Cravotta CA III, Brady KBC. Priority
Huaibei coal field, China. Energy pollutants and associated constituents in
Conversion and Management. untreated and treated discharges from
2005;46:2001-2009 coal mining or processing facilities in
164
Leaching Mechanisms of Trace Elements from Coal and Host Rock Using Method of Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.100498
[29] Orakwe LC, Chukwuma EC. [36] Xue D, De Baets B, Van Cleemput O,
Multivariate analysis of ground water Hennessy C, Berglund M, Boeckx P. Use
characteristics of Ajali sandstone of a Bayesian isotope mixing model to
formation: A case study of Udi and estimate proportional contributions of
Nsukka LGAs of Enugu State of Nigeria. multiple nitrate sources in surface water.
Journal of African Earth Sciences. Environmental Pollution. 2012;161:43-49
2017;129:668-674
[37] Huang P, Yang Z, Wang X, Ding F.
[30] Matiatos I, Paraskevopoulos V, Research on Piper-PCA-Bayes-LOOCV
Lazogiannis K, Botsou F, Dassenakis M, discrimination model of water inrush
Ghionis G, et al. Surface-ground water source in mines. Arabian Journal
interactions and hydrogeochemical of Geosciences. 2019;12(334):1-14.
evolution in a fluvio-deltaic setting: The
case study of the Pinios River delta. [38] Wang J, Li X, Cui T, Yang J.
Journal of Hydrology. 2018;561:236-249 Application of distance discriminant
analysis method to headstream
[31] Zhu B, Wang X, Rioual P. recognition of water-bursting source.
Multivariate indications between Procedia Engineering. 2011;26:374-381
environment and ground water recharge
in a sedimentary drainage basin in [39] Pumure I, Renton JJ, Smart RB. The
northwestern China. Journal of interstitial location of selenium and
Hydrology. 2017;549:92-113 arsenic in rocks associated with coal
mining using ultrasound extractions and
[32] Hwang CK, Cha JM, Kim KW,
principal component analysis (PCA).
Lee HK. Application of multivariate
Journal of Hazardous Materials.
statistical analysis and ageographic
2011;198:151-158
information system to trace element
contamination in the Chungnam Coal
[40] Lin Q, Liu E, Zhang E, Li K, Shen J.
Mine area, Korea. Applied Geochemistry.
Spatial distribution, contamination and
2001;16:1455-1464
ecological risk assessment of heavy
[33] Singh KP, Malik A, Mohan D, metals in surface sediments of Erhai
Sinha S. Multivariate statistical Lake, a large eutrophic plateau lake in
techniques for the evaluation of spatial southwest China. Catena. 2016;
and temporal variations in water quality 145:193-203
of Gomti River (India)—A case study.
Water Research. 2004;38(18):3980-3992 [41] Tian HZ, Zhu CY, Gao JJ, Cheng K,
Hao JM, Wang K, et al. Quantitative
[34] Liu P, Hoth N, Drebenstedt C, Sun Y, assessment of atmospheric emissions of
Xu Z. Hydro-geochemical paths of toxic heavy metals from anthropogenic
multi-layer groundwater system in coal sources in China: Historical trend, spatial
mining regions—Using multivariate distribution, uncertainties, and control
statistics and geochemical modeling policies. Atmospheric Chemistry and
approaches. Science of the Total Physics. 2015;15(17):10127-10147
Environment. 2017;601-602:1-14
[42] Murillo JH, Roman SR, Rojas
[35] Hajigholizadeh M, Melesse AM. Marin JF, Ramos AC, Jimenez SB,
Assortment and spatiotemporal analysis Gonzalez BC, et al. Chemical
of surface water quality using cluster and characterization and source
discriminant analyses. Catena. 2017;151: apportionment of PM10 and PM2.5 in the
247-258 metropolitan area of Costa Rica, Central
166
Leaching Mechanisms of Trace Elements from Coal and Host Rock Using Method of Data Mining
DOI: http://dx.doi.org/10.5772/intechopen.100498
Abstract
Mining the sentiment of the user on the internet via the context plays a significant
role in uncovering the human emotion and in determining the exactness of the
underlying emotion in the context. An increasingly enormous number of user-
generated content (UGC) in social media and online travel platforms lead to develop-
ment of data-driven sentiment analysis (SA), and most extant SA in the domain of
tourism is conducted using document-based SA (DBSA). However, DBSA cannot be
used to examine what specific aspects need to be improved or disclose the unknown
dimensions that affect the overall sentiment like aspect-based SA (ABSA). ABSA
requires accurate identification of the aspects and sentiment orientation in the UGC.
In this book chapter, we illustrate the contribution of data mining based on deep
learning in sentiment and emotion detection.
1. Introduction
Since the world has been inundated with the increasing amount of tourist data,
tourism organizations and business should keep abreast about tourist experience and
views about the business, product and service. Gaining insights into these fields can
facilitate the development of the robust strategy that can enhance tourist experience
and further boost tourist loyalty and recommendations. Traditionally, business rely on
the structured quantitative approach, for example, rating tourist satisfaction level
based on the Likert Scale. Although this approach is effective to prove or disprove
existing hypothesis, the closed ended questions cannot reveal exact tourist experience
and feelings of the products or services, which hampers obtaining insights from
tourists. Actually, business have already applied sophisticated and advanced
approaches, such as text mining and sentiment analysis, to disclose the patterns
hidden behind the data and the main themes.
Sentiment analysis (SA) has been used to deal with the unstructured data in the
domain of tourism, such as texts, images, and video to investigate decision-making
process [1], service quality [2], destination image and reputation [3]. As for the level
of sentiment analysis, it has been found that most extant sentiment analysis in the
domain of tourism is conducted at document level [4–7]. Document-based sentiment
analysis (DBSA) regards the individual whole review or each sentence as an
169
Data Mining - Concepts and Applications
independent unit and assume there is only one topic in the review or in the sentence.
However, this assumption is invalid as people normally express their semantic orien-
tation on different aspects in a review or a sentence [8]. For example, in the sentence
“we had impressive breakfast, comfortable bed and friendly and professional staff
serving us”, the aspects discussed here are “breakfast”, “bed” and “staff” and the users
give positive comments on these aspects (“impressive”, “comfortable” and “friendly
and professional”). Since the sentiment obtained through DBSA is at coarse level,
aspect-based sentiment analysis (ABSA) has been suggested to capture sentiment
tendency of finer granularity.
To obtain the sentiment at the finer level, ABSA has been proposed and developed
over the years. ABSA normally involves three tasks, the extraction of opinion target
(also known as the “aspect term”), the detection of aspect category and the classifica-
tion of sentiment polarity. Traditional methods to extract aspects rely on the word
frequency or the linguistic patterns. Nevertheless, it cannot identify infrequent
aspects and heavily depends on the grammatical accuracy to manipulate the rules [9].
As for the detection of sentiment polarity, supervised machine learning approaches,
like Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector
Machine (SVM). Although machine learning-based approaches have achieved desir-
able accuracy and precision, they require huge dataset and manual training data. In
addition, the results cannot be duplicated in other fields [10]. To overcome these
shortcomings, ABSA of deep learning (DL) approaches has the advantage of auto-
matically extracting features from data [9]. Extant studies based on DL methods in
tourism have investigated and explored tourist experiences in economy hotel [11], the
identification of destination image [12], review classification [13]. Although DL
methods have been applied in tourism, ABSA in tourism is scant. Therefore, this study
reviewed sentiment analysis at aspect level conducted by DL approaches, compared
the performance of DL models, and explored the model training process.
With the references of surveys about DL methods [9, 14], this study followed the
framework of ABSA proposed by Liu (2011) [8] to achieve the following aims: (1)
provide an overview of the studies using DL-based ABSA in tourism for researchers
and practitioners; (2) provide practical guidelines including data annotation, pre-
processing, as well as model training for potential application of ABSA in similar areas;
(3) train the model to classify sentiments with the state-of-art DL methods and
optimizers using datasets collected from TripAdvisor. This paper is organized as
follows: Section 2 reviews the cutting-edge techniques for ABSA, studies using DL for
NLP tasks in tourism, and research gap; Section 3 presents the annotation schema of
the given corpus and DL methods used in this study; Section 4 describes the details of
annotation results, model training, and the experiment results. Section 5 provides the
conclusions and future extensions.
2. Literature review
An extensive literature review of the state-of-art techniques for ABSA and the
studies using DL in tourism is provided in this section.
To convert the NLP problems into the form that computers can deal with, the texts
are required to be transformed into a numerical value. In ML-based approaches, One-
170
Tourist Sentiment Mining Based on Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.98836
hot and Counter Vectorizer are commonly used. One-hot encoding can realize a
token-level representation of a sentence. However, the use of One-hot encoding
usually results in high dimension issues, which is not computationally efficient [15].
Another issue is the difficulty of extracting meanings as this approach assumes that
words in the sentence are independent, and the similarities cannot be measured by
distance nor cosine-similarity. As for Counter Vectorizer, although it can convert the
whole sentence into one vector, it cannot consider the sequence of the words and the
context.
Nevertheless, in DL based approaches, pre-trained word embeddings have been
proposed in [16, 17]. Word embedding, or word representation, refers to the learned
representation of texts in which the words with identical meanings would have similar
representation. It has been proved that the use of word embeddings as the input
vectors can make a 6–9% increase in aspect extraction [18] and 2% in the identifica-
tion of sentiment polarity [19]. Pre-trained word embeddings are favored as random
initialization could result in stochastic gradient descent (SGD) in local minima [20].
Based on the network language model, a feedforward architecture, which combined a
linear projection layer and a non-linear hidden layer, could learn the word vector
representation and a statistical language model [21].
Word2Vec [16] proposed the skip-gram and continuous bag-of-words (CBOW)
models. By setting the window size, skip-gram can predict the context based on the
given words, while the CBOW can predict the word based on the context. Frequent
words also are assigned binary codes in Huffman trees because Also, due to the fact
that the word frequency is appropriate to acquire classes in neural net language
models, frequent words are assigned binary codes in Huffman trees. This practice in
Word2Vec helps reduce the number of output units that are required to be assessed.
However, the window-based approaches of Word2Vec do not work on the co-
occurrence of the text and do not harness the huge amount of repetition in the texts.
Therefore, to capture the global representation of the words in all sentences, GloVe
can take advantage of the nonzero elements in a word-word cooccurrence matrix [17].
Although the models discussed above performed well in similarity tasks and named
entity recognition, they cannot cope with the polysemous words. In a more recent
development, Embeddings from language model (ELMo) [22], Bi-directional Encoder
Representations from Transformers (BERT) [23] can identify the context-sensitive
features in the corpus. The main difference between these two architectures is that
ELMo is feature-based, while BERT is deeply bidirectional. To be specific, the con-
textual representation of each token is obtained through the concatenation of the left-
to-right and right-to-left representations. In contrast, BERT applies masked language
models (MLM) to acquire the pre-trained deep bidirectional representations. MLM
can randomly mask certain tokens from the input and predict the ID of the input
depending only on the context. Additionally, BERT is capable of addressing the issues
of long text dependence.
Nonetheless, researchers have combined certain features with word embedding to
produce more pertinent results. These features include Part-Of-Speech (POS) and
chunk tags, and commonsense knowledge. It has been observed that aspect terms are
usually nouns or noun phrases [8]. The original word embeddings of the texts are
concatenated with as k-dimensional binary vectors that represent the k POS, or k tags.
The concatenated word embeddings are fed into the models (Do et al.,, Prasad, Maag,
and Alsadoon, 2019 [9]). It has been proved that the use of POS tagging as input can
improve the performance of aspect extraction, with gains from 1% [18, 20] to 4%
[24]. Apart from the POS, concepts that are closely related to the affections are
171
Data Mining - Concepts and Applications
suggested to be added as word embeddings [25, 26]. POS focused on the grammatical
tagging of the words in a corpus, while concepts that are extracted from SenticNet
emphasize the multi-word expressions and the dependency relation between clauses.
For example, the multi-word expression “win lottery” could be related to the emo-
tions “Arise-joy” and the single-word expression “dog” is associated with the property
“Isa-pet” and the emotions “Arise-joy” [26]. After being parsed by SenticNet, the
obtained concept-level information (property and the emotions) is embedded into the
deep neural sequential models. The performance of the Long Short-Term Memory
(LSTM) [27] combined with SenticNet exceeded the baseline LSTM [26].
This section reviews the DL methods used for ABSA, including Convolutional
Neural Network (CNN), Recurrent Neural Network (RNN), Attention-based RNN,
and Memory Network.
2.2.1 CNN
CNN can learn to capture the fixed-length expressions based on the assumption
that keywords usually include the aspect terms with few connections of the positions
[28]. Besides, as CNN is a non-linear model, it usually outperforms the linear-model
and rarely relies on language rules [29]. A local feature window of 5 words was firstly
created for each word in the sentence to extract the aspects. Then, a seven-layer of CNN
was tested and generated better results [29]. To capture the multi-word expressions, the
model proposed [30] contained two separate convolutional layers with non-linear gates.
N-gram features can be obtained by the convolutional layers with multiple filters. Li et
al. [13] put position information between the aspect words and the context words into
the input layer in CNN and introduced the aspect-aware transformation parts. Fan et al.
[31] integrated the attention mechanism with a convolutional memory network. This
proposed model can learn multi-word expressions in the sentence and identify long-
distance dependency.
Apart from simply extracting the aspects alone, CNN can identify the sentiment
polarity at the same time, which can be regarded as multi-label tasking classification
or multitasking issues. As for researchers who considered ABSA multi-label tasking
classification, a probability distribution threshold was applied to select the aspect
category and the aspect vector was concatenated with the word embedding, which
was then further performed using CNN. Xu et al. [32] combined the CNN with the
non-linear CRF to extract the aspect, which was then concatenated with the word
embeddings and fed into another CNN to identify the sentiment polarity. Gu et al.
[33] proposed a CNN with two levels that integrated the aspect mapping and senti-
ment classification. Compared with conventional ML approaches, this approach can
lessen the feature engineering work and elapsed time [9]. It should be noticed that the
performance of multitasking CNN does not necessarily outperform multitasking
methods [19].
RNN has been applied for the ABSA and SBSA in the UGC. RNN models use a
fixed-size vector to represent one sequence, which could be a sentence or a document,
172
Tourist Sentiment Mining Based on Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.98836
to feed each token into a recurrent unit. The main differences between CNN and RNN
are: (1) the parameters of different layers in RNN are the same, making a fewer
number of parameters required to be learned; (2) since the outputs from RNN relies
on the prior steps, RNN can identify the context dependency and suitable for texts of
different lengths [34–36].
However, the standard RNN has prominent shortcomings of gradient explosion
and vanishing, causing difficulties to train and fine-tune the parameter during the
process of prorogation [34]. LSTM and Gated Recurrent Unit (GRU) [37] have been
proposed to tackle such issues. Also, Bi-directional RNN (Bi-RNN) models have been
proposed in many studies [38, 39]. The principle behind Bi-RNN is the context-aware
representation can be acquired by concatenating the backward and the forward vec-
tors. Instead of the forward layer alone, a backward layer was combined to learn from
both prior and future, enabling Bi-RNN to predict by using the following words. It has
been proved that the Bi-RNN model achieved better results than LSTM in the highly
skewed data in the task of aspect category detection [40]. Especially, Bi-directional
GRU is capable of extracting aspects and identifying the sentiment in the meanwhile
[23, 41] by using Bi-LSTM-CRF and CNN to extract the aspects in the sentence that
has more than one sentiment targets.
Another drawback of RNN is that RNN encodes peripheral information, especially
when it is fed with information-rich texts, which would further result in semantic
mismatching problems. To tackle the issue, the attention mechanism is proposed to
capture the weights from each lower level, which are further aggregated as the
weighted vector for high-level representation [42]. In doing so, the attention mecha-
nism can emphasize aspects and the sentiment in the sentence. Single attention-based
LSTM with aspect embeddings [43], and position attention-based LSTM [44],
syntactic-aware vectors [45] were used to capture the important aspects and the
context words. The aspect and opinion terms can be extracted in the Coupled Multi-
Layer Attention Model based on GRU [46] and the Bi-CNN with attention [47]. These
frameworks require fewer engineering features compared with the use of CRF.
The development of the deep memory network in ABSA was originated from the
multi-hop attention mechanism that applies the exterior memory to compute the
influence of context words on the given aspects [36]. A multi-hop attention mecha-
nism was set over an external memory that can recognize the importance level of the
context words and can infer the sentiment polarity based on the contexts. The tasks of
aspect extraction and sentiment identification can be achieved simultaneously in the
memory network in the model proposed by [13]. Li et al. [13] used the signals
obtained in aspect extraction as the basis to predict the sentiment polarity, which
would further be computed to identify the aspects.
Memory networks can tackle the problems that cannot be addressed by attention
mechanism. To be specific, in certain sentences, the sentiment polarity is dependent
on the aspects and cannot be inferred from the context alone. For example, “the price
is high” and “the screen resolution is high”. Both sentences contain the word “high”.
When “high” is related to “price”, it refers to negative sentiment, while it represents
positive sentiment when “high” is related to “screen resolution”. Wang et al. [48]
proposed a target-sensitive memory network proposed six techniques to design
target-sensitive memory networks that can deal with the issues effectively.
173
Data Mining - Concepts and Applications
Based on the consideration and the purpose of the study, the corpora in this study
will be completely in English and will include reviews collected from casino resorts in
Macao. A self-designed tool programmed in Python was implemented to acquire all
the URLs, which were first stored and further used as the initial page to crawl all the
UGC that belongs to the hotel. The corpus includes 61544 reviews of 66 hotels. The
length of the reviews varied greatly, with a maximum of 15 sentences, compared to
the minimum of one sentence.
In terms of the size of the corpora that requires annotation, as there is no clear
instruction regarding the size of the corpora, this study referred to Liu’s work and
SemEval’s task. In machine learning based studies, it is reasonable to consider that the
corpus that has 800–1000 aspects would be sufficient, while for deep-learning based
174
Tourist Sentiment Mining Based on Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.98836
approach, we think at least 5000 aspects in total would be acceptable. As the original
data was annotated first to be further analyzed, 1% of the reviews were randomly
sampled from the corpus. Therefore, 600 reviews that contain 5506 sentences were
selected for ABSA in this study.
3.2 Annotation
Although previous works annotated the corpora and performed sentiment analy-
sis, they did not reveal the annotation principles [51, 53] and the categories are rather
coarse. For example, [53] used pre-defined categories to annotate the aspects of the
restaurant. The categories involved “Food, Service, Price, Ambience, Anecdotes, and
Miscellaneous”, which did not annotate the aspects of finer levels. In addition, the
reliability and validity of the annotation scheme have not been proved.
As the training of the models discussed above requires the annotation of domain-
specific corpora, this study referred to [54]. The design of the annotation schema calls
for the identification of aspect-sentiment pairs. Specifically, Α is the collection of
aspects aj (with j ¼ 1, … , s). Then, sentiment polarity pk (with k ¼ 1, … , t) should be
added to each aspect in the form of a tuple (aj , pk ).
To ensure the reliability and validity, Cohen’s kappa, Krippendorff’s alpha, and
Inter-Annotator-Agreement (IAA) are introduced in this study, which are calculated
by the agreement package in NLTK. Both indicators are used to measure (1) the
agreement of the entire aspect-sentiment pair, (2) the agreement of each independent
category.
The LSTM unit proposed by [25] overcomes the gradient vanishing or exploding
issues in the standard RNN. The LSTM unit is consisted of forget, input, and output
gates, as well as a cell memory state. The LSTM unit maintained a memory cell ct at
time t instead of the recurrent unit computing a weighted sum of the inputs and
applying an activation function. Each LSTM unit can be computed as follows:
X ¼ ½ht�1 xt � (1)
f t ¼ σ XW Tf þ bf (2)
it ¼ σ XW Ti þ bi (3)
ot ¼ σ XW To þ bo (4)
ct ¼ f t ⊙ ct�1 þ it ⊙ tanh XW Tc þ bc (5)
ht ¼ ot ⊙ tanhðct Þ (6)
The forget gate decides the extent to which the existing memory is kept (Eq. (2)),
while the extent to which the new memory is added to the memory cell is controlled
by the input gate (Eq. (3)). The memory cell is updated by partially forgetting the
existing memory and adding a new memory content (Eq. (5)). The output gate
summarizes the memory content exposure in the unit (Eq. (4)). LSTM unit can decide
whether to keep the existing memory with three gates. Intuitively, if the LSTM unit
detects an important feature from an input sequence at an early stage, it easily carries
this information (the existence of the feature) over a long distance, hence, capturing
potential long-distance dependencies.
3.3.2 GRU
A Gated Recurrent Unit (GRU) that adaptively remembers and forgets was pro-
posed by [37]. GRU has reset and update gates that modulate the flow of information
inside the unit without having a memory cell compared with the LSTM unit. Each
GRU can be computed as follows:
X ¼ ½ ht�1 xt � (7)
rt ¼ σ XW Tr þ br (8)
zt ¼ σ XW Tz þ bz (9)
ht ¼ ð1 � zt Þ ⊙ ht�1 þ zt ⊙ tanh rt⨀ht�1 xt W T þ b (10)
The reset gate filters the information from the previous hidden layer as a forget
gate does in the LSTM unit (Eq. (8)), which effectively allows the irrelevant informa-
tion to be dropped, thus, allowing a more compact representation. On the other hand,
the update gate decides how much the GRU updates its information (Eq. (9)). This is
similar to LSTM. However, the GRU does not have the mechanism to control the
degree to which its state is exposed instead of fully exposing the state each time.
The standard LSTM and GRU cannot detect the important part for aspect-level
sentiment classification. To address this issue, [43] proposed an attention mechanism
that allows the model to capture the key part of a sentence when different aspects are
concerned. The architecture of a gated RNN model considering the attention mecha-
nism which can produce an attention weight vector α, and a weighted hidden repre-
sentation r.
W hH
M ¼ tanh (11)
W v va ⨂ eN
α ¼ softmaxðW m MÞ (12)
r ¼ HαT (13)
where H ∈ dh �N is the hidden matrix, dh is the dimension of the hidden layer, N is
the length of the given sentence; va ∈ da is the aspect embedding, and eN ∈ N is a
N-dimensional vector with an element of 1; ⨂ represents element-wise
176
Tourist Sentiment Mining Based on Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.98836
In the first trial, Cohen’s kappa and Krippendorff’s alpha are obtained at 0.80 and
0.78 respectively. Which are highly acceptable in the study since the scores measured
the overall attribute and polarity. To identify the category that has the largest varia-
tion between two coders, Cohen’s kappa for each label was calculated separately.
Results (Table 1) indicated that Polarity had the highest agreement, while attribute
showed lower agreement among two annotators. At the end of the first trial, both
coders discussed the issues they encountered when they were annotating the corpus
and make efforts to improve the preliminary annotation schema. The problems
include dealing with the sentence that is difficult to assign the aspects.
Based on the revisions of the annotation schema, the coders conducted the second
trial. With the revised annotation schema, the Cohen’s kappa for the attribute and
polarity is obtained at 0.89 and 0.91 respectively. In addition, Cohen’s kappa and
Krippendorff’s alpha for the aspect-sentiment pair is computed by the end of the
second trial, with 0.82 and 0.81 respectively, which indicated that the annotation
schema in this study is valid.
The experiment was conducted on the dataset of TripAdvisor hotel reviews which
contains 5506 sentences, where the numbers of positive, neutral, and negative senti-
ment samples are 3032, 2986, and 2725, respectively. Given a dataset, maximizing the
predictive performance and training efficiency of a model requires finding the opti-
mal network architecture and tuning hyper-parameters. In addition, the samples can
significantly affect the performance of the model. To investigate the effect of
Attribute Polarity
Table 1.
Cohen’s kappa for categories of aspect and polarity.
177
Data Mining - Concepts and Applications
sentiment sample fractions on the model performance, four sub-datasets with 4000
sentiment samples subjected to different sentiment fractions were resampled from the
TripAdvisor hotel dataset as the train sets, one is a balanced dataset and three are
unbalanced datasets that the sample fraction of sentiment positive, neutral, and neg-
ative dominated, respectively. In addition, it is observed that the average number of
the aspects in a sentence is about 1.4, and the average length of the aspects in a
sentence is about 8.0, which indicates that one sentence normally contains more than
one aspect and the aspect averagely contains eight characters. The number of aspects
in train and test sets is more than 850 and 320, respectively, which confirms the
diversity of aspects in the dataset of TripAdvisor hotel reviews. For each train set,
20% of reviews were selected as the validation set.
Attention-based gated RNN models including LSTM and GRU were used for
ABSA. Attention-based GRU/LSTM without and with aspect embedding were
referred to as AT-GRU/AT-LSTM and ATAE-GRU/ATAE-LSTM, respectively. The
details of the configurations and used hyper-parameters are summarized in Table 2.
In the experiments, all word embeddings with the dimension of 300 were initialized
by GloVe [17]. The word embeddings were pre-trained on an unlabeled corpus of
which size is about 840 billion. The dimension of hidden layer vectors and aspect
embedding are 300 and 100 respectively. The weight matrices are initialized with the
uniform distribution U (0.1, 0.1), and the bias vectors are initialized to zero. The
learning rate and mini-batch size are 0.001 and 16 respectively. The best optimizer
and number of epochs were obtained from {SGD, Adam, AdaBelief} and {100, 300,
500} respectively via grid search. The optimal parameters based on the best perfor-
mance on the validation set were kept and the optimal model is used for evaluation in
the test set.
The aim of the training is to minimize the cross-entropy error between the target
sentiment distribution y and the predicted sentiment distribution ^y. However,
overfitting is a common issue during training. In order to avoid the over-fitting,
regularization procedures including L2-regularization, early stopping as well as drop-
out were used in the experiment. L2-regularization adds “squared magnitude” of
coefficient as a penalty term to the loss function.
Configuration Hyper-parameter
Dropout 0.5
Multi-batch size 16
Table 2.
Details of configurations and used hyper-parameters.
178
Tourist Sentiment Mining Based on Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.98836
XX j j
loss ¼ � yi log ^yi þ λkθk2 (15)
i j
where i is the index of review; j is the index of sentiment class, and the classifica-
tion in this paper is three-way; λ is the L2-regularization term, which modified the
learning rule to multiplicatively shrink the parameter set on each step before
performing the usual gradient update; θ is the parameter set.
On the other hand, early stopping is a commonly used and effective way to avoid
over-fitting. It reliably occurs that the training error decreases steadily over time, but
validation set error begins to rise again. Therefore, early stopping terminates when no
parameters have improved over the best-recorded validation error for a pre-specified
number of iterations. Additionally, dropout is a simple way to prevent the neural
network from overfitting, which refers to temporarily removing cells and their con-
nections from a neural network [55]. In an RNN model, dropout can be implemented
on input, output, and hidden layers. In this study, only the output layer with a
dropout ratio of 0.5 was followed by a linear layer to transform the feature represen-
tation to the conditional probability distribution.
Optimizers are algorithms used to update the attributes of the neural network such
as parameter set and learning rate to reduce the losses to provide the most accurate
results possible. Three optimizers namely SGD [56], Adam [57], and AdaBelief [58]
were used in the experiment to search for the best performance. The standard SGD
uses a randomly selected batch of samples from the train set to compute derivate of
loss, on which the update of the parameter set is dependent. The updates in the case of
the standard SGD are much noisy because the derivative is not always toward minima.
As result, the standard SGD may have a more time complexity to converge and get
stuck at local minima. In order to overcome this issue, SGD with momentum is
proposed by Polyak [56] (1964) to denoise derivative using the previous gradient
information to the current update of the parameter set. Given a loss function f ðθÞ to
be optimized, the SGD with momentum is given by:
where α > 0 is the learning rate; β ∈ ½0, 1� is the momentum coefficient, which
decides the degree to which the previous gradient contributing to the updates of the
parameter set, and g t ¼ ∇f ðθt Þ is the gradient at θt .
Both Adam and AdaBelief are adaptive learning rates optimizer. Adam records the
first moment of gradient mt which is similar to SGD with momentum and second
moment of gradient vt in the meanwhile. mt and vt are updated using the exponential
moving average (EMA) of g t and g 2t , respectively:
179
Data Mining - Concepts and Applications
The update rules for parameter set using Adam and AdaBelief are given by
Eqs. (23) and (24), respectively:
αmt
θtþ1 ¼ θt � pffiffiffiffi (21)
vt þ ε
αmt
θtþ1 ¼ θt � pffiffiffi (22)
st þ ε
As for the confusion matrix for a multi-class classification task, accuracy is the
most basic evaluation measure of classification. The evaluation measure accuracy
represents the proportion of the correct predictions of the trained model, and it can be
calculated as:
PC
1 TPi
Accuracy ¼ (23)
N
C
1X TPi
MacroPrecision ¼ (24)
C i¼1 TPi þ FPi
C
1X TPi
MacroRecall ¼ (25)
C i¼1 TPi þ FN i
2 � MacroPrecision � MacroRecall
Macro � F1 ¼ (26)
MacroPrecision þ MacroRecall
where FPi and FN i are the number of false predictions for the positive and negative
samples of the ith class, respectively.
180
Tourist Sentiment Mining Based on Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.98836
Figure 1.
Summary of model performance.
This study computed accuracy (A), macro precision (P), macro recall (R), and
macro F1-score (F) of AT-GRU, ATAE-GRU, AT-LSTM, and ATAE-LSTM trained with
various optimizers and epochs. The results show: (1) Attention-based models (AT-GRU
and AT-LSTM) performed better than attention-based models with aspect embedding
(ATAE-GRU and ATAE-LSTM). Taken Dataset 1 for example, the best accuracy in the
test set using AT-GRU was 80.7%, while the best accuracy using ATAE-GRU was 75.3%;
(2) Attention-based GRU performed better than attention-based LSTM. Taken AT-
GRU and AT-LSTM for example, the accuracy and macro F1-score of AT-GRU for all
datasets were higher than those of AT-LSTM; (3) The balanced dataset (Dataset 1)
achieved the best predictive performance for all models. For the unbalanced datasets,
the accuracy was exactly close to that of the balanced dataset. However, the macro
precision, recall, and F1-score were significantly lower than those of the balanced
dataset, which confirmed that the balanced dataset had the best generalization and
stability in this study; (4) For Dataset 3 in which the neutral sentiment samples domi-
nated, all of the models exhibited the worst predictive performance compared with
other datasets. The candidate model for each dataset is illustrated in Figure 1. It is noted
that the candidate model was selected according to accuracy. However, the model with
a higher macro F1-score was selected as the candidate model instead when the accura-
cies of models were similar. Among 16 models, AT-GRU trained with the optimizer of
AdaBelief and epoch of 300 in Dataset 1 achieved the highest accuracy of 80.7% and
macro F1-score of 75.0% in the meanwhile. Figure 2 illustrates the normalized confu-
sion matrix of the best predictive model of which diagonal represented for the pre-
cisions. The precisions of positive and negative sentiment classification were about 20%
higher than that of neutral sentiment classification, which confirmed that the need to
boost the precision of neutral sentiment classification in order to globally improve the
accuracy of the model in future work.
Early stopping was used in this research to avoid overfitting and save training time.
Figure 3 illustrates the learning history of AT-GRU using early stopping in four
181
Data Mining - Concepts and Applications
Figure 2.
Normalized confusion matrix of model with best predictive performance.
Figure 3.
Learning history of AT-GRU using early stopping.
182
Tourist Sentiment Mining Based on Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.98836
datasets, where the training stopped when the validation loss kept increasing for 5
epochs (i.e., “patience” equals to 5 in this study). For all datasets, the validation
accuracy was exactly close to the training validation during the training procedure,
which confirmed that early stopping was able to effectively avoid overfitting. Exper-
imental results of A/P/R/F obtained based on training AT-GRU and AT-LSTM using
early stopping. The accuracies obtained by AT-GRU and AT-LSTM were similar. For
the balanced dataset, the accuracy and macro F1-score obtained by early stopping
were significantly lower than that obtained by the corresponding model without early
stopping. This is because the loss function probably found the local minima if the
training stopped when the loss started to rise for 5 epochs. All of the optimizers used
in this study were aimed at avoiding the loss function sticking at the local minima to
find the global loss minima, therefore, using more epochs in the training was effective
to obtain the best predictive performance model. On the other hand, for the unbal-
anced datasets, the accuracy and macro F1-score obtained by early stopping were
similar to that obtained by the corresponding model without early stopping, which
indicated that early stopping was effective to avoid overfitting as the loss converged
fast in the unbalanced dataset. Although early stopping is a straightforward way of
avoiding overfitting and improving training efficiency, the trade-off is that the model
for test set possibly returns at the time point when reaching the local minima of loss
function especially for the balanced dataset, and a new hyper-parameter of “patience”
which is sensitive to the results is introduced.
Three optimizers were used in this study to find the best model. Figure 4 illus-
trates the learning history of AT-GRU in four datasets. The gap between training and
Figure 4.
Learning history of AT-GRU.
183
Data Mining - Concepts and Applications
validation accuracy was the largest, which indicated that the worst generalization of
Adam among three optimizers in this study although it converged quickly at the very
beginning except for Dataset 3. Both SGD and AdaBelief can achieve good predictive
performance with good generalization, however, AdaBelief converged faster than
SGD, and the best results were achieved by AdaBelief.
In this study, the hotel review dataset collected from TripAdvisor for aspect-level
sentiment classification was first established. The dataset contains 5506 sentences in
which the numbers of positive, neutral, and negative sentiment samples are 3032,
2986, and 2725, respectively. In order to study the effect of the fraction of sentiment
samples on the model performance, four sub-datasets with a various fraction of
sentiment samples were resampled from the TripAdvisor hotel review dataset as the
train sets. The task in this study is to determine the aspect polarity of a given review
with the corresponding aspects. To achieve a good predictive performance toward a
multi-class classification task, attention-based GRU and LSTM (AT-GRU and AT-
LSTM), as well as attention-based GRU and LSTM with aspect embedding (ATAE-
GRU and ATAE-LSTM), were optimized with SGD, Adam, and AdaBelief and trained
with epochs of 100, 300, and 500, respectively. Conclusions from these experiments
are as follows:
3. The balanced dataset achieved the best predictive performance. For the
unbalanced datasets, the accuracy was exactly close to that of the balanced
dataset, however, the macro precision, recall, and F1-score were significantly
lower than those of the balanced dataset, which confirmed that the balanced
dataset had the best generalization and stability in this study. For the dataset in
which the neutral sentiment samples dominated, all of the models exhibited the
worst predictive performance.
4. For the balanced dataset, the accuracy and macro F1-score obtained by early
stopping was significantly lower than that obtained by the corresponding model
without early stopping. However, for the unbalanced datasets, the accuracy and
macro F1-score obtained by early stopping were similar to that obtained by the
corresponding model without early stopping, which indicated that early stopping
was effective to avoid overfitting as the loss converged fast in the unbalanced
datasets.
5. For optimizers, both SGD and AdaBelief can achieve good predictive
performance with good generalization, however, AdaBelief converged faster
than SGD, and the best results were achieved by AdaBelief.
184
Tourist Sentiment Mining Based on Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.98836
1. Enlargement of the dataset. This study focused on the hotel in Macau, collecting
5506 reviews from TripAdvisor. To improve the model performance, hotels from
other countries and regions can be collected into the dataset.
Author details
© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
185
Data Mining - Concepts and Applications
References
[1] Gretzel U, Yoo KH. Use and impact of [9] Do HH, Prasad PWC, Maag A,
online travel reviews. In: Information and Alsadoon A. Deep learning for
Communication Technologies in Tourism aspect-based sentiment analysis: A
2008. Vienna: Springer Vienna; 2008. comparative review. Expert Syst Appl.
p. 35–46. 2019;118:272–99.
186
Tourist Sentiment Mining Based on Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.98836
[19] Wu H, Gu Y, Sun S, Gu X. Aspect- [26] MA, Y., Peng, H. Y., & Cambria, E.
based opinion summarization with Targeted Aspect-Based Sentiment
convolutional neural networks. In: 2016 Analysis via Embedding Commonsense
International Joint Conference on Neural Knowledge into an Attentive LSTM. In:
Networks (IJCNN). IEEE; 2016. The Thirty-Second AAAI Conference on
Artificial Intelligence (AAAI-18). 2018.
[20] Liu P, Joty S, Meng H. Fine-grained p. 5876–83.
opinion mining with recurrent neural
networks and word embeddings. In: [27] Hochreiter S, Schmidhuber J. Long
Proceedings of the 2015 Conference on short-term memory. Neural Comput.
Empirical Methods in Natural Language 1997;9(8):1735–80.
Processing. Stroudsburg, PA, USA:
Association for Computational [28] Goldberg Y. Neural network
Linguistics; 2015. methods for natural language processing.
Synth lect hum lang technol. 2017;10(1):1–
[21] Bengio, Y., Schwenk, H., Senécal, J. 309.
S., Morin, F. M., & Gauvain, J. L. Neural
Probabilistic Language Models. Heidelberg:
[29] Poria S, Cambria E, Gelbukh A,
Springer; 2006.
Bisio F, Hussain A. Sentiment data flow
[22] Peters, M., Neumann, M., Iyyer, M., analysis by means of dynamic linguistic
Gardner, M., Clark, C., Lee, K., & patterns. IEEE Comput Intell Mag. 2015;
Zettlemoyer, L, editor. Deep 10(4):26–36.
contextualized word representations. In:
Proceedings of the 2018 Conference of the [30] Xue W, Li T. Aspect based
North American Chapter of the Association sentiment analysis with gated
for Computational Linguistics: Human convolutional networks. In: Proceedings of
Language Technologies. 2018. p. 2227–37. the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1:
[23] Devlin J, Chang M-W, Lee K, Long Papers). Stroudsburg, PA, USA:
Toutanova K. BERT: Pre-training of deep Association for Computational
bidirectional Transformers for language Linguistics; 2018.
187
Data Mining - Concepts and Applications
[32] Xu L, Lin J, Wang L, Yin C, Wang J. [40] Chaudhuri A, Ghosh SK. Sentiment
Deep convolutional neural network- analysis of customer reviews using
based approach for aspect-based robust hierarchical bidirectional
sentiment analysis. In Science & recurrent neural network. In: Advances in
Engineering Research Support soCiety; Intelligent Systems and Computing. Cham:
2017. Springer International Publishing; 2016.
p. 249–61.
[33] Gu X, Gu Y, Wu H. Cascaded
convolutional neural networks for [41] Chen T, Xu R, He Y, Wang X.
aspect-based opinion summary. Neural Improving sentiment analysis via
Process Lett. 2017;46(2):581–94. sentence type classification using
BiLSTM-CRF and CNN. Expert Syst
[34] Goldberg Y. A primer on neural Appl. 2017;72:221–30.
network models for natural language
processing. J. Artif Intell Res. 2016;57: [42] Luong T, Pham H, Manning CD.
345–420. Effective approaches to attention-based
neural machine translation. In: Proceedings
[35] Bengio Y. Deep Learning. London, of the 2015 Conference on Empirical Methods
England: MIT Press; 2016. in Natural Language Processing.
Stroudsburg, PA, USA: Association for
[36] Tang D, Qin B, Liu T. Aspect level Computational Linguistics; 2015.
sentiment classification with deep
memory network. In: Proceedings of the [43] Wang Y, Huang M, Zhu X, Zhao L.
2016 Conference on Empirical Methods in Attention-based LSTM for Aspect-level
Natural Language Processing. Sentiment Classification. In: Proceedings
Stroudsburg, PA, USA: Association for of the 2016 Conference on Empirical
Computational Linguistics; 2016. Methods in Natural Language Processing.
Stroudsburg, PA, USA: Association for
[37] Cho K, van Merrienboer B, Computational Linguistics; 2016.
Gulcehre C, Bahdanau D, Bougares F,
Schwenk H, et al. Learning phrase [44] Zeng J, Ma X, Zhou K. Enhancing
representations using RNN encoder– attention-based LSTM with position
decoder for statistical machine context for aspect-level sentiment
translation. In: Proceedings of the 2014 classification. IEEE Access. 2019;7:
Conference on Empirical Methods in 20462–71.
Natural Language Processing (EMNLP).
Stroudsburg, PA, USA: Association for [45] He, R., Lee, W. S., Ng, H. T., &
Computational Linguistics; 2014. Dahlmeier, D. Effective attention
modeling for aspect-level sentiment
[38] Graves A. Supervised sequence classification. In: Proceedings of the 27th
labelling. In: Studies in Computational International Conference on
188
Tourist Sentiment Mining Based on Deep Learning
DOI: http://dx.doi.org/10.5772/intechopen.98836
[47] Cheng J, Zhao S, Zhang J, King I, [53] Ganu, G., Elhadad, N., & Marian, A.
Zhang X, Wang H. Aspect-level (2009). Beyond the Stars: Improving
sentiment classification with HEAT Rating Predictions using Review Text
(HiErarchical ATtention) network. In: Content [Internet]. Available from:
Proceedings of the 2017 ACM on Beyond the Stars: Improving Rating
Conference on Information and Knowledge Predictions using Review Text Content.
Management - CIKM ‘17. New York, New Twelfth International Workshop on the
York, USA: ACM Press; 2017. Web and Databases./http://spidr-ursa.
rutgers.edu/resources/WebDB.pdf
[48] Wang S, Mazumder S, Liu B,
Zhou M, Chang Y. Target-sensitive [54] Moreno-Ortiz A, Salles-Bernal S,
memory networks for aspect sentiment Orrequia-Barea A. Design and validation
classification. In: Proceedings of the 56th of annotation schemas for aspect-based
Annual Meeting of the Association for sentiment analysis in the tourism sector.
Computational Linguistics (Volume 1: Inf Technol Tour. 2019;21(4):535–57.
Long Papers). Stroudsburg, PA, USA:
Association for Computational [55] Nitish Srivastava Geoffrey Hinton
Linguistics; 2018. Alex Krizhevsky Ilya Sutskever Ruslan
Salakhutdinov. Dropout: A Simple Way
[49] Chang Y-C, Ku C-H, Chen C-H. to Prevent Neural Networks from
Using deep learning and visual analytics Overfitting. Journal of Machine Learning
to explore hotel reviews and responses. Research. 2014;15:1929–58.
Tour Manag. 2020;80(104129):104129.
[56] Polyak BT. Some methods of
[50] Gao J, Yao R, Lai H, Chang T-C. speeding up the convergence of iteration
Sentiment analysis with CNNs built on methods. USSR Comput Math Math Phys.
LSTM on tourists’ comments. In: 2019 1964;4(5):1–17.
IEEE Eurasia Conference on Biomedical
Engineering, Healthcare and Sustainability [57] Kingma DP, Ba J. Adam: A method
(ECBIOS). IEEE; 2019. for stochastic optimization [Internet].
arXiv [cs.LG]. 2014. Available from:
[51] Pontiki M, Galanis D, Pavlopoulos J, http://arxiv.org/abs/1412.6980
Papageorgiou H, Androutsopoulos I,
Manandhar S. SemEval-2014 Task 4: [58] Zhuang J, Tang T, Ding Y,
Aspect Based Sentiment Analysis. In: Tatikonda S, Dvornek N,
Proceedings of the 8th International Papademetris X, et al. AdaBelief
Workshop on Semantic Evaluation optimizer: Adapting stepsizes by the
(SemEval 2014). Stroudsburg, PA, USA: belief in observed gradients [Internet].
Association for Computational arXiv [cs.LG]. 2020. Available from:
Linguistics; 2014. http://arxiv.org/abs/2010.07468
189
Chapter 12
Abstract
Natural disasters can occur anytime and anywhere, especially in areas with high disas-
ter risk. The earthquake that followed the tsunami and liquefaction in Palu, Indonesia,
at the end of 2018 had caused tremendous damage. In recent years, rehabilitation and
reconstruction projects have been implemented to restore the situation and accelerate
economic growth. A study is needed to determine whether the rehabilitation and recon-
struction that has been carried out for three years have met community satisfaction. The
results of further analysis are expected to predict the level of community satisfaction for
the implementation of rehabilitation and other reconstruction. The method used in this
paper is predictive modeling using a data mining (DM) approach. Data were collected
from all rehabilitation and reconstruction activities in Palu, Sigi, and Donggala with the
scope of the earthquake, tsunami, and liquefaction disasters. The analysis results show
that the Artificial Neural Network (ANN) and the support vector machine (SVM) with a
DM approach can develop a community satisfaction prediction model to implement reha-
bilitation and reconstruction after the earthquake-tsunami and liquefaction disasters.
1. Introduction
The Palu earthquake, Indonesia, on September 28, 2018, caused severe damage
with a reasonably broad impact. At the time of this writing, the atmosphere of
grief and trauma of the people affected directly and indirectly began to disappear.
The earthquake has a complete phenomenon in the movement of faults, tsunamis,
landslides, and liquefaction events. Simultaneous liquefaction in several locations is
unique in the world. This liquefaction phenomenon has received attention from the
people in the world because the mudflow event during liquefaction has devastated
infrastructure and housing on a massive scale [1].
191
Data Mining - Concepts and Applications
Palu City and its surroundings based on topographic, geological, and seismologi-
cal conditions can suffer damage due to earthquakes, including secondary disasters
(tsunami, liquefaction, and cliff landslides). The earthquake in Palu on May 20, 1938,
with a magnitude of 7.6 SR, was the previous incident with many fatalities. Studying,
analyzing, and estimating all the supporting factors and the potential for disasters
of such magnitude, the government needs to empower all components of society.
The role of stakeholders in providing thoughts and recommendations is not accurate.
Before and after an earthquake disaster occurs, they are better prepared psychologi-
cally and physically to reduce the impact of the disaster [2].
After a disaster with a significant impact, as mentioned above, various parties
immediately carried out rehabilitation and reconstruction work, one of which was in
transportation infrastructure. There are rehabilitation and reconstruction works on
several roads, handling roads affected by liquefaction, including drainage systems,
construction of retaining walls, construction of bridges, maintenance of bridges, and
construction of access roads to permanent residences for disaster victims. According
to its stages, the implementation of the rehabilitation and reconstruction was carried
out, starting from recovery, trauma healing, permanent planning up to the overall
reconstruction. The trauma healing stage is the starting point for the rehabilitation
and reconstruction directly related to the community [3].
The implementation of rehabilitation and reconstruction due to natural disasters
has not been completed yet. In early 2020 the Palu area could not avoid the non-natural
disasters that plagued the world as a whole, namely the COVID-19 pandemic. This
condition adds to the pressure to complete all stages of rehabilitation and reconstruc-
tion, especially work productivity which is directly impacted by restrictions on the
labor movement. The decline in performance was mainly due to limited employee
interactions with concerns and the potential risk of being exposed to the coronavirus.
Covid-19 is transmitted by shedding droplets when an infected person coughs or
exhales. Then, the released droplets will fall on nearby objects and surfaces, thereby
polluting the surrounding environment [4].
Mitigation management and natural disaster recovery are an inseparable series
of activities, starting from planning, mitigation, trauma healing, rehabilitation, and
reconstruction, to socio-cultural recovery of the community. The speed and accuracy
of planning play an essential role in achieving the success of post-disaster manage-
ment. A thorough understanding and mapping are required in determining the plan
that can be implemented appropriately in the field. Planning and implementation of
work must consider the latest conditions taking into account the potential for recur-
ring disasters. A thorough and well-targeted evaluation is required to ensure that the
rehabilitation and reconstruction process runs according to the community’s expecta-
tions. One of the evaluations that can be done is to measure community satisfaction
at the job site. Because community satisfaction is one of the essential things in
measuring the success of rehabilitation and reconstruction, the valuable experience
from this disaster incident can be developed by a community satisfaction prediction
model. The model that is built is expected to be an improvement step in the process of
implementing rehabilitation and reconstruction in other activities.
2. Literature review
Apart from being famous for its wealth and natural beauty, Indonesia is also a
country that is prone to disasters. This condition is because Indonesia is in a dynamic
volcanic area and continental plates. This position also causes the shape of Indonesia’s
relief to varying widely, from mountains with steep slopes to gently sloping areas
along very long coastlines, all of which are susceptible to landslide, flood, abrasion,
and tsunami hazards. Various hydrometeorological conditions sometimes threaten
flooding and landslides, hurricanes or tornadoes, drought-related forest fires, etc.
Another threat is disasters caused by various technological failures.
The condition of Indonesia with a reasonably high risk of natural disasters such
as Sulawesi Island is a complex area. The location of the Sulawesi is a meeting place
for three large plates. The plate is the Indo-Australian Plate moving north, the Pacific
Plate moving west, the Eurasian Plate moving south-southeast, and the smaller
plate, the Eurasian plate, which moves south-southeast, and the smaller plate, the
Philippine Plate. Sulawesi, a young island in Indonesia, is located where subduction
and collisions are still active. Based on existing rock blocks, the island of Sulawesi can
be divided into three parts of the geological area. The first is West Sulawesi, where
tertiary deposits and magma rocks are the dominant parts. Second, Central and
Southeast Sulawesi mainly consisting of rocks from the early Cretaceous era. Thrid,
East Sulawesi ophiolitic nappe covered Mesozoic and Paleozoic era sedimentary
rocks [5].
Palu City is one of the capital cities in Sulawesi, which has a high risk of disaster.
Palu was also passed by a significant fault that divides the city firmly on the surface.
This fault is often referred to as the Palu-Koro fault, originally called the Fossa
Sarassina fault. All geologists and geophysicists who are familiar with the Palu-Koro
fault agree that this fault is active. An active fault will experience an earthquake at the
exact location of the period. Several studies show repeated earthquakes for hundreds
and thousands of years [6]. These faults are thought to have caused the history
of earthquakes in the area to be quite long. The history of earthquakes in central
Sulawesi has been recorded since the 19th century. Several major earthquakes with a
sufficiently large record were in 1968 with 6.7 SR, 1993 at 5.8 SR, and 2005 at 6.2 SR.
Meanwhile, the tsunami occurred in 1927 in Palu Bay with a wave height of 15 m, 1968
in Malaga as high as 10 m, and 1996 in Simuntu Pangalaseang as high as 3.4 m [7].
This condition causes Palu’s vulnerability to earthquakes to be very high. The
studies about earthquake vulnerability by conducting a microtremor test in Palu
City based on the earthquake’s epicenter from the United States Geological Survey
193
Data Mining - Concepts and Applications
(USGS), magnitude 6.3, which occurred on January 23, 2005 [5]. Microtremor survey
to estimate the distribution of solid earthquake vibrations. From the survey, the peak
acceleration, velocity, and earthquake susceptibility index were obtained. From these
observations, it can be concluded that Palu City has soil conditions with shear wave
velocity Vs. < 300 m/s. The peak acceleration can reach more than 400-gal, resulting
in significant damage to the building. From microtremor research, it is found that the
vulnerability index in hilly areas is low and vice versa. The earthquake vulnerability
index in the alluvium area is very high.
194
Data Mining Applied for Community Satisfaction Prediction of Rehabilitation…
DOI: http://dx.doi.org/10.5772/intechopen.99349
195
Data Mining - Concepts and Applications
Currently, soft computing methods are carried out by mimicking processes found
in nature, such as the brain and natural selection [18]. Soft computing techniques
make it possible to perform data processing to reduce uncertainty, imprecision, and
ambiguity. In the mid-early 1960s, a new branch of computer science began to attract
the attention of most scientists. This new branch, referred to as artificial intelligence
(AI), can be defined as the study of how making computers drive the quality of
people’s work better. The AI approach encourages the development of soft computing
in various fields, one of which is the development of data mining.
The development of the information technology industry is speedy, and knowl-
edge in data collection is proliferating. Large databases are not a problem if they can
take advantage of computer technology with various primary and supporting applica-
tions. All data collected and stored in a suitable database can be precious knowledge
(for example, trend models, behavior models) that can support decision-making
and optimize action [19]. Classical statistics have limitations for performing large
amounts of data analysis or complex relationships between data variables. The solu-
tion for this problem and its limitations is to develop computer-based data analysis
tools with more excellent capabilities and are automatic [20]. With the development
of semi-automatic approaches in various fields of science, in recent decades, there
has been an increase and across disciplines, such as AI, statistics, and information
systems. This field is formally defined as knowledge discovery from the database
(KDD). That in its development, KDD is increasingly known as DM [21].
One step in developing a community satisfaction prediction model in rehabilita-
tion and reconstruction is processing the satisfaction data for each stage in a KDD
process to form a DM prediction model. DM is a logical combination of data knowl-
edge and statistical analysis developed in knowledge or a business process that uses
statistical techniques, mathematics, artificial intelligence, artificial intelligence, and
machine learning to extract and identify valuable information for related knowledge
from large databases. The DM approach continues to be developed in various scien-
tific fields. In recent times the use of DM for predicting social problems is increasing
[22]. At the KDD stage, the DM algorithm has equipped a dataset used during the
learning-phase, to be developed into a data-driven model. The model can be described
as the relationship between input and output, which can provide helpful information.
Understanding and deepening the scientific field has an essential influence on the
success of designing the DM algorithm. The database is only a meaningless set of data
if an appropriate algorithm is not approached [23]. Furthermore, Fu also said that
reviews carried out in the last few years show that DM’s ability is growing in specific
domains and depends on continuously developing specific algorithms. In simple
cases, science can help identify the right features to model the data that underlie the
compilation of scientific databases. Knowledge can also help design business goals
that can be achieved using in-depth database analysis.
In this study, the database collects data on various satisfaction variables in the
pre, during, and post-rehabilitation and reconstruction. Stages summarized in a
post-disaster management system can be defined, and algorithms can be compiled to
become real information support in improving mitigation management. The develop-
ment of a system like this has a significant impact on the scientific development of
disaster management, and even if the prediction accuracy is only a little, it is still
better than random guessing. The availability of a complete database can provide a
better and more reliable satisfaction prediction model [24].
196
Data Mining Applied for Community Satisfaction Prediction of Rehabilitation…
DOI: http://dx.doi.org/10.5772/intechopen.99349
3. Research method
This study will develop a community satisfaction prediction model with the DM
approach without any restrictive assumptions by considering the input data sourced
from the questionnaire results. The preparation of a community satisfaction predic-
tion model with DM follows the following stages and processes. It was first cleaning
and researching data that can be used in the deterioration model. The data cleaning
process includes deleting inappropriate and irrelevant data from the database. This
process can include writing errors, ensuring that the writing format remains consis-
tent, and deleting records with incomplete data.
Second, check the data. The first step is to make a histogram or bar chart to
determine the frequency of each variable. After that, the relationship of each data
must be found. Knowing the distribution and correlation between existing variables
helps researchers choose the proper form of data and be more efficient in evaluating
the mode to be formed. In data checking, discrepancies and inaccuracies can be found
so that further data cleaning is required. The level of correlation refers to the relation-
ship between two variables. A high level of correlation indicates that the two variables
are closely related, where if one of these variables changes, the other variables will
197
Data Mining - Concepts and Applications
also change proportionally. If the variables are continuous, these variables will form
a line if drawn together. A low level of correlation indicates that the two variables
change randomly and are not related. Most of the data fall between two extreme
values. The correlation level test is shown through the correlation matrix.
Third, choosing the type of model. After considering each type of model pre-
viously studied (deterministic, probabilistic, and artificial intelligence). In this
research, the development of the selected AI-based model. Developing a community
satisfaction model is carried out through iteration stages by changing aspects of the
model to form the best model based on the available data. Model development is done
by adjusting aspects to the type of model and the available software. Several factors
influence the shape of the model, among others, the basic equation, the variables used
in the model, and the grouping of these variables into groups.
Fourth, look for parameter values. Determination of values and parameters is
required in model development. In general, this step is completed using an optimized
algorithm equation. However, for simple models (for example, a linear regression
model using the least square method), this value can be manually optimized using a
spreadsheet program. The rminer provides a complete menu option in determining
the parameter value with the command:> contribution.
Finally, after the parameter values are obtained and the model has been formed,
the model must be evaluated. The evaluation method will depend on the type of
model selected. If, after evaluation, the model is not feasible, then the type of model
must be reconsidered. If the type of model is still deemed inadequate, the form of the
model must be changed and redeveloped. If the evaluation results conclude that the
model type is unsuitable for the available data, then the model type must be reconsid-
ered. There are several ways to evaluate statistical models. One of the initial actions
that must be considered in evaluating a model is estimating parameter values. The
parameter values must be reasonable and significant.
3.3 R Tools
parameters in SVM are kernel parameter γ , used in the search scope {2−15, 2−13, …, 23},
below the minimum 5-fold cross-validation [26].
Completing the modeling of the ANN and SVM algorithms, in this study, the
MR model was tested as a comparison. The entire DM algorithm consisting of ANN,
SVM, and MR is implemented with the R-Tool (R Development Core Team, 2009)
and rminer library [28]. Furthermore, before fitting the ANN, SVM, and MR models,
all data are tested with standard statistics, and then the output is tested for inverse
transformation.
As study material in this paper used data from the earthquake incident on
September 28, 2018, in Palu, Sigi, and Donggala. This choice takes into account that
the disaster has a reasonably broad impact on damage. In general, the damage can be
divided into several phenomena. One of them is the damage caused by fault move-
ments, fractures, and earthquake shocks. The fault movement is an offset where the
left side moves north and the right side shifts to the south. The length of the most
considerable shear on the right side is about 4 m, while the left side shifts to the north
along 3 m. This shift is visible on the map visible on Google map. Of course, buildings
that are traversed by faults will suffer significant damage and soil fractures, where
199
Data Mining - Concepts and Applications
fractures can be the impact of the movement of faults (or reactivated faults) with a
smaller offset. Earthquake shocks are in the form of vibrations both horizontally and
vertically. In general, in Palu City, the impact of damage due to shocks was not too
much, except for buildings of low quality.
Therefore, is the phenomenon of damage due to the tsunami. The impact of a tsu-
nami is the result of inundation (submerged buildings) and tsunami currents (speed
or force acting to push or pull buildings). The impact of current velocity is mainly the
scouring of the subgrade. If it is loose sand, the erosion rate is very high. Generally,
buildings with shallow foundations fail because the scour reaches the base of the
foundation. The buildings are relatively light, so they are easily carried away by the
flow of water. Another damage is due to the tsunami and at the same time carrying
debris to cars and ships, so collisions with these objects often result in heavy damage.
Lastly is the phenomenon of damage due to liquefaction. There are 4–5 locations
that are pretty prominent and wide, namely in Balaroa, Petobo, Jono Oge, Lolu village
(also in Jono Oge), and Sibalaya. Although some spots also occur liquefaction in the
sand boil, it is not prominent and is not recorded. In addition, landslides in the sea can
occur due to liquefaction. This kind of avalanche is induced by liquefaction. The land-
slides in Balaroa and Sibalaya were a phenomenon of liquefaction-induced landslides.
It is possible that the submarine landslides that occurred in Palu Bay which caused the
tsunami impact had the exact mechanism as in Sibalaya.
This section presents the modeling framework and procedures used to develop
the ANN and SVM approach models. Similar to the traditional modeling process,
where the goal is to estimate set coefficients in the form of a particular function. The
main objective of the ANN model in this study is to obtain a set of matrices, which
are abstract basic knowledge of the available data after going through the training
loop. However, to use ANN in solving real-world problems, it is necessary to design a
framework following the characteristics of a problem. The framework design aims to
define the required ANN architecture and the relationships between the components
in the framework. After completing the design framework, the next stage is to design
the architecture of each ANN sub-model. The ANN architectural design process is
a decision-making process, which includes determining the number of layers, the
number of neurons in each layer, the variables entered into the input layer and the
output layer. After completing the ANN architectural design, the design results need
to be tested and validated.
In general, a neural network is made up of millions (even more) of the basic struc-
tures of interconnected and integrated neurons so that they can carry out activities
regularly and continuously as needed. The imitation of a neuron in an artificial neural
network structure is a processing element that can function as a neuron. The number
of input signals is multiplied by the corresponding weight w. Then do the sum of all
the results of the multiplication and the resulting output is passed into the activating
function to get the degree of the output signal f (a, w). Although it is still far from
perfect, the performance of this neuron clone is identical to that of the cell biology
we know today. The collection of neurons is made into a network that functions as
a computational tool. The number of neurons and the network structure for each
problem solved is different.
Furthermore, this model was developed by activating the entire network in ANN.
Activating an artificial neural network means activating every neuron used in that
200
Data Mining Applied for Community Satisfaction Prediction of Rehabilitation…
DOI: http://dx.doi.org/10.5772/intechopen.99349
network. Many functions can be used as activators, such as goniometric and hyper-
bolic functions, step unit functions, impulses, sigmoid, etc. Of the several commonly
used functions is the sigmoid function because it is considered closer to the human
brain’s performance. The algorithm activation process during iteration can be moni-
tored, and its movement pattern can be seen.
In contrast to the neural network strategy, which seeks to find a hyperplane that
separates classes, SVM tries to find the best hyperplane in the input space. The basic
principle of SVM is a linear classifier. It is further developed to work on non-linear
problems by incorporating the concept of a kernel trick in a high-dimensional
workspace. This development encourages research in modeling to explore the poten-
tial capabilities of SVM theoretically and in terms of application. Currently, SVM has
been successfully applied to real-world problems, and in general, provides a better
solution than conventional methods.
The model built is verified using data from questionnaire collection around the
rehabilitation and reconstruction project. The questionnaire result dataset includes
625 results from 2 rehabilitation and reconstruction projects and 25 input parameters
referred to as influencing parameters in an empirical study of community satisfac-
tion. These parameters are given a sequence code based on the pre-during-post stage
as input, as shown in Table 1 below. All data obtained based on the level of impor-
tance and level of performance of each parameter asked the correspondent.
Forming a dataset is carried out to form three datasets that can be used immedi-
ately to learn, test, and validate. The database is divided into two datasets. The first
set includes all the information. The dataset of both questionnaires was collected,
which will be used for validation purposes. The entire dataset used for learning and
test purposes is further divided into two subsets to obtain learning datasets. One set
contains 80% of the data used for learning and 20% of the data used for testing. It
is statistically independent data from the dataset used during learning and testing
based on separating the dataset for the validation process. Therefore, verification of
the DM model by using a separated dataset can be considered a control to check the
performance of the DM model. The learning process is carried out with the number
of epochs (10,000 times). The iteration process produces an ANN model that has an
optimal weight between neurons.
After the learning phase is complete, the model development step is continued to
the test stage to check the effectiveness of the learning process. The dataset used in
the test stage becomes the DM input. The algorithm used in this stage uses a learning
algorithm that has been recorded in the DM application when the learning process is
running. The test process can calculate the error rate that occurs. If the error level of
the test stage is still within an acceptable level, then the DM model is considered rea-
sonable. A comparison of the model’s accuracy is made by comparing the average MSE
values during the test phase. Finally, the DM model with the lowest MSE error rate
and the highest R2 is selected. Finally, after the learning and test process is complete.
Furthermore, the verification and validation of the model are carried out using the
data that has been prepared with the prediction model of the community satisfaction
learning and test results. Different dataset details were selected for model validation.
201
Data Mining - Concepts and Applications
10 B2 Labor availability
17 C2 The current state of the road & bridge is compared to the past
Result
25 CS Community Satisfaction
Table 1.
Input code.
influenced by the power of the data-driven model for this purpose. When the DM
black box is implemented with ANN, SVM, and MR algorithms that involve complex
mathematical expressions, the data-driven application procedure provided must
translate the model. In this case, the results of the model interpretation are carried out
to obtain a measurement of the input variables of the community satisfaction predic-
tion model.
The first stage of model interpretation is to believe in the ability and accuracy of
the model. The prediction model of community satisfaction using community sat-
isfaction as the leading prediction parameter is first checked for modeling accuracy.
There are several methods for evaluating predictive models, one of which uses the
sum of absolute errors. The sum of the absolute errors often referred to as the absolute
deviation of the average or MAD, is measuring forecasting accuracy by averaging
the forecast errors using their absolute values. MAD is beneficial for analyzing and
measuring the prediction error in the same unit of measure as the original data. In
addition, the resulting process modeling criteria are stated in the RMSE, provided
that the smaller the resulting RMSE (close to the value 0) will result in a better output
prediction model.
This model is structured with a confidence level of 95% according to the t-student
distribution. All DM models with ANN, SVM and MR algorithms are trained using
12 input variable attributes. Figure 1 shows the predictive capacity of all training
outcome models, comparing their performance in predicting the value of community
Figure 1.
Performance measured.
203
Data Mining - Concepts and Applications
satisfaction based on MAD, RMSE, and R2. This table shows that the value of com-
munity satisfaction can be predicted accurately by each of the three DM models,
especially by the ANN and SVM models.
Figure 1 above shows the standard error, and R2 for each model developed. The
DM model with the SVM algorithm has the smallest MAD value and RMSE value,
and the highest R2 value. The prediction model with the ANN and SVM algorithms is
acceptable and can be used in calculating community satisfaction predictions because
it has R2 close to 1. The following community satisfaction prediction model used in
this study is the DM model with the SVM algorithm.
DM technique, also known as association rule mining, can find associative rules
between a combination of items. Two parameters can determine the importance of
an associative rule. The parameter is the percentage combination of these attributes
in the database and confidence, namely the strength of the relationship between
attributes in the associative rule. With the generate and test paradigm, the algorithm
used in this study is making candidate combinations of attributes based on specific
rules and then tested. Combining attributes that meet these requirements is called
a frequent itemset, which is then used to create rules that meet the minimum confi-
dence requirements.
By analyzing Figure 2 (the scatterplot of the community satisfaction value predic-
tion of the SVM algorithm with the questionnaire results), the variables that have
been determined have a significant relationship with the change in the value of the
questionnaire community satisfaction. Figure 2a shows the scatterplots of learning
results in the SVM model, and Figure 2b shows the results of the validation stages.
In the validation stage, the library feature rminer is used to describe and obtain the
relative contribution value of each input value. The confirmed model has R2, MAD,
and RMSE values in the performance validation stage, such as Figure 1, with 20 runs
performed, while the best hyperparameters to achieve a fit SVM model are used.
∈ = 0.07 ± 0.01 and γ = 0.05 ± 0.00. Whereas the hyperparameters for ANN used
H = 3 ± 1.
Furthermore, the interpretation of the regression analysis used in DM is carried
out. Package rminer, provides a graphical interpretation tool, namely: REC curve,
Figure 2.
Community satisfaction prediction outputs. a. Learning stage, b. validation stage.
204
Data Mining Applied for Community Satisfaction Prediction of Rehabilitation…
DOI: http://dx.doi.org/10.5772/intechopen.99349
error tolerance depicted on the x-axis, while the percentage value of road perfor-
mance predictions is depicted on the y-axis. The resulting curve describes the level of
error in the form cumulative distribution function (CDF). The error level defined as
the difference between the predicted values of community satisfaction f(x) with
community satisfaction actual on every coordinate (x, y). This approach is also a
squared residual ( y − f ( x ) ) or absolute deviation y − f ( x ) based on error metric
2
mapping. Figure 3 shown REC curve community satisfaction model with MR, ANN,
dan SVM algorithm.
In Figure 3 it can be analyzed that the REC curve describes the error tolerance
on the x-axis and the level of accuracy of the regression function on the y-axis. The
level of accuracy is defined as the percentage of modeling results that fit the speci-
fied tolerance. If the tolerance value is zero, only that value is considered to meet the
model requirements. However, if you choose the maximum tolerance, other values
can be used as reference for accuracy values. In the REC curve it is clear that the level
of accuracy has a trade-off with tolerance. The greater the tolerance value given, the
Figure 3.
The regression error characteristic curve.
205
Data Mining - Concepts and Applications
higher the accuracy value. Conceptually, the model with the lowest tolerance value
with the highest accuracy is the model that has the best REC value.
The illustration of the REC curve depicts three different models. The curve shows
that the SVM model has the highest accuracy value with the smallest tolerance value
that moves consistently. This REC curve depicts the entire iteration process with
20 runs on the SVM model with hyperparameters as mentioned in the previous
section. The shape of the REC curve can change shape when using different hyperpa-
rameters and the number of iteration runs is different.
The DM model developed can assess each variable’s contribution and attribute
that becomes input data in the model. In this study, the variables or attributes
consist of A1-C9. All attributes are then grouped into three dimensions pre, dur-
ing, and post. A parameter vector in this DM model is chosen to explain that it is
a variable function and not parameters as in the parametric approach. The only
condition for a variance function is to be able to generate a non-negative definite
variance matrix. Several methods can be used to estimate hyperparameter values.
The value of θ can be predicted in this DM by using the cross-validation method.
Hyperparameter used (H and γ) are H (2, 4, …, 10) and γ (2–15, 2–13, …, 23). This
value produces the most precise model with optimal run time. For further model
development, an approach can be used to try other hyperparameter values.
The contribution of each attribute and dimension is of relative importance in
composing the model.
The search results for the contribution value in DM can be simplified and dis-
played in Figure 4. This figure can display the relative importance on the x-axis for
each attribute and dimension on the y-axis forming the community satisfaction
prediction model with the DM model approach using the SVM, ANN, and MR
algorithms.
Based on Figure 4 below, each parameter has an almost even effect on community
satisfaction in disaster management. When using a model that is considered the
fittest, namely SVM, it can be seen that the most significant importance is the
comfort of road and bridge compared to before (C4), and Collaboration between
local communities in reconstruction and rehabilitation (A5). Therefore, the access
road to residence compared to before the reconstruction and rehabilitation (C8),
Participation in the reconstruction and rehabilitation process (A4), and Community
Participation in the reconstruction and rehabilitation (B7). While pre-rehabilitation
and reconstruction, the stage is the most critical dimension affecting community
satisfaction.
The following model analysis is to compile an algorithm to select the main dimen-
sions that affect the community satisfaction model and analyze the supporting
variables that affect the community satisfaction prediction model that is not accom-
modated in this model. The results of VEC analysis illustrate the influence of the main
attributes that move dynamically in the prediction model of community satisfaction
with this SVM model in the form of information and socialization about reconstruc-
tion and rehabilitation (A1), a pre-rehabilitation and reconstruction group. Decreased
community satisfaction following the time of reconstruction program began (A2)
and the role of the facilitator in the reconstruction and rehabilitation process (B1),
206
Data Mining Applied for Community Satisfaction Prediction of Rehabilitation…
DOI: http://dx.doi.org/10.5772/intechopen.99349
Figure 4.
Relative importance.
and conversely, community satisfaction improved when performed the access road to
residence compared to before the reconstruction and rehabilitation (C8).
5. Conclusion
The modeling process with the DM approach using the SVM, ANN, and MR algo-
rithms produces a community satisfaction prediction model with a reasonably good
model performance. The three model algorithms are compared with the questionnaire
results. The REC curve shows the accuracy of each model used. Based on the resulting
error matrix, it is believed that the SVM model is the best model to predict community
satisfaction with a low iteration of 20 runs and has a good consistency. The most critical
parameter in preparing the community satisfaction prediction model is the comfort of the
road and bridge compared to before. Each attribute that affects the community satisfac-
tion prediction model is successfully described with the algorithm of relative importance.
207
Data Mining - Concepts and Applications
Acknowledgements
The authors are grateful to the editor and reviewers for their constructive com-
ments on the earlier version of the paper. The Directorate General of Highway
supported this research and liked to thank people for working at the Universitas
Internasional Batam, Indonesia.
Author details
© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
208
Data Mining Applied for Community Satisfaction Prediction of Rehabilitation…
DOI: http://dx.doi.org/10.5772/intechopen.99349
References
[1] Tim Peneliti Unpar (2018). Menyelisik Sumber Daya Air Tanag dan Geologi
Untaian Bencana Palu-Dongala Lingkungan.
(in Bahasa). Bandung: Universitas
Katholik Parahyangan. [8] Ahmad, M. I., & Ma, H. (2020). An
investigation of the targeting and
[2] Wu, X., Wang, Z., Gao, G., Guo, J., & allocation of post-flood disaster aid for
Xue, P. (2020). Disaster probability, rehabilitation in Punjab, Pakistan.
optimal government expenditure for International Journal of Disaster Risk
disaster prevention and mitigation, and Reduction, 44, 101402.
expected economic growth. Science of the
total environment, 709, 135888. [9] Nakamura, N., & Kanemasu, Y.
(2020). Traditional knowledge, social
[3] Meilianda, E., Munadi, K., Direzkia, capital, and community response to a
Y., & Oktari, R. S. (2017). Assessment of disaster: resilience of remote communities
post-tsunami disaster recovery of Banda in Fiji after a severe climatic event.
Aceh city of Indonesia as window of Regional Environmental Change,
opportunities for sustainable 20(1), 1-14.
development. IOP Conference Series:
[10] Islam, E., Abd Wahab, H., & Benson,
Earth and Environmental Science, Vol. 56,
O. G. (2020). Structural and operational
No. 1 (p. 012019). IOP Publishing.
factors as determinant of meaningful
community participation in sustainable
[4] Jamaludin, S., Azmir, N. A., Ayob, A.
disaster recovery programs: The case of
F., & Zainal, N. (2020). COVID-19 exit
Bangladesh. International Journal of
strategy: Transitioning towards a new
Disaster Risk Reduction, 50, 101710.
normal. Annals of Medicine and Surgery,
165-170. [11] Lu, Q., Zhong, D., & Zhang, Q.
(2020). The evolving pattern of NGOs’
[5] Thein, P. S., Pramumijoyo, S.,
participating in post-disaster community
Brotopuspito, K. S., Kiyono, J., Wilopo, reconstruction in China: cases study on
W., Furukawa, A., & Putra, R. R. (2015). the 2008 Wenchuan earthquake and the
Estimation of S-wave velocity structure 2013 Lushan earthquake. Natural Hazards,
for sedimentary layered media using 104(1), 167-184.
microtremor array measurements in Palu
City, Indonesia. Procedia Environmental [12] Paudel, D., Rankin, K., & Le Billon, P.
Sciences, 28, 595-605. (2020). Lucrative Disaster: Financiali-
zation, Accumulation and Post-
[6] Abdullah, A. I. (2020). A field survey earthquake Reconstruction in Nepal.
for the rupture continuity of Palu-Koro Economic Geography, 96(2), 137-160.
fault after Donggala earthquake on
September 28, 2018. In Journal of Physics: [13] Daly, P., Mahdi, S., McCaughey, J.,
Conference Series (Vol. 1434, No. 1), Mundzir, I., Halim, A., & Srimulyani, E.
012009. (2020). Rethinking relief,
reconstruction, and development:
[7] Widyaningrum, R. (2012). Evaluating the effectiveness and
Penyelidikan Geologi Teknik Potensi sustainability of post-disaster livelihood
Liquefaksi Daerah Palu, Provinsi Sulawesi aid. International Journal of Disaster Risk
Tengah (in Bahasa). Bandung: Pusat Reduction, 49, 101650.
209
Data Mining - Concepts and Applications
[14] Yong, Z., Zhuang, L., Liu, Y., Deng, [21] Wang, X. Z. (2012). Data mining and
X., & Xu, D. (2020). Differences in the knowledge discovery for process monitoring
disaster-preparedness behaviors of the and control. Springer Science &
general public and professionals: Business Media.
evidence from Sichuan Province, China.
International journal of environmental [22] Suma, V., & Hills, S. M. (2020). Data
research and public health, 17(14), 5254. Mining based Prediction of Demand in
Indian Market for Refurbished
[15] Santha, S. D. (2018). Social interfaces Electronics. Journal of Soft Computing
in disaster situations: Analyzing Paradigm (JSCP), 2(02), 101-110.
rehabilitation and recovery processes
among the fisherfolk of Tamil Nadu after [23] Fu, T. C. (2011). A review on time
the Tsunami in India. In The Asian series data mining. Engineering
Tsunami and post-disaster aid (pp. 65-78). Applications of Artificial Intelligence,
Singapore: Springer. 24(1), 164-181.
[16] Sofyan, M. (2019). Community [24] Zhao, Y., Xu, X., & Wang, M. (2019).
Satisfaction of the Urban Flood Control Predicting overall customer satisfaction:
System Improvement Project (UFCSI). Big data evidence from hotel online
Ilomata International Journal of Social textual reviews. International Journal of
Science, 1(1), 29-34. Hospitality Management, 76, 111-121.
[17] Ophiyandri, T., Hidayat, B., & [25] Rifai, A. I., Hadiwardoyo, S. P.,
Ghiffari, C. (2020). Community Correia, A. G., Pereira, P., & Cortez, P.
satisfaction levels on the housing (2015). Data Mining Applied for The
reconstruction project after Mentawai Prediction of Highway Roughness under
tsunami in 2010-a case study at Sipora Overloaded Traffic. International Journal
Island. IOP Conference Series: Materials of Technology.
Science and Engineering Vol. 933, No. 1,
(p. 012041). IOP Publishing. [26] Hastie, R. T., Tibshirani, &
Friedman, J. (2009). The Elements of
[18] Tinoco, J., Correia, A. G., & Cortez, Statistical Learning: Data Mining,
P. (2014). Support vector machines Inference, and Prediction. Springer-Verlag
applied to uniaxial compressive strength New York, second edition.
prediction of jet grouting columns.
Computers and Geotechnics 55, 132-140. [27] Cherkassky, V., & Ma, Y. (2004).
Practical selection of svm parameters
[19] Liu, H., Motoda, H., Setiono, R., & and noise estimation for svm regression.
Zhao, Z. (2010). Feature Selection: An Neural Networks, 17(1) ISSN 0893-6080.,
Ever Evolving Frontier in Data Mining. 113-126.
Proceedings of the Fourth Workshop on
Feature Selection in Data Mining, (pp. [28] Cortez, P. (2010). Data Mining with
4-13). Hyderabad, India. Neural Networks and Support Vector
Machines Using the R/rminer Tool. 10th
[20] Rahman, F. A., Desa, M. I., Wibowo, Industrial Conference on Data Mining
A., & Haris, N. A. (2014). Knowledge (ICDM 2010), (p. Lecture Notes in
Discovery Database (KDD)-Data Mining Artificial Intelligence 6171). Advances in
Application in Transportation. Proceeding Data Mining.
of the Electrical Engineering Computer
Science and Informatics, 1(1), (pp. 116-119.).
210
Edited by Ciza Thomas
The availability of big data due to computerization and automation has generated an
urgent need for new techniques to analyze and convert big data into useful information
and knowledge. Data mining is a promising and leading-edge technology for mining
large volumes of data, looking for hidden information, and aiding knowledge discovery.
It can be used for characterization, classification, discrimination, anomaly detection,
association, clustering, trend or evolution prediction, and much more in fields such as
science, medicine, economics, engineering, computers, and even business analytics.
This book presents basic concepts, ideas, and research in data mining.
ISSN 2633-1403
978-1-83969-266-6
ISBN 978-1-83969-268-0
Published in London, UK
© 2022 IntechOpen
© your_photo / iStock