Data Analysis
Data Analysis
Processing
DATA ANALYSIS AND
INFORMATION PROCESSING
Edited by:
Jovan Pehcevski
ARCLER
P r e s s
www.arclerpress.com
Data analysis and Information Processing
Jovan Pehcevski
Arcler Press
224 Shoreacres Road
Burlington, ON L7L 2H2
Canada
www.arclerpress.com
Email: [email protected]
This book contains information obtained from highly regarded resources. Reprinted
material sources are indicated. Copyright for individual articles remains with the au-
thors as indicated and published under Creative Commons License. A Wide variety of
references are listed. Reasonable efforts have been made to publish reliable data and
views articulated in the chapters are those of the individual contributors, and not neces-
sarily those of the editors or publishers. Editors or publishers are not responsible for
the accuracy of the information in the published chapters or consequences of their use.
The publisher assumes no responsibility for any damage or grievance to the persons or
property arising out of the use of any materials, instructions, methods or thoughts in the
book. The editors and the publisher have attempted to trace the copyright holders of all
material reproduced in this publication and apologize to copyright holders if permission
has not been obtained. If any copyright holder has not been acknowledged, please write
to us so we may rectify.
Notice: Registered trademark of products or corporate names are used only for explana-
tion and identification without intent of infringement.
Arcler Press publishes wide variety of books and eBooks. For more information about
Arcler Press and its products, visit our website at www.arclerpress.com
DECLARATION
Some content or chapters in this book are open access copyright free
published research work, which is published under Creative Commons
License and are indicated with the citation. We are thankful to the
publishers and authors of the content and chapters as without them this
book wouldn’t have been possible.
ABOUT THE EDITOR
List of Contributors........................................................................................xv
List of Abbreviations..................................................................................... xxi
Preface................................................................................................... ....xxiii
Chapter 4 Big Data Analytics for Business Intelligence in Accounting and Audit..... 87
Abstract.................................................................................................... 87
Introduction.............................................................................................. 88
Machine Learning..................................................................................... 90
Data Analytics.......................................................................................... 93
Data Visualization..................................................................................... 97
Conclusion............................................................................................... 98
Acknowledgements.................................................................................. 99
References.............................................................................................. 100
Chapter 6 Integrated Real-Time Big Data Stream Sentiment Analysis Service........ 125
Abstract.................................................................................................. 125
Introduction............................................................................................ 126
Related Works........................................................................................ 130
Architecture of Big Data Stream Analytics Framework............................. 131
x
Sentiment Model.................................................................................... 133
Experiments............................................................................................ 142
Conclusions............................................................................................ 146
Acknowledgements................................................................................ 147
References.............................................................................................. 148
xi
Overview of Big Data Technology........................................................... 202
Requirements on Auditing in the Era of Big Data..................................... 203
Application of Big Data Technology in Audit Field.................................. 205
Risk Analysis of Big Data Audit............................................................... 210
Conclusion............................................................................................. 211
References.............................................................................................. 212
Chapter 12 Different Data Mining Approaches Based Medical Text Data................ 231
Abstract.................................................................................................. 231
Introduction............................................................................................ 232
Medical Text Data................................................................................... 232
Medical Text Data Mining....................................................................... 233
Discussion.............................................................................................. 246
Acknowledgments.................................................................................. 247
References.............................................................................................. 248
xii
Chapter 14 Research on Realization of Petrophysical Data Mining
Based on Big Data Technology............................................................... 273
Abstract.................................................................................................. 273
Introduction............................................................................................ 274
Analysis of Big Data Mining of Petrophysical Data.................................. 274
Mining Based on K-Means Clustering Analysis........................................ 278
Conclusions............................................................................................ 283
Acknowledgements................................................................................ 284
References.............................................................................................. 285
xiii
Neural Network Optimization Method and its Research in
Information Processing.................................................................. 330
Neural Network Optimization Method and its Experimental
Research In Information Processing............................................... 337
Neural Network Optimization Method and its Experimental
Research Analysis in Information Processing................................. 338
Conclusions............................................................................................ 347
Acknowledgments.................................................................................. 348
References.............................................................................................. 349
Index...................................................................................................... 389
xiv
LIST OF CONTRIBUTORS
Amira Khattak
Prince Sultan University, Riyadh, Saudi Arabia
Noreen Jamil
National University of Computer and Emerging Sciences, Islamabad, Pakistan
M. Asif Naeem
National University of Computer and Emerging Sciences, Islamabad, Pakistan
Auckland University of Technology, Auckland, New Zealand
Farhaan Mirza
Auckland University of Technology, Auckland, New Zealand
Abdullah Z. Alruhaymi
Department of Electrical Engineering and Computer Science, Howard University,
Washington D.C, USA.
Charles J. Kim
Department of Electrical Engineering and Computer Science, Howard University,
Washington D.C, USA.
André Ribeiro
INESC-ID/Instituto Superior Técnico, Lisbon, Portugal
Afonso Silva
INESC-ID/Instituto Superior Técnico, Lisbon, Portugal
Jing Sun
Cancer Vaccine Center, Dana-Farber Cancer Institute, Harvard Medical School, Boston,
MA 02115, USA
Lou Chitkushev
Department of Computer Science, Metropolitan College, Boston University, Boston,
MA 02215, USA
Vladimir Brusic
Department of Computer Science, Metropolitan College, Boston University, Boston,
MA 02215, USA
Danielle Aring
Department of Electrical Engineering and Computer Science, Cleveland State
University, Cleveland, USA
Haya Smaya
Mechanical Engineering Faculty, Institute of Technology, MATE Hungarian University
of Agriculture and Life Science, Gödöllo”, Hungary
xvi
Wang Zhao Shun
School of Information and Communication Engineering, University of Science and
Technology Beijing (USTB), Beijing, China
Beijing Key Laboratory of Knowledge Engineering for Material Science, Beijing, China
Guanfang Qiao
WUYIGE Certified Public Accountants LLP, Wuhan, China
Ibrahim Ba’abbad
Department of Information Technology, Faculty of Computing and Information
Technology, King Abdulaziz University, Jeddah, KSA.
Thamer Althubiti
Department of Information Technology, Faculty of Computing and Information
Technology, King Abdulaziz University, Jeddah, KSA.
Abdulmohsen Alharbi
Department of Information Technology, Faculty of Computing and Information
Technology, King Abdulaziz University, Jeddah, KSA.
Khalid Alfarsi
Department of Information Technology, Faculty of Computing and Information
Technology, King Abdulaziz University, Jeddah, KSA.
Saim Rasheed
Department of Information Technology, Faculty of Computing and Information
Technology, King Abdulaziz University, Jeddah, KSA.
Wenke Xiao
School of Medical Information Engineering, Chengdu University of Traditional Chinese
Medicine, Chengdu 611137, China
Lijia Jing
School of Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu
611137, China
Yaxin Xu
School of Medical Information Engineering, Chengdu University of Traditional Chinese
Medicine, Chengdu 611137, China
Shichao Zheng
School of Medical Information Engineering, Chengdu University of Traditional Chinese
Medicine, Chengdu 611137, China
xvii
Yanxiong Gan
School of Medical Information Engineering, Chengdu University of Traditional Chinese
Medicine, Chengdu 611137, China
Chuanbiao Wen
School of Medical Information Engineering, Chengdu University of Traditional Chinese
Medicine, Chengdu 611137, China
Mustapha Ismail
Management Information Systems Department, Cyprus International University,
Haspolat, Lefkoşa via Mersin, Turkey
Muesser Nat
Management Information Systems Department, Cyprus International University,
Haspolat, Lefkoşa via Mersin, Turkey
Yu Ding
School of Computer Science, Yangtze University, Jingzhou, China
Key Laboratory of Exploration Technologies for Oil and Gas Resources (Yangtze
University), Ministry of Education, Wuhan, China
Rui Deng
Key Laboratory of Exploration Technologies for Oil and Gas Resources (Yangtze
University), Ministry of Education, Wuhan, China
School of Geophysics and Oil Resource, Yangtze University, Wuhan, China
Chao Zhu
The Internet and Information Center, Yangtze University, Jingzhou, China
Xiang Fu
School of Physical Education Guangdong Polytechnic Normal University, Guangzhou
510000, China
Ye Zhang
School of Physical Education Guangdong Polytechnic Normal University, Guangzhou
510000, China
xviii
Ling Qin
School of Physical Education Guangdong Polytechnic Normal University, Guangzhou
510000, China
R. Zhang
Department of Quantity Survey, School of Construction Management and Real Estate,
Chongqing University, Chongqing, China.
A. M. M. Liu
Department of Real Estate and Construction, Faculty of Architecture, The University of
Hong Kong, Hong Kong, China
I. Y. S. Chan
Department of Real Estate and Construction, Faculty of Architecture, The University of
Hong Kong, Hong Kong, China
Pin Wang
School of Mechanical and Electrical Engineering, Shenzhen Polytechnic, Shenzhen
518055, Guangdong, China
Peng Wang
Garden Center, South China Botanical Garden, Chinese Academy of Sciences,
Guangzhou 510650, Guangdong, China
En Fan
Department of Computer Science and Engineering, Shaoxing University, Shaoxing
312000, Zhejiang, China
Rick Quax
Computational Science Lab, University of Amsterdam, Amsterdam, Netherlands
Gregor Chliamovitch
Department of Computer Science, University of Geneva, Geneva, Switzerland
Alexandre Dupuis
Department of Computer Science, University of Geneva, Geneva, Switzerland
Jean-Luc Falcone
Department of Computer Science, University of Geneva, Geneva, Switzerland
Bastien Chopard
Department of Computer Science, University of Geneva, Geneva, Switzerland
xix
Alfons G. Hoekstra
Computational Science Lab, University of Amsterdam, Amsterdam, Netherlands
ITMO University, Saint Petersburg, Russia
Peter M. A. Sloot
Computational Science Lab, University of Amsterdam, Amsterdam, Netherlands
ITMO University, Saint Petersburg, Russia
Complexity Institute, Nanyang Technological University, Singapore
xx
LIST OF ABBREVIATIONS
Over the past few decades, the development of information systems in larger enterprises
was accompanied with the development of data storage technology. Initially, the
information systems of individual departments were developed independently of each
other, so that, for example, the finance department had a separate information system
from the human resources department. The so-called ‘information islands’ were created,
among which the flow of information was not established. If a company has offices in
more than one country, until recently it was the practice for each country to have a
separate information system, which was necessary due to differences in legislation,
local customs and the problem of remote customer support. Such systems often had
different data structures. The problem arose with reporting, as there was no easy way
to aggregate data from diverse information systems to get a picture of the state of the
entire enterprise.
The main task of information engineering was to merge separate information systems
into one logical unit, from which unified data can be obtained. The first step in the
unification process is to create a company model. This consists of the following steps:
defining data models, defining process models, identifying participants, and determining
the flow of information between participants and systems (data flow diagram).
The problem of unavailability of information in practice is bigger than it may seem.
Certain types of businesses, especially non-profit-oriented ones, can operate in this
way. However, a large company, which sells its main product on the market for a
month at the wrong price, due to inaccurate information obtained by the management
from a poor information system, will surely find itself in trouble. The organization’s
dependence on quality information from the business system grows with its size and
the geographical dislocation of its offices. Full automation of all business processes is
now the practical standard in some industries. Examples are airline reservation systems,
or car manufacturer systems where it is possible to start the production of car models
with accessories directly from the showroom according to the customer’s wishes. The
fashion industry, for example, must effectively follow fashion trends (analyze sales of
different clothing models by region) in order to respond quickly to changes in consumer
habits.
This edition covers different topics from data analysis and information processing,
including: data analytics methods, big data methods, data mining methods, and
information processing methods.
Section 1 focuses on data analytics methods, describing data analytics in mental
healthcare, a case study on data analytics and machine learning accuracy, a survey
from a big data perspective on data modeling and data analytics, big data analytics for
business intelligence in accounting and audit, and a knowledge-based approach for big
data analytics in immunology.
Section 2 focuses on big data methods, describing integrated real-time big data stream
sentiment analysis service, the influence of big data analytics in the industry, big data
usage in the marketing information system, a review of big data for organizations, and
application research of big data technology in audit field.
Section 3 focuses on data mining methods, describing a short review of classification
algorithms accuracy for data prediction in data mining applications, different data mining
approaches based on medical text data, the benefits and challenges of data mining in
electronic commerce, and research on realization of petrophysical data mining based on
big data technology.
Section 4 focuses on information processing methods, describing application of spatial
digital information fusion technology in information processing of national traditional
sports, the effects of quality and quantity of information processing on design
coordination performance, a neural network optimization method and its application in
information processing, and information processing features that can detect behavioral
regimes of dynamical systems.
xxiv
SECTION 1:
DATA ANALYTICS METHODS
Chapter 1
Ayesha Kamran Ul haq1, Amira Khattak2, Noreen Jamil1, M. Asif Naeem1,3-, and
Farhaan Mirza3
1
National University of Computer and Emerging Sciences, Islamabad, Pakistan
2
Prince Sultan University, Riyadh, Saudi Arabia
3
Auckland University of Technology, Auckland, New Zealand
ABSTRACT
Worldwide, about 700 million people are estimated to suffer from mental
illnesses. In recent years, due to the extensive growth rate in mental disorders,
it is essential to better understand the inadequate outcomes from mental
health problems. Mental health research is challenging given the perceived
limitations of ethical principles such as the protection of autonomy, consent,
threat, and damage. In this survey, we aimed to investigate studies where big
data approaches were used in mental illness and treatment. Firstly, different
types of mental illness, for instance, bipolar disorder, depression, and
Citation: Ayesha Kamran Ul Haq, Amira Khattak, Noreen Jamil, M. Asif Naeem, Far-
haan Mirza, “Data Analytics in Mental Healthcare”, Scientific Programming, vol. 2020,
Article ID 2024160, 9 pages, 2020. https://doi.org/10.1155/2020/2024160.
Copyright: © 2020 by Authors. This is an open access article distributed under the
Creative Commons Attribution License, which permits unrestricted use, distribution,
and reproduction in any medium, provided the original work is properly cited.
4 Data Analysis and Information Processing
INTRODUCTION
Recently the term “big data” has become exceedingly popular all over the
world.
Over the last few years, big data has started to set foot in healthcare
system. In this context, scientists have been working on improving the
public health strategies, medical research, and the care provided to patients
by analyzing big datasets related to their health.
Data is coming from different sources like providers (pharmacy and
patient’s history) and nonproviders (cell phone and internet searches). One
of the outstanding possibilities available from huge data utilization is evident
inside the healthcare industry. Healthcare organizations have a big quantity
of information available to them and a big portion of it is unstructured and
clinically applicable. The use of big data is expected to grow in the medical
field and it will continue to pose lucrative opportunities for solutions that can
help in saving lives of patients. Big data needs to be interpreted correctly in
order to predict future data so that final result can be estimated. To solve this
problem, researchers are working on AI algorithms that have a high impact
on analysis of huge quantities of raw data and extract useful information
from it. There are varieties of AI algorithms that are used to predict patient
disease by observing past data. A variety of wearable sensors have been
developed to deal with both physical and social interactions practically.
Mental health of a person is measured by a high grade of affective disorder
which results in major depression and different anxiety disorders. There are
many conditions which are recognized as mental disorders including anxiety
disorder, depressive disorder, mood disorder, and personality disorder.
There are lots of mobile apps, smart devices like smartwatches, and smart
bands which increase healthcare facilities in mobile mental healthcare
systems. Personalized psychiatry also plays an important role in predicting
bipolar disorder and improving diagnosis and optimized treatment. Most
of the smart techniques are not pursued due to lack of resources especially
in underdeveloping countries. Like, in Pakistan, 0·1% of the government
health budget is being spent on the mental health system. There is a need
Data Analytics in Mental Healthcare 5
LITERATURE REVIEW
There are a lot of mental disorders like bipolar one, depression, and different
forms of anxieties. Bauer et al. [1] conducted a paper-based survey in which
1222 patients from 17 countries were participated to detect bipolar disorder
in adults. This survey was translated into 12 different languages with some
limitation that it did not contain any question about technology usage in
6 Data Analysis and Information Processing
older adults. According to Bauer et al. [1], digital treatment is not suitable
for the older adults with bipolar disorder.
Researchers are working on the most interesting and unique method of
tremendous interest to check the personality of a person just by looking at
the way he or she is using the mobile phone. De Montjoye [2] collected
dataset from US Research University and created a framework that analyzed
phone call and text messages to check the personality of the user. Participants
who did 300 calls or text per year failed to complete personality measures.
They choose optimal sample size that is 69 with mean age = 30.4, S. D. = 6.1,
and 1 missing value. Similarly, Bleidorn and Hopwood [3] adopted a
comprehensive machine learning approach to test the personality of the user
using social media and digital records. Main 9 recommendations for how to
amalgamate machine learning techniques provided by the researcher enhance
the big five of the personality assessments. Focusing on minor details of the
user comprehends and validates the result. Digital mental health has been
revolutionized and its innovations are growing at a high rate. The National
Health Service (NHS) has recognized its importance in mental healthcare
and is looking for innovations to provide services at low cost. Hill et al. [4]
presented a study of challenges and considerations in innovations in digital
mental healthcare. They also suggested collaboration between clinicians,
industry workers, and service users so that these challenges can be overcome
and successful innovations of e-therapies and digital apps can be developed.
There are lots of mobile apps, smart devices like smartwatches, smart
bands, and shirts which increase healthcare facilities in the mobile healthcare
system. A variety of wearable sensors have been developing to deal with both
physical and social interactions practically. Combining artificial intelligence
with healthcare systems extends the healthcare facilities up to the next
level. Dimitrov [5] conducted a systematic survey on mobile internet of
things in the devices which allow business to emerge, spread productivity
improvements, lock down the cost, and intensify customer experience and
change in a positive way. Similarly, Monteith et al. [6] performed a paper-
based survey on clinical data mining to analyze different data sources to get
psychiatry data and optimized precedence opportunities for psychiatry.
One of the machine learning algorithms named artificial neural network
(ANN) is based on three-layer architecture. Kellmeyer [7] introduced
a way to secure big brain data from clinical and consumer-directed
neurotechnological devices using ANN. But this model needs to be trained
on a huge amount of data to get accurate results. Jiang et al. [8] designed
Data Analytics in Mental Healthcare 7
elaborated in their article that the availability of the big data is increasing
twice in size every two year for use in automated decision-making. Passos
et al. [16] believed that the long-established connection between doctor and
patient will change with the establishment of big data and machine learning
models. ML algorithm can allow an affected person to observe his fitness
from time to time and can tell the doctor about his current condition if it
becomes worst. Early consultation with the doctor could prevent any bigger
loss for the patient.
If the psychiatric disease is not predicted or handled earlier, then it
enforces the patient to involve into many illegal activities like suicide as
most of the suicide attempts are related to mental disorder. Kessler et al.
[17] proposed meta-analysis that focused on suicide incidence within 1 year
of the self-harm using machine learning algorithm. They analyzed the past
reports of suicide patients and concluded that any prediction was impossible
to be made due to short duration of psychiatric hospitalizations. Although a
number of AI algorithms are used to estimate patient disease by observing
past data, the focus of all studies was related to suicide prediction by setting
up a threshold. Defining a threshold is a very crucial point or sometimes
even impossible to be predicted. Cleland et al. [18] reviewed many studies
but were unable to discover principles to clarify threshold. Authors used
a random-effects model to generate a meta-analytic ROC. On the basis of
correlation results, it is stated that depression prevalence is mediating factor
between economic deprivation and antidepressant prescribing.
Another side effect of mental disease is drug addiction. Early drug
prediction is possible by analyzing user data. Opioid is a swear type of drug.
Hasan et al. [19] explored the Massachusetts All Payer Claim Data (MA
APCD) dataset and examined how naïve users develop opioid use disorder.
A popular machine learning algorithm is tested to predict the risk of such
type of dependency of patent. Perdue et al. [20] predicted ratio of drug
abusers by comparing Google trends data with monitoring the future (MTF)
data; a well-structured study was made. It is concluded that Google trends
and MTF data provided combined support for detecting drug abuse.
adults. Data is collected from 187 older adults and 1021 younger adults
with excluded missing observations. The survey contained 39 questions
which took 20 minutes to complete. Older adults with bipolar disorder were
addicted to the internet less regularly than the younger ones. As most of
the healthcare services are available only online and most digital tools and
devices are evolved, the survey has some limitations that it did not contain
any question about technology usage in older adults. There is a need for
proper treatment of a disordered person. Mood of the patient is one of
the parameters to detect his/her mental health. Table 1 describes another
approach of personality assessment using machine learning algorithm that
focused on other aspects like systematic fulfillment and argued to enhance
the validity of machine learning (ML) approach. Coming with technological
advancement in the medical field will promote personalized treatments. A
lot of work has been done in the field of depression detection using social
networks.
Hill et al. Mental Mental health, col- (i) Online 33 (i) Developing
[4] disorder laborative comput- CBT platform smartphone ap-
ing, and e-therapies plication
(ii) Collabora- (ii) For mental
tive computing disorder
(iii) For improving
e-therapies
10 Data Analysis and Information Processing
Kellmeyer Big brain Brain data, neu- (i) Machine 77 (i) Maximizing
[7] data rotechnology, big learning medical knowledge
data, privacy, secu- (ii) Consumer- (ii) Enhancing the
rity, and machine directed neu- security of devices
learning rotechnologi- and sheltering the
cal devices privacy of personal
(iii) Combin- brain data
ing expert
with a bottom-
up process
Furnham Personality Dark side, big five, Hogan ‘dark 34 All of the personal-
[21] disorder facet analysis, side’ measure ity disorders are
dependence, and (HDS) concept strongly negatively
dutifulness of dependent associated with
personal- agreeableness
ity disorder (a type of kind,
(DPD) sympathetic, and
cooperative per-
sonality)
Data Analytics in Mental Healthcare 11
at low cost using useful information extracted by big data tool Mongo DB
and genetic algorithm.
In Table 1, some of the techniques are handled and stored huge amount
of data.
Using MongoDB tool, researchers are working to predict mental
condition before severe mental stage. So, some devices introduced a
complete detection process to tackle the present condition of the user by
analyzing his/her daily life routine. There is a need for reasonable solutions
that detect disable stage of a mental patient more precisely and quickly.
Personality Disorder
Dutifulness is a type of personality disorder in which patients are overstressed
about the disease that is not actually much serious. People with this type of
disorder tend to work hard to impress others. A survey was conducted to
find the relationship between normal and dutifulness personalities. Other
researchers are working on the most interesting and unique method of
tremendous interest to check the personality of a person just by looking at
the way he or she is using the mobile phone. This approach provides cost-
effective and questionnaire-free personality detection through mobile phone
data that performs personality assessment without conducting any digital
survey on social media. To perform all nine main aspects of the constructed
validation in real time is not easy for the researchers. This examination, like
several others, has limitations. This is just a sample that has implications for
generalization when it is used in the near-real-time scenario which may be
tough for the researchers.
Table 2: Side effects of mental illness and their solution through data science
Perdue et al. Drug abuse (i) Google search his- Providing real time data
[20] tory that may allow us to
(ii) Monitoring the predict drug abuse ten-
future (MTF) dency and respond more
quickly
Hasan et al. Opioid use (iii) Feature engineer- Suppressing the in-
[19] disorder ing creasing rate of opioid
(iv) Logistic regression addiction using machine
learning algorithms
(v) Random forest
(vi) Gradient boosting
Suicide
Suicide is very common in underdeveloped countries. According to
researchers, someone dies because of suicide in every 40 seconds all over
the world. There are some areas in the world where mental disorder and
suicide statistics are relatively larger than other areas.
Psychiatrists say that 90% of people who died by suicide faced a mental
disorder. Electronic medical records and big data generate suicide through
machine learning algorithm. Machine learning algorithms can be used to
predict suicides in depressed persons; it is hard to estimate how accurately it
performs, but it may help a consultant for pretreating patients based on early
prediction. Various studies depict the fact that there are a range of factors
such as high level of antidepressant prescribing that caused such prevalence
of illness. Some people started antidepressant medicine to overcome mental
affliction. In Table 1, Cleland et al. [18] explored three main factors, i.e.,
14 Data Analysis and Information Processing
Drug Abuse
People voluntarily take drugs but most of them are addicted to them in
order to get rid of all their problems and feel relaxed. Adderall divinorum,
Snus, synthetic marijuana, and bath salts are the novel drugs. Opioid is a
category of drug that includes the illegitimate drug heroin. Hasan et al. [19]
compared four machine learning algorithms: logistic regression, random
forest, decision tree, and gradient boosting to predict the risk of opioid
use disorder. Random forest is one of the best methods of classification
in machine learning algorithms. It is found that in such types of situations
random forest models outperform the other three algorithms specially for
determining the features. There is another approach to predict drug abusers
using the search history of the user. Perdue et al. [20] predicted ratio of drug
abusers by comparing Google trends data with monitoring the future (MTF)
data; a well-structured study was made. It is concluded that Google trends
and MTF data provided combined support for detecting drug abuse.
Google trends appear to be a particularly useful data source regarding
novel drugs because Google is the first place where many users especially
adults go for information on topics of which they are unfamiliar. Google
tends not to predict heroin abuse; the reason may be that heroin is a relatively
uniquely dangerous than other drugs. According to Granka [23], internet
searches can be understood as behavioral measures of an individual’s
interest in an issue. Unfortunately, this technique was not going to be very
convenient as drug abuse researchers are unable to predict drug abuse
successfully because of sparse data.
perfectly worked on long-term data instead of low-term one but they used
offline data transfer instead of real time.
Although it has different sensors, adding up garbage data to the sensors
is a very obvious thing. This is an application that offers on-hand record
management using mobile/tablet technology once security and privacy
are confirmed. To increase the reliability of IoT devices, there is a need to
increase the sample size with different age groups in real time environment
to check the validity of the experiment.
There are a lot of technologies that effectuate tracking data like
smartphones, credit cards, social media, and sensors. This paper discussed
some of the existing work to tackle such data. In Table 3, one of the
approaches is human made algorithm; searching for disease symptoms hits
disease websites, sending/receiving healthcare e-mail, and sharing health
information on social media through this kind of data. These are some
examples of activities that perform key rules to produce medical data.
catch all possible outcomes. Hadoop works on cloud computing that helps to
accomplish different operations on distributed data in a systematic manner.
Success rate of the above approach was around 70% but authors have done
these tasks using two programming languages. Python code for extraction
tweets and Java is used to train the data which required expert programmers
on each language. It will help doctors to give more accurate treatment for
several mental disorders in less time and at low cost. Infecting this approach
provides predetection of depression that may preserve the patient to face the
worst stage of mental illness.
CONCLUSIONS
Big data are being used for mental health research in many parts of the world
and for many different purposes. Data science is a rapidly evolving field that
offers many valuable applications to mental health research, examples of
which we have outlined in this perspective.
We discussed different types of mental disorders and their reasonable,
affordable, and possible solution to enhance the mental healthcare facilities.
Currently, the digital mental health revolution is amplifying beyond the pace
of scientific evaluation and it is very clear that clinical communities need
to catch up. Various smart healthcare systems and devices developed that
reduce the death rate of mental patients and avert the patient to associate in
any illegal activities by early prediction.
This paper examines different prediction methods. Various machine
learning algorithms are popular to train data in order to predict future data.
Random forest model, Naïve Bayes, and k-mean clustering are popular
ML algorithms. Social media is one of the best sources of data gathering
as the mood of the user also reveals his/her psychological behavior. In
this survey, various advances in data science and its impact on the smart
healthcare system are points of consideration. It is concluded that there
is a need for a cost-effective way to predict intellectual condition instead
of grabbing costly devices. Twitter data is utilized for the saved and live
tweets accessible through application program interface (API). In the future,
connecting twitter API with python, then applying sentimental analysis on
‘posts,’ ‘liked pages’, ‘followed pages,’ and ‘comments’ of the twitter user
will provide a cost-effective way to detect depression for target patients.
ACKNOWLEDGMENTS
The authors are thankful to Prince Sultan University for the financial support
towards the publication of this paper.
Data Analytics in Mental Healthcare 21
REFERENCES
1. R. Bauer, T. Glenn, S. Strejilevich et al., “Internet use by older adults
with bipolar disorder: international survey results,” International
Journal of Bipolar Disorders, vol. 6, no. 1, p. 20, 2018.
2. Y.-A. De Montjoye, J. Quoidbach, F. Robic, and A. Pentland,
“Predicting personality using novel mobile phone-based metrics,” in
Proceedings of the International Conference on Social Computing,
Behavioral-Cultural Modeling, and Prediction, pp. 48–55, Berlin,
Heidelberg, April 2013.
3. W. Bauer and C. J. Hopwood, “Using machine learning to advance
personality assessment and theory,” Personality and Social Psychology
Review, vol. 23, no. 2, pp. 190–203, 2019.
4. C. Hill, J. L. Martin, S. Thomson, N. Scott-Ram, H. Penfold, and C.
Creswell, “Navigating the challenges of digital health innovation:
considerations and solutions in developing online and smartphone-
application-based interventions for mental health disorders,” British
Journal of Psychiatry, vol. 211, no. 2, pp. 65–69, 2017.
5. D. V. Dimitrov, “Medical internet of things and big data in healthcare,”
Healthcare Informatics Research, vol. 22, no. 3, pp. 156–163, 2016.
6. S. Monteith, T. Glenn, J. Geddes, and M. Bauer, “Big data are coming
to psychiatry: a general introduction,” International Journal of Bipolar
Disorders, vol. 3, no. 1, p. 21, 2015.
7. P. Kellmeyer, “Big brain data: on the responsible use of brain data
from clinical and consumer-directed neurotechnological devices,”
Neuroethics, vol. 11, pp. 1–16, 2018.
8. L. Jiang, B. Gao, J. Gu et al., “Wearable long-term social sensing for
mental wellbeing,” IEEE Sensors Journal, vol. 19, no. 19, 2019.
9. S. Yang, B. Gao, L. Jiang et al., “IoT structured long-term wearable
social sensing for mental wellbeing,” IEEE Internet of Things Journal,
vol. 6, no. 2, pp. 3652–3662, 2018.
10. S. Monteith and T. Glenn, “Automated decision-making and big data:
concerns for people with mental illness,” Current Psychiatry Reports,
vol. 18, no. 12, p. 112, 2016.
11. S. Goyal, “Sentimental analysis of twitter data using text mining and
hybrid classification approach,” International Journal of Advance
Research, Ideas and Innovations in Technology, vol. 2, no. 5, pp.
2454–132X, 2016.
22 Data Analysis and Information Processing
23. L. Granka, “Inferring the public agenda from implicit query data,”
in Proceedings of the 32nd International ACM SIGIR Conference on
Research and Development in Information Retrieval, Boston, MA,
USA, July 2009.
24. V. Sinha, 2019, https://www.quora.com/What-are-the-main-
differences-between-artificial-intelligence-and-machine-learning-Is-
machine-learning-a-part-of-artificial-intelligence.
25. Lenhart A., Purcell K., Smith A., Zickuhr K., Social media & mobile
internet use among teens and young adults. Millennials, Pew Internet
& American Life Project, Washington, DC, USA, 2010.
Chapter 2
ABSTRACT
The information gained after the data analysis is vital to implement its
outcomes to optimize processes and systems for more straightforward
problem-solving. Therefore, the first step of data analytics deals with
identifying data requirements, mainly how the data should be grouped or
labeled. For example, for data about Cybersecurity in organizations, grouping
can be done into categories such as DOS denial of services, unauthorized
access from local or remote, and surveillance and another probing. Next,
after identifying the groups, a researcher or whoever carrying out the data
analytics goes out into the field and primarily collects the data. The data
collected is then organized in an orderly fashion to enable easy analysis; we
aim to study different articles and compare performances for each algorithm
to choose the best suitable classifies.
Citation: Alruhaymi, A. and Kim, C. (2021), “Case Study on Data Analytics and Ma-
chine Learning Accuracy”. Journal of Data Analysis and Information Processing, 9,
249-270. doi: 10.4236/jdaip.2021.94015.
Copyright: © 2021 by authors and Scientific Research Publishing Inc. This work
is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0
26 Data Analysis and Information Processing
INTRODUCTION
Data Analytics is a branch of data science that involves the extraction of
insights from data to gain a better understanding. It entails all the techniques,
data tools, and processes involved in identifying trends and measurements
that would otherwise be lost in the enormous amount of information available
and always getting generated in the world today. Grouping the dataset into
categories is an essential step of the analysis. Then, we go ahead and clean
up the data by removing any instances of duplication and errors done during
its collection.
In this step, there is also the identification of complete or incomplete
data and the implementation of the best technique to handle incomplete data.
The impact of missing values leads to an incomplete dataset in
machine learning (ML) algorithms’ performance causes inaccuracy and
misinterpretations.
Machine learning has emerged as a problem-solver for many existing
situation problems. Advancement in this field helps us with Artificial
intelligence (AI) in many applications we use daily in real life. However,
statistical models and other technologies failed to remedy our modern luxury
and were unsuccessful in holding categorical data, dealing with missing
values and significant data points [1]. All these reasons arise the importance
of Machine Learning Technology. Moreover, ML plays a vital role in many
applications, e.g., cyber detection, data mining, natural language processing,
and even disease diagnostics and medicine. In all these domains, we look for
a clue by which ML offers possible solutions.
Since ML algorithms do training with part of a dataset and tests with
the rest of the other dataset, unless missingness is entirely random and this
is rarely happening, missing elements in especially training dataset can
alter to insufficient capture of the entire population of the complete dataset.
Therefore, in turn, it would lead to lower performance with the test dataset.
However, if reasonably close values somehow replace the missing elements,
the performance for the imputed dataset would be restored correctly to the
level of the same as that of the intact, complete dataset. Therefore, this
research intends to investigate the performance variation under different
numbers of missing elements and under two other missingness mechanisms,
missing completely at random (MCAR) and missing at random (MAR).
Case Study on Data Analytics and Machine Learning Accuracy 27
RESEARCH METHODOLOGY
This article is one chapter of the whole dissertation work, and to achieve the
objectives of the dissertation, the following tasks are performed:
1) cyber-threat dataset selection
2) ML algorithms selection
3) investigation and literature review on the missingness mechanisms
of MCAR and MAR
28 Data Analysis and Information Processing
missingness and therefore appropriate for this research. This method ideally
fills in the missing data values by using the iteration of prediction models
methodology where in each iteration a missing value is imputed by using the
already complete variables to predict it.
Knowledge Discovery and Data Mining Tools Competition, which was held
in conjunction with KDDCUP’99 and the Fifth International Conference on
Knowledge Discovery and Data Mining. The competition task was to build
a network intrusion detector, a predictive model capable of distinguishing
between “bad” connections, called intrusions or attacks, and “good’’ normal
connections. This database contains a standard set of data to be audited,
which includes a wide variety of intrusions simulated in a military network
environment [2]. The KDDCUP’99 dataset contains around 4,900,000 single
connection vectors, every one of which includes 41 attributes and includes
categories such as attack or normal [3], with precisely one specified attack-
type of four main type of attacks:
1) Denial of service (DoS): The use of excess resources denies legit
requests from legal users on the system.
2) Remote to local (R2L): Attacker having no account gains a legal
user account on the victim’s machine by sending packets over the
networks.
3) User to root (U2R): Attacker tries to access restricted privileges
of the machine.
4) Probe: Attacks that can automatically scan a network of computers
to gather information or find any unknown vulnerabilities.
All the 41 features are also labeled into four listed types:
1) Basic features: These characteristics tend to be derived from
packet headers which are no longer analyzing the payload.
2) Content features: Aanalyzing the actual TCP packet payload, and
here domain knowledge is used, and this encompasses features
that include the large variety of unsuccessful login attempts.
3) Time-based traffic features: These features are created to acquire
properties accruing over a 2-second temporal window.
4) Host-based traffic features: Make use of an old window calculated
over the numerous connections. Thus, host-based attributes are
created to analyze attacks, with a timeframe longer than 2 seconds
[4].
Most of the features are of a continuous variable type, where MICE use
multiple regression to account for uncertainty of the missing data, a standard
error is added to linear regression and in calculating stochastic regression,
a similar method of MICE called predictive mean matching was used.
However, some variables (is_guest_login, flag, land, etc.) are of binary type
32 Data Analysis and Information Processing
(1)
Below is an illustrative table data of the KDDCUP’99 with 41 features
from source: (https://kdd.ics.uci.edu/databases/kddcup99/task.html). As
shown in Table 1, we remove two features number 20 and 21 because their
values in the data are zeros.
The dataset used for testing the proposed regression model is the
KDDsubset network intrusion cyber database, and since this dataset is quite
a bit large and causes a time delay and slow execution of the R code due to
limited hardware equipment, the data is therefore cleaned.
Nr Name Description
ML ALGORITHMS SELECTION
A huge substantial amount of data is available to organizations from a
diverse log, application logs, intrusion detection systems, and others. Year
over year, the data volume increases significantly and is produced by every
person every second, with the number of devices connected to the internet
being three times more than the world population.
1: Intrusion Sugeno- . . . . . . . . . . . ,
Detection fuzzy infer-
system ence system
using fuzzy for genera-
inference tion of fuzzy
system rules and
best first
methods
under select
Case Study on Data Analytics and Machine Learning Accuracy 37
3: Detecting Decision 95.09% 1.032 0.003 4649 2702 279 100 4.9 94.34% 97.34% . ,
anomaly Tree
based
MLP 92.46% 20.59 0.004 4729 2419 562 29 7.54 89.38 99.56%
network
intrusion us-
KNN 92.78% 82.956 13.24 4726 2446 535 23 7.22 89.83% 99.52%
ing feature
extraction
and clas- Linear 92.59% 78.343 2.11 4723 2434 547 26 7.41 89.62% 99.45%
sification SVM
techniques Passive ag- 90.34 0.275 0.001 4701 2282 699 48 9.66 89.62% 99.45%
gressive
RBF SVM 91.67% 99.47 2.547 4726 2960 621 23 8.33 89.39% 99.52%
Random 93.62% 1.189 0.027 4677 2560 621 23 6.38 91.74% 98.48%
Forest
AdaBoost 93.52% 29.556 0.225 4676 2553 428 73 6.48 91.61% 98.46%
Gausian 94.35% 244 0.006 4642 2651 330 107 5.65 93.36% 97.75% - -
NB
Multion- 91.71% 0.429 0.001 4732 2357 624 17 8.29 88.35% 99.64% - -
mINB
Adratic 93.23% 1.305 0.0019 4677 2530 451 72 6.77 91.20% 84.87% - -
Discriminat
Ana
38 Data Analysis and Information Processing
Three datasets were used in this study and one of them is the KDDCUP’99
to compare accuracy and execution time before and after dimensionality
reduction.
Article 5. Summarized as shown in Table 6 below.
Five algorithms were used.
Article 6. Summarized as shown in Table 7 below.
Singulars Values Decomposition (SVD) is eigenvalues method used to
reduce a high-dimensional dataset into fewer dimensions while retaining
important information and uses improved version of the algorithm (ISVD).
Article 7. Summarized as shown in Table 8 below.
In article number 7 above two classifiers were used and for J48 we have
high accuracy results.
Case Study on Data Analytics and Machine Learning Accuracy 39
6: Using SVD 43.73% 45.154 10.289 43.67% 56.20% 53.33% 43.8 0.5 . . . .
an imputed
Data detec- ISVD 94.34% 189.232 66.72 92.86 . . 95.82 0.55
tion method
in intrusion
detection
system
(2)
Information gain measures the reduction in entropy or surprise by
splitting a dataset according to a given value of a random variable. A larger
information gain suggests a lower entropy group or groups of samples, and
hence less surprise.
(3)
where:
f: feature split on
Dp: dataset of the parent mode
Dleft: dataset of the left child node
Dright: dataset of the right child node
I: impurity criterion (Gini index or Entropy)
N: total number of samples
Nleft: number of samples at left child node
Nright: number of samples at right child node [6].
Figure 7 below indicates how the Decision Tree works.
2) Random Forest: [7] The random forest algorithm is a supervised
classification algorithm like a decision tree and instead of one
tree, this classifier uses multiple trees and merges them to obtain
better accuracy and prediction. In random forests each tree in the
ensemble is built from a sample drawn with a replacement from
a training set called bagging (Bootstrapping) and this improves
stability of the model. Figure 8 below shows the mechanism of
this algorithm.
46 Data Analysis and Information Processing
(4)
Polynomial kernel: It is a more complex function that can be used to
distinguish between non-linear inputs. And it can be represented as:
(5)
where p is the polynomial degree. Radial basis function is a kernel function
that helps in non-linear cases, as it computes the similarity that depends on
the distance between the origin or from some points:
RBF (Gaussian):
(6)
48 Data Analysis and Information Processing
(7)
(8)
(9)
Figure 10: Confusion Matrix for attack (A) and normal (N).
On Missing Data
The more accurate the measurement, the better it will be. The missing values
are imputed with best guesses and used to work if the missing values are
small then drop the records with missing values if the data is large enough.
However, this was the case before the multivariate approach, but now this is
not the case anymore.
Accuracy with missing data and because all the rows and columns are
of numerical values so, when we make the data incomplete with R code, the
replacement for empty cells is done using NA.
And the classifiers may not work with the NA, instead, we can substitute
with mean for each variable or median or mode or just with constant
number -9999 and we run the code and this works with a constant number.
Case Study on Data Analytics and Machine Learning Accuracy 51
On Imputed Data
We assume that with reasonable multivariate imputation procedures, the
accuracy will be close enough to the baseline accuracy of the original dataset
before we make it incomplete, then we impute for both mechanisms. The
results will be shown in chapter 5.
CONCLUSION
This paper provides a survey of different machine learning techniques for
measuring accuracy for different ML classifiers used to detect intrusions
in the KDDsubset dataset. Many algorithms have shown promising results
because they identify the attribute accurately. The best algorithms were
chosen to test our dataset and results posted in a different chapter. The
performance of four machine learning algorithms has been analyzed using
complete and incomplete versions of the KDDCUP’99 dataset. From this
investigation it can be concluded that the accuracy of these algorithms is
greatly affected when the dataset containing missing data is used to train
these algorithms and they cannot be relied upon in solving real world
problems. However, after using the multiple imputation by chained equation
(MICE) to remove the missingness in the dataset the accuracy of the four
algorithms increases exponentially and is even almost equal to that of the
original complete dataset. This is clearly indicated by the confusion matrix
in section 5 where TNR and the TPR are both close to 100% while the
FNR and FPR are both close to zero. This paper has clearly indicated that
the performance of machine learning algorithms decreases greatly when a
dataset contains missing data, but this performance can be increased by using
MICE to get rid of the missingness. Some classifiers have better accuracy
than others, so we should be careful to choose the suitable algorithms for
each independent case. We conclude from the survey and observation that
the chosen classifiers work best with cybersecurity systems, while others
are not and may be helpful in different domains. A Survey of many articles
provides a beneficial chance for analyzing detecting attacks and offers an
opportunity for improved decision-making in which model is the best to use.
52 Data Analysis and Information Processing
ACKNOWLEDGEMENTS
The authors would like to thank the anonymous reviewers for their valuable
suggestions and notes; thanks, extended to Scientific Research/Journal of
Data Analysis and Information Processing.
Case Study on Data Analytics and Machine Learning Accuracy 53
REFERENCES
1. Fatima, M. and Pasha, M. (2017) Survey of Machine Learning
Algorithms for Disease Diagnostic. Journal of Intelligent Learning
Systems and Applications, 9, 1-16. https://doi.org/10.4236/
jilsa.2017.91001
2. Kim, D.S. and Park, J.S. (2003) Network-Based Intrusion Detection
with Support Vector Machines. In: International Conference on
Information Networking. Springer, Berlin, Heidelberg, 747-756.
https://doi.org/10.1007/978-3-540-45235-5_73
3. Tavallaee, M., Bagheri, E., Lu, W. and Ghorbani, A.A. (2009)
A Detailed Analysis of the KDD CUP 99 Data Set. 2009 IEEE
Symposium on Computational Intelligence for Security and Defense
Applications, Ottawa, 8-10 July 2009, 1-6. https://doi.org/10.1109/
CISDA.2009.5356528
4. Sainis, N., Srivastava, D. and Singh, R. (2018) Feature Classification
and Outlier Detection to Increased Accuracy in Intrusion Detection
System. International Journal of Applied Engineering Research, 13,
7249-7255.
5. Sharma, H. and Kumar, S. (2016) A Survey on Decision Tree
Algorithms of Classification in Data Mining. International Journal of
Science and Research (IJSR), 5, 2094-2097. https://doi.org/10.21275/
v5i4.NOV162954
6. Singh, S. and Gupta, P. (2014) Comparative Study ID3, Cart and C4. 5
Decision Tree Algorithm: A Survey. International Journal of Advanced
Information Science and Technology (IJAIST), 27, 97-103.
7. Chen, J., Li, K., Tang, Z., Bilal, K., Yu, S., Weng, C. and Li, K. (2016)
A Parallel Random Forest Algorithm for Big Data in a Spark Cloud
Computing Environment. IEEE Transactions on Parallel and Distributed
Systems, 28, 919-933. https://doi.org/10.1109/TPDS.2016.2603511
8. Suthaharan, S. (2016) Support Vector Machine. In: Machine Learning
Models and Algorithms for Big Data Classification. Springer, Boston,
207-235. https://doi.org/10.1007/978-1-4899-7641-3_9
9. Larhman (2018) Linear Support Vector Machines. https://en.wikipedia.
org/wiki/Support-vector_machine
10. Chen, S., Webb, G.I., Liu, L. and Ma, X. (2020) A Novel Selective
Naïve Bayes Algorithm. Knowledge-Based Systems, 192, Article ID:
105361. https://doi.org/10.1016/j.knosys.2019.105361
Chapter 3
ABSTRACT
These last years we have been witnessing a tremendous growth in the volume
and availability of data. This fact results primarily from the emergence of
a multitude of sources (e.g. computers, mobile devices, sensors or social
networks) that are continuously producing either structured, semi-structured
or unstructured data. Database Management Systems and Data Warehouses
are no longer the only technologies used to store and analyze datasets, namely
due to the volume and complex structure of nowadays data that degrade
their performance and scalability. Big Data is one of the recent challenges,
since it implies new requirements in terms of data storage, processing and
visualization. Despite that, analyzing properly Big Data can constitute
great advantages because it allows discovering patterns and correlations in
datasets. Users can use this processed information to gain deeper insights
and to get business advantages. Thus, data modeling and data analytics are
Citation: Ribeiro, A. , Silva, A. and da Silva, A. (2015), “Data Modeling and Data
Analytics: A Survey from a Big Data Perspective”. Journal of Software Engineering
and Applications, 8, 617-634. doi: 10.4236/jsea.2015.812058.
Copyright: © 2015 by authors and Scientific Research Publishing Inc. This work is li-
censed under the Creative Commons Attribution International License (CC BY). http://
creativecommons.org/licenses/by/4.0
56 Data Analysis and Information Processing
evolved in a way that we are able to process huge amounts of data without
compromising performance and availability, but instead by “relaxing” the
usual ACID properties. This paper provides a broad view and discussion of
the current state of this subject with a particular focus on data modeling and
data analytics, describing and clarifying the main differences between the
three main approaches in what concerns these aspects, namely: operational
databases, decision support databases and Big Data technologies.
INTRODUCTION
We have been witnessing to an exponential growth of the volume of data
produced and stored. This can be explained by the evolution of the technology
that results in the proliferation of data with different formats from the most
various domains (e.g. health care, banking, government or logistics) and
sources (e.g. sensors, social networks or mobile devices). We have assisted a
paradigm shift from simple books to sophisticated databases that keep being
populated every second at an immensely fast rate. Internet and social media
also highly contribute to the worsening of this situation [1] . Facebook, for
example, has an average of 4.75 billion pieces of content shared among
friends every day [2] . Traditional Relational Database Management Systems
(RDBMSs) and Data Warehouses (DWs) are designed to handle a certain
amount of data, typically structured, which is completely different from
the reality that we are facing nowadays. Business is generating enormous
quantities of data that are too big to be processed and analyzed by the
traditional RDBMSs and DWs technologies, which are struggling to meet
the performance and scalability requirements.
Therefore, in the recent years, a new approach that aims to mitigate
these limitations has emerged. Companies like Facebook, Google, Yahoo
and Amazon are the pioneers in creating solutions to deal with these “Big
Data” scenarios, namely recurring to technologies like Hadoop [3] [4] and
MapReduce [5] . Big Data is a generic term used to refer to massive and
complex datasets, which are made of a variety of data structures (structured,
semi- structured and unstructured data) from a multitude of sources [6] . Big
Data can be characterized by three Vs: volume (amount of data), velocity
(speed of data in and out) and variety (kinds of data types and sources) [7] .
Still, there are added some other Vs for variability, veracity and value [8] .
Data Modeling and Data Analytics: A Survey from a Big Data Perspective 57
DATA MODELING
This section gives an in-depth look of the most popular data models used to
define and support Operational Databases, Data Warehouses and Big Data
technologies.
Databases are widely used either for personal or enterprise use, namely
due to their strong ACID guarantees (atomicity, consistency, isolation and
durability) guarantees and the maturity level of Database Management
Systems (DBMSs) that support them [15] .
The data modeling process may involve the definition of three data
models (or schemas) defined at different abstraction levels, namely
Conceptual, Logical and Physical data models [15] [16] . Figure 1 shows
part of the three data models for the AMS case study. All these models define
three entities (Person, Student and Professor) and their main relationships
(teach and supervise associations).
Conceptual Data Model. A conceptual data model is used to define,
at a very high and platform-independent level of abstraction, the entities
or concepts, which represent the data of the problem domain, and their
relationships. It leaves further details about the entities (such as their
attributes, types or primary keys) for the next steps. This model is typically
used to explore domain concepts with the stakeholders and can be omitted
or used instead of the logical data model.
Logical Data Model. A logical data model is a refinement of the previous
conceptual model. It details the domain entities and their relationships, but
standing also at a platform-independent level. It depicts all the attributes
that characterize each entity (possibly also including its unique identifier,
the primary key) and all the relationships between the entities (possibly
Data Modeling and Data Analytics: A Survey from a Big Data Perspective 59
including the keys identifying those relationships, the foreign keys). Despite
being independent of any DBMS, this model can easily be mapped on to a
physical data model thanks to the details it provides.
Physical Data Model. A physical data model visually represents the
structure of the data as implemented by a given class of DBMS. Therefore,
entities are represented as tables, attributes are represented as table columns
and have a given data type that can vary according to the chosen DBMS,
and the relationships between each table are identified through foreign keys.
Unlike the previous models, this model tends to be platform-specific, because
it reflects the database schema and, consequently, some platform-specific
aspects (e.g. database-specific data types or query language extensions).
Summarizing, the complexity and detail increase from a conceptual to
a physical data model. First, it is important to perceive at a higher level of
abstraction, the data entities and their relationships using a Conceptual Data
Model. Then, the focus is on detailing those entities without worrying about
implementation details using a Logical Data Model. Finally, a Physical Data
Model allows to represent how data is supported by a given DBMS [15]
[16] .
Operational Databases
Databases had a great boost with the popularity of the Relational Model
[17] proposed by E. F. Codd in 1970. The Relational Model overcame the
problems of predecessors data models (namely the Hierarchical Model and
the Navigational Model [18] ). The Relational Model caused the emergence
of Relational Database Management Systems (RDBMSs), which are the most
used and popular DBMSs, as well as the definition of the Structured Query
Language (SQL) [19] as the standard language for defining and manipulating
data in RDBMSs. RDBMSs are widely used for maintaining data of daily
operations. Considering the data modeling of operational databases there are
two main models: the Relational and the Entity-Relationship (ER) models.
Relational Model. The Relational Model is based on the mathematical
concept of relation. A relation is defined as a set (in mathematics terminology)
and is represented as a table, which is a matrix of columns and rows, holding
information about the domain entities and the relationships among them.
Each column of the table corresponds to an entity attribute and specifies
the attribute’s name and its type (known as domain). Each row of the table
(known as tuple) corresponds to a single element of the represented domain
entity.
60 Data Analysis and Information Processing
Figure 1: Example of three data models (at different abstraction levels) for the
Academic Management System.
In the Relational Model each row is unique and therefore a table has
an attribute or set of attributes known as primary key, used to univocally
identify those rows. Tables are related with each other by sharing one or
more common attributes. These attributes correspond to a primary key in the
referenced (parent) table and are known as foreign keys in the referencing
(child) table. In one-to-many relationships, the referenced table corresponds
to the entity of the “one” side of the relationship and the referencing table
corresponds to the entity of the “many” side. In many- to-many relationships,
Data Modeling and Data Analytics: A Survey from a Big Data Perspective 61
dimensions for the Student. These dimensions are repre- sented by sides of
the cube (Student, Country and Date). This cube is useful to execute queries
such as: the students by country enrolled for the first time in a given year.
A challenge that DWs face is the growth of data, since it affects the
number of dimensions and levels in either the star schema or the cube
hierarchies. The increasing number of dimensions over time makes the
management of such systems often impracticable; this problem becomes
even more serious when dealing with Big Data scenarios, where data is
continuously being generated [23] .
Figure 2: Example of two star schema models for the Academic Management
System.
64 Data Analysis and Information Processing
DATA ANALYTICS
This section presents and discusses the types of operations that can be
performed over the data models described in the previous section and also
establishes comparisons between them. A complementary discussion is
provided in Section 4.
Operational Databases
Systems using operational databases are designed to handle a high number
of transactions that usually perform changes to the operational data, i.e. the
data an organization needs to assure its everyday normal operation. These
systems are called Online Transaction Processing (OLTP) systems and they
are the reason why RDBMSs are so essential nowadays. RDBMSs have
increasingly been optimized to perform well in OLTP systems, namely
providing reliable and efficient data processing [16] .
The set of operations supported by RDBMSs is derived from the
relational algebra and calculus underlying the Relational Model [15] . As
mentioned before, SQL is the standard language to perform these operations.
SQL can be divided in two parts involving different types of operations:
68 Data Analysis and Information Processing
result table by applying the Union, Intersect and Except operations, based
on the Set Theory [15] .
For example, considering the Academic Management System, a system
manager could get a list of all students who are from G8 countries by entering
the following SQL-DML query:
for specific queries. For instance, he can slice and dice the cube away to get
the results he needed, but sometimes with a pivot most of those operations
can be avoided by perceiving a common structure on future queries and
pivoting the cube in the correct fashion [23] [24] . Figure 7 (bottom-right)
shows a pivot operation where years are arranged vertically and countries
horizontally.
The usual operations issued over the OLAP cube are about just querying
historical events stored in it. So, a common dimension is a dimension
associated to time.
the slicer axis as the “Computer Science” value of the Academic Program
dimension. This query returns the students (by names and gender) that have
enrolled in Computer Science in the year 2015.
Hive3 are two frameworks used to express tasks for Big Data sets analysis
in MapReduce programs. Pig is suitable for data flow tasks and can produce
sequences of MapReduce programs, whereas Hive is more suitable for data
summarization, queries and analysis. Both of them use their own SQL-like
languages, Pig Latin and Hive QL, respectively [33] . These languages use
both CRUD and ETL operations.
Streaming processing is a paradigm where data is continuously arriving
in a stream, at real-time, and is analyzed as soon as possible in order to
derive approximate results. It relies in the assumption that the potential
value of data depends on its freshness. Due to its volume, only a portion
of the stream is stored in memory [33] . Streaming processing paradigm is
used in online applications that need real-time precision (e.g. dashboards of
production lines in a factory, calculation of costs depending on usage and
available resources). It is supported by Data Stream Management Systems
(DSMS) that allow performing SQL-like queries (e.g. select, join, group,
count) within a given window of data. This window establishes the period
of time (based on time) or number of events (based on length) [34] . Storm
and S4 are two examples of such systems.
DISCUSSION
In this section we compare and discuss the approaches presented in the
previous sections in terms of the two perspectives that guide this survey:
Data Modeling and Data Analytics. Each perspective defines a set of features
used to compare Operational Databases, DWs and Big Data approaches
among themselves.
Regarding the Data Modeling Perspective, Table 2 considers the
following features of analysis: (1) the data model; (2) the abstraction
level in which the data model resides, according to the abstraction levels
(Conceptual, Logical and Physical) of the database design process; (3) the
concepts or constructs that compose the data model;
Data Modeling and Data Analytics: A Survey from a Big Data Perspective 75
(4) the concrete languages used to produce the data models and that apply
the previous concepts; (5) the modeling tools that allow specifying diagrams
using those languages and; (6) the database tools that support the data model.
Table 2 presents the values of each feature for each approach. It is possible to
verify that the majority of the data models are at a logical and physical level,
with the exception of the ER model and the OLAP cube model, which are
more abstract and defined at conceptual and logical levels. It is also possible
to verify that Big Data has more data models than the other approaches, what
can explain the work and proposals that have been conducted over the last
years, as well as the absence of a de facto data model. In terms of concepts,
again Big Data-related data models have a more variety of concepts than
the other approaches, ranging from key-value pairs or documents to nodes
and edges. Concerning concrete languages, it is concluded that every data
76 Data Analysis and Information Processing
to the ones offered by SQL-ML and add new constructs for supporting
both ETL, data stream processing (e.g. create stream, window) [34] and
MapReduce operations. It is important to note that concrete languages used
in the different approaches reside at logical and physical levels, because
they are directly used by the supporting software tools.
RELATED WORK
As mentioned in Section 1, the main goal of this paper is to present and
discuss the concepts surrounding data modeling and data analytics, and
their evolution for three representative approaches: operational databases,
decision support databases and Big Data technologies.
In our survey we have researched related works that also explore and
compare these approaches from the data modeling or data analytics point
of view.
J.H. ter Bekke provides a comparative study between the Relational,
Semantic, ER and Binary data models based on an examination session
results [38] . In that session participants had to create a model of a case
study, similar to the Academic Management System used in this paper. The
purpose was to discover relationships between the modeling approach in
78 Data Analysis and Information Processing
use and the resulting quality. Therefore, this study just addresses the data
modeling topic, and more specifically only considers data models associated
to the database design process.
Several works focus on highlighting the differences between operational
databases and data warehouses. For example, R. Hou provides an analysis
between operational databases and data warehouses distinguishing them
according to their related theory and technologies, and also establishing
common points where combining both systems can bring benefits [39] . C.
Thomsen and T.B. Pedersen compare open source ETL tools, OLAP clients
and servers, and DBMSs, in order to build a Business Intelligence (BI)
solution [40] .
P. Vassiliadis and T. Sellis conducted a survey that focuses only on OLAP
databases and compare various proposals for the logical models behind
them. They group the various proposals in just two categories: commercial
tools and academic efforts, which in turn are subcategorized in relational
model extensions and cube- oriented approaches [41] . However, unlike our
survey they do not cover the subject of Big Data technologies.
Several papers discuss the state of the art of the types of data stores,
technologies and data analytics used in Big Data scenarios [29] [30] [33]
[42] , however they do not compare them with other approaches. Recently, P.
Chandarana and M. Vijayalakshmi focus on Big Data analytics frameworks
and provide a comparative study according to their suitability [35] .
Summarizing, none of the following mentioned work provides such a
broad analysis like we did in this paper, namely, as far as we know, we
did not find any paper that compares simultaneously operational databases,
decision support databases and Big Data technologies. Instead, they focused
on describing more thoroughly one or two of these approaches
CONCLUSIONS
In recent years, the term Big Data has appeared to classify the huge datasets
that are continuously being produced from various sources and that are
represented in a variety of structures. Handling this kind of data represents
new challenges, because the traditional RDBMSs and DWs reveal serious
limitations in terms of performance and scalability when dealing with such
a volume and variety of data. Therefore, it is needed to reinvent the ways in
which data is represented and analyzed, in order to be able to extract value
from it.
Data Modeling and Data Analytics: A Survey from a Big Data Perspective 79
ACKNOWLEDGEMENTS
This work was partially supported by national funds through FCT―Fundação
para a Ciência e a Tecnologia, under the projects POSC/EIA/57642/2004,
CMUP-EPB/TIC/0053/2013, UID/CEC/50021/2013 and Data Storm
Research Line of Excellency funding (EXCL/EEI-ESS/0257/2012).
NOTES
1
https://hadoop.apache.org
2
https://pig.apache.org
3
https://hive.apache.org
4
http://cassandra.apache.org
5
https://hbase.apache.org
6
https://www.mongodb.org
7
https://drill.apache.org
Data Modeling and Data Analytics: A Survey from a Big Data Perspective 81
REFERENCES
1. Mayer-Schonberger, V. and Cukier, K. (2014) Big Data: A Revolution
That Will Transform How We Live, Work, and Think. Houghton
Mifflin Harcourt, New York.
2. Noyes, D. (2015) The Top 20 Valuable Facebook Statistics.https://
zephoria.com/top-15-valuable-facebook-statistics
3. Shvachko, K., Hairong Kuang, K., Radia, S. and Chansler, R. (2010)
The Hadoop Distributed File System. 26th Symposium on Mass
Storage Systems and Technologies (MSST), Incline Village, 3-7 May
2010, 1-10.http://dx.doi.org/10.1109/msst.2010.5496972
4. White, T. (2012) Hadoop: The Definitive Guide. 3rd Edition, O’Reilly
Media, Inc., Sebastopol.
5. Dean, J. and Ghemawat, S. (2008) MapReduce: Simplified Data
Processing on Large Clusters. Communications, 51, 107-113.http://dx.
doi.org/10.1145/1327452.1327492
6. Hurwitz, J., Nugent, A., Halper, F. and Kaufman, M. (2013) Big Data
for Dummies. John Wiley & Sons, Hoboken.
7. Beyer, M.A. and Laney, D. (2012) The Importance of “Big Data”: A
Definition. Gartner. https://www.gartner.com/doc/2057415
8. Duncan, A.D. (2014) Focus on the “Three Vs” of Big Data Analytics:
Variability, Veracity and Value. Gartner.https://www.gartner.com/
doc/2921417/focus-vs-big-data-analytics
9. Agrawal, D., Das, S. and El Abbadi, A. (2011) Big Data and Cloud
Computing: Current State and Future Opportunities. Proceedings
of the 14th International Conference on Extending Database
Technology, Uppsala, 21-24 March, 530-533.http://dx.doi.
org/10.1145/1951365.1951432
10. McAfee, A. and Brynjolfsson, E. (2012) Big Data: The Management
Revolution. Harvard Business Review.
11. DataStorm Project Website.http://dmir.inesc-id.pt/project/DataStorm.
12. Stahl, T., Voelter, M. and Czarnecki, K. (2006) Model-Driven Software
Development: Technology, Engineering, Management. John Wiley &
Sons, Inc., New York.
13. Schmidt, D.C. (2006) Guest Editor’s Introduction: Model-Driven
Engineering. IEEE Computer, 39, 25-31.http://dx.doi.org/10.1109/
MC.2006.58
82 Data Analysis and Information Processing
ABSTRACT
Big data analytics represents a promising area for the accounting and
audit professions. We examine how machine learning applications, data
analytics and data visualization software are changing the way auditors and
accountants work with their clients. We find that audit firms are keen to use
machine learning software tools to read contracts, analyze journal entries,
and assist in fraud detection. In data analytics, predictive analytical tools are
utilized by both accountants and auditors to make projections and estimates,
and to enhance business intelligence (BI). In addition, data visualization
tools are able to complement predictive analytics to help users uncover
trends in the business process. Overall, we anticipate that the technological
advances in these various fields will accelerate in the coming years. Thus,
it is imperative that accountants and auditors embrace these technological
advancements and harness these tools to their advantage.
Citation: Chu, M. and Yong, K. (2021), “Big Data Analytics for Business Intelligence
in Accounting and Audit”. Open Journal of Social Sciences, 9, 42-52. doi: 10.4236/
jss.2021.99004.
Copyright: © 2021 by authors and Scientific Research Publishing Inc. This work is li-
censed under the Creative Commons Attribution International License (CC BY). http://
creativecommons.org/licenses/by/4.0
88 Data Analysis and Information Processing
INTRODUCTION
Big data analytics has transformed the world that we live in. Due to
technological advances, big data analytics enables new forms of business
value and enterprise risk that will have an impact on the rules, standards and
practices for the finance and accounting professions. The accounting and
audit professionals are important players in harnessing the power of big data
analytics, and they are poised to become even more vital to stakeholders in
supporting data and insight-driven enterprises.
Data analytics can enable auditors to focus on exception reporting more
efficiently by identifying outliers in risky areas of the audit process (IAASB,
2018). The advent of inexpensive computational power and storage, as well
as the progressive computerization of organizational systems, is creating a
new environment in which accountants and auditors must adapt to harness
the power of big data analytics. In other applications, data analytics can help
auditors to improve the risk assessment process, substantive procedures and
tests of controls (Lim et al., 2020). These software tools have the potential to
provide further evidence to assist with audit judgements and provide greater
insights for audit clients.
In machine learning applications, the expectation is that the algorithm
will learn from the data provided, in a manner that is similar to how a human
being learns from data. A classic application of machine learning tools is
pattern recognition. Facial recognition machine learning software has been
developed such that a machine-learning algorithm can look at pictures of
men and women and be able to identify those features that are male driven
from those that are female driven. Initially, the algorithm might misclassify
some male faces as female faces. It is thus important for the programmer
to write an algorithm that can be trained using test data to look for specific
patterns in male and female faces.
Because machine learning requires large data sets in order to train the
learning algorithms, the availability of a vast quantity of high-quality data
will expedite the process by allowing the programmer to refine the machine
learning algorithms to be able to identify pictures that contain a male
face as opposed to a female face. Gradually, the algorithm will be able to
classify some general characteristics of a man (e.g., spotting a beard, certain
Big Data Analytics for Business Intelligence in Accounting and Audit 89
differences in hair styles, broad faces) from those that belong to a woman
(e.g., more feminine characteristics).
Similarly, it is envisaged that many routine accounting processes will be
handled by machine learning algorithms or robotics automation processing
(RPA) tools in the near future. For example, it is possible that machine
learning algorithms can receive an invoice, match it to a purchase order,
determine the expense account to charge and the amount to be paid, and
place it in a pool of payments for a human employee to review the documents
and release them for payment to the respective vendors.
Likewise, in auditing a client, a well designed machine learning
algorithm could make it easier to detect potential fraudulent transactions in
a company’s financial statements by training the machine learning algorithm
to successfully identify transactions that have characteristics associated with
fraudulent activities from bona fide transactions. The evolution of machine
learning is thus expected to have a dramatic impact on business, and it is
expected that the accounting profession will need to adapt so as to better
understand how to utilize such technologies in modifying their ways of
working when auditing financial statements of their audit clients (Haq,
Abatemarco, & Hoops, 2020).
Predictive analytics is a subset of data analytics. Predictive analytics can
be viewed as helping the accountant or auditor in understanding the future
and provides foresight by identifying patterns in historical data. One of the
most common applications of predictive analytics in the field of accounting
is the computation of a credit score to indicate the likelihood of timely future
credit payments. This predictive analytics tool can be used to predict an
accounts receivable balance at a certain date and to estimate a collection
period for each customer.
Data visualization tools are becoming increasingly popular because of
the way these tools help users obtain better insights, draw conclusions and
handle large datasets (Skapoullis, 2018). For example, auditors have begun
to use visualizations as a tool to look at multiple accounts over multiple
years to detect misstatements. If an auditor is attempting to examine a
company’s accounts payable (AP) balances over the last ten years compared
to the industry average, a data visualization tool like PowerBI or Tableau
can quickly produce a graph that compares two measures against one
dimension. The measures are the quantitative data, which are the company’s
AP balances versus the industry averages. The dimension is a qualitative
categorical variable. The difference between data visualization tools from a
90 Data Analysis and Information Processing
simple Excel graph is that this information (“sheet’) can be easily formatted
and combined with other important information (“other sheets”) to create a
dashboard where numerous sheets are compiled to provide an overall view
that shows the auditor a cohesive audit examination of misstatement risk
or anomalies in the company’s AP balances. As real-time data is streamed
to update the dashboard, auditors could also examine the most current
transactions that affect AP balances; thus, enabling auditor to perform
continuous audit. With the real-time quality dashboard that provide real-
time alerts, it enables collaboration among the audit team on a real-time
continuous basis coupled with real-time supervisory review. Analytical
procedures and test of transactions can be done more continually, and the
auditor can investigate unusual fluctuations more promptly. The continuous
review can also help to even out the workload of the audit team as the audit
team members are kept abreast of the client’s business environment and
financial performance throughout the financial year.
The next section discusses machine learning applications to aid the
audit process. Section 3 describes predictive analytics and how accountants
and auditors use these tools to generate actionable insights for companies.
Section 4 discusses data visualization and its role in the accounting and audit
profession. Section 5 concludes.
MACHINE LEARNING
to not just read a lease contract, identify key terms, determine whether it
is a capital or operating lease, but also to interpret nonstandard leases with
significant judgments (e.g., those with unusual asset retirement obligations).
This would allow auditors to review and assess larger samples—even up to
100% of the documents, spend more time on judgemental areas and provide
greater insights to audit clients, thus improving both the speed and quality
of the audit process.
Another example of machine learning technology currently used by
PricewaterhouseCoopers is Halo. Halo analyzes journal entries and can
identify potentially problematic areas, such as entries with keywords of a
questionable nature, entries from unauthorized sources, or an unusually high
number of journal entry postings just under authorized limits. Similar to
Argus, Halo allows auditors to test 100% of the journal entries and focusing
only on the outliers with the highest risk, both the speed and quality of the
testing procedures are significantly improved.
risk, for the purpose of assessing risk of material misstatements. Data from
various exogenous sources, such as forum posts, comments, conversations
from social media, press release, news, management discussion notes, can
be used to supplement traditional financial attributes to train the model to
virtually assess the inherent risk levels (Paltrinieri et al., 2019).
The use of machine learning for risk assessment can also be applied
to assessment of going concern risk. By studying the traits of companies
that have gone under financial distress, a Probability of Default (PD) model
can be developed, with the aim to quantify the going concern on a timelier
basis. The predictive model requires an indicator of financial distress and a
set of indicators that leverage on environmental and financial performance
scanning to produce a PD that is dynamically updated according to firm
performance (Martens et al., 2008).
The impact on businesses and the accounting profession will
undoubtedly be significant in the near future. The major public accounting
firms are focused on providing their customers with the expertise needed
to deploy machine learning algorithms in businesses to accelerate
and improve business decisions while lowering costs. In May 2018,
PricewaterhouseCoopers announced a joint venture with eBravia, a contract
analytics software company, to develop machine learning algorithms for
contract analysis. Those algorithms could be used to review documents
related to lease accounting and revenue recognition standards as well as
other business activities, such as mergers and acquisitions, financings, and
divestitures. In the area of advisory services, Deloitte has advised retailers
on how they can enhance customer experience by using machine learning to
target product and services based on past buying patterns. While the major
public accounting firms may have the financial resources to invest in machine
learning, small public accounting firms can leverage on these technological
solutions and use pre-built machine learning algorithms to develop expertise
through their own implementations at a smaller scale.
DATA ANALYTICS
it does not necessarily help them better predict and more aggressively plan
for the future.
Finding the right solution to enable a detailed analysis of financial data
is critical in the transition from looking at the historical financial data to
find predictors that enable forward-looking business intelligence (BI). A BI
solution leverages on patterns in your data. Looking at consolidated data
in an aggregate manner rather than in a piecemeal ad-hoc process from
separate information systems provides an opportunity to uncover hidden
trends and is a useful functionality for predictive analytics. For example, in
customer relationship management (CRM) systems, improved forecasting
is important in better planning for capacity peaks and troughs that directly
impact the customer experience, response time, and transaction volumes.
Many accountants are already using data analytics in their daily work.
They compute sums, averages, and percent changes to report sales results,
customer credit risk, cost per customer, and availability of inventory.
Accountants also are generally familiar with diagnostic analytics because
they perform variance analyses and use analytic dashboards to explain
historical results.
The various attempts to try to predict financial performance and
leveraging on nonfinancial performance measures that might be good
predictors of financial performance is expected to gain much traction in the
coming years. This presents a great opportunity for accountants to provide a
much more valuable role to management. Hence, accountants should further
harness the power of data analytics to effectively perform their roles.
Predictive analytics and prescriptive analytics are important because they
provide actionable insights for companies. Accountants need to increase their
competence in these areas to provide value to their organizations. Predictive
analytics integrates data from various sources (such as enterprise resource
planning, point-of-sale, and customer relationship management systems) to
predict future outcomes based on statistical relationships found in historical
data using regression-based modeling. One of the most common applications
of predictive analytics is the computation of a credit score to indicate the
likelihood of timely future credit payments. Prescriptive analytics utilizes
a combination of sophisticated optimization techniques (self-optimizing
algorithms) to make recommendations on the most favorable courses of
action to be taken.
Big Data Analytics for Business Intelligence in Accounting and Audit 95
analyze payments made after period end. This technique can relate the
subsequent payments with the delivery dates extracted from the underlying
delivery documents to ascertain if the payments relate to goods delivered
before the period end or after the period end and also determine the amount
of unrecorded liability.
DATA VISUALIZATION
CONCLUSION
The use of automation, big data and other technological advances such as
machine learning will continue to grow in accounting and audit, producing
important business intelligence tools that provide historical, current and
predictive views of business operations in interactive data visualizations.
Business intelligence systems allow accounting professionals to make better
decisions by analyzing very large volumes of data from all lines of business,
resulting in increased productivity and accuracy and better insights to make
more informed decisions. The built-in, customizable dashboards allow for
real-time reporting and analysis, where exceptions, trends and opportunities
can be identified and transactional data drilled down for greater detail.
Analytics, artificial intelligence, and direct linkages to clients’
transaction systems can allow audits to be a continuous rather than an
annual process, and material misstatements and financial irregularities can
be detected in real time as they occur, providing near real-time assurance.
Big Data Analytics for Business Intelligence in Accounting and Audit 99
ACKNOWLEDGEMENTS
Mui Kim Chu is senior lecturer at Singapore Institute of Technology. Kevin
Ow Yong is associate professor at Singapore Institute of Technology. We
wish to thank Khin Yuya Thet for her research assistance. All errors are our
own.
100 Data Analysis and Information Processing
REFERENCES
1. Alawadhi, A. (2015). The Application of Data Visualization in
Auditing. Rutgers, The State University of New Jersey
2. Davenport, T. H. (2016). The Power of Advanced Audit Analytics
Everywhere Analytics. Deloitte Development LLC. https://www2.
deloitte.com/content/dam/Deloitte/us/Documents/deloitte-analytics/
us-da-advanced-audit-analytics.pdf
3. Dickey, G., Blanke, S., & Seaton, L. (2019). Machine Learning in
Auditing: Current and Future Applications. The CPA Journal, 89, 16-
21.
4. Eaton, T., & Baader, M. (2018). Data Visualization Software: An
Introduction to Tableau for CPAs. The CPA Journal, 88, 50-53.
5. Haq, I., Abatemarco, M., & Hoops, J. (2020). The Development of
Machine Learning and its Implications for Public Accounting. The
CPA Journal, 90, 6-9.
6. IAASB (2018). Exploring the Growing Use of Technology in the Audit,
with a Focus on Data Analytics. International Auditing and Assurance
Standards Board.
7. Lim, J. M., Lam, T., & Wang, Z. (2020). Using Data Analytics in a
Financial Statement Audit. IS Chartered Accountant Journal.
8. Martens, D., Bruynseels, L., Baesens, B., Willekens, M., & Vanthienen,
J. (2008). Predicting Going Concern Opinion with Data Mining.
Decision Support Systems, 45, 765-777. https://doi.org/10.1016/j.
dss.2008.01.003
9. McQuilken, D. (2019). 5 Steps to Get Started with Audit Data Analytics.
AICPA. https://blog.aicpa.org/2019/05/5-steps-to-get-started-with-
audit-data-analytics.html#sthash.NSlZVigi.dpbs
10. Paltrinieri, N., Comfort, L., & Reniers, G. (2019). Learning about Risk:
Machine Learning for Risk Assessment. Safety Science, 118, 475-486.
https://doi.org/10.1016/j.ssci.2019.06.001
11. Skapoullis, C. (2018). The Need for Data Visualisation. ICAEW.
https://www.icaew.com/technical/business-and-management/strategy-
risk-and-innovation/risk-management/internal-audit-resource-centre/
the-need-for-data-visualisation
12. Tschakert, N., Kokina, J., Kozlowski, S., & Vasarhelyi, M. (2016). The
Next Frontier in Data Analytics. Journal of Accountancy, 222, 58.
13. Zabeti, S. (2019). How Audit Data Analytics Is Changing Audit. Accru.
Chapter 5
Guang Lan Zhang1, Jing Sun2, Lou Chitkushev1, and Vladimir Brusic1
1
Department of Computer Science, Metropolitan College, Boston University, Boston,
MA 02215, USA
2
Cancer Vaccine Center, Dana-Farber Cancer Institute, Harvard Medical School, Bos-
ton, MA 02115, USA
ABSTRACT
With the vast amount of immunological data available, immunology
research is entering the big data era. These data vary in granularity, quality,
and complexity and are stored in various formats, including publications,
technical reports, and databases. The challenge is to make the transition
from data to actionable knowledge and wisdom and bridge the knowledge
gap and application gap. We report a knowledge-based approach based on
a framework called KB-builder that facilitates data mining by enabling
Citation: Guang Lan Zhang, Jing Sun, Lou Chitkushev, Vladimir Brusic, “Big Data An-
alytics in Immunology: A Knowledge-Based Approach”, BioMed Research Internation-
al, vol. 2014, Article ID 437987, 9 pages, 2014. https://doi.org/10.1155/2014/437987.
Copyright: © 2014 by Authors. This is an open access article distributed under the
Creative Commons Attribution License, which permits unrestricted use, distribution,
and reproduction in any medium, provided the original work is properly cited.
102 Data Analysis and Information Processing
INTRODUCTION
Data represent the lowest level of abstraction and do not have meaning
by themselves. Information is data that has been processed so that it
gives answers to simple questions, such as “what,” “where,” and “when.”
Knowledge represents the application of data and information at a higher
level of abstraction, a combination of rules, relationships, ideas, and
experiences, and gives answers to “how” or “why” questions. Wisdom
is achieved when the acquired knowledge is applied to offer solutions to
practical problems. The data, information, knowledge, and wisdom (DIKW)
hierarchy summarizes the relationships between these levels, with data at
its base and wisdom at its apex and each level of the hierarchy being an
essential precursor to the levels above (Figure 1(a)) [1, 2]. The acquisition
cost is lowest for data acquisition and highest for knowledge and wisdom
acquisition (Figure 1(b)).
Big Data Analytics in Immunology: A Knowledge-Based Approach 103
Figure 1: The DIKW hierarchy. (a) The relative quantities of data, information,
knowledge, and wisdom. (b) The relative acquisition cost of the different layers.
(c) The gap between data and knowledge and (d) the gap between knowledge
and wisdom.
In immunology, for example, a newly sequenced molecular sequence
without functional annotation is a data point, information is gained by
annotating the sequence to answer questions such as which viral strain
it originates from, knowledge may be obtained by identifying immune
epitopes in the viral sequence, and the design of a peptide-based vaccine
using the epitopes represents the wisdom level. Overwhelmed by the vast
amount of immunological data, to make the transition from data to actionable
knowledge and wisdom and bridge the knowledge gap and application
gap, we are confronted with several challenges. These include asking the
“right questions,” handling unstructured data, data quality control (garbage
in, garbage out), integrating data from various sources in various formats,
and developing specialized analytics tools with the capacity to handle large
volume of data.
The human immune system is a complex system comprising the innate
immune system and the adaptive immune system. There are two branches of
adaptive immunity, humoral immunity effected by the antibodies and cell-
mediated immunity effected by the T cells of the immune system. In humoral
immunity, B cells produce antibodies for neutralization of extracellular
pathogens and their antigens that prevent the spread of infection. The
104 Data Analysis and Information Processing
targets composed of conserved T-cell and B-cell epitopes that are broadly
cross-reactive to viral subtypes and protective of a large host population
(Figure 2).
West Nile virus, herpes zoster, pneumococcus, and the malaria parasite [15].
Systems biology aims to study the interactions between relevant molecular
components and their changes over time and enable the development of
predictive models. The advent of technological breakthroughs in the fields of
genomics, proteomics, and other “omics” is catalyzing advances in systems
immunology, a new field under the umbrella of system biology [16]. The
synergy between systems immunology and vaccinology enables rational
vaccine design [17].
Big data describes the environment where massive data sources combine
both structured and unstructured data so that the analysis cannot be performed
using traditional database and analytical methods. Increasingly, data sources
from literature and online sources are combined with the traditional types
of data [18] for summarization of complex information, extraction of
knowledge, decision support, and predictive analytics. With the increase of
the data sources, both the knowledge and application gaps (Figures 1(c) and
1(d)) keep widening and the corresponding volumes of data and information
are rapidly increasing. We describe a knowledge-based approach that helps
reduce the knowledge and application gaps for applications in immunology
and vaccinology.
sequences, or both, and their related information are collected from various
sources. The collected data are then reformatted and organized into a
unified XML format. Module 2 (data cleaning, enrichment, and annotation
module) deals with data incompleteness, inconsistency, and ambiguities
due to the lack of submission standards in the online primary databases.
The semiautomated data cleaning is performed by domain experts to ensure
data quality, completeness, and redundancy reduction. Semiautomated data
enrichment and annotation are performed by the domain experts further
enhancing data quality. The semiautomation involves automated comparison
of new entries to the entries already processed within the KB and comparison
of terms that are entered into locally implemented dictionaries. Terms that
match the existing record annotations and dictionary terms are automatically
processed. New terms and new annotations are inspected by a curator and
if in error they are corrected, or if they represent novel annotations or
terms they are added to the knowledgebase and to the local dictionaries.
Module 3 (the import module) performs automatic import of the XML file
into the central repository. Module 4 (the basic analysis toolset) facilitates
fast integration of common analytical tools with the online antigen KB.
All our knowledgebases have the basic keyword search tools for locating
antigens and T-cell epitopes or HLA ligands. The advanced keyword search
tool was included in FLAVIdB, FLUKB, and HPVdB, where users further
restrict the search by selecting virus species, viral subtype, pathology, host
organism, viral strain type, and several other filters. Other analytical tools
include sequence similarity search enabled by basic local alignment search
tool (BLAST) [24] and color-coded multiple sequence alignment (MSA)
tool [25] on user-defined sequence sets as shown in Figure 4. Module 5 (the
specialized analysis toolset) facilitates fast integration of specialized analysis
tools designed according to the specific purpose of the knowledgebase and
the structural and functional properties of the source of the sequences. To
facilitate efficient antigenicity analysis, in every knowledgebase and within
each antigen entry, we embedded a tool that performs on-the-fly binding
prediction to 15 frequent HLA class I and class II alleles. In TANTIGEN,
an interactive visualization tool, mutation map, has been implemented to
provide a global view of all mutations reported in a tumor antigen. Figure
5 shows a screenshot of mutation map of tumor antigen epidermal growth
factor receptor (EGFR) in TANTIGEN. In TANTIGEN and HPVdB, a
T-cell epitope visualization tool has been implemented to display epitopes
in all isoforms of a tumor antigen or sequences of a HPV genotype. The
B-cell visualization tool in FLAVIdB and FLUKB displays neutralizing
Big Data Analytics in Immunology: A Knowledge-Based Approach 109
of 333 peptides as T-cell epitope candidates. This set could form the basis
for a broadly neutralizing dengue virus vaccine. The results of block entropy
analysis of dengue subtypes 1–4 from FLAVIdB are shown in Figure 7.
The motifs identical to the known neutralized or escape B-cell epitope motifs
are proposed as neutralized or escape strains, respectively.
The cross-neutralization estimation workflow provides an overview of
cross-neutralization of existing neutralizing antibodies, while B-cell epitope
mapper workflow gives an estimation of possible neutralizing effect of new
viral strains using known neutralizing antibodies. This knowledge-based
approach improves our understanding of antibody/antigen interactions,
facilitates mapping of the known universe of target antigens, allows the
prediction of cross-reactivity, and speeds up the design of broadly protective
influenza vaccines.
CONCLUSIONS
The big data analytics applies advanced analytic methods to data sets that
are very large and complex and that include diverse data types. These
advanced analytics methods include predictive analytics, data mining, text
mining, integrated statistics, visualization, and summarization tools. The
data sets used in our case studies are complex and the analytics is achieved
through the definition of workflow. Data explosion in our case studies is
fueled by the combinatorial complexity of the domain and the disparate
data types. The cost of analysis and computation increases exponentially
as we combine various types of data to answer research questions. We use
the in silico identification of influenza T-cell epitopes restricted by HLA
class I variants as an example. There are 300,000 influenza sequences to
be analyzed for T-cell epitopes using MHC binding prediction tools based
on artificial neural networks or support vector machines [37–40]. Based on
the DNA typing for the entire US donor registry, there are 733 HLA-A, 921
HLA-B, and 429 HLA-C variants, a total of 2083 HLA variants, observed in
US population [41]. These alleles combine into more than 45,000 haplotypes
(combinations of HLA-A, -B, and -C) [41]. Each of these haplotypes has
different frequencies and distributions across different populations. The
in silico analysis of MHC class I restricted T-cell epitopes includes MHC
binding prediction of all overlapping peptides that are 9–11 amino acids
long. This task alone involves a systematic analysis of 300,000 sequences
that are on average 300 amino acids long. Therefore, the total number of
in silico predictions is approximately 300,000 × 300 × 3 × 2083 (number
of sequences times the average length of each sequence times 3 times the
number of observed HLA variants) or a total of 5.6 × 1011 calculations.
Predictive models do not exist for all HLA alleles, so some analysis needs
Big Data Analytics in Immunology: A Knowledge-Based Approach 117
REFERENCES
1. J. Rowley, “The wisdom hierarchy: representations of the DIKW
hierarchy,” Journal of Information Science, vol. 33, no. 2, pp. 163–
180, 2007.
2. R. Ackoff, “From data to wisdom,” Journal of Applies Systems
Analysis, vol. 16, no. 1, pp. 3–9, 1989.
3. C. Janeway, Immunobiology: The Immune System in Health and
Disease, Garland Science, New York, NY, USA, 6th edition, 2005.
4. M. H. V. van Regenmortel, “What is a B-cell epitope?” Methods in
Molecular Biology, vol. 524, pp. 3–20, 2009.
5. S. C. Meuer, S. F. Schlossman, and E. L. Reinherz, “Clonal analysis
of human cytotoxic T lymphocytes: T4+ and T8+ effector T cells
recognize products of different major histocompatibility complex
regions,” Proceedings of the National Academy of Sciences of the
United States of America, vol. 79, no. 14 I, pp. 4395–4399, 1982.
6. J. H. Wang and E. L. Reinherz, “Structural basis of T cell recognition
of peptides bound to MHC molecules,” Molecular Immunology, vol.
38, no. 14, pp. 1039–1049, 2002.
7. R. Vita, L. Zarebski, J. A. Greenbaum et al., “The immune epitope
database 2.0,” Nucleic Acids Research, vol. 38, supplement 1, pp.
D854–D862, 2009.
8. J. Robinson, J. A. Halliwell, H. McWilliam, R. Lopez, P. Parham, and
S. G. E. Marsh, “The IMGT/HLA database,” Nucleic Acids Research,
vol. 41, no. 1, pp. D1222–D1227, 2013.
9. A. Sette and J. Sidney, “Nine major HLA class I supertypes account
for the vast preponderance of HLA-A and -B polymorphism,”
Immunogenetics, vol. 50, no. 3-4, pp. 201–212, 1999.
10. O. Lund, M. Nielsen, C. Kesmir et al., “Definition of supertypes for HLA
molecules using clustering of specificity matrices,” Immunogenetics,
vol. 55, no. 12, pp. 797–810, 2004.
11. R. Rappuoli, “Reverse vaccinology,” Current Opinion in Microbiology,
vol. 3, no. 5, pp. 445–450, 2000.
12. D. C. Koboldt, K. M. Steinberg, D. E. Larson, R. K. Wilson, and E. R.
Mardis, “The next-generation sequencing revolution and its impact on
genomics,” Cell, vol. 155, no. 1, pp. 27–38, 2013.
Big Data Analytics in Immunology: A Knowledge-Based Approach 119
transform,” Nucleic Acids Research, vol. 30, no. 14, pp. 3059–3066,
2002.
26. J. Sun, U. J. Kudahl, C. Simon, Z. Cao, E. L. Reinherz, and V.
Brusic, “Large-scale analysis of B-cell epitopes on influenza virus
hemagglutinin—implications for cross-reactivity of neutralizing
antibodies,” Frontiers in Immunology, vol. 5, article 38, 2014.
27. J. Sun, G. L. Zhang, L. R. Olsen, E. L. Reinherz, and V. Brusic,
“Landscape of neutralizing assessment of monoclonal antibodies
against dengue virus,” in Proceedings of the International Conference
on Bioinformatics, Computational Biology and Biomedical Informatics
(BCB ‘13), p. 836, Washington, DC, USA, 2013.
28. G. E. Crooks, G. Hon, J. Chandonia, and S. E. Brenner, “WebLogo: a
sequence logo generator,” Genome Research, vol. 14, no. 6, pp. 1188–
1190, 2004.
29. L. R. Olsen, U. J. Kudahl, C. Simon et al., “BlockLogo: visualization of
peptide and sequence motif conservation,” Journal of Immunological
Methods, vol. 400-401, pp. 37–44, 2013.
30. J. Söllner, A. Heinzel, G. Summer et al., “Concept and application of
a computational vaccinology workflow,” Immunome Research, vol. 6,
supplement 2, article S7, 2010.
31. L. R. Olsen, G. L. Zhang, E. L. Reinherz, and V. Brusic, “FLAVIdB: a
data mining system for knowledge discovery in flaviviruses with direct
applications in immunology and vaccinology,” Immunome Research,
vol. 7, no. 3, pp. 1–9, 2011.
32. G. L. Zhang, A. B. Riemer, D. B. Keskin, L. Chitkushev, E. L. Reinherz,
and V. Brusic, “HPVdb: a data mining system for knowledge discovery
in human papillomavirus with applications in T cell immunology and
vaccinology,” Database, vol. 2014, Article ID bau031, 2014.
33. A. B. Riemer, D. B. Keskin, G. Zhang et al., “A conserved E7-derived
cytotoxic T lymphocyte epitope expressed on human papillomavirus
16-transformed HLA-A2+ epithelial cancers,” Journal of Biological
Chemistry, vol. 285, no. 38, pp. 29608–29622, 2010.
34. D. B. Keskin, B. Reinhold, S. Lee et al., “Direct identification of
an HPV-16 tumor antigen from cervical cancer biopsy specimens,”
Frontiers in Immunology, vol. 2, article 75, 2011.
35. L. R. Olsen, G. L. Zhang, D. B. Keskin, E. L. Reinherz, and V. Brusic,
“Conservation analysis of dengue virust-cell epitope-based vaccine
Big Data Analytics in Immunology: A Knowledge-Based Approach 121
ABSTRACT
Opinion (sentiment) analysis on big data streams from the constantly
generated text streams on social media networks to hundreds of millions
of online consumer reviews provides many organizations in every field
with opportunities to discover valuable intelligence from the massive user
generated text streams. However, the traditional content analysis frameworks
are inefficient to handle the unprecedentedly big volume of unstructured
text streams and the complexity of text analysis tasks for the real time
opinion analysis on the big data streams. In this paper, we propose a parallel
real time sentiment analysis system: Social Media Data Stream Sentiment
Analysis Service (SMDSSAS) that performs multiple phases of sentiment
analysis of social media text streams effectively in real time with two fully
analytic opinion mining models to combat the scale of text data streams
Citation: Chung, S. and Aring, D. (2018), “Integrated Real-Time Big Data Stream Sen-
timent Analysis Service”. Journal of Data Analysis and Information Processing, 6, 46-
66. doi: 10.4236/jdaip.2018.62004.
Copyright: © 2018 by authors and Scientific Research Publishing Inc. This work is li-
censed under the Creative Commons Attribution International License (CC BY). http://
creativecommons.org/licenses/by/4.0
126 Data Analysis and Information Processing
INTRODUCTION
In the era of the web based social media, user-generated contents in “any”
form of user created content including: blogs, wikis, forums, posts, chats,
tweets, or podcasts have become the norm of media to express people’s
opinion. The amounts of data generated by individuals, businesses,
government, and research agents have undergone exponential growth. Social
networking giants such as Facebook and Twitter had 1.86 and 0.7 billion
active users as of Feb. 2018. The user-generated texts are valuable resources
to discover useful intelligence to help people in any field to make critical
decisions. Twitter has become an important platform of user generated text
streams where people express their opinions and views on new events,
new products or news. Such new events or news from announcing political
parties and candidates for elections to a popular new product release are
often followed almost instantly by a burst in Twitter volume, providing a
unique opportunity to measure the relationship between expressed public
sentiment and the new events or the new products.
Sentiment analysis can help explore how these events affect public
opinion or how public opinion affects future sales of these new products.
While traditional content analysis takes days or weeks to complete, opinion
Integrated Real-Time Big Data Stream Sentiment Analysis Service 127
. More sophisticated sentiment analyses on the streaming data are mostly the
MapReduce based batch mode analytics. While it is common to find batch
mode data processing works for the sophisticated sentiment analysis on
social media data, there are only a few works that propose the systems that
perform complex real time sentiment analysis on big data streams [7] [8] [9]
and little work is found in that the proposed such systems are implemented
and tested with real time data streams.
Sentiment Analysis otherwise known as opinion mining commonly
refers to the use of natural language processing (NLP) and text analysis
techniques to extract, and quantify subjective information in a text span [10]
. NLP is a critical component in extracting useful viewpoints from streaming
data [10] . Supervised classifiers are then utilized to predict from labeled
training sets. The polarity (positive or negative opinion) of a sentence is
measured with scoring algorithms to measure a polarity level of the opinion
in a sentence. The most established NLP method to capture the essential
meaning of a document is the bag of words (or bag of n-gram) representations
[11] . Latent Dirichlet Allocation (LDA) [12] is another widely adopted
representation. However, both representations have limitations to capture
the semantic relatedness (context) between words in a sentence and suffer
from the problems such as polysemy and synonymy [13] .
A recent paradigm in NLP, unsupervised text embedding methods, such
as Skip-gram [14] [15] and Paragraph Vector [16] [17] to use a distributed
representation for words [14] [15] and documents [16] [17] are shown to be
effective and scalable to capture the semantic and syntactic relationships,
such as polysemy and synonymy, between words and documents. The
essential idea of these approaches comes from the distributional hypothesis
that a word is represented by its neighboring (context) words in that you
shall know a word by the company it keeps [18] . Le and Mikolov [16] [17]
show that their method, Paragraph Vectors, can be used in classifying movie
reviews or clustering web pages. We employed the pre-trained network with
the paragraph vector model [19] for our system for preprocessing to identify
n-grams and synonymy in our data sets.
An advanced sentiment analysis beyond polarity is the aspect based
opinion mining that looks at other factors (aspects) to determine sentiment
polarity such as “feelings of happiness sadness, or anger”. An example of
the aspect oriented opinion mining is classifying movie reviews based on
a thumbs up or downs as seen in the 2004 paper and many other papers by
Pang and Lee [10] [20] . Another technique is the lexical approach to opinion
Integrated Real-Time Big Data Stream Sentiment Analysis Service 129
SMDSSAS for the real time sentiment analysis. The results show
that our framework can be a good alternative for an efficient and
scalable tool to extract, transform, score and analyze opinions for
the user generated big social media text streams in real time.
RELATED WORKS
Many existing works in the related literature concentrate on topic-based
opining mining models. In topic-based opinion mining, sentiment is
estimated from the messages related to a chosen topic of interest such that
topic and sentiment are jointly inferred [22] . There are many works on
the topic based sentiment analysis where the models are tested on a batch
method as listed in the reference Section. While there are many works in
the topic based models for batch processing systems, there are few works
in the literature on topic-based models for real time sentiment analysis on
streaming data. Real-time topic sentiment analysis is imperative to meet
the strict time and space constraints to efficiently process streaming data
[6] . Wang et al. in the paper [6] developed a system for Real-Time Twitter
Sentiment Analysis of the 2012 Presidential Election Cycle using the Twitter
firehose with a statistical sentiment model and a Naive Bayes classifier on
unigram features. A full suite of analytics were developed for monitoring the
shift in sentiment utilizing expert curated rules and keywords in order to gain
an accurate picture of the online political landscape in real time. However,
these works in the existing literature lacked the complexity of sentiment
analysis processes. Their sentiment analysis model for their system is based
on simple aggregations for statistical summary with a minimum primitive
language preprocessing technique.
More recent research [23] [24] have proposed big data stream processing
architectures. The first work in 2015 [23] proposed a multi-layered storm
based approach for the application of sentiment analysis on big data
streams in real time and the second work in 2016 [24] proposed a big data
analytics framework (ASMF) to analyze consumer sentiments embedded in
hundreds of millions of online product reviews. Both approaches leverage
probabilistic language models by either mimicking “document relevance”:
with probability of the document generating a user provided query term
found within the sentiment lexicon [23] or by adapting a classical language
modeling framework to enhance the prediction of consumer sentiments
[24]. However, the major limitation of their works is both the proposed
frameworks have never been implemented and tested under an empirical
setting or in real time.
Integrated Real-Time Big Data Stream Sentiment Analysis Service 131
Our sixth and final layer is Presentation layer that consists of a web
based user interface.
SENTIMENT MODEL
Extracting useful viewpoints (aspects) in context and subjectivity from
streaming data is a critical task for sentiment analysis. Classical approaches
of sentimental analysis have their own limitations in identifying accurate
contexts, for instance, for the lexicon-based methods; common sentiment
lexicons may not be able to detect the context-sensitive nature of opinion
expressions. For example, while the term “small” may have a negative
polarity in a mobile phone review that refers to a “small” screen size,
the same term could have a positive polarity such as “a small and handy
notebook” in consumer reviews about computers. In fact, the token “small”
is defined as a negative opinion word in the well-known sentiment lexicon
list Opinion-Finder [28] .
The sentiment models developed for SMDSSAS are based on the aspect
model [29] . The aspect based opinion mining techniques are to identify to
extract personal opinions and emotions of surrounding social or political
events by capturing semantically orienting contents in subjectivity and
context that are correlated by aspects, i.e. topic words. The design of our
sentiment model was based on the assumption that positive and negative
opinions could be estimated per a context of a given topic [22] . Therefore,
in generating data for model training and testing, we employed a topic-based
approach to perform sentiment annotation and quantification on related user
tweets.
The aspect model is a core of probabilistic latent semantic analysis
in the probabilistic language model for general co-occurrence data which
associates a class (topic) variable t∈T={t1,t2,⋯,tk} with each occurrence of a
word w∈W={w1,w2,⋯,wm} in a document d∈D={d1,d2,⋯,dn} . The Aspect
model is a joint probability model that can be defined in selecting a document
d with probability P(d), picking a latent class (topic) t with probability P(t|d),
and occurring a word (token) w with probability P(w|t).
As a result one obtains an observed pair (d,w), while the latent class
variable z is discarded. Translating this process into a joint probability
model results in the expression as follow:
(1)
where
134 Data Analysis and Information Processing
(2)
Essentially, to derive (2) one has to sum over the possible choices of z
that could have generated the observation.
The aspect model is based on two independence assumptions: First,
any pairs (d,w) are assumed to be occurred independently; this essentially
corresponds to the bag-of-words (or bag of n-gram) approach. Secondly,
the conditional independence assumption is made that conditioned on the
latent class t, words w are occurred independently of the specific document
identity di. Given that the number of class states is smaller than the number
of documents ( K≪N), t acts as a bottleneck variable in predicting w
conditioned on d.
Following the likelihood principle, P(d), P(t|d), and P(w|t) can be
determined by maximization of the log-likelihood function
(3)
where n(d,w) denotes the term frequency, i.e., the number of times w occurred
in d. An equivalent symmetric version of the model can be obtained by
inverting the conditional probability P(t|d) with the Bayes’ theorem, which
results in
(4)
In the Information Retrieval context, this Aspect model is used to
estimate the probability that a document d is related to a query q [2] .
Such a probabilistic inference is used to derive a weighted vector in Vector
Space Model (VSM) where a document d contains a user given query q [2]
where q is a phrase or a sentence that is a set of classes (topic words) as
d∩q=T={t1,t2,⋯,tk}.
(5)
where tf.idft,d is defined as a term weight wt,d of a topic word t with tft,d being
the term frequency of a topic word tj occurs in di and idft, being the inverted
document index defined with dft the number of documents that contain tj as
below. N is the total number of documents.
(6)
Then d and q are represented with the weighted vectors for the common
terms. score(q,d) can be derived using the cosine similarity function to
capture the concept of document “relevance” of d respect to q in the context
of topic words in q. Then the cosine similarity function is defined as the
Integrated Real-Time Big Data Stream Sentiment Analysis Service 135
(7)
Context Identification
We derive a topic set T(q) by generating a set of all the related topic words
from a user given query (topics) q={t1,t2,⋯,tk} where q is a set of tokens. For
each token ti in q, we derive the related topic words to add to the topic set
T(q) based on the related language semantics R(ti) as follow.
(8)
where ti,tj∈T . ti.*|*.ti denote any word concatenated with ti and ti_tj denotes
a bi-gram with ti and tj, label_synonym(ti) is a set of the labeled synonyms of
ti in the dictionary identified by in WordNet [23] . For context identification,
we can choose to employee the pre-trained network with the paragraph
vector model [16] [17] for our system for preprocessing. The paragraph
vector model is more robust in identifying synonyms of a new word that is
not in the dictionary.
(9)
where di is a document (message) in a tweet stream D of a given topic set T
with 1 ≤ i < n and di={w1,⋯,wm} , m is the number of words in di. Pos(wk)
= 1 if wk is a positive word and Neg(wk) = −1 if wk is a negative word.
sentiment(di) is the difference between the frequency of the positive words
denoted as FreqPos(di) and the frequency of negative words denoted as
FreqNeg(di) in di indicating an initial opinion polarity measure with −m ≤
sentiment(di) ≤ m and a sentiment label of di sentimenti = 1 for positive if
sentiment(di) ≥ 1, 0 if sentiment(di) = 0 for neutral, and −1 for negative if
sentiment(di) ≤ −1.
Then, we define w(di) a weight for a sentiment orientation for di
to measure a subjectivity of sentiment orientation of a document, then a
weighted sentiment measure for di senti_score(di) is defined with w(di) and
sentimenti the sentiment label of di as a score of sentiment of di as follow:
(10)
(11)
where −1 ≤ w(di) ≤ 1, and α is a control parameter for learning. When α =
0, senti_score(di) = sentimenti. senti_score(di) gives more weight toward a
short message with strong sentiment orientation. w(di) = 0 for neural.
Class Max Sentiment Extraction (CMSE): To test the performance of our
models and to predict the outcomes of events such as the 2016 Presidential
election from the extracted user opinions embedded in tweet streams, we
quantify the level of the sentiment in the data set with Class Max Sentiment
Extraction (CMSE) to generate statistically relevant absolute sentiment
values to measure an overall sentiment orientation of a data set for a given
topic set for each sentiment polarity class to compare among different data
sets. To quantify a sentiment of a data set D of a given topic set T, we define
CMSE(D(T)) as follow.
For a given Topic set T, for each di∈D(T) where di contains at least one
of the topic words of interest in T in a given tweet stream D, CMSE(D(T))
returns a weighted sum of senti_score(di) of the data set D on T as follow:
(12)
Integrated Real-Time Big Data Stream Sentiment Analysis Service 137
(13)
(14)
where 1 ≤ i < n and D(T)={d1,⋯,dn} , n is the number of documents in D(T).
CMSE is to measure the maximum sentiment orientation values of each
polarity class for a given topic correlated data set D(T). It is a sum of the
weighted document sentiment scores for each sentiment class―positively
labeled di, negatively labeled di, and neutrally labeled di respectively in a
given data set D(T) for a user given topic word set T. CMSE is the same as
an aggregated count of sentimenti when α = 0.
CMSE is an indicator of how strongly positive or negative the sentiment
is in a data set for a given topic word set T where D(T) is a set of documents
(messages) in a tweet stream where each document di∈D(T) 1≤ i ≤ n, contains
at least one of the topic words tj∈T={t1,⋯,tk} with 1 ≤ j ≤ k and T is a set
of all the related topic words derived from a user given query q as a seed to
generate T. Tj, which is a subset of T, is a set of topic words that is derived
from a given topic tj∈T . D(Tj), a subset of D(T), is a set of documents where
each document di contains at least one of the related topic words in a topic
set Tj. Every topic word set is derived by the Context Identifier described
in Section 4.1. With the Donald Trump and Hillary Clinton example, three
topic-correlated data sets are denoted as below.
D(Tj) is a set of documents with a topic word set Tj derived from {Donald
Trump|Hillary Clinton}.
D(TRj) is a set of documents, a subset of D(Tj), with a topic word set TRj
derived from {Donald Trump}.
D(HCj) is a set of documents, a subset of D(Tj), with a topic word set
HCj derived from{Hillary Clinton}.
where m, are the number of document di in D(TRj) and D(HCj)
respectively. For example, CMSEpos(D(TRj)) is the maximum positive
opinion measure in the tweet set that are talking about the candidate Donald
Trump.
CSOM (Class Sentiment Orientation Measure): CSOM is to measure a
relative ratio of the level of the positive and negative sentiment orientation
for a given topic correlated data set over the entire dataset of interest.
For CSOM, we define two relative opinion measures: Semantic
Orientation (SMO) and Sentiment Orientation (STO) to quantify a polarity
138 Data Analysis and Information Processing
for a given data set correlated with a topic set Tj. SMO indicates a relative
polarity ratio between two polarity classes within a given topic data set. STO
indicates a ratio of the polarity of a given topic set over an entire data set.
With our Trump and Hillary example from the 2016 Presidential Election
event, the positive SMO for the data set D(TRj) with the topic word “Donald
Trump” and the negative SMO for the Hillary Clinton topic set D(HCj) can
be derived for each polarity class respectively as below. For example, the
positive SMO for a topic set D(TRj) for Donald Trump and the negative
SMO for a topic set D(HCj) for Hillary Clinton are defined as follow:
(15)
(16)
When α = 0, senti_score(di) = sentimenti, so CMSE and SMO are
generated with count of sentimenti of the data set. Then, Sentiment
Orientation (STO) for a topic set D(TRj) for Donald Trump and the negative
STO for a topic set D(HCj) for Hillary Clinton are defined as follow:
(17)
(18)
where Weight(TRj) and Weight(HCj) are the weights of the topics over
the entire dataset, defined as follow. Therefore, STO(TRj) indicates a
weighted polarity of the topic TRj over the entire data set D(Tj) where
D(Tj)=D(TRj)∪D(HCj).
(19)
(20)
Integrated Real-Time Big Data Stream Sentiment Analysis Service 139
(21)
0m is the number of tokens in di. −m∗2≤SentimentSubj(di)≤m∗2. Note that
subjScale(wt) of each neutral word is not 0. We consider a strong neutral
opinion as a weak positive and a weak neutral as a weak negative by as-
signing very small positive or negative weights. The sentiment of each di is
then defined by the sum of the frequency of each subjectivity group with its
weighted subjScale.
140 Data Analysis and Information Processing
(22)
(23)
Then CMSESubj(D(T)) is a sum of subjectivity weighted opinion polarity
for a given topic set D(T) with di∈D(T) . It can be defined with senti_score_
subj(di) as follow.
(24)
(25)
(26)
Then, we define our deterministic model ρε(Pos_Tj) as a length
normalized sum of subjectivity weighted senti_score_subj(di) for a given
topic Tj with di∈D(Tj) as follow:
(27)
where D(T) is a set of documents (messages) in a tweet stream where each
document di∈D(T),1≤i≤n , contains one of the topic words tj∈T={t1,⋯,tk}
where 1 ≤ j ≤ k and T is a set of all the related topic words derived from the
user given topics and Tj is a set of all the topic words that are derived from
a given query q as defined in the Section 4.1. D(Tj) , a subset of D(T), is a
set of documents where each document di contains one of the related topic
words in a topic set Tj and n is the number of document di in D(Tj).
Our probabilistic model ρ with a given topic set D(T) and D(Tj) measures
the probability of sentiment polarity of a given a topic set D(Tj) where D(Tj)
is a subset of D(T). For example, the probability of the positive opinion for
Trump in D(T), denoted as P(Pos_TR), is defined as follow:
(28)
ϵϵ is a smoothing factor [30] and we consider a strong neutral subjectivity as
a weak positivity here. Then, we define our probabilistic model ρ(POS_TR)
ρ(POS_TR) as
(29)
where NegativeInfo(TR) is essentially a subjectivity weighted NegSMO(TRj)
defined as follow.
(30)
Our probabilistic model penalizes with the weight of the negative
opinion in the correlated topic set D(TR) when measuring a positive opinion
of a topic over a given entire data set D(T).
(31)
(32)
In a multinomial event model, a document is an ordered sequence of
word events, that represent the frequencies which certain events have been
generated by a multinomial (p1⋯pn) where pi is the probability that event i
occurs, and xi is the feature vector counting the number of times event i was
observed in an instance [32] . Each document di is drawn from a multinomial
distribution of words with as many independent trials as the length of di,
yielding a “bag of words” representation for the documents [32] . Thus the
probability of a document given its class is the representation of k such
multinomial [32] .
EXPERIMENTS
We applied our sentiment models discussed in Section(s) 4.2 and 4.3 on the
real-time twitter stream for the following events―the 2016 US Presidential
election and the 2017 Inauguration. User opinion was identified extracted
and measured surrounding the political candidates and corresponding
election policies in an effort to demonstrate SMDSSAS’s accurate critical
decision making capabilities.
A total of 74,310 topic-correlated tweets were collected randomly chosen
on a continuous 30-second interval in Apache Spark DStream accessing the
Twitter Streaming API for the pre-election week of November 2016 and the
pre-election month on October, as well as pre-inauguration week in January.
The context detector on the following topics generates the set of topic words:
Hillary Clinton, Donald Trump and political policies. The number of the
Integrated Real-Time Big Data Stream Sentiment Analysis Service 143
topic correlated tweets for the candidate Donald Trump was ~53,009 tweets
while the number of the topic correlated tweet for the candidate Hillary
Clinton was ~8510 which is a lot smaller than that of Trump.
Tweets were preprocessed with a custom cleaning function to remove
all non-English characters including: the Twitter at “@” and hash tag “#”
signs, image/website URLS, punctuation: “[. , ! “ ‘]”, digits: [0-9], and
non-alphanumeric characters: $ % & ^ * () + ~ and stored in NoSql Hive
database. Each topic-correlated tweet was labeled for sentiment using the
OpinionFinder subjectivity word lexicon and the subjScale(wt) defined in 4.3
associating a numeric value to each word based on polarity and subjectivity
strength.
(a)
(a)
Integrated Real-Time Big Data Stream Sentiment Analysis Service 145
(a)
(b)
(b)
Figure 3: (a) Polarity comparison of two candidates: Clinton vs Trump with
CMSE and subjectivity weighted CMSE; (b) Comparison of positive sentiment
measure of two candidates with Pos_SMO and deterministic model.
To validate parallel stream data processing, we adopt the method of
evaluation of big data stream classifier proposed in Bifet 2015 [7] . The
standard K-fold cross-validation, which is used in other works with batch
methods, treats each fold of the stream independently, and therefore
may miss concept drift occurring in the data stream. To overcome these
problems, we employed the strategy K-fold distributed cross-validation
[7] to validate stream data. Assuming we have K different instances of the
classifier, we want to evaluate running in parallel. The classifier does not
need to be randomized. Each time a new example arrives, it is used in 10-
fold distributed cross-validation: each example was used for testing in one
classifier selected randomly, and used for training by all the others.
10 fold distributed cross validation were performed on our stream data
processing in each two different data splits. 60%: 40% training data: test
data, and 90%: 10%. Average accuracy was taken for each split, for each
deterministic and probabilistic model. Each cross validation was performed
with classifier optimization parameters providing the model a variance of
smoothing factors, features for term frequency and numeric values for min
document frequency. Figure 4 illustrates the accuracies of deterministic
and probabilistic models. 10 fold cross validation on 90%: 10% split with
Deterministic model showed the highest accuracy with an average accuracy
of 81% and the average accuracy of the Probabilistic model showed the
almost comparable result with 80%. In comparison with the existing works,
146 Data Analysis and Information Processing
the overall average accuracy from the cross validation on each model shows
1% - 22% improvement from the previous work [6] [7] [8] [9] [22] [23]
[24] [29] [30] . Figure 4 below illustrates the cross validation results of the
Deterministic and Probabilistic models.
CONCLUSIONS
The main contribution of this paper is the design and development of a
real time big data stream analytic framework; providing a foundation for
an infrastructure of real time sentiment analysis on big text streams. Our
framework is proven to be an efficient, scalable tool to extract, score and
analyze opinions on user generated text streams per user given topics in
real time or near real time. The experiment results demonstrated the ability
of our system architecture to accurately predict the outcome of the 2016
Presidential Race against candidates Hillary Clinton and Donald Trump.
The proposed fully analytic Deterministic and Probabilistic sentiment
models coupled with the real-time streaming components were tested on the
user tweet streams captured during pre-election month in October 2016 and
the pre-election week of November 2016. The results proved that our system
was able to predict correctly Donald Trump as the definitive winner of the
2016 Presidential election.
Figure 4: Average cross validation prediction accuracy on real time pre election
tweet streams of 2016 presidential election for deterministic vs. probabilistic
model.
Integrated Real-Time Big Data Stream Sentiment Analysis Service 147
The cross validation results showed that the Deterministic Topic Model
in real time processing consistently improved the accuracy with average
81% and the Probabilistic Topic Model with average 80% compared to the
accuracies of the previous works, ranging from 59% to 80%, in the literature
[6] [7] [8] [9] [22] [23] [24] [29] [30] that lacked the complexity of sentiment
analysis processing, either in batch or real time processing.
Finally, SMDSSAS performed efficient real-time data processing and
sentiment analysis in terms of scalability. The system uses the continuous
processing of a smaller window of data stream (e.g. consistent processing
of a 30sec window of streaming data) in which machine learning analytics
were performed on the context stream resulting in more accurate predictions
with the ability of the system to continuously apply multi-layered fully
analytic processes with complex sentiment models to a constant stream
of data. The improved and stable model accuracies demonstrate that our
proposed framework with the two sentiment models offers a scalable real-
time sentiment analytic framework alternative for big data stream analysis
over the traditional batch mode data analytic frameworks.
ACKNOWLEDGEMENTS
The research in this paper was partially supported by the Engineering
College of CSU under the Graduate Research grant.
148 Data Analysis and Information Processing
REFERENCES
1. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K.,
Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A. and Zaharia, M.
(2015) Spark SQL: Relational Data Processing in SPARK. Proceedings
of the ACM SIGMOD International Conference on Management
of Data, Melbourne, 31 May-4 June 2015, 1383-1394. https://doi.
org/10.1145/2723372.2742797
2. Sagiroglu, S. and Sinanc, D. (2013) Big Data: A Review. 2013
International Conference on Collaboration Technologies and Systems
(CTS), San Diego, 20-24 May 2013, 42-47. https://doi.org/10.1109/
CTS.2013.6567202
3. Lars, E. (2015) What’s the Best Way to Manage Big Data for Healthcare:
Batch vs. Stream Processing? Evariant Inc., Farmington.
4. Hu, M. and Liu, B. (2004) Mining and Summarizing Customer Reviews.
Proceedings of the Tenth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Seattle, 22-25 August 2004,
168-177.
5. Liu, B. (2010) Sentiment Analysis and Subjectivity. In: Indurkhya, N.
and Damerauthe, F.J., Eds., Handbook of Natural Language Processing,
2nd Edition, Chapman and Hall/CRC, London, 1-38.
6. Wang, H., Can, D., Kazemzadeh, A., Bar, F. and Narayanan, S.
(2012) A System for Real-Time Twitter Sentiment Analysis of 2012
U.S. Presidential Election Cycle. Proceedings of ACL 2012 System
Demonstrations, Jeju Island, 10 July 2012, 115-120.
7. Bifet, A., Maniu, S., Qian, J., Tian, G., He, C. and Fan, W. (2015)
StreamDM: Advanced Data Mining in Spark Streaming. IEEE
International Conference on Data Mining Workshop (ICDMW),
Atlantic City, 14-17 November 2015, 1608-1611.
8. Kulkarni, S., Bhagat, N., Fu, M., Kedigehalli, V., Kellogg, C., Mittal,
S., Patel, J.M., Ramasamy, K. and Taneja, S. (2015) Twitter Heron:
Stream Processing at Scale. Proceedings of the 2015 ACM SIGMOD
International Conference on Management of Data, Melbourne, 31
May-4 June 2015, 239-250. https://doi.org/10.1145/2723372.2742788
9. Nair, L.R. and Shetty, S.D. (2015) Streaming Twitter Data Analysis
Using Spark For Effective Job Search. Journal of Theoretical and
Applied Information Technology, 80, 349-353.
Integrated Real-Time Big Data Stream Sentiment Analysis Service 149
21. Maite, T., Brooke, J., Tofiloski, M., Voll, K. and Stede, M. (2011)
Lexicon-Based Methods for Sentiment Analysis. Computational
Linguistics, 37, 267-307. https://doi.org/10.1162/COLI_a_00049
22. O’Connor, B., Balasubramanyan, R., Routledge, B. and Smith, N.
(2010) From Tweets to Polls: Linking Text Sentiment to Public Opinion
Time Series. Proceedings of the International AAAI Conference on
Weblogs and Social Media (ICWSM 2010), Washington DC, 23-26
May 2010, 122-129.
23. Cheng, K.M.O. and Lau, R. (2015) Big Data Stream Analytics
for Near Real-Time Sentiment Analysis. Journal of Computer and
Communications, 3, 189-195. https://doi.org/10.4236/jcc.2015.35024
24. Cheng, K.M.O. and Lau, R. (2016) Parallel Sentiment Analysis with
Storm. Transactions on Computer Science and Engineering, 1-6.
25. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N.,
Antony, S., Liu, H. and Murthy, R. (2010) Hive—A Petabyte Scale
Data Warehouse Using Hadoop. Proceedings of the International
Conference on Data Engineering, Long Beach, 1-6 March 2010, 996-
1005.
26. Manning, C., Surdeanu, A., Bauer, J., Finkel, J., Bethard, S. and
McClosky, D. (2014) The Stanford CoreNLP Natural Language
Processing Toolkit. Proceedings of 52nd Annual Meeting of the
Association for Computational Linguistics: System Demonstrations,
Baltimore, 23-24 June 2014, 55-60. https://doi.org/10.3115/v1/P14-
5010
27. Finkel, J., Grenager, T. and Manning, C. (2005) Incorporating Non-
Local Information into Information Extraction Systems by Gibbs
Sampling. Proceedings of the 43rd Annual Meeting of the Association
for Computational Linguistics (ACL 2005), Ann Arbor, 25-30 June
2005, 363-370.
28. Wilson, T., Wiebe, J. and Hoffman, P. (2005) Recognizing Contextual
Polarity in Phrase-Level Sentiment Analysis. Proceedings of the
Conference on Human Language Technology and Empirical Methods
in Natural Language Processing, Vancouver, 6-8 October 2005, 347-
354. https://doi.org/10.3115/1220575.1220619
29. Wang, S., Zhiyuan, C. and Liu, B. (2016) Mining Aspect-Specific
Opinion Using a Holistic Lifelong Topic Model. Proceedings of the
Integrated Real-Time Big Data Stream Sentiment Analysis Service 151
Haya Smaya
Mechanical Engineering Faculty, Institute of Technology, MATE Hungarian University
of Agriculture and Life Science, Gödöllo”, Hungary
ABSTRACT
Big data has appeared to be one of the most addressed topics recently, as
every aspect of modern technological life continues to generate more and
more data. This study is dedicated to defining big data, how to analyze it,
the challenges, and how to distinguish between data and big data analyses.
Therefore, a comprehensive literature review has been carried out to define
and characterize Big-data and analyze processes. Several keywords, which
are (big-data), (big-data analyzing), (data analyzing), were used in scientific
research engines (Scopus), (Science direct), and (Web of Science) to acquire
up-to-date data from the recent publications on that topic. This study shows
the viability of Big-data analysis and how it functions in the fast-changeable
world. In addition to that, it focuses on the aspects that describe and anticipate
Citation: Smaya, H. (2022), “The Influence of Big Data Analytics in the Industry”.
Open Access Library Journal, 9, 1-12. doi: 10.4236/oalib.1108383.
Copyright: © 2022 by authors and Scientific Research Publishing Inc. This work is li-
censed under the Creative Commons Attribution International License (CC BY). http://
creativecommons.org/licenses/by/4.0
154 Data Analysis and Information Processing
INTRODUCTION
The research background is dedicated to defining big data, how to analyze it,
the challenges, and how to distinguish between data and big data analyses.
Therefore, a comprehensive literature review has been carried out to define
and characterize Big-data and analyze processes. Several keywords, which
are (big-data), (big-data analyzing), (data analyzing), were used in scientific
research engines (Scopus), (Science direct), and (Web of Science) to acquire
up-to-date data from the recent publications on that topic.
The Problem this paper wants to solve is to show the viability of Big-
data analysis and how it functions in the fast-changeable world. In addition
to that, it focuses on the aspects that describe and anticipate Big-data
analysis behaviour. Big Data is omnipresent, and there is an almost urgent
need to collect and protect whatever data is generated. In recent years, big
data has exploded in popularity, capturing the attention and investigations of
researchers all over the world. Because data is such a valuable tool, making
proper use of it may help people improve their projections, investigations,
and decisions [1]. The growth of science has driven everyone to mine and
consume large amounts of data for the company, consumer, bank account,
medical, and other studies, which has resulted in privacy breaches or
intrusions in many cases [2]. The promise of data-driven decision-making
is now widely recognized, and there is growing enthusiasm for the concept
of “Big Data,” as seen by the White House’s recent announcement of new
financing programs across many agencies. While Big Data’s potential is
real―Google, for example, is thought to have given 54 billion dollars. In
2009 to the US economy―there is no broad unanimity on this [3].
It is difficult to recall a topic that received so much hype as broadly and
quickly as big data. While barely known a few years ago, big data is one of
the most discussed topics in business today across industry sectors [4].
This study is dedicated to defining the Big-data concept, assessing its
viability, and investigating the different methods of analyzing and studying
it.
The Influence of Big Data Analytics in the Industry 155
Characteristics of Big-Data
Firstly, it is essential to differentiate between Big-data and structured data
(which is usually stored in relational database systems) based on five
parameters (Figure 2) which are:
1-Volume 2-Variety 3-Velocity 4-Value 5-Veracity
And usually, it can be referred to these parameters as (5V’s) that are the
main challenges of Big-data management:
156 Data Analysis and Information Processing
BIG-DATA ANALYSIS
Analyzing Process
Analyzing Steps
Data analysts, data scientists, predictive modellers, statisticians, and other
analytics experts collect, process, clean, and analyze increasing volumes of
structured transaction data, as well as other types of data not typically used
by traditional BI and analytics tools. The four steps of the data preparation
process are summarized below (Figure 4) [7]:
1) Data specialists gather information from a range of sources. It’s
usually a mix of semi-structured and unstructured information.
While each company will use different data streams, the following
are some of the most frequent sources:
• clickstream data from the internet
Analyzing Tools
To support big data analytics procedures, a variety of tools and technologies
are used [14]. The following are some of the most common technologies and
techniques used to facilitate big data analytics processes:
160 Data Analysis and Information Processing
• Some firms are having difficulty filling the gaps due to a probable
lack of internal analytics expertise and the high cost of acquiring
professional data scientists and engineers.
CONCLUSIONS
Gradually, the business sector is relying more on its development on data
science. A tremendous amount of data is used to describe the behaviour
of complex systems, anticipate the output of processes, and evaluate this
output. Based on what we discussed in this essay, it can be stated that Big-
data analytics is the cutting-edge methodology in data science alongside
every other technological aspect, and studying comprehensively this major,
would be essential for further development.
Several methods and software are commercially available for analyzing
big-data sets. Each of them can relate to technology, business, or social
media. Further studies using analyzing software could enhance the depth of
the knowledge reported and validate the results.
166 Data Analysis and Information Processing
REFERENCES
1. Siegfried, P. (2017) Strategische Unternehmensplanung in jungen
KMU—Probleme and Lösungsansätze. de Gruyter/Oldenbourg Verlag,
Berlin.
2. Siegfried, P. (2014) Knowledge Transfer in Service Research—Service
Engineering in Startup Companies. EUL-Verlag, Siegburg.
3. Divesh, S. (2017) Proceedings of the VLDB Endowment. Proceedings
of the VLDB Endowment, 10, 2032-2033.
4. Su, X. (2012) Introduction to Big Data. In: Opphavsrett: Forfatter
og Stiftelsen TISIP, Institutt for informatikk og e-læring ved NTNU,
Zürich, Vol. 10, Issue 12, 2269-2274.
5. Siegfried, P. (2015) Die Unternehmenserfolgsfaktoren und deren kausale
Zusammenhänge. In: Zeitschrift Ideen-und Innovationsmanagement,
Deutsches Institut für Betriebs-wirtschaft GmbH/Erich Schmidt Verlag,
Berlin, 131-137. https://doi.org/10.37307/j.2198-3151.2015.04.04
6. Gandomi, A. and Haider, M. (2015) Beyond the Hype: Big Data
Concepts, Methods, and Analytics. International Journal of
Information Management, 35, 137-144. https://doi.org/10.1016/j.
ijinfomgt.2014.10.007
7. Lembo, D. (2015) An Introduction to Big Data. In: Application of Big
Data for National Security, Elsevier, Amsterdam, 3-13. https://doi.
org/10.1016/B978-0-12-801967-2.00001-X
8. Siegfried, P. (2014) Analysis of the Service Research Studies in the
German Research Field, Performance Measurement and Management.
Publishing House of Wroclaw University of Economics, Wroclaw,
Band 345, 94-104.
9. Cheng, O. and Lau, R. (2015) Big Data Stream Analytics for Near Real-
Time Sentiment Analysis. Journal of Computer and Communications,
3, 189-195. https://doi.org/10.4236/jcc.2015.35024
10. Abu-salih, B. and Wongthongtham, P. (2014) Chapter 2. Introduction
to Big Data Technology. 1-46.
11. Sharma, S. and Mangat, V. (2015) Technology and Trends to Handle
Big Data: Survey. International Conference on Advanced Computing
and Communication Technologies, Haryana, 21-22 February 2015,
266-271. https://doi.org/10.1109/ACCT.2015.121
The Influence of Big Data Analytics in the Industry 167
12. Davenport, T.H. and Dyché, J. (2013) Big Data in Big Companies.
Baylor Business Review, 32, 20-21. http://search.proquest.com/docv
iew/1467720121?accountid=10067%5Cnhttp://sfx.lib.nccu.edu.tw/
sfxlcl41?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:jou
rnal&genre=article&sid=ProQ:ProQ:abiglobal&atitle=VIEW/REVIE
W:+BIG+DATA+IN+BIG+COMPANIES&title=Bay
13. Riahi, Y. and Riahi, S. (2018) Big Data and Big Data Analytics:
Concepts, Types and Technologies. International Journal of Research
and Engineering, 5, 524-528. https://doi.org/10.21276/ijre.2018.5.9.5
14. Verma, J.P. and Agrawal, S. (2016) Big Data Analytics: Challenges
and Applications for Text, Audio, Video, and Social Media Data.
International Journal on Soft Computing, Artificial Intelligence and
Applications, 5, 41-51. https://doi.org/10.5121/ijscai.2016.5105
15. Begoli, E. and Horey, J. (2012) Design Principles for Effective
Knowledge Discovery from Big Data. Proceedings of the 2012 Joint
Working Conference on Software Architecture and 6th European
Conference on Software Architecture, WICSA/ECSA, Helsinki, 20-24
August 2012, 215-218. https://doi.org/10.1109/WICSA-ECSA.212.32
16. Najafabadi, M.M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N.,
Wald, R. and Muharemagic, E. (2015) Deep Learning Applications and
Challenges in Big Data Analytics. Journal of Big Data, 2, 1-21. https://
doi.org/10.1186/s40537-014-0007-7
17. Bätz, K. and Siegfried, P. (2021) Complexity of Culture and
Entrepreneurial Practice. International Entrepreneurship Review, 7,
61-70. https://doi.org/10.15678/IER.2021.0703.05
18. Bockhaus-Odenthal, E. and Siegfried, P. (2021) Agilität über
Unternehmensgrenzen hinaus—Agility across Boundaries, Bulletin of
Taras Shevchenko National University of Kyiv. Economics, 3, 14-24.
https://doi.org/10.17721/1728-2667.2021/216-3/2
19. Kaisler, S.H., Armour, F.J. and Espinosa, A.J. (2017) Introduction to Big
Data and Analytics: Concepts, Techniques, Methods, and Applications
Mini Track. Proceedings of the Annual Hawaii International Conference
on System Sciences, Hawaii, 4-7 January 2017, 990-992. https://doi.
org/10.24251/HICSS.2017.117
Chapter 8
ABSTRACT
Data generation, storage capacity, processing power and analytical capacity
increase had created a technological phenomenon named big data that could
create big impact in research and development. In the marketing field, the use
of big data in research can represent a deep dive in consumer understanding.
This essay discusses the big data uses in the marketing information system
and its contribution for decision-making. It presents a revision of main
concepts, the new possibilities of use and a reflection about its limitations.
Citation: Salvador, A. and Ikeda, A. (2014), “Big Data Usage in the Marketing Infor-
mation System”. Journal of Data Analysis and Information Processing, 2, 77-85. doi:
10.4236/jdaip.2014.23010.
Copyright: © 2014 by authors and Scientific Research Publishing Inc. This work is li-
censed under the Creative Commons Attribution International License (CC BY). http://
creativecommons.org/licenses/by/4.0
170 Data Analysis and Information Processing
INTRODUCTION
A solid information system is essential to obtain relevant data for the decision-
making process in marketing. The more correct and relevant the information
is, the greater the probability of success is. The 1990s was known as the decade
of the network society and the transactional data analysis [1] . However,
in addition to this critical data, there is a great volume of less structured
information that can be analyzed in order to find useful information [2] . The
growth of generation, storage capacity, processing power and data analysis
provided a technological phenomenon called big data. This phenomenon
would cause great impacts on studies and lead to the development of
solutions in different areas. In marketing, big data research can represent the
possibility of a deep understanding of the consumer behavior, through their
profile monitoring (geo-demographic, attitudinal, behavioral), the statement
of their areas of interest and preferences, and monitoring of their purchase
behavior [3] [4] . The triangulation of the available data in real time with
information previously stored and analyzed would enable the generation of
insights that would not be possible through other techniques [5] .
However, in order to have big data information correctly used by
companies, some measures are necessary, such as investment on people
qualification and equipment. More than that, the increase of information
access may generate ethic-related problems, such as invasion of privacy and
redlining. It may affect research as well, as in cases where information could
be used without consent of the surveyed.
Predictive analytics are models that seek to predict the consumer
behavior through data generated by their purchase and/or consumption
activities and with the advent of big data, predictive analytics grow in
importance to understand this behavior from the data generated in on-line
interactions among these people. The use of predictive systems can also be
controversial as exemplified by the case of American chain Target, which
identified the purchase behavior of women at the early stage of pregnancy
and sent a congratulation letter to a teenage girl who had not yet informed
her parents about the pregnancy. The case generated considerable negative
repercussions and the chain suspended the action [4] .
The objective of this essay is to discuss the use of big data in the context
of marketing information systems, present new possibilities resulting from
its use, and reflect on its limitations. For that, the point of view of researchers
and experts will be explored based on academic publications, which will
Big Data Usage in the Marketing Information System 171
BIG DATA
The term big data applies to information that could not be processed using
traditional tools or processes. According to an IBM [11] report, the three
172 Data Analysis and Information Processing
characteristics that would define big data are volume, speed and variety, as
together they would have created the need for new skills and knowledge in
order to improve the ability to handle the information (Figure 1).
The Internet and the use of social media have transferred the
power of creating content to users, greatly increasing the generation of
information on the Internet. However, this represents a small part of the
generated information. Automated sensors, such as RFID (radio-frequency
identification), multiplied the volume of collected data, and the volume
of stored data in the world is expected to jump from 800,000 petabytes
(PB) in 2000 to 35 zettabytes (ZB) in 2020. According to IBM, Twitter
would generate by itself over 7 terabytes (TB) of data a day, while some
companies would generate terabytes of data in an hour, due to its sensors
and controls. With the growth of sensors and technologies that encourage
social collaboration through portable devices, such as smartphones, the data
became more complex, due to its volume and different origins and formats,
such as files originating from automatic control, pictures, books, reviews in
communities, purchase data, electronic messages and browsing data. The
traditional idea of data speed would consider its retrieval, however, due to
the great number of sensors capturing information in real time, the concern
with the capture and information analysis speed emerges, leading, therefore,
to the concept of flow.
Figure 1. Three big data dimension. Source: Adapted from Zikopoulos and Ea-
ton, 2012.
Big Data Usage in the Marketing Information System 173
Input-Sub-Systems
Internal Reports
Internal reports became more complete and complex, involving information
and metrics generated by the company’s digital proprieties (including
websites and fanpages), which would also increase the amount of information
on consumers, reaching beyond the data on customer profile. With the
increase of information from different origins and in different formats, a
richer internal database becomes the research source for business, markets,
clients and consumers insights, in addition to internal analysis.
Marketing Intelligence
If in one hand the volume of information originated from marketing
intelligence increases, on the other hand, it is concentrated on an area
with more structured search and monitoring tools, with easier storage and
integration. Reading newspapers, magazines and sector reports gains a new
dimension with the access to global information in real time, changing the
challenge of accessing information to selection of valuable information,
increasing, therefore, the value of digital clipping services. The monitoring
of competitors gains a new dimension since brand changes, whether local or
global, can be easily followed up. The services of brand monitoring increase,
with products such as GNPD by Mintel [13] and the Buzzz Monitor by e.
Life [14] or SCUP and Bluefin.
Marketing Research
Since the Internet growth and virtual communities increase, studying online
behavior became, at the same time, an opportunity and a necessity. Netnography
makes use of ethnography sources when proposing to study group behavior
through observation of their behavior in their natural environment. In this
regard, ethnography (and netnography) has the characteristic of minimizing
the behavior changes setbacks by not moving the object of study from its
habitat, as many other study groups do. However, academic publications
have not reached an agreement on technique application and analysis depth
[15] -[17] . Kozinets (2002, 2006) [16] [17] proposes a deep study, in which
the researcher needs to acquire great knowledge over the object group and
monitor it for long periods, while Gerbera (2008) [15] is not clear about
such need of deep knowledge of the technique, enabling the understanding
of that which could be similar to a content analysis based on digital data. For
Big Data Usage in the Marketing Information System 175
the former, just as ethnography, the ethical issues become more important
as the researcher should ask for permission to monitor the group and make
their presence known; and, for the latter, netnography would not require
such observer presentation from public data collected. The great volume of
data captured by social networks could be analyzed using netnography.
One of the research techniques that have been gaining ground in the
digital environment is the content analysis due to, on one hand, the great
amount of data available for analysis on several subjects, and, on the other
hand, the spread of free automated analysis tools, such as Many Eyes by
IBM [18] , which offers cloud resources on terms, term correlation, scores
and charts, among others. The massive volume of information of big data
provides a great increase in the sample, and, in some cases, enables the
population research, with “n = all” [4] .
Product
From the positioning, the available information should be used to define
the product attributes, considering the value created for the consumer.
Information on consumer preferences and manifestations in communities
and forums are inputs for the development and adjustment of products, as
well as for the definition of complementary services. The consumer could
also participate in the product development process by offering ideas and
evaluations in real time.
The development of innovation could also benefit from big data, both
by surveying insights with the consumers and by using the information to
develop the product, or even to improve the innovation process through
the use of information, benefiting from the history of successful products,
analyses of the process stages or queries to an idea archive [23] . As an
improvement to the innovation process, the studies through big data would
Big Data Usage in the Marketing Information System 177
Distribution
Internal reports became more complete and complex, involving information
and metrics generated by the company’s digital proprieties (including
websites and fanpages), which would also increase the amount of information
on consumers, reaching beyond the data on customer profile. With the
increase of information from different origins and in different formats, a
richer internal database becomes the research source for business, markets,
clients and consumers insights, in addition to internal analysis.
In addition to the browsing location in the digital environment and the
monitoring of visitor indicators, exit rate, bounce rate and time per page,
the geolocation tools enable the monitoring of the consumers’ physical
location and how they commute. More than that, the market and consumer
information from big data enables to assess, in a more holistic manner, the
variables that affect the decisions on distribution and location [25] .
Communication
Big data analysis enables the emergence of new forms of communication
research through the observation on how the audience interacts with the
social networks. From their behavior analysis, new insights on their
preferences and idols [3] may emerge to define the concepts and adjust
details on the campaign execution. Moreover, the online interaction while
displaying offline actions of brands enables the creation and follow up of
indicators to monitor the communication [3] [26] , whether quantitative or
qualitative.
The increase of information storage, processing and availability enables
the application of the CRM concept to B2C clients, involving the activities
of gathering, processing and analyzing information on clients, providing
insights on how and why clients shop, optimizing the company processes,
facilitating the client-company interaction, and offering access to the client’s
information to any company.
Price
Even offline businesses will be strongly affected by the use of online prices
information. A research by Google Shopper Marketing Council [27] ,
178 Data Analysis and Information Processing
LIMITATIONS
Due to the lack of a culture that cultivates the proper use of information and
to a history of high costs for storage space, a lot of historical information
was lost or simply not collected at all. A McKinsey study with retail
companies observed that the chains were not using all the potential of
the predictive systems due to the lack of: 1) historical information; 2)
information integration; and 3) minimum standardization between the
internal and external information of the chain [28] -[30] . The greater the
historical information, the greater the accuracy of the algorithm, provided
that the environment in which the system is implemented remains stable.
Biesdorf, Court and Willmott (2013) [12] highlight the challenge of
integrating information from different functional systems, legacy systems
and information generated out of the company, including information from
the macro environment and social networks.
Not having qualified people to guide studies and handle systems and
interfaces is also a limiting factor for research [23] , at least in a short term.
According to Gobble (2013) [23] McKinsey report identifies the need for
190,000 qualified people to work in data analysis-related posts today. The
qualification of the front line should follow the development of user-friendly
interfaces [12] . In addition to the people directly connected to the analytics,
Don Schults (2012) [31] still highlights the need for people with “real life”
experience, able to interpret the information generated by the algorithms. “If
the basic understanding of the customer isn’t there, built into the analytical
models, it’s really doesn’t matter how many iterations the data went through
or how quickly. The output is worthless (SCHULTZ, 2012, p. 9).”
The management of clients in a different manner through CRM already
faces a series of criticism and limitations. Regarding the application of CRM
for service marketing, its limitations would lie in the fact that a reference
Big Data Usage in the Marketing Information System 179
based only on the history may not reflect the client’s real potential; the
unequal treatment of clients could generate conflicts and dissatisfaction of
clients not listed as priorities; and ethical issues involving privacy (improper
information sharing) and differential treatment (such as redlining). These
issues can be also applied in a bigger dimension in discussions about the use
of information from big data in marketing research and its application on
clients and consumers.
The predictive models are based on the fact that the environment where
the analyzing system is implemented remains stable, which, by itself, is a
limitation to the use of information. In addition to it and to the need of
investing in a structure or expending on outsourcing, the main limitations
in the use of big data are connected to three main factors: data shortage
and inconsistence, qualified people, and proper use of the information. The
full automation of the decision due to predictive models [5] also represents
a risk, since that no matter how good a model is, it is still a binary way
of understanding a limited theoretical situation. At least for now, the
analytical models would be responsible for performing the analyses and
recommendations, but the decisions would still be the responsibility of
humans.
Nuan and Domenico (2013) [5] have also emphasized that people’s
behavior and their relationships in social networks may not accurately
reflect their behavior offline, and the first important thing to do would be to
increase the understanding level of the relation between online and offline
social behavior. However, if on one hand people control the content of the
intentionally released information in social networks, on the other hand, a
great amount of information is collected invisibly, compounding their digital
trail. The use of information without the awareness and permission of the
studied person involves the ethics in research [15] -[17] . Figure 2 shows
a suggestion of continuum between the information that the clients would
make available wittingly and the information make available unwittingly
to the predictive systems. The consideration of the ethics issues raised by
Kozinets (2006) [17] , Nunan and Domenico (2013) [15] , and reinforces
the importance of increasing the clients’ level of awareness regarding the
use of their information or ensuring the non-customization of the analysis of
information obtained unwittingly by the companies.
180 Data Analysis and Information Processing
FINAL CONSIDERATIONS
This study discussed the use of big data in the context of marketing
information system, and, what was clear is that we are still in the beginning
of a journey of understanding its possibilities and use, and we can observe the
great attention generated by the subject and the increasing ethical concern.
As proposed by Nunan and Domenico (2013) [5] , the self-governance via
ESOMAR (European Society for Opinion and Market Research) [32] is an
alternative to fight the abuses and excesses and enable the good use of the
information. Nunan and Di Domenico (2013) [5] propose to include in the
current ESOMAR [32] rules the right to be forgotten (possibility to request
deletion of history), the right to have the data expired (complementing
the right to be forgotten, the transaction data could also expire), and the
ownership of a social graph (an individual should be aware of the information
collected about them).
Non-confidential information
Greater awareness and consent in providing information
Confidential information
Greater unawareness and non-consent in providing information
REFERENCES
1. Chow-White, P.A. and Green, S.E. (2013) Data Mining Difference
in the Age of Big Data: Communication and the Social Shaping of
Genome Technologies from 1998 to 2007. International Journal of
Communication, 7, 556-583.
2. ORACLE: Big Data for Enterprise. http://www.oracle.com/br/
technologies/big-data/index.html
3. Paul, J. (2012) Big Data Takes Centre Ice. Marketing, 30 November
2012.
4. Vitorino, J. (2013) Social Big Data. S?o Paulo, 1-5. www.elife.com.br
5. Nunan, D. and Domenico, M.Di. (2013) Market Research and the Ethics
of Big Data Market Research and the Ethics of Big Data. International
Journal of Market Research, 55, 2-13.
6. Cox, D. and Good, R. (1967) How to Build a Marketing Information
System. Harvard Business Review, May-June, 145-154. ftp://donnees.
admnt.usherbrooke.ca/Mar851/Lectures/IV
7. Berenson, C. (1969) Marketing Information Systems. Journal of
Marketing, 33, 16. http://dx.doi.org/10.2307/1248668
8. Chiusoli, C.L. and Ikeda, A. (2010) Sistema de Informa??o de
Marketing (SIM): Ferramenta de apoio com aplica??es à gest?o
empresarial. Atlas, S?o Paulo.
9. Kotler, P. (1998) Administra??o de marketing. 5th Edition, Atlas, S?o
Paulo.
10. Kotler, P. and Keller, K. (2012) Administra??o de marketing. 14th
Edition, Pearson Education, S?o Paulo.
11. Zikopoulos, P. and Eaton, C. (2012) Understanding Big Data: Analytics
for Enterprise Class Hadoop and Streaming Data. McGraw Hill, New
York, 166. Retrieved from Malik, A.S., Boyko, O., Atkar, N. and
Young, W.F. (2001) A Comparative Study of MR Imaging Profile of
Titanium Pedicle Screws. Acta Radiologica, 42, 291-293. http://dx.doi.
org/10.1080/028418501127346846
12. Biesdorf, S., Court, D. and Willmott, P. (2013) Big Data: What’s Your
Plan? McKinsey Quarterly, 40-41.
13. MINTEL. www.mintel.com
14. E. Life. www.elife.com.br
184 Data Analysis and Information Processing
29. Bughin, J., Livingston, J. and Marwaha, S. (2011) Seizing the Potential
of “Big Data.” McKinsey …, (October). http://whispersandshouts.
typepad.com/files/using-big-data-to-drive-strategy-and-innovation.pdf
30. Manyika, J., Chui, M., Brown, B. and Bughin, J. (2011) Big Data:
The Next Frontier for Innovation, Competition, and Productivity. 146.
www.mckinsey.com/mgi
31. Schultz, D. (2012) Can Big Data Do It All?? Marketing News,
November, 9.
32. ESOMAR. http://www.esomar.org/utilities/news-multimedia/video.
php?idvideo=57
33. CONAR. Conselho Nacional de Auto-regulamenta??o Publicitária.
http://www.conar.org.br/
34. Pindyck, R.S. and Rubinfeld, D.L. (1994) Microeconomia. Makron
Books, S?o Paulo.
35. Kotler, P. and Levy, S.J. (1969) Broadening the Concept of Marketing.
Journal of Marketing, 33, 10-15. http://dx.doi.org/10.2307/1248740
Chapter 9
ABSTRACT
Big data challenges current information technologies (IT landscape) while
promising a more competitive and efficient contributions to business
organizations. What big data can contribute to is what organizations have
been wanted for a long time ago. This paper presents the nature of big data
and how organizations can advance their systems with big data technologies.
By improving the efficiency and effectiveness of organizations, people can
benefit the can take advantages of a more convenient life contributed by
Information Technology.
Citation: Khine, P. and Shun, W. (2017), “Big Data for Organizations: A Review”.
Journal of Computer and Communications, 5, 40-48. doi: 10.4236/jcc.2017.53005.
Copyright: © 2017 by authors and Scientific Research Publishing Inc. This work is li-
censed under the Creative Commons Attribution International License (CC BY). http://
creativecommons.org/licenses/by/4.0
188 Data Analysis and Information Processing
INTRODUCTION
Business organizations have been using big data to improve their competitive
advantages. According to McKinsey [1], organizations which can fully
apply big data get competitive advantages over its competitors. Facebook
users uploads hundreds of Terabytes of data each day and these social media
data are used for developing more advanced analysis which aim is to take
more value from user data. Search Engines like Google and Yahoo are
already monetizing by associating appropriate ads based on user queries
(i.e. Google use big data to give the right ads to the right user in a split
seconds). In applying information systems to improve their organization
system, most government organization left behind compared to the business
organizations [2]. Meanwhile some government already take initiative to get
the advantages of big data. E.g. Obama’s government announced investment
of more than $200 million for Big Data R & D in Scientific Foundations in
2012 [3]. Today, people are living in the data age where data become oxygen
to people as organizations are producing data more than they can handle
leading to big data era.
This paper is sectioned as follows: Section II of this paper describes
Big Data Definitions, Big Data Differences and Sources within data, Big
Data characteristics, and databases and ELT process of big data. Section
IV is mainly concerned with the relationship between big data information
systems and organizations, how big data system should be implemented and
big data core techniques for organizations. Section IV is the conclusion of
the paper.
Hierarchy Description
Data Any piece of raw information that is unprocessed e.g. name,
quality, sound, image, etc.
Information Data is processed into a useful form that become information.
e.g. employee information (data about employee)
190 Data Analysis and Information Processing
types of data, and Velocity for different data rate required by different kinds
of systems [6].
Volume: When the scale of the data surpass the traditional store or
technique, these volume of data can be generally labeled as the big data
volume. Based on the types of organization, the amount of data volume can
vary from one place to another from gigabytes, terabytes, petabytes, etc. [1].
Volume is the original characteristic for the emergence of big data.
Variety: Include structured data defined in specific type and structure
(e.g. string, numeric, etc. data types which can be found in most RDBMS
databases), semi-structured data which has no specific type but have
some defined structure (e.g. XML tags, location data), unstructured data
with no structure (e.g. audio, voice, etc. ) which their structures have to
be discovered yet [7], and multi-structured data which include all these
structured, semi-structured and unstructured features [7] [8]. Variety comes
from the complexity of data from different information systems of target
organization.
Velocity: Velocity means the rate of data required by the application
systems based on the target organization domain. The velocity of big data
can be considered in increasing order as batch, near real-time, real-time and
stream [7]. The bigger data volume, the more challenges will likely velocity
face. Velocity the one of the most difficult characteristics in big data to
handle [8].
As more and more organizations are trying to use big data, big data Vs
characteristics become to appear one after another such as value, veracity
and validity. Value mean that data retrieved from big data must support the
objective of the target organization and should create a surplus value for the
organization [7]. Veracity should address confidentiality in data availablefor
providing required data integrity and security. Validity means that the data
must come from valid source and clean because these big data will be
analyzed and the results will be applied in business operations of the target
organization.
Another V of data is “Viability” or Volatility of data. Viability means
the time data need to survive i.e. in a way, the data life time regardless of
the systems. Based on viability, data in the organizations can be classified as
data with unlimited lifetime and data with limited lifetime. These data also
need to be retrieved and used in a point of time. Viability is also the reason
the volume challenge occurs in organizations.
192 Data Analysis and Information Processing
“all contents as data” when implementing big data projects. In digital era,
data has the power to change the world and need careful implementation.
CONCLUSION
Big data is a very wide and multi-disciplinary field which requires the
collaboration from different research areas and organizations from various
sources. Big data may change the traditional ETL process into Extract-
Load-Transform (ELT) process as big data give more advantages in moving
algorithms near where the data exist. Like other information systems, the
Big Data for Organizations: A Review 197
ACKNOWLEDGEMENTS
I want to express my gratitude for my supervisor Professor Wang Zhao Shun
for encouraging and giving suggestions for improving my paper.
198 Data Analysis and Information Processing
REFERENCES
1. Manyika, J., et al. (2011) Big Data: The Next Frontier for Innovation,
Competition, and Productivity. San Francisco, McKinsey Global
Institute, CA, USA.
2. Laudon, K.C. and Laudon, J.P. (2012) Management Information
Systems: Managing the Digital Firm. 13th Edition, Pearson Education,
US.
3. House, W. (2012) Fact Sheet: Big Data across the Federal Government.
4. Mousanif, H., Sabah, H., Douiji, Y. and Sayad, Y.O. (2014) From
Big Data to Big Projects: A Step-by-Step Roadmap. International
Conference on Future Internet of Things and Cloud, 373-378
5. Oracle Enterprise Architecture White Paper (March 2016) An Enterprise
Architect’s Guide to Big Data: Reference Architecture Overview.
6. Laney, D. (2001) 3D Data Management: Controlling Data Volume,
Velocity and Variety, Gartner Report.
7. Sagiroglu, S. and Sinanc, D. (2013) Big Data: A Review. International
Conference on Collaboration Technologies and Systems (CTS), 42-47.
8. de Roos, D., Zikopoulos, P.C., Melnyk, R.B., Brown, B. and Coss, R.
(2012) Hadoop for Dummies. John Wiley & Sons, Inc., Hoboken, New
Jersey, US.
9. Grolinger, K., Hayes, M., Higashino, W.A., L’Heureux, A., Allison,
D.S. and Capretz1, M.A.M. (2014) Challenges of MapReduce in Big
Data, IEEE 10th World Congress on Services, 182-189.
10. Hurwitz, J.S., Nugent, A., Halper, F. and Kaufman, M. (2012) Big Data
for Dummies, 1st Edition, John Wiley & Sons, Inc, Hoboken, New
Jersey, US.
11. Han, J., Kamber, M. and Pei, J. (2006) Data Mining: Concepts and
Techniques. 3rd Edition, Elsevier (Singapore).
12. Data Lake. https://en.m.wikipedia.org/wiki/Data_lake
13. Hu, H., Wen, Y.G., Chua, T.-S. and Li, X.L. (2014) Toward Scalable
Systems for Big Data Analytics: A Technology Tutorial. IEEE Access,
2, 652-687. https://doi.org/10.1109/ACCESS.2014.2332453
14. Dean, J. and Ghemawat, S. (2008) MapReduce: Simplified Data
Processing on Large Clusters. Commun ACM, 107-113. https://doi.
org/10.1145/1327452.1327492
Big Data for Organizations: A Review 199
Guanfang Qiao
WUYIGE Certified Public Accountants LLP, Wuhan, China
ABSTRACT
The era of big data has brought great changes to various industries, and
the innovative application effect of big data-related technologies also shows
obvious advantages. The introduction and application of big data technology
in the audit field also become the future development trend. Compared with
the traditional mode of audit work, the application of big data technology
can help to achieve more ideal results, which needs to promote the adaptive
transformation and adjustment of audit work. This paper makes a brief
analysis of the application of big data technology in audit field, which first
introduces the characteristics of big data and its technical application, and
then points out the new requirements for audit work in the era of big data,
and finally discusses how to apply the big data technology in the audit field,
hoping that it can be used for reference.
INTRODUCTION
With the rapid development of information technology in today’s world, the
amount of data information is getting larger and larger, which presents the
characteristics of big data. Big data refers to a collection of data that cannot
be captured, managed and processed by conventional software tools within
a certain time. It is a massive, high-growth, diversified information asset
that requires new processing models to have greater decision-making power,
insight and process optimization capabilities. Under the background of big
data development era, all walks of life should actively adapt to it in order to
form a more positive change. With the development of new social economy,
audit work is faced with higher requirements. The traditional audit methods
and concepts have been difficult to show good adaptability, and it is very easy
to appear many problems and defects. So positive changes should be made,
and proper and scientific integration of big data technology is an effective
measurement, which deserves high attention. Of course, the application of
big data technology in the field of audit is indeed facing higher difficulties,
for example, the development of audit software and the establishment of
audit analysis model need to be effectively adjusted from multiple levels in
order to give full play to the application value of big data technology.
Based on the development of the era of big data, the core is not to obtain
massive data information, but how to conduct professional analysis and
processing for these massive information, so as to play its due role and value.
In this way, it is necessary to strengthen the research on big data technology,
so that all fields can realize the optimization analysis and processing of
massive data information with the assistance of big data technology, and
meet the original application requirements. In terms of the development
and application of current big data technologies, data mining technology,
massively parallel processing database, distributed database, extensible
storage system and cloud computing technology are commonly used. These
big data technologies can be effectively applied to the massive information
acquisition, storage, analysis and management.
Big data has ushered in a major era of transformation, which has changed
our lives, work and even our thinking. More and more industries maintain
a very optimistic attitude towards the application of big data, and more and
more users are trying or considering how to use similar big data to solve the
problem, so as to improve their business level. With the gradual advance of
digitization, big data will become the fourth strategy that enterprises can
choose after the three traditional competitive strategies of cost leadership,
differentiation and concentration.
improvement, better explore the law of development, and then give play to
the reference value in decision analysis (Gepp, 2018).
Based on this, the construction of large audit group has become an important
application mode. The future audit work should rely on the large audit
group to divide the organizational structure from different functions such
as leadership decision-making, data analysis and problem verification, so as
to realize the orderly promotion of the follow-up audit work. For example,
the establishment of a leading group could facilitate the implementation of
the audit plan, to achieve leadership decisions for the entire audit work. For
the analysis of massive data information, the data analysis group is required
to make full analysis of the target with the help of rich and diverse big
data technologies, so as to find clues and problems and explore rules and
relationships. However, the clues and rules discovered need to be further
analyzed by the problem verification team and verified in combination with
the actual situation, so as to complete the audit task (Castka et al., 2020).
The application of this large audit team mode can give full play to the
application value of big data technology, avoid the interference brought
by organizational factors, and become an important trend of optimization
development in the audit field in the future. Of course, in order to give full
play to the application value of the large audit team, it is often necessary
to focus on the optimization of specific audit staff to ensure that all audit
staff have a higher level of competence. Audit staff not only need to master
and apply big data-related technical means, but also need to have big data
thinking, realize the transformation under this new situation, and avoid
obstacles brought by human problems. Based on this, it is of great importance
to provide necessary education and training for audit staff, which should
carry out detailed explanation around big data concept, technology and new
audit mode, so as to make them better adapt to the new situation
comprehensive grasp of all audit objectives and tasks involved in the audit
industry, so as to purposefully develop the corresponding data analysis
model and special software, and promote its application in subsequent audit
work more efficient and convenient. For example, the query analysis, mining
analysis and multi-dimensional analysis involved in the audit work need to
be matched with the corresponding data analysis model in order to better
improve the audit execution effect. In the development and application of
audit software, it is necessary to take into account the various functions. For
example, in addition to discovering and clarifying the defects existing in the
audit objectives, it is also necessary to reflect the risk warning function, so
as to better realize the audit function and highlight the application effect of
big data technology.
Cloud Audit
The application of big data technology in the auditing field is also developing
towards cloud auditing, which is one of the important manifestations of
the development of big data era. From the application of big data related
technologies, it is often closely related to cloud computing, and they are
often difficult to be separated. In order to better use big data technology, it
is necessary to rely on cloud computing mode to better realize distributed
processing, cloud storage and virtualization processing, facilitate the
efficient use of massive data information, and solve the problem of data
information. Based on this, the application of big data technology in audit
field should also pay attention to the construction of cloud audit platform
in the future, to better realize the optimization and implementation of
audit work. In the construction of cloud audit, it is necessary to make full
use of big data technology, intelligent technology, Internet technology
and information means to realize the orderly storage and analysis and
application of massive data information, and at the same time pay attention
to the orderly sharing of massive data information, so as to better enhance
its application value. For example, for the comprehensive analysis of the
above mentioned cross-database data information, cloud audit platform can
be used to optimize the processing. The overall analysis and processing
efficiency is higher, which can effectively meet the development trend of
the increasing difficulty of the current audit. Of course, the application of
cloud audit mode can also realize the remote storage and analysis of data
information, which obviously improves the convenience of audit work,
breaks the limitation of original audit work on location, and makes the data
210 Data Analysis and Information Processing
CONCLUSION
In a word, the introduction and application of big data technology
has become an important development trend in the current innovative
development of audit field in China. With the introduction and application
of big data, audit work does show obvious advantages with more prominent
functions. Therefore, it is necessary to explore the integration of big data
technology in the audit field from multiple perspectives in the future, and
strive to innovate and optimize the audit concept, organizational structure,
auditors and specific technologies in order to create good conditions for
the application of big data technology. This paper mainly discusses the
transformation of big data technology to the traditional audit work mode and
its specific application. However, as the application of big data in the field
of audit is not long, the research is inevitably shallow. With the development
of global economic integration, multi-directional and multi-field data fusion
will make audit work more complex, so big data audit will be normalized
and provide better reference for decision-making.
212 Data Analysis and Information Processing
REFERENCES
1. Alles, M., & Gray, G. L. (2016). Incorporating Big Data in Audits:
Identifying Inhibitors and a Research Agenda to Address Those
Inhibitors. International Journal of Accounting Information Systems,
22, 44-59. https://doi.org/10.1016/j.accinf.2016.07.004
2. Appelbaum, D. A., Kogan, A., & Vasarhelyi, M. A. (2018). Analytical
Procedures in External Auditing: A Comprehensive Literature Survey
and Framework for External Audit Analytics. Journal of Accounting
Literature, 40, 83-101. https://doi.org/10.1016/j.acclit.2018.01.001
3. Castka, P., Searcy, C., & Mohr, J. (2020). Technology-Enhanced
Auditing: Improving Veracity and Timeliness in Social and
Environmental Audits of Supply Chains. Journal of Cleaner Production,
258, Article ID: 120773. https://doi.org/10.1016/j.jclepro.2020.120773
4. Gepp, A., Linnenluecke, M. K., O’Neill, T. J., & Smith, T. (2018). Big
Data Techniques in Auditing Research and Practice: Current Trends
and Future Opportunities. Journal of Accounting Literature, 40, 102-
115. https://doi.org/10.1016/j.acclit.2017.05.003
5. Harris, M. K., & Williams, L. T. (2020). Audit Quality Indicators:
Perspectives from Non-Big Four Audit Firms and Small Company
Audit Committees. Advances in Accounting, 50, Article ID: 100485.
https://doi.org/10.1016/j.adiac.2020.100485
6. Ingrams, A. (2019). Public Values in the Age of Big Data: A Public
Information Perspective. Policy & Internet, 11, 128-148. https://doi.
org/10.1002/poi3.193
7. Shukla, M., & Mattar, L. (2019). Next Generation Smart Sustainable
Auditing Systems Using Big Data Analytics: Understanding the
Interaction of Critical Barriers. Computers & Industrial Engineering,
128, 1015-1026. https://doi.org/10.1016/j.cie.2018.04.055
8. Sookhak, M., Gani, A., Khan, M. K., & Buyya, R. (2017).
WITHDRAWN: Dynamic Remote Data Auditing for Securing Big
Data Storage in Cloud Computing. Information Sciences, 380, 101-
116. https://doi.org/10.1016/j.ins.2015.09.004
9. Xiao, T. S., Geng, C. X., & Yuan, C. (2020). How Audit Effort Affects
Audit Quality: An Audit Process and Audit Output Perspective. China
Journal of Accounting Research, 13, 109-127. https://doi.org/10.1016/j.
cjar.2020.02.002
SECTION 3: DATA MINING METHODS
Chapter 11
ABSTRACT
Many business applications rely on their historical data to predict their
business future. The marketing products process is one of the core processes
for the business. Customer needs give a useful piece of information that
helps to market the appropriate products at the appropriate time. Moreover,
services are considered recently as products. The development of education
and health services is depending on historical data. For the more, reducing
online social media networks problems and crimes need a significant
source of information. Data analysts need to use an efficient classification
algorithm to predict the future of such businesses. However, dealing with a
huge quantity of data requires great time to process. Data mining involves
many useful techniques that are used to predict statistical data in a variety
of business applications. The classification technique is one of the most
widely used with a variety of algorithms. In this paper, various classification
algorithms are revised in terms of accuracy in different areas of data mining
applications. A comprehensive analysis is made after delegated reading of
20 papers in the literature. This paper aims to help data analysts to choose
the most suitable classification algorithm for different business applications
including business in general, online social media networks, agriculture,
health, and education. Results show FFBPN is the most accurate algorithm
in the business domain. The Random Forest algorithm is the most accurate in
classifying online social networks (OSN) activities. Naïve Bayes algorithm
is the most accurate to classify agriculture datasets. OneR is the most
accurate algorithm to classify instances within the health domain. The C4.5
Decision Tree algorithm is the most accurate to classify students’ records to
predict degree completion time.
INTRODUCTION
Decision-makers in the business sector are always concerning about their
business future. Since data collections form the core resource of information,
digitalizing business activities help to collect business operational data in
enormous storages named as a data warehouse. These historical data can be
used by data analysts to predict the future behavior of the business. However,
dealing with a huge quantity of data requires great time to process.
Data mining (DM) is a technique that uses information technology and
statistical methods to search for potential worthy information from a large
database that can be used to support administrative decisions. The reason
behind the importance of DM is that data can be converted into useful
information and knowledge automatically and intelligently. In addition,
enterprises use data mining to know companies that work status and analyze
potential information values. Information mined should be protected from
the disclosure of company secrets.
Different data mining concepts were described by Kaur [1]
functionalities, material, and mechanisms. Data mining involves the use of
sophisticated data analysis tools and techniques to find advanced ambiguity,
A Short Review of Classification Algorithms Accuracy for Data Prediction... 217
patterns, and relationships that are valid in large data sets. The best-known
data mining technique is Association. In association, a pattern is discovered
based on a relationship between items in the same transaction. Clustering
is a data mining technology that creates a useful group of objects that have
comparative features using the programmed strategy. Decision Tree is one
of the most common data mining techniques. One of the most difficult things
to do is when choosing to implement a data mining framework is to know
and decide which method to use and when.
However, one of the most implemented data mining techniques in a
variety of applications is the classification technique. The classification
process needs two types of data: training data and testing data. Training
data are the data used by a data mining algorithm to learn the classification
metrics to classify the other data i.e. testing data. Many business applications
rely on their historical data to predict their business future.
The literature presents various problems that were solved by predicting
through data mining techniques. In business, DM techniques are used to
predict the export abilities of companies [2] . In social media applications,
missing link problems between online social networks (OSN) nodes are a
frequent problem in which a link is supposed to be between two nodes, but
it becomes a missing link for some reasons [3] . In the agriculture sector,
analyzing soil nutrients will prove to be a large profit to the growers through
automation and data mining [4] .
Data mining technique is used to enhance the building energy
performance through determining the target multi-family housing complex
(MFHC) for green remodeling [5] . In crime, preventing offense and force
against the human female is one of the important goals. Different data
mining techniques were used to analyze the causes of offense [6] . In the
healthcare sector, various data mining tools have been applied to a range of
diseases for detecting the infection in these diseases such as breast cancer
diagnosis, skin diseases, and blood diseases [7] .
For the more, data analysts in the education field used data mining
techniques to develop learning strategies at schools and universities [8] .
Another goal is to detect several styles of learner behavior and forecast his
performance [9] . One more goal is to forecast the student’s salary after
graduation based on the student’s previous record and behavior during the
study [10] . In general, services are considered products.
In this paper, various classification algorithms are revised in terms of
accuracy in different areas of data mining applications. This paper aims
218 Data Analysis and Information Processing
METHODS IN LITERATURE
The classification technique is one of the most implemented data mining
techniques in a variety of applications. The classification process needs two
types of data: training data and testing data. Training data are the data used
by a data mining algorithm to learn the classification metrics to classify the
other data i.e. testing data. Two data sets of text articles are used and classified
into training data and testing data. Three traditional classification algorithms
are compared in terms of accuracy and execution time by Besimi et al. [11]
. K-nearest neighbor classifier (K-NN), Naïve Bayes classifier (NB), and
Centroid classifier are considered. K-NN classifier is the slowest classifier
since it uses the whole training data as a reference to classify testing data.
On the other hand, the Centroid classifier uses the average vector for each
class as a model to classify new data. Hence, the Centroid classifier is much
faster than the K-NN classifier. In terms of accuracy, the Centroid classifier
has the highest accuracy rate among the others.
Several data mining techniques were used to predict the export abilities
of a sample of 272 companies by Silva et al. [2] . Synthetic Minority
Oversampling Technique (SMOTE) is used to oversample unbalanced
data. The K-means method is used to group the sample into three different
clusters. The generalized Regression Neural Network (GRNN) technique
is used to minimize the error between the actual input data points in the
network and the regression predicting vector in the model. Feed Forward
Back Propagation Neural Network (FFBPN) is a technique used in machine
learning to learn the pattern of specific input/output behavior for a set of
data in a structure known as Artificial Neural Networks (ANN). Support
Vector Machine (SVM) is a classification technique used to classify a set
of data according to similarities between them. A Decision Tree (DT) is a
A Short Review of Classification Algorithms Accuracy for Data Prediction... 219
literature based on health care data to find the existing data mining methods
and techniques described. Many data mining tools have been applied to a
range of diseases for detecting the infection in these diseases such as breast
cancer diagnosis, skin diseases, and blood diseases. Data mining execution
has high effectiveness in this domain due to express amplification in the size
of remedial data.
Moreover, Kaur and Bawa [14] present to the medical healthcare sector
a detailed view of popular data mining techniques to the researchers so that
they can work more exploratory. Knowledge discovery in databases (KDD)
analyzes large volumes of data and turns it into meaningful information.
There is a boon to data mining techniques because it helps in the early
diagnosis of medical diseases with high accuracy in which saves more time
and money in any effort related to computers, robots, and parallel processing.
Among all the medical diseases, cardiovascular is the most critical disease.
Data mining is proved efficacious as accuracy is a major concern. Data
mining techniques are proved to be successfully used in the treatment of
various other serious diseases which have a threat to lives.
As another attempt, a comparative analysis is conducted by Parsania et
al. [15] to find the best data mining classification techniques based on
healthcare data in terms of accuracy, sensitivity, precision, false-positive
rate, and f-measure. Naïve Bayes, Bayesian Network, J RIPPER (JRip),
OneRule (OneR), and PART techniques are selected to be applied over a
dataset from a health database. Results show that the PART technique is
the best in terms of precision, false-positive rate, and f-measure metrics. In
terms of accuracy, the OneR technique is the best while Bayesian Network
is the best technique in terms of sensitivity.
Data mining techniques are used widely in several fields. Data analysts
in the education field used data mining techniques to develop learning
strategies at schools and universities since it serves a big chunk of society.
A corporative learning model to group learners into active learning groups
via the web was introduced by Amornsinlaphachai [8] . Artificial Neural
Network (ANN), K-Nearest Neighbor (KNN), Naive Bayes (NB), Bayesian
Belief Network (BN), RIPPER (called JRIP), ID3, and C4.5 (called J48)
classification data mining algorithms are used to predict the performance
of 474 students who study computer programming subject at Nakhon
Ratchasima Rajabhat University in Thailand. A comparison between those
algorithms is made to select the most efficient algorithm among them.
222 Data Analysis and Information Processing
In the meantime, the data mining techniques used in this model are
K-Nearest Neighbors (K-NN), Naive Bayes (NB), Decision trees J48,
Multilayer Perceptron (MLP), and Support Vector Machines (SVM). To
determine the preferable technique for predicting future salary, a test was
conducted by entering data of students graduating from the same university
during the years 2006 to 2015. A WEKA (Waikato Environment for
Knowledge Analysis) tool was used to compare the outputs of data mining
techniques. The results showed that after comparisons work outperformed
(KNN) technique in predicting 84.69 percent for Recall, Precision, and
F-measure. The other techniques were as follows: (J48) get a percentage of
73.96 percent, (SVM) (43.71 percent), Naive Bayes (NB) (43.63 percent),
and Multilayer perceptron (MLP) (38.8 percent). A questionnaire was then
distributed to 50 current students at the university to see if the model works
to achieve its objectives. The results of the questionnaire indicate that the
proposed model increased the motivation of the students and helped them to
focus on continuing the study.
Sulieman and Jayakumari [17] proposed the importance of using
technology data mining 11th grade in Oman, which contains a lot of units that
provide the school in Oman administration inclusive student data. The goal
is to decrease the dropout rate of students and improve school performance.
Using data mining techniques helps students to choose the appropriate
mathematics for 11th grade in Oman. It is an opportunity to develop and give
appropriate analysis through such a method that extracts student information
from the end-of-term grades to improve student performance. Knowledge
derives from data mining helps decision-makers in the field of education
make the perfect decision that will help in the development of educational
processing. The math subject uses data mining techniques. The results of
the various algorithms acquired from the various data using in a study that
confirm the fact the prediction of student choice and performance can be
obtained using data mining techniques.
Academic databases used to be analyzed through a data mining
approach to earn new helpful knowledge. Wati et al. [18] prophesy the
degree-accomplishment time of bachelor’s degree students by using data
mining algorithms such as C4.5, and naive Bayes classifier algorithms.
They concentrate on the achievement of ranking data mining algorithms
especially the C4.5 algorithm with its decision tree-based and naive Bayes
classifier algorithm based on a gain ratio to find the nodes. it shows in the
result of the foresee degree accomplishment time of bachelor’s degree the
224 Data Analysis and Information Processing
Politics, technology, and sports news articles are used with a total of
237 news articles. Experiments show that the Centroid classifier is the most
accurate algorithm in classifying text documents since it classifies 226
226 Data Analysis and Information Processing
news articles correctly. Centroid classifier calculates the average vector for
each class and uses them as a reference to classify each new test instance.
However, k-NN needs to compare the test instance distance with all training
instances distances for each time.
In [2] , 272 companies are taken as a study sample to be classified. Five
classification algorithms are used to classify companies into three classes:
Generalized Regression Neural Network (GRNN), Feed Forward Back
Propagation Neural Network (FFBPN), Support Vector Machine (SVM),
Decision Tree (DT), and Naïve Bayes (NB). Results show that FFBPN is
the most accurate algorithm to classify instances in the business domain
with an accuracy of 85.2 percent.
Two Online Social Networks (OSN) datasets are used to compare the
performance of seven classification algorithms. The first dataset (DS1) with
High density (0.05) and the other dataset (DS2) with low-density (0.03).
The two datasets were obtained using the Facebook API tool. Each dataset
contains public information about the users such as interests, friends, and
demographics data. Classification algorithms include; Support Vector
Machine (SVM), k-Nearest Neighbors (k-NN), Decision Tree (DT), Neural
Networks, Naïve Bayes (NB), Logistic Regression, and Random Forest. As
results show in [3] , the Random Forest algorithm is the most accurate in
classifying OSN activities even with a high-density OSN dataset.
A dataset of 1676 soil samples has 12 attributes that need to be classified.
J48 Decision Tree (J48 DT) and Naïve Bayes (NB) classification algorithms
are used. Results in [4] tells that the NB algorithm is more accurate than J48
DT to classify agriculture datasets since it classifies 98 percent of instances
correctly.
An experiment is conducted in the health domain to classify 3163
patients’ data as mentioned in [15] . Naïve Bayes (NB), Bayesian Network
(BayesNet), J Ripper (JRip), One Rule (OneR), and PART classification
algorithms are used. Results show that OneR is the most accurate algorithm
to classify instances in the health domain with an accuracy of 99.2 percent.
Random Forest, Naïve Bayes (NB), Multilayer Perceptron (MLP),
Support Vector Machine (SVM), and J48 Decision Tree (J48 DT)
classification algorithms are used. 163 instances are used as an experimental
dataset of students’ performance. Results in [16] tell that the MLP algorithm
is the most accurate algorithm to classify students’ performance datasets
since it classifies 76.1 percent of instances correctly. 13,541 students’
profiles are used as a dataset to examine five classification algorithms.
A Short Review of Classification Algorithms Accuracy for Data Prediction... 227
k-Nearest Neighbors (k-NN), Naïve Bayes (NB), J48 Decision Tree (J48
DT), Multilayer Perceptron (MLP), and Support Vector Machine (SVM)
were compared in terms of accuracy. As results show in [10] , the k-NN
algorithm is the most accurate algorithm with an 84.7 percent accuracy level.
297 students’ records were used as a dataset in [18] . Two classification
algorithms are applied: C4.5 Decision Tree (C4.5 DT), and Naïve Bayes
(NB). Results tell that the C4.5 DT algorithm is more accurate than NB to
classify Students’ records since it classifies 78 percent of instances correctly.
REFERENCES
1. Harkiran, K. (2017) A Study On Data Mining Techniques And
Their Areas Of Application. International Journal of Recent Trends
in Engineering and Research, 3, 93-95. https://doi.org/10.23883/
IJRTER.2017.3393.EO7O3
2. Silva, J., Borré, J.R., Castillo, A.P.P., Castro, L. and Varela, N. (2019)
Integration of Data Mining Classification Techniques and Ensemble
Learning for Predicting the Export Potential of a Company. Procedia
Computer Science, 151, 1194-1200. https://doi.org/10.1016/j.
procs.2019.04.171
3. Sirisup, C. and Songmuang, P. (2018) Exploring Efficiency of Data
Mining Techniques for Missing Link in Online Social Network. 2018
International Joint Symposium on Artificial Intelligence and Natural
Language Processing (iSAI-NLP), Pattaya, 15-17 November 2018.
https://doi.org/10.1109/iSAI-NLP.2018.8692951
4. Chiranjeevi, M.N. and Nadagoudar, R.B. (2018) Analysis of Soil
Nutrients Using Data Mining Techniques. International Journal of
Recent Trends in Engineering and Research, 4, 103-107. https://doi.
org/10.23883/IJRTER.2018.4363.PDT1C
5. Jeong, K., Hong, T., Chae, M. and Kim, J. (2019) Development of
a Decision Support Model for Determining the Target Multi-Family
Housing Complex for Green Remodeling Using Data Mining
Techniques. Energy and Buildings, 202, Article ID: 109401. https://
doi.org/10.1016/j.enbuild.2019.109401
6. Kaur, B., Ahuja, L. and Kumar, V. (2019) Crime against Women:
Analysis and Prediction Using Data Mining Techniques. International
Conference on Machine Learning, Big Data, Cloud and Parallel
Computing (COMITCon), 14-16 February 2019, Faridabad. https://
doi.org/10.1109/COMITCon.2019.8862195
7. Mia, M.R., Hossain, S.A., Chhoton, A.C. and Chakraborty, N.R. (2018)
A Comprehensive Study of Data Mining Techniques in Health-Care,
Medical, and Bioinformatics. International Conference on Computer,
Communication, Chemical, Material and Electronic Engineering
(IC4ME2), Rajshahi, 8-9 February 2018. https://doi.org/10.1109/
IC4ME2.2018.8465626
8. Amornsinlaphachai, P. (2016) Efficiency of Data Mining Models to
Predict Academic Performance and a Cooperative Learning Model.
A Short Review of Classification Algorithms Accuracy for Data Prediction... 229
Wenke Xiao1, Lijia Jing2, Yaxin Xu1, Shichao Zheng1, Yanxiong Gan1, and
Chuanbiao Wen1
1
School of Medical Information Engineering, Chengdu University of Traditional Chi-
nese Medicine, Chengdu 611137, China
2
School of Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu
611137, China
ABSTRACT
The amount of medical text data is increasing dramatically. Medical text
data record the progress of medicine and imply a large amount of medical
knowledge. As a natural language, they are characterized by semistructured,
high-dimensional, high data volume semantics and cannot participate
in arithmetic operations. Therefore, how to extract useful knowledge or
information from the total available data is very important task. Using various
techniques of data mining can extract valuable knowledge or information
from data. In the current study, we reviewed different approaches to apply
Citation: Wenke Xiao, Lijia Jing, Yaxin Xu, Shichao Zheng, Yanxiong Gan, Chuan-
biao Wen, “Different Data Mining Approaches Based Medical Text Data”, Journal of
Healthcare Engineering, vol. 2021, Article ID 1285167, 11 pages, 2021. https://doi.
org/10.1155/2021/1285167.
Copyright: © 2021 by Authors. This is an open access article distributed under the Cre-
ative Commons Attribution License, which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
232 Data Analysis and Information Processing
for medical text data mining. The advantages and shortcomings for each
technique compared to different processes of medical text data were
analyzed. We also explored the applications of algorithms for providing
insights to the users and enabling them to use the resources for the specific
challenges in medical text data. Further, the main challenges in medical text
data mining were discussed. Findings of this paper are benefit for helping
the researchers to choose the reasonable techniques for mining medical text
data and presenting the main challenges to them in medical text data mining.
INTRODUCTION
The era of big data is coming with the mass of data growing at an incredible
rate. The concept of big data for the first time was put forward in the 11th EMC
World conference in 2011, which refers to large-scale datasets that cannot be
captured, managed, or processed by common software tools. With the arrival
of big data age, the amount of medical text data is increasing dramatically.
Analyzing this immense amount of medical text data to extract the valuable
knowledge or information is useful for decision support, prevention,
diagnosis, and treatment in medical world [1]. However, analyzing the
huge amount of multidimensional or raw data is very complicated and time-
consuming task. Data mining has capabilities for this matter.
Data mining is a methodology for discovering the novel, valuable, and
useful information, knowledge, or hidden pattern from enormous datasets by
using various statistical approaches. Data mining is with many advantages
in contrast to the traditional model for transforming data to knowledge
with some manual analysis and interpretation. Data mining approaches are
quicker, favorable, time-saving, and objective. Summarizing various data
mining approaches in medical text data for clinical applications is essential
for health management and medical research.
This paper is organized in four sections. Section 2 presents the concepts
of medical text data. Section 3 includes data mining approaches and its
applications in medical text data analysis. Section 4 concludes this paper
and presents the future works.
Medical big data are the application of big data in the medical field after
the data related to human health and medicine have been stored, searched,
shared, analyzed, and presented in innovative ways [2]. Medical text data
are an important part of medical big data which are described in natural
language, cannot participate in an arithmetic operation, and are characterized
by semistructured, high-dimensional, high data volume semantics [3]. They
cannot be well applied in research owing to no fixed writing format and
being highly professional [4]. Medical text data contain clinical data, medical
record data, medical literature data, etc., and this type of data records the
progress of medicine and implies a large amount of medical knowledge.
However, utilizing human power to extract the facts of relationships between
entities from a vast amount of medical text requires time-consuming efforts.
With the development of data mining technology, data mining technology
used for medical text to discover the relationships in medical text becomes
the hot topic. Medical text data mining is able to assist the discovery of
medical information. In the COVID-19 research field, medical text mining
can help decision-makers to control the crown outbreak by gathering and
collating scientific basic data and scientific research literature related to
the new crown virus, predicting the susceptible population to new crown
pneumonia, virus variability, and potential therapeutic drugs [5–8].
Data Preparation
Medical text data include electronic medical records, medical images,
medical record parameters, laboratory results, and pharmaceutical
antiquities according to the different data sources. The different data were
selected based on the data mining task and stored in the database for further
processing.
234 Data Analysis and Information Processing
Data Processing
The quality of data will affect the efficiency and accuracy of data mining
and the effectiveness of the final pattern. The raw medical text data contain a
large amount of fuzzy, incomplete, noisy, and redundant information. Taking
medical records as an example, the traditional paper-based medical records
have many shortcomings, such as nonstandard terms, difficult to form clinical
decision-making support, scattered information distribution, and so on. After
the emergence of electronic medical records, the medical records data are
gradually standardized [13]. However, the electronic medical records still
as natural language are difficult for data mining. Therefore, it is necessary
to clean up and filter the data to ensure data consistency and certainty by
removing missing, incorrect, noisy, and inconsistent or no quality data.
Missing values in medical text data are usually handled by deletion
and interpolation. Deletion is the easiest method to handle, but some
useful information is lost. Interpolation is a method that assigns reasonable
substitution values to missing values through a specific algorithm. At
present, many algorithms have emerged in the process of data processing.
Multiple imputation, regression algorithm, and K-nearest neighbors are
often used to supplement missing values in medical text data. The detail
algorithm information is shown in Table 1. In order to further understand
the semantic relationships of medical texts, researchers have used natural
language processing (NLP) techniques to perform entity naming, relationship
extraction, and text classification operations on medical text data with good
results [19].
Table 1: The detailed algorithm information for missing values in medical text
data
Multiple imputation Estimate the value to be interpolated, and Repeat the simula-
[14, 15] add different noises to form multiple groups tion to supplement the
of optional interpolation values; select the missing value
most appropriate interpolation value accord-
ing to a certain selection basis.
Expectation maxi- Compute maximum likelihood estimates or Supplement missing
mization [16] posterior distributions with incomplete data. values
K-nearest neighbors Select its K closest neighbors according to Estimate missing val-
[17, 18] a distance metric and estimate missing data ues with samples
with the corresponding mode or mean.
Different Data Mining Approaches Based Medical Text Data 235
input input
word noun
Lexical analysis word Thesaurus
string
word verb
Structured processing
SBV
I am
Local syndactyly
Syntactic parsing S V
dependency
s-v relation grammar
Data Analysis
Data analysis is applying data mining methods for extracting interesting
patterns. The model establishment is essential for knowledge discovery in
data analysis. According to the characteristics of the data, modeling and
analysis are performed. After the initial test, the model is parametrically
adjusted. The advantages and disadvantages of different models are analyzed
to choose the final optimal model. Data analysis methods for medical text
data include clustering, classification, association rules, and regression on
the goal. The detail information of methods is shown in Table 2.
Clustering Classify similar sub- K-means 1.Simple and fast 1. Large amount of
jects in medical texts [28, 29] 2. Scalability and data and time-con-
efficiency suming
2. More restrictions
on use
Different Data Mining Approaches Based Medical Text Data 237
Association Mine frequent items Apriori Simple and easy Low efficiency and
rules and corresponding [35, 36] to implement time-consuming
association rules from
massive medical text FP-tree [37] 1. Reduce the High memory over-
datasets number of data- head
base scans
2. Reduce the
amount of
memory space
FP-growth 1. Improve data Harder to achieve
[38] density structure
2. Avoid repeated
scanning
Logistic Analyze how vari- Logistic 1.Visual under- 1.Easy underfitting
Regression ables affect results regression standing and 2. Cannot handle
[39] interpretation a large number of
2. Very sensitive multiclass features or
to outliers variables
responsible for receiving external information and data. The hidden layer
is responsible for processing information and constantly adjusting the
connection properties between neurons, such as weights and feedback,
while the output layer is responsible for outputting the calculated results.
ANN is different from traditional artificial intelligence and information
processing technology, which overcomes the drawbacks of traditional
artificial intelligence based on logical symbols in processing intuitive and
unstructured information, and has the characteristics of self-adaption, self-
organizing, and real-time learning. It can complete data classification,
feature mining, and other mining tasks. Medical text data contain massive
amounts of patient health records, vital signs, and other data. ANN can
analyze the conditions of patients’ rehabilitation, find the law of patient
data, predict the patient’s condition or rehabilitation, and help to discover
medical knowledge [41].
There are several ANN mining techniques that are used for medical text
data, such as backpropagation and factorization machine-supported neural
network (FNN). The information on ANN mining techniques is shown in
Table 3.
(1)
m is the total number of samples. K is the sample data order. T is the unit
serial number. is the desired output. is the actual output.
In clinics, the judgment of disease is often determined by the integration
of multidimensional data. In the establishment of disease prediction models,
BP algorithms can not only effectively classify complex data but also have
good multifunctional mapping. The relationship between data and disease
can be found in the process of repeated iteration [46].
(2) Application Examples. Adaptive learning based on ANN can find
the law of medical development from the massive medical text data
and assist the discovery of medical knowledge. Heckerling et al.
[47] combined a neural network and genetic algorithm to predict
the prognosis of patients with urinary tract infections (as shown
in Figure 2). In this study, nine indexes (eg, frequent micturition,
dysuria, etc.) from 212 women with urinary tract infections were
used as predictor variables for training. The relationship between
symptoms and urinalysis input data and urine culture output data
was determined using ANN. The predicted results were accurate.
240 Data Analysis and Information Processing
6 Journal of Healthcare Engineering
Data analysis
Naive Bayes
Naive Bayes (NB) is a classification counting method based on the Bayes
theory [50]. The conditional independence hypothesis of the NB classification
algorithm assumes that the attribute values are independent of each other
and the positions are independent of each other [51]. Attribute values are
independent of each other, which means there is no dependence between
terms. The position independence hypothesis means that the position of the
term in the document has no effect on the calculation of probability. However,
conditional dependence exists among terms in medical texts, and the
location of terms in documents contributes differently to classification [52].
But medical text existence conditions depend on the relationship between a
middle term and the term in the document; the location of the contribution
to the classification is different. These two independent assumptions lead
to the poor effect of NB estimation. However, NB has been widely used in
medical texts because it plays an effective role in classification decision-
making.
(1) Core Algorithm: NBC4D. Naive Bayes classifier for continuous
variables using a novel method (NBC4D) is a new algorithm
based on NB. It classifies continuous variables into Naive
Bayes classes, replaces traditional distribution techniques with
alternative distribution techniques, and improves classification
accuracy by selecting appropriate distribution techniques [53].
Different Data Mining Approaches Based Medical Text Data 241
Output range
Golpour et al. [55] used the NB algorithm to process the hospital 30 10 prediction of no infecti
medical records and assessment scale and found that the NB Expected Modification right
model with three variables had the best performance and could
input
Train Calculation
Error Output
Decision Tree
The decision tree is a tree structure, in which each nonleaf node represents
a test on a feature attribute, each branch represents the output of the feature
attribute on a certain value domain, and each leaf node stores a category
[56]. The process of using a decision tree to make a decision is to start from
242 Data Analysis and Information Processing
the root node, then test the corresponding characteristic attributes of the
items to be classified, select the output branch according to its value until it
reaches the leaf node, and finally take the category stored in the leaf node as
the decision result [57]. The advantages of decision tree learning algorithms
include good interpretability induction, various types of data processing
(categorical and numerical data), white-box modeling, sound robust
performance for noise, and large dataset processing. Medical text data is
complex [58]. For instance, electronic medical record data include not only
disease characteristics but also patient age, gender, and other characteristic
data. Since the construction of decision tree starts from a single node, the
training data set is divided into several subsets according to the attributes
of the decision node, so the decision tree algorithm can deal with the data
types and general attributes at the same time, which has certain advantages
for the complexity of medical text data processing [59]. The construction
of a decision tree is mainly divided into two steps: classification attribute
selection and number pruning. The common algorithm is C4.5 [60].
(1) Core algorithm: C4.5. Several decision tree algorithms are
proposed such as ID3 and C4.5. The famous ID3 algorithm
proposed by Quinlan in 1986 has the advantages of clear theory,
simple method, and strong learning ability. The disadvantage is
that it is only effective for small datasets and sensitive to noise.
When the training data set increases, the decision tree may
change accordingly. When selecting test attributes, the decision
tree tends to select attributes with more values. In 1993, Quinlan
proposed the C4.5 algorithm based on the ID3 algorithm [61].
Compared with ID3, C4.5 overcomes the shortages of selecting
more attributes in information attribute selection, prunes the tree
construction process, and processes incomplete data. And it uses
the gain ratio as the selection standard of each node attribute in
the decision tree [62]. In particular, its extension which is called
S-C4.5-SMOTE and can not only overcome the problem of
data distortion but also improve overall system performance. Its
mechanism aims to effectively reduce the amount of data without
distortion by maintaining the balance of datasets and technical
smoothness.
Different Data Mining Approaches Based Medical Text Data 243
(2)
n is the classification number. p(xi) represents the proportion of sample xi.
A is used as the feature of dividing data set S. is the proportion of
the number of samples in the total number of samples.
(2) Application Examples. The decision tree algorithms can construct
specific decision trees for multiattribute datasets and get feasible
results in relative time. It can be used as a good method for data
classification in medical text data mining.
Byeon [63] used the C4.5 algorithm to develop a depression prediction
model for Korean dementia caregivers based on a secondary analysis of the
2015 Korean Community Health Survey (KCHS) survey results. And the
effective prediction rate was 70%. The overall research idea is shown in
Figure 4.
Data analysis
C4.5:Processing continuous data and incomplete data
Depression
POOR GOOD
Data collection
Subjective
Subjective Reasult
Gender stress
stress
NO YSE NO YSE
classification of adverse drug reactions (ADR) signals. Tao Zheng et al. [65]
ECG
Association Rules
Association rules are often sought for very large datasets, whose efficient
algorithms are highly valued. They are used to discover the correlations
from large amounts of data and reflect the dependent or related knowledge
between events and other events [66]. Medical text data contains a large
number of association data, such as the association between symptoms and
diseases and the relationship between drugs and diseases. Mining medical
text data using an association rule algorithm is conducive to discovering
the potential links in medical text data and promoting the development of
medicine. Association rules are expressions like X ≥ Y. There are two key
expressions in the transaction database:(1)Support{X≥Y}. The ratio of the
number of transactions with X and Y to all transactions(2)Confidence{X≥Y}.
The ratio of the number of transactions with X and Y to the number of
transactions with X
Given a transaction data set, mining association rules is to generate
association rules whose support and trust are greater than the minimum
support and minimum confidence given by users, respectively.
(1) Core Algorithm: Apriori. The apriori algorithm is the earliest and
the most classic algorithm. The iterative search method is used
to find the relationship between items in the database layer by
layer. The process consists of connection (class matrix operation)
and pruning (removing unnecessary intermediate results). In
this algorithm, the concept of item set is the set of items. A set
containing K items is a set of K items. Item set frequency is the
number of transactions that contain an item set. If an item set
satisfies the minimum support, it is called a frequent item set.
Apriori algorithm is divided into two steps to find the largest item set:(1)
Count the occurrence frequency of an element item set, and find out the data
set which is not less than the minimum support to form a one-dimensional
maximum item set(2)Loop until no maximum item set is generated
(2) Application Examples. Association rules are usually a data
mining approach used to explore and interpret large transactional
Different Data Mining Approaches Based Medical Text Data 245
ischemic beats in ECG for a long time. In this study, the specific
NO YSE NO YSE
application process
Figure of association
4: C4.5 rules
algorithm application flow. is shown in Figure 5.
Feature Database
Data Processing
TID ITEMS
Algorithm iterations 1 eature data1 ITEMES SUP
Data Feature
2 eature data2 {F1,F2,F3} 50%
denoising extraction
…
ECG
Model Evaluation
Classifications generated by data mining models through test sets are not
necessarily optimal, which can lead to the error of test set classification.
In order to get a perfect data model, it is very important to evaluate the
model. Receiver operating characteristic (ROC) curve and area under the
curve (AUC) are common evaluation methods in medical text data mining.
The ROC curve has a y-axis of TPR (sensitivity, also called recall rate)
and an x-axis of FPR (1-specificity). The higher the TPR, the smaller the
FPR, and the higher the efficiency of the model. AUC is defined as the area
under the ROC curve, that is, AUC is the integral of ROC, and the value of
the area is less than 1. We randomly select a positive sample and a negative
sample. The probability that the classifier determines that the positive sample
value is higher than the negative sample is the AUC value. Pourhoseing Holi
246 Data Analysis and Information Processing
et al. [71] used the AUC method to evaluate the prognosis model of rectal
cancer patients and found that the prediction accuracy of random forest (RF)
and BN models was high.
DISCUSSION
Data mining is useful for medical text data to extract novel and usable
information or knowledge. This paper reviewed several research works
which are done for mining medical text data based on four steps. It is
beneficial for helping the researchers to choose reasonable approaches for
mining medical text data. However, some difficulties in medical text data
mining are also considered.
First, the lack of a publicly available annotation database affects the
development of data mining to a certain extent, due to differences in medical
information records and descriptions among countries. Its information
components are highly heterogeneous and the data quality is not uniform.
Ultimately, it brings about a key obstacle that makes annotation bottleneck
existing in medical text data [72]. At present, the international standards
include ICD (International Classification of Diseases), SNOMED CT (The
Systematized Nomenclature of Human and Veterinary Medicine Clinical
Terms), CPT (Current Procedural Terminology), DRG (Diagnosis-Related
Groups), LOINC (Logical Observation Identifiers Names and Codes),
Mesh (Medical Subject Headings), MDDB (Main Drug Database), and
UMLS (Unified Medical Language System). There are few corpora in the
field of medical text. In recent 10 years, natural language has undergone
a truly revolutionary paradigm shift. More new technologies have been
applied to the extraction of natural language information. Many scholars
have established a corpus for a certain disease. However, there is a close
relationship between medical entities. A single corpus cannot cut the data
accurately, and it is easy to omit keyword information.
Second, text records of different countries have different opinions.
For example, Ayurvedic medicine, traditional Arab Islamic medicine, and
traditional Malay medicine from India, the Middle East, and Malaysia have
problems such as inconsistent treatment description, complex treatment
methods, and difficulty in statistical analysis, leading to great difficulty in
medical data mining [73]. At the same time, the information construction
of traditional medicine is insufficient. For example, the traditional North
American indigenous medical literature mainly involves clinical efficacy
evaluation and disease application, which is complicated in recording
Different Data Mining Approaches Based Medical Text Data 247
ACKNOWLEDGMENTS
This work was supported by the National Natural Science Foundation
of China (81703825), the Sichuan Science and Technology Program
(2021YJ0254), and the Natural Science Foundation Project of the Education
Department of Sichuan Province (18ZB01869).
248 Data Analysis and Information Processing
REFERENCES
1. R. J. Oskouei, N. M. Kor, and S. A. Maleki, “Data mining and medical
world: breast cancers’ diagnosis, treatment, prognosis and challenges
[J],” American journal of cancer research, vol. 7, no. 3, pp. 610–627,
2017.
2. Y. Zhang, S.-L. Guo, L.-N. Han, and T.-L. Li, “Application and
exploration of big data mining in clinical medicine,” Chinese Medical
Journal, vol. 129, no. 6, pp. 731–738, 2016.
3. B. Polnaszek, A. Gilmore-Bykovskyi, M. Hovanes et al., “Overcoming
the challenges of unstructured data in multisite, electronic medical
record-based abstraction,” Medical Care, vol. 54, no. 10, pp. e65–e72,
2016.
4. E. Ford, M. Oswald, L. Hassan, K. Bozentko, G. Nenadic, and J.
Cassell, “Should free-text data in electronic medical records be shared
for research? A citizens’ jury study in the UK,” Journal of Medical
Ethics, vol. 46, no. 6, pp. 367–377, 2020.
5. S. M. Ayyoubzadeh, S. M. Ayyoubzadeh, H. Zahedi, M. Ahmadi,
and S. R Niakan Kalhori, “Predicting COVID-19 incidence through
analysis of google trends data in Iran: data mining and deep learning
pilot study,” JMIR public health and surveillance, vol. 6, no. 2, Article
ID e18828, 2020.
6. X. Ren, X. X. Shao, X. X. Li et al., “Identifying potential treatments
of COVID-19 from Traditional Chinese Medicine (TCM) by using a
data-driven approach,” Journal of Ethnopharmacology, vol. 258, no.
1, Article ID 12932, 2020.
7. E. Massaad and P. Cherfan, “Social media data analytics on telehealth
during the COVID-19 pandemic,” Cureus, vol. 12, no. 4, Article ID
e7838, 2020.
8. J. Dong, H. Wu, D. Zhou et al., “Application of big data and artificial
intelligence in COVID-19 prevention, diagnosis, treatment and
management decisions in China,” Journal of Medical Systems, vol. 45,
no. 9, p. 84, 2021.
9. L. B. Moreira and A. A. Namen, “A hybrid data mining model for
diagnosis of patients with clinical suspicion of dementia [J],” Computer
Methods and Programs in Biomedicine, vol. 165, no. 1, pp. 39–49,
2018.
Different Data Mining Approaches Based Medical Text Data 249
ABSTRACT
Huge volume of structured and unstructured data which is called big data,
nowadays, provides opportunities for companies especially those that use
electronic commerce (e-commerce). The data is collected from customer’s
internal processes, vendors, markets and business environment. This
paper presents a data mining (DM) process for e-commerce including
the three common algorithms: association, clustering and prediction. It
also highlights some of the benefits of DM to e-commerce companies in
terms of merchandise planning, sale forecasting, basket analysis, customer
relationship management and market segmentation which can be achieved
with the three data mining algorithms. The main aim of this paper is to review
the application of data mining in e-commerce by focusing on structured and
unstructured data collected thorough various resources and cloud computing
services in order to justify the importance of data mining. Moreover, this
study evaluates certain challenges of data mining like spider identification,
data transformations and making data model comprehensible to business
users. Other challenges which are supporting the slow changing dimensions
of data, making the data transformation and model building accessible to
business users are also evaluated. A clear guide to e-commerce companies
sitting on huge volume of data to easily manipulate the data for business
improvement which in return will place them highly competitive among
their competitors is also provided in this paper.
INTRODUCTION
Data mining in e-commerce is all about integrating statistics, databases
and artificial intelligence together with some subjects to form a new idea
or a new integrated technology for the purpose of better decision making.
Data mining as a whole is believed to be a good promoter of e-commerce.
Presently, applying data mining to e-com- merce has become a hot cake
among businesses [1] . Data mining in cloud computing is the process of
extracting structured information from unstructured or semi unstructured
web data sources. From business point of view, the core concept of cloud
computing is to render computing resources in form of service to the users
who need to buy whenever they are in demand [2] . The end product of
data mining creates an avenue for decision makers to be able to track their
customers’ purchasing patterns, demand trends and locations, making their
strategic decision more effective for the betterment of their business. This
can bring down the cost of inventory together with other expenses and
maximizing the overall profit of the company.
With the wide availability of the Internet, 21st century companies
highly utilize online tools and technologies for various reasons. Therefore,
today many companies buy and sell through e-commerce and the need for
developing e-commerce applications by an expert who takes responsibility
for running and maintaining the services is increasing. When businesses grow,
the required resources for e-commerce maintenance may increase more than
the level the enterprise can handle. Based on that regard, data mining can
Data Mining in Electronic Commerce: Benefits and Challenges 259
user pays for less with pay per use models. Most e-commerce companies
welcome the idea as it eliminates the high cost of storage for large volume
of business data by keeping it in the cloud data centers. The platform also
gives opportunity to use e-commerce business applications e.g. B2B and
B2C with smaller investment. Some other advantages of cloud computing
for e-commerce include the following: cost effective, speed of operations,
scalability and security of the entire service [3] [4] .
The association between cloud computing and data mining is that cloud
is used to store the data on the servers and data mining is use to provide
client server relationship as a service and information being collected based
on ethical issues like privacy and individuality are violated [5] .
Considering the importance of data mining for today’s companies, this
paper discusses benefits and challenges of data mining for e-commerce
companies. Furthermore, it reviews the process of data mining in e-com-
merce together with the common types of database and cloud computing in
the field of e-commerce.
DATA MINING
Data mining is the process of discovering meaningful pattern and
correlation by sifting through large amounts of data stored in repositories.
There are several tools for this data generation, which include abstractions,
aggregations, summarization and characteristics of data [6] . In the past
decade, data mining has change the e-commerce business. Data mining is
not specific to one type of data. Data mining can be germane to any type
of information source, however, algorithms and tactics may differ when
applied to different kind of data. The challenges presented by different type
of data varies. Data mining is being used in many form of databases like flat
file, data warehouses, object oriented databases and etc.
This paper concentrates on relational databases. Relational database
consists of a set of tables containing either values of entity attributes or
values of attributes from entity relationship. Tables have columns and rows,
where columns represent attributes and rows represent tuples. A tuple in
relational table corresponds to either an object or a relationship between
objects and is identified by a set of attribute values representing a unique
key [6] . The most commonly used query language for relational database is
SQL, which allows to manipulate and retrieve data stored in the tables. Data
mining algorithms using relational database can be more versatile than data
Data Mining in Electronic Commerce: Benefits and Challenges 261
mining algorithms specifically written for flat files. Data mining can benefit
from SQL for data selection, transformation and consolidation [7] .
There are several core techniques in data mining that are used to build
data mining. Most common techniques are as follows [8] [9] :
1) Association Rules: Association rule mining is among the most
important methods of data mining. The essence of this method
is extracting interesting correlation and association among
sets of items in the transactional databases or other data pools.
Association rules are used extensively in various areas. A typical
association rule has an implication of the form A→B where A is
an item set and B is an item set that contains only a single atomic
condition [10] .
2) Clustering: This is the organisation of data in classes or it refers to
a collection of objects by grouping similar objects to form more
than one class of methods. Moreover, clustering class labels are
unidentified and it is up to the clustering algorithm to discover
acceptable classes. Clustering is sometimes called unsupervised
classification. The reason was classification is not dictated by
given class labels. Clustering is the process of grouping a set of
physical or abstract object into classes of similar object [10] .
3) Prediction: Prediction has attracted substantial attention given
the possible consequences of successful forecasting in a business
context. There are two types of predictions. The first one is
predicting unavailable data values and the second one is as soon
as classification model is form on a training set, the class label of
the object can be pre- dicted based on the attribute values of the
object. Prediction is more often referred to the forecast of missing
numerical values [10] .
Selection
Data warehouse
Data Pre-prooessing
Target Database
Pattern
Analysis
Knowledge
startup up for applying of the data mining result. The analysis lay much
emphasis on the statistics and rules of the pattern used, by observing them
after multiple users have accessed them [14] .
However all this has to do with how iterative the overall process is, and
the interpretation of visual information you get at each sub step. Therefore,
in general data mining process iterates from the following five basic steps,
which are:
• Data selection: This step is all about identifying the kind of data
to be mined, the goals for it and the necessary tool to enable
the process. At the end of it the right input attributes and output
information in order to represent the task are chosen.
• Data transformation: This step is all about organising the data
based on the requirements by removing noise, converting one
type of data to another, normalising the data if there is need to,
and also defining the strategy to handle the missing data.
• Data mining step per se: Having mined the transformed data using
any of the techniques to extract pattern of interest, the miner can
also make data mining method by performing the proceeding
steps correctly.
• Result interpretation and validation: For better understanding
of data and it synthesised knowledge together with its validity
span, the robustness is check by data mining application test. The
information retrieved can also be evaluated by comparing it with
the earlier expertise in the application domain.
• Incorporation of the discovered knowledge: This has to do with
presenting the result of discovered knowledge to decision maker
so that it is possible to compare or check/resolve for conflict with
an earlier extracted knowledge where a new discovered pattern
can be applied [15] .
Spiders are software programs that are sent out by the search
engine to find new information. These spiders can also be called
as bots or crawlers. It is a software program that search engine
uses to request pages and download them, it comes as a surprise
to some people, however what the search engine does is they use
a link of an existing website to find a new website and request
a copy of that page to download it to their server. This is what
the search engines use to run the ranking algorithm against and
that is what shows up in the search engine result page. Therefore,
the challenge here is that the search engines need to download
a correct copy of the website. E-commerce website needs to be
readable and seeable and the algorithm is applied to the search
engines database. Tools are needed to have the mechanisms to
enable them automatically remove unwanted data that will be
transformed to information in order for data mining algorithm to
provide reliable and sensible output [22] .
2) Data Transformations: In this case data transformation pose
a challenge for data mining tools. Today, the data needed to
transform can only be gotten from two different sources, one of
which an active and operational system for the data warehouse
to be built and secondly it should include some activities
that involves assigning new columns, binning data and also
aggregating the data as well. In the first process, it is needed to be
modified infrequently that is only when there is a change in the
site and lastly the set of the transformed data gives a significantly
great challenge in the data mining process [22] .
3) Scalability of Data Mining Algorithms: With yahoo which has
over 1.2 billion page views in a day with the presence of large
amount of data, scalability arises with significant issues;
• Due to the large amount of data size gathered from the website
at a reasonable time, the data mining algorithm can handle or
process it as much as it’s needed especially because of the scale
nonlinearly.
• The models that are generated tends to be too complicated for
individuals to understand how it is interpreted [22] .
Data Mining in Electronic Commerce: Benefits and Challenges 269
REFERENCES
1. Cao, L., Li, Y. and Yu, H. (2011) Research of Data Mining in Electronic
Commerce. IEEE Computer Society, Hebei.
2. Bhagyashree, A. and Borkar, V. (2012) Data Mining in Cloud
Computing. Multi Conference (MPGINMC-2012). http://reserach.
ijcaonline.org/ncrtc/number6/mpginme1047.pdf
3. Rao, T.K.R.K., Khan, S.A., Begun, Z. and Divakar, Ch. (2013)
Mining the E-Commerce Cloud: A Survey on Emerging Relationship
between Web Mining, E-Commerce and Cloud Computing.
IEEE International Conference on Computational Intelligence
and Computing Research, Enathi, 26-28 December 2013, 1-4.
http://dx.doi.org/10.1109/iccic.2013.6724234
4. Wu, M., Zhang, H. and Li, Y. (2013) Data Mining Pattern Valuation
in Apparel Industry E-Commerce Cloud. IEEE 4th International
Conference on Software Engineering and Service Science (ICSESS),
689-690.
5. Srinniva, A., Srinivas, M.K. and Harsh, A.V.R.K. (2013) A Study on
Cloud Computing Data Mining. International Journal of Innovative
Research in Computer and Communication Engineering, 1, 1232-1237.
6. Carbone, P.L. (2000) Expanding the Meaning and Application of Data
Mining. International Conference on Systems, Man and Cybernetics,
3, 1872-1873. http://dx.doi.org/10.1109/icsmc.2000.886383
7. Barry, M.J.A. and Linoff, G.S. (2004) On Data Mining Techniques
for Marketing, Sales and Customer Relationship Management.
Indianapolis Publishing Inc., Indiana.
8. Pan, Q. (2011) Research of Data Mining Technology in Electronic
Commerce. IEEE Computer Society, Wuhan, 12-14 August 2011, 1-4.
http://dx.doi.org/10.1109/icmss.2011.5999185
9. Verma, N., Verma, A., Rishma and Madhuri (2012) Efficient and
Enhanced Data Mining Approach for Recommender System.
International Conference on Artificial Intelligence and Embedded
Systems (ICAIES2012), Singapore, 15-16 July 2012.
10. Kamba, M. and Hang, J. (2006) Data Mining Concept and Techniques.
Morgan Kaufmann Publishers, San Fransisco.
11. News Stack (2015). http://thenewstack.io/six-of-the-best-open-source-
data-mining-tools/
272 Data Analysis and Information Processing
12. Witten, I.H. and Frank, E. (2014) The Morgan Kaufmann Series
on Data Mining Management Systems: Data Mining. 2nd Edition,
Publisher Morgan Kaufmann, San Francisco, 365-528.
13. Liu, X.Y. And Wang, P.Z. (2008) Data Mining Technology and Its
Application in Electronic Commerce. IEEE Computer Society, Dalian,
12-14 October 2008, 1-5.
14. Zeng, D.H. (2012) Advances in Computer Science and Engineering.
Springer Heidelberg, NewYork.
15. Ralph, K. and Caserta, J. (2011) The Data Warehouse ETL Toolkit:
Practical Techniques for Extraction, Cleaning, Conforming and
Delivering Data. Wiley Publishing Inc., USA.
16. Michael, L.-W. (1997) Discovering the Hidden Secrets in Your Data—
The Data Mining Approach to Information. Information Research, 3.
http://informationr.net/ir/3-2/
17. Li, H.J. and Yang, D.X. (2006) Study on Data Mining and Its Application
in E-Business. Journal of Gansu Lianhe University (Natural Science),
No. 2006, 30-33.
18. Raghavan, S.N.R. (2005) Data Mining in E-Commerce: A Survey.
Sadhana, 30, 275-289. http://dx.doi.org/10.1007/BF02706248
19. Michael, J.A.B. and Gordon, S.L. (1997) Data Mining Techniques: For
Marketing and Sales, and Customer Relationship Management. 3rd
Edition, Wiley Publishing Inc., Canada.
20. Wang, J.-C., David, C.Y. and Chris, R. (2002) Data Mining Techniques
for Customer Relationship Management. Technology in Society, 24,
483-502.
21. Christos, P., Prabhakar. R. and Jon, K. (1998) A Microeconomic View
of Data Mining. Data Mining and Knowlege Discovery, 2, 311-324.
http://dx.doi.org/10.1023/A:1009726428407
22. Yahoo (2001) Second Quarter Financial Report. Yahoo Inc., Califonia.
Chapter 14
Research on Realization of
Petrophysical Data Mining Based on Big
Data Technology
Key Laboratory of Exploration Technologies for Oil and Gas Resources (Yangtze
2
ABSTRACT
This paper studied the interpretation method of realization of data mining
for large-scale petrophysical data, which took distributed architecture, cloud
computing technology and B/S mode referred to big data technology and
data mining methods. Based on petrophysical data mining application of
K-means clustering analysis, it elaborated the practical significance of
application association with big data technology in well logging field, which
also provided a scientific reference for logging interpretation work and data
analysis and processing method to broaden the application.
INTRODUCTION
With the increasing scale of oil exploration and the development of
engineering field, the application of high-tech logging tools is becoming
more and more extensive. The structural, semi-structured and unstructured
complex types of oil and gas exploration data are exploded. In this paper,
the petrophysical data was taken as the object; Big data technology and data
mining methods were used for data analysis and processing, which mines
effective and available knowledge to assist routine interpretation of work
and to broaden the scientific way to enhance the interpretation of precision.
The research allows full play to great potential of logging interpretation for
comparative study of geologic laws and oil and gas prediction.
The rapid development of network and computer technology as well
as the large-scale use of database technology makes it possible to extract
effective information from petrophysical data in more different ways
adopted by logging interpretation. Relying on the traditional database query
mechanism and mathematical statistical analysis method, it is difficult to
satisfy the effective processing of large-scale data. It tends to be that the
data contains a lot of valuable information, but it cannot be of efficient
use because the data is in an isolated state and cannot be transformed into
useful knowledge applied to logging interpretation work. Too much useless
information will inevitably lead to the loss of information distance [1] and
useful knowledge which is in the “rich information and lack of knowledge”
dilemma [2] .
to manage the literacy requests of the storage and processing of the file
system clients on its nodes.
and sedimentary parameters under the guidance of phase pattern and phase
sequence. The petrophysical data contains much potential stratigraphic
information, and the lithology of the strata often leads to a certain difference
in the sampling value of the logging curve. This difference can be seen
as the common effects of many factors, such as the lithological mineral
composition, its structure and the fluid properties contained in the pores.
Because of this, one logging physical value also means some particular
lithology of corresponding strata. Coupled with the difference of the
formation period and the background, then the combination of the inherent
physical characteristics of rock stratum in different geological periods
and some random noise is used to achieve the purpose of lithological and
stratigraphic division.
(1)
Here, X represents a data object in the data set; Ci represents the ith
cluster, and represents the mean of cluster Ci.
(2)
Here, Figure 3 is taken as the example to show the clustering process of
K-means petrophysical data. In Figure 3(a), the black triangles are labeled in
two-dimensional space with two-dimensional eigenvectors as coordinates.
They can be regarded as examples reflected by two-dimensional data
(composed of the data of two logging curves), that is, primitive petrophysical
data sets in need of clustering. Three different colored boxes represent the
clustering center points (analogical to some lithology) given by random
initialization. Figure 3(b) shows the results of the completion of clustering,
that is, to achieve the goal of lithological division. Figure 3(c) shows the
trajectory of the centroid in the iterative process.
280 Data Analysis and Information Processing
Software Implementation
CONCLUSIONS
1) The advantages of distributed architecture and cloud computing
are used to improve the overall processing capacity of the system,
and in the process of large-scale petrophysical data processing,
the B/S mode is integrated to achieve data mining to combine
big data analysis and processing mechanism with conventional
interpretation. The exploratory research idea of the new method
284 Data Analysis and Information Processing
ACKNOWLEDGEMENTS
This work is supported by Yangtze University Open Fund Project of key
laboratory of exploration technologies for oil and gas resources of ministry
of education (K2016-14).
Research on Realization of Petrophysical Data Mining Based on Big Data... 285
REFERENCES
1. Wang, H.C. (2006) DIT and Information. Science Press, Beijing.
2. Wang, L.W. (2008) The Summarization of Present Situation of Data
Mining Research. Library and Information, 5, 41-46.
3. Pan, H.P., Zhao, Y.G. and Niu, Y.X. (2010) The Conventional Well
Logging Database of CCSD. Chinese Journal of Engineering
Geophysics, 7, 525-528.
4. Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003) The Google File
System. ACM SIGOPS Operating Systems Review, 37, 29-43. https://
doi.org/10.1145/1165389.945450
5. Sakr, S., Liu, A., Batista, D.M., et al. (2011) A Survey of Large
Scale Data Management Approaches in Cloud Environments. IEEE
Communications Surveys & Tutorials, 13, 311-336. https://doi.
org/10.1109/SURV.2011.032211.00087
6. Low, Y., Bickson, D., Gonzalez, J., et al. (2012) Distributed GraphLab:
A Framework for Machine Learning and Data Mining in the Cloud.
Proceedings of the VLDB Endowment, 5, 716-727. https://doi.
org/10.14778/2212351.2212354
7. Song, Y., Chen, H.W. and Zhang, X.H. (2007) Short Term Electric Load
Forecasting Model Integrating Multi Intelligent Computing Approach.
Computer Engineering and Application, 43, 185-188.
8. Abraham, B. and Ledolter, J. (1983) Statistical Methods for Forecasting.
John Wiley & Sons, Inc., NewJersey.
9. Farnstrom, F., Lewis, J. and Elkan, C. (2000) Scalability for Clustering
Algorithms Revisited. AcmSigkdd Explorations Newsletter, 2, 51-57.
https://doi.org/10.1145/360402.360419
10. Rose, K., Gurewitz, E. and Fox, G.C. (1990) A Deterministic Annealing
Approach to Clustering. Information Theory, 11, 373.
SECTION 4:
INFORMATION PROCESSING METHODS
Chapter 15
ABSTRACT
The rapid development of digital informatization has led to an increasing
degree of reliance on informatization in various industries. Similarly, the
development of national traditional sports is also inseparable from the
support of information technology. In order to improve the informatization
development of national traditional sports, this paper studies the fusion
process of multisource vector image data and proposes an adjustment and
merging algorithm based on topological relationship and shape correction
for the mismatched points that constitute entities with the same name. The
algorithm is based on topological relationship. The shape of the adjustment
Citation: Xiang Fu, Ye Zhang, Ling Qin, “Application of Spatial Digital Informa-
tion Fusion Technology in Information Processing of National Traditional Sports”, Mo-
bile Information Systems, vol. 2022, Article ID 4386985, 10 pages, 2022. https://doi.
org/10.1155/2022/4386985.
Copyright: © 2022 by Authors. This is an open access article distributed under the Cre-
ative Commons Attribution License, which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
290 Data Analysis and Information Processing
INTRODUCTION
National traditional sports, as the carrier of China’s excellent traditional
culture, has been preserved in the state of a living fossil after changes
in the times and social development. The 16th National Congress of the
Communist Party of China regards the construction of socialist politics,
economy, and culture with Chinese characteristics as the basic program
of the primary stage of socialism and cultural prosperity as an important
symbol of comprehensive national strength. Therefore, cultural construction
will also be a focus of today’s new urbanization process. If you want to
achieve cultural development in urbanization, you need to use a medium,
and traditional national sports is a good medium. Traditional national
sports equipment is multifunctional, including fitness, entertainment, and
education. Moreover, the content is rich, the forms are diversified, and the
forms of sports are eclectic, regardless of age, suitable for men, women,
and children. It can be said that traditional national sports activities are an
indispensable part of people’s lives.
An information platform is an environment created for the construction,
application, and development of information technology, including the
development and utilization of information resources, the construction of
information networks, the promotion of information technology applications,
the development of information technology and industries, the cultivation of
information technology talents, and the formulation and improvement of
information technology policy systems [1]. The network platform has the
greatest impact and the best effect in the construction of the information
platform. Therefore, it is necessary to make full use of network technology,
communication technology, control technology, and information security
technology to build a comprehensive and nonprofit information platform for
the traditional Mongolian sports culture. The information platform includes
organizations, regulatory documents, categories, inheritors, news updates,
columns, performance videos, and protection forums. Moreover, it uses text,
pictures, videos, etc., to clearly promote and display the unique ethnicity of
Application of Spatial Digital Information Fusion Technology in ... 291
RELATED WORK
Integrating “digitalization” into sports can understand the concept of “digital
sports” from both narrow and broad directions [4]. In a broad sense, digital
sports is a new physical exercise method that combines computer information
technology with scientific physical exercise content and methods. It can help
exercisers improve their sports skills, enhance physical fitness, enrich social
leisure life, and promote the purpose of spiritual civilization construction. In
a narrow sense, digital sports is a related activity that combines traditional
292 Data Analysis and Information Processing
best exercise plan, and form a serialized and intelligent digital sports service
system [11]. Regardless of the age of the participants, high or moderate
weight, and female or male, digital sports methods will provide them
with the most suitable activity method to help different sports enthusiasts
complete exercise and demonstrate the charm of sports [12]. Digital sports
deeply analyzes the activity habits or exercise methods of the elderly,
women, children, and other special groups and provides more suitable sports
services for every sports enthusiast [13]. Through local computing, digital
sports accurately locates and perceives the personalized and unstructured
data of different audience groups and conducts comprehensive analysis and
processing of various data information in a short period of time, forming a
portable mobile device for each sports group. In order to find out the real
needs of more sports audiences, put forward effective exercise suggestions
to help different exercise groups reach the best exercise state [14]. Through
the connection of the bracelet and the digital sports terminal, the public can
also see the comparison chart of the comprehensive sports data of different
participating groups more intuitively, assist the public to set personalized
sports goals, and urge each athlete to complete their own exercise volume.
In the end, every exerciser’s exercise method and exercise effect will be
improved scientifically and reasonably over time [15].
(1)
Among them, B is the latitude of the earth, L is the longitude of the
earth, H is the height of the earth, (X,Y,Z) is the rectangular coordinates of
the space, is the radius of curvature of the ellipsoid, and
is the eccentricity of the ellipse (a and b represent the long
and short radii of the ellipse, respectively) [17].
294 Data Analysis and Information Processing
(2)
In the iterative process, the initial value is
. According to formula (2), B can be obtained by approximately four
generations, and then, H can be obtained.
Figure 1 shows two spatial rectangular coordinate systems
. Among them, the same point in the
two rectangular coordinate systems has the following correspondence [18]:
(3)
(4)
is the coordinate translation parameter, is the
coordinate rotation parameter, and k is the coordinate scale coefficient. In
practical applications, the determination of the conversion parameters in
the above-mentioned two Cartesian coordinate conversion relations can
be determined by using the least squares method of the common point
coordinates.
In mathematics, projection refers to the establishment of a one-to-
one mapping relationship between two point sets. Image projection is to
express the graticule on the sphere of the earth onto a plane in accordance
with a certain mathematical law. A one-to-one correspondence function is
established between the digital coordinates (B,L) of a point on the ellipsoid
and the rectangular coordinates (x,y) of the corresponding point on the
image. The general projection formula can be expressed as [19]
(5)
In the formula, (B,L) is the digitized coordinates (longitude, latitude) of
a point on the ellipsoid, and (x,y) is the rectangular coordinates of the point
projected on the plane.
The transformation of the positive solution of Gaussian projection is
as follows: given the digitized coordinates (B,L), the plane rectangular
coordinates (x,y) under the Gaussian projection are solved. The formula is
shown in
296 Data Analysis and Information Processing
(6)
In the formula, X represents the arc length of the meridian from
(7)
(8)
In the formula, L0 is the longitude of the origin,
is called the reference latitude.
When B=0, the cylinder is tangent to the earth ellipsoid, and the radius of
the tangent cylinder is a.
The inverse solution transformation of the Mercator projection is as
follows: given the plane rectangular coordinates (x,y) under the Mercator
projection, the digitized coordinates (B,L) are calculated, and the formula
is shown in
(9)
In the formula, exp is the natural logarithm base, and the latitude B is
quickly closed by iterative calculation.
For the geometric matching method of point entities, the commonly used
matching similarity index is Euclidean distance. The algorithm compares
the calculated Euclidean distance between the two with a threshold, and the
one within the threshold is determined to be an entity with the same name
or may be an entity with the same name. If multiple entities with the same
name are obtained by matching, then repeated matching can be performed
by reducing the threshold or reverse matching. If the entity with the same
name cannot be matched, it can be adjusted by appropriately increasing the
threshold until the entity with the same name is matched. The calculation
formula of Euclidean distance is shown in
(10)
298 Data Analysis and Information Processing
(11)
(12)
(13)
to be solved, that is, the adjustment value correction number. Various error
formulas, such as displacement formula, shape formula, relative displacement
formula, area formula, parallel line formula, line segment length formula,
and distance formula of adjacent entities, are established according to
actual application needs. Finally, the calculation is carried out according to
the principle of least squares method of interrogation adjustment, and the
calculation formula is shown
(14)
(15)
In the formula, constraintk is the limit value of the k factor, is
the adjustment of the i-th entity coordinate point, and n is the total number of
entity coordinate points. A is the coefficient matrix of the adjustment model,
and v, x, and l are the corresponding residual value, parameter vector, and
constant vector, respectively.
The adjustment and merging algorithm based on topological relations
is mainly used to adjust the geometric positions of unmatched points in
entities with the same name. The basic idea is as follows: first, the algorithm
determines that the unmatched points that need to be adjusted are affected
by the matched points with the same name. Secondly, the algorithm analyzes
and calculates the geometric position adjustment of each matched point with
the same name. Finally, the algorithm uses the weighted average method to
calculate the total geometric position adjustment of the unmatched points.
We assume that the position adjustment of the last matched point P
is affected by N matched points with the same name , and
the distance from P$ to each matched point Qi with the same name is
. We assume that the coordinate adjustment amount of the
matched point Qi to the point P is , and then, the total adjustment
amount of the coordinate of the P point is calculated as
302 Data Analysis and Information Processing
(16)
Among them, the weight .
(17)
In the formula, θifront and θiafter, respectively, represent the angle value
before and after the adjustment of the i-th turning angle that constitutes the
entity, and r represents the total number of turning angles that constitute the
entity.
In order to enable the entity adjustment and merging algorithm based on
topological relations to maintain the consistency of the shape of irregular
entities before and after the adjustment and merging, the wood text is an
indicator of the size of the shape change; that is, starting from the average
angle difference, an adjustment and merging algorithm based on topological
relations and shape correction is proposed for the points that are not
successfully matched on the entities with the same name. The detailed steps
of the algorithm are as follows:
304 Data Analysis and Information Processing
(1) The algorithm first calculates the point that is not matched
successfully according to the adjustment and merging algorithm
based on the topological relationship; that is, the adjusted position
coordinates are calculated by formula (16)
(2) Based on the adjustment and merging algorithm of the topological
relationship, the shape correction is performed. According to the
principle that the last matched point on the entity with the same
name before and after the adjustment should maintain the same
angle as the two nearest matched points with the same name, the
adjusted position coordinates are calculated. As in Figure
7, we assume that A1, B1, A2, and B2 are the point pairs with the
same name that are successfully matched on the entities with
the same name in vector image 1 and vector image 2, where A1
matches A2 and B1 matches B2. They are adjusted to A′, B′ after
being processed by the entity adjustment and merging algorithm.
In the figure, X is the last matched point in vector image 1, and
the two matched points closest to X in this figure are A1 and B1,
respectively. Now, the algorithm adjusts and merges the point X
that is not successfully matched and finds its adjusted position
(18)
In the formula, a1 and a2 are weights, respectively, and their values are
determined according to specific data, applications, and experience
CCD camera 1
CCD camera 2
CCD camera 1
Main processor
Figure 8: National traditional sports training system based on spatial digital information fusion.
Figure 8: CCD
National
camera n traditional sports training system based on spatial digital
information fusion. Video collection card
Objective
Industrial computer
Industrial
camera 2
Computer-aided Software
software engineering development
environment
Process-oriented
approach
(structured method) Software reuse
technology Integrated project Central
Data-oriented
/program support resource
method (information
environment database
engineering method) Other technologies
The object-oriented
method (OO method)
96
94
Information processing effect
92
90
88
86
84
82
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
Number
Digital effect
It can be seen from the above that the effect of the traditional national
sports information processing method based on digital information fusion
proposed in this article is relatively significant. On this basis, the spatial
digital processing of this method is evaluated, and the results shown in
Table 1 and Figure 12 are obtained.
Table 1: Evaluation of the spatial digital processing effect of the national tradi-
tional sports information processing method based on digital information fusion
96
94
90
88
86
84
82
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
Number
Digital effect
Figure 12: Statistical diagram of the spatial digital processing effect of the na-
tional traditional sports information processing method based on digital infor-
mation fusion.
From the above research, it can be seen that the national traditional
sports information processing method based on digital information fusion
proposed in this article also has a good effect in the digital processing of the
national traditional sports space.
CONCLUSION
Information technology has emerged in the field of sports, and brand-
new sports activities such as sports digitalization and sports resource
informationization have emerged. Unlike traditional online games and
e-sports, which involve finger movements and eye-moving relatively static
activities, digital sports put more emphasis on “sweating” body movements.
Moreover, it uses digital technologies such as motion capture devices and
motion sensors to transform and upgrade traditional sports to achieve
interaction and entertainment among humans, machines, and the Internet.
Digital sports will also play a particularly important role in social criticism
and cultural value orientation. This article combines digital information
fusion technology to construct the national traditional sports information
processing system and improve the development effect of national
traditional sports in the information age. The research results show that the
national traditional sports information processing method based on digital
information fusion proposed in this paper has a good effect in the digital
processing of the national traditional sports space.
310 Data Analysis and Information Processing
REFERENCES
1. K. Aso, D. H. Hwang, and H. Koike, “Portable 3D human pose
estimation for human-human interaction using a chest-mounted
fisheye camera,” in Augmented Humans Conference 2021, pp. 116–
120, Finland, February 2021.
2. A. Bakshi, D. Sheikh, Y. Ansari, C. Sharma, and H. Naik, “Pose estimate
based yoga instructor,” International Journal of Recent Advances in
Multidisciplinary Topics, vol. 2, no. 2, pp. 70–73, 2021.
3. S. L. Colyer, M. Evans, D. P. Cosker, and A. I. Salo, “A review of
the evolution of vision-based motion analysis and the integration of
advanced computer vision methods towards developing a markerless
system,” Sports Medicine-Open, vol. 4, no. 1, pp. 1–15, 2018.
4. Q. Dang, J. Yin, B. Wang, and W. Zheng, “Deep learning based 2d
human pose estimation: a survey,” Tsinghua Science and Technology,
vol. 24, no. 6, pp. 663–676, 2019.
5. R. G. Díaz, F. Laamarti, and A. El Saddik, “DTCoach: your digital
twin coach on the edge during COVID-19 and beyond,” IEEE
Instrumentation & Measurement Magazine, vol. 24, no. 6, pp. 22–28,
2021.
6. S. Ershadi-Nasab, E. Noury, S. Kasaei, and E. Sanaei, “Multiple human
3d pose estimation from multiview images,” Multimedia Tools and
Applications, vol. 77, no. 12, pp. 15573–15601, 2018.
7. R. Gu, G. Wang, Z. Jiang, and J. N. Hwang, “Multi-person hierarchical
3d pose estimation in natural videos,” IEEE Transactions on Circuits
and Systems for Video Technology, vol. 30, no. 11, pp. 4245–4257,
2019.
8. G. Hua, L. Li, and S. Liu, “Multipath affinage stacked—hourglass
networks for human pose estimation,” Frontiers of Computer Science,
vol. 14, no. 4, pp. 1–12, 2020.
9. M. Li, Z. Zhou, and X. Liu, “Multi-person pose estimation using
bounding box constraint and LSTM,” IEEE Transactions on
Multimedia, vol. 21, no. 10, pp. 2653–2663, 2019.
10. S. Liu, Y. Li, and G. Hua, “Human pose estimation in video via
structured space learning and halfway temporal evaluation,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 29,
no. 7, pp. 2029–2038, 2019.
Application of Spatial Digital Information Fusion Technology in ... 311
ABSTRACT
It is acknowledged that lacking of interdisciplinary communication amongst
designers can result in poor coordination performance in building design.
Viewing communication as information processing activity, this paper aims
to explore the relationship between interdisciplinary information processing
(IP) and design coordination performance. Both amount and quality are
concerned regarding information processing. 698 project based samples are
collected by questionnaire survey from design institutes in mainland China.
INTRODUCTION
Changes in construction projects are very common and could lead to project
delays and cost overruns. Lu and Issa believe that the most frequent and
most costly changes are often related to design, such as design changes
and design errors [1]. Hence, design stage is of primary importance in
construction project life-cycle [2]. Common types of design deficiencies
include design information inconsistency (e.g. location of a specific wall
differing on the architectural and structural drawings), mismatches/physical
interference between connected components (e.g. duct dimensions in
building service drawings not matching related pass/hole dimensions in
structural drawings), and component malfunctions (e.g. designing a room’s
electrical supply to suit classroom activities, while architectural drawings
designate the room as a computer lab) [3] [4]. Based on a questionnaire
survey of 12 leading Canadian design firms, Hegazy, Khalifa and Zaneldin
report eight common problems―all of which are due to insufficient and
inadequate communication and information exchange (e.g., delay in
obtaining information, not everyone on the team getting design change
information) [5]. Communication and information exchange is termed
information processing in this paper. Information processing includes the
collection, processing and distribution of information [6], and can be either
personal or impersonal (e.g. accomplished using a program) [7] [8].
Building design is a multi-disciplinary task. The process of designing a
building is the process of integrating information from multiple disciplinary
Effects of Quality and Quantity of Information Processing on Design ... 315
(1)
What is the relationship between information processing amount and
information processing quality? According to Chinese philosophy, the
accumulation of amount increase brings the improvement of quality. Under
the context of interdisciplinary information processing in Chinese design
institute, it is hypothesized that: The relationship between information
processing amount and perceived information quality follows a nonlinear
exponential expression of
Literature in the field of communication studies is reviewed here to
investigate the concept of information processing quality, for two reasons.
The first is that information process quality should be constructed as a multi-
dimensional construct to properly investigate its rich content; however,
little research within the information processing theory literature discusses
the multiple dimensions of information processing quality, perhaps
due to the short history of information processing theory. Fortunately,
in the communication study community, researchers have deeply
discussed the content of information quality in communication [14] [15].
Usually, communication refers to communication between people; here,
communication is not limited to people talking to other people directly, but
also includes people getting information from media sources, such as online
management systems on which other people have posted information.
Information processing in design coordination includes both personal
communication, and communication through programming; in this sense,
communication and information processing are the same issue, which is the
second reason why research findings from the communication studies field
can be used.
Perceived information quality (PIQ) is a concept applied, in
communication literature, to measure information processing quality, and
refers to the extent to which an individual perceives information received
from a sender as being valuable. At the cognitive level, people choose sources
that are perceived to have a greater probability of providing information that
will be relevant, reliable, and helpful to the problem at hand―attributes
that may be summarize under the label perceived source quality [16].
Effects of Quality and Quantity of Information Processing on Design ... 317
METHODS
Measurement
GD Archi. SE ME EE PE BIM
GD 2 5 5 2 2 2 2
Archi. 15 65 65 44 44 44 44
SE 12 66 66 38 38 38 38
ME 2 15 15 6 6 6 6
EE 1 12 12 2 2 2 2
PE 2 11 11 7 7 7 7
BIM 3 19 19 9 9 9 9
Coordination Performance
Coordination process performance refers to the extent to which the
respondent (focal unit a) has effective information processing with another
person in the design team (unit j). It is a dyadic concept, and the five-item
dyadic coordination performance scale used by Sherman and Keller [18]
is applied. The scale includes items examining: 1) the extent to which the
focal unit a had an effective working relationship with unit j; 2) the extent to
which unit j fulfilled its responsibilities to unit a; 3) the extent to which unit
a fulfilled its responsibilities to unit j; 4) the extent to which the coordination
320 Data Analysis and Information Processing
DATA ANALYSIS
Standard errors in parentheses *p < 0.05, **p < 0.01, ***p < 0.001;
Difference on sample size due to missing data.
Effects of Quality and Quantity of Information Processing on Design ... 321
DISCUSSION
On one hand, insufficient interdisciplinary communication will lead to
coordination failure. On the other hand, too much information processing
will lead to information overload as well as coordination cost overrun. The
challenge for cross-functional teams is to ensure the level of information
exchange amongst team members allows them to optimize their performance
[11]. In this study, it is found that information processing amount is positively
related to coordination process performance; specifically, it is found that the
relationship between the frequency of interdisciplinary communication in
the detailed design stage and coordination process performance followed
the nonlinear exponential expression of performance = 3.691 (1−0.235IP
amount
). Whether the finding can be used in other areas rather than Mainland
China need further study.
322 Data Analysis and Information Processing
CONCLUSION
This paper explores the relationship between interdisciplinary communication
and design coordination performance in design institutes in Mainland China.
Effects of Quality and Quantity of Information Processing on Design ... 323
REFERENCES
1. Lu, H. and Issa, R.R. (2005) Extended Production Integration
for Construction: A Loosely Coupled Project Model for Building
Construction. Journal of Computing in Civil Engineering, 19, 58-68.
https://doi.org/10.1061/(ASCE)0887-3801(2005)19:1(58)
2. Harpum, P. (Ed.) (2004) Design Management. John Wiley and Sons
Ltd., USA. https://doi.org/10.1002/9780470172391.ch18
3. Korman, T., Fischer, M. and Tatum, C. (2003) Knowledge and Reasoning
for MEP Coordination. Journal of Construction Engineering and
Management, 129, 627-634. https://doi.org/doi:10.1061/(ASCE)0733-
9364(2003)129:6(627)
4. Mokhtar, A.H. (2002) Coordinating and Customizing Design
Information through the Internet. Engineering Construction and
Architectural Management, 9, 222-231. https://doi.org/10.1108/
eb021217
5. Hegazy, T., Khalifa, J. and Zaneldin, E. (1998) Towards Effective
Design Coordination: A Questionnaire Survey 1. Canadian Journal of
Civil Engineering, 25, 595-603. https://doi.org/10.1139/l97-115
6. Tushman, M.L. and Nadler, D.A. (1978) Information Processing
as an Integrating Concept in Organizational Design. Academy of
Management Review, 613-624.
7. Dietrich, P., Kujala, J. and Artto, K. (2013) Inter-Team Coordination
Patterns and Outcomes in Multi-Team Projects. Project Management
Journal, 44, 6-19. https://doi.org/10.1002/pmj.21377
8. Van de Ven, A.H., Delbecq, A.L. and Koenig Jr., R. (1976) Determinants
of Coordination Modes within Organizations. American Sociological
Review, 322-338. https://doi.org/10.2307/2094477
9. Mathieu, J.E., Heffner, T.S., Goodwin, G.F., Salas, E. and Cannon-
Bowers, J.A. (2000) The Influence of Shared Mental Models on Team
Process and Performance. Journal of Applied Psychology, 85, 273.
https://doi.org/10.1037/0021-9010.85.2.273
10. Katz, D. and Kahn, R.L. (1978) The Social Psychology of Organizations.
11. Patrashkova, R.R. and McComb, S.A. (2004) Exploring Why More
Communication Is Not Better: Insights from a Computational
Model of Cross-Functional Teams. Journal of Engineering and
Technology Management, 21, 83-114. https://doi.org/10.1016/j.
jengtecman.2003.12.005
Effects of Quality and Quantity of Information Processing on Design ... 325
12. Goodman, P.S. and Leyden, D.P. (1991) Familiarity and Group
Productivity. Journal of Applied Psychology, 76, 578. https://doi.
org/10.1037/0021-9010.76.4.578
13. Boisot, M.H. (1995) Information Space. Int. Thomson Business Press.
14. Maltz, E. (2000) Is All Communication Created Equal? An Investigation
into the Effects of Communication Mode on Perceived Information
Quality. Journal of Product Innovation Management, 17, 110-127.
https://doi.org/10.1016/S0737-6782(99)00030-2
15. Thomas, S.R., Tucker, R.L. and Kelly, W.R. (1998) Critical
Communications Variables. Journal of Construction Engineering
and Management, 124, 58-66. https://doi.org/10.1061/(ASCE)0733-
9364(1998)124:1(58)
16. Choo, C.W. (2005) The Knowing Organization. Oxford University
Press. https://doi.org/10.1093/acprof:oso/9780195176780.001.0001
17. Chang, A.S. and Shen, F.-Y. (2014) Effectiveness of Coordination
Methods in Construction Projects. Journal of Management in
Engineering. https://doi.org/10.1061/(ASCE)ME.1943-5479.0000222
18. Sherman, J.D. and Keller, R.T. (2011) Suboptimal Assessment of
Interunit Task Interdependence: Modes of Integration and Information
Processing for Coordination Performance. Organization Science, 22,
245-261. https://doi.org/10.1287/orsc.1090.0506
19. Keller, K.L. and Staelin, R. (1987) Effects of Quality and Quantity of
Information on Decision Effectiveness. Journal of Consumer Research,
14, 200-213. https://doi.org/10.1086/209106
Chapter 17
Zhejiang, China
ABSTRACT
Neural network theory is the basis of massive information parallel processing
and large-scale parallel computing. Neural network is not only a highly
nonlinear dynamic system but also an adaptive organization system, which
can be used to describe the intelligent behavior of cognition, decision-
making, and control. The purpose of this paper is to explore the optimization
Citation: Pin Wang, Peng Wang, En Fan, “Neural Network Optimization Method and
Its Application in Information Processing”, Mathematical Problems in Engineering,
vol. 2021, Article ID 6665703, 10 pages, 2021. https://doi.org/10.1155/2021/6665703.
Copyright: © 2021 by Authors. This is an open access article distributed under the
Creative Commons Attribution License, which permits unrestricted use, distribution,
and reproduction in any medium, provided the original work is properly cited.
328 Data Analysis and Information Processing
INTRODUCTION
In the information society, the increase in information generation is getting
bigger [1]. To make information available in a timely manner to serve the
development of the national economy, science and technology, and defense
industry, it is necessary to collect, process, transmit, store, and make decisions
on information data. Theoretical innovation and implementation are carried
out to meet the needs of the social development situation. Therefore, neural
networks have extremely extensive research significance and application
value in information science fields such as communications, radar, sonar,
electronic measuring instruments, biomedical engineering, vibration
engineering, seismic prospecting, and image processing. This article focuses
on the study of neural network optimization methods and their applications
in intelligent information processing.
Neural Network Optimization Method and Its Application in Information ... 329
(1)
where WB is the signal bandwidth and fo is the signal center frequency. A
single-frequency signal with a center frequency of fo can be used to simulate
a narrowband signal. The sine signal as we know it is a typical narrowband
signal. The analog signals used in this article are all single-frequency sine
signals.
(2) Array signal processing model: suppose that there is a sensor
array in the plane, in which M sensor array elements with
arbitrary directivity are arranged, and K narrowband plane waves
are distributed in this plane. The center frequencies of these plane
Neural Network Optimization Method and Its Application in Information ... 331
waves are all w0 and the wavelength is λ, and suppose that M > K
(that is, the number of array elements is greater than the number of
incident signals). The signal output received by the k-th element
at time t is the sum of K plane waves; namely,
(2)
where ak(θi) is the sound pressure response coefficient of element k to source
i, si (t − τk(θi )) is the signal wavefront of source i, and τk(θi) is the relative
value of element k to the reference element time delay. According to the
assumption of narrowband waves, the time delay only affects the wavefront
by the phase change,
(3)
Therefore, formula (2) can be rewritten as
(4)
Write the output of M sensors in vector form; the model becomes
(5)
Among them,
(6)
It is called the direction vector of the incoming wave direction 0.
Let
. The other measurement noise is n (t); then the above array model can be
expressed as
(7)
Among them, is the direction
matrix of the array model.
332 Data Analysis and Information Processing
(8)
where E{.} represents statistical expectation; let
(9)
(10)
It is the covariance matrix of noise. It is assumed that the noise received
by all elements has a common variance, and is also the noise power.
From equations (9) and (10), we can get
(11)
It can be proved that R is a nonsingular matrix and a positive definite
Hermitian square matrix; that is, RH = R. ,erefore, the singular value
decomposition of R can be performed to achieve diagonalization, and the
eigendecomposition can be written as follows:
(12)
where U is the transformation unitary matrix, so that matrix R is diagonalized
into a real-valued matrix Λ = diag(λ1, λ2, . . . , λM), and the eigenvalues are
ordered as follows:
(13)
From equation (13), it can be seen that any vector orthogonal to A is an
eigenvector of matrix R belonging to the eigenvalue .
a designed RBF neural network are equivalent to the height fitting of this
hypersurface. It establishes an approximate hypersurface by interpolating
the input data points [10, 11].
The sensor array is equivalent to a mapping from the DOA space
to the sensor array output space ,
a mapping :
(14)
where K is the number of source signals, M is the number of elements of the
uniform linear array, ak is the complex amplitude of the k-th signal, α is the
initial phase, ω0 is the signal center frequency, d is the element spacing, and
c is the propagation speed of the source signal [12, 13].
When the number of information sources has been estimated as K, the
function of the neural network on this problem is equivalent to the inverse
problem of the above mapping, that is, the inverse mapping .
To obtain this mapping, it is necessary to establish a neural network structure
in which the preprocessed data based on the incident signal is used as the
network input, and the corresponding DOA is used as the network output
after the hidden layer activation function is applied. The whole process is
a targeted training process, and the process of fitting the mapping with the
RBF neural network is equivalent to an interpolation process.
(15)
The training steps of the Kohonen SOM neural network used in this
article are as follows: the first step is network initialization [14, 15].
334 Data Analysis and Information Processing
(16)
where x = [x1, x2, . . . , xm] T is the training sample vector of the network.
Initialize the network weight wj (j = 1, 2, . . . , K) to be the same as the
partially normalized input vector e’.
The second step is to calculate the Euclidean distance between the input
vector and the corresponding weight vector ωj of each competing layer
neuron to obtain the winning neuron ωc [16, 17]. The selection principle of
the winning neuron is as follows:
(17)
,e third step is to adjust the weight of the winning neuron ωc and its
neighborhood ωj. ,e adjustment method is as follows:
(18)
Among them, η(t) is the learning rate function, which decreases with
the number of iteration steps t [18, 19]. ,e function Uc(t) is the neighborhood
function; here is the Gaussian function:
(19)
where r is the position of the neurons in the competition layer on a two-
dimensional plane and σ is the smoothing factor, which is a normal number.
(20)
(2) If the neuron node j is activated by more than one training sample
vector, that is, nj > 1, and the signal positions corresponding to
these samples are , then the output of
the corresponding node of the second layer of grid is the average
value of the direction angle of these signals [22, 23], namely,
(21)
(3) If the neuron node j has never been activated by any training
sample vector, the corresponding output neuron node is regarded
as an invalid node. When this node is activated by a new input
vector, the output value is defined as the output direction angle of
the valid node closest to this node.
(22)
336 Data Analysis and Information Processing
and the AOA value θ have a consistent trend [24, 25]. In other words, when
the DDOA vectors of two source signals are similar, their arrival direction
angles AOA must also be similar. Therefore, the topological orders and
distributions of DDOA vector and AOA are basically the same.
(23)
At the same time, the sum of the distances between classes can also be
found:
(24)
Neural Network Optimization Method and Its Application in Information ... 337
No Noise
In order to verify the effectiveness of the two-layer SOM neural network
established in this paper for arbitrary array conditions, we conducted a
simulation experiment of detecting the direction of acoustic signals with
arbitrary sensor arrays underwater. Assuming that the sensor array contains
4 sensors, the frequency of a single sound source signal is f = 2 kHz, the
propagation speed of the sound signal in water is c = 1500 m/s, and the
distance between two adjacent sensors is Δi = 0.375, which is the wavelength
half. The positions of the four sensor array elements are (x1 = 0.y1 = 0), (x2 =
0.3, y2 = 0.225), (x3 = 0.5, y3 = −0.0922), and (x4 = 0.6, y4 = 0.2692). In order
to obtain the training vector, we uniformly collect 60 × 30 points from the
rectangular area [−20, 20] × [0, 20] ∈ R2 as the emission positions of 1800
simulated sound source signals, which can calculate 1800 DDOA vectors r,
and input them into the network as training vectors of the network.
Calculate the value of Rmax(x, y):
(25)
338 Data Analysis and Information Processing
Except for the few points near the origin (0, 0), the function Rmax(x, y)
at most of the remaining points has a common upper bound, which belongs
to the second case.
Noise
In practice, the signal data collected by the sensor array is often noisy, and
the energy of noise is generally large. The signal-to-noise ratio between
signal and noise often reaches very low values, even below 0 dB; that is, the
signal is overwhelmed by environmental noise that is much stronger than its
strength. When the signal-to-noise ratio is particularly small, people usually
perform a denoising filtering process artificially in advance to make the
filtered signal-to-noise ratio at least above 0 dB. Therefore, a good model
that can be applied to practice must be applicable to noisy environments.
simulated signals with different AOA values. Calculate the DDOA vectors
corresponding to these simulated signal emission points, and then input
these vectors as test vectors into the trained two-layer SOM neural network.
The output of the network is the corresponding AOA predicted value. The
experimental results are shown in Figure 1.
training and the corresponding AOA value as the target output of the network
training.
As shown in Table 1 and Figure 2, the average of the absolute value of the
AOA prediction error of the noise-free signal in the simulation experiment
is approximately 0.1° to 0.4°, the minimum is 0.122°, and the maximum is
only 0.242°, and most of the test signals (accounting for the absolute value
of the prediction error of the number of test signals (70% ∼ 80%)) are less
than 0.1°.
Distance 4 8 12 16 32 64
1.2
1
Experimental parameters
0.8
0.6
0.4
0.2
0
4 8 1 16 2 4
Distance
Average error
Pr (err < 0.3°)
x 0 5 10 15 20 25 30 35 40
BRF 0.43 −0.05 0.08 −0.16 0.09 6.25 15.64 16.28 17.89
SOM −0.08 0.04 −0.11 0.03 0.08 0.13 0.01 0.04 0.02
As shown in Figure 3, the two networks both use the same 20 simulated
signals as test signals. The transmission positions of the tested signals are
evenly distributed between 2 meters and 40 meters from the origin, including
the training area, that is, within 20 meters. It can be seen from Figure 4
that the prediction effect of the RBF neural network in the training area is
similar to that of the SOM neural network, but the prediction effect outside
the training area is poor, while the SOM neural network shows strong
adaptability to distance changes. This shows that the RBF neural network
will be affected by the distance factor, because its training principle is to use
the idea of interpolation to fit the mapping relationship, which makes the
error larger when the test data exceeds the training range.
SOM
BRF
– 0 5 10 15 20
Parameter value
40 25 10
35 20 5
30 15 0
OAO
10 degrees 30 degrees
20 degrees 40 degrees
Figure 4: SOM neural network predicts AOA resulting in the case of additional
Gaussian noise.
Table 3: SOM neural network predicts AOA resulting in the case of additional
Gaussian noise
AOA (degrees) 0 5 10 15 20
It can be seen from Figure 4 that when the signal-to-noise ratio drops
from 20 dB to 1 dB, the absolute error of the AOA prediction is small and
the fluctuations are not large; that is, when the signal-to-noise ratio is greater
than 1 dB, we establish that the prediction effect of the SOM network does
not decrease with the decrease of the signal-to-noise ratio, and it has strong
adaptability to noise.
1400
1200
1000 900
800
Parameter value
600 500
400
200 200 78.5 47.3
7.4 3.8
0 26.2 5.2
Sound distance(m) Forecast AOA Absolute error
–
–
85
55
X 0 5 10 15 20
25
20
17.36 16.65
15.36
Paremeter value
15
10.47
10 9.24
8.46
7.13
6.32 6.63 6.93
4.95 5.18 5.73
5 3.68 4.53
0
0 5 0 15 20
X
Real
Estimation
Expected
4 3.26 4.61
5 3.52 4.78
6 3.78 4.98
7 3.62 5.18
8 3.29 4.72
346 Data Analysis and Information Processing
8 4.72
3.29
7 5.18
Number of categories
3.62
6 4.98
3.78
5 4.78
3.52
4 4.61
3.26
Parameter value
Genetic clustering method
K-means clustering method
Table 7: Average absolute value of AOA prediction error under different neuron
node arrangements
Node arrangement 20 × 20 25 × 25 30 × 30 35 × 35 40 × 40 45 × 45
Rectangular domain 0.72 0.61 0.57 0.48 0.55 0.62
Circle 0.51 0.56 0.42 0.37 0.33 0.41
Neural Network Optimization Method and Its Application in Information ... 347
0.8
0.7
0.6
Parameter value
0.5
0.4
0.3
0.2
0.1
0
20 × 20 25 × 25 30 × 30 35 × 35 40 × 40 45 × 45
Node arrangement
Rectangular domain
Circle
Figure 8: Average absolute value of AOA prediction error under different neu-
ron node arrangements.
CONCLUSIONS
In this paper, a two-layer SOM neural network is used to study the AOA
prediction problem based on DDOA vectors under arbitrary arrays in theory
and simulation experiments. This network is equivalent to a classifier,
through the classification of DDOA vectors to achieve the classification of
AOA values, so as to achieve the purpose of predicting AOA. The established
two-layer SOM neural network is further discussed, and the feasible situation
of applying the network for prediction is given. First, clarify the features
used for prediction and form the input vector, and the predicted object is
used as the output of the network.
This method is verified through simulation experiments and actual lake
experiments. From the experimental results, it can be seen that the neural
348 Data Analysis and Information Processing
network trained in advance through simulation data can detect the direction
of arrival of the source signal without noise, Gaussian white noise, and
real noise environment, and the angle estimation effect is good. Finally, we
further compare the prediction effect of this method with the classic MUSIC
algorithm and RBF neural network method. The experimental results show
that the performance of this network is excellent and can be considered for
practice.
This paper applies SOM neural network to the estimation of the direction
of arrival of array signals. It is found through research that the DDOA vector
and AOA in the array signal have similar topological distributions. Based
on this, the SOM neural network is connected with the topological order
to establish a two-layer SOM neural network to estimate the direction of
arrival of the array signal. While the method has a theoretical basis, it also
shows high estimation accuracy in both simulation experiments and lake
water experiments.
ACKNOWLEDGMENTS
This work was supported by the National Natural Science Foundation of
China under Grant 61703280.
Neural Network Optimization Method and Its Application in Information ... 349
REFERENCES
1. X. Li, Y. Wang, and G. Liu, “Structured medical pathology data
hiding information association mining algorithm based on optimized
convolutional neural network,” IEEE access, vol. 8, no. 1, pp. 1443–
1452, 2020.
2. M. A. A. Mamun, M. A. Hannan, A. Hussain, and H. Basri,
“Theoretical model and implementation of a real time intelligent bin
status monitoring system using rule based decision algorithms,” Expert
Systems with Applications, vol. 48, pp. 76–88, 2016.
3. M. F. Hamza, H. J. Yap, and I. A. Choudhury, “Recent advances on
the use of meta-heuristic optimization algorithms to optimize the type-
2 fuzzy logic systems in intelligent control,” Neural Computing and
Applications, vol. 28, no. 5, pp. 1–21, 2015.
4. B. Tom and S. Alexei, “Conditional random fields for pattern
recognition applied to structured data,” Algorithms, vol. 8, no. 3, pp.
466–483, 2015.
5. Y. Chen, W. Zheng, W. Li, and Y. Huang, “The robustness and
sustainability of port logistics systems for emergency supplies from
overseas,” Journal of Advanced Transportation, vol. 2020, Article ID
8868533, 10 pages, 2020.
6. W. Quan, “Intelligent information processing,” Computing in Science
& Engineering, vol. 21, no. 6, pp. 4-5, 2019.
7. X. Q. Cheng, X. W. Liu, J. H. Li et al., “Data optimization of traffic video
vehicle detector based on cloud platform,” Jiaotong Yunshu Xitong
Gongcheng Yu Xinxi/Journal of Transportation Systems Engineering
and Information Technology, vol. 15, no. 2, pp. 76–80, 2015.
8. S. Wei, Z. Xiaorui, P. Srinivas et al., “A self-adaptive dynamic
recognition model for fatigue driving based on multi-source information
and two levels of fusion,” Sensors, vol. 15, no. 9, pp. 24191–24213,
2015.
9. M. Niu, S. Sun, J. Wu, and Y. Zhang, “Short-term wind speed hybrid
forecasting model based on bias correcting study and its application,”
Mathematical Problems in Engineering, vol. 2015, no. 10, 13 pages,
2015.
10. X. Song, X. Li, and W. Zhang, “Key parameters estimation and adaptive
warning strategy for rear-end collision of vehicle,” Mathematical
350 Data Analysis and Information Processing
ABSTRACT
In dynamical systems, local interactions between dynamical units generate
correlations which are stored and transmitted throughout the system,
generating the macroscopic behavior. However a framework to quantify
exactly how these correlations are stored, transmitted, and combined
at the microscopic scale is missing. Here we propose to characterize the
notion of “information processing” based on all possible Shannon mutual
Citation: Rick Quax, Gregor Chliamovitch, Alexandre Dupuis, Jean-Luc Falcone, Bas-
tien Chopard, Alfons G. Hoekstra, Peter M. A. Sloot, “Information Processing Features
Can Detect Behavioral Regimes of Dynamical Systems”, Complexity, vol. 2018, Ar-
ticle ID 6047846, 16 pages, 2018. https://doi.org/10.1155/2018/6047846.
Copyright: © 2018 by Authors. This is an open access article distributed under the
Creative Commons Attribution License, which permits unrestricted use, distribution,
and reproduction in any medium, provided the original work is properly cited.
354 Data Analysis and Information Processing
information quantities between a future state and all possible sets of initial
states. We apply it to the 256 elementary cellular automata (ECA), which
are the simplest possible dynamical systems exhibiting behaviors ranging
from simple to complex. Our main finding is that only a few information
features are needed for full predictability of the systemic behavior and that
the “information synergy” feature is always most predictive. Finally we
apply the idea to foreign exchange (FX) and interest-rate swap (IRS) time-
series data. We find an effective “slowing down” leading indicator in all
three markets for the 2008 financial crisis when applied to the information
features, as opposed to using the data itself directly. Our work suggests that
the proposed characterization of the local information processing of units
may be a promising direction for predicting emergent systemic behaviors.
INTRODUCTION
Emergent, complex behavior can arise from the interactions among (simple)
dynamical units. An example is the brain whose complex behavior as a
whole cannot be explained by the dynamics of a single neuron. In such a
system, each dynamical unit receives input from other (upstream) units
and then decides its next state, reflecting these correlated interactions.
This new state is then used by (downstream) neighboring units to decide
their new states and so on, eventually generating a macroscopic behavior
with systemic correlations. A quantitative framework is missing to fully
trace how correlations are stored, transmitted, and integrated, let alone to
predict whether a given system of local interactions will eventually generate
complex systemic behavior or not.
Our hypothesis is that Shannon’s information theory [1] can be used to
construct, eventually, such a framework. In this viewpoint, a unit’s new state
reflects its past interactions in the sense that it stores mutual information
about the past states of upstream neighboring units. In the next time instant a
downstream neighboring unit interacts with this state, implicitly transferring
this information and integrating it together with other information into its
new state and so on. In effect, each interaction among dynamical units is
interpreted as a Shannon communication channel and we aim to trace the
onward transmission and integration of information (synergy) through this
network of “communication channels.”
In this paper we characterize the information in a single unit’s state at
time t by enumerating its mutual information quantities with all possible sets
of initial unit states (t=0). We generate initial unit states independently for
Information Processing Features Can Detect Behavioral Regimes of ... 355
METHODS
Notational Conventions
Constants and functions are denoted by lower-case Roman letters. Stochastic
variables are denoted by capital Roman letters. Feature vectors are denoted
by Greek letters.
system states are generated by the interacting units and not an artifact of the
initial conditions.
(1)
There are 256 possible transition rules and they are numbered 0 through
255, denoted . As initial state we take the fully random state
so that no correlations exist already at t=0; that is, for all
r and all i. The evolution of each cellular automaton is fully deterministic
for a given rule, implying that the conditional probabilities in (1) can only
be either 0 or 1. (This is nevertheless not a necessary condition in general.)
(2)
358 Data Analysis and Information Processing
(3)
The conditional variant obeys the chain rule
and is written explicitly as
(4)
This denotes the remaining entropy (uncertainty) of A given that the value
for B is observed. For intuition it is easily verified that the case of statistical
independence, that is, , which
makes , meaning that B contains zero information about A. At
the other extreme, B=A would make so that
, meaning that B contains the maximal amount of information needed to
determine a unique value of A.
(5)
Here denotes the (ordered) power set notation for all subsets of
stochastic variables of initial cell states. (Note though that in practice not
infinitely many initial cell states are needed; for instance, for an ECA at time
t only the nearest 1+2t initial cell states are relevant.) We will refer to
as the sequence of information features of unit i at time t. The subscript
notation implies that the rule-specific (conditional) probabilities
are used to compute the mutual information. We
Information Processing Features Can Detect Behavioral Regimes of ... 359
use the subscript i for generality to emphasize that this feature vector pertains
to each single unit (cell) in the system, even though in the specific case of
ECA this subscript could be dropped as all cells are indistinguishable.
In particular we highlight the following three types of information
features. The “memory” of unit i at time t is defined as the feature
, that is, the amount of information that the unit retains
about its own initial state. The “transfer” of information is defined as
nonlocal mutual information such as . Nonlocal
mutual information must be due to interactions because the initial states are
independent (all pairs of units have zero mutual information). Finally we
define the integration of information as “information synergy,” an active
research topic in information theory [4, 14, 16, 19, 21–23]. The information
synergy in about X0 is calculated here by the well-known whole-minus-
sum (WMS) formula . The WMS measure directly
implements the intuition of subtracting the information carried by individual
variables from the total information. However the presence of correlations
among the would be problematic for this measure, in which case it can
become negative. In this paper we prevent this by ensuring that the are
uncorrelated. In this case it fulfills various proposed axiomatizations for
synergistic information known thus far, particularly PID [14, 15] and SRV
[16].
Information synergy (or “synergy” for short) is not itself a member of
but it is fully redundant given since each of its terms is in
. Therefore we will treat synergy features as separate single features in our
results analysis while we do not add them to .
Note that a normalized predictive power of, say, 0.75 does not necessarily
mean that 75% of the rules can be correctly classified. Our definition yields
merely a relative measure where 0 means zero predictive power, 1 means
perfect prediction, and intermediate values are ordered such that a higher
value implies that a more accurate classification algorithm could in principle
be constructed. The benefit of our definition based on mutual information is
that it does not depend on a specific classifier algorithm; that is, it is model-
free. Indeed, the use of mutual information as a predictor of classification
accuracy has become the de facto standard in machine learning applications
[25, 26].
(6)
Their concatenation makes the extended ordered feature set, now written
in the form of stochastic variables:
(7)
The extended feature set has no additional predictive power
compared to , so for any inference task and are equivalent.
That is, the synergy features are completely redundant given
since each of its terms is a member of . The reason for adding them
separately to form is that they have a clear meaning as information
which is stored in a collection of variables while not being stored in any
individual variable. We are interested to see whether this phenomenon plays
a significant role in generating dynamical behaviors.
We define the first principal feature at time t as maximizing its individual
predictive power, quantified by a mutual information term as explained
before, as
362 Data Analysis and Information Processing
(8)
Here, again rule number R is treated as a uniformly random stochastic
variable with which in turn makes
(9)
features whose sum will then dominate the synergy feature. We remedy
this by rescaling the sum of the memory and transfer features which are
subtracted in (6) to equal the average value of the total information (positive
term in (6)). In formula, a constant c is inserted into the WMS formula,
leading to
(10)
for a given set of initial cell states S. c is fitted such that this WMS measure
is on average 0 for all sliding windows over the dataset. This rejects the
cointegration null-hypothesis between total information and the subtracted
term at the 0.05 significance level ( ) in this dataset. This
results in the synergy feature being distributed around zero and being
independent of the sum of the other two features so that it may functionally
be used as part of the feature space for feature selection; however, the value
itself should not be trusted as quantifying precisely the notion synergy.
RESULTS
1.0
0.8
0.6
S
0.4
0.2
0.0
1.0
0.8
0.6
0.4 1.0 1.2
0.2 0.6 0.8
T 0.0 0.2 0.4
0.0
M
and (“synergy”).
Each dot corresponds to a rule r and is color-coded by its Wolfram class c(r),
namely, black for the simple homogeneous (24) and periodic behaviors (196),
green for complex behavior (10), and red for chaotic behavior (26). The trans-
parency of a point indicates its distance away from the viewer, with more trans-
parent points being farther away. A small random vector with average norm 0.02
is added to each point in order to make rules with equal information features
still visible. The gray plus signs are the projections of the 3D points on the two
visible side faces (the S-M face is occluded) for better visibility of the positions
of the points.
thin black line with error bars is the “base line” predictability for n features,
obtained by randomizing the pairing of 256 information features with their class
identifier. The error bar indicates the 95% confidence interval of the distribution
of predictive power under the null-hypothesis of zero correlation between infor-
mation features and class identifier. Finally, the small black markers indicate the
predictive powers of all other information feature sets.
For the second time step (Figure 2) we again find that the most predictive
information feature is synergy. An intriguing difference however is that it is
now significantly more predictive at 0.90. This means that already at t=2 there
is a single information characteristic of dynamical behavior (i.e., synergy),
which explains the vast majority of the entropy of the behavioral
class that will eventually be exhibited. A second intriguing difference is that
the maximum predictive power of 0.98 is now achieved using only 3 out of
57 possible information features, where 4 features were needed at t=1.
Finally, for t=3 we find that only 2 information features are needed to
achieve the maximum possible predictive power of 1.0; that is, the values
for these two features uniquely identify the behavior class. Firstly this
confirms the apparent trend that fewer information features capture more of
the relevant dynamical behavior as time t increases. Secondly we find again
that synergy is the single most predictive feature. In addition, we find again
that the best secondary feature is a peculiar combination of memory and the
two longest-range transfers, as in t=2. Including the intermediate transfers
(so adding I1111111 instead of I1001001 as second feature) actually only
slightly convolutes the prediction: adding them in t=2 reduces predictive
power by 0.028, whereas in t=3 it does not reduce the predictive power at
all. In t=1 there are no intermediate transfers possible since there are only
three predecessors of a cell’s state, and apparently then it pays off to leave
out memory (which would reduce power by 0.025 if added).
One could argue that the quick separation of the points in information
space is hardly surprising because a high-dimensional space is used to
separate only a small number (256) of discrete points. To validate that the
predictive power values of the information features are indeed meaningful
we also plot the expected “base line” prediction power in each subfigure in
Figure 2 along with the 95% confidence interval. The base line is the null-
hypothesis formed by randomizing the pairing of information feature values
with class identifiers; that is, it shows the expected predictive power of having
the same number and frequencies of feature values but sampled with zero
correlation with the classification to make the separability meaningless. This
368 Data Analysis and Information Processing
138
126
232
154
170
106
42
94
29
162
46
77
19
58
58
1105
178
62
50
46
43
34
1
23
7 32
268
10
305
777
3
4 3
1
3 1
4 2
7
2 7
4
2 5
51 8
13
10 4
16 32
22 5
16
2
4
0 37 9 36 30
10
128 1744
8
1 84 1
1 8 72 2 8
0
1 17 6 2 6 16 6
2 14 20004 13
4 56 8
8 1
34
50
57 1402 40
160
162 54 106 32
5 51 154 128
28 122 26 0
73 232 38 94
3 172 22 29
6 152 150 35
9 10 105 27
18 16 4 60 15
33 74 4 90 28 6
24 4 73 3
14 57
368 13 34 25 1 18 7
16 36 8 4 8
1 6
4
1 10
42 70
14 4
40 14 6
1 4
1 3
2
1
4
13
5
20 30
13 74
11
15 1
4
0
1
12
56
7 2
1
14
45
1322
24
1 nat
78
30
152
7
1
156
110
204
108
122
33
76
06
09
54
140
62
105
9
126
150
(a) (b)
and that chaotic and complex rules are all on the same large spray. However
the agreement is far from perfect. For instance, the spray bearing chaotic
and complex rules also bears periodic rules. Note also that rules 60, 90,
105, and 150 are indistinguishable when considered from the information
processing viewpoint, even though they exhibit chaotic patterns that can be
visually distinguished from each other. On the contrary, rules 106 and 154
are very close to each other and the pattern they exhibit indeed shows some
similarities, but the former is complex while the latter is periodic.
Note that using this clustering scheme all rules converging to a uniform
pattern, but one, are close to each other in the information features space.
The remaining one, rule 168, has a transient regime which is essentially
dominated by a translation of the initial state. This unexpected behavior is
due to rare initial conditions (e.g., …110110110…) that are present in our
exact calculation with the same weight as all other initial conditions but have
a strong impact on the information processing measure. This translational
regime can be found as well in rules 2 and 130, which are classified in
the same subspray as rule 168. The similarity of any information feature
(information transfer in this case) can thus lead to rules whose behavior
differs in other respects to get classified similarly.
FX
0.02
0.01
0.00
−0.01 S
−0.02
0.025 −0.03
0.015
−0.04
0.005
T −0.005 0.01 0.00 −0.01
0.03 0.02
−0.015 0.05 0.04
0.06
M
Figure 4: 200 time points showing the progression of the three information fea-
tures memory (M), transfer (T), and integration (S) computed with a time delay
of 1 day (similar to t=1 for ECA). The color indicates the time difference with
September 15, 2008 (big black dot), which we consider the starting point of
the 2008 crisis, from dark blue (long before) to dark red (long after) and white
at the crisis date. The data spans from January 1, 1999, to April 21, 2017; the
large green dot is the last time point also present in the IRS data in 2011. In this
information space we clearly observe signs of a two attractor regimes separated
by a sudden regime shift. Mutual information is calculated using a sliding
Information Processing Features Can Detect Behavioral Regimes of ... 373
window of w=1400 days; the 200 windows partially overlap and are placed
uniformly over the dataset, where the first and last window include the first and
last day of the dataset, respectively. The gray plus signs are the projections of
the 3D points on the visible side faces for better visibility of the positions of
the points.
Interestingly, this behavior resembles to some extent the dynamics
observed for the so-called tipping points [31] where a system is slowly
pushed to an unstable point and then “over the hill” after which it progresses
quickly “downhill” to the next attractor state. This is relevant because slow
progressions to tipping points offer a potential for developing an early-
warning signal.
EUR USD
0.06
0.04
0.075
0.02 0.050
0.025
0.00
0.000 S
−0.02 S −0.025
−0.04 −0.050
−0.075
−0.06
0.00
−0.08
0.65 0.02
0.60 0.04
−0.10 0.55
0.12 0.50 0.06
0.08 −0.075
0.45 0.08 M
0.04 0.025 0.000
−0.025 −0.050 T 0.40 0.10
T 0.00 0.125 0.100
0.075 0.050
M
Figure 5: 200 time points showing the progression of the three information
features memory (M), transfer (T), and synergy (S) computed with a time delay
374 Data Analysis and Information Processing
of 1 day (similar to t=1 for ECA). The color indicates the time difference with
September 15, 2008 (big black dot), which we take as the starting point of the
2008 crisis, from dark blue (long before) to dark red (long after) and white
at the crisis date. The data spans more than twelve years: the EUR data from
January 12, 1998, to August 12, 2011, and the USD data from April 29, 1999,
to June 6, 2011. Mutual information is calculated using a sliding window of
w=1400 days; the 200 windows partially overlap and are placed uniformly over
the dataset, where the first and last window include the first and last day of the
dataset, respectively. The gray plus signs are the projections of the 3D points on
the visible side faces for better visibility of the positions of the points.
Yet another possible but hypothetical explanation for this is that the
IRS markets could have been (part of) a slow but steady driving factor in
the global progression to the crisis, perhaps even building up a financial
“bubble,” whereas the FX market may have been more exogenously
forced toward their regime shift from one attractor to another. Indeed, the
progression to the 2008 crisis is often explained by referring at least to
substantial losses in fixed income and equity portfolios followed by the US
subprime home loan turmoil [32], suggesting at least a central role for the
trade of risks concerning interest rates in USD. The exact sequence of events
leading to the 2008 crisis is however still debated among financial experts.
Our numerical analyses may nevertheless help to shed light on interpreting
the relative roles of different markets.
In any case, in the EUR plot we observe that a steady and fast progression
is following as well by a short “noisy stationary” period where there seems
to be no general direction, after which a new and almost orthogonal direction
is followed after the crisis. The evolution after the crisis is much more noisy,
in the form of larger deviations around the general direction. In the USD we
do not observe a brief stationary phase before the crisis, but we do observe
larger deviations as well around the general directions, mostly sideways
from the viewpoint of this plot. The market does contain two directional
changes but these do not occur closely around the crisis point. We do not
speculate here about their possible causes.
We aim to generalize upon the idea of the variance indicator [34] which
appears the most feasible candidate for multidimensional time-series. In
contrast, computing critical slowing down involves computing correlations,
which requires a large, combinatorially increasing amount of data as the
number of dimensions grows. In short, the idea of the variance indicator is that
prior to a tipping point the stability of the current attractor decreases, leading
to larger variation in the system state, until the point where the stability
is sufficiently low such that natural variation can “tip” the system over to
a different attractor. This indicator is typically applied to one-dimensional
system states, such as species abundance in ecosystems or carbon dioxide
concentrations in climate, where the behavior in each attractor is assumed to
be (locally) stationary.
A natural generalization of variance (or standard deviation) to higher
dimensions is the average centroid distance: the average Euclidean distance
of a set of points to their average (centroid). Since the centroid distance also
increases when there is a directed general trend, which we wish to consider
as natural behavior, we divide by the distance traversed by this general trend.
The result in words is then the average centroid distance per unit length of
trend. That is, for a sequence of state vectors , in
our case information features, our indicator is defined as
(11)
Here, is the number of data points up to time t used in order to
compute the indicator value at time t. Ideally, should typically be as
low as possible in order to provide an accurate local description of the
system’s stability near time t, but not too low such that mostly noise effects
are measured and/or the general trend cannot be distinguished effectively.
To further filter out noise effects and study the indicator progression on
different time scales, we use an averaging sliding window of g preceding
indicator values to finally compute the indicator value at time t; that is,
(12)
Note that using these two subsequent sliding windows ( and g) is not
equivalent to simply increasing by g and then not averaging. To illustrate,
Information Processing Features Can Detect Behavioral Regimes of ... 377
IRS FX
1.0
Centroid distance norm. (g = 10)
0.8
0.6
0.4
0.2
−1000 −750 −500 −250 0 250 500 750 −1000 −500 0 500 1000 1500 2000
Trade days since Lehman Brothers bankruptcy
USD
EUR
(a)
IRS FX
0.7
Centroid distance norm. (g = 50)
0.6
0.5
0.4
0.3
Centroid di
0.4
0.2
−1000 −750 −500 −250 0 250 500 750 −1000 −500 0 500 1000 1500 2000
Trade days since Lehman Brothers bankruptcy
378 Data
USD Analysis and Information Processing
EUR
(a)
IRS FX
0.7
Centroid distance norm. (g = 50)
0.6
0.5
0.4
0.3
−1000 −750 −500 −250 0 250 500 750 −1000 −500 0 500 1000 1500 2000
Trade days since Lehman Brothers bankruptcy
USD
EUR
(b)
Figure 6: The proposed instability indicator (12) calculated for all three datas-
ets. (a) is for small sliding window size (g=10) and (b) for a long sliding win-
dow size (g=50), showing the indicator both on short and long time scales. The
indicator is computed from the sequence of 200 information feature vectors in
Figures 4 and 5. The sliding window size (g) is illustrated by the gray bar in the
top left of the right-hand panels. The first g-1 indicator values are averages of
fewer than g preceding values. One year corresponds to about 250 trade days.
Note that the x-axes left and right are different. =10.
Note that the w indicator values have an intuitive interpretation. A value
of 1/4 means that the multivariate time-series progresses in a perfectly
straight line with uniform spacing. At another extreme, if the points are
perfectly distributed in a symmetrically round cloud around the initial
point, then w tends to unity on average. If there is a directed trend but the
orthogonal deviations are larger than the trend vector, then .
It is common to study instability indicators at a larger time scale in
order to detect or even predict the largest events, ignoring (smoothing out)
smaller events. In particular the hope is to find a leading indicator which
could be used to anticipate the 2008 onset of recession. We show the same
indicator but now averaged over g=50 values in Figure 6(b). Remarkably, all
three datasets show a discernible, long-term steady growth in the instability
indicator leading up and through the crisis date. For the EUR and FX curves
this growth starts around two years before the crisis; for the USD curve the
growth starts at the start of the curve. Although here the initial peak in the
EUR curve appears to even outweigh the crisis-time peak, we must note
here that this peak is subject to less smoothing as there are fewer than g
values to the left available for averaging (compare with the sliding window
Information Processing Features Can Detect Behavioral Regimes of ... 379
size depicted in gray); with further averaging all other peaks will continue
to decrease in height (cf. Figure 6(a)), whereas this initial peak will remain
roughly at the same value for this reason.
We will now discuss two significant additional strong peaks observable
in the indicator curves: an initial peak in EUR (around August-September
2004) and a late peak in FX (mid-2016). We caution that it is hardly scientific
to reason back from observed peaks toward potential underlying causes,
especially for continually turbulent systems such as the financial markets
where events are easy to find. Nevertheless it is important to evaluate
whether the additional peaks at least potentially could indicate substantial
systemic instabilities, or whether they appear likely to be false positives.
For the EUR initial peak we refer to ECB’s Euro Money Market Study,
May 2004 report. We find that this report has indeed an exceptionally negative
sentiment compared to other years, speaking of “declining and historically
low interest rates,” an inverted yield curve, “high geopolitical tensions in
the Middle East and the associated turbulence in oil prices and financial
markets,” and “growing pessimism with regard to economic growth in the
euro area.” Also: “The ECB introduced some changes to its operational
framework which came into effect starting from the main refinancing
operation conducted on 9 March 2004.” In contrast, the subsequent report
(2006) is already much more optimistic: “After two years of slow growth,
the aggregated turnover of the euro money market expanded strongly in
the second quarter of 2006. Activity increased across all money market
segments except in cross-currency swaps, which remained fairly stable.”
We deem it at least plausible that the initial EUR indicator peak, which has
about half the height of the after-crisis peak, is a true positive and detects
indeed a period of increased systemic instability or stress.
For the more recent FX peak across 2016 we must refer to news articles
such as in Financial Times [38, 39]. Firstly there were substantial and largely
unanticipated political shifts, including Brexit (dropping Sterling by over
20%) and the election of Trump as US President. At the same time, articles
mention fears about China’s economic growth slowing down. Lastly, as
interest rates affect associated currencies: “By August, global [bond yield]
benchmarks were at all-time lows, led by the 10-year gilt yielding a paltry
0.51 per cent, while Switzerland’s entire bond market briefly traded below
zero. […] The universe of negative yielding debt had swollen to $13.4tn.”
For example, earlier in the year (January 29), the Bank of Japan unexpectedly
started to take their interest rates into the negative for the first time, affecting
380 Data Analysis and Information Processing
the Yen. Questions toward another recession are mentioned, although also
discarded. All in all, we deem it at least plausible that the FX market’s
indicator peak in 2016 could be caused indeed by systemic instability and
stress, that is, a true positive.
All in all we deem the proposed “normalized centroid distance” instability
indicator as a high potential candidate for multivariate, nonstationary time-
series. Secondly we argue that parameterizing a market state in terms of
information features instead of the original observations (interest or exchange
rates) is useful and enables detecting growing systemic instability. However
we must caution that our financial data only contains one large-scale onset
of recession (2008), so it is difficult to provide conclusive validation that
such events are detected reliably by the proposed indicator. Future work
may include applying the indicator to different simulated systems which can
be driven toward a large-scale regime shift.
DISCUSSION
Our working assumption is that dynamical systems inherently process
information. Our leading hypothesis is that the way that information is
locally processed determines the global emergent behavior. In this article
we propose a way to quantitatively characterize the notion of information
processing and assess its predictive power of the Wolfram classification of
the eventual emergent behavior of ECA. We also make a “leap of faith”
to real (financial) time-series data and find that transforming the original
time-series to an information features time-series enables detection of the
2008 financial crisis by a simple leading indicator. Since it is known that the
original data does not permit such detection, this suggests that novel insights
may be gained even in real data of complex systems, despite not obeying
the ideal conditions of our ECA approach. This warrants a further systemic
study into this notion of information processing in different types of models
and eventually datasets.
Our formalization builds upon Shannon’s information theory, which
means that we consider an ensemble of state trajectories rather than a single
trajectory. That is, we do not quantify the information processing that occurs
during a particular, single sequence of system states (attempts to this end
are followed by Lizier et al. [40]). Rather, we consider the ensemble of all
possible state sequences along with their probabilities. One way to interpret
this is that we quantify the “expected” information processing averaged over
multiple trajectories. Another way to interpret it is that we characterize a
Information Processing Features Can Detect Behavioral Regimes of ... 381
features may still be able to detect changes in the correlations over time,
despite not knowing what is the root cause of these changes. In fact, a
primary driver behind our approach is indeed the abstraction of physical or
mechanistic details while still capturing the emergence of different types
of behaviors. We consider our results in the financial application promising
enough to warrant further study into information processing features in
complex system models and other real datasets. Our results suggest tipping
point behavior for the FX and EUR IRS markets and a possible driving role
for the USD IRS market.
All in all we conclude that the presented information processing concept
appears indeed to be a promising direction for studying how dynamical
systems generate emergent behaviors. In this paper we present initial results
which support this. Further research may identify concrete links between
information features and various types of emergent behaviors, as well as
the relative impact of the interaction topology. Our lack of understanding
of emergent behaviors is exhibited by the ECA model: it is arguably the
simplest dynamical model possible, and the choice of local dynamics
(rule) and initial conditions fully determine the emergent behavior that is
eventually generated. Nevertheless even in this case no theory exists that
predicts the latter from the former. The information processing concept may
eventually lead to a framework for studying how correlations behave in
dynamical systems and how this leads to different emergent behaviors.
ACKNOWLEDGMENTS
Peter M. A. Sloot and Rick Quax acknowledge the financial support of
the Future and Emerging Technologies (FET) Programme within Seventh
Framework Programme (FP7) for Research of the European Commission,
under the FET-Proactive grant agreement TOPDRIM, no. FP7-ICT-318121.
All authors also acknowledge the financial support of the Future and
Emerging Technologies (FET) Programme within Seventh Framework
Programme (FP7) for Research of the European Commission, under the
FET-Proactive grant agreement Sophocles, no. FP7-ICT-317534. Peter M.
A. Sloot acknowledges the support of the Russian Scientific Foundation,
Project no. 14-21-00137.
384 Data Analysis and Information Processing
REFERENCES
1. T. M. Cover and J. A. Thomas, Elements of Information Theory, vol. 6,
John Wiley & Sons, 1991.
2. C. G. Langton, “Computation at the edge of chaos: phase transitions
and emergent computation,” Physica D: Nonlinear Phenomena, vol.
42, no. 1–3, pp. 12–37, 1990.
3. P. Grassberger, “Toward a quantitative theory of self-generated
complexity,” International Journal of Theoretical Physics, vol. 25, no.
9, pp. 907–938, 1986.
4. J. T. Lizier, M. Prokopenko, and A. Y. Zomaya, “Information
modification and particle collisions in distributed computation,”
Chaos: An Interdisciplinary Journal of Nonlinear Science, vol. 20, no.
3, Article ID 037109, 2010.
5. J. T. Lizier, M. Prokopenko, and A. Y. Zomaya, “The information
dynamics of phase transitions in random boolean networks,” in
Proceedings of the 11th International Conference on the Simulation
and Synthesis of Living Systems: Artificial Life XI, ALIFE 2008, pp.
374–381, 2008.
6. R. D. Beer and P. L. Williams, “Information processing and dynamics
in minimally cognitive agents,” Cognitive Science, 2014.
7. E. J. Izquierdo, P. L. Williams, and R. D. Beer, “Information flow
through a model of the C. elegans klinotaxis circuit,” https://arxiv.org/
ftp/arxiv/papers/1603/1603.03552.pdf.
8. Y. Bar-Yam, D. Harmon, and Y. Bar-Yam, “Computationally tractable
pairwise complexity profile,” Complexity, vol. 18, no. 5, pp. 20–27,
2013.
9. B. Allen, B. C. Stacey, and Y. Bar-Yam, “An Information-Theoretic
Formalism for Multiscale Structure in Complex Systems,” https://
arxiv.org/abs/1409.4708v1.
10. R. Quax, A. Apolloni, and P. M. A. Sloot, “The diminishing role of
hubs in dynamical processes on complex networks,” Journal of the
Royal Society Interface, vol. 10, no. 88, 2013.
11. R. Quax, D. Kandhai, and P. M. A. Sloot, “Information dissipation as
an early-warning signal for the Lehman Brothers collapse in financial
time series,” Scientific Reports, vol. 3, article no. 1898, 2013.
Information Processing Features Can Detect Behavioral Regimes of ... 385