Open Data Book
Open Data Book
net/publication/360013523
Open Data
CITATIONS READS
0 405
1 author:
Vijayalakshmi Kakulapati
Sreenidhi Institute of Science & Technology
202 PUBLICATIONS 257 CITATIONS
SEE PROFILE
All content following this page was uploaded by Vijayalakshmi Kakulapati on 11 July 2022.
Contributors
Monika M. Wahi, Natasha Dukach, Omer Hassan Abdelrahman, Farah Jemili, Hajer Bouras, Vijayalakshmi
Kakulapati, Kannadhasan Suriyan, Nagarajan Ramalingam
Individual chapters of this publication are distributed under the terms of the Creative Commons
Attribution 3.0 Unported License which permits commercial use, distribution and reproduction of
the individual chapters, provided the original author(s) and source publication are appropriately
acknowledged. If so indicated, certain images may not be included under the Creative Commons
license. In such cases users will need to obtain permission from the license holder to reproduce
the material. More details and guidelines concerning content reuse and adaptation can be found at
http://www.intechopen.com/copyright-policy.html.
Notice
Statements and opinions expressed in the chapters are these of the individual contributors and not
necessarily those of the editors or publisher. No responsibility is accepted for the accuracy of
information contained in the published chapters. The publisher assumes no responsibility for any
damage or injury to persons or property arising out of the use of any materials, instructions, methods
or ideas contained in the book.
Open Data
Edited by Vijayalakshmi Kakulapati
p. cm.
Print ISBN 978-1-83968-315-2
Online ISBN 978-1-83968-316-9
eBook (PDF) ISBN 978-1-83968-317-6
We are IntechOpen,
the world’s leading publisher of
Open Access books
Built by scientists, for scientists
156
Countries delivered to
Top 1%
most cited scientists
12.2%
Contributors from top 500 universities
A
ATE NALY
IV
R
TI
CLA
CS
BOOK
CITATION
INDEX
IN
DEXED
Preface XIII
Section 1
Fundamental Aspects of Open Data 1
Chapter 1 3
Knowledge Extraction from Open Data Repository
by Vijayalakshmi Kakulapati
Chapter 2 21
Open Government Data: Development, Practice, and Challenges
by Omer Hassan Abdelrahman
Chapter 3 37
Framework to Evaluate Level of Good Faith in Implementations
of Public Dashboards
by Monika M. Wahi and Natasha Dukach
Section 2
Case Studies of Open Data 57
Chapter 4 59
Intrusion Detection Based on Big Data Fuzzy Analytics
by Farah Jemili and Hajer Bouras
Chapter 5 73
Artificial Intelligence and IoT: Past, Present and Future
by Kannadhasan Suriyan and Nagarajan Ramalingam
Preface
This book describes how retrieved data can improve different learning qualities
of digital networking, particularly performance and reliability. The book also
describes developing artificial intelligence (AI) and machine learning or related
models, knowledge acquisition problems, and feature assessment by incorporating
data sources (blogs, search query logs, document collection) as well as interactive
data (images, videos, and their explanations, multi-channel handling data).
The search query log created by manual intervention with the POD repository is a
good source of knowledge. The data in the search query log is generated from users
who interact with online communities. However, there is an understanding of the
concept with economic models in specific sectors, for example, the telecom sector,
where prices are appropriately designed and implemented. There is a significant gap
between recently evolved extraction methodologies of POD repositories and their
applicability across numerous organizational processes.
This chapter analyzes how researchers retrieve data from POD repositories. The
increase in the number of affluent online platforms, social media, and collabora-
tively related web resources has amplified the evolution of socio-technical systems,
resulting in domains that demonstrate both the conceptual model of the required
system approaches and the collaborative form of their participants. The POD
repositories at impressive levels and retrieve the information from query log data to
investigate these factors’ effects. This investigation aims to maximize the quality of a
POD repository from a new perspective. First, we offer a unique query recommender
system that can help consumers reduce the length of their querying operations.
The goal is to discover methods that will allow users to engage with the open data
repository quickly and with fewer requests.
This chapter focuses on the principle of open data, emphasizing Open Government
Data (OGD). It discusses the context and features of OGD and identifies dangers,
barriers, and problems. It also examines the benefits of OGD, as well as perceived
risks, obstacles, and challenges.
Chapter 3: “Framework to Evaluate Level of Good Faith in Implementations
of Public Dashboards”
Public dashboards (PDs) must be measured by how often they satisfy customer
demands. This chapter provides a methodology for assessing the amount of
contracting parties’ development of PD. It begins by looking at the problems
governments face when sharing files with good conscience, even though OGD
laws and regulations are being implemented worldwide. The chapter provides a
use case in which scientists investigate a PD in their environment that looks to be
adopting OGD but is not doing so in good conscience and designs an equivalent
approach.
This chapter examines how artificial intelligence (AI) can assist healthcare
practitioners in making optimal treatment decisions. The increased usage of
patient records and the development of big data analysis techniques have resulted
in reliable and efficient applications of AI in health services. As guided by proper
diagnostic inquiries, advanced AI algorithms may find critical clinical data in
massive data, assisting in treatment decisions. The chapter also discusses the
Internet of Things (IoT), a system that connects physical devices to the Internet
using near field communication (NFC) and wireless sensor networks (WSNs).
Vijayalakshmi Kakulapati
Sreenidhi Institute of Science and Technology,
Department of Information Technology,
Yamnampet, Ghatkesar, Hyderabad, Telangana, Inida
XIV
IV
Section 1
Fundamental Aspects
of Open Data
1
Chapter 1
Abstract
1. Introduction
3
Open Data
Presently, people rely on social media and its vast and diverse wealth and have
progressively penetrated each human living area. Increasingly individuals prefer
to engage valuable time on social media to develop a significant social entertain-
ing community and again try to communicate with each other so often that the
interaction around them is robust. POD repository analytics is perhaps a commonly
used scientific and commercial approach for investigating the social media of
interpersonal, organizational, and corporate links. The necessity for solid knowl-
edge in DPO analysis has lately increased with ready availability to computational
power and the rise of social popular social networking platforms such as LinkedIn,
Twitter, Netlog, and more.
Twitter social network by study the contents of the tweets and the links between
the tweets to extract knowledge from log data. By selecting buzzwords, began the
‘Twitter review and then collecting all Twitter posts (Tweets) correlated to the
keywords. It is a social-economical problem in India. Mining the query log based
on social networks like Facebook, Twitter, etc. Study and address the discovery,
access, and citation of POD repositories like Twitter data sets; and strengthening
educational programs of academics of current and future generations specializing
in such areas. This is an auspicious time for extracting useful information from
social media query log data. Substantial efforts to decipher large amounts of data
are steps towards complete search log records integrating POD repository analysis.
From these data sets, we extract valuable knowledge.
The search log obtained by user actions with the Public Open Data (POD)
database is an excellent data collection for improving its efficiency and the effec-
tiveness of the online community. The data in the user input logs are gathering
from individuals who communicate on online platforms. The search log assessment
is complicating due to the variety of customers and diverse resources. As a result,
numerous scientific articles written about query log analysis.
The word “data set” can also describe the data in a set of specifically relevant
tables that correlate to a specific investigation or occurrence. Records generated
by satellites testing hypotheses using devices aboard communications satellites are
one instance of such a category. A data source is the standard measurement for data
provided in a POD repository in the open data domain. Over a quarter a million
datasets are gathering on the European Open Data platform. Alternative interpreta-
tions were presented in this area, although there is presently no accurate statement.
Various difficulties (relevant data resources, non-relational datasets, etc.) make
reaching a compromise more challenging. The utilization of query logs for knowl-
edge discovery improves the speed of the POD repositories and improves the use of
open data source capabilities.
POD repository analysis and mining for valuable extract knowledge from query
log data. We perform on knowledge discovery, ML or similar approaches, chal-
lenges connected to pre-processing and model assessment, for data sets (web usage
log files, query logs, collection of documents), and collaborative data (images,
videos, and their explanations multi-channel handling data). We summarize the
fundamental results concerning query logs: analyses, procedures used to retrieve
knowledge, the outstanding results, most practical applications, and open issues
and possibilities that remain to be studied. We discuss how the retrieved knowledge
can be utilized to progress different social media class features, mainly its effective-
ness and efficiency.
In addition, several concurrent inquiries of multiple distinct users are addressing
by business social networks. The query stream has simultaneously been defining by
a stop-time rate, making it impossible for the POD repository to generate massive
query load times without over-sizing [1]. Web Search engines Query Logs Social
4
Knowledge Extraction from Open Data Repository
DOI: http://dx.doi.org/10.5772/intechopen.100234
Network Analysis [2], Web search engine quality approaches developed for query
logs, November 2013.
1.1 Motivation
1. Retrieving the collection of Twitter tweets that match with one or more
content keywords of user-query.
These challenges can be solved using query expansion and semantic models. In
query expansion, the reformulation of the query is made based on the vocabulary
mismatch among the query and content retrieved. Through semantic models,
similar words of the user query are extracting.
5
Open Data
1.3 Objectives
2. Literature survey
Social media connects next to each other individuals in various methods, such as
web-based gaming, marking, earning, and socializing, showing effective technolo-
gies to collaborate and communicate that were unthinkable only recently. In addi-
tion, online communities contribute to the corporate strategy and assist in altering
economic models, sentiments and introduce various opportunities for studying
direct intervention and collaborative actions.
Several previous researchers suggested using the Internet search query records
to derive linguistic relations between queries or terms. The idea that the web search
query logs provide knowledge via clicks confirms the relation among searches and
records selected by individuals. The writers relate questions and words in the infor-
mation gathered based on these data. This technique has also been using to group
requests from log files. The cross-reference text is linked to similarities depending
on query information, proximity editing, and hierarchical resources to identify
better clusters. Such clusters are utilizing to discover identical queries for querying
systems.
Twitter is a massive amount of information social network, to perform an analy-
sis on Twitter, a keyword-based search for possible and relevant posts [8], where
such search keywords cover all the possible tweets of the user [9], which is a lengthy
and time-consuming process. Typically, to reduce the complexity of searching the
posts from a Twitter data source, a user search keyword identification is made [10]
to reduce the manual effort. User search keywords extraction is developing on the
target keywords instead of the general word phrase of the keyword selected. This
keyword extraction process is iterative because of the user’s regular interaction in the
social network through a web search and advertising. So there is a need for a query
6
Knowledge Extraction from Open Data Repository
DOI: http://dx.doi.org/10.5772/intechopen.100234
2. Innovative approaches are essential for the data analysis of numerous con-
tributing inputs, for example, contextual performance, clinical development
research findings, previous warnings, incidence history.
3. Novel approaches require decentralized systems where the related items are
searching, and the similarity’s reliability needs to be measured.
Mining methods workflow contains a group of mining data and models, with
an utmost data operator work to set the parameters of the mining model used.
In mining, the data is not expressing indirect form, but it is unseen in the model
connectors. The user provides the indirect form of data and applies a model on the
7
Open Data
indirect form of data, generating the direct form of data. During this process, min-
ing techniques should distinguish between components, are data model, operators,
and parameters. Enable the user in designing such mining for the web. There is a
need for the development of online data workflow through concepts and categories.
Online data refers to a frequent visitor of a group or several web pages in social net-
works to cover and gather the complete user required information through locating
the web page and fetching the desired user valid information.
Web pages are complete application-specific to fetch the user’s desired target
information through the user-defined keywords using a constrained specific web
application to provide up-to-date information through the Online Social Networks
(OSN). OSN is protecting billions of active and passive web user’s knowledges. The
rapid change in social networking sites has proven an exponential growth in user
information and knowledge exchange rate. According to [18], two-thirds of the
online users browse a social network or an eCommerce website, with an average
of 10% of all internet utilization time. By covering such a large amount of helpful
information exchange, OSNs through social media become an excellent platform for
mining techniques and research in data analysis.
The method allows data on social media user tweets for goods and commodities.
In [19], a RES approach is recommending to give a level of precision compared to
the previous approaches used in the tweet assessment of the consumer. Fasahte
et al. [20] has presented the method to anticipate tweets by utilizing the Online
Reviews dataset sorting procedures. It investigates the search engine extraction
and training algorithm to collect data from the unstructured text in the available
online content. In addition to the keyword-based evaluation, the data model on
the Internet is connecting with complex searches. They utilized to locate tweets on
various tweets while maintaining the surfing data operational inside the account
location. The data collection, processing of data, and data sampling are all three
aspects of tweets availability. Verma et al. [21] developed a dynamic analysis clas-
sification technique by implementing ML and evaluated the different variables in
these learning approaches. A public repository response assessment technique [22]
discussed huge data volumes on Twitter to create the emotional state of every mes-
sage. Rosenthal et al. [23] describes the user opinion mining system used to extract
similar users’ views from the person’s view using a moderate data analysis method.
Ibrahim et al. [24] established an online emotional assessment that supplied many
functional tweets out of interest to identify comparable personal data. The decision
relates to extracting features, extensive conversion, and different recognition using
machine learning techniques in many tweet solutions for the clustering techniques,
correlating the query response pattern, relationship regulations for Twitter tweet
extraction, and visualization in the Tweet API application.
The phrase retrieval from several texts is provided by [25–27] since the words
should be user-specific, and the searching procedure should be preserving. Due to
the powerful conventional method to all these, it is possible to investigate a method
that relies on the recommendation, utilizes iteratively in the search engine and
advertising searching.
Optimization techniques provide AI (Artificial Intelligence) and NLP (Natural
Learning Processing) capabilities in order to deliver necessary assessed user sugges-
tions interpretations in different networks/services of social network applications
[28, 29]. Interface design such as mobile web apps permits various movies, cuisine,
literature, YouTube, healthcare and more information related material. Films, cul-
ture, and entertainment are communal societies. Depending on the user’s awareness
of the material, the recommender system [30] has problems with confidentiality
and protection. Thus, classic recommenders are becoming inevitable for current
user ratings and Twitter posts to evaluate user-generated content [31, 32].
8
Knowledge Extraction from Open Data Repository
DOI: http://dx.doi.org/10.5772/intechopen.100234
3. Recommender system
Unlike the previous works of the recommendation system, the proposed UQCRS
is a system capable of performing Twitter mining from a vast Twitter database
through query logs and user tweets for understanding the user interaction in the
Twitter tweets. UQCRS provides the search-based tweets content recommenda-
tion through the found user-query centered content in the short and long tweets
to depict the intention of the user tweets. In the proposed system, the workflow is:
firstly, the Twitter background knowledge is extracting for user-query-centered
knowledgebase integration. Secondly, implementing the strategy of UQCRS on the
Twitter knowledge repositories. And finally, evaluation and illustration through the
discussion of the proposed URCRS system are made.
9
Open Data
content and ratings of tweets through the query search. UQC recommendation sys-
tem tweets content architecture depends on the profile and database of user Twitter
profile that store query information and update the entities continuously through
Twitter user customer recommendation, as represented in Figure 1.
With the content feedback and query-centered analysis, recommender systems are
implementing in Figure 1, used within the e-commerce websites, to guide the Twitter
customers through retrieving log data and Twitter mining originate themselves.
Figure 1.
Architecture of the UQCRS implementation.
Figure 2.
User-query pattern categorization.
10
Knowledge Extraction from Open Data Repository
DOI: http://dx.doi.org/10.5772/intechopen.100234
The tweets clustering process and the tweets’ filtering are performing, shown
in Figure 2, of the tweets as given by the user recommendation, which analyses at
every instant of user interaction.
In the proposed system, each analysis is a service, and the Content Detection
Model algorithm’s operation is explaining in Figure 3, which is describing in two
parts. The first content phrase built as a final set of query phrases per query log
Figure 3.
User twitter tweets detection.
11
Open Data
brings together a maximum number of ordered patterns that make each filtered
design generate enough matched tweets. In the second, query mapping is comput-
ing that shares similar tweets during consecutive query logs, exceeding the maxi-
mum ordered pattern.
Figure 4.
Extracting the knowledge from query log data.
Figure 5.
User-query centered knowledgebase integration.
12
Knowledge Extraction from Open Data Repository
DOI: http://dx.doi.org/10.5772/intechopen.100234
User item set is associated with the number of feature vectors representing the
tweet customers with different tweet phrases assigned to the user-query content
model. In the recommender content model, the decision ranking prediction
compares the users and Twitter queries in the categorized user-query item set and
tweet’s weights.
Figure 6.
The proposed user-query centered recommender system.
13
Open Data
For experiments, a random public user Twitter dataset and real-time data
using the API of Twitter is complete. The Twitter tweets containing the keywords
“basket,” “pencil,” “work,” “enter,” and “formal” from the public domain are taking
as the standard bag-of-words approach. Used this dataset for classification and
collected 300 documents in each of the public domains.
For the classification of tweets, the true +ves, true -ves, false +ves, and false -ves
constraints are utilized to equate the consequences of the classifier under the test
with investigation techniques, which is illustrating in Figure 7.
The relations between TP, FP, FN, and TN are:
Figure 7.
Classification matrix model for metric analysis.
Figure 8.
Accuracy comparison of a classified tweet.
14
Knowledge Extraction from Open Data Repository
DOI: http://dx.doi.org/10.5772/intechopen.100234
Figure 9.
Comparison of different algorithms for measured values.
precision ∗ recall
c. Relation 2 ∗
precision + recall is describing as the F-measure,
which is a measure of precision and recall.
Figure 8 shows the accuracy of classifying a user query tweet in the users
defined the recommended system. The highest accuracy is achieved through the
proposed work, with an incredible number of word phrases, depending on the
content of the user query and compared with Naive Bayes classifier (NB) [41].
Figure 9 compares the proposed system with different dataset approaches in
terms of F1-score and the exact match. Because of the phrase-based content mining
is made and tweets analysis is made accurately on two other datasets of different
methods [42, 43] and proposed.
5. Conclusions
15
Open Data
Also, a novel algorithm based on content detection is using to extract the tweets
using the bags-of-word method. Using the tweet’s knowledge weights the pro-
posed recommendation system avoids the dissimilar tweet’s pattern identification
problem. The above said three parameters are complete, which indicates that the
proposed approach produces better accuracy results than the other methods.
6. Future scope
Future work focuses on The DTSR System may be extended along with addi-
tional user profiles such as film playlists, community groups, social media tweets,
user emotion, user posts, and feature tweets to better the method recommended.
Author details
Vijayalakshmi Kakulapati
Sreenidhi Institute of Science and Technology, Hyderabad, Telangana, India
© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
16
Knowledge Extraction from Open Data Repository
DOI: http://dx.doi.org/10.5772/intechopen.100234
References
[2] Metin Turan, et al. "Automatize [13] King, G.; et al. 2014. Computer-
Document Topic and Subtopic assisted keyword and document set
Detection with Support of a Corpus," discovery from the unstructured
Social and Behavioral Sciences, text—copy at http://j.mp/1qdVqhx 456.
Published by Elsevier, DOI: 10.1016/j.
sbspro.2015.02.373, 2015. [14] Luke, T.; et al. 2013. A framework
for specific term recommendation
[3] K. D. Rosa, et al. "Topical clustering systems. In SIGIR, 1093– 1094.
of tweets," Proceedings of the ACM
SIGIR: SWSM, 2011. [15] Bhatia, S; et al., 2011. Query
suggestions in the absence of query logs.
[4] Y. Duan, L. et al. "An empirical study
In SIGIR, 795-804.
on learning to the rank of tweets," in
Proceedings of the 23rd COLING, 2010,
[16] Zhang, Y.; et al. 2014. Bid keyword
pp. 295-303.
suggestion in sponsored search based
[5] M. Pennacchiotti, et al. "Making your on competitiveness and relevance.
interests follow you on Twitter," in Information Processing & Management
Proceedings of the 21st CIKM, 2012, 50(4):508-523.
pp. 165-174.
[17] Hahm, G. J.; et al. 2014. A
[6] A. Pal et al., "Identifying topical personalized query expansion approach
authorities in microblogs," in for engineering document retrieval.
Proceedings of the 4th ACM Advanced Engineering Informatics
WSMINING. ACM, 2011, pp. 45-54. 28(4):344-359.
[7] J. Weng, et al. "Twitter rank: finding [18] Global Faces and Networked Places,
topic sensitive influential Twitterers," in A Neilsen report on Social Networking's
Proceedings of the 3rd ACM New Global Footprint, March 2009.
WSMINING, 2010, pp. 261-270. Neilsen company.
[8] Turney, P. D. 2000. Learning [19] Z. Tan et al. "An efficient similarity
algorithms for keyphrase extraction. measure for user-based collaborative
Information Retrieval 2(4):303-336. filtering recommender systems inspired
by the physical resonance principle,"
[9] Zhao, W. X et al. 2011. Topical
IEEE Access, vol. 5, pp. 27211-27228,
keyphrase extraction from Twitter.
2017.
In ACL, 379-388.
[10] El-Kishky, A.; et al. 2014. Scalable [20] U. Fasahte, et al. "Hotel
topical phrase mining from text recommendation system," Imperial
corpora. VLDB 8(3):305-316. Journal of Interdisciplinary Research,
vol. 3, no. 11, 2017.
[11] Danilevsky, M.; et al. 2014.
Automatic construction and ranking of [21] A. Verma et al. "A hybrid genre-
topical keyphrases on collections of based recommender system for movies
17
Open Data
using genetic algorithm and kNN communities. In: Proc. of the 9th
approach," International Journal of WebKDD and 1st SNA-KDD 2007
Innovations in Engineering and Workshop on Web Mining and Social
Technology, vol. 5, no. 4, pp. Network Analysis. pp. 56-65 2007.
48-55, 2015.
[31] Joachims, T., et al.: Accurately
[22] H. Jazayeriy et al. "A fast interpreting clickthrough data as
recommender system for the cold user implicit feedback. In: Proc. of the 28th
using categorized items," Mathematical Int. ACM SIGIR Conf. on Research and
and Computational Applications, vol. Development in Information Retrieval
23, no. 1, p. 1, 2018. (SIGIR'05). pp. 154-161 2005.
[23] Rosenthal, S., et al. (2017). [32] Kwak, H., et al.: What is Twitter, a
SemEval-2017 task 4: Sentiment analysis social network or a news media? In:
in Twitter. In Proceedings of the 11th Proc. of the 19th Int. Conf. on World
International Workshop on Semantic Wide Web (WWW'10). pp. 591-
Evaluation (SemEval-2017)(pp. 600 2010.
502-518).
[33] Liben-Nowell, D., et al.: The link
[24] M. Ibrahim et al. "Design and prediction problem for social networks.
application of a multivariant expert In: Proc. of the 12th Int. Conf. on
system using Apache Hadoop information and knowledge
framework," Sustainability, vol. 10, management. pp. 556-559. CIKM ‘03,
no. 11, p. 4280, 2018 ACM, New York, NY, USA 2003.
18
Knowledge Extraction from Open Data Repository
DOI: http://dx.doi.org/10.5772/intechopen.100234
19
Chapter 2
Abstract
This chapter explores the concept of open data with a focus on Open
Government Data (OGD). The chapter presents an overview of the development
and practice of Open Government Data at the international level. It also discusses
the advantages and benefits of Open Government Data. The scope and charac-
teristics of OGD, in addition to the perceived risks, obstacles and challenges are
also presented. The chapter closes with a look at the future of open data and open
government data in particular. The author adopted literature review as a method
and a tool of data collection for the purpose of writing this chapter.
1. Introduction
The concept of Open Government Data (OGD) has been heavily debated during
the last few years. It has drawn much interest and attention among researchers and
government officials worldwide. Many of the developed and developing countries
have launched open data initiatives with a view to harnessing the benefits and
advantages of open government data. This chapter is dedicated to highlighting the
various aspects of open data and open government data.
According to the Open Definition, “Open” in the context of data and content
“means anyone can freely access, use, modify, and share for any purpose”. There
are many types of data that can be open and used or re-used by the public. These
include data relating to culture, science and research, finance, statistics, weather,
and environment [1, 2].
The Open Knowledge Foundation outlined key features of openness as the
following:
• Reuse and redistribution: the data must be provided under terms permitting
reuse and redistribution, with the capability of mixing it with other datasets.
This data must be machine-readable.
21
Open Data
• Universal participation: the data should be available for everyone to use, reuse
and redistribute without discrimination against fields of knowledge, or against
persons or groups [2].
Features of open data also include the following aspects: Data should be primary
and timely and accessed data must be available in non-proprietary formats and
free to use with unrestricted license. Data should also be as accurate as possible.
Although most of the data will not meet all of these criteria, data is only truly open
if it meets most of them [3].
The earliest appearance for the term open data was in 1995. It was related to the
disclosure of geographical and environmental data in a document written by an
American agency. The scholarly community understood the benefits of open and
shareable data long before the term open data was a technical object or political
movement [4].
The Scholarly Publishing and Academic Resources Coalition (SPARC) defined
open data from a research perspective as: “Open Data is research data that is freely
available on the internet permitting any user to download, copy, analyze, re-
process, pass to software or use for any other purpose without financial, legal, or
technical barriers other than those inseparable from gaining access to the internet
itself ” [5]. SPARC stressed the benefits of open data in that it accelerates the pace of
discovery, grows the economy, helps ensure people do not miss breakthroughs, and
improves the integrity of the scientific and scholarly record,
The current concept of open data and particularly open government data (OGD)
started to become visible and popular in 2009 with a number of governments
in the developed world who announced new initiatives to open up their public
information records such as the USA, UK, and New Zealand. These initiatives were
triggered by the mandate for transparency and open government from the then
American President Barack Obama administration, thus kick starting the Open
Government Data Movement [6, 7].
To legalize the use of the published public data, open data must be licensed. This
license should permit people to freely use, transform, redistribute and republish
the data even on a commercial basis. A number of standard licenses designed
to provide consistent and broadly recognized terms of use are employed. These
licenses include: Creative Commons (CC), Open Data Commons Open Database
License (ODbL), and Open Data Commons Public Domain Dedication and License
(PDDL). Some governmental organizations and international organizations have
released their own tailored Open Data license such as The Worldbank Data License,
French open Data License, and UK Gov. Data License. Standard licenses have many
advantages over bespoke licenses, including greater recognition among users,
increased interoperability, and greater ease of compliance [8, 9].
22
Open Government Data: Development, Practice, and Challenges
DOI: http://dx.doi.org/10.5772/intechopen.100465
traced back to the year 1966 when the USA federal government passed the Freedom
of Information Act (FOIA). The coming of the internet and new information and
telecommunications technologies contributed to the more recent interest and
understanding of the value and benefits of government information for the sake
of transparency, collaboration and innovation [12]. Two significant consequent
developments contributed positively to the open government data; these are the
launching of data.gov in the USA in May 2009 and the data.gov.uk in the United
Kingdom (UK), in January 2010. It subsequently spread out to many other countries
around the world, as well as to international organizations, including the World
Bank and the Organization for Economic Co-operation and Development (OECD).
Moreover, the concurrent advances in the information and telecommunications
technologies also played a role in the development of open government data,
coupled with the passing of open standard laws by many countries such as Canada,
the USA, Germany and New Zealand, and the setting of policies on open data
focusing on indexing government data holdings [13, 14].
In 2015 a number of governments, civil society members, and international
experts convened with the purpose of representing an internationally-agreed set
of norms for how to publish government and other public sector organizations
data. They then formulated a set of principles called the Open Data Charter. They
introduced these principles with the following statement:
“We, the adherents to the International Open Data Charter, recognize that gov-
ernments and other public sector organizations hold vast amounts of data that may
be of interest to citizens, and that this data is an underused resource. Opening up
government data can encourage the building of more interconnected societies that
better meet the needs of our citizens and allow innovation, justice, transparency,
and prosperity to flourish, all while ensuring civic participation in public decisions
and accountability for governments…” [15].
The conveners agreed to adhere to the following set of principles concerning
access and release of government and public sector data. That data should be.
i. Open by Default;
The scope of Open government data which is made available with no restrictions
on its use, reuse, or distribution covers all data funded by public money excluding
private, security sensitive, and confidential data.
23
Open Data
24
Open Government Data: Development, Practice, and Challenges
DOI: http://dx.doi.org/10.5772/intechopen.100465
CSV CSV stands for “comma-separated values”. Product, Size, Color, Price Shirt, Large,
This type of file is a simple text file where White,$15 Shirt, Small, Green $12
information is separated by commas. These Trousers, Medium, Khaki, $35
files are usually encountered in spreadsheets
software and databases. They may also use other
characters such as the semicolons to separate
data. By using a CSV file format, complex data
can be exported from one application to a CSV
file, and then imported in that CSV file into
another application.
JSON JSON stands for “JavaScript Object Notation”. ‘{“name”: “Jack”, “age”:41, “job”:
This is a text file format for storing and accountant}’.
transporting data. By using JSON, JavaScript
objects can be stored as text. The string in the
example defines an object with 3 properties:
name, age, job and each property has a value.
RDF An RDF file is a document written in the <?xml version = “1.0? > <rdf
Resource Description Framework (RDF) xmlns = “http://www.w3.org/1999/02/22-
language. This language is used to represent rdf-syntax-ns#” xmlns:s = “http://
information about resources on the web. It description.org/schema/”> <Description
contains the website metadata. Metadata is about = “https://www.xul.fr/Wells”>
structured information. RDF files may include a <s:author>The Invisible Man<s:author>
site map, an updates log, page descriptions, and </Description> </rdf>
keywords.
XML An XML is a file written in extensible markup <part number = “1976” > <name>
language. It is used to structure data for storage Windscreen Wiper</name>
and transport. In an XML file, there are tags <description>The Windscreen wiper
and text. The tags provide the data structure, automatically removes rain from
and the text in the file is surrounded by these your windscreen, if it should happen
tags, which adhere to specific syntax guidelines. to splash there. It has a rubber <ref.
The XML format is used for sharing structured part = “1977” > blade</ref.> which
information between programs, and between can be ordered separately if you need to
computers and people, both locally and across replace it. </description> </part>
networks.
Table 1.
File types definitions and examples [18–22].
over time [25]. Therefore, open data can lead to open government which is defined
as: “….. a multilateral, political, and social process, which includes in particular
transparent, collaborative, and participatory action by government and administra-
tion. To meet these conditions, citizens and social groups should be integrated into
political processes with the support of modern information and communication
technologies, which together should improve the effectiveness and efficiency of
governmental and administrative action” [26].
According to the principles of OGD, data must be: complete, primary, timely,
accessible, and machine-readable. It should also be non-discriminatory, non-
proprietary and License-free. Furthermore, public institutions should publish
all data they have if it would not violate security, privacy or other legitimate
restrictions [27].
The World Wide Web Consortium (W3C) outlined three steps for publish-
ing open data, which will help the public to easily find, use, cite and understand
the data:
25
Open Data
Step 1: Publishing the data in its raw form. The data should be well-structured
to enable its use in an automated manner by the users of the data. Data may be
in XML, RDF or CSV formats. Formats used should allow the data to be seen as
well as extracted by the users.
Step2: Creating an online catalog of the raw data, complete with documentation,
to enable users to discover published data.
Step 3: Making the data human readable as well as machine-readable [28].
Open data portals are a very important component of open data infrastructure.
They connect data publishers with data users enabling the former to deliver open
data and establish the necessary relationships for increasing transparency. Open
data portals, which are essentially data management software, contain metadata
about datasets so that these datasets could be accessed and utilized by the users.
The open data portal includes the tools which help the users to find and harvest all
relevant data from public sector databases. From the users’ perspective, features of
open data portals can be used to specify datasets they need and to request for data-
sets [29]. Thus, Open data portals play the role of interface between government
data and citizens who use or reuse this data. Consequently, a portal should have
user- friendly features such as a clean look with a search facility. The portal should
also provide information about the responsible authority which hosts the portal
written clearly and in a simple language. The portal’s contents should be organized
into categories and subcategories. It should also aim to engage citizens’ ideas and
feedback in addition to its basic function of making data available to stakeholders.
Data quality and standards, and the language settings are very important elements
in portals so that they can satisfy their users’ needs [30].
The World Wide Web Consortium’s (W3C) benchmark for publishing open gov-
ernment data and the World Bank’s technical option guide outlined the necessary
technical requirements for establishing efficient and modern OGD data centers.
These requirements include, among other things, that:
iii. Data is stored in multi formats – both human and machine- readable
formats, such as CSV, XML, PDF, RDF JSON etc. to enable the users to easily
access published data. It is expected that documental data are stored in either
PDF, doc(x) or Excel, and geographical data are stored in Keyhole Markup
Language (KML) or their equivalent alternatives [28, 31].
As for OGD portals’ content and functionality requirements, these include the
following:
• A number of Datasets.
26
Open Government Data: Development, Practice, and Challenges
DOI: http://dx.doi.org/10.5772/intechopen.100465
• Data Search. A ‘search box’ feature should be available to allow users to easily
locate specific information by entering a search term.
• Availability of working social media plugins. This feature enables data users to
share their experiences and suggest new datasets through comments in social
media websites such as Facebook, tweeter…etc.
• Dataset should be organized for use and not only for the sake of publication.
• Should learn from the techniques used by recent commercial data market,
share knowledge to promote data use, and adapt methods that are common in
the open source software community;
• Being accessible by offering both options for big data, such as Application
Programme Interfaces (API), and options for more manual processing, such
as comma separated value files, thus ensuring a wide range of user needs
are met;
• Assess how well they are meeting users’ needs by being measurable [32].
A number of open source and commercial open data portals software exist.
Some of the more widely used open source software are the following:
i. CKAN: This is an open source data portal designed to allow publishing, shar-
ing and managing datasets; it has a number of functionalities to the manag-
ers and end-users such as full-text search, reporting tools, and multi-lingual
support. It also provides an Application Programming Interface (API) to
access the data.
27
Open Data
ii. DKAN: compared to CKAN, this software has more data-oriented features
including scrapping, data harvesting, visual data workflow, and advanced
visualization. DKAN users are mainly government organizations and Non-
Governmental Organizations (NGOs).
iii. Socrata: It has a number of powerful data management tools for database
management, data manipulation, reporting, visualization with advanced
options and customized financial analytics insights. Socrata has two licenses;
an open source license for the community edition and commercial one for
the enterprise edition.
iv. Dataverse: It is built to share and manage large data-sets. It helps its
users to collect, organize, and publish their data-sets in a collaborative
platform. Dataverse is employed around the world by Non-Governmental
Organizations (NGOs), Government organizations and research
centers [8].
The European Data Portal published a report in the year 2020 highlighting the
best open data practices implemented by the three top performing countries of the
year 2019 assessment - Cyprus, France and Ireland. The reported practices may be
applicable to other international contexts. The practices were categorized into four
aspects relating to open data, namely, Open Data Policy, Open Data Portal, Open
Data Impact, and Open Data Quality. Table 2 shows the best practices associated
with each one of these aspects.
Open Data • Setting up of open data policy and legislation and strategy.
Policy
• Development of an implementation plan so as to have an actionable strategy and clear
responsibilities.
• Setting up of an open data liaison officers network and maintaining of close contact
with them
Open Data • Inclusion of features that go further than enabling users to find datasets.
Portal
• Focusing on interaction between data providers and data re-users through discussion
forums, dataset-specific feedback and rating systems.
• Interaction with data re-users for the purpose of understanding their needs through
open data events.
Open Data • Provision of manuals and technical guidelines for the purpose of responding to
Quality frequently asked questions.
• Integration of all measures, guides and training possibilities in one platform on the
portal
Table 2.
Open data best practices [33].
28
Open Government Data: Development, Practice, and Challenges
DOI: http://dx.doi.org/10.5772/intechopen.100465
Open government data has a number of economic and political implications and
benefits, particularly on the democratic aspect. They include better transparency,
citizens’ trust in the government and collaboration in governance, and economic
development. These benefits of open government data utilization can be detailed in
the following:
i. Political and social benefits. These include the following aspects: more
transparency, democratic accountability; more participation and self-
empowerment of citizens; creation of trust in government; public
engagement; equal access to data; new governmental services for citizens;
improvement of citizen services, citizens’ satisfaction, and policy-making
processes; allowing more visibility for the data provider; creation of new
insights in the public sector; and introduction of new and innovative social
services.
ii. Economic benefits. These include the following aspects: economic growth;
stimulation of innovation; contribution towards the improvement of
processes, products, and services; adding value to the economy by creating
a new sector; and availability of information for investors and companies.
When Open Data is used to produce new products or start new services, it
can increase the demand for more data causing the release of more datasets
and improvements in data quality [34].
iii. Operational and technical benefits. These include the following aspects: The
ability of reusing data; optimization of administrative processes; improve-
ment of public policies; accessing external problem-solving capacity; fair
decision-making by enabling comparison; easier discovery of and access
to data; creating new data based on combining and integrating existing
data; validation of data by external quality checks; and avoidance of data
loss [6, 23].
iii. Skills. Technical skills and knowledge about data on the part of users is
essential in order for them to be able to use open government data, such as
knowledge about statistics or programming.
29
Open Data
Open government data faces a number of barriers and challenges that may
impede its development and implementation. Some of these barriers are related to
either the data providers or the data users, while other barriers can be attributed to
both sides. Barriers that might face either side are outlined below:
• Poor data quality. This includes lack of sufficient and accurate data and avail-
ability of obsolete and non-valid data.
There are some other barriers that are encountered by both open data publishers
and users. These include the following:
• Geospatial data has its own specific barriers resulting from the use of different
standards in relation to other types of open data. Dealing with this type of data
requires specific technical knowledge and expertise [37, 38].
30
Open Government Data: Development, Practice, and Challenges
DOI: http://dx.doi.org/10.5772/intechopen.100465
• Lack of open-mindedness about the application of open data and the focus on
publishing of data regardless of its good quality or perceived value.
• Non allocation of budget for opening data because it is still a recent not fully
understood concept.
5.2 Risks
Many risks confront and may consequently impede the proper implementation
and utilization of open government data. They include the ones listed below:
• Others may profit from open data rather than the intended citizens.
To avoid and mitigate OGD challenges and risks, a number of practical solutions
can be designed to enhance the accessibility and reusability of open government
data on the legal, institutional and technical levels. These solutions include:
• Education of data users and data providers on what is technically and legally
possible so they can develop their plans within these boundaries.
• linking of the discussion on technical and legal requirements so that the former
may not end up being difficult to implement in national jurisdictions and the
latter may not be unrealistic or unadjusted to technical developments and
practice [41].
31
Open Data
6. Future of OGD
A seminar held by Statisticians, civil society and private sector ahead of the 48th
session of the UN Statistical Commission that took place in 2017 discussed new
trends and emerging issues in open data in light of the 2030 Agenda for Sustainable
Development. The outcome of this seminar included the following insights and
recommendations for the purpose of making the world more open to open data:
• Providing free access and use of data by open data platforms for purposes of
transparency, accountability and daily decision making;
• Ensuring that the principles of data rights and access are matched with strict
ethical and security protocols;
• Facilitating and enabling the efforts to making data more open by advanced
technologies and approaches to data architecture and management.
7. Conclusion
This chapter explored the various aspects of open government data. The chapter
opened by defining the concepts of openness, open data, and open government
data (OGD). It then proceeded to explaining how OGD developed during the last
few years and highlighted the most important cornerstones of this development.
The chapter then explained the various requirements of OGD implementation and
utilization. It highlighted the practice of OGD around the world. It also explained
the role of portals in OGD implementation and utilization, outlining their vari-
ous technical and functional requirements, besides introducing a number of open
32
Open Government Data: Development, Practice, and Challenges
DOI: http://dx.doi.org/10.5772/intechopen.100465
source portal software and applications. The chapter then elaborated on a number
of benefits and advantages of open government data for the government and
the citizens. Finally the chapter discussed the barriers, challenges and risks that
are confronted by open government data initiatives. The chapter then closed by
highlighting some perceived future trends of open government data.
Author details
© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
33
Open Data
References
[8] Top 16 Open source Data Portal [16] Open Data Barometer - Leaders
Solutions for Open Data Publishing Edition [Internet].2018. Available From:
[Internet]. 2019. Available From: https://opendatabarometer.org/
https://medevel.com/15-data-portals- leadersedition/report/ [Accessed:
opensource/ [Accessed: 2021- 07-29] 2021- 08-15]
[9] Open Data Essentials [Internet]. [17] Open Data Maturity Report.
2019. Available From: http:// Luxembourg: Publications Office of
opendatatoolkit.worldbank.org/en/ the European Union; 2020. DOI:
essentials.html [Accessed: 2021- 07-25] 10.2830/619187
34
Open Government Data: Development, Practice, and Challenges
DOI: http://dx.doi.org/10.5772/intechopen.100465
[18] Dave Johnson. What is a CSV file? [26] Bernd W, Steven B. Open
How to open, use, and save the popular Government: Origin, Development, and
spreadsheet file in 3 different apps. Conceptual Perspectives, International
[Internet]. 2021. Available From: Journal of Public Administration. 2015;
https://africa.businessinsider.com/ 38(5):381-396. DOI: 10.1080/01900692.
tech-insider/what-is-a-csv-file-how-to- 2014.942735
open-use-and-save-the-popular-
spreadsheet-file-in-3/4gbqn4b. [27] Jędrzej W. Barriers to Using Open
[Accessed: 2021- 9 -12]. Government Data. In: Proceedings of
the 3rd International Conference on
[19] JSON – Introduction. [Internet]. E-Commerce, E-Business and
2021. Available From: https://www. E-Government (ICEEG 2019); June
w3schools.com/js/js_json_intro.asp. 2019; Lyon, France. 2019. p.15-20. DOI:
[Accessed: 2021-9-12]. 10.1145/3340017.3340022
35
Open Data
36
Chapter 3
Abstract
1. Introduction
There has been a global trend for populations to increasingly hold govern-
ments accountable to open government data (OGD) standards [1]. Because of
this, governments have undertaken open data projects, such as providing public
access to government data through publicly-accessible dashboards [2, 3]. However,
government actors also may have an incentive to hide or obscure data, so there are
barriers to accessing data for public dashboards [1]. This chapter focuses on the
specific problem where governments attempt to demonstrate compliance with
OGD standards through the presentation of a public dashboard, while at the same
time, appearing to hide or obscure the data it is supposed to represent through poor
dashboard design.
Our motivation to tackle this topic comes from our own disappointing experi-
ence trying to use a public dashboard implemented as part of OGD standards
37
Open Data
38
Framework to Evaluate Level of Good Faith in Implementations of Public Dashboards
DOI: http://dx.doi.org/10.5772/intechopen.101957
departments and organizations, and is extremely paperwork intensive [7, 8]. Part
of what causes the RCQI model to be so effort-intensive is that it measures process
outcomes. By contrast, the evaluation framework for public dashboards recom-
mended in this chapter is streamlined, and focused on achieving a design solution,
not a process solution.
Nevertheless, an optimal design solution will not be achieved without an ade-
quate design process. Therefore, it is important to consider how the public should
be involved in the process of designing public dashboards – especially those that are
publicly-funded, and therefore have obligations to respond to the public’s needs.
Figure 1.
Generic logical dashboard design process. This design process produces an alpha prototype for initial testing,
and a beta prototype for widespread field testing.
39
Open Data
As a general trend, consumers are demanding more data transparency, and calls
are being made for governments to make data available for public oversight [1].
Likewise, there is an increasing trend toward using dashboards for empowering the
public [2, 3]. Not only do dashboards of public data provide a mechanism for public
oversight of leaders, but they also reduce information asymmetry, which refers to
the circumstance in which one party (the government) has more information than
another party (the public), thus disempowering them [2, 10, 11].
However, governments are not always keen to share the data for various reasons.
It has been argued that government agencies will be more likely to comply with
open government data (OGD) practices if they see it as an opportunity to showcase
their agency’s success [1]. However, if the agency believes the data will cast the
agency in a negative light, the agency may be less likely to be inclined toward OGD
practices. Ruijer and colleagues recommend that institutional incentives and pres-
sure be created for OGD, because governments have a natural interest in suppress-
ing data they think may be harmful to them in some way if analyzed [1].
However, data suppression is not the only method governments employ to
prevent data use and interpretation. One limitation of legal requirements for OGD
is that the agency may comply with the requirements in bad faith. During the
COVID-19 outbreak in early 2020, a state epidemiologist in Florida said she was
fired for refusing to manually falsify data behind a state dashboard [12]. Simply
reviewing the limitations of big data can reveal ways to share big data in bad faith
in a dashboard, such as visualizing too much data, visualizing incomprehensible or
inappropriate data, and not visualizing needed data [13].
For this reason, in addition to holding governments to OGD standards, govern-
ment efforts need to be evaluated as to whether or not they meet OGD standards
in good faith. The framework presented here guides as to how to evaluate good vs.
bad faith implementations of a public dashboard.
The evaluation framework presented has six principles on which to judge the
level of good or bad faith in a public dashboard: 1) ease of access to the underlying
data, 2) the transparency of the underlying data, 3) approach to data classification,
4) utility of comparison functions, 5) utility of navigation functions, and 6) utility
of metrics presented. These principles will be described below.
40
Framework to Evaluate Level of Good Faith in Implementations of Public Dashboards
DOI: http://dx.doi.org/10.5772/intechopen.101957
Although raw data are used for the dashboard, in the dashboarding process,
they undergo many transformations to be properly visually displayed [9, 14]. The
processing of the data can develop calculations that are then displayed in the dash-
board. Therefore, to be transparent, the dashboard must not only facilitate access
to the underlying raw data, but also to the transformations the data underwent in
being displayed. A simple way to accomplish this kind of transparency is to use
open source tools and publish the code, along with documentation [14]. This allows
citizen data scientists an opportunity to review and evaluate the decisions made in
the dashboard display.
How data are classified in a dashboard can greatly impact the utility of the dash-
board. As an example, developers of an emergency department (ED) dashboard
that was in use for five years under beta testing found that after the ED experienced
an outbreak of Middle East Respiratory Syndrome (MERS), major structural
changes were needed to the dashboard [15]. Another paper about developing a visu-
alization of patient histories for clinicians described in detail how each entity being
displayed on the dashboard would be classified [16]. Hence, inappropriate classifi-
cations or ones deliberately made in bad faith can negatively impact data interpreta-
tion to the point that the dashboard could be incomprehensible to its users.
Dashboards are typically at least somewhat interactive, providing the user the
ability to navigate through the data display, which responds to actions by the user
[14, 18, 19]. When operating in good faith, developers often conduct extensive
41
Open Data
usability testing to ensure that the dashboard is intuitive to use in terms of navigat-
ing through the data display, and that any interactivity is useful [15]. But when
implemented in bad faith, a dashboard could be designed to deliberately confuse
the user as to how to navigate and interpret the data in the dashboard.
42
Framework to Evaluate Level of Good Faith in Implementations of Public Dashboards
DOI: http://dx.doi.org/10.5772/intechopen.101957
Figure 2.
MA HAI public dashboard landing page. Note: “A” labels a menu of tabs that can be used for navigation to
view metrics on the various hospital-acquired infections (HAIs). In panel labeled “B”, tabs can be used to
toggle between viewing state-level metrics and hospital-level metrics. Hospitals can be selected for display using
a map labeled “C”.
(Figure 2, “A” and “B”). For the ICUs at each hospital, the report displays a set of
tables summarizing CAUTI and CLABSI rates, followed by time-series graphs.
For a set of high-risk surgical procedures, SSI rates and graphs for the hospital are
displayed. Medication-resistant staphylococcus aureus (MRSA) and C. Difficile
infections are serious HAIs that can be acquired in any part of the hospital and are
diagnosed using laboratory tests [27]. Rates and graphs of MRSA and C. Difficile
infections are also displayed on the report.
The underlying data come from the NHSN. This is not stated on the dashboard.
Instead, there is a summary report and presentation posted alongside the dashboard
on the web site, and the analyses in these files are based on NHSN data [21]. It seems
that the DPH is using this NHSN data using as a back-end to the dashboard, and the
dashboard is an attempt to comply with OGD laws.
Because the authors are aware of the high rates of HAI in the US, and because
we both live in MA and we both are women who are cognizant that sexism in US
healthcare adds additional layers of risk to women [28], we identified that we were
in a state of information asymmetry. Specifically, we had the information need to
compare MA hospitals to choose the least risky or “lethal” one for elective surgery
or childbearing (planned procedures), but we felt this need was not met by this
OGD implementation.
In this section, we start by evaluating the existing MA DPH HAI dashboard
against our good vs. bad faith framework. Next, we propose an alternative dash-
board solution that improves the good vs. bad faith features of the implementation.
43
Open Data
Figure 3.
Logical entity-relationship diagram for data behind dashboard. Note: The schema presented assumes four
entities: The hospital entity (primary key [PK]: HospRowID), each intensive care unit (ICU) attached to a
hospital which contains the frequency of infection and catheter days attributes to allow rate calculation (PK:
ICURowID), each procedure type attached to a hospital (to support the analysis of surgical site infection [SSI],
with PK: ProcRowID), and each other infection type at the hospital not tracked with ICUs (PK: LabID).
least likely to cause HAI for an elective procedure (e.g., childbearing), or to estab-
lish as their top choice of the local hospital should they ever need to be admitted.
This interface makes it difficult to compare HAI at different hospitals, because
metrics from more than one hospital cannot be viewed at the same time. Further,
metrics about different HAIs at the same hospital are on different panels, so within-
hospital comparisons cannot be facilitated. There appears to be no overall metric to
use by which to compare hospitals in terms of their HAI rates.
Figure 4 shows an example of the metrics reported by each hospital on the dash-
board reporting panel (“B” in Figure 2). The figure also shows one of the two tables
and one of the two figures displayed on the CAUTI tab for the selected hospital. In
all, two tables and two figures are displayed in portrait style in panel “B” (Figure 2),
and Figure 4 shows the top table and figure displayed. In the table displayed (labeled
“1” in Figure 4), the metrics presented are the number of infections, predicted
Figure 4.
Dashboard metric display for each hospital. Note: To view hospital-acquired infection (HAI) rates at hospitals,
a hospital is selected (Figure 2, panel “C”), then the user selects the tab for the HAI of interest. In Figure 4,
a hospital has been identified, and a tab for catheter-associated urinary tract infection (CAUTI) has been
selected (see circle). Two tables and two figures are presented in portrait format on the reporting panel for each
hospital (Figure 2, panel “B”). Figure 4 shows the first table and figure presented (“1” and “2”); the table
reflects stratified metrics for CAUTI at each ICU at the hospital, and the graph reflects a time series of these
metrics stratified by hospital vs. state levels, and intensive care unit (ICU) vs. ward (“ward” is not defined in
the dashboard). The metrics provided in “1” are a number of infections, predicted infections, standard infection
ratios (SIRs), a confidence interval for the SIR, and an interpretation of the level. In “B”, the SIR is graphed.
44
Framework to Evaluate Level of Good Faith in Implementations of Public Dashboards
DOI: http://dx.doi.org/10.5772/intechopen.101957
infections, standard infection ratios (SIRS), a confidence interval for the SIR, and an
interpretation of the level. The figure (labeled “2” in Figure 4) displays a time-series
graph of SIRs for the past five years. In the other table on the panel (not shown in
Figure 4), ICU-level metrics are provided about catheter-days, predicted catheter-
days, Standard Utilization Ratios (SURs) and their confidence interval, and an
interpretation, and an analogous time-series graph of five years of SURs is presented
(also not shown in Figure 4).
SIRs and SURs are not metrics used typically by the public to understand rates
of HAI in hospitals. Risk communication about rates to the public is typically done
in the format of n per 10,000 or 100,000, depending upon the magnitude of the
rate [29]. Further, stratifying rates by ICU is confusing, as prospective patients may
not know what ICU in which they will be placed. Because the hospital environment
confers the strongest risk factors for HAI (e.g., worker burnout), HAI rates will be
intra-correlated within each hospital [30]. Therefore, it is confusing to present all
these rates and stratify them by ICU. Figure 4 only displays 50% of the information
available about CAUTI at one hospital. With each tab displaying similar metrics
about SSI and other infections, the experimental unit being used is so small, it
obfuscates any summary statistics or comparisons. Also, it is unclear how the
“predicted” metrics presented were calculated.
Ultimately, the design process and requirements behind this dashboard are not
known. There is no documentation as to how this dashboard was designed, and what
it is supposed to do. It appears to be an alpha prototype that was launched without a
stated a priori design process, and without any user testing or formal evaluation.
We chose to redesign the dashboard into a new alpha prototype that met require-
ments that we, as members of the public, delineated. Consistent with the good faith
principles proposed, our requirements included the following: 1) the dataset we use
should be easily downloadable by anyone using the dashboard, 2) the documenta-
tion of how the dashboard was developed should be easy to access, 3) hospitals
should present summary metrics rather than stratified ones, 4) different HAI
metrics for the same hospital should be presented together, and 5) there needs to be
a way to easily compare hospitals and choose the least risky hospital. To do this, we
first obtained the data underlying the original dashboard. Next, we analyzed it to
determine better metrics to present. We also selected open-source software to use
to redeploy an alpha prototype of a new dashboard. Finally, we conducted informal
user testing on this alpha prototype.
45
Open Data
We chose to focus our inquiry on the data from the hospital and ICU tables, as
CAUTI and CLABSI are by far the most prevalent and deadly HAIs [23]. Therefore, we
scoped our alpha prototype to only display data from the ICU and the hospital tables
(although we make all the data we scraped available in the downloadable dataset).
This limited us to basing the dashboard on hospital- and ICU-level metrics only.
Next, we intended to present CAUTI and CLABSI frequencies and rates,
whereby the numerator would be the number of infections, and the denominator
would be the “number of patients catheterized”. We felt that the dashboard’s use of
catheter-days as the rate denominator was confusing to the public, and appeared
to attenuate the prevalence of patients having experienced a CAUTI or CLABSI.
Although, “number of patients catheterized” was not available in the data, “annual
admissions” was. Since the proportion of patients admitted annually who are
catheterized probably does not vary much from hospital to hospital, we chose to use
the number of admissions as the denominator and a proxy measurement.
Third, we needed to develop a way of sorting hospitals as to their likelihood of
causing an HAI to allow easy comparisons by public users, so we decided to develop
an equation to predict the likelihood of an HAI at the hospital. We did this by
developing a linear regression model with hospital-level attributes as independent
variables (IVs), and CAUTI rate in 2019 as the dependent variable (DV). We chose
CAUTI over CLABSI after observing the two rates were highly correlated, and
CAUTI was more prevalent.
Table 1 describes the candidate IVs for the linear regression model. The table also
includes the source of external data that were added to the hospital data. We studied
our IVs, and found serious collinearity among several variables, so we used principal
component analysis (PCA) to help us make informed choices about parsimony [37].
The data predominantly loaded on three factors (not shown). The first factor included
all the size and utilization variables for the hospital; these were summed into a Factor 1
score. The second-factor loadings included the proportion of those aged 65 and older
and the non-urban flag (Table 1), so those were summed as Factor 2. Proportion non-
White was strongly inversely correlated with Factor 2, so it was kept for the model, and
county population did not load, so it was removed from the analysis. Factor 3 loadings
included teaching status, for-profit status, and Medicare Performance Score (MPS).
Rather than create a score, we simply chose to include the variable from Factor 3 that
led to the best model fit to represent the factor, which was MPS. Then we finalized our
linear regression model, and developed a predicted CAUTI rate (ŷ) using our model
that included the following IVs: MPS, Factor 1 score, Factor 2 score, and proportion of
non-White residents in hospital county.
Next, we used the regression equation to calculate ŷ as a “lethality score” for each
hospital. Of the 71 hospitals in the dataset, 21 were missing MPS and 8 were missing
other data in the model. Therefore, only the 42 with complete data (IVs and DVs)
were used to develop the regression model. As a result, the lethality score was non-
sensical for some hospitals; where the residual was large, the lethality score was
replaced with the 2019 CAUTI rate. If CAUTI data were missing, it was assumed
that the hospital had no CAUTI cases, and therefore was scored as 0.
Once the lethality score was calculated, we chose to sort the hospitals by score,
and divide them into four categories: least probable (color-coded green), some-
what probable (color-coded yellow), more probable (color-coded red), and most
probable (color-coded dark gray). Due to missing CAUTI information and many
hospitals having zero CAUTI cases, our data were severely skewed left, so making
quartiles of the lethality score to divide the hospitals into four categories was not
meaningful. To compensate, we sorted the data by lethality score and placed the
46
Framework to Evaluate Level of Good Faith in Implementations of Public Dashboards
DOI: http://dx.doi.org/10.5772/intechopen.101957
Teaching hospital status Original dashboard Exposure Scraped data from dashboard
Hospital profit status Original dashboard Confounder Scraped data from dashboard
Numerators for rates – Original dashboard Create Scraped data from dashboard
infection frequencies outcome
Table 1.
Conceptual model specification.
first 23 hospitals (32%), which included all the hospitals with zero cases, in the least
probable category. We placed the next 16 (23%) in somewhat probable, the next
16 (23%) in more probable, and the final 16 (23%) are most probable. We chose to
use this classification in data display on the dashboard to allow for easy comparison
between hospitals of risk of a patient contracting HAI.
47
Open Data
the dashboard. In our newly designed dashboard, the package leaflet was used for a
base map on which we placed the hospital icons (like the original dashboard), and
add-ons were made to display other items. These add-ons were adapted from other
published codes [41]. JavaScript with wrapper DT (data table) was used to display
stratified ICU rates, and CSS was used for formatting.
The dashboard we developed was deployed on a server (https://natasha-dukach.
shinyapps.io/healthcare-associated-infections-reports/) and code for the dash-
board was published (https://github.com/NatashaDukach/HAI-MA-2019). When
accessing the link to the dashboard, the user initially sees a map with icons (in the
form of dots) on it indicating hospitals. The icons are color-coded according to
the lethality score described previously. Clicking on an icon will expand a bubble
reporting information about the hospital (Figure 5).
As shown in Figure 5, like the original dashboard, this one has a map for navigation.
Unlike the original, it only has two tabs: “ICU Rate Explorer” (the one shown in Figure 5),
and “Data Collection”, which provides documentation and links to original data and code
(see “A” in Figure 5). The hospital icons are placed on the map and coded according to our
color scheme (see legend in Figure 5 by “B”). This allows for easy comparison between
hospitals. When clicking on an icon for a hospital, a bubble appears that contains the fol-
lowing hospital metrics: Number of admissions, number of ICU beds, overall CAUTI rate,
and overall CLABSI rate. There is also a link on the bubble where the user can click to open
a new box that provides CAUTI and CLABSI rates stratified by ICU. Future development
plans include adding other overall rates (e.g., for SSI), and adding in data from previous
years to allow for the evaluation of trends.
Informally, members of two potential user bases were queried as to their reac-
tions to the differences between the two dashboards: members of the academic
public health space, and members of the MA public. When the dashboard redesign
Figure 5.
Alternative dashboard solution. Note: In our new version, two tabs are created (see “A”). The figure shows the
first tab titled “ICU Rate Explorer”. The second tab, titled “data collection”, has information about the design
of the dashboard and links to the original code. Each of the hospitals is indicated on the map by a color-coded
icon that can be clicked on to display a bubble. The legend by “B” displays our color-coding scheme. When
clicking on a hospital icon, hospital-level metrics are shown in a bubble, and there is a link that leads to the
display of intensive care unit (ICU)-level metrics (see “C”).
48
Framework to Evaluate Level of Good Faith in Implementations of Public Dashboards
DOI: http://dx.doi.org/10.5772/intechopen.101957
4. Application
We wanted to compare the original HAI dashboard with the one we developed
based on the good faith principles described earlier. We started by creating the
framework presented in Table 2, which guides as to the good faith and bad faith
characteristics of public dashboards.
Using this framework, we applied a rating system. We chose zero to represent
“neither bad faith nor good faith”, −5 to represent “mostly bad faith” and + 5 to
represent “mostly good faith”. Then, based on our experience and available infor-
mation, we rated the original MA HAI dashboard and our alternative dashboard
solution to compare the ratings. To experiment with applying our framework
to another public dashboard, we used the information published in the article
described earlier to rate the traffic dashboard which was in Rio de Janeiro [2]. Our
ratings appear in Table 3.
As shown in Table 3, using Table 2 as a rubric and our rating scale, we were
able to rate each dashboard and assign a score. We were also able to define in the
comments in the table the evidence on which we based our score. Table 3 demon-
strates that this framework can be used to compare two different alternatives of
a public dashboard displaying the same data, as well as two completely different
public dashboards. The total scores show that while our redesigned prototype of the
HAI dashboard had a similar level of good faith implementation compared to the
Rio traffic dashboard (scores 26 vs. 23, respectively), the original HAI dashboard
had a very low level of good faith implementation compared to the other two
(score − 20).
5. Discussion
49
Open Data
Transparency of • For each native variable used in • Source datasets may be specified,
underlying data the dashboard, its source dataset but little information about the use
is specified, and a link is given if of their variables in the dashboard is
available. provided
• For each calculated variable used • It is not made clear which dashboard
in the dashboard, clear documen- variables are calculated, and how they
tation is available. are calculated is also not made clear
• It is clear which data reflect real • It is not clear which data reflect
measurements, and which reflect real measurements, and which data
simulations, imputations, or have been simulated, imputed, or
predictions otherwise fabricated
Data classification Data are classified in ways that • Data are classified in ways that either
are intuitive to the consumer, and make the development work easier for
results are presented according to the analyst, or serve to mask negative
those classifications indicators
• Data are not grouped into classifica-
tions consumers use, making it impos-
sible to obtain summary statistics for
these classification levels
Navigation functions • Navigation functions reflect how • Navigation functions reflect how
users conceive of accessing the public officials want consumers to
entities in the dashboard navigate the entities in the dashboard
• Specifically, map navigation • This forces consumers to think dif-
reflects how users conceive of ferently about the topic, and disrupts
their geographic locale when their ability to ingest and assimilate
searching for information information
• This allows consumers to • Map functions force the consumer to
intuitively ingest and assimilate conceive of their geographic locale
information as they interact with in an unintuitive way, making map
the dashboard navigation confusing
Metrics presented • Metrics are intuitive to consumers • Metrics reflect jargon, and are
unintuitive to consumers
• Metrics are presented in such
a way that they are intuitive to • Metrics require consumers to read
ingest and assimilate documentation to understand
• Metrics that are presented to • Metrics are presented in such a way as
make comparisons between to be confusing, making them impos-
entities intuitive to support sible to be used for decision-support
decision-making
Table 2.
Proposed framework for evaluation of good faith and bad faith public dashboards.
50
Framework to Evaluate Level of Good Faith in Implementations of Public Dashboards
DOI: http://dx.doi.org/10.5772/intechopen.101957
Dashboard MA Hosp Alt. Comment: MA Hosp vs. Rio Comment: MA Hosp vs.
function/ MA Alt. MA Hosp Rio
characteristic Hosp
ACCESS TO −5 5 The original solution had 5 Rio dashboard uses open
UNDERLYING no access to underlying data from City Hall with
DATA data (except by way user-generated content
of PDF-style reports). collected through Waze
Alternative solution
posts data publicly for
download.
Transparency of −5 3 It was difficult to identify 0 Unclear from the article, but
Underlying Data the source of the data in it appears that it is possible
the original solution. The to audit Rio dashboard
alternative solution uses design if a member of a
the same data, which is certain role (e.g., public data
from NHSN. Because scientist). Not all tools used
NHSN itself is somewhat were open source.
opaque, the final solution
lacks transparency.
Data Classification 0 3 In informal user testing, 5 Much effort was made
public users found the data to classify data in Rio
classifications much more dashboard to make it useful
intuitive in the alternative for the public to make route
compared to the original decisions.
solution. However, formal
user testing was not
conducted.
Comparison −5 5 In informal user 5 Rio dashboard was designed
Functions testing, users found the to allow the public to make
comparison function in comparisons about potential
the alternative solution traffic routes.
useful for decision-
making, and could
not find a comparison
function in the original
solution.
Navigation 0 5 In informal user testing, 5 Rio dashboard for the public
Functions users reported being able had a very simple, intuitive
to easily navigate the data interface with images and
and dashboard in the only a few metrics critical to
alternative solution, but decision-making. This made
having extreme difficulty it possible to easily navigate
in navigating the original the dashboard display.
solution.
Metrics Presented −5 5 In informal testing, 3 The few metrics presented
users indicated that they on the Rio dashboard
did not understand the were geared specifically to
metrics presented on helping the public make
the original solution route decisions based on
but found the color- traffic metrics. However,
coding of the alternative no formal user testing is
solution intuitive for presented.
decision-making.
Total −20 26 23
Note: MA Hosp = original hospital-acquired infection (HAI) dashboard from the Commonwealth of Massachusetts
(MA), MA Hosp Alt. = alternative solution, Rio = Rio traffic dashboard [2], and NHSN = National Healthcare
Safety Network.
Table 3.
Application of rating system.
51
Open Data
Figure 6.
Example of visualization of framework score comparison. Note: MA Hosp = original hospital-acquired
infection (HAI) dashboard from the Commonwealth of Massachusetts (MA), MA Hosp Alt. = alternative
solution, and Rio = Rio traffic dashboard [2].
did not serve our information needs, and essentially obscured the data it was sup-
posed to present. To address this challenge, we not only redesigned the dashboard
into a new prototype, but we also tested our proposed framework for evaluating the
level of good faith in public dashboards by applying it. Using our proposed frame-
work and rubric, we evaluated the original HAI dashboard, our redesigned proto-
type, and a public dashboard on another topic presented in the scientific literature
on the level of good faith implementation. Through this exercise, we demonstrated
that the proposed framework is reasonable to use when evaluating the level of good
faith in a public dashboard.
The next step in the pursuit of holding governments accountable for meeting
OGD standards in public dashboards is to improve upon this framework and rubric
through rigorous research. As part of this research, entire groups of individuals
could be asked to score dashboards on each of these characteristics, and the results
could easily be summarized to allow an evidence-based comparison between dash-
boards. Results can be easily visualized in a dumbbell plot (using packages ggplot2,
ggalt, and tidyverse [42–44]), which we have done with our individual scores, but
could be done with summary scores (Figure 6).
As visualized in Figure 5 and summed in Table 3, our scoring system suggested
that the alternative HAI dashboard we developed was done in a level of good faith
(score = 26) similar to that of the Rio traffic dashboard (score = 23), and that the
original HAI dashboard appears to not have been done in good faith (score = −20),
and may serve as a governmental attempt to hide or obscure uncomfortable data.
This exercise shows that the framework and rubric developed can be used to com-
pare the level of good faith in public dashboards, and to provide evidence-based
recommendations on how governments can improve them so they meet both the
spirit and the letter of OGD requirements.
6. Conclusion
52
Framework to Evaluate Level of Good Faith in Implementations of Public Dashboards
DOI: http://dx.doi.org/10.5772/intechopen.101957
Author details
© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
53
Open Data
References
[1] Ruijer E, Détienne F, Baker M, et al. [8] Mullins CM, Hall KC, Diffenderfer SK,
The politics of open government data: et al. Development and implementation
Understanding organizational responses of APRN competency validation tools in
to pressure for more transparency. The four nurse-led clinics in rural east
American Review of Public Tennessee. Journal of Doctoral Nursing
Administration. 2020;50:260-274 Practice. 2019;12:189-195
54
Framework to Evaluate Level of Good Faith in Implementations of Public Dashboards
DOI: http://dx.doi.org/10.5772/intechopen.101957
55
Open Data
[31] Persson E. pdftables: Programmatic [40] Chang W, Cheng J, Allaire JJ, et al.
Conversion of PDF Tables. 2016. shiny: Web Application Framework for
Available from: https://CRAN.R-project. R. 2018. Available from: https://
org/package=pdftables [Accessed: CRAN.R-project.org/package=shiny
March 20, 2021] [Accessed: March 12, 2019]
[32] Ooms J. pdftools: Text Extraction, [41] Cheng J, Karambelkar B, Xie Y, et al.
Rendering and Converting of PDF leaflet: Create Interactive Web Maps
Documents. 2020. Available from: with the JavaScript ‘Leaflet’ Library.
https://cran.r-project.org/web/ 2021. Available from: https://CRAN.R-
packages/pdftools/index.html project.org/package=leaflet [Accessed:
[Accessed: March 20, 2021] September 15, 2021]
[36] US Census Bureau. QuickFacts. The [44] Wickham H. tidyverse: Easily Install
United States Census Bureau. Available and Load the ‘Tidyverse’. 2021. Available
from: https://www.census.gov/ from: https://CRAN.R-project.org/
programs-surveys/sis/resources/ package=tidyverse [Accessed:
data-tools/quickfacts.html [Accessed: September 15, 2021]
March 20, 2021]
56
Section 2
57
Chapter 4
Abstract
In today’s world, Intrusion Detection System (IDS) is one of the significant tools
used to the improvement of network security, by detecting attacks or abnormal data
accesses. Most of existing IDS have many disadvantages such as high false alarm
rates and low detection rates. For the IDS, dealing with distributed and massive data
constitutes a challenge. Besides, dealing with imprecise data is another challenge.
This paper proposes an Intrusion Detection System based on big data fuzzy analyt-
ics; Fuzzy C-Means (FCM) method is used to cluster and classify the pre-processed
training dataset. The CTU-13 and the UNSW-NB15 are used as distributed and
massive datasets to prove the feasibility of the method. The proposed system shows
high performance in terms of accuracy, precision, detection rates, and false alarms.
1. Introduction
59
Open Data
detection, Section 4 presents the used datasets, the proposed system and its com-
ponents, Section 5 illustrates the evaluation metrics and results of the tested system,
finally, Section 6 provides conclusions and further development of future work.
2. Related work
Several works of IDS using Big Data techniques exist. Jeong et al. [4] indicate
that Hadoop can solve intrusion detection and big data issues by focusing specifi-
cally on anomalous IDSs. The experience of Lee et al. [5] with Hadoop technologies
shows good feasibility as an intrusion detection instrument because they were able
to reach up to 14 Gbps for a DDOS detector. M. Essid and F. Jemili [6] have
combined and eliminated the redundancy of the alerts bases KDD99 and DARPA,
they used Hadoop for data fusion. Besides, R. Fekih and F. Jemili [7] used Spark to
merge and remove the redundancy of the three alerts bases KDD99, DARPA and
MAWILAB. The main objective was to improve detection rates and decrease false
negatives. Terzi et al. [8] created a new approach to unsupervised anomaly detec-
tion and used it with Apache Spark on Microsoft Azure (HDInsight21) to harness
scalable processing power. The new approach was tested on CTU-13, a botnet traffic
dataset, and achieved an accuracy rate of 96%. M.Hafsa and F.Jemili [9] created a
new approach to intrusion detection. They used Apache Spark on Microsoft Azure
(HDInsight21) to analyze and process data from the MAWILAB database. Their
new approach achieved an accuracy rate of 99%. Ren et al. [10] created a new
approach to unsupervised anomaly detection using the KDD’99 base to analyze and
process the data, they achieved a low detection rate. Rustam and Zahras [11]
compared two models, one supervised (the Support Vector Machine SVM model)
and the other unsupervised (Fuzzy C-Means FCM) to analyze, process, and detect
KDD’99 database intrusions. They found that SVM achieved an average accuracy
rate of 94.43%, while FCM achieved an average accuracy rate of 95.09%. In this
work, we propose an Apache Spark-based approach to detect intrusions. The goal of
our system is to provide an efficient intrusion detection system using Big Data tools
and fuzzy inference to treat uncertainties and provide better results.
Intrusion: An intrusion is any use of a computer system for purposes other than
those intended, usually due to the acquisition of privileges illegitimately. The
intruder is generally seen as a stranger to the computer system that has managed to
gain control of it, but statistics show that the most common abuses come from
internal people who already have access to the system [12].
Intrusion detection: Intrusion detection has always been a major concern in
scientific papers [13, 14]. By security auditing mechanisms, it consists in analyzing
the collected information in search of possible attacks.
Intrusion detection system (IDS): To detect and signal anomalous activities, an
intrusion detection system (IDS) is used. To detect and signal anomalous activities,
an intrusion detection system (IDS) is used. This system protects a system from
malicious activities coming from known or unknown sources, this process is done
automatically in order to protect confidentiality, integrity and availability of
systems. Cannady et al. [15] states that an IDS has two detection approaches: an
anomaly-based detection and signature-based detection. IDS are characterized by
their surveillance domain which can monitor a corporate network, multiple
machines or applications.
60
Intrusion Detection Based on Big Data Fuzzy Analytics
DOI: http://dx.doi.org/10.5772/intechopen.99636
CTU-13 consists in a group of thirteen scenarios that each run a specific botnet
performed in a real network environment. Each scenario includes a botnet pcap file,
a tagged NetFlow file, a README file with the capture timeline, and the malware
run file. The NetFlow (network flow) file is based on bidirectional flows that
provide information about the communication between a source (a client) and a
destination (a server). This dataset includes three types of traffic with a different
distribution: Normal, botnet (or Malware), and background:
• Normal traffic: comes from normal hosts that are previously verified and very
useful to check the actual performance of machine learning algorithms.
Figure 1.
Attack categories in CTU-13.
61
Open Data
Types Description
Fuzzers The attacker attempts to cause a program or network suspended by feeding it the
randomly generated data.
Analysis It penetrates the web applications via ports (port scan), web scripts (HTML files),
and emails (spam).
Generic A technique establishes against every block-cipher using a hash function to collision
without configuration of the block-cipher.
Reconnaissance It gathers information about a computer network to evade its security controls.
Shellcode The attacker penetrates a slight piece of code starting from a shell to check the
compromised machine.
Worm The attacker replicates itself in order to advance on other computers. Frequently, it
uses a computer network to spread itself, relying on the security failures on the
target computer to access it.
Table 1.
UNSW-NB15 attack types.
62
Intrusion Detection Based on Big Data Fuzzy Analytics
DOI: http://dx.doi.org/10.5772/intechopen.99636
Figure 2.
Speed comparison chart between spark Hadoop.
Figure 3.
Apache spark ecosystem.
The FCM algorithm is one of the most widely used fuzzy clustering algorithms
[23] which attempts to partition a finite collection of elements into a collection of c
fuzzy clusters with respect to some given criterion. This algorithm is based on
minimization of the following objective function:
N X
X C � �2
Jm ¼ um � �
ij xi � c j , 1 ≤ m < ∞ (1)
i¼1 j¼1
where:
• ||*||: Any norm expressing the similarity between any measured data and the
center. The algorithm FCM is composed of the following steps:
63
Open Data
Figure 4.
Pseudo code of FCM algorithm-.
k¼1 kxi�ckk
Where:
5. Proposed method
We will be using Jupyter Notebook with Apache Spark and the Python API
(PySpark). In this stage, we will read CSV files and converting them to Apache Parquet
format into Microsoft Azure Blob Storage. Apache Spark supports multiple operations
on data, it bids the ability to convert data to another format in just one line of code.
Developed by Twitter and Cloudera, Apache Parquet is an open-source columnar file
format optimized for query performance and minimizing I/O, offering very efficient
64
Intrusion Detection Based on Big Data Fuzzy Analytics
DOI: http://dx.doi.org/10.5772/intechopen.99636
Figure 5.
Diagram of proposed approach.
compression and encoding schemes [26]. Figure 6 shows the efficiency of using the
Parket format. This format minimizes storage costs and data processing time.
The following Table 2 indicates the old and new size of each datasets after
converting to Apache Parquet, We notice that by converting CSV to Parquet the
storage costs are minimized.
The feature selection phase selects relevant attributes required for decision
making. A pre-processing phase converts the flow records in a specific format
which is acceptable to an anomaly detection algorithm [27].
65
Open Data
Figure 6.
Apache parquet advantages.
Table 2.
Average file size before and after converting.
with the CTU-13 dataset, We did not utilize the feature selection algorithm for this
dataset, we instead selected columns that were pertinent and delete unnecessary
features(empty columns). After the removing, we get with a total of 13 columns.
Using UNSW-NB15 dataset, we processed the data selection problem. We apply
a combination fusion of Random Forest Algorithm with Decision Tree Classifier. V.
Kanimozhi [28] decides that the combined fusion of these two algorithms provides
98.3% has listed the best four features are as sbytes, sttl, sload, ct_dst_src_ltm and
the Figure 7 labels the graphical representation of Feature Importances and the top
four features.
The goal of eliminating no-useful attributes is bring about a better performance
by the system with a better accuracy.
This eliminate redundancies task involves removing duplicates (removing all the
repeated records) which helps with attack detection as it makes the system less
biased by the existence of more frequent records.This tactic makes computation
faster as it must deal with less data [29].
Before the merge of the bases, some common columns have different names
from one database to another (for example “Label” in CTU-13 named “attack_cat”
in UNSW-NB15), in this state, we will rename these attributes then we will merge
our bases. Since, Apache Spark offers the ability to join our databases in just one line
of code. The following Listing shows the query used:
Using the Apache Spark Machine Learning library, we create a Machine Learn-
ing pipeline. A pipeline is a sequence of stages where each stage is either an Esti-
mator or a Transformer.
66
Intrusion Detection Based on Big Data Fuzzy Analytics
DOI: http://dx.doi.org/10.5772/intechopen.99636
Figure 7.
Feature importance of UNSW-NB15 dataset.
In our final base, some attributes are of types string (Like: Label, sport, proto,
...), in this step we will convert all attributes of type string to attribute of type
integer by using the transformer “StringIndexer” which encodes a string column of
labels to a column of label indices. These indices are ordered by label frequencies,
the most frequent label gets index 0.
StringIndexer classifies attacks automatically in class, it assigns the same index
for attacks of the same category.
In our experimental work and as we said above we will use Microsoft Azure as a
cloud environment to upload and analyze the dataset with FCM algorithm. We use
the training dataset to form and evaluate our model. The test dataset is then used to
make predictions. We choose to train our Model with FCM algorithm. The first
stage of the FCM algorithm is to initialize the input variable, the input vector
includes the dataset features, the number of cluster is 2 (1 = intrusion and 0 = nor-
mal), and the center of cluster is calculated by taking the means of all feature in the
final dataset.The use of fuzzy C-means clustering algorithm to classify data will
generate a number of clusters, each cluster contains part of the data records [30].
The characteristics are different between normal and intrusion data records, so they
should be in different clusters as shown in the following Figure 8 which presents
the data records clustering.
67
Open Data
Figure 8.
The data records clustering.
Table 3.
Evaluation metrics.
As shown in Table 4 the total input data is 845 721 records, 243 899 records as
normal and 601 822 records as intrusion. After applying FCM algorithm, the result
is 231 704 record for normal and 589 785 records for intrusion. Then we calculated
the normal and intrusion classification rate by the following equation:
The simulation results show that the classification rate is 96.4% by the FCM
algorithm which means that the false positive rate(returns the rate of instances
which are falsely classified) is 0.02%.
68
Intrusion Detection Based on Big Data Fuzzy Analytics
DOI: http://dx.doi.org/10.5772/intechopen.99636
Table 4.
Evaluation metrics.
6. Discussion
It is possible to obtain a very precise system (accuracy of 99%) but not very
efficient with a recall of 10%. In our work, with an accuracy of 97.2% and a recall of
96.4%, we can say that our system is efficient. The use of the fuzzy algorithm in this
experiment gave a good result. The advantage of our system is the fuzzy represen-
tation that is increasingly used to deal with missing and inaccurate data problems
which is the disadvantage of most classification algorithms.
In this paper, we achieved a successful distributed IDS. Using the FCM algo-
rithm allows to effectively train and analyze our model after merging datasets.
Proposing a distributed system and showing the power of Spark to combine and
handle large and heterogeneous structures of training datasets present the main
merits in our work.
In future work, we will perform our dataset analysis with another Big Data
framework expected to reach faster results. In addition, we will develop our
approach with other classifiers to get better results.
Author details
© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
69
Open Data
References
70
Intrusion Detection Based on Big Data Fuzzy Analytics
DOI: http://dx.doi.org/10.5772/intechopen.99636
71
Chapter 5
Abstract
Artificial intelligence (AI) approaches have recently made major impacts in the
healthcare field, igniting a heated discussion over whether AI physicians would
eventually replace human doctors. Human doctors are unlikely to be replaced by
machines anytime soon, but AI may assist physicians make better clinical deci-
sions or even replace human judgment in certain areas of healthcare (e.g., radiol-
ogy). The increased availability of healthcare data and the rapid development of
big data analysis tools have made recent productive applications of AI in health-
care possible. When driven by appropriate clinical queries, powerful AI systems
may find clinically valuable information hidden in enormous volumes of data,
which can help clinical decision making. The internet of things (IoT) is a network
of many interconnected things that may communicate with one another across a
computer network. We may get information from this global network by connect-
ing sensors to it. Thanks to the computer network, we can obtain this information
from anywhere on the planet. The internet of things (IoT) enables physical objects
to connect to the internet and create systems using various technologies such as
near-field communication (NFC) and wireless sensor networks (WSN).
Keywords: WSN, artificial intelligence, deep learning, IoT and health sector
1. Introduction
73
Open Data
characteristics, each application has its own design concept and execution to meet
its own demands. The qualities of WSN, together with technology improvements,
give the greatest benefits to healthcare. A sensor network designed to identify
human health indicators is known as a body sensor network (BSN). Because BSN
nodes are directly attached to the human body, considerable vigilance is required.
For many days, several healthcare applications need the BSN to collect patient data
indefinitely without user intervention. Such applications must take into consider-
ation the energy constraints of sensor networks [1–15].
The challenges of WBS networks healthcare is a need for everyone’s quality of
life in today’s environment. The population of developed countries is growing at a
pace that is proportional to the government’s budget. Healthcare systems will face
challenges as a result of this. One of the most difficult challenges is making health-
care more accessible to elderly people who live alone. In general, health monitoring
is done on a check-in basis, with the patient remembering their symptoms; the
doctor performs tests and develops a diagnosis, then monitors the patient’s progress
throughout therapy. Wireless sensor network applications in healthcare provide in-
home assistance, smart nursing homes, clinical trials, and research advancement.
Let’s take a look at some of the challenges and basic features of BSNs before we get
into the medical uses of this technology. In healthcare applications, low power, lim-
ited computing, security and interference, material restrictions, resilience, continu-
ous operation, and regulatory requirements with elderly people are all issues.
Modern modelling approaches such as fuzzy logic (FL) and artificial neural
networks (ANN) are frequently employed in hydrological modelling for a number
of applications. The main benefit of these techniques is that they are not con-
strained by restrictive assumptions such as linearity, normality, or homoscedasticity
and that they provide promising and acceptable alternatives to classical stochastic
hydrological modelling in time series analysis, such as the auto regressive mov-
ing average exogenous (ARMAX) model (autoregressive moving average with
exogenous inputs). When applied to hydrological systems, however, traditional
stochastic models have many drawbacks, the most notable of which are short-time
dependence and the normality assumption, as previously mentioned. Hydrological
processes are well recognised for defying these assumptions. ANNs have been
recognised as a tool for modelling difficult nonlinear systems and are widely used
for hydrological prediction. Their applications range from forecasting hourly and
daily river stages to further FL modelling applications in: rainfall-runoff groundwa-
ter; and time series modelling [6–10]. Fuzzy neural networks (FNN) are a unique
approach for river flow prediction that blends FL and ANNs.
Because they can estimate any continuous function to any degree of accuracy,
the Mamdani and tidal sequence (TS) systems are referred to as universal approxi-
mators. The smaller the error tolerance, the more fuzzy rules are necessary. In
practise, fuzzy models can always yield nonlinear modelling solutions when the
required number of fuzzy sets and rules are provided. In comparison to the TS
approximator, the Mamdani approximator has the benefit of being able to use both
numerical and verbal data produced from human knowledge and experience.
When nontrapezoidal/nontriangular input fuzzy sets are used, TS fuzzy systems
may be more cost-effective than Mamdani fuzzy systems in terms of input fuzzy
sets and fuzzy rules. They discovered that TS and Mamdani fuzzy systems have
comparable minimal system configurations when trapezoidal or triangular input
fuzzy sets are used. The performance of Mamdani (linguistic) and TS (clustering-
based) fuzzy models was examined in the spatial interpolation of mechanical
features of rocks. In terms of prediction performance, the clustering-based TS fuzzy
modelling technique beats the Mamdani model, according to their results. The
main purpose of this study is to develop a hybrid model for streamflow forecasting
74
Artificial Intelligence and IoT: Past, Present and Future
DOI: http://dx.doi.org/10.5772/intechopen.101758
that incorporates both a genetic algorithm and fuzzy logic. Genetic algorithms and
neural networks (NNs) were used to train the Mamdani and Takagi-Sugeno fuzzy
logic modelling systems, respectively. According to the comparison, the Mamdani
approach beats standard methods in terms of avoiding restrictive assumptions,
insight into the modelling structure, and modelling accuracy.
Health is always a major concern as the human race improves in terms of
technology. The current coronavirus outbreak, which has hurt China’s economy
to some extent, exemplifies how healthcare has become more vital. In areas
where the pandemic has spread, it is always preferable to monitor these people
using remote health monitoring equipment. As a consequence, the current solu-
tion is a health monitoring system based on the IoT. Remote patient monitoring
provides for patient monitoring outside of typical clinical settings (for example,
at home), increasing accessibility to human services offices while cutting
expenses. This project’s primary purpose is to design and build a smart patient-
health monitoring system that uses sensors to monitor patient health and the
internet to alert loved ones if there are any issues. The purpose of developing
monitoring systems is to reduce healthcare costs by reducing the number of
needed inspections. In an IOT-based framework, different consumers may be
able to see sensitive aspects of the patient’s blooming. Because the information
should be double-checked by visiting a website or URL, this is the case. In GSM-
based patient observation, the rising parameters are communicated utilising
GSM through SMS techniques.
In most rural areas, the medical facility would not be within walking distance of
the residents. As a consequence, the majority of people must attend doctor’s visits,
stay in hospitals, and undergo diagnostic testing procedures. Each of our bodies
uses temperature and pulse recognition to determine our overall health. The sensors
are linked to a microcontroller that monitors the state and, as a result, is inter-
faced to an liquid crystal display (LCD) as well as a remote connection that may
send alarms. If the framework detects any unusual changes in heart rate or body
temperature, it warns the client through IoT and also shows subtle features of the
patient’s pulse and temperature on the web in real-time. An IoT-based tolerant well-
ness monitoring framework efficiently leverages the web to monitor quiet wellbeing
metrics and save time in this manner. There is a significant ability to disregard any
form of minor health concern, as shown by changes in important components such
as body temperature, pulse rate, and so on in the early stages.
When a person’s health condition has developed to the point that his or her life
is in peril, they seek medical assistance, perhaps wasting money. This is crucial to
consider, especially if an epidemic spreads to a place where doctors are unavailable.
Giving patients a smart sensor that can be monitored from afar to avoid the spread
of sickness would be a realistic solution that might save many lives. Sensors monitor
physiological signs, which are transformed into electrical impulses when a patient
enters the healing centre. The basic electrical flag is then updated to an advanced
flag (computerised data), which is then stored in RFID. To transfer computerised
data to a local server, the Zigbee Protocol is employed. For this framework, Zigbee
is a good choice. In this location, there are the most cell hubs. It’s better for gadgets
that are smaller and use less energy. A nearby server sends information to the
therapeutic server through WLAN.
When the data is transmitted to the therapeutic server, it checks to see whether
the patient already has a medical record, then adds the new information to that
record and sends it to the specialist. If the patient has not had any prior treatment
records, the server creates a new ID and stores the data in its database. The IoT is
becoming more widely recognised as a feasible solution for distant value tracking,
notably in the field of health monitoring. It permits the secure cloud storage of
75
Open Data
individual health parameter data, the decrease of hospital visits for routine checks,
and, most critically, the remote monitoring and diagnosis of sickness by any doctor.
In this research, an IoT-based health monitoring system was developed. Body
temperature, pulse rate, and room humidity and temperature were all measured by
sensors and shown on an LCD. The sensor data is then wirelessly sent to a medical
server. These data are then delivered to a smartphone with an IoT platform that
belongs to an authorised person. Based on the findings, the doctor diagnoses the
condition and the patient’s current state of health.
The advantages of AI have been extensively researched in the medical literature.
Using complicated algorithms, AI can ‘learn’ characteristics from a large quantity
of healthcare data and then apply the results to clinical practise. It might potentially
include learning and self-correcting capabilities to improve accuracy as input
changes. AI systems that give up-to-date medical information from journals, text-
books, and clinical practises may support physicians in providing proper patient
care. In addition, an AI system might help to reduce diagnostic and therapeutic
errors, which are inevitable in human clinical practise. Furthermore, an AI system
extracts important data from a large patient population to assist in the generation of
real-time health risk warnings and prediction findings [11–15].
In this chapter, we look at the current level of AI in healthcare and predict its
future. First, we will go over four crucial factors from a medical researcher’s per-
spective: (1) Justifications for AI use in healthcare, and (2) The sorts of data that
AI systems have examined AI devices may be classified into two classes, according
to the previous description. The first category includes machine learning (ML)
approaches that analyse structured data such as imaging, genomics, and EP data.
ML algorithms are used in medical applications to cluster people’ traits or forecast
the probability of sickness consequences. The second category includes natural
language processing (NLP) tools, which extract information from unstructured
data such as clinical notes and medical journals to supplement and enrich organised
medical data. Texts are converted into machine-readable structured data, which
may then be analysed using ML algorithms.
Lung- and heart-related ailments are at the top of the list of health-related
problems/complications. Wireless technology, which is a relatively new concept,
may be used to track one’s health. Wireless health monitoring systems make use
of wearable sensors, portable remote health systems, wireless communications,
and expert systems, among other technologies. Life is valuable, even a single life
is valuable, but people are dying due to the lack of health facilities, sickness
awareness, and sufficient access to healthcare systems. In all conditions, the
IoT assists in the identification of diseases and the treatment of patients. In IoT
healthcare systems, there are wireless systems in which different applications and
sensors are linked to patients, information is gathered, and the information is
communicated to a doctor or specialist through an expert system. Medical devices
for the Internet of Things (MD-IoT) are connected to the Internet and use sensors,
actuators, and other communication devices to monitor patient health. The expert
system uses these devices to transfer patient data and information to a secure
cloud-based platform, where it is stored and analysed.
Telemedicine is the practise of caring for a clinician and a patient while they
are not physically present with each other. ‘The delivery of healthcare services
at a distance’ is how telemedicine is defined. Telemedicine provides a variety of
benefits, but it also has many disadvantages. Providers, payers, and politicians
76
Artificial Intelligence and IoT: Past, Present and Future
DOI: http://dx.doi.org/10.5772/intechopen.101758
all recognise the difficulty of navigating some grey zones. While the sector will
rapidly develop over the next decade, it will also provide practical and technologi-
cal challenges. IoT is the most trustworthy and cost-effective alternative in certain
circumstances, and the connection between different devices and interactive
communication systems also need further formal examination. By communicat-
ing information to healthcare teams such as doctors, nurses, and specialists, IoT
(Internet of Things) and mobile technologies make it easier to monitor a patient’s
health. Professionals would benefit from using the store and forward method to
save and collect patient data that could be accessed at any time.
A smart healthcare system is a piece of technology that enables patients to be
treated while also improving their overall quality of life. The smart health concept
incorporates the e-health concept, which emphasises a number of technologies
such as electronic record management, smart home services, and intelligent and
medically connected items. Sensors, smart devices, and expert systems all help
to create a smart healthcare system. Healthcare facilities are a big concern in
today’s globe, especially in developing countries where rural areas lack access to
high-quality hospitals and medical experts. Artificial intelligence has benefited
health in the same way it has benefited other aspects of life. The IoT is expanding
its capabilities in many areas, including smart cities and smart healthcare. The
IoT is now being used in healthcare for remote monitoring and real-time health
systems. Controlling and preventing catastrophic events, such as the one that
happened in 2020 when the coronavirus disease (COVID-19) ravaged the world,
maybe done via IoT technologies without imposing severe restrictions on people
and enterprises. COVID-19, unlike SARS in 2003, causes respiratory symptoms
and seems to be more contagious. One way to restrict viral transmission until
a vaccine is developed is to keep a close eye on physical (or social) distance.
Improved surveillance, healthcare, and transportation networks will make it
less probable for contagious diseases to spread. An IoT system combined with
artificial intelligence (AI) may give the following advantages when considering
a pandemic: (1) utilising surveillance and image recognition technologies to
enhance public security, (2) using drones for supply, transportation, or disinfec-
tion, and (3) leveraging AI-powered apps and platforms to monitor and limit
people’s access to public places.
In healthcare, an IoT system is often made up of a number of sensors that are
all connected to a computer and allow real-time monitoring of the environment or
patients. AI-assisted sensors might be employed in the case of a pandemic to help
predict whether or not people are sick based on symptoms like body temperature,
coughing patterns, and blood oxygen levels. The ability to monitor people’s loca-
tions is another useful function. During an outbreak of severe disease, tracking the
distance between people may provide vital information. Using technologies like
Bluetooth, we can get a good estimate of how much distance people maintain when
walking in public places. This information might be used to target people who are
not physically separated by a specified distance, such as 2 m, to stop the virus from
spreading further. To prevent the abuse of personal information, security and data
management must be addressed throughout the development of such platforms.
Following a pandemic, governments may try to use these platforms and data for
long-term monitoring to control and monitor people’s behaviour.
One of the problems with traditional medical diagnosis is its inaccuracy and
imprecision, which has resulted in the deaths of thousands of people. The develop-
ment of various algorithms, models, and technologies to ensure accuracy and preci-
sion has considerably reduced the number of people who die every day in hospitals,
and fuzzy logic, a branch of artificial intelligence, is one of these technologies.
Medical diagnostic processes are carried out with the use of computer-assisted
77
Open Data
technologies, which are growing more common by the day. These systems are
based on AI and are designed to diagnose as well as recommend treatments based
on symptoms. Many decision support systems (DSSs) have been developed in the
medical field, such as Aaphelp, Internist I, Mycin, Emycin, Casnet/Glaucoma, Pip,
Dxplain, Quick Medical Reference, Isabel, Refiner Series System, and PMA, to
assist medical practitioners in their decisions for diagnosis and treatment of various
diseases.
The medical diagnostic System (MDS) is used to diagnose various ailments in an
expert system like this. Fuzzy logic was chosen as the AI tool in the recommended
system since it is one of the most efficient qualitative computational approaches.
Fuzzy logic has been proved to be one of the most effective techniques to offer
clarity to the medical field. Medical applications include CADIAG, MILORD,
DOCTORMOON, TxDENT, MedFrame/CADIAG-IV, FuzzyTempToxopert, and
MDSS, to name a few.
78
Artificial Intelligence and IoT: Past, Present and Future
DOI: http://dx.doi.org/10.5772/intechopen.101758
included in a vaccine, allowing the immune system to learn and prepare for specific
antigens. Antigens already found in pathogens that may be related to antigens for a
new infection may also be recognised by AI, speeding up the process even further.
AI is assisting in the development of vaccines by simplifying the comprehension
of viral protein structures and assisting clinical professionals in sifting through
hundreds of relevant study findings at a faster rate than would be achievable
otherwise. The ability to understand the structure of a virus may aid in the creation
of effective vaccination.
AI and soft computing (SC) play a crucial role in healthcare medical diagnostics.
Doctors nowadays are unable to advance without the help of technological advance-
ment. This digital advancement will be incomplete if AI and SC are not included.
AI is a technique for constructing intelligent machines. The SC is a collection of
computer algorithms for seeing and learning real-world information, which allows
computers to create AI. As a consequence, the computer can perform as well as a
person if the philosophy of human labour can be expressed using AI and SC tech-
nologies. In the healthcare sector, this technological development is being used for
long-term medical diagnosis. AI is defined by Alan Turing, the discipline’s inven-
tor, as ‘the science and engineering of building machines, especially sophisticated
computer programmes.’ Artificial intelligence systems are computer programmes
that can mimic human cognitive processes.
In the early phases of AI, philosophy, potential, demonstrations, dreams, and
imagination all played a part. In response to a variety of conflicting needs, pos-
sibilities, and interests, the field of IA developed. In a range of fields, including
healthcare, AI combined with analytics (AIA) is becoming increasingly commonly
employed. Medicine was one of the most successful applications of analytics, and
it is now a prospective AI application sector. As early as the mid-twentieth century,
clinical applications were designed and provided to physicians to assist them
in their practise. Among the applications are clinical decision support systems,
automated surgery, patient monitoring and assistance, healthcare administration,
and others. The current methodologies are mostly focused on knowledge discovery
via data and ML, ontologies and semantics, and reasoning, as we will see in the next
sections. We will look at how AI has advanced in healthcare over the last 5 years in
this piece.
Data mining, ontologies, semantic reasoning, and ontology-extended clinical
recommendations, clinical decision support systems, smart homes, and medical
big data will be the focus of the examination. The multiple artificial intelligence
features of our study were not chosen at random. Indeed, we have noticed that they
have developed a strong interest in medicine in recent years. Data mining methods
are used in learning and prediction, as well as picture and speech processing, and
anything involving emotion and sentiment. Because of their ability to reason, as
well as its usage as a way of learning, sharing, reuse, and integration, ontologies
have gained momentum in medicine. Clinical decision support systems that assist
improve the quality of treatment in clinical practise draw on both disciplines. They
are also used in smart homes to help those with cognitive impairments with daily
tasks. Big data in medicine is becoming increasingly common, and its application in
analytics is unavoidable.
Electricity engineers formerly concentrated their efforts on the production
and transmission levels, with the distribution system receiving less attention.
Engineers have only recently been provided with the tools necessary to cope with
the computational burden of distribution systems to undertake realistic modelling
and simulation. The majority of primary distribution systems are built up in a radial
configuration, with one end providing each load point. The radial type system
is the simplest and most often used for effective coordination of their protective
79
Open Data
systems. Fuzzy set theory has been developed and used in a range of engineering
and non-engineering domains where the evaluation of actions and observations is
‘fuzzy’ in the sense that no clear boundaries exist. The fuzzy set theory provides for
the inaccurate representation of evaluations and observations, which may then be
utilised to describe and solve issues.
The use of fuzzy set theory to distribution system analysis may aid professional
judgement and prior knowledge in distribution system planning, design, and opera-
tions. Future computer technology will be considerably more advanced than our
greatest imaginations, and far more advanced than anything we can envision right
now. The IoT is one of the most cutting-edge technologies, with IoT-enabled things
all around us. With the help of RFID (radio frequency identification) and sensors, it
will create its own world in which everything will be managed and transmitted over
the Internet. The devices will create their own environment. The enormous amount
of data created will be recorded, analysed, and presented in a timely, seamless, and
understandable way. Cloud computing will provide us with virtual infrastructure
for visualisation platforms and utility computing, enabling us to integrate device
monitoring, storage, client delivery, analytics tools, and visualisation in one place.
Cloud computing, which will provide an end-to-end solution, will allow users
and businesses to access applications on-demand from anywhere. One of the
most important IoT applications is in the field of healthcare. We designed a health
monitoring device using current low-cost sensors to monitor and maintain human
health parameters such as heart rate, temperature, and air quality. The approach of
fuzzy logic was used. In 1965, Lotfi Zadeh presented the concept of fuzzy logic for
the first time.
Fuzzy logic is a kind of multivalued logic with truth values ranging from 0 to 1.
Fuzzy logic deals with the concept of partial truth, in which the truth value varies
from completely false to completely true. The fuzzy logic technique includes fuzzi-
fication, inference, and defuzzification. The sensors capture crisp input data, which
is then converted via membership functions into a fuzzy input set, linguistic words,
and linguistic variables. The rules are used to make inferences. The system will work
on the same principles as the IF-THEN system. The membership function is used to
convert the fuzzy output to crisp output.
Vital signs are the four most important markers that reveal the condition of the
body’s vital functions. These measurements are used to assess a person’s general
physical well-being, detect probable diseases, and monitor healing progress. The
fuzzy inference system is a computer framework that makes choices based on fuzzy
set theory, fuzzy if-then logic, and fuzzy reasoning. Over the last decade, fuzzy
set theory has advanced in many directions, with applications in taxonomy, topol-
ogy, linguistics, automata theory, logic, control theory, game theory, information
theory, psychology, pattern recognition, medicine, law, decision analysis, system
theory, and information retrieval, to name a few. A fuzzy inference requires three
parts: a membership function generation circuit that calculates the goodness of
fit between an input value and the membership function of an antecedent part, a
minimum value operation circuit that finds an inference result for each rule, and
a maximum value operation circuit that integrates a plurality of inference results.
When these components are combined into a system, the system can do inference.
Each externally supplied input value, this membership function generating circuit
creates one membership function value. The decision-making logic of the fuzzy
inference machine is crucial, and it may be the system’s most adaptive component.
The fuzzification interface corresponds to our sensory organs (e.g., eye, ear), the
de-fuzzification interface to our action organs (e.g., arms, feet, etc.), the fuzzy
rule base to our memory, and the fuzzy inference machine to our thought process
when a fuzzy system is compared to a human controller. It is called a fuzzy expert
80
Artificial Intelligence and IoT: Past, Present and Future
DOI: http://dx.doi.org/10.5772/intechopen.101758
system when an expert system uses fuzzy data to reason. It is important to know
what makes up a fuzzy expert system. The fuzzy expert system consists of a fuzzy
knowledge base (based on fuzzy rules), an interference engine, a working memory
subsystem, an explanation subsystem, natural language interference, and knowl-
edge acquisition.
4. Conclusion
Author details
© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
81
Open Data
References
82
Edited by Vijayalakshmi Kakulapati
978-1-83968-317-6
ISBN 978-1-83968-315-2
Published in London, UK
© 2022 IntechOpen
© metamorworks / iStock