Graph Databases
Graph Databases
Graph Databases
Editor
Christos Tjortjis
Dean, School of Science and Technology
International Hellenic University
Greece
p,
A SCIENCE PUBLISHERS BOOK
First edition published 2024
by CRC Press
2385 NW Executive Center Drive, Suite 320, Boca Raton FL 33431
Reasonable efforts have been made to publish reliable data and information, but the author and
publisher cannot assume responsibility for the validity of all materials or the consequences of
their use. The authors and publishers have attempted to trace the copyright holders of all material
reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write
and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or
contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978
750-8400. For works that are not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
DOI: 10.1201/9781003183532
The idea for the book came about after fruitful discussions with members of the
Data Mining and Analytics research group, stemming out of frustrations using
conventional database management systems for our research in Data Mining,
Social Media Analytics, and Smart cities, as well as aspirations to enhance the
utilisation of complex heterogeneous big data for our day-to-day research. The
need was confirmed after discussions with colleagues across the globe, as well
as surveying the state of the art, so we happily embarked on the challenge to
put together a high-quality collection of chapters complementary but coherent,
telling the story of the ever-increasing rate of graph database usage, especially
in the context of social media and smart cities. Meanwhile, our planet was taken
aback by the new pandemic storm. Plans were disrupted, priorities changed,
and attention was diverted towards more pressing matters. Yet, the notion of the
need for social media analytics coupled with smart city applications including
healthcare and facilitated by graph databases emerged even stronger.
The editor is grateful to all authors who weathered the storm for their
contributions and to the editorial team for their support throughout the compilation
of this book. I hope that the selected chapters offer a firm foundation, but also new
knowledge and ideas for readers to understand, use and improve applications of
graph databases in the areas of Social Media Analytics, Smart Cities, and beyond.
Preface iii
Introduction vii
Index 179
Introduction
Christos Tjortjis
Dean of the School of Science and Technology, International Hellenic University
email: [email protected]
With Facebook having 2.9 billion active users, YouTube following with 2.5
billion, Instagram with 1.5 billion, TikTok with 1 billion, and Twitter with 430
million, the amount of data published daily is excessive. In 2022 it was estimated
that 500 million tweets were published daily, 1 billion get posted daily across
Facebook apps, there were 17 billion posts with location tracking on Facebook,
and 350 million uploaded daily, with the accumulated number of uploaded photos
reaching 350 billion. Every minute 500 hours of new video content is uploaded
on YouTube, meaning that 82, 2 years of video content is uploaded daily. On
Instagram, 95 million photos and videos are uploaded daily. The importance of
gathering such rich data, often called “the digital gold rush”, processing it, and
retrieving information are vital.
Graph databases have gained increasing popularity recently, disrupting areas
traditionally dominated by conventional relational, SQL-based databases, as well
as domains requiring the extra capabilities afforded by graphs. This book is a
timely effort to capture the state of the art of Graph databases and their applications
in domains, such as Social Media analysis and Smart Cities.
This practical book aims at combining various advanced tools, technologies,
and techniques to aid understanding and better utilizing the power of Social
Media Analytics, Data Mining, and Graph Databases. The book strives to
support students, researchers, developers, and simple users involved with Data
Science and Graph Databases to master the notions, concepts, techniques, and
tools necessary to extract data from social media or smart cities that facilitate
information acquisition, management, and prediction.
The contents of the book guide the interested reader into a tour starting
with a detailed comparison of relational SQL Databases with NoSQL and Graph
Databases, reviewing their popularity, with a focus on Neo4j.
viii Introduction
Chapter 1 details the characteristics and reviews the pros and cons of relational
and NoSQL databases assessing and explaining the increasing popularity of the
latter, in particular when it comes to Neo4j. The chapter includes a categorization
of NoSQL Database Management Systems (DBMS) into i) Column, ii) Document,
iii) Key-value, iv) Graph and v) TimeSeries. Neo4j Use Cases and related scientific
research are detailed, and the chapter concludes with an insightful discussion. It
is essential reading for any reader who is not familiar with the related concepts
before engaging with the following chapters.
Next, two surveys review the state of play regarding graph databases and
social media data. The former emphasises analytics from a link prediction
perspective and the latter focuses on knowledge extraction from social media data
stored in Neo4j. Graph databases can manage highly connected data originating
from social media, as they are suitable for storing, searching, and retrieving data
that are rich in relationships.
Chapter 2 reviews the literature for graph databases and software libraries
suitable for performing common social network analytic tasks. It proposes a
taxonomy of graph database approaches for social network analytics based on
the available algorithms and the provided means of storing, importing, exporting,
and querying data, as well as the ability to deal with big social graphs, and the
corresponding CPU and memory usage. Various graph technologies are evaluated
by experiments related to the link prediction problem on datasets of diverse sizes.
Chapter 3 introduces novel capabilities for knowledge extraction by surveying
Neo4j usage for social media. It highlights the importance of transitioning from
SQL to NoSQL databases and proposes a categorization of Neo4j use cases in
Social Media. The relevant literature is reviewed including various domains,
such as Recommendation systems, marketing, learning applications, Healthcare
analytics, Influence detection, and Fake news.
The theme is further developed by two more chapters: one elaborating on
combining multiple social networks on a single graph, and another on YouTube
child influencers and the relevant community detection.
Chapter 4 makes the case for combining multiple Social Networks into a Single
Graph, since one user maintains several accounts across a variety of social
media platforms. This combination brings forward the potential for improved
recommendations enhancing user experience. It studies actual data from
thousands of users on nine social networks (Twitter, Instagram, Flickr, Meetup,
LinkedIn, Pinterest, Reddit, Foursquare, and YouTube). Node similarity methods
were developed, and node matching success was increased. In addition, a new
alignment method for multiple social networks is proposed. Success rates are
measured and a broad user profile covering more than one social network is
created.
Introduction ix
Chapter 5 investigates data collection and analysis about child Influencers and
their follower communities on YouTube to detect overlapping communities and
understand the socioeconomic impact of child influencers in different cultures.
It presents an approach to data collection, and storage using the graph database
ArangoDB, and analysis with overlapping community detection algorithms,
such as SLPA, CliZZ, and LEMON. With the open source WebOCD framework,
community detection revealed that communities form around child influencer
channels with similar topics, and that there is a potential divide between family
channel communities and singular child influencer channel communities. The
network collected contains 72,577 channels and 2,025,879 edges with 388
confirmed child influencers. The collection scripts, the software, and the data set
in the database are available freely for further use in education and research.
The smart city theme is investigated in three chapters. First a comprehensive
literature survey on using graph databases to manage smart city linked data.
Two case studies follow, one emphasising energy load forecasting using graph
databases and Machine Learning, and another on digital health applications
which utilise a Graph-Based data model.
Chapter 6 provides a detailed literature survey integrating the concepts of Smart
City Linked Data with Graph Databases and social media. Based on the concept
of a smart city as a complex linked system producing vast amounts of data,
and carrying many connections, it capitalises on the opportunities for efficient
organization and management of such complex networks provided by Graph
databases, given their high performance, flexibility, and agility. The insights
gained through a detailed and critical review and synthesis of the related work
show that graph databases are suitable for all layers of smart city applications.
These relate to social systems including people, commerce, culture, and policies,
posing as user-generated in social media. Graph databases are an efficient tool for
managing the high density and interconnectivity that characterizes smart cities.
Chapter 7 ventures further into the domain of smart cities focusing on the case of
Energy Load Forecasting (ELF) using Neo4j, the leading NoSQL Graph database,
and Machine Learning. It proposes and evaluates a method for integrating multiple
approaches for executing ELF tests on historical building data. The experiments
produce data resolution for 15 minutes as one step ahead of the time series forecast
and reveal accuracy comparisons. The chapter provides guidelines for developing
correct insights for energy demand predictions and proposes useful extensions for
future work.
Finally, Chapter 8 concludes with an interesting Graph-Based Data Model for
Digital Health Applications in the context of smart cities. A key challenge for
modern smart cities is the generation of large volumes of heterogeneous data to
be integrated and managed to support the discovery of complex relationships in
x Introduction
1
From Relational to NoSQL
Databases – Comparison and
Popularity Graph Databases and the
Neo4j Use Cases
1.1 Introduction
In Codd We Trust. Published on March 6th, 1972, the paper with the title “Relational
Completeness of Data Base Sublanguages” [1] written by Edgar Frank “Ted” Codd
(19 August 1923–18 April 2003), an Oxford-educated mathematician working for
IBM, was one of the most seminal and ground-breaking IT publications of the
20th century. The abstract of the publication starts with “In the near future, we can
expect a great variety of languages to be proposed for interrogating and updating
data bases. This paper attempts to provide a theoretical basis which may be used
to determine how complete a selection capability is provided in a proposed data
2 Graph Databases: Applications on Social Media Analytics and Smart Cities
Advantages Explanation
Non-relational Not tuple-based – no joins and other RD features
Schema-less Not strict/fixed structure
Data are replicated to Down nodes are simply replaced, and there is no single
multiple nodes and can be point of failure
partitioned
Horizontally scalable Cheap, simple to set up (open-source), vast write
performance, and quick key-value access
Provides a wide range of Supports new/modern datatypes and models
data models
Database administrators Not direct management and supervision required
are not required
Less hardware failures NoSQL DBaaS providers such as Riak and Cassandra are
designed to deal with equipment failures
Faster, more efficient, and Simple, scalable, efficient, multifunctional
flexible
Has evolved at a very high Fast growth by IT leading companies
pace
Less time writing queries More time comprehending answers – more condensed
and functional queries
Less time debugging More time spent developing the next piece of code,
queries elevated overall code quality
Code is easier to read Faster ramp-up for new project members, improved
maintainability and troubleshooting
The growing of big data - Data is constantly available. True spatial transparency.
in that high data velocity, Transactional capabilities for the modern era. Data
data variety, data volume, architecture that is adaptable. High-performance
and data complexity architecture with a high level of intelligence
It has huge volumes of Quick schema iteration, agile sprints and frequently code
fast changing structured, pushes quickly. Support for object-oriented programming
semi-structured, and languages that are simple to comprehend and use in a
unstructured data that are short amount of time. NoSQL is a globally distributed
generated by users scale-out architecture that is not costly and monolithic.
Agnostic to schema, scalability, speed, and high
availability. Handles enormous amounts of data with ease
and at a cheaper cost
Disadvantages Explanation
Immature Need additional time to acquire the RDs’ consistency,
sustainability and maturity
No standard query Compared to RD’s SQL
language
Some NoSQL DBs are not Atomicity, consistency, isolation, durability offers
ACID compliant stability
No standard interface. Some DBs do not offer GUI yet
Maintenance is difficult
From Relational to NoSQL Databases – Comparison and Popularity... 5
1.2.3 Categorization
According to the bibliography, there are five main/dominant categories for NoSQL
DBMS: (1) Column, (2) Document, (3) Key-value, (4) Graph and (5) Time Series.
In “NoSQL Databases List by Hosting Data” [9] it is mentioned that there are 15
categories; the five aforementioned ones and ten more that were characterized
as Soft NoSQL Systems (6 to 15): (6) Multimodel DB, (7) Multivalue DB, (8)
Multidimensional DB, (9) Event Sourcing, (10) XML DB, (11) Grid & Cloud
DB Solutions, (12) Object DB, (13) Scientific and Specialized DBs, (14) Other
NoSQL related DB, and finally (15) Unresolved and uncategorized [10], whilst
the db-engines.com website has 15 categories in total, too, to be discussed in the
following section.
Next, the five most popular categories are analyzed as described at the db-
engines website:
(1) Key-value stores1: Key-Value which are based on Amazon’s Dynamo paper
[11] and “they are considered to be the simplest NoSQL DBMS since they
can only store pairs of keys and values, as well as retrieve values when a
key is known”.
(2) Wide column stores2: They were first introduced at Google’s BigTable
paper [12], and they are “also called extensible record stores, store data
in records with an ability to hold very large numbers of dynamic columns.
Since the column names as well as the record keys are not fixed, and since
a record can have billions of columns, wide column stores can be seen as
two-dimensional key-value stores”.
(3) Graph databases3: Also called graph-oriented DBMS, they represent data
in graph structures as nodes and edges, which are relationships between
nodes. “They allow easy processing of data in that form, and simple
calculation of specific properties of the graph, such as the number of steps
needed to get from one node to another node”.
3 https://db-engines.com/en/article/Graph+DBMS
4
https://db-engines.com/en/article/Document+Stores
5 https://db-engines.com/en/article/Time+Series+DBMS
From Relational to NoSQL Databases – Comparison and Popularity... 7
1.3 Popularity
In this section the popularity of the Relational and the NoSQL databases is
discussed, based on the ranking of an excellent website (db-engines.com). The
creators of db-engines.com have accumulated hundreds of databases and they
have developed a methodology where all these databases are ranked based on
their popularity
1.3.1 DB engines
The creators of db-engines claim that the platform “is an initiative to collect and
present information on database management systems (DBMS). In addition to
established relational DBMS, systems and concepts of the growing NoSQL area
are emphasized. The DB-Engines Ranking is a list of DBMS ranked by their
current popularity. The list is updated monthly. The most important properties
of numerous systems are shown in the overview of database management
systems”. Each system’s attributes may be examined by the user, and they can
be compared side by side. This topic’s words and concepts are discussed in the
database encyclopedia. Recent DB-Engines news, citations and major events are
also highlighted on the website.
1.3.2 Methodology
The platform’s creators established a system for computing DBMS scores termed
‘DB-Engines Ranking’, which is a list of DBMS rated by their current popularity.
They use the following parameters to assess a system’s popularity6:
• Number of mentions of the system on websites, assessed by the number of
results in Google and Bing search engine inquiries. To count only relevant
results, they search for the system name followed by the phrase database,
such as ‘Oracle’ and ‘database’. General interest in the system, measured
by the number of queries in Google Trends..
• Frequency of technical discussions about the system. They utilize the
number of similar queries and interested users on the well-known IT-
related Q&A sites Stack Overflow and DBA Stack Exchange.
6 https://db-engines.com/en/ranking_definition
8 Graph Databases: Applications on Social Media Analytics and Smart Cities
• Number of job offers, in which the system is referenced and where they
utilize the number of offers on the top job search engines Indeed and
Simply Hired.
• Number of profiles in professional networks, where the system is
referenced, as determined by data from the most prominent professional
network LinkedIn.
• Relevance in social networks, where they calculate the number of tweets,
in which the system is cited.
They calculate the popularity value of a system “by standardizing and
averaging of the individual parameters. These mathematical transformations are
made in a way so that the distance of the individual systems is preserved. That
means, when system A has twice as large a value in the DB-Engines Ranking as
system B, then it is twice as popular when averaged over the individual evaluation
criteria”.
To remove the impact induced by changing amounts of the data sources
themselves, the popularity score is a relative number that should only be
understood in relation to other systems. However, the DB-Engines Ranking
does not take into account the number of systems installed or their application in
IT systems.
On the website there is information on 432 database systems (https://db
engines.com/en/systems), which are examined and divided in 15 categories.
However, just 381 databases are ranked in accordance with the aforementioned
methodology. The number and percentage of databases per category is shown
in Table 1.2.
It is evident that the sum of the total number of databases exceeds 381, since
there are databases that belong to more than one category.
The Top-10 relational databases and the Top-5 Key-Value, Document, Wide-
Column, Graph and Time-Series are respectively depicted in Tables 1.5-1.10.
The tables show the percentage difference in popularity in the last 5 years. It
should be mentioned that there are DBMSs that were included in more than
one category.
From Tables 1.5-1.10, where the change of the popularity has also been noted,
we conclude that from the top-20 Databases, 15 demonstrated an increase in their
popularity and the remaining 5, four of which were relational and one a Wide-
Column DB, a decrease. The biggest increase in the past 5 years was observed at
(a) MariaDB (+136.69%), (b) PostgreSQL (+84.3%) and (c) Splunk (+71.75%)
whilst the biggest decrease was observed at: (a) Microsoft SQL Server (–22.23%),
(b) MySQL (–12.25%) and (c) Cassandra (–11.23%).
* TimescaleDB was released in December 2017, therefore the score and the % change is
calculated from that month.
Figure 1.6. Time Query execution comparison (PostgreSQL mongoDB and Neo4j).
14 Graph Databases: Applications on Social Media Analytics and Smart Cities
7 https://www.mongodb.com/who-uses-mongodb
8
https://redis.io/topics/whos-using-redis
9 https://www.elastic.co/customers/
10 https://cassandra.apache.org/_/case-studies.html
11 https://www.splunk.com/en_us/customers.html
12 https://aws.amazon.com/dynamodb/customers/
13
https://neo4j.com/who-uses-neo4j/
14 https://lucidworks.com/post/who-uses-lucenesolr/
16 Graph Databases: Applications on Social Media Analytics and Smart Cities
1.5 Neo4j
According to the www.neo4j.com, Neo4j is “The Fastest Path to Graph and
it gives developers and data scientists the most trusted and advanced tools to
quickly build today’s intelligent applications and machine learning workflows.
Available as a fully managed cloud service or self-hosted”. As mentioned in db
engines.com Neo4j System Properties16, Neo4j supports a range of programming
languages like .Net, Clojure, Elixir, Go, Groovy, Haskell, Java, JavaScript, Perl,
PHP, Python, Ruby and Scala.
Neo4j’s key competitive advantages are:
• 1000x Performance at Unlimited Scale
• Unmatched Hardware Efficiency
• Platform Agnostic
• Developer Productivity - Declarative Language
• Agility - Flexible Schema
• First Graph ML for Enterprise
From Figure 1.8 and Table 1.13 that depict the ranking of Neo4j from
November 2012 until December 2021, it is evident that Neo4j is in a constant rise,
demonstrating a big 614% increase between December 2012 and December 2021.
It is also worth mentioning that each year Neo4j demonstrates an increase on its
popularity compared to the previous year. The last six years the increase varies
from 5.16% (December 2017) up to 17.63% (December 2018).
16 https://db-engines.com/en/system/Neo4j
From Relational to NoSQL Databases – Comparison and Popularity... 19
17 https://db-engines.com/en/system/Neo4j
18 https://neo4j.com/use-cases/
From Relational to NoSQL Databases – Comparison and Popularity... 21
The most popular use cases, as defined by the Neo4j developers, are:
1. Fraud detection and analytics: Real-time analysis of data connections is
critical for identifying fraud rings and other sophisticated schemes before
the fraudsters and criminals do long-term damage.
2. Network and database infrastructure monitoring for IT operations: Graph
databases are fundamentally better suited than RDBMS for making
sense of the complex interdependencies that are key to network and IT
infrastructure management.
3. Recommendation engine and product recommendation system: Graph-
powered recommendation engines assist businesses in personalizing
products, information and services by utilizing a large number of real-time
connections.
4. Master data management: Organize and manage the master data using a
flexible and schema-free graph database format in gain to get real-time
insights and a 360° view of the customers.
5. Social media and social network graphs: When employing a graph
database to power a social network application, you may easily utilize
social connections or infer associations based on activity.
6. Identity and access management: When utilizing a graph database for
identity and access management, you may quickly and efficiently track
people, assets, relationships, and authorizations.
7. Retail: For retailers, Neo4j supports real-time product recommendation
engines, customer experience personalisation, and supply-chain
management.
8. Telecommunications: Neo4j assists the telecom sector in managing
complex interdependencies in telecommunications, IT infrastructure, and
other dense networks.
9. Government: Governments utilize Neo4j to combat crime, prevent
terrorism, enhance fiscal responsibility, increase efficiency, and provide
transparency to their citizens.
10. Data privacy, risk and compliance: Regulatory compliance that is quick and
effective (GDPR, CCPA, BCBS 239, FRTB and more). Neo4j assists in the
management of enterprise risk while harnessing linked data to improve
business intelligence.
11. Artificial intelligence and analytics: Artificial Intelligence is expected to
lead the next wave of technology disruption in practically every sector.
12. Life sciences: Pharmaceutical, chemical, and biotech businesses are
adopting Neo4j to examine data in ways that were previously impossible
without graphics.
13. Financial services: Context is essential in everything from risk management
to securities advice. Top banks throughout the world are utilizing Neo4j to
address their linked data concerns.
14. Graph data science: Businesses now confront tremendously complex issues
and possibilities that necessitate more adaptable, intelligent ways. That’s
22 Graph Databases: Applications on Social Media Analytics and Smart Cities
why Neo4j built the first enterprise graph framework for data scientists: to
enhance forecasts, which lead to better decisions and innovation.
15. Supply chain management: Graph technology is critical for optimizing
products movement, identifying weaknesses, and increasing overall supply
chain resilience. Explore how Transparency-One, Caterpillar, and other
companies are using supply chain graph technology to secure business
continuity.
16. Knowledge graph: Knowledge graphs guarantee that search results are
contextually appropriate to the needs of the user. Knowledge graphs are
used by organizations like NASA, AstraZeneca, NBC News, and Lyft to
contextualize a range of data forms and formats.
managed to increase the trust between users. This research can also be classified
as a “Social media and social network graphs” use case scenario.
Bakkal et al. [27] suggested a unique carpooling matching approach. The
authors modeled trajectories using the Neo4j spatial and TimeTree libraries, then
performed temporal and locational filtering stages, and ultimately evaluated the
efficiency and efficacy of the proposed system. The suggested approach showed
available drivers to hitchhikers using Google Maps and presented them in the
form of a graph using a Geolife trajectory dataset of 182 individuals and 17,621
trajectories. According to the assessment, the system was effective, and it could
assist carpooling by saving users money on gasoline, tolls, and time wasted on the
road, among other things.
Another research attempt [28] proposes a recommendation system for
impacted items based on association rule mining and Neo4j. Using real-world
consumer feedback data from Amazon, the authors created a model to determine
the effect of one product on another for better and faster decision making.
According to the findings, the model outperforms the Apriori algorithm, one
of the most common approaches for association rule mining. It has lower time
complexity and saves space, whereas the Apriori methodology has a much higher
runtime complexity than the Neo4j model. This paper also falls under the category
of a “Retail” use case scenario.
Dharmawan and Samo [33] proposed a book recommendation system by
combining BibTeX book metadata and Neo4j. Then, with the aid of Cypher by
inputting criteria like the author’s name or the book’s type, the user can receive book
recommendation based on their input’s criteria. They performed various queries
with both SQL and Cypher and the results were the same. However, Neo4j queries
took around 130 milliseconds on average to be executed for the author’s criteria
and 124 milliseconds for the book’s type criteria, but they were approximately 8.5
and 10 times slower, respectively, than the same queries executed in SQL. Despite
this difference in execution time, that does not actually affect the user as they are
at the range of hundredth of a second, the authors concluded that storing book
recommendation into database, graph database is more flexible than a relational
database and graph databases are more efficient in preparing data before querying
them and more flexible for a book recommendation system. This paper can also
be classified as a “Retail” use case.
Konno et al. [41] proposed a goods recommendation system based on retail
knowledge in a Neo4j Graph Database combined with an inference mechanism
implemented in Java Expert System Shell (Jess). They presented a two-layer
knowledge graph database, a concept layer, “the resulting graph representation
transferred from an ontology representation” and an instance layer, “the instance
data associated with concept nodes”. By using RFM (Recency, Frequency,
Monetary) analysis, “they extracted customer behavior characteristics and
classified customers into five groups according to RFM segments and a list of
recommended products was created for each segment based the RFM value
associated to each customer”. Finally, they evaluated the time efficiency of
24 Graph Databases: Applications on Social Media Analytics and Smart Cities
answering queries of retail data and the novelty of recommendations of the system
and they concluded that they were “reasonably good”. This study can also be
classified as a “Retail” use case.
Finally, the authors in [44] exploited Neo4j to create a content-based filtering
recommendation system in abstract search. They designed a recommendation
system based on the search results for related documents based on the content
search on report papers. They employed Neo4j, and a document-keyword
graph was developed to depict the relationship between the document and its
characteristics, and it was used to filter keyword co-occurrence documents in
order to limit the search space as much as feasible. The model’s efficiency was
examined, and the findings showed that it had an accuracy of 0.77.
Retail (#7)
In [42] a study about the Neo4j efficient storage performance of oilfield ontology
was conducted. The authors designed mapping rules from ontology files to
regulate the Neo4j database and via a two-tier index architecture, that included
object and triad indexing, they managed to initially to keep the time of loading
data low and they were able to match with different patterns for accurate retrieval.
26 Graph Databases: Applications on Social Media Analytics and Smart Cities
Their evaluation results were very promising, and they showed that Neo4j can
reduce the required storage space in a great extent, since their method could “save
13.04% of the storage space and improve retrieval efficiency by more than 30
times compared with the methods of relational databases”.
In [39] the authors used Neo4j for mining protein graphs adding another study
about the usefulness of applying Neo4j in bioinformatics and in health sciences
in general. The problem they managed to resolve was the one of protein-protein
interface (PPI) identification in which according to the authors “the goal of the
PPI identification task is, given a protein structure, to identify amino acids which
are responsible for binding of the structure to other proteins”. They retrieved
data from the Protein Data Bank (PDB) and showed a method for transforming
and migrating these data into Neo4j as a set of independent protein graphs. The
resulting graph database contains about 14 million labeled nodes and 38 million
edges and after querying the graph database they concluded that “using Neo4j is a
viable option for specific, rather small, subgraph query types” but at the same time
querying large number of edges led to performance limitation.
Another study on proteins was undertaken by Johnpaul and Mathew [40],
who presented a Cypher Query based NoSQL Data Mining on Protein Datasets
using Neo4j Graph Database. The research investigated the usage of Cypher
queries on a proteome-protein dataset of 20,000 nodes and 100,000 edges, with
an average degree of five for each node. The assessment was based on inquiries
with varying total number of nodes and relationships between them. The research
concluded by claiming that NoSQL queries are more capable for conducting
data retrieval without the restrictions of RDMSs and they offer better storing for
unstructured data.
Stark et al. [43] presented a migraine drug recommendation system based
on Neo4j, called BetterChoice. Their proposed system used simulated data for
100,000 patients to help physicians gain more transparency about which drug fits
a migraine patient best considering his/her individual features. They evaluated the
system by examining if the recommended drugs best suited the patients’ needs and
they concluded that “the proposed system works as intended as only drugs with
highest relevance scores and no interactions with the patient’s diseases, drugs or
pregnancy were recommended”.
1.6 Discussion
This chapter compares the RDs with the NoSQL DBs, cites their main advantages
and disadvantages and investigates their popularity, based on a methodology
introduced by the db-engines website. In this platform 432 database management
systems are analyzed, and 381 of them are distributed in 15 categories and ranked
based on their popularity. After statistical analysis, it was concluded that the
Relational Databases continue to dominate the DBMS domain, as approximately
40% of the analyzed DBMS belong to this category. The categories that follow are
Key-Value stores with 16.8%, Document stores with 13.91%, Time Series DBMS
with 10.24% and Graph DBMS with 9.45%. The remaining 10 categories claim
less than 6% of the total number of analyzed DBMS.
However, the gap in popularity between RDBMS and the non-RDBMS has
been reducing as the top-3 RDBMS are becoming less popular, whereas the non
Table 1.14. Distribution of papers per use-case category
28
Reference Publi-cation year Comparison - Net- Recom- Data manage- Social Retail Govern- Life sciences
no. Performance works mendation ment (#4) media (#5) (#7) ment (#9) (#12)
(#2) (#3)
29
30 Graph Databases: Applications on Social Media Analytics and Smart Cities
RDBMS are becoming more popular every year. Then, graph databases and more
specifically Neo4j were presented. By examining the literature concerning Neo4j
papers that were retrieved from Google Scholar by searching for the term “Neo4j
use cases”, without applying a date filter and by selecting them based on the order
of the search results, it was evident that Neo4j is a quite popular to research and
study software. At Table 1.14, the distribution of papers per use-case category
(introduced in section “Neo4j Use Cases and scientific research”) is demonstrated.
Out of the 28 papers, 7 (25%) focused on: (i) evaluating Neo4j’s features and/
or comparing it with other DBMS, (ii) recommendation systems, and (iii) utilizing
it with social media, followed by 6 (21.4%) which focused on life sciences use-
cases. Since several papers were included in more than one use-case category,
summing up percentages exceeds 100%.
Neo4j was released in 2007, however research on Neo4j is recent, as shown
by their publication year. The total number of papers since 2017 were 21 (75%).
More specifically, 10 were published in 2017, 6 in 2018, 3 in 2019 and 1 in 2020
as well as 2021. No papers published before 2013 were analyzed.
In Table 1.15, the total number of citations along the average number of
citations per year, for each article is depicted.
The total number of citations (cited by other papers according to Google
Scholar) for the 28 papers was 809, although 350 (43.26%) of them, belonged
to a single paper [26]. By finding the average number of citations per year,
dividing each paper’s citations by the years since the paper was published (2021 –
publication year +1), it was calculated that the 28 papers average 131.14 citations
per year.
From Relational to NoSQL Databases – Comparison and Popularity... 31
Table 1.16. Average citations per year, and per year per paper
The most popular ones are [26] (“Graph database applications and concepts
with Neo4j”), by Miller with 350 citations and approximately 39 per year,
followed by [24] (“Time-varying social networks in a graph database: a Neo4j
use case”), by Cattuto et al. with 78 citations and around 9 per year, followed by
[23] (“Graph Databases Comparison: AllegroGraph, ArangoDB, InfiniteGraph,
Neo4J, and OrientDB”), by Fernandes & Bernardino, with 62 citations, with 15.5
average per year. It should be noted that 2 of the 3 papers focused on comparing
the performance of Graph Databases.
Finally, the most popular use-cases were also analyzed. The threshold was
four, meaning that only the categories with four or more citations were examined,
and these were: (i) comparison, (ii) recommendation systems, (iii) social media,
(iv) retail, and (v) life sciences. The total average citations per year and the
average per year and paper are shown in Table 1.16.
This chapter can be used in a bibliographic survey as information and
statistics about 381 DBMS, along with the most popular ones per category, and
more specifically about relational, key-value, wide-column, document-based,
graph and time series DBMS are presented and discussed.
Furthermore, it can be used to promote the use of NoSQL, Graph Databases
and mainly Neo4j. It compares Relational Databases and NoSQL Databases and
focuses mainly on the numerous advantages of the NoSQL Databases. Their
characteristics and their categorization were also studied. Several studies that were
analyzed echo the belief that NoSQL Databases, and mainly Graph Databases,
outperform the Relational Databases. This conclusion is also backed up by the
fact that NoSQL Databases demonstrate a continuous increase on their popularity,
as studied and calculated by db-engines.com, and at the same time by the trust of
an excessive number of colossi in numerous industries. The db-engines platform’s
methodology, its categorization of the 381 DBMSs, its measuring popularity,
the RDBMS vs. NoSQL popularity, the most popular Relational, Key-Value,
Document, Wide-Column, Graph and Time-Series databases were presented.
Studies about the comparison and the performance of RD and NoSQL
databases and the use of NoSQL databases were also discussed. Then a more
extensive study of Graph Databases, and mainly on the leading one, Neo4j was
performed. This study of Neo4j also can serve as a literature review since it offers
the analysis of 28 papers related to Neo4j use cases, retrieved from Google Scholar,
along with data about the category (out of 16 categories that were introduced on
the study) of use-case they investigated and statistics about their popularity by the
research community.
32 Graph Databases: Applications on Social Media Analytics and Smart Cities
Because of the daily engagement of the authors with Neo4j, this chapter
could be considered as a proselytism to Neo4j and its awards in 2021 and 2022
Data Breakthrough Award Winners, KMWorld 100 Companies That Matter in
Knowledge Management 2020-2022, DSA Awards 2021, The insideBIGDATA
IMPACT 50 List for Q4 2020, Q1-Q4 2021 and Q1 2022, 2021 Datanami Readers’
Choice Awards, the 2021 SaaS Awards, The Coolest Database System Companies
Of The 2021 Big Data 100, Management 2021, Digital.com Names Best Database
Management Software of 2021, Neo4j is a Finalist in 2020-21 Cloud Awards,
Best Graph Database: 2020 DBTA Readers’ Choice Award, KMWorld AI 50:
The Companies Empowering Intelligent Knowledge Management, 13th Annual
Ventana Digital Innovation Award Finalist: Data, The Coolest Database System
Companies Of The 2020 Big Data 100, Neo4j Wins 2020 Data Breakthrough
Award for Overall Open Source Data Solution Provider of the Year, Neo4j in The
2019 SD Times 100: DATABASE AND DATABASE MANAGEMENT, Bloor
Mutable Award 2019, Neo4j: 2019 Big Data 100 can back it up.
References
[1] Codd, E.F. 1972. Relational Completeness of Data Base Sublanguages (pp. 65-98).
IBM Corporation.
[2] Seed Scientific. 2021, October 28. How Much Data Is Created Every Day? [27
Powerful Stats]. Retrieved November 1, 2021, from https://seedscientific.com/how-
much-data-is-created-every-day/.
[3] 21 Big Data Statistics & Predictions on the Future of Big Data, June 2018 [online].
Available: https://www.newgenapps.com/blog/big-data-statisticspredictions-on-the
future-of-big-data.
[4] NoSQL. 2020, August 1. Retrieved August 4, 2021, from https://en.wikipedia.org/
wiki/NoSQL.
[5] Moniruzzaman, A.B.M. and S.A. Hossain. 2013. Nosql database: New era of
databases for big data analytics-classification, characteristics and comparison. arXiv
preprint arXiv:1307.0191.
[6] Rousidis, D., P. Koukaras and C. Tjortjis. 2020, December. Examination of NoSQL
transition and data mining capabilities. In: Research Conference on Metadata and
Semantics Research (pp. 110–115). Springer, Cham.
[7] Vaghani, R. 2018, December 17. Use of NoSQL in Industry. Retrieved August 5,
2021, from https://www.geeksforgeeks.org/use-of-nosql-in-industry.
[8] Nayak, A., A. Poriya and D. Poojary. 2013. Type of NOSQL databases and its
comparison with relational databases. International Journal of Applied Information
Systems, 5(4): 16–19.
[9] NoSQL Databases List by Hosting Data – Updated 2020. (2020, July 03). Retrieved
August 5, 2021, from https://hostingdata.co.uk/nosql-database/.
[10] Zollmann, J. 2012. Nosql databases. Retrieved from Software Engineering Research
Group: http://www. webcitation. org/6hA9zoqRd.
From Relational to NoSQL Databases – Comparison and Popularity... 33
[28] Sen, S., A. Mehta, R. Ganguli and S. Sen. 2021. Recommendation of influenced
products using association rule mining: Neo4j as a case study. SN Computer Science,
2(2): 1–17.
[29] Allen, D., A. Hodler, M. Hunger, M. Knobloch, W. Lyon, M. Needham and H. Voigt.
2019. Understanding trolls with efficient analytics of large graphs in Neo4j. BTW
2019.
[30] Summer, G., T. Kelder, K. Ono, M. Radonjic, S. Heymans and B. Demchak. 2015.
cyNeo4j: Connecting Neo4j and Cytoscape. Bioinformatics, 31(23): 3868–3869.
[31] Comyn-Wattiau, I. and J. Akoka. 2017, December. Model driven reverse engineering
of NoSQL property graph databases: The case of Neo4j. In: 2017 IEEE International
Conference on Big Data (Big Data) (pp. 453–458). IEEE.
[32] Soni, D., T. Ghanem, B. Gomaa and J. Schommer. 2019, June. Leveraging Twitter
and Neo4j to study the public use of opioids in the USA. In: Proceedings of the 2nd
Joint International Workshop on Graph Data Management Experiences & Systems
(GRADES) and Network Data Analytics (NDA) (pp. 1–5).
[33] Dharmawan, I.N.P.W. and R. Sarno. 2017, October. Book recommendation using
Neo4j graph database in BibTeX book metadata. In: 2017 3rd International Conference
on Science in Information Technology (ICSITech) (pp. 47–52). IEEE.
[34] Dietze, F., J. Karoff, A.C. Valdez, M. Ziefle, C. Greven and U. Schroeder. 2016,
August. An open-source object-graph-mapping framework for Neo4j and Scala:
Renesca. In: International Conference on Availability, Reliability, and Security (pp.
204–218). Springer, Cham.
[35] Drakopoulos, G. 2016, July. Tensor fusion of social structural and functional analytics
over Neo4j. In: 2016 7th International Conference on Information, Intelligence,
Systems & Applications (IISA) (pp. 1–6). IEEE.
[36] Drakopoulos, G., A. Kanavos, P. Mylonas and S. Sioutas. 2017. Defining and
evaluating Twitter influence metrics: A higher-order approach in Neo4j. Social
Network Analysis and Mining, 7(1): 1–14.
[37] Hölsch, J. and M. Grossniklaus. 2016. An algebra and equivalences to transform graph
patterns in Neo4j. In: EDBT/ICDT 2016 Workshops: EDBT Workshop on Querying
Graph Structured Data (GraphQ).
[38] Hölsch, J., T. Schmidt and M. Grossniklaus (2017). On the performance of analytical
and pattern matching graph queries in Neo4j and a relational database. In: EDBT/
ICDT 2017 Joint Conference: 6th International Workshop on Querying Graph
Structured Data (GraphQ).
[39] Hoksza, D. and J. Jelínek. 2015, September. Using Neo4j for mining protein graphs:
A case study. In: 2015 26th International Workshop on Database and Expert Systems
Applications (DEXA) (pp. 230–234). IEEE.
[40] Johnpaul, C.I. and T. Mathew. 2017, January. A Cypher query based NoSQL data
mining on protein datasets using Neo4j graph database. In: 2017 4th International
Conference on Advanced Computing and Communication Systems (ICACCS) (pp.
1–6). IEEE.
[41] Konno, T., R. Huang, T. Ban and C. Huang. 2017, August. Goods recommendation
based on retail knowledge in a Neo4j graph database combined with an inference
mechanism implemented in jess. In: 2017 IEEE SmartWorld, Ubiquitous
Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing
& Communications, Cloud & Big Data Computing, Internet of People and Smart
City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI) (pp. 1–8).
IEEE.
From Relational to NoSQL Databases – Comparison and Popularity... 35
[42] Gong, F., Y. Ma, W. Gong, X. Li, C. Li and X. Yuan. 2018. Neo4j graph database
realizes efficient storage performance of oilfield ontology. PloS One, 13(11):
e0207595.
[43] Stark, B., C. Knahl, M. Aydin, M. Samarah and K.O. Elish. 2017, September.
Betterchoice: A migraine drug recommendation system based on Neo4j. In: 2017
2nd IEEE International Conference on Computational Intelligence and Applications
(ICCIA) (pp. 382–386). IEEE.
[44] Wita, R., K. Bubphachuen and J. Chawachat 2017, November. Content-based filtering
recommendation in abstract search using Neo4j. In: 2017 21st International Computer
Science and Engineering Conference (ICSEC) (pp. 1–5). IEEE.
CHAPTER
2
A Comparative Survey of Graph
Databases and Software for Social
Network Analytics: The Link
Prediction Perspective
2.1 Introduction
Social Network Analysis (SNA) is of great organic value to businesses and
society. It encompasses techniques and methods for analyzing the continuous flow
of information in offline networks (e.g. networks of employees in labor markets
and networks of collaborators in product markets) and online networks (e.g.
A Comparative Survey of Graph Databases and Software for Social... 37
Facebook posts, Twitter feeds, and Google maps check-ins) in order to identify
patterns of dissemination of information or locate nodes and edges of interest for
an analyst.
The importance of graphs for social media analysis was first highlighted
by Jacob Moreno. In particular, in his book entitled ”Who shall survive?”, he
attempted to portray the entire population of New York with a network form of
relationships [1]. Social networks, like many other real-world phenomena, can
be modeled using graphs. A graph is a collection of nodes and edges between
them. In the case of social networks, the nodes correspond to people (or groups
of people), while the edges represent social or human relationships between the
people (or groups).
Consequently, graph processing has become an important part of many so-
cial network analytics applications. As social networks evolve and grow in size,
the respective social graphs also grow and contain up to trillions of edges. In
many cases, these graphs change dynamically over time and are enriched with
various metadata. All these increased requirements gave rise to graph database
systems, which are specialized in storing, processing, and analyzing huge graphs.
Graph database systems utilize graph formats to represent data, including nodes,
edges, and their properties. The majority of them (e.g. Neo4j, GraphDB, FlockdB,
InfiniteGraph) support graph storage and querying. They also support different
database models, which either represent data as directional graphs (e.g. as node
and edge collections) or as general graph structures that allow data manipulation
through a predefined library of graph algorithms.
In this chapter, we perform an overview of the graph-based solutions for
social network analytics (SNA). The main contributions of this work are:
• We provide the research community with a comprehensive taxonomy
of graph databases and software for graph analysis and mining for social
network analytics.
• We perform a comparative evaluation of a selected list of graph mining
solutions, aiming to highlight the advantages and disadvantages of each
approach.
• We provide the research community with a guideline for choosing the
appropriate software solution for each task, taking into account the size and
other properties of the dataset under consideration.
The remainder of this chapter is organized as follows. Section 2.2 provides
the taxonomy of the graph-based approaches for SNA. Section 2.3 describes a
simple SNA task, namely link prediction, and shows how it could be tackled
using different popular tools in each category. Section 2.4 performs a comparative
evaluation. Using datasets of increasing size, we evaluate the performance and
challenge the limits of each solution. Finally, Section 2.5 discusses the advantages
and disadvantages of each technology, guides the reader to the selection of the
most appropriate tool for each task, and provides directions for future work in
the field.
38 Graph Databases: Applications on Social Media Analytics and Smart Cities
with more advanced algorithmic capabilities are utilized when the capabilities or
performance of a graph database is not adequate.
In order to provide a better understanding of the available solutions in the
subsections that follow, we discuss in more depth the various graph database
technologies and the software that can be employed for graph-based analysis of
social network data. Table 2.1 provides the main software solutions of each type.
4 http://neo4j.org/
5 https://jena.apache.org
6 http://rdf4j.org
7
https://memgraph.com/
42 Graph Databases: Applications on Social Media Analytics and Smart Cities
8
https://oss.redislabs.com/redisgraph/
9 https://igraph.org/
10 https://jgrapht.org/
11 https://github.com/google/guava
A Comparative Survey of Graph Databases and Software for Social... 43
12 https://networkx.org/
13 https://webgraph.di.unimi.it/
14 https://azure.microsoft.com/en-us/services/cosmos-db/
15 https://www.orientdb.org/
44 Graph Databases: Applications on Social Media Analytics and Smart Cities
edge to appear between the two vertices in the future. Essentially, what this metric
does is to “close triangles” between not directly connected vertices, which in the
social network context assume that the friend of a mutual friend is likely to be our
friend in the future.
Another algorithm is that of Adamic & Adar [31] which was introduced
for predicting links on social networks (Facebook, Twitter, etc). The algorithm
also calculates the common neighbors of two vertices, but in addition to their
simple count, it calculates the sum of the inverse logarithm of the degree of each
neighbor. Thus, it manages not to reject the scenarios in which low degree nodes play
an important role in predicting links. The formula for measuring the Adamic-Adar
index for a potential edge (x, y) is given in Eq. 2.1, where N(u) denotes the set of
neighbors of node u.
1
A( x, y ) = L log N (u )
(2.1)
fEN ( x ) n N ( y )
embeddings, and algorithms for link prediction and pathfinding. Support for all
these algorithms is included as a separate library, which is highly coupled with the
core system and is aware of the underlying storage implementation. In order to
process big graphs that do not fit in the main memory, Neo4j supports partitioning
them over multiple servers and aggregating the results of each node. For such
purposes, Neo4j provides an integration with the Pregel API for custom Pregel
computations.
Since Neo4j contains native support for the Adamic-Adar score, we keep
almost all computation inside the database. After loading the graph in the DB, we
use a Python-based driver program that executes the necessary steps. Finding the
vertices which high degree is performed using the following query:
MATCH(m1:Member)-[:Friend]->(m2:Member)
WITH m1, count(m2) as degree, collect(m2)as friends
WHERE degree >={min_degree}RETURN m1, friends
Then, after building the candidate pairs in the Python driver and splitting
them into independent lists depending on the number of cores of the machine, we
execute the following query for each candidate pair:
MATCH (m1:Member {name: ’%s’})
MATCH (m2:Member {name: ’%s’})
RETURN gds.alpha.linkprediction.adamicAdar(m1, m2,
{relationshipQuery:’Friend’, orientation:’NATURAL’})
MATCH (m1:Member)-[:Friend]->(m2:Member)
WITH m1, count(m2) as degree, collect(m2) as friends
WHERE degree >= {min_degree} RETURN m1, degree, friends
The computation of all candidate pairs and their split into multiple separate
lists for parallel processing is performed, as previously, in the Python driver
program. The final computation of Adamic-Adar is performed using Python. One
complication which arises here, due to the nature of the Adamic-Adar score, is the
need for degrees of vertices that are not in candidate pairs. When such vertices are
encountered, an additional query for vertex v is performed and its result is cached
in case it is needed a second time.
MATCH (m1:Member {{name:{v}}})-[:Friend]->(m2:Member)
WITH m1, count(m2) as degree RETURN m1, degree
The final aggregation of top-k results is performed in Python just like in the
Neo4j case.
2.4.1 Datasets
In order to compare the various approaches, we used four datasets, which are
available from the Stanford Network Analysis Project (SNAP)16. Each dataset has
a different number of nodes and edges. In particular:
• wiki-Vote17: Wikipedia is a free encyclopedia written collaboratively by
volunteers around the world. A small part of Wikipedia contributors are
administrators, who are users with access to additional technical features
that aid in maintenance. In order for a user to become an administrator, a
Request for adminship (RfA) is issued and the Wikipedia community via a
public discussion or a vote decides who to promote to adminship. Using the
latest complete dump of Wikipedia page edit history (from January 3, 2008)
we extracted all administrator elections and voting history data. This gave
us 2,794 elections with 103,663 total votes and 7,066 users participating in
the elections (either casting a vote or being voted on). Out of these 1,235
elections resulted in a successful promotion, while 1,559 elections did not
result in the promotion. About half of the votes in the dataset are from
existing admins, while the other half comes from ordinary Wikipedia users.
The network contains all the Wikipedia voting data from the inception of
Wikipedia till January 2008. Nodes in the network represent Wikipedia
users and a directed edge from node i to node j represents that user i voted
on user j.
• soc-Epinions118: This is a who-trust-whom online social network of a
general consumer review site Epinions.com. Members of the site can decide
whether to “trust” each other. All the trust relationships interact and form the
Web of Trust which is then combined with review ratings to determine which
reviews are shown to the user.
• ego-Twitter19: This dataset consists of ‘circles’ (or ‘lists’) from Twitter.
Twitter data was crawled from public sources. The dataset includes node
features (profiles), circles, and ego networks. Data is also available from
Facebook and Google+.
• soc-LiveJournal120: LiveJournal is a free online community with almost 10
million members; a significant fraction of these members are highly active.
For example, roughly 300,000 update their content in any given 24-hour
period. LiveJournal allows members to maintain journals, individual and
group blogs, and it allows people to declare which other members can have
access to them.
16 http://snap.stanford.edu/index.html
17 http://snap.stanford.edu/data/wiki-Vote.html
18
http://snap.stanford.edu/data/soc-Epinions1.html
19 http://snap.stanford.edu/data/ego-Twitter.html
20 http://snap.stanford.edu/data/soc-LiveJournal1.html
A Comparative Survey of Graph Databases and Software for Social... 49
‘# nodes’ and ‘# edges’ correspond to the number of nodes and edges of the original
dataset. ‘# nodes 100’, ‘# nodes 250’ and ‘# nodes 500’ correspond to the number of
nodes that have a degree greater or equal than 100, 250 and 500 respectively
Table 2.3. Performance results, on all datasets using a minimum degree of 100
Table 2.4. Performance results, on all datasets using a minimum degree of 250
Table 2.5. Performance results, on all datasets using a minimum degree of 500
do not compress data, and thus need to load a large amount of data from disk to
memory during the query execution, which introduces significant overhead
in query execution. This is the reason that Neo4j ran out of memory for both
LiveJournal subsets that used a minimum degree of 100 or 250. JGraphT had
similar memory issues with the same dataset and the nodes with a minimum
degree of 100, but ran successfully on the fewer nodes that had a minimum degree
of 250. RedisGraph outperformed both solutions in terms of memory usage in
all cases.
An increasing minimum degree threshold selected fewer and fewer nodes
to apply the algorithm and thus, speeded up the execution in all cases. The effect
is insignificant for small datasets (e.g. the wiki-Vote), but brought an important
decrease in memory usage and execution time for larger datasets. For example,
JGraphT in the case of the LiveJournal dataset employed 75% of the memory
needed for examining nodes with degrees higher than 250 (i.e. 15107 nodes),
when it examined nodes with degrees higher than 500 (i.e., 2,691 nodes).
A comparison of the solutions in terms of the execution time shows that the
tasks were executed quite fast for small datasets in all systems. However, even in
the smallest dataset, Neo4j was 10 times slower than RedisGraph and JGraphT,
whereas on larger graphs the execution time was even worse. In general, JGraphT
was faster than RedisGraph in the same graphs and this was most probably due
to a more efficient implementation of the algorithm in the former case. When the
threshold on minimum degree was high and the resulting nodes are few (e.g., in
Table 2.5 RedisGraph was faster than JGraphT), mainly because it handled more
efficiently the storage of the compressed graph in memory.
In order to provide a better visual comparison of how the algorithmic
implementations performed in each system, in terms of execution time for an
increasing number of nodes (depending on the dataset and the minimum degree
threshold), we draw the respective plots for each degree threshold. The results are
shown in Figures 2.1 to 2.3.
Figure 2.1. Execution times for an increasing graph size (using min. degree = 100).
Figure 2.2. Execution times for an increasing graph size (using min. degree = 250).
Figure 2.3. Execution times for an increasing graph size (using min. degree = 500).
the graph database side, whereas many specialized graph processing and analysis
software exists that implement algorithms for supporting such tasks in memory.
The experimental evaluation results demonstrated the superiority of
specialized software for SNA, against graph databases in terms of applying
specific algorithms on networks that fit (in compressed format or not) in main
memory. Disk-based graph databases are still the preferable solution for custom
queries, that target knowledge extraction from the networks. Nice compromises
already exist [36] where Cypher queries directly from Neo4j could be converted
on the fly in JGraphT graphs in order to execute more sophisticated algorithms.
Graph database software was more consistent in the implementation of
database operations and features, such as transaction management and querying,
but provided fewer graph analysis functionalities and algorithms. On the other
side, the specialized software for social network analysis are mostly oriented in the
implementation of a wide range of algorithms and optimizing their performance
in the execution of these algorithms. Despite their improved performance in
algorithmic tasks, their scalability was limited by the memory available in the
system, which in turn restricted the graph size that they could handle.
The social network analysis task examined in this chapter (i.e. link prediction)
is a typical task that can be efficiently handled by an in-memory algorithm, but can
also be implemented in an offline (i.e. disk-based) setup using a graph database
and its query engine. This allowed us to re-write the task for the different types
of software solutions and perform an end-to-end comparison, using various graph
sizes and complexities and the respective query or algorithmic workloads. With
this task at hand it was able to directly compare the solutions in terms of memory
and time performance and test their limits. The comparison showed the superiority
A Comparative Survey of Graph Databases and Software for Social... 53
Appendix
Acronyms used in the chapter is shown in Table 2.6.
Table 2.6. Acronyms used in the chapter
References
[1] Moreno, J.L. 1934. Who shall survive? A new approach to the problem of human
interrelations. Nervous and Mental Disease Publishing Co.
[2] Angles, R. and C. Gutierrez. 2008. Survey of graph database models. ACM Computing
Surveys (CSUR), 40(1): 1–39.
[3] Kaliyar, R.K. 2015. Graph databases: A survey. In: International Conference on
Computing, Communication & Automation (pp. 785–790). IEEE.
[4] Besta, M., E. Peter, R. Gerstenberger, M. Fischer, M. Podstawski, C. Barthels,
G. Alonso and T. Hoefler. 2019. Demystifying graph databases: Analysis and
taxonomy of data organization, system designs, and graph queries. arXiv preprint
arXiv:1910.09017.
54 Graph Databases: Applications on Social Media Analytics and Smart Cities
[5] Tabassum, S., F.S.F. Pereira, S. Fernandes and J. Gama. 2018. Social network
analysis: An overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, 8(5): e1256.
[6] Wood, P.T. 2012. Query languages for graph databases. ACM Sigmod Record, 41(1):
50–60.
[7] Francis, N., A. Green, P. Guagliardo, L. Libkin, T. Lindaaker, V. Marsault, S.
Plantikow, M. Rydberg, P. Selmer and A. Taylor. 2018. Cypher: An evolving query
language for property graphs. In: Proceedings of the 2018 International Conference
on Management of Data (pp. 1433–1445).
[8] Guia, J., V. Goncalves Soares and J. Bernardino. 2017. Graph databases: Neo4j
analysis. In: ICEIS (1) (pp. 351–356).
[9] Fernandes, D. and J. Bernardino. 2018. Graph databases comparison: Allegrograph,
Arangodb, Infinitegraph, Neo4j, and Orientdb. In: DATA (pp. 373–380).
[10] Kang, U., C.E. Tsourakakis and C. Faloutsos. 2011. Pegasus: Mining peta-scale
graphs. Knowledge and Information Systems, 27(2): 303–325.
[11] Martella, C., R. Shaposhnik, D. Logothetis and S. Harenberg. 2015. Practical graph
analytics with apache giraph, volume 1. Springer.
[12] Siddique, K., Z. Akhtar, E.J. Yoon, Y.-S. Jeong, D. Dasgupta and Y. Kim. 2016.
Apache hama: An emerging bulk synchronous parallel computing framework for big
data applications. IEEE Access, 4: 8879–8887.
[13] Malewicz, G., M.H. Austern, A.J.C. Bik, J.C. Dehnert, I. Horn, N. Leiser and G.
Czajkowski. 2010. Pregel: A system for large-scale graph processing. In: Proceedings
of the 2010 ACM SIGMOD International Conference on Management of Data (pp.
135–146).
[14] Castellana, V.G., A. Morari, J. Weaver, A. Tumeo, D. Haglin, O. Villa and J. Feo.
2015. In-memory graph databases for web-scale data. Computer, 48(3): 24–35.
[15] Da Silva, M.D. and H.L. Tavares. 2015. Redis Essentials. Packt Publishing Ltd.
[16] Davis, T.A. 2019. Algorithm 1000: Suitesparse: Graphblas: Graph algorithms in the
language of sparse linear algebra. ACM Transactions on Mathematical Software
(TOMS), 45(4): 1–25.
[17] Csardi, G. and T. Nepusz. 2006. The igraph software package for complex network
research. InterJournal, Complex Systems, 1695(5): 1–9.
[18] Siek, J.G., L-Q. Lee and A. Lumsdaine. 2001. Boost Graph Library: User Guide and
Reference Manual, The Pearson Education.
[19] O’Madadhain, J., D. Fisher, S. White and Y. Boey. 2003. The JUNG (java universal
network/graph) framework. University of California, Irvine, California.
[20] Michail, D., J. Kinable, B. Naveh and J.V. Sichi. 2020. Jgrapht—A java library for
graph data structures and algorithms. ACM Transactions on Mathematical Software
(TOMS), 46(2): 1–29.
[21] Bejeck, B. 2013. Getting Started with Google Guava. Packt Publishing Ltd.
[22] Hagberg, A., P. Swart and D.S. Chult. 2008. Exploring network structure, dynamics,
and function using networkx. Technical Report, Los Alamos National Lab. (LANL),
Los Alamos, NM (United States).
[23] Staudt, C.L., A. Sazonovs and H. Meyerhenke. 2016. Networkit: A tool suite for large-
scale complex network analysis. Network Science, 4(4): 508–530.
[24] Leskovec, J. and R. Sosič. 2016. Snap: A general-purpose network analysis and graph-
mining library. ACM Transactions on Intelligent Systems and Technology (TIST),
8(1): 1.
A Comparative Survey of Graph Databases and Software for Social... 55
[25] Boldi, P. and S. Vigna. 2004. The webgraph framework I: Compression techniques.
In: Proceedings of the 13th International Conference on World Wide Web (pp. 595–
602).
[26] Paz, J.R.G. 2018. Introduction to Azure Cosmos DB. In: Microsoft Azure Cosmos DB
Revealed (pp. 1–23). Springer.
[27] Yan, D., Y. Tian and J. Cheng. 2017. Systems for Big Graph Analytics. Springer.
[28] Yan, D., Y. Bu, Y. Tian and A. Deshpande. 2017. Big graph analytics platforms.
Foundations and Trends in Databases, 7(1-2): 1–195.
[29] Pokorný, J. 2015. Graph databases: Their power and limitations. In: IFIP International
Conference on Computer Information Systems and Industrial Management (pp. 58–
69). Springer.
[30] Zhou, T., L. Lü and Y.-C. Zhang. 2009. Predicting missing links via local information.
The European Physical Journal B, 71(4): 623–630.
[31] Adamic, L.A. and E. Adar. 2003. Friends and neighbors on the web. Social Networks,
25(3): 211–230.
[32] Liben-Nowell, D. and J. Kleinberg. 2007. The link-prediction problem for social
networks. Journal of the American Society for Information Science and Technology,
58(7): 1019–1031.
[33] Yuliansyah, H., Z.A. Othman and A.A. Bakar. 2020. Taxonomy of link prediction for
social network analysis: A review. IEEE Access.
[34] Buluc, A., T. Mattson, S. McMillan, J. Moreira and C. Yang. 2017. The Graphblas C
API specification. Tech. Rep, 88. GraphBLAS.org
[35] Szárnyas, G., D.A. Bader, T.A. Davis, J. Kitchen, T.G. Mattson, S. McMillan and E.
Welch. 2021. Lagraph: Linear algebra, network analysis libraries, and the study of
graph algorithms. arXiv preprint arXiv:2104.01661.
[36] JGraphT Neo4j client. 2021. https://github.com/murygin/jgrapht-neo4j-client.
Accessed: 2021-06-01.
CHAPTER
3
A Survey on Neo4j Use Cases
in Social Media: Exposing New
Capabilities for Knowledge Extraction
Neo4j is the leading NoSQL graph database. In the past few years, it has gained
popularity with ever more businesses and enterprises using it. In addition, many
universities have incorporated teaching Neo4j within their syllabi. At the same
time, many organizations were forced to change and trust NoSQL databases due
to the characteristics of relational databases, such as cumbersomeness and the
inability to adapt to the needs of data volume eruption and the incorporation
of new and diverse data types. This chapter provides a literature survey and
analyzes major use cases and applications of Neo4j, focusing on the domain
of social media. At the same time, it categorizes them according to identified
context. The categorization includes topic detection/extraction, recommendation
systems, branding/marketing, learning environments, healthcare analytics,
influence detection, fake/controversial information, information modeling,
environment/disaster management, profiling/criminality, attractions/tourism, and
metrics/altmetrics. For each category, representative examples are showcased
elevating the necessity and benefits of using a NoSQL database such as Neo4j for
various applications.
3.1 Introduction
Neo4j1 is a graph database designed from the bottom up to use both data and data
connections. Neo4j links data as it is stored, allowing queries never seen before, at
1 https://neo4j.com/
A Survey on Neo4j Use Cases in Social Media... 57
2 https://db-engines.com/en/ranking_trend
58 Graph Databases: Applications on Social Media Analytics and Smart Cities
database with a persistent Java engine that allows structures to be saved as graphs
rather than tuples.
Neo4j was first released in 2007 and is divided into three product releases:
Community, Government, and Enterprise. At the time of writing, the Community
Edition is the trial edition, which includes all fundamental capabilities. For
a month, the Enterprise Edition grants access to the full edition of Neo4j.
The Government Edition is an update to the Enterprise edition that focuses
on government services. The primary distinctions between Community and
Enterprise/Government Editions include a sophisticated monitoring system,
increased database scalability, robust management of database locks, the presence
of online backup, a high-performance level of memory cache, and more [2].
The remainder of this chapter is structured as follows. Section 2 reports on
the state-of-the-art data analytics in the domain of SM that utilize Neo4j. Section 3
analyzes and outlines the resulting categorization of the state-of-the-art, which is
the main goal of this work. Finally, section 4 discusses the context of this chapter,
the implications of such a taxonomy, as well as future directions resulting from
the conducted survey on the state-of-the-art Neo4j use cases in SM.
are utilized. More specifically, MongoDB and Neo4j are combined. Such a system
is validated by three different datasets reporting on higher precision and recall
when compared with other state-of-the-art systems [3].
3 https://qgis.org/en/site/
60 Graph Databases: Applications on Social Media Analytics and Smart Cities
metric of interest and occurrence of results related to the initial search query.
Sentiment analysis (SA) is also utilized to report on the positive/negative impact
of text and its respective terms on the webpage. A final rank is applied to the
retrieved web pages and SN pages. The goal is to identify the appropriate online
location for advertising the brand yielding a positive impact and high affinity with
the ideal content. This approach constitutes a targeted brand analysis that enables
benefits for both customers and advertisement agencies [7].
Online market research strategies involve the dissemination of IT product
information to retrieve customer input. Yet, SM customer responses tend to be
unstructured and very large in volume. An automated approach for customer
opinion extraction would wield great benefits for the effectiveness of business
ads. An intelligent method measures the inter-profile causality, value structures,
and user attitudes considering replies in SM, such as YouTube. By utilizing
the media/information richness theory to report on the agility and information
richness features. Consequently, a deep SA approach is proposed that seems to
outperform other legacy approaches [8].
Customer review websites and SM are great sources for mining product
feature data. Most design methodologies for product feature extraction assume that
the product features expressed by customers are clearly stated and comprehended
and can be mined straight ahead. Although, that is very rarely accurate. A novel
inference model assigns the most probable explicit product feature that the
customer desires based on an implicit preferences expression in an automated
way. The algorithm adjusts its inference functionality by hypothesizing and
utilizing ground truth. This approach is validated through its statistical evaluation
in a case study of smartphone product features utilizing a Twitter dataset [9].
Opioid usage
Neo4j is used to collect and handle tweets involving at least one opioid-related
term. Opioid (mis)use is a growing public health problem in the United States,
hence the purpose of this project is to offer healthcare professionals data on
public opioid usage and the geographical distribution of opioid-related tweets.
The findings can be used to tailor public health initiatives to communities in high-
use regions. Among the results are: (i) During the analysis period, California,
particularly the Sacramento area, had the largest number of users (921 people)
who sent 2,397 opioid-related tweets. (ii) When compared to the total number of
tweets in the state, North Carolina has the greatest proportion (17%) of opioid-
related tweets. (iii) The greatest opioid-related user group on Twitter has 42
members, and the most often discussed topic in this group is the negative impacts
of Percocet and Tylenol [12].
The opioid pandemic poses a significant public health risk to worldwide
communities. Due to limited research and current monitoring limitations, the
62 Graph Databases: Applications on Social Media Analytics and Smart Cities
Nature-deficit disorder
The body of evidence supporting the notion that restorative surroundings, green
fitness, and nature-based activities improve human health is growing. Nature-
deficit disorder, a media phrase used to reflect the negative consequences of
people’s estrangement from nature, has yet to be legally recognized as a clinical
diagnosis. SM, such as Twitter, with its potential to collect BD on public opinion,
provides a platform for investigating and disseminating information about the
nature-deficit disorder and other nature–health topics. An analysis has been
conducted on over 175,000 tweets using SA to determine if they are favorable,
neutral, or negative, and then the influence on distribution was mapped.
SA was utilized to analyze the consequences of events in SNs, to examine
perceptions about products and services, and comprehend various elements of
communication in Web-based communities. A comparison between nature
deficit-disorder “hashtags” and more specific nature hashtags was used to provide
recommendations for enhanced distribution of public health information through
message phrasing adjustments. Twitter can increase knowledge of the natural
environment’s influence on human health [15].
A Survey on Neo4j Use Cases in Social Media... 63
HIV
Since the 1990s, the number of new HIV cases and outbreaks have remained
steady near 50,000 each year. To increase epidemic containment, public health
interventions aimed at reducing HIV spread should be created. Online SNs
and their real-time communication capacities are evolving as fresh venues for
epidemiological research. Recent research has shown that utilizing Twitter to
study HIV epidemiology is feasible. The utilization of publicly accessible data
from Twitter as an indication of HIV risk is presented as a novel way of identifying
HIV at-risk groups. Existing methodologies are improved by providing a new
infrastructure for collecting, classifying, querying, and visualizing data. But also
to demonstrate the possibility of classifying HIV at-risk groups in the San Diego
area at a finer degree of data resolution [16].
The HIV epidemic is still a major public health issue. According to recent
statistics, preventive initiatives do not reach a large number of persons in
susceptible communities. To allow evidence-based prevention, researchers are
investigating novel ways for identifying HIV at-risk groups, including the use
of Twitter tweets as potential markers of HIV risk. A study on SN analysis and
machine learning demonstrated the viability of utilizing tweets to monitor HIV-
related threats at the demographic, regional, and SN levels. This methodology,
though, exposes moral dilemmas in three areas: (i) data collection and analysis,
(ii) risk assessment using imprecise probabilistic techniques, and (iii) data-driven
intervention. An examination and debate of ethics are offered based on two
years of experience with doctors and local HIV populations in San Diego,
California. [17].
platform was tested. The suggested strategy appears to be promising for mining
prominent nodes in large SNs, based on experimental findings [19].
Fake users
With the rapid increase in the number of Web users, SN platforms have been
among the most important forms of communication all over the world. There
are several notable participants in this industry, such as Facebook, Twitter, and
YouTube. Most SN platforms include some type of measure that may be used to
characterize a user’s popularity, like the number of followers on Twitter, likes on
Facebook, and so on.
Yet, it has been seen in past years that many users seek to influence their
popularity by using false accounts. A certain technique discovers all false
followers in a social graph network using attributes related to the centrality of all
nodes in the graph and training a classifier using a subset of the data. Employing
solely centrality measurements, the proposed approach detected false followers
with a high degree of accuracy. The suggested technique is general in nature and
may be applied regardless of the SN platform in the hand [20].
The Ethical Journalism Network defines disinformation as deliberately
manufactured and publicized material meant to misguide and confuse others.
It tries to manipulate the uneducated into believing lies and has a bad societal
impact. Fake SM users are seen as famous, and they distribute false information
by making it appear genuine. The goal of this research project is to increase the
accuracy in detecting fake users [20]. By utilizing several centrality measures
supplied by the Neo4j graph database and two additional datasets a classification
technique (Random Forest) was incorporated to detect fake users on SM [21].
Several SM platforms that permit content distribution and SN interactions
have risen in popularity. Although certain SM platforms enable distinguishing
group tags or channel tagging using ‘@’ or ‘#’, there has been a lack of user
personalization (a user must recollect a person by their SM name on sign-up).
An alternative approach is presented, in which users may store and then seek
their friends by using names they use to identify them in real life i.e. nicknames.
Furthermore, the suggested approach may be utilized in chatting systems for
identifying individuals and tagging friends. Similar to how the ‘@’ is used on SM
platforms such as Facebook and Twitter [22].
Pathogenic SM profiles, like terrorist supporter accounts and fake media
publishers, have the potential to propagate disinformation virally. Early
identification of pathogenic accounts is critical since they are likely to be major
sources of spreading illicit information. The causal inference approach was used
A Survey on Neo4j Use Cases in Social Media... 65
Fake news
Comments and information posted on the internet and other SM platforms impact
public opinion about prospective treatments for detecting and healing illnesses.
The spread of these is similar to the spread of fake news regarding other vital
areas, such as the environment. To validate the suggested technique, SM networks
were employed as a testing ground. Twitter users’ behavior was examined using
an algorithm. A dynamic knowledge graph technique was also incorporated to
describe Twitter and other open-source data such as web pages. Furthermore,
a real example of how the corresponding graph structure of tweets connected
to World Environment Day 2019 was utilized to construct a heuristic analysis.
Methodological recommendations showed how this system may enable the
automating of operations for the development of an automated algorithm for the
identification of fake health news on the web [25].
The analysis of enormous graph data sets has become a significant tool
for comprehending and affecting the world. The utilization of graph DBMSs
in various use cases such as cancer research exemplifies how analyzing graph-
structured data may help find crucial but hidden connections. In this context, an
example demonstrated how GA might assist cast light on the functioning of SM
troll networks, such as those seen on Twitter.
GA can efficiently assist businesses in uncovering patterns and structures
inside linked data. That way it allows them to make more accurate forecasts and
make faster choices. This necessitates effective GA that is well-integrated with
graph data management. Such an environment is provided by Neo4j. It offers
transactional and analytical processing of graph data, as well as data management
and analytics capabilities. The Neo4j graph algorithms are the primary ingredients
for GA. These algorithms are effectively implemented as parallelized versions
of typical graph algorithms, tuned for the Neo4j graph database. The design and
integration of Neo4j graph algorithms were discussed and its capabilities were
displayed with a Twitter Troll analysis showcasing its performance with a few
massive graph tests [26].
66 Graph Databases: Applications on Social Media Analytics and Smart Cities
CRF model recognized the topics and writes the entities and their types to a table.
Then the entities and their links, as well as the structure of the resource description
framework, have been merged to create an earthquake knowledge graph on
Neo4j [33].
Information has become more accessible and faster than before since more
and more people use the web and SM as sources of news. The growth of the
internet and SM has also given a voice to a much larger audience. As a result, each
user has the opportunity to function as an active correspondent, generating a large
amount of data on live occurrences. Twitter data was gathered and analyzed to
detect, evaluate, and present instances of social turmoil in three countries, namely
India, Pakistan, and Bangladesh [34].
system administrators and give them a forewarning about the attacker’s skills
and purpose.
System administrators must prevent, repel, and identify cyber intrusions, as
well as adjust in the aftermath of successful attacks. Advanced warnings allow
system administrators to focus on certain assault component types, time periods,
and targets, allowing them to be more effective. A widespread denial-of-service
assault and website defacement can be mitigated by observing SM and private
group chats. Even, in case an assault is successful the security officials can briefly
halt certain foreign traffic to the inflicted sites. There were teams prepared to react
to attacks and repair or recover websites. Monitoring SM networks is a useful
way for detecting hostile cyber conversations, but analysts presently lack the
necessary automated tools [37].
A visual analytics system called Matisse was presented for exploring
worldwide tendencies in textual information streams with special applicability to
SM. It was promised to enable real-time situational awareness through the use of
various services. Yet, interactive analysis of such semi-structured textual material
is difficult due to the high throughput and velocity of data This system provided
(i) sophisticated data streaming management, (ii) automatically generated
sentiment/emotion analytics, (iii) inferential temporal, geospatial, and term-
frequency visualizations, and (iv) an adaptable interaction scheme that enables
multiple data views. The evaluation took place with a real use case, sampling 1%
of total Twitter recorder data during the week of the Boston Marathon bombings.
The suggested system also contained modules for data analytics and associations
based on implicit user networks in Neo4j [38].
the origin of inspirations and its significance in the institution’s connections with
its tourists. The tools gathered and analyzed Twitter data relating to two incidents:
(i) by working with museum specialists, the utility of discovering expressions of
inspiration in Tweets was investigated, and (ii) an assessment utilizing annotated
material yielded an F-measure of 0.46, showing that SM may be a viable data
source [40].
There has been much debate on how museums may provide more value to
tourists which can be simply compensated for using instrumental measures such as
attendance data. Questionnaires or interviews can give more in-depth information
about the influence of museum activities. Yet, they can be unpleasant and time-
consuming, and they only offer snapshots of visitor attitudes at certain periods
in time. The Epiphany Project studied the viability of applying computational
social science approaches to detect evidence of museum inspiration in tourist
SM. Inspiration is described in a way that is useful to museum activity, as are
the stakeholders that could benefit, their needs, and the vision for the system’s
design [41].
SM users may post text, create stories, be co-creators of these stories, and
engage in group message sharing. Users with a large number of message exchanges
have a substantial, selective effect on the information delivered across the SM.
Businesses and organizations should consider segmenting SM web pages and
linking user opinions with structural indicators. Sections of a web page containing
valuable information should be recognized, and an intelligent wrapping system
based on clustering and statistics could be suggested to do so automatically. The
goal is to gather information on business or organization services/goods based on
real-time comments provided by users on SM. Experimentation on Facebook with
posts for a hotel booking website named Booking.com took place. Implications
include the formation of a web user community with common interests that are
related to a product/service. As a result of the comment responses on SM can be
utilized in tourism and other activities, leading to trust increase, as well as loyalty
and trustworthiness improvement [42].
3.4 Conclusion
This chapter highlights the importance of transitioning from SQL to NoSQL
databases investigating use cases of Neo4j in the SM domain. A categorization is
conceived, based on the context of the state-of-the-art. The intent is to highlight
the importance and benefits of NoSQL in contrast to relational databases while
presenting its great applicability spread in various modern data management
applications.
Therefore, various topics of application have been examined; yet some factors
should be also examined when considering the migration to a NoSQL database or
even which NoSQL database is more appropriate to use. Indicatively, this section
also discusses the comparison of PostgreSQL with Neo4j and MongoDB with
Neo4j in two distinct use cases.
To make the most of the large amount of information accessible in today’s
BD environment, new analytical skills must be developed. SQL databases, such as
PostgreSQL, have typically been favoured, with graph databases, such as Neo4j,
limited to the analysis of SNs and transportation data. The MIMIC-III patient
database is used as a case study for a comparison between PostgreSQL (which
uses SQL) and Neo4j (which uses Cypher). While Neo4j takes longer to set up, its
queries are less complicated and perform faster compared to PostgreSQL queries.
As a result, while PostgreSQL is a solid database, Neo4j should be considered a
viable solution for gathering and processing health data [45].
Furthermore, in the smartphone era, geospatial data is vital for building citizen-
centric services for long-term societal evolution, such as smart city development,
disaster management services, and identifying critical infrastructures such as
schools, hospitals, train stations, and banks. People are producing geo-tagged data
on numerous SM websites such as Facebook, Twitter, and others, which may be
categorized as BD due to the provision of the three key BD attributes: volume,
diversity, and velocity. Multiple sources generate vast volumes of heterogeneous
data that cannot be classified.
Moreover, data-driven applications demand a longer throughput time.
It is quite tough to manage such a big volume of data. Instead of employing a
traditional relational database management system, this geotagged data should be
maintained using BD management techniques such as NoSQL. As a result, in the
context of geographical information systems and all other use cases, it is critical
to select the suitable graph database to be used. In [46], for instance, the authors
evaluated the performance of MongoDB and Neo4j querying geotagged data to
make a more educated selection.
3.4.1 Contribution
This chapter’s goal is to expose the reader to graph databases, notably Neo4j and
its SM-related applications. It begins by briefly discussing graph databases and
Neo4j. It then covers literature on Neo4j use in SM for different applications.
A Survey on Neo4j Use Cases in Social Media... 75
References
[1] Robinson, I., J. Webber and E. Eifrem. 2013. Graph Databases. O’Reilly Media.
Cambridge, USA.
[2] Guia, J., V.G. Soares and J. Bernardino. 2017. Graph databases: Neo4j analysis. In:
ICEIS (1), pp. 351–356.
[3] Asgari-Chenaghlu, M., M.-R. Feizi-Derakhshi, M.-A. Balafar and C. Motamed.
2020. Topicbert: A transformer transfer learning based memory-graph approach for
multimodal streaming social media topic detection. arXiv Prepr. arXiv2008.06877.
[4] Sen, S., A. Mehta, R. Ganguli and S. Sen. 2021. Recommendation of influenced
products using association rule mining: Neo4j as a case study. SN Comput. Sci., 2(2):
1–17.
[5] Konno, T., R. Huang, T. Ban and C. Huang. 2017. Goods recommendation based on
retail knowledge in a Neo4j graph database combined with an inference mechanism
implemented in jess. In: 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing,
76 Graph Databases: Applications on Social Media Analytics and Smart Cities
Advanced & Trusted Computed, Scalable Computing & Communications, Cloud &
Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/
SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pp. 1–8.
[6] Dubey, M.D.R. and S.R.R. Naik. 2019. An Integrated Recommendation System Using
Graph Database and QGIS. Int. Research Journal of Engineering and Technology
(IRJET), 6(6).
[7] Aggrawal, N., A. Ahluwalia, P. Khurana and A. Arora. 2017. Brand analysis
framework for online marketing: Ranking web pages and analyzing popularity of
brands on social media. Soc. Netw. Anal. Min., 7(1): 21.
[8] Jang, H.-J., J. Sim, Y. Lee and O. Kwon. 2013. Deep sentiment analysis: Mining
the causality between personality-value-attitude for analyzing business ads in social
media. Expert Syst. Appl., 40(18): 7492–7503.
[9] Tuarob, S. and C.S. Tucker. 2015. A product feature inference model for mining implicit
customer preferences within large scale social media networks. In: International
Design Engineering Technical Conferences and Computers and Information in
Engineering Conference, 2015, vol. 57052, p. V01BT02A002.
[10] Stanescu, L., V. Dan and M. Brezovan. 2016. Social learning environment based on
social learning graphs formalism. In: 2016 20th International Conference on System
Theory, Control and Computing (ICSTCC), 2016, pp. 818–823.
[11] Fiaidhi, J. 2020. Envisioning insight-driven learning based on thick data analytics
with focus on healthcare. IEEE Access, 8: 114998–115004.
[12] Soni, D., T. Ghanem, B. Gomaa and J. Schommer. 2019. Leveraging Twitter and
Neo4j to study the public use of opioids in the USA. In: Proceedings of the 2nd
Joint International Workshop on Graph Data Management Experiences & Systems
(GRADES) and Network Data Analytics (NDA), 2019, pp. 1–5.
[13] Zhao, F., Skums, P., Zelikovsky, A., Sevigny, L., Swahn, H., Strasser, M., Huang,
Y. and Wu, Y. 2020. Computational approaches to detect illicit drug ads and find
vendor communities within social media platforms. IEEE/ACM Trans. Comput. Biol.
Bioinforma.
[14] Celesti, A., A. Buzachis, A. Galletta, G. Fiumara, M. Fazio and M. Villari. 2018.
Analysis of a NoSQL graph DBMS for a hospital social network. In: 2018 IEEE
Symposium on Computers and Communications (ISCC), 2018, pp. 1298–1303.
[15] Palomino, M., T. Taylor, A. Göker, J. Isaacs and S. Warber. 2016. The online
dissemination of nature–health concepts: Lessons from sentiment analysis of social
media relating to ‘nature-deficit disorder’. Int. J. Environ. Res. Public Health, 13(1):
142.
[16] Thangarajan, N., N. Green, A. Gupta, S. Little and N. Weibel. 2015. Analyzing social
media to characterize local HIV at-risk populations. In: Proceedings of the Conference
on Wireless Health, 2015, pp. 1–8.
[17] Weibel, N., P. Desai, L. Saul, A. Gupta and S. Little. 2017. HIV risk on Twitter:
The ethical dimension of social media evidence-based prevention for vulnerable
populations. In: Proceedings of 50th Int. Conf. on System Sciences, pp. 1775-1784.
[18] Joshi, P. and S. Mohammed. 2020. Identifying social media influencers using graph
based analytics. Int. J. of Advanced Research in Big Data Management System, 4(1):
35–44.
[19] El Bacha, R. and T.T. Zin. 2017. A Markov Chain Approach to big data ranking
systems. In: 2017 IEEE 6th Global Conference on Consumer Electronics (GCCE),
2017, pp. 1–2.
A Survey on Neo4j Use Cases in Social Media... 77
[20] Mehrotra, A., M. Sarreddy and S. Singh. 2016. Detection of fake Twitter followers
using graph centrality measures. In: 2016 2nd International Conference on
Contemporary Computing and Informatics (IC3I), 2016, pp. 499–504.
[21] Zhao, Y. 2020. Detecting Fake Users on Social Media with Neo4j and Random Forest
Classifier. University of Victoria, Department of Human & Social Development,
http://hdl.handle.net/1828/11809
[22] Aggarwal, A. 2016. Enhancing social media experience by usage of user-defined
nicknames as additional identifiers for online interaction. J. Comput. Sci. Appl., 4(1):
1–8.
[23] Shaabani, E., A.S. Mobarakeh, H. Alvari and P. Shakarian. 2019. An end-to-end
framework to identify pathogenic social media accounts on Twitter. In: 2019 2nd
International Conference on Data Intelligence and Security (ICDIS), 2019, pp. 128–
135.
[24] Tatbul, N., T.J. Lee, S. Zdonik, M. Alam and J. Gottschlich. 2018. Precision and Recall
for Time Series. Mar. 2018, Accessed: Dec. 09, 2021. [Online]. Available: http://arxiv.
org/abs/1803.03639.
[25] Lara-Navarra, P., H. Falciani, E.A. Sánchez-Pérez and A. Ferrer-Sapena. 2020.
Information management in healthcare and environment: Towards an automatic
system for fake news detection. Int. J. Environ. Res. Public Health, 17(3): 1066.
[26] Allen, D., Hodler, E.A., Hunger, M., Knobloch, M., Lyon, W., Needham, M. and
Voigt, H. 2019. Understanding trolls with efficient analytics of large graphs in neo4j.
BTW 2019.
[27] Filippov, A., V. Moshkin and N. Yarushkina. 2019. Development of a software for the
semantic analysis of social media content. In: International Conference on Information
Technologies, 2019, pp. 421–432.
[28] Xie, T., Y. Yang, Q. Li, X. Liu and H. Wang. 2019. Knowledge graph construction for
intelligent analysis of social networking user opinion. In: International Conference on
e-Business Engineering, 2019, pp. 236–247.
[29] Ferreira, D.R.G. 2014. Using Neo4J geospatial data storage and integration.
Master’s Dissertation. University of Madeira Digital Library, http://hdl.handle.
net/10400.13/1034
[30] Wagenpfeil, S., F. Engel, P.M. Kevitt and M. Hemmje. 2021. Ai-based semantic
multimedia indexing and retrieval for social media on smartphones. Information,
12(1): 43.
[31] Chen, J., S. Wang and B. Stantic. 2017. Connecting social media data with observed
hybrid data for environment monitoring. In: International Symposium on Intelligent
and Distributed Computing, 2017, pp. 125–135.
[32] Grolinger, K., M.A.M. Capretz, E. Mezghani and E. Exposito. 2013. Knowledge as
a service framework for disaster data management. In: 2013 Workshops on Enabling
Technologies: Infrastructure for Collaborative Enterprises, 2013, pp. 313–318.
[33] Sun, X., Qi, L., Sun, H., Li, W., Zhong, C., Huang, Y. and Wang, P. 2020. Earthquake
knowledge graph constructing based on social intercourse using BiLSTM-CRF. In:
IOP Conference Series: Earth and Environmental Science, 2020, vol. 428(1), p.
12080.
[34] Clark, T. and D. Joshi. 2019. Detecting areas of social unrest through natural language
processing on social media. J. Comput. Sci. Coll., 35(4): 68–73.
[35] Ward. 2016. Using social media activity to identify personality characteristics of
Navy personnel. Naval Postgraduate School Monterey United States.
78 Graph Databases: Applications on Social Media Analytics and Smart Cities
[36] Maguerra, S., A. Boulmakoul, L. Karim and H. Badir. 2018. Scalable solution for
profiling potential cyber-criminals in Twitter. In: Proceedings of the Big Data &
Applications 12th Edition of the Conference on Advances of Decisional Systems.
Marrakech, Morocco, 2018, pp. 2–3.
[37] Campbell, J.J.P., A.C. Mensch, G. Zeno, W.M. Campbell, R.P. Lippmann and D.J.
Weller-Fahy. 2015. Finding malicious cyber discussions in social media. MIT Lincoln
Laboratory Lexington United States.
[38] Steed, C.A., J. Beaver, P.L. Bogen II, M. Drouhard and J. Pyle. 2015. Text stream
trend analysis using multiscale visual analytics with applications to social media
systems. In: Conf. ACM IUI Workshop on Visual Text Analytics, Atlanta, GA, USA.
[39] Becheru, A., C. Bădică and M. Antonie. 2015. “Towards social data analytics for
smart tourism: A network science perspective. In: Workshop on Social Media and the
Web of Linked Data, 2015, pp. 35–48.
[40] Gerrard, D., M. Sykora and T. Jackson. 2017. Social media analytics in museums:
Extracting expressions of inspiration. Museum Manag. Curatorsh., 32(3): pp. 232–
250.
[41] Gerrard, D., T. Jackson and A. O’Brien. 2014. The epiphany project: Discovering the
intrinsic value of museums by analysing social media. Museums Web, 2014.
[42] Ntalianis, K., A. Kavoura, P. Tomaras and A. Drigas. 2015. Non-gatekeeping on social
media: A reputation monitoring approach and its application in tourism services. J.
Tour. Serv., 6(10).
[43] Timilsina, M., W. Khawaja, B. Davis, M. Taylor and C. Hayes. 2017. Social impact
assessment of scientist from mainstream news and weblogs. Soc. Netw. Anal. Min.,
7(1): 1–15.
[44] Drakopoulos, G., A. Kanavos, P. Mylonas and S. Sioutas. 2017. Defining and
evaluating Twitter influence metrics: A higher-order approach in Neo4j. Soc. Netw.
Anal. Min., 7(1): 1–14.
[45] Stothers, J.A.M. and A. Nguyen. 2020. Can Neo4j Replace PostgreSQL in Healthcare?
AMIA Summits Transl. Sci. Proc., vol. 2020, p. 646.
[46] Sharma, M., V.D. Sharma and M.M. Bundele. 2018. Performance analysis of RDBMS
and no SQL databases: PostgreSQL, MongoDB and Neo4j. In: 2018 3rd International
Conference and Workshops on Recent Advances and Innovations in Engineering
(ICRAIE), 2018, pp. 1–5.
CHAPTER
4
Combining and Working with Multiple
Social Networks on a Single Graph
Internet users use social networks for different purposes and with different data.
Having the same user accounts in different social networks and combining users’
data in a single graph will improve the functioning of recommendation systems
and increase the user experience. In this study, the data of thousands of users in
nine different social networks were collected and combined in a single graph.
Anchors are created between nodes in graphs with different attributes using
the previously proposed node alignment and node similarity methods. Node
similarity methods have been developed for multiple social networks, and node
matching success has been increased. Thus, it has been possible to propose much
more successful recommendation systems. In addition, a new alignment method
for multiple social networks is proposed in this study. As a result of the study,
the success rates of the proposed methods were measured with the actual data
collected from social networks. Graphs from multiple social networks have been
converted into a single graph. A broad user profile covering more than one social
network has been created for users.
4.1 Introduction
Social networks have been one of the most significant technological innovations
that entered the internet world in the early 2000s, with the widespread use of the
Internet and the increasing penetration rate of mobile devices. In social networks,
users can create their pages, visit the pages of their followers, and see their posts.
Thus, social networks are channels that allow users to produce content. In this
80 Graph Databases: Applications on Social Media Analytics and Smart Cities
model, the users can publish their material, create friend lists, follow lists, and
interact with other users.
People use different social networks according to their needs and purposes.
For example, social networks like Instagram, and Flickr, focus on sharing
photos. There are also social networks that contain posts limited to a specific
character number, called microblogging. Each user can use one or more of the
social networks they need as they wish. With the increase in social networks,
whose users constantly generate data, the amount of data available on the Internet
has increased significantly. Although there are tens of microblogging services
globally, Twitter has close to 2 million users, and nearly 500 million tweets
are sent every day [31]. One of the largest social networks, Facebook has more
than 1.5 billion users, which significantly changed the user’s social networking
experience. Facebook now has more users than most countries in the world [26].
In Instagram, a Facebook company and another popular photo-sharing network,
users have shared over 50 billion photos so far.
Graphs with different features represent social networks because users can
share different information using them. Therefore, when comparing the graphs
of social networks with each other, it is seen that there are few similarities and
almost entirely different graphs.
With the increasing popularity of social networks, mapping users in social
networks has recently become an essential issue in academia and the industry.
Theoretically, cross-platform discoveries allow a bird’s eye view of the user’s
behavior across all social networks. However, almost all studies that use social
network data focus on a few specific social networks [18, 28]. Therefore, using
limited data in a small number of social networks causes problems in determining
the success of the proposed methods and prevents accurate results. Furthermore,
graphs with different features represent social networks because users can share
different information using them. Therefore, when comparing the graphs of social
networks with each other, it is seen that there are few similarities and almost
entirely different graphs.
Identifying and matching the accounts of the same users in different social
networks can create a graph with very detailed information for the users, and data
mining methods can work more successfully in this new graph. Because users
share different information on different social networks, a social network may
contain unique data not contained in other social networks. By combining graphs,
links and information that have not been obtained before can be obtained, and
thus more successful recommendations can be presented to users. The proposed
method combines the user’s information with information in different social
networks in a single graph with a single node.
The paper continues with the following sections. In section 2, node similarity
and topological alignment methods are mentioned, and previous studies in the
literature are presented. The proposed methods are described in detail in section
3. In section 4, the datasets used in selected social networks and the test results are
mentioned. Finally, section 5 is the conclusion part of the study.
Combining and Working with Multiple Social Networks on a Single Graph 81
4.1.1 Background
Combining different social networks is an increasingly popular topic in recent
years. Studies in this area are generally divided into two: Matching the nodes
according to their similarity characteristics by looking at their attributes. The
other one is alignment to nodes according to connections between the nodes on
the graph. Thus, node similarity algorithms have been a frequently studied subject
by researchers.
Researchers generally do not need to represent more than one social network
because they work with only one social network. Therefore, data representation
with multiple graphs is not used much. Each graph representing social networks
is independent and often seen as very different. Layers represent different graphs
even when social networks are defined by graphs (the same nodes in some graphs
may be anchored). A structure with three different graphs and the relationships
between graphs are shown in Figure 4.1. Figure 4.1 shows three different social
networks. Some users use all three social networks and some users use only two
or only one social networks. In the social network at the top, the relationships are
through the group. In the middle network, the relationships are bidirectional. In
the network at the bottom, the relationships are one-way. In such social networks
with different features, accounts belonging to the same user can be matched and
connection points can be created.
Proper selection of attributes is vital for most of the studies on social
networks. Attributes can be directly selected from the data presented in social
networks or obtained as derived. For example, the user’s registration date or
friends’ number is called a direct attribute. The language attribute obtained by
inferring from the posts is called the derived attribute. It is often preferred to use
attributes as a set rather than use them alone and specify rule trees with more than
one attribute. While there are similarities between the two nodes, tens of attributes
can be used and analyzed. It may not always be possible to match the values in the
attributes exactly. Information contained in users’ actions or profiles is often tried
to find relationships using tagging. Some of the tagging examples are shown in
Figure 4.2.
Social network users actively use non-multiple choice data entry fields.
Therefore, users can enter the same information in different ways. As a result,
it is often not possible to find similarities using classical text-matching methods.
Instead, it is necessary to use methods that find partial similarities for almost
every field. Tagging users’ interests and finding common locations are other
problems and cannot be solved successfully by standard methods. The process
of detecting and tagging the words (terminological names) frequently used in
the user’s posts is critical to determine the user’s area of interest. Extracting this
information with an approach other than labeling is not very possible due to the
data structure. Correct labeling is a factor that directly affects the success rate. In
Figure 4.2, a demonstration of different users’ usage methods of different tags and
the relationships between tag cloud and tags is presented.
As a result, the combined data can be used for all recommendation algorithms to
increase the success rate.
There are many methods to analyze the similarity of nodes in a network.
Three of the most popular methods in this area are Adamic Adar, Jaccard, and
Common Neighbors methods [23, 32]. Adamic Adar [1] has also been used in
many previous studies to predict new connections in the social network and has
been reported to have high performance for complex networks. Both the Jaccard
Similarity Detection and Adamic Adar were developed to find the similarity
between two web pages. Since web pages contain unstructured data like the
social network data in the nodes, these methods were also used in this study.
Another algorithm used in the study is Friend of Friend (FOAF). Also known
as the Common Neighborhood method [3], this algorithm was developed based
on the idea that users with many familiar neighbors are more likely to connect
in the future. Apart from these, there are other similarity criteria, such as Liben
and Kleinberg [6]. Some of these metrics focus on the length of paths between
the pair of nodes to predict new connections that will be included in a graph.
Nodes can be found in graphs and can be made using calculating the shortest path
between nodes [10]. The Restart Random Walk (RWR) algorithm [8] searches
for the shortest path by walking randomly on graphs using the Markov chain
model. Like RWR, Fouss et al. [13] developed a model to find the most similar
node pairs among the nodes in a graph. SimRank [15], on the other hand, accepts
the proposition that ‘people related to similar people are similar’ based on the
structural characteristics of the network.
Finally, some methods use other attributes such as messages between users,
user ratings, co-authored posts, and common tags besides the graphical structure.
Table 4.1 includes the names and some features of the algorithms used for this
purpose. Only some algorithms mentioned in this section are included in the table
because the table contains only node similarity algorithms. In addition, unlike the
existing studies, a vector-based similarity algorithm was also used in this study.
Most of the above methods are used on only one social network link structure.
In this study, some of these methods were used but modified for multiple graphs.
Adamic and Adar [1, 2] found four sources of information for a user in their
proposed method. The four sources of information that websites mention in their
proposed method of finding similarity are links, mailing lists, links, and texts
provided by users themselves. Three different data groups were used in this study,
namely user-to-user connections, user-related data, and shares. However, since
the connections between users are bidirectional, the number of groups can be
accepted as four.
N-Gram is frequently used in computational sciences and probability to
find n-contiguous sequences in a given text. For written texts, N-gram elements
are usually words [7]. Also, web-based unstructured systems generally contain
text. Therefore, N-grams are very popular and used for similarity or proximity in
these texts.
The task of evaluating the similarity between two nodes in a graph topology
is a longstanding problem. It is not possible to quickly find the similarity of data
84 Graph Databases: Applications on Social Media Analytics and Smart Cities
based on user behavior with any similarity measure. In addition, applying such
analogy methods directly to large graphs brings different problems. Therefore, an
effective and scalable link-based similarity model is needed.
SimRank is a general-purpose similarity quantification method that is simple
and intuitive and can be applied to graphs. SimRank offers an application that
analyzes the similarity of the structural context in which objects occur: object-
to-object relationships based on their relationships with other objects. For this
reason, SimRank can also be used in areas where partial relationships between
objects are required.
For recommendation systems, established known similarities between items
and similarities between users may be close to each other. Thus, FOAF is a
rational method for describing people, their activities, and their relationships with
other people and objects.
FOAF profiles can be used to find all people living in Europe and list the
people’s friend knows most recognizable among these people. Each profile must
have a unique identifier used to identify these relationships. A similarity matrix
is used between feature pairs. Similarly, nodes can be found when using machine
learning algorithms with the vector space model.
The use of the vector space model to find the same nodes in social networks
provides many advantages. Since the vector space model supports many different
similarity algorithms, especially cosine, it is also suitable for detecting small
Combining and Working with Multiple Social Networks on a Single Graph 85
changes in values and doing letter-based analysis rather than just getting the
intersection set.
Figure 4.3. Vector-based user profile and links (Liu et al. 2016).
86 Graph Databases: Applications on Social Media Analytics and Smart Cities
4.3.1 Methodology
In this part of the study, anchor knot methods are first mentioned. Information
about the anchor knot method, especially in social networks, is expressed in detail.
The section also explains how to use node similarity and topological similarity
algorithms. This study proposes an improved method based on the anchor knot
method. Although the anchor knot method has been used in social networks before,
a method using the anchor method in many more social networks is proposed in
this study. While previous studies using the anchor method only focused on node
alignment, this study also focused on feature similarity.
anchor nodes are tried to be matched with other candidate anchor nodes in
the graph.
The proposed methods aim to align the networks and anchor the same
users in different networks, as seen in Figure 4.4. However, each network may
not have connections between the same users, and some nodes may be missing
or redundant.
An interaction score calculation process has been proposed to calculate scores
between nodes. More than ten attributes are used in nine different social networks.
The interaction intensity score has been normalized due to social network-based
tests and is limited between 1 and 2. However, in topological distance calculations,
the length of the paths between the nodes is taken into account. Therefore, in
weighted topological calculations, the scores of the routes used between the nodes
are also included in the process. In social networks, not every user connected is
in the same proximity. Therefore, the weight of the connection is high in cases
where the users’ relationships are intense in weighted methods. In our study, the
weights of the links were measured by the number of common friends between
the profiles represented by the two nodes. Also, the number of mutual friends
scores normalized to 1 between 2. The formulas used in the interaction density
determination method between two nodes are formulas 4.1, 4.2, and 4.3. Formula
4.1 gives the distance as a path between two nodes, formula 4.2 gives the
similarity between common attributes, and formula 4.3 gives the interaction score
after normalization. Thus, the topological distance with formula 4.1, the similarity
score with formula 4.2, the interaction score, and the normalized value of these
two were taken.
Used link total point
U ( x, y) _ L All links (4.1)
Used link number
dT
Ep ( x, y ) = L Interaction ways b( x, y ) *U ( x, y ) * (4.2)
sd
Ep ( x, y ) -Ep ( x, y ) min
Interaction Poin t +1 (4.3)
Ep ( x, y ) max -Ep ( x, y ) min
A filter is applied for a small number of people who are thought to distort the
average and are expected to be interviewed very frequently. While calculating
the average, these people who are frequently interviewed are not included in
the calculation. In addition, they cannot be more than ten times the average in
order not to disturb the normalization function of 2 nodes with such unusual
oversights. Due to the changing usage patterns of social networks over the years,
the average calculation method applied for each social network has to be rerun
and recalculated at regular intervals.
Attribute Method
User name Cosine Similarity Method
User name N-Gram Similarity Method
Location Euclidean Location Similarity Method
Popularity Intersection Method
Using language SimRank
Active Time Zone Intersection Method
Interest Tags Word N-Gram Similarity Method
Personal Statement TF-IDF
Combining and Working with Multiple Social Networks on a Single Graph 89
In this study, each attribute was scored with 1 point. Therefore, all the feature
similarity methods were arranged to be the lowest 0, and the highest 1. It was
determined that it is not appropriate to set a single threshold value for all social
networks. The reason for this is that the similarity rate is lower because some
social networks provide much more detailed information on some attributes. The
most common example in this regard is the city-area location determination from
Instagram and Twitter, while the neighborhood-size location can be determined in
Meetup and Foursquare. Therefore, the similarity threshold value was determined
separately for each social network. In the process of determining each social
network separately, separate tests were designed. In these tests, nodes on a low
threshold score were manually controlled, and the threshold values per social
network were determined by optimizing the F-Measure as maximum. By making
the calculation described with all candidate nodes except each node itself, the
nodes above the threshold value are matched as anchor nodes, and the node pairs
are combined.
method can be applied to all social networks with more than two. In two social
networks, the application of this method will create unnecessary complexity since
the 2-dimensional coordinate system is sufficient.
Applying the proposed method to all graphs in all collected data would require
a tremendous amount of processing time. The threshold value to be determined
is not optimal, and when it is too smaller than it should be, it may lead to an
infinite loop. Therefore, the proposed method has to contain many assumptions,
filters, and rules. Some of these rules are: The nodes to be aligned can be at most
one level away from the primary anchor nodes. Up to 100 nodes can be aligned
in a single iteration. Nodes to be aligned are expected to have more than 2x the
average interaction amount of all users connected to the anchor node. With these
rules, the data set is reduced, and only candidate nodes that can be actual nodes
are considered.
Combining and Working with Multiple Social Networks on a Single Graph 91
4.4.1 Results
In this study, nine different social networks were selected to test the proposed
method. This data was collected with a program developed by the authors to
conduct the study. This program both collected data with the spider method and
used the APIs offered by social networks. The data collected covers the years
2018 to 2020. To apply the proposed method, limited but relevant data was
chosen. While selecting the data, the profiles of the researchers and students at the
university where the study was conducted and the maximum 2nd level of friends
of these profiles were taken into consideration. These social networks are Twitter
(T), Instagram (I), Flickr (F), Meetup (M), LinkedIn (L), Pinterest (P), Reddit
(R), Foursquare (F), and YouTube (Y). As well as selecting social networks and
collecting data, it is very important for the success of the study to determine
which attribute to choose in the social network and how the selected attribute is
represented in the other social network and to match the attributes. In Table 4.3,
the attributes to be used in social networks and the existence of these attributes
in other social networks are presented. ‘D’ means derived, ‘+’ means this data is
available, and ‘-’ means this data is not available. ‘T’ for Text, ‘P’ for Photo, ‘O’
for Geo, ‘V’ for Video, ‘OWC’ for One Way Connection, ‘TWC’ for Two Way
Connection and, ‘G’ for Group.
All social networks have usernames that users use to introduce themselves.
In Meetup and Foursquare social networks, the success rates of locating and
determining the location are very high since they are geo-based social networks.
In all networks other than these networks, location determination may not be
obtained because it is made from the content. For such cases, a confidence value
was applied in the location determination. If there are locations below this value
Feature/SN T I F M L P R F Y
Main Feature T P P G T P T O V
User name (Real) + + + + + + + + -
User name (Register) + + + + + + + + +
Location + + + + + D D + D
Popularity + + D - + - - + +
Language D D D D D D D D D
Active Time Zone + + + + + - + - +
Connection Type OWC OWC OWC G TWC - - OWC -
Interest Tags D D + + D + D D D
Other Social Network Connection + - + + - - + - -
Date of Registration + - + - - - + - -
92 Graph Databases: Applications on Social Media Analytics and Smart Cities
or two locations that are very close to each other in terms of points but very far
geographically, no location determination has been made in the relevant social
network for that user. The number makes popularity determination of followers
and followers in many social networks. For Meetup, Pinterest, and Reddit, since
these numbers are uncertain, all users whose popularity could not be determined
were counted as usual. The determination of the languages used was tried to be
made by taking the contents shared on the social network and the statements that
the user expressed himself if any. Active Time Zone gives the active time zone
that the user is using on the social network. The day consisting of 24 hours was
divided into six equal parts, and it was tried to determine which part of the user
was more active on the social network. Social networks according to interaction
types are presented in Table 4.4.
Table 4.4. Interaction type in social networks
Feature/SN T I F M L P R F Y
Like + + + - + + + +
Retweet – Repost + T - - - T - - -
Comment + + + - + + + - +
Mention + + + - - + - - +
Favorite + - - + - - + - -
Data from social networks are kept in a relational database, and the data is
converted into graphs with dynamically linked lists on an in-memory data grid
platform when the calculation is made. After correcting and connecting the graphs,
calculations were made. All tests were evaluated on four metrics: Precision (P),
Accuracy (A), Recall (R), and F-Measure (F). In formula 4.5, the applied four
evolution metrics are given.
tp tp t p � tn 2PR
P� , R� , A� , Fm � (4.5)
tp � f p t p � fn t p � tn � t p � f n P�R
Method name A P R F
Our Topological Alignment Method 0.79 0.94 0.78 0.85
Liu – Cheung [22] 0.60 0.85 0.62 0.71
Hyrdra [24] 0.60 0.71 0.67 0.69
Wu-Chien [33] 0.60 0.78 0.58 0.67
Although this test is one of the reliable tests in terms of success rate, which nodes
will be removed affects the test’s success. In this test, the anchor removal process
begins with removing anchors in the second or higher degrees of the same node.
Thus, the knot-finding process is more successful since the anchor that directly
affects the node is not removed. The results of this test with different anchor
removal options are shown in Table 4.6.
According to Table 4.6, which includes the tests performed with the anchor
removed, success rates increase when a few anchors are removed. This is because
after all the nodes are matched with a very high success rate, the removed anchor
does not disturb the general structure, and the method can easily find the missing
anchor.
Limited data collection from two different sides of the world does not give
accurate results about the study’s success. Although the solution that can be
applied to this problem is to collect all the data in the network, it is impossible
to collect data with this method due to the data security and API limitations of
social networks. For this reason, the data set was selected by applying filters to
94 Graph Databases: Applications on Social Media Analytics and Smart Cities
find those with some similar properties. Thus, care was taken to find datasets
close to each other from different social networks. The fact that the data used may
positively or negatively affect the success rates.
Although the exact boundaries of confidentiality and usage agreements are
not drawn in the data extraction process with the bot, the profiles of volunteer
users and open profiles are used in our transactions. While the data of voluntary
users were being collected, consent declarations were obtained from the users for
these data. In these statements, information is given about the purpose of using
the data, how long it will be kept and what will be done after use. All obtained
data were anonymized.
4.5 Conclusion
This study proposed unique, inclusive, and prosperous methods for node similarity
and alignment. These proposed methods have been transformed into interconnected
graphs for social networks with different attributes and different properties and
represented by different graphs. In addition, a novel method for node alignment in
multiple social networks is proposed. As a result of the study, it has been shown
that data from multiple social networks can be combined in a single graph through
links. In this way, the applicability of data mining algorithms in multiple social
networks has increased. Thanks to this new graph created with data from different
social networks, the way to use more successful recommendation algorithms and
alternative data mining algorithms have been paved.
The data collected from nine different social networks are used for different
purposes and evaluated in different categories, analysis and results such as feature
analysis and comparison, which features can be matched, and the compatibility of
social networks with other social networks is presented.
One of the side contributions of the study is to provide the opportunity to
gather information about the profiles of users with profiles in more than one social
network on a single node/graph. The scope of this information provided to the user
profile also covers the characteristics of the user’s connections in different social
networks. Thus, the requesting user can benefit from the services provided by
social networks more comprehensively, and problems such as cold start problems
can be prevented. The success of the suggested methods has been observed by
performing more than 500 tests on different social networks. All tests performed
were measured with four criteria frequently used in social networking studies and
presented in summary.
When the tests were performed and the proposed method was analyzed, it
was revealed that the change in the starting point and the changes in the network
affected it. Therefore, the result can be changed according to the starting point.
This is a factor that negatively affects confidence in the proposed method. In
addition, the proposed method has high computational complexity and needs a
tremendous amount of memory than most other similar algorithms.
Combining and Working with Multiple Social Networks on a Single Graph 95
The number of equivalent studies with which the study can be compared is
few. Furthermore, the methods and datasets used by previous researchers and the
proposed method in this study are different. For these reasons, a good comparison
of the success of the study could not be made.
A more advanced model can be suggested in future studies using other
features that are not covered in this study. Also, a real-time method can be
proposed using parallel and distributed processing methods and high-performance
computing methods. In addition, higher success can be achieved by using a self-
feeding neural network for node similarity. The change in the users’ usage habits
and personal interests over the years is a sociological situation. Therefore, the
evolution of social networks can be taken into account, and the addition of the
time metric can increase the method’s success.
References
[1] Adamic, L. and E. Adar. 2005. How to search a social network. Social Networks,
87–203.
[2] Adamic, L.A. and E. Adar. 2003. Friends and neighbors on the Web. Soc Networks,
25: 211–230.
[3] Aleman-Meza, B., M. Nagarajan, C. Ramakrishnan, L. Ding, P. Kolari, A.P. Sheth,
I.B. Arpinar, A. Joshi, and T. Finin. 2006. Semantic analytics on social networks:
Experiences in addressing the problem of conflict of interest detection. In: Proceedings
of the 15th International Conference on World Wide Web .
[4] Antonellis, I., H. Garcia-Molina and C.C. Chang. 2008. Simrank++: Query rewriting
through link analysis of the click graph. Proc. VLDB Endow.
[5] Berlingerio, M., D. Koutra, T. Eliassi-Rad and C. Falousos (n.d.) 2012. NetSimile: A
Scalable Approach to Size-Independent Network Similarity. ArXiv abs/1209.2684
[6] Blondel, V.D., A. Gajardo, M. Heymans, P. Senellart and P. Van Dooren. 2004. A
measure of similarity between graph vertices: Applications to synonym extraction and
web searching. SIAM Review, 46(4): 647–666.
[7] Broder, A.Z. 1997. Syntactic clustering of the Web. Comput. Networks.
[8] Cai, B., H. Wang, H. Zheng and H. Wang. 2011. An improved random walk based
clustering algorithm for community detection in complex networks. In: Conference
Proceedings – IEEE International Conference on Systems, Man and Cybernetics, pp.
2162–2167.
[9] Cheng, J., X. Su, H. Yang, L. Li, J. Zhang, S. Zhao and X. Chen. 2019. Neighbor
similarity based agglomerative method for community detection in networks.
Complexity, 2019: 1–16.
[10] Corme, T.H., C.E. Leiserson and R.L. Rivest. 2001. Introduction to Algorithms. First
Edition, MIT Press.
[11] Faisal, F.E., H. Zhao and T. Milenkovic. 2015. Global network alignment in the context
of aging. Proceedings of the 5th ACM Conference on Bioinformatics, Computational
Biology, and Health Informatics, pp. 580–580.
96 Graph Databases: Applications on Social Media Analytics and Smart Cities
[28] Shelke, S. and V. Attar. 2019. Source detection of rumor in social network – A review.
Online Soc Networks Media, 9: 30–42.
[29] Singh, R., J. Xu and B. Berger. 2007. Pairwise global alignment of protein interaction
networks by matching neighborhood topology. In: Lecture Notes in Computer Science
(Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics).
[30] Sun,Y., J. Crawford, J. Tang and T.M. Milenković. 2014. Simultaneous optimization
of both node and edge conservation in network alignment via WAVE. In: Algorithms
in Bioinformatics: 15th International Workshop, WABI 2014, 16–39. Springer Berlin
Heidelberg.
[31] Twitter – Statistics & Facts | Statista [WWW Document] (n.d.). URL https://www.
statista.com/topics/737/twitter/#topicHeader__wrapper (accessed 3.3.22).
[32] Wang, J. and Y. Dong. 2020. Measurement of text similarity: A survey. Inf 2020, 11:
421.
[33] Wu, S.H., H.H. Chien, K.H. Lin and P.S. Yu. 2014. Learning the consistent behavior of
common users for target node prediction across social networks. In: 31st International
Conference on Machine Learning, ICML 2014.
[34] Yang, S.J.H., J. Zhang and I.Y.L. Chen. 2007. Web 2.0 services for identifying
communities of practice through social networks. In: Proceedings – 2007 IEEE
International Conference on Services Computing, SCC 2007, pp. 130–137.
[35] Yu, W., X. Lin, W. Zhang, J. Pei and J.A. McCann. 2019. SimRank*: Effective and
scalable pairwise similarity search based on graph topology. VLDB J, 28: 401–426.
[36] Zhang, J. and P.S. Yu. 2015. Integrated anchor and social link predictions across social
networks. In: IJCAI International Joint Conference on Artificial Intelligence.
[37] Zhang, M., H. Hu, Z. He, L. Gao and L. Sun. 2015. A comprehensive structural-based
similarity measure in directed graphs. Neurocomputing, 167: 147–157.
[38] Zhang, X., H. Yu, C. Zhang and X. Liu. 2008. An Improved Weighted HITS
Algorithm Based on Similarity and Popularity, pp. 477–480. Institute of Electrical
and Electronics Engineers (IEEE).
[39] Zhao, P., J. Han and Y. Sun. 2009. P-Rank: A comprehensive structural similarity
measure over information networks. In: International Conference on Information and
Knowledge Management, Proceedings.
[40] Zhou, Y., Y. Deng, J. Xie and L.T. Yang. 2018. EPAS: A sampling based similarity
identification algorithm for the cloud. IEEE Trans Cloud Comput, 6: 720–733.
CHAPTER
5
Child Influencers on YouTube: From
Collection to Overlapping Community
Detection
Influencers are gaining a growing impact on the social life and economic decisions
of the web population. A special group are child influencers, carrying responsibility
and popularity unusual for their young age. However, little research data exists
regarding child influencers and their follower communities. Furthermore, social
networks continue to restrict independent data collection, making the collection
of data sets for analysis difficult. This chapter presents an exemplary approach
starting from collection of a larger amount of data from YouTube with respect to
the access restrictions. By creation of automatic scripts, targeting child influencers
and communities on YouTube, followed by storage on the graph database
ArangoDB. Through analyzing the data with overlapping community detection
algorithms and centralities for a better understanding of the social and economic
impact of child influencers in different cultures. With the authors open source
WebOCD framework, community detection revealed that each family channels
and single child influencer channels form big communities, while there is a
divide between the two. The network collected during the research contains
72,577 channels and 2,025,879 edges with 388 confirmed child influencers. The
collection scripts, the software and data set in the database are available freely for
further use in education and research.
5.1 Introduction
As social networks continue to grow, the impact of influencers on peoples lives
Child Influencers on YouTube: From Collection to Overlapping... 99
increases. Influencers are people with a strong presence within social networks like
YouTube, Instagram or Twitter who are using their reputation, measured in followers,
to advertise for products and lifestyles [4]. Marketing campaigns made use of
influencers recently, which raised a lot of interest in algorithms identifying actual and
potential influencers in social network data sets as well as in measures for the impact
of influencers on other persons in social networks [10, 22]. Still, socio-economic
research and analysis and advances in mathematical analysis and algorithmic support of
influence maximization [5, 28] is only starting to grow together.
With it, non-adult or child influencers also gain more importance, making them
an interesting target for economic analysis, but also for social debates. Their young age
raises concerns within the public. From the perspective of marketing, child influencers may
reach markets already lost for advertisements in traditional media like print and television.
From the perspective of their parents, they are generating additional income from a
family up to making a fortune. Still, laws for child protection and against child labor
are still in place.
However, examination of such child influencers has rarely been done so far as
little recent social network data exists in this to base it on. One reason for this is
maybe the fact that the phenomenon is quite new and evidence is known mostly on
an anecdotal basis. Moreover, the phenomenon of child influencers is spread over
a quite high number of social networks making it difficult to collect and analyze
data in all of them. Also, challenges can be attributed to social networks restricting
access to their application programming interfaces (APIs) and data scraping, making
collections more difficult to build. Communities of child influencers provide a means
to deduct the actual impact range as well as unifying interests for the audience members
themselves. For the understanding of child influencer communities, it is also important
to identify overlaps between different child influencer communities, e.g. for revealing
hidden marketing activities or to identify the work of social bots [24, 29]. With these
results, influencers are first identified through influencer identification and then overlapping
community detection algorithms. Algorithms applied in this chapter are Clizz, SLPA
and LEMON [12, 13, 26]. Graph databases are a good companion for social media
analytics, in general. They provide direct storage of graph data ideally suited for social
networks with a lot of augmentation tools for later analysis and visualization. Graph
databases like Neo4J and ArangoDB offer a set of social media analytics tools like
centrality measures and (overlapping) community detection algorithms. In principle,
a limited set of analysis tasks can be already performed in the graph database itself,
but there is still a larger need for further support. This makes is necessary to work
with external tools in the moment. Yet, there are also some challenges in using graph
databases for the research. Many graph databases provide import and export tools
for relational data format, e.g. CSV, as well as standard formats for graph data, e.g.
JSON. But overall support for researchers is still limited. Writing collection scripts
is cumbersome and not well supported in many graph databases. In this chapter, an
exemplary way to extract graph data from the social network YouTube through means
of its API is shown. Because of the abundance of data, preliminary filtering is applied
to receive more child influencer channels. The data is stored on an ArangoDB database to
100 Graph Databases: Applications on Social Media Analytics and Smart Cities
support its graph-like nature and allow for easier access. For the analysis, it made use of the
authors overlapping community detection algorithm framework on the Web WebOCD.
This chapter provides the following contributions.
• A discussion of social network analysis tools in graph databases with a special
focus on (overlapping) community detection algorithms for influencer analysis.
The special domain of child influencers is introduced and related work is
presented.
• A practical study of collecting, storing, analyzing, and visualizing child
influencer data from YouTube with special respect to the graph database
ArangoDB, which serves as an example for modern multi-model databases.
• A detailed analysis of the child influencer data set collected from YouTube,
mostly performed with the authors open source overlapping community
detection algorithm framework for the Web WebOCD.
The rest of the chapter is organized as follows. In the next chapter, the state- of-
the-art in graph databases for social network analysis and child influencer research are
described as well as the overlapping community detection algorithms used in the chapter.
This is followed by Section 1.3 on the used methodology of collecting, storing,
analyzing and visualizing the data set. After this, the results are presented and discussed,
while the chapter concludes with a summary and an outlook on further work.
5.2 Background
5.2.1 Graph databases
Even though SQL databases still enjoy a lot of use, specialized NoSQL graph
databases have already taken a noticeable share in popularity*, see Figure 5.1. By
the time of this publication, they are furthermore still continuing to gain importance†
as shown in Figure 5.2. In the case of graph databases and social networks, this can
be attributed to the graph-like qualities of such networks: often, an account can be
represented by a node with additional information linked to other accounts through
certain relationships, e.g. being friends with someone. Using graph databases can
therefore come with less adaption required towards the structure of the social network.
They can also come with performance advantages, e.g. the graph database neo4j was
shown to support memory up to the petabyte level and furthermore optimized information
search [9], which may be highly relevant for larger networks. An important argument for
researchers is that networks like Facebook and YouTube offer APIs with graph-like access
already and that graph databases support often needed graph traversal as well a set of
algorithms for graph analysis. Through their NoSQL-characteristic, object structures
can also change, which may become necessary if a network for example introduces
new attributes (e.g. birthdays) or removes certain ones from public availability. Like most
SQL databases, graph databases also provide web interfaces for users.
*
https://db-engines.com/en/ranking_trend
† https://db-engines.com/en/ranking_categories
Child Influencers on YouTube: From Collection to Overlapping... 101
Figure 5.1. A ranking of overall popularity for various SQL and NoSQL databases,
including ArangoDB and Neo4j, from 2013 to February 2022 taken from the db-engines
website. It is implied that graph databases have reached a noticeable share of popularity
with MySQL databases however still being the most popular.
Figure 5.2. A ranking of popularity growth in popularity for SQL and different NoSQL
database formats from 2013 to February 2022 taken from the db-engines website. The
ranking indicates a strong rise in popularityfor graph databases.
In the following, graph analysis algorithms offered by the graph databases Neo4j and
ArangoDB will be given a more detailed description: Neo4j provides a set of centrality
measures such as PageRank [3], betweenness and degree centralities and community
detection in the form of the Louvain method [2], furthermore label propagation and
local clustering coefficients algorithms. ArangoDB also offers PageRank, between
and furthermore closeness centrality as well as label propagation and SLPA [26] for
community detection. While these algorithms can provide helpful initial insights and
can be very optimized for parallel processing due to per-vertex computation through
architectures like Pregel [16] added by [8] in ArangoDB (although they previously
offered them through just AQL functions [15]). It is argued in this approach, with
102 Graph Databases: Applications on Social Media Analytics and Smart Cities
‡
https://www.facebook.com/apps/site_scraping_tos_terms.php
§ https://www.tiktok.com/legal/tik-tok-developer-terms-of-service?lang=en
¶ https://www.facebook.com/help/instagram/581066165581870/
** https://github.com/benedekrozemberczki/awesome-community-detection
†† https://socnetv.org
‡‡
https://github.com/cytoscape/cy-community-detection
§§ https://networkit.github.io
¶¶ https://github.com/shobrook/communities
104 Graph Databases: Applications on Social Media Analytics and Smart Cities
CliZZ
The CliZZ algorithm [12] works by locally expanding communities through which
it is also able to work on larger graphs. It works by identifying influential nodes and
forming communities around them. For this, the algorithm defines its own measure
called leadership:
d (i , j )
– min
f (i ) = ∑ e δ
3δ
=j 1, d min ( i , j) ≤
2
Here, δ is a chosen constant and dmin(i, j) representing the shortest distance between
two nodes, where i is the leader candidate. Leadership thus infers how many nodes
can be reached by as small a distance as possible from a candidate, with a maximum
3δ
distance smaller than . For a leader-node i it then holds that f (i) > f ( j) with
2
all other nodes j that the candidate can reach. For every non-leader node i, a vector xi
signifying community affiliation with l entries is initialized with randomly generated
normalized entries. From there, the membership matrix is calculated through a random
walk like process:
1 i
xli (t +1)
= xl (t ) + ∑ ai , j xlj (t ) ; t ∈ [ 0, k ]
deg(i ) +1 j
Child Influencers on YouTube: From Collection to Overlapping... 105
with ai, j being an entry in the graphs adjacency matrix. This is repeated for a specified k
times, or when the infinity norm of the difference between an updated membership matrix
and the previous one is smaller than a specified precision factor p.
LEMON
The final algorithm makes use of spectral clustering and is called LEMON [13].
It was chosen to detect communities centered around one or a few specific influencers
without looking at the whole graph. A spectra defines an l-dimensional subspace in
which the closeness of the nodes for the community sub-graph is captured. Initial seed
sets must be provided by own means.
Starting from the seed set, a sub-graph S is defined as its ground truth community
and iteratively expanded. The procedure initially takes a specified k steps of random walk
from the nodes of the seed set resulting in probability vector p0. From there, a local
spectra is generated in multiple steps.
Initially, A¯s as a derivation from the normalized adjacency matrix As of the sub-
graph for the community is calculated:
1 1
– –
=A s Ds 2 * ( As + I ) * Ds 2
1
Where Ds being the diagonal degree matrix of the sub-graph and D− signifying taking
the square root of all entries of Ds and then taking the inverse. Finally, I is the identity.
The second step is to compute a Krylov matrix Kl+1(A, p0) and, i.e. a basis of l
normalized eigenvectors orthogonal to each other.
The third step is a k-loop: Deriving an orthonormal basis of eigenvectors of the
matrix yields the first spectra V0, l. By multiplying A¯s with the spectra and taking its
orthonormal basis again, the next spectra V1, l is created and so on.
The result is the local spectra Vk, l, consisting of l random-walk probability
vectors. With it, the following linear problem is solved:
min ||y||1
s.t. y = Vk,l x,
v ≥ 0,
y(S) ≥ 1
The solution-vector y is a sparse vector in which each element signifies the
likelihood of the corresponding node to be in the target community. From there, the
global conductance for this iteration is calculated and stored. If this global conductance
increases after a local minimum for the first time, the algorithm stops as it has found a
reasonably accurate community, repeating it otherwise. According to the authors, the
algorithm does not scale with graph size but with community size. This can lead to it
being a lot faster than others in application.
106 Graph Databases: Applications on Social Media Analytics and Smart Cities
5.3 Methodology
For the approach, an ArangoDB database was first set up to store the collected
information. Collection itself was handled through a NodeJS-script periodically
collecting channels and links, while collecting less information for channels below
a 5,000 subscriber threshold and filtering for potential child influencer candidates
through regular expressions, relaying collected to the database in the process. A channels
subscriptions/favorites/comments led the script to other channels. With help of WebOCD,
centrality calculations were run to verify the possible child influencers from the script,
followed by overlapping community detection through CliZZ, SLPA and LEMON.
5.3.1 Storage
Method description
As YouTube’s data can be considered graph-like with the channels being nodes linked
by subscriptions, comments, etc., ArangoDB was chosen, in addition to previous
affiliation with the chair, with a single-node setup. This was considered sufficient for
the collected amount. The structure of the objects in the database is largely oriented on
how objects are structured in the YouTube Data API, which already returns JSON
objects. Although the database is schema-less, sticking to a specific structure can
be beneficial for analysis and speed up traversal as ArangoDB generates implicit
schemas from its documents. For each channel and link a collection was created. To allow
for graph representation, traversal and export, the latter collections were labeled as
edge collections.
The channel document is given in Figure 5.3, it can either contain just the statistics
and contentDetails part, or additionally (if subscriberCount is greater than 5,000) channel
snippet, status and topicDetails. The statistics contain the amounts of comments,
views, videos and furthermore the amount of subscribers. The contentDetails contain
a list of the channels public playlists ids and the snippet contains the channel title,
description and its creation date. In the status snippet there is finally information about
the channels special properties. The edge documents are represented in Figure 5.4.
Comments contain their text, subscriptions their publishing date and favorites the id
of the favorited video. The direction of an edge is understood here as from the channel
the action was done to towards the one that carried the action out.
Limitations
While sticking closely to the object format of the YouTube Data API arguably makes
collection and storage easier, objects may become more verbose than needed. For
instance, keeping separate parts, e.g. snippet, in every larger channel object may become
problematic for bigger data sets and leave attributes in the less refined returned state
may make understanding them more difficult. Finally, having two separate channel
objects does not only mean less available information for all channels below 5,000
subscribers, it also requires accounting for two different types of objects when using the
information as a channels description is not guaranteed to exist.
Child Influencers on YouTube: From Collection to Overlapping... 107
Figure 5.3. The channel object containing snippet-, content-, statistics-, topic- and status-
information. The basic structure is similar to the channel object in the YouTube API, and
a channel with below 5,000 subscribers will just hold statistics and contentDetails.
Figure 5.4. The comment, favorite and subscription objects: While the last just holds
its publishing date, thefirst two are related to a video and hold the corresponding id as an
attribute. Favorites additionally hold the videostags, while comment objects contain the
comment text.
5.3.2 Collection
Method description
YouTube offers an API for developers to access specific information. As most networks
it has introduced (daily) quotas for requests made to the API, making fetching of
friends or connected groups of people difficult. Likes were left out of collection as
they are set to be private by default and thus likely to be few. Collection itself was
realized via NodeJS. While the full source code is available on github***, a pseudo
*** https://github.com/MaxKissgen/ytdapi_fetch
108 Graph Databases: Applications on Social Media Analytics and Smart Cities
code representation is given in Figure 5.5. NodeJS was chosen due to the availability
of libraries supporting client-sever communication. Furthermore, ArangoDB as
Figure 5.5. The scheduler from the script in pseudo code: The top channel from the queue
is chosen, startingwith a set of seed channel ids. From there a reg-ex check is made whether
the channel can be an influencer and thena potential child influencer. Only then, additional
channel information is collected, along with Youtube comments.Otherwise, favorites are
collected and the channel moved to the unlikelyChildQueue if it is a potential influencer.
Subscriptions are fetched for every type of channel.
well as YouTube offer libraries for the language. Due to the quota set by Google,
collectible information per day had a set limit and a high collection speed, as it could
be done lower-level, was not an objective as the script would idle most of the time.
The script maintains two queues with channel ids, the channelQueue is an initial
location for newly found channels and contains seed channels at the start while the
unlikelyChildQueue is a backup of channels over 5,000 subscribers that are unlikely
to be child influencers.
The main loop starts with the scheduler taking a channel id from the
channelQueue, or, if empty, from the unlikelyChildQueue, fetching its statistics part.
Child Influencers on YouTube: From Collection to Overlapping... 109
If the subscriberCount is greater than 5,000, the channel is likely an influencer and
the snippet is fetched. Whit it, two reg-ex checks are then made to first see if it is a
music channel or an official presence of a personality from another platform/medium
and then whether it is a potential child influencer. If the first is true, the channel gets
discarded as such channels were found to cultivate big audiences during testing and
may act as a sink. Otherwise, the favorites of the channel are fetched and it is moved
into the unlikelyChildQueue. If the second is true, comments for that channel are
fetched alongside its subscriptions and it is then deleted from the queue. Only English
and German child influencers were considered in this approach, the phrases and words
were gathered from an initial sample and reg-ex’s check for mentions of parents and
immediate family, of favorite activities such as toys and pretend play and the own age
as an important point. These expressions do only represent the sample and not child
influencer context as a whole, achieving the latter was considered out of scope.
Channels below 5,000 subscribers only retain the statistics part and are deleted
from the queue afterwards. For any of the comments, subscriptions and favorites, not all
are fetched, as loading any additional page (of 50 objects) increases the quota usage.
The script is set up to consider additional mentions of channels in the queue and saves
the id of the last retrieved page for a channel. For each iteration, up to 250 comments,
100 subscriptions and only the first 50 favorites (as getting favorites is costly due to
having to get the related video first) of channels are fetched. Channels and the different
edge objects are sent to the connected ArangoDB database after processing, leaving
their JSON structure mostly intact. At any point during the collection, the collection
quota set by YouTube can run out and the scheduler will idle until the next day pacific
time when the quota is reset.
Limitations
The script implements snowball sampling [7] in that collected objects deliver the next
to be examined, and thus likely much more than one other channel. This results in
a certain bias towards specific types of content, however allowing to potentially get
more child influencers, and in an exponentially in-creasing sample size, making the
script unable to run for an indefinite amount of time. The regular expressions also work
toward that bias and even though bigger, non child influencer channels are potential sinks
they still provide important and connecting parts of a network that are ignored with this
method. Furthermore, the fetching of only some links for some channels and never all
of a certain type can skew the actual strength of bonds between them and others. This
and the option for users to make their subscriptions etc. private/unavailable may ignore
some connections completely.
5.3.3 Analysis
Method description
From importing the resulting data of the query as an XGMML file and uploading it
to WebOCD, a selection of centrality calculations are run to determine whether possible
child influencers from the collection could be considered as such. WebOCD was
110 Graph Databases: Applications on Social Media Analytics and Smart Cities
Limitations
While the selection criteria can provide a good way to define what channel can be
considered a child influencer and which is not, it is debatable if they can cope with
all possible channels, possibly leaving out or wrongly including child influencers with
irregular but prominent children appearances, and cannot give an objective definition.
Implications of the community detection results on a small graph will likely be less
for the bigger graph, let alone for the whole of YouTube. A greater variation of used
algorithms, also non-overlapping ones might furthermore deliver more refined results.
5.4 Results
Upon termination of the collection script, 72,577 channels have been collected.
Furthermore 2,025,879 edges, with 1,176,617 subscriptions, 750,688 comments and
98,574 favorites. Out of the collected channels, 12,627 were considered possible
influencers, therefore having over 5,000 subscriptions, while the rest of the objects then
had reduced information due to them being below. The complete results can be found
on GitHub†††.
5.4.1 Influencers
Discussion
The collection script found 1,335 channels for the English and 302 channels for the
German regular expression. Most of the English child influencer channels disclose
their location as the United States of America, further notable countries include Great
Britain, the Philippines Australia. Others only occur below three times. The German
child influencer channels, with the exception of one Austrian and one Swiss channel,
were all German.
The centrality detection results showed the probable child influencer channels to be
scoring high in centrality rank, for the top channels by Borda count of the centralities
††† https://github.com/MaxKissgen/UCISNC_thesis_results
Child Influencers on YouTube: From Collection to Overlapping... 111
refer to Figure 5.6. In addition, early added channels ranked rather high in comparison
to others. Under the top channels, many are entertainment oriented, e.g. MrBeast and
Guava Juice. However, especially under the top 5 channels, there are family channels
or such that feature children such as The McClure Family, which were added early in
collection.
Out of the channels, each above 5,000 subscribers, child influencers were then
selected according to the criteria mentioned in Section 5.3. After this further selection,
388 channels remained which will be considered as child influencers in this approach.
For the English regular expression 347 channels remain, and 41 for the German regular
expression. Although most of the channels are indeed speaking the corresponding
language, some are included that speak a different language, yet make use English
or German wording respectively.
Comparing the found possible child influencers with the number of those after
manual filtering, the pre-selection through regular expressions can be considered somewhat
fitting, however the number difference is large enough to suggest that expressions can
be improved upon in terms of accuracy, while the smaller number of found German
channels overall can additionally hint at more seed channels being needed and a more
verbose regular expression. That script selected them as being possible influencers
overall can be seen as accurate with the centrality results.
Limitations
The smaller number of child influencers possibly makes them less representative of
the child influencer spectrum on YouTube as a whole. This sentiment can be extended
to the number of found possible child influencer channels from the script. That there
are just English and German speaking channels, makes the results likely to leave out
important findings as parts of the world speaking different languages or highly differing
dialects are not considered. Even though the centrality rank results support them to be
influencers, the apparent higher ranking of early added channels may have distorted
results in their favor and with the used centralities, channels with very strong influence
through their content on however few other users are likely less favored than those just
having a lot of connections and paths to others.
112 Graph Databases: Applications on Social Media Analytics and Smart Cities
Figure 5.7. The biggest connected component of the sub graph with SLPA results on the
left and Clizz on the right: The Piper Rockelle community is portrayed in red/dark-blue top
left, consisting of other children doing pranks/challenges. The McClure Family community
is in light-blue/red on bottom right with family channels. The German Mamiseelen
community has most German channels and is on the upper left corner in green/dark-green.
time, even though the influencer channel was. For Piper Rockelle the algorithm was run
with minimum community size of 1,000 and maximum size of 5,000. Along similar
members compared to the previous results, as high amount of family channels was now
featured as well, raising the number to 1611 significant members. Child influencer
channels with a single child as the content producer with challenges and in parts singing
however had higher affiliation values. In addition, under the included family channels
there were more concentrating on challenges or skits in comparison to the McClure
community which featured mostly family vlog channels. For the German child
influencer community run for Mamiseelen, the algorithm was run with minimum size
of 10 and maximum size of 2,100. The community featured this time include most
German influencer channels with 111 significant members. With lower affiliation, Family
Fun and Team Tapia were included as well.
The results suggest that communities form, even though bigger influencers (in
terms of direct followers) can still form their own or impact communities, around certain
topic types of influencer channels such as family-vlog ones more than just around big
channels. In consequence, it is implied that users that consume content of a child
influencer channel centered around a specific topic are likely to consume content or
are related to users that consume similar content from other child influencer channels
as well. Other findings are that there is a divide of interest between people that follow
family channels and those that follow singular child influencers, such as Piper Rockelle
for English channels, while this seems to not be the case for German channels.
Limitations
Overall, implications of the community detection on YouTube as a whole are, as
mentioned in Section 5.3, limited by the small number of community members and
nodes compared to the complete graph. This is especially the case for the smaller
communities as the lack of members places high importance on the ones that do exist.
This possibly resulted in isolated nodes that are not isolated overall and less accurate
communities. With regards to the collection process, it has to further be said that the
concentration on just child influencer channels may hide links between their followers or
themselves through other types of channels, such as sport or music themed ones, alongside
with the quota-reserving measures taken in Section 5.3, possibly distorting the resulting
community structures.
5.5 Conclusion
Social network research and graph databases are indeed good companions. It was
demonstrated in this chapter that one can make use of a graph database like ArangoDB
to collect, store, analyze and visualize a data set from a social network like YouTube.
However for it, a lot of efforts have to be made like scripting the collection scripts and
utilizing an overlapping community detection algorithm framework as part of a not
yet integrated tool chain. One can argue that graph databases should focus on storing
graph data but, similar to relational databases, tighter integration of processing tools
should however be at least discussed.
Child Influencers on YouTube: From Collection to Overlapping... 115
Also demonstrated was the use case of child influencer community detection as
a means to interdisciplinary research between socio-economic research and computer
science in the field of emerging computational social science [1, 11]. The work
may be ground laying and exemplary for further collaborative research in this area.
Although, the use case from a more practical and technical perspective was described and
discussed, the impact on research tools usable for non-computer scientists is already
obvious. Managing data collection with parameterized scripts allows research to
address new collections with regular expressions for search terms. Storing data in
managed databases frees researchers from handling versions of excel sheets and plain
data files. Computational tools configured and executed through a user friendly Web
interface ease the work of researchers from installing and managing tools on their
computers. They can share and discuss results openly with colleagues world wide, just by
exchanging URLs if they like.
Last but not least some interesting results on child influencer communities were
computed and visualized, indicating that communities form around child influencer
channels with similar topics and that there is a potential divide between family channel
communities and singular child influencer channel communities. Child influencer are
still an emerging phenomenon and as the numbers show, with likely more in the US
than in Europe. One can clearly identify influencers in child influencer communities but
with lower numbers of members results can get inaccurate.
This leads to a number of challenges in using graph databases as tools for social
network analysis. Automation or integration is the key for many labor-intensive and
complicated steps in the research procedures. The more automation is possible, the more
researchers will be able to use computational tools based on graph databases. Sharing and
discussing results demands more platforms for storing and exchanging results among
researchers. Open data and data infrastructures on the national and international level
are key. Ideally, social network providers would provide better support in both research
automation as well as result sharing. Since this is not very likely to happen, researchers
must find ways to overcome current lack of support. Open source for graph databases
and the supporting infrastructure is a must in the future. ArangoDB is open source and
has a lively community. WebOCD is still a research and teaching tool with a much
more limited number of developers, but hopefully more researchers and developers
are supporting WebOCD or similar platforms.
References
[1] Alvarez, R.M. (editor). 2016. Computational Social Science: Discovery and
Prediction. -Analytical Methods For Social Research. Cambridge University Press,
Cambridge, United Kingdom and New York, NY, USA and Port Melbourne, Australia
and Delhi, India and Singapore, first published. edition.
[2] Blondel, V.D., J.-L., Guillaume, R. Lambiotte, and E. Lefebvre. 2008. Fast unfolding
of communities in large networks. J. Stat. Mech., 10: 10008.
116 Graph Databases: Applications on Social Media Analytics and Smart Cities
[3] Brin, S. and L. Page. 1998. The anatomy of a largescale hypertextual web search
engine. In: International World Wide Web Conference Committee, editor, Computer
Networks and ISDN Systems, pp. 107–117.
[4] Brown, D. and N. Hayes. 2008. Influencer Marketing. Routledge.
[5] Chen,W., Y. Wang and S. Yang. 2009. Efficient influence maximization in social
networks. In: Elder, J. (Ed.). Proceedings of the 15th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. p. 199, New York, NY. ACM.
[6] de Veirman, M., S. de Jans, E. van den Abeele and L. Hudders. 2020. Unravelling
the power of social media influencers: A qualitative study on teenage influencers
as commercial content creators on social media. In: Goanta, C. and Ranchordás,
S. (Eds.). The Regulation of Social Media Influencers. pp. 126–166. Edward Elgar
Publishing.
[7] Goodman, L.A. 1961. Snowball sampling. The Annals of Mathematical Statistics,
32(1): 148–170.
[8] Grätzer, S. 2017. Implementing Pregel for a Multi Model Database. Master thesis,
RWTH Aachen University.
[9] Guia, J., V. Goncalves Soares and J. Bernardino. 2017. Graph databases: Neo4j
analysis. In: Proceedings of the 19th International Conference on Enterprise
Information Systems, pp. 351–356. SCITEPRESS – Science and Technology
Publications.
[10] Kiss, C. and M. Bichler. 2008. Identification of influencers – Measuring influence in
customer networks. Decision Support Systems, 46(1): 233–253.
[11] Lazer, D., A. Pentland, L.A. Adamic, S. Aral, A.-L. Barabási, D. Brewer, N.A.
Christakis, N.S. Contractor, J. Fowler, M. Gutmann, T. Jebara, G. King, M. Macy,
D. Roy and M. van Alstyne. 2009. Social science: Computational Social Science.
Science, 323(5915): 721–723. New York, N.Y.
[12] Li, H.-J., J. Zhang, Z.-P. Liu, L. Chen and X.-S. Zhang. 2012. Identifying overlapping
communities in social networks using multiscale local information expansion. The
European Physical Journal B, 85(6).
[13] Li, Y., K. He, K. Kloster, D. Bindel and J. Hopcroft. 2018. Local spectral clustering
for overlapping community detection. ACM Transactions on Knowledge Discovery
from Data, 12(2): 1–27.
[14] López-Villafranca, P. and S. Olmedo-Salar. 2019. Menores en youtube, ¿ocio o
negocio? an´alisis de casos en españa y eua. El Profesional de la Información, 28(5).
[15] Dohmen, L. 2012. Algorithms for Large Networks in the NoSQL Database ArangoDB.
Bachelor’s thesis, RWTH Aachen, Aachen.
[16] Malewicz, G., M.H. Austern, A.J.C. Bik, J.C. Dehnert, I. Horn, N. Leiser and G.
Czajkowski. 2010. Pregel: A system for large-scale graph processing. In: Elmagarmid,
A.K. and Agrawal, D. (Eds.). Proceedings of the ACM SIGMOD International
Conference on Management of Data (SIGMOD 2010), SIGMOD ’10, pp. 135–146.
[17] Gorshkova, N., L. Robaina-Calderín, and J.D. Martín-Santana. 2020. Native
advertising: Ethical aspects of kid influencers on youtube: Proceedings of the
ethicomp* 2020. In: ETHICOMP* 2020, pp. 169–171.
[18] Shahabi Sani, N., M. Manthouri and F. Farivar. 2020. A multi-objective ant colony
optimization algorithm for community detection in complex networks. Journal of
Ambient Intelligence and Humanized Computing, 11(1): 5–21.
[19] Shahriari, M., S. Krott and R. Klamma. 2015. Webocd. In: Proceedings of the 15th
International Conference on Knowledge Technologies and Data-driven Business, pp.
1–4. ACM.
Child Influencers on YouTube: From Collection to Overlapping... 117
6
Managing Smart City Linked Data
with Graph Databases: An Integrative
Literature Review
The smart city paradigm is about a better life for citizens based on advancements
in information and communication technologies, like the Internet of Things, social
media, and big data mining. A smart city is a linked system with a high degree of
complexity that produces a vast amount of data, carrying a very large number of
connections. Graph databases yield new opportunities for the efficient organization
and management of such complex networks. They introduce advantages in
performance, flexibility, and agility in contrast to traditional, relational databases.
Managing smart city-linked data with graph databases is a fast-emerging and
dynamic topic, highlighting the need for an integrative literature review. This work
aims to review, critique, and synthesize research attempts integrating the concepts
of smart cities, social media, and graph databases. The insights gained through
a detailed and critical review of the related work show that graph databases are
suitable for all layers of smart city applications. These relate to social systems
including people, commerce, culture, and policies, posing as user-generated
content (discussions and topics) in social media. Graph databases are an efficient
tool for managing the high density and interconnectivity that characterizes
smart cities.
6.1 Introduction
Cities are critical for our future, as most people live there. They are complex
systems that have been constructed through a long process of many small actions
Managing Smart City Linked Data with Graph Databases... 119
[1]. Thus, the city operation is challenging, and new methods emerge to manage
and gain insights from massive amounts of generated data. The smart city concept
aims to make inhabitants’ lives more sustainable, friendly, green, and secure. It
lies in the explosive growth of Information and Communication Technologies
(ICT) due to the advancement of revolutionary technologies like the Internet of
Things (IoT), social media, and big data mining which are considered the e-bricks
for smart city development. Citizens post opinions on social media and participate
in online discussions generating new opportunities for improving government
services while boosting information distribution and reachability. A smart city is a
social and technical structure, carrying a very large number of connections. Social
media may contain unstructured data regarding any smart city application layer.
For example, users may discuss optimal resource management issues related to
water or electricity supply.
Graph theory emerges as the default way of organizing the complex
networks that city services need and possibly the only realistic way of capturing
all this density and interconnectivity [2]. When dealing with linked data, one
compelling reason to choose a graph database over a relational database is the
sheer performance boost. Another advantage of graph databases is that they allow
schemas to emerge in tandem with the growing understanding of the domain, as
they are naturally additive. This means that they are flexible in adding new kinds
of nodes and relationships to an existing structure, without operating problems
on existing queries. Furthermore, the schema-free nature of the graph data model
empowers evolving an application in a controlled manner. Thus, in smart city
data, graph databases are ideal for modeling, storing, and querying hierarchies,
information, and linkages.
Smart city data management with graph databases is an emerging topic that
would benefit from a holistic conceptualization and synthesis of the literature.
The topic is relatively new and has not yet undergone a comprehensive review
of the literature. Each source of urban data varies in scale, speed, quality, format,
and, most importantly, semantics and reflects a specific aspect of the city [3].
Knowledge graphs uphold linked city data regardless of type, schema, or any
other traditional concern [4]. What is fascinating about smart city applications
is that all these graphs are interconnected. For example, the devices and users
may be represented on a map, depicting the device, citizen, and location graphs
in a single view. Many independent smart city applications can be found by
reviewing the related literature, but there is an absence of holistic, synthetic, and
integrated proposals. This work aims to synthesize new knowledge and probe
future practices on the topic by weaving together ideas from the literature into a
unique integrated model.
The rest of the article is structured as follows: The second section lays out
a conceptual framework for smart cities, social media, knowledge graphs, and
graph database research. The third section presents the materials and methods
used in the current study. The fourth section outlines an overview of the related
literature in the field. The next section composes the emerging ideas found in
120 Graph Databases: Applications on Social Media Analytics and Smart Cities
the literature into an integrative framework for managing smart city-linked data
with graph databases. Finally, in the last section, conclusions and future study
directions are discussed.
6.2 Background
6.2.1 Smart cities
As Townsend states: “cities accelerate time by compressing space, and let
us do more with less of both” [1]. As cities grow, the need to change the way
they operate becomes more and more vital. Never before has there been such
an opportunity to do it [5]. In recent years, the question of how ICT might be
used to improve a city’s performance and promote efficiency, effectiveness, and
competition has arisen [6]. The smart city concept was proposed by International
Business Machines Corporation (IBM) in 2008 [7] as the potential solution to the
challenges posed by urbanization [8].
D’Aniello et al. [9] consider the smart city as an evolving, dynamic system
with a dual purpose: supporting decision-making and enriching city domain
knowledge. The smart city strives to offer citizens a promising quality of life by
merging technology, urban infrastructures, and services to dramatically enhance
function efficiency and answer residents’ requests via resource optimization [10].
The smart city operation is based on six pillars: Smart People, Smart Economy,
Smart Mobility, Smart Living, Smart Governance, and Smart Environment [11]–
[13] as illustrated in Figure 6.1. Li et al. [14] argue that only by considering all
of these factors equally can a smart city accomplish social fairness, economic
development, and sustainability.
The smart city is based on a 3D geospatial architecture that allows for real-
time sensing, measurement, and data transmission of stationary and moving
objects [14]. Large amounts of data generated can be converted into valuable
information that will guide decision-making through proper management and
processing. D’ Aniello et al. [9] represent the smart city as an adapted system that
operates in three phases, as shown in Figure 6.2. The ubiquitous sensor network
(hard sensing) and social media interactions capture real-time data in the first
phase (soft sensing). These data produce streams, which should be processed in
the second step to change them into valuable information that may be used to
make decisions. Knowledge leads to actions in the city during the third phase.
The overall architecture of a smart city is based on the three-layered
hierarchical model of the IoT paradigm as illustrated in Figure 6.3. Townsend
[1] states that these three layers allow renovating of governments by design,
transforming the way they work internally and together with citizens and other
stakeholders. The three layers of the smart city structure are:
1. The “instrumentation” layer: Data from the urban environment and
citizen social interaction are collected in this layer. Smart devices with
distributed data collection are in charge of gathering and storing data from
the monitored region locally. Images, video, music, temperature, humidity,
pressure, and other types of information can all be collected. For capturing
and delivering data streams to a base station or a sink node for processing,
the distributed sensor grid is integrated into municipal infrastructure. Data
routing and transmission to the service-oriented middleware layer are
handled by wired and wireless network devices. Many studies [7], [14]–
[16], consider this layer as two separate layers, the sensors layer, and the
network layer.
2. Data storage, real-time processing, and analysis are handled by the service-
oriented middleware layer. To promote interoperability across otherwise
incompatible technologies, service-oriented architecture often uses widely
recognized Internet standards such as REST or Web Services. [17]. Data
can be preprocessed with very low latency at edge gateways near smart
devices, and then aggregated and processed in data centers using cloud
computing and machine learning capabilities. [18].
3. The intelligent application layer provides end-users with a user interface
that is efficient, interactive, and comprehensive, allowing intelligent
services for many city domains to be provided. Tables, charts, and
handlers should be used to provide information to the user in a clear and
comprehensible manner, allowing him to experiment and create scenarios.
The application layer should offer the user all the tools they need to make
informed decisions and plan their city.
Managing Smart City Linked Data with Graph Databases... 123
grown and evolved into an efficient tool in the field of applied mathematics. Today
it finds many applications in various scientific areas, such as biology, chemistry,
sociology, engineering, computer science, and operations research.
A knowledge graph is a specific type of graph made of “entity-relation-
entity” triples and the value pairs of entities and their associated attributes [27].
It connects the fragmented knowledge by representing a semantic network that
can effectively support the contextual understanding of a domain. Knowledge
graphs were originally proposed by Google in 2012 as a way of improving search
procedures. With a knowledge graph, the intricate relationships between the
real-world entities can be inferred out of existing facts [28] and complex related
information can be represented in a structured way [29]. As Qamar et al. [30]
state, having all the data semantically structured not only displays the domain’s
structure in a comprehensible way but also allows academics to work on the data
without exerting additional effort. Knowledge graphs’ holistic understanding can
assist in extracting insights from existing data, improving predictions, and driving
automation and process optimization [26].
The benefits and applications of knowledge graphs for smart cities are not
been explored yet [31]. Decision-makers must deal with the complexity and
variety of municipal data by utilizing technologies that are aware of the smart
city pillars’ interdependencies [32]. The use of relational tables or key-value pairs
is very common in traditional smart city applications. It is difficult to retrieve
Managing Smart City Linked Data with Graph Databases... 125
qualitative information from these databases, because of the high volume and
fragmentation of the stored data [33]. There is a need to make data smarter. This
may be achieved by combining data and knowledge at a large scale, which is why
knowledge graphs are developed. [26].
Kurteva and Fensel [31] argue that knowledge graphs can provide the
level of traceability, transparency, and interpretability that is needed for the
implementation of smart city holistic management as they are widely used in
knowledge-driven tasks, such as information retrieval, question-answering,
intelligent dialogue systems, personalized recommendations, and visualization
[27, 34]. City data are expected to be understood more intuitively and better serve
the decision-making process with the use of a visual knowledge graph. There is
a direct correspondence between the three layers of smart city architecture and
the knowledge graphs taxonomy that Aasman [4] proposed. Aasman classifies
knowledge graphs into three categories:
• Internal operations knowledge graph: It focuses on integrating an
organization’s human resources, materials, facilities, and projects. This
knowledge graph yields a competitive advantage to the organization by
improving its self-knowledge.
• Intermediary products and services knowledge graph: It is dedicated to the
enterprise industry, line of business, and operation sector and is meant to
improve services.
• External customer knowledge graph: This graph integrates data coming from
various organization silos. It focuses on the relationships between these pieces
of information, offering the typical 360-degree view of an organization’s
customers.
An ontology describes the top-level structure of a knowledge graph. It is a
taxonomy scheme that identifies the classes in a domain and the relationships
between them. An ontology can be constructed from a knowledge graph for
system analysis and level of domain knowledge detection [26, 29]. Komninos et
al. [23] argue that the impact of smart city applications depends primarily on their
ontology, and secondarily on technology and programming features.
Tidke et al. [24] state that the identification of influential nodes in a social
network can act as a source in making public decisions because their opinions
give insights to urban governance bodies. For this purpose, they present novel
approaches to identify and rank influential nodes for the smart city topic based
on a timestamp. Their approach differs from most state-of-the-art approaches
because it can be stretched to test different heterogeneous features on diverse data
sets and applications.
In [48] a novel model for autonomous citizen profiling capable to resolve
intelligent queries is proposed. The researchers have used sousveillance and
surveillance devices to provide the information that is represented as a knowledge
graph for inferring and matching relationships among the entities. Tsitseklis et
al. [49] suggested a community detection approach to analyze complex networks
that use hyperbolic network embedding to obtain communities in very large
graphs. The system is based on a graph database for storing information as well as
supporting crucial computations.
Wang et al. [50] proposed a heterogeneous graph embedding framework
for a location-based social network. As location-based social networks are
heterogeneous by nature because they contain various types of nodes, i.e., users,
Points Of Interest (POIs), and various relationships between different nodes,
they argue that their framework can extract and represent useful information for
smart cities. D’Onofrio et al. [51] investigated how a fuzzy reasoning process
with multiple components (such as natural language processing and information
retrieval) could improve information processing in urban systems. They modeled
city data by fuzzy cognitive maps stored in graph databases.
The following papers on IoT applications for city transportation were
identified during the review of the relevant literature. Wirawan et al. [39] proposed
a database of multimodal transportation design using the graph data model. In [52]
a graph-based framework that helps to identify the impact of breakage on very
large road networks is presented. The authors analyze vulnerability using global
efficiency, a metric that has been widely used to characterize the overall resilience
of several city networks such as power grids and transportation networks. Then,
to identify the most vulnerable nodes inside the network, they use betweenness
centrality, a fundamental metric of centrality to identify topological criticalities.
Vela et al. [53] aim to discover the strengths or weaknesses of the public
transport information provided and the services offered by utilizing the Neo4j graph
database. In their study, they propose a novel method to design a graph-oriented
database in which to store accessible routes generated by mobile application
users. Tan et al. [29] created a representation learning model (named TransD) to
perform knowledge reasoning using the existing knowledge in knowledge graphs,
which can discover the implicit relationship between traffic entities, such as the
relationship between the POIs and the road traffic state. The objective of their
study was to construct the domain ontology with the four elements of “people–
vehicle–road–environment” in traffic as the core, and related to traffic subjects,
travel behavior, traffic facilities, traffic tools, and other entities.
Managing Smart City Linked Data with Graph Databases... 129
Bellini et al. [54] suggested a system for the ingestion of public and private
data for smart cities, including road graphs, services available on the roads, traffic
sensors, and other relevant features. The system can handle enormous amounts
of data from a variety of sources, including both open data from government
agencies and private data.
A wide range of studies is also found on the topic of energy management
in cities with the help of IoT and graph database technologies. Huang et al. [34]
used an AI-enhanced semi-automated labeling system to construct a knowledge
graph model for facilitating the grid management and search functions with the
help of Neo4j.
Kovacevic et al. [55] elaborate on four different approaches which aim to
tighten access control to preserve privacy in smart grids. They also employed the
Neo4j graph database for storing and querying the power distribution network
model in their study. Graph-based models can be used also for handling big data
generated from surveillance applications in wireless multimedia sensor networks
as the authors did in [37]. In their study, big sensor data is stored in a well-defined
graph database for simulating multimedia wireless sensor networks to run several
complex experimental queries.
Gorawski & Grochla [45] proposed a graph database schema for representing
the linear infrastructure of the city, e.g., railways, waterworks, canals, gas pipelines,
heat pipelines, electric power lines, cable ducting, or roads for identifying the real
connections between these linear infrastructure objects and connect them with
devices, meters, and sensors.
Le-Phuoc et al. [56] created a knowledge graph to pave the way toward
building a ‘‘real-time search engine for the Internet of Things’’, which they call
the Graph of Things (GoT), which aims at enabling a deeper understanding of
the data generated by connected things of the world around us. Palaiokrassas et
al. [38] built a Neo4j graph database for storing data collected from smart city
sensors (about air pollution concentrations, temperature, and relative humidity)
as well as city open data sources and employed recommendations for citizens.
D’Orazio et al. [57] show how developing techniques for querying and evolving
graph-modeled datasets based on user-defined constraints can be applied to
effectively create knowledge from urban data with automated mechanisms that
guarantee data consistency.
To solve the problem of multi-source spatiotemporal data analysis in
heterogeneous city networks, Zhao et al. [58] proposed a general framework
via knowledge graph embedding for multi-source spatiotemporal data analysis
tasks. They then used link prediction and cluster analysis tasks to mine the
network structure and semantic knowledge. Ali et al. [59] proposed a particularly
customized to a smart city environment Semantic Knowledge-Based Graph
(SKBG) model as a solution that overcomes the basic limitations of conventional
ontology-based approaches. Their model interlinks heterogeneous data to find
meaning, concepts, and patterns in smart city data.
130 Graph Databases: Applications on Social Media Analytics and Smart Cities
network comprises the bus network, the subway network, and the railway
network [77]. These networks are not constructed by only one type of
technology, but several, and this fact makes their management a particularly
big challenge [25].
• The social layer: It may also contain multiple networks, for example, family,
friendship, political, business, innovation, academic networks, etc. Mention,
however, should be made of social media e.g., Facebook or Twitter, which
have become widespread in recent years. Social media have emerged as an
important source of information about what is happening in the city [78] as
many posts have an associated location. This can lead to a strong correlation
between the behavior of social media users and urban spaces [50].
The complexity of city networks and the ambiguities related to social aspects
are crucial challenges to the issue of city operation and design [79]. The examination
of all the city’s components that affect the city and the investigation of the related
data flows between these components is vital to create a representative model of
the city’s structure [69]. The most effective way of representing these networks
is graphed where city components are modeled with nodes and the relationships
between components are modeled with edges. In addition, each node has to be
located in specific geographical coordinates to make sure that all the right things
take place at the same location [77]. For example, a streetlight equipped with a
Wi-Fi hotspot must be placed near a bus station.
The greatest challenge in integrating smart city applications is interoperability,
which arises from data heterogeneity. The hardware and software components
that are used in smart city applications are heterogeneous and usually delivered by
Managing Smart City Linked Data with Graph Databases... 133
Angles and Gutierrez [82] define the graph data model as “a model in which
the data structures for the schema and/or instances are modeled as a directed,
possibly labeled, graph, or generalizations of the graph data structure, where data
manipulation is expressed by graph-oriented operations and type constructors,
and appropriate integrity constraints can be defined over the graph structure”.
Küçükkeçeci and Yazıcı [37] consider graph data modeling as a natural way to
depict actuality. Robinson et al. [36] underline the fact that graph data models not
only represent how things are linked but also provide querying capabilities so to
make new knowledge. They argue that graph models and graph queries are just
two sides of the same coin.
Needham and Hodler [41] compile a list of graph analytics and algorithms
employed for answering questions in graph databases. Some types of algorithms
and their related usage are included in Table 6.1. The use cases mentioned in the
table are indicative as the possibilities are unlimited.
As we have seen, graph databases can communicate and cooperate effectively
with other data technologies, due to their openness and flexibility. However, there
is also the issue of interaction with people. The massive volume, high speed, and
variety, as well as the unstructured nature of smart city data, cause an intractable
problem of making sense of it all.
Managing the entire data ecosystem needs a comprehensive user interface
that intuitively provides information. The visual representation of the information
has to be dynamic and interactive, and it should be understandable and usable, as
it should be delivered to the ordinary citizen and not only to experts. In addition,
the user must have the ability to pose queries without any coding knowledge.
Α graphic user interface that meets the above requirements provides a better
understanding of the entire data ecosystem, from the production of a data point to
its consumption, and supports an effective decision-making [26]. Graph databases
by nature have an advantage in the data visualization [71]. In contrast to relational
databases, which can store data processed and re-shaped into a graph structure,
they do not require extra overhead and configuration as this is the native topology
of their technology [71]. Graph database visualization mechanisms enable users
to intuitively point and click, drag and drop, and query data to obtain information
in near real-time and get answers to questions they would not otherwise think
to ask [4].
Table 6.1. Types of graph algorithms and related smart city use cases
References
[1] Townsend, A. 2013. Smart Cities: Big Data, Civic Hackers, and the Quest for a New
Utopia. New York, NY: W.W. Norton & Company.
[2] Eifrem, E. 2018. The role of graph in our smart city future. Global, pp. 7–8.
[3] Psyllidis, A. 2015. Ontology-based data integration from heterogeneous urban
systems: A knowledge representation framework for smart cities. In: CUPUM
2015 – 14th International Conference on Computers in Urban Planning and Urban
Management, July 2015, p. 240.
[4] Aasman, J. 2017. Transmuting information to knowledge with an enterprise knowledge
graph. IT Prof., 19(6): 44–51, doi: 10.1109/mitp.2017.4241469.
[5] Goldsmith, S. and S. Crawford. 2014. The Responsive City: Engaging Communities
Through Data-Smart Governance. San Francisco, CA, USA: Jossey-Bass.
[6] Charalabidis, Y., C. Alexopoulos, N. Vogiatzis and D. Kolokotronis. 2019. A
360-degree model for prioritizing smart cities initiatives, with the participation of
municipality officials, citizens and experts. In: M.P.R. Bolivar and L.A. Munoz (Eds.).
E-Participation in Smart Cities: Technologies and Models of Governance for Citizen
Engagement. pp. 123–153. Cham, Switzerland: Springer International Publishing.
[7] Li, D.R., J.J. Cao and Y. Yao. 2015. Big data in smart cities. Sci. China Inf. Sci.,
58(10): 1–12, doi: 10.1007/s11432-015-5396-5.
[8] Ejaz, W. and A. Anpalagan. 2019. Internet of Things for Smart Cities: Technologies,
Big Data and Security. Cham, Switzerland: Springer Nature.
[9] D’Aniello, G., M. Gaeta and F. Orciuoli. 2018. An approach based on semantic stream
reasoning to support decision processes in smart cities. Telemat. Informatics, 35(1):
68–81, doi: 10.1016/j.tele.2017.09.019.
Managing Smart City Linked Data with Graph Databases... 137
[10] Kousis, A. and C. Tjortjis. 2021. Data Mining Algorithms for Smart Cities: A
Bibliometric Analysis. Algorithms, 2021, 14(8): 242, https:// doi.org/10.3390/
a14080242
[11] Pieroni, A., N. Scarpato, L. Di Nunzio, F. Fallucchi and M. Raso. 2018. Smarter city:
Smart energy grid based on blockchain technology. Int. J. Adv. Sci. Eng. Inf. Technol.,
8(1): 298–306, doi: 10.18517/ijaseit.8.1.4954.
[12] Khan, M.S., M. Woo, K. Nam and P.K. Chathoth. 2017. Smart city and smart tourism:
A case of Dubai. Sustain., 9(12), doi: 10.3390/su9122279.
[13] Kumar Kar, A., S.Z. Mustafa, M. Gupta, V. Ilavarsan and Y. Dwivedi. 2017.
Understanding smart cities: Inputs for research and practice. In: A. Kumar Kar, M.P.
Gupta, V. Ilavarsan and Y. Dwivedi (Eds.). Advances in Smart Cities: Smarter People,
Governance, and Solutions. pp. 1–8. Boca Raton, FL: CRC Press.
[14] Li, D., J. Shan, Z. Shao, X. Zhou and Y. Yao. 2013. Geomatics for smart cities –
Concept, key techniques, and applications. Geo-Spatial Inf. Sci., 16(1): 13–24, doi:
10.1080/10095020.2013.772803.
[15] Moreno, M.V., F. Terroso-Sáenz, A. González-Vidal, M. Valdés-Vela, A, Skarmeta,
M.A. Zamora and V. Chang. 2017. Applicability of big data techniques to smart
cities deployments, IEEE Trans. Ind. Informatics, 13(2): 800–809, doi: 10.1109/
TII.2016.2605581.
[16] Massana, J., C. Pous, L. Burgas, J. Melendez and J. Colomer. 2017. Identifying
services for short-term load forecasting using data driven models in a Smart City
platform. Sustain. Cities Soc., 28: 108–117, doi: 10.1016/j.scs.2016.09.001.
[17] Wlodarrczak, P. 2017. Smart Cities – Enabling technologies for future living.
In: A. Karakitsiou, A. Migdalas, S. Rassia and P. Pardalos (Eds.). City Networks:
Collaboration and Planning for Health and Sustainability. pp. 1–16. Cham,
Switzerland: Springer International Publishing.
[18] Mystakidis, A. and C. Tjortjis. 2020. Big Data Mining for Smart Cities: Predicting
Traffic Congestion using Classification. doi: 10.1109/IISA50023.2020.9284399.
[19] Giatsoglou, M., D. Chatzakou, V. Gkatziaki, A. Vakali and L. Anthopoulos. 2016.
CityPulse: A platform prototype for smart city social data mining. J. Knowl. Econ.,
7(2): 344–372, doi: 10.1007/s13132-016-0370-z.
[20] Christakis, N. and J. Fowler. 2011. Connected: The Amazing Power of Social
Networks and How They Shape Our Lives. London, UK: Harper Press.
[21] Koukaras, P., C. Tjortjis and D. Rousidis. 2019. Social Media Types: Introducing a
Data Driven Taxonomy, vol. 102(1). Springer Vienna.
[22] Rousidis, D., P. Koukaras and C. Tjortjis. 2019. Social Media Prediction: A Literature
Review, vol. 79(9–10). Multimedia Tools and Applications.
[23] Komninos, N., C. Bratsas, C. Kakderi and P. Tsarchopoulos. 2015. Smart City
Ontologies: Improving the effectiveness of smart city applications. J. Smart Cities,
1(1): 31–46, doi: 10.18063/jsc.2015.01.001.
[24] Tidke, B.A., R. Mehta, D. Rana, D. Mittal and P. Suthar. 2021. A social network based
approach to identify and rank influential nodes for smart city. Kybernetes, 50(2): 568–
587, doi: 10.1108/K-09-2019-0637.
[25] CITO Research. 2020. Six Essential Skills for Mastering the Internet of Connected
Things. Retrieved April 26, 2022 from http://www.citoresearch.com
[26] Barrasa, J., A.E. Hodler and J. Webber. 2021. Knowledge Graphs: Data in Context for
Responsive Businesses. Sebastopol, CA: O’Reilly Media.
[27] Liu, J., H. Ning, Y. Bai and T. Yan. 2020. The study on the index system of the smart
138 Graph Databases: Applications on Social Media Analytics and Smart Cities
city based on knowledge map. J. Phys. Conf. Ser., 1656(1), doi: 10.1088/1742
6596/1656/1/012015.
[28] Fensel, A. 2020. Keynote: Building smart cities with knowledge graphs. 2019
Int. Conf. Comput. Control. Informatics its Appl., 19: 1–1, doi: 10.1109/
ic3ina48034.2019.8949613.
[29] Tan, J., Q. Qiu, W. Guo and T. Li. 2021. Research on the construction of a knowledge
graph and knowledge reasoning model in the field of urban traffic. Sustain., 13(6), doi:
10.3390/su13063191.
[30] Qamar, T., N.Z. Bawany, S. Javed and S. Amber. 2019. Smart City Services
Ontology (SCSO): Semantic Modeling of Smart City Applications. In: Proc. 2019
7th Int. Conf. Digit. Inf. Process. Commun. ICDIPC, pp. 52–56, doi: 10.1109/
ICDIPC.2019.8723785.
[31] Kurteva, A. and A. Fensel. 2021. Enabling interpretability in smart cities with
knowledge graphs: Towards a better modelling of consent. IEEE Smart Cities, https://
smartcities.ieee.org/newsletter/june-2021/enabling-interpretability-in-smart-cities
with-knowledge-graphs-towards-a-better-modelling-of-consent (accessed Aug. 17,
2021).
[32] De Nicola, A. and M.L. Villani. 2021. Smart city ontologies and their applications: A
systematic literature review. Sustain., 13(10), doi: 10.3390/su13105578.
[33] Huang, H., Y. Chen, B. Lou, Z. Hongzhou, J. Wu and K. Yan. 2019. Constructing
knowledge graph from big data of smart grids. Proc. 10th Int. Conf. Inf. Technol.
Med. Educ. ITME 2019, pp. 637–641, doi: 10.1109/ITME.2019.00147.
[34] Huang, H., Z. Hong, H. Zhou, J. Wu and N. Jin. 2020. Knowledge graph construction
and application of power grid equipment. Math. Probl. Eng., vol. January 2020, no.
2018, doi: 10.1155/2020/8269082.
[35] Zhang, Z.J. 2017. Graph databases for knowledge management. IT Prof., 19(6): 26–
32, doi: 10.1109/MITP.2017.4241463.
[36] Robinson, I., J. Webber and E. Eifrem. 2015. Graph Databases: New Opportunities for
Connected Data, 2nd ed. Sebastopol, CA, USA: O’Reilly.
[37] Küçükkeçeci, C. and A. Yazıcı. 2018. Big Data Model Simulation on a graph database
for surveillance in wireless multimedia sensor networks. Big Data Res., 11: 33–43,
doi: 10.1016/j.bdr.2017.09.003.
[38] Palaiokrassas, G., V. Charlaftis, A. Litke and T. Varvarigou. 2017. Recommendation
service for big data applications in smart cities. In: International Conference on High
Performance Computing and Simulation, HPCS 2017, pp. 217–223, doi: 10.1109/
HPCS.2017.41.
[39] Wirawan, P.W., D. Er Riyanto, D.M.K. Nugraheni and Y. Yasmin. 2019. Graph
database schema for multimodal transportation in Semarang. J. Inf. Syst. Eng. Bus.
Intell., 5(2): 163–170, doi: 10.20473/jisebi.5.2.163-170.
[40] Webber, J. 2021. The Top 10 Use Cases of Graph Database Technology. [Online].
Available: https://neo4j.com/resources/top-use-cases-graph-databases-thanks/?aliId=
eyJpIjoiWlA2MFdod08yeW14bmt0XC8iLCJ0IjoiTEg3aFVUdlwvWHQ3dDgwV1d
jQUdYUVE9PSJ9.
[41] Needham, M. and A.E. Hodler. 2019. Graph Algorithms: Practical Examples in
Apache Spark & Neo4j. Sebastopol, CA, USA: O’Reilly.
[42] Ding, P., Y. Cheng, W. Lu, H. Huang and X. Du. 2019. Which Category is Better:
Benchmarking the RDBMSs and GDBMSs. In: Web and Big Data: 3rd International
Joint Conference, APWeb-WAIM 2019 Proceedings, Part II, vol. 2, pp. 191–206,
[Online]. Available: http://link.springer.com/10.1007/978-3-030-26075-0.
Managing Smart City Linked Data with Graph Databases... 139
[58] Zhao, L., H. deng, L. Qiu, S. Li, Z. Hou, H. Sun and Y. Chen. 2020. Urban multi-
source spatio-temporal data analysis aware knowledge graph embedding. Symmetry
(Basel)., 12(2): 1–18, doi: 10.3390/sym12020199.
[59] Ali, S., G. Wang, K. Fatima and P. Liu. 2019. Semantic knowledge based graph model
in smart cities. Communications in Computer and Information Science: Smart City
and Informatization, 7th International Conference, iSCI 2019, vol. 1122, pp. 268–278,
doi: 10.1007/978-981-15-1301-5_22.
[60] Consoli, S., M. Mongiovic, A.G. Nuzzolese, S. Peroni, V. Presutti, D.R. Recupero
and D. Spampinato. 2015. A smart city data model based on semantics best practice
and principles. WWW 2015 Companion – Proceedings of the 24th International
Conference on World Wide Web, pp. 1395–1400, doi: 10.1145/2740908.2742133.
[61] Maduako, I. and M. Wachowicz. 2019. A space-time varying graph for modelling
places and events in a network. Int. J. Geogr. Inf. Sci., 33(10): 1915–1935, doi:
10.1080/13658816.2019.1603386.
[62] Amaxilatis, D., G. Mylonas, E. Theodoridis, L. Diez and K. Deligiannidou. 2020.
LearningCity: Knowledge generation for smart cities. In: F. Al-Turjman (Ed.). Smart
Cities Performability, Cognition & Security. pp. 17–41. Cham, Switzerland: Springer.
[63] Schoonenberg, W.C.H., I.S. Khayal and A.M. Farid. 2018. A Hetero-functional Graph
Theory for Modeling Interdependent Smart City Infrastructure. Cham, Switzerland:
Springer.
[64] Yao, Z., C. Nagel, F. Kunde, G. Hudra, P. Willkomm, A. Donaubauer, T. Adolphi
and T.H. Kolbe. 2018. 3DCityDB – A 3D geodatabase solution for the management,
analysis, and visualization of semantic 3D city models based on CityGML. Open
Geospatial Data, Softw. Stand., 3(5): 1–26, doi: https://doi.org/10.1186/s40965-018
0046-7.
[65] Fernandez, F., A. Sanchez, J.F. Velez and B. Moreno. 2020. The augmented space of a
smart city. International Conference on Systems, Signals, and Image Processing, vol.
July 2020, pp. 465–470, doi: 10.1109/IWSSIP48289.2020.9145247.
[66] Přibyl, P., O. Přibyl, M. Svítek and A. Janota. 2020. Smart city design based on
an ontological knowledge system. Commun. Comput. Inf. Sci., vol. 1289 CCIS, no.
October 2020, pp. 152–164, doi: 10.1007/978-3-030-59270-7_12.
[67] Štěpánek, P. and M. Ge. 2018. Validation and extension of the Smart City ontology.
ICEIS 2018 – Proc. 20th Int. Conf. Enterp. Inf. Syst., vol. 2, no. Iceis 2018, pp. 406–
413, 2018, doi: 10.5220/0006818304060413.
[68] Qamar, T. and N. Bawany. 2020. A cyber security ontology for Smart City. Int. J. Inf.
Technol. Secur., 12(3): 63–74.
[69] Smirnova, O. and T. Popovich. 2019. Ontology-based model of a Smart City. Real
Corp 2019: Is this the Real World? Perfect Smart Cities vs. Real Emotional Cities,
2019, pp. 533–540.
[70] Amghar, S., S. Cherdal and S. Mouline. 2018. Which NoSQL database for IoT
applications? 2018 Int. Conf. Sel. Top. Mob. Wirel. Networking, MoWNeT 2018, pp.
131–137, doi: 10.1109/MoWNet.2018.8428922.
[71] Effendi, S.B., B. van der Merwe and W.T. Balke. 2020. Suitability of graph database
technology for the analysis of spatio-temporal data. Futur. Internet, 12(5): 1–31,
2020, doi: 10.3390/FI12050078.
[72] Brugnara, M., M. Lissandrini and Y. Velegrakis. 2016. Graph Databases for Smart
Cities. IEEE Smart Cities Initiative. Retrieved September 19, 2021 from https://
event.unitn.it/smartcities-trento/Trento_WP_Brugnara2.pdf
Managing Smart City Linked Data with Graph Databases... 141
[73] Almabdy, S. 2018. Comparative analysis of relational and graph databases for social
networks. 1st Int. Conf. Comput. Appl. Inf. Secur. ICCAIS 2018, doi: 10.1109/
CAIS.2018.8441982.
[74] Desai, M., R.G. Mehta and D.P. Rana. 2018. Issues and challenges in big graph
modelling for Smart City: An extensive survey. Int. J. Comput. Intell. IoT, 1(1): 44–
50.
[75] Breunig, M., P.E. Bradley, M. Jahn, P. Kuper, N. Mazroob, N. RÖsch, M. Al-Doori, E.
Stefanakis and M. Jadidi. 2020. Geospatial data management research: Progress and
future directions. ISPRS Int. J. Geo-Information, 9(2), doi: 10.3390/ijgi9020095.
[76] Guo, D. and E. Onstein. 2020. State-of-the-art geospatial information processing in
NoSQL databases. ISPRS Int. J. Geo-Information, 9(5), doi: 10.3390/ijgi9050331.
[77] Ameer , F., M.K. Hanif, R. Talib, M.U. Sarwar, Z. Khan, K. Zulfiqar and A. Riasat.
2019. Techniques, tools and applications of graph analytic. Int. J. Adv. Comput. Sci.
Appl., 10(4): 354–363, doi: 10.14569/ijacsa.2019.0100443.
[78] Alomari, E. and R. Mehmood. 2018. Analysis of tweets in Arabic language for
detection of road traffic conditions. Lect. Notes Inst. Comput. Sci. Soc. Telecommun.
Eng. LNICST, 224: 98–110, doi: 10.1007/978-3-319-94180-6_12.
[79] Raghothama, J., E. Moustaid, V. Magal Shreenath and S. Meijer. 2017. Bridging
borders: Integrating data analytics, modeling, simulation, and gaming for
interdisciplinary assessment of health aspects in city networks. In: A. Karakitsioy et
al. (Eds.). City Networks. pp. 137–155. Cham, Switzerland: Springer International
Publishing.
[80] Fontana, F. 2017. City networking in urban strategic planning. In: Karakitsiou et al.
(Eds.). City Networks. pp. 17–38. Cham, Switzerland: Springer.
[81] Kivelä, M., A. Arenas, M. Barthelemy, J.P. Gleeson, Y. Moreno and M.A. Porter.
2014. Multilayer networks. J. Complex Networks, 2(3): 203–271, doi: 10.1093/
comnet/cnu016.
[82] Angles, R. and C. Gutierrez. 2008. Survey of graph database models. ACM Comput.
Surv., 40(1): 1–39, doi: 10.1145/1322432.1322433.
CHAPTER
7
Graph Databases in Smart City
Applications – Using Neo4j and
Machine Learning for Energy Load
Forecasting
The smart city (SC) approach aims to enhance populations’ lives through
developments in knowledge and connectivity systems such as traffic congestion
management, Energy Management Systems (EMS), Internet of Things or social
media. An SC is a very sophisticated connected system that generates a big
quantity of data and demands a large number of connections. Graph databases
(GDB) provide new possibilities for organizing and managing such complicated
networks. Machine learning (ML) is positively influencing headline-grabbing
apps like self-driving vehicles and virtual assistants, but it is also improving
performance and lowering costs for everyday tasks like web chat and customer
service, recommendation systems, fraud detection, and energy forecasting. Most
of these ML data challenges may be addressed by utilizing GDB. Because graphs
are founded on the concept of linking and traversing connections, they are an
obvious option for data integration. Graphs can also be used to supplement raw
data. Each column in typical tabular data represents one “feature” that the ML
system may exploit. Each form of link is an extra feature in a graph. Furthermore,
simple graph structures such as causal chains, loops, and forks can be regarded
as features in and of themselves. Also, Power system engineers have examined
several parallel computing technologies based on relational database structures to
increase the processing efficiency of EMS applications but have yet to produce
sufficiently rapid results while graph processing does this. Considering Neo4j as
Graph Databases in Smart City Applications – Using Neo4j... 143
the leading NoSQL GDB, the goal of this work is to create and test a method
for energy load forecasting (ELF) combining ML and GDB. More specifically,
this research integrates multiple approaches for executing ELF tests on historical
building data. The experiment produces data resolution for 15 minutes as one step
ahead, while disclosing accuracy issues.
7.1 Introduction
For a long time, municipal administration depended on a variety of information
sources, including financial analyses, surveys, and ad-hock research, to track the
growth of the city and suggest potential growth opportunities. Cities are now
outfitted with a plethora of various observation equipment that offers actual input
on everything that is going on, due to inexpensive IoT devices, enhanced network
connectivity, and breakthroughs in data collection and data analytics. These new
streams of data enable authorities to respond strategically and quickly to enhance
existing procedures or prevent undesirable situations. The volume, variety, and
velocity of data gathered by Smart Cities (SC) are so great that it is referred to as
Big Data [1].
There are various SC applications, like healthcare systems [2], traffic
congestion management [3], Energy Management Systems (EMS) [4], energy load
or generation forecasting [45, 46], waste [5] and water management [18], pollution
and air quality monitoring [19] or social media [6]. One of the most important and
common aspects of an SC infrastructure that is related to Energy Management
Systems with a large variety of publications is Energy Load Forecasting (ELF).
ELF is important in power system architecture and control. It supports power
providers in estimating energy consumption and planning for anticipated energy
requirements. Moreover, it assists transmission and distribution system operators
in regulating and balancing upcoming power generation to satisfy energy needs
effectively. As a result, even though it could be a challenging task given the
complexity of present energy networks, ELF has garnered a great deal of interest
in current decades. There is a variety of research related to ELF and EMS with
Graph databases (GDB) [7, 8].
One of the most common and known GDB is the Neo4j1, a GDB that was
developed from the ground up to employ both data and connections. Neo4j
connects information when it is saved, enabling nearly unprecedented queries at
previously unseen rates. Unlike traditional databases, which arrange data in rows,
columns, and tables, Neo4j has a flexible structure defined by documented links
between data items.
This research analyzes ELF using Machine Learning (ML) time series
regression, along with GDB. This entails using conventional evaluation measures,
i.e., performance metrics based mostly on statistics, to evaluate the prediction
performance of each model. R-squared (R2), Mean Absolute Error (MAE), Mean
Squared Error (MSE), Root Mean Squared Error (RMSE), and Coefficient of
1 https://neo4j.com/
144 Graph Databases: Applications on Social Media Analytics and Smart Cities
Variation of Root Mean Squared Error (CVRMSE) are the metrics used. For the
initial storage, descriptive analysis, and preprocessing, a Neo4j GDB was employed
with the prediction algorithms used in this research can be categorized as ML
approaches such as Light Gradient Boosting Machine (LGBM), Random Forest
(RF), Artificial Neural Networks (ANN), Extreme Gradient Boosting (XGB) and
others. A sliding window method is also used for five distinct prediction models.
In Neo4j, each data record, or node, has direct ties to all of the nodes to
which it is related. Because it is built on this simple yet successful optimization,
Neo4j performs queries in densely linked data faster and with more depth than
other databases.
The remainder of this work has the following structure: Section 2 covers
the necessity for reliable ELF, as well as the state-of-the-art and commonly used
methods that provide the necessary context. The used principles for time-series
ELF are described in Section 3, along with a recommended method that has been
created. Section 4 provides the results of the experiments using pilot data. Finally,
Section 5 brings the chapter to closure by giving final opinions, conclusions, and
future work.
7.2 Background
7.2.1 Graph databases
GDB extend the capabilities of the network and relational databases by
providing graph-oriented operations, implementing specific improvements, and
providing indexes. Several similar frameworks have lately been built. Despite
having comparable qualities, they are produced using diverse technologies
and approaches.
The fundamental unit of a GDB is given as G (V, E), where V symbolizes
entities and E symbolizes relationships [9]. Singular entity graphs are composed
of the same class of nodes, while multiple entity graphs are composed of various
types of nodes [9, 10, 11]. Graphs are widely utilized in a variety of real systems
due to their adaptable architecture, self-explanatory, diverse modeling, domain
universality, schema-free configuration, fast associative functions, and implicit
storing of relations [12, 13, 14].
A graph is simply a group of vertices and edges or a series of nodes and the
relationships that link them. Graphs describe entities as nodes and the connections
between those entities as links [22]. Features, that are crucial pairings, are
supported by both nodes or relationships. The information is preserved as a graph
in a GDB, and the records are referred to as nodes. Nodes are linked together via
relationships. A label is a term that groups nodes together. Neo4j is among the
most widespread GDB. The problem would be how can query a graph database,
and Neo4j, like any GDB, leverages a sophisticated scientific idea from graph
theory as a robust and effective engine for querying data. This is known as graph
Graph Databases in Smart City Applications – Using Neo4j... 145
2
https://neo4j.com/blog/neo4j-3-0-massive-scale-developer-productivity/. Accessed:
2018.05.14.
146 Graph Databases: Applications on Social Media Analytics and Smart Cities
VSTLF: Very short-term load forecasting is focused on load demand until the
upcoming hour. It is primarily employed for real energy framework safety
analytics and energy instrument management [34].
STLF: Short-term load forecasting includes from one-hour to one-day or one-
week load predictions. STLF is the most useful of the four techniques of load
demand forecasting, which can help with systems’ energy dispatching and
control [35].
MTLF: The purpose of medium-term load forecasting is to predict energy
consumption in the coming weeks or months. It is used in the maintenance and
operation of power systems [36].
LTLF: Finally, long-term power load forecasting (LTLF) is a method for designing
long-term power systems. It is primarily used to anticipate energy consumption
for the next year or years [37].
ELF is frequently employed using three unique types of sources. Periodic
input factors including load changes induced by air heating and cooling (i.e.
quarter of a year, period, weekend, etc.) and prior demand data are instances of
these (hourly loads for the previous hour, the previous day, and the same day of
the previous week). The anticipated load for every hour of each day, the weekly
or monthly or daily power demand, and the daily peak power demand all could be
generated by ELF [38].
Furthermore, the author in [39] proposes a network-based deep belief ensemble
technique for cooling demand forecasting regarding an air conditioning unit. More
precisely, a layered Restricted Boltzmann Machine with a Logistic Regression
algorithm is employed as the outcome, with a coefficient of determination
(R2 or R–squared), Mean Absolute Error (MAE), Root Mean Square Error
(RMSE), Mean Absolute Percentage Error (MAPE), and Coefficient of Variation
Root Mean Squared Error (CVRMSE) evaluators measured.
ANNs [40], tree-based models [41] and Support Vector Machines (SVM)
[42] are presently among the most often utilized electricity forecasting methods.
SVMs are touted as the most accurate solution for estimating power use
according to [43], with levels of complexity and accuracy equivalent to deep
neural networks. When compared to ANNs, SVMs have several disadvantages,
such as sluggish running speed. This is an issue, particularly with large-scale
efforts. Unfortunately, this is a typical issue with moderately realistic models that
need a large amount of computer memory and processing or computation time.
Moreover, tree-based algorithms could also be among the most efficient models
[41], due to their behavior around zero-inflated data sets [44]. Finally, statistical
regression methods are a common alternative for forecasting electricity demand.
These models are beneficial for determining the significance of prospective model
inputs, but they struggle with short-term forecasts because of their high level of
inaccuracy [21].
Graph Databases in Smart City Applications – Using Neo4j... 147
7.3 Methodology
This section details the methods for executing ELF, including the pilot dataset
description, preprocessing steps, analysis using Neo4j, and the forecasting
technique used, which includes a sliding window design, a common technique for
time series regression [2, 20].
3 https://smarthome.iti.gr/news.html?id=27
4
www.certh.gr
5 www.iti.gr
6 https://www.visualcrossing.com/
148 Graph Databases: Applications on Social Media Analytics and Smart Cities
three years show that there are daily aggregated measures that have similar values
among two or three years.
Figure 7.2 visualizes the temperature with the year as a center showing the
green bullets as daily aggregated values. Besides the central connection, there is
only one extra aggregated connection, showing that only one daily aggregated
temperature value is equal between 2 years.
Furthermore, Figure 7.3 illustrates the wind speed parameter centered with
the year parameter. As it can be viewed, there are no other connections between
the years, besides the central one. This means that there is no daily aggregated
wind speed (light green bullets) values that are equal between 2 years. Similarly,
Figure 7.4 presents the relative humidity parameter with each year as a center,
illustrating the beige bullets as the daily aggregated values. In this case, there are
several other connections between the years, besides the central one. This means
that several daily aggregated relative humidity values are equal between 2 years.
Finally, Figure 7.5 shows the graph between energy load and year. Orange
bullets show daily aggregated values, while the blue cycle centers the year. As in
relative humidity, there is also numerous energy load daily aggregated values that
are equal between two years.
7.5 Results
This chapter describes the outcomes of the experiments based on several
evaluation metrics such as MAE, R2, MSE, RMSE, and CVRMSE. The results are
illustrated in Table 7.1.
152 Graph Databases: Applications on Social Media Analytics and Smart Cities
Table 7.1. Algorithmic results per metric, for one step ahead load forecasting
Model R2 (0-1) MAE (kWh) MSE (kWh2) RMSE (kWh) CVRMSE (0-1)
XGB 0.8538 0.1309 0.0328 0.1811 0.2911
LGBM 0.9332 0.0733 0.0150 0.1224 0.1968
CB 0.9183 0.0874 0.0183 0.1354 0.2177
KR 0.9259 0.0809 0.0166 0.1289 0.2072
BR 0.9256 0.0809 0.0167 0.1292 0.2077
GB 0.9318 0.0730 0.0153 0.1236 0.1988
RF 0.9029 0.1024 0.0218 0.1476 0.2373
DT 0.7475 0.1461 0.0566 0.2380 0.3825
MLP 0.9332 0.0785 0.0062 0.0785 0.1968
From the R2 point of view, several models achieved high scores, greater than
0.9. More specifically, LGBM and MLP had the similar best R2 (0.9332), with
the GB being very close (0.9318). KR, BR, CB and RF also performed very well
and above 0.9, having scores of 0.9259, 0.9256, 0.9183 and 0.9029 respectively.
Finally, XGB and DT achieved lower R2 scores, below 0.9 (0.8538 and 0.7475).
As far as MAE is concerned, GB and LGBM had almost identical errors
(0.0730 and 0.0733 kWh), with MLP performing very close (0.0785 kWh).
Furthermore, KR, BR, and CB had MAE lower than 0.1 (0.0809, 0.0809 and
0.0874, performing very well. Finally, RF, XGB, and DT had greater MAE than
0.1 (0.1024, 0.1309 and 0.1461), scoring the highest error.
Regarding MSE, RMSE and CVRMSE MLP was the most efficient model
scoring 0.0062, 0.0785 and 0.1968 respectively. LGBM was the 2nd most accurate
reaching 0.0150, 0.1224, and 0.1968 with GB almost similar scoring 0.0153,
0.1236, 0.1988. Moreover, KR (0.0166, 0.1289, 0.2072), BR (0.0167, 0.1292,
0.2077) and CB (0.0183, 0.1354, 0.2177) performed also very well. Finally, as in
R2 and MAE cases, RF, XGB, and DT were outperformed returning lower scores.
Overall, the findings revealed that LGBM, GB, and MLP were the most
accurate models. More specifically, all three models were almost comparable,
with the LGBM offering slightly better results for R2, MLP significantly better
scores for the squared error-based metrics (MSE, RMSE, CVRMSE) and the GB
providing better results for MAE.
7.6 Conclusions
This research analyzes a case in time series forecasting by offering a unique
technique for one step ahead of ELF that combines GDB technologies with ML
and DL. It provides a clear guideline for developing correct insights for energy
demands, but it also involves Neo4j for data integration, resulting in findings that
can be matched towards other state-of-the-art approaches.
Graph Databases in Smart City Applications – Using Neo4j... 153
The results and the detailed comparison utilized indicated that the most
accurate models were LGBM, GB, and MLP, with each one performing better for
a specific metric (LGBM for R2, GB for MAE and MLP for squared error-based
metrics – MSE, RMSE, CVRMSE).
7.7 Limitations
The accompanying assertions are responsible for this study’s restrictions.
Information granularity and quality are two widely common issues in this
research, as well as within the context of making accurate forecasts. Data should
be complete, up-to-date, and widely available. Their quantity, while reliant on
the kind of algorithm used, may represent a balancing challenge. In other words,
whereas large data sets may be advantageous for model training, time and space
complication might be a significant constraint. Model overfitting and outliers’
detection are two extra-related issues.
Ethical considerations
We would like to inform you that all information was gathered either open source
or by subscription and is free to use.
Acknowledgements
We would like to thank CERTH for using their open-access database
Conflict of Interest
We do not have any conflicts of interest other than with staff working at the
International Hellenic University.
154 Graph Databases: Applications on Social Media Analytics and Smart Cities
References
[1] Brugnara, M., M. Lissandrini and Y. Velegrakis. 2022. Graph databases for smart
cities. IEEE Smart Cities, University of Trento, Aalborg University.
[2] Mystakidis, A., N. Stasinos, A. Kousis, V. Sarlis, P. Koukaras, D. Rousidis, I.
Kotsiopoulos and C. Tjortjis. 2021. Predicting covid-19 icu needs using deep learning,
xgboost and random forest regression with the sliding window technique. IEEE Smart
Cities, July 2021 Newsletter.
[3] Mystakidis, A. and C. Tjortjis .2020. Big Data Mining for Smart Cities: Predicting
Traffic Congestion using Classification. 2020 11th International Conference on
Information, Intelligence, Systems and Applications (IISA), pp. 1–8, doi: 10.1109/
IISA50023.2020.9284399.
[4] Christantonis, K. and C. Tjortjis. 2019. Data Mining for Smart Cities: Predicting
Electricity Consumption by Classification. 2019 10th International Conference on
Information, Intelligence, Systems and Applications (IISA), pp. 1–7, doi: 10.1109/
IISA.2019.8900731.
[5] Anagnostopoulos, T., A. Zaslavsky, K. Kolomvatsos, A. Medvedev, P. Amirian,
J. Morley and S. Hadjiefthymiades. 2017. Challenges and opportunities of waste
management in IoT-enabled smart cities: A survey. IEEE Transactions on Sustainable
Computing, 2(3): 275–289, 1 July-Sept. 2017, doi: 10.1109/TSUSC.2017.2691049
[6] Rousidis, D., P. Koukaras and C. Tjortjis. 2020. Social media prediction: A literature
review. Multimedia Tools and Applications, 79: 1–33. doi: 10.1007/s11042-019-
08291-9.
[7] Perçuku, A., D. Minkovska and L. Stoyanova. 2018. Big data and time series use in
short term load forecasting in power transmission system. Procedia Computer Science,
141: 167–174, ISSN 1877-0509, https://doi.org/10.1016/j.procs.2018.10.163.
[8] Huang, H., Z. Hong, H. Zhou, J. Wu and N. Jin .2020. Knowledge graph construction
and application of power grid equipment. Math. Probl. Eng., vol. 2020, no. January
2018, doi: 10.1155/2020/8269082
[9] Agrawal, S. and A. Patel. 2016. A Study On Graph Storage Database Of NoSQL.
International Journal on Soft Computing, Artificial Intelligence and Applications
(IJSCAI), 5(1): 33–39.
[10] Patel, A. and J. Dharwa. 2017. Graph Data: The Next Frontier in Big Data Modeling
for Various Domains. Indian Journal of Science and Technology, 10(21): 1–7.
[11] Ma, S., J.J. Li, C. Hu, X.X. Lin and J. Huai. 2016. Big graph search: Challenges and
techniques. Frontiers of Computer Science, 10(3): 387–398.
[12] Singh, D.K. and R. Patgiri. 2016. Big Graph: Tools, Techniques, Issues, Challenges
and Future Directions. In: 6th Int. Conf. on Advances in Computing and Information
Technology (ACITY 2016), pp. 119–128. Chennai, India.
[13] Petkova, T. 2016. Why Graph Databases Make a Better Home for Interconnected
Data than the Relational Databases. (2016). https://ontotext.com/graph-databases-
interconected-data-relationaldatabases/
[14] Kaliyar, R. 2015. Graph Databases: A Survey. In: Int. Conf. on Computing,
Communication and Automation (ICCCA2015), pp. 785–790.
[15] Robinson, I., J. Webber and E. Eifrem. 2015. Graph Databases New Opportunities for
Connected Data, pp. 171–211.
[16] Zyglakis, L., S. Zikos, K. Kitsikoudis, A.D. Bintoudi, A.C. Tsolakis, D. Ioannidis and
D. Tzovaras. 2020. Greek Smart House Nanogrid Dataset (Version 1.0.0) [Data set].
Zenodo. https://doi.org/10.5281/zenodo.4246525
Graph Databases in Smart City Applications – Using Neo4j... 155
[17] Gholamy, A., V. Kreinovich and O. Kosheleva. 2018. Why 70/30 or 80/20 relation
between training and testing sets: a pedagogical explanation. Departmental Technical
Reports (CS). 1209. https://scholarworks.utep.edu/cs techrep/1209.
[18] Ramos, M., A. McNabola, P. López-Jiménez and M. Pérez-Sánchez. 2020. Smart
water management towards future water sustainable networks. Water, 12(1): 58.
https://doi.org/10.3390/w12010058
[19] Toma, C., A. Alexandru, M. Popa and A. Zamfiroiu. 2019. IoT solution for smart
cities’ pollution monitoring and the security challenges. Sensors, 19(15): 3401.
https://doi.org/10.3390/s19153401
[20] Lee, C., C. Lin and M. Chen. 2001. Sliding-Window Filtering: An Efficient Algorithm
for Incremental Mining. In: Proceedings of the Tenth International Conference on
Information and Knowledge Management. pp. 263–270.
[21] Alfares, H.K. and M. Nazeeruddin. 2002. Electric load forecasting: Literature survey
and classification of methods. Internat. J. Systems Sci., 33(1): 23–34.
[22] Salama, F., H. Rafat and M. El-Zawy. 2012. General-graph and inverse-graph. Applied
Mathematics, 3(4): 346–349. doi: 10.4236/am.2012.34053.
[23] Vukotic, A. and N. Watt. 2015. Neo4j in Action, pp. 1–79. ISBN: 9781617290763.
[24] https://neo4j.com/blog/neo4j-3-0-massive-scale-developer-productivity/. Accessed:
2018.05.14.
[25] Kantarci, B., K.G. Carr and C.D. Pearsall. 2017. SONATA: Social Network Assisted
Trustworthiness Assurance in Smart City Crowdsensing. In: The Internet of Things:
Breakthroughs in Research and Practice. pp. 278–299. Hershey, PA, IGI Global.
[26] Rodriguez, J.A., F.J. Fernadez and P. Arboleya. 2018. Study of the Architecture of a
Smart City. Proceedings, vol. 2, pp. 1–5.
[27] Bertot, J.C. and H. Choi. 2013. Big data and e-government: Issues, policies, and
recommendations. In: Proceedings of the 14th Annual International Conference on
Digital Government Research, pp. 1–10. ACM, New York.
[28] West, D.M. 2012. Big Data for Education: Data Mining, Data Analytics, and Web
Dashboards. Gov. Stud. Brook. US Reuters.
[29] U.S. Department of Energy. 2019. Report on Smart Grid/Department of Energy.
Available at https://www.energy.gov/oe/articles/2018-smart-grid-system-report,
Retrieved Sep. 29.
[30] Christantonis, K., C. Tjortjis, A. Manos, D. Filippidou and E. Christelis. 2020. Smart
cities data classification for electricity consumption & traffic prediction. Automatics
& Software Enginery, 31(1): 49–69.
[31] International Energy Agency C. 2020. Electricity information overview (2020) URL:
https://www.iea.org/reports/electricity-information-overview
[32] Alkhathami, M. 2015. Introduction to electric load forecasting methods. J Adv Electr
Comput Eng, 2(1): 1–12.
[33] Koukaras, P., N. Bezas, P. Gkaidatzis, D. Ioannidis, D. Tzovaras and
C. Tjortjis. 2021. Introducing a novel approach in one-step ahead energy load
forecasting. Sustainable Computing: Informatics and Systems, 32: 100616.
[34] Ahmad, A., N. Javaid, A. Mateen, M. Awais and Z.A. Khan. 2019. Short-term load
forecasting in smart grids: An intelligent modular approach. Energies, 12(1): 164,
10.3390/en12010164.
[35] Zhang, J., Y.M. Wei, D. Li, Z. Tan and J. Zhou. 2018. Short term electricity
load forecasting using a hybrid model. Energy, 158: 774–781, 10.1016/j.
energy.2018.06.012.
156 Graph Databases: Applications on Social Media Analytics and Smart Cities
[36] Kuo, P.-H. and C.-J. Huang. 2018. A high precision artificial neural networks model
for short-term energy load forecasting. Energies, 11(1): 213, 10.3390/en11010213.
[37] Raza, M.Q. and A. Khosravi. 2015. A review on artificial intelligence-based load
demand forecasting techniques for smart grid and buildings. Renew. Sustain. Energy
Rev., 50: 1352–1372, 10.1016/j.rser.2015.04.065.
[38] Singh, A., K. Ibraheem, S. Khatoon, M. Muazzam and D.K. Chaturvedi. 2012. Load
forecasting techniques and methodologies: A review. In: 2012 2nd International
Conference on Power, Control and Embedded Systems. pp. 1–10.
[39] Fu, G. 2018. Deep belief network-based ensemble approach for cooling load
forecasting of air-conditioning system. Energy, 148: 269–282.
[40] Amasyali, K. and N.M. El-Gohary. 2018. A review of data-driven building energy
consumption prediction studies. Renew. Sustain. Energy Rev., 81.
[41] Moon, J., Z. Shin, S. Rho and E. Hwang. 2021. A comparative analysis of tree-
based models for day-ahead solar irradiance forecasting. 2021 International
Conference on Platform Technology and Service (PlatCon), pp. 1–6, doi: 10.1109/
PlatCon53246.2021.9680748.
[42] Khan, R.A., C.L. Dewangan, S.C. Srivastava and S. Chakrabarti. 2018. Short term
load forecasting using SVM models. Power India Int. Conf., 8.
[43] Hao, H. and F. Magoules. 2012. A review on the prediction of building energy
consumption. Renew. Sustain. Energy Rev., 16(6): 3586–3592, 10.1016/j.
rser.2012.02.049
[44] Lee, S.K. and S. Jin. 2006. Decision tree approaches for zero-inflated count data.
Journal of Applied Statistics, 33(8): 853–865.
[45] Mystakidis, A., E. Ntozi, K. Afentoulis, P. Koukaras, P. Gkaidatzis, D. Ioannidis, C.
Tjortjis and D. Tzovaras. 2023. Energy generation forecasting: Elevating performance
with machine and deep learning. Computing, https://doi.org/10.1007/s00607-023-0
164-y
[46] Mystakidis, A., E. Ntozi, K. Afentoulis, P. Koukaras, G. Giannopoulos, N. Bezas, P.A.
Gkaidatzis, D. Ioannidis, C. Tjortjis and D. Tzovaras. 2022. One step ahead energy
load forecasting: A multi-model approach utilizing machine and deep learning. In:
2022 57th International Universities Power Engineering Conference (UPEC), pp.
1–6, https://doi.org/10.1109/UPEC55022.2022.9917790
CHAPTER
8
A Graph-Based Data Model for
Digital Health Applications
The era of big data constantly introduces promising profits for the healthcare
industry, but simultaneously raises challenging problems in data processing
and management. The flood of automated systems in our smart cities generates
large volumes of data to be integrated for valuable information which can also
be applied to support in the healthcare context. Medical databases offer the
capabilities to address these issues and integrate, manage and analyze the data
for gaining deep insights. Especially modern graph databases which support the
discovery of complex relationships in the ever-growing networks of heterogenous
medical data. However, traditional relational databases are usually not capable of
representing data networks in tables or revealing such relationships, but are still
widely used as data management systems. This chapter discusses a methodology
for transferring a relational to a graph database by mapping the relational
schema to a graph schema. To this end, a relational schema graph is constructed
for the relational database and transformed in multiple steps. The approach is
demonstrated for the example of a graph-based medical information system using
a dashboard on top of a Neo4j database system to visualize, explore and analyse
the stored data.
8.1 Introduction
Smart Cities have the aim of improving living conditions for their citizens by
means of a variety of digitized service offerings and a large-scale analysis of
citizen-related data. A fundamental pillar for smart cities is the area of digital
health due to the following reasons:
158 Graph Databases: Applications on Social Media Analytics and Smart Cities
• Health and well-being of citizens are the major objectives of smart cities.
Collecting the personalized health data of citizens is hence the foundation of
analyzing living conditions or identifying adverse environments in cities.
• Health care is a main economic factor in cities and digital health can improve
the timely delivery of health care services, as well as facilitate, optimize and
modernize health care management procedures. In this sense, digital health
can also relieve the burden involved with an aging society as well as shortages
of qualified healthcare professionals.
• Availability of easy-to-use, secure and reliable digital health and telemedicine
applications can improve the diagnosis and treatment of patients, in particular
in regions where comprehensive health services are not regularly available or
where patients have difficulties reaching the medical service center.
• Hospitals as a major unit in smart cities rely heavily on medical information
systems in order to manage the every-day processes.
• Wearables (like smart watches) and smart phones are now-a-days widely
used to collect health-related data on an individualized basis leading to a
stark increase of health-related data, that can be applied to other aspects of
smart cities (for example, analyzing smart mobility aspects by assessing the
usage of bikes by individuals).
From a data management perspective, digital health systems provide
the potential to integrate information and communication in healthcare more
efficiently with the goal of sustainable benefits for all individuals.
8.1.1 Contributions
We propose a methodology for transforming a relational database schema into
a graph data schema. The proposed method is general, intuitive and easy to
apply. It constructs a schema graph from the relational schema and transforms
it step-by-step into a graph model to be applied to a graph database management
system. As an example, a medical application (and an implementation based on
Neo4j and NeoDash – a Neo4j Dashboard Builder [5]) for storing and visualizing
electronic health records, diagnostics and biomedical data is discussed. This
example illustrates the different transformation steps of the schema graph from
the relational to the graph database schema in a real-world application.
8.2 Background
We can define digital health to span a variety of applications and information
systems that collect individualized person-related data on a large scale. Those
information systems need a flexible data management solution in the backend that
can incorporate different kinds of input data (formats) and at the same time provide
an easily extensible data model that also allows for a user-friendly visualisation
of information. Moreover, an optimized (graph-based) data structure can build a
reliable and efficient backbone for an AI-based evaluation of large digital health
data sets.
As compared to tabular data storage, graph structures may often be better
suited for medical data management. Looking in more detail at biomedical data
one can observe that many of these data are inherently graph-structured; an
overview of graph databases in the biomedical domain is provided in [10]. These
graph structures manifest at the low level of biochemistry as well as the high level
of disease and patient management:
• For example – on the low level of biochemical processes – gene regulation
can be expressed in a network where different components interact: proteins
interact with other proteins [8], or genes interact with genes [9] as well as for
analysis and visualisation of dynamic gene regulatory networks [12].
• On the high level of electronic health records, the data of individual patients
can be linked via their symptoms to other patients with similar symptoms or
to supporting documentation on similar diseases [6].
• Unified vocabularies (thesauri, ontologies, taxonomies), that are also graph-
shaped, have to be involved to enable matchings beyond simple string
equality: semantic similarity between terms (for example, disease categories
like “ulna fracture” and “broken arm”) will produce more relevant and
accurate cohorts [2].
• Moreover, during the run of this project we fuse different notions of similarity
for patients (different similarities for different data types) into one global
similarity in a so-called similarity fusion network [11], which is again a graph
structure.
• Another graph approach is the one of disease networks [1, 3]; by observing
deviations of a disease network from a healthy state, diseases and their causes
can be identified.
160 Graph Databases: Applications on Social Media Analytics and Smart Cities
7. Replace sources ns (i.e. nodes without incoming edges), that are connected
to exactly two entity nodes nE1 and nE2 , and their two edges (ns, nE1 ) and (ns,
nE2) by an undirected edge e = {nE1 , nE2} connecting nE1 and nE2 directly.
• Case 1: If ns has no other edges, no other actions are required.
• Case 2: If ns has an edge to a (merged) sink nq, add the attribute(s)
represented by nq as property to e and remove nq, too.
• Case 3: If ns has an edge to a hub nh with only one other edge to an entity
node nE containing only one additional attribute a next to the identifying
attribute(s), add a as property to e and remove nh and nE from the schema
graph.
If none of the above cases is applicable, no merge is performed but the
directed edges (ns , nE) to any entity node nE are transformed into undirected ones.
8. Resolve FK relations by edges:
• Case 1: Replace FK relations indicated by hubs nh with one incoming edge
(nE1, nh) from an entity node nE1 and one outgoing edge (nh, nE2) to an
entity node nE2 by an undirected edge {nE, nE2}.
• Case 2: If the FK relation is a source ns labeled “Ri.a” with outgoing edges
to an entity node nE and to a (merged) sink nq, first merge ns and nq (with
all attributes except the FK attribute) into an entity node nE′ labeled “Ri”.
Then, connect the entity nodes nE and nE′ with an undirected edge nE , nE′ .
9. Transform any node n, which is not an entity yet, into an entity node and each
directed edge (na, nb) into an undirected edge {na , nb} between the same
nodes na and nb.
edge e = (na , nb) to the corresponding PK node nb of Rj(Xj). It should be noted here
that we do not consider composite FKs here as they can be handled analogously
to the composite PKs by collapsing the respective attribute nodes into a single FK
node. The introduced edges mirror the logical connections between the entities
and are formed based on the join conditions.
(a) Relations with PKs denoted (b) Constructed schema graph RG for
by underlines and arrows the relations in (a) with PK nodes
indicating FK dependencies. (purple), FK attribute nodes (yellow)
and FK relationships (dashed arrows).
Figure 8.1. Schema graph construction example.
8.3.3 Schema graph transformation
Furthermore, the schema graph RG is compressed by merging the sinks ns ∈ N
(i.e. nodes without any outgoing edges) that are all connected to the same PK
node np and only have one incoming edge (np, ns). The respective sink nodes ns
are then merged into one node labeled with “Ri .attributes”. This aggregation of
the non-key attribute nodes is the first step to modeling entities in the design of
the final graph database schema obtained from RG. To create the first entities,
all hubs nh ∈ N of some attribute a ∈ Xi, that are PK nodes of Ri with only one
outgoing edge (nh, ns) to a (merged) sink ns representing attribute(s) b ∈ Xj, are
A Graph-Based Data Model for Digital Health Applications 163
merged with the corresponding sink ns into one new node n labeled Ri. The new
node n then has the set of attributes {a, b} (or {a} ∪ b if b denotes a set of
attributes from a merged sink ns). This ensured that the constructed node n acts as
an entity modeling the relation Ri together with the entity-specific attributes and,
hence, the graph data model becomes even more intuitive and less inflated.
The next step is to resolve associative relations Rw that model the one- or
many-to-many relationships between two other relations Ri and Rj. To this end,
source nodes ns with two outgoing edges e1 = (ns, nE1) and e2 = (ns, nE2) to two
entity nodes nE1 and nE2 are processed. These nodes make up the fundamental
units, i.e. the real-world entities, and have a label and attributes. The node ns and
the edges e1 and e2 are removed from RG and all the information (attributes) of
ns is incorporated into an undirected edge e = {nE1, nE2} between nE1 and nE2. If
e1 and e2 were the only two edges attached to ns, no further processing is required
(case 7.1). In the case of one additional outgoing edge (ns, nq) to a (merged) sink
nq, the attribute set represented by nq is added as the attribute set of the new
undirected edge e (case 7.2). Also, if ns has an outgoing edge (ns, nh) to a hub node
nh with only one other edge (nh , nE ) to an entity node nE where nE only contains a
single additional attribute b next to the identifying attribute(s) a, add b as property
to e and remove nh and nE from RG (case 7.3). However, if there are more edges
originating from ns than described above or if the additionally connected hub nh
from the latter case does not match the constraints, i.e., none of the cases apply, no
replacement of ns, e1 and e2 is possible. In this case ns becomes an entity as well
and the edges e1 and e2 are made undirected.
As a further processing step towards the final graph data model, the remaining
FK relations in the notion of the relational schema were also transformed into
edges modeling relationships between the corresponding entities. The FK relations
were either indicated by hubs nh with exactly one incoming edge (nE1, nh) from
an entity node nE1 and one outgoing edge (nh , nE2) to an entity node nE2 or sources
ns linking via (ns, nE) to an entity node nE and (ns, nq) to a (merged) sink nq. In
the first case, such a hub nh (and its edges) are simply replaced by an undirected
edge e = {nE1, nE2} directly connecting nE1 and nE2. The sources ns (case 8.2) are
treated differently as before establishing an undirected edge between the entity
node nE and the next entity node nE′ a second entity node has to be created first.
This entity is drawn from merging the source ns with the (merged) sink nq into an
entity node nE′ with the attributes from ns and nq. The foreign identifier from the
attributes of ns is omitted in the context of this merge as the new edge incorporates
this relation inherently. With the new entity node nE′ , the relationship in terms of
an edge {nE, nE′} between the nodes nE and nE′ is established.
To finalize the graph data model, the remaining non-entity nodes are
transformed into entity nodes as the relationships can not be resolved further. Any
directed edge (na, nb) in RG between any two nodes na and nb is replaced by an
undirected edge na, nb. Then, the final graph data model is captured by the fully
transformed schema graph RG, that now consists of only relationships (edges)
and entities (nodes) with properties and labels.
164 Graph Databases: Applications on Social Media Analytics and Smart Cities
(a) Transformed schema graph with entity (b) Final schema graph
nodes (magenta), PK nodes (purple), consisting of connected
FK nodes (yellow) and FK relationships entities A, B and D.
(dashed arrows).
Figure 8.2. Schema graph transformation example.
(a) Ternary associative relationship (b) Transformed schema graph for the
with PKs denoted by underlines ternary associative relationship
and arrows indicating FKs. between R, S and T.
Figure 8.3. Schema graph transformation counter example n-ary associative relationship.
Figure 8.4. Relational schema of the medical database. The tables represent the relations
(PK bold) with attribute names and types. The lines between the relations indicate FK
dependencies.
diagnostic purposes, for instance, relate the patients and their materials to the
important, decisive analytical results, which in turn can be used by doctors for
personalizing the treatment or researchers to gain knowledge.
The application of the proposed schema transformation approach leads to
a first relational schema graph RG = 〈N, E〉 with (merged) attribute nodes, PK
nodes, and PK-to-attribute & FK-to-PK edges. It is obtained by applying the
transformation steps 1-5. For the sake of better visualisation, the schema graph
is split into two parts RG1 ∪ RG2 (Figures 8.5 (a) and 8.5 (b)). The PK nodes
of the graph are visualised with a thick outline and source and sink nodes are
colored green and red, respectively. The arrows between the nodes indicate the
directed edges for PK-attribute or FK-PK connections according to the schema
graph construction rules. The first partition RG1 shows the schema graph of
the Patient, Family, Project, Diagnosis and DiagnosisAddition
relation, that all connect via associative relations. The second partition RG2
consists of the other relations involving the patient material and the analytic
process. Both partitions RG1 and RG2 share the PK node of the Patient relation
labeled “Patient.id”, which would be the central node in a non-partitioned
illustration, and the corresponding attribute node “Patient.attributes”.
The further transformations of the nodes N and edges E towards the final
graph data schema create the first entities merging the PK hub nodes (step 6)
with the corresponding (merged) attribute nodes. This leads to the entities, which
aggregate the entity-specific properties for patients, projects, families, diagnoses
A Graph-Based Data Model for Digital Health Applications 167
(a) Schema graph partition RG1. (b) Schema graph partition RG2.
Figure 8.5. Partitioned schema graph RG = RG1 ∪ RG2 with PK nodes (thick outline),
sink nodes (green) and source nodes (red) highlighted.
(additions), orders, analyses (masters), results and materials as shown in Figure
8.6 (left). Then, the source nodes connecting to two entity nodes are replaced
by a direct edge between the entity nodes to model the relationship in the graph
data schema, i.e., the relationships “InProject” and “InFamily” are added. It is
possible according to transformation step 7.3 to replace the source node labeled
“DiagnosisPatient. PK” as well as the attribute node and the FK dependency to
the “DiagnosisAddition” entity by an edge labeled “HasDiagnosis”. This edge
also has the attribute diagnosis_date of the former DiagnosisPatient table
and the attribute DiagnosisAddition.description as this is the only non-
PK attribute and, thus, can be compressed into the new relationship. This leads to
the final schema graph partition RG1 as depicted on the right side of Figure 8.6.
The same procedure as for RG1 is also done in parallel for the partition
RG2. Originating from the version shown in Figure 8.5 (b), the entities are again
generated by merging hubs and their (merged) attribute nodes. This yields the
entities of patients (same entity as in RG2), orders, analyses (master), results and
Figure 8.6. Transformation of RG1: The left side shows the snapshot of the graph after
creating entity nodes by merging hubs. The right side displays RG1 after additionally
applying the transformation steps 7-9.
168 Graph Databases: Applications on Social Media Analytics and Smart Cities
materials (numbers) (left side in Figure 8.7). For these entities, relationships are
established by replacing the sources labeled “OrderPatient. PK”, “MaterialPatient.
PK” and “ResultAnalysis.PK” by edges (steps 7.1 and 7.2) labeled “HasOrder”,
“HasMaterial” and “HasResult”, respectively. After introducing these edges, the
updated schema graph partition RG2 (right side in Figure 8.7) still contains several
FK dependencies and non-entity nodes, e.g., the FK attribute node “Analysis.
order id” connecting the performed analyses and their requests, that are about to
be resolved.
Figure 8.7. Transformation of RG2: The left side shows the creation of the entities
according to transformation step 6. The result of replacing hubs by edges to connect
entities (step 7) is visualised on the right side.
Figure 8.8. Transformation of RG2 continued: The final schema after the application of
the last transformations (steps 8 and 9) is illustrated on the left side. The right side depicts
some manual, use case-specific adaptations of the final schema.
A Graph-Based Data Model for Digital Health Applications 169
the new entity and the analysis entity. Hence, the second partition of RG is
composed as demonstrated by the graph on the left side of Figure 8.8.
Dashboard builder
A dashboard is then built with NeoDash, a dashboard builder for Neo4j databases.
This tool uses straightforward Cypher queries entered in a built-in editor to extract
information from the data graph for visualization in different charts, e.g., bar or
graph charts. The dashboard itself is composed of two main components, pages
and reports. Each dashboard contains one or more pages that enable a conceptual
separation of contents. The separation supports different perspectives of the
same data. Each of the pages contains multiple reports that render a chunk of
information obtained as the result of a single Cypher query against the database.
For example, the report can show a distribution over the data in a bar chart or
plot a subgraph of the data. The dashboard implemented in a NeoDash instance is
deployed as a user-friendly web application.
We implemented three pages in our dashboard that visualise information on
different levels of granularity. One page deals with the full cohort of pediatric
ALL cases and gives a summarising overview of the cohort and its statistics. The
second page concentrates on the analysis of a subgroup of patients based on the
selection of a certain genetic fusion and allows the user to explore the subgroups
relationships and for leukemia research relevant information like karyotype or
aneuplouidy. The last page visualises the characteristics and healthcare data
for an individual target patient and suggests similar patients. In the following
subsections, we demonstrate the dashboard pages as implemented in the public
demo version of Graph4Med1. The demo tool uses random and synthetic patient
data generated with Synthea2 for the sake of privacy.
Cohort page
This dashboard page yields the user a broad overview of the cohort of the extracted
ALL patients (Figure 8.9). The first information is the size of the demo cohort
given by the number at the top left report that is fed by a simple query returning
the count of patient nodes. The page comprises a report rendering the distribution
of the age of the patients in a bar plot which is additionally grouped by patients’
1 http://graph4med.cs.uni-frankfurt.de/
2 https://synthetichealth.github.io/synthea/
A Graph-Based Data Model for Digital Health Applications 171
Figure 8.9. Cohort page with summarizing statistics on the patient cohort: The cohort
size (top left), age distribution (top right), age at ALL-diagnosis distribution (middle) and
mostfrequent non-ALL diagnoses (bottom). The colors of the bars indicate male (M,
orange) and female (F, green) patients.
gender (top right). The grouping is indicated by the stacked colors of each of the
bars and the gender attribute is only binary for the sake of simplicity. The second
bar plot in the middle of the page represents the gender-grouped distribution of
the age at which the ALL diagnosis was made. The last bar plot on the page’s
bottom illustrates the gender-grouped distribution of the most frequent diagnoses
next to ALL throughout the cohort, e.g., AML (Acute Myeloid Leukemia) or
(non)-pediatric MDS (Myelodysplastic Syndrome).
As the Cypher queries are written down in the built-in query editor, it is
possible with only minor effort to extend these queries, such that the plots can
be interactively changed by the user. For example, the user could then choose
whether to display the absolute or relative frequencies for each of the age values
or alternative groupings could be seamlessly applied and displayed, e.g., grouping
by minimal residual disease instead of gender.
MATCH (n:Patient)
WITH n, duration.between(date(n.dob), date()) AS age
RETURN age.years AS AgeInYears,
172 Graph Databases: Applications on Social Media Analytics and Smart Cities
n.gender AS Gender,
COUNT (DISTINCT n) AS Count
ORDER BY age.years
This Cypher query populates the “Patient age distribution” report (Figure
8.9) with patient counts grouped implicitly by age and gender. The dashboard
can then simply render the bar chart with the data returned from the query. For
example, the relative frequency can be chosen as an alternative for the absolute
frequency by the user when adding
MATCH (:Patient)
WITH COUNT(*) AS cnt
...
RETURN ..., 1.0 * COUNT(DISTINCT n)/cnt AS Frequency
ORDER BY age.years
to the above query. The equivalent SQL query to the unmodified version is very
similar for this use case:
SELECT datediff(year, p.dob, getdate()) AS AgeInYears,
p.gender AS Gender,
COUNT(DISTINCT p.id) AS Count
FROM Patient p
GROUP BY datediff(year, p.dob, getdate()),
p.gender
ORDER BY AgeInYears
In contrast to the Cypher query, the grouping for the age and gender of the
patients has to be declared explicitly in SQL. The explicit grouping and the double
function call of datediff(year, p.dob, getdate()) for computing the age
in years in SELECT and GROUP BY make the SQL query less comprehensible.
Subgroup page
The dashboard page for analysing subgroups of patients is shown in Figure 8.10.
The top row of reports gives an overview over the whole cohort by statistics on
the amount of detected fusions per patient (top left) and the ten most frequent
detected fusions plus the frequency of occurrence of hyper-/hypodiploidy (top
mid). To ease the fusion selection (mid left) by the user, the top right report
gives a table of fusions and their alternative names. Upon the selection of a
fusion, the affected reports are reloaded and the Cypher queries are re-evaluated
under the instantiation of the included variable with the selected fusion value.
For the subgroup of patients, i.e., those that were detected the selected fusion,
age distribution plots (center and mid right), an interactive subgraph revealing
the relationships between patients and fusions (bottom left) as well as a tabular
overview with analytical information (bottom right), e.g., chromosomes or
aneuploidy type, are contained in the dashboard page.
A Graph-Based Data Model for Digital Health Applications 173
Figure 8.10. Subgroup page about analytic information on a subgroup of patients based
on a selected fusion. The top row shows summarising statistics for the whole cohort and
a table with (groups of) fusions and their other names. Age distributions (middle row) as
well as a graph-based (bottom left) and tabular overview (bottom right) are shown for the
subgroup of patients with the selected fusion (middle left).
MATCH (p:Patient)-[:HasOrder]-(:Order)-[:HasAnalysis]
-(:Analysis)-[:HasFusion]-(f:Fusion),
(p)-[d:HasDiagnosis]-(diag:Diagnosis)
WHERE f.name = neodash_fusion_name
AND diag.name = “ALL”
WITH p, f, d, diag
OPTIONAL MATCH (p)-[:HasOrder]-(:Order)-[:HasAnalysis]-
(a2:ArrayCGHAnalysis)
174 Graph Databases: Applications on Social Media Analytics and Smart Cities
The Cypher query above, fetches the data for the tabular report summarising
the analytical information for the patients of the subgroup (leaving out the
projection for the sake of simplicity). The variable neodash_fusion_name is
the parameter for the selection of a fusion by the user and is replaced by the
selected fusion name via the selection report shown at the left of the middle row
of Figure 8.10.
SELECT ...
FROM Patient p, OrderPatient, Order o, Analysis a,
DynamicField df, DiagnosisPatient dp, Diagnosis diag
LEFT JOIN OrderPatient op2 ON p.id = op2.patient_id
LEFT JOIN Analysis a2 ON op2.order_id = a2.order_id
... joins aneuploidy/karyotype information ...
WHERE ... join conditions ...
AND df.x = neodash_fusion_name
AND diag.name = “ALL”
AND ... selection analysis types ...
The sketched SQL query to obtain the same results as the previous Cypher
query is relatively long, due to the explicit listing of the join conditions of all
the involved tables. The join conditions to link the relations were omitted here
for better readability. Due to the associative relations like OrderPatient or
DiagnosisPatient, the joining conditions inflate the query structure massively.
This example query demonstrates the strength of the Cypher query language
with the much more intuitive and concise path notation, instead of the lengthy
SQL statements involving multiple associative relations and the corresponding
join conditions.
The modeling of the analysis subtypes by the introduced sublabels in the graph
data model also makes it more convenient to formulate queries where it is directly
visible whether the query targets single or multiple analysis types, e.g., (p)-
[:HasOrder]-(:Order)-[:HasAnalysis]-(a2:ArrayCGHAnalysis).
The SQL query, in contrast, needs to include (based on the relational schema)
selections for the corresponding analysis type for each of the joined Analysis table
instances (also omitted here).
Patient page
The last page of the Graph4Med dashboard enables the user to navigate the
individual case of a selected patient. From investigations on a subgroup of
patients with a certain fusion, the user might find interesting cases and explore
them separately in this dashboard page. The first report at the top left as shown
A Graph-Based Data Model for Digital Health Applications 175
in Figure 8.11 performs the selection of a patient by id. With the selection of
an individual, the dashboard shows a tabular overview of the patients diagnostic
information in the form of the executed analyses and their results (top right). A
subgraph comprising the patient’s data in a comprehensive and interactive report
is also generated (bottom left).
For further exploration, the similarity between the target patient and all other
patients is calculated and the most similar ones are shown in a graph (bottom
right). The similarity is computed as the Jaccard similarity
A� B
J(A, B) =
A� B
between the set of fusions, aneuploidy type and diagnoses of the target patient
and the other patients. The color and thickness of the “SimilarTo” relationships
pointing to the other patients mirror the similarity, i.e., thick and dark green
arrows correspond to a high similarity whereas thin and light green ones refer to
a smaller degree of similarity.
Figure 8.11. Patient dashboard page displaying information about the individual case of a
single patient. Upon selection (top left), a table with analysis data (top right) and a graph
with various data related to the target patient (bottom left), e.g., materials, analyses and
diagnoses, are created. Similar patients are linked in the graph on the bottom right, where
color and thickness of the “SimilarTo” relationship indicate the degree of similarity.
176 Graph Databases: Applications on Social Media Analytics and Smart Cities
MATCH (p:Patient)
WHERE p.patient_id = target_patient_id
OPTIONAL MATCH (p)-[ho:HasOrder]-(o)-[ha:HasAnalysis]-(a)
OPTIONAL MATCH (a)-[hr:HasResult]-(r:Result)
...
RETURN *
SELECT *
FROM Patient p
JOIN OrderPatient op ON p.id = op.patient_id
JOIN Order o ON op.order_id = o.id
LEFT JOIN Analysis a ON op.order_id = a.order_id
LEFT JOIN ResultAnalysis ra ON a.id = ra.analysis_id
LEFT JOIN Result r ON ra.result_id = r.id
...
WHERE p.id = target_patient_id
The equivalent SQL query is again much longer due to the associative
relations instead of the path expressions and, thus, less comprehensible. These
two queries also demonstrate the strength of the projection in a graph data model
in comparison to the projections in a relational data model. The SQL query returns
a table that contains various columns potentially providing a load of information
which make it challenging for the user to grasp all the information when viewing
the records in the table. The Cypher query, in contrast, returns a subgraph
indicating the relationships between the patient and the other entities, such as
samples or diagnoses. The subgraph can be inspected more intuitively and lets
the user get a better understanding of the relationships that would be hidden in
tabular data.
8.5 Conclusion
We proposed a general, intuitive and simple methodology for transforming
a relational database schema into graph data schema. The methodology was
demonstrated for a medical application (and an implementation based on Neo4j
and NeoDash – Neo4j Dashboard Builder [5]) for storing, visualizing and
analyzing electronic health records, diagnostics and biomedical data. The different
transformation steps of the schema graph were shown to obtain the graph database
schema from the former relational schema. The benefits of the graph data model
like more comprehensible queries and powerful visualisations in comparison to
the relational model were also discussed in the context of the tool Graph4Med.
A Graph-Based Data Model for Digital Health Applications 177
References
[1] Barabási, A.-L., N. Gulbahce and J. Loscalzo. 2011. Network medicine: A network-
based approach to human disease. Nature Reviews Genetics, 12(1): 56–68.
[2] Girardi, D., S. Wartner, G. Halmerbauer, M. Ehrenmüller, H. Kosorus and S. Dreiseitl.
2016. Using concept hierarchies to improve calculation of patient similarity. Journal
of Biomedical Informatics, 63: 66–73.
[3] Goh, K.-I., M.E. Cusick, D. Valle, B. Childs, M. Vidal and A.-L. Barabási. 2007. The
human disease network. Proceedings of the National Academy of Sciences, 104(21):
8685–8690.
[4] Iacobucci, I. and C.G. Mullighan. 2017. Genetic basis of acute lymphoblastic
leukemia. Journal of Clinical Oncology, 35(9): 975.
[5] de Jong, N. 2022. NeoDash – Neo4j Dashboard Builder. Retrieved 21 Jan 2022, from
https://github.com/nielsdejong/neodash
[6] Pai, S. and G.D. Bader. 2018. Patient similarity networks for precision medicine.
Journal of Molecular Biology, 430(18): 2924–2938.
[7] Edwards, R. 2019. Neomodel documentation. Retrieved 14 Jan 2022, from https://
neomodel.readthedocs.io/en/latest/
[8] Rual, J.-F., K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot and N. Li. 2005.
Towards a proteome-scale map of the human protein–protein interaction network.
Nature, 437(7062): 1173–1178.
[9] Tian, X.W. and J.S. Lim. 2015. Interactive naive bayesian network: A new approach
of constructing gene-gene interaction network for cancer classification. Bio-medical
Materials and Engineering, 26(s1): S1929–S1936.
[10] Timón-Reina, S., M. Rincón and R. Martínez-Tomás. 2021. An overview of graph
databases and their applications in the biomedical domain. Database, 2021.
[11] Wang, B., A.M. Mezlini, F. Demir, M. Fiume, Z. Tu, M. Brudno and A. Goldenberg.
2014. Similarity network fusion for aggregating data types on a genomic scale. Nature
Methods, 11(3): 333–337.
[12] Wiese, L., C. Wangmo, L. Steuernagel, A.O. Schmitt and M. Gültas. 2018.
Construction and visualization of dynamic biological networks: Benchmarking the
Neo4j graph database. In: International Conference on Data Integration in the Life
Sciences (pp. 33–43).
Index
E M
Eccentricity centrality, 110 Machine learning, 142, 143
180 Index
Maximilian Kissgen, 98 S
Sentiment analysis, 60
N
SLPA, 99, 101, 104, 106, 110, 112, 113
Neighborhood coreness centrality, 110 Smart cities, 118, 119, 120, 123, 124, 127-
Neo4j, 1, 2, 9, 10, 12-27, 30-32, 56-66, 131, 134, 143, 145
68, 69, 71-75, 142-144, 147, 150, 152 Snowball sampling, 109
Node pairs in social networks, 83 Social media, 56, 57, 72, 99, 118, 119,
Node similarity algorithms, 81, 83, 84 121, 123, 127, 132, 134, 135
NodeJS, 106-108 Social network analytics, 36-38, 49
NoSQL, 1-5, 7, 10, 12, 15, 24, 27, 31 Social network, 98-100, 102, 103, 114,
115
O Social networks analysis, 79
Social science, 102, 103, 115
Open source, 98, 100, 115
Overlapping community detection, 98- W
100, 103, 106, 110, 114
WebOCD, 98, 100, 103, 106, 109, 115
P
Y
Popularity, 1,2, 7-11, 15-18, 27, 31
YouTube Data API, 106
R YouTube, 98-100, 102, 103, 106-111, 114