0% found this document useful (0 votes)
20 views

Performance Comparison of Graph Database and Relational Database

Uploaded by

jinana1077
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Performance Comparison of Graph Database and Relational Database

Uploaded by

jinana1077
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/370751317

Performance Comparison of Graph Database and Relational Database

Technical Report · May 2023


DOI: 10.13140/RG.2.2.27380.32641

CITATIONS READS

0 2,395

3 authors, including:

Cajetan Rodrigues Mit Ramesh Jain

4 PUBLICATIONS 0 CITATIONS
San Jose State University
1 PUBLICATION 0 CITATIONS
SEE PROFILE
SEE PROFILE

All content following this page was uploaded by Cajetan Rodrigues on 13 May 2023.

The user has requested enhancement of the downloaded file.


Performance Comparison of Graph Database
and Relational Database

Mit Jain Ashish Khanchandani Cajetan Rodrigues


Computer Science Department Computer Science Department Computer Science Department
San Jose State University San Jose State University San Jose State University
San Jose, USA San Jose, USA San Jose, USA
[email protected] [email protected] [email protected]

Abstract—We aim to present a comprehensive Graph databases are particularly useful for
comparison between a graph database, Neo4j, and a applications that deal with complex and interconnected
relational database, MySQL, focusing on their data, such as social networks, recommendation engines,
performance based on different types of queries. Graph and fraud detection systems. They provide a more
databases utilize graph structures, nodes, edges, and natural and intuitive way to represent data than relational
properties to represent data, while relational databases databases, especially when dealing with unstructured or
employ tables and relationships between them. This semi-structured data. Graph databases can also handle
study aims to evaluate the performance of Neo4j and large amounts of data and scale horizontally, making
MySQL in terms of data query execution time by them suitable for applications with a high volume of
data.
examining representative queries from four categories:
selection/search, recursion, aggregation, and pattern One of the main reasons for the popularity of graph
matching. Real-world data from Career Village was databases is their ability to perform complex queries
used for the experiment. The results show that Neo4j quickly and efficiently. Graph databases use a traversal-
outperforms MySQL in most cases, particularly in based query language known as Cypher that allows users
pattern matching and recursive queries. However, to search for patterns and relationships within the data.
MySQL has advantages in terms of data consistency This makes it easy to perform tasks such as pathfinding,
and transactional support. recommendation generation, and fraud detection.
Relational databases are one of the most widely used
types of databases, popular for their ability to store and
Keywords—Databases, Neo4j, NoSQL, Graph
manage large amounts of data in an organized and
Databases, Relational Databases efficient manner. They represent data in a tabular form,
with each table consisting of rows and columns, where
each row represents a record and each column represents
I. INTRODUCTION a specific attribute of that record. Relational databases
are based on the principles of relational algebra and are
designed to enforce data integrity and consistency.
Graph databases revolutionized the way data is
stored and processed. By representing data as nodes and One of the main reasons for the popularity of
edges in a graph, they enable us to uncover insights that relational databases is their ability to handle complex
would be impossible to detect or require complex and data relationships. By organizing data into tables and
expensive join operations with traditional relational establishing relationships between them, relational
databases. They allow us to efficiently navigate through databases make it easier to perform complex queries and
vast and intricate networks of data, making them analysis. They also provide a standardized language for
invaluable tools for applications ranging from e- querying and manipulating data,
commerce to scientific research. The research is
We aim to determine the difference between
motivated by the comparison of MySQL and Graph
databases and suggesting which database is suited under traditional RDBMS and a graph-based NoSQL
which scenarios. database. We execute a comprehensive comparison by
using a dataset and querying the same data in both
schemas across different categories. To facilitate a

1
comparison between a graph-based NoSQL database relational databases excel at managing structured data
and a traditional relational database management and enforcing integrity constraints.
system (RDBMS), we will employ Neo4j as the [3] offers a rather broad and detailed view of various
representative for the graph-based NoSQL database graph database models like property graphs, RDF
category and MySQL as the exemplar for the traditional graphs, Hypergraphs among others. They also provide
RDBMS category. We compare the performance of strength and weaknesses of different approaches for
various operations like search, pattern matching, managing graph data.
recursion, and aggregation. Neo4j is touted to be one of While [4] concludes that while relational databases
the best graph bases systems in the industry; well excel at managing structured data and enforcing data
known for its execution speed and the benefits that integrity constraints, graph databases are more effective
come with having a graph structure with nodes and at handling unstructured and semi-structured data with
edges to model the data effectively. MySQL is a very complex relationships.
popular and widely used RDBMS. The purpose is to In [5], the authors evaluate the performance and
show which is better and how significant of a difference scalability of both database models using various
it makes if either database is chosen. metrics, including response time, throughput, and CPU
usage. The authors found that the non-relational
This paper is structured as follows: In Section 2, we database model performed better in terms of response
conduct a survey of previous studies that are relevant to time and scalability, while the relational database model
performance comparison between MySQL and Neo4j. performed better in terms of data consistency and
Section 3 outlines the dataset used to assess the availability. To analyze further about Graph Databases
performance of these two database systems, specifically and how to query them, [6] offers a comprehensive
comparing graph databases (Neo4j) with relational overview of query languages for graph databases,
databases (MySQL). Section 4 details the Neo4j test providing readers with a solid foundation for
environment. Section 5 represents SQL test understanding how to query and manipulate graph data
environment. Section 6 represents the implementation in different contexts. The authors describe the features
and comparison between the SQL & Neo4j Queries.
of modern graph query languages, such as Cypher,
Section 7 outlines performances strategies used and
comparative analysis. Section 8 showcases performance Gremlin, and SPARQL, and provides examples of how
results. Finally, Section 9 concludes the paper and to use these languages to perform different types of
provides a discussion of the findings. queries thereby providing a solid foundation for
understanding how to query and manipulate graph data
in different contexts. [7-10] take a deeper look into
performance of graph databases on different datasets
II. RELATED WORK and focus on their performance on aggregation and
recursive queries.
Various studies have compared the performance of
MySQL and Neo4j graph databases for different types
of queries and datasets. Some studies have found that III. DATASET
Neo4j performs better than MySQL in terms of query
speed, while others have found that MySQL is faster
and more memory efficient. The types of queries tested A. Collection of Dataset
include selection, aggregation, recursion, pattern
matching. The studies also explore the use of graph The CareerVillage dataset provides a valuable
databases in various domains, such as social network resource for researchers interested in studying career
analysis, web-based applications, IoT data guidance and counseling. In this research paper, we will
management, and Customer Relationship Management use the dataset to compare the performance of SQL vs
(CRM) systems. Overall, the studies suggest that the Neo4j, two popular database management systems.
performance of graph databases is better than that of Specifically, we will analyze how these systems perform
conventional databases for certain types of queries and when querying and processing the dataset's information,
datasets. which includes questions asked by students, answers
In [1] and [2], the authors draw comparisons provided by professionals, and demographic data of both
between a graph based and relational based database students and professionals. Our evaluation criteria will
and highlight the advantages and disadvantages of both focus on four query groups: selection, recursion,
databases. In [2], authors highlight the use of graph for aggregation, and pattern matching. These query groups
represent common types of queries that are used to
tracking the relationships and origins from the
analyze large, complex datasets. By evaluating the
perspective of data provenance and talk about how performance of SQL and Neo4j on these query groups,
we hope to gain insights into the strengths and

2
limitations of each database management system. IV. NEO4J TEST ENVIRONMENT
Ultimately, our research aims to provide guidance to
researchers and practitioners in selecting the most
appropriate database management system for analyzing
similar datasets.

B. Dataset representation

The dataset is a collection of csv files provided


by careervillage.org. careervillage.org is a website
which is a community of students and professionals
where students post questions and professionals offer
advice in the form of answers to posted questions. The
csv files collectively are a collection of tables. These
tables contain a subset of data stored by the actual
database of CareerVillage. The total size of the dataset
is 436.59 MB and it has 15 files.
Understanding what each file represents is
crucial to making sense of the data. The answers.csv file
contains the answers that are posted by registered
professionals in response to students’ questions.
Answers can only be posted by professionals. The
comments.csv file contains comments made on answers
or questions. Comments can be posted by anyone. The
emails.csv file contains information
marketing/subscription emails sent. The Fig. 1. UML Sequence diagram for Neo4j implementation
‘frequency_level’ of an email is a label which has an
implicit frequency indicating the number of times such
emails are sent. The group_memberships.csv file tracks A. Loading dataset onto Neo4j
user group memberships, with any user being allowed
to join any group. Before we begin loading the dataset, here are some
On the other hand, the groups.csv file contains prerequisites:
information about each group, but the group names • Installed instance of Neo4j Desktop
have been left off for privacy reasons. The matches.csv • Installation of python (Python version 3.9.13
file links questions included in emails, with each row was used in our implementation)
containing information on the email's ID. The • Python library: neo4j
professionals.csv file contains information about the
site's volunteers, who are referred to as professionals. We analyzed the dataset and deduced that few files
The questions.csv file contains the questions posted by contained entity information and few files had
students. The school_memberships.csv file tracks user relationships between the entities. We created nodes
memberships in schools, with a similar structure to corresponding to each entity. We also created nodes to
group_memberships.csv. Only students are allowed to represent the relationships between two nodes. In the
be part of school groups. premature stages of loading the dataset into the Neo4j
Lastly, the students.csv file contains framework, we manually wrote queries for loading each
information about the site's students, who are the reason node into the Neo4j framework. We also manually wrote
CareerVillage.org exists. The tag_questions.csv file queries to create relationships between the nodes. Since
tracks hashtag-to-question pairings, while the we were testing it as a team, we soon realized that we
tag_users.csv file shows which hashtags each user needed something dynamic. Hence, we went ahead and
follows. Finally, the tags.csv file contains the name of employed a automated way of loading the dataset and
each tag, and the question_scores.csv and creating nodes and relationships on a single script run.
answer_scores.csv files contain the number of "hearts" We created 2 python scripts. Firstly, for reading the
information from the csv files and creating nodes in the
received by each question and answer, respectively.
database. Secondly, for creating relationships between
the nodes. We automated the execution of both scripts
by writing a bash script to creates the nodes first. The
following is a step-by-step execution of the script.

3
1. Begin by invoking the bash shell 1. Count all nodes.
initialize_neo4j.sh.
2. In the bash shell, execute the script called The following query counts all the nodes loaded
execute.sh. on the Neo4j DBMS.
3. The execute.sh script executes the first Python
script called load_nodes.py, which reads the data
from the CSV file and creates the corresponding
nodes in the Neo4j database.
4. Once the load_nodes.py script completes its
execution, the setup.sh script executes the second
Python script called create_relationships.py. Fig. 2. Query to count all nodes
5. The create_relationships.py script reads the data
from the CSV file and creates the relationships
between the nodes created in the previous step. 2. Count all relationships.
6. End the script.
The following query counts all the possible
Thus, the bash script executes both Python scripts relationships between a pair of nodes.
sequentially, where the first script loads the nodes into
the Neo4j database, and the second script creates
relationships between the nodes. This approach allowed
us to automate the entire process of loading data into the
Neo4j database and creating relationships between the
nodes using a single command.

Fig. 3. Query to count all relationships


B. Visualising the Neo4j Schema

Post loading the dataset onto the neo4j database, we 3. List count of each node
used the Cypher language to write data profiling
queries to visualize the created nodes and The following query counts the number of nodes
relationships. present in each entity or label.
Cypher is the query language used in Neo4j, a
popular graph database management system. It is a
declarative, pattern-matching language that is
specifically designed for querying and manipulating
graph data. With Cypher, users can express complex
Fig. 4. Query to list count of each entity
queries in a concise and readable syntax that is easy to
understand and maintain.
Cypher provides a range of expressive syntax for TABLE I
filtering, aggregating, and transforming data stored in DISPLAYING THE COUNT OF EACH ENTITY
a graph. It also supports several advanced features
such as pattern matching, traversals, path finding, and Node Count
spatial operations. Cypher queries are constructed Matches 4316275
using ASCII art-like patterns, which makes them easy Emails 1850101
to read and understand.
Tag Users 136663
Overall, Cypher is a powerful and flexible language
Tag Questions 76553
that enables users to query and manipulate graph data
Answers 51123
in Neo4j quickly and easily. It is a key component of
the Neo4j ecosystem, and is widely used by Students 30971
developers, data analysts, and data scientists to build Professionals 28152
graph-based applications and solve complex data Questions 23931
problems. Tags 16269
Comments 14966
The following are some of data profiling queries we School Memberships 5638
used to visualize the created nodes and relationships. Group Memberships 1038
Groups 49

4
• Installed instance of MySQL server (MySQL
server community edition 8.0.32 was used in
4. Visualize all nodes and relationships. our implementation)
• Installation of python (Python version 3.9.13
The following is the representation of the nodes and the was used in our implementation)
relationships between the nodes.
• Python library: mysql-connector-python

We load the dataset into tables in two phases.


Fig. 5. Query to visualise the schema
First, we create the database and the necessary tables.
In doing so, we define the schemas and implement all
the required constrains, indexes, and relationships.
Second, we read data from csv files, construct SQL
queries, and execute these queries. Phase 1 is
implemented by the files ‘createDB.py’ and
‘createTables.py’. Phase 2 is implemented by
‘loadData.py’. All these python scripts are stored in
the ‘load-data-into-sql’ subdirectory. To avoid the
hassle of executing these python scripts manually, a
shell script, called ‘initialize_sql.sh’, has been
provided in the same directory which automates the
complete data loading process and successfully
executes both the phases.

B. Visualizing the MySQL schema


Fig. 6. Neo4j database schema visualization

After the successful completion of dataset loading, we


will have the following tables in the database:

V. MYSQL TEST ENVIRONMENT TABLE II


DISPLAYING THE NUMBER OF TUPLES IN
EACH TABLE

Table No. of tuples


answer_scores 51107
answers 51123
comments 14966
emails 1850101
group_memberships 1038
groups_ 49
professionals 28152
question_scores 23928
questions 23931
school_memberships 5638
students 30971
tag_questions 76553
tag_users 136663
tags 16269
Fig. 7. UML Sequence for SQL implementation
Essentially each csv file present in the dataset is
loaded in a separate table. We visualize our database
using an Entity-Relationship (ER) diagram which
A. Loading dataset in MySQL
captures the relationships between various entities. Fig
8. below shows the ER diagram.
Before we begin loading the dataset, here are some
prerequisites:

5
Fig. 8. Entity-Relationship diagram

VI. IMPLEMENTATION (SQL & NEO4J)

For the sake of simplicity, we will visualize the results


for one query in each category. For MySQL we show
the execution of EXPLAIN ANALYZE command. For
Neo4j counterpart, we show the graph visualization.

Fig. 9. Flowchart showing overall implementation

1. Selection

Q1: Looking for professionals in a specific tag?

SQL
SELECT p.* FROM professionals p JOIN
tag_users tu ON p.professionals_id =
tu.tag_users_user_id JOIN tags t ON
tu.tag_users_tag_id = t.tags_tag_id
WHERE tags_tag_name = 'college';
Cypher
MATCH (p:Professionals)-[]->(t:Tags)
WHERE t.tags_tag_name='college'
RETURN p,t

6
SQL
SELECT * FROM professionals p JOIN
emails e ON p.professionals_id =
e.emails_recipient_id WHERE
p.professionals_id =
'0079e89bf1544926b98310e81315b9f1';
Cypher
MATCH
(p:Professionals{professionals_id:
'0079e89bf1544926b98310e81315b9f1'})-
[:GOT_EMAIL]->(e:Emails)
RETURN e

2. Recursion

Fig. 10. Query to find professionals in a specific tag Q4: Looking for the questions with answers
recursively many times?

SQL
WITH RECURSIVE answer_replies AS(
SELECT answers_id, answers_author_id,
answers_question_id,
answers_date_added, answers_body FROM
answers WHERE answers_question_id IS
not null UNION all SELECT
Fig. 11. EXPLAIN ANALYSE on SQL Query a.answers_id, a.answers_author_id,
a.answers_question_id,
a.answers_date_added, a.answers_body
Q2: Looking for students in a specific group and FROM answers a INNER JOIN
interested in a specific tag? answer_replies ar ON ar.answers_id =
a.answers_question_id ) SELECT * FROM
SQL answer_replies ar LEFT JOIN questions
SELECT * FROM students s JOIN q ON ar.answers_question_id =
group_memberships gm ON students_id = q.questions_id;
gm.group_memberships_user_id JOIN Cypher
groups_ g ON g.groups_id = MATCH (q:Questions)<-
gm.group_memberships_group_id JOIN [:IS_REPLY_TO*1..]-(a:Answers)
tag_users tu ON tu.tag_users_user_id RETURN q,aWHERE
= s.students_id JOIN tags t ON t.tags_tag_name='college'
t.tags_tag_id = tu.tag_users_tag_id RETURN p,t
WHERE t.tags_tag_name = 'college' AND
g.groups_group_type = 'youth
program';
Cypher
MATCH (t:Tags)<-[:HAS_TAG]-
(s:Students)-
[:MEMBER_IN]->(b)
WHERE t.tags_tag_name='college'
AND b.groups_group_type='youth
program'
RETURN s,t,b

Q3: Looking for all emails received by a particular


professional?
Fig. 12. Query to find questions with answers
recursively many times

7
LEFT JOIN questions q ON
ar.answers_question_id =
q.questions_id;
Cypher
MATCH (q:Questions)<-
[:IS_REPLY_TO*1..3]-
(a:Answers)
RETURN q,a

3. Aggregation
Fig. 13. EXPLAIN ANALYSE SQL Command
Q7: Count the number of professionals who answered
the questions.
Q5: Looking for questions with answers recursively
twice?
SQL
SELECT count(professionals_id) FROM
SQL professionals p JOIN answers a ON
WITH RECURSIVE answer_replies p.professionals_id =
AS(SELECT 1 as level,answers_id, a.answers_author_id;
answers_author_id, Cypher
answers_question_id, MATCH (p:Professionals)-[]-
answers_date_added, answers_body FROM >(a:Answers)
answers WHERE answers_question_id IS RETURN count(p)
not null UNION all SELECT level+1,
a.answers_id, a.answers_author_id,
a.answers_question_id,
a.answers_date_added, a.answers_body
FROM answers a INNER JOIN
answer_replies ar ON ar.answers_id =
a.answers_question_id WHERE level
<=2) SELECT * FROM answer_replies ar
LEFT JOIN questions q ON
ar.answers_question_id = Fig. 14. Cypher query to count the number of
q.questions_id; professionals who answered the question
Cypher
MATCH (q:Questions)<-
[:IS_REPLY_TO*1..2]-
(a:Answers)

Q6: Looking for questions with answers recursively 3


times?

Fig. 15. EXPLAIN ANALYSE SQL Command


SQL
WITH RECURSIVE answer_replies AS(
SELECT 1 as level,answers_id, Q8: Count the number of professionals of a specific
answers_author_id, tag.
answers_question_id,
answers_date_added, answers_body FROM
answers WHERE answers_question_id IS
not null UNION all SELECT level+1, SQL
a.answers_id, a.answers_author_id, SELECT count(*) FROM (SELECT DISTINCT
a.answers_question_id, p.* FROM professionals p JOIN
a.answers_date_added, a.answers_body tag_users tu ON p.professionals_id =
FROM answers a INNER JOIN tu.tag_users_user_id JOIN tags t ON
answer_replies ar ON ar.answers_id = tu.tag_users_tag_id = t.tags_tag_id
a.answers_question_id WHERE level WHERE t.tags_tag_name = 'college') AS
<=3) SELECT * FROM answer_replies ar temp;

8
Cypher
MATCH (p:Professionals)-[:HAS_TAG]-
>(t:Tags)
WHERE t.tags_tag_name='college'
RETURN count(p)

Q9: Which tag has the most professionals?

SQL
SELECT tags.tags_tag_id,
tags_tag_name,
COUNT(p.professionals_id) AS
number_of_professionals FROM tags
JOIN tag_users tu ON tags.tags_tag_id Fig. 16. Cypher query to find question answered in tags
= tu.tag_users_tag_id JOIN
professionals p ON p.professionals_id
= tu.tag_users_user_id GROUP BY
tags.tags_tag_id, tags_tag_name ORDER
BY COUNT(p.professionals_id) DESC
LIMIT 1;
Cypher
MATCH (p:Professionals)-[:HAS_TAG]-
>(t:Tags)
RETURN t.tags_tag_name AS TagName,
COUNT(p) ORDER BY COUNT(p) DESC
LIMIT 1
Fig. 17. EXPLAIN ANALYSE SQL Command

4. Pattern Match
Q11: Looking for students and professionals with the
same group?
Q10: Looking for the question answered in tags?
SQL
SELECT g.groups_id, professionals_id,
SQL
students_id FROM groups_ g JOIN
SELECT q.questions_id, t.tags_tag_id,
group_memberships gm ON g.groups_id =
a.answers_id FROM tags t JOIN
gm.group_memberships_group_id JOIN
tag_questions tq ON t.tags_tag_id =
(SELECT group_memberships_group_id AS
tq.tag_questions_tag_id JOIN
group_id, professionals_id FROM
questions q ON professionals p JOIN
tq.tag_questions_question_id = group_memberships gm1 ON
q.questions_id JOIN answers a ON gm1.group_memberships_user_id =
a.answers_question_id = questions_id;
p.professionals_id) pg ON pg.group_id
Cypher = gm.group_memberships_group_id JOIN
MATCH (a:Answers)-[]->(q:Questions)- (SELECT group_memberships_group_id AS
[]->(t:Tags) group_id, students_id FROM students s
RETURN a,q,t JOIN group_memberships gm2 ON
s.students_id =
gm2.group_memberships_user_id) sg ON
sg.group_id=
gm.group_memberships_group_id;
Cypher
MATCH (p:Professionals)-[]-
>(g:Groups)<-[]-(s:Students)
RETURN p, g, s

9
Q12: Looking for patterns that students and experts
in the same tag?

SQL
SELECT pt.tags_id, st.students_id,
pt.professionals_id FROM tags t JOIN
tag_users tu ON t.tags_tag_id =
tu.tag_users_tag_id JOIN (SELECT
u.tag_users_tag_id AS tags_id,
professionals_id FROM professionals p
JOIN tag_users u ON
p.professionals_id =
u.tag_users_user_id) pt ON
pt.tags_id= t.tags_tag_id JOIN
(SELECT u.tag_users_tag_id AS Fig. 19. Cypher query using Explain command in Neo4j.
tags_id, students_id FROM students s
JOIN tag_users u ON s.students_id = The ‘profile’ command provides more detailed
u.tag_users_user_id) st ON st.tags_id information than the explain command. It provides
= t.tags_tag_id LIMIT 100000; information on the execution plan, as well as
Cypher additional statistics on how the query was executed,
MATCH (p:Professionals)-[]- such as the number of database hits, the number of
>(t:Tags)<-[]-(s:Students) rows processed at each stage, and the total processing
RETURN p, t, s LIMIT 100000
time. The command used is :

VII. PERFORMANCE EVALUATIONS

A. Neo4j Performance Strategy Fig. 20. Cypher query using PROFILE command in
Neo4j.
Now that we have out data modelled and setup, we
use the Neo4j Browser client in the Neo4j Desktop
app to run our Cyphers as discussed in the paper in the The profile command is more useful than the explain
previous section. Neo4j provides a lot of functionality command when optimizing queries because it provides
out of the box as we can use the EXPLAIN and more detailed information about the performance of the
PROFILE keywords. query. By examining the statistics provided by the
PROFILE command, developers can identify
The ‘explain’ command is used to show the performance bottlenecks and adjust optimize their
execution plan of a Cypher query. It provides queries. Figures 21 and 22 show the more detailed
information on how the query will be executed, such execution plan and at the bottom of figure 22 we can also
as which indexes will be used, which operations will see the execution time displayed for the query to
be performed, and the estimated number of rows that complete execution. We will be using the same for all
will be processed. The command used is : the 12 queries and run each query 3 times and take the
average of their runtimes to consider that value for
further comparison against relation execution times.
Fig. 18. Cypher query using Explain command in Neo4j.

10
query results on the client-side, which is important to
calculate the exact execution time purely reflective of
the MySQL DB engine capability. For demonstrative
purposes, consider the same query, that is used in the
previous subsection, i.e., query Q4. Fig. 23 shows the
output of executing the EXPLAIN ANALYZE
command on Q4.

Fig. 23. EXPLAIN ANALYZE command shown in


Query Q4
Fig. 24 below shows a zoomed-in version of Fig 23.
allowing us to see the granular query execution details
captured by the EXPLAIN ANALYZE command.

Fig. 21. Visualising the PROFILE command Fig. 24. Zoomed-in version of Fig. 23
We will run the EXPLAIN ANALYZE command on
each of the 12 queries three times to find the average
execution time of each query so that it can be used for
performance comparison with Neo4j in the next section.

VIII. PERFORMANCE RESULTS

Our evaluation criteria would majorly look at


Fig. 22. Profiling command shown on Query the following query groups:

• Selection/Search
B. MySQL Performance Startegy • Recursive/Related
• Aggregation
Once the data loading process is complete, we
• Pattern Matching
use the ‘mysql’ command-line client tool to query our
database and test if the data is loaded so that we can go
ahead with performance evaluation. For analyzing the For the scope of this comparison, we focus on one
execution of a query and finding out the exact of the most practical parameters to judge performance
execution time we use the EXPLAIN ANALYZE of a query: execution time. We compare the execution
command. times across the four categories, and we ensure to have
This command is essential as it gives the at least 3 queries per category.
complete breakdown of how the query was executed The hardware configuration used was Apple
(types of join strategy used, result size – intermediate MacBook Pro with M1 Pro Apple Silicon Chip coupled
and final, estimated cost, actual execution times at all with 16 GB RAM and running the latest version of
steps, etc.) and exactly how much time was spent on MacOS Ventura. We have three machines of the same
each aspect of the query. This can help identify configuration with each running a local instance of both
bottlenecks and optimize performances where needed. databases i.e. Neo4j Desktop and MySQL so we can
Additionally, it ignores the time required to render later record and take average and ensure there was no

11
swaying of results due to any other external factors include metrics such as how much memory tradeoff is
related to our local systems. there is storing duplicated data in a NoSQL system and
We then recorded the time it took for both databases it’s scalability and cost consequences with respect to the
to do the job and present the results in the table below : gain in performance we obtain by following the graph
structure. Similar to this, there can be further work and
TABLE III analysis based on the use cases, size of data and various
PERFORMANCE COMPARISON BETWEEN other parameters.
NEO4J & MYSQL

Category Query Neo4j MySQL


REFERENCES
Q1 2ms 31ms
Selection Q2 8ms 323ms [1] Thi-Thu-Trang Do, Thai-Bao Mai-Hoang, Van-
Q3 32ms 438ms Quyet Nguyen, and Quyet-Thang Huynh. Query-
Q4 2ms 757ms based performance comparison of graph database
Q5 2ms 290ms and relational database. In Proceedings of the 11th
Recursive
Q6 3ms 305ms International Symposium on Information and
Communication
Q7 43ms 146ms
Technology, pages 375–381, 2022
Aggregation Q8 18ms 40ms
Q9 62ms 290ms [2] Chad Vicknair, Michael Macias, Zhendong Zhao,
Q10 5ms 360ms Xiaofei Nan, Yixin Chen, and Dawn Wilkins. 2010.
Pattern
A comparison of a graph database and a relational
Matching Q11 10ms 455ms
database: a data provenance perspective. In
Q12 1ms 68ms
Proceedings of the 48th Annual Southeast Regional
Conference (ACM SE '10)
The table above shows the average values taken on
executing the same query three times sequentially as the [3] Renzo Angles and Claudio Gutierrez. Survey of
graph database models.ACM Comput. Surv.,40(1),
execution time varies for each time due to various
factors such as background processes on the system, feb 2008.
caching, and other operating system processes. [4] Shalini Batra and Charu Tyagi. 2012. Comparative
Therefore, we have chosen to re execute these queries 3 analysis of relational and graph databases.
times across three different systems and then taken an International Journal of Soft Computing and
average across all the recordings to give us a fair Engineering (IJSCE)
relative estimation of the performance difference [5] Cornelia Gyorödi, Robert Gyorödi, and Roxana
between the two databases. Sotoc. 2015. A comparative studyof relational and
non-relational database models in a Web-based
application. International Journal of Advanced
IX. CONCLUSION Computer Science and Applications 6, 11 (2015)
[6] R. Angles, M. Arenas, P. Barceló, A. Hogan, J.
In this study, we have provided an overview of Reutter, and D. Vrgoč, "Foundations of Modern
queries categorized into four groups: selection/search, Query Languages for Graph Databases," ACM
recursion, aggregation, and pattern matching. Comput. Surv., vol. 50, no. 5, Art. no. 68, Sep.
Furthermore, we have conducted a comparison between 2017, doi: 10.1145/3104031.
Neo4j (a representative of graph databases) and [7] P. Kotiranta, M. Junkkari, and J. Nummenmaa,
MySQL (a representative of relational databases) in "Performance of Graph and Relational Databases in
terms of their data query performance. Our findings Complex Queries," Applied Sciences, vol. 12, no.
demonstrate that the graph database outperforms the 13, Art. no. 6490, Jul. 2022, doi:
relational database by up to 146 times when querying 10.3390/app12136490.
complex and large datasets. As our future work, we plan [8] W. Khan, W. Shahzad, et al., "Predictive
to extend our tests to other datasets in various sectors, Performance Comparison Analysis of Relational &
such as banking, the stock market, and ERP, and NoSQL Graph Databases," International Journal of
evaluate other aspects of system performance, such as Advanced Computer Science and Applications, vol.
memory usage, power consumption, and 8, no. 5, pp. 73-79, 2017, doi:
implementation complexities. 10.14569/IJACSA.2017.080510.
From the perspective of the future, it would be
exciting to take this comparison one step further and

12
[9] L. Jachiet, P. Genevès, N. Gesbert, and N. Layaïda, [10] J. Hölsch, T. Schmidt, and M. Grossniklaus,
"On the Optimization of Recursive Relational "On the performance of analytical and pattern
Queries: Application to Graph Queries," in matching graph queries in neo4j and a relational
Proceedings of the 2020 ACM SIGMOD database," in EDBT/ICDT 2017 Joint Conference:
International Conference on Management of Data, 6th International Workshop on Querying Graph
2020, pp. 681-697, doi: 10.1145/3318464.3380594. Structured Data (GraphQ), 2017, pp. 15-22, doi:
10.1145/3035918.3035930.

13

View publication stats

You might also like