University Research Graph Database
University Research Graph Database
Abstract— In general, research-related data are modeled industry with several prominent organizations, including
using a relational database optimized for transaction NASA (National Aerospace Service Agency), known for
processing. In many cases, this solution is effective and efficient using the approach [11].
enough to answer basic queries and simple reporting
requirements. However, when users request a more-in-depth, This research aims to produce a graph of researchers and
more expansive, multi-perspective, and sometimes more their publications from where knowledge workers can query
abstract analysis, the relational database struggles to provide against to analyze the network of authors, organizations,
answers. This study proposes a research graph database keywords, research interests, and others. The graph resulted
implemented using neo4j as an effort to answer the problems. from this study is hoped to be easily updated, extended, and
The database consists of a core model and an extension model. up-scaled by inserting more nodes and adding relations to
The core model represents scientific articles-related data loaded domain knowledge to support semantic searches.
with real data scraped from Google Scholar. The extension
model indicates research and community engagement activities
done by researchers loaded manually. The database enables the
university to analyze researchers' individual and collaborative II. METHODOLOGY
performances with fellow researchers inside and outside There are many search engines for academic literature
universities. The study concludes that the research graph available on the Internet today such as Google Scholar,
database implementation is more efficient in answering similar Google Books, Microsoft Academic, Researchgate,
questions than the relational database implementation. Science.gov, Refseek, ERIC (Educational Resources
Information Center), WorldWideScience, iSeek, VLRC
Keywords— graph database, research data, efficient data (Virtual Learning Resources Center), BASE (Bielefeld
analysis, multi-perspective, neo4j Academic Search Engine), PubMed, and others. Reports by
I. INTRODUCTION several studies favor Google Scholar over the others. They
present that Google Scholar is widely used globally due to its
Organizations, including universities, face challenges high recall, free access, and broad coverage [12][13]. Among
concerning realizing data integration. Typical problems found those initiatives which rely on Google Scholar is a platform
in universities are fragmented data, flawed and incomplete named SINTA (Science and Technology Index). Ristekbrin
data model design, managed separately by different systems (Research and Technology Ministry / National Research and
across the organization. It is often further worsened with no Innovation Body) of Indonesia currently hosted SINTA.
documentation. Reports generation is also a long, painful SINTA utilizes data from Google Scholar [14] and other
process. Simultaneously, the analysis activity (e.g., SWOT sources and then calculates researchers' capabilities,
Analysis) is limited in depth and breadth, mostly caused by institutions, and journals in Indonesia. The results are then
the lack of data integrity and connectivity. made available for free to the public. Google Scholar's
In line with the continuing increase of demands from the advantages worth highlighting are the facts that Google
government to universities for higher quality, especially in the Scholar offers more updated academic references. There are
research and community engagement area where new quality also APIs (Application Programming Interfaces) in various
measures are in effect, any university needs to build better programming platforms built by developers that researchers,
solutions to answer those requirements. It must look carefully practitioners, and the general public can use to connect to
at their current systems and data models and develop new Google Scholar and retrieve data from it freely for academic
ideas to keep up with the dynamic environment and answer and research purposes. Therefore, it fits the requirements of
future organizations challenging questions. this study.
In recent years graph database has been increasingly This research used Google Scholar as the primary source
getting attention and researched extensively for many of scientific articles related-data to produce a graph database.
purposes such as (i) analysis of film and analysis of lightning The secondary data sources are research activities data
and transmission failure rate relationship [1][2]; (ii) retrieved from various sources ranging from university
recommendation on e-commerce, recipe, book, movie, and research databases, spreadsheets, and the Internet. Fig. 1
retail [3][4]–[7]; (iii) information retrieval and discovery depicts the detail of the steps taken in this study.
[8][9]; (iv) data integration [10]. It also gains popularity in the
Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on October 19,2024 at 00:07:02 UTC from IEEE Xplore. Restrictions apply.
As an initial step, the study identified the data available in Table I shows all node label names, and Table II displays the
and offered by Google Scholar through direct observation of relationship between nodes.
the https://scholar.google.com and literature study. Next,
based on the data of entities and possible relations between TABLE I. NODE LABEL
them, questions were generated. The initial questions list was
No Node Label Description
never intended to be final nor fixed but dynamic. However,
we assessed every question dimension of 4W (What, Where, 1 Affiliation It represents a researcher’s affiliation, such as a
university or a research institution, usually written in
When, Who) and 1H (How) to see whether the data collected a scientific article.
answered the questions.
2 Article It represents scientific articles or reports written by
The following step was designing the graph data model Researchers
consisting of nodes, labels, relationships, and property keys 3 DbIndexer It describes indexing database services of scientific
based on the previous actions and implementing it on the articles provided by specific organizations, such as
neo4j platform. The graph queries, written in CYPHER, are Scopus, Web Of Science.
intended to answer questions and test the model were then 4 IntelPropRight It typifies any intellectual property right owned by
formulated. any Researcher
5 Organization It represents any organization of any kind which
researchers affiliate with, receive projects grant
from, and have their intellectual property approved
6 Person It depicts any person in general who can be a
researcher or any other type of role or job
7 Publisher It portrays a particular type of organization which
offers publication products
8 Pubname It illustrates any publication product offered by a
Publisher
9 QualityMeasure It represents any quality measurement of a
publication product offered by an indexing database
service either directly offered by the database or
indirectly by a third party
10 RegisterOffice A particular type of organization which provides
intellectual property rights services from applying to
awarding
11 ResearchProject Any project which can be either research or
community engagement type, self-funded or granted
by an individual or a group of organizations
12 Researcher It represents a select type of Person who works either
partially or entirely as a researcher.
287
Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on October 19,2024 at 00:07:02 UTC from IEEE Xplore. Restrictions apply.
TABLE III. EXAMPLE OF MANUAL DATA LOADING A. Query Execution For Case 1
No Query Remarks
1 MATCH (a:Researcher), (b:IntelPropRight) Three similar Fig. 2. provides an answer for the first test scenario.
WHERE a.name STARTS WITH 'Moha' Researcher nodes
AND a.name ENDS WITH 'Afandi' and own the same
b.title CONTAINS 'Mobile Business Intelli' Intellectual Property
CREATE (a)-[c:OWNS]->(b) Right, so the nodes
RETURN a,b,c have to be connected
manually
2 MATCH (a:Researcher), (b:DbIndexer) Scopus ID is not
WHERE a.name STARTS WITH 'Moha' and available in Google
a.name ENDS WITH 'Afandi' and b.name = Scholar, so we had to
'Scopus' manually insert it.
CREATE (a) - [c:REG_AUTH_IN {reg_id:
'57193856616', name_of_id: 'Scopus Author
ID'}] → (b)
RETURN a, b
288
Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on October 19,2024 at 00:07:02 UTC from IEEE Xplore. Restrictions apply.
Organization Y. On the other hand, an indirect connection
would likely manifest into something like Organization X
EMPLOYS a Person P who WRITES an Article which is
CO_WRITE by a Person P2 who WORKS for Organization Y.
In this case, there are four relations in between Organization
X and Organization Y. The answer in CYPHER is as natural
as it can be, as depicted in Fig. 6.
As demonstrated in case 1, case 2, and case 3, the graph
database and query do the jobs of answering inquiries
efficiently compared to the relational database. These simple
experiments are in line with previous studies that compared
graph database (neo4j) and other database systems [15][10].
The results also reveal that the graph queries formulated
are very close to natural languages stated in the multi-
perspective inquiries. For university managers or leaders who
are often in the middle of meetings where they are requested
to provide random and ad-hoc information regarding their
universities, graph queries' capabilities will undoubtedly be
beneficial.
Fig. 4. Cypher query and result for case 2
In addition to reporting and analytical features supported
by relational databases and queries, which most Executive
Information Systems (EIS) currently have, organizations may
need to develop graphical and analytical capabilities backed
up by graph databases and queries.
Previous studies by [16], [17] added chatbot capabilities
to existing EIS so that executives can view and get
information about their universities by commanding through
voices and texts. The problems with those systems are they do
not know beyond the specific knowledge embedded in them.
The commands are limited because they and their associated
Fig. 5. Corresponding SQL query for case 2 SQL queries must be pre-defined first. So, those systems have
not addressed ad hoc, random requests for information.
C. Query Execution For Case 3 The graph database and queries have great possibilities to
overcome that problem. They can understand and produce
intuitive results despite blind questions, which seem hard to
comprehend and look meaningless, as shown in case 1 and
case2. The node-relationship-node will arguably become the
key in interpreting random requests for information as long as
we can map things to nodes in the database. Indeed, the
implementation will remain a challenge, especially when
integrating data from different databases, often legacy
systems, where documentations are less available or do not
exist.
IV. CONCLUSION
The research graph database presented in this study has
shown to efficiently answer the inquiries compared to the
relational database and visually provide engaging
discernment.
However, the research reported in this paper is still in the
early stage and limited. Therefore, in the future works, the
number of use cases covered, variety of data to be integrated,
and analytical capabilities need to be increased and further
explored to see graph database full potentials for universities'
benefit.
Fig. 6. CYPHER query and results for Case 3 ACKNOWLEDGMENT
For case 3, the question expects relation(s) exist(s) Authors would like to thank Lembaga Penelitian dan
between organizations in one way or another, whether directly Pengabdian Masyarakat – Universitas Pembangunan Nasional
or indirectly. A direct relation (one link away) would be Veteran Jawa Timur for funding this research.
something like Organization X HAS_MOU_WITH
289
Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on October 19,2024 at 00:07:02 UTC from IEEE Xplore. Restrictions apply.
REFERENCES [8] Y. Zhu, E. Yan, and I. Y. Song, “The use of a graph-based system to
improve bibliographic information retrieval: System design,
implementation, and evaluation,” J. Assoc. Inf. Sci. Technol., 2017.
[1] H. Lu, Z. Hong, and M. Shi, “Analysis of film data based on Neo4j,” [9] D. Hristovski, A. Kastrin, D. Dinevski, and T. C. Rindflesch,
in Proceedings - 16th IEEE/ACIS International Conference on “Constructing a graph database for semantic literature-based
Computer and Information Science, ICIS 2017, 2017. discovery,” in Studies in Health Technology and Informatics, 2015.
[2] Y. Ma, Z. Wu, L. Guan, B. Zhou, and R. Li, “Study on the relationship [10] B.-H. Yoon, S.-K. Kim, and S.-Y. Kim, “Use of graph database for the
between transmission line failure rate and lightning information based integration of heterogeneous biological data,” Genomics Inform., vol.
on Neo4j,” in POWERCON 2014 - 2014 International Conference on 15, no. 1, p. 19, 2017.
Power System Technology: Towards Green, Efficient and Smart Power
System, Proceedings, 2014. [11] B. M. Sasaki, “The 5-minute interview: David Meza, Chief Knowledge
Architect, NASA,” 2016.
[3] S. Shaikh, S. Rathi, and P. Janrao, “Recommendation system in e-
[12] M. Boeker, W. Vach, and E. Motschall, “Google Scholar as
commerce websites: a graph based approached,” in Proceedings - 7th
replacement for systematic literature searches: good relative recall and
IEEE International Advanced Computing Conference, IACC 2017,
precision are not enough,” BMC Med. Res. Methodol., 2013.
2017.
[4] V. Bajaj, R. B. Panda, C. Dabas, and P. Kaur, “Graph database for [13] E. D. López-Cózar, N. Robinson-García, and D. Torres-Salinas, “The
recipe recommendations,” in 2018 7th International Conference on google scholar experiment: How to index false papers and manipulate
bibliometric indicators,” J. Assoc. Inf. Sci. Technol., 2014.
Reliability, Infocom Technologies and Optimization: Trends and
Future Directions, ICRITO 2018, 2018. [14] A. S. Ahmar et al., “Lecturers’ understanding on indexing databases of
[5] I. N. P. W. Dharmawan and R. Sarno, “Book recommendation using SINTA, DOAJ, Google Scholar, SCOPUS, and Web of Science: A
study of Indonesians,” J. Phys. Conf. Ser., vol. 954, pp. 0–17, 2018.
Neo4j graph database in BibTeX book metadata,” in Proceeding - 2017
3rd International Conference on Science in Information Technology: [15] M. Sharma, V. D. Sharma, and M. M. Bundele, “Performance analysis
Theory and Application of IT for Education, Industry and Society in of RDBMS and No SQL databases: PostgreSQL, MongoDB and
Big Data Era, ICSITech 2017, 2017. Neo4j,” in 3rd International Conference and Workshops on Recent
Advances and Innovations in Engineering, ICRAIE 2018, 2018.
[6] N. Yi, C. Li, X. Feng, and M. Shi, “Design and implementation of
movie recommender system based on graph database,” in Proceedings [16] M. I. Afandi, E. D. Wahyuni, and S. Mukaromah, “Mobile business
- 2017 14th Web Information Systems and Applications Conference, intelligence assistant (m-BELA) for higher education executives,”
WISA 2017, 2018. 2019 4th Int. Conf. Inf. Technol. Inf. Syst. Electr. Eng. ICITISEE 2019.
[7] T. Konno, R. Huang, T. Ban, and C. Huang, “Goods recommendation [17] M. I. Afandi and E. D. Wahyuni, “Prototype of voice commanded
based on retail knowledge in a Neo4j graph database combined with an university executive business intelligence assistant (BELA),” vol. 1,
inference mechanism implemented in jess,” in 2017 IEEE SmartWorld no. ICST, pp. 4–8, 2018.
Ubiquitous Intelligence and Computing, Advanced and Trusted
Computed, Scalable Computing and Communications, Cloud and Big
Data Computing, Internet of People and Smart City Innovation, 2018.
290
Authorized licensed use limited to: Corporacion Universitaria de la Costa. Downloaded on October 19,2024 at 00:07:02 UTC from IEEE Xplore. Restrictions apply.