0% found this document useful (0 votes)
42 views

Design and Implementation of A NoSQL Database

Uploaded by

Nam Hoàng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Design and Implementation of A NoSQL Database

Uploaded by

Nam Hoàng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

University of Magdeburg

School of Computer Science

Databases
D B and
Software
S E Engineering

Master’s Thesis

Design and Implementation of a NoSQL Database for


Decision Support in R&D Management

Author:

Prem Sagar Jeevangekar

April 23, 2019

Advisors:

Prof. Dr. rer. nat. habil. Gunter Saake


Dipl.-Inf. Wolfram Fenske
Department of Computer Science

Dr. rer. nat. Matthias Plaue


MAPEGY GmbH
Jeevangekar, Prem Sagar:
Design and Implementation of a NoSQL Database for Decision Support in R&D
Management
Master’s Thesis, University of Magdeburg, 2019.
Abstract
The concept of database is introduced in the early 1960s. The relational database
came into the picture in the early 1970s. The relational database had a great im-
pact on data storage ever since. However, due to an increase in data in the modern
database world, it leads to developing more efficient database technologies to deal
with the usage of exponentially increased data. When working with the structured
datasets, relational databases are more reliable and efficient. However, the database
lacks its efficiency when huge unstructured data is produced from real-world ap-
plications. To overcome the problems faced by the relational databases, compa-
nies started looking for more reliable, flexible, highly scalable and high-performance
databases. In the early 2000s, the NoSQL databases were introduced and gained
huge popularity in a short period of time.
The main aim of the thesis is to design and implement the NoSQL database (Mon-
goDB) and investigate its performance in terms of query execution speed. This thesis
provides an overview of one of the relational database (PostgreSQL) and investigate
the performance of NoSQL database (MongoDB).
The current database is based on a PostgreSQL database that has performance
issues. In PostgreSQL database whenever the large complex queries are performed,
the performance of the database is low due to a large number of join operations.
To investigate the performance of NoSQL database, Document-Oriented MongoDB
database is used. It is one of the most popularly used NoSQL databases.
To compare the performance of the MongoDB database. The data from the Post-
greSQL is migrated to the MongoDB database. The detailed implementation of
data migration procedure is explained in this thesis. The data is extracted from the
PostgreSQL database and imported into the MongoDB. By evaluating the perfor-
mance of both the database. It is possible to take the decision on which database is
the best fit to provide high performance for the given data.
The evaluation between the two databases can help in decision making for R&D
management.
Acknowledgement
This thesis would not have been accomplished without the support and guidance of

Prof. Dr. rer. nat. habil. Gunter Saake Dr. rer. nat. Matthias Plaue
Dipl.-Inf. Wolfram Fenske
From From
Otto von Guericke University Magdeburg MAPEGY GmbH, Berlin

I would like to thank Computer Science department for providing me an oppurtunity


to learn professional skills by encouraging me to participate in projects and guiding
me in my learning stages.
I woud like to thank my parents and friends who provided support and encouraged
me throughout years of my education. This accomplishment would not have achieved
without their prayers and blessings.
Finally, I would like to thank god for giving me strength to reach my goal.
Thank you
Contents
List of Figures viii

List of Tables ix

List of Code Listings xi

1 Introduction 1
1.1 Goal of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Readers Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Technical Background 5
2.1 Decision Support in R&D Management . . . . . . . . . . . . . . . . . 5
2.2 Relational Database Management System . . . . . . . . . . . . . . . . 6
2.2.1 RDBMS Architecture . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Database Queries . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 NoSQL Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 NoSQL Database Characteristics . . . . . . . . . . . . . . . . 12
2.3.2 Classification of NoSQL Databases . . . . . . . . . . . . . . . 13
2.3.3 MongoDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3.1 MongoDB Architecture . . . . . . . . . . . . . . . . 18
2.3.3.2 Schema Design . . . . . . . . . . . . . . . . . . . . . 18
2.3.3.3 Query Model . . . . . . . . . . . . . . . . . . . . . . 19

3 Requirements and Concept 23


3.1 NoSQL Database Selection . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 A Framework for Decision Support in R&D Management . . . . . . . 29
3.3 Data Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Embedding in MongoDB . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Data Migration Phases . . . . . . . . . . . . . . . . . . . . . . 31
3.4 MongoDB Query Model . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Query Language . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1.1 Aggregation Pipeline . . . . . . . . . . . . . . . . . . 35
3.4.1.2 Text Indexing . . . . . . . . . . . . . . . . . . . . . . 38

4 Design and Implementation 40


4.1 Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 MongoDB Core Process . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Tools Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.1 NoSQLBooster . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.2 R framework SHINY . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 MongoDB Query Optimization . . . . . . . . . . . . . . . . . . . . . 49
vi Contents

5 Evaluation 53
5.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1.1 Machine Used . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1.2 Data Characterstics . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.1 Data Migration . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.2 Experiment Queries . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Comparison between PostgreSQL and MongoDB . . . . . . . . . . . . 60
5.3.1 Impact of the Size of Datasets . . . . . . . . . . . . . . . . . . 65
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6 Related Work 67

7 Conclusion and Future Work 69


7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A Code Listings 71
A.1 R Framework SHINY . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
A.2 SQL and MongoDB Queries . . . . . . . . . . . . . . . . . . . . . . . 74

Bibliography 83
List of Figures

2.1 Decision support for R&D management . . . . . . . . . . . . . . . . . 6


2.2 Codd’s relational model [2589] . . . . . . . . . . . . . . . . . . . . . . 7
2.3 RDBMS schema model . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Three-level architecture of relational databases [RU15] . . . . . . . . 8
2.5 PostgreSQL Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Key value stores data mode . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Wide column stores architecture . . . . . . . . . . . . . . . . . . . . . 14
2.8 Graph data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 SQL & NoSQL database classification . . . . . . . . . . . . . . . . . . 16
2.10 MongoDB document store . . . . . . . . . . . . . . . . . . . . . . . . 17
2.11 MongoDB nexus architecture . . . . . . . . . . . . . . . . . . . . . . 18
2.12 MongoDB denormalized approach example . . . . . . . . . . . . . . . 19
2.13 MongoDB normalized approach example . . . . . . . . . . . . . . . . 19

3.1 PostgreSQL database model . . . . . . . . . . . . . . . . . . . . . . . 24


3.2 Data model of PostgreSQL database containing required data . . . . 25
3.3 Functional , Non-Functional requirements, and techniques of MongoDB 28
3.4 Framework to support R&D management . . . . . . . . . . . . . . . . 29
3.5 Embedded data collection structure . . . . . . . . . . . . . . . . . . . 30
3.6 Embedded document structure . . . . . . . . . . . . . . . . . . . . . . 31
3.7 Data migration process . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.8 Implementation of data migration process . . . . . . . . . . . . . . . 33
3.9 Aggregation pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Design process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41


4.2 NoSQLBooster Graphical User Interface . . . . . . . . . . . . . . . . 43
4.3 R Framework Shiny Application . . . . . . . . . . . . . . . . . . . . . 44
4.4 Output table in Shiny R web application . . . . . . . . . . . . . . . . 48
viii List of Figures

4.5 Query execution process in SHINY. . . . . . . . . . . . . . . . . . . . 48


4.6 Creating a single text index using NoSQLBooster . . . . . . . . . . . 49

5.1 Performance comparision on 10,000 records . . . . . . . . . . . . . . . 62


5.2 Performance comparision on 50,000 records . . . . . . . . . . . . . . . 62
5.3 Performance comparision on 100,000 records . . . . . . . . . . . . . . 63
5.4 Performance comparision on 1,000,000 records . . . . . . . . . . . . . 63
List of Tables

2.1 CAP theorem [MH13] . . . . . . . . . . . . . . . . . . . . . . . . . . 13


2.2 Terminology difference in MongoDB and RDBMS . . . . . . . . . . . 17

3.1 Top ten most popular databases [ 1] . . . . . . . . . . . . . . . . . . . 27

4.1 MongoDB version and NoSQLBooster Specifications . . . . . . . . . . 42

5.1 Used Machine Specifications . . . . . . . . . . . . . . . . . . . . . . . 53


5.2 Statistics of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Data Migration results . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 Queries for performance comparison from PostgreSQL and MongoDB 61
5.5 PostgreSQL Query results in seconds . . . . . . . . . . . . . . . . . . 64
5.6 MongoDB Query results in seconds . . . . . . . . . . . . . . . . . . . 65
5.7 Output of the results retured for each query . . . . . . . . . . . . . . 65
List of Code Listings
3.1 Finding total number of scientific publications and patents for the
institutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Finding data duplication using aggregation . . . . . . . . . . . . . . . 32
3.3 MongoDB embedded document sample . . . . . . . . . . . . . . . . . 34
3.4 MongoDB query structure and example query . . . . . . . . . . . . . 35
3.5 MongoDB aggregation example query . . . . . . . . . . . . . . . . . . 36
3.6 MongoDB aggregation example query output . . . . . . . . . . . . . . 36
3.7 MongoDB aggregation example query with multiple stages . . . . . . 37
3.8 MongoDB aggregation example output . . . . . . . . . . . . . . . . . 37
3.9 MongoDB Text Indexing . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.10 MongoDB Text Indexing . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Code for developing a Shiny R application user interface . . . . . . . 45
4.2 Code for developing a Shiny R application server side . . . . . . . . . 45
4.3 MongoDB Aggregation Stages for performance optimizations . . . . . 50
5.1 MongoDB Data Migration . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Importing data into PostgreSQL database . . . . . . . . . . . . . . . 55
5.3 Importing data into PostgreSQL database . . . . . . . . . . . . . . . 56
5.4 MongoDB query execution . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 Aggregation query execution . . . . . . . . . . . . . . . . . . . . . . . 57
5.6 MongoDB query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.7 MongoDB Ranking number of documents . . . . . . . . . . . . . . . . 59
A.1 Query covering some field of interest and get a list of relevant documents 71
A.2 Query covering some field of interest and get a list of organizations
ranked by number of patents, scientific publications matching the query 72
A.3 Query for an organization and get a list of collaborators, i.e., organi-
zations with common documents; rank them by number of common
patents, number of common scientific publications at user interface
And at server side shown Listing A.4 on page 73 . . . . . . . . . . . . 73
List of Code Listings xi

A.4 Query for an organization and get a list of collaborators, i.e., organi-
zations with common documents; rank them by number of common
patents, number of common scientific publications at server side . . . 73
A.5 PostgreSQL query example 1 . . . . . . . . . . . . . . . . . . . . . . 74
A.6 PostgreSQL query example 2 . . . . . . . . . . . . . . . . . . . . . . 75
A.7 PostgreSQL query example 3 . . . . . . . . . . . . . . . . . . . . . . 75
A.8 PostgreSQL query example 4 . . . . . . . . . . . . . . . . . . . . . . 75
A.9 PostgreSQL query example 5 . . . . . . . . . . . . . . . . . . . . . . 75
A.10 PostgreSQL query example 6 . . . . . . . . . . . . . . . . . . . . . . 76
A.11 PostgreSQL query example 7 . . . . . . . . . . . . . . . . . . . . . . 76
A.12 PostgreSQL query example 8 . . . . . . . . . . . . . . . . . . . . . . 76
A.13 MongoDB query example 1 . . . . . . . . . . . . . . . . . . . . . . . 77
A.14 MongoDB query example 2 . . . . . . . . . . . . . . . . . . . . . . . 77
A.15 MongoDB query example 3 . . . . . . . . . . . . . . . . . . . . . . . 77
A.16 MongoDB query example 4 . . . . . . . . . . . . . . . . . . . . . . . 77
A.17 MongoDB query example 5 . . . . . . . . . . . . . . . . . . . . . . . 77
A.18 MongoDB query example 6 . . . . . . . . . . . . . . . . . . . . . . . 78
A.19 MongoDB query example 7 . . . . . . . . . . . . . . . . . . . . . . . 78
A.20 MongoDB query example 8 . . . . . . . . . . . . . . . . . . . . . . . 79
1. Introduction

In recent days, internet users are gradually increasing. This leads to the exponential
growth of the data. The usage of the internet reaches 2.5 quintillions (1018 ) bytes
of data on every day 1 . From the past 4 decades, relational databases had exclusive
control over data storage but some of the applications have performance and scala-
bility issues. So, companies are looking for more reliable database systems. In such
a context, NoSQL databases are developed.
The thesis work is done at MAPEGY GmbH company which provides data-driven
decision support in the field of life sciences, energy, information & communication
systems, industries, and finance & insurance and also provides data products for
their customers. Research and Development management plays a key role in su-
pervising and managing the research department of MAPEGY’s customers. The
primary objective of the Research and Development is the development of new tech-
nology by applying creative ideas to improve knowledge based on patents, scientific
publications, and market updates 2 . It always focuses on vision and strategy in
various perspectives such as financial perspective, customer perspective, internal
business perspective, and innovation & learning perspective. This paper mainly fo-
cuses on innovation & learning perspective that includes project evaluation ideas for
new projects [KvDB99]. R&D management helps to gain knowledge and is used for
practical implementation in the future.
Mapegy Gmbh company uses database based on PostgreSQL, which is a relational
database and it has some limitations. In this thesis, the data is taken from the com-
pany’s PostgreSQL server data warehouse which contains huge datasets of patents,
scientific publications, organizations, and so on. The detailed description of the data
is given in chapter Chapter 3 on page 23.
In spite of the fact that the relational databases are most common and consistently
good in storing and retrieving the data from the database. It has limitations in
1
URL : https://www.ibm.com/blogs/insights-on-business/consumer-products/
2-5-quintillion-bytes-of-data-created-every-day-how-does-cpg-retail-manage-it/
2
URL: https://www.civilserviceindia.com/subject/Management/notes/r-and-d-management.
html
2 1. Introduction

dealing with the data in the relational database. To deal with the data, the relational
database needs a predefined schema for data normalization process [KYLTC12].
One of the important limitations is, building a relationship between the entities is
complex. For very large queries between interconnected tables, we require JOIN
operations to fetch the relevant information. Such cases take long response time
which makes querying costly and affects the performance.
This thesis investigates the performance in terms of query execution time between
PostgreSQL and one of the NoSQL databases. Although the NoSQL databases are
introduced in the early 2000’s it has shown its ability in working with large unstruc-
tured and unrelated data. The main reason for its popularity is that it does not
require a strict schema structure and provides high performance for large datasets.
Unlike relational databases, the NoSQL databases rely on denormalization. It means
the data can be retrieved faster that doesn’t involve JOIN operations. In terms of
performance, designing and implementing one of the NoSQL databases is selected
to decide whether the database is efficient compared to PostgreSQL database. The
resultant evaluation between the two databases helps in decision support for R&D
management.
NoSQL database is not a single technology. There are dozens of NoSQL databases
offer various options they are mainly categorized as Key-value based (example: Re-
dis), wide-column store (example: Hbase), document-oriented databases (example:
MongoDB), and graph databases (example: Neo4j) [HJ11b]. The decision in choos-
ing the right database is a key factor for the company. Therefore, R&D manage-
ment plays an important role in making such decisions. One of the most popular
document-oriented databases that is MongoDB database is selected. Because Mon-
goDB provides the abilities of relational database along with the features of high
flexibility, high scalability, and high performance.
The thesis contains information about the decision making on selecting the database
and the process for selecting the database system for IT managers or engineers for
their requirement. The result of the thesis will help to understand the advantages
and disadvantage of outcomes of PostgreSQL and MongoDB databases. The data
migration and query capabilities of MongoDB are evaluated in terms of performance
by comparing it to the PostgreSQL database. In Chapter 3 on page 23 & Chapter 4
on page 40, detailed explanation of selecting the database and its implementation is
discussed.

1.1 Goal of the Thesis


The main aim of the thesis is to investigate the performance of NoSQL database by
selecting, designing the database and implementing queries and compare its perfor-
mance to the currently using database (PostgreSQL) in the company. Based on the
result of the thesis, why and how the particular database has advantages over the
other database is discussed. In the thesis, the document-oriented MongoDB database
is used for practical implementation. And compare it with the PostgreSQL database
and investigate whether MongoDB can overcome the limitation of PostgreSQL.
We aim to reach the following goals:
1.2. Readers Guide 3

1. Migrating PostgreSQL database to the MongoDB database.

(a) Designing a data model which includes at least following entities: sci-
entific publications, patents, essential metadata which must contain the
title of the documents and the organizations (companies, research institu-
tions), list of organizations which are extracted from the data warehouse
and load it into MongoDB server.

2. To examine the querying capabilities of MongoDB like:

(a) Enter a query and retrieve the information related to a particular field of
interest.
(b) Enter a query covering some field of interest and get all patents and
scientific publications.
(c) Enter a query covering some field of interest and get a list of organizations
and experts projected by document type ’PATENT’ matching the query.
(d) Enter a query for an organization and get a list of organization and ex-
perts and ranking them by the number of patents and number of scientific
publications).

We build an interactive prototype web application using the R framework


“shiny” for querying data from MongoDB database3 . From the examination of
the output query response time of both the databases, the performance of the
database is evaluated.

1.2 Readers Guide


The thesis is structured as :

1. Chapter: Introduction
The first chapter gives the brief introduction of decision support for R&D
management along with the limitations of currently existing technology (Post-
greSQL) company and introduction to the NoSQL databases. The second part
illustrates the goal of the thesis followed by the structure of the thesis.

2. Chapter: Technical Background


This chapter Provides basic technical knowledge of decision support for R&D
management, RDBMS. Furthermore, the chapter covers the basic knowledge
of schema design and query model of PostgreSQL and MongoDB databases.

3. Chapter: Requirements and Concept


The concept of the thesis is discussed in this chapter. The basic require-
ments, limitations of the PostgreSQL database, NoSQL database selection,
data migration process, and MongoDB query model are the key points that
are discussed in this chapter.
3
http://www.rstudio.com/shiny/
4 1. Introduction

4. Chapter: Design and Implementation


This chapter covers design and implementation task. The data migration, tools
selected for the implementation, MongoDB query optimization is discussed.

5. Chapter: Evaluation
In this chapter, the evaluation setup is discussed. The details of the machine
used, the datasets used, the experiments implemented is discussed. Then
PostgreSQL and MongoDB query performance are compared and results are
evaluated.

6. Chapter: Related Work


In this chapter, we discuss the related research papers, their approaches and
how the papers are related to the thesis is explained.

7. Chapter: Conclusion and Future Work


The chapters provide a summary of the thesis. It’s conclusion and future work
is discussed.
2. Technical Background
Detail explanation of NoSQL databases will be described in this chapter. This chap-
ter is divided into multiple sections. It provides a basic understanding of decision
support and R&D management, fundamentals of the relational, and non-relational
databases (NoSQL).

2.1 Decision Support in R&D Management


To support the R&D decision-making process, all the information is collected from
patent documents, scientific publications and, market resources. The decision-
making process for selecting the database requires detailed technical knowledge and
its features. The decision in selecting the database is their efficiency by comparing
with other databases. This section gives a brief introduction to decision support in
R&D management.
Innovation process generates technological ideas, these ideas implemented and trans-
formed into a new business model that makes profits for the company and creates
a lead in the market place. A tremendous amount of research is carried out to
develop new products, technologies, and processes. R&D management plays a vi-
tal role in developing the growth of the companies [VGPC+ 17]. From few years,
efforts to build better decisions in R&D management for the companies are imple-
mented. Most of the efforts concentrate on making decision models and improving
decision-making methods. To improve the decision-making process, determined at-
tempts are made by the researchers to use computer-based decision-making methods
to support R&D activity [TML+ 02]. Figure 2.1 on the following page depicts the
decision-making process for R&D activity.
The Figure 2.1 on the next page shows the decision support for R&D management.
The strategy in developing a new project should be designed for achieving the goal.
For decision support in R&D management. Some of the factors should be considered
as follows.

• The scope of market opportunities is taken into account while designing the
project.
6 2. Technical Background

Company‘s resource
Market Opportunities New project
& constraints

Technical difficulties in
the project? No
yes
No R&D
Sufficient knowledge yes activity
No
Identify missing
knowledge

R&D activity

Figure 2.1: Decision support for R&D management

• The new project is planned considering resources like cost, time, and project
development environment.

• The sufficient knowledge to face technical difficulties in the project is needed.

• The lack of technical knowledge can not help in developing a new project. So,
missing knowledge should be identified and gain knowledge.

The development of new projects requires a good understanding of the current mar-
ket needs. The knowledge of design and implementation is necessary to learn for
investigating the project efficiently according to market needs.
The decisions are taken by comparing the new technology to the existing technology.
Here in the thesis, the relational and non-relational databases are compared in terms
of database query performance (query execution speed). This helps in making a
decision in selecting the database from the comparison.

2.2 Relational Database Management System


The relational database model was first introduced by IBM employee named Edgar
F. Codd [Cod82]. In RDBMS, data is stored in tables. One of the key data integrity
features of RDBMS includes the primary key and foreign key constraints.Figure 2.2
2.2. Relational Database Management System 7

shows the combination of tuples (rows) and attributes (columns) form a table (re-
lations) to build a basic block of the relational model. For instance, Figure 2.3 il-
lustrates the relations between the tables using primary and foreign key constraints.
Consider an employee table with primary and foreign key indications which are inter-
linked with other tables. Schema normalization is an important factor in designing a
relational database schema. In the data normalization, queries contains joins which
make query operations complex especially for large queries for retrieving data from
multiple tables 1 .

Figure 2.2: Codd’s relational model [2589]

Figure 2.3: RDBMS schema model2

In Figure 2.3, gives an example of a relational schema model. The tables contain
information about employee details. The data is normalized and the relative infor-
mation is stored in different tables that are connected using primary and foreign
key constraints. For instance, the information regarding the annual salary of an
1
https://www.objectivity.com/is-your-database-schema-too-complex/
8 2. Technical Background

employee and the city of the employee is needed. For example, John is an employee
living in Berlin and his annual salary is 120000 Euros. The information about the
employee is stored in different tables and can be retrieved any time using JOIN
operation. It is a simple query that needs less number of JOIN operations. In a
real-world scenario, there are a large number of interconnected datasets. To retrieve
the required information, a complex query using JOINS is used. In such cases, due
to a large number of JOINS, the execution speed decreases. Thus, resulting in a
performance problem.

2.2.1 RDBMS Architecture


In relational database, it is important to understand how the data is structured,
stores and, authorized. RDBMS supports external, conceptual and internal levels
(standard three-level) architecture. As shown in Figure 2.4 , Relational database
model delivers both physical and logical data independences. The physical and
logical data independences isolate the standard three-level architecture.

Figure 2.4: Three-level architecture of relational databases [RU15]

Conceptual level is also known as the data model which describes the structure of the
database for the end users. The schema represents the content of the whole database
and relationships among the data. The internal level is also known as physical level
defines how the data is stored in the database and the layout of the records in files 3 .
The external level allows the particular group of authorized users to view the data
in the database. To improve the performance, changing the internal schema with-
out affecting the conceptual level is possible in physical data independence [RU15].
Logical data independence is capable of changing the conceptual schema without af-
fecting the external level [RU15]. There are numerous RDBMS like Microsoft SQL
Server, Oracle Database, MySQL, and IBM DB2, PostgreSQL and many more 4 . In
the thesis, the PostgreSQL database is used for performance evaluation.
2
https://www.objectivity.com/is-your-database-schema-too-complex/
3
http://ecomputernotes.com/fundamental/what-is-a-database/data-independence
4
https://www.keycdn.com/blog/popular-databases
2.2. Relational Database Management System 9

RDBMS provides a useful feature called an ACID (Atomicity, Consistency, Iso-


lation, and durability) property which provides assurance for reliable transaction
[BD04]. PostgreSQL database provides properties of the transaction (that is ACID
properties) and the transactions are controlled using transaction control operations
(commands used are BEGIN, COMMIT, END, and ROLLBACK). The commands
cannot be used when dropping or creating the new tables as they are automatically
committed in the PostgreSQL database.

2.2.2 Database Queries


Data operations are done using queries. To retrieve the required information from
tables we often use different SQL codes. The query mechanism and different syntaxes
are used for different queries to scan and fetch the data are described below.
The conjunctive statement contains the syntax SELECT FROM WHERE which is
used to display the required data from the table. Similarly, Functional query further
specifies create, manage and control fields used for data manipulation, database
management, and access control respectively. OLTP (online transaction processing)
query execution is much faster. DSS (decision support system) is used in case of
complex queries to retrieve data from a large database. Due to complex queries,
DSS is costly in terms of execution time and system resources.
Most of the relational database is based on transactions. A transaction is a logical
unit which is used to manage the data such as update, delete, read, and create. These
transaction characteristics are called ACID properties which provide accuracy, data
integrity, and completeness 5 .
The systems are OLTP when the database structures are designed by focusing on
transactional processing. On the database level, the transactional operations aim to
fast and powerful queries to the database. OLTP commands INSERT, UPDATE,
DELETE are mostly used operations.OLTP is also used in interactive mechanisms
like web services. For example, consider the banking transaction. There are many
customers using their accounts. The system must execute the operations performed.
In case of many concurrent transactions (that is multiple user access) at the same
time. The operations must be completed effectively 6 .
DSS is a complex creation that is integrated with the various components. For ex-
ample, a business company details that has numerous components such as products,
customers, orders, timestamp, region and many more. if such company needs to
extract the information from the aggregated business information. first, the com-
pany analyst must model the data accordingly. Then the analyst has to analyze
and load the data from various data sources to the data warehouse. The queries are
often complex which involves aggregation functions. For the complex queries, the
execution speed is low. It can be improved by creating indexes [CDG01].

2.2.3 PostgreSQL
PostgreSQL is an open source relational database. It aims to provide high perfor-
mance, robustness and reliability to the clients 7 . PostgreSQL stores data in tables
5
https://iteritory.com/acid-properties-in-transactions-dbms/
6
https://francois-encrenaz.net/what-is-a-dbms-a-RDBMS-OLAP-and-oltp/
7
https://www.postgresql.org/
10 2. Technical Background

and generally access the data using SQL language. Since PostgreSQL is a relational
database it requires predefined data structure based on the application require-
ment. Related data is stored in different tables accessed using the JOIN operation.
PostgreSQL supports not only system-defined data types but also user-defined data
types, indexing types, functional language are used by the user according to their
requirements.

Schema Design
It is mandatory to design a minimum of one schema for every database. Schema
covers the content as listed in Figure 2.5.

8
Figure 2.5: PostgreSQL Schema

Schema design helps in organizing and identifying the wide range of data into a
finely-grained structure providing a unique namespace. Whenever a new database
is created it must have one schema minimum. Schema is used for many different
reasons. For example, the schema is used for control authorization that means
when people using the environment simultaneously one could create rules to access
the database schema based on individual roles, organizing the database objects,
maintaining third-party SQL code, and efficient performance 9 .
Tables: Tables are created specifying the name of the table and by using CREATE
TABLE command, providing the column names with data types 10 .
11
Range: It is a data type which usually used for selecting the range of values .
View : Suppose we need information by combining two different tables but we do not
want to write a query every time. For such situations, we use view command. We
create a view and refer it to a query. Consider tables with the city and an employee
below example shows how the view is created 12 .
And can execute it by using a simple command as shown below.
SELECT * FROM myview;
8
https://hub.packtpub.com/overview-postgresql/
9
https://hub.packtpub.com/overview-PostgreSQL/
10
http://www.postgresql.org/docs/9.4/static/release-9-2.html
11
http://www.postgresql.org/docs/9.4/static/release-9-2.html
12
http://www.postgresql.org/docs/9.4/static/release-9-4.html
2.2. Relational Database Management System 11

CREATE VIEW myview AS


SELECT city, employee name, employee salary, employee address
FROM employees, city
WHERE city = name;

Functions: PostgreSQL provides various functions by the combination of declara-


tions, expressions, and statements. It also has in-built functions for system-defined
data types. Functions can be used for different aspects. For instance, developers use
functions as an interface for the programming language to conceal the data model,
to perform complex logical queries and many other requirements. There are dif-
ferent possibilities to access the data in PostgreSQL 13 . It allows multi-procedural
language to write triggers such as default system-defined language like PL/pgSQL
and C, C++, Python, R in case of user-defined APIs [Dou05].
Type: It is important to provide the right data type for each column in a table. For
large datasets changing the data type is costly. In order to make the table efficient
we must pick appropriate data type. Some of the most used data types are numeric
type, character type, date and time type 14 .
Every relational database supports different types of default data types like text,
integers, Boolean and also an array of values with system-defined data types. For
designing a schema, the table ensures the allocation of data types for each column.
For instance, the database provides the finite number of data types and if it does
not permit the use of new data types then it decreases the flexibility of data model
[Mak15].
In PostgreSQL this restrictions are improved a little bit by initiation of JSON data
type in PostgreSQL documentation release version 9.2 15 and further development of
it. The JSONB data type was introduced in version 9.4 16 . This JSON and JSONB
enable storing unstructured data into a PostgreSQL database. The major difference
between them is, for JSON type the data is stored as a copy same as the JSON
input text. Whereas JSONB data is represented in a binary format that is binary
code not in UTF-8 or the ASCII format [Mak15].
Indexes: Indexes are used to improve the performance of the database by scanning
the related data pages. Indexing works just like an index page of a book. By
following the index, we access the required data much faster which contain the
required term. Therefore, we can state that it is easy to fetch the data using indexes
instead of scanning the whole book. PostgreSQL database supports different types
of indexing from the simplest B-tree index to Geospatial GiST (Generalized Search
Tree) indexes. PostgreSQL supports different types of indexes 17 :

1. B-tree index (Balanced-tree): It is a default index when the index is not mere
with CREATE INDEX command. Then indexing the data on both sides of
13
https://hub.packtpub.com/overview-PostgreSQL/
14
https://hub.packtpub.com/overview-postgresql/
15
http://www.postgresql.org/docs/9.4/static/release-9-2.html
16
http://www.postgresql.org/docs/9.4/static/release-9-4.html
17
http://www.postgresql.org/docs/9.1/indexes-types.html
12 2. Technical Background

the tree is almost equal. B-tree indexes uses operators like =, <, >, <=, >=
whenever, a column involved in comparison.

2. Hash index: Hash index handles equality predicates. It is generally less used
because it does not help much in the PostgreSQL database.

3. GiST index: It is an infrastructure inside the GiST index many different plans
are implemented. Gist index is generally used in geometric data-types and
also supports full-text search.

4. GIN (Generalized inverted index) index: GIN is useful for complex structure
for instance arrays, and full-text search.

5. BRIN (Block range index) index: It helps the data to arrange systematically
by storing lower to higher values in each block. Partial indexing, multicolumn
indexing, and unique indexing are some other kinds of indexing which support
PostgreSQL database.

The RDBMS architecture and database query techniques are explained. The queries
help to understand how the data is retrieved from the database.
The PostgreSQL database is a relational database that supports unstructured data
JSON and JSONB. To retrieve the data SQL (Structured Query Language) is used.
PostgreSQL supports full-text search or a phrase search using text indexing. The
data model helps in identifying the huge datasets into a proper structure providing
a unique namespace. In the case of large complex queries, the database queries need
to perform JOIN operations which is costly and slowers the execution time.
In the case of complex operations, NoSQL databases work effectively. The NoSQL
databases do not require any JOIN operations because of its flexible schema struc-
ture. So, for the flexible schema NoSQL databases are the best fit.

2.3 NoSQL Databases


NoSQL databases are designed for large scale data sets. The NoSQL databases
are first introduced in the early 2000s, NoSQL databases are well known as non-
relational databases which do not contain tables. NoSQL technology is developed
for distributed storage on a large set of data, parallel computing across different
servers, and to overcome the limitations of RDBMS [MH13]. To interact with the
data, every database has its own unique language. Big companies like AMAZON,
FACEBOOK, GOOGLE, etc, adopt this type of environment to cope up with highly
scalable data for massive read and write requests [HJ11a].

2.3.1 NoSQL Database Characteristics


With the immensely growing data it is important to find a new solution alternative
to the existing ACID databases. A new solution is designed under the BASE model
(basically available, soft-state, eventual consistency). The consequences of the scale
out ACID properties arises a conflict between different aspects of availability in
2.3. NoSQL Databases 13

Databases Consistency Availability Partition tolerance


RDBMS (MySQL, + + -
PostgreSQL etc.)
Cassandra, CouchDB - + +
etc.
MongoDB, Hbase etc. + - +

Table 2.1: CAP theorem [MH13]

a distributed environment. That is not fully solvable resulting the CAP-theorem


[Sim12].
Table 2.1, depicts that every database can achieve only two characteristics out of
three characteristics over the distributed network.
Consistency: Every user can see the data regardless of updates like delete, create,
updates, and so on.
Availability: Data should be available to the user in a network even if one of the
nodes is down.
Partition tolerance: A partition tolerance system can sustain even the entire network
is failed. It should maintain its characteristics, and continue its operations.
Some of the NoSQL databases concentrate on availability and Partition tolerance.
This result to know BASE (Basic, Available, Soft-State and Eventually Consistent).
It is an alternative to ACID properties. The BASE has no transaction properties.
The BASE provides high availability, simple, and faster but has weak consistency.

2.3.2 Classification of NoSQL Databases


NoSQL databases are categorized into four types key value stores, wide column
stores, document oriented databases and graph databases.

Key Value Stores


Key value stores are one of the popular NoSQL databases used to store the data.
Contrary to some of the relational databases, it supports elasticity, scalability, and
easy to manage. The database stores keys in alphanumeric format. The values are
text strings, or complex set of strings, arrays, lists. The values are in standalone
tables. Data search is performed against key not values. So, the key is always unique
[MH13]. Below shows the example Figure 2.6 on the following page.
The different implementations of Key-value stores are Dynamo, Redis, Riak, Volde-
mort, BerkeleyDB [MH13]. Key-value databases are best suited for fast data access
via keys, scalability in fetching the values from an application like retrieving the
information of the product, managing user sessions, and so on. There are differ-
ent advantages to using Key-value databases. They provide flexible data modeling.
18
https://medium.baqend.com/nosql-databases-a-survey-and-decision-guidance-ea7823a822d
14 2. Technical Background

18
Figure 2.6: Key value stores data model

Contrary to the traditional RDBMS, data retrieval does not involve fetching the
data from columns and rows. The architecture involves high performance, it is sim-
ple to operate and handles a massive amount of load. There are some drawbacks of
Key-value databases, they have no unique query language associated with different
executions, mostly support simple databases [Moh16].

Wide Column Stores

The wide column stores also known as extensible record stores. The data stored in
tables which contain many rows with unique row key. Consider a single row as

19
Figure 2.7: Wide column stores architecture

19
https://studio3t.com/whats-new/nosql-database-types
2.3. NoSQL Databases 15

shown in Figure 2.7 on the preceding page, it shows the first column as row key (the
unique identifier of a column). The column in Figure 2.7 on the facing page contains
a column name which uniquely identifies every row. In other words, it is a two-
dimensional key-value store. The primary usage of the wide column or column family
stores is for distributed data storage, large scale data processing like sorting, parsing,
conversion between code values (for instance hexadecimal, binary, decimal), and
algorithm processing, batch-oriented processing method, and predictive analytics.
Wide column databases are generally high performance oriented when querying the
data and provide strong scalability. Many organizations like Spotify, Facebook, Big
Table model of Google uses Wide column stores [MH13].

Graph Database
Graph databases are first introduced in the early 1990s. At that time, the database
community is turning to adapt semi-structured data (in 1990s Graph databases
does not have any relation with semi-structured data), the existed technologies are
effective for most of the application requirements [AG08]. Today, graph databases
gain interest in managing the relationship for a massive set of data by connecting
internally between entities. The ideal use case scenarios for using graph databases
are, for traversing social networking, pattern detection in forensic investigations,
and in biological networking [DS12]. The graph represents the connection between
the objects and illustrating the relationship between them [DS12]. Querying in
a graph database is traversal [HR15]. Figure 2.8 is a simple example shows the

20
Figure 2.8: Graph data model

graph model with every node and its relationship has properties. The relationship
is stated by dereferencing using pointers. The query can execute with only one
index lookup. This approach provides high performance than relational databases.
Whereas in the relational database to retrieve information from multiple tables.
The database executes the query using foreign keys that have multiple lookups
on indexes. Maintaining the performance is more important for every database
even when the data volume is increased. It increases performance when querying
the interconnected data against any relational or other NoSQL databases. Unlike
20
https://blog.octo.com/en/graph-databases-an-overview
16 2. Technical Background

RDBMS, the performance of graph database is stable even the datasets are increased
massively [HR15]. Popular Graph databases are Neo4j 21 , InfiniteGraph 22 , and
FlockDB 23 .

Document Oriented Database (DODB)


DoDBs are JSON or JSON like documents which are encapsulated using key-value
pairs. The key is a unique identifier which is also a unique ID in a database collection.
The ID helps to identify the document explicitly within the collection. Complex
data can be handled effectively. To work in a big data environment, DODB is best
suited unless there is any specific database for a specific purpose like graph database
[KSM17]. Since DODBs has a flexible schema, it is possible to modify or change
anytime. It provides high scalability, efficient performance and it does not depend
on any data format. Some of the most popular DODBs are MongoDB 24 , CouchDB
25
, Riak 26 . Figure 2.9 describes the unique characteristics of a different data model,

Data Model Performance Scalability Flexibility Complexity Functionality Examples

Key Value High High High - - Redis

Hbase,
Wide Column High High Moderate Moderate Minimal
Cassendra

Document MongoDB,
High Variable High Low Low
Oriented CouchDB

Neo4j,
Graph Variable Variable High Variable Graph theory
InfiniteGraph

Relational MySQL,
Relational Variable Variable Low Moderate
algebra PostgreSQL

27
Figure 2.9: SQL & NoSQL database classification

there are many NoSQL paradigms existing presently in the data world. Most of the
NoSQL databases are open source or low cost and can be used for non-commercial,
or commercial purposes. For the practical implementation of the research, we use
21
https://neo4j.com/
22
https://www.objectivity.com/products/infinitegraph
23
https://blog.twitter.com/engineering/en us/a/2010/introducing-flockdb.html
24
https://www.mongodb.com
25
http://couchdb.apache.org/
26
http://docs.basho.com/
27
https://www.slideshare.net/bscofield/nosql-codemash-2010
2.3. NoSQL Databases 17

a MongoDB database and compare the performance with companies database (i.e.
PostgreSQL).

2.3.3 MongoDB
MongoDB is an open source document oriented database. These documents are
grouped together in a collection as shown in Figure 2.10.

Figure 2.10: MongoDB document store28

RDBMS MongoDB
Database Database
Tables Collection
Rows Documents
Indexing Indexing
Joins Embedded documents or
lookups
29
Table 2.2: Terminology difference in MongoDB and RDBMS

A collection is similar to a table in relational databases. Relational database contain


rows and columns in a table whereas, MongoDB contains documents in a collection.
To design the database schema consider Table 4.1 on page 42 which shows the
difference in relational and MongoDB terminology. The documents stored in a
collection are BSON documents (Binary encoded JSON). Documents are key value
pairs. MongoDB data model is flexible. The data from different tables are embedded
in a collection which increases its query performance. MongoDB has its own query
28
https://www.slideshare.net/mongodb/mongodb-schema-design-practical-applications-and-implications
29
https://www.slideshare.net/mongodb/mongodb-schema-design-practical-applications-and-implications
18 2. Technical Background

language which makes it easy to retrieve information from the database.The database
Indexing, aggregation, map reduce are some of the powerful query characteristics of
MongoDB.

2.3.3.1 MongoDB Architecture


Relational databases served the data world over decades, it shows a very mature ap-
proach in dealing with the database. Designing MongoDB includes the proven abili-
ties of RDBMS with newly introduced features of the NoSQL database. Figure 2.11
depicts the architecture blending RDBMS with MongoDB (NoSQL technology) key
features. In the MongoDB database, the data is collected as JSON documents. Due
to its rich query capabilities, MongoDB provides the abilities of relational databases
along with high performance, flexibility, and scalability.

30
Figure 2.11: MongoDB nexus architecture

2.3.3.2 Schema Design


Before designing a schema it is important to know the technical terminology of Mon-
goDB Section 2.3.3 on the preceding page. The decision in designing an efficient data
model is based on the application requirements. The data model can be designed in
embedding the data into a single collection or by referencing the data from different
collections. It is possible to apply both approaches depending on the application
requirements.
The embedded method id generally similar to the denormalization model. Figure 2.12
on the next page shows the embedded document. In the example, the contact and
address data are embedded into a single document. The model can be designed one
to one, one to many relationship in an embedded documents. The main advantage
of embedding is that the database retrieves the information without the use of JOIN
operation. As a result their is an increase in performance 31 .
30
https://www.mongodb.com/white-papers
31
https://www.mongodb.com
2.3. NoSQL Databases 19

Figure 2.12: MongoDB denormalized approach example31

31
Figure 2.13: MongoDB normalized approach example

The referencing data model is also known as a normalized data model. Figure 2.13
shows the example of a normalized data model. In the data model the data is
retrieved by referencing (using the unique object id). This model is best suited to
design model for huge data sets 31 .
2.3.3.3 Query Model
Some of the important factors related to the query model are discussed in this
section. MongoDB supports mongo shell, python, ruby, R, Scala, c#, and many
more programming languages to query the data from the MongoDB database 31 .
Query options
MongoDB has its own querying language to perform operations like finding the
number of documents, matching, ranking, and projecting. MongoDB also supports
many ranging queries and other expressions like $gte (greater than equal), $lt (less
than), $ne (not equal). To performs the arithmetic operations and to query the data
in array $elemMatch keyword is used. The operators are case sensitive 31 .
20 2. Technical Background

Indexing
Indexing is one important feature of MongoDB. Indexing provides efficient results
upon querying the data in a collection. Indexing is done on a specific field, multi-
ple fields, or on the whole collection. When we query the data without indexing,
MongoDB must inspect all the documents in a collection which lowers the execution
speed. With indexing, MongoDB restricts the number of documents to examine
from the collection. When creating the collection MongoDB has a unique index
(i.e. default id index). This prevents from inserting any identical data with the
same value. MongoDB is integrated with various indexing type namely, text search
indexing, single, compound indexing. Partial, unique are some of the properties of
indexing 32 .
By default MongoDB creates default indexing on the id field, Indexing is done on
any one field. With reference to single indexing, Multiple indexing is created by
indexing on two or more fields in a collection.

Text indexing
Text Indexing is performed on the string content field on a specific column or multiple
columns in a database. MongoDB supports partial and full-text search. Text indexes
are applied to the single field, multiple fields, or on a wildcard specifier (i.e. $**)
for indexing every field that contains text content in a collection. There are certain
limitations of text indexing. For instance, we create a single text index on a field
for the text query.

db. <Collection name>.createIndex({“topic” : “text”})

Indexing is executed. When we run the compound indexing to the same collection
as shown below it throws an error stating that the text search index already exists.

db. <Collection name>.createIndex({“topic” : “text”, “abstract” : “text”})

So, if we want to create another text index on the same collection we must drop the
previously existing indexing.

db. <Collection name>.dropIndex({“topic text”})

and create a new one depending on the requirement.

db. <Collection name>.createIndex({“topic” : “text”, “abstract” : “text”})

So, for creating a text indexing, it should be noted that there are certain rules to
perform indexing on a column in a database.
32
https://docs.mongodb.com/manual/indexes/
2.3. NoSQL Databases 21

Full Text Search Support


Full-text search provides various features of MongoDB.

1. Stop words: The words like a, an, the, at, etc. are filtered out from the content.

2. Stemming: In stemming the words like standing, happening is reduced to


originated words stand, happen respectively.

3. Scoring: Ranking the most relevant word depending on a search result.

With the use of MongoDB text indexing, it is possible to search text on one field
(single field text indexing), multiple fields (compound text indexing) or the total text
in a collection using wildcard specifier indexing [ 32]. Using MongoDB it is possible
for phrase search. It fetches for the document which contains relevant information
based on the given phrase.

Aggregation
Query operations can be performed in different methods in the MongoDB database.

1. Aggregation pipeline.

2. Map Reduce

3. Other aggregation functions.

In aggregation pipeline, the data undergoes set of operation in multiple stages in a


pipeline such as $match, $project, $group, etc that computes the data and return
the result. Use of pipeline provides efficient data operations. Aggregation pipeline
is the preferred method on data in MongoDB [ 32].
Map Reduce is a data processing model for conducting queries on a large amount of
data and return the result. In the map phase, MongoDB maps the key-value pairs.
If the same key has multiple values, the reduce phase is applied to the collection
and process the aggregated data. The results are stored in a collection. MongoDB
map-reduce functions run using javascript so it provides high flexibility [ 32].
MongoDB provides some other functions like distinct, count, limit, etc [ 32]. Mon-
goDB provides other important administration and maintenance features like con-
figurations, scalability, persistence, sharding, availability, and security.
In this chapter, the brief introduction of decision support for R&D management is
discussed. The R&D activity gives us a basic knowledge of the factors involved in
developing a new project. In the next section, the Relational database details are
explained. The data model is designed using strict schema structure. To query the
data from the database standard SQL language is used. PostgreSQL is one of the
most advanced relational databases in today’s world. The schema and the query
model along with the types of indexes are discussed.
22 2. Technical Background

NoSQL database plays a key role in managing the unstructured data. NoSQL
databases are mainly classified into four types. Each type has its unique features.
The usage of these databases depends on the application requirement. Further-
more, a rich document-oriented MongoDB database is explained. MongoDB has the
abilities of the relational database along with additional features such as flexibility,
scalability, and high performance. The data is stored as JSON documents which are
represented in a BSON format. There are two possible data modeling techniques
in MongoDB. Embedded and the referencing data model. Each of it has a unique
purpose. The two approaches can also be used together to develop a database espe-
cially in case of large data sets. MongoDB has its own query language to retrieve
the relevant information from the database. MongoDB Indexing helps in minimizing
the data search which results in high performance.
3. Requirements and Concept
The chapter discusses the structure of the data in the PostgreSQL database server,
its limitations, the applicability of NoSQL technology for data migration from the
PostgreSQL to NoSQL database, and investigating the query performance on data
that is taken from PostgreSQL database. The investigation concludes which of the
two databases (PostgreSQL and NoSQL database) are best fit in terms of query
performance.
The data used for the thesis is taken from Mapegy’s PostgreSQL database server.
The information in the database provides data-driven decision support to the R&D
managers. The database provides updates of the data that contains information of
organizations, and expert profiles. The data provides innovation insights for R&D
managers, investors, and business analysts. The decisions focused on factors such
as investment decisions, technology decisions, decision making based on resources,
cost, and time. The thesis mainly focused on technology-oriented decision making
support. The database contains huge sets of data with patents, science, documents,
organizations, experts, metadata, and so on. The database contains information
about the date and time, title, country, language and many more. Furthermore, it
also contains the information of last updated records time and date. For such huge
dataset, the hierarchical structure is modeled in different tables with a primary key
and are connected using link tables. The data is derived from millions of research
papers, articles, patent offices, social media, and many more innovation-related plat-
forms.
The database is designed with various tables Figure 3.1 on the next page of different
types as follows.

1. entity related tables contain the basic information of entities.

2. link tables that contain information about the connection between entity ta-
bles.

3. KPI tables (key performance indicator) provide indicators to score e.g assets
(table). KPIs are used to evaluate a company’s success rate based on achieving
business-oriented goals.
24 3. Requirements and Concept

4. stat tables contain information about statistics like documents count. It mon-
itors the number of records present in the database and the number of deleted
records.
5. trend tables provide information about the number of records count per year
(certain time interval) from the database.

Figure 3.1: PostgreSQL database model

The data used in this thesis is taken from three different tables that are entity docs
table, entity players table, link player docs table. The chosen datasets provide in-
formation about PATENT and Scientific publications from millions of sources. The
data is used by the R&D managers to make a decision on selecting the relevant
information depending on the requirement.

• entity docs: In the table the information of the number of patents and sci-
entific publications is shown. It also provides the information regarding the
documents such as the title of the document, abstract, metadata, time and
date inserted, last updated and so on.
• enitiy players: The table provides information about the number of institu-
tions, and expert profiles. It also provides the data regarding the address of the
institution, experts, the country code where they belong to, global positioning
and many more.
25

• link players docs: The table provides the connection between the above two
tables. Figure 3.2 shows the relationship between the tables. The tables contain
information about the number of columns and the connection between the
tables.

Figure 3.2: Data model of PostgreSQL database containing required data

PostgreSQL database has a query performance issue. The database contains various
tables. The tables are normalized and are linked together using primary and foreign
key constraints. When the complex query that requires many JOIN operations are
performed to collect relevant information from multiple tables. The JOINS takes
time to retrieve the information from the tables decreases the query performance.
For better explanation consider a complex query Listing 3.1 on the next page.
The explanation of query syntax is as follows

• selecting the columns to be displayed.


• filtering doc type (document type) SCIENCE as nb science, and PATENT as
nb patent
• counting the number of SCIENCE, and PATIENT documents from data warehouse
(PostgreSQL database server) by connecting the entity docs table, and en-
tity players table using link players docs table for a given phrase.
• group the table by player id.
26 3. Requirements and Concept

1
2 select
3 z . player id ,
4 z . player type ,
5 z . player sub type ,
6 z . player name ,
7 z . country code ,
8 z . address ,
9 count ( ∗ ) f i l t e r ( where d o c t y p e = ’SCIENCE ’ ) a s n b s c i e n c e ,
10 count ( ∗ ) f i l t e r ( where d o c t y p e = ’PATENT ’ ) a s n b p a t e n t
11 y . doc id ,
12 y . doc type ,
13 y. title
14 from data warehouse . l i n k p l a y e r s d o c s x
15 join d a t a w a r e h o u s e . e n t i t y d o c s y on x . d o c i d = y . d o c i d and y .
d o c t y p e i n ( ’PATENT ’ , ’SCIENCE ’ )
16 join d a t a w a r e h o u s e . e n t i t y p l a y e r s z on z . p l a y e r i d = x . p l a y e r i d
and p l a y e r t y p e = ’INSTITUTION ’
17 where t s v f u l l t e x t @@ p h r a s e t o t s q u e r y ( ’ v i d e o game c o n s o l e ’ )
18 group by z . p l a y e r i d
19 o r d e r by n b p a t e n t d e s c
20 limit 500;

Listing 3.1: Finding total number of scientific publications and patents for the
institutions

• Sort the result in descending order.

• executed query output is limited to 500 rows.

The output of the execution retrieve all the INSTITUTIONS who have published
the scientific publications and patent information from the given title (’video game
console’). From Listing 3.1, it is shown that to retrieve data from multiple tables
we required JOINs. This process of retrieving data from multiple tables is a time-
consuming process resulting in decreasing the performance. To find an alternate
solution to the problems, one of the NoSQL databases is selected. By designing
the data model, data migration from the PostgreSQL database and implementing
different queries that are mentioned in Section 1.1 on page 2 in the introduction
chapter. By investigating the query performance between the existing database and
NoSQL database, the decision is taken that which of the databases (PostgreSQL, or
NoSQL database) provides high performance.

3.1 NoSQL Database Selection


One of the NoSQL databases is selected for implementing the queries to the data
that is extracted from the PostgreSQL database. The performance evaluation be-
tween the existing and selected database helps in decision making for selecting the
database. The evaluation is implemented for the data extracted from PostgreSQL.
NoSQL databases have unique properties that differ in consistency, performance,
scalability, reliability [HELD11]. As discussed in section 2.3.2, NoSQL databases are
3.1. NoSQL Database Selection 27

Ranking(2019) DBMS Model Score (2019)


1 Oracle RDBMS 1268.84
2 MySQL RDBMS 1154.27
3 Microsoft SQL Server RDBMS 1040.26
4 PostgreSQL RDBMS 466.11
5 MongoDB Document store 387.18
6 IBM DB2 RDBMS 179.85
7 Redis Key-value store 149.01
8 Elasticsearch Search engine 143.44
9 Microsoft Access RDBMS 141.62
10 SQLite RDBMS 126.80

Table 3.1: Top ten most popular databases [ 1]

classified into four different categories. The criteria for selecting the NoSQL database
in this thesis work is based on important factors like high performance, marketability,
reliability, open source support. Most widely used databases ranking is determined
in the DB-Engines ranking. The ranking is based on different aspects like the most
used database, frequency in the number of job offers, technical discussions 1 . For
instance, consider Table 3.1 that shows the ten most popular databases according
to DB-Engine 2019 ranking. From the Table 3.1 it is clear that still RDBMS are
most popularly used databases. As discussed from Listing 3.1 on the facing page,
performing a large query that involves retrieving information from multiple tables
decreases its query performance. In such scenarios, NoSQL databases are useful.
This chapter explains the requirements of the NoSQL database to work with the data
which is extracted from the PostgreSQL database. We consider the set of features
that should be integrated into the database. NoSQL database is chosen such that
it matches the features of PostgreSQL database. The chosen database should have
the following important features:

1. The chosen database must be available at any time.

2. It should be flexible to store complex data.

3. The important part of the database is the analysis of stored data. So for
data analysis, the database needs to support many data analytic features.
NoSQL database should support the query executions similar to the company’s
database.

4. Easy query operations that increase the performance effectively.

5. Text indexing support.

6. Documents involving in JSON format.

1
https://db-engines.com/en/ranking
28 3. Requirements and Concept

There are various rich document-oriented databases. MongoDB and CouchDB are
the most widely used databases [ 1]. Query execution speed of MongoDB is faster
compared to CouchDB [Bha16]. CouchDB uses an elegant map-reduce syntax for
querying the data 2 . Whereas, MongoDB has its own query language which is easy
to learn for the people who have SQL knowledge. Additionally, it provides map-
reduce function 3 . It supports rich document-oriented full-text search support, it
provides very flexible schema (data model). So, data is represented in a simple
way to query data in an efficient way without any join operations. Below Figure 3.3
shows the functional, non-functional requirements, and techniques of MongoDB. The
techniques connect functional and non-functional system properties that support
MongoDB. Scanning queries, Filtering, Full-text search, Analytics, and Conditional
writes are functional requirements. Among the requirements, the queries, indexing,
and analytics are our primary requirements for implementing the data in the Mon-
goDB database. The query capabilities of MongoDB guarantees consistency. The
data is retrieved using a unique ID. For performing queries efficiently, secondary
indexes are employed. The indexes reduce the scanning items which provides high
query performance. The full-text search is performed by using text indexing in the
database.

Figure 3.3: Functional , Non-Functional requirements, and techniques of MongoDB


2
http://couchdb.apache.org/
3
https://www.mongodb.com/white-papers
3.2. A Framework for Decision Support in R&D Management 29

3.2 A Framework for Decision Support in R&D


Management
This thesis investigates the query performance of PostgreSQL and MongoDB databases.
We implement a NoSQL databases framework to support decisions for R&D man-
agement. The framework aims to address the limitations of the current system and
to analyze the performance of the proposed approach. The framework categorized
into 4 sections for decision support in R&D management Figure 3.4. The idea for
designing a framework is taken from an article [SKC16].

Figure 3.4: Framework to support R&D management

1. Storage layer: In this layer, the unstructured data is imported to MongoDB.


The data in MongoDB is stored in BSON (binary JavaScript Object Notation)
format [SKC16].

2. Analysis layer: For the thesis, the queries used for data analysis in MongoDB
need features such as indexing and full-text search, sorting, matching, grouping
the documents and projecting it and so on. [SKC16].

3. Application layer: We used R framework shiny for an interactive web applica-


tion. Using R programming language data is queried and the result is displayed
on the web application API [SKC16].

3.3 Data Migration


PostgreSQL database server contains various huge datasets. According to the re-
quirement, data migration is carried from entity docs table which contains infor-
mation related to patents, scientific publications, and metadata. The entity players
30 3. Requirements and Concept

table that provides information about organizations, experts and research institu-
tions. The link player docs table contains information about the connection between
tables. The schema Figure 3.2 on page 25 contain information about the number of
columns and the connection between the tables.
Designing a MongoDB data model varies from the PostgreSQL data model. Mon-
goDB does not need schema design for data modeling in this work. There are two
different design patterns for designing a data model in MongoDB. The first option
was migrating every database table into different collections to a target database
which is a pragmatic approach. The second option is embedding multiple tables
into a single collection. Embedding the related data as documents is simple. In
this thesis, the data extracted from the PostgreSQL database is embedded into a
single collection in MongoDB database. The data extracted consists of relative in-
formation that can be embedded into one collection. The advantage of embedding
the documents into a collection is that they avoid using references or lookups. The
embedded collection is faster in query execution that results in high performance.
However, due to the lack of joins, there is data redundancy. This results in high
memory usage.

3.3.1 Embedding in MongoDB


For better understanding about embedding in MongoDB. Consider an example, the
relational database Figure 3.5 is designed by proper normalization.The data normal-
ization decreases data redundancy, increases data integrity, low space consumption
[KGK14]. But in the case of MongoDB, it does not support joins so the data is
denormalized. In the Figure 3.5 Table:1 and Table:2 are connected with primary
and foreign keys. Whereas, Table:3 shows the denormalization. Here the joins are
replaced by merging two tables. Table:4 shows the structure of the embedding doc-
ument. Embedding is similar to denormalization. In an embedded collection, the
foreign key relationship is embedded as an array of documents.

Figure 3.5: Embedded data collection structure


3.3. Data Migration 31

Figure 3.6: Embedded document structure

3.3.2 Data Migration Phases


There are three different phases in the data migration process namely planning,
migration, and data validation [MD18].

• Planning: In the planning phase, hardware specification, data configurations,


and features of the target database (MongoDB) are observed. Proper planning
helps to decrease the inherent risk factors like missing data, data duplication,
data inconsistency or quality issues.

• Data migration: In this phase, the data is migrated by extracting the data from
the source database (PostgreSQL) and importing it into the target database
(MongoDB) using mongo shell.

• Data validation: The process of measuring the data quality after data mi-
gration. The process is tested by simply querying to count the number of
documents (rows and columns) in the target database (MongoDB) and com-
pare it to the source database (PostgreSQL). If the number of documents (rows
and columns) is the same in both source (PostgreSQL) and target database
(MongoDB) then the data migration is successful.

For migrating data, data should be extracted and restructured from the PostgreSQL
database, transform, and load the data to the MongoDB database. Figure 3.7 on
the following page shows the data migration process that explains how the data
migration is carried out from the PostgreSQL database to MongoDB. The data is
32 3. Requirements and Concept

initially copied from PostgreSQL database server in CSV (comma separated value)
file format, transformed, imported to the MongoDB, and validated. The CSV data
is directly imported using mongo shell. This type of data is known as pass-through
data. But the data is checked for any data duplication by querying the data using
query aggregation Listing 3.2. If the duplicates are present, they are deleted. The
data migration design is constructed such that the data extracted from PostgreSQL
should match the data in MongoDB. It means, MongoDB database contains the
same amount of information with the number of rows equal to the source database
(PostgreSQL database).

1 db . doc . a g g r e g a t e ( [
2 #Grouping a l l d o c i d ’ s t o g e a t h e r
3 { $group : {
4 i d : { d o c i d : ” $ d o c i d ”} ,
5 # View documents t h a t have same d o c i d and P r o v i d e s t h e count t h a t
s h a r e s same group key
6 d u p l i c a t e : { $addToSet : ” $ i d ”} ,
7 count : {$sum : 1}
8 }
9 },
10 #To g e t output o n l y t o t h e group which i s g r e a t e r than 1 .
11 { $match : {
12 count : { ”$ g t ”: 1}
13 }} ,
14 # s o r t i n a d e c e n d i n g o r d e r
15 { $sort : {
16 count : −1
17 }
18 }]) ;

Listing 3.2: Finding data duplication using aggregation

Figure 3.7: Data migration process

The implementation process of data migration is explained clearly in Figure 3.8 on


the facing page.
Following the procedure mentioned in the Figure 3.8 on the next page, the data
migration is carried out. The data model specifically focused on entity docs, en-
tity players and link players docs from PostgreSQL data warehouse. The sample
document after embedding from three tables is shown as Listing 3.3 on page 34.
3.4. MongoDB Query Model 33

Figure 3.8: Implementation of data migration process

The data is extracted from the PostgreSQL database as a CSV file and imported
to MongoDB using mongo shell. After the migration, the data is checked for data
duplication. The duplicate documents are removed from the MongoDB database.
MongoDB provides a flexible schema structure. The data is modeled as an embedded
collection. The data is denormalized and merged into a single collection. The data
migration is carried out in three phases that are planning, data migration, and data
validation. The procedure involved in data migration is discussed in this section.

3.4 MongoDB Query Model


In this section, the query model, concerning syntax and indexes used for applying
business logic are discussed.

3.4.1 Query Language


In MongoDB, queries are expressed as JSON objects. The prototype is built using
R programming by developing a user-friendly interactive web application using R
framework SHINY. Initially, the data is imported using mongo shell. It is an inter-
active interface that is completely based on JavaScript. The queries used and the
time taken to execute queries is discussed in the implementation chapter.
The database structure contains the following basic components to query the data.
34 3. Requirements and Concept

1 {
2 ” i d ” : O b j e c t I d ( ”5 b c c f 4 5 4 2 1 9 b d a 1 3 7 b 6 8 9 c 3 f ”) ,
3 ” p l a y e r i d ” : 113654298 ,
4 ”doc id ” : 191987233 ,
5 ” p l a y e r d o c l i n k t y p e ” : ”{APPLICANT} ” ,
6 ” l a s t u p d a t e ” : ”2018−02−14 1 4 : 1 8 : 2 5 . 6 8 5 4 3 5 ” ,
7 ”pos ” : ” ” ,
8 ”score player doc ” : 1 ,
9 ” d a t e i n s e r t e d ” : ”2018−02−14 1 4 : 1 8 : 2 5 . 6 8 5 4 3 5 ” ,
10 ” d o c s o u r c e ” : ”PATSTAT” ,
11 ”d o c s o u r c e i d ” : 449257868 ,
12 ” t i t l e ” : ”HEATER” ,
13 ” c o u n t r y c o d e ” : ”JP ” ,
14 ”doc timestamp ” : ”2014−06−25 0 0 : 0 0 : 0 0 ” ,
15 ” l a n g u a g e c o d e ” : ”EN” ,
16 ” w e b l i n k ” : ”h t t p : / / worldwide . e s p a c e n e t . com/ s e a r c h R e s u l t s ? query=
JP20140130564 ” ,
17 ”image link ” : ”” ,
18 ” p u b l i s h e r ” : ”p a t e n t o f f i c e ( JP ) ” ,
19 ” s e r i e s ” : ”” ,
20 ”d o c t y p e ” : ”PATENT” ,
21 ”doc sub type ” : ”” ,
22 ” t s v t i t l e ” : ” ’ heater ’ : 1 ” ,
23
24 ....
25 }

Listing 3.3: MongoDB embedded document sample

• A collection name which helps to have an idea about where the desired doc-
uments are present in the database. Firstly, data is migrated from a single
table (entity docs) from PostgreSQL to MongoDB collection (main data) in
MongoDB database. Secondly, the data migration is performed by embed-
ding the documents of a collection with 23.7 million documents (entity doc).
The data is migrated to compare the query execution speed of MongoDB and
PostgreSQL database. The queries are performed on a single table that does
not involve in joins (entity docs) and on interlinked tables that are embedded
as a single collection in a MongoDB database. The data is migrated single
table and embedded collection because to investigate the speed of the query
execution that involves no JOIN and JOIN operations respectively.

• A quering method specifies the method for investigating or retrieving the data
from the MongoDB database. Aggregation pipeline is used for querying the
document. The detailed explanation of the aggregation pipeline is discussed
in Section 3.4.1.1 on the next page.

Query Structure
MongoDB has different query methods. That is the MongoDB query using fing()
method, aggregation pipeline, and Map Reduce. The MongoDB query structure is
explained with an example in a Listing 3.4 on the facing page. It illustrates the
query structure with a sample query that fetches the information of 1000 documents
3.4. MongoDB Query Model 35

about the title field which is HEATER. The query is prefixed with db that contains
the collection name where the queries to the data are applied. Query command
find() is used to find the documents fulfilling the desired output.

1
2 #s t r u c t u r e o f a query
3 db . C o l l e c t i o n n a m e . Querycommand ( Querydocument )
4 . Projectiondocument ( )
5 . Limitdocument ( )
6
7 # example
8
9 db . data . f i n d ( { ’ t i t l e ’ : ’HEATER ’ } )
10 . projection ({})
11 . l i m i t (1000)

Listing 3.4: MongoDB query structure and example query

3.4.1.1 Aggregation Pipeline


In this section, the detail working procedure of the aggregation pipeline is discussed
using an example query. The aggregation pipeline discussed because it is used for
developing queries in the thesis.
In MongoDB, aggregation pipeline is a preferred method for huge data collection or
huge sets of collections. It has a built-in query optimizer, it makes data processing
easy across stages and gives optimal results 4 .
Speed and consistency of data access and data retrieval are important factors for
evaluating the performance of a database. In aggregation pipeline computation of
the result is executed at each stage of operation. The result at each stage is evaluated
and returns the output result. Data execution in the aggregation pipeline is based
on the proper ordering of the operators and on data dependencies.
Aggregation pipeline gives the output collection by retrieving the required data
from a collection. Consider Figure 3.9 on the next page shows the process of data
aggregation. When the data is requested by developing an aggregated query as
input, MongoDB identifies the requested data including aggregation operators. And
it analyzes the data requested and starts executing in different stages depending
upon the query. The transformed output data is retrieved following the aggregation
procedure.
To understand aggregation pipeline better, consider a collection called docs which
have many documents and the sample document is shown in Listing 3.4. The sample
query shown in Listing 3.5 on the next page explains how the simple aggregation
using a $group operation is performed. The query passing through the aggregation
pipeline, groups total number of patents and scientific publications.
The result of the query is Listing 3.6 on the following page. In the collection, the
query runs through $group stage. The aggregation pipeline runs through different
4
https://www.qualiero.com/community/mongodb/mongodb-theorie/
mongodb-aggregation-pipeline-in-5-minuten.html
36 3. Requirements and Concept

Figure 3.9: Aggregation pipeline

1
2 db . doc . a g g r e g a t e ( [
3 { ”$group ” : { ” i d ” : ”$ d o c t y p e ” , ”n u m b e r r e c o r d s ” : { ”$sum ” : 1
} }, } ]) ;

Listing 3.5: MongoDB aggregation example query

1 {
2 ” i d ” : ”PATENT” ,
3 ”n u m b e r r e c o r d s ” : 798666
4 },
5 {
6 ” i d ” : ”SCIENCE” ,
7 ”n u m b e r r e c o r d s ” : 201334
8 }

Listing 3.6: MongoDB aggregation example query output

stages and transforms the documents when it passes through the pipeline. Now con-
sider an example with multiple stages Listing 3.7 on the next page. The aggregation
pipeline in the Listing 3.7 on the facing page is an array of different expressions.
Every expression is a stage. The stage operators tell about the operation performed
on the stage. Aggregation pipeline processes the document through the pipeline.
Each stage in pipeline references the output result of the before stages.

1. First stage: The pipeline passes through $match operator. The operator finds
all the documents related to the particular field of interest. The pipeline
executes given text ”video” by $match (The text search is possible only after
text indexing. The text indexing process is discussed in section Section 3.4.1.2
on page 38). After the text search, it passes through the same stage with
another $match operator. Where it finds all the patent documents.
3.4. MongoDB Query Model 37

2. Second stage: From the output of the $match stages, the pipeline passes
through the $group stage. The total number of experts ( player name) and the
origin of the country are grouped using a $group operator. The accumulator
$sum is used to generate the total number of records for the given query.

3. Third stage: The $project is used for projecting the required columns. The
choice of fields projection is selected using the $project operator after the
output from the first and second stage.

4. Fourth stage: The documents from the output is projected in descending order
by sorting the number of records using the $sort operator.

1 db . doc . a g g r e g a t e ( [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ”VIDEO ” } } } ,


2 { ”$match ” : { ”d o c t y p e ” : ”PATENT” } } ,
3
4
5 { ”$group ” : { ” i d ” : { ”player name ” : ”$player name ” ,
6 ”country code ” : ”$country code ” } ,
7 ”n u m b e r r e c o r d s ” : { ”$sum ” : 1 }
8 },
9 },
10 { ” $ p r o j e c t ” : { ”player name ” : ”$player name ” , ” c o u n t r y c o d e ” : ”
$ c o u n t r y c o d e ” , ”n u m b e r r e c o r d s ” : ”$ n u m b e r r e c o r d s ” , ” i d ”
:1}} ,
11 { ” $ s o r t ” : { ”n u m b e r r e c o r d s ” : −1 } }
12
13 ]) ;

Listing 3.7: MongoDB aggregation example query with multiple stages

The order of aggregation is important to improve query performance. The output


of the above aggregation Listing 3.8.
1 #1
2 {
3 ” id ” : {
4 ”player name ” : ”KIM, YEONG TAEG” ,
5 ” c o u n t r y c o d e ” : ”KR”
6 },
7 ”n u m b e r r e c o r d s ” : 17
8 #2
9 {
10 ” id ” : {
11 ”player name ” : ”Joo , Young Ho” ,
12 ” c o u n t r y c o d e ” : ”KR”
13 },
14 ”n u m b e r r e c o r d s ” : 17
15 } ,
16 #3
17 {
18 ” id ” : {
19 ”player name ” : ”Mathew , Manu” ,
20 ” c o u n t r y c o d e ” : ”US”
21 },
38 3. Requirements and Concept

22 ”n u m b e r r e c o r d s ” : 12
23 } ,
24 #4
25 {
26 ” id ” : {
27 ”player name ” : ”Mathew , Manu” ,
28 ” c o u n t r y c o d e ” : ”KR”
29 },
30 ”n u m b e r r e c o r d s ” : 10
31 } ,
32 . . . . .

Listing 3.8: MongoDB aggregation example output

3.4.1.2 Text Indexing


A text index is used for text search on a string field that helps the user to fetch
relevant information for a given text. In the thesis, the text indexing is used to text
search to fetch relevant information for organizations or experts and rank them by
the number of patents and scientific publications. For instance, if a client wanted to
know which company has more scientific publications and patents for a topic title
HEATERS. The text search on the text indexed field is given as HEATERS in the
aggregation pipeline with some aggregation operations. The output of the query
finds the relevant information and ranks the number of scientific publications and
patents for a given text search in a particular company.
A text index is created by specifing word text which is similar to regular index. Text
indexing is created on a single field called title in a collection in MongoDB Listing 3.9.
To test the text index on title field, ”$text” operator is used. For example, giving
some text to search and retrieving the data is shown Listing 3.10 on the facing page.
Indexing is possible also for multiple fields known as compound indexing.
There are some restrictions in using text search

• Only one $text expression is allowed for each query.

• Views in text search is not allowed.

• In aggregation pipeline, the stage that includes $text must be in the first place
of the pipeline. This increases query performance by scanning the document to
the given text. This reduces the scanning time for the other pipeline operators
resulting in increased performance.

• $ text does not support the operators $or $not in a single expression.

1 db . g e t c o l l e c t i o n ( ”doc ”) . c r e a t e I n d e x ( { ” t i t l e ” : ” t e x t ”} )

Listing 3.9: MongoDB Text Indexing

In this chapter, the factors for selecting the MongoDB database is discussed. The
database schema is designed as an embedded collection from multiple tables. This
3.4. MongoDB Query Model 39

1 #Giving t e x t i n p u t a s ”VIDEO” and p r o j e c t i n g t h e r e q u i r e d f i e l d s


2
3 db . doc . a g g r e g a t e ( [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ”VIDEO ” } } }
]) ;
4 { ” $ p r o j e c t ” : { ”player name ” : ”$player name ” , ” c o u n t r y c o d e ” : ”
$ c o u n t r y c o d e ” , ”n u m b e r r e c o r d s ” : ”$ n u m b e r r e c o r d s ” , ” i d ” 1 } } ,
5 { ” $ s o r t ” : { ”n u m b e r r e c o r d s ” : −1 } }
6
7 #Output o f t h e above query
8 #1
9 {
10 ” i d ” : O b j e c t I d ( ”5 bed58c61cd32101ed438472 ”) ,
11 ”player name ” : ”C o s s a r t , C h r i s t o p h e ” ,
12 ” c o u n t r y c o d e ” : ”FR”
13 } ,
14 #2
15 {
16 ” i d ” : O b j e c t I d ( ”5 bed59a51cd32101ed4985d6 ”) ,
17 ”player name ” : ”Noack , Andreas ” ,
18 ” c o u n t r y c o d e ” : ”DE”
19 } ,
20
21 #3
22 {
23 ” i d ” : O b j e c t I d ( ”5 bed5 b7 31 cd3 21 01 ed4f36 5 3 ”) ,
24 ”player name ” : ”Kanders , Michael ” ,
25 ” c o u n t r y c o d e ” : ”DE”
26 } ,
27
28 #4
29 {
30 ” i d ” : O b j e c t I d ( ”5 b e d 5 8 c b 1 c d 3 2 1 0 1 e d 4 3 a f 2 3 ”) ,
31 ”player name ” : ”Pathe , C h r i s t i a n ” ,
32 ” c o u n t r y c o d e ” : ”FR”
33 } ,
34 . . . . . .

Listing 3.10: MongoDB Text Indexing

brings an advantage of storing relevant information at one place and also having
no joins can increase query performance. However, there is a drawback for this
approach. The data duplication. The duplicate data is removed from the database
by checking the redundancy using the aggregation pipeline. The data in the Mon-
goDB database is stored in BSON format. MongoDB database has its own query
language using find() which is used for simple queries. The aggregation pipeline and
the Map Reduce. For the thesis, aggregation pipeline is used as it is used for large
complex queries and easy to develop a query using aggregation pipeline. The ag-
gregation pipeline works by proceeding through stages. The output of before stage
is the input next stage. The stages and some of the operators used in the thesis is
explained with an example query in this chapter. The usage of text indexing and
its restrictions when querying data using text search is discussed in the section.
4. Design and Implementation
In the Chapter 3 on page 23, we discussed the data warehouse structure of the Post-
greSQL database, data selection, the concept of data migration, MongoDB data
model and the concept of aggregation pipeline in MongoDB using simple and com-
plex queries.
Detailed design and implementation procedure are provided in this chapter. Major
steps like tool selection and data analysis procedures are described. In this chapter,
we first discuss the design process and the core process of the MongoDB database
because it is important to know the basic operations of the MongoDB database.
The operations are important to interact with MongoDB server that runs based
on JavaScript. Then, we discuss the tool selection to interact with the data in the
database. To show the output query performance, we used a user-friendly interactive
web interface called R framework Shiny. Lastly, we provide the queries used and its
results for evaluation purpose.

4.1 Design Process


The design process is basically an application design which defines the interfaces
and its behavior. Before the process begins with the data import to the MongoDB
database. After the data migration, all the components from the data query to de-
veloping an interactive web application for data retrieval are described. Figure 4.1 on
the facing page describes the structure of the whole system to run our applications.
The database involves a three-layer process.

1. Storage layer

2. Processing layer

3. Management layer

The storage layer consists of data in a collection. The processing layer operates on
the storage layer. All the aggregations, indexing, and query executions happen in
this layer. Finally, the management layer which is a high-level job, it consists of a
4.2. MongoDB Core Process 41

developed application that coordinates between the database and database users.
The Figure 4.1 explains all the three layers.
MongoDB stores the data in BSON format in a collection. All the query operations
are done in the NoSQLBooster because it is smart GUI that helps users to know the
query execution details. The details are time taken for retrieving data, the number of
records retrieved, the query execution process. Based on the queries, an interactive
web application is developed using the R framework SHINY for business users.

Figure 4.1: Design process

4.2 MongoDB Core Process


To connect and interact with the data in the MongoDB database, the core processes
play an important role. The core processes are discussed in this section. In this
thesis, the core processes are used to import data and interact with the data using
mongo shell.
mongoimport tool is used for importing data from CSV file to MongoDB. Before
migration the server must be started, mongod is the used to start the server. We
cannot perform any database operations without starting the server. mongo is used
to interact with the data in the MongoDB database using mongo shell. The core
process of MongoDB is an important part to interact with the database.
The core process of MongoDB involves mongoimport and mongoexport tools, the
GridFS (file storage) tools, and diagnostic tools.
The main components of the core process are described below 1 .

1. mongod: mongod command is one of the main components of the core pro-
cesses. The mongod command starts the server, manages the requests and
also support in data format management. It also provides core options such
as version, configuration file, verbose, port number, and many more.
1
https://docs.mongodb.com/manual/reference/program/
42 4. Design and Implementation

2. mongo: mongo command provides an interactive javascript shell to communi-


cate with the server. with mongo operations, the developers and administra-
tors can manage the data, run queries, and get the report for their business.

3. mongoimport: mongoimport tool is used to import data from different file


formats such as JSON, CSV, TSV (tab-delimited file).

4.3 Tools Selection


In this section, we discuss the tools required for the implementation of MongoDB
queries, building interactive web applications and its functionalities.

4.3.1 NoSQLBooster
There are numerous MongoDB management tools that are available in the data
world 2 . These tools help to interact with the data in the database with the smart
GUI (Graphical User Interface). GUI makes productivity and management tasks
easy for developers, and administrators.
In this thesis, NoSQLBooster is selected for query evaluation in MongoDB database Fig-
ure 4.2 on the facing page.

NoSQLBooster 4.7.0
Version
MongoDB version 4.0
Machine type Windows 64-bit
Size 36.8 MB
Downloaded on 13.08.2018
3
Table 4.1: MongoDB version and NoSQLBooster Specifications

4
NoSQLBooster offers built-in language services, vast number of built-in snippets
that helps in writing queries. Whenever the query script writing is started it always
pop-up the suggestions as you type a query.
Features of using NoSQLBooster:

1. The query functions can be performed in SQL for MongoDB database. This
includes JOINS, expressions, aggregations, and function.

2. NoSQL booster as a unique feature known as chaining syntax. It is possible to


retrieve the data with the combination of SQL and MongoDB query language.
It works even with the aggregation framework.

3. It explains the query execution plan in a hierarchical structure which is easy


to read and understand.
2
https://www.guru99.com/top-20-mongodb-tools.html
4
https://www.nosqlbooster.com/
4.3. Tools Selection 43

Figure 4.2: NoSQLBooster Graphical User Interface

4. It provides a visual query builder feature.

Many other features are described in NoSQLBooster website 5 .


Figure 4.2 shows the user interface of NoSQLBooster. It has multiple types of data
retrieval process such as visual query builder, SQL or MongoDB language, querying
through the programming languages namely R programming, Python, C++, Java
and many more. It also shows the time taken to retrieve the data and number of
documents count that is retrieved from the database. The data can be viewed as
tables, tree view, and JSON format.
NoSQLBooster helps to develop queries using the query interface. The output of the
query gives the information stated in Figure 4.2. The execution time, the number
of documents retrieved helps to investigate the query performance of the database
and compare it with the PostgreSQL database.

4.3.2 R framework SHINY


The prototype is developed using the R framework SHINY. The technical details
and the process of developing of SHINY application for MongoDB query implemen-
tation are discussed in this section. R is a programming language built for graphics
and statistical computing 6 . Shiny is an R framework. It helps to develop a user-
friendly an interactive web application. Shiny is an in-built application in R studio
7
. It is the R package and the SHINY application built is developed using R pro-
gramming. There are two major parts of the SHINY framework. Firstly, frontend
5
URL: https://www.nosqlbooster.com/
6
URL: https://www.r-project.org/about.html
7
URL: http://www.rstudio.com/shiny/
44 4. Design and Implementation

development providing a user interface. It enables the user to interact with the data
even without any programming knowledge. The user can retrieve the data just by
text search and some simple search clicks on dialogue boxes, and tabs. Secondly,
backend development where the queries are implemented in R programming.
In Figure 4.3, the shiny interface is shown. It has four windows. In window 1 is used
as a code editor which is written in R programming (ui.R* & server.R*). Window
2 is a console where it runs execution of the developed program. Window 3 is a
global environment where it shows the details of the history of the commands used,
connection with the MongoDB database and many more functions 8 . Window 4
contains the information regarding the various R packages, libraries for connecting
or interacting with the data.
In this thesis for developing an interactive web application for MongoDB, mongolite
and jsonite libraries are used. The libraries are available by default in Shiny R
package list.
To work with the Mongodb data, initially we need to connect to the MongoDB
database. In order to communicate, first the server must be started using mongod
command which is already described in Section 4.2 on page 41.

Figure 4.3: R Framework Shiny Application

Consider the query which is used for developing web application Listing 4.1 and List-
ing 4.2 . The steps involved in developing an application is discussed below.
8
URL: http://www.rstudio.com/shiny/
4.3. Tools Selection 45

1 l i b r a r y ( shiny )
2 l i b r a r y ( mongolite )
3 library ( jsonlite )
4 l i m i t <− 10L
5
6 # D e f i n e UI f o r a p p l i c a t i o n f o r mongodb t e x t s e a r c h
7 u i <− f l u i d P a g e (
8 # Application t i t l e
9 t i t l e P a n e l ( ”Mongodb t e x t s e a r c h Data ”) ,
10 sidebarLayout (
11 sidebarPanel (
12 t e x t I n p u t ( ” q u e r y i d ” , ” T i t l e t e x t ” , ” ”) ,
13 s e l e c t I n p u t ( ” d o c i d ” , ”document ” , c h o i c e s = c ( ”PATENT” , ”
SCIENCE”) ) ,
14
15 a c t i o n B u t t o n ( ” a c t ” , ”output ”)
16 ),
17
18 # Show t h e mongodb t e x t s e a r c h output i n t h e main p a n e l
19 mainPanel (
20 tabsetPanel (
21 ta bP a ne l ( ”INSTITUTE” , dataTableOutput ( ’ t a b l e 1 ’ ) ) ,
22 ta bP a ne l ( ”EXPERT” , dataTableOutput ( ’ t a b l e 2 ’ ) )
23 )
24 ))
25 )

Listing 4.1: Code for developing a Shiny R application user interface

For developing a web application using SHINY, mongolite, jsonlite packages must
be installed. The packages help to connect the shiny interface with MongoDB and
interact with the data.
In Listing 4.1, the user interface is defined. The page is titled as MongoDB text
search Data. In the side-bar panel, the text field for text search and document
selection type for the given text search is developed. The output button is used
to display the output after selecting the query. In the main panel, two tabs are
provided for retrieving the data related to institutions or expert profile. The output
of the query is displayed in the main panel as a table.
1 #d e f i n i n g s e r v e r s i d e f u n c t i o n
2 s e r v e r <− f u n c t i o n ( input , output ) {
3 #c o n n e c t i n g t o MongoDB s e r v e r
4
5 mdb <− mongo ( c o l l e c t i o n = ”doc ” , db = ”datasample ” , u r l = ”mongodb : / /
l o c a l h o s t : 2 7 0 1 7 / ? socketTimeoutMS =1200000 ”)
6 # Reactivity
7
8 INSTITUTION <− e v e n t R e a c t i v e ( i n p u t $ a c t , {
9
10
11 #Text i n d e x i n g
12
13 mdb$index ( toJSON ( l i s t ( ” t i t l e ” = ” t e x t ”) , auto unbox = TRUE) )
14
46 4. Design and Implementation

15 #Applying query
16
17 q <− p a s t e 0 ( ’ [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ” ’ ,
input$query id , ’ ” } } } ,
18
19 { ”$match ” : { ”d o c t y p e ”: ” ’ , i n p u t $ d o c i d , ’ ”} } ,
20 { ”$match ” : { ” p l a y e r t y p e ”: ”INSTITUTION”} } ,
21
22
23
24 { ”$group ”:
25 { ” i d ”:
26 { ”d o c t y p e ”: ”$ d o c t y p e ”} ,
27
28 ”n u m b e r r e c o r d s ”: { ”$sum ”: 1 } ,
29
30 ”player name ” : { ” $ f i r s t ”: ” $player name ”} ,
31
32 ” t i t l e ” : { ” $ f i r s t ”: ” $ t i t l e ”} ,
33
34 ” p l a y e r t y p e ” : { ” $ f i r s t ”: ” $ p l a y e r t y p e ”} ,
35
36 ” c o u n t r y c o d e ” : { ” $ f i r s t ”: ” $ c o u n t r y c o d e ”}
37 } } ,
38
39 { ” $ s o r t ”: { ” n u m b e r r e c o r d s ” : −1}} ,
40
41 { ” $ l i m i t ” : 10}
42
43 ]’
44
45 )
46
47
48 js o n li t e : : validate (q)
49 query <− mdb$aggregate (
50 q , ’ { ”a l l o w D i s k U s e ”: t r u e } ’
51 )
52
53
54 })
55 # Reactivity
56 EXPERT <− e v e n t R e a c t i v e ( i n p u t $ a c t , {
57
58 # Applying query
59 q <− p a s t e 0 ( ’ [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ” ’ ,
input$query id , ’ ” } } } ,
60
61 { ”$match ” : { ”d o c t y p e ”: ” ’ , i n p u t $ d o c i d , ’ ”} } ,
62 { ”$match ” : { ” p l a y e r t y p e ”: ”EXPERT”} } ,
63
64 { ”$group ”: { ” i d ”: { ”d o c t y p e ”: ”$ d o c t y p e ” , ” t i t l e
”: ” $ t i t l e ” , ”player name ”: ”$player name ” , ”
p l a y e r t y p e ”: ”EXPERT” , ” c o u n t r y c o d e ”: ”
$country code ” } ,
65 ”n u m b e r r e c o r d s ”: { ”$sum ”: 1 }
66 }
4.3. Tools Selection 47

67 },
68
69 { ” $ s o r t ”: { ”n u m b e r r e c o r d s ”: −1 } } ,
70 { ” $ l i m i t ”: 500}
71 ] ’)
72
73
74 js o n li t e : : validate (q)
75 query <− mdb$aggregate ( q , ’ { ”a l l o w D i s k U s e ”: t r u e } ’ )
76
77
78 })
79 # d e f i n i n g output a s t h e t a b l e
80
81 o u t p u t $ t a b l e 1 <− renderDataTable ( {
82 INSTITUTION ( )
83 })
84 o u t p u t $ t a b l e 2 <− renderDataTable ( {
85 EXPERT( )
86 })
87
88
89 }
90 shinyApp ( u i = ui , s e r v e r = s e r v e r )

Listing 4.2: Code for developing a Shiny R application server side

At backend Listing 4.2 on page 45, input and output the query operations are
developed. Initially, The SHINY is connected to MongoDB database. Then the
reactive function is used. It is a reactive command that makes the application
responsive to the call by the user using the user interface. For the text search from
the database, text indexing is created on the title field. The aggregation query
is developed. The aggregation process is explained in Section 3.4.1.1 on page 35.
finally, the output of the data is displayed as a table in an application Figure 4.4 on
the following page.

For instance, if the user needs information related to a particular field of interest.
The query scans the related documents in a collection and executes results. The
process of query implementation by a user is displayed in Figure 4.5 on the next
page.

The user enters the text query in the text field and the document type is selected.
The number of organizations or the experts for a selected document type is executed
and displayed to the user. With the web applicaion, the user can fetch the informa-
tion according to the requirements. For instance, it the user need number of scientific
publications for every organization for a particular field of interest. The user selects
the scientific documents of the organization by giving the text in a text search field.
The output of the queries displayes, the number of scientific publications for every
organization for a given text.
48 4. Design and Implementation

Figure 4.4: Output table in Shiny R web application

Figure 4.5: Query execution process in SHINY.


4.4. MongoDB Query Optimization 49

4.4 MongoDB Query Optimization


Performance optimization affects the execution speed when data reaches its peak
limits or because of large complex queries. In order to improve the performance, it
is important to follow proper aggregation pipeline optimization technique.
Creating an index on a single field in a collection reduces the scanning time and
increases the data retrieval process faster ( Figure 4.6). After indexing, aggregation

Figure 4.6: Creating a single text index using NoSQLBooster

pipeline optimization is performed. Each stage passing through the pipeline. Proper
steps in aggregation reduce the execution time. Projecting the required fields instead
of whole data in collection increases the speed of pipeline operation.
When the complex aggregation queries are applied to the database. The database
with large queries lowers the speed of execution. To improve the performance the
proper aggregation pipeline optimization is required.
The query is used to explain the optimization technique Listing 4.3 on the next
page. The query projects the number of PATENT documents fetches the list of
organizations, experts and country code that matches the query for a given text.
The execution process in each stage is shown.
The execution plan in Listing 4.3 on the following page shows the procedure of
how each stage passes through the pipeline. The pipeline first matches the field
where doc type is PATENT and then match the text search which is given in a text
indexed field. The document is filtered out and the output after passing through
the matching stage it enters to the group operator. With the reference of the output
of the match operator, the documents are grouped according to player name and
country code and fetch the number of records. Then, it projects the fields described
in the group stage. Sorts the number of records in descending order and finishes the
pipeline after limiting the documents.
50 4. Design and Implementation

1 #Giving t e x t i n p u t a s ”VIDEO” and p r o j e c t i n g t h e r e q u i r e d f i e l d s


2
3 db . doc . a g g r e g a t e ( [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ”VIDEO ” } } } ,
4 { ”$match ” : { ”d o c t y p e ” : ”PATENT” } } ,
5
6
7 { ”$group ” : { ” i d ” : { ”player name ” : ”$player name ” ,
8 ”country code ” : ”$country code ” } ,
9 ”n u m b e r r e c o r d s ” : { ”$sum ” : 1 }
10 },
11 },
12 { ” $ p r o j e c t ” : { ”player name ” : ”$player name ” , ” c o u n t r y c o d e ” : ”
$country code ” ,
13 ”n u m b e r r e c o r d s ” : ”$ n u m b e r r e c o r d s ” , ” i d ” : 1 } } ,
14 { ” $ s o r t ” : { ”n u m b e r r e c o r d s ” : −1 } } ,
15 { ” $ l i m i t ” : 500 }
16 ] , { explain : true }) ;
17
18 #P i p e l i n e e x p l a n a t i o n o f t h e above query
19 {
20 ”stages ” : [
21 {
22 ”$cursor ” : {
23 ”query ” : {
24 ”$and ” : [
25 {
26 ”$text ” : {
27 ” $ s e a r c h ” : ”VIDEO ”
28 }
29 },
30 {
31 ”d o c t y p e ” : ”PATENT”
32 }
33 ]
34 },
35 ”fields ” : {
36 ”country code ” : 1 ,
37 ”player name ” : 1 ,
38 ” id ” : 0
39 },
40 ”q u e r y P l a n n e r ” : {
41 ”plannerVersion ” : 1 ,
42 ”namespace ” : ”sample . doc ” ,
43 ”indexFilterSet ” : false ,
44 ”parsedQuery ” : {
45 ”$and ” : [
46 {
47 ”d o c t y p e ” : {
48 ”$eq ” : ”PATENT”
49 }
50 },
51 {
52 ”$text ” : {
53 ” $ s e a r c h ” : ”VIDEO ” ,
54 ”$ l a n g u a g e ” : ” e n g l i s h ” ,
55 ”$caseSensitive ” : false ,
4.4. MongoDB Query Optimization 51

56 ”$diacriticSensitive ” : false
57 }
58 }
59 ]
60 },
61 ”winningPlan ” : {
62 ” s t a g e ” : ”FETCH” ,
63 ”filter ” : {
64 ”d o c t y p e ” : {
65 ”$eq ” : ”PATENT”
66 }
67 },
68 ”inputStage ” : {
69 ” s t a g e ” : ”TEXT” ,
70 ”indexPrefix ” : {
71
72 },
73 ”indexName ” : ” t i t l e ” ,
74 ”parsedTextQuery ” : {
75 ”terms ” : [
76 ”video ”
77 ],
78 ”negatedTerms ” : [ ] ,
79 ”phrases ” : [ ] ,
80 ”n e g a t e d P h r a s e s ” : [ ]
81 },
82 ”textIndexVersion ” : 3 ,
83 ”inputStage ” : {
84 ” s t a g e ” : ”TEXT MATCH” ,
85 ”inputStage ” : {
86 ” s t a g e ” : ”FETCH” ,
87 ”inputStage ” : {
88 ” s t a g e ” : ”OR” ,
89 ”inputStage ” : {
90 ” s t a g e ” : ”IXSCAN” ,
91 ”k e y P a t t e r n ” : {
92 ” f t s ” : ”text ” ,
93 ” ftsx ” : 1
94 },
95 ”indexName ” : ” t i t l e ” ,
96 ”i s M u l t i K e y ” : t r u e ,
97 ”isUnique ” : f a l s e ,
98 ”isSparse ” : false ,
99 ”isPartial ” : false ,
100 ”indexVersion ” : 2 ,
101 ” d i r e c t i o n ” : ”backward ” ,
102 ”indexBounds ” : {
103
104 }
105 }
106 }
107 }
108 }
109 }
110 },
111 ”rejectedPlans ” : [ ]
112 }
52 4. Design and Implementation

113 }
114 },
115 {
116 ”$group ” : {
117 ” id ” : {
118 ”player name ” : ”$player name ” ,
119 ”country code ” : ”$country code ”
120 },
121 ”n u m b e r r e c o r d s ” : {
122 ”$sum ” : {
123 ”$const ” : 1
124 }
125 }
126 }
127 },
128 {
129 ”$project ” : {
130 ” i d ” : true ,
131 ”player name ” : ”$player name ” ,
132 ”country code ” : ”$country code ” ,
133 ”n u m b e r r e c o r d s ” : ”$ n u m b e r r e c o r d s ”
134 }
135 },
136 {
137 ”$sort ” : {
138 ”sortKey ” : {
139 ”n u m b e r r e c o r d s ” : −1
140 },
141 ” l i m i t ” : NumberLong ( ”500 ”)
142 }
143 }
144 ],
145 ”ok ” : 1
146 }

Listing 4.3: MongoDB Aggregation Stages for performance optimizations

The match command should always be on the first staging as it filters the PATENT
document by scanning the whole collection. If the search involves in text search, it
should be queried in the first line. After the pipeline passes through matching stage
grouping the number of document and projecting only the required field can affect
the execution time. That leads to an increase in performance.
In this chapter, the design of process flow, MongoDB core process, and implemen-
tation process are described. Furthermore, we developed an easy implementation
approach to query data by using different tools. The tools used in the implementa-
tion are mongo shell (for data importing), NoSQLBooster (for evaluating the per-
formance), and R framework SHINY (developing a prototype) made data retrieval
easy and the web application helps to interact with the data even without any pro-
gramming knowledge. The interactive web application is best suitable for people
like business managers, clients, R&D manager’s.
5. Evaluation
Different implementation tools and procedures were discussed in the previous chap-
ter. This chapter focuses entirely on the evaluation, it includes results from data mi-
gration, query performance in terms of query execution speed and their comparison
at the end. To do so, initially, we migrate the data from the PostgreSQL database
to MongoDB database. Then, we compare the query performance of MongoDB and
PostgreSQL database.

5.1 Evaluation Setup


In order to properly judge the query performance of MongoDB implementation, we
need to compare it to PostgreSQL database. Since the primary focus is on query
performance, our evaluation is based on query execution time. The results provides
the information for decision making in selecting the database for the given data.

5.1.1 Machine Used


For implementing the data migration, writing the queries for data retrieval and
designing the web application for the MongoDB database, the local machine is used.
The detail description is shown in Table 5.1.
To compare the results, the PostgreSQL database is also implemented with the same
data that is used in MongoDB database. The PostgreSQL database implementation
is done in the same machine.

Machine Windows version 10


Processor Intel Core i5-5200U
CPU 2.20GHz
System type 64 bit
RAM 8 GB

Table 5.1: Used Machine Specifications


54 5. Evaluation

5.1.2 Data Characterstics


We use the data from the Mapegy Gmbh company database to compare performance
between MongoDB and PostgreSQL systems. We migrated multiple datasets with
different size to examine the query performance on different sizes. It is difficult
to test the performance with a large amount of data on the local machine. So,
we migrated the same data set but with the small data sets. Also, we migrated
another dataset that is extracted from the single table that is entity docs table see
Figure 3.2 from PostgreSQL database. The tables with single table and merged
tables (entity docs, link player docs, entity players) are migrated see Figure 3.2.
After the migration, we compare the query performance for a single table and the
embedded table in PostgreSQL database and the MongoDB database.
In Table 5.2, we provide an overview of tables. In the experiments, we start with the
migration of a large table that contains the information of patents and scientific pub-
lications for the list of organisations and experts for various countries. The data that
contains required information is migrated as a single table (entity docs) and from
multiple tables (entity docs, link player docs, entity players). For investigating the
query performance, the data is migrated with different sizes to the database.

No. of records MongoDB MongoDB


type of docu- Embedded ta- Single table
ment ble
10000 13 MB 5 MB
50000 210 MB 53.3 MB
100000 420 MB 110 MB
1000000 3.2 GB 1.40 GB
23734000 106.2 GB 84.6 GB

Table 5.2: Statistics of Datasets

To examine the query capabilities of MongoDB, we perform queries on single collec-


tion and embedded collection. The data is collected from PostgreSQL database.

5.2 Experiments
In this section, we discuss the experiments implemented on MongoDB and its results
are shown. Apart from our predefined queriess, we perform several other experiments
to compare the performance of MongoDB and PostgreSQL with simple and complex
queries.

5.2.1 Data Migration


The data is extracted from the PostgreSQL database into a CSV file. The time
taken for extracting files from PostgreSQL data into CSV is shown in the Table 5.3.
From the result,we observe that the time taken for collecting the data is greater than
the time taken in importing the data into MongoDB database. The mongoimport
tool is performed efficiently in importing data to the MongoDB database. The
mongoimport tool provides an effiecient results for the datasets with different sizes.
5.2. Experiments 55

The data is migrated using the MongoDB shell interface. The migration is per-
formed using following command Listing 5.1. At first, the path of the locally stored
MongoDB server is specified. Then, mongoimport tool is used to create the name
of the database, collection name, the type of file (CSV), file path where the file is
stored, and the headerline where it tells the server that the CSV file has a header.

1 #s y n t a x f o r data m i g r a t i o n u s i n g mongoimport i n m o n g o s h e l l
2 @<mongoshell >: mongoimport −−db <databaseName> −− c o l l e c t i o n <
c o l l e c t i o n N a m e > −−type CSV −− f i l e <f i l e p a t h > −−h e a d e r l i n e
3 #Data m i g r a t i o n u s i n g mongoimport i n mongo s h e l l
4
5 C: \ Programme\mongodb\ S e r v e r \ 4 . 0 \ bin>mongoimport −−db datadocuments −−
c o l l e c t i o n data −−type c s v −− f i l e E : mongodb\ data −55124365. c s v −−
headerline

Listing 5.1: MongoDB Data Migration

We run the experiments with different tables that vary in size. Initially, we mi-
grated a large embedded table with 23.7 million records. Due to the large size, it
is difficult to run the queries efficiently on local machine. So in order to investigate
the query performance, another table with the same columns but with fewer records
is migrated. Finally, a single table (entity docs) from a PostgreSQL database with
100000 records is migrated. This single table helps to investigate the execution speed
without joins.
To compare query performance of both the databases (PostgreSQL and MongoDB),
the data extracted from the PostgreSQL tables are initially imported to PostgreSQL
database on a local machine in a normalized form. The data is migrated using COPY
command, to import the data to PostgreSQL database from CSV (comma seperated
value) file. Before importing the data to the PostgreSQL database, the table is
created with CREATE TABLE command. The column names and its data type is
listed when creating a table. After creating the table the data is imported into the
table using the command shown in Listing 5.2. The same procedure is followed to
import other tables that are used in the thesis (entity players, link players docs).

1 # CSV data i m p o r t i n g s y n t a x
2 COPY <databaseName . tableName> FROM <F i l e p a t h > DELIMITER < ’ type o f
d e l i m i t e r ’> CSV HEADER;
3
4 # Data imported u s i n g SQL s h e l l f o r a s i n g l e t a b l e .
5 COPY datawarehouse . e n t i t y d o c s FROM’C: \ data \db\ mongocsv \ e n t i t y d o c s . c s v
’ DELIMITER ’ , ’ HEADER;

Listing 5.2: Importing data into PostgreSQL database

According to the tasks defined in Section 1.1 on page 2, we need to use text search
to fetch the relavant information from the database. So, the tables are indexed using
GIN to provide text search indexing. The detail explanation of indexing is given
in Section 2.2.3 on page 9. The GIN indexing is implemented on a text field title
see Listing 5.3 on the next page.
56 5. Evaluation

1 CREATE INDEX f t s i d x ON
2 public . entity docs
3 USING g i n ( t s v t i t l e )
4 TABLESPACE p g d e f a u l t ;

Listing 5.3: Importing data into PostgreSQL database

The time taken to import data to the MonogoDB and the postgreSQL database is
shown in Table 5.3.

No.of records Type of table PostgreSQL (CSV) MongoDB (JSON)


No. of records - Data extraction time Data importing time
23.7 million embedded table 13200.245 Sec. 9000.192 Sec.
1 million embedded table 550.235 Sec. 37.149 Sec.
1 million single table 280.685 Sec. 18.281 Sec.
100 thousand embedded table 49.568 Sec. 19.214 Sec.
100 thousand single table 19.817 Sec. 10.214 Sec.
50 thousand embedded table 29.648 Sec. 14.214 Sec.
50 thousand single table 24.568 Sec. 12.214 Sec.
10 thousand embedded table 15.687 Sec. 9.214 Sec.
10 thousand single table 12.549 Sec. 8.214 Sec.

Table 5.3: Data Migration results

5.2.2 Experiment Queries


After importing data into MongoDB database, the queries are executed for the tasks
described in Section 1.1. The tasks are selected because, the tasks contain simple
and complex queries that helps in evaluating the query performance of MongoDB
database. The queries are developed for the tasks and evaluate the query perfor-
mance of MonogoDB. All the experiments that are shown in this section are also
implemented in the R framework SHINY. The listing of the experiments shown in
this section is posted in the Chapter A. For evaluating the query performance sev-
eral other queries are implemented and the details are listed in Table 5.4. From the
Table 5.4 the queries 1, 3, 5, and 6 are the queries that are similar to the tasks that
are defined in Section 1.1 but the queries 3, 5, 6 are used for different text search.
Retrieve information of all patents:
The query Listing 5.4 is a simple query that is performed using db.collection.find()
function. In the query the information related to patent documents are retrieved.
Listing 5.4 is the implementation of query that finds all the patents with different
fields projected. The projected field are player id, doc type, doc id, title, coun-
try code The query retrieves 68125 patents from 100000 documents.
1 #s i m p l e query f i n d i n g PATENT documents i n doc type f i e l d .
2 db . mergeddocs . f i n d ( {
3 d o c t y p e : ”PATENT”
4 }. projection {
5.2. Experiments 57

5 player id : 1 ,
6 doc type : 1 , doc id : 1 , t i t l e : 1 , country code : 1
7
8 })
9 #output o f t h e query
10 {
11 ” i d ” : O b j e c t I d ( ”5 b e 1 b 7 0 6 7 4 0 4 2 f 0 2 9 c 0 d e 8 1 6 ”) ,
12 ” p l a y e r i d ” : 104364024 ,
13 ”doc id ” : 29715188 ,
14 ” t i t l e ” : ”One touch v o i c e memo” ,
15 ” c o u n t r y c o d e ” : ”US” ,
16 ”d o c t y p e ” : ”PATENT”
17 } ,
18
19 {
20 ” i d ” : O b j e c t I d ( ”5 b e 1 b 7 0 6 7 4 0 4 2 f 0 2 9 c 0 d e 8 1 7 ”) ,
21 ” p l a y e r i d ” : 104364024 ,
22 ”doc id ” : 28942127 ,
23 ” t i t l e ” : ” M u l t i p l e x i n g VoIP s t r e a m s f o r c o n f e r e n c i n g and s e l e c t i v e
playback o f audio streams ” ,
24 ” c o u n t r y c o d e ” : ”US” ,
25 ”d o c t y p e ” : ”PATENT”
26 } ,
27 . . . . . .

Listing 5.4: MongoDB query execution

The task that are defined in Section 1.1 involves in text search for retrieving the
data from MongoDB database. In Section 3.4.1.2, the text indexing is performed
Listing 3.9 on every collection that is used in the thesis. Here the index on the id
field is present by default for every collection in a table. In the aggregation pipeline,
we use text search (with the focus on document title) for retrieving the relavant data
of patents and scientific publications of all the organization and experts. So, text
index is executed on single field title that is required for a text search.
Retrieve all the patents and scientific publications related whose title
contains the word ’Complex’:
After creating a text index on title field, the query is executed using text search
on the MongoDB database. The queries retrieves all the patents and scienctific
publications related to the title that contains word ’complex’ . The output of the
query returns all the documents relaated to the word complex (736 documents). The
output of the given query projects six columns. The projected output columns are
shown in Listing 5.5.
1
2 #Text s e a r c h i n p u t on t i t l e column
3
4 db . s i n g l e t a b l e . a g g r e g a t e ( [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ”complex
” } } },
5 { ”$ p r o j e c t ” : { ” t i t l e ” : ” $ t i t l e ” , ”country code ” : ”$country code ”
, ”d o c t y p e ” : ”$ d o c t y p e ” ,
6 ”d a t e i n s e r t e d ” : ”$ d a t e i ns e r te d ” , ”doc source ” : ”$doc source ” , ” id ”
:1}} ,
7 ]) ;
58 5. Evaluation

8 #Output o f t h e t e x t s e a r c h
9 {
10 ” i d ” : O b j e c t I d ( ”5 c 8 1 5 5 b 5 f 0 4 e 5 8 2 7 a a d 7 2 7 2 2 ”) ,
11 ” d o c s u b t y p e ” : ”{ARTICLE} ” ,
12 ”country code ” : ”” ,
13 ”d o c t y p e ” : ”SCIENCE” ,
14 ” d a t e i n s e r t e d ” : ”2017−01−10 0 9 : 4 1 : 0 1 . 4 9 8 3 2 1 ” ,
15 ” d o c s o u r c e ” : ”CROSSREF”
16 } ,
17
18 {
19 ” i d ” : O b j e c t I d ( ”5 c 8 1 5 5 b 1 f 0 4 e 5 8 2 7 a a d 6 d e e 8 ”) ,
20 ” d o c s u b t y p e ” : ”{ARTICLE} ” ,
21 ”country code ” : ”” ,
22 ”d o c t y p e ” : ”SCIENCE” ,
23 ” d a t e i n s e r t e d ” : ”2018−12−20 0 8 : 5 9 : 4 7 . 2 5 1 5 5 2 ” ,
24 ” d o c s o u r c e ” : ”CROSSREF”
25 } ,
26 . . . .

Listing 5.5: Aggregation query execution

Retrieve all the patents related to VIDEO and list all the organizations
and experts:
The query in Listing 5.6 is implemented in aggregation pipeline. The query retrieves
all the documents (230 documents) whose title contains the word VIDEO. The
output result gives all the patents and scientific publications of the organization and
the expert from various countries.
1
2 #Text s e a r c h i n p u t on \ t e x t i t { t i t l e } column . with d o c t y p e : PATENT and
p l a y e r a s INSTITUTE and EXPERT
3 db . mergeddocs . a g g r e g a t e ( [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ”VIDEO ” }
} },
4 { ”$match ” : { ”d o c t y p e ” : ”PATENT” } } ,
5
6 { ” $ p r o j e c t ” : { ”player name ” : ”$player name ” , ” p l a y e r t y p e ” : ”
$ p l a y e r t y p e ” , ”d o c t y p e ” : ”$ d o c t y p e ” , ” c o u n t r y c o d e ” : ”
$ c o u n t r y c o d e ” , ”n u m b e r r e c o r d s ” : ”$ n u m b e r r e c o r d s ” , ” i d ”
:1}} ,
7 { ” $ s o r t ” : { ”n u m b e r r e c o r d s ” : −1 } } ,
8 ]) ;
9
10
11 #Ranking output
12 {
13 ” i d ” : O b j e c t I d ( ”5 c7d939dcd1c8a2a36311159 ”) ,
14 ”player name ” : ”L ’ azou , Yves ” ,
15 ” p l a y e r t y p e ” : ”EXPERT” ,
16 ”d o c t y p e ” : ”PATENT” ,
17 ” c o u n t r y c o d e ” : ”FR”
18 } ,
19 {
20 ” i d ” : O b j e c t I d ( ”5 c 7 d 9 3 a 9 c d 1 c 8 a 2 a 3 6 3 1 b f 6 c ”) ,
21 ”player name ” : ”Cosson , Laurent ” ,
22 ” p l a y e r t y p e ” : ”INSTITUTION” ,
5.2. Experiments 59

23 ”d o c t y p e ” : ”PATENT” ,
24 ” c o u n t r y c o d e ” : ”FR”
25 },
26 ...

Listing 5.6: MongoDB query

Retrieve all the organizations and experts related to the SERVICE and
ranking them by number of patents:
In the query mentioned Listing 5.7, the aggregation pipeline undergoes different
stages. The query ranks the number of patents and scientific publications in de-
scending order for all the organizations and experts. The output projects all the
patents and scientific publications of the organization and the expert from various
countries whose title contains the word ’SERVICE’.
1 #Text s e a r c h i n p u t on t i t l e column .
2 # p l a y e r t y p e can be s e l e c t e d a s EXPERTS o r INSTITUTIONS and rank them
by number o f PATENTS and number o f SCIENCE .
3
4 db . mergeddocs . a g g r e g a t e ( [ {
5 ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ”SERVICE” } }
6 },
7 {
8 ”$match ” : { ”d o c t y p e ” : ”SCIENCE” }
9 ”$match ” : { ”d o c t y p e ” : ”PATENT” }
10 },
11 {
12 #p l a y e r t y p e : SCIENCE t o g e t l i s t o f o r g a n i z a t i o n and rank them by
number o f p a t e n t s and s c i e n t i f i c p u b l i c a t i o n s .
13 ”$group ” : { ” i d ” : { ” p l a y e r t y p e ” : ” p l a y e r t y p e ” ,
14 ”country code ” : ”$country code ” } ,
15 ”n u m b e r r e c o r d s ” : { ”$sum ” : 1 }
16 },
17 },
18 {
19 ” $ p r o j e c t ” : { ” p l a y e r t y p e ” : ” p l a y e r t y p e ” , ”player name ” : ”
$player name ” , ” c o u n t r y c o d e ” : ” $ c o u n t r y c o d e ” , ”n u m b e r r e c o r d s ” :
”$ n u m b e r r e c o r d s ” , ” i d ” : 1 }
20 } ,
21 { ” $ s o r t ” : { ”n u m b e r r e c o r d s ” : −1 } } ,
22 { ” $ l i m i t ” : 100000 }
23 ]) ;
24 #output o f t h e g i v e n query
25 {
26 ” id ” : {
27 ” p l a y e r t y p e ” : ”EXPERT” ,
28 ”player name ” : ”E i d l o t h , Ra iner ” ,
29 ” c o u n t r y c o d e ” : ”DE” ,
30 ”d o c t y p e ” : ”PATENT” ,
31 ” d o c s o u r c e ” : ”PATSTAT” ,
32 ” d o c s o u r c e i d ” : 339191463
33 },
34 ”n u m b e r r e c o r d s ” : 67
35 } ,
36
37 {
60 5. Evaluation

38 ” id ” : {
39 ” id ” : {
40 ” p l a y e r t y p e ” : ”EXPERT” ,
41 ”player name ” : ”A r t a u l t , Alexandre ” ,
42 ” c o u n t r y c o d e ” : ”FR” ,
43 ”d o c t y p e ” : ”PATENT” ,
44 ” d o c s o u r c e ” : ”PATSTAT” ,
45 ” d o c s o u r c e i d ” : 334075617
46 },
47 },
48 ”n u m b e r r e c o r d s ” : 28
49 },
50 .....

Listing 5.7: MongoDB Ranking number of documents

5.3 Comparison between PostgreSQL and Mon-


goDB
For performance evaluation, the query performance is executed on PostgreSQL and
MongoDB database. It includes different tables with 10,000, 50,000, 100,000 and
1,000,000 record entries in each database. In PostgreSQL, queries executed in stan-
dard SQL language. In MongoDB, queries executed in MongoDB language. For
simple queries MongoDB find() is used and for complex queries aggregation pipeline
is used. Below Table 5.4 shows the queries performed and the projected fields with
the code listing references for the query. To get reliable results, the query execution
time is calculated by the average over 10 runs. The query average time was calcu-
lated with the same procedure on PostgreSQL and MongoDB database. The data in
the database are used only for the thesis, that means no additional data operations
are performed by any means. The results of the above queries are showen in the
Table 5.5 Table 5.6 Table 5.7.
In Figure 5.1 Figure 5.2 Figure 5.3 and Figure 5.4 the query response time for the
queries listed in Table 5.4 is compared for PostgreSQL and MongoDB databases. The
query number which is listed in Table 5.4 is shown on x-axis. The query response
time (in seconds) is measured on y-axis. In Figure 5.1 Figure 5.2 Figure 5.3 and
Figure 5.4, the MongoDB and PostgreSQL is denoted with different colour for easy
understanding of query comparison between the databases.
For queries 1, 2, and 3, the queries executed on the single table (collection) and
retrieves the data almost the same time with slight advantage of PostgreSQL in
all the datasets (Figure 5.1 Figure 5.2 Figure 5.3 and Figure 5.4). Query 1 always
take longer time than query 2 and 3 because query 1 retrieves all the data from the
table (collection) by scanning every row in the table. Query 2 is faster than query
1 because it scans scientific publications from the database and retrieves the data
related to scientific publications. In query 3, the query performed text search on the
single table retrieves the data faster than the queries 1 and 2. This is because the
query only scans the data where the given text is present. This reduces the scanning
time resulting in faster query execution.
5.3. Comparison between PostgreSQL and MongoDB 61

Number Queries No. of SQL code MongoDB


pro- code
jected
rows
1 Get all data from single tables 19 List- List-
ing A.5 on ing A.13
page 74 on page 77
2 Get all the data from single table 19 List- List-
related to scientific publications ing A.6 on ing A.14
page 75 on page 77
3 Text search in a single table 19 List- List-
ing A.7 on ing A.15
page 75 on page 77
4 Get all the organizations and 59 List- List-
experts where scientific publica- ing A.8 on ing A.16
tion is selected page 75 on page 77
5 Get a list of organizations for 6 List- List-
Patents and scientific publica- ing A.9 on ing A.17
tions related to word ’SERVICE’ page 75 on page 77
from title field
6 Retrieve all organization related 6 List- List-
to ’SERVICE’ and ranking them ing A.10 ing A.18
by number of patents, and num- on page 76 on page 78
ber of scientific publication
7 Get all the players where the 59 List- List-
type of connection is INVEN- ing A.11 ing A.19
TOR on page 76 on page 78
8 Retreive all startup company re- 7 List- List-
lated to the word ’behaviour’ ing A.12 ing A.20
from title for an organization to on page 76 on page 79
find the total number of Scien-
tific publications for a type of or-
ganization

Table 5.4: Queries for performance comparison from PostgreSQL and MongoDB
62 5. Evaluation

Figure 5.1: Performance comparision on 10,000 records

Figure 5.2: Performance comparision on 50,000 records


5.3. Comparison between PostgreSQL and MongoDB 63

Figure 5.3: Performance comparision on 100,000 records

Figure 5.4: Performance comparision on 1,000,000 records


64 5. Evaluation

For query 4 and 7, the datasets involve JOIN operation to retrieve all the patents
and scientific publications for all the organizations and experts in the PostgreSQL
database. In the MongoDB database, the data is embedded into a single embedded
collection and involves no joins. All the data is retrieved faster from the embedded
collection because the collection does not involve in join operations. The execution
speed of the MongoDB is up to 50% more than PostgreSQL database.

For queries 5 and 6, the query performs faster with great execution speed on both the
database because of text indexing. Indexing limits the number of document search.
The query scans only the documents related to the given text. For the datasets
10,000 and 50,000 rows, the difference between the performance of the database
is less. In Figure 5.3, the data retrieval in MongoDB is up to 50% faster than
PostgreSQL database. For Figure 5.4, query 5 is performance is slightly different
whereas for query 6 there is a huge difference in the query performance between
the databases. This is because the queries are executed on the local machine. For
such huge dataset, the behavior of the system is not significant. However, from the
datasets, it is clearly evident that the MongoDB database is faster than PostgreSQL
database.

For query 8, MongoDB has a better performance compared to the PostgreSQL


database. The queries require to join operations and scanning all the tables degrades
the query performance. In MongoDB aggregation pipeline, the query is executed
in different stages. Each stage output is the response to the next stage. Proper
planning in developing an aggregation pipeline provides efficient execution result.
This way the complex query output is executed faster.

From these, it is clearly evident that MongoDB is faster than PostgreSQL database
for the given queries.

PostgreSQL Query 10000 50000 100000 1000000


Number
1 1.864 2.994 3.891 11.249
2 1.142 2.137 3.917 9.392
3 0.533 0.804 1.156 5.632
4 4.677 9.123 11.213 23.192
5 0.452 0.522 1.231 4.961
6 0.562 0.458 0.981 3.798
7 4.639 9.324 10.793 16.459
8 0.283 0.543 1.124 4.191

Table 5.5: PostgreSQL Query results in seconds

For each dataset, the number of rows returned for each query in PostgreSQL and
MongoDB database is shown in Table 5.6. Since both the databases support stem-
ming, the output of the query should return the same number of rows. The out-
put rows count helps to recognize the datasets used in PostgreSQL and MongoDB
database are same.
5.4. Discussion 65

MongoDB 10000 50000 100000 1000000


Query Num-
ber
1 1.641 2.861 3.591 9.213
2 0.981 1.986 3.429 8.124
3 0.462 0.791 0.961 4.294
4 2.193 4.563 7.579 12.129
5 0.348 0.235 0.543 4.921
6 0.361 0.296 0.512 0.764
7 3.612 4.831 6.593 8.213
8 0.194 0.192 0.784 0.981

Table 5.6: MongoDB Query results in seconds

Results 10000 50000 100000 1000000


1 10000 50000 100000 1000000
2 9920 30919 51875 621756
3 21 130 218 1749
4 1054 26244 50875 621756
5 8 268 941 24590
6 SCIENCE- SCIENCE- SCIENCE- SCIENCE-
7 65 800 18241
PATENT- PATENT- PATENT- PATENT-
1 14 141 1749
7 840 20703 78409 798491
8 4 252 281 595

Table 5.7: Output of the results retured for each query

5.3.1 Impact of the Size of Datasets


The databases are implemented on the local machine which has a limited amount
of RAM (8GB) and permanent storage (1 TB HDD). For the datasets over 100,000
and 1,000,000 records, the machine cannot perform accurately. For the large dataset
(1000000 records), due to lack of RAM, the execution speed of the databases de-
creases.

5.4 Discussion
The decision in selecting the NoSQL database depends on the requirement. For in-
stance, if the company needs to manage the relationships between the huge datasets
the Graph databases are a better fit. For example, cooperation networks between
the organization or the author of the scientific publication.
In this thesis, we used MongoDB database. One of the main advantages of using
the MongoDB database is flexible schema. The data model is easy to maintain and
modify in the ever-changing environment.
66 5. Evaluation

From the Section 5.3, the query performance of the database is evaluated. On the
single table, the query performance is almost the same between PostgreSQL and
MongoDB database. For the embedded collection (multiple tables in PostgreSQL
database), MongoDB shows a clear dominance for the given datasets Figure 5.1
Figure 5.2 Figure 5.3 and Figure 5.4. However, the results may vary for the data
that contains several millions of records.
To work with several million records, MongoDB provides high scalability, the data
is shared over multiple machines and facilitates working with the large datasets 1 .
In the case of PostgreSQL, there is no native sharding technique for distributing
the data across nodes in a cluster1 . Using other horizontal scalability techniques
such as manual sharding, using a sharding framework and so on can lead to loss of
important relational abilities, data integrity, ACID properties 1 .
1

1
https://www.mongodb.com/compare/mongodb-postgresql
6. Related Work
“Study the past if you would define the future.”
-Confucius

In this chapter, we discuss the related work that is similar in implementing MongoDB
database operations. Their work also describes the performance of the MongoDB
database.
Tim Mohring investigates the Tinnitus database project [Moh16]. The Tinnitus
database contains information like patient symptoms, ideal treatment method and
so on. The Tinnitus database is based on MySQL database which is a relational
database and has some disadvantages that can create unacceptable errors in case of
mispractice. The author examines different NoSQL databases to overcome problems
that occur in MySQL database and concluded that the performance of the document-
oriented database is high when the data is retrieved from multiple tables that require
joins in the relational database. He used MongoDB for practical implementation
and evaluated the results (in terms of performance and schema validation). Due
to the flexible schema and easy query capabilities and lack of joins, the query per-
formance of MongoDB is higher than the relational database [Moh16].In his work,
he concluded that the MongoDB database is superior in queries that involve JOIN
operations in MySQL database. Similar to this thesis, we also implemented various
complex queries in the MongoDB database and evaluate the resultant performance
with company’s PostgreSQL database query performance. The query performance
of MongoDB is higher than the PostgreSQL database.
Ningthoujam et al [NCP+ 14] designed a MongoDB data model for Ethnomedicinal
plant data. In the article, the authors described the data modeling patterns of
MongoDB. There are two options for designing a data model in MongoDB. The
two options were embedding or referencing through a unique identifier. For the
embedding collection, the related data is merged into a single collection. The second
option was connecting the collections using a unique identifier in a collection. They
choose to use both the options depending upon the data representation choice. The
data sets are imported through Mongoimport tool and tested its performance in
terms of scalability, flexibility, extensibility, query execution speed. The authors
conclude that the ultimate decisions on MongoDB data model are based on the access
68 6. Related Work

pattern. The MongoDB database queries depend on data usage pattern. Indexing
limits the number of scanning documents. The use of indexes in developing a query
increases the query execution speed. This results in providing high-performance
[NCP+ 14]. In our approach, we implemented a similar idea of embedding the tables
into a single collection and used aggregation pipeline with the use of text indexing.
The text indexing fetches the related document that reduces the scanning time. This
results in high query performance.
Parker et al [PPV13] compared NoSQL databases and SQL databases. The au-
thors implemented different NoSQL databases. They compare key-value pairs of
databases on different operations. The operations are storage, read, write, and
deletes. The operations are executed on Microsoft SQL Server Express, MongoDB,
CouchDB, RavenDB, Cassandra, Hypertable, and Couchbase. They evaluated re-
sults for all the operations for 100000 records. They performed data retrieval query
for 10,50,100,1000,10000, and 100000 records. They compared the query response
time on all the databases and concluded that Couchbase and MongoDB are the
fastest in retrieving data for the given datasets. In our work, we migrated datasets
of 23.7 million, 1 million, and 100 thousand records and compared the query per-
formance with the PostgreSQL database. The MongoDB database is superior in
fetching the data from the large datasets compared to PostgreSQL.
Chickerur et al [CGK15] used the airline’s database with 1050000 records for com-
paring a relational database with MongoDB database. Initially, the authors mi-
grated the data from the relational database to MongoDB database. The queries
are executed on the MongoDB database and compared its performance with the
MySQL database. They implemented various query operations and concluded that
the MongoDB provides efficient performance for a big data application compared to
the MySQL database. Similarly, in our thesis, we migrated the data from the Post-
greSQL database to MongoDB database. After the migration, the various queries
are executed. MongoDB database provides an efficient query performance for the
big data sets that are extracted from the PostgreSQL database.
7. Conclusion and Future Work
“In three words I can sum up everything I’ve learned about data: it goes on.”
-Robert Frost

7.1 Summary
In this thesis, we compared the performance of the MongoDB database with the
PostgreSQL database. To evaluate the performance, the data is migrated to Mon-
goDB database.
In this work, we provide an overview of existing PostgreSQL database where the
structure of the PostgreSQL database is shown and addresses the issues that arose
in using PostgreSQL database.The PostgreSQL contain multiple tables. The tables
contains the information of patents and scientific publications of the organizations
and experts from different countries that are taken different data sources.
To retrieve the relevant data from the tables, JOIN operation is used. The usage
of multiple JOINS especially in case of complex queries lowers the query execution
speed that results in low performance. To overcome the issues, one of the NoSQL
databases is selected.
MongoDB database is selected on the factors that support all the characteristics of
PostgreSQL. Initially, the data is migrated from PostgreSQL database to MongoDB.
The data model of MongoDB is flexible and modelled in an easy way. For evaluating
the performance aggregation pipeline is used. Furthermore, the operations involved
in aggregation pipeline is discussed.
For developing the prototype a user friendly interactive web application is developed
using R framework SHINY.
For our experiments, we used four different sizes of data sets. We implemented all
the queries in the local machine (Windows version 10).

7.2 Conclusion
We examined the query performance of small, and large datasets. The results of the
work are practically relevant to the developers who use the database. We checked
70 7. Conclusion and Future Work

the results for each query in PostgreSQL and MongoDB database. We also provide
the query optimization technique that helps users to develop aggregation pipeline
in a proper way that can reduce the query execution time. The results are provided
for different queries, ranging from simple queries to large complex queries in case of
single table and multiple tables.

• The queries 1, 2, and 3 are performed on the single table ( see Table 5.4), where
no join operations are required. The PostgreSQL database query performance
for a single table is almost the same as MongoDB database.

• To perform queries 5, 6, 7, and 8 ( see Table 5.4), multiple tables are denormal-
ized to create an embedded collection in MongoDB database. In PostgreSQL,
for the queries that involves multiple tables takes a long response time. Mon-
goDB shows a clear dominance in case of complex queries involved in JOIN
operations.

• In case of complex queries where join operations are performed, the query
performance is decreased by upto 50% compared to MongoDB database.

In our experiment results, it is clear that MongoDB is dominating in the query exe-
cution speed for a single collection and also on the embedded collection. This results
proves, MongoDB database provides high performance than PostgreSQL database.
Finally, in conclusion, the thesis proves that the use of NoSQL databases (MongoDB)
is beneficial especially in the case of large complex queries where JOIN operations
involved. However, the different sizes of the database can impact performance.

7.3 Future Work


In the MongoDB database in text search, the text indexing supports features like
filtering the stop words such as (it, a, an the and many more), scoring and stemming
(reduced words like standing as a stand, stood which has the same base). But,
MongoDB not yet supports text search based on synonyms. The development of
full-text search on a synonym or similar words will be an interesting task in the
future.
In MongoDB indexing, if we need to add indexing the first executed indexing must
be dropped. This leads to time consumption. It will be interesting if MongoDB
overcomes this restriction in the future.
This thesis evaluated the result for the data upto 1 million records. The possible
future benchmark would be evaluating the query performance for a huge datasets
(greater than 1 million rows).
A. Code Listings
The chapter contains the queries used in SQL and in MonogoDB. The code for de-
veloping SHINY application is listed followed by SQL and MongoDB query listings.

A.1 R Framework SHINY


1 #s h i n y = ( u i . R, s e r v e r .R)
2 #l i b r a r i e s used f o r c o n n c e t i n g t o MongoDB d a t a b a s e
3 l i b r a r y ( shiny )
4 l i b r a r y ( mongolite )
5 library ( jsonlite )
6 # D e f i n e UI f o r a p p l i c a t i o n f o r mongodb t e x t s e a r c h
7 u i <− f l u i d P a g e (
8 # Application t i t l e
9 t i t l e P a n e l ( ”Mongodb t e x t s e a r c h Data ”) ,
10 sidebarLayout (
11 sidebarPanel (
12 t e x t I n p u t ( ” q u e r y i d ” , ” T i t l e t e x t ” , ” ”) ,
13
14 s e l e c t I n p u t ( ” d o c i d ” , ”document ” , c h o i c e s = c ( ”PATENT” , ”SCIENCE”
))
15 ),
16 # Show t h e mongodb t e x t s e a r c h output i n t h e main p a n e l
17 mainPanel (
18 dataTableOutput ( ”mydata ”)
19 ))
20 )
21 s e r v e r <− f u n c t i o n ( input , output ) {
22
23 mdt <− mongo ( c o l l e c t i o n = ”data ” , db = ”datadocuments ” , u r l = ”
mongodb : / / l o c a l h o s t : 2 7 0 1 7 ” )
24 t i t l e t e x t <− r e a c t i v e ( {
25
26 mdt$index ( toJSON ( l i s t ( ” t i t l e ” = ” t e x t ”) , auto unbox = TRUE) )
27 q <− p a s t e 0 ( ’ [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ” ’ ,
input$query id , ’ ” } } } ,
28 { ”$match ” : { ”d o c t y p e ”: ” ’ , i n p u t $ d o c i d , ’ ”} } ,
29 { ”$group ” :
30 { ” i d ” : {
72 A. Code Listings

31 ”player name ” : ”$player name ” } ,


32 ”n u m b e r r e c o r d s ” :
33 { ”$sum ” : 1}
34 }
35 },
36 { ” $ s o r t ”: { ” n u m b e r r e c o r d s ” : −1}} ,
37 { ” $ l i m i t ” : 10}
38 ] ’)
39 js o n li t e : : validate (q)
40 query <− mdt$aggregate ( q )
41 })
42 output$mydata <−renderDataTable ( {
43 t i t l e t e x t ()
44 })
45 }
46 shinyApp ( u i = ui , s e r v e r = s e r v e r )

Listing A.1: Query covering some field of interest and get a list of relevant documents

1 #s h i n y = ( u i . R, s e r v e r .R)
2 #l i b r a r i e s used f o r c o n n c e t i n g t o MongoDB d a t a b a s e
3 l i b r a r y ( shiny )
4 l i b r a r y ( mongolite )
5 library ( jsonlite )
6 # D e f i n e UI f o r a p p l i c a t i o n f o r mongodb t e x t s e a r c h
7 u i <− f l u i d P a g e (
8
9 # Application t i t l e
10 t i t l e P a n e l ( ”Mongodb Data ”) ,
11 sidebarLayout (
12 sidebarPanel (
13 t e x t I n p u t ( ” t i t l e i d ” , ” T i t l e t e x t ” , ” ”)
14
15 ),
16
17 # Show t h e mongodb t e x t s e a r c h output i n t h e main p a n e l
18 mainPanel (
19 dataTableOutput ( ”mydata ”)
20 ))
21 )
22
23 s e r v e r <− f u n c t i o n ( input , output ) {
24 mon <− mongo ( c o l l e c t i o n = ”documents ” , db = ”e n t i t y d o c u m e n t s ” , u r l = ”
mongodb : / / l o c a l h o s t : 2 7 0 1 7 ” )
25
26 t i t l e s e a r c h r e s u l t <− r e a c t i v e ( {
27 # D e f i n i n g mongodb i n d e x
28
29 mon$index ( toJSON ( l i s t ( ” t i t l e ” = ” t e x t ”) , auto unbox = TRUE) )
30 t e x t <− i n p u t $ t i t l e i d
31
32 #t e x t s e a r c h output
33 mon$find ( toJSON ( l i s t ( ” $ t e x t ” = l i s t ( ” $ s e a r c h ” = t e x t ) ) , auto unbox
= TRUE) )
34 })
35 output$mydata <−renderDataTable ( {
36 titlesearchresult ()
A.1. R Framework SHINY 73

37 })
38 }
39 shinyApp ( u i = ui , s e r v e r = s e r v e r )

Listing A.2: Query covering some field of interest and get a list of organizations
ranked by number of patents, scientific publications matching the query

1 #s h i n y = ( u i . R, s e r v e r .R)
2 #l i b r a r i e s used f o r c o n n c e t i n g t o MongoDB d a t a b a s e
3 l i b r a r y ( shiny )
4 l i b r a r y ( mongolite )
5 library ( jsonlite )
6 # D e f i n e UI f o r a p p l i c a t i o n f o r mongodb t e x t s e a r c h
7 u i <− f l u i d P a g e (
8 # Application t i t l e
9 t i t l e P a n e l ( ”Mongodb t e x t s e a r c h Data ”)
10 sidebarLayout (
11 sidebarPanel (
12 t e x t I n p u t ( ” q u e r y i d ” , ” T i t l e t e x t ” , ” ”) ,
13 s e l e c t I n p u t ( ” d o c i d ” , ”document ” , c h o i c e s = c ( ”PATENT” , ”SCIENCE”
)) ,
14 a c t i o n B u t t o n ( ” a c t ” , ”output ”)
15 ),
16 # Show t h e mongodb t e x t s e a r c h output i n t h e main p a n e l
17 mainPanel (
18 tabsetPanel (
19 ta bP a ne l ( ”INSTITUTE” , dataTableOutput ( ’ t a b l e 1 ’ ) ) ,
20 ta bP a ne l ( ”EXPERT” , dataTableOutput ( ’ t a b l e 2 ’ ) )
21 )
22 ))
23 )

Listing A.3: Query for an organization and get a list of collaborators, i.e.,
organizations with common documents; rank them by number of common patents,
number of common scientific publications at user interface And at server side
shown Listing A.4

1 s e r v e r <− f u n c t i o n ( input , output ) {


2 mdt <− mongo ( c o l l e c t i o n = ”data ” , db = ”datadocuments ” , u r l = ”
mongodb : / / l o c a l h o s t : 2 7 0 1 7 ” )
3 INSTITUTION <− e v e n t R e a c t i v e ( i n p u t $ a c t , {
4 mdt$index ( toJSON ( l i s t ( ” t i t l e ” = ” t e x t ”) , auto unbox = TRUE) )
5 q <− p a s t e 0 ( ’ [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ” ’ ,
input$query id , ’ ” } } } ,
6 { ”$match ” : { ”d o c t y p e ”: ” ’ , i n p u t $ d o c i d , ’ ”} } ,
7 { ”$match ” : { ” p l a y e r t y p e ”: ”INSTITUTION”} } ,
8 { ” $ p r o j e c t ” : { ”player name ”: 1 , ” t i t l e ” : 1 , ”
p l a y e r t y p e ” : 1 , ” c o u n t r y c o d e ”: 1 } } ,
9 { ”$group ” :
10 { ” i d ” : { ”player name ” : ”$player name
” },
11 ”n u m b e r r e c o r d s ” : { ”$sum ” : 1} ,
12 ”player name ” : { ” $ f i r s t ”: ” $player name ”} ,
13 ” p l a y e r t y p e ” : { ” $ f i r s t ”: ” $ p l a y e r t y p e ”} ,
14 ” c o u n t r y c o d e ” : { ” $ f i r s t ”: ” $ c o u n t r y c o d e ”}
74 A. Code Listings

15 } },
16 { ” $ s o r t ”: { ” n u m b e r r e c o r d s ” : −1}} ,
17 { ” $ l i m i t ” : 10}
18 ] ’)
19 js o n li t e : : validate (q)
20 query <− mdt$aggregate ( q )
21 })
22 EXPERT <− e v e n t R e a c t i v e ( i n p u t $ a c t , {
23 mdt$index ( toJSON ( l i s t ( ” t i t l e ” = ” t e x t ”) , auto unbox = TRUE) )
24 q <− p a s t e 0 ( ’ [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ” ’ ,
input$query id , ’ ” } } } ,
25 { ”$match ” : { ”d o c t y p e ”: ” ’ , i n p u t $ d o c i d
, ’ ”} } ,
26 { ”$match ” : { ” p l a y e r t y p e ”: ”EXPERT”} } ,
27 { ” $ p r o j e c t ” : { ”player name ”: 1 , ” t i t l e ”
: 1 , ”player type ” : 1 , ”
c o u n t r y c o d e ”: 1 } } ,
28 { ”$group ” :
29 { ” i d ” : { ”player name ” : ”$player name ” } ,
30 ”n u m b e r r e c o r d s ” : { ”$sum ” : 1} ,
31 ”player name ” : { ” $ f i r s t ”: ” $player name ”} ,
32 ” p l a y e r t y p e ” : { ” $ f i r s t ”: ” $ p l a y e r t y p e ”} ,
33 ” c o u n t r y c o d e ” : { ” $ f i r s t ”: ” $ c o u n t r y c o d e ”}
34 } },
35 { ” $ s o r t ”: { ” n u m b e r r e c o r d s ” : −1}} ,
36 { ” $ l i m i t ” : 10}
37 ] ’)
38 js o n li t e : : validate (q)
39 query <− mdt$aggregate ( q )
40 })
41 o u t p u t $ t a b l e 1 <− renderDataTable ( {
42 INSTITUTION ( )
43 })
44 o u t p u t $ t a b l e 2 <− renderDataTable ( {
45 EXPERT( )
46 })
47 }
48 shinyApp ( u i = ui , s e r v e r = s e r v e r )

Listing A.4: Query for an organization and get a list of collaborators, i.e.,
organizations with common documents; rank them by number of common patents,
number of common scientific publications at server side

A.2 SQL and MongoDB Queries


1 # Data s e l e c t i o n from s i n g l e t a b l e
2 SELECT ∗ FROM p u b l i c . e n t i t y d o c s

Listing A.5: PostgreSQL query example 1


A.2. SQL and MongoDB Queries 75

1 # Data s e l e c t i o n from s i n g l e t a b l e
2 SELECT ∗
3 FROM p u b l i c . e n t i t y d o c s
4 where d o c t y p e i n ( ’SCIENCE ’ )

Listing A.6: PostgreSQL query example 2

1 # Data s e l e c t i o n from s i n g l e t a b l e
2 SELECT ∗
3 FROM p u b l i c . e n t i t y d o c s
4 where t s v t i t l e
5 @@ t o t s q u e r y ( ’ Motion ’ ) ;

Listing A.7: PostgreSQL query example 3

1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 select
3 ∗
4 from public . link player doc x
5 join p u b l i c . e n t i t y d o c s y on x . d o c i d = y . d o c i d and y . d o c t y p e i n (
’SCIENCE ’ )
6 join p u b l i c . e n t i t y p l a y e r z on z . p l a y e r i d = x . p l a y e r i d

Listing A.8: PostgreSQL query example 4

1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 select
3 z . player id ,
4 z . player type ,
5 z . player sub type ,
6 z . player name ,
7 z . country code ,
8 z . address
9 from public . link player doc x
10 j o i n p u b l i c . e n t i t y d o c s y on x . d o c i d = y . d o c i d and y . d o c t y p e i n (
’PATENT ’ , ’SCIENCE ’ )
11 j o i n p u b l i c . e n t i t y p l a y e r z on z . p l a y e r i d = x . p l a y e r i d and
p l a y e r t y p e = ’INSTITUTION ’
12 where t s v f u l l t e x t @@ t o t s q u e r y ( ’ s e r v i c e ’ )
13 group by z . p l a y e r i d

Listing A.9: PostgreSQL query example 5


76 A. Code Listings

1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 select
3 z . player id ,
4 z . player type ,
5 z . player sub type ,
6 z . player name ,
7 z . country code ,
8 z . address ,
9 count ( ∗ ) f i l t e r ( where d o c t y p e = ’SCIENCE ’ ) a s n b s c i e n c e ,
10 count ( ∗ ) f i l t e r ( where d o c t y p e = ’ p a t e n t ’ ) a s n b p a t e n t
11 −− y . doc id ,
12 −− y . doc type ,
13 −− y. title
14 from public . link player doc x
15 join p u b l i c . e n t i t y d o c s y on x . d o c i d = y . d o c i d and y . d o c t y p e i n (
’PATENT ’ , ’SCIENCE ’ )
16 j o i n p u b l i c . e n t i t y p l a y e r z on z . p l a y e r i d = x . p l a y e r i d and
p l a y e r t y p e = ’INSTITUTION ’
17 where t s v f u l l t e x t @@ t o t s q u e r y ( ’ Motion ’ )

Listing A.10: PostgreSQL query example 6

1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 select
3 ∗
4 from public . link player doc x
5 join p u b l i c . e n t i t y d o c s y on x . d o c i d = y . d o c i d and y . d o c t y p e i n (
’PATENT ’ , ’SCIENCE ’ )
6 join p u b l i c . e n t i t y p l a y e r z on z . p l a y e r i d = x . p l a y e r i d where x .
p l a y e r d o c l i n k t y p e i n ( ’ {INVENTOR} ’ )

Listing A.11: PostgreSQL query example 7

1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 select
3 z . player sub type ,
4 z . player name ,
5 z . player type ,
6 y . doc source ,
7 z . date inserted ,
8 y . meta ,
9 y . country code
10 from public . link player doc x
11 j o i n p u b l i c . e n t i t y d o c s y on x . d o c i d = y . d o c i d and y . d o c t y p e i n (
’SCIENCE ’ )
12 j o i n p u b l i c . e n t i t y p l a y e r z on z . p l a y e r i d = x . p l a y e r i d and z .
p l a y e r s u b t y p e = ( ’ {STARTUP,COMPANY} ’ )
13 where t s v f u l l t e x t @@ t o t s q u e r y ( ’ b e h a v i o r ’ )

Listing A.12: PostgreSQL query example 8


A.2. SQL and MongoDB Queries 77

1 # Data s e l e c t i o n from s i n g l e t a b l e
2 db . documents . f i n d ( { } )
3 . projection ({})
4 . l i m i t (1000000)

Listing A.13: MongoDB query example 1

1 db . d o c t a b l e . f i n d ( { ’ d o c t y p e ’ : ’SCIENCE ’ } )
2 . projection ({})
3 . l i m i t (1000000)

Listing A.14: MongoDB query example 2

1 # Data s e l e c t i o n from s i n g l e t a b l e
2 db . documents . a g g r e g a t e ( [ { ”$match ” :
3 { ”$text ” :
4 { ” $ s e a r c h ” : ”Motion ” }
5 } }
6 ]) ;

Listing A.15: MongoDB query example 3

1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 db . documents . a g g r e g a t e ( [ { ”$match ” :
3 { ”$text ” :
4 { ”$search ” : ”Science ” }
5 } }
6 ]) ;

Listing A.16: MongoDB query example 4

1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 db . embedded . a g g r e g a t e ( [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ”SERVICE” }
} },
3 { ”$group ” : { ” i d ” : { ”player name ” : ”$player name ” ,
4 ” p l a y e r t y p e ” : ”INSTITUTION” ,
5 ”country code ” : ”$country code ” } ,
6 ”n u m b e r r e c o r d s ” : { ”$sum ” : 1 }
7 },
8 },
9 { ” $ p r o j e c t ” : { ”player name ” : ”$player name ” , ” c o u n t r y c o d e ” : ”
$ c o u n t r y c o d e ” , ”n u m b e r r e c o r d s ” : ”$ n u m b e r r e c o r d s ” , ” i d ”
:1}} ,
10 { ” $ s o r t ” : { ”n u m b e r r e c o r d s ” : −1 } } ,
11 { ” $ l i m i t ” : 1000000 }
12 ]) ;

Listing A.17: MongoDB query example 5


78 A. Code Listings

1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 db . embedded . a g g r e g a t e ( [ { ”$match ” :
3 { ”$text ” :
4 { ” $ s e a r c h ” : ”MOTION” } } } ,
5 { ”$match ” : { ”d o c t y p e ” : ”PATENT” } } , # For d o c t y p e = PATENT
6 { ”$match ” : { ”d o c t y p e ” : ”SCIENCE” } } , #For d o c t y p e = SCIENCE
7 { ”$group ” : { ” i d ” : { ”player name ” : ”$player name ” ,
8 ” p l a y e r t y p e ” : ”INSTITUTION” ,
9 ”country code ” : ”$country code ” } ,
10 ”n u m b e r r e c o r d s ” : { ”$sum ” : 1 }
11 },
12 },
13 { ” $ p r o j e c t ” : { ”player name ” : ”$player name ” ,
14 ”country code ” : ”$country code ” ,
15 ”n u m b e r r e c o r d s ” : ”$ n u m b e r r e c o r d s ” ,
16 ” id ” :1}} ,
17 { ” $ s o r t ” : { ”n u m b e r r e c o r d s ” : −1 } } ,
18 { ” $ l i m i t ” : 100000 }
19 ]) ;

Listing A.18: MongoDB query example 6

1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2
3 db . embedded . f i n d (
4 { ’ p l a y e r d o c l i n k t y p e ’ : ’ {INVENTOR} ’ } )
5 . projection ({})
6 . s o r t ( { i d : −1 } )
7 . l i m i t (1000000)

Listing A.19: MongoDB query example 7


A.2. SQL and MongoDB Queries 79

1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 db . embedded . a g g r e g a t e ( [ { ”$match ” :
3 { ” $ t e x t ” : { ” $ s e a r c h ” : ”b e h a v i o u r ” } } } ,
4 { ”$match ” : { ”d o c t y p e ” : ”SCIENCE” } } ,
5
6 { ”$group ” :
7 { ” id ”:
8 { ” p l a y e r s u b t y p e ” : ( ’ {STARTUP,COMPANY} ’ ) ,
9 ”player name ” : ”$player name ” ,
10 ” p l a y e r t y p e ” : ”INSTITUTION” ,
11 ” d o c s o u r c e ” : ” $ d o c s o u r c e ” ,
12 ”date inserted ” : ”$date inserted ” ,
13 ”meta ” : ”$meta ” ,
14 ”country code ” : ”$country code ” } ,
15 ”n u m b e r r e c o r d s ” : { ”$sum ” : 1 }
16 },
17 },
18 { ” $ p r o j e c t ” : { ”player name ” : ”$player name ” ,
19 ” p l a y e r t y p e ” : ”INSTITUTION” ,
20 ”d o c t y p e ” : ”SCIENCE” ,
21 ” d a t e i n s e r t e d ” : ” $ d a t e i n s e r t e d ” ,
22 ”country code ” : ”$country code ” ,
23 n u m b e r r e c o r d s ” : ”$ n u m b e r r e c o r d s ” ,
24 ” id ” :1}} ,
25 { ” $ s o r t ” : { ”n u m b e r r e c o r d s ” : −1 } } ,
26 { ” $ l i m i t ” : 100000 }
27 ]) ;

Listing A.20: MongoDB query example 8


Bibliography
[2589] Relational database: A practical foundation for productivity. In John
Mylopolous and Michael Brodie, editors, Readings in Artificial Intelli-
gence and Databases, pages 60 – 68. Morgan Kaufmann, San Francisco
(CA), 1989. (cited on Page vii and 7)

[AG08] Renzo Angles and Claudio Gutiérrez. An overview of graph databases.


ACM Comput. Surv., 40(1):1:1–1:39, 2008. (cited on Page 15)

[BD04] Paul Beynon-Davies. Transaction Management, pages 403–417.


Macmillan Education UK, London, 2004. (cited on Page 9)

[Bha16] Niteshwar Datt Bhardwaj. Comparative study of couchdb and mon-


godb – nosql document oriented databases. International Journal of
Computer Applications, 136(3):35–46, February 2016. (cited on Page 28)

[CDG01] Surajit Chaudhuri, Umeshwar Dayal, and Venkatesh Ganti. Database


technology for decision support systems. IEEE Computer, 34(12):48–
55, 2001. (cited on Page 9)

[CGK15] S. Chickerur, A. Goudar, and A. Kinnerkar. Comparison of relational


database with document-oriented database (mongodb) for big data ap-
plications. In 2015 8th International Conference on Advanced Software
Engineering Its Applications (ASEA), pages 41–47, Nov 2015. (cited
on Page 68)

[Cod82] E. F. Codd. Relational database: A practical foundation for produc-


tivity. Commun. ACM, 25(2):109–117, 1982. (cited on Page 6)

[Dou05] Korry Douglas & Susan Douglas. PostgreSQL: The comprehen-


sive guide to building, programming, and administering PostrgreSQL
databases. Sams, 2005. (cited on Page 11)

[DS12] Sangita Chaudhari Darshana Shimpi. Survey of graph database models.


International Conference in Recent Trends in Information Technology
and Computer Science (ICRTITCS - 2012) Proceedings published in
International Journal of Computer Applications R (IJCA) ., page 0975
– 8887, 2012. (cited on Page 15)

[HELD11] Jing Han, Haihong E, Guan Le, and Jian Du. Survey on nosql database.
In 2011 6th International Conference on Pervasive Computing and Ap-
plications, pages 363–366, Oct 2011. (cited on Page 26)
Bibliography 81

[HJ11a] Robin Hecht and Stefan Jablonski. Nosql evaluation: A use case ori-
ented survey. In 2011 International Conference on Cloud and Service
Computing and Hong Kong, pages 336–341, 2011. (cited on Page 12)

[HJ11b] Robin Hecht and Stefan Jablonski. Nosql evaluation: A use case ori-
ented survey. In 2011 International Conference on Cloud and Service
Computing, CSC 2011, Hong Kong, pages 336–341, December 12-14,
2011. (cited on Page 2)

[HR15] Dr P.P.Karde Harsha R.Vyawahare. An overview on graph database


model. International Journal of Innovative Research in Computer and
Communication Engineering, 3, 2015. (cited on Page 15 and 16)

[KGK14] A. Kanade, A. Gopal, and S. Kanade. A study of normalization and


embedding in mongodb. In 2014 IEEE International Advance Comput-
ing Conference (IACC), pages 416–421, Feb 2014. (cited on Page 30)

[KSM17] K. B. S. Kumar, Srividya, and S. Mohanavalli. A performance com-


parison of document oriented nosql databases. In 2017 International
Conference on Computer, Communication and Signal Processing (IC-
CCSP), pages 1–6, Jan 2017. (cited on Page 16)

[KvDB99] Inge c. Kerssens-van Drongelen and Jan Bilderbeek. R&d performance


measurement: more than choosing a set of metrics. R&D Management,
29(1):35–46, 1999. (cited on Page 1)

[KYLTC12] Ken Ka-Yin Lee, Wai-Choi Tang, and Kup-Sze Choi. Alternatives
to relational database: Comparison of nosql and xml approaches for
clinical data storage. Computer methods and programs in biomedicine,
110:99–110, 11 2012. (cited on Page 2)

[Mak15] Dmitri Maksimov. Performance comparison of mongodb and postgresql


with json types. Master’s thesis, 2015. (cited on Page 11)

[MD18] Ajeet Ghodeswar Trupti Shah Amruta Mhatre and Santosh Dodamani.
A comparative study of data migration techniques. IOSR Journal of
Engineering (IOSRJEN), 9:77–82, 2018. (cited on Page 31)

[MH13] A B M Moniruzzaman and Syed Hossain. Nosql database: New era


of databases for big data analytics - classification, characteristics and
comparison. Int J Database Theor Appl, 6, 06 2013. (cited on Page ix,
12, 13, and 15)

[Moh16] Tim Mohring. Design and implementation of a nosql-concept for an


international and multicentral clinical database. Master’s thesis, 2016.
(cited on Page 14 and 67)

[NCP+ 14] Sanjoy Singh Ningthoujam, Manabendra Dutta Choudhury, Ku-


mar Singh Potsangbam, Pankaj Chetia, Lutfun Nahar, Satyajit D.
Sarker, Norazah Basare, and Anupam Das Talukdarf*. Nosql data
82 Bibliography

model for semi-automatic integration of ethnomedicinal plant data from


multiple sources. wileyonlinelibrary.com/journal/pca, April 2014. (cited
on Page 67 and 68)

[PPV13] Zachary Parker, Scott Poe, and Susan V. Vrbsky. Comparing nosql
mongodb to an sql db. In Proceedings of the 51st ACM Southeast Con-
ference, ACMSE ’13, pages 5:1–5:6, New York, NY, USA, 2013. ACM.
(cited on Page 68)

[RU15] Catherine M. Ricardo and Susan D. Urban. Databases Illuminated.


Jones and Bartlett Publishers, Inc., USA, 3rd edition, 2015. (cited on
Page vii and 8)

[Sim12] Salomé Simon. Brewer’s CAP Theorem, 2012. (cited on Page 13)

[SKC16] W. Seo, N. Kim, and S. Choi. Big data framework for analyzing
patents to support strategic r amp;amp;amp;d planning. In 2016 IEEE
14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th
Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on
Big Data Intelligence and Computing and Cyber Science and Technol-
ogy Congress(DASC/PiCom/DataCom/CyberSciTech), pages 746–753,
Aug 2016. (cited on Page 29)

[TML+ 02] Qijia Tian, Jian Ma, Cleve J. Liang, Ron Chi-Wai Kwok, Ou Liu,
and Quan Zhang. An organizational decision support approach to r&d
project selection. In 35th Hawaii International Conference on Sys-
tem Sciences (HICSS-35 2002), CD-ROM / Abstracts Proceedings, 7-
10 January 2002, Big Island, HI, USA, page 251, 2002. (cited on Page 5)

[VGPC+ 17] Jose M. Vicente-Gomila, Anna Palli, Begoña Calle, Miguel A. Artacho,
and Sara Jimenez. Discovering shifts in competitive strategies in probi-
otics, accelerated with techmining. Scientometrics, 111(3):1907–1923,
June 2017. (cited on Page 5)
Hiermit erkläre ich, dass ich die vorliegende Arbeit selbständig verfasst und keine
anderen als die angegebenen Quellen und Hilfsmittel verwendet habe.
Magdeburg, den April 23, 2019

You might also like