Design and Implementation of A NoSQL Database
Design and Implementation of A NoSQL Database
Databases
D B and
Software
S E Engineering
Master’s Thesis
Author:
Advisors:
Prof. Dr. rer. nat. habil. Gunter Saake Dr. rer. nat. Matthias Plaue
Dipl.-Inf. Wolfram Fenske
From From
Otto von Guericke University Magdeburg MAPEGY GmbH, Berlin
List of Tables ix
1 Introduction 1
1.1 Goal of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Readers Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Technical Background 5
2.1 Decision Support in R&D Management . . . . . . . . . . . . . . . . . 5
2.2 Relational Database Management System . . . . . . . . . . . . . . . . 6
2.2.1 RDBMS Architecture . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Database Queries . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 NoSQL Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 NoSQL Database Characteristics . . . . . . . . . . . . . . . . 12
2.3.2 Classification of NoSQL Databases . . . . . . . . . . . . . . . 13
2.3.3 MongoDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3.1 MongoDB Architecture . . . . . . . . . . . . . . . . 18
2.3.3.2 Schema Design . . . . . . . . . . . . . . . . . . . . . 18
2.3.3.3 Query Model . . . . . . . . . . . . . . . . . . . . . . 19
5 Evaluation 53
5.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1.1 Machine Used . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1.2 Data Characterstics . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.1 Data Migration . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.2 Experiment Queries . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Comparison between PostgreSQL and MongoDB . . . . . . . . . . . . 60
5.3.1 Impact of the Size of Datasets . . . . . . . . . . . . . . . . . . 65
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6 Related Work 67
A Code Listings 71
A.1 R Framework SHINY . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
A.2 SQL and MongoDB Queries . . . . . . . . . . . . . . . . . . . . . . . 74
Bibliography 83
List of Figures
A.4 Query for an organization and get a list of collaborators, i.e., organi-
zations with common documents; rank them by number of common
patents, number of common scientific publications at server side . . . 73
A.5 PostgreSQL query example 1 . . . . . . . . . . . . . . . . . . . . . . 74
A.6 PostgreSQL query example 2 . . . . . . . . . . . . . . . . . . . . . . 75
A.7 PostgreSQL query example 3 . . . . . . . . . . . . . . . . . . . . . . 75
A.8 PostgreSQL query example 4 . . . . . . . . . . . . . . . . . . . . . . 75
A.9 PostgreSQL query example 5 . . . . . . . . . . . . . . . . . . . . . . 75
A.10 PostgreSQL query example 6 . . . . . . . . . . . . . . . . . . . . . . 76
A.11 PostgreSQL query example 7 . . . . . . . . . . . . . . . . . . . . . . 76
A.12 PostgreSQL query example 8 . . . . . . . . . . . . . . . . . . . . . . 76
A.13 MongoDB query example 1 . . . . . . . . . . . . . . . . . . . . . . . 77
A.14 MongoDB query example 2 . . . . . . . . . . . . . . . . . . . . . . . 77
A.15 MongoDB query example 3 . . . . . . . . . . . . . . . . . . . . . . . 77
A.16 MongoDB query example 4 . . . . . . . . . . . . . . . . . . . . . . . 77
A.17 MongoDB query example 5 . . . . . . . . . . . . . . . . . . . . . . . 77
A.18 MongoDB query example 6 . . . . . . . . . . . . . . . . . . . . . . . 78
A.19 MongoDB query example 7 . . . . . . . . . . . . . . . . . . . . . . . 78
A.20 MongoDB query example 8 . . . . . . . . . . . . . . . . . . . . . . . 79
1. Introduction
In recent days, internet users are gradually increasing. This leads to the exponential
growth of the data. The usage of the internet reaches 2.5 quintillions (1018 ) bytes
of data on every day 1 . From the past 4 decades, relational databases had exclusive
control over data storage but some of the applications have performance and scala-
bility issues. So, companies are looking for more reliable database systems. In such
a context, NoSQL databases are developed.
The thesis work is done at MAPEGY GmbH company which provides data-driven
decision support in the field of life sciences, energy, information & communication
systems, industries, and finance & insurance and also provides data products for
their customers. Research and Development management plays a key role in su-
pervising and managing the research department of MAPEGY’s customers. The
primary objective of the Research and Development is the development of new tech-
nology by applying creative ideas to improve knowledge based on patents, scientific
publications, and market updates 2 . It always focuses on vision and strategy in
various perspectives such as financial perspective, customer perspective, internal
business perspective, and innovation & learning perspective. This paper mainly fo-
cuses on innovation & learning perspective that includes project evaluation ideas for
new projects [KvDB99]. R&D management helps to gain knowledge and is used for
practical implementation in the future.
Mapegy Gmbh company uses database based on PostgreSQL, which is a relational
database and it has some limitations. In this thesis, the data is taken from the com-
pany’s PostgreSQL server data warehouse which contains huge datasets of patents,
scientific publications, organizations, and so on. The detailed description of the data
is given in chapter Chapter 3 on page 23.
In spite of the fact that the relational databases are most common and consistently
good in storing and retrieving the data from the database. It has limitations in
1
URL : https://www.ibm.com/blogs/insights-on-business/consumer-products/
2-5-quintillion-bytes-of-data-created-every-day-how-does-cpg-retail-manage-it/
2
URL: https://www.civilserviceindia.com/subject/Management/notes/r-and-d-management.
html
2 1. Introduction
dealing with the data in the relational database. To deal with the data, the relational
database needs a predefined schema for data normalization process [KYLTC12].
One of the important limitations is, building a relationship between the entities is
complex. For very large queries between interconnected tables, we require JOIN
operations to fetch the relevant information. Such cases take long response time
which makes querying costly and affects the performance.
This thesis investigates the performance in terms of query execution time between
PostgreSQL and one of the NoSQL databases. Although the NoSQL databases are
introduced in the early 2000’s it has shown its ability in working with large unstruc-
tured and unrelated data. The main reason for its popularity is that it does not
require a strict schema structure and provides high performance for large datasets.
Unlike relational databases, the NoSQL databases rely on denormalization. It means
the data can be retrieved faster that doesn’t involve JOIN operations. In terms of
performance, designing and implementing one of the NoSQL databases is selected
to decide whether the database is efficient compared to PostgreSQL database. The
resultant evaluation between the two databases helps in decision support for R&D
management.
NoSQL database is not a single technology. There are dozens of NoSQL databases
offer various options they are mainly categorized as Key-value based (example: Re-
dis), wide-column store (example: Hbase), document-oriented databases (example:
MongoDB), and graph databases (example: Neo4j) [HJ11b]. The decision in choos-
ing the right database is a key factor for the company. Therefore, R&D manage-
ment plays an important role in making such decisions. One of the most popular
document-oriented databases that is MongoDB database is selected. Because Mon-
goDB provides the abilities of relational database along with the features of high
flexibility, high scalability, and high performance.
The thesis contains information about the decision making on selecting the database
and the process for selecting the database system for IT managers or engineers for
their requirement. The result of the thesis will help to understand the advantages
and disadvantage of outcomes of PostgreSQL and MongoDB databases. The data
migration and query capabilities of MongoDB are evaluated in terms of performance
by comparing it to the PostgreSQL database. In Chapter 3 on page 23 & Chapter 4
on page 40, detailed explanation of selecting the database and its implementation is
discussed.
(a) Designing a data model which includes at least following entities: sci-
entific publications, patents, essential metadata which must contain the
title of the documents and the organizations (companies, research institu-
tions), list of organizations which are extracted from the data warehouse
and load it into MongoDB server.
(a) Enter a query and retrieve the information related to a particular field of
interest.
(b) Enter a query covering some field of interest and get all patents and
scientific publications.
(c) Enter a query covering some field of interest and get a list of organizations
and experts projected by document type ’PATENT’ matching the query.
(d) Enter a query for an organization and get a list of organization and ex-
perts and ranking them by the number of patents and number of scientific
publications).
1. Chapter: Introduction
The first chapter gives the brief introduction of decision support for R&D
management along with the limitations of currently existing technology (Post-
greSQL) company and introduction to the NoSQL databases. The second part
illustrates the goal of the thesis followed by the structure of the thesis.
5. Chapter: Evaluation
In this chapter, the evaluation setup is discussed. The details of the machine
used, the datasets used, the experiments implemented is discussed. Then
PostgreSQL and MongoDB query performance are compared and results are
evaluated.
• The scope of market opportunities is taken into account while designing the
project.
6 2. Technical Background
Company‘s resource
Market Opportunities New project
& constraints
Technical difficulties in
the project? No
yes
No R&D
Sufficient knowledge yes activity
No
Identify missing
knowledge
R&D activity
• The new project is planned considering resources like cost, time, and project
development environment.
• The lack of technical knowledge can not help in developing a new project. So,
missing knowledge should be identified and gain knowledge.
The development of new projects requires a good understanding of the current mar-
ket needs. The knowledge of design and implementation is necessary to learn for
investigating the project efficiently according to market needs.
The decisions are taken by comparing the new technology to the existing technology.
Here in the thesis, the relational and non-relational databases are compared in terms
of database query performance (query execution speed). This helps in making a
decision in selecting the database from the comparison.
shows the combination of tuples (rows) and attributes (columns) form a table (re-
lations) to build a basic block of the relational model. For instance, Figure 2.3 il-
lustrates the relations between the tables using primary and foreign key constraints.
Consider an employee table with primary and foreign key indications which are inter-
linked with other tables. Schema normalization is an important factor in designing a
relational database schema. In the data normalization, queries contains joins which
make query operations complex especially for large queries for retrieving data from
multiple tables 1 .
In Figure 2.3, gives an example of a relational schema model. The tables contain
information about employee details. The data is normalized and the relative infor-
mation is stored in different tables that are connected using primary and foreign
key constraints. For instance, the information regarding the annual salary of an
1
https://www.objectivity.com/is-your-database-schema-too-complex/
8 2. Technical Background
employee and the city of the employee is needed. For example, John is an employee
living in Berlin and his annual salary is 120000 Euros. The information about the
employee is stored in different tables and can be retrieved any time using JOIN
operation. It is a simple query that needs less number of JOIN operations. In a
real-world scenario, there are a large number of interconnected datasets. To retrieve
the required information, a complex query using JOINS is used. In such cases, due
to a large number of JOINS, the execution speed decreases. Thus, resulting in a
performance problem.
Conceptual level is also known as the data model which describes the structure of the
database for the end users. The schema represents the content of the whole database
and relationships among the data. The internal level is also known as physical level
defines how the data is stored in the database and the layout of the records in files 3 .
The external level allows the particular group of authorized users to view the data
in the database. To improve the performance, changing the internal schema with-
out affecting the conceptual level is possible in physical data independence [RU15].
Logical data independence is capable of changing the conceptual schema without af-
fecting the external level [RU15]. There are numerous RDBMS like Microsoft SQL
Server, Oracle Database, MySQL, and IBM DB2, PostgreSQL and many more 4 . In
the thesis, the PostgreSQL database is used for performance evaluation.
2
https://www.objectivity.com/is-your-database-schema-too-complex/
3
http://ecomputernotes.com/fundamental/what-is-a-database/data-independence
4
https://www.keycdn.com/blog/popular-databases
2.2. Relational Database Management System 9
2.2.3 PostgreSQL
PostgreSQL is an open source relational database. It aims to provide high perfor-
mance, robustness and reliability to the clients 7 . PostgreSQL stores data in tables
5
https://iteritory.com/acid-properties-in-transactions-dbms/
6
https://francois-encrenaz.net/what-is-a-dbms-a-RDBMS-OLAP-and-oltp/
7
https://www.postgresql.org/
10 2. Technical Background
and generally access the data using SQL language. Since PostgreSQL is a relational
database it requires predefined data structure based on the application require-
ment. Related data is stored in different tables accessed using the JOIN operation.
PostgreSQL supports not only system-defined data types but also user-defined data
types, indexing types, functional language are used by the user according to their
requirements.
Schema Design
It is mandatory to design a minimum of one schema for every database. Schema
covers the content as listed in Figure 2.5.
8
Figure 2.5: PostgreSQL Schema
Schema design helps in organizing and identifying the wide range of data into a
finely-grained structure providing a unique namespace. Whenever a new database
is created it must have one schema minimum. Schema is used for many different
reasons. For example, the schema is used for control authorization that means
when people using the environment simultaneously one could create rules to access
the database schema based on individual roles, organizing the database objects,
maintaining third-party SQL code, and efficient performance 9 .
Tables: Tables are created specifying the name of the table and by using CREATE
TABLE command, providing the column names with data types 10 .
11
Range: It is a data type which usually used for selecting the range of values .
View : Suppose we need information by combining two different tables but we do not
want to write a query every time. For such situations, we use view command. We
create a view and refer it to a query. Consider tables with the city and an employee
below example shows how the view is created 12 .
And can execute it by using a simple command as shown below.
SELECT * FROM myview;
8
https://hub.packtpub.com/overview-postgresql/
9
https://hub.packtpub.com/overview-PostgreSQL/
10
http://www.postgresql.org/docs/9.4/static/release-9-2.html
11
http://www.postgresql.org/docs/9.4/static/release-9-2.html
12
http://www.postgresql.org/docs/9.4/static/release-9-4.html
2.2. Relational Database Management System 11
1. B-tree index (Balanced-tree): It is a default index when the index is not mere
with CREATE INDEX command. Then indexing the data on both sides of
13
https://hub.packtpub.com/overview-PostgreSQL/
14
https://hub.packtpub.com/overview-postgresql/
15
http://www.postgresql.org/docs/9.4/static/release-9-2.html
16
http://www.postgresql.org/docs/9.4/static/release-9-4.html
17
http://www.postgresql.org/docs/9.1/indexes-types.html
12 2. Technical Background
the tree is almost equal. B-tree indexes uses operators like =, <, >, <=, >=
whenever, a column involved in comparison.
2. Hash index: Hash index handles equality predicates. It is generally less used
because it does not help much in the PostgreSQL database.
3. GiST index: It is an infrastructure inside the GiST index many different plans
are implemented. Gist index is generally used in geometric data-types and
also supports full-text search.
4. GIN (Generalized inverted index) index: GIN is useful for complex structure
for instance arrays, and full-text search.
5. BRIN (Block range index) index: It helps the data to arrange systematically
by storing lower to higher values in each block. Partial indexing, multicolumn
indexing, and unique indexing are some other kinds of indexing which support
PostgreSQL database.
The RDBMS architecture and database query techniques are explained. The queries
help to understand how the data is retrieved from the database.
The PostgreSQL database is a relational database that supports unstructured data
JSON and JSONB. To retrieve the data SQL (Structured Query Language) is used.
PostgreSQL supports full-text search or a phrase search using text indexing. The
data model helps in identifying the huge datasets into a proper structure providing
a unique namespace. In the case of large complex queries, the database queries need
to perform JOIN operations which is costly and slowers the execution time.
In the case of complex operations, NoSQL databases work effectively. The NoSQL
databases do not require any JOIN operations because of its flexible schema struc-
ture. So, for the flexible schema NoSQL databases are the best fit.
18
Figure 2.6: Key value stores data model
Contrary to the traditional RDBMS, data retrieval does not involve fetching the
data from columns and rows. The architecture involves high performance, it is sim-
ple to operate and handles a massive amount of load. There are some drawbacks of
Key-value databases, they have no unique query language associated with different
executions, mostly support simple databases [Moh16].
The wide column stores also known as extensible record stores. The data stored in
tables which contain many rows with unique row key. Consider a single row as
19
Figure 2.7: Wide column stores architecture
19
https://studio3t.com/whats-new/nosql-database-types
2.3. NoSQL Databases 15
shown in Figure 2.7 on the preceding page, it shows the first column as row key (the
unique identifier of a column). The column in Figure 2.7 on the facing page contains
a column name which uniquely identifies every row. In other words, it is a two-
dimensional key-value store. The primary usage of the wide column or column family
stores is for distributed data storage, large scale data processing like sorting, parsing,
conversion between code values (for instance hexadecimal, binary, decimal), and
algorithm processing, batch-oriented processing method, and predictive analytics.
Wide column databases are generally high performance oriented when querying the
data and provide strong scalability. Many organizations like Spotify, Facebook, Big
Table model of Google uses Wide column stores [MH13].
Graph Database
Graph databases are first introduced in the early 1990s. At that time, the database
community is turning to adapt semi-structured data (in 1990s Graph databases
does not have any relation with semi-structured data), the existed technologies are
effective for most of the application requirements [AG08]. Today, graph databases
gain interest in managing the relationship for a massive set of data by connecting
internally between entities. The ideal use case scenarios for using graph databases
are, for traversing social networking, pattern detection in forensic investigations,
and in biological networking [DS12]. The graph represents the connection between
the objects and illustrating the relationship between them [DS12]. Querying in
a graph database is traversal [HR15]. Figure 2.8 is a simple example shows the
20
Figure 2.8: Graph data model
graph model with every node and its relationship has properties. The relationship
is stated by dereferencing using pointers. The query can execute with only one
index lookup. This approach provides high performance than relational databases.
Whereas in the relational database to retrieve information from multiple tables.
The database executes the query using foreign keys that have multiple lookups
on indexes. Maintaining the performance is more important for every database
even when the data volume is increased. It increases performance when querying
the interconnected data against any relational or other NoSQL databases. Unlike
20
https://blog.octo.com/en/graph-databases-an-overview
16 2. Technical Background
RDBMS, the performance of graph database is stable even the datasets are increased
massively [HR15]. Popular Graph databases are Neo4j 21 , InfiniteGraph 22 , and
FlockDB 23 .
Hbase,
Wide Column High High Moderate Moderate Minimal
Cassendra
Document MongoDB,
High Variable High Low Low
Oriented CouchDB
Neo4j,
Graph Variable Variable High Variable Graph theory
InfiniteGraph
Relational MySQL,
Relational Variable Variable Low Moderate
algebra PostgreSQL
27
Figure 2.9: SQL & NoSQL database classification
there are many NoSQL paradigms existing presently in the data world. Most of the
NoSQL databases are open source or low cost and can be used for non-commercial,
or commercial purposes. For the practical implementation of the research, we use
21
https://neo4j.com/
22
https://www.objectivity.com/products/infinitegraph
23
https://blog.twitter.com/engineering/en us/a/2010/introducing-flockdb.html
24
https://www.mongodb.com
25
http://couchdb.apache.org/
26
http://docs.basho.com/
27
https://www.slideshare.net/bscofield/nosql-codemash-2010
2.3. NoSQL Databases 17
a MongoDB database and compare the performance with companies database (i.e.
PostgreSQL).
2.3.3 MongoDB
MongoDB is an open source document oriented database. These documents are
grouped together in a collection as shown in Figure 2.10.
RDBMS MongoDB
Database Database
Tables Collection
Rows Documents
Indexing Indexing
Joins Embedded documents or
lookups
29
Table 2.2: Terminology difference in MongoDB and RDBMS
language which makes it easy to retrieve information from the database.The database
Indexing, aggregation, map reduce are some of the powerful query characteristics of
MongoDB.
30
Figure 2.11: MongoDB nexus architecture
31
Figure 2.13: MongoDB normalized approach example
The referencing data model is also known as a normalized data model. Figure 2.13
shows the example of a normalized data model. In the data model the data is
retrieved by referencing (using the unique object id). This model is best suited to
design model for huge data sets 31 .
2.3.3.3 Query Model
Some of the important factors related to the query model are discussed in this
section. MongoDB supports mongo shell, python, ruby, R, Scala, c#, and many
more programming languages to query the data from the MongoDB database 31 .
Query options
MongoDB has its own querying language to perform operations like finding the
number of documents, matching, ranking, and projecting. MongoDB also supports
many ranging queries and other expressions like $gte (greater than equal), $lt (less
than), $ne (not equal). To performs the arithmetic operations and to query the data
in array $elemMatch keyword is used. The operators are case sensitive 31 .
20 2. Technical Background
Indexing
Indexing is one important feature of MongoDB. Indexing provides efficient results
upon querying the data in a collection. Indexing is done on a specific field, multi-
ple fields, or on the whole collection. When we query the data without indexing,
MongoDB must inspect all the documents in a collection which lowers the execution
speed. With indexing, MongoDB restricts the number of documents to examine
from the collection. When creating the collection MongoDB has a unique index
(i.e. default id index). This prevents from inserting any identical data with the
same value. MongoDB is integrated with various indexing type namely, text search
indexing, single, compound indexing. Partial, unique are some of the properties of
indexing 32 .
By default MongoDB creates default indexing on the id field, Indexing is done on
any one field. With reference to single indexing, Multiple indexing is created by
indexing on two or more fields in a collection.
Text indexing
Text Indexing is performed on the string content field on a specific column or multiple
columns in a database. MongoDB supports partial and full-text search. Text indexes
are applied to the single field, multiple fields, or on a wildcard specifier (i.e. $**)
for indexing every field that contains text content in a collection. There are certain
limitations of text indexing. For instance, we create a single text index on a field
for the text query.
Indexing is executed. When we run the compound indexing to the same collection
as shown below it throws an error stating that the text search index already exists.
So, if we want to create another text index on the same collection we must drop the
previously existing indexing.
So, for creating a text indexing, it should be noted that there are certain rules to
perform indexing on a column in a database.
32
https://docs.mongodb.com/manual/indexes/
2.3. NoSQL Databases 21
1. Stop words: The words like a, an, the, at, etc. are filtered out from the content.
With the use of MongoDB text indexing, it is possible to search text on one field
(single field text indexing), multiple fields (compound text indexing) or the total text
in a collection using wildcard specifier indexing [ 32]. Using MongoDB it is possible
for phrase search. It fetches for the document which contains relevant information
based on the given phrase.
Aggregation
Query operations can be performed in different methods in the MongoDB database.
1. Aggregation pipeline.
2. Map Reduce
NoSQL database plays a key role in managing the unstructured data. NoSQL
databases are mainly classified into four types. Each type has its unique features.
The usage of these databases depends on the application requirement. Further-
more, a rich document-oriented MongoDB database is explained. MongoDB has the
abilities of the relational database along with additional features such as flexibility,
scalability, and high performance. The data is stored as JSON documents which are
represented in a BSON format. There are two possible data modeling techniques
in MongoDB. Embedded and the referencing data model. Each of it has a unique
purpose. The two approaches can also be used together to develop a database espe-
cially in case of large data sets. MongoDB has its own query language to retrieve
the relevant information from the database. MongoDB Indexing helps in minimizing
the data search which results in high performance.
3. Requirements and Concept
The chapter discusses the structure of the data in the PostgreSQL database server,
its limitations, the applicability of NoSQL technology for data migration from the
PostgreSQL to NoSQL database, and investigating the query performance on data
that is taken from PostgreSQL database. The investigation concludes which of the
two databases (PostgreSQL and NoSQL database) are best fit in terms of query
performance.
The data used for the thesis is taken from Mapegy’s PostgreSQL database server.
The information in the database provides data-driven decision support to the R&D
managers. The database provides updates of the data that contains information of
organizations, and expert profiles. The data provides innovation insights for R&D
managers, investors, and business analysts. The decisions focused on factors such
as investment decisions, technology decisions, decision making based on resources,
cost, and time. The thesis mainly focused on technology-oriented decision making
support. The database contains huge sets of data with patents, science, documents,
organizations, experts, metadata, and so on. The database contains information
about the date and time, title, country, language and many more. Furthermore, it
also contains the information of last updated records time and date. For such huge
dataset, the hierarchical structure is modeled in different tables with a primary key
and are connected using link tables. The data is derived from millions of research
papers, articles, patent offices, social media, and many more innovation-related plat-
forms.
The database is designed with various tables Figure 3.1 on the next page of different
types as follows.
2. link tables that contain information about the connection between entity ta-
bles.
3. KPI tables (key performance indicator) provide indicators to score e.g assets
(table). KPIs are used to evaluate a company’s success rate based on achieving
business-oriented goals.
24 3. Requirements and Concept
4. stat tables contain information about statistics like documents count. It mon-
itors the number of records present in the database and the number of deleted
records.
5. trend tables provide information about the number of records count per year
(certain time interval) from the database.
The data used in this thesis is taken from three different tables that are entity docs
table, entity players table, link player docs table. The chosen datasets provide in-
formation about PATENT and Scientific publications from millions of sources. The
data is used by the R&D managers to make a decision on selecting the relevant
information depending on the requirement.
• entity docs: In the table the information of the number of patents and sci-
entific publications is shown. It also provides the information regarding the
documents such as the title of the document, abstract, metadata, time and
date inserted, last updated and so on.
• enitiy players: The table provides information about the number of institu-
tions, and expert profiles. It also provides the data regarding the address of the
institution, experts, the country code where they belong to, global positioning
and many more.
25
• link players docs: The table provides the connection between the above two
tables. Figure 3.2 shows the relationship between the tables. The tables contain
information about the number of columns and the connection between the
tables.
PostgreSQL database has a query performance issue. The database contains various
tables. The tables are normalized and are linked together using primary and foreign
key constraints. When the complex query that requires many JOIN operations are
performed to collect relevant information from multiple tables. The JOINS takes
time to retrieve the information from the tables decreases the query performance.
For better explanation consider a complex query Listing 3.1 on the next page.
The explanation of query syntax is as follows
1
2 select
3 z . player id ,
4 z . player type ,
5 z . player sub type ,
6 z . player name ,
7 z . country code ,
8 z . address ,
9 count ( ∗ ) f i l t e r ( where d o c t y p e = ’SCIENCE ’ ) a s n b s c i e n c e ,
10 count ( ∗ ) f i l t e r ( where d o c t y p e = ’PATENT ’ ) a s n b p a t e n t
11 y . doc id ,
12 y . doc type ,
13 y. title
14 from data warehouse . l i n k p l a y e r s d o c s x
15 join d a t a w a r e h o u s e . e n t i t y d o c s y on x . d o c i d = y . d o c i d and y .
d o c t y p e i n ( ’PATENT ’ , ’SCIENCE ’ )
16 join d a t a w a r e h o u s e . e n t i t y p l a y e r s z on z . p l a y e r i d = x . p l a y e r i d
and p l a y e r t y p e = ’INSTITUTION ’
17 where t s v f u l l t e x t @@ p h r a s e t o t s q u e r y ( ’ v i d e o game c o n s o l e ’ )
18 group by z . p l a y e r i d
19 o r d e r by n b p a t e n t d e s c
20 limit 500;
Listing 3.1: Finding total number of scientific publications and patents for the
institutions
The output of the execution retrieve all the INSTITUTIONS who have published
the scientific publications and patent information from the given title (’video game
console’). From Listing 3.1, it is shown that to retrieve data from multiple tables
we required JOINs. This process of retrieving data from multiple tables is a time-
consuming process resulting in decreasing the performance. To find an alternate
solution to the problems, one of the NoSQL databases is selected. By designing
the data model, data migration from the PostgreSQL database and implementing
different queries that are mentioned in Section 1.1 on page 2 in the introduction
chapter. By investigating the query performance between the existing database and
NoSQL database, the decision is taken that which of the databases (PostgreSQL, or
NoSQL database) provides high performance.
classified into four different categories. The criteria for selecting the NoSQL database
in this thesis work is based on important factors like high performance, marketability,
reliability, open source support. Most widely used databases ranking is determined
in the DB-Engines ranking. The ranking is based on different aspects like the most
used database, frequency in the number of job offers, technical discussions 1 . For
instance, consider Table 3.1 that shows the ten most popular databases according
to DB-Engine 2019 ranking. From the Table 3.1 it is clear that still RDBMS are
most popularly used databases. As discussed from Listing 3.1 on the facing page,
performing a large query that involves retrieving information from multiple tables
decreases its query performance. In such scenarios, NoSQL databases are useful.
This chapter explains the requirements of the NoSQL database to work with the data
which is extracted from the PostgreSQL database. We consider the set of features
that should be integrated into the database. NoSQL database is chosen such that
it matches the features of PostgreSQL database. The chosen database should have
the following important features:
3. The important part of the database is the analysis of stored data. So for
data analysis, the database needs to support many data analytic features.
NoSQL database should support the query executions similar to the company’s
database.
1
https://db-engines.com/en/ranking
28 3. Requirements and Concept
There are various rich document-oriented databases. MongoDB and CouchDB are
the most widely used databases [ 1]. Query execution speed of MongoDB is faster
compared to CouchDB [Bha16]. CouchDB uses an elegant map-reduce syntax for
querying the data 2 . Whereas, MongoDB has its own query language which is easy
to learn for the people who have SQL knowledge. Additionally, it provides map-
reduce function 3 . It supports rich document-oriented full-text search support, it
provides very flexible schema (data model). So, data is represented in a simple
way to query data in an efficient way without any join operations. Below Figure 3.3
shows the functional, non-functional requirements, and techniques of MongoDB. The
techniques connect functional and non-functional system properties that support
MongoDB. Scanning queries, Filtering, Full-text search, Analytics, and Conditional
writes are functional requirements. Among the requirements, the queries, indexing,
and analytics are our primary requirements for implementing the data in the Mon-
goDB database. The query capabilities of MongoDB guarantees consistency. The
data is retrieved using a unique ID. For performing queries efficiently, secondary
indexes are employed. The indexes reduce the scanning items which provides high
query performance. The full-text search is performed by using text indexing in the
database.
2. Analysis layer: For the thesis, the queries used for data analysis in MongoDB
need features such as indexing and full-text search, sorting, matching, grouping
the documents and projecting it and so on. [SKC16].
table that provides information about organizations, experts and research institu-
tions. The link player docs table contains information about the connection between
tables. The schema Figure 3.2 on page 25 contain information about the number of
columns and the connection between the tables.
Designing a MongoDB data model varies from the PostgreSQL data model. Mon-
goDB does not need schema design for data modeling in this work. There are two
different design patterns for designing a data model in MongoDB. The first option
was migrating every database table into different collections to a target database
which is a pragmatic approach. The second option is embedding multiple tables
into a single collection. Embedding the related data as documents is simple. In
this thesis, the data extracted from the PostgreSQL database is embedded into a
single collection in MongoDB database. The data extracted consists of relative in-
formation that can be embedded into one collection. The advantage of embedding
the documents into a collection is that they avoid using references or lookups. The
embedded collection is faster in query execution that results in high performance.
However, due to the lack of joins, there is data redundancy. This results in high
memory usage.
• Data migration: In this phase, the data is migrated by extracting the data from
the source database (PostgreSQL) and importing it into the target database
(MongoDB) using mongo shell.
• Data validation: The process of measuring the data quality after data mi-
gration. The process is tested by simply querying to count the number of
documents (rows and columns) in the target database (MongoDB) and com-
pare it to the source database (PostgreSQL). If the number of documents (rows
and columns) is the same in both source (PostgreSQL) and target database
(MongoDB) then the data migration is successful.
For migrating data, data should be extracted and restructured from the PostgreSQL
database, transform, and load the data to the MongoDB database. Figure 3.7 on
the following page shows the data migration process that explains how the data
migration is carried out from the PostgreSQL database to MongoDB. The data is
32 3. Requirements and Concept
initially copied from PostgreSQL database server in CSV (comma separated value)
file format, transformed, imported to the MongoDB, and validated. The CSV data
is directly imported using mongo shell. This type of data is known as pass-through
data. But the data is checked for any data duplication by querying the data using
query aggregation Listing 3.2. If the duplicates are present, they are deleted. The
data migration design is constructed such that the data extracted from PostgreSQL
should match the data in MongoDB. It means, MongoDB database contains the
same amount of information with the number of rows equal to the source database
(PostgreSQL database).
1 db . doc . a g g r e g a t e ( [
2 #Grouping a l l d o c i d ’ s t o g e a t h e r
3 { $group : {
4 i d : { d o c i d : ” $ d o c i d ”} ,
5 # View documents t h a t have same d o c i d and P r o v i d e s t h e count t h a t
s h a r e s same group key
6 d u p l i c a t e : { $addToSet : ” $ i d ”} ,
7 count : {$sum : 1}
8 }
9 },
10 #To g e t output o n l y t o t h e group which i s g r e a t e r than 1 .
11 { $match : {
12 count : { ”$ g t ”: 1}
13 }} ,
14 # s o r t i n a d e c e n d i n g o r d e r
15 { $sort : {
16 count : −1
17 }
18 }]) ;
The data is extracted from the PostgreSQL database as a CSV file and imported
to MongoDB using mongo shell. After the migration, the data is checked for data
duplication. The duplicate documents are removed from the MongoDB database.
MongoDB provides a flexible schema structure. The data is modeled as an embedded
collection. The data is denormalized and merged into a single collection. The data
migration is carried out in three phases that are planning, data migration, and data
validation. The procedure involved in data migration is discussed in this section.
1 {
2 ” i d ” : O b j e c t I d ( ”5 b c c f 4 5 4 2 1 9 b d a 1 3 7 b 6 8 9 c 3 f ”) ,
3 ” p l a y e r i d ” : 113654298 ,
4 ”doc id ” : 191987233 ,
5 ” p l a y e r d o c l i n k t y p e ” : ”{APPLICANT} ” ,
6 ” l a s t u p d a t e ” : ”2018−02−14 1 4 : 1 8 : 2 5 . 6 8 5 4 3 5 ” ,
7 ”pos ” : ” ” ,
8 ”score player doc ” : 1 ,
9 ” d a t e i n s e r t e d ” : ”2018−02−14 1 4 : 1 8 : 2 5 . 6 8 5 4 3 5 ” ,
10 ” d o c s o u r c e ” : ”PATSTAT” ,
11 ”d o c s o u r c e i d ” : 449257868 ,
12 ” t i t l e ” : ”HEATER” ,
13 ” c o u n t r y c o d e ” : ”JP ” ,
14 ”doc timestamp ” : ”2014−06−25 0 0 : 0 0 : 0 0 ” ,
15 ” l a n g u a g e c o d e ” : ”EN” ,
16 ” w e b l i n k ” : ”h t t p : / / worldwide . e s p a c e n e t . com/ s e a r c h R e s u l t s ? query=
JP20140130564 ” ,
17 ”image link ” : ”” ,
18 ” p u b l i s h e r ” : ”p a t e n t o f f i c e ( JP ) ” ,
19 ” s e r i e s ” : ”” ,
20 ”d o c t y p e ” : ”PATENT” ,
21 ”doc sub type ” : ”” ,
22 ” t s v t i t l e ” : ” ’ heater ’ : 1 ” ,
23
24 ....
25 }
• A collection name which helps to have an idea about where the desired doc-
uments are present in the database. Firstly, data is migrated from a single
table (entity docs) from PostgreSQL to MongoDB collection (main data) in
MongoDB database. Secondly, the data migration is performed by embed-
ding the documents of a collection with 23.7 million documents (entity doc).
The data is migrated to compare the query execution speed of MongoDB and
PostgreSQL database. The queries are performed on a single table that does
not involve in joins (entity docs) and on interlinked tables that are embedded
as a single collection in a MongoDB database. The data is migrated single
table and embedded collection because to investigate the speed of the query
execution that involves no JOIN and JOIN operations respectively.
• A quering method specifies the method for investigating or retrieving the data
from the MongoDB database. Aggregation pipeline is used for querying the
document. The detailed explanation of the aggregation pipeline is discussed
in Section 3.4.1.1 on the next page.
Query Structure
MongoDB has different query methods. That is the MongoDB query using fing()
method, aggregation pipeline, and Map Reduce. The MongoDB query structure is
explained with an example in a Listing 3.4 on the facing page. It illustrates the
query structure with a sample query that fetches the information of 1000 documents
3.4. MongoDB Query Model 35
about the title field which is HEATER. The query is prefixed with db that contains
the collection name where the queries to the data are applied. Query command
find() is used to find the documents fulfilling the desired output.
1
2 #s t r u c t u r e o f a query
3 db . C o l l e c t i o n n a m e . Querycommand ( Querydocument )
4 . Projectiondocument ( )
5 . Limitdocument ( )
6
7 # example
8
9 db . data . f i n d ( { ’ t i t l e ’ : ’HEATER ’ } )
10 . projection ({})
11 . l i m i t (1000)
1
2 db . doc . a g g r e g a t e ( [
3 { ”$group ” : { ” i d ” : ”$ d o c t y p e ” , ”n u m b e r r e c o r d s ” : { ”$sum ” : 1
} }, } ]) ;
1 {
2 ” i d ” : ”PATENT” ,
3 ”n u m b e r r e c o r d s ” : 798666
4 },
5 {
6 ” i d ” : ”SCIENCE” ,
7 ”n u m b e r r e c o r d s ” : 201334
8 }
stages and transforms the documents when it passes through the pipeline. Now con-
sider an example with multiple stages Listing 3.7 on the next page. The aggregation
pipeline in the Listing 3.7 on the facing page is an array of different expressions.
Every expression is a stage. The stage operators tell about the operation performed
on the stage. Aggregation pipeline processes the document through the pipeline.
Each stage in pipeline references the output result of the before stages.
1. First stage: The pipeline passes through $match operator. The operator finds
all the documents related to the particular field of interest. The pipeline
executes given text ”video” by $match (The text search is possible only after
text indexing. The text indexing process is discussed in section Section 3.4.1.2
on page 38). After the text search, it passes through the same stage with
another $match operator. Where it finds all the patent documents.
3.4. MongoDB Query Model 37
2. Second stage: From the output of the $match stages, the pipeline passes
through the $group stage. The total number of experts ( player name) and the
origin of the country are grouped using a $group operator. The accumulator
$sum is used to generate the total number of records for the given query.
3. Third stage: The $project is used for projecting the required columns. The
choice of fields projection is selected using the $project operator after the
output from the first and second stage.
4. Fourth stage: The documents from the output is projected in descending order
by sorting the number of records using the $sort operator.
22 ”n u m b e r r e c o r d s ” : 12
23 } ,
24 #4
25 {
26 ” id ” : {
27 ”player name ” : ”Mathew , Manu” ,
28 ” c o u n t r y c o d e ” : ”KR”
29 },
30 ”n u m b e r r e c o r d s ” : 10
31 } ,
32 . . . . .
• In aggregation pipeline, the stage that includes $text must be in the first place
of the pipeline. This increases query performance by scanning the document to
the given text. This reduces the scanning time for the other pipeline operators
resulting in increased performance.
• $ text does not support the operators $or $not in a single expression.
1 db . g e t c o l l e c t i o n ( ”doc ”) . c r e a t e I n d e x ( { ” t i t l e ” : ” t e x t ”} )
In this chapter, the factors for selecting the MongoDB database is discussed. The
database schema is designed as an embedded collection from multiple tables. This
3.4. MongoDB Query Model 39
brings an advantage of storing relevant information at one place and also having
no joins can increase query performance. However, there is a drawback for this
approach. The data duplication. The duplicate data is removed from the database
by checking the redundancy using the aggregation pipeline. The data in the Mon-
goDB database is stored in BSON format. MongoDB database has its own query
language using find() which is used for simple queries. The aggregation pipeline and
the Map Reduce. For the thesis, aggregation pipeline is used as it is used for large
complex queries and easy to develop a query using aggregation pipeline. The ag-
gregation pipeline works by proceeding through stages. The output of before stage
is the input next stage. The stages and some of the operators used in the thesis is
explained with an example query in this chapter. The usage of text indexing and
its restrictions when querying data using text search is discussed in the section.
4. Design and Implementation
In the Chapter 3 on page 23, we discussed the data warehouse structure of the Post-
greSQL database, data selection, the concept of data migration, MongoDB data
model and the concept of aggregation pipeline in MongoDB using simple and com-
plex queries.
Detailed design and implementation procedure are provided in this chapter. Major
steps like tool selection and data analysis procedures are described. In this chapter,
we first discuss the design process and the core process of the MongoDB database
because it is important to know the basic operations of the MongoDB database.
The operations are important to interact with MongoDB server that runs based
on JavaScript. Then, we discuss the tool selection to interact with the data in the
database. To show the output query performance, we used a user-friendly interactive
web interface called R framework Shiny. Lastly, we provide the queries used and its
results for evaluation purpose.
1. Storage layer
2. Processing layer
3. Management layer
The storage layer consists of data in a collection. The processing layer operates on
the storage layer. All the aggregations, indexing, and query executions happen in
this layer. Finally, the management layer which is a high-level job, it consists of a
4.2. MongoDB Core Process 41
developed application that coordinates between the database and database users.
The Figure 4.1 explains all the three layers.
MongoDB stores the data in BSON format in a collection. All the query operations
are done in the NoSQLBooster because it is smart GUI that helps users to know the
query execution details. The details are time taken for retrieving data, the number of
records retrieved, the query execution process. Based on the queries, an interactive
web application is developed using the R framework SHINY for business users.
1. mongod: mongod command is one of the main components of the core pro-
cesses. The mongod command starts the server, manages the requests and
also support in data format management. It also provides core options such
as version, configuration file, verbose, port number, and many more.
1
https://docs.mongodb.com/manual/reference/program/
42 4. Design and Implementation
4.3.1 NoSQLBooster
There are numerous MongoDB management tools that are available in the data
world 2 . These tools help to interact with the data in the database with the smart
GUI (Graphical User Interface). GUI makes productivity and management tasks
easy for developers, and administrators.
In this thesis, NoSQLBooster is selected for query evaluation in MongoDB database Fig-
ure 4.2 on the facing page.
NoSQLBooster 4.7.0
Version
MongoDB version 4.0
Machine type Windows 64-bit
Size 36.8 MB
Downloaded on 13.08.2018
3
Table 4.1: MongoDB version and NoSQLBooster Specifications
4
NoSQLBooster offers built-in language services, vast number of built-in snippets
that helps in writing queries. Whenever the query script writing is started it always
pop-up the suggestions as you type a query.
Features of using NoSQLBooster:
1. The query functions can be performed in SQL for MongoDB database. This
includes JOINS, expressions, aggregations, and function.
development providing a user interface. It enables the user to interact with the data
even without any programming knowledge. The user can retrieve the data just by
text search and some simple search clicks on dialogue boxes, and tabs. Secondly,
backend development where the queries are implemented in R programming.
In Figure 4.3, the shiny interface is shown. It has four windows. In window 1 is used
as a code editor which is written in R programming (ui.R* & server.R*). Window
2 is a console where it runs execution of the developed program. Window 3 is a
global environment where it shows the details of the history of the commands used,
connection with the MongoDB database and many more functions 8 . Window 4
contains the information regarding the various R packages, libraries for connecting
or interacting with the data.
In this thesis for developing an interactive web application for MongoDB, mongolite
and jsonite libraries are used. The libraries are available by default in Shiny R
package list.
To work with the Mongodb data, initially we need to connect to the MongoDB
database. In order to communicate, first the server must be started using mongod
command which is already described in Section 4.2 on page 41.
Consider the query which is used for developing web application Listing 4.1 and List-
ing 4.2 . The steps involved in developing an application is discussed below.
8
URL: http://www.rstudio.com/shiny/
4.3. Tools Selection 45
1 l i b r a r y ( shiny )
2 l i b r a r y ( mongolite )
3 library ( jsonlite )
4 l i m i t <− 10L
5
6 # D e f i n e UI f o r a p p l i c a t i o n f o r mongodb t e x t s e a r c h
7 u i <− f l u i d P a g e (
8 # Application t i t l e
9 t i t l e P a n e l ( ”Mongodb t e x t s e a r c h Data ”) ,
10 sidebarLayout (
11 sidebarPanel (
12 t e x t I n p u t ( ” q u e r y i d ” , ” T i t l e t e x t ” , ” ”) ,
13 s e l e c t I n p u t ( ” d o c i d ” , ”document ” , c h o i c e s = c ( ”PATENT” , ”
SCIENCE”) ) ,
14
15 a c t i o n B u t t o n ( ” a c t ” , ”output ”)
16 ),
17
18 # Show t h e mongodb t e x t s e a r c h output i n t h e main p a n e l
19 mainPanel (
20 tabsetPanel (
21 ta bP a ne l ( ”INSTITUTE” , dataTableOutput ( ’ t a b l e 1 ’ ) ) ,
22 ta bP a ne l ( ”EXPERT” , dataTableOutput ( ’ t a b l e 2 ’ ) )
23 )
24 ))
25 )
For developing a web application using SHINY, mongolite, jsonlite packages must
be installed. The packages help to connect the shiny interface with MongoDB and
interact with the data.
In Listing 4.1, the user interface is defined. The page is titled as MongoDB text
search Data. In the side-bar panel, the text field for text search and document
selection type for the given text search is developed. The output button is used
to display the output after selecting the query. In the main panel, two tabs are
provided for retrieving the data related to institutions or expert profile. The output
of the query is displayed in the main panel as a table.
1 #d e f i n i n g s e r v e r s i d e f u n c t i o n
2 s e r v e r <− f u n c t i o n ( input , output ) {
3 #c o n n e c t i n g t o MongoDB s e r v e r
4
5 mdb <− mongo ( c o l l e c t i o n = ”doc ” , db = ”datasample ” , u r l = ”mongodb : / /
l o c a l h o s t : 2 7 0 1 7 / ? socketTimeoutMS =1200000 ”)
6 # Reactivity
7
8 INSTITUTION <− e v e n t R e a c t i v e ( i n p u t $ a c t , {
9
10
11 #Text i n d e x i n g
12
13 mdb$index ( toJSON ( l i s t ( ” t i t l e ” = ” t e x t ”) , auto unbox = TRUE) )
14
46 4. Design and Implementation
15 #Applying query
16
17 q <− p a s t e 0 ( ’ [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ” ’ ,
input$query id , ’ ” } } } ,
18
19 { ”$match ” : { ”d o c t y p e ”: ” ’ , i n p u t $ d o c i d , ’ ”} } ,
20 { ”$match ” : { ” p l a y e r t y p e ”: ”INSTITUTION”} } ,
21
22
23
24 { ”$group ”:
25 { ” i d ”:
26 { ”d o c t y p e ”: ”$ d o c t y p e ”} ,
27
28 ”n u m b e r r e c o r d s ”: { ”$sum ”: 1 } ,
29
30 ”player name ” : { ” $ f i r s t ”: ” $player name ”} ,
31
32 ” t i t l e ” : { ” $ f i r s t ”: ” $ t i t l e ”} ,
33
34 ” p l a y e r t y p e ” : { ” $ f i r s t ”: ” $ p l a y e r t y p e ”} ,
35
36 ” c o u n t r y c o d e ” : { ” $ f i r s t ”: ” $ c o u n t r y c o d e ”}
37 } } ,
38
39 { ” $ s o r t ”: { ” n u m b e r r e c o r d s ” : −1}} ,
40
41 { ” $ l i m i t ” : 10}
42
43 ]’
44
45 )
46
47
48 js o n li t e : : validate (q)
49 query <− mdb$aggregate (
50 q , ’ { ”a l l o w D i s k U s e ”: t r u e } ’
51 )
52
53
54 })
55 # Reactivity
56 EXPERT <− e v e n t R e a c t i v e ( i n p u t $ a c t , {
57
58 # Applying query
59 q <− p a s t e 0 ( ’ [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ” ’ ,
input$query id , ’ ” } } } ,
60
61 { ”$match ” : { ”d o c t y p e ”: ” ’ , i n p u t $ d o c i d , ’ ”} } ,
62 { ”$match ” : { ” p l a y e r t y p e ”: ”EXPERT”} } ,
63
64 { ”$group ”: { ” i d ”: { ”d o c t y p e ”: ”$ d o c t y p e ” , ” t i t l e
”: ” $ t i t l e ” , ”player name ”: ”$player name ” , ”
p l a y e r t y p e ”: ”EXPERT” , ” c o u n t r y c o d e ”: ”
$country code ” } ,
65 ”n u m b e r r e c o r d s ”: { ”$sum ”: 1 }
66 }
4.3. Tools Selection 47
67 },
68
69 { ” $ s o r t ”: { ”n u m b e r r e c o r d s ”: −1 } } ,
70 { ” $ l i m i t ”: 500}
71 ] ’)
72
73
74 js o n li t e : : validate (q)
75 query <− mdb$aggregate ( q , ’ { ”a l l o w D i s k U s e ”: t r u e } ’ )
76
77
78 })
79 # d e f i n i n g output a s t h e t a b l e
80
81 o u t p u t $ t a b l e 1 <− renderDataTable ( {
82 INSTITUTION ( )
83 })
84 o u t p u t $ t a b l e 2 <− renderDataTable ( {
85 EXPERT( )
86 })
87
88
89 }
90 shinyApp ( u i = ui , s e r v e r = s e r v e r )
At backend Listing 4.2 on page 45, input and output the query operations are
developed. Initially, The SHINY is connected to MongoDB database. Then the
reactive function is used. It is a reactive command that makes the application
responsive to the call by the user using the user interface. For the text search from
the database, text indexing is created on the title field. The aggregation query
is developed. The aggregation process is explained in Section 3.4.1.1 on page 35.
finally, the output of the data is displayed as a table in an application Figure 4.4 on
the following page.
For instance, if the user needs information related to a particular field of interest.
The query scans the related documents in a collection and executes results. The
process of query implementation by a user is displayed in Figure 4.5 on the next
page.
The user enters the text query in the text field and the document type is selected.
The number of organizations or the experts for a selected document type is executed
and displayed to the user. With the web applicaion, the user can fetch the informa-
tion according to the requirements. For instance, it the user need number of scientific
publications for every organization for a particular field of interest. The user selects
the scientific documents of the organization by giving the text in a text search field.
The output of the queries displayes, the number of scientific publications for every
organization for a given text.
48 4. Design and Implementation
pipeline optimization is performed. Each stage passing through the pipeline. Proper
steps in aggregation reduce the execution time. Projecting the required fields instead
of whole data in collection increases the speed of pipeline operation.
When the complex aggregation queries are applied to the database. The database
with large queries lowers the speed of execution. To improve the performance the
proper aggregation pipeline optimization is required.
The query is used to explain the optimization technique Listing 4.3 on the next
page. The query projects the number of PATENT documents fetches the list of
organizations, experts and country code that matches the query for a given text.
The execution process in each stage is shown.
The execution plan in Listing 4.3 on the following page shows the procedure of
how each stage passes through the pipeline. The pipeline first matches the field
where doc type is PATENT and then match the text search which is given in a text
indexed field. The document is filtered out and the output after passing through
the matching stage it enters to the group operator. With the reference of the output
of the match operator, the documents are grouped according to player name and
country code and fetch the number of records. Then, it projects the fields described
in the group stage. Sorts the number of records in descending order and finishes the
pipeline after limiting the documents.
50 4. Design and Implementation
56 ”$diacriticSensitive ” : false
57 }
58 }
59 ]
60 },
61 ”winningPlan ” : {
62 ” s t a g e ” : ”FETCH” ,
63 ”filter ” : {
64 ”d o c t y p e ” : {
65 ”$eq ” : ”PATENT”
66 }
67 },
68 ”inputStage ” : {
69 ” s t a g e ” : ”TEXT” ,
70 ”indexPrefix ” : {
71
72 },
73 ”indexName ” : ” t i t l e ” ,
74 ”parsedTextQuery ” : {
75 ”terms ” : [
76 ”video ”
77 ],
78 ”negatedTerms ” : [ ] ,
79 ”phrases ” : [ ] ,
80 ”n e g a t e d P h r a s e s ” : [ ]
81 },
82 ”textIndexVersion ” : 3 ,
83 ”inputStage ” : {
84 ” s t a g e ” : ”TEXT MATCH” ,
85 ”inputStage ” : {
86 ” s t a g e ” : ”FETCH” ,
87 ”inputStage ” : {
88 ” s t a g e ” : ”OR” ,
89 ”inputStage ” : {
90 ” s t a g e ” : ”IXSCAN” ,
91 ”k e y P a t t e r n ” : {
92 ” f t s ” : ”text ” ,
93 ” ftsx ” : 1
94 },
95 ”indexName ” : ” t i t l e ” ,
96 ”i s M u l t i K e y ” : t r u e ,
97 ”isUnique ” : f a l s e ,
98 ”isSparse ” : false ,
99 ”isPartial ” : false ,
100 ”indexVersion ” : 2 ,
101 ” d i r e c t i o n ” : ”backward ” ,
102 ”indexBounds ” : {
103
104 }
105 }
106 }
107 }
108 }
109 }
110 },
111 ”rejectedPlans ” : [ ]
112 }
52 4. Design and Implementation
113 }
114 },
115 {
116 ”$group ” : {
117 ” id ” : {
118 ”player name ” : ”$player name ” ,
119 ”country code ” : ”$country code ”
120 },
121 ”n u m b e r r e c o r d s ” : {
122 ”$sum ” : {
123 ”$const ” : 1
124 }
125 }
126 }
127 },
128 {
129 ”$project ” : {
130 ” i d ” : true ,
131 ”player name ” : ”$player name ” ,
132 ”country code ” : ”$country code ” ,
133 ”n u m b e r r e c o r d s ” : ”$ n u m b e r r e c o r d s ”
134 }
135 },
136 {
137 ”$sort ” : {
138 ”sortKey ” : {
139 ”n u m b e r r e c o r d s ” : −1
140 },
141 ” l i m i t ” : NumberLong ( ”500 ”)
142 }
143 }
144 ],
145 ”ok ” : 1
146 }
The match command should always be on the first staging as it filters the PATENT
document by scanning the whole collection. If the search involves in text search, it
should be queried in the first line. After the pipeline passes through matching stage
grouping the number of document and projecting only the required field can affect
the execution time. That leads to an increase in performance.
In this chapter, the design of process flow, MongoDB core process, and implemen-
tation process are described. Furthermore, we developed an easy implementation
approach to query data by using different tools. The tools used in the implementa-
tion are mongo shell (for data importing), NoSQLBooster (for evaluating the per-
formance), and R framework SHINY (developing a prototype) made data retrieval
easy and the web application helps to interact with the data even without any pro-
gramming knowledge. The interactive web application is best suitable for people
like business managers, clients, R&D manager’s.
5. Evaluation
Different implementation tools and procedures were discussed in the previous chap-
ter. This chapter focuses entirely on the evaluation, it includes results from data mi-
gration, query performance in terms of query execution speed and their comparison
at the end. To do so, initially, we migrate the data from the PostgreSQL database
to MongoDB database. Then, we compare the query performance of MongoDB and
PostgreSQL database.
5.2 Experiments
In this section, we discuss the experiments implemented on MongoDB and its results
are shown. Apart from our predefined queriess, we perform several other experiments
to compare the performance of MongoDB and PostgreSQL with simple and complex
queries.
The data is migrated using the MongoDB shell interface. The migration is per-
formed using following command Listing 5.1. At first, the path of the locally stored
MongoDB server is specified. Then, mongoimport tool is used to create the name
of the database, collection name, the type of file (CSV), file path where the file is
stored, and the headerline where it tells the server that the CSV file has a header.
1 #s y n t a x f o r data m i g r a t i o n u s i n g mongoimport i n m o n g o s h e l l
2 @<mongoshell >: mongoimport −−db <databaseName> −− c o l l e c t i o n <
c o l l e c t i o n N a m e > −−type CSV −− f i l e <f i l e p a t h > −−h e a d e r l i n e
3 #Data m i g r a t i o n u s i n g mongoimport i n mongo s h e l l
4
5 C: \ Programme\mongodb\ S e r v e r \ 4 . 0 \ bin>mongoimport −−db datadocuments −−
c o l l e c t i o n data −−type c s v −− f i l e E : mongodb\ data −55124365. c s v −−
headerline
We run the experiments with different tables that vary in size. Initially, we mi-
grated a large embedded table with 23.7 million records. Due to the large size, it
is difficult to run the queries efficiently on local machine. So in order to investigate
the query performance, another table with the same columns but with fewer records
is migrated. Finally, a single table (entity docs) from a PostgreSQL database with
100000 records is migrated. This single table helps to investigate the execution speed
without joins.
To compare query performance of both the databases (PostgreSQL and MongoDB),
the data extracted from the PostgreSQL tables are initially imported to PostgreSQL
database on a local machine in a normalized form. The data is migrated using COPY
command, to import the data to PostgreSQL database from CSV (comma seperated
value) file. Before importing the data to the PostgreSQL database, the table is
created with CREATE TABLE command. The column names and its data type is
listed when creating a table. After creating the table the data is imported into the
table using the command shown in Listing 5.2. The same procedure is followed to
import other tables that are used in the thesis (entity players, link players docs).
1 # CSV data i m p o r t i n g s y n t a x
2 COPY <databaseName . tableName> FROM <F i l e p a t h > DELIMITER < ’ type o f
d e l i m i t e r ’> CSV HEADER;
3
4 # Data imported u s i n g SQL s h e l l f o r a s i n g l e t a b l e .
5 COPY datawarehouse . e n t i t y d o c s FROM’C: \ data \db\ mongocsv \ e n t i t y d o c s . c s v
’ DELIMITER ’ , ’ HEADER;
According to the tasks defined in Section 1.1 on page 2, we need to use text search
to fetch the relavant information from the database. So, the tables are indexed using
GIN to provide text search indexing. The detail explanation of indexing is given
in Section 2.2.3 on page 9. The GIN indexing is implemented on a text field title
see Listing 5.3 on the next page.
56 5. Evaluation
1 CREATE INDEX f t s i d x ON
2 public . entity docs
3 USING g i n ( t s v t i t l e )
4 TABLESPACE p g d e f a u l t ;
The time taken to import data to the MonogoDB and the postgreSQL database is
shown in Table 5.3.
5 player id : 1 ,
6 doc type : 1 , doc id : 1 , t i t l e : 1 , country code : 1
7
8 })
9 #output o f t h e query
10 {
11 ” i d ” : O b j e c t I d ( ”5 b e 1 b 7 0 6 7 4 0 4 2 f 0 2 9 c 0 d e 8 1 6 ”) ,
12 ” p l a y e r i d ” : 104364024 ,
13 ”doc id ” : 29715188 ,
14 ” t i t l e ” : ”One touch v o i c e memo” ,
15 ” c o u n t r y c o d e ” : ”US” ,
16 ”d o c t y p e ” : ”PATENT”
17 } ,
18
19 {
20 ” i d ” : O b j e c t I d ( ”5 b e 1 b 7 0 6 7 4 0 4 2 f 0 2 9 c 0 d e 8 1 7 ”) ,
21 ” p l a y e r i d ” : 104364024 ,
22 ”doc id ” : 28942127 ,
23 ” t i t l e ” : ” M u l t i p l e x i n g VoIP s t r e a m s f o r c o n f e r e n c i n g and s e l e c t i v e
playback o f audio streams ” ,
24 ” c o u n t r y c o d e ” : ”US” ,
25 ”d o c t y p e ” : ”PATENT”
26 } ,
27 . . . . . .
The task that are defined in Section 1.1 involves in text search for retrieving the
data from MongoDB database. In Section 3.4.1.2, the text indexing is performed
Listing 3.9 on every collection that is used in the thesis. Here the index on the id
field is present by default for every collection in a table. In the aggregation pipeline,
we use text search (with the focus on document title) for retrieving the relavant data
of patents and scientific publications of all the organization and experts. So, text
index is executed on single field title that is required for a text search.
Retrieve all the patents and scientific publications related whose title
contains the word ’Complex’:
After creating a text index on title field, the query is executed using text search
on the MongoDB database. The queries retrieves all the patents and scienctific
publications related to the title that contains word ’complex’ . The output of the
query returns all the documents relaated to the word complex (736 documents). The
output of the given query projects six columns. The projected output columns are
shown in Listing 5.5.
1
2 #Text s e a r c h i n p u t on t i t l e column
3
4 db . s i n g l e t a b l e . a g g r e g a t e ( [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ”complex
” } } },
5 { ”$ p r o j e c t ” : { ” t i t l e ” : ” $ t i t l e ” , ”country code ” : ”$country code ”
, ”d o c t y p e ” : ”$ d o c t y p e ” ,
6 ”d a t e i n s e r t e d ” : ”$ d a t e i ns e r te d ” , ”doc source ” : ”$doc source ” , ” id ”
:1}} ,
7 ]) ;
58 5. Evaluation
8 #Output o f t h e t e x t s e a r c h
9 {
10 ” i d ” : O b j e c t I d ( ”5 c 8 1 5 5 b 5 f 0 4 e 5 8 2 7 a a d 7 2 7 2 2 ”) ,
11 ” d o c s u b t y p e ” : ”{ARTICLE} ” ,
12 ”country code ” : ”” ,
13 ”d o c t y p e ” : ”SCIENCE” ,
14 ” d a t e i n s e r t e d ” : ”2017−01−10 0 9 : 4 1 : 0 1 . 4 9 8 3 2 1 ” ,
15 ” d o c s o u r c e ” : ”CROSSREF”
16 } ,
17
18 {
19 ” i d ” : O b j e c t I d ( ”5 c 8 1 5 5 b 1 f 0 4 e 5 8 2 7 a a d 6 d e e 8 ”) ,
20 ” d o c s u b t y p e ” : ”{ARTICLE} ” ,
21 ”country code ” : ”” ,
22 ”d o c t y p e ” : ”SCIENCE” ,
23 ” d a t e i n s e r t e d ” : ”2018−12−20 0 8 : 5 9 : 4 7 . 2 5 1 5 5 2 ” ,
24 ” d o c s o u r c e ” : ”CROSSREF”
25 } ,
26 . . . .
Retrieve all the patents related to VIDEO and list all the organizations
and experts:
The query in Listing 5.6 is implemented in aggregation pipeline. The query retrieves
all the documents (230 documents) whose title contains the word VIDEO. The
output result gives all the patents and scientific publications of the organization and
the expert from various countries.
1
2 #Text s e a r c h i n p u t on \ t e x t i t { t i t l e } column . with d o c t y p e : PATENT and
p l a y e r a s INSTITUTE and EXPERT
3 db . mergeddocs . a g g r e g a t e ( [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ”VIDEO ” }
} },
4 { ”$match ” : { ”d o c t y p e ” : ”PATENT” } } ,
5
6 { ” $ p r o j e c t ” : { ”player name ” : ”$player name ” , ” p l a y e r t y p e ” : ”
$ p l a y e r t y p e ” , ”d o c t y p e ” : ”$ d o c t y p e ” , ” c o u n t r y c o d e ” : ”
$ c o u n t r y c o d e ” , ”n u m b e r r e c o r d s ” : ”$ n u m b e r r e c o r d s ” , ” i d ”
:1}} ,
7 { ” $ s o r t ” : { ”n u m b e r r e c o r d s ” : −1 } } ,
8 ]) ;
9
10
11 #Ranking output
12 {
13 ” i d ” : O b j e c t I d ( ”5 c7d939dcd1c8a2a36311159 ”) ,
14 ”player name ” : ”L ’ azou , Yves ” ,
15 ” p l a y e r t y p e ” : ”EXPERT” ,
16 ”d o c t y p e ” : ”PATENT” ,
17 ” c o u n t r y c o d e ” : ”FR”
18 } ,
19 {
20 ” i d ” : O b j e c t I d ( ”5 c 7 d 9 3 a 9 c d 1 c 8 a 2 a 3 6 3 1 b f 6 c ”) ,
21 ”player name ” : ”Cosson , Laurent ” ,
22 ” p l a y e r t y p e ” : ”INSTITUTION” ,
5.2. Experiments 59
23 ”d o c t y p e ” : ”PATENT” ,
24 ” c o u n t r y c o d e ” : ”FR”
25 },
26 ...
Retrieve all the organizations and experts related to the SERVICE and
ranking them by number of patents:
In the query mentioned Listing 5.7, the aggregation pipeline undergoes different
stages. The query ranks the number of patents and scientific publications in de-
scending order for all the organizations and experts. The output projects all the
patents and scientific publications of the organization and the expert from various
countries whose title contains the word ’SERVICE’.
1 #Text s e a r c h i n p u t on t i t l e column .
2 # p l a y e r t y p e can be s e l e c t e d a s EXPERTS o r INSTITUTIONS and rank them
by number o f PATENTS and number o f SCIENCE .
3
4 db . mergeddocs . a g g r e g a t e ( [ {
5 ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ”SERVICE” } }
6 },
7 {
8 ”$match ” : { ”d o c t y p e ” : ”SCIENCE” }
9 ”$match ” : { ”d o c t y p e ” : ”PATENT” }
10 },
11 {
12 #p l a y e r t y p e : SCIENCE t o g e t l i s t o f o r g a n i z a t i o n and rank them by
number o f p a t e n t s and s c i e n t i f i c p u b l i c a t i o n s .
13 ”$group ” : { ” i d ” : { ” p l a y e r t y p e ” : ” p l a y e r t y p e ” ,
14 ”country code ” : ”$country code ” } ,
15 ”n u m b e r r e c o r d s ” : { ”$sum ” : 1 }
16 },
17 },
18 {
19 ” $ p r o j e c t ” : { ” p l a y e r t y p e ” : ” p l a y e r t y p e ” , ”player name ” : ”
$player name ” , ” c o u n t r y c o d e ” : ” $ c o u n t r y c o d e ” , ”n u m b e r r e c o r d s ” :
”$ n u m b e r r e c o r d s ” , ” i d ” : 1 }
20 } ,
21 { ” $ s o r t ” : { ”n u m b e r r e c o r d s ” : −1 } } ,
22 { ” $ l i m i t ” : 100000 }
23 ]) ;
24 #output o f t h e g i v e n query
25 {
26 ” id ” : {
27 ” p l a y e r t y p e ” : ”EXPERT” ,
28 ”player name ” : ”E i d l o t h , Ra iner ” ,
29 ” c o u n t r y c o d e ” : ”DE” ,
30 ”d o c t y p e ” : ”PATENT” ,
31 ” d o c s o u r c e ” : ”PATSTAT” ,
32 ” d o c s o u r c e i d ” : 339191463
33 },
34 ”n u m b e r r e c o r d s ” : 67
35 } ,
36
37 {
60 5. Evaluation
38 ” id ” : {
39 ” id ” : {
40 ” p l a y e r t y p e ” : ”EXPERT” ,
41 ”player name ” : ”A r t a u l t , Alexandre ” ,
42 ” c o u n t r y c o d e ” : ”FR” ,
43 ”d o c t y p e ” : ”PATENT” ,
44 ” d o c s o u r c e ” : ”PATSTAT” ,
45 ” d o c s o u r c e i d ” : 334075617
46 },
47 },
48 ”n u m b e r r e c o r d s ” : 28
49 },
50 .....
Table 5.4: Queries for performance comparison from PostgreSQL and MongoDB
62 5. Evaluation
For query 4 and 7, the datasets involve JOIN operation to retrieve all the patents
and scientific publications for all the organizations and experts in the PostgreSQL
database. In the MongoDB database, the data is embedded into a single embedded
collection and involves no joins. All the data is retrieved faster from the embedded
collection because the collection does not involve in join operations. The execution
speed of the MongoDB is up to 50% more than PostgreSQL database.
For queries 5 and 6, the query performs faster with great execution speed on both the
database because of text indexing. Indexing limits the number of document search.
The query scans only the documents related to the given text. For the datasets
10,000 and 50,000 rows, the difference between the performance of the database
is less. In Figure 5.3, the data retrieval in MongoDB is up to 50% faster than
PostgreSQL database. For Figure 5.4, query 5 is performance is slightly different
whereas for query 6 there is a huge difference in the query performance between
the databases. This is because the queries are executed on the local machine. For
such huge dataset, the behavior of the system is not significant. However, from the
datasets, it is clearly evident that the MongoDB database is faster than PostgreSQL
database.
From these, it is clearly evident that MongoDB is faster than PostgreSQL database
for the given queries.
For each dataset, the number of rows returned for each query in PostgreSQL and
MongoDB database is shown in Table 5.6. Since both the databases support stem-
ming, the output of the query should return the same number of rows. The out-
put rows count helps to recognize the datasets used in PostgreSQL and MongoDB
database are same.
5.4. Discussion 65
5.4 Discussion
The decision in selecting the NoSQL database depends on the requirement. For in-
stance, if the company needs to manage the relationships between the huge datasets
the Graph databases are a better fit. For example, cooperation networks between
the organization or the author of the scientific publication.
In this thesis, we used MongoDB database. One of the main advantages of using
the MongoDB database is flexible schema. The data model is easy to maintain and
modify in the ever-changing environment.
66 5. Evaluation
From the Section 5.3, the query performance of the database is evaluated. On the
single table, the query performance is almost the same between PostgreSQL and
MongoDB database. For the embedded collection (multiple tables in PostgreSQL
database), MongoDB shows a clear dominance for the given datasets Figure 5.1
Figure 5.2 Figure 5.3 and Figure 5.4. However, the results may vary for the data
that contains several millions of records.
To work with several million records, MongoDB provides high scalability, the data
is shared over multiple machines and facilitates working with the large datasets 1 .
In the case of PostgreSQL, there is no native sharding technique for distributing
the data across nodes in a cluster1 . Using other horizontal scalability techniques
such as manual sharding, using a sharding framework and so on can lead to loss of
important relational abilities, data integrity, ACID properties 1 .
1
1
https://www.mongodb.com/compare/mongodb-postgresql
6. Related Work
“Study the past if you would define the future.”
-Confucius
In this chapter, we discuss the related work that is similar in implementing MongoDB
database operations. Their work also describes the performance of the MongoDB
database.
Tim Mohring investigates the Tinnitus database project [Moh16]. The Tinnitus
database contains information like patient symptoms, ideal treatment method and
so on. The Tinnitus database is based on MySQL database which is a relational
database and has some disadvantages that can create unacceptable errors in case of
mispractice. The author examines different NoSQL databases to overcome problems
that occur in MySQL database and concluded that the performance of the document-
oriented database is high when the data is retrieved from multiple tables that require
joins in the relational database. He used MongoDB for practical implementation
and evaluated the results (in terms of performance and schema validation). Due
to the flexible schema and easy query capabilities and lack of joins, the query per-
formance of MongoDB is higher than the relational database [Moh16].In his work,
he concluded that the MongoDB database is superior in queries that involve JOIN
operations in MySQL database. Similar to this thesis, we also implemented various
complex queries in the MongoDB database and evaluate the resultant performance
with company’s PostgreSQL database query performance. The query performance
of MongoDB is higher than the PostgreSQL database.
Ningthoujam et al [NCP+ 14] designed a MongoDB data model for Ethnomedicinal
plant data. In the article, the authors described the data modeling patterns of
MongoDB. There are two options for designing a data model in MongoDB. The
two options were embedding or referencing through a unique identifier. For the
embedding collection, the related data is merged into a single collection. The second
option was connecting the collections using a unique identifier in a collection. They
choose to use both the options depending upon the data representation choice. The
data sets are imported through Mongoimport tool and tested its performance in
terms of scalability, flexibility, extensibility, query execution speed. The authors
conclude that the ultimate decisions on MongoDB data model are based on the access
68 6. Related Work
pattern. The MongoDB database queries depend on data usage pattern. Indexing
limits the number of scanning documents. The use of indexes in developing a query
increases the query execution speed. This results in providing high-performance
[NCP+ 14]. In our approach, we implemented a similar idea of embedding the tables
into a single collection and used aggregation pipeline with the use of text indexing.
The text indexing fetches the related document that reduces the scanning time. This
results in high query performance.
Parker et al [PPV13] compared NoSQL databases and SQL databases. The au-
thors implemented different NoSQL databases. They compare key-value pairs of
databases on different operations. The operations are storage, read, write, and
deletes. The operations are executed on Microsoft SQL Server Express, MongoDB,
CouchDB, RavenDB, Cassandra, Hypertable, and Couchbase. They evaluated re-
sults for all the operations for 100000 records. They performed data retrieval query
for 10,50,100,1000,10000, and 100000 records. They compared the query response
time on all the databases and concluded that Couchbase and MongoDB are the
fastest in retrieving data for the given datasets. In our work, we migrated datasets
of 23.7 million, 1 million, and 100 thousand records and compared the query per-
formance with the PostgreSQL database. The MongoDB database is superior in
fetching the data from the large datasets compared to PostgreSQL.
Chickerur et al [CGK15] used the airline’s database with 1050000 records for com-
paring a relational database with MongoDB database. Initially, the authors mi-
grated the data from the relational database to MongoDB database. The queries
are executed on the MongoDB database and compared its performance with the
MySQL database. They implemented various query operations and concluded that
the MongoDB provides efficient performance for a big data application compared to
the MySQL database. Similarly, in our thesis, we migrated the data from the Post-
greSQL database to MongoDB database. After the migration, the various queries
are executed. MongoDB database provides an efficient query performance for the
big data sets that are extracted from the PostgreSQL database.
7. Conclusion and Future Work
“In three words I can sum up everything I’ve learned about data: it goes on.”
-Robert Frost
7.1 Summary
In this thesis, we compared the performance of the MongoDB database with the
PostgreSQL database. To evaluate the performance, the data is migrated to Mon-
goDB database.
In this work, we provide an overview of existing PostgreSQL database where the
structure of the PostgreSQL database is shown and addresses the issues that arose
in using PostgreSQL database.The PostgreSQL contain multiple tables. The tables
contains the information of patents and scientific publications of the organizations
and experts from different countries that are taken different data sources.
To retrieve the relevant data from the tables, JOIN operation is used. The usage
of multiple JOINS especially in case of complex queries lowers the query execution
speed that results in low performance. To overcome the issues, one of the NoSQL
databases is selected.
MongoDB database is selected on the factors that support all the characteristics of
PostgreSQL. Initially, the data is migrated from PostgreSQL database to MongoDB.
The data model of MongoDB is flexible and modelled in an easy way. For evaluating
the performance aggregation pipeline is used. Furthermore, the operations involved
in aggregation pipeline is discussed.
For developing the prototype a user friendly interactive web application is developed
using R framework SHINY.
For our experiments, we used four different sizes of data sets. We implemented all
the queries in the local machine (Windows version 10).
7.2 Conclusion
We examined the query performance of small, and large datasets. The results of the
work are practically relevant to the developers who use the database. We checked
70 7. Conclusion and Future Work
the results for each query in PostgreSQL and MongoDB database. We also provide
the query optimization technique that helps users to develop aggregation pipeline
in a proper way that can reduce the query execution time. The results are provided
for different queries, ranging from simple queries to large complex queries in case of
single table and multiple tables.
• The queries 1, 2, and 3 are performed on the single table ( see Table 5.4), where
no join operations are required. The PostgreSQL database query performance
for a single table is almost the same as MongoDB database.
• To perform queries 5, 6, 7, and 8 ( see Table 5.4), multiple tables are denormal-
ized to create an embedded collection in MongoDB database. In PostgreSQL,
for the queries that involves multiple tables takes a long response time. Mon-
goDB shows a clear dominance in case of complex queries involved in JOIN
operations.
• In case of complex queries where join operations are performed, the query
performance is decreased by upto 50% compared to MongoDB database.
In our experiment results, it is clear that MongoDB is dominating in the query exe-
cution speed for a single collection and also on the embedded collection. This results
proves, MongoDB database provides high performance than PostgreSQL database.
Finally, in conclusion, the thesis proves that the use of NoSQL databases (MongoDB)
is beneficial especially in the case of large complex queries where JOIN operations
involved. However, the different sizes of the database can impact performance.
Listing A.1: Query covering some field of interest and get a list of relevant documents
1 #s h i n y = ( u i . R, s e r v e r .R)
2 #l i b r a r i e s used f o r c o n n c e t i n g t o MongoDB d a t a b a s e
3 l i b r a r y ( shiny )
4 l i b r a r y ( mongolite )
5 library ( jsonlite )
6 # D e f i n e UI f o r a p p l i c a t i o n f o r mongodb t e x t s e a r c h
7 u i <− f l u i d P a g e (
8
9 # Application t i t l e
10 t i t l e P a n e l ( ”Mongodb Data ”) ,
11 sidebarLayout (
12 sidebarPanel (
13 t e x t I n p u t ( ” t i t l e i d ” , ” T i t l e t e x t ” , ” ”)
14
15 ),
16
17 # Show t h e mongodb t e x t s e a r c h output i n t h e main p a n e l
18 mainPanel (
19 dataTableOutput ( ”mydata ”)
20 ))
21 )
22
23 s e r v e r <− f u n c t i o n ( input , output ) {
24 mon <− mongo ( c o l l e c t i o n = ”documents ” , db = ”e n t i t y d o c u m e n t s ” , u r l = ”
mongodb : / / l o c a l h o s t : 2 7 0 1 7 ” )
25
26 t i t l e s e a r c h r e s u l t <− r e a c t i v e ( {
27 # D e f i n i n g mongodb i n d e x
28
29 mon$index ( toJSON ( l i s t ( ” t i t l e ” = ” t e x t ”) , auto unbox = TRUE) )
30 t e x t <− i n p u t $ t i t l e i d
31
32 #t e x t s e a r c h output
33 mon$find ( toJSON ( l i s t ( ” $ t e x t ” = l i s t ( ” $ s e a r c h ” = t e x t ) ) , auto unbox
= TRUE) )
34 })
35 output$mydata <−renderDataTable ( {
36 titlesearchresult ()
A.1. R Framework SHINY 73
37 })
38 }
39 shinyApp ( u i = ui , s e r v e r = s e r v e r )
Listing A.2: Query covering some field of interest and get a list of organizations
ranked by number of patents, scientific publications matching the query
1 #s h i n y = ( u i . R, s e r v e r .R)
2 #l i b r a r i e s used f o r c o n n c e t i n g t o MongoDB d a t a b a s e
3 l i b r a r y ( shiny )
4 l i b r a r y ( mongolite )
5 library ( jsonlite )
6 # D e f i n e UI f o r a p p l i c a t i o n f o r mongodb t e x t s e a r c h
7 u i <− f l u i d P a g e (
8 # Application t i t l e
9 t i t l e P a n e l ( ”Mongodb t e x t s e a r c h Data ”)
10 sidebarLayout (
11 sidebarPanel (
12 t e x t I n p u t ( ” q u e r y i d ” , ” T i t l e t e x t ” , ” ”) ,
13 s e l e c t I n p u t ( ” d o c i d ” , ”document ” , c h o i c e s = c ( ”PATENT” , ”SCIENCE”
)) ,
14 a c t i o n B u t t o n ( ” a c t ” , ”output ”)
15 ),
16 # Show t h e mongodb t e x t s e a r c h output i n t h e main p a n e l
17 mainPanel (
18 tabsetPanel (
19 ta bP a ne l ( ”INSTITUTE” , dataTableOutput ( ’ t a b l e 1 ’ ) ) ,
20 ta bP a ne l ( ”EXPERT” , dataTableOutput ( ’ t a b l e 2 ’ ) )
21 )
22 ))
23 )
Listing A.3: Query for an organization and get a list of collaborators, i.e.,
organizations with common documents; rank them by number of common patents,
number of common scientific publications at user interface And at server side
shown Listing A.4
15 } },
16 { ” $ s o r t ”: { ” n u m b e r r e c o r d s ” : −1}} ,
17 { ” $ l i m i t ” : 10}
18 ] ’)
19 js o n li t e : : validate (q)
20 query <− mdt$aggregate ( q )
21 })
22 EXPERT <− e v e n t R e a c t i v e ( i n p u t $ a c t , {
23 mdt$index ( toJSON ( l i s t ( ” t i t l e ” = ” t e x t ”) , auto unbox = TRUE) )
24 q <− p a s t e 0 ( ’ [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ” ’ ,
input$query id , ’ ” } } } ,
25 { ”$match ” : { ”d o c t y p e ”: ” ’ , i n p u t $ d o c i d
, ’ ”} } ,
26 { ”$match ” : { ” p l a y e r t y p e ”: ”EXPERT”} } ,
27 { ” $ p r o j e c t ” : { ”player name ”: 1 , ” t i t l e ”
: 1 , ”player type ” : 1 , ”
c o u n t r y c o d e ”: 1 } } ,
28 { ”$group ” :
29 { ” i d ” : { ”player name ” : ”$player name ” } ,
30 ”n u m b e r r e c o r d s ” : { ”$sum ” : 1} ,
31 ”player name ” : { ” $ f i r s t ”: ” $player name ”} ,
32 ” p l a y e r t y p e ” : { ” $ f i r s t ”: ” $ p l a y e r t y p e ”} ,
33 ” c o u n t r y c o d e ” : { ” $ f i r s t ”: ” $ c o u n t r y c o d e ”}
34 } },
35 { ” $ s o r t ”: { ” n u m b e r r e c o r d s ” : −1}} ,
36 { ” $ l i m i t ” : 10}
37 ] ’)
38 js o n li t e : : validate (q)
39 query <− mdt$aggregate ( q )
40 })
41 o u t p u t $ t a b l e 1 <− renderDataTable ( {
42 INSTITUTION ( )
43 })
44 o u t p u t $ t a b l e 2 <− renderDataTable ( {
45 EXPERT( )
46 })
47 }
48 shinyApp ( u i = ui , s e r v e r = s e r v e r )
Listing A.4: Query for an organization and get a list of collaborators, i.e.,
organizations with common documents; rank them by number of common patents,
number of common scientific publications at server side
1 # Data s e l e c t i o n from s i n g l e t a b l e
2 SELECT ∗
3 FROM p u b l i c . e n t i t y d o c s
4 where d o c t y p e i n ( ’SCIENCE ’ )
1 # Data s e l e c t i o n from s i n g l e t a b l e
2 SELECT ∗
3 FROM p u b l i c . e n t i t y d o c s
4 where t s v t i t l e
5 @@ t o t s q u e r y ( ’ Motion ’ ) ;
1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 select
3 ∗
4 from public . link player doc x
5 join p u b l i c . e n t i t y d o c s y on x . d o c i d = y . d o c i d and y . d o c t y p e i n (
’SCIENCE ’ )
6 join p u b l i c . e n t i t y p l a y e r z on z . p l a y e r i d = x . p l a y e r i d
1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 select
3 z . player id ,
4 z . player type ,
5 z . player sub type ,
6 z . player name ,
7 z . country code ,
8 z . address
9 from public . link player doc x
10 j o i n p u b l i c . e n t i t y d o c s y on x . d o c i d = y . d o c i d and y . d o c t y p e i n (
’PATENT ’ , ’SCIENCE ’ )
11 j o i n p u b l i c . e n t i t y p l a y e r z on z . p l a y e r i d = x . p l a y e r i d and
p l a y e r t y p e = ’INSTITUTION ’
12 where t s v f u l l t e x t @@ t o t s q u e r y ( ’ s e r v i c e ’ )
13 group by z . p l a y e r i d
1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 select
3 z . player id ,
4 z . player type ,
5 z . player sub type ,
6 z . player name ,
7 z . country code ,
8 z . address ,
9 count ( ∗ ) f i l t e r ( where d o c t y p e = ’SCIENCE ’ ) a s n b s c i e n c e ,
10 count ( ∗ ) f i l t e r ( where d o c t y p e = ’ p a t e n t ’ ) a s n b p a t e n t
11 −− y . doc id ,
12 −− y . doc type ,
13 −− y. title
14 from public . link player doc x
15 join p u b l i c . e n t i t y d o c s y on x . d o c i d = y . d o c i d and y . d o c t y p e i n (
’PATENT ’ , ’SCIENCE ’ )
16 j o i n p u b l i c . e n t i t y p l a y e r z on z . p l a y e r i d = x . p l a y e r i d and
p l a y e r t y p e = ’INSTITUTION ’
17 where t s v f u l l t e x t @@ t o t s q u e r y ( ’ Motion ’ )
1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 select
3 ∗
4 from public . link player doc x
5 join p u b l i c . e n t i t y d o c s y on x . d o c i d = y . d o c i d and y . d o c t y p e i n (
’PATENT ’ , ’SCIENCE ’ )
6 join p u b l i c . e n t i t y p l a y e r z on z . p l a y e r i d = x . p l a y e r i d where x .
p l a y e r d o c l i n k t y p e i n ( ’ {INVENTOR} ’ )
1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 select
3 z . player sub type ,
4 z . player name ,
5 z . player type ,
6 y . doc source ,
7 z . date inserted ,
8 y . meta ,
9 y . country code
10 from public . link player doc x
11 j o i n p u b l i c . e n t i t y d o c s y on x . d o c i d = y . d o c i d and y . d o c t y p e i n (
’SCIENCE ’ )
12 j o i n p u b l i c . e n t i t y p l a y e r z on z . p l a y e r i d = x . p l a y e r i d and z .
p l a y e r s u b t y p e = ( ’ {STARTUP,COMPANY} ’ )
13 where t s v f u l l t e x t @@ t o t s q u e r y ( ’ b e h a v i o r ’ )
1 # Data s e l e c t i o n from s i n g l e t a b l e
2 db . documents . f i n d ( { } )
3 . projection ({})
4 . l i m i t (1000000)
1 db . d o c t a b l e . f i n d ( { ’ d o c t y p e ’ : ’SCIENCE ’ } )
2 . projection ({})
3 . l i m i t (1000000)
1 # Data s e l e c t i o n from s i n g l e t a b l e
2 db . documents . a g g r e g a t e ( [ { ”$match ” :
3 { ”$text ” :
4 { ” $ s e a r c h ” : ”Motion ” }
5 } }
6 ]) ;
1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 db . documents . a g g r e g a t e ( [ { ”$match ” :
3 { ”$text ” :
4 { ”$search ” : ”Science ” }
5 } }
6 ]) ;
1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 db . embedded . a g g r e g a t e ( [ { ”$match ” : { ” $ t e x t ” : { ” $ s e a r c h ” : ”SERVICE” }
} },
3 { ”$group ” : { ” i d ” : { ”player name ” : ”$player name ” ,
4 ” p l a y e r t y p e ” : ”INSTITUTION” ,
5 ”country code ” : ”$country code ” } ,
6 ”n u m b e r r e c o r d s ” : { ”$sum ” : 1 }
7 },
8 },
9 { ” $ p r o j e c t ” : { ”player name ” : ”$player name ” , ” c o u n t r y c o d e ” : ”
$ c o u n t r y c o d e ” , ”n u m b e r r e c o r d s ” : ”$ n u m b e r r e c o r d s ” , ” i d ”
:1}} ,
10 { ” $ s o r t ” : { ”n u m b e r r e c o r d s ” : −1 } } ,
11 { ” $ l i m i t ” : 1000000 }
12 ]) ;
1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 db . embedded . a g g r e g a t e ( [ { ”$match ” :
3 { ”$text ” :
4 { ” $ s e a r c h ” : ”MOTION” } } } ,
5 { ”$match ” : { ”d o c t y p e ” : ”PATENT” } } , # For d o c t y p e = PATENT
6 { ”$match ” : { ”d o c t y p e ” : ”SCIENCE” } } , #For d o c t y p e = SCIENCE
7 { ”$group ” : { ” i d ” : { ”player name ” : ”$player name ” ,
8 ” p l a y e r t y p e ” : ”INSTITUTION” ,
9 ”country code ” : ”$country code ” } ,
10 ”n u m b e r r e c o r d s ” : { ”$sum ” : 1 }
11 },
12 },
13 { ” $ p r o j e c t ” : { ”player name ” : ”$player name ” ,
14 ”country code ” : ”$country code ” ,
15 ”n u m b e r r e c o r d s ” : ”$ n u m b e r r e c o r d s ” ,
16 ” id ” :1}} ,
17 { ” $ s o r t ” : { ”n u m b e r r e c o r d s ” : −1 } } ,
18 { ” $ l i m i t ” : 100000 }
19 ]) ;
1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2
3 db . embedded . f i n d (
4 { ’ p l a y e r d o c l i n k t y p e ’ : ’ {INVENTOR} ’ } )
5 . projection ({})
6 . s o r t ( { i d : −1 } )
7 . l i m i t (1000000)
1 # Data s e l e c t i o n from m u l t i p l e t a b l e s
2 db . embedded . a g g r e g a t e ( [ { ”$match ” :
3 { ” $ t e x t ” : { ” $ s e a r c h ” : ”b e h a v i o u r ” } } } ,
4 { ”$match ” : { ”d o c t y p e ” : ”SCIENCE” } } ,
5
6 { ”$group ” :
7 { ” id ”:
8 { ” p l a y e r s u b t y p e ” : ( ’ {STARTUP,COMPANY} ’ ) ,
9 ”player name ” : ”$player name ” ,
10 ” p l a y e r t y p e ” : ”INSTITUTION” ,
11 ” d o c s o u r c e ” : ” $ d o c s o u r c e ” ,
12 ”date inserted ” : ”$date inserted ” ,
13 ”meta ” : ”$meta ” ,
14 ”country code ” : ”$country code ” } ,
15 ”n u m b e r r e c o r d s ” : { ”$sum ” : 1 }
16 },
17 },
18 { ” $ p r o j e c t ” : { ”player name ” : ”$player name ” ,
19 ” p l a y e r t y p e ” : ”INSTITUTION” ,
20 ”d o c t y p e ” : ”SCIENCE” ,
21 ” d a t e i n s e r t e d ” : ” $ d a t e i n s e r t e d ” ,
22 ”country code ” : ”$country code ” ,
23 n u m b e r r e c o r d s ” : ”$ n u m b e r r e c o r d s ” ,
24 ” id ” :1}} ,
25 { ” $ s o r t ” : { ”n u m b e r r e c o r d s ” : −1 } } ,
26 { ” $ l i m i t ” : 100000 }
27 ]) ;
[HELD11] Jing Han, Haihong E, Guan Le, and Jian Du. Survey on nosql database.
In 2011 6th International Conference on Pervasive Computing and Ap-
plications, pages 363–366, Oct 2011. (cited on Page 26)
Bibliography 81
[HJ11a] Robin Hecht and Stefan Jablonski. Nosql evaluation: A use case ori-
ented survey. In 2011 International Conference on Cloud and Service
Computing and Hong Kong, pages 336–341, 2011. (cited on Page 12)
[HJ11b] Robin Hecht and Stefan Jablonski. Nosql evaluation: A use case ori-
ented survey. In 2011 International Conference on Cloud and Service
Computing, CSC 2011, Hong Kong, pages 336–341, December 12-14,
2011. (cited on Page 2)
[KYLTC12] Ken Ka-Yin Lee, Wai-Choi Tang, and Kup-Sze Choi. Alternatives
to relational database: Comparison of nosql and xml approaches for
clinical data storage. Computer methods and programs in biomedicine,
110:99–110, 11 2012. (cited on Page 2)
[MD18] Ajeet Ghodeswar Trupti Shah Amruta Mhatre and Santosh Dodamani.
A comparative study of data migration techniques. IOSR Journal of
Engineering (IOSRJEN), 9:77–82, 2018. (cited on Page 31)
[PPV13] Zachary Parker, Scott Poe, and Susan V. Vrbsky. Comparing nosql
mongodb to an sql db. In Proceedings of the 51st ACM Southeast Con-
ference, ACMSE ’13, pages 5:1–5:6, New York, NY, USA, 2013. ACM.
(cited on Page 68)
[Sim12] Salomé Simon. Brewer’s CAP Theorem, 2012. (cited on Page 13)
[SKC16] W. Seo, N. Kim, and S. Choi. Big data framework for analyzing
patents to support strategic r amp;amp;amp;d planning. In 2016 IEEE
14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th
Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on
Big Data Intelligence and Computing and Cyber Science and Technol-
ogy Congress(DASC/PiCom/DataCom/CyberSciTech), pages 746–753,
Aug 2016. (cited on Page 29)
[TML+ 02] Qijia Tian, Jian Ma, Cleve J. Liang, Ron Chi-Wai Kwok, Ou Liu,
and Quan Zhang. An organizational decision support approach to r&d
project selection. In 35th Hawaii International Conference on Sys-
tem Sciences (HICSS-35 2002), CD-ROM / Abstracts Proceedings, 7-
10 January 2002, Big Island, HI, USA, page 251, 2002. (cited on Page 5)
[VGPC+ 17] Jose M. Vicente-Gomila, Anna Palli, Begoña Calle, Miguel A. Artacho,
and Sara Jimenez. Discovering shifts in competitive strategies in probi-
otics, accelerated with techmining. Scientometrics, 111(3):1907–1923,
June 2017. (cited on Page 5)
Hiermit erkläre ich, dass ich die vorliegende Arbeit selbständig verfasst und keine
anderen als die angegebenen Quellen und Hilfsmittel verwendet habe.
Magdeburg, den April 23, 2019