Analyzing The Performance of NoSQL vs. SQL Databases For Spatial
Analyzing The Performance of NoSQL vs. SQL Databases For Spatial
Conference Proceedings
Volume 17 Boston, USA Article 4
2017
KS Rajan
International Institute of Information Technology Hyderabad Gachibowli, Hyderabad, India
Recommended Citation
Agarwal, Sarthak and Rajan, KS (2017) "Analyzing the performance of NoSQL vs. SQL databases for Spatial and Aggregate queries,"
Free and Open Source Software for Geospatial (FOSS4G) Conference Proceedings: Vol. 17 , Article 4.
DOI: https://doi.org/10.7275/R5736P26
Available at: https://scholarworks.umass.edu/foss4g/vol17/iss1/4
This Paper is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Free and Open Source
Software for Geospatial (FOSS4G) Conference Proceedings by an authorized editor of ScholarWorks@UMass Amherst. For more information, please
contact [email protected].
Analyzing the performance of NoSQL vs. SQL databases for Spatial and
Aggregate queries
a
International Institute of Information Technology Hyderabad Gachibowli, Hyderabad, India
Abstract: Relational databases have been around for a long time and spatial databases
have exploited this feature for close to two decades. The recent past has seen the development
of NoSQL non-relational databases, which are now being adopted for spatial object storage and
handling, too. While SQL databases face scalability and agility challenges and fail to take the
advantage of the cheap memory and processing power available these days, NoSQL databases
can handle the rise in the data storage and frequency at which it is accessed and processed
- which are essential features needed in geospatial scenarios, which do not deal with a fixed
schema(geometry) and fixed data size. This paper attempts to evaluate the performance of
an existing NoSQL database ’MongoDB’ with its inbuilt spatial functions with that of a SQL
database with spatial extension ’PostGIS’ for two problems spatial and aggregate queries, across
a range of datasets, with varying features counts. All the data in the analysis was processed
In-memory and no secondary memory was used. Initial results suggest that MongoDB performs
better by an average factor of 10x-25x which increases exponentially as the data size increases
in both indexed and non-indexed operations. Given these results, NoSQL databases may be
better suited for simultaneous multiple-user query systems including Web-GIS and mobile-GIS.
Further studies are required to understand the full potential of NoSQL databases across various
geometries and spatial query types.
∗
Corresponding author
Email address: [email protected] (Sarthak Agarwal)
Submitted to FOSS4G 2017 Conference Proceedings, Boston, USA. September 20, 2017
FOSS4G 2017 Academic Program Performance of NoSQL vs. SQL databases
1. Introduction
Traditionally Databases were designed to structure and organize any form of data. However,
as the database size increased and to optimize databases for the geospatial domain we use spatial
databases. Satellite images are one prominent example of spatial databases Govind and Sharma
2013. To extract spatial information from a satellite image, it has to be processed in a spatial
frame of reference. However, satellites are not the only type of spatial databases. Maps are also
stored in the spatial database.
Like other database systems, spatial databases have also relied on the relational databases to
handle and manage spatial objects and their associated attribute information. They have been
of great value in cases where we have a defined structure of our schema, including geometrical
characteristics of the spatial objects. In addition, in many spatial applications, we do not always
have a fixed schema, there can be many geometries with different shape and the requirements
evolve as the data size increases depending on the case scenario where relational Databases can
limit the potential use or design of the solution.
For the satisfaction of the user’s significant characteristics of a database such as scalability,
performance and latency play a crucial role. Especially social media projects, such as Facebook
and Google+, with high user traffic, use other database management systems. such as Apache
Cassandra or Google BigTable. Instead of the relational approach, a Not-only-SQL (NoSQL)
method is used. NoSQL-databases are increasingly used to deal with simultaneously high read
and write requests related to large datasets.
While SQL databases face scalability and agility challenges and fail to take the advantage of
the cheap memory and processing power available these days, NoSQL databases can handle the
rise in the data storage and frequency at which it is accessed and processed - which are essential
features needed in geospatial scenarios, which do not deal with a fixed schema(geometry) and
fixed data size Loureno et al. 2015.
NoSQL data stores may provide advantages over relational databases, However, they gen-
erally lack the robustness of relational databases for those advantages. The aim of this paper
is to discover some benefits of a selected NoSQL data store as compared to a traditional rela-
tional database when storing and querying spatial vector data. Our work is influenced by the
work done previously on the similar theories de Souza Baptista et al. 2011, Xiao and Liu 2011,
Schmid et al. 2015 and some practical implementation of those theories Popescu and Bacalu
2009, Steiniger and AndrewJ.S.Hunter 2012, van der Veen et al. 2012.
One of the very important problems we face is of aggregate queries i.e when we combine the
normal non-spatial data with the spatial data. Here we are trying to solve a similar problem
based on our previous studies. The aim of this problem is to combine the real time spatial and
non-spatial scenarios into one and then analyze the performance of SQL vs NoSQL databases
in such scenarios.
The two problems are carefully chosen which are, first the total number of restaurants in an
area which is vegetarian only. This problem has two very important features. Firstly we have
to use containment query to find out the total number of restaurants in an area and apply sum
aggregate query on such restaurants which are vegetarian. The second problem is getting the
total number of distinct cuisine in an area. This problem also has very important aggregate
feature i.e. distinct in a column. So using these queries we are able to simulate most of the real
case scenarios and analyzing the performance of each database.
SQL Server and relational databases have been the go-to databases for over 20 years. The
increased need to process higher volumes, velocities, and varieties of data at a rapid rate has
altered the nature of data storage needs for application developers. SQL Server and relational
databases were the most popular databases. However, in order to enable the present day process-
ing needs, NoSQL databases have gained popularity due to their ability to store unstructured
and heterogeneous data at scale. Relational databases still remain a popular default option due
to their easy to understand table structure, however there are many reasons to explore beyond
relational databases.
NoSQL is a category of databases distinctly different from SQL databases. NoSQL is often
used to refer to data management systems that are Not SQL or an approach to data management
that includes Not only SQL”. There are a number of technologies in the NoSQL category,
including document databases, key value stores, column family stores, and graph databases,
which are popular with gaming, social, and IoT apps.
There are currently many NoSQL databases systems available but only few support spatial
databases. In this section, we are going to discuss some of those systems and what all function
so each of them offers.
MongoDB uses currently two geospatial indexes, 2d, and 2dsphere. The 2d index is used
to calculate distances on a plane surface. The 2dsphere index calculates geometries over an
Earth-like sphere. The coordinate reference system is currently limited to the WGS84 datum.
MongoDB computes the geohash values for the coordinate pairs and then indexes the geohash
values. A precise description of the indexing techniques of the geohash values is not available
at the moment Anonymous 2011-2015.
Besides the storage and indexing of spatial data, the query process is an important aspect.
For querying spatial data several geo-functions are available in relational databases. They
enable different queries with geo-context at the database level using SQL. An example for a
geo-function is the calculation of a buffer around a point feature or a line feature.
For the relational-database PostgreSQL, there is a special extension available, PostGIS, for
integrating several geo- functions. MongoDB and CouchBase don’t have a separate extension
at the moment but they support some geo-functions. Table 1 compares the geo-functions of the
three databases.
PostgreSQL/PostGIS inherits more than one thousand geo- functions. Table 1 includes only
a selection of them. MongoDB only supports three geo-functions, $geoWithin, $geoIntersects
and $near. The MongoDB $geoWithin operator corresponds to the ST Within function in
PostgreSQL/PostGIS, and the MongoDB $geoIntersects operator corresponds to the function
ST Intersects in PostgreSQL/PostGIS.
The function $near delivers the next located geometry for a predefined point. The $near
function can be used in combination with a $maxDistance parameter. In that case, MongoDB
delivers all geometries within a certain distance ordered by the distance. PostgreSQL can
calculate this using the ST DWithin function. The results however, need to be additionally
ordered by the distance.
CouchBase can only query point geometries within a BoundingBox (BBox). The BBox-
function of CouchBase can be compared to the $geoWithin (MongoDB) and ST Within (Post-
GIS) functions, however, MongoDB and PostGIS can use different polygons, not only an axial
parallel polygon.
The overview of the implemented geo-functions shows that PostgreSQL with its extension
PostGIS has the most comprehensive geo-functionalities with more than one thousand functions.
For a complete list of all implemented functions, it is referred to the PostGIS handbook. The
two NoSQL databases have very limited implemented geo-functions. MongoDB just implements
three functions whereas CouchBase just implements one geo-function.
The NoSQL spatial database is still at their inception. There is a lot of research and
development yet to be done in both theoretical part and practical part of them. Currently,
they support only a limited type of geometry and very fewer functions to manipulate them. We
have to build the design of these databases in such a way that adding more functions can be
flexible. We can take inspiration from the functions currently present in PostGIS and can start
implementing those functions in MongoDB as well. spatial databases also currently support for
various indexes, Xiang et al. 2016 illustrates the implementation of R-trees in MongoDB for
spatial indexing.
3. Dataset
Point Containment problem works for any geometry and reports whether the given geometry
is completely inside another geometry or not. This is a vital and traditional problem on spatial
databases. It is useful in the domain of map generations, modeling, analyzing spatial data over
an area, for example, we want to report how the number of houses has changed over time in
a city by analyzing the spatial data of that town. We can count the number of points (which
represents each house in this case) in the polygon(city) for all the years using point within a
polygon problem.
The dataset D1 consisted of two layers, the first layer simulating the restaurants which are
represented as points having some non-spatial attributes such as timing, price, cuisine, etc.
Another layer consisted of square boxes of different perimeter scattered both sequentially and
randomly over the space with few points completely inside the box and others completely outside
the box simulating the area in interest.
Like the previous problem here also we use Point Containment query which works for any
geometry and reports whether the given geometry is completely inside another geometry or not.
The dataset D2 consists of similar layers one simulating restaurants and other simulating
the area in interest.
4. Performance
Tests were run on both of the database systems with same datasets one with Index and other
without an index. The time recorded is in seconds for both indexed and non-indexed analysis.
Lines and polygons in PostGIS were indexed using GIST index method and in MongoDB they
were indexed using 2d Sphere indexing method.
Test Setup
All the data in the analysis was processed In-memory and no secondary memory was used.
Hardware used-
• Ram 16 Gb
• Processor - Intel Core i7-5500u CPU @ 2.40 Ghz x 4
• OS Ubuntu 14.04 64 bit
• Disk Solid state hard drive
Softwares used-
• PostgreSQL 9.3.12
• PostGIS 2.1
• MongoDB 3.2.5
For Post results analysis Libre Office and Google sheets were used to plot graphs.
If we observe the graph from figure 1 we will notice that MongoDB works much faster than
PostGIS in all the cases(from small datasets too large datasets). The difference in performance
is in the order of 10(approx.). Also, another observation we make from the graph is that
the performance difference between indexed and non-indexed queries is small in MongoDB as
compared to PostGIS. So we reach a conclusion that the performance of both indexed and non-
indexed Line Intersection queries is better in the case of MongoDB as compared to PostGIS.
In experiment 2, firstly the performance was analyzed without indexing any geometry and
time was observed which suggests that MongoDB performs better as the data size increases
whereas PostGIS does not perform as well with huge datasets and time complexity increases
exponentially. However after indexing the geometries performance of both the database engine
improved by a substantial factor.
If we observe the results of the experiments, we will notice that indexed datasets perform
better than non-indexed dataset and PostGIS time increases the exponentially as the size of
dataset increases whereas MongoDB still performs within some bounds.
The experiment analyses the performance difference and ease of implementation while de-
veloping real time application involving both spatial and non-spatial use cases. These results
suggest that MongoDB performs better by an average factor of 10 which increases exponentially
as the data size increases in both indexed and non-indexed operations for problems. Given these
results, NoSQL databases may be better stated for simultaneous multiple-user query systems
including Web-GIS and mobile-GIS.
This implies that non-relational databases are more suited to the multi-user query systems
and has the potential to be implemented in servers with limited computational power. Further
studies are required to identify its appropriateness and incorporate a range of spatial algorithms
within non-relational databases.
The focus of our research is to mainly benchmark the real case scenarios including both spa-
tial and non-spatial use cases. The motivation behind this work is to minimize the dependency
for server during mobile routing. We want the client itself to act as a server during routing and
for that we need a light database system that can be easily ported to the mobile devices, so
here we are comparing the performance of SQL vs. NoSQL databases which are comparatively
lighter.
However there are still some limitations on using NoSQL databases over SQL databases.
There are not as many spatial functions in NoSQL as in SQL. The currently implemented
geo-functions support only very basic operations. Relational databases are still far superior if
the user needs to calculate geoinformation on database level Schmid et al. 2015. The results
presented in the paper are only valid for the chosen database settings but they clearly show
that NoSQL databases are a possible alternative, at least for querying attribute information.
PostgreSQL also has an implementation of NoSQL of its own, however it does not support
PostGIS currently however we can export the results of PostGIS queries as GeoJSON objects.
In future, we are planning on expanding our study to other spatial query functions as well as
spatial algorithms such as shortest path problem to evaluate the performance of NoSQL on such
platforms and also test the performance of MongoDB on distributed systems.
References
Anonymous, 2011-2015. Mongodb, inc.; geospatial indexes and queries. Online, online; accessed July 21, 2017.
URL http://docs.mongodb.org/manual/applications/geospatialindexes/
de Souza Baptista, C., de Oliveira, M. G., da Silva, T. E., 2011. Using ogc services to interoperate spatial data
stored in sql and nosql databases. XII GEOINFO, Campos do Jordo, Brazil.
Govind, S., Sharma, A., 2013. Open source spatial database for mobile devices. Computer Engineering and
Intelligent Systems 4(6).
Loureno, J. R., Cabral, B., Carreiro, P., Vieira, M., Bernardino, J., 2015. Choosing the right nosql database for
the job: a quality attribute evaluation. Journal of Big Data 2:18.
Ostrovsky, D., Rodenski, Y., 2014. Pro couchbase server. Apress.
Popescu, A., Bacalu, A.-M., 2009. Geo nosql: Couchdb, mongodb, and tokyo cabinet. Online, online; accessed
July 21, 2017.
URL http://nosql.mypopescu.com/post/300199706/geo-nosql-couchdb-mongodb-tokyo-cabinet
Schmid, S., Galicz, E., Reinhardt, W., 2015. Performance investigation of selected sql and nosql databases.
AGILE 2015, Lisbon.
Steiniger, S., AndrewJ.S.Hunter, 2012. Free and open source gis software for building a spatial data infrastructure.
Geospatial Free and Open Source Software in the 21st Century(Part 5).
van der Veen, J. S., van der Waaij, B., Meijer, R. J., 2012. Sensor data storage performance: Sql or nosql, physical
or virtual. IEEE Fifth International Conference on Cloud Computing, Honolulu, Hawaii, USA.
Xiang, L., Shao, X., Wang, D., 2016. Providing r-tree support for mongodb. The International Archives of the
Photogrammetry, Remote Sensing and Spatial Information Sciences, Prague, Czech Republic XLI-B4.
Xiao, Z., Liu, Y., 2011. Remote sensing image database based on nosql database. Geoinformatics.
Table 2: Average Time for Vegan in an area problem (*NC- Not Computable).
Figure 1: Graph of No. of restaurants vs average time for all datasets in D1.
Figure 2: Graph of No. of restaurants vs average time for all datasets in D2.