Computing (2020) 102:221–246
https://doi.org/10.1007/s00607-019-00736-1
Bringing SQL databases to key-based NoSQL databases:
a canonical approach
Geomar A. Schreiner1 · Denio Duarte2 · Ronaldo dos Santos Mello1
Received: 14 December 2017 / Accepted: 24 June 2019 / Published online: 29 June 2019
© Springer-Verlag GmbH Austria, part of Springer Nature 2019
Abstract
Big Data management has brought several challenges to data-centric applications, like
the support to data heterogeneity, rapid data growth and huge data volume. NoSQL
databases have been proposed to tackle Big Data challenges by offering horizontal
scalability, schemaless data storage and high availability, among others. However,
NoSQL databases do not have a standard query language, which bring on a steep
learning curve for developers. On the other hand, traditional relational databases and
SQL are very popular standards for storing and manipulating critical data, but they are
not suitable to Big Data management. One solution for relational-based applications to
move to NoSQL databases is to offer a way to access NoSQL databases through SQL
instructions. Several approaches have been proposed for translating relational database
schemata and operations to equivalent ones in NoSQL databases in order to improve
scalability and availability. However, these approaches map relational databases only
to a single NoSQL data model and, sometimes, to a specific NoSQL database product.
This paper presents a canonical approach, called SQLToKeyNoSQL, that translates
relational schemata as well as SQL instructions to equivalent schemata and access
methods of any key-oriented NoSQL database. We present the architecture of our
layer focusing on the mapping strategies as well as experiments that evaluate the
benefits of our approach against some state-of-art baselines.
Keywords Data interoperability · Cloud computing · Relational-cloud mapping ·
NoSQL · Big data
B Geomar A. Schreiner
[email protected]
Denio Duarte
[email protected]
Ronaldo dos Santos Mello
[email protected]
1 Federal University of Santa Catarina, Florianópolis, Brazil
2 Federal University of Fronteira Sul, Chapecó, Brazil
123
222 G. A. Schreiner et al.
Mathematics Subject Classification 68P15
1 Introduction
Relational databases (RDB) and SQL have been the preferred technologies to store
and manage data for decades. However, we have witnessed a tremendous growing
of the size of data sets in several application domains over the last years. These data
sets are often loosely structured, schemaless and heterogeneous—the so-called Big
Data—, and their management is challenging since, in general, high availability and
scalability are required. Social networks, sensor networks, and healthcare are exam-
ples of data-centric applications that produce this kind of data. Cloud computing-based
approaches for data management are raising as a promising solution to deal with Big
Data [1]. Distributed data centers accessed through the Internet are a typical system
architecture in this context. Although RDB are very popular for data management,
they are not suited to Big Data-centric applications because they respect the ACID
properties for data manipulation, which are orthogonal to the availability and scal-
ability requirements for Big Data manipulation. In fact, the overhead introduced to
guarantee ACID transactions may be prohibitive when a large volume of data must
be handled. Besides, the fixed record format of relational data, known as schema first,
data later paradigm [16], also introduces modeling and storage challenges for data
instances that do not respect a schema. This variety of representation is also a typical
Big Data issue.
NoSQL databases (NoSQL DB) have been proposed to overcome these prob-
lems [21]. NoSQL DB support scalability and elasticity requirements to efficiently
manipulate data sets that can increase in size quickly. They are based on new data
models that better represent heterogeneous (and complex) data instances (also known
as data first, schema later paradigm [16]) and provide horizontal elasticity instead
of the (limited) vertical elasticity supported by most of RDB management systems
(RDBMS). Horizontal elasticity leverages performance for Big Data management
since new machines can be added or removed based on the application storage needs
[7,21]. Usually, the data models of NoSQL DB are organized into four categories:
(i) document-oriented (e.g., Mongo DB and SimpleDB), (ii) column-oriented1 (e.g.,
Cassandra and Cloudy), (iii) key-value (e.g., Voldemort and Redis), and (iv) graph
(e.g., Neo4J).
Based on this motivation, many organizations have been moving their relational data
to DB in the cloud (DB-as-a-service - DBaaS), in particular, NoSQL DB. However,
the cost of this moving is high due to the new paradigm that must be faced. The
database access interface is the most challenging one: developers are used to define
and manipulate data using the RDB SQL language. Instead, NoSQL DB provide
different access methods and access languages depending on their data model or
specific product, and usually have limited (or not) support to the SQL standard. As a
consequence, the learning curve to start using NoSQL DB is very steep due to these
differences in terms of data representation and data accessing.
1 Some authors, like [7], use the terms extensible record stores, wide column stores or simply columnar as
synonyms to column-oriented databases.
123
Bringing SQL databases to key-based NoSQL databases: a… 223
To adapt applications to new computational environments brings some risks to the
organizations, and the existing solutions that deal with this problem may be organized
into three categories [6]: (i) redevelopment, which rewrites the existing applications
from the scratch; (ii) wrapping, which provides a new interface to a software compo-
nent, making it more easily accessible by other components; and (iii) migration, which
moves the application to the new environment, while retaining the original system’s
data and functionality. The choice for one of these solutions depends on an evaluation
of the costs, like the amount of required changes, as well as the involved risks. The first
solution is more expensive since it requires a whole system re-implementation. The
third one requires less effort than the first one considering that not all the system will
be recoded. Instead, the second solution is the less costly one as it usually provides
a faster moving strategy. In this case, the wrapping component acts as an interface to
a service that performs some processing required by an external client that does not
need to know how the service is implemented.
The approach we propose in this paper minimizes the cost of moving of relational-
based applications to NoSQL DB in the cloud by following this wrapping strategy.
Although wrapping is a provisional solution, it offers a fast way to deploy applications
to new platforms while, for example, a new application is developed (redevelopment
category).
Related work for SQL-to-NoSQL mapping based on wrapping usually adopt one of
these two strategies: (i) to modify the storage system of a RDBMS kernel, allowing
the RDBMS to store data in a NoSQL DB [3,10,24]; or (ii) to develop a layer that
translates SQL instructions to corresponding access methods to be executed at the
target NoSQL DB [2,8,9,15,19]. Our approach, called SQLtoKeyNoSQL, fits into the
second strategy. We propose a canonical model that maps a subset of SQL instructions
to a hierarchical structure organized as a tree T . T , in turn, can be mapped to any key-
based NoSQL DB (document-oriented, column-oriented and key-value). We argue that
we have a canonical approach because we provide a transparent interoperability of an
application with a RDB access interface to one or more key-based NoSQL DB. The
existing approaches that provide interoperability between RDB and NoSQL DB do
not offer a comprehensive solution, i.e., they focus on a specific NoSQL DB product.
We give more details about the differences between our approach and the related work
in Sect. 4.
This paper extends several points of our previous work that introduces SQL-
toKeyNoSQL [22]. First of all, we review and detail the formal definitions for the
canonical model and all the schema and SQL instruction mappings in order to let
them more precise. This revision is essential mainly for reproducibility purposes. Sec-
ondly, we enhance our access methods for NoSQL DB by adding a new (and optimized)
method for retrieving blocks of data. In the previous version of SQLtoKeyNoSQL,
data were accessed row by row (i.e., for each retrieved row, we should make an access
to the NoSQL DB). Despite being a more technical contribution, this new access
method reduces the number of effective access to the target NoSQL DB by retrieving
a block of rows on each access, which increases the performance of our solution. We
show that through our experiments (Sect. 5). Thirdly, we provide join queries for our
approach. We allow the join of data coming from different target NoSQL DB (e.g.,
join over three tables, one stored in Cassandra database, and the other two in Redis
123
224 G. A. Schreiner et al.
database) by executing the efficient and well-known merge join and hash join algo-
rithms. Besides, developers can implement their own join algorithm and set it as the
default join strategy at SQLtoKeyNoSQL. The last contribution is a new set of exper-
iments. In our previous work, we evaluate the overhead of our approach. In this paper
we conduct new experiments that compare our approach against two state-of-the-art
baselines: SimpleSQL [9] and Unity [15]. We provide a fair comparison by consider-
ing the same types of SQL instructions, NoSQL target, computational environment
and configurations for each approach, as detailed in Sect. 5. The results show that
our approach outperforms SimpleSQL and Unity concerning processing time to query
relational data stored in NoSQL DB.
The remaining of this paper is organized as follows. Next section presents
the fundamentals of key-based NoSQL data models. Section 3 details the SQL-
toKeyNoSQL approach. Section 4 discusses related work, Sect. 5 presents an experi-
mental evaluation, and Sect. 6 is dedicated to the final considerations and future work.
2 Key-based NoSQL data models
NoSQL DB have been proposed to manage highly heterogeneous and voluminous
data efficiently. They can be defined as a database that is not relational and have
six properties [7,21]: (i) horizontal scaling; (ii) ability to store complex data in a
distributed way; (iii) simple access interface or protocol for manipulating data; (iv)
relaxed/non-existent ACID support; (v) high availability; and (vi) optional and flexible
schema.
NoSQL DB have independent designs, each one with specific data models that
support complex data. In the literature, we find different taxonomies related to the
data models of the NoSQL DB [7,21]. In this paper, we consider the four categories
of NoSQL data models defined in [21]: (i) document-oriented; (ii) column-oriented;
(iii) key-value; and (iv) graph.
The canonical model proposed in this paper supports the first three categories,
which are called key-based NoSQL data models. Any NoSQL database in this family
is able to retrieve an individual data object given an input key, but different NoSQL
DB may differ in terms of accessing internal object components [4]. Besides, each
key-based NoSQL DB may offer different access methods and protocols. Several of
them support the REST API [11], but this is not a standard. We define the common
concepts of each key-based NoSQL data model in the following. We illustrate each
definition using an extract from a Cars RDB presented in Table 1.
The key-value data model is the simplest NoSQL data model. It is composed of
a set of key-value pairs, with the value being accessed through a key. A value can
maintain a simple or complex content, but this content cannot be queried, i.e., it is a
“black-box” content. Because of this, we assume that any value in a key-value data
model has an atomic domain. A database based on the key-value model is defined as
follows.
Definition 1 (Key-Value Database) A key-value database db is a tuple db =
(n db , K Vdb ), where n db is the name of the database, K Vdb is the set of key-value
pairs, and db is accessed by n db .
123
Bringing SQL databases to key-based NoSQL databases: a… 225
Definition 2 (Key-Value Set) A key-value kv ∈ K Vdb is defined as kv = (key: value),
being each kv.key a unique value, and kv.value holds an atomic value.
Example 1 shows an extract of a key-value database based on the first tuple of
Brands of Cars database (Table 1).
Example 1 The Definitions 1 and 2 applied to an extract of Cars RDB results in db =
(Cars,{(Brands.1.id:1), (Brands.1.name:Ford), (Brands.1.year: 1903)}), where Cars
is the database name, and (Brands.1.id: 1) is one of the key-value pairs.
The document-oriented data model is a specialization of the key-value data model.
A document encompasses a set of key-value pairs, and each document is accessed by
a unique atomic key. However, a document content is composed of a set of simple or
complex attributes. A simple attribute holds an atomic value, and a complex attribute
has a list, set or tuple domain.
The document-oriented data model is composed of a database, as well as collections,
documents (items), attributes and values [21], as defined in the following.
Definition 3 (Document Database) A document database D is a tuple D = (n D , CD ),
where n D is the name of D, CD is a set of document collections, and D is accessed
by n D .
Definition 4 (Document Collection) A document collection dc ∈ CD is a tuple dc =
(k DC , D OC S), where k DC is the key of dc, D OC S is a set of documents, and dc is
accessed by k DC .
Definition 5 (Document) A document d ∈ D OC S is a tuple d = (kd , A), where kd
is the key of d, A is a set of attributes, and d is accessed by kd .
Definition 6 (Attribute) An attribute α ∈ A is a pair (kα : v), where kα is the key of α,
v holds a value whose domain can be atomic, a list, a set, or a tuple, and α is accessed
by kα .
The following example shows a document-oriented modeling based on Table 1.
Table 1 An extract of a database
Table brands
about cars
Id Name Country Founded
1 Ford USA 1903
2 BMW Germany 1916
3 Renault France 1899
Table models
Id Name Prod_begin Prod_end Brand_id
1 Clio 1990 – 3
2 Corcel 1968 1986 1
3 E65 2002 2008 2
123
226 G. A. Schreiner et al.
Example 2 The Definitions 3–6 applied to an extract of Cars RDB results in D =
(Cars, {(Models,{(Model_1(id:“1”, name: “Clio”))})}), where Cars is the document
database, Models is the key of the single document collection in Cars, Model_1 is the
key of the single document in Models, and (id:“1”) and (name: “Clio”) are attribute
pairs (kα : v) of Model_1.
Finally, the column-oriented data model represents data properties based on a
column-distributed schema. It is composed of a keyspace, column family, column
set accessed by a unique key, columns and values [21], as defined in the following.
Definition 7 (Keyspace) A keyspace K is a tuple K = (n K , F), where n K is the name
of K , F is a set of column families, and K accessed by n K .
Definition 8 (Column Family) A column family f ∈ F is a tuple f = (n f , Sc ), where
n f is the name of f , Sc is a set of column sets, and f is accessed by n f .
Definition 9 (Column Set) A column set cs ∈ Sc is a tuple cs = (n cs , Cols), where
n cs is the name of cs , Cols is a set of columns, and cs is accessed by n cs .
Definition 10 (Column) A column c ∈ Cols is a tuple c = (n c , v), where n c is the
name of c, v is an atomic value, and c is accessed by n c .
Example 3 shows a column-oriented modeling based on Table 1.
Example 3 The Definitions 7–10 applied to an extract of Cars database may be rep-
resented by a column-oriented database K = (Cars,{(Models,(row1,{id:“1”,name:
“Clio”}),Brands(row1,{id:“3”,name: “Renault”}))}) where Cars is a keyspace, Mod-
els and Brands are column families, row1 is a column set, and (id:“1”) and
(name:“Clio”) are columns with their respective values.
Definitions 1–10 are the basis for the definition of our canonical hierarchical model
as well as the mapping rules adopted by our approach. We detail them in the next
section.
3 The SQLtoKeyNoSQL approach
SQLtoKeyNoSQL is a layer to allow relational access to data stored in NoSQL DB. In
order to guarantee a transparent and general access to any key-based NoSQL DB, our
approach maps a relational schema to an intermediary canonical model that abstracts
the target NoSQL data models. In fact, these data models can be generalized to two
concepts (key and value), and the canonical model represents keys and values in
a simple hierarchical structure, as presented in Sect. 3.1. Our layer also maps SQL
instructions to intermediary methods based on the REST API methods (get, put and
delete), given that most of key-based NoSQL DB supports this API for data accessing.
In the following, we present the canonical model, the mapping strategies accom-
plished by our layer as well as its architecture.
123
Bringing SQL databases to key-based NoSQL databases: a… 227
3.1 Canonical model
The proposed canonical model is composed of a set of key and value nodes organized
in a hierarchical structure that is able to represent a relational schema. Besides the
root node, it is limited to three key node levels and one atomic value in the leaf nodes.
As our model has these specific structural constraints, we decided not to use other
available hierarchical data models, like XML2 and DOM,3 which are less constrained
and more complex at the same time.
The definition of a schema based on our canonical model is given as follows.
Definition 11 (Canonical Schema) A canonical schema Can is a tree structure defined
as Can = (nr oot , SL1 ), where nr oot is a node with a property that holds the RDB name,
and SL1 = {k L11 , . . . , k L1n } is the set of First Level Keys, being each k L1i ∈ SL1 a
child node of nr oot that represents a mapped RDB relation.
Definition 12 (First Level Key) A first level key k L1i ∈ SL1 is a tuple k L1i =
(n L1i , SL2 ), where n L1i is a node property that holds a RDB relation name, and
SL2 = {k L21 , . . . , k L2o } is a set of Second Level Keys, being each k L2 j ∈ SL2 a
child node of k L1i that identifies uniquely (primary key values concatenation) a tuple
of an RDB mapped relation.
Definition 13 (Second Level Key) A second level key k L2 j ∈ SL2 is a tuple k L2 j =
(n L2 j , SL3 ), where n L2 j is a node property that holds the concatenation of primary
keys values of a tuple, and SL3 = {k L31 , . . . , k L3 p } is a set of Third Level Keys, being
each k L3k ∈ SL3 a child node of k L2 j that represents an attribute of a mapped RDB
relation tuple.
Definition 14 (Third Level Key) A third level key k L3k ∈ SL3 is a tuple k L3k =
(n L3k , ν), where n L3k is a node property that holds an attribute name of a tuple, and ν
is the child node of k L3k with a property that holds the value of the attribute represented
by k L3k . We define this single child node of k L3k as child(k L3k ).
Figure 1 shows the Cars RDB schema (Fig. 1a—the same of Table 1) and its
corresponding schema in the canonical model (Fig. 1b). nr oot is named as Cars. The
tables Brands and Models are mapped to first level key nodes. The primary keys from
both tables (id attributes) are mapped to second level key nodes. Attributes are mapped
to third level key nodes, and their values are mapped to leaf nodes.
3.2 Mapping strategies
The core of the SQLtoKeyNoSQL approach comprises mapping strategies to translate
relational schemas to corresponding key-based NoSQL schemas as well as basic DDL
and DML SQL instructions to compatible API REST access methods. We detail both
types of mappings in the following.
2 https://www.w3.org/XML/.
3 https://www.w3.org/DOM/.
123
228 G. A. Schreiner et al.
Fig. 1 Cars RDB schema (a) and the corresponding schema in the canonical model (b)
3.2.1 Schema mapping
SQLtoKeyNoSQL supports mappings from the relational model to our canonical
model, and from the canonical model to the target NoSQL data model. Since the
former is described by the Definitions 11–14, we detail the latter in this section.
As stated before, the canonical model generalizes all key-based NoSQL data mod-
els, which simplifies the mapping of canonical schemas to each one of these models.
The proposed mappings are stated by the rules that are defined for each one of the
three NoSQL data models. Next, we present these rules, beginning with the preliminary
definition of the Node Name function.
Definition 15 (Node Name Function) Given a canonical schema C, the function
name(n) returns the property value of a node n that belongs to the tree structure
of C.
Rule 01 (Key-Value Mapping) The mapping of a canonical schema C to a key-
value NoSQL schema proceeds as follows: (i) C maps to a key-value DB bd named
name(nr oot ); (ii) each first level key node key1 ∈ nr oot .K L1 and, in turn, each second
level key node key2 ∈ key1 .K L2r (1 ≤ r ≤ q) generate a key k ∈ db whose name is the
concatenation of name(key1 ) and name(key2 ); and (iii) the value of each generated
key k ∈ db is a set of key-value pairs K Vdb , being each pair kv ∈ K Vdb defined as kv
= (name(key3l ), νl ) where key3l ∈ key2 .K L3s (1 ≤ s ≤ p), i.e., kv has name(key3l )
as key and name(child(key3l )) as value νl .
Figure 2 shows an example of Rule 01 application considering the canon-
ical schema from Fig. 1b. nr oot in the canonical schema is mapped to Cars
(database name). First level keys (table names) are concatenating to each corre-
sponding second level keys (primary key values) to build the key of key-value
schema, e.g., Brands.1 and Models.2. Finally, third level keys and their values
at the leaf nodes build the value of the key, generating, for instance, (Brands.1,
{id: 1; name: Ford; country: USA; founded: 1903}).
Rule 02 (Document-Oriented Mapping) The mapping of a canonical schema C to a
document-oriented NoSQL schema proceeds as follows: (i) C maps to a document-
oriented DB D named name(nr oot ); (ii) each first level key node key1 ∈ nr oot .K L1
123
Bringing SQL databases to key-based NoSQL databases: a… 229
Fig. 2 A key-value schema
generated by the mapping of the
canonical schema from Fig. 1b
generates a document collection DC ∈ D whose access key is name(key1 ); (iii) each
second key level node key2 ∈ key1 .K L2r (1 ≤ r ≤ q) generates a document d ∈ DC
whose key is name(key2 ); and (iv) each third level key node key3 ∈ key2 .K L3s (1 ≤
s ≤ p) generates an attribute ai ∈ d whose key is name(key3 ) and whose value is
name(child(key3 )).
Figure 3 presents the document-oriented DB built from Rule 02 over RDB Cars
(Fig. 1a). The node nr oot is mapped to the document DB Car s. Each first level key
(Brands and Models) is mapped to a document collection with the same name. Each
second level key is mapped to a document in Cars DB. For example, Brands collection
is composed of three documents with the keys 1, 2 and 3. The third level keys and their
child nodes are mapped to document attributes and values. For example, document 1
from Brands is composed of the following attributes: {id: 1; name: Ford; country:
USA; founded: 1903}.
Rule 03 (Column-Oriented Mapping) The mapping of a canonical schema C to a
column-oriented NoSQL schema proceeds as follows: (i) C is mapped to a keyspace
K named name(nr oot ); (ii) each first level key node key1 ∈ nr oot .K L1 generates a
column family f ∈ K whose key is name(key1 ); (iii) each second key level node
key2 ∈ key1 .K L2r (1 ≤ r ≤ q) generates an access key ch ∈ f whose name is
name(key2 ); and (iv) each third level key node key3 ∈ key2 .K L3s (1 ≤ s ≤ p)
generates a column c ∈ f , indexed by ch, whose name is name(key3 ) and whose
value is name(child(key3 )).
Figure 4 exemplifies the application of Rule 03. The root node of the canonical
schema nr oot is mapped to a keyspace Car s. First level keys are mapped to column
families, e.g., Brands and Models. Each second level key is mapped to a row key,
like row key 1 in the column family Brands. Each third level key and its child nodes
are mapped to a column and its respective values, e.g., {id: 1; name: Ford; country:
Ford; founded: 1903} for row key 1 of column family Brands.
123
230 G. A. Schreiner et al.
Fig. 3 A document-oriented
schema generated by the
mapping of the canonical
schema of Fig. 1b
Rules 01–03 are considered for organizing and storing data in NoSQL DB. Notice
that the canonical data model acts as a simple intermediary schema abstraction. This
abstraction is used as a standard to store and access data in different key-based NoSQL
DB through a single data model.
SQLtoKeyNoSQL also provides the mapping of SQL instructions in order to sup-
port SQL-to-NoSQL interoperability. In next section, we describe the mapping of the
main SQL DDL and DML instructions.
3.2.2 SQL instruction mapping
Our layer is also able to map a subset of SQL DDL and DML instructions to corre-
sponding REST API access methods. These mappings are supported by a dictionary
(see Sect. 3.4) that maintains relevant metadata. In fact, the execution of SQL DDL
instructions triggers updates in the dictionary, and during the processing of SQL DML
instructions the dictionary is queried to generate suitable REST API methods to be
executed at the target NoSQL DB. The considered SQL DDL instructions, as well as
their capabilities, are the following:
– CREATE TABLE it creates a table definition in the dictionary. Only the table
name, attributes, primary and foreign key constraints are considered.
– ALTER TABLE it can add, rename or remove attributes. If an attribute is removed,
its definition is removed from the dictionary, and all third key level in the canonical
schema that corresponds to it are also removed. This action also propagates to the
NoSQL DB, i.e., all corresponding attributes and their data are also removed. The
same reasoning applies in case of an attribute renaming, i.e., if an attribute name is
modified, this operation is accomplished in the dictionary and in all corresponding
123
Bringing SQL databases to key-based NoSQL databases: a… 231
Fig. 4 A column-oriented
schema generated by the
mapping of the canonical
schema of Fig. 1b
data stored in the NoSQL DB. If a new attribute is added, it is just created in the
dictionary. Primary key changes are not allowed because the primary key is the
basis for data accessing in NoSQL DB.
– DROP TABLE it removes table definition from the dictionary, as well as corre-
sponding first level key in canonical schema and corresponding data in the NoSQL
DB. Tables can be removed only if other tables do not reference them.
The mapping of SQL DML instructions generates one or more of the following
primitive REST API methods: put (stores a value based on a key), get (retrieves a
value based on a key) and delete (deletes a value based on a key). The following
definitions show how NoSQL DB are accessed by our approach.
123
232 G. A. Schreiner et al.
Definition 16 (Canonical Key) A canonical key is a key obtained by concatenating a
first level key key1 ∈ nr oot .K N 1m (1 ≤ m ≤ o) with one of its children nodes key2 ∈
key1 .K N 2r (1 ≤ r ≤ p).
As an example of Definition 16, the canonical key Brands.1 is the concatenation
of the first level key Brands with its child 1.
Definition 17 (Record) A record is a set of key-value pairs, each pair is obtained from
a third level key key3 and its leaf nodes for a given key from second level key key2,
representing key-value pairs of a given canonical key.
The concept of a record is similar to a tuple in the relational model. For example,
{id : 1; name : For d; countr y : U S A; f ounded : 1903} corresponds to a record
in our canonical model, while (1, Ford, USA, 1903) is a tuple in the corresponding
relational model. Given these preliminary definitions, we now define the considered
primitive REST API methods.
Definition 18 (Put) The primitive method put(k put , ν) stores a record ν corresponding
to a canonical key k put .
For example, to store the first tuple of table Brands (Fig. 1a), we issue
put(Brands.1, {id : 1; name : For d; countr y : U S A; f ounded : 1903; }).
Definition 19 (Get) The primitive method ν = get(k get ) returns a record ν from a
target NoSQL DB corresponding to a canonical key k get .
The get method searches for a given canonical key in a NoSQL DB and retrieves
its corresponding value (if exists). For example, get(Brands.1) returns the record
{id : 1; name : For d; countr y : U S A; f ounded : 1903}.
Definition 20 (Delete) The primitive method delete(kdel ) removes a record identified
by the canonical key kdel .
The delete method removes a record given a canonical key as the input parameter.
For example, delete(Brands.1) removes Brands.1 from the Car s DB.
From the get primitive method, we propose the get N method that retrieves a set of
records given a set of canonical keys. We use this method to optimize data retrieving
by considering NoSQL DB that return blocks of records.
Definition 21 (GetN) A method R = get N (C K all , F L) returns a set of records R
based on a set of canonical keys C K all and a stack of filters F L.
The getN method returns a set of records that are stored in a NoSQL DB. The
retrieved records must match with the given canonical keys and (possibly empty) filters.
For example, get N ({Brands.1, Brands.2}, null) returns the records {id: 1; name:
Ford; country: USA; founded: 1903} and {id: 2; name: BMW; country: Germany;
founded: 1916}.
Based on Definitions 18–21, we now detail the mapping of the basic SQL DML
instructions to NoSQL DB access methods as follows:
123
Bringing SQL databases to key-based NoSQL databases: a… 233
– INSERT it is translated to a set of put methods based on the canonical schema
and the dictionary (see Sect. 3.4). Nested queries are not supported. The values are
stored based on the given input attributes, and primary key values are required.
– UPDATE it is translated to get N and put methods. One or more tuples can be
updated, simple filters over attributes (predicates linked by AND or OR logical
connectors) can be used, but nested queries are not supported.
– DELETE it is translated to get N and delete methods. In the same way of the
UPDATE instruction, one or more tuples may be deleted based on simple filters
without nested queries.
– SELECT it is translated to a set of get N methods. Projections, selections, and joins
are allowed, but nested queries, aggregations and ordering are not implemented
yet.
In the following, we present the layer architecture that manages all of these map-
pings as well as the dictionary.
3.3 Architecture
The architecture of the SQLtoKeyNoSQL layer is composed of seven modules, as
illustrated in Fig. 5. Each module, in the current version, is implemented using JAVA
8 language.
The first module is the Access Interface, which receives SQL instructions from
a relational-based application (Relational App) or an Ad-Hoc query and sends them
to the SQL Parser module. It also receives results from by the Execution Engine
module (result sets and/or messages) and, in turn, forwards them to the external com-
ponents.
The SQL Parser module receives an SQL instruction and accomplishes syntactic
and semantic analysis with the support of the Dictionary module. If the instruction is
a SELECT, DELETE, or UPDATE, it further sends it to the Query Planner module.
Otherwise, it sends the instruction directly to the Translator module. The Query Plan-
ner defines a plan that optimizes query execution. Currently, this module can optimize
queries that use filters connected by the AND operator. The optimization guarantees
that filters over specific tables are executed with high priority to reduce the data volume
Fig. 5 SQLtoKeyNoSQL layer architecture
123
234 G. A. Schreiner et al.
Fig. 6 An input SQL query (a) and the set of primitive methods generated as output (b) by the Translator
module
to be processed by further (and more expansive) join operations. The output of this
module is a query tree that is translated by the Translator module and further executed
by the Execution Engine module.
The Translator module receives an SQL instruction or a query plan as input and
translates it as described in Sect. 3.2. The output access methods are then sent to the
Execution Engine module. This output is a stack of primitives or extended methods.
For example, Fig. 6a shows an input SQL query and Fig. 6b shows the respectively
output generated by the Translator module. The stack of output methods first filters
data on each pair of tables (getN methods) before joining them.
The Execution Engine module handles the execution of methods with the support
of the Communication module and metadata stored in the Dictionary. It is responsible
for processing filters over returned data, sending (and receiving) data sets to (from)
the Join Processing module, and generating the result set or messages to be sent to
the Access Interface module. In the stack of methods of Fig. 6b, for example, the
Execution Engine first execute the operation getN for table1 , then getN for table2 . The
two result sets R1 and R2 are sent to Join Processing module with the join condition.
After receiving the join result (R j), it executes the getN for table3 , which produces
the result set R3. Finally, R j and R3 are sent to the Join Processing module with the
next join condition, and the result set R f is sent to the Access Interface module.
The Join Processing module executes joins between data sets under the control of
the Execution Engine module. Each getN operation returns a set of records. These sets
of records and the join condition are passed to the Join Processing module. The Join
Processing module, in turn, executes a join algorithm that combines the records based
on the join condition. Finally, the resulting join records are returned to the Execution
Engine. Multiple joins are supported and are executed in a left to right order.
The current version of SQLtoKeyNoSQL implements a join operation in two fla-
vors: Merge-Join (for data that do not fit in the main memory), and Hash-Join (for
data that fit in the main memory). Our implementations are based on classical join
algorithms [17].
Finally, the Communication module executes requested access methods over one or
more NoSQL DB. It is composed of connectors (wrappers) that translate the getN, put
and delete methods to the specific signature of the target NoSQL DB access methods.
123
Bringing SQL databases to key-based NoSQL databases: a… 235
Such translations usually perform little syntactic adjustments in the method parame-
ters. If a NoSQL database does not support the retrieval of a set of data at a time, the
getN method is first mapped to a set of get methods. Under the hood, each NoSQL
target is represented by a java class connector that needs to implement a uniform
interface called Connector. The Connector interface provides a method for each prim-
itive/extend operation (get, set, delete, etc), and each NoSQL target needs to implement
the methods of the Connector class using its specific API. The Communication module
accesses data through the Connector interface using polymorphism with the specific
connector for each NoSQL target. Data returned from the NoSQL DB are sent back
to Execution Engine through the Buffer component of the SQLtoKeyNoSQL layer.
Another important component of the SQLtoKeyNoSQL architecture is the Dictio-
nary. It holds the metadata necessary to perform all the mappings, and it is detailed in
the next Section.
3.4 Dictionary
The Dictionary maintains metadata for each considered RDB schema (attributes, pri-
mary keys, foreign keys, among others), as well as information about the target NoSQL
DB where the data are stored. The dictionary is defined as follows.
Definition 22 (Dictionary) A dictionary D is a tuple D = (T , N ), with T being a set
of RDB table metadata and N a set of target NoSQL DB.
Definition 23 (Table Metadata) A table metadata t ∈ T is a tuple t = (name, AT T ,
P K , F K , K EY S, db), where name is the table name, AT T is the set of attribute
names of the table, P K is the primary key of the table, F K = {(att1 , tname1 ), . . . ,
(attn , tnamen )} is a set (possibly empty) of foreign keys of the table, where each pair
(atti , tnamei ) ∈ F K holds the attribute name of a foreign key and the name of the
referenced table, K EY S are the set of the keys (third level key node names in the
canonical schema) of the table, and db is the alias of the target NoSQL DB.
Definition 24 (Target NoSQL) A target NoSQL bd ∈ N is a tuple bd =
(alias, user , psw, url), being alias a unique name for the NoSQL DB, user and
psw the user and password to connect to the NoSQL DB, respectively, and url the
address of the NoSQL DB.
Notice that the canonical model maps the relational data in a hierarchical model
using a tree view of the database (tree root), table, tuple identifier (the concatenation
of the primary key values of each tuple), columns and values (tree leaves). The rela-
tionships between tables or tuples are not explicitly defined in the canonical model.
They are maintained only in the dictionary.
Figure 7 presents an example of SQLtoKeyNoSQL dictionary corresponding to the
database from Fig. 1a. It shows, for example, that the metadata of the Models table is
(Models, ATT: {id, name, prod_begin, prod_end, brand_id}, PK: id, KEYS: {1, 2, 3},
FK: {(brand_id, Brands)}, DB: D B2). Notice also, from Fig. 7, that table Brands is
stored at NoSQL DB DB1 and table Models in NoSQL DB DB2. It means that RDB
may be distributed among several NoSQL DB.
123
236 G. A. Schreiner et al.
Fig. 7 Example of the SQLtoKeyNoSQL dictionary schema
4 Related work
Many works are dealing with the mapping of RDB or SQL instructions to NoSQL
DB. These works follow several approaches. Some of them build a unified layer over
different NoSQL DB introducing SQL-like languages [25] or using other query lan-
guages to access data [4,5,23]. There are also works that present migration techniques
to move a relational-based application to NoSQL DB helping the users to rewrite their
SQL instructions to the NoSQL API methods [12–14,20].
Different from the approaches mentioned above, our solution fits into related work
that offers a way to query/update NoSQL DB using traditional SQL instructions.
As stated in Sect. 1, the related work on which our solution belongs falls into two
categories: layer and storage engine. The first one comprises approaches based on
a software layer that provides a schema and operation abstraction over NoSQL DB,
allowing users to define and manipulate relational data through SQL instructions.
The second one comprises approaches that modify the kernel of a RDBMS to store
relational data in a NoSQL DB.
Due to the limited horizontal space, we present two tables to show the related
approach features. Table 2 focuses on the: (i) approach category; (ii) target NoSQL
DB; and (iii) data model of the target NoSQL DB. Table 3 focuses on the: (i) SQL
instructions supported; (ii) dictionary support; and (iii) type of supported join (if
considered).
According to Table 2, four approaches of the category Layer are found. Simp-
leSQL [9] is a layer that supports the mapping of a subset of SQL DDL and DML
instructions and stores data in SimpleDB, a document-oriented NoSQL DB. JackHare
123
Bringing SQL databases to key-based NoSQL databases: a… 237
Table 2 Related work comparison (Part 1)
Approach Category NoSQL DB Data model
SimpleSQL (2013) Layer SimpleDB Document
JackHare (2013) Layer HBase Column
Unity (2014) Layer Cassandra/MongoDB Column/document
Rith et al. [19] Layer Cassandra/MongoDB Column/document
Apache Phoenix (2014) Layer HBase Column
Phoenix (2011) Storage engine Scalaris Key-value
CloudyStore (2009) Storage engine Cloudy Column
DQE (2013) Storage engine HBase Column
SQLtoKeyNoSQL Layer Key-oriented Key-oriented
Table 3 Related work comparison (Part 2)
Approach SQL support Dictionary Join
SimpleSQL (2013) DDL + DML subset Yes By similarity
JackHare (2013) DDL + DML subset Yes Map-reduce
Unity (2014) DML subset Yes Hash-Join
Rith et al. [19] DML subset – –
Apache Phoenix (2014) DML + DML Yes Hash-Join
Phoenix (2011) DDL + DML No RDBMS-dependent
CloudyStore (2009) DDL + DML Yes RDBMS-dependent
DQE (2013) DDL + DML Yes RDBMS-dependent
SQLtoKeyNoSQL DDL + DML subset Yes Merge-Join/Hash-Join
[8] is also a relational layer, but different from SimpleSQL, it provides mappings
to HBase, a column-oriented NoSQL DB. Another work that considers a target
document-oriented DB is Unity [15], but its SQL support is limited to a subset of
DML instructions. Rith et. al. [19] is capable of accesses data stored in the Cassandra
and MongoDB. The last layer approach is Apache Phoenix [2], which accesses and
stores data in HBase.
There are also three approaches of the category Storage Engine. Two of them
(Phoenix [3] and CloudyStore [10]) modify the MySQL RDBMS storage engine to
provide persistence of relational data in NoSQL DB. As our approach, Phoenix is based
on an intermediary model called VOEM (Value-based OEM), which is an extension
of the OEM (Object Exchange Model) [18], a data model that is more complex than
our canonical model mainly in terms of number of concepts. Besides, it supports the
mapping of VOEM schemata only to the key-value data model, specifically, to the
Scalaris NoSQL database. Different from Phoenix, Cloudy Store does not provide an
intermediary model, managing the mapping of relational data to the column-oriented
NoSQL DB called Cloudy, and considering the MySQL r ow I ds to optimize data
accessing stored in Cloudy. The last approach is DQE [24], which modifies the kernel
123
238 G. A. Schreiner et al.
of the Derby RDBMS, including its query optimization module. Similar to JackHare,
DQE stores relational data in the HBase NoSQL database.
Most of approaches are limited to map to only one NoSQL data model. Only Rith
et. al. and Unity support more than one target data model. Rith et. al. translates SQL
queries to the query language of Cassandra and MongoDB using the query properties of
each target DB. Unity supports multiple data sources, but its mappings must be coded
by hand through wrappers. Besides, Unity details mappings only for MongoDB. Our
approach is more flexible than those ones since we have support to all key-oriented
NoSQL DB.
Table 3 shows that approaches of category Storage Engine have full support for
SQL DDL and DML instructions. This is justified by the fact they are extensions of
existing RDBMS that naturally offer SQL-based access. Despite their limited SQL
support, Layer approaches are more flexible, since they are not strongly coupled to a
particular RDBMS.
A challenging task for SQL-to-NoSQL mapping is join operation support, since
NoSQL DB do not have this query capability. Table 3 highlights that Storage Engine
approaches are dependent of the RDBMS join capabilities for such a task. Only the
work of Rith et. al. does not support joins in the Layer category. SimpleSQL applies a
join-by-similarity algorithm to match foreign and primary key values. JackHare uses
map-reduce jobs to take advantage of parallel processing for improving join operation
performance. Unity and Apache Phoenix execute a hash join algorithm that considers
the primary and foreign keys as hash entries. Our approach also supports join operation,
providing more than one join algorithm depending on whether the data set fits or not
into the main memory.
Next section presents an evaluation of SQLtoKeyNoSQL through a set of experi-
ments that compares it with some related work (baseline approaches).
5 Experiments
This section presents a set of experiments conducted to show the effectiveness of the
SQLtoKeyNoSQL approach. We focus our experiments on query operations since it
is the most frequent operation performed by RDB.
We refer the readers to [22] for experiments that evaluate the overhead introduced
by our approach on considering a data-centric application that directly accesses a
relational database. In that paper, we executed a set of SELECT and INSERT instruc-
tions on three NoSQL databases with and without considering our layer. The results
revealed that our solution is not prohibitive.
To compare the processing time of our approach with two baselines, we exe-
cute two sets of experiments. We first compare SQLtoKeyNoSQL with Unity using
MongoDB as the NoSQL database target. Then, we compare SimpleSQL with SQL-
toKeyNoSQL using Amazon SimpleDB as the NoSQL database target. Unfortunately,
we did not find any available open source Storage Engine approach to consider in
our experiments. We also check the completeness and correctness of our approach:
we execute a set of queries directly over a RDB and using SQLtoKeyNoSQL. We
compare the returned tuples and in both cases the tuples are the same.
123
Bringing SQL databases to key-based NoSQL databases: a… 239
Table 4 Some metadata of the Prova Brasil RDB
Tables PKs FKs #Cols Original rows Reduced rows
ts_school id_school – 128 79,252 79,252
ts_student_3rdhs id_student id_school 98 150,430 200,000
ts_student_5th id_student id_school 98 2,720,589 200,000
ts_student_9th id_student id_school 98 2,524,126 200,000
5.1 Experiment setup
The experiments were performed on an Intel Core i5-2430M processor with 8
GB DDR3 1066mHz RAM, 240GB Scandisk SSD, running Linux 4.5.5-04 kernel
(XUbuntu 16.04 distribution). In the experiments, we use two different NoSQL DB as
targets: MongoDB and SimpleDB. MongoDB and SimpleDB are document-oriented
NoSQL DB. We ran the experiments for each baseline in the same environment.
MongoDB ran as local host as a single node without replicas. SimpleDB ran through
Amazon AWS accessed by a REST API. We constrain our experiments to MongoDB
and SimpleDB because of the restrictions of the baselines.
5.2 Experiment methodology
Our experiments considered, as a use case, a real RDB. This RDB, called Prova Brasil
(PBdb ), stores data about the academic performance of students at compulsory level
(elementary and high school) in Brazil. We extracted four tables from PBdb to perform
the experiments: ts_student_3rdhs, which stores results of the test from the third year
of Brazilian high school students; ts_student_5th and ts_student_9th, which stores the
results of five and nine years students of the basic school, respectively; and ts_school,
which stores data about all the Brazilian public schools.
Table 4 shows some metadata of PBdb . The first column shows the table names.
The second and third columns show the primary key and foreign key attributes. The
column #Cols presents the number of columns of the tables. The next column presents
the original number of rows of each table. Due to the network lag and failures, we
could not use the cardinality of the original tables for the experiments with SimpleSQL
baseline. After some tests, we decide to export only 200,000 rows to each table. Thus,
the last column presents the number of rows considered in our experiments (Reduced
#Rows).
The main working tables are ts_student_3rdhs, ts_student_5th, and ts_student_9th.
The table ts_school was chosen because it is the largest table (79, 252 rows and 128
attributes) that can be joined with the other 3 tables.
We considered fifteen SQL queries (Q1 to Q15) in our experiments, as shown in
Figs. 3 and 4. For each query, we changed the number of projections columns and the
number of filters. The queries were defined based on the SQL support provided by our
approach and the baselines. In short, we avoid aggregations and nested queries. Three
of the queries perform join operations (Q13, Q14 and Q15).
123
240 G. A. Schreiner et al.
Table 5 Queries considered in the experiments (Part 1)
Queries
SELECT id_uf, id_city, id_area, id_shift, id_grade, tx_resp_q001
Q1
FROM ts_students_3rdhs;
SELECT id_uf, id_city, id_area, id_shift, id_grade, tx_resp_q001
Q2 FROM ts_students_3rdhs
WHERE id_city = 6236282;
SELECTid_uf, id_city, id_area, id_shift, id_grade, tx_resp_q001
Q3 FROM ts_students_3rdhs
WHERE id_city = 6236282 AND id_shift = 2;
Q4 SELECT * FROM ts_students_3rdhs;
SELECT id_uf, id_city, id_area, id_shift, id_grade
Q5
FROM ts_students_5th;
SELECT tx_resp_q001, tx_resp_q002, tx_resp_q003, tx_resp_q004,
tx_resp_q005, tx_resp_q006, tx_resp_q007, tx_resp_q008
Q6
FROM ts_students_5th
WHERE id_uf = 43;
SELECT tx_resp_q001, tx_resp_q002, tx_resp_q003, tx_resp_q004,
tx_resp_q005, tx_resp_q006, tx_resp_q007, tx_resp_q008
Q7
FROM ts_students_5th
WHERE id_uf =′ 11′ AND id_location =′ 1′ ;
SELECT id_turma, id_shift, id_city, id_block_1, id_block_2,
id_grade, id_students
Q8
FROM ts_students_5th
WHERE id_uf =′ 15′ AND id_students >′ 11161931′
SELECT id_uf, id_city, _id_escola, id_students
Q9
FROM ts_students_9th;
SELECT id_turma, id_shift, id_city, id_block_1, id_block_2,
id_grade, id_students
Q10
FROM ts_students_5th
WHERE id_uf =′ 11′ AND id_students >′ 10913619′ ;
To compare SQLtoKeyNoSQL with the baselines, we organize the experiments in
two parts. First, we evaluate the processing time of the fifteen queries. Second, we
evaluate the scalability of all approaches. In this test, we randomly picked a query
(Q8) from Table 5 based on table ts_students_5th (the table with the higher number
of rows). In the scalability experiment we decided not to evaluate queries with joins
(Q13 to Q15 from Table 6) because the baselines had a poor performance w.r.t.
SQLtoKeyNoSQL in the processing time experiment.
The execution of each query in the processing time experiment considers an initial
warm-up phase (we ran each query 3 times), and then, we ran each query 5 times and
report the average rates. The results were compared employing statistics significance
tests (paired t-test) with a 95% confidence interval. T-test is a type of inferential
statistic used to determine if there is a significant difference between the means of
two different groups. We apply the test over two groups of values for each query:
one group is our approach and the other one a state-of-art approach (depending on
123
Bringing SQL databases to key-based NoSQL databases: a… 241
Table 6 Queries considered in the experiments (Part 2)
Queries
SELECT id_uf, id_city, _id_escola, id_students, tx_resp_q001,
tx_resp_q002, tx_resp_q003, tx_resp_q004, tx_resp_q005,
Q11 tx_resp_q006
FROM ts_students_9th
WHERE id_uf >′ 11′ AND id_uf <′ 15′ ;
SELECT id_uf, id_city, id_escola, id_students
Q12 FROM ts_students_9th
WHERE id_uf >′ 10′ AND id_uf <′ 15′ AND id_area =′ 2′ ;
SELECT ts_students_3rdhs.id_uf, ts_students_3rdhs.id_city,
Q13 ts_students_3rdhs.id_shift, id_grade, tx_resp_q001
FROM ts_students_3rdhsNATURALJOIN ts_uf;
SELECT ts_escola.id_escola, ts_students_9th.id_uf,
ts_students_9th.id_city, ts_students_9th.id_shift,
Q14 ts_students_9th.id_grade, ts_students_9th.tx_resp_q001,
ts_students_9th.id_escola
FROMts_students_9thNATURALJOIN ts_escola;
SELECT ts_escola.id_escola, ts_students_9th.id_uf,
ts_students_9th.id_city, ts_students_9th.id_shift,
ts_students_9th.id_grade, ts_students_9th.tx_resp_q001,
Q15 ts_students_9th.id_escola
FROMts_students_9th NATURALJOINts_escola
NATURALJOINts_uf
WHEREts_students_9th.id_location = 1;
the experiment). The hypothesis to be verified is that our approach produces a better
performance (a reduced execution time) than the baselines.
For the the scalability experiment, we execute scripts that insert different numbers
of synthetic rows in the ts_student_5th table. This experiment was divided into 5 parts
based on the inserted rows: 500,000, 1,000,000, 1,500,000, 2,000,000 and 2,500,00
(in the case of SimpleDB: 40,000, 80,000, 120,000, 160,000 and 200,000 rows). We
executed query Q8 5 times and got the average rates. Again, the results were compared
employing statistics significance tests (paired t-test) with a 95% confidence interval.
5.3 SQLtoKeyNoSQL vs unity
Figure 8 shows the processing time comparison of our approach (SQLtoKeyNoSQL)
with Unity. The bar graph shows, in the x-axis, the queries and, in the y-axis, the
corresponding processing time in seconds.
The results show that our approach obtained better performance w.r.t. Unity for all
queries. One possible reason for that is our abstract method get N , which considers
the MongoDB capability for retrieving blocks of data. Instead, Unity fetches only one
record at a time.
Notice that the biggest difference is related to the join queries (Q13, Q14, and
Q15). Our approach has a significant lower processing time for all of them by run-
ning classical (and efficient) join algorithms and prioritizing main memory processing
123
242 G. A. Schreiner et al.
Fig. 8 Processing time comparison between SQLtoKeyNoSQL and Unity
Fig. 9 Scalability comparison between SQLtoKeyNoSQL and Unity
when possible. Instead, Unity implements a more complex join processing strategy
[15].
Figure 9 shows the results for the second part of the experiments, i.e., the scala-
bility evaluation. The line graph presents, in the x-axis, the number of rows and, in
the y-axis, the respective processing time in seconds. Both approaches obtained about
the same performance up to 1,500,000 rows. After that, SQLtoKeyNoSQL presents a
(lightweight) superior performance. That improvement is probably due to the number
123
Bringing SQL databases to key-based NoSQL databases: a… 243
of data requests to MongoDB since SQLtoKeyNoSQL can retrieve blocks of data
for each request by executing the GetN method. We also notice that both approaches
increase significantly the processing time after the mark of 1,500,000 tuples. This
is because MongoDB is running in a single node instance, and it is not able to
scale.
5.4 SQLtoKeyNoSQL vs SimpleSQL
We accomplished the same set of experiments for SimpleSQL by accessing Amazon
SimpleDB through Amazon AWS cloud. As stated before, for this set of experiments
we reduce the cardinality of each table to 200,000 rows.
Figure 10 shows the results for the first part of the experimental evaluation. Notice
that SQLtoKeyNoSQL has a very significant lower processing time for all queries. The
reason for that is probably due to the two main strategies followed by the approaches.
First, SimpleSQL executes at least two requests to SimpleDB to retrieve the rows: one
to get metadata information about the table and another one to get the mapped rows of
the table. Instead, SQLtoKeyNoSQL keeps metadata information in main memory and
accesses SimpleDB only to get the rows. Besides, SimpleSQL has to query each item
stored in SimpleDB by filtering a special metadata attribute (SimpleSQL_TableName)
that maintains the table name. Different from it, SQLtoKeyNoSQL accesses all meta-
data information in its main memory dictionary.
Figure 11 shows the results for the SimpleSQL scalability experiments. The line
graph presents, in the x-axis, the number of rows for the table and, in the y-axis, the
respective processing time in minutes. Again, SQLtoKeyNoSQL outperforms Simp-
leSQL in all five experiments. The difference increases drastically with the increase
of the number of retrieved rows.
Fig. 10 Processing times comparison between SQLtoKeyNoSQL and SimpleSQL
123
244 G. A. Schreiner et al.
Fig. 11 Scalability comparison between SQLtoKeyNoSQL and SimpleSQL
In short, the experiments have shown that SQLtoKeyNoSQL is a promising
approach for performing SQL queries over NoSQL data stores. The canonical model
and the dictionary help to manage and retrieve the data in an efficient way. Moreover,
we offer the flexibility to store data in any key-based NoSQL data model, allowing
users to choose the best one for their needs.
6 Conclusion
This paper presents SQLtoKeyNoSQL, an approach that provides an SQL-based access
interface for data maintained in NoSQL DB. The idea behind this proposal is to offer
a solution for relational-based applications that intend to migrate their data to NoSQL
DB and do not want to incur in high costs with the learning of new NoSQL DB access
methods as well as the changing of their SQL interface to these new access methods.
Moreover, it allows the movement of relational data to one or more key-based NoSQL
data models.
Our approach is materialized as a layer allowing users to execute a subset of
SQL DDL and DML instructions over any key-based access NoSQL DB (document-
oriented, column-oriented and key-value NoSQL DB). It supports a canonical data
model that works as an intermediate schema between the relational data model and
the key-based access NoSQL data models, providing transparent access. Besides,
SQLtoKeyNoSQL allows the user to choose the NoSQL target DB where each table
is going to be stored.
We evaluate SQLtoKeyNoSQL, regarding processing time, against two baselines
available in the literature (SimpleSQL and Unity), as detailed in Sect. 5. The results of
the experiments were considered satisfactory. Our approach outperforms SimpleSQL
123
Bringing SQL databases to key-based NoSQL databases: a… 245
for all proposed queries, being three times more effective in terms of join processing.
The experiments also showed that our approach reached less processing times than
Unity in terms of scalability tests, with 95% of confidence. Based on the results of the
experiments, we also conclude that the new features added to SQLtoKeyNoSQL make
it a more robust and scalable approach. SQLtoKeyNoSQL can be a very useful tool
for users that intend to migrate their application from the relational data model to a
NoSQL key-based-data model with a lower learning curve.
This paper contributes as a basis for a comprehensive and efficient solution for
relational-based access to any key-based NoSQL DB. Even so, several future work
can be issued as follows: (i) support for index management; (ii) a possible extension
of the canonical model to support the graph data model; and (iii) enhancing the SQL
subset by adding support to aggregation and subqueries.
References
1. Abadi DJ (2009) Data management in the cloud: limitations and opportunities. IEEE Data Eng Bull
32(1):3–12
2. Apache (2017) White paper: apache phoenix. http://phoenix.apache.org/. Accessed 24 Aug 2018
3. Arnaut DE, Schroeder R, Hara CS (2011) Phoenix: a relational storage component for the cloud. In:
2013 IEEE SICCC 0
4. Atzeni P, Bugiotti F, Rossi L (2012) Sos (save our systems): a uniform programming interface for
non-relational systems. In: Proceedings of the 15th international conference on extending database
technology. ACM, New York
5. Banerjee S, Goto T, Debnath NC, Sarkar A (2017) Ontology driven query language for nosql databases.
In: 2017 IEEE 15th international conference on industrial informatics (INDIN), pp 951–956
6. Bisbal J, Lawless D, Wu B, Grimson J (1999) Legacy information systems: issues and directions. IEEE
Softw 16(5):103–111
7. Cattell R (2011) Scalable SQL and NoSQL data stores. SIGMOD Rec 39(4):12–27
8. Chung WC, Lin HP, Chen SC, Jiang MF, Chung YC (2014) Jackhare: a framework for SQL to NoSQL
translation using mapreduce. Autom Softw Eng 21(4):489–508
9. dos Santos Ferreira G, Calil A, dos Santos Mello R (2013) On providing DDL support for a relational
layer over a document NoSQL database. In: IIWAS. ACM, New York
10. Egger D (2009) SQL in the cloud. Ph.D. thesis, Master Thesis ETH Zurich
11. Fielding RT (2000) Architectural styles and the design of network-based software architectures. Ph.D.
thesis, University of California, Irvine
12. Hamouda S, Zainol Z (2017) Document-oriented data schema for relational database migration to
NoSQL. In: 2017 International conference on big data innovations and applications (innovate-data),
pp 43–50
13. Kim HJ, Ko EJ, Jeon YH, Lee KH (2018a) Migration from RDBMS to column-oriented NoSQL:
lessons learned and open problems. In: Lee W, Choi W, Jung S, Song M (eds) Proceedings of the 7th
international conference on emerging databases. Springer Singapore, pp 25–33
14. Kim HJ, Ko EJ, Jeon YH, Lee KH (2018b) Techniques and guidelines for effective migration from
RDBMS to NoSQL. J Supercomput. https://doi.org/10.1007/s11227-018-2361-2
15. Lawrence R (2014) Integration and virtualization of relational SQL and NoSQL systems including
MySQL and MongoDB. In: CSCI, vol 1
16. Liu ZH, Hammerschmidt BC, McMahon D (2014) JSON data management: supporting schema-less
development in RDBMS. In: ICMD, SIGMOD
17. Mishra P, Eich MH (1992) Join processing in relational databases. ACM CSUR 24(1):63–113
18. Papakonstantinou Y, Garcia-Molina H, Widom J (1995) Object exchange across heterogeneous infor-
mation sources. In: 11th CDE. IEEE
19. Rith J, Lehmayr PS, Meyer-Wegener K (2014) Speaking in tongues: SQL access to NoSQL systems.
In: 29th ACM SAC, New York
123
246 G. A. Schreiner et al.
20. Rocha L, Vale F, Cirilo E, Barbosa D, Mouro F (2015) A framework for migrating relational datasets
to NoSQL1. Procedia Comput Sci 51(C):2593–2602
21. Sadalage PJ, Fowler M (2012) NoSQL distilled: a brief guide to the emerging world of polyglot
persistence. Pearson Education, London
22. Schreiner GA, Duarte D, dos Santos Mello R (2015) SQLtoKeyNoSQL: a layer for relational to key-
based NoSQL database mapping. In: iiWAS, ACM, New York
23. Vathy-Fogarassy G, Hugyk T (2017) Uniform data access platform for SQL and NoSQL database
systems. Inf Syst 69(C):93–105
24. Vilaça R, Cruz F, Pereira J, Oliveira R (2013) An effective scalable SQL engine for NoSQL databases.
In: Dowling J, Taïani F (eds) 13th IFIP, DAIS, Springer, Berlin
25. Xu J, Shi M, Chen C, Zhang Z, Fu J, Liu CH (2016) ZQL: a unified middleware bridging both relational
and NoSQL databases. In: 2016 IEEE 14th ICD, ASC, 14th ICPIC, 2nd CyberSciTech, pp 730–737
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123