ImplementatioDbNOSQL
ImplementatioDbNOSQL
Abstract. NoSQL (Not Only SQL) systems are becoming popular due to known
advantages such as horizontal scalability and elasticity. In this paper, we study
the implementation of data warehouses with document-oriented NoSQL
systems. We propose mapping rules that transform the multidimensional data
model to logical document-oriented models. We consider three different
logical translations and we use them to instantiate multidimensional data
warehouses. We focus on data loading, model-to-model conversion and cuboid
computation.
1 Introduction
NoSQL solutions have proven some clear advantages with respect to relational database
management systems (RDBMS) [14]. Nowadays, the research attention has moved
towards the use of these systems for storing “big” data and analyzing it. This work joins
our previous work on the use of NoSQL solutions for data warehousing [3] and it joins
substantial ongoing works [6, 9, 15]. In this paper, we focus on one class of NoSQL
stores, namely document-oriented systems [7].
Document-oriented systems are one of the most famous families of NoSQL systems.
Data is stored in collections, which contain documents. Each document is composed of
key-value pairs. The value can be composed of nested sub-documents. Document-
oriented stores enable more flexibility in schema design: they allow the storage of
complex structured data and heterogeneous data in one collection. Although, document-
oriented databases are declared to be “schema less” (no schema needed), most uses
convey to some data model.
When it comes to data warehouses, previous work has shown that it can be instan‐
tiated with different logical models [10]. We recall that data warehousing relies mostly
on the multidimensional data model. The latter is a conceptual model,1 and we need to
map it in document-oriented logical models. Mapping the multidimensional model to
relational databases is quite straightforward, but until now there is no work (except of
our previous [3] ) that considers the direct mapping from the multidimensional concep‐
tual model to NoSQL logical models (Fig. 1). NoSQL models support more complex
data structures than relational model i.e. we do not only have to describe data and the
relations using atomic attributes. They have a flexible data structure (e.g. nested
elements). In this context, more than one logical model is candidate for mapping the
multidimensional model. As well, the evolving needs might demand for switching from
one model to another. This is the scope of our work: NoSQL logical models and their
use for multidimensional data warehousing.
Logical Level
Relational NoSQL
OLAP OLAP
1
The conceptual level consists in describing the data in a generic way regardless the information
technologies whereas the logical level consists in using a specific technique for implementing
the conceptual level.
Implementation of Multidimensional Databases 381
Illustration: Let’s consider an excerpt of the star schema benchmark [12]. It consists
in a monitoring of a sales system. Orders are placed by customers and the lines of the
orders are analyzed. A line consists in a part (a product) bought from a supplier and sold
to a customer at a specific date. The conceptual schema of this case study is presented
in Fig. 2.
Implementation of Multidimensional Databases 383
Dimension Name
HCust
Fact CUSTOMER
Customer City Region Nation All
Size
Prod_Name
Weak Attributes
LineOrder HBrand
PART
Quantity Partkey Brand Type All
HCateg
Discount Hierarchy
Category
Revenue
Month_Name
Tax
HTIME
DATE
Measures Date Month Year All
HTime
Name
HSuppl
SUPPLIER
Legend Supplier City Region Nation All
– F SSB={ FLineOrder }
– D SSB={ D Customer , D Part , D Date , D Supplier },
– Star SSB(FLineOrder )={ D Customer , D Part , D Date , D Supplier } Parameter
From this schema, called ESSB, we can define cuboids, for instance:
– (FLineOrder, {DCustomer, DDate, DSupplier}),
– (FLineOrder, {DCustomer, DDate}).
MLD1: For a given fact, all dimension attributes are nested under the respective attribute
name and all measures are nested in a subdocument with key “measures”. This model
is inspired from [3]. Note that there are different ways to nest data, this is just one of
them.
MLD2: For a given fact and its dimensions, we store data in dedicated collections one
per dimension and one for the fact. Each collection is kept simple: no sub-documents.
The fact documents will have references to the dimension documents. We call this model
MLD2 (or shattered). This model has known advantages such as less memory usage and
data integrity, but it can slow down interrogation.
Implementation of Multidimensional Databases 385
Table 1. Mapping rules from the conceptual model to the logical models
Conceptual Model to MLD0: To instantiate this model from the conceptual model,
these rules are applied:
• Each cuboid O (FO and its dimensions DO) is translated in a collection C.
• Each measure m ∈ MF is translated into a simple attribute (i.e. C[id]{m})
• For all dimension D ∈ DO, each attribute a ∈ AD of the dimension D is converted into
a simple attribute of C (i.e. C[id]{a})
Conceptual Model to MLD1: To instantiate this model from the conceptual model,
these rules are applied:
• Each cuboid O (FO and its dimensions DO) is translated in a collection C.
• The attributes of the fact FO will be nested in a dedicated nested document C[id]
{NF}. Each measure m ∈ MF is translated into a simple attribute C[id]{NF:m}.
• For any dimension D ∈ DO, its attributes will be nested in a dedicated nested docu‐
ment C[id]{ND}. Every attribute a ∈ AD of the dimension D will be mapped into a
simple attribute C[id]{ND:a}.
Conceptual Model to MLD2: To instantiate this model from the conceptual model,
these rules are applied:
386 M. Chevalier et al.
• Each cuboid O (FO and its dimensions DO), the fact FO is translated in a collection
CF and each dimension D ∈ DO into a collection CD.
• Each measure m ∈ MF is translated within CF as a simple attribute (i.e. CF[id’]{m})
• For all dimension D ∈ DO, each attribute a ∈ AD of the dimension D is mapped into
CD as a simple attribute (i.e. CD[id]{a}), and if a = idD the document CF is completed
by a simple attribute CF[id’]{a} (the value reference of the linked dimension).
5 Experiments
Our experimental goal is to validate the instantiation of data warehouses with the three
approaches mentioned earlier. Then, we consider converting data from one model to the
other. In the end, we generate OLAP cuboids and we compare the effort needed by
model. We rely on the SSB + benchmark that is popular for generating data for decision
support systems. As data store, we rely on MongoDB one of the most popular document-
oriented system.
5.1 Protocol
Data: We generate data using the SSB + [4] benchmark. The benchmark models a
simple product retail reality. It contains one fact table “LineOrder” and 4 dimensions
“Customer”, “Supplier”, “Part” and “Date”. This corresponds to a star-schema. The
dimensions are hierarchic e.g. “Date” has the hierarchy of attributes [d_date, d_month,
d_year]. We have extended it to generate raw data specific to our models in JSON file
format. This is convenient for our experimental purposes. JSON is the best file format
for Mongo data loading. We use different scale factors namely sf = 1, sf = 10, sf = 25
and sf = 100 in our experiments. The scale factor sf = 1 generates approximately 107
lines for the LineOrder fact, for sf = 10 we have approximately 108 lines and so on. In
the MLD2model we will have (sf x 107) lines for LineOrder and quite less for the
dimensions.
Data loading: Data is loaded into MongoDB using native instructions. These are
supposed to load data faster when loading from files. The current version of MongoDB
would not load data with our logical model from CSV file, thus we had to use JSON
files.
5.2 Results
In Table 2, we summarize data loading times by model and scale factor. We can observe
at scale factor SF1, we have 107 lines on each line order collections for a 4.2 GB disk
memory usage for MLD2 (15 GB for MLD0 and MLD1). At scale factors SF10 and
SF100 we have respectively 108 lines and 109 lines and 42 GB (150 GB MLD0 and
MLD1) and 420 GB (1.5 TB MLD0 and MLD1) for of disk memory usage. We observe
that memory usage is lower in the MLD2 model. This is explained by the absence of
redundancy in the dimensions. The collections “Customers”, “Supplier”, “Part” and
“Date” have respectively 50000 records, 3333 records, 3333333 records and 2556
records.
In Fig. 4, we show the time needed to convert data of one model to data of another
model with SF1. When we convert data from MLD0 to MLD1 and vice versa conversion
times are comparable. To transform data from MLD0 to MLD1 we just introduce a depth
of 1 in the document. On the other sense (MLD1 to MLD0), we reduce the depth by one.
The conversion is more complicated when we consider MLD0 and MLD2. To convert
MLD0 data into MLD2 we need to split data in multiple tables: we have to apply 5
projections on original data and select only distinct keys for dimensions. Although, we
produce less data (in memory usage), we need more processing time than when we
convert data to MLD1. Converting from MLD2 to MLD0 is the slowest process by far.
This is due to the fact that most NoSQL systems (including MongoDB) do not support
joins (natively). We had to test different optimization techniques hand-coded. The
loading times fall between 5 h to 125 h for SF1. It might be possible to optimize this
conversion further, but the results are illustrative of the jointure issues in MongoDB.
388 M. Chevalier et al.
550s 870s
MLD1 MLD0 MLD2
720s 5h-125h
10000000 documents*
CSPD Loading time only (no processing time)
62500 documents 937475 documents 21250 documents 937478 documents 21250 documents 317415 documents
CS 36s CP 229s CD 35s SP 237s SD 35s PD 217s
Record count
1 document Computation time
All <1s
We observe as expected that the number of records decreases from one level to the
lower level. The same is true for computation time. We need between 300 and 500 s to
compute the cuboids at the first level (3 dimensions). We need between 30 s and 250 s
at the second layer (2 dimensions). We need less than one second to compute the cuboids
at the third and fourth level (1 and 0 dimensions).
OLAP computation using the model MLD1 provides similar results. The perform‐
ance is significantly lower with the MLD2 model due to joins. These differences involve
only the layer 1 (depth one) of the OLAP lattice, cause the other layers can be computed
from the latter. We do not report this results for space constraints.
Observations: We observe that we need comparable times to load data in one model
with the conversion times (except of MLD2 to MLD0). We also observe reasonable
times for computing OLAP cuboids. These observations are important. At one hand, we
show that we can instantiate data warehouses in document-oriented data systems. On
the other, we can think of pivot models or materialized views that can be computed in
parallel with a chosen data model.
Implementation of Multidimensional Databases 389
6 Conclusion
In this paper, we have studied the instantiation of data warehouses with document-
oriented systems. We propose three approaches at the document-oriented logical model.
Using a simple formalism, we describe the mapping from the multidimensional concep‐
tual data model to the logical level.
Our experimental work illustrates the instantiation of data warehouses with each of
the three approaches. Each model has its weaknesses and strengths. The shattered model
(MLD2) uses less disk memory, but it is quite inefficient when it comes to answering
queries with joins. The simple models MLD0 and MLD1 do not show significant
performance differences. Passing from one model to another is shown to be easy and
comparable in time to “data loading from scratch”. One conversion is significantly non-
performing; it corresponds to the mapping from multiple collections (MLD2) to one
collection. Interesting results are also met in the computation of the OLAP lattice with
document-oriented models. The computation times are reasonable enough.
For future work, we will consider logical models in column-oriented models and
graph-oriented models. After exploring data warehouse instantiation across different
NoSQL systems, we need to generalize across logical model. We need a simple
formalism to express model differences and we need to compare models within each
paradigm and across paradigms (document versus column).
References
1. Bosworth, A., Gray, J., Layman, A., Pirahesh, H.: Data cube: A relational aggregation
operator generalizing group-by, cross-tab, and sub-totals. Technical report MSRTR-95-22,
Microsoft Research, February 1995
2. Chaudhuri, S., Dayal, U.: An overview of data warehousing and OLAP technology. SIGMOD
Rec. 26, 65–74 (1997)
3. Chevalier, M., malki, M.E., Kopliku, A., Teste, O., Tournier, R.: Implementing
multidimensional data warehouses into NoSQL. In: 17th International Conference on
Entreprise Information Systems, April 2015
4. Chevalier, M., El Malki, M., Kupliku, A., Teste, O., Tournier, R.: Benchmark for OLAP on
NoSQL technologies, comparing NoSQL multidimensional data warehousing solutions. In:
9th International Conference on Research Challenges in Information Science (RCIS). IEEE
(2015)
5. Colliat, G.: Olap, relational, and multidimensional database systems. SIGMOD Rec. 25(3),
64–69 (1996)
6. Cuzzocrea, A., Song, I.Y., Davis, K.C.: Analytics over large-scale multidimensional data:
The big data revolution!. In: 14th International Workshop on Data Warehousing and OLAP
DOLAP 2011, pp. 101–104. ACM (2011)
7. Dede, E., Govindaraju, M., Gunter, D., Canon, R.S., Ramakrishnan, L.: Performance
evaluation of a MongoDB and hadoop platform for scientific data analysis. In: 4th ACM
Workshop on Scientific Cloud Computing Science Cloud 2013, pp.13–20. ACM (2013)
8. Dehdouh, K., Boussaid, O., Bentayeb, F.: Columnar NoSQL star schema benchmark. In: Ait
Ameur, Y., Bellatreche, L., Papadopoulos, G.A. (eds.) MEDI 2014. LNCS, vol. 8748, pp.
281–288. Springer, Heidelberg (2014)
390 M. Chevalier et al.
9. Golfarelli, M., Maio, D., Rizzi, S.: The dimensional fact model: A conceptual model for data
warehouses. Int. J. Coop. Inf. Syst. 7, 215–247 (1998)
10. Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling, 2nd edn. Wiley, New York (2002)
11. Mior, M.J.: Automated schema design for NoSQL databases. In: SigMOD (2014)
12. O’Neil, P., O’Neil, E., Chen, X., Revilak, S.: The star schema benchmark and augmented fact
table indexing. In: Nambiar, R., Poess, M. (eds.) TPCTC 2009. LNCS, vol. 5895, pp. 237–
252. Springer, Heidelberg (2009)
13. Ravat, F., Teste, O., Tournier, R., Zuruh, G.: Algebraic and graphic languages for OLAP
manipulations. IJDWM 4(1), 17–46 (2008)
14. Stonebraker, M.: New opportunities for new SQL. Commun. ACM 55(11), 10–11 (2012).
http://doi.acm.org/10.1145/2366316.2366319
15. Zhao, H., Ye, X.: A practice of TPC-DS multidimensional implementation on NoSQL
database systems. In: Nambiar, R., Poess, M. (eds.) TPCTC 2013. LNCS, vol. 8391, pp. 93–
108. Springer, Heidelberg (2014)