0% found this document useful (0 votes)
47 views

Mongodb Vs Cassandra

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Mongodb Vs Cassandra

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

NoSQL Databases: MongoDB vs Cassandra

Veronika Abramova Jorge Bernardino


Polytechnic Institute of Coimbra Polytechnic Institute of Coimbra
ISEC - Coimbra Institute of Engineering ISEC - Coimbra Institute of Engineering
Rua Pedro Nunes, 3030-199 Coimbra, Portugal Rua Pedro Nunes, 3030-199 Coimbra, Portugal
Tel. ++351 239 790 200 Tel. ++351 239 790 200
[email protected] [email protected]

ABSTRACT with the evolution of Information and Communications


In the past, relational databases were used in a large scope of Technology, the storage type, functionalities and interaction with
applications due to their rich set of features, query capabilities and databases has improved. Moreover, databases became a resource
transaction management. However, they are not able to store and used every day by millions of people in countless number of
process big data effectively and are not very efficient to make applications. All that value and usage created a must to have all
transactions and join operations. Recently, emerge a new data structured and organized in the best way so extraction can be
paradigm, NoSQL databases, to overcome some of these made fast and easy. Whenever quantity of data increases,
problems, which are more suitable for the usage in web databases become larger. With the exponential growth of database
environments. In this paper, we describe NoSQL databases, their size, access to data has to be made as more efficient as possible.
characteristics and operational principles. The main focus of this That leads to the well-known problem of efficiency in information
paper is to compare and evaluate two of the most popular NoSQL extraction.
databases: MongoDB and Cassandra. Edgar Frank Codd introduced the relational model in 1970
publishing a paper in Communications of ACM magazine while
Categories and Subject Descriptors working as IBM programmer [2]. As research result, Codd
H.2 [Database Management]. H.2.5 [Heterogeneous proposed a solution to overcome data storage and usage
Databases]. H.2.6 [Database Machines]. difficulties according to principles based on relations between
data. So, 12 rules were introduced to manage data as relational
model, known as “E. F. Codd’s 12 rules” [3]. Meanwhile System
General Terms R [15], experimental database system, was developed to
Management, Measurement, Performance, Experimentation, demonstrate usability and advantages of relational model. With it
Verification. was created a new language, Structured Query Language, known
as SQL [6]. Since then, SQL became a standard for data
Keywords interaction and manipulation. Relational Databases store data as a
Database Management Systems (DBMS), NoSQL Databases. set of tables, each one with different information. All data is
related so it is possible to access information from different tables
simultaneously. Relational model is based on “relationship”
1. INTRODUCTION concept. The origin of relational model was the concept Codd
Some years ago, databases appeared as a repository with used to define a table with data, he called it “relation”. So,
organized and structured data, where all that data is combined into basically, a relation is a table organized in columns and rows.
a set of registers arranged into a regular structure to enable easy Each table is formed by set of tuples with same attributes. Those
extraction of information. To access data is common to use a attributes contain information about some object. More complex
system usually known as DataBase Management System (DBMS). database contains a lot of tables with millions of entries. Those
DBMS can be defined as a collection of mechanisms that enables tables are connected so data from one table can be related to other
storage, edit and extraction of data; over past years the concept of by key. There are different types of keys, but essentially there are
DBMS has become a synonym of database. Size and complexity of two types: primary key and foreign key. Primary key is used to
of databases are defined by the number of registers used. A simple identify each entire table, tuple, as unique. Foreign key is used to
database can be represented as a file with data while more cross-reference tables. Foreign key in one table represents a
complex databases are able to store millions of registers with a Primary key in the other.
huge amount of gigabytes all over the globe. More and more,
databases became an important enterprise tool. For the past years While data volume increases exponentially, some problems
became evident. One of those is database performance related to
Permission to make digital or hard copies of all or part of this work for data access and basic structure of relational model. SQL enables
personal or classroom use is granted without fee provided that copies are
easy data extraction but when information volume is huge, query
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy execution time can become slow [10, 12, 18]. Any application
otherwise, or republish, to post on servers or to redistribute to lists, with large amount of data will unavoidably lose performance. To
requires prior specific permission and/or a fee. overcome those efficacy problems, emerged different types of
Conference C3S2E, July 10–12, 2013, Porto, Portugal. databases. One of those is known as NoSQL corresponding
Copyright 2013 ACM 1-58113-000-0/00/0010 …$15.00 “NotOnlySQL” [16]. NoSQL was introduced by Carlo Strozzi in
1980 to refer an open source database that wasn’t using SQL
interface. Carlo Strozzi prefer to refer to NoSQL as “noseequel”

14
or “NoRel”, which is a principal difference between that The main difference of our paper is its goal to study execution
technology and already existent [13]. The origin of NoSQL can be time evolution according to increase in database size. Although all
related to BigTable, model developed by Google [7]. That different studies performed are important and allow better
database system, BigTable, was used to storage Google’s projects, understanding of capabilities of NoSQL database and how those
such as, Google Earth. Posteriorly Amazon developed his own differ, we consider data volume a very important factor that must
system, Dynamo [5]. Both of those projects highly contributed for be evaluated. At the same time, execution time provides better
NoSQL development and evolution. However NoSQL term was perception of performance while the number of operations per
not popular or known until the meeting held in San Francisco in second may be hard to analyze. At the same time, while examine
2009 [20, 21]. Ever since then, NoSQL became a buzzword. related work, it is important to notice that there are no much
papers discussing performance and benchmarking NoSQL
This paper is focused on testing NoSQL databases and compare
databases. With all aspects defined above, the main aim of our
performance of two widely used databases, MongoDB and
study is to increase the number of analysis and studies available,
Cassandra. We will describe the main characteristics and
while focusing on different parameters compared to existing
advantages of NoSQL databases compared to commonly used
papers.
relational databases. Some advantages and innovation brought by
noseequel model and different existing types of NoSQL databases
will be discussed. The benchmarking of these two NoSQL 3. NOSQL DATABASES
databases, MongoDB and Cassandra is also described. The main reason to NoSQL development was Web 2.0 which
increased the use and data quantity stored in databases [8, 11].
The experimental evaluation of both databases will test the Social networks and large companies, such as Google, interact
difference in managing and data volume scalability, and verify with large scale data amount [12]. In order not to lose
how databases will respond to read/update mix while running just performance, arises the necessity to horizontally scale data.
on one node without a lot of memory and processor resources, just Horizontal scaling can be defined as an attempt to improve
like personal computers. More specifically will be used Virtual performance by increasing the number of storage units [19]. Large
Machine environment. It is common to benchmark databases on amount of computers can be connected creating a cluster and its
high processing and with large capabilities clusters, but in our performance exceeds a single node unit with a lot of additional
study the main goal is focus on less capacity servers. CPUs and memory. With increased adherence to social networks,
The remainder of this paper is organized as follows. Section 2 information volume highly increased. In order to fulfill users
reviews related work on the topic and Section 3 makes a brief demands and capture even more attention, multimedia sharing
summary of NoSQL databases. Section 4 describes the became more used and popular. Users became able to upload and
comparison between MongoDB and Cassandra. Section 5 share multimedia content. So, the difficulty to keep performance
describes the YCSB – Yahoo! Cloud Serving Benchmark. In and satisfy users became higher [19]. Enterprises became even
section 6 the experimental results obtained in the study are shown. more aware of efficiency and importance of information
Finally, Section 7 presents our conclusions and suggests future promptness. The most widely used social network, Facebook,
work. developed by Mark Zuckerberg grew rapidly. With that, meet all
requirements of its users became a hard task. It is difficult to
define the number of millions of people who use this network at
2. RELATED WORK the same time to perform different activities. Internally interaction
Performance and functional principles of NoSQL databases has of all those users with multimedia data is represented by millions
been approached ever since those gained popularity. While of requests to database at the same time. The system must be
analyzing different papers and studies of NoSQL databases two designed to be able to handle large amount of requests and
different types of approaches can be defined. The first is focused process data in a fast and efficient way. In order to keep up with
on compare commonly used SQL databases and NoSQL all demands as well as keep high performance, companies invest
databases, evaluate and study performance in order to distinguish in horizontal scaling [18]. Beyond efficiency, costs are also
those two types of databases. The other one consists of reduced. It is more inexpensive to have a large amount of
comparison only between NoSQL databases. Those studies computers with fewer resources than build a supercomputer.
commonly pick most known NoRel databases and compare their Relational databases allow to horizontally scaling data but
performance. However, both of those comparisons in most cases NoSQL provide that in an easier way. This is due to ACID
are focused on analyzing the number of operations per second and principles and transaction support that is described in the next
latency for each database. While latency may be considered an section. Since data integrity and consistency are highly important
important factor while working in cluster environment, there is no for relational databases, communication channels between nodes
value for it in a single node study. and clusters would have to instantly synchronize all transactions.
Brian F. Cooper et al. analyzed NoSQL databases and MySQL NoSQL databases are designed to handle all type of failures.
database performance using YCSB benchmark by relating latency Variety of hardware fail may occur and system must be prepared
with the number of operations per second [4]. In our paper the so it is more functional to consider those concerns as eventual
main focus is to perform studies prioritizing different execution occurrences than some exceptional events.
parameters. More specifically our goal is based on relating
execution time to the number of records used on each execution. In the next sections it will be described the principles of
More importantly, all benchmarking is commonly done in high operation, characteristics and different types of NoSQL databases.
processing and with lots of memory clusters, it is also important to
understand how these databases behave in simpler environments
and while using just one server.

15
3.1 ACID vs BASE 3.3 Types of NoSQL databases
Relational databases are based on a set of principles to optimize With high adherence to NoSQL databases, different databases
performance. Principles used by Relational or NoSQL databases have been developed. Currently there are over 150 different
are derived from CAP theorem [11]. According to this theorem, NoSQL databases. All those are based on same principles but own
following guarantees can be defined: some different characteristics. Typically can be defined four
categories [9]:
 Consistency – all nodes have same data at the same
time;  Key-value Store. All data is stored as set of key and
value. All keys are unique and data access is done by
 Availability – all requests have response; relating those keys to values. Hash contains all keys in
 Partition tolerance – if part of system fails, all system order to provide information when needed. But value
won’t collapse. may not be actual information, it may be other key.
Examples of Key-value Store databases are:
ACID is a principle based on CAP theorem and used as set of
BynamoDB, Azure Table Storage, Riak, Redis.
rules for relational database transactions. ACID’s guarantees are
[17]:  Document Store. Those databases can be defined as set
of Key-value stores that posteriorly are transformed into
 Atomic – a transaction is completed when all operations documents. Every document is identified by unique key
are completed, otherwise rollback1 is performed; and can be grouped together. The type of documents is
 Consistent – a transaction cannot collapse database, defined by known standards, such as, XML or JSON.
otherwise if operation is illegal, rollback is performed; Data access can be done using key or specific value.
 Isolated – all transactions are independent and cannot Some examples of Document Store databases are:
affect each other; MongoDB, Couchbase Server, CouchDB, RavenDB.
 Durable – when commit2 is performed, transactions  Column-family. That is the type most similar to
cannot be undone. relational database model. Data is structured in columns
that may be countless. One of projects with that
It is noticeable that in order to have robust and correct database
approach is HBase based on Google’s Bigtable [24].
those guarantees are important. But when the amount of data is
Data structure and organization consists of:
large, ACID may be hard to attain. That why, NoSQL focuses on
BASE principle [17, 20]: o Column – represents unit of data identified by
key and value;
 Basically Avaliable – all data is distributed, even when
there is a failure the system continues to work; o Super-column – grouped by information
columns;
 Soft state – there is no consistency guarantee;
o Column family – set of structured data similar
 Eventually consistent – system guarantees that even to relation database table, constituted by
when data is not consistent, eventually it will be. variety of super-columns.
It is important to notice, that BASE still follows CAP theorem Structure of database is defined by super-columns and
and if the system is distributed, two of three guarantees must be column families. New columns can be added whenever
chosen [1]. What to choose depends of personal needs and necessary. Data access is done by indicating column
database purpose. BASE is more flexible that ACID and the big family, key and column in order to obtain value, using
difference is about consistency. If consistency is crucial, relational following structure:
databases may be better solution but when there are hundreds of
<columnFamily>.<key>.<column> = <value>
nodes in a cluster, consistency becomes very hard to accomplish.
Examples of Column-family databases: HBase,
3.2 Data access Cassandra, Accumulo, Hypertable.
When it comes to data access, data interaction and extraction in  Graph database. Those databases are used when data
NoSQL databases is different. Usual SQL language cannot be can be represented as graph, for example, social
used anymore. NoSQL databases tend to favor Linux so data can networks.
be manipulated with UNIX commands. All information can be
easily manipulated using simple commands as ls, cp, cat, etc. and Examples of Graph databases: Neo4J, Infinite Graph,
extracted with I/O and redirect mechanisms. Even though, since InfoGrid, HyperGraphDB.
SQL became a standard and widely used, there are NoSQL In the next section we describe the main characteristics of the two
databases where SQL-like query language can be used. For popular NoSQL databases under test.
example, UnQL – Unstructured Query language developed by
Couchbase [22] or CQL – Cassandra Query language [23]. 4. MONGODB VS CASSANDRA
In this section we describe MongoDB and Cassandra, which are
1
Operation that returns database to consistent state the databases chosen for analysis and tests. The main
2
Operation that confirms all changes done over database as characteristics to be analyzed are: data loading, only reads, reads
permanent and updates mix, read-modify-write and only updates. Those
databases were chosen in order to compare different types of

16
databases, MongoDB as Document Store and Cassandra as on replica will be older compared to the Master and will
Column family. not match last updates done.

4.1 MongoDB  Arbiters. These members exist only to participate in


MongoDB is an open source NoSQL database developed in C++. elections and interact with all other members.
It is a multiplatform database developed in 2007 by 10gen with  Non-Voting Members. These replicas may not take part
first public release in 2009, currently in version 2.4.3 and in elections and usually are used only with large clusters
available to download at (http://www.mongodb.org/downloads). with more than 7 members.
MongoDB is a document store database where documents are Starting from version 2.2 MongoDB uses locks to ensure
grouped into collections according to their structure, but some consistency of data and prevent multiple clients to read and
documents with different structure can also be stored. However, in update data at the same time. Before, information could be simple
order to keep efficiency up, similarity is recommended. The replaced while being transferred to memory.
format to store documents in MongoDB is BSON – Binary JSON
and the maximum size for each is limited to 16MB. The Similarly to RDBMS may be defined four core database
identification is made by defined type, not just id. For example, it operations executed over MongoDB. The set of those operations
can be the combination of id and timestamp in order to keep is called CRUD and stands for Create, Read, Update and Delete.
documents unique. It is important to notice that 32bit MongoDB In Figure 1 is shown an example of the MongoDB interface.
has a major limitation. Only 2GB of data can be stored per node.
The reason of that is memory usage made by MongoDB. In order
to increase performance data files are mapped in memory. By
default data is sent to disc every 60 seconds but that time can be
personalized. When new files are created, everything is flushed to
disc, releasing memory. It is not known if the size of memory used
by MongoDB can be defined, eventually unused files will be
removed from memory by operating system. So, since 64 bit OS
are capable of address more memory, 32 bit OS are limited. In
order to increase performance while working with documents,
MongoDB uses indexing similar to relational databases. Each
document is identified by _id field and over that field is created
unique index. Although indexing is important to execute
efficiently read operations, it may have negative impact on inserts.
Apart from automatic index created on _id field, additional
indexes can be created by database administrator. For example,
can be defined index over several fields within specific collection.
That feature of MongoDB is called “compound index”. However,
all indexes use the same B-tree structure. Each query use only one Figure 1 – MongoDB interface
index chosen by query optimizer mechanism, giving preference to Like other NoSQL databases, MongoDB is controlled by UNIX
more efficient index. Eventually query optimizer reevaluates used shell but there are some projects that developed an interface, such
indexing by executing alternative plans and comparing execution as, Edda, MongoVision and UMongo [25].
cost.
Some of the most important characteristics of this database are 4.2 Cassandra
durability and concurrency. Durability of data is granted with Cassandra is a NoSQL database developed by Apache Software
creation of replicas. MongoDB uses Master-Slave replication Foundation written in Java. Cassandra is available as Apache
mechanism. It allows defining a Master and one or more Slaves. License distribution at (http://cassandra.apache.org/).
Master can write or read files while Slave serves as backup, so
only reading operations are allowed. When Master goes down, Being part of Column-Family, Cassandra is very similar to the
Slave with more recent data is promoted to Master. All replicas usual relational model, made of columns and rows. The main
are asynchronous, what means that all updates done are not spread difference were the stored data, that can be structured, semi
immediately. Replica members can be configured by system structured or unstructured.
administrator in a variety of ways, such as: While using Cassandra, there is a community of support and
 Secondary-Only Members. Those replicas store data but professional support from some companies. Cassandra is designed
cannot be promoted to Master under any circumstances. to store large amount of data and deal with huge volumes in an
efficient way. Cassandra can handle billions of columns and
 Hidden Members. Hidden replicas may not become millions of operations per day [26]. Data can be distributed all
primary and are invisible to client applications. Usually over the world, deployed on large number of nodes across
those members provide dedicated backups and read- multiple data centers. When it comes to storage in cluster and
only testing. However, those replicas still vote for new nodes, all data is stored over clusters. When some node is added
Master when failover occurs and primary unit must be or when it is removed, all data is automatically distributed over
chosen. other nodes and the failed node can be replaced with no
downtime. With that it is no longer necessary to calculate and
 Delayed Members. Replicas that copy primary unit
assign data to each node. Every node in the cluster have same role
operations with specified delay. Which means that data

17
which means that there are no master. That architecture is known There are different ways to use Cassandra, some of most
as peer-to-peer and overcomes master-slave limitations such as, prominent areas of use are: financial, social media, advertising,
high availability and massive scalability. Data is replicated over entertainment, health care, government, etc. There are many
multiple nodes in the cluster. It is possible to store terabytes or companies that use Cassandra, for example, IBM, HP, Cisco and
petabytes of data. Failed nodes are detected by gossip protocols eBay [24].
and those nodes can be replaced with no downtime. The total
number of replicas is referred as replication factor. For example, 4.3 Features comparison
replication factor 1 means that there is only one copy of each row In order to better understand differences between MongoDB and
on one node but replication factor 2 represents that there are two Cassandra we study some features of those NoSQL databases such
copies of same records, each one on different node. There are two as: development language, storage type, replication, data storage,
available replication strategies: usage and some other characteristics. All those characteristics are
 Simple Strategy: it is recommended when using a single shown in Table 1.
data center. Data center can be defined as group of
related nodes in cluster with replication purpose. First Table 1. MongoDB and Cassandra features
replica is defined by system administrator and
additional replica nodes are chosen clockwise in the MongoDB Cassandra
ring. Development
C++ Java
 Network Topology Strategy: it is a recommended language
strategy when the cluster is deployed across multiple Storage Type BSON files Column
data centers. Using this strategy it is possible to specify
the number of replicas to use per data center. Protocol TCP/IP TCP/IP
Commonly in order to keep tolerance-fault and Transactions No Local
consistency it should be used two or three replicas on
each data center. Concurrency Instant update MVCC
One of the important features of Cassandra is durability. There are Locks Yes Yes
two available replication types: synchronous and asynchronous,
and the user is able to choose which one to use. Commit log is Triggers No Yes
used to capture all writes and redundancies in order to ensure data
Replication Master-Slave Multi-Master
durability.
Consistency, Partition tolerance,
Another important feature for Cassandra is indexing. Each node CAP theorem
Partition tolerance High Availability
maintains all indexes of tables it manages. It is important to notice Operating Linux / Mac OS / Linux / Mac OS /
that each node knows the range of keys managed by other nodes. Systems Windows Windows
Requested rows are located by analyzing only relevant nodes.
Data storage Disc Disc
Indexes are implemented as a hidden table, separated from actual
data. In addition, multiple indexes can be created, over different A cross between
Retains some SQL
fields. However, it is important to understand when indexes must BigTable and
Characteristics properties such as
be used. With larger data volumes and a large number of unique Dynamo. High
query and index
values, more overhead will exist to manage indexes. For example, availability
having database with millions of clients’ records and indexing by CMS system, Banking, finance,
Areas of use
e-mail field that usually is unique will be highly inefficient. comment storage logging
All stored data can be easily manipulated using CQL – Cassandra By analyzing core properties it is possible to conclude that there
Query Language based on widely used SQL. Since syntax is are similarities when it comes to used file types, querying,
familiar, learning curve is reduced and it is easier to interact with transactions, locks, data storage and operating systems. But it is
data. In Figure 2 is shown a Cassandra client console. important to notice the main difference, according to CAPs
theorem, MongoDB is CP type system – Consistency and
Partition tolerance, while Cassandra is PA – Consistency and
Availability. In terms of replication, MongoDB uses Master-Slave
while Cassandra uses peer-to-peer replication that is typically
named as Multi-master.
In terms of usage and best application, MongoDB has better use
for Content Management Systems (CMS), while having dynamic
queries and frequently written data. Cassandra is optimized to
store and interact with large amounts of data that can be used in
different areas such as, finance or advertising. Following, we
describe the benchmark to test MongoDB and Cassandra
databases.
Figure 2 – Cassandra console

18
5. YCSB BENCHMARK In the following figures we show data loading phase tests and
The YCSB – Yahoo! Cloud Serving Benchmark is one of the time execution for the different types of workloads: A, B, C, F, G,
most used benchmarks to test NoSQL databases [10]. YCSB has a and H.
client that consists of two parts: workload generator and the set of
scenarios. Those scenarios, known as workloads, are
combinations of read, write and update operations performed on Data loading phase
randomly chosen records. The predefined workloads are:
 Workload A: Update heavy workload. This workload 09:36

Time (min:sec)
has a mix of 50/50 reads and updates.
MongoDB
 Workload B: Read mostly workload. This workload 04:48
Cassandra
has a 95/5 reads/update mix.
 Workload C: Read only. This workload is 100% read. 00:00
100K 280K 700K
 Workload D: Read latest workload. In this workload,
new records are inserted, and the most recently inserted MongoDB 00:45 02:00 04:42
records are the most popular.
Cassandra 00:59 02:24 05:42
 Workload E: Short ranges. In this workload, short
ranges of records are queried, instead of individual Number of records
records.
Figure 3 - Data loading test
 Workload F: Read-modify-write. In this workload, the
client will read a record, modify it, and write back the To compare loading speed and throughput different volumes of
changes. were loaded with 100.000, 280.000 and 700.000 records as shown
in Figure 3. While observing results, it is possible to see that there
Because our focus is on update and read operations, workloads D was no significant difference between MongoDB and Cassandra.
and E will not be used. Instead, to better understand update and MongoDB had slightly lower insert time, regardless of number of
read performance, two additional workloads were defined: records, compared to Cassandra, which has an average overhead
 Workload G: Update mostly workload. This workload of 24%. When the size of loaded data increases, the execution
has a 5/95 reads/updates mix. time increased in a similar proportion for both databases with
highest time of 04:42 for MongoDB and 05:42 for Cassandra
 Workload H: Update only. This workload is 100% when inserting 700.000 records.
update.
The loaded data is from a variety of files, each one with a certain
Workload A (50/50 reads and updates)
number of fields. Each record is identified by a key, string like
“user123423”. And each field is named as field0, field1 and so on.
Values of each field are random characters. For testing we use 00:57
records with 10 fields each of 100 bytes, meaning a 1kb per
Time (min:sec)

record. MongoDB
00:28
Since client and server are hosted on the same node, latency will Cassandra
not take part of this study. YCSB provides thread configuration
and set of operation number per thread. During initial tests we 00:00
observed that using threads, the number of operations per second 100K 280K 700K
actually reduced. That is due to the fact that tests are running on MongoDB 00:19 00:31 00:28
virtual machine with even lower resources than a host.
Cassandra 00:10 00:14 00:11
6. EXPERIMENTAL EVALUATION Number of records
In this section we will describe the experiments while using
different workloads and data volumes. Tests were running using Figure 4 - Workload A experiments
Ubuntu Server 12.04 32bit Virtual Machine on VMware Player.
As experimental setup it is important to notice that VM has Compared to MongoDB, Cassandra had better execution time
available 2GB RAM and Host was single-node Core 2 Quad 2.40 regardless database size. The performance of Cassandra can be
GHz with 4GB RAM and Windows 7 Operating System. The 2.54 times faster than Mongo DB using a mix of 50/50 reads and
tested versions of NoSQL databases are MongoDB version 2.4.3 updates with 700.000 records. Another important fact that can be
and Cassandra version 1.2.4. observed is the decrease in time execution when number of
records used goes from 280.000 up to 700.000, for both databases
As focus of study, we take the execution time to evaluate the best (see Figure 4). This happens due to optimization of databases to
database performance. All workloads were executed three times work with larger volumes of data.
with reset of computer between tests. All the values are shown in
(minutes:seconds) and represent the average value of the three
executions.

19
Workload B (95/5 reads and updates) Workload F (read-modify-write)

00:57

Time (min:sec)
00:57
MongoDB
00:28
(min:sec)

MongoDB Cassandra
Time

00:28
Cassandra
00:00
00:00 100K 280K 700K
100K 280K 700K
MongoDB 00:19 00:21 00:36
MongoDB 00:12 00:22 00:32
Cassandra 00:40 00:21 00:20
Cassandra 00:29 00:21 00:18
Number of records
Number of records
Figure 7 - Workload F experiments
Figure 5 - Workload B experiments
In this workload, the client will read a record, modify it, and write
When we test the databases with a 95/5 reads/update mix the back the changes. In this workload Cassandra and MongoDB
results for Cassandra and MongoDB had different behavior as showed opposite behavior results as illustrated id Figure 7. The
shown in Figure 5. While execution time for MongoDB kept Cassandra’s higher execution time was with small data volume
increasing, Cassandra was able to reduce time while data volume and with increase it kept reducing while MongoDB has worst time
became larger. However, the highest time for Cassandra was with bigger data size. MongoDB is 2.1 faster for querying over
00:29 and corresponds to querying over 100.000 records when for 100.000 records but 1.8 slower for 700.000 records, and have the
MongoDB highest time was of 00:32 for 700.000 records. The same value for 280.000 records, when comparing to Cassandra
performance of Cassandra with this workload is 56% better when execution time. Smallest execution time variations were 00:01 for
comparing to MongoDB, using 700.000 records. Although for Cassandra when increasing number of records from 280.000 up to
small size data (100.000 records) the MongoDB has better results. 700.000 and 00:02 for MongoDB when lowing number of records
used from 280.000 down to 100.000 records.
Workload C (100% reads)
Workload G (5/95 reads and updates)
00:57
Time (min:sec)

00:57
MongoDB
Time (min:sec)

00:28 MongoDB
Cassandra
00:28
Cassandra
00:00
100K 280K 700K 00:00
100K 280K 700K
MongoDB 00:16 00:27 00:35
Cassandra 00:43 00:24 00:20 MongoDB 00:23 00:31 00:36
Cassandra 00:01 00:02 00:03
Number of records
Number of records
Figure 6 - Workload C experiments
In this workload we have 100% of reads. As the previous Figure 8 - Workload G experiments
experiments, when it comes to large amount of read operations, This workload has a 5/95 reads/updates mix. The results shown in
Cassandra becomes more efficient with bigger quantity of data, as Figure 8 are absolutely demonstrative of the superiority of
illustrated in Figure 6. MongoDB showed similar behavior of the Cassandra over MongoDB for all database sizes. On every
previous workload, where execution time is directly proportional execution time Cassandra showed better results. With grown of
to data size. However, MongoDB is 2.68 faster when using data volume both Cassandra and MongoDB started having higher
100.000 records but 1.75 slower for 700.000 records, when execution time, but MongoDB was not even close to Cassandra.
comparing to Cassandra execution time. Fastest execution time of The performance of Cassandra with this workload varies from 23
MongoDB is 00:16 and for Cassandra is 00:20, however those to 12 times faster than MongoDB. That established difference in
results represent opposite volumes of data, being better execution performance allows us to conclude that in this environment,
time for Cassandra with high number of records and for Cassandra is more optimized to update operations compared to
MongoDB with just 100.000 records. MongoDB, showing surprisingly high performance results.

20
Workload H (100% updates) 9. REFERENCES
[1] Brewer, E., "CAP twelve years later: How the "rules" have
00:57 changed," Computer , vol.45, no.2, pp.23,29, Feb. 2012. doi:
Time (min:sec) 10.1109/MC.2012.37.
MongoDB [2] Codd. E. F. 1970. A relational model of data for large
00:28 shared data banks. Communications of ACM 13, 6 (June
Cassandra
1970), 377-387. doi=10.1145/362384.362685.
00:00 [3] Codd. E. F. 1985. “Is your DBMS Really Relational?” and
100K 280K 700K “Does your DBMS Run by the Rules?” Computer World,
MongoDB 00:25 00:27 00:43 October 14 and October 21.

Cassandra 00:01 00:01 00:01 [4] Cooper B. F., Adam Silberstein, Erwin Tam, Raghu
Ramakrishnan, and Russell Sears. 2010. Benchmarking
Number of records cloud serving systems with YCSB. In Proceedings of the 1st
ACM symposium on Cloud computing (SoCC '10). ACM,
Figure 9 - Workload H experiments New York, NY, USA, 143-154.
DOI=10.1145/1807128.1807152
When it came to a 100% update workload Cassandra had stable http://doi.acm.org/10.1145/1807128.1807152
performance even with increased number of records, as shown in
[5] DeCandia Giuseppe, Deniz Hastorun, Madan Jampani,
Figure 9. Similarly to results of workload G, Cassandra showed
Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin,
great results compared to MongoDB, which varies from 25 to 43
Swaminathan Sivasubramanian, Peter Vosshall, and Werner
times better. For MongoDB the difference in execution time
Vogels. 2007. Dynamo: amazon's highly available key-value
between 100.000 records and 280.000 records was not big but
store. In Proceedings of twenty-first ACM SIGOPS
almost doubled when using 700.000 records
symposium on Operating systems principles (SOSP '07).
ACM, New York, NY, USA, 205-220.
7. CONCLUSIONS AND FUTURE WORK [6] Donald D. Chamberlin, Raymond F. Boyce: SEQUEL: A
The development of the Web need databases able to store and
Structured English Query Language. SIGMOD Workshop,
process big data effectively, demand for high-performance when
Vol. 1 1974: 249-264.
reading and writing, so the traditional relational database is facing
many new challenges. NoSQL databases have gained popularity [7] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C.
in the recent years and have been successful in many production Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra,
systems. In this paper we analyze and evaluate two of the most Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: a
popular NoSQL databases: MongoDB and Cassandra. In the distributed storage system for structured data. In Proceedings
experiments we test the execution time according to database size of the 7th USENIX Symposium on Operating Systems
and the type of workload. We test six different types of Design and Implementation - Volume 7 (OSDI '06), Vol. 7.
workloads: mix of 50/50 reads and updates; mix of 95/5 USENIX Association, Berkeley, CA, USA, 15-15.
reads/updates; read only; read-modify-write cycle; mix of 5/95 [8] Hecht, R.; Jablonski, S., "NoSQL evaluation: A use case
reads/updates; and update only. With the increase of data size, oriented survey," Cloud and Service Computing (CSC), 2011
MongoDB started to reduce performance, sometimes showing International Conference on , vol., no., pp.336,341, 12-14
poor results. Differently, Cassandra just got faster while working Dec. 2011. doi: 10.1109/CSC.2011.6138544.
with an increase of data. Also, after running different workloads
to analyze read/update performance, it is possible to conclude that [9] Indrawan-Santiago, M., "Database Research: Are We at a
when it comes to update operations, Cassandra is faster than Crossroad? Reflection on NoSQL," Network-Based
MongoDB, providing lower execution time independently of Information Systems (NBiS), 2012 15th International
database size used in our evaluation. As overall analysis turns out Conference on , vol., no., pp.45,51, 26-28 Sept. 2012. doi:
that MongoDB fell short with increase of records used, while 10.1109/NBiS.2012.95.
Cassandra still has a lot to offer. In conclusion Cassandra show [10] Jayathilake, D.; Sooriaarachchi, C.; Gunawardena, T.;
the best results for almost all scenarios. Kulasuriya, B.; Dayaratne, T., "A study into the capabilities
of NoSQL databases in handling a highly heterogeneous
As future work, we pretend to analyze the number of operations tree," Information and Automation for Sustainability
per second vs database size. That would help to understand, how (ICIAfS), 2012 IEEE 6th International Conference on , vol.,
those databases would behave with higher number of records to no., pp.106,111, 27-29 Sept. 2012. doi:
read/update with data volume grown. 10.1109/ICIAFS.2012.6419890.
[11] Jing Han; Haihong, E.; Guan Le; Jian Du, "Survey on
8. AKNOWLEDGMENTS NoSQL database," Pervasive Computing and Applications
Our thanks to ISEC – Coimbra Institute of Engineering from (ICPCA), 2011 6th International Conference on , vol., no.,
Polytechnic Institute of Coimbra for following us to use the pp.363,366, 26-28 Oct. 2011.
facilities of the Laboratory of Research and Technology doi:10.1109/ICPCA.2011.6106531.
Innovation of Computer Science and Systems Engineering
Department. [12] Leavitt, N., "Will NoSQL Databases Live Up to Their
Promise?," Computer , vol.43, no.2, pp.12,14, Feb. 2010.
doi: 10.1109/MC.2010.58.

21
[13] Lith, Adam; Jakob Mattson (2010). "Investigating storage [19] Silberstein, A.; Jianjun Chen; Lomax, D.; McMillan, B.;
solutions for large data: A comparison of well performing Mortazavi, M.; Narayan, P. P S; Ramakrishnan, R.; Sears,
and scalable data storage solutions for real time extraction R., "PNUTS in Flight: Web-Scale Data Serving at Yahoo,"
and batch insertion of data". Göteborg: Department of Internet Computing, IEEE , vol.16, no.1, pp.13,23, Jan.-Feb.
Computer Science and Engineering, Chalmers University of 2012. doi: 10.1109/MIC.2011.142.
Technology. [20] Tudorica, B.G.; Bucur, C., "A comparison between several
[14] Lombardo, S.; Di Nitto, E.; Ardagna, D., "Issues in Handling NoSQL databases with comments and notes," Roedunet
Complex Data Structures with NoSQL Databases," Symbolic International Conference (RoEduNet), 2011 10th , vol., no.,
and Numeric Algorithms for Scientific Computing pp.1,5, 23-25 June 2011.
(SYNASC), 2012 14th International Symposium on , vol., doi:10.1109/RoEduNet.2011.5993686.
no., pp.443,448, 26-29 Sept. 2012. doi: [21] Yahoo! Developer Network 2009. Notes from NoSQL
10.1109/SYNASC.2012.59. Meetup. - http://developer.yahoo.com/blogs/ydn/notes-nosql-
[15] M.M. Astrahan, A history and evaluation of system R, meetup-7663.html.
Performance Evaluation, Volume 1, Issue 1, January 1981, [22] http://www.couchbase.com/press-releases/unql-query-
Page 95, ISSN 0166-5316, 10.1016/0166-5316(81)90053-5. language, accessed on 30th April 2013
[16] nosql-database.org, accessed on 30th April 2013. [23] http://www.datastax.com/docs/1.0/references/cql/index,
[17] Roe C. 2012 “ACID vs. BASE: The Shifting pH of Database accessed on 30th April 2013.
Transaction Processing” - http://www.dataversity.net/acid- [24] http://cassandra.apache.org/, accessed on 30th April 2013.
vs-base-the-shifting-ph-of-database-transaction-processing/.
[25] http://docs.mongodb.org/ecosystem/tools/administration-
[18] Shidong Huang; Lizhi Cai; Zhenyu Liu; Yun Hu, "Non- interfaces/, accessed on 30th April 2013.
structure Data Storage Technology: A Discussion,"
Computer and Information Science (ICIS), 2012 IEEE/ACIS [26] http://www.datastax.com/what-we-offer/products-
11th International Conference on , vol., no., pp.482,487, services/datastax-enterprise/apache-cassandra, accessed on
May 30 2012-June 1 2012. doi: 10.1109/ICIS.2012.76. 30th April 2013.

22

You might also like