Casandra Vs MongoDB
Casandra Vs MongoDB
Independent benchmark analyses and testing of various NoSQL platforms under big data and
production-level workloads have been performed over the years. Most of them, including the recent
ones, shows that Apache Cassandra performed significantly better than Couchbase 3.0, MongoDB 3.0
(with the Wired Tiger storage engine) in throughput and latency.
Casandra advantages:
Peer to Peer Architecture:
Elastic Scalability:
One of the biggest advantages of using Cassandra is its elastic scalability. Cassandra cluster can be easily
scaled-up or scaled-down. Interestingly, any number of nodes can be added or deleted in Cassandra
cluster without much disturbance. You don’t have to restart the cluster or change queries related
Cassandra application while scaling up or down. This is why Cassandra is popular of having a very high
throughput for the highest number of nodes. As scaling happens, read and write throughput both
increase simultaneously with zero downtime or any pause to the applications.
Another striking feature of Cassandra is Data replication which makes Cassandra highly available and
fault-tolerant. Replication means each data is stored at more than one location. This is because, even if
one node fails, the user should be able to retrieve the data with ease from another location. In a
Cassandra cluster, each row is replicated based on the row key. You can set the number of replicas you
want to create. Just like scaling, data replication can also happen across multiple data centers. This
further leads to high level back-up and recovery competencies in Cassandra.
High Performance:
Cassandra provides very fast writes, and they are actually faster than reads where it can transfer data
about 80-360MB/sec per node. It achieves this using two techniques.
Cassandra keeps most of the data within memory at the responsible node, and any updates are
done in the memory and written to the persistent storage (file system) in a lazy fashion. To avoid
losing data, however, Cassandra writes all transactions to a commit log in the disk. Unlike
updating data items in the disk, writes to commit logs are append-only and, therefore, avoid
rotational delay while writing to the disk.
Unless writes have requested full consistency, Cassandra writes data to enough nodes without
resolving any data inconsistencies where it resolves inconsistencies only at the first read. This
process is called "read repair."
Tunable Consistency:
Eventual consistency - makes sure that the client is approved as soon as the cluster accepts the
write
Strong consistency - any update is broadcasted to all machines or all the nodes where the
particular data is situated
You can adopt any of these, based on your requirements. You also have the freedom to blend both
eventual and strong consistency. For instance, you can go for eventual consistency in case of remote
data centers where latency is quite high and go for Strong consistency for local data centers where
latency is low.
Replication
Cassandra has much more advanced support for replication by being aware of the network topology.
The server can be set to use a specific consistency level to ensure that queries are replicated locally, or
to remote data centers. This means you can let Cassandra handle redundancy across nodes where it is
aware of which rack and data center those nodes are on. Cassandra can also monitor nodes and route
queries away from “slow” responding nodes.
Idempotency
Idempotency is easy to maintain (don’t need to do a query before an insertion) which prevent
duplication of data.
Memory requirements
Cassandra is much lighter on the memory requirements, especially if you don’t need to keep a lot of
data in cache
Hadoop advantages:
Hadoop Distributed File System (HDFS) - can store massive distributed unstructured data sets.
Data can be stored directly in HDFS, or it can be stored in a semi-structured format in HBase,
which allows rapid record-level data access
MapReduce capabilities are very strong
Hadoop disadvantages:
HDFS file system is extremely complex to set up
Has single points of failure
MongoDB advantages:
Easier development and much better documentation
Better fit for single server
Stores BSON (basically JSON) which is easy to manage and extremely useful when working with
web applications
Strongly consistent by default
Scalability – mongoDB has a number of functions related to scalability
o automatic sharding (auto-partitioning of data across servers)
o reads and writes distributed over shards
o eventually-consistent reads that can be distributed over replicated servers
Availability - data is spread across several shards (replica sets).
Typically, each shard consists of multiple Mongo Daemon instances, including an
arbiter node, a master node, and multiple slaves. If a slave node fails, the master
node automatically re-distributes the workload to the rest of the slave nodes. In
case the master node crashes, the arbiter node elects a new master.
Replica set can span across multiple datacenters but writes can only go to one primary instance
in one data-center.
Simple and powerful indexing - Indexes work very similar to relational databases. You can create
single or compound indexes on the collection level and every document inserted into that
collection has those fields indexed. Querying by index is extremely fast so long as you have all
your indexes in memory.
Dynamic queries, sorting, rich updates…
MapReduce can be used for batch processing of data and aggregation operations. The
aggregation framework enables users to obtain the kind of results for which the SQL GROUP BY
clause is used.
MongoDB disadvantages:
Global write lock limits its use for big data applications (When you perform a write operation in
MongoDB, it creates a lock on the entire database, not just the affected entries, and not just for
a particular connection. This lock blocks not only other write operations, but also read
operations.)
Writes in MongoDB are “unsafe” by default.
Data isn’t written right away by default so it’s possible that a write operation could return
success but be lost if the server fails before the data is flushed to disk. This is how Mongo attains
high performance. If you need increased durability then you can specify a safe write which will
guarantee the data is written to disk before returning
Memory Usage - MongoDB has the natural tendency to use up more memory because it has to
store the key names within each document. This is due to the fact that the data structure is not
necessarily consistent amongst the data objects.
Increasing cluster size in Mongo involves a lot of manual operations done through the command
line. So, it is mandatory that you have a highly skilled system administrator for this database.
Couchbase advantages:
Couchbase is really user/developer/admin friendly. You can see easily what’s going on in your
cluster by using web console. When things get wrong, web console is a huge advantage.
Built-in caching mechanism - couchbase includes a Memcached component that can operate
independently (if you wish) from the document storage components.
Low-latency read and write operations
No single point of failure
Document access in Couchbase is strongly consistent, query access is eventually consistent
Scalability - easy to scale-out the cluster and support live cluster topology changes (all nodes
are identical, easy to setup and can be added or removed with no changes to the application)
Cross-datacenter replication makes it possible to scale a cluster across datacenters for better
data locality and faster data access.
Availability - Couchbase Server maintains multiple copies (up to three replicas) of
each document in a cluster. Each server is identical and serves active and replica
documents. Data is uniformly distributed across all the nodes and the clients are
aware of the topology. If a node in the cluster fails, Couchbase Server detects the
failure and promotes replica documents on other live nodes to active.