DBMS Module 5 Part 2
DBMS Module 5 Part 2
MODULE 5 PART 3
What Is NoSQL?
Databases can be considered as one of the important component entity for technology and
applications. Data need to be stored in a specific structure and format to retrieve it whenever
required. But, there are situations where data are not always in a structured format, i.e., their schemas
are not rigid. In this chapter, you will learn in details about NoSQL and its characteristic features.
NoSQL can be defined as an approach to database designing, which holds a vast diversity of data such
as key-value, multimedia, document, columnar, graph formats, external files, etc. NoSQL is
purposefully developed for handling specific data models having flexible schemas to build modern
applications.
NoSQL is famous for its high functionality, ease of development with a performance at scale. Because
of such diverse data handling feature, NoSQL is called a non-relational database. It does not follow the
rules of Relational Database Management Systems (RDBMS), and hence do not use traditional SQL
statements to query your data. Some famous examples are MongoDB, Neo4J, HyperGraphDB, etc.
Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data. The system response time
becomes slow when you use RDBMS for massive volumes of data.
To resolve this problem, we could “scale up” our systems by upgrading our existing
hardware. This process is expensive.
The alternative for this issue is to distribute database load on multiple hosts whenever the
load increases. This method is known as “scaling out.”
NoSQL database is non-relational, so it scales out better than relational databases as they
are designed with web applications in mind.
Features of NoSQL
Non-relational
Schema-free
NoSQL is Schema-Free
Simple API
• Offers easy to use interfaces for storage and querying data provided
• APIs allow low-level data manipulation & selection methods
• Text-based protocols mostly used with HTTP REST with JSON
• Mostly used no standard based NoSQL query language
• Web-enabled databases running as internet-facing services
Distributed
1. Key-value stores: is the most straightforward type where every item of your database gets stored
in the form of an attribute name (i.e., "key") along with the value.
2. Wide-column stores: accumulate data collectively as a column rather than rows which are
optimized for querying big datasets.
3. Document databases: couple every key with a composite data structure termed as a document.
These documents hold a lot of different key-value pairs, as well as key-array pairs or sometimes
nested documents.
4. Graph databases: are used for storing information about networks, like social connections.
• SQL databases are mainly coming under Relational Databases (RDBMS) whereas NoSQL databases
mostly come under non-relational or distributed database.
• SQL databases are table-oriented databases, whereas NoSQL databases document-oriented have
key-value pairs or wide-column stores or graph databases.
• SQL databases have a predefined or static schema that is rigid, whereas NoSQL databases have
dynamic or flexible schema to handle unstructured data.
• SQL is used to store structured data, whereas NoSQL is used to store structured as well as
unstructured data.
• SQL databases can be considered as vertically scalable, but NoSQL databases are considered
horizontally scalable.
• Scaling of SQL databases is done by mounting the horse-power of your hardware. But, scaling of
NoSQL databases is calculated by mounting the databases servers for reducing the load.
• Examples of SQL databases: MySql, Sqlite, Oracle, Postgres SQL, and MS-SQL. Examples of NoSQL
databases: BigTable, MongoDB, Redis, Cassandra, RavenDb, Hbase, CouchDB and Neo4j
• When your queries are complex SQL databases are a good fit for the intensive environment, and
NoSQL databases are not an excellent fit for complex queries. Queries of NoSQL are not that
powerful as compared to SQL query language.
• SQL databases need vertical scalability, i.e., excess of load can be managed by increasing the CPU,
SSD, RAM, GPU, etc., on your server. In the case of NoSQL databases, they horizontally scalable,
i.e., the addition of more servers will ease out the load management thing to handle.
RDBMS NOSQL
RDBMS NOSQL
It uses structured query language (SQL), Data Query language varies from database to
Manipulation Language (DML), Data database.
Definition Language (DDL) for defining and
manipulating the data.
Examples- MySql, Oracle, Sqlite, Postgres and Examples- MongoDB, Redis, Hbase,
MS-SQL. RavenDb, Cassandra, Neo4j and CouchDb
Advantages of NoSQL
Disadvantages of NoSQL
• No standardization rules
• Limited query capabilities
• RDBMS databases and tools are comparatively mature
• It does not offer any traditional database capabilities, like consistency when multiple
transactions are performed simultaneously.
• When the volume of data increases it is difficult to maintain unique values as keys
become difficult
• Doesn’t work as well with relational data
• The learning curve is stiff for new developers
• Open source options so not so popular for enterprises.
Redis
Once installed in a server, run the Redis CLI (Command Line Interface) to issue
commands to Redis. While working on the CLI tool, your command-line prompt will
change to: redis>
Features:
• Persistence : While all the data lives in memory, changes are asynchronously
saved on disk using flexible policies based on elapsed time and/or number of
updates since last save. Redis supports an append-only file persistence
mode. Check more on Persistence, or read the AppendOnlyFileHowto for
more information.
• Atomic Operations : Redis operations working on the different Data Types are
atomic, so setting or increasing a key, adding and removing elements from a
set, increasing a counter will all be accomplished safely.
• Portable : Redis is written in ANSI C and works in most POSIX systems like
Linux, BSD, Mac OS X, Solaris, and so on. Redis is reported to compile and
work under WIN32 if compiled with Cygwin, but there is no official support for
Windows currently.
• Redis is a different evolution path in the key-value DBs where values can
contain more complex data types, with atomic operations defined on those
data types. Redis data types are closely related to fundamental data
structures and are exposed to the programmer as such, without additional
abstraction layers.
Redis will either be killed by the Linux kernel OOM killer, crash with an error or will
start to slow down. With modern operating systems malloc() returning NULL is not
common, usually the server will start swapping, and Redis performance will degrade,
so you'll probably notice there is something wrong.
The INFO command will report the amount of memory Redis is using so you can
write scripts that monitor your Redis servers checking for critical conditions.
Redis has built-in protections allowing the user to set a max limit to memory usage,
using the maxmemory option in the config file to put a limit to the memory Redis can
use. If this limit is reached Redis will start to reply with an error to write commands,
or you can configure it to evict keys when the max memory limit is reached in the
case you are using Redis for caching.
You can easily build complex systems on top of Redis, here is a sample list :
MongoDB
MongoDB can be defined as a document-oriented database system that uses the concept
of NoSQL. It also provides high availability, high performance, along with automatic scaling.
This open-source product was developed by the company - 10gen in October 2007, and the
company also maintains it. MongoDB exists under the General Public License (GPL) as a free
database management tool as well as available under Commercial license as of the
manufacturer. MongoDB was also intended to function with commodity servers. Companies
of different sizes all over the world across all industries are using MongoDB as their database.
In MongoDB,, a database can be defined as a physical container for collections of data.
Here, on the file system, every database has its collection of files residing. Usually, a MongoDB
server contains numerous databases.
Collections can be defined as a cluster of MongoDB documents that exist within a
single database. You can relate this to that of a table in a relational database management
system. MongoDB collections do not implement the concept of schema. Documents that have
collection usually contain different fields. Typically, all the documents residing within a
collection are meant for a comparable or related purpose.
A document can be defined as a collection of key-value pairs that contain dynamic
schema. Dynamic schema is something that documents of the equal collection do not require
for having the same collection of fields or construction, and a common field is capable of
holding various types of data.
The terminologies used in RDBMS and MongoDB
RDBMS MongoDB
Database Database
Table Collection
Column Field
Here is a list of some popular and multinational companies and organizations that are
using MongoDB as their official database to perform and manage different business
applications.
• Adobe
• McAfee
• LinkedIn
• FourSquare
• MetLife
• eBay
• SAP
Where Is MongoDB Used?
Beginners need to know the purpose and requirement of why to use MongoDB or what
is the need of it in contrast to SQL and other database systems. In simple words, it can be said
that every modern-day application involves the concept of big data, analyzing different forms
of data, fast features improvement in handling data, deployment flexibility, which old database
systems are not competent enough to handle. Hence, MongoDB is the next choice.
• MongoDB is also used as a file system that can help in easy management of load
balancing.
• MongoDB also supports the searching using the concept of regex (regular expression) as
well as fields.
• Users can run MongoDB as a windows service also.
• It does not require any VM to run on different platforms.
• It also supports sharding of data.
Cassandra
Apache Cassandra is an open source, distributed and decentralized/distributed storage
system (database), for managing very large amounts of structured data spread out across the
world. It provides highly available service with no single point of failure.
Listed below are some of the notable points of Apache Cassandra −
• It is scalable, fault-tolerant, and consistent.
• It is a column-oriented database.
• Its distribution design is based on Amazon’s Dynamo and its data model on Google’s
Bigtable.
• Created at Facebook, it differs sharply from relational database management systems.
• Cassandra implements a Dynamo-style replication model with no single point of
failure, but adds a more powerful “column family” data model.
• Cassandra is being used by some of the biggest companies such as Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.
Features of Cassandra
• Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to
accommodate more customers and more data as per requirement.
• Always on architecture − Cassandra has no single point of failure and it is
continuously available for business-critical applications that cannot afford a failure.
• Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases your
throughput as you increase the number of nodes in the cluster. Therefore it maintains
a quick response time.
• Flexible data storage − Cassandra accommodates all possible data formats including:
structured, semi-structured, and unstructured. It can dynamically accommodate
changes to your data structures according to your need.
• Easy data distribution − Cassandra provides the flexibility to distribute data where
you need by replicating data across multiple data centers.
• Transaction support − Cassandra supports properties like Atomicity, Consistency,
Isolation, and Durability (ACID).
• Fast writes − Cassandra was designed to run on cheap commodity hardware. It
performs blazingly fast writes and can store hundreds of terabytes of data, without
sacrificing the read efficiency.
The design goal of Cassandra is to handle big data workloads across multiple nodes without
any single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and
data is distributed among all the nodes in a cluster.
• All the nodes in a cluster play the same role. Each node is independent and at the
same time interconnected to other nodes.
• Each node in a cluster can accept read and write requests, regardless of where the data
is actually located in the cluster.
• When a node goes down, read/write requests can be served from other nodes in the
network.
Data Replication in Cassandra
In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of
data. If it is detected that some of the nodes responded with an out-of-date value, Cassandra
will return the most recent value to the client. After returning the most recent value, Cassandra
performs a read repair in the background to update the stale values.
The following figure shows a schematic view of how Cassandra uses data replication among
the nodes in a cluster to ensure no single point of failure.
Note − Cassandra uses the Gossip Protocol in the background to allow the nodes to
communicate with each other and detect any faulty nodes in the cluster.
Components of Cassandra
The key components of Cassandra are as follows −
• Node − It is the place where data is stored.
• Data center − It is a collection of related nodes.
• Cluster − A cluster is a component that contains one or more data centers.
• Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every
write operation is written to the commit log.
A column family is a container for an ordered collection of rows. Each row, in turn, is
an ordered collection of columns. The following table lists the points that differentiate a
column family from a table of relational databases.
A column is the basic data structure of Cassandra with three values, namely key or
column name, value, and a time stamp. Given below is the structure of a column.
A super column is a special column, therefore, it is also a key-value pair. But a super
column stores a map of sub-columns.
Generally column families are stored on disk in individual files. Therefore, to optimize
performance, it is important to keep columns that you are likely to query together in the same
column family, and a super column can be helpful here.Given below is the structure of a super
column.
RDBMS Cassandra
RDBMS deals with structured data. Cassandra deals with unstructured data.
Database is the outermost container that Keyspace is the outermost container that
contains data corresponding to an contains data corresponding to an application.
application.
Tables are the entities of a database. Tables or column families are the entity of a
keyspace.
RDBMS supports the concepts of foreign Relationships are represented using collections.
keys, joins.
ArangoDB
ArangoDB is hailed as a native multi-model database by its developers. This is unlike
other NoSQL databases. In this database, the data can be stored as documents, key/value pairs
or graphs. And with a single declarative query language, any or all of your data can be
accessed. Moreover, different models can be combined in a single query. And, owing to its
multi-model style, one can make lean applications, which will be scalable horizontally with
any or all of the three data models.
Layered vs. Native Multi-Model Databases
Many database vendors call their product “multi-model,” but adding a graph layer to a
key/value or document store does not qualify as native multi-model.
With ArangoDB, the same core with the same query language, one can club together
different data models and features in a single query. In ArangoDB, there is no “switching”
between data models, and there is no shifting of data from A to B to execute queries. It leads
to performance advantages to ArangoDB in comparison to the “layered” approaches.
Features of ArangoDB
• Multi-model Paradigm
• ACID Properties
• HTTP API
ArangoDB supports all popular database models. Following are a few models supported by
ArangoDB −
• Document model
• Key/Value model
• Graph model
A single query language is enough to retrieve data out of the database
The four properties Atomicity, Consistency, Isolation, and Durability (ACID) describe
the guarantees of database transactions. ArangoDB supports ACID-compliant transactions.
ArangoDB allows clients, such as browsers, to interact with the database with HTTP
API, the API being resource-oriented and extendable with JavaScript.
Advantages of using ArangoDB
• Consolidation : As a native multi-model database, ArangoDB eliminates the need to
deploy multiple databases, and thus decreases the number of components and their
maintenance. Consequently, it reduces the technology-stack complexity for the
application. In addition to consolidating your overall technical needs, this
simplification leads to lower total cost of ownership and increasing flexibility.
• Simplified Performance Scaling : With applications growing over time, ArangoDB
can tackle growing performance and storage needs, by independently scaling with
different data models. As ArangoDB can scale both vertically and horizontally, so in
case when your performance demands a decrease (a deliberate, desired slow-down),
your back-end system can be easily scaled down to save on hardware as well as
operational costs.
• Reduced Operational Complexity: The decree of Polyglot Persistence is to employ
the best tools for every job you undertake. Certain tasks need a document database,
while others may need a graph database. As a result of working with single-model
databases, it can lead to multiple operational challenges. Integrating single-model
databases is a difficult job in itself. But the biggest challenge is building a large
cohesive structure with data consistency and fault tolerance between separate,
unrelated database systems. It may prove nearly impossible.
Polyglot Persistence can be handled with a native multi-model database, as it allows to
have polyglot data easily, but at the same time with data consistency on a fault tolerant
system. With ArangoDB, we can use the correct data model for the complex job.
• Strong Data Consistency : If one uses multiple single-model databases, data
consistency can become an issue. These databases aren’t designed to communicate
with each other, therefore some form of transaction functionality needs to be
implemented to keep your data consistent between different models.
Supporting ACID transactions, ArangoDB manages your different data models with a
single back-end, providing strong consistency on a single instance, and atomic
operations when operating in cluster mode.
• Fault Tolerance :It is a challenge to build fault tolerant systems with many unrelated
components. This challenge becomes more complex when working with clusters.
Expertise is required to deploy and maintain such systems, using different technologies
and/or technology stacks. Moreover, integrating multiple subsystems, designed to run
independently, inflict large engineering and operational costs.