BDA Unit-3
BDA Unit-3
Introduction to NoSQL
What is NoSQL (Not Only SQL)?
• The term NoSQL was first coined by Carlo Strozzi in 1998 to name his lightweight, open-
source, relational database that did not expose the standard SQL interface. Johan
Oskarsson, who was then a developer at last.
• In 2009 reintroduced the term NoSQL at an event called to discuss open-source distributed
network.
• The NoSQL was coined by Eric Evans and few other databases people at the event found it
suitable to describe these non-relational databases.
• Few features of NoSQL databases are as follows:
1. They are open source.
2. They are non-relational.
3. They are distributed.
4. They are schema-less.
5. They are cluster friendly.
6. They are born out of 21st century web applications.
Where is it Used?
• NoSQL databases are widely used in big data and other real-time web applications. Refer
Figure - 1. NoSQL databases is used to stock log data which can then be pulled for analysis.
• Likewise, it is used to store social media data and all such data which cannot be stored and
analyzed comfortably in RDBMS.
What is it?
• NoSQL stands for Not Only SQL. These are non-relational, open source, distributed
databases.
• They are hugely popular today owing to their ability to scale out or scale horizontally and
the adeptness at dealing with a rich variety of data: structured, semi-structured and
unstructured data. Refer Figure - 2 for additional features of NoSQL.
• NoSQL databases are non-relational:
o They do not adhere to relational data model. In fact, they are either key–value pairs
or document-oriented or column-oriented or graph-based databases.
• NoSQL databases are distributed:
o They are distributed meaning the data is distributed across several nodes in a cluster
constituted of low-cost commodity hardware.
• NoSQL databases offer no support for ACID properties (Atomicity, Consistency, Isolation,
and Durability):
o They do not offer support for ACID properties of transactions.
o On the contrary, they have adherence to Brewer’s CAP (Consistency, Availability, and
Partition tolerance) theorem and are often seen compromising on consistency in
favor of availability and partition tolerance.
• NoSQL databases provide no fixed table schema:
o NoSQL databases are becoming increasing popular owing to their support for
flexibility to the schema.
o They do not mandate for the data to strictly adhere to any schema structure at the
time of storage.
• The need for horizontal scaling instead of vertical scaling (faster processors) shifts the
organization from serial processing to parallel processing, where data problems are broken
down into separate paths and sent to separate processors to divide and conquer.
Types of NoSQL
• Traditional RDBMS uses SQL syntax to store and retrieve data from SQL databases.
• They all use a data model that has a different structure than the traditional row-and-column
table model used with relational database management systems (RDBMSs).
• Instead, a NoSQL database system encompasses a wide range of database technologies that
can store structured, semi-structured, unstructured and polymorphic data.
• They can be broadly classified into the following:
1. Key-Value Pair Oriented
o Key-value stores are the simplest type of NoSQL database.
o Data is stored in key/value pairs.
o It uses keys and values to store the data. The attribute name is stored in ‘key’,
whereas the values corresponding to that key will be held in ‘value’.
Key Value
First Rahul
Name
Last Name Mehta
o In Key-value store databases, the key can only be string, whereas the value can store
string, JSON, XML, Blob, etc. Due to its behavior, it is capable of handling massive
data and loads.
o The use case of key-value stores mainly stores user preferences, user profiles,
shopping carts, etc.
o DynamoDB, Riak, Redis are a few famous examples of Key-value store NoSQL
databases.
o Use cases:
▪ For storing user session data
▪ Maintaining schema-less user profiles
▪ Storing user preferences
▪ Storing shopping cart data
2. Document Oriented
o Document Databases use key-value pairs to store and retrieve data from the
documents.
o Documents can contain many different key-value pairs, or key-array pairs, or even
nested documents. MongoDB is the most popular of these databases.
o A document is stored in the form of XML and JSON.
o Data is stored as a value. Its associated key is the unique identifier for that value.
o The difference is that, in a document database, the value contains structured or
semi-structured data.
o Example:
{
“Book Name”: “Fundamentals of Business Analytics”,
“Publisher”:
“Wiley India”,
“Year of Publication”: “2011”
}
o This structured/semi-structured value is referred to as a document and can be in
XML, JSON or BSON format.
o Examples of Document databases are – MongoDB, OrientDB, Apache CouchDB, IBM
Cloudant, CrateDB, BaseX, and many more.
o Use cases:
▪ E-commerce platforms
▪ Content management systems
▪ Analytics platforms
▪ Blogging platforms
3. Column Oriented
o Column based database store data together as columns instead of rows and are
optimized for queries over large datasets.
o It works on columns and are based on BigTable paper by Google.
o Every column is treated separately. Values of single column databases are stored
contiguously.
Column Family
Row Column Name
Key Key Key Key
Value Value Value
Column Name
Key Key Key
Value Value Value
o They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN
etc. as the data is readily available in a column.
o HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column-based
database.
o Use cases:
▪ Content management systems
▪ Blogging platforms
▪ Systems that maintain counters
▪ Services that have expiring usage
▪ Systems that require heavy write requests (like log aggregators)
4. Graph Oriented
o Graph databases form and store the relationship of the data.
o Each element/data is stored in a node, and that node is linked to another
data/element.
o A typical example for Graph database use cases is Facebook.
o It holds the relationship between each user and their further connections.
o Graph databases help search the connections between data elements and link one
part to various parts directly or indirectly.
o The Graph database can be used in social media, fraud detection, and knowledge
graphs. Examples of Graph Databases are – Neo4J, Infinite Graph, OrientDB, FlockDB,
etc.
o Use cases:
▪ Fraud detection
▪ Graph based search
▪ Network and IT operations
▪ Social networks, etc
Advantages of NoSQL
1. It can easily scale up and down: NoSQL database supports scaling rapidly and elastically and
even allows to scale to the cloud.
o Cluster scale: It allows distribution of database across 100+ nodes often in multiple
data centers.
o Performance scale: It sustains over 100,000+ database reads and writes per second.
o Data scale: It supports housing of 1 billion+ documents in the database.
2. Doesn’t require a pre-defined schema: NoSQL does not require any adherence to pre-
defined schema.
1. It is pretty flexible. For example, if we look at MongoDB, the documents (equivalent of
records in RDBMS) in a collection (equivalent of table in RDBMS) can have different sets
of key–value pairs.
{_id: 101,“BookName”: “Fundamentals of Business Analytics”, “AuthorName”:
“Seema Acharya”, “Publisher”: “Wiley India”} {_id:102, “BookName”:“Big Data and
Analytics”}
3. Cheap, easy to implement: Deploying NoSQL properly allows for all of the benefits of scale,
high availability, fault tolerance, etc. while also lowering operational costs.
4. Relaxes the data consistency requirement: NoSQL databases have adherence to CAP
theorem (Consistency, Availability, and Partition tolerance). Most of the NoSQL databases
compromise on consistency in favour of availability and partition tolerance. However, they
do go for eventual consistency.
5. Data can be replicated to multiple nodes and can be partitioned: There are two terms that
we will discuss here:
o Sharding: Sharding is when different pieces of data are distributed across multiple
servers.
o NoSQL databases support auto-sharding; this means that they can natively and
automatically spread data across an arbitrary number of servers, without requiring
the application to even be aware of the composition of the server pool.
o Servers can be added or removed from the data layer without application downtime.
o This would mean that data and query load are automatically balanced across servers,
and when a server goes down, it can be quickly and transparently replaced with no
application disruption.
o Replication: Replication is when multiple copies of data are stored across the cluster
and even across data centers. This promises high availability and fault tolerance.
Vertically scalable (by increasing system Horizontally scalable (by creating a cluster of
resources) commodity machines)
Not preferred for large datasets Largely preferred for large datasets
Not a best ft for hierarchical data Best ft for hierarchical storage as it follows the key–
value pair of storing data like JSON (Java Script Object
Notation)
Supports complex querying and data Does not have good support for complex querying
keeping needs
Can be configured for strong consistency Few supports strong consistency (e.g., MongoDB),
some others can be configured for eventual
consistency (e.g., Cassandra)
Examples: Oracle, DB2, MySQL, MS SQL, Examples: MongoDB, HBase, Cassandra, Redis, Neo4j,
PostgreSQL, etc. CouchDB, Couchbase, Riak, etc.
NoSQL Vendors
• Refer Table for few popular NoSQL vendors.
Distributed Model
From a distribution perspective, there are two main models:
1. Peer-to-Peer Model
2. Master-Slave Model
• Distribution models determine the responsibility for processing data when a request is made.
• Peer-to-peer models may be more resilient to failure than master-slave models.
• Some master-slave distribution models have single points of failure that might impact your
system availability, so you might need to take special care when configuring these systems.
• In the master-slave model, one node is in charge (master) rest are slave node.
• Using the right distribution model will depend on your business requirements:
o If high availability is a concern, a peer-to-peer network might be the best solution.
o If you can manage your big data using batch jobs that run in off hours, then the simpler
master-slave model might be best.
Peer-to-Peer Model
• Peer-to-peer systems distribute the responsibility of the master to each node in the cluster.
• In this situation, testing is much easier since you can remove any node in the cluster and the
other nodes will continue to function.
• The disadvantage of peer-to-peer networks is that there’s an increased complexity and
communication overhead that must occur for all nodes to be kept up to date with the cluster
status.
Master-Slave Model
• Hadoop were designed to use a master-slave architecture with the Name-Node of a cluster being
responsible for managing the status of the cluster.
• Their job is to manage and distribute queries to the correct nodes on the cluster.
• Hadoop are also designed to remove single points of failure from a Hadoop cluster.
Slave
Master
What is the CAP theorem?
• The CAP theorem is used to makes system designers aware of the trade-offs while
designing networked shared-data systems. CAP theorem has influenced the design of
many distributed data systems. It is very important to understand the CAP theorem as It
makes the basics of choosing any NoSQL database based on the requirements.
• CAP theorem states that in networked shared-data systems or distributed systems, we can
only achieve at most two out of three guarantees for a database: Consistency, Availability
and Partition Tolerance.
• A distributed system is a network that stores data on more than one node (physical or
virtual machines) at the same time.