0% found this document useful (0 votes)
39 views13 pages

BDA Unit-3

Uploaded by

status wind sk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views13 pages

BDA Unit-3

Uploaded by

status wind sk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Unit-3:NoSQL

Introduction to NoSQL
What is NoSQL (Not Only SQL)?
• The term NoSQL was first coined by Carlo Strozzi in 1998 to name his lightweight, open-
source, relational database that did not expose the standard SQL interface. Johan
Oskarsson, who was then a developer at last.
• In 2009 reintroduced the term NoSQL at an event called to discuss open-source distributed
network.
• The NoSQL was coined by Eric Evans and few other databases people at the event found it
suitable to describe these non-relational databases.
• Few features of NoSQL databases are as follows:
1. They are open source.
2. They are non-relational.
3. They are distributed.
4. They are schema-less.
5. They are cluster friendly.
6. They are born out of 21st century web applications.

Where is it Used?
• NoSQL databases are widely used in big data and other real-time web applications. Refer
Figure - 1. NoSQL databases is used to stock log data which can then be pulled for analysis.
• Likewise, it is used to store social media data and all such data which cannot be stored and
analyzed comfortably in RDBMS.

Figure 1 - Where to use NoSQL?

What is it?
• NoSQL stands for Not Only SQL. These are non-relational, open source, distributed
databases.
• They are hugely popular today owing to their ability to scale out or scale horizontally and
the adeptness at dealing with a rich variety of data: structured, semi-structured and
unstructured data. Refer Figure - 2 for additional features of NoSQL.
• NoSQL databases are non-relational:
o They do not adhere to relational data model. In fact, they are either key–value pairs
or document-oriented or column-oriented or graph-based databases.
• NoSQL databases are distributed:
o They are distributed meaning the data is distributed across several nodes in a cluster
constituted of low-cost commodity hardware.
• NoSQL databases offer no support for ACID properties (Atomicity, Consistency, Isolation,
and Durability):
o They do not offer support for ACID properties of transactions.
o On the contrary, they have adherence to Brewer’s CAP (Consistency, Availability, and
Partition tolerance) theorem and are often seen compromising on consistency in
favor of availability and partition tolerance.
• NoSQL databases provide no fixed table schema:
o NoSQL databases are becoming increasing popular owing to their support for
flexibility to the schema.
o They do not mandate for the data to strictly adhere to any schema structure at the
time of storage.

NoSQL Business Drivers


• The demands of volume, velocity, variability, and agility play a key role in the emergence of
NoSQL solutions.
• As each of these drivers applies pressure to the single-processor relational model, its
foundation becomes less stable and in time no longer meets the organization’s needs.
• The business driver's volume, velocity, variability, and agility apply pressure to the single
CPU system, resulting in the failures.

Figure 2 - NoSQL Business Drivers


• Volume and velocity refer to the ability to handle large datasets that arrive quickly.
• Variability refers to how diverse data types don’t fit into structured tables.
• Agility refers to how quickly an organization responds to business change.

NoSQL Business Driver – Volume


• The key factor that led organizations to seek alternatives to their current RDBMS was the
need to use commodity processor clusters to query big data.
• Early 2005 that performance problems were solved by buying faster processors.
• Over time, the ability to increase processing speed is no longer an option. As chip density
increases, heat cannot be quickly dissipated if chips are overheated.
• This phenomenon, known as the power wall, forces system designers to shift their attention
from increasing the speed of a single chip to using more processors to work together.

Figure 3 - Business Driver - Volume

• The need for horizontal scaling instead of vertical scaling (faster processors) shifts the
organization from serial processing to parallel processing, where data problems are broken
down into separate paths and sent to separate processors to divide and conquer.

NoSQL Business Driver – Velocity


• A big data issues are a consideration for many organizations far from RDBMS, the ability of
uniprocessor systems to quickly read and write data is also critical.
• Many single-processor RDBMSs cannot meet the real-time insertion and online database
query needs of public websites.
• RDBMS often indexes many columns in each new row, a process that reduces system
performance.
• When a single-processor RDBMS is used as the back end of a web store front end, random
bursts in web traffic will slow down everyone's response speed, and the cost of adjusting
these systems when high read and write performance is required can be high.
NoSQL Business Driver – Variability
• An organization that want to capture and report abnormal data will encounter difficulties
when trying to use the strict database schema structure enforced by the RDBMS.
• For example, if a business unit wants to capture some custom fields for a specific customer,
all customer rows in the database must store this information, even if it is not applicable.
• Adding a new column to the RDBMS requires shutting down the system and executing the
ALTER TABLE command.
• When the database is large, this process affects the availability of the system, which
consumes time and money.

NoSQL Business Driver – Agility


• The most complex part of creating an application with RDBMS is the process of entering and
extracting data from the database.
• If your data has nested and repeated data structure subgroups, you must include an object-
relational mapping layer.
• The responsibility of this layer is to generate the correct combination of SQL INSERT,
UPDATE, DELETE, and SELECT statements to move object data in and out of the RDBMS
persistence layer.
• This process is not simple. When developing new applications or modifying existing
applications, it is the biggest obstacle to rapid change.
• Object-relational mapping usually requires object-relational frameworks such as Java
Hibernate (or NHibernate for .Net systems).

Types of NoSQL
• Traditional RDBMS uses SQL syntax to store and retrieve data from SQL databases.
• They all use a data model that has a different structure than the traditional row-and-column
table model used with relational database management systems (RDBMSs).
• Instead, a NoSQL database system encompasses a wide range of database technologies that
can store structured, semi-structured, unstructured and polymorphic data.
• They can be broadly classified into the following:
1. Key-Value Pair Oriented
o Key-value stores are the simplest type of NoSQL database.
o Data is stored in key/value pairs.
o It uses keys and values to store the data. The attribute name is stored in ‘key’,
whereas the values corresponding to that key will be held in ‘value’.

Key Value
First Rahul
Name
Last Name Mehta
o In Key-value store databases, the key can only be string, whereas the value can store
string, JSON, XML, Blob, etc. Due to its behavior, it is capable of handling massive
data and loads.
o The use case of key-value stores mainly stores user preferences, user profiles,
shopping carts, etc.
o DynamoDB, Riak, Redis are a few famous examples of Key-value store NoSQL
databases.
o Use cases:
▪ For storing user session data
▪ Maintaining schema-less user profiles
▪ Storing user preferences
▪ Storing shopping cart data
2. Document Oriented
o Document Databases use key-value pairs to store and retrieve data from the
documents.
o Documents can contain many different key-value pairs, or key-array pairs, or even
nested documents. MongoDB is the most popular of these databases.
o A document is stored in the form of XML and JSON.
o Data is stored as a value. Its associated key is the unique identifier for that value.
o The difference is that, in a document database, the value contains structured or
semi-structured data.
o Example:
{
“Book Name”: “Fundamentals of Business Analytics”,
“Publisher”:
“Wiley India”,
“Year of Publication”: “2011”
}
o This structured/semi-structured value is referred to as a document and can be in
XML, JSON or BSON format.
o Examples of Document databases are – MongoDB, OrientDB, Apache CouchDB, IBM
Cloudant, CrateDB, BaseX, and many more.
o Use cases:
▪ E-commerce platforms
▪ Content management systems
▪ Analytics platforms
▪ Blogging platforms
3. Column Oriented
o Column based database store data together as columns instead of rows and are
optimized for queries over large datasets.
o It works on columns and are based on BigTable paper by Google.
o Every column is treated separately. Values of single column databases are stored
contiguously.
Column Family
Row Column Name
Key Key Key Key
Value Value Value
Column Name
Key Key Key
Value Value Value
o They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN
etc. as the data is readily available in a column.
o HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column-based
database.
o Use cases:
▪ Content management systems
▪ Blogging platforms
▪ Systems that maintain counters
▪ Services that have expiring usage
▪ Systems that require heavy write requests (like log aggregators)
4. Graph Oriented
o Graph databases form and store the relationship of the data.
o Each element/data is stored in a node, and that node is linked to another
data/element.
o A typical example for Graph database use cases is Facebook.
o It holds the relationship between each user and their further connections.
o Graph databases help search the connections between data elements and link one
part to various parts directly or indirectly.

o The Graph database can be used in social media, fraud detection, and knowledge
graphs. Examples of Graph Databases are – Neo4J, Infinite Graph, OrientDB, FlockDB,
etc.
o Use cases:
▪ Fraud detection
▪ Graph based search
▪ Network and IT operations
▪ Social networks, etc

Why NoSQL in Big Data?


1. It has scale out architecture instead of the monolithic architecture of relational databases.
2. It can house large volumes of structured, semi-structured, and unstructured data.
3. Dynamic schema: NoSQL database allows insertion of data without a pre-defined schema. In
other words, it facilitates application changes in real time, which thus supports faster
development, easy code integration, and requires less database administration.
4. Auto-sharding: It automatically spreads data across an arbitrary number of servers. The
application in question is more often not even aware of the composition of the server pool.
It balances the load of data and query on the available servers; and if and when a server
goes down, it is quickly replaced without any major activity disruptions.
5. Replication: It offers good support for replication which in turn guarantees high availability,
fault tolerance, and disaster recovery.

Using NoSQL to Manage Big Data


• The main reason behind organization moving towards a NoSQL solution and leaving the
RDBMS system behind is the requirement to analyze a large volume of data.
• It is any business problem which could be so large and single processor cannot manage it.
• We need to move single processor environment to distributed computing environment due
to big data problem. It has own problems and challenges while solving big data problems.

Advantages of NoSQL
1. It can easily scale up and down: NoSQL database supports scaling rapidly and elastically and
even allows to scale to the cloud.
o Cluster scale: It allows distribution of database across 100+ nodes often in multiple
data centers.
o Performance scale: It sustains over 100,000+ database reads and writes per second.
o Data scale: It supports housing of 1 billion+ documents in the database.
2. Doesn’t require a pre-defined schema: NoSQL does not require any adherence to pre-
defined schema.
1. It is pretty flexible. For example, if we look at MongoDB, the documents (equivalent of
records in RDBMS) in a collection (equivalent of table in RDBMS) can have different sets
of key–value pairs.
{_id: 101,“BookName”: “Fundamentals of Business Analytics”, “AuthorName”:
“Seema Acharya”, “Publisher”: “Wiley India”} {_id:102, “BookName”:“Big Data and
Analytics”}

3. Cheap, easy to implement: Deploying NoSQL properly allows for all of the benefits of scale,
high availability, fault tolerance, etc. while also lowering operational costs.
4. Relaxes the data consistency requirement: NoSQL databases have adherence to CAP
theorem (Consistency, Availability, and Partition tolerance). Most of the NoSQL databases
compromise on consistency in favour of availability and partition tolerance. However, they
do go for eventual consistency.
5. Data can be replicated to multiple nodes and can be partitioned: There are two terms that
we will discuss here:
o Sharding: Sharding is when different pieces of data are distributed across multiple
servers.
o NoSQL databases support auto-sharding; this means that they can natively and
automatically spread data across an arbitrary number of servers, without requiring
the application to even be aware of the composition of the server pool.
o Servers can be added or removed from the data layer without application downtime.
o This would mean that data and query load are automatically balanced across servers,
and when a server goes down, it can be quickly and transparently replaced with no
application disruption.
o Replication: Replication is when multiple copies of data are stored across the cluster
and even across data centers. This promises high availability and fault tolerance.

Figure 4 - Advantages of NoSQL

Four ways that NoSQL System handle Big Data Problems


1. Moving Queries to the data, Not Data to the Queries
2. Using Hash Rings to Evenly Distribute Data on a Cluster
3. Using Replication to Scale Reads
4. Letting the Database Distribute Queries Evenly to Data Nodes

SQL Vs. NoSQL


SQL NoSQL

Relational database Non-relational, distributed database

Relational model Model-less approach

Pre-defined schema Dynamic schema for unstructured data

Table based databases Document-based or graph-based or wide column


store or key–value pairs databases

Vertically scalable (by increasing system Horizontally scalable (by creating a cluster of
resources) commodity machines)

Uses SQL Uses UnSQL (Unstructured Query Language)

Not preferred for large datasets Largely preferred for large datasets

Not a best ft for hierarchical data Best ft for hierarchical storage as it follows the key–
value pair of storing data like JSON (Java Script Object
Notation)

Excellent support from vendors Relies heavily on community support

Supports complex querying and data Does not have good support for complex querying
keeping needs

Can be configured for strong consistency Few supports strong consistency (e.g., MongoDB),
some others can be configured for eventual
consistency (e.g., Cassandra)

Examples: Oracle, DB2, MySQL, MS SQL, Examples: MongoDB, HBase, Cassandra, Redis, Neo4j,
PostgreSQL, etc. CouchDB, Couchbase, Riak, etc.

NoSQL Vendors
• Refer Table for few popular NoSQL vendors.

Company Product Most Widely Used by


Amazon DynamoDB LinkedIn, Mozilla
Facebook Cassandra Netfix, Twitter, eBay
Google BigTable Adobe Photoshop
Analyzing bigdata with shared nothing architecture
Resources can be shared between computer systems by three ways.
1. By shared RAM
2. By shared disk
3. By share nothing

Distributed Model
From a distribution perspective, there are two main models:
1. Peer-to-Peer Model
2. Master-Slave Model
• Distribution models determine the responsibility for processing data when a request is made.
• Peer-to-peer models may be more resilient to failure than master-slave models.
• Some master-slave distribution models have single points of failure that might impact your
system availability, so you might need to take special care when configuring these systems.
• In the master-slave model, one node is in charge (master) rest are slave node.
• Using the right distribution model will depend on your business requirements:
o If high availability is a concern, a peer-to-peer network might be the best solution.
o If you can manage your big data using batch jobs that run in off hours, then the simpler
master-slave model might be best.
Peer-to-Peer Model
• Peer-to-peer systems distribute the responsibility of the master to each node in the cluster.
• In this situation, testing is much easier since you can remove any node in the cluster and the
other nodes will continue to function.
• The disadvantage of peer-to-peer networks is that there’s an increased complexity and
communication overhead that must occur for all nodes to be kept up to date with the cluster
status.

Master-Slave Model
• Hadoop were designed to use a master-slave architecture with the Name-Node of a cluster being
responsible for managing the status of the cluster.
• Their job is to manage and distribute queries to the correct nodes on the cluster.
• Hadoop are also designed to remove single points of failure from a Hadoop cluster.

Slave

Master
What is the CAP theorem?
• The CAP theorem is used to makes system designers aware of the trade-offs while
designing networked shared-data systems. CAP theorem has influenced the design of
many distributed data systems. It is very important to understand the CAP theorem as It
makes the basics of choosing any NoSQL database based on the requirements.
• CAP theorem states that in networked shared-data systems or distributed systems, we can
only achieve at most two out of three guarantees for a database: Consistency, Availability
and Partition Tolerance.
• A distributed system is a network that stores data on more than one node (physical or
virtual machines) at the same time.

Let’s first understand C, A, and P in simple words:


1. Consistency: means that all clients see the same data at the same time, no matter which
node they connect to in a distributed system. To achieve consistency, whenever data is
written to one node, it must be instantly forwarded or replicated to all the other nodes in
the system before the write is deemed successful.
2. Availability: means that every non-failing node returns a response for all read and write
requests in a reasonable amount of time, even if one or more nodes are down. Another
way to state this — all working nodes in the distributed system return a valid response for
any request, without failing or exception.
3. Partition Tolerance: means that the system continues to operate despite arbitrary
message loss or failure of part of the system. In other words, even if there is a network
outage in the data center and some of the computers are unreachable, still the system
continues to perform. Distributed systems guaranteeing partition tolerance can gracefully
recover from partitions once the partition heals.

The CAP theorem categorizes systems into three categories:


1. CP (Consistent and Partition Tolerant) database:
• A CP database delivers consistency and partition tolerance at the expense of
availability. When a partition occurs between any two nodes, the system has to shut
down the non-consistent node (i.e., make it unavailable) until the partition is resolved.
• Partition refers to a communication break between nodes within a distributed system.
Meaning, if a node cannot receive any messages from another node in the system,
there is a partition between the two nodes. Partition could have been because of
network failure, server crash, or any other reason.

2. AP (Available and Partition Tolerant) database:


• An AP database delivers availability and partition tolerance at the expense of
consistency. When a partition occurs, all nodes remain available but those at the
wrong end of a partition might return an older version of data than others. When the
partition is resolved, the AP databases typically resync the nodes to repair all
inconsistencies in the system.
3. CA (Consistent and Available) database:
• A CA delivers consistency and availability in the absence of any network partition.
Often a single node’s DB servers are categorized as CA systems. Single node DB servers
do not need to deal with partition tolerance and are thus considered CA systems.

In any networked shared-data systems or distributed systems partition tolerance is a must.


Network partitions and dropped messages are a fact of life and must be handled
appropriately. Consequently, system designers must choose between System designers must
take into consideration the CAP theorem while designing or choosing distributed storages as
one needs to be sacrificed from C and A for others.

You might also like