0% found this document useful (0 votes)
28 views

Module 2.3

This document provides an introduction to big data frameworks Hadoop and NoSQL databases. It discusses the limitations of relational databases for distributed applications and large datasets due to issues with scalability, replication, and rigid schemas. It then introduces the CAP theorem, which states that a distributed system can only optimize for two of consistency, availability, and partition tolerance. This leads into explanations of BASE databases that sacrifice consistency for availability being better suited for large-scale distributed systems compared to ACID-compliant relational databases. Common NoSQL database types of CP, AP, and CA are also defined based on their support for consistency and availability.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Module 2.3

This document provides an introduction to big data frameworks Hadoop and NoSQL databases. It discusses the limitations of relational databases for distributed applications and large datasets due to issues with scalability, replication, and rigid schemas. It then introduces the CAP theorem, which states that a distributed system can only optimize for two of consistency, availability, and partition tolerance. This leads into explanations of BASE databases that sacrifice consistency for availability being better suited for large-scale distributed systems compared to ACID-compliant relational databases. Common NoSQL database types of CP, AP, and CA are also defined based on their support for consistency and availability.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 25

Module 2

Introduction to Big Data Frameworks:


Hadoop, NOSQL

Big Data Analytics


BEITC801 Prof. Priyanka Bandagale
FAMT, Ratnagiri
Introduction
• Database - Organized collection of data

• DBMS- A package with computer programs that controls the


creation, maintenance and use of a database

• Databases are created to operate large quantities of information by


inputting, storing, retrieving, and managing that information.
Relational databases

• Benefits of Relational databases:

 Designed for all purposes


 ACID
 Strong consistancy, concurrency, recovery
 Standard Query language (SQL)
 Lots of tools to use with i.e: Reporting services, entit
y, frameworks, ...
SQL databases
What is Wrong With RDBMS?
• Relational databases were not built for distributed
applications.
• Impedance mismatch.
• Object Relational Mapping doesn't work quite well.
• Rigid schema design.
• Harder to scale.
• Replication.
• Joins across multiple nodes? Hard.
• How does RDMS handle data growth? Hard.
• Many programmers are already familiar with it.
• Transactions and ACID make development easy.
ACID Semantics
• Atomicity: All or nothing.
• Consistency: Consistent state of data and transactions.
• Isolation: Transactions are isolated from each other.
• Durability: When the transaction is committed, state will be
durable.

Any data store can achieve Atomicity, Isolation and Durability


but do you always need consistency? No.

By giving up ACID properties, one can achieve higher


performance and scalability.
Brewer’s CAP Theorem
A distributed system can support only two of the following
characteristics:
• Consistency
•Availability

7
•Partition tolerance
•Proven by Nancy Lynch et al. MIT labs.

•http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keyn
ote.pdf
CAP theorem

We can not achieve all the three items In


distributed database systems (center)
CAP theorem
Consistency
• Consistency: Clients should read the same data. There are
many levels of consistency.
• Strict Consistency – RDBMS.
• Tunable Consistency – Cassandra.
• Eventual Consistency – Amazon Dynamo.

10
• client perceives that a set of operations has occurred all at
once – Pritchett
• More like Atomic in ACID transaction properties
Availability
• Availability: Data to be available.
• node failures do not prevent survivors
from continuing to operate – Wikipedia

11
• Every operation must terminate in an intended response – Pritchett
Partition Tolerance
• Partial Tolerance: Data to be partitioned
across network segments due to network
failures.

12
• the system continues to operate despite
arbitrary message loss – Wikipedia
• Operations will complete, even if individual components are
unavailable – Pritchett

Febr
uary
16, 2
024
CAP Theorem
• Consistency
• 2 types of consistency:
1. Strong consistency – ACID (Atomicity, Consistency,
Isolation, Durability)

13
2. Weak consistency – BASE (Basically Available Soft-
state Eventual consistency)
BASE, an ACID Alternative
Almost the opposite of ACID.
•Basically available: Nodes in the a distributed environment can
go down, but the whole system shouldn’t be affected.
•Soft State (scalable): The state of the system and data changes
over time.
•Eventual Consistency: Given enough time, data will be consistent
across the distributed system.
Characteristics of BASE Transactions
• Weak consistency meaning stale data is okay.
• Availability has more priority
• Best effort.
• Approximate answers are okay
• Simple and faster
CAP
The cap theorem categorizes systems into three categories:
CP(consistent and partition tolerant) — The cp category is
confusing, i.e., a system that is consistent and partition tolerant
but never available. is referring to a category of systems where
availability is sacrificed only in the case of a network partition.
CA(consistent and available) — ca systems are consistent and
available systems in the absence of any network partition. often a
single node's db servers are categorized as ca systems. single node
db servers do not need to deal with partition tolerance and are
thus considered ca systems. the only hole in this theory is that
single node db systems are not a network of shared data systems
and thus do not fall under the preview of cap.
AP(available and partition tolerant) — these are systems that are
available and partition tolerant but cannot guarantee consistency.
CAP theorem with databases that
“choose” CA, CP and AP
Example- CA

RDBMS

• Read/ Write Read/Write


Example-AP

Replication

Read A
B

Read Write

User User
Example-CP

Backup

Replication

Read/ Write
User
A
• Today, NoSQL databases are classified based on the two CAP characteristics
they support:
• CP database: A CP database delivers consistency and partition tolerance at
the expense of availability. When a partition occurs between any two
nodes, the system has to shut down the non-consistent node (i.e., make it
unavailable) until the partition is resolved.
• AP database: An AP database delivers availability and partition tolerance at
the expense of consistency. When a partition occurs, all nodes remain
available but those at the wrong end of a partition might return an older
version of data than others. (When the partition is resolved, the AP
databases typically resync the nodes to repair all inconsistencies in the
system.)
• CA database: A CA database delivers consistency and availability across all
nodes. It can’t do this if there is a partition between any two nodes in the
system, however, and therefore can’t deliver fault tolerance.
What is NOSQL?
• The Name:
• Stands for Not Only SQL
• The term NOSQL was introduced by Carl Strozzi in 1998 to name
his file-based database

22
• It was again re-introduced by Eric Evans when an event was
organized to discuss open source distributed databases
• Eric states that “… but the whole point of seeking alternatives is
that you need to solve a problem that relational databases are a
bad fit for. …”
What is NOSQL?
• Key features (advantages):
• non-relational
• don’t require schema
• data are replicated to multiple
nodes (so, identical & fault-tolerant)

23
and can be partitioned:
• down nodes easily replaced
• no single point of failure
• horizontal scalable
• cheap, easy to implement
(open-source)
• massive write performance
• fast key-value access
What is NOSQL?
• Disadvantages:
• Don’t fully support relational features
• no join, group by, order by operations (except within partitions)
• no referential integrity constraints across partitions
• No declarative query language (e.g., SQL)  more programming

24
• Relaxed ACID (see CAP theorem)  fewer guarantees
• No easy integration with other applications that support SQL
Questions?

You might also like