NoSQL Database
NoSQL Database
Because...
❑ Joins are expensive
❑ Hard to scale horizontally (adding
more machines)
❑ Impedance (object-relational)
mismatch occurs
❑ Expensive (product cost, hardware,
Maintenance)
NoSQL why, what and when? 9
And....
It’s weak in:
❑ Speed (performance)
❑ High availability
❑ Partition tolerance
Why NOSQL now?? Driving Trends 11
13
What is NoSQL?
o simplicity of design
o simpler "horizontal" scaling to
clusters of machines (which is
a problem for relational
databases)
o finer control over availability
Servers can be added or removed without
application downtime
NoSQL avoids:
▶ Overhead of ACID transactions
▶ Complexity of SQL query
▶ Burden of up-front schema design
▶ DBA presence
▶ Transactions (It should be handled
at
application layer)
Provides:
▶ Easy and frequent changes to DB
▶ Fast development
▶ Large data volumes(eg.Google)
▶ Schema less
What we need ? 26
Which is impossible!!!
According to CAP theorem
CAP Theorem
■ Three properties of a system
❑ Consistency (all copies have same value)
❑ Availability (system can run even if parts have failed)
❑ Via replication
❑ Partitions (network can break into two or more parts,
each with active systems that can’t talk to other
parts)
■ Brewer’s CAP “Theorem”: You can have at most
two of these three properties for any system
■ Very large systems will partition at some point
❑ 🡺Choose one of consistency or availablity
❑ Traditional database choose consistency
❑ Most Web applications choose availability
■ Except for specific parts such as order processing
Availability
In relational Databases:
In NoSQL Databases:
application layer
• Key-value
• Document
• Column family (or wide
column)
• Graph
🡺 Basic
Operations:
Insert(key,value),
Fetch(key),
Update(key),
Delete(key)
Column family data model 20
🡺 each column family typically contains multiple columns that are used
together
🡺 Within a given column family, all data is stored in a row-by-row
fashion, such that the columns for a given row are stored together,
rather than each column being stored separately.
Column family data model 20
Data Data
read write
Consumer Producer
HDFS write
Hive vs. HBase
o Unlike Hive, HBase operations run in real-time on
its database rather than MapReduce jobs
o Apache Hive is a data warehouse system that's
built on top of Hadoop. Apache HBase is a NoSQL
key/value store on top of HDFS
o Apache Hive provides SQL features to Spark/Hadoop
data. HBase can store or process Hadoop data with
near real-time read/write needs.
o Hive should be used for analytical querying of
data collected over a period of time. HBase is
primarily used to store and process unstructured
Hadoop data
o HBase is perfect for real-time querying of Big
Data. Hive should not be used for real-time
querying
What: Features-1
ZooKeeper
client
How: Setup: Hadoop Cluster
Typical Hadoop+HBase setup
Master Node HDFS
TaskTracker
TaskTracker
DataNode DataNode
Write-heavy applications*