0% found this document useful (0 votes)
16 views

NoSQL Database

The document discusses the benefits and limitations of relational databases compared to NoSQL databases, highlighting that while relational databases are designed for structured data and provide ACID properties, they struggle with scalability and distributed applications. NoSQL databases offer flexibility, horizontal scaling, and schema-less design, making them suitable for handling large volumes of diverse data. The CAP theorem is introduced, emphasizing the trade-offs between consistency, availability, and partition tolerance in distributed systems.

Uploaded by

Md Hamid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

NoSQL Database

The document discusses the benefits and limitations of relational databases compared to NoSQL databases, highlighting that while relational databases are designed for structured data and provide ACID properties, they struggle with scalability and distributed applications. NoSQL databases offer flexibility, horizontal scaling, and schema-less design, making them suitable for handling large volumes of diverse data. The CAP theorem is introduced, emphasizing the trade-offs between consistency, availability, and partition tolerance in distributed systems.

Uploaded by

Md Hamid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Relational databases 5

• Benefits of Relational databases:

🡺 Designed for OLTP


🡺 ACID properties
🡺 Strong consistancy, concurrency, recovery
🡺 Mathematical background
🡺 Standard Query language (SQL)
🡺 Lots of tools to use with i.e: Reporting services, entity
frameworks, ...
NoSQL why, what and when? 8
But...
❑ Relational databases were not built
for distributed applications.

Because...
❑ Joins are expensive
❑ Hard to scale horizontally (adding
more machines)
❑ Impedance (object-relational)
mismatch occurs
❑ Expensive (product cost, hardware,
Maintenance)
NoSQL why, what and when? 9

And....
It’s weak in:
❑ Speed (performance)
❑ High availability
❑ Partition tolerance
Why NOSQL now?? Driving Trends 11
13
What is NoSQL?

❑ A No SQL database provides a mechanism


for storage and retrieval of data that
employs less constrained models than
traditional relational database

❑ No SQL systems are also referred to as


"NotonlySQL“ to emphasize that they do in
fact allow SQL-like query languages to be
used.
Motivations of NoSQL databases 14

o simplicity of design
o simpler "horizontal" scaling to
clusters of machines (which is
a problem for relational
databases)
o finer control over availability
Servers can be added or removed without
application downtime

o limiting the object-relational


impedance mismatch
Characteristics of NoSQL databases 14

NoSQL avoids:
▶ Overhead of ACID transactions
▶ Complexity of SQL query
▶ Burden of up-front schema design
▶ DBA presence
▶ Transactions (It should be handled
at
application layer)
Provides:
▶ Easy and frequent changes to DB
▶ Fast development
▶ Large data volumes(eg.Google)
▶ Schema less
What we need ? 26

• We need a distributed database system having such


features:
– Fault tolerance
– High availability
– Consistency
– Scalability

Which is impossible!!!
According to CAP theorem
CAP Theorem
■ Three properties of a system
❑ Consistency (all copies have same value)
❑ Availability (system can run even if parts have failed)
❑ Via replication
❑ Partitions (network can break into two or more parts,
each with active systems that can’t talk to other
parts)
■ Brewer’s CAP “Theorem”: You can have at most
two of these three properties for any system
■ Very large systems will partition at some point
❑ 🡺Choose one of consistency or availablity
❑ Traditional database choose consistency
❑ Most Web applications choose availability
■ Except for specific parts such as order processing
Availability

■ Traditionally, thought of as the


server/process available five 9’s (99.999
%).
■ However, for large node system, at
almost any point in time there’s a good
chance that a node is either down or
there is a network disruption among the
nodes.
❑ Want a system that is resilient in the face
of network disruption
Eventual Consistency

■ When no updates occur for a long period of time,


eventually all updates will propagate through the
system and all the nodes will be consistent
■ For a given accepted update and a given node,
eventually either the update reaches the node or the
node is removed from service
■ Known as BASE (Basically Available, Soft state,
Eventual consistency), as opposed to ACID
❑ Soft state: copies of a data item may be inconsistent

❑ Eventually Consistent – copies becomes consistent at


some later time if there are no more updates to that
data item
CAP theorem 27

We can not achieve all the three items


In distributed database systems
(center)
NoSQL when? 10

o To handle a huge volume of structured, semi-structured and


unstructured data.
o Where there is a need to follow modern software development
practices like Agile Scrum and if you need to deliver
prototypes or fast applications.
o If you prefer object-oriented programming.
o If your relational database is not capable enough to scale up
to your traffic at an acceptable cost.
o If you want to have an efficient, scale-out architecture in place
of an expensive and monolithic architecture.
o If you have local data transactions that need not be very
durable.
o If you are going with schema-less data and want to include
new fields without any ceremony.
o When your priority is easy scalability and availability.
NoSQL when not? 10

o If you are required to perform complex and dynamic querying


and reporting, then you should avoid using NoSQL as it has a
limited query functionality. For such requirements, you should
prefer SQL only.
o NoSQL also lacks in the ability to perform dynamic operations.
It can’t guarantee ACID properties. In such cases like financial
transactions, etc., you may go with SQL databases.
o You should also avoid NoSQL if your application needs
run-time flexibility.
o If consistency is a must and if there aren’t going to be any
large-scale changes in terms of the data volume, then going
with the SQL database is a better option.
NoSQL is getting more & more popular 15
What is a schema-less datamodel? 16

In relational Databases:

▶ You can’t add a record which does


not fit the schema
▶ You need to add NULLs to unused
items in a row
▶ We should consider the
datatypes.
i.e : you can’t add a stirng to an
interger field
▶ You can’t add multiple items in a
field (You should create another
table: primary-key, foreign key,
joins, normalization, ... !!!)
What is a schema-less datamodel? 17

In NoSQL Databases:

▶ There is no schema to consider


▶ There is no unused cell
▶ There is no datatype (implicit)
▶ Most of considerations are done in

application layer

▶ We gather all items in an aggregate


(document)
Aggregate Data Models 18

NoSQL databases are classified in four major


datamodels:

• Key-value
• Document
• Column family (or wide
column)
• Graph

Each DB has its own query language


Aggregate Data Models 18

Column Family: Azure Cosmos DB, Accumulo, Cassandra, Scylla,


HBase.

Document: Azure Cosmos DB, Apache CouchDB, ArangoDB,


BaseX, Clusterpoint, Couchbase, eXist-db, IBM Domino,
MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB

Key–value: Azure Cosmos DB, Aerospike, Apache Ignite,


ArangoDB, Berkeley DB, Couchbase, Dynamo, FoundationDB,
InfinityDB, MemcacheDB, MUMPS, Oracle NoSQL Database,
OrientDB, Redis, Riak, SciDB, SDBM/Flat File dbm

Graph: Azure Cosmos DB, AllegroGraph, ArangoDB,


InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, AgensGraph,
OrientDB, Virtuoso
Key-value data model 19

🡺 Simplest NOSQL databases

🡺 The main idea is the use of a


hash table

🡺 Access data (values) by strings


called keys

🡺 Data has no required format data


may have any format

🡺 Data model: (key, value) pairs

🡺 Basic
Operations:
Insert(key,value),
Fetch(key),
Update(key),
Delete(key)
Column family data model 20

🡺 Based on Google's Bigtable

🡺 The column is lowest/smallest instance of data.


🡺 the names and format of the columns can vary from row to row in the
same table

🡺 each column family typically contains multiple columns that are used
together
🡺 Within a given column family, all data is stored in a row-by-row
fashion, such that the columns for a given row are stored together,
rather than each column being stored separately.
Column family data model 20

🡺 A wide-column store can be


interpreted as a
two-dimensional key–value
store
🡺 It is a tuple that contains a
name, a value and a timestamp
Column family data model 21

Some statistics about Facebook Search (using Cassandra)

❖ MySQL > 50 GB Data


🡺 Writes Average : ~300 ms
🡺 Reads Average : ~350 ms

❖ Rewritten with Cassandra > 50 GB Data


🡺 Writes Average : 0.12 ms
🡺 Reads Average : 15 ms
Graph data model 22
🡺 Similar to network data model at high
level of abstraction
🡺 Based on Graph Theory.
🡺 You can use graph algorithms easily
🡺 Graph Query language (GQL): Gremlin,
cypher, SPARQL
🡺 underlying storage mechanism of graph
databases can vary: relational, key–value
store or document-oriented database
Document based data model 23

• The central concept of a document-oriented database is the notion


of a document
• document store are roughly equivalent to the programming concept
of an object
• While each document-oriented database implementation differs on
the details of this definition, in general, they all assume documents
encapsulate and encode data (or information) in some standard
format or encoding
• Encodings in use include XML, YAML, JSON, as well as binary forms
like BSON.
• allow different types of documents in a single store
• Documents are addressed in the database via a unique key that
represents that document. This key is a simple identifier (or ID),
typically a string, a URI, or a path
Document based data model 23

• Pair each key with complex data


structure known as data structure.
• Documents can contain many different
key-value pairs, or key-array pairs, or
even nested documents.
SQL vs NOSQL 25
Common Advantages of NoSQL
Systems

■ Cheap, easy to implement (open source)


■ Data are replicated to multiple nodes (therefore identical and
fault-tolerant) and can be partitioned
❑ When data is written, the latest version is on at least one node and then
replicated to other nodes

❑ No single point of failure


■ Easy to distribute
■ Don't require a schema
What does NoSQL Not Provide?
■ Joins
■ Group by
❑ But PNUTS (a massively parallel and
geographically distributed database
system for Yahoo!'s web applications)
provides materialized view approach to
joins/aggregation.
■ ACID transactions
■ SQL
■ Integration with applications that are
based on SQL
What: HBase is...
Open-source non-relational distributed
column family database modeled after
Google’s BigTable.

Think of it as a sparse, consistent,


distributed, multidimensional, sorted map:

labeled tables of rows

row consist of key-value cells:

(row key, column family, column, timestamp) -> value


HBase
random, real time read/write access to the
Big Data
goal is the hosting of very large tables --
billions of rows X millions of columns --
atop clusters of commodity hardware.
HDFS vs HBase
HBase
Tables in HBase can serve as the input and
output for MapReduce jobs run in Hadoop
may be accessed through the Java API but
also through REST, Avro or Thrift gateway
APIs

HBase runs on top of HDFS and is


well-suited for faster read and write
operations on large datasets with high
throughput and low input/output latency.
Phoenix
HBase is not a direct replacement for a
classic SQL database, however Apache Phoenix
project provides a SQL layer for Hbase
Apache Phoenix is an open source, massively
parallel, relational database engine
supporting OLTP for Hadoop using Apache HBase
as its backing store
Phoenix provides a JDBC driver that hides the
intricacies of the noSQL store enabling users
to create, delete, and alter SQL tables,
views, indexes, and sequences; insert and
delete rows singly and in bulk; and query data
through SQL.
Phoenix compiles queries and other statements
into native noSQL store APIs
Usage
HBase is now serving several data-driven
websites
Facebook elected to implement its new messaging
platform using HBase in November 2010, but
migrated away from HBase in 2018 (MyRocks)
Twitter runs HBase across its entire Hadoop
cluster.
HP IceWall SSO is a web-based single sign-on
solution and uses HBase to store user data to
authenticate users.
Adobe: currently have about 30 nodes running
HDFS, Hadoop and HBase in clusters ranging from
5 to 14 nodes on both production and development

Powered By Apache Hbase at


http://hbase.apache.org/poweredbyhbase.html
Enterprises that use HBase
What: Part of Hadoop
ecosystem

Provides realtime random read/write


access to data stored in HDFS

read HBase write

Data Data
read write
Consumer Producer
HDFS write
Hive vs. HBase
o Unlike Hive, HBase operations run in real-time on
its database rather than MapReduce jobs
o Apache Hive is a data warehouse system that's
built on top of Hadoop. Apache HBase is a NoSQL
key/value store on top of HDFS
o Apache Hive provides SQL features to Spark/Hadoop
data. HBase can store or process Hadoop data with
near real-time read/write needs.
o Hive should be used for analytical querying of
data collected over a period of time. HBase is
primarily used to store and process unstructured
Hadoop data
o HBase is perfect for real-time querying of Big
Data. Hive should not be used for real-time
querying
What: Features-1

Linear scalability, capable of


storing hundreds of terabytes of data

Automatic and configurable sharding


of tables

Automatic failover support

Strictly consistent reads and writes


What: Features-2
Integrates nicely with Hadoop MapReduce (both
as source and destination)

Easy Java API for client access

Thrift gateway and REST APIs


Bulk import of large amount of data

Replication across clusters & backup options

Block cache and Bloom filters for real-time


queries
How to use HBase?
Hbase Table
How: the Data
Row keys uninterpreted byte arrays
Columns grouped in columnfamilies (CFs)

CFs defined statically upon table creation

Rows are Cell is uninterpreted byte array and a timestamp


ordered and
accessed by Different data All values stores
row key separated into CFs as byte arrays

Row Key Data Rows can


have
geo:{‘country’:‘Belarus’,‘regio
Minsk differen
n’:‘Minsk’}
t
demography:{‘population’:‘1,937,00
0’@ts=2011} columns
Cell can have
New_York_City multiple
geo:{‘country’:‘ USA’,‘state’:’ NY’} versions

geo:{‘country’:‘Fiji’} Data can be


Suva demography:{‘population’:‘8,175,133’@ts=2010,
very “sparse”
‘population’:‘8,244,910’@ts=2011}
How: Writing the Data
Row updates are atomic

Updates across multiple rows are NOT


atomic, no transaction support out of
the box

HBase stores N versions of a cell


(default 3)

Tables are usually “sparse”, not all


columns populated in a row
How: Reading the Data
Reader will always read the last written (and committed)
values

Reading single row: Get

Reading multiple rows: Scan (very fast)


Scan usually defines start key and stop key

Rows are ordered, easy to do partial key scan


How: MapReduce Integration
How: Sharding the Data
Automatic and configurable sharding of
tables:

Tables partitioned into Regions

Region defined by start & end row keys

Regions are the “atoms” of


distribution

Regions are assigned to RegionServers


(HBase cluster slaves)
How: Setup: Components
HBase components

ZooKeeper

client
How: Setup: Hadoop Cluster
Typical Hadoop+HBase setup
Master Node HDFS

NameNode JobTracker MapRed


uce
HBase
HMaster

TaskTracker
TaskTracker

RegionServer RegionServer Slave


Nodes

DataNode DataNode

Slave Node Slave Node


How: Setup: Automatic Failover
When to Use HBase?
When: What HBase is good at

Serving large amount of data: built


to scale from the get-go
fast random access to the data

Write-heavy applications*

Append-style writing (inserting/


overwriting new data) rather than
heavy read-modify-write operations
When: HBase vs ...
General COMMANDS
• status: Provides the status of HBase,
for example, the number of servers.
• version: Provides the version of HBase
being used.
• table_help: Provides help for
table-reference commands.
• whoami: Provides information about the
user.
Hbase DDL commands
• create: Creates a table.
• list: Lists all the tables in HBase.
• disable: Disables a table.
• is_disabled: Verifies whether a table
is disabled.
• enable: Enables a table.
• is_enabled: Verifies whether a table
is enabled.
• describe: Provides the description of
a table.
• alter: Alters a table.
• exists: Verifies whether a table
exists.
• drop: Drops a table from HBase.
Hbase Data Manipulation commands

• put: Puts a cell value at a specified


column in a
specified row in a particular table.
• get: Fetches the contents of row or a
cell.
• delete: Deletes a cell value in a
table.
• deleteall: Deletes all the cells in a
given row.
• scan: Scans and returns the table
data.
• count: Counts and returns the number
of rows in a
table.
• truncate: Disables, drops, and

You might also like