04-NoSQL
04-NoSQL
Availability
Consistency
Partition
tolerance
The CAP Theorem
Consistency
Partition
tolerance
Consistency
ACID Transactions
• A DBMS is expected to support “ACID
transactions,” processes that are:
– Atomic : Either the whole process is done or none
is.
– Consistent : Database constraints are preserved.
– Isolated : It appears to the user as if only one
process executes at a time.
– Durable : Effects of a process do not get lost if the
system crashes.
1
7
Atomicity
• A real-world event either happens or does
not happen
– Student either registers or does not register
Transaction Consistency
• Consistent transaction: if DB is in consistent
state initially, when the transaction completes:
– All static integrity constraints are satisfied (but
constraints might be violated in intermediate states)
• Can be checked by examining snapshot of database
– New state satisfies specifications of transaction
• Cannot be checked from database snapshot
– No dynamic constraints have been violated
• Cannot be checked from database snapshot
2
4
Isolation
• Serial Execution: transactions execute in sequence
– Each one starts after the previous one completes.
• Execution of one transaction is not affected by the
operations of another since they do not overlap in time
– The execution of each transaction is isolated from
all others.
• If the initial database state and all transactions are
consistent, then the final database state will be
consistent and will accurately reflect the real-world
state, but
• Serial execution is inadequate from a performance
perspective
2
5
Isolation
local computation
DBMS
op1,1 op2,1 op2.2 op1.2
T2
op2,1 op2.2
interleaved sequence of db
operations input to DBMS
local variables
2
7
Durability
Consistency
Partition
tolerance
Availability
• Traditionally, thought of as the server/process available
five 9’s (99.999 %).
• However, for large node system, at almost any point in
time there’s a good chance that a node is either down or
there is a network disruption among the nodes.
– Want a system that is resilient in the face of network disruption
The CAP Theorem
A system can continue to
operate in the presence of a
Availability
network partitions.
Consistency
Partition
tolerance
The CAP Theorem
Consistency
Partition
tolerance
What kinds of NoSQL
• NoSQL solutions fall into two major areas:
– Key/Value or ‘the big hash table’.
• Amazon S3 (Dynamo)
• Voldemort
• Scalaris
• Memcached (in-memory key/value store)
• Redis
– Schema-less which comes in multiple flavors, column-based,
document-based or graph-based.
• Cassandra (column-based)
• CouchDB (document-based)
• MongoDB(document-based)
• Neo4J (graph-based)
• HBase (column-based)
Key/Value
Pros:
– very fast
– very scalable
– simple model
– able to distribute horizontally
Cons:
- many data structures (objects) can't be easily modeled as key value
pairs
Schema-Less
Pros:
- Schema-less data model is richer than key/value pairs
- eventual consistency
- many are distributed
- still provide excellent performance and scalability
Cons:
- typically no ACID transactions or joins
Common Advantages
• Cheap, easy to implement (open source)
• Data are replicated to multiple nodes (therefore
identical and fault-tolerant) and can be partitioned
– Down nodes easily replaced
– No single point of failure
• Easy to distribute
• Don't require a schema
• Can scale up and down
• Relax the data consistency requirement (CAP)
What am I giving up?
• joins
• group by
• order by
• ACID transactions
• SQL as a sometimes frustrating but still powerful query
language
• easy integration with other applications that support SQL
Big Table and Hbase
(C+P)
Data Model
• A table in Bigtable is a sparse, distributed,
persistent multidimensional sorted map
• Map indexed by a row key, column key, and a
timestamp
– (row:string, column:string, time:int64) ®
uninterpreted byte array
• Supports lookups, inserts, deletes
– Single row transactions only
SSTable
64K 64K 64K
block block block
Index
SSTable SSTable
64K 64K 64K 64K 64K 64K
block block block block block block
Index Index
Tablet Tablet
aardvark apple apple_two_E boat
81
HBase: Overview
82
HBase: Part of Hadoop’s Ecosystem
83
HBase vs. HDFS
84
HBase vs. HDFS (Cont’d)
86
HBase Data Model
87
HBase Data Model
Row key
TimeStamp value
88
HBase Logical View
89
HBase: Keys and Column Families
90
Column family named “anchor”
Column family named “Contents”
Column
Time
Row key content Column anchor:
Key Stamp
s:
Byte array
<html>
t12
Serves as the primary key for …
the table com.apac Column named “apache.com”
<html>
he.ww t11
Indexed far fast lookup w
…
91
Version number for each row
Column
Time
Row key content Column anchor:
Stamp
Version Number s:
anchor:my.look. CNN.co
Byte array t13
ca m
com.cnn.w <html>
ww t6
…
<html>
t5
…
<html>
t3
…
92
Notes on Data Model
93
Notes on Data Model (Cont’d)
95
HBase Physical Model
96
Example
97
Column Families
98
HBase Regions
99
HBase Architecture
100
Three Major Components
The HBaseMaster
One master
The HRegionServer
Many region servers
101
HBase Components
Region
A subset of a table’s rows, like horizontal range partitioning
Automatically done
RegionServer (many slaves)
Manages data regions
Serves data for reads and writes (using a log)
Master
Responsible for coordinating the slaves
Assigns regions, detects failures
Admin functions
102
Big Picture
103
ZooKeeper
104
Creating a Table
105
Operations On Regions: Get()
106
Operations On Regions: Scan()
107
Select value from table where
Get() key=‘com.apache.www’ AND
label=‘anchor:apache.com’
Time
Row key Column “anchor:”
Stamp
t12
t11
“com.apache.www”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6
t5
t3
Select value from table
Scan() where
anchor=‘cnnsi.com’
Time
Row key Column “anchor:”
Stamp
t12
t11
“com.apache.www”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6
t5
t3
Operations On Regions: Put()
110
Operations On Regions: Delete()
111
HBase: Joins
112
Altering a Table
113
Logging Operations
114
HBase Deployment
Master
node
Slave
nodes
115
HBase vs. HDFS
116
HBase vs. RDBMS
117
When to use HBase
118
Cassandra
Structured Storage System over a P2P Network
Why Cassandra?
• Lots of data
– Copies of messages, reverse indices of messages,
per user data.
• Many incoming requests resulting in a lot of
random reads and random writes.
• No existing production ready solutions in the
market meet these requirements.
Design Goals
• High availability
• Eventual consistency
– trade-off strong consistency in favor of high availability
• Incremental scalability
• Optimistic Replication
• “Knobs” to tune tradeoffs between consistency,
durability and latency
• Low total cost of ownership
• Minimal administration
innovation at scale
• google bigtable (2006)
– consistency model: strong
– data model: sparse map
– clones: hbase, hypertable
• amazon dynamo (2007)
– O(1) dht
– consistency model: client tune-able
– clones: riak, voldemort
web 2.0
• used at Twitter, Rackspace, Mahalo, Reddit,
Cloudkick, Cisco, Digg, SimpleGeo, Ooyala, OpenX,
others
Data Model Columns are added
and modified
ColumnFamily1 Name : MailList dynamically
Type : Simple Sort : Name
KEY Name : tid1 Name : tid2 Name : tid3 Name : tid4
Sorted
K3 < Serialized data >
--
DELETED
Sorted
K30 < Serialized data >
--
--
Sorted
K10 < Serialized data >
--
--
--
-- --
--
MERGE SORT
Index File
K1 < Serialized data >
Loaded in memory K2 < Serialized data >
Query Result
Cassandra Cluster
Digest Query
Digest Response Digest Response
Replica B Replica C
Partitioning And Replication
1 0 h(key1)
E
A N=3
h(key2) F
B
D
1/2
131
Cluster Membership and Failure
Detection
• Gossip protocol is used for cluster membership.
• Super lightweight with mathematically provable properties.
• State disseminated in O(logN) rounds where N is the number of nodes in
the cluster.
• Every T seconds each member increments its heartbeat counter and
selects one other member to send its list to.
• A member merges the list with its own list .
Accrual Failure Detector
• Valuable for system management, replication, load balancing etc.
• Defined as a failure detector that outputs a value, PHI, associated with
each process.
• Also known as Adaptive Failure detectors - designed to adapt to changing
network conditions.
• The value output, PHI, represents a suspicion level.
• Applications set an appropriate threshold, trigger suspicions and perform
appropriate actions.
• In Cassandra the average time taken to detect a failure is 10-15 seconds
with the PHI threshold set at 5.
Information Flow in the
Implementation
Performance Benchmark
• Loading of data - limited by network
bandwidth.
• Read performance for Inbox Search in
production: