0% found this document useful (0 votes)

4 views

04-NoSQL

The document discusses the evolution and scaling challenges of relational databases (RDBMS) in the context of big data and NoSQL solutions. It covers various scaling techniques such as master-slave replication and sharding, and introduces NoSQL as a flexible alternative that relaxes traditional ACID properties. Key concepts like the CAP theorem, eventual consistency, and specific NoSQL implementations like BigTable and HBase are also highlighted.

Uploaded by

Frederic Vargas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

04-NoSQL

Uploaded by

Frederic Vargas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 126

NoSQL and Big Data Processing

Hbase, Hive and Pig, etc.

Adopted from slides by Jerome Simeon, Perry
Hoekstra, Jiaheng Lu, Avinash Lakshman,
Prashant Malik, and Jimmy Lin
History of the World, Part 1
• Relational Databases – mainstay of business
• Web-based applications caused spikes
– Especially true for public-facing e-Commerce sites
• Developers begin to front RDBMS with memcache or integrate
other caching mechanisms within the application (ie. Ehcache)
Scaling Up
• Issues with scaling up when the dataset is just too big
• RDBMS were not designed to be distributed
• Began to look at multi-node database solutions
• Known as ‘scaling out’ or ‘horizontal scaling’
• Different approaches include:
– Master-slave
– Sharding
Scaling RDBMS – Master/Slave
• Master-Slave
– All writes are written to the master. All reads performed against
the replicated slave databases
– Critical reads may be incorrect as writes may not have been
propagated down
– Large data sets can pose problems as master needs to duplicate
data to slaves
Scaling RDBMS - Sharding
• Partition or sharding
– Scales well for both reads and writes
– Not transparent, application needs to be partition-aware
– Can no longer have relationships/joins across partitions
– Loss of referential integrity across shards
• Different sharding approaches:
• Vertical Partitioning: Have tables related to a specific feature sit on their own
server. May have to rebalance or reshard if tables outgrow server.
• Range-Based Partitioning: When single table cannot sit on a server, split
table onto multiple servers. Split table based on some critical value range.
• Key or Hash-Based partitioning: Use a key value in a hash and use the
resulting value as entry into multiple servers.
• Directory-Based Partitioning: Have a lookup service that has knowledge of
the partitioning scheme . This allows for the adding of servers or changing
the partition scheme without changing the application.
Other ways to scale RDBMS
• Multi-Master replication
• INSERT only, not UPDATES/DELETES
• No JOINs, thereby reducing query time
– This involves de-normalizing data
• In-memory databases
What is NoSQL?
• Stands for Not Only SQL
• Class of non-relational data storage systems
• Usually do not require a fixed table schema nor do they use
the concept of joins
• All NoSQL offerings relax one or more of the ACID properties
(will talk about the CAP theorem)
Why NoSQL?
• For data storage, an RDBMS cannot be the be-all/end-all
• Just as there are different programming languages, need to
have other data storage tools in the toolbox
• A NoSQL solution is more acceptable to a client now than
even 5 or 10 years ago
How did we get here?
• Explosion of social media sites (Facebook, Twitter) with
large data needs
• Rise of cloud-based solutions such as Amazon S3 (simple
storage solution)
• Just as moving to dynamically-typed languages
(Ruby/Groovy), a shift to dynamically-typed data with
frequent schema changes
• Open-source community
Dynamo and BigTable
• Three major papers were the seeds of the NoSQL movement
– BigTable (Google)
– Dynamo (Amazon)
• Gossip protocol (discovery and error detection)
• Distributed key-value data store
• Eventual consistency
– CAP Theorem (discuss in a sec ..)
The Perfect Storm
• Large datasets, acceptance of alternatives, and dynamically-
typed data has come together in a perfect storm
• Not a backlash/rebellion against RDBMS
• SQL is a rich query language that cannot be rivaled by the
current list of NoSQL offerings
CAP Theorem
• Three properties of a system: consistency, availability and
partitions
• You can have at most two of these three properties for any
shared-data system
• To scale out, you have to partition. That leaves either
consistency or availability to choose from
– In almost all cases, you would choose availability over
consistency
The CAP Theorem

Availability

Consistency

Partition
tolerance
The CAP Theorem

Once a writer has written, all

readers will see that write
Availability

Consistency

Partition
tolerance
Consistency

• Two kinds of consistency:

– strong consistency – ACID(Atomicity Consistency Isolation
Durability)

– weak consistency – BASE(Basically Available Soft-state

Eventual consistency )
1
6

ACID Transactions
• A DBMS is expected to support “ACID
transactions,” processes that are:
– Atomic : Either the whole process is done or none
is.
– Consistent : Database constraints are preserved.
– Isolated : It appears to the user as if only one
process executes at a time.
– Durable : Effects of a process do not get lost if the
system crashes.
1
7

Atomicity
• A real-world event either happens or does
not happen
– Student either registers or does not register

• Similarly, the system must ensure that either

the corresponding transaction runs to
completion or, if not, it has no effect at all
– Not true of ordinary programs. A crash could
leave files partially updated on recovery
1
8

Commit and Abort

• If the transaction successfully completes it
is said to commit
– The system is responsible for ensuring that all
changes to the database have been saved

• If the transaction does not successfully

complete, it is said to abort
– The system is responsible for undoing, or rolling
back, all changes the transaction has made
1
9
Database Consistency
• Enterprise (Business) Rules limit the
occurrence of certain real-world events
– Student cannot register for a course if the current
number of registrants equals the maximum allowed
• Correspondingly, allowable database states
are restricted
cur_reg <= max_reg

• These limitations are called (static) integrity

constraints: assertions that must be satisfied
by all database states (state invariants).
2
0
Database Consistency
(state invariants)

• Other static consistency requirements are

related to the fact that the database might
store the same information in different ways
– cur_reg = |list_of_registered_students|
– Such limitations are also expressed as integrity
constraints
• Database is consistent if all static integrity
constraints are satisfied
2
1
Transaction Consistency
• A consistent database state does not necessarily
model the actual state of the enterprise
– A deposit transaction that increments the balance by
the wrong amount maintains the integrity constraint
balance ³ 0, but does not maintain the relation between
the enterprise and database states
• A consistent transaction maintains database
consistency and the correspondence between the
database state and the enterprise state (implements
its specification)
– Specification of deposit transaction includes
balance¢ = balance + amt_deposit ,
(balance¢ is the next value of balance)
2
2
Dynamic Integrity Constraints
(transition invariants)

• Some constraints restrict allowable state

transitions
– A transaction might transform the database
from one consistent state to another, but the
transition might not be permissible
– Example: A letter grade in a course (A, B, C, D,
F) cannot be changed to an incomplete (I)
• Dynamic constraints cannot be checked
by examining the database state
2
3

Transaction Consistency
• Consistent transaction: if DB is in consistent
state initially, when the transaction completes:
– All static integrity constraints are satisfied (but
constraints might be violated in intermediate states)
• Can be checked by examining snapshot of database
– New state satisfies specifications of transaction
• Cannot be checked from database snapshot
– No dynamic constraints have been violated
• Cannot be checked from database snapshot
2
4
Isolation
• Serial Execution: transactions execute in sequence
– Each one starts after the previous one completes.
• Execution of one transaction is not affected by the
operations of another since they do not overlap in time
– The execution of each transaction is isolated from
all others.
• If the initial database state and all transactions are
consistent, then the final database state will be
consistent and will accurately reflect the real-world
state, but
• Serial execution is inadequate from a performance
perspective
2
5

Isolation

• Concurrent execution offers performance benefits:

– A computer system has multiple resources capable of
executing independently (e.g., cpu’s, I/O devices), but
– A transaction typically uses only one resource at a time
– Hence, only concurrently executing transactions can
make effective use of the system
– Concurrently executing transactions yield interleaved
schedules
2
6
begin trans Concurrent Execution
..
op1,1
.. sequence of db
op1,2 operations output by T1
.. op1,1 op1.2
commit T1

local computation

DBMS
op1,1 op2,1 op2.2 op1.2

T2
op2,1 op2.2
interleaved sequence of db
operations input to DBMS
local variables
2
7

Durability

• The system must ensure that once a transaction

commits, its effect on the database state is not
lost in spite of subsequent failures
– Not true of ordinary programs. A media failure after a
program successfully terminates could cause the file
system to be restored to a state that preceded the
program’s execution
2
8
Implementing Durability
• Database stored redundantly on mass storage
devices to protect against media failure
• Architecture of mass storage devices affects
type of media failures that can be tolerated
• Related to Availability: extent to which a
(possibly distributed) system can provide
service despite failure
• Non-stop DBMS (mirrored disks)
• Recovery based DBMS (log)
Consistency Model
• A consistency model determines rules for visibility and apparent
order of updates.
• For example:
– Row X is replicated on nodes M and N
– Client A writes row X to node N
– Some period of time t elapses.
– Client B reads row X from node M
– Does client B see the write from client A?
– Consistency is a continuum with tradeoffs
– For NoSQL, the answer would be: maybe
– CAP Theorem states: Strict Consistency can't be achieved at the
same time as availability and partition-tolerance.
Eventual Consistency
• When no updates occur for a long period of time,
eventually all updates will propagate through the
system and all the nodes will be consistent
• For a given accepted update and a given node,
eventually either the update reaches the node or the
node is removed from service
• Known as BASE (Basically Available, Soft state,
Eventual consistency), as opposed to ACID
The CAP Theorem

System is available during

software and hardware
Availability upgrades and node failures.

Consistency

Partition
tolerance
Availability
• Traditionally, thought of as the server/process available
five 9’s (99.999 %).
• However, for large node system, at almost any point in
time there’s a good chance that a node is either down or
there is a network disruption among the nodes.
– Want a system that is resilient in the face of network disruption
The CAP Theorem
A system can continue to
operate in the presence of a
Availability
network partitions.

Consistency

Partition
tolerance
The CAP Theorem

Theorem: You can have at

most two of these properties
Availability for any shared-data system

Consistency

Partition
tolerance
What kinds of NoSQL
• NoSQL solutions fall into two major areas:
– Key/Value or ‘the big hash table’.
• Amazon S3 (Dynamo)
• Voldemort
• Scalaris
• Memcached (in-memory key/value store)
• Redis
– Schema-less which comes in multiple flavors, column-based,
document-based or graph-based.
• Cassandra (column-based)
• CouchDB (document-based)
• MongoDB(document-based)
• Neo4J (graph-based)
• HBase (column-based)
Key/Value
Pros:
– very fast
– very scalable
– simple model
– able to distribute horizontally

Cons:
- many data structures (objects) can't be easily modeled as key value
pairs
Schema-Less
Pros:
- Schema-less data model is richer than key/value pairs
- eventual consistency
- many are distributed
- still provide excellent performance and scalability

Cons:
- typically no ACID transactions or joins
Common Advantages
• Cheap, easy to implement (open source)
• Data are replicated to multiple nodes (therefore
identical and fault-tolerant) and can be partitioned
– Down nodes easily replaced
– No single point of failure
• Easy to distribute
• Don't require a schema
• Can scale up and down
• Relax the data consistency requirement (CAP)
What am I giving up?
• joins
• group by
• order by
• ACID transactions
• SQL as a sometimes frustrating but still powerful query
language
• easy integration with other applications that support SQL
Big Table and Hbase
(C+P)
Data Model
• A table in Bigtable is a sparse, distributed,
persistent multidimensional sorted map
• Map indexed by a row key, column key, and a
timestamp
– (row:string, column:string, time:int64) ®
uninterpreted byte array
• Supports lookups, inserts, deletes
– Single row transactions only

Image Source: Chang et al., OSDI 2006

Rows and Columns
• Rows maintained in sorted lexicographic order
– Applications can exploit this property for efficient row
scans
– Row ranges dynamically partitioned into tablets
• Columns grouped into column families
– Column key = family:qualifier
– Column families provide locality hints
– Unbounded number of columns
Bigtable Building Blocks
• GFS
• Chubby
• SSTable
SSTable
¢ Basic building block of Bigtable
¢ Persistent, ordered immutable map from keys to values
l Stored in GFS
¢ Sequence of blocks on disk plus an index for block lookup
l Can be completely mapped into memory
¢ Supported operations:
l Look up value associated with key
l Iterate key/value pairs within a key range

SSTable
64K 64K 64K
block block block

Index

Source: Graphic from slides by Erik Paulson

Tablet
¢ Dynamically partitioned range of rows
¢ Built from multiple SSTables

Tablet Start:aardvark End:apple

SSTable SSTable
64K 64K 64K 64K 64K 64K
block block block block block block

Index Index

Source: Graphic from slides by Erik Paulson

Table
¢ Multiple tablets make up the table
¢ SSTables can be shared

Tablet Tablet
aardvark apple apple_two_E boat

SSTable SSTable SSTable SSTable

Source: Graphic from slides by Erik Paulson

Architecture
• Client library
• Single master server
• Tablet servers
Bigtable Master
• Assigns tablets to tablet servers
• Detects addition and expiration of tablet
servers
• Balances tablet server load
• Handles garbage collection
• Handles schema changes
Bigtable Tablet Servers
• Each tablet server manages a set of tablets
– Typically between ten to a thousand tablets
– Each 100-200 MB by default
• Handles read and write requests to the tablets
• Splits tablets that have grown too large
Tablet Location

Upon discovery, clients cache tablet locations

Image Source: Chang et al., OSDI 2006
Tablet Assignment
• Master keeps track of:
– Set of live tablet servers
– Assignment of tablets to tablet servers
– Unassigned tablets
• Each tablet is assigned to one tablet server at a time
– Tablet server maintains an exclusive lock on a file in Chubby
– Master monitors tablet servers and handles assignment
• Changes to tablet structure
– Table creation/deletion (master initiated)
– Tablet merging (master initiated)
– Tablet splitting (tablet server initiated)
Tablet Serving

“Log Structured Merge Trees”

Image Source: Chang et al., OSDI 2006

Compactions
• Minor compaction
– Converts the memtable into an SSTable
– Reduces memory usage and log traffic on restart
• Merging compaction
– Reads the contents of a few SSTables and the memtable, and
writes out a new SSTable
– Reduces number of SSTables
• Major compaction
– Merging compaction that results in only one SSTable
– No deletion records, only live data
Bigtable Applications
• Data source and data sink for MapReduce
• Google’s web crawl
• Google Earth
• Google Analytics
Lessons Learned
• Fault tolerance is hard
• Don’t add functionality before understanding
its use
– Single-row transactions appear to be sufficient
• Keep it simple!
HBase is an open-source,
distributed, column-oriented
database built on top of HDFS
based on BigTable!
HBase is ..
• A distributed data store that can scale horizontally to
1,000s of commodity servers and petabytes of
indexed storage.
• Designed to operate on top of the Hadoop
distributed file system (HDFS) or Kosmos File System
(KFS, aka Cloudstore) for scalability, fault tolerance,
and high availability.
Benefits
• Distributed storage
• Table-like in data structure
– multi-dimensional map
• High scalability
• High availability
• High performance
HBase Is Not …
• Tables have one primary index, the row key.
• No join operators.
• Scans and queries can select a subset of available
columns, perhaps by using a wildcard.
• There are three types of lookups:
– Fast lookup using row key and optional timestamp.
– Full table scan
– Range scan from region start to end.
HBase Is Not …(2)
• Limited atomicity and transaction support.
– HBase supports multiple batched mutations of
single rows only.
– Data is unstructured and untyped.
• No accessed or manipulated via SQL.
– Programmatic access via Java, REST, or Thrift APIs.
– Scripting via JRuby.
Why Bigtable?
• Performance of RDBMS system is good for
transaction processing but for very large scale
analytic processing, the solutions are
commercial, expensive, and specialized.
• Very large scale analytic processing
– Big queries – typically range or table scans.
– Big databases (100s of TB)
Why Bigtable? (2)
• Map reduce on Bigtable with optionally
Cascading on top to support some relational
algebras may be a cost effective solution.
• Sharding is not a solution to scale open source
RDBMS platforms
– Application specific
– Labor intensive (re)partitionaing
Why HBase ?
• HBase is a Bigtable clone.
• It is open source
• It has a good community and promise for the
future
• It is developed on top of and has good
integration for the Hadoop platform, if you are
using Hadoop already.
• It has a Cascading connector.
HBase benefits than RDBMS
• No real indexes
• Automatic partitioning
• Scale linearly and automatically with new
nodes
• Commodity hardware
• Fault tolerance
• Batch processing
HBase

81
HBase: Overview

HBase is a distributed column-oriented

data store built on top of HDFS

HBase is an Apache open source project whose

goal is to provide storage for the Hadoop
Distributed Computing

Data is logically organized into tables, rows and

columns

82
HBase: Part of Hadoop’s Ecosystem

HBase is built on top of HDFS

HBase files are

internally
stored in HDFS

83
HBase vs. HDFS

Both are distributed systems that scale to

hundreds or thousands of nodes

HDFS is good for batch processing (scans over

big files)
Not good for record lookup
Not good for incremental addition of small batches
Not good for updates

84
HBase vs. HDFS (Cont’d)

HBase is designed to efficiently address the

above points
Fast record lookup
Support for record-level insertion
Support for updates (not in place)

HBase updates are done by creating new

versions of values
85
HBase vs. HDFS (Cont’d)

If your application has neither random reads or writes è Stick to HDFS

86
HBase Data Model

87
HBase Data Model

HBase is based on Google’s Bigtable

model
Key-Value pairs
Column Family

Row key

TimeStamp value

88
HBase Logical View

89
HBase: Keys and Column Families

Each record is divided into Column Families

Each row has a Key

Each column family consists of one or more Columns

90
Column family named “anchor”
Column family named “Contents”

Column
Time
Row key content Column anchor:
Key Stamp
s:
Byte array
<html>
t12
Serves as the primary key for …
the table com.apac Column named “apache.com”
<html>
he.ww t11
Indexed far fast lookup w
…

Column Family t10

anchor:apache
.com
APACH
E
Has a name (string) anchor:cnnsi.co
t15 CNN
m
Contains one or more related
columns anchor:my.look. CNN.co
t13
ca m
Column
com.cnn.w <html>
t6
Belongs to one column family ww …

Included inside the row <html>

t5
…
familyName:columnName
<html>
t3
…

91
Version number for each row

Column
Time
Row key content Column anchor:
Stamp
Version Number s:

Unique within each key t12

<html>
… value
By defaultà System’s com.apac
he.ww t11
<html>
timestamp w
…
anchor:apache APACH
t10
Data type is Long .com E
anchor:cnnsi.co
Value (Cell) t15
m
CNN

anchor:my.look. CNN.co
Byte array t13
ca m

com.cnn.w <html>
ww t6
…

<html>
t5
…
<html>
t3
…

92
Notes on Data Model

HBase schema consists of several Tables

Each table consists of a set of Column Families
Columns are not part of the schema
HBase has Dynamic Columns
Because column names are encoded inside the cells
Different cells can have different columns

“Roles” column family

has different columns
in different cells

93
Notes on Data Model (Cont’d)

The version number can be user-supplied

Even does not have to be inserted in increasing order
Version number are unique within each key
Table can be very sparse
Has two columns
Many cells are empty [cnnsi.com & my.look.ca]

Keys are indexed as the primary key

HBase Physical Model

95
HBase Physical Model

Each column family is stored in a separate file (called HTables)

Key & Version numbers are replicated with each column family
Empty cells are not stored

HBase maintains a multi-level

index on values:
<key, column family, column
name, timestamp>

96
Example

97
Column Families

98
HBase Regions

Each HTable (column family) is partitioned

horizontally into regions
Regions are counterpart to HDFS blocks

Each will be one region

99
HBase Architecture

100
Three Major Components

The HBaseMaster
One master

The HRegionServer
Many region servers

The HBase client

101
HBase Components

Region
A subset of a table’s rows, like horizontal range partitioning
Automatically done
RegionServer (many slaves)
Manages data regions
Serves data for reads and writes (using a log)
Master
Responsible for coordinating the slaves
Assigns regions, detects failures
Admin functions

102
Big Picture

103
ZooKeeper

HBase depends on ZooKeeper

By default HBase manages the
ZooKeeper instance
E.g., starts and stops
ZooKeeper
HMaster and HRegionServers
register themselves with
ZooKeeper

104
Creating a Table

HBaseAdmin admin= new HBaseAdmin(config);

HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);

105
Operations On Regions: Get()

Given a key à return corresponding record

For each value return the highest version

• Can control the number of versions you want

106
Operations On Regions: Scan()

107
Select value from table where
Get() key=‘com.apache.www’ AND
label=‘anchor:apache.com’

Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t3
Select value from table
Scan() where
anchor=‘cnnsi.com’
Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t3
Operations On Regions: Put()

Insert a new record (with a new key), Or

Insert a record for an existing key
Implicit version number
(timestamp)

Explicit version number

110
Operations On Regions: Delete()

Marking table cells as deleted

Multiple levels
Can mark an entire column family as deleted
Can make all column families of a given row as deleted

• All operations are logged by the RegionServers

• The log is flushed periodically

111
HBase: Joins

HBase does not support joins

Can be done in the application layer

Using scan() and get() operations

112
Altering a Table

Disable the table before changing the schema

113
Logging Operations

114
HBase Deployment

Master
node

Slave
nodes

115
HBase vs. HDFS

116
HBase vs. RDBMS

117
When to use HBase

118
Cassandra
Structured Storage System over a P2P Network
Why Cassandra?
• Lots of data
– Copies of messages, reverse indices of messages,
per user data.
• Many incoming requests resulting in a lot of
random reads and random writes.
• No existing production ready solutions in the
market meet these requirements.
Design Goals
• High availability
• Eventual consistency
– trade-off strong consistency in favor of high availability
• Incremental scalability
• Optimistic Replication
• “Knobs” to tune tradeoffs between consistency,
durability and latency
• Low total cost of ownership
• Minimal administration
innovation at scale
• google bigtable (2006)
– consistency model: strong
– data model: sparse map
– clones: hbase, hypertable
• amazon dynamo (2007)
– O(1) dht
– consistency model: client tune-able
– clones: riak, voldemort

cassandra ~= bigtable + dynamo

proven
• The Facebook stores 150TB of data on 150 nodes

web 2.0
• used at Twitter, Rackspace, Mahalo, Reddit,
Cloudkick, Cisco, Digg, SimpleGeo, Ooyala, OpenX,
others
Data Model Columns are added
and modified
ColumnFamily1 Name : MailList dynamically
Type : Simple Sort : Name
KEY Name : tid1 Name : tid2 Name : tid3 Name : tid4

Value : <Binary> Value : <Binary> Value : <Binary> Value : <Binary>

TimeStamp : t1 TimeStamp : t2 TimeStamp : t3 TimeStamp : t4

ColumnFamily2 Name : WordList Type : Super Sort : Time

Column Families Name : aloha Name : dude
are declared C1 C2 C3 C4 C2 C6
upfront are
SuperColumns V1 V2 V3 V4 V2 V6
added and T1 T2 T3 T4
T2 T6
modified
Columns are added
dynamically
and modified
dynamically ColumnFamily3 Name : System Type : Super Sort : Name
Name : hint1 Name : hint2 Name : hint3 Name : hint4

<Column List> <Column List> <Column List> <Column List>

Write Operations
• A client issues a write request to a random
node in the Cassandra cluster.
• The “Partitioner” determines the nodes
responsible for the data.
• Locally, write operations are logged and then
applied to an in-memory version.
• Commit log is stored on a dedicated disk local
to the machine.
write op
Write cont’d
Key (CF1 , CF2 , CF3) ñ Data size
ñ Number of Objects
Memtable ( CF1)
ñ Lifetime

Commit Log Memtable ( CF2) FLUSH

Binary serialized
Key ( CF1 , CF2 , CF3 ) Memtable ( CF2)

Data file on disk

<Key name><Size of key Data><Index of columns/supercolumns><
Serialized column family>
K128 Offset ---

K256 Offset ---

BLOCK Index <Key Name> Offset, <Key Name> Offset
Dedicated Disk
K384 Offset ---

Bloom Filter ---

<Key name><Size of key Data><Index of columns/supercolumns><

Serialized column family>
(Index in memory)
Compactions
K2 < Serialized data > K4 < Serialized data >
K1 < Serialized data >
K10 < Serialized data > K5 < Serialized data >
K2 < Serialized data >

Sorted
K3 < Serialized data >

--
DELETED
Sorted
K30 < Serialized data >

--
Sorted
K10 < Serialized data >

--
--
-- --
--

MERGE SORT

Index File
K1 < Serialized data >
Loaded in memory K2 < Serialized data >

K1 Offset K3 < Serialized data >

K5 Offset Sorted K4 < Serialized data >

K30 Offset K5 < Serialized data >

Bloom Filter K10 < Serialized data >

K30 < Serialized data >

Data File
Write Properties
• No locks in the critical path
• Sequential disk access
• Behaves like a write back Cache
• Append support without read ahead
• Atomicity guarantee for a key
• “Always Writable”
– accept writes during failure scenarios
Read
Client

Query Result

Cassandra Cluster

Closest replica Result Read repair if

digests differ
Replica A

Digest Query
Digest Response Digest Response

Replica B Replica C
Partitioning And Replication
1 0 h(key1)
E
A N=3

h(key2) F

B
D

1/2
131
Cluster Membership and Failure
Detection
• Gossip protocol is used for cluster membership.
• Super lightweight with mathematically provable properties.
• State disseminated in O(logN) rounds where N is the number of nodes in
the cluster.
• Every T seconds each member increments its heartbeat counter and
selects one other member to send its list to.
• A member merges the list with its own list .
Accrual Failure Detector
• Valuable for system management, replication, load balancing etc.
• Defined as a failure detector that outputs a value, PHI, associated with
each process.
• Also known as Adaptive Failure detectors - designed to adapt to changing
network conditions.
• The value output, PHI, represents a suspicion level.
• Applications set an appropriate threshold, trigger suspicions and perform
appropriate actions.
• In Cassandra the average time taken to detect a failure is 10-15 seconds
with the PHI threshold set at 5.
Information Flow in the
Implementation
Performance Benchmark
• Loading of data - limited by network
bandwidth.
• Read performance for Inbox Search in
production:

Search Interactions Term Search

Min 7.69 ms 7.78 ms
Median 15.69 ms 18.27 ms
Average 26.13 ms 44.41 ms
MySQL Comparison
• MySQL > 50 GB Data
Writes Average : ~300 ms
Reads Average : ~350 ms
• Cassandra > 50 GB Data
Writes Average : 0.12 ms
Reads Average : 15 ms
Lessons Learnt
• Add fancy features only when absolutely
required.
• Many types of failures are possible.
• Big systems need proper systems-level
monitoring.
• Value simple designs
Future work
• Atomicity guarantees across multiple keys
• Analysis support via Map/Reduce
• Distributed transactions
• Compression support
• Granular security via ACL’s

1z0-1042-25 Dumps
No ratings yet
1z0-1042-25 Dumps
6 pages
CS3492 DBMS Notes
No ratings yet
CS3492 DBMS Notes
165 pages
How Nifty! But Are NFTs Securities, Commodities, or Something Else
No ratings yet
How Nifty! But Are NFTs Securities, Commodities, or Something Else
27 pages
NOSQL
No ratings yet
NOSQL
23 pages
Hbase Hive Pig
No ratings yet
Hbase Hive Pig
144 pages
Nosql Databases: P.Krishna Reddy Iiit Hyderabad
No ratings yet
Nosql Databases: P.Krishna Reddy Iiit Hyderabad
30 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
43 pages
BigData_NoSQL
No ratings yet
BigData_NoSQL
30 pages
ABDMS-UNIT 2 AND UNIT 5 NOTES
No ratings yet
ABDMS-UNIT 2 AND UNIT 5 NOTES
10 pages
2- NoSQL
No ratings yet
2- NoSQL
32 pages
Big Data Storage and Processing
No ratings yet
Big Data Storage and Processing
49 pages
Bda - 4 Unit
No ratings yet
Bda - 4 Unit
10 pages
No SQL
No ratings yet
No SQL
109 pages
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
No ratings yet
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
102 pages
NoSQL Intro
No ratings yet
NoSQL Intro
26 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
29 pages
nosql-kk
No ratings yet
nosql-kk
23 pages
Unit 4: Big Data Tehnology Landscape Two Inportant Technologies
No ratings yet
Unit 4: Big Data Tehnology Landscape Two Inportant Technologies
42 pages
DBMS
No ratings yet
DBMS
27 pages
Module 2 Notes
No ratings yet
Module 2 Notes
19 pages
4.NoSQL 1
No ratings yet
4.NoSQL 1
69 pages
Lect-Transactions-1-Week 10 (TEL)
No ratings yet
Lect-Transactions-1-Week 10 (TEL)
32 pages
Databases: - We Are Particularly Interested in Relational Databases - Data Is Stored in Tables
No ratings yet
Databases: - We Are Particularly Interested in Relational Databases - Data Is Stored in Tables
23 pages
A Closer Look
No ratings yet
A Closer Look
25 pages
IntroNoSQL Revised
No ratings yet
IntroNoSQL Revised
28 pages
PPT of Chapter 3.2 - Transaction Management and Concurrency Control
No ratings yet
PPT of Chapter 3.2 - Transaction Management and Concurrency Control
58 pages
Lecture 8 Chapter 5 Part 4 Big Data Storage Concepts (4)
No ratings yet
Lecture 8 Chapter 5 Part 4 Big Data Storage Concepts (4)
9 pages
DBMS[1]
No ratings yet
DBMS[1]
4 pages
CC part 1 no quizzes
No ratings yet
CC part 1 no quizzes
69 pages
NoSQL Databases
No ratings yet
NoSQL Databases
52 pages
APZNZA~4
No ratings yet
APZNZA~4
18 pages
NoSQL D
No ratings yet
NoSQL D
26 pages
16-concurrencycontrol (1)
No ratings yet
16-concurrencycontrol (1)
4 pages
Lecture 6 - NoSQL
No ratings yet
Lecture 6 - NoSQL
28 pages
Introduction to NoSQL
No ratings yet
Introduction to NoSQL
13 pages
Big Data Analysis
No ratings yet
Big Data Analysis
9 pages
Week 2 Complete
No ratings yet
Week 2 Complete
73 pages
Introduction To Databases
No ratings yet
Introduction To Databases
73 pages
SQL - Transactions
No ratings yet
SQL - Transactions
19 pages
Intro to NoSQL DBs
No ratings yet
Intro to NoSQL DBs
44 pages
09 - Cloud-Enabling Technologies - v2
No ratings yet
09 - Cloud-Enabling Technologies - v2
45 pages
dbmsco3,4part2
No ratings yet
dbmsco3,4part2
5 pages
CS3492-DBMS unit-5
No ratings yet
CS3492-DBMS unit-5
9 pages
Software Engineer Concepts_4030afdb-00a4-4f83-a520_241007_202416
No ratings yet
Software Engineer Concepts_4030afdb-00a4-4f83-a520_241007_202416
26 pages
Database Management Systems: Instructor: Murali Mani Mmani@cs - Wpi.edu
100% (1)
Database Management Systems: Instructor: Murali Mani Mmani@cs - Wpi.edu
22 pages
Module 2.3
No ratings yet
Module 2.3
25 pages
bda-ia2-bda
No ratings yet
bda-ia2-bda
7 pages
Big Data Analytics Module-3
No ratings yet
Big Data Analytics Module-3
160 pages
Nosql Overview: Implementation Free
No ratings yet
Nosql Overview: Implementation Free
40 pages
RK NoSQL
No ratings yet
RK NoSQL
35 pages
NoSQL_Notes
No ratings yet
NoSQL_Notes
11 pages
L5 Transaction and Concurrency Control
No ratings yet
L5 Transaction and Concurrency Control
76 pages
Nosql Tricks
No ratings yet
Nosql Tricks
34 pages
Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D
No ratings yet
Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D
27 pages
Lect 1
No ratings yet
Lect 1
25 pages
SQL Vs NoSQL - Full
No ratings yet
SQL Vs NoSQL - Full
95 pages
Lecture 1 - NoSQL
No ratings yet
Lecture 1 - NoSQL
31 pages
Database - Q and A
No ratings yet
Database - Q and A
9 pages
Chapter - One - Transaction - Management, Concurrency Control
No ratings yet
Chapter - One - Transaction - Management, Concurrency Control
61 pages
NoSql 2024 Assign2
No ratings yet
NoSql 2024 Assign2
189 pages
Lecture 1
No ratings yet
Lecture 1
31 pages
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
14:332:231 Digital Logic Design: Ivan Marsic, Rutgers University Electrical & Computer Engineering Fall 2013
No ratings yet
14:332:231 Digital Logic Design: Ivan Marsic, Rutgers University Electrical & Computer Engineering Fall 2013
5 pages
Learning Objectives:: Topic 2.2.3 - BCD Counter Topic 2.2.4 - Decade Counter
No ratings yet
Learning Objectives:: Topic 2.2.3 - BCD Counter Topic 2.2.4 - Decade Counter
35 pages
Statistical Analysis System: First SAS Program
No ratings yet
Statistical Analysis System: First SAS Program
8 pages
Cisco CX Success Tracks Service Description
No ratings yet
Cisco CX Success Tracks Service Description
5 pages
IEC-IM05 Series: Key Features
No ratings yet
IEC-IM05 Series: Key Features
1 page
An 14093
No ratings yet
An 14093
21 pages
apb code
No ratings yet
apb code
26 pages
Systemcards and Modules Tcs990 Mep Telecom Com
No ratings yet
Systemcards and Modules Tcs990 Mep Telecom Com
15 pages
JD - DevOps R&D
No ratings yet
JD - DevOps R&D
1 page
Quectel at Commands Manual V1.0.0
No ratings yet
Quectel at Commands Manual V1.0.0
254 pages
Cambridge International AS & A Level: Computer Science 9618/41
No ratings yet
Cambridge International AS & A Level: Computer Science 9618/41
38 pages
CBLM Coc2
No ratings yet
CBLM Coc2
54 pages
Readme
No ratings yet
Readme
2 pages
School Messenger Parent Form
No ratings yet
School Messenger Parent Form
1 page
Chapter 9b IPv6 Subnetting
No ratings yet
Chapter 9b IPv6 Subnetting
23 pages
Study - Id146754 - Manufacturing Market Data and Analysis
No ratings yet
Study - Id146754 - Manufacturing Market Data and Analysis
113 pages
Biometric Attendance Document
No ratings yet
Biometric Attendance Document
5 pages
Kuwait University Dept. of Chemical Engineering Spring 2017/2018
No ratings yet
Kuwait University Dept. of Chemical Engineering Spring 2017/2018
8 pages
Tender Document FOR Providing Onsite Services On Single Platform Software For Various Modules, With Implementation & Maintenance
No ratings yet
Tender Document FOR Providing Onsite Services On Single Platform Software For Various Modules, With Implementation & Maintenance
26 pages
Module 3 Working With Toolbox Controls
No ratings yet
Module 3 Working With Toolbox Controls
43 pages
Troubleshoot Connecting Surface To A Second Screen
No ratings yet
Troubleshoot Connecting Surface To A Second Screen
3 pages
LAVCA K-12 Student and Parent Handbook 2019-2020
No ratings yet
LAVCA K-12 Student and Parent Handbook 2019-2020
45 pages
EXPARIMENT # 9 and 10 DCN
No ratings yet
EXPARIMENT # 9 and 10 DCN
20 pages
Tales of The Abyss
No ratings yet
Tales of The Abyss
252 pages
PSD-I Exam Dumps – Pass Your Exam on First Try
No ratings yet
PSD-I Exam Dumps – Pass Your Exam on First Try
22 pages
TD SGT NX I Flight Sim Manual
No ratings yet
TD SGT NX I Flight Sim Manual
33 pages
[FREE PDF sample] Modern X86 Assembly Language Programming: Covers x86 64-bit, AVX, AVX2, and AVX-512 Daniel Kusswurm ebooks
100% (2)
[FREE PDF sample] Modern X86 Assembly Language Programming: Covers x86 64-bit, AVX, AVX2, and AVX-512 Daniel Kusswurm ebooks
65 pages
Sorensen Estrid 2009 The Materiality of
No ratings yet
Sorensen Estrid 2009 The Materiality of
224 pages