0% found this document useful (0 votes)
69 views

Week-4 Lecture Notes

Uploaded by

tejastaware7451
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Week-4 Lecture Notes

Uploaded by

tejastaware7451
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

Data Placement Strategies

EL
PT
N

Big Data Computing Vu Pham Design of Key-Value Stores


Data Placement Strategies
Replication Strategy:
1. SimpleStrategy
2. NetworkTopologyStrategy
1. SimpleStrategy: uses the Partitioner, of which there are two kinds

EL
1. RandomPartitioner: Chord-like hash partitioning
2. ByteOrderedPartitioner: Assigns ranges of keys to servers.
• Easier for range queries (e.g., Get me all twitter users starting
with [a-b])
PT
2. NetworkTopologyStrategy: for multi-DC deployments
N
Two replicas per DC
Three replicas per DC
Per DC
• First replica placed according to Partitioner
• Then go clockwise around ring until you hit a different rack
Big Data Computing Vu Pham Design of Key-Value Stores
Snitches
Maps: IPs to racks and DCs. Configured in cassandra.yaml
config file
Some options:
SimpleSnitch: Unaware of Topology (Rack-unaware)

EL
RackInferring: Assumes topology of network by octet of
server’s IP address

PT
• 101.102.103.104 = x.<DC octet>.<rack octet>.<node octet>
PropertyFileSnitch: uses a config file
N
EC2Snitch: uses EC2.
• EC2 Region = DC
• Availability zone = rack
Other snitch options available

Big Data Computing Vu Pham Design of Key-Value Stores


Writes
Need to be lock-free and fast (no reads or disk seeks)
Client sends write to one coordinator node in Cassandra
cluster

EL
Coordinator may be per-key, or per-client, or per-query
Per-key Coordinator ensures writes for the key are
serialized
PT
Coordinator uses Partitioner to send query to all replica
nodes responsible for key
N
When X replicas respond, coordinator returns an
acknowledgement to the client
X?

Big Data Computing Vu Pham Design of Key-Value Stores


Writes (2)
Always writable: Hinted Handoff mechanism
If any replica is down, the coordinator writes to all
other replicas, and keeps the write locally until down
replica comes back up.

EL
When all replicas are down, the Coordinator (front end)
buffers writes (for up to a few hours).

PT
One ring per datacenter
N
Per-DC coordinator elected to coordinate with other
DCs
Election done via Zookeeper, which runs a Paxos
(consensus) variant

Big Data Computing Vu Pham Design of Key-Value Stores


Writes at a replica node
On receiving a write
1. Log it in disk commit log (for failure recovery)
2. Make changes to appropriate memtables
Memtable = In-memory representation of multiple key-value pairs

EL
Typically append-only datastructure (fast)
Cache that can be searched by key
Write-back aas opposed to write-through

PT
Later, when memtable is full or old, flush to disk
N
Data File: An SSTable (Sorted String Table) – list of key-value pairs,
sorted by key
SSTables are immutable (once created, they don’t change)
Index file: An SSTable of (key, position in data sstable) pairs
And a Bloom filter (for efficient search)
Big Data Computing Vu Pham Design of Key-Value Stores
Bloom Filter
Compact way of representing a set of items
Checking for existence in set is cheap
Some probability of false positives: an item not in set may
check true as being in set

EL
Large Bit Map
0
Never false negatives 1 On insert, set all hashed
2 bits.

Key-K PT
Hash1

Hash2
.
3

69
On check-if-present,
return true if all hashed bits
set.
N
. • False positives
Hashk
111 False positive rate low
• m=4 hash functions
127 • 100 items
• 3200 bits
• FP rate = 0.02%
Big Data Computing Vu Pham Design of Key-Value Stores
Compaction

Data updates accumulate over time and SStables and logs


need to be compacted

EL
The process of compaction merges SSTables, i.e., by

PT
merging updates for a key

Run periodically and locally at each server


N

Big Data Computing Vu Pham Design of Key-Value Stores


Deletes
Delete: don’t delete item right away

Add a tombstone to the log

EL
Eventually, when compaction encounters tombstone it
will delete item

PT
N

Big Data Computing Vu Pham Design of Key-Value Stores


Reads
Read: Similar to writes, except
Coordinator can contact X replicas (e.g., in same rack)
• Coordinator sends read to replicas that have responded quickest in
past
• When X replicas respond, coordinator returns the latest-

EL
timestamped value from among those X
• (X? We will check it later. )

PT
Coordinator also fetches value from other replicas
• Checks consistency in the background, initiating a read repair if
any two values are different
N
• This mechanism seeks to eventually bring all replicas up to date
At a replica
• A row may be split across multiple SSTables => reads need to touch
multiple SSTables => reads slower than writes (but still fast)

Big Data Computing Vu Pham Design of Key-Value Stores


Membership
Any server in cluster could be the coordinator

So every server needs to maintain a list of all the other

EL
servers that are currently in the server

PT
List needs to be updated automatically as servers join,
leave, and fail
N

Big Data Computing Vu Pham Design of Key-Value Stores


Cluster Membership – Gossip-Style
Cassandra uses gossip-based cluster membership 1 10118 64
2 10110 64
3 10090 58
1 10120 66
4 10111 65
2 10103 62
2

EL
3 10098 63
4 10111 65 1
1 10120 70
Address Time

Protocol:
(local)
Heartbeat Counter
PT 4
2
3
4
10110
10098
10111
64
70
65
N
3
•Nodes periodically gossip their Current time : 70 at node 2
membership list
•On receipt, the local membership (asynchronous clocks)
list is updated, as shown
•If any heartbeat older than Tfail, (Remember this?)
node is marked as failed
Big Data Computing Vu Pham Design of Key-Value Stores
Suspicion Mechanisms in Cassandra
Suspicion mechanisms to adaptively set the timeout based on
underlying network and failure behavior
Accrual detector: Failure Detector outputs a value (PHI)
representing suspicion

EL
Applications set an appropriate threshold
PHI calculation for a member

PT
Inter-arrival times for gossip messages
PHI(t) =
N
– log(CDF or Probability(t_now – t_last))/log 10
PHI basically determines the detection timeout, but takes
into account historical inter-arrival time variations for
gossiped heartbeats
In practice, PHI = 5 => 10-15 sec detection time
Big Data Computing Vu Pham Design of Key-Value Stores
Cassandra Vs. RDBMS
MySQL is one of the most popular (and has been for a
while)
On > 50 GB data

EL
MySQL
Writes 300 ms avg

PT
Reads 350 ms avg
Cassandra
N
Writes 0.12 ms avg
Reads 15 ms avg
Orders of magnitude faster
What’s the catch? What did we lose?

Big Data Computing Vu Pham Design of Key-Value Stores


CAP Theorem

EL
PT
N

Big Data Computing Vu Pham Design of Key-Value Stores


CAP Theorem
Proposed by Eric Brewer (Berkeley)
Subsequently proved by Gilbert and Lynch (NUS and MIT)
In a distributed system you can satisfy atmost 2 out of the

EL
3 guarantees:

PT
1. Consistency: all nodes see same data at any time, or
reads return latest written value by any client
2. Availability: the system allows operations all the time,
N
and operations return quickly
3. Partition-tolerance: the system continues to work in
spite of network partitions

Big Data Computing Vu Pham Design of Key-Value Stores


Why is Availability Important?
Availability = Reads/writes complete reliably and quickly.
Measurements have shown that a 500 ms increase in
latency for operations at Amazon.com or at Google.com

EL
can cause a 20% drop in revenue.
At Amazon, each added millisecond of latency implies a
$6M yearly loss.
PT
User cognitive drift: If more than a second elapses between
clicking and material appearing, the user’s mind is already
N
somewhere else
SLAs (Service Level Agreements) written by providers
predominantly deal with latencies faced by clients.

Big Data Computing Vu Pham Design of Key-Value Stores


Why is Consistency Important?
• Consistency = all nodes see same data at any time, or
reads return latest written value by any client.

EL
When you access your bank or investment account via
multiple clients (laptop, workstation, phone, tablet), you
want the updates done from one client to be visible to
other clients.
PT
N
When thousands of customers are looking to book a flight,
all updates from any client (e.g., book a flight) should be
accessible by other clients.

Big Data Computing Vu Pham Design of Key-Value Stores


Why is Partition-Tolerance Important?
Partitions can happen across datacenters when the
Internet gets disconnected
• Internet router outages

EL
• Under-sea cables cut
• DNS not working

PT
Partitions can also occur within a datacenter, e.g., a rack
switch outage
N
Still desire system to continue functioning normally
under this scenario

Big Data Computing Vu Pham Design of Key-Value Stores


CAP Theorem Fallout

Since partition-tolerance is essential in today’s cloud


computing systems, CAP theorem implies that a system

EL
has to choose between consistency and availability

Cassandra
PT
Eventual (weak) consistency, Availability, Partition-
tolerance
N
Traditional RDBMSs
Strong consistency over availability under a partition

Big Data Computing Vu Pham Design of Key-Value Stores


CAP Tradeoff
Starting point for NoSQL Consistency
Revolution
A distributed storage

EL
system can achieve at
most two of C, A, and P. HBase, HyperTable, RDBMSs
BigTable, Spanner
Pick 2
When partition-
PT
tolerance is important,
you have to choose
(non-replicated)
N
between consistency and
availability
Partition-tolerance Availability
Cassandra, RIAK,
Dynamo, Voldemort

Big Data Computing Vu Pham Design of Key-Value Stores


Eventual Consistency
If all writes stop (to a key), then all its values (replicas) will
converge eventually.
If writes continue, then system always tries to keep

EL
converging.
• Moving “wave” of updated values lagging behind the latest
values sent by clients, but always trying to catch up.

PT
May still return stale values to clients (e.g., if many back-
to-back writes).
N
But works well when there a few periods of low writes –
system converges quickly.

Big Data Computing Vu Pham Design of Key-Value Stores


RDBMS vs. Key-value stores
While RDBMS provide ACID
Atomicity
Consistency

EL
Isolation
Durability
PT
Key-value stores like Cassandra provide BASE
N
Basically Available Soft-state Eventual Consistency
Prefers Availability over Consistency

Big Data Computing Vu Pham Design of Key-Value Stores


Consistency in Cassandra
Cassandra has consistency levels
Client is allowed to choose a consistency level for each
operation (read/write)
ANY: any server (may not be replica)

EL
• Fastest: coordinator caches write and replies quickly to
client

PT
ALL: all replicas
• Ensures strong consistency, but slowest
N
ONE: at least one replica
• Faster than ALL, but cannot tolerate a failure
QUORUM: quorum across all replicas in all datacenters
(DCs)
• What?
Big Data Computing Vu Pham Design of Key-Value Stores
Quorums for Consistency
In a nutshell:
Quorum = majority A second
A quorum quorum
> 50%
Any two quorums intersect

EL
Client 1 does a write in
red quorum
A server

in blue quorumPT
Then client 2 does read

At least one server in blue


N
quorum returns latest Five replicas of a key-value pair
write
Quorums faster than ALL,
but still ensure strong
consistency
Big Data Computing Vu Pham Design of Key-Value Stores
Quorums in Detail
Several key-value/NoSQL stores (e.g., Riak and Cassandra)
use quorums.
Reads

EL
Client specifies value of R (≤ N = total number of
replicas of that key).

PT
R = read consistency level.
Coordinator waits for R replicas to respond before
sending result to client.
N
In background, coordinator checks for consistency of
remaining (N-R) replicas, and initiates read repair if
needed.

Big Data Computing Vu Pham Design of Key-Value Stores


Quorums in Detail (Contd..)
Writes come in two flavors
Client specifies W (≤ N)
W = write consistency level.

EL
Client writes new value to W replicas and returns. Two
flavors:

PT
• Coordinator blocks until quorum is reached.
• Asynchronous: Just write and return.
N

Big Data Computing Vu Pham Design of Key-Value Stores


Quorums in Detail (Contd.)
R = read replica count, W = write replica count
Two necessary conditions:
1. W+R > N

EL
2. W > N/2
Select values based on application

PT
(W=1, R=1): very few writes and reads
(W=N, R=1): great for read-heavy workloads
N
(W=N/2+1, R=N/2+1): great for write-heavy workloads
(W=1, R=N): great for write-heavy workloads with
mostly one client writing per key

Big Data Computing Vu Pham Design of Key-Value Stores


Cassandra Consistency Levels (Contd.)
Client is allowed to choose a consistency level for each operation
(read/write)
ANY: any server (may not be replica)
• Fastest: coordinator may cache write and reply quickly to client

EL
ALL: all replicas
• Slowest, but ensures strong consistency
ONE: at least one replica

PT
• Faster than ALL, and ensures durability without failures
QUORUM: quorum across all replicas in all datacenters (DCs)
• Global consistency, but still fast
N
LOCAL_QUORUM: quorum in coordinator’s DC
• Faster: only waits for quorum in first DC client contacts
EACH_QUORUM: quorum in every DC
• Lets each DC do its own quorum: supports hierarchical replies
Big Data Computing Vu Pham Design of Key-Value Stores
Types of Consistency
Cassandra offers Eventual Consistency

Are there other types of weak consistency models?

EL
PT
N

Big Data Computing Vu Pham Design of Key-Value Stores


Consistency Solutions

EL
PT
N

Big Data Computing Vu Pham Design of Key-Value Stores


Consistency Solutions

EL
PT
Faster reads and writes

More consistency
N
Strong
Eventual (e.g., Sequential)

Big Data Computing Vu Pham Design of Key-Value Stores


Eventual Consistency
Cassandra offers Eventual Consistency
If writes to a key stop, all replicas of key will converge
Originally from Amazon’s Dynamo and LinkedIn’s

EL
Voldemort systems

PT
Faster reads and writes
N
More consistency Strong
Eventual (e.g., Sequential)

Big Data Computing Vu Pham Design of Key-Value Stores


Newer Consistency Models
Striving towards strong consistency
While still trying to maintain high availability and
partition-tolerance

EL
Red-Blue
Causal

PT
Per-key sequential
Probabilistic

Strong
N
Eventual CRDTs (e.g., Sequential)

Big Data Computing Vu Pham Design of Key-Value Stores


Newer Consistency Models (Contd.)
Per-key sequential: Per key, all operations have a global
order
CRDTs (Commutative Replicated Data Types): Data

EL
structures for which commutated writes give same result
[INRIA, France]
E.g., value == int, and only op allowed is +1
PT
Effectively, servers don’t need to worry about
consistency
N
Red-Blue
Causal Probabilistic

Per-key sequential Strong


Eventual CRDTs (e.g., Sequential)

Big Data Computing Vu Pham Design of Key-Value Stores


Newer Consistency Models (Contd.)
Red-blue Consistency: Rewrite client transactions to
separate operations into red operations vs. blue
operations [MPI-SWS Germany]

EL
Blue operations can be executed (commutated) in any
order across DCs
Red operations need to be executed in the same order
at each DC
Causal
PT
Red-Blue
Probabilistic
N
Per-key sequential Strong
Eventual CRDTs (e.g., Sequential)

Big Data Computing Vu Pham Design of Key-Value Stores


Newer Consistency Models (Contd.)
Causal Consistency: Reads must respect partial order based
on information flow [Princeton, CMU]
W(K1, 33)
Client A
W(K2, 55)
Client B Time

EL
R(K1) returns 33
Client C

W(K1, 22) R(K1) may return


R(K1) must return 33

PT
22 or 33
R(K2) returns 55
Causality, not messages
N
Red-Blue
Causal Probabilistic

Per-key sequential Strong


Eventual CRDTs (e.g., Sequential)

Big Data Computing Vu Pham Design of Key-Value Stores


Which Consistency Model should you use?
Use the lowest consistency (to the left) consistency model
that is “correct” for your application

EL
Gets you fastest availability

Causal
PT
Red-Blue
Probabilistic
N
Per-key sequential Strong
Eventual CRDTs (e.g., Sequential)

Big Data Computing Vu Pham Design of Key-Value Stores


Strong Consistency Models
Linearizability: Each operation by a client is visible (or available)
instantaneously to all other clients
Instantaneously in real time
Sequential Consistency [Lamport]:
"... the result of any execution is the same as if the operations of all the

EL
processors were executed in some sequential order, and the operations of
each individual processor appear in this sequence in the order specified
by its program.

PT
After the fact, find a “reasonable” ordering of the operations (can re-
order operations) that obeys sanity (consistency) at all clients, and across
clients.
N
Transaction ACID properties, example: newer key-value/NoSQL stores
(sometimes called “NewSQL”)
Hyperdex [Cornell]
Spanner [Google]
Transaction chains [Microsoft Research]
Big Data Computing Vu Pham Design of Key-Value Stores
Conclusion
Traditional Databases (RDBMSs) work with strong
consistency, and offer ACID
Modern workloads don’t need such strong guarantees, but
do need fast response times (availability)

EL
Unfortunately, CAP theorem
Key-value/NoSQL systems offer BASE
PT
[Basically Available Soft-state Eventual Consistency]
Eventual consistency, and a variety of other consistency
N
models striving towards strong consistency
We have also discussed the design of Cassandra and
different consistency solutions.

Big Data Computing Vu Pham Design of Key-Value Stores


CQL
(Cassandra Query Language)

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham CQL
Preface

Content of this Lecture:

EL
In this lecture, we will discuss CQL (Cassandra Query
Language) Mapping to Cassandra's Internal Data
Structure.
PT
N

Big Data Computing Vu Pham CQL


What Problems does CQL Solve?
The Awesomeness that is Cassandra:

• Distributed columnar data store

EL
• No single point of failure
• Optimized for availability (through “Tunably” consistent


PT
Optimized for writes
Easily maintainable
N
• Almost infinitely scalable

Big Data Computing Vu Pham CQL


What Problems does CQL Solve? (Contd.)
Cassandra’s usability challenges
• NoSQL: “Where are my JOINS? No Schema? De-
normalize!?”

EL
• BigTable: “Tables with millions of columns!?”

CQL Saves the day! PT


• A best-practices interface to Cassandra
N
• Uses familiar SQL-Like language

Big Data Computing Vu Pham CQL


C* Data Model

EL
PT
N

Big Data Computing Vu Pham CQL


C* Data Model (Contd.)

Row Key, Column Name, Column


Value have types

EL
Column Name has comparator
RowKey has partitioner

PT Rows can have any number of


columns - even in same column family
N
Rows can have many columns
Column Values can be omitted
Time-to-live is useful!
Tombstones

Big Data Computing Vu Pham CQL


C* Data Model: Writes
Insert into
MemTable
Dump to
CommitLog

EL
No read
Very Fast!

PT Blocks on CPU
before O/I!
N

Big Data Computing Vu Pham CQL


C* Data Model: Reads
Get values from Memtable
Get values from row cache if present
Otherwise check bloom filter to find
appropriate SSTables
Check Key Cache for fast SSTable

EL
Search
Get values from SSTables
Repopulate Row Cache

PT Super Fast Col. retrieval


Fast row slicing
N

Big Data Computing Vu Pham CQL


C* Data Model: Reads (Contd.)
Get values from Memtable
Get values from row cache if
present
Otherwise check bloom filter to
find appropriate SSTables

EL
Check Key Cache for fast SSTable
Search
Get values from SSTables

PT Repopulate Row Cache


Super Fast Col. retrieval
Fast row slicing
N

Big Data Computing Vu Pham CQL


Introducing CQL
CQL is a reintroduction of schema so that you don’t
have to read code to understand the data model.

EL
CQL creates a common language so that details of the
data model can be easily communicated.

PT
CQL is a best-practices Cassandra interface and hides
N
the messy details.

Big Data Computing Vu Pham CQL


Introducing CQL (Contd.)

EL
PT
N

Big Data Computing Vu Pham CQL


Remember this:
Cassandra finds rows fast

Cassandra scans columns fast

EL
PT
Cassandra does not scan rows
N

Big Data Computing Vu Pham CQL


The CQL/Cassandra Mapping

EL
PT
N

Big Data Computing Vu Pham CQL


The CQL/Cassandra Mapping

EL
PT
N

Big Data Computing Vu Pham CQL


The CQL/Cassandra Mapping (Contd.)

EL
PT
N

Big Data Computing Vu Pham CQL


CQL for Sets, Lists and Maps
Collection Semantics
Sets hold list of unique elements
Lists hold ordered, possibly repeating elements

EL
Maps hold a list of key-value pairs
Uses same old Cassandra datastructure
Declaring PT
N

Big Data Computing Vu Pham CQL


Inserting
INSERT INTO mytable (row, myset)
VALUES (123, { ‘apple’, ‘banana’});

EL
INSERT INTO mytable (row, mylist)

PT
VALUES (123, [‘apple’,’banana’,’apple’]);
N
INSERT INTO mytable (row, mymap)
VALUES (123, {1:’apple’,2:’banana’})

Big Data Computing Vu Pham CQL


Updating
UPDATE mytable SET myset = myset + {‘apple’,‘banana’}
WHERE row = 123;
UPDATE mytable SET myset = myset - { ‘apple’ }
WHERE row = 123;

EL
UPDATE mytable SET mylist = mylist + [‘apple’,‘banana’]
WHERE row = 123;

PT
UPDATE mytable SET mylist = [‘banana’] + mylist
WHERE row = 123;
N
UPDATE mytable SET mymap[‘fruit’] = ‘apple’
WHERE row = 123
UPDATE mytable SET mymap = mymap + { ‘fruit’:‘apple’}
WHERE row = 123
Big Data Computing Vu Pham CQL
SETS

EL
PT
N

Big Data Computing Vu Pham CQL


LISTS

EL
PT
N

Big Data Computing Vu Pham CQL


MAPS

EL
PT
N

Big Data Computing Vu Pham CQL


Example
(in cqlsh)
CREATE KEYSPACE test WITH replication =
{'class': 'SimpleStrategy', 'replication_factor': 1};
USE test;

EL
CREATE TABLE stuff ( a int, b int, myset set<int>,
mylist list<int>, mymap map<int,int>, PRIMARY KEY (a,b));
UPDATE stuff SET myset = {1,2}, mylist = [3,4,5], mymap = {6:7,8:9} WHERE a = 0
AND b = 1;
SELECT * FROM stuff;
(in cassandra-cli)
PT
N
use test;
list stuff ;
(in cqlsh)
SELECT key_aliases,column_aliases from system.schema_columnfamilies WHERE
keyspace_name = 'test' AND columnfamily_name = 'stuff';
Big Data Computing Vu Pham CQL
Conclusion
CQL is a reintroduction of schema
CQL creates a common data modeling language
CQL is a best-practices Cassandra interface

EL
CQL let’s you take advantage of the C* Data structure

with any language PT


CQL protocol is binary and therefore interoperable

CQL is asynchronous and fast (Thrift transport layer is


N
synchronous)
CQL allows the possibility for prepared statements

Big Data Computing Vu Pham CQL


Design of Zookeeper

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]

Big Data Computing Vu Pham Design of Zookeeper


Preface
Content of this Lecture:

In this lecture, we will discuss the ‘design of

EL
ZooKeeper’, which is a service for coordinating
processes of distributed applications.

PT
We will discuss its basic fundamentals, design goals,
architecture and applications.
N

https://zookeeper.apache.org/
Big Data Computing Vu Pham Design of Zookeeper
ZooKeeper, why do we need it?
Coordination is important

EL
PT
N

Big Data Computing Vu Pham Design of Zookeeper


Classic Distributed System

EL
PT
N

Most of the system like HDFS have one Master and couple of
slave nodes and these slave nodes report to the master.
Big Data Computing Vu Pham Design of Zookeeper
Fault Tolerant Distributed System

EL
PT
N
Real distributed fault tolerant system have Coordination service,
Master and backup master.
If primary failed then backup works for it.
Big Data Computing Vu Pham Design of Zookeeper
What is a Race Condition?
When two processes are competing with each other causing data
corruption.

Person B

EL
Person A

Bank PT
N
As shown in the diagram, two persons are trying to deposit 1 rs.
online into the same bank account. The initial amount is 17 rs. Due
to race conditions, the final amount in the bank is 18 rs. instead of
19.

Big Data Computing Vu Pham Design of Zookeeper


What is a Deadlock?
When two processes are waiting for each other directly or
indirectly, it is called deadlock. Waiting for

EL
Process 1 Process 2

PT Waiting for
Process 3
Waiting for
N
Here, Process 1 is waiting for process 2 and process 2 is waiting
for process 3 to finish and process 3 is waiting for process 1 to
finish. All these three processes would keep waiting and will
never end. This is called dead lock.
Big Data Computing Vu Pham Design of Zookeeper
What is Coordination ?
How would Email Processors avoid
reading same emails?
Suppose, there is an inbox from
which we need to index emails.
Indexing is a heavy process and might

EL
take a lot of time.
Here, we have multiple machine

PT
which are indexing the emails. Every
email has an id. You can not delete
any email. You can only read an email
N
and mark it read or unread.
Now how would you handle the
coordination between multiple
indexer processes so that every email
is indexed?

Big Data Computing Vu Pham Design of Zookeeper


What is Coordination ?
If indexers were running as multiple
threads of a single process, it was
easier by the way of using
synchronization constructs of Central
Storage

EL
programming language.
But since there are multiple Email
Id-
processes running on multiple Time

PT
machines which need to coordinate,
we need a central storage.
This central storage should be safe
stamp-
Subject
-Status
N
from all concurrency related
problems.
This central storage is exactly the
role of Zookeeper.

Big Data Computing Vu Pham Design of Zookeeper


What is Coordination ?
Group membership: Set of datanodes (tasks) belong to same
group
Leader election: Electing a leader between primary and
backup

EL
Dynamic Configuration: Multiple services are joining,
communicating and leaving (Service lookup registry)

in a cluster PT
Status monitoring: Monitoring various processes and services

Queuing: One process is embedding and other is using


N
Barriers: All the processes showing the barrier and leaving
the barrier.
Critical sections: Which process will go to the critical section
and when?
Big Data Computing Vu Pham Design of Zookeeper
What is ZooKeeper ?
ZooKeeper is a highly reliable distributed coordination kernel,
which can be used for distributed locking, configuration
management, leadership election, work queues,….
Zookeeper is a replicated service that holds the metadata of

EL
distributed applications.
Key attributed of such data
Small size

PT
Performance sensitive
Dynamic
N
Critical
In very simple words, it is a central store of key-value using
which distributed systems can coordinate. Since it needs to be
able to handle the load, Zookeeper itself runs on many
machines.
Big Data Computing Vu Pham Design of Zookeeper
What is ZooKeeper ?
Exposes a simple set of primitives
Very easy to program
Uses a data model like directory tree

EL
Used for
Synchronisation
Locking
PT
Maintaining Configuration
Coordination service that does not suffer from
N
Race Conditions
Dead Locks

Big Data Computing Vu Pham Design of Zookeeper


Design Goals: 1. Simple
A shared hierarchal namespace looks like standard file
system
The namespace has data nodes - znodes (similar to files/dirs)

EL
Data is kept in-memory
Achieve high throughput and low latency numbers.
High performance
PT
Used in large, distributed systems
N
Highly available
No single point of failure
Strictly ordered access
Synchronisation
Big Data Computing Vu Pham Design of Zookeeper
Design Goals: 2. Replicated

EL
PT
• All servers have a copy of the state in memory
• A leader is elected at startup
N
• Followers service clients, all updates go through leader
• Update responses are sent when a majority of servers have
persisted the change
We need 2f+1 machines to tolerate f failures

Big Data Computing Vu Pham Design of Zookeeper


Design Goals: 2. Replicated
The client
Keeps a TCP connection
Gets watch events

EL
Sends heart beats.
If connection breaks,
Connect to different

The servers
PT server.
N
• Know each other
•Keep in-memory image of State
•Transaction Logs & Snapshots - persistent

Big Data Computing Vu Pham Design of Zookeeper


Design Goals: 3. Ordered
ZooKeeper stamps each update with a number

The number:

EL
Reflects the order of transactions.
used implement higher-level abstractions, such as
synchronization primitives.
PT
N

Big Data Computing Vu Pham Design of Zookeeper


Design Goals: 4. Fast
Performs best where reads are more common than
writes, at ratios of around 10:1.

EL
At Yahoo!, where it was created, the throughput for a

PT
ZooKeeper cluster has been benchmarked at over 10,000
operations per second for write-dominant workloads
generated by hundreds of clients
N

Big Data Computing Vu Pham Design of Zookeeper


Data Model
The way you store data in any store is called data model.
Think of it as highly available fileSystem: In case of zookeeper,
think of data model as if it is a highly available file system with
little differences.

EL
Znode: We store data in an entity called znode.
JSON data: The data that we store should be in JSON format

PT
which Java script object notation.
No Append Operation: The znode can only be updated. It does
not support append operations.
N
Data access (read/write) is atomic: The read or write is atomic
operation meaning either it will be full or would throw an error
if failed. There is no intermediate state like half written.
Znode: Can have children

Big Data Computing Vu Pham Design of Zookeeper


Data Model Contd…
So, znodes inside znodes make a
tree like heirarchy.
The top level znode is "/".

EL
The znode "/zoo" is child of "/"
which top level znode.

denoted as /zoo/duckPT
duck is child znode of zoo. It is

Though "." or ".." are invalid


N
characters as opposed to the file
system.

Big Data Computing Vu Pham Design of Zookeeper


Data Model – Znode - Types
Persistent
Such kind of znodes remain in zookeeper untill deleted. This
is the default type of znode. To create such node you can
use the command: create /name_of_myznode "mydata"

EL
Ephemeral

PT
Ephermal node gets deleted if the session in which the
node was created has disconnected. Though it is tied to
client's session but it is visible to the other users.
N
An ephermal node can not have children not even
ephermal children.

Big Data Computing Vu Pham Design of Zookeeper


Data Model – Znode - Types
Sequential
Creates a node with a sequence number in the name
The number is automatically appended.

EL
create -s /zoo v create -s /zoo/ v
Created /zoo0000000008 Created /zoo/0000000003

create -s /xyz v
PT create -s /zoo/ v
N
Created /xyz0000000009 Created /zoo/0000000004

The counter keeps increasing monotonically


Each node keeps a counter
Big Data Computing Vu Pham Design of Zookeeper
Architecture
Zookeeper can run in two modes: (i) Standalone and
(ii) Replicated.

(i) Standalone:

EL
In standalone mode, it is just running on one machine and
for practical purposes we do not use stanalone mode.

PT
This is only for testing purposes.
It doesn't have high availability.
N
(ii) Replicated:
Run on a cluster of machines called an ensemble.
High availability
Tolerates as long as majority.
Big Data Computing Vu Pham Design of Zookeeper
Architecture: Phase 1
Phase 1: Leader election (Paxos
Algorithm) Ensemble
The machines elect a distinguished
member - leader.

EL
The others are termed followers.
This phase is finished when majority

PT
sync their state with leader.
If leader fails, the remaining
machines hold election. takes
N
200ms.
If the majority of the machines
aren't available at any point of time,
the leader automatically steps
down.
Big Data Computing Vu Pham Design of Zookeeper
Architecture: Phase 2
Phase 2: Atomic broadcast 3 out of 4
have saved
All write requests are forwarded to
the leader,
Client
Leader broadcasts the update to
the followers

EL
Write Write
When a majority have persisted the Successful
change:
Leader
PT
The leader commits the up-date
The client gets success
response.
N
The protocol for achieving
consensus is atomic like two-phase Follower Follower
commit.
Machines write to disk before in- Follower Follower
memory
Big Data Computing Vu Pham Design of Zookeeper
Election in Zookeeper
Centralized service for maintaining configuration
information

EL
Uses a variant of Paxos called Zab (Zookeeper Atomic
Broadcast)

PT
Needs to keep a leader elected at all times
N
Reference: http://zookeeper.apache.org/

Big Data Computing Vu Pham Design of Zookeeper


Election in Zookeeper (2)
Each server creates a new
sequence number for itself N12 N3
Let’s say the sequence

EL
numbers are ids
Gets highest id so far (from N6

PT
ZK(zookeeper) file system),
creates next-higher id, writes it
into ZK file system
N32
N
N80 N5
Elect the highest-id server as
Master
leader.

Big Data Computing Vu Pham Design of Zookeeper


Election in Zookeeper (3)
Failures:

One option: everyone N12 N3

EL
monitors current master
(directly or via a failure N6
detector)
PT
On failure, initiate election
Leads to a flood of elections Crash
N32
N
Too many messages N80 N5

Master

Big Data Computing Vu Pham Design of Zookeeper


Election in Zookeeper (4)
Second option: (implemented in
Zookeeper)

N80 N3

EL
Each process monitors its next-
higher id process
if that successor was the N32

PT
leader and it has failed
• Become the new leader
Monitors
N5
N
else
• wait for a timeout, and check N12 N6
your successor again.

Big Data Computing Vu Pham Design of Zookeeper


Election in Zookeeper (5)
What about id conflicts? What if leader fails during
election ?

To address this, Zookeeper uses a two-phase commit (run

EL
after the sequence/id) protocol to commit the leader
Leader sends NEW_LEADER message to all

PT
Each process responds with ACK to at most one leader, i.e.,
one with highest process id
N
Leader waits for a majority of ACKs, and then sends
COMMIT to all
On receiving COMMIT, process updates its leader variable
Ensures that safety is still maintained

Big Data Computing Vu Pham Design of Zookeeper


Election Demo
If you have three nodes A, B, C with A as Leader. And A dies.
Will someone become leader?

EL
PT
N

Big Data Computing Vu Pham Design of Zookeeper


Election Demo
If you have three nodes A, B, C with A as Leader. And A dies.
Will someone become leader?

EL
PT
N
Yes. Either B or C.

Big Data Computing Vu Pham Design of Zookeeper


Election Demo
If you have three nodes A, B, C And A and B die. Will C become
Leader?

EL
PT
N

Big Data Computing Vu Pham Design of Zookeeper


Election Demo
If you have three nodes A, B, C And A and B die. Will C become
Leader?

EL
PT
N
No one will become Leader.
C will become Follower.
Reason: Majority is not available.
Big Data Computing Vu Pham Design of Zookeeper
Why do we need majority?
Imagine: We have an ensemble spread over two data centres.

EL
PT
N

Big Data Computing Vu Pham Design of Zookeeper


Why do we need majority?
Imagine: The network between data centres got disconnected.
If we did not need majority for electing Leader,
What will happen?

EL
PT
N

Big Data Computing Vu Pham Design of Zookeeper


Why do we need majority?
Each data centre will have their own Leader.
No Consistency and utter Chaos.
That is why it requires majority.

EL
PT
N

Big Data Computing Vu Pham Design of Zookeeper


Sessions
Lets try to understand how do the zookeeper decides to
delete ephermals nodes and takes care of session
management.

EL
A client has list of servers in the ensemble

PT
It tries each until successful.
Server creates a new session for the client.
A session has a timeout period - decided by caller
N

Big Data Computing Vu Pham Design of Zookeeper


Contd…
If the server hasn’t received a request within the timeout
period, it may expire the session.
On session expire, ephermal nodes are lost

EL
To keep sessions alive client sends pings (heartbeats)
Client library takes care of heartbeats

PT
Sessions are still valid on switching to another server
Failover is handled automatically by the client
N
Application can't remain agnostic of server reconnections
- because the operations will fail during disconnection.

Big Data Computing Vu Pham Design of Zookeeper


States

EL
PT
N

Big Data Computing Vu Pham Design of Zookeeper


Use Case: Many Servers How do they Coordinate?
Let us say there are many servers which can respond to
your request and there are many clients which might want
the service.

EL
Z
Servers o Clients

PT o
k
e
N
e
p
e
r
Big Data Computing Vu Pham Design of Zookeeper
Use Case: Many Servers How do they Coordinate?
From time to time some of the servers will keep going
down. How can all of the clients can keep track of the
available servers?

EL
Z
Servers o Clients

PT o
k
e
N
e
p
e
r
Big Data Computing Vu Pham Design of Zookeeper
Use Case: Many Servers How do they Coordinate?
It is very easy using zookeeper as a centeral agency. Each
server will create their own ephermal znode under a particular
znode say "/servers". The clients would simply query
zookeeper for the most recent list of servers.
Available Servers ?

EL
Z
Servers o Clients

PT o
k
e
N
e
p
e
r
Big Data Computing Vu Pham Design of Zookeeper
Use Case: Many Servers How do they Coordinate?
Lets take a case of two servers
and a client. The two server
duck and cow created their
ephermal nodes under
"/servers" znode. The client

EL
would simply discover the alive
servers cow and duck using
command ls /servers.

PT
Say, a server called "duck" is
down, the ephermal node will
disappear from /servers znode
N
and hence next time the client
comes and queries it would only
get "cow". So, the coordinations
has been made heavily
simplified and made efficient
because of ZooKeeper.
Big Data Computing Vu Pham Design of Zookeeper
Guarantees
Sequential consistency
Updates from any particular client are applied in the order
Atomicity
Updates either succeed or fail.

EL
Single system image
A client will see the same view of the system, The new server will

Durability PT
not accept the connection until it has caught up.

Once an update has succeeded, it will persist and will not be


N
undone.
Timeliness
Rather than allow a client to see very stale data, a server will shut
down,

Big Data Computing Vu Pham Design of Zookeeper


Operations

OPERATION DESCRIPTION
create Creates a znode (parent znode must exist)

EL
delete Deletes a znode (mustn’t have children)
exists/ls Tests whether a znode exists & gets metadata
getACL, Gets/sets the ACL for a znode
setACL
getData/get,
PT
getChildren/ls Gets a list of the children of a znode
Gets/sets the data associated with a znode
N
setData
sync Synchronizes a client’s view of a znode with
ZooKeeper

Big Data Computing Vu Pham Design of Zookeeper


Multi Update

Batches together multiple operations together

EL
Either all fail or succeed in entirety

PT
Possible to implement transactions
N
Others never observe any inconsistent state

Big Data Computing Vu Pham Design of Zookeeper


APIs
Two core: Java & C
contrib: perl, python, REST
For each binding, sync and async available

EL
Synch:
Public Stat exists (String path, Watcher watcher) throws KeeperException,
InterruptedException
PT
N
Asynch:

Public void exists (String path, Watcher watcher, StatCallback cb, Object ctx

Big Data Computing Vu Pham Design of Zookeeper


Watches
Clients to get notifications when a znode changes in some
way

Watchers are triggered only once

EL
For multiple notifications, re-register
PT
N

Big Data Computing Vu Pham Design of Zookeeper


Watch Triggers
The read operations exists, getChildren, getData may have
watches
Watches are triggered by write ops: create, delete, setData
ACL (Access Control List) operations do not participate in

EL
watches

WATCH OF …ARE
TRIGGERED
PT WHEN ZNODE IS…
N
exists created, deleted, or its data updated.
getData deleted or has its data updated.
deleted, or its any of the child is created or
getChildren
deleted

Big Data Computing Vu Pham Design of Zookeeper


ACLs - Access Control Lists
ACL Determines who can perform certain operations on it.

ACL is the combination

EL
authentication scheme,
an identity for that scheme,
and a set of permissions

PT
Authentication Scheme
N
digest - The client is authenticated by a username & password.
sasl - The client is authenticated using Kerberos.
ip - The client is authenticated by its IP address.

Big Data Computing Vu Pham Design of Zookeeper


Use Cases
Building a reliable configuration service

• A Distributed lock service

EL
Only single process may hold the lock

PT
N

Big Data Computing Vu Pham Design of Zookeeper


When Not to Use?

1. To store big data because:


• The number of copies == number of nodes
• All data is loaded in RAM too

EL
• Network load of transferring all data to all
Nodes

PT
2. Extremely strong consistency
N

Big Data Computing Vu Pham Design of Zookeeper


ZooKeeper Applications: The Fetching Service
• The Fetching Service: Crawling is an important part of a search
engine, and Yahoo! crawls billions of Web documents. The Fetching
Service (FS) is part of the Yahoo! crawler and it is currently in
production. Essentially, it has master processes that command page-
fetching processes.

EL
• The master provides the fetchers with configuration, and the

PT
fetchers write back informing of their status and health. The main
advantages of using ZooKeeper for FS are recovering from failures of
masters, guaranteeing availability despite failures, and decoupling
N
the clients from the servers, allowing them to direct their request to
healthy servers by just reading their status from ZooKeeper.

• Thus, FS uses ZooKeeper mainly to manage configuration metadata,


although it also uses Zoo- Keeper to elect masters (leader election).
Big Data Computing Vu Pham Design of Zookeeper
ZooKeeper Applications: Katta
Katta: It is a distributed indexer that uses Zoo- Keeper for
coordination, and it is an example of a non- Yahoo! application.
Katta divides the work of indexing using shards.
A master server assigns shards to slaves and tracks progress.

EL
Slaves can fail, so the master must redistribute load as slaves
come and go.

PT
The master can also fail, so other servers must be ready to take
over in case of failure. Katta uses ZooKeeper to track the status
of slave servers and the master (group membership), and to
N
handle master failover (leader election).
Katta also uses ZooKeeper to track and propagate the
assignments of shards to slaves (configuration management).

Big Data Computing Vu Pham Design of Zookeeper


ZooKeeper Applications: Yahoo! Message Broker
Yahoo! Message Broker: (YMB) is a distributed publish-
subscribe system. The system manages thousands of topics
that clients can publish messages to and receive messages
from. The topics are distributed among a set of servers to

EL
provide scalability.
Each topic is replicated using a primary-backup scheme that
ensures messages are replicated to two machines to ensure
PT
reliable message delivery. The servers that makeup YMB use a
shared-nothing distributed architecture which makes
N
coordination essential for correct operation.
YMB uses ZooKeeper to manage the distribution of topics
(configuration metadata), deal with failures of machines in the
system (failure detection and group membership), and control
system operation.
Big Data Computing Vu Pham Design of Zookeeper
ZooKeeper Applications: Yahoo! Message Broker
Figure, shows part of the znode
data layout for YMB.
Each broker domain has a znode
called nodes that has an

EL
ephemeral znode for each of the
active servers that compose the
YMB service.

PT
Each YMB server creates an
N
ephemeral znode under nodes
with load and status information
providing both group Figure: The layout of Yahoo!
membership and status Message Broker (YMB) structures in
information through ZooKeeper. ZooKeeper
of YMB.
Big Data Computing Vu Pham Design of Zookeeper
ZooKeeper Applications: Yahoo! Message Broker
The topics directory has a child
znode for each topic managed by
YMB.
These topic znodes have child

EL
znodes that indicate the primary
and backup server for each topic
along with the subscribers of
that topic.
PT
The primary and backup server
znodes not only allow servers to
N
discover the servers in charge of
a topic, but they also manage Figure: The layout of Yahoo!
leader election and server Message Broker (YMB) structures in
crashes. ZooKeeper

Big Data Computing Vu Pham Design of Zookeeper


More Details

EL
PT
N
See: https://zookeeper.apache.org/

Big Data Computing Vu Pham Design of Zookeeper


Conclusion
ZooKeeper takes a wait-free approach to the problem of
coordinating processes in distributed systems, by exposing
wait-free objects to clients.

EL
ZooKeeper achieves throughput values of hundreds of
thousands of operations per second for read-dominant
PT
workloads by using fast reads with watches, both of which
served by local replicas.
N
In this lecture, we have discussed the basic fundamentals,
design goals, architecture and applications of ZooKeeper.

Big Data Computing Vu Pham Design of Zookeeper

You might also like