Week-4 Lecture Notes
Week-4 Lecture Notes
EL
PT
N
EL
1. RandomPartitioner: Chord-like hash partitioning
2. ByteOrderedPartitioner: Assigns ranges of keys to servers.
• Easier for range queries (e.g., Get me all twitter users starting
with [a-b])
PT
2. NetworkTopologyStrategy: for multi-DC deployments
N
Two replicas per DC
Three replicas per DC
Per DC
• First replica placed according to Partitioner
• Then go clockwise around ring until you hit a different rack
Big Data Computing Vu Pham Design of Key-Value Stores
Snitches
Maps: IPs to racks and DCs. Configured in cassandra.yaml
config file
Some options:
SimpleSnitch: Unaware of Topology (Rack-unaware)
EL
RackInferring: Assumes topology of network by octet of
server’s IP address
PT
• 101.102.103.104 = x.<DC octet>.<rack octet>.<node octet>
PropertyFileSnitch: uses a config file
N
EC2Snitch: uses EC2.
• EC2 Region = DC
• Availability zone = rack
Other snitch options available
EL
Coordinator may be per-key, or per-client, or per-query
Per-key Coordinator ensures writes for the key are
serialized
PT
Coordinator uses Partitioner to send query to all replica
nodes responsible for key
N
When X replicas respond, coordinator returns an
acknowledgement to the client
X?
EL
When all replicas are down, the Coordinator (front end)
buffers writes (for up to a few hours).
PT
One ring per datacenter
N
Per-DC coordinator elected to coordinate with other
DCs
Election done via Zookeeper, which runs a Paxos
(consensus) variant
EL
Typically append-only datastructure (fast)
Cache that can be searched by key
Write-back aas opposed to write-through
PT
Later, when memtable is full or old, flush to disk
N
Data File: An SSTable (Sorted String Table) – list of key-value pairs,
sorted by key
SSTables are immutable (once created, they don’t change)
Index file: An SSTable of (key, position in data sstable) pairs
And a Bloom filter (for efficient search)
Big Data Computing Vu Pham Design of Key-Value Stores
Bloom Filter
Compact way of representing a set of items
Checking for existence in set is cheap
Some probability of false positives: an item not in set may
check true as being in set
EL
Large Bit Map
0
Never false negatives 1 On insert, set all hashed
2 bits.
Key-K PT
Hash1
Hash2
.
3
69
On check-if-present,
return true if all hashed bits
set.
N
. • False positives
Hashk
111 False positive rate low
• m=4 hash functions
127 • 100 items
• 3200 bits
• FP rate = 0.02%
Big Data Computing Vu Pham Design of Key-Value Stores
Compaction
EL
The process of compaction merges SSTables, i.e., by
PT
merging updates for a key
EL
Eventually, when compaction encounters tombstone it
will delete item
PT
N
EL
timestamped value from among those X
• (X? We will check it later. )
PT
Coordinator also fetches value from other replicas
• Checks consistency in the background, initiating a read repair if
any two values are different
N
• This mechanism seeks to eventually bring all replicas up to date
At a replica
• A row may be split across multiple SSTables => reads need to touch
multiple SSTables => reads slower than writes (but still fast)
EL
servers that are currently in the server
PT
List needs to be updated automatically as servers join,
leave, and fail
N
EL
3 10098 63
4 10111 65 1
1 10120 70
Address Time
Protocol:
(local)
Heartbeat Counter
PT 4
2
3
4
10110
10098
10111
64
70
65
N
3
•Nodes periodically gossip their Current time : 70 at node 2
membership list
•On receipt, the local membership (asynchronous clocks)
list is updated, as shown
•If any heartbeat older than Tfail, (Remember this?)
node is marked as failed
Big Data Computing Vu Pham Design of Key-Value Stores
Suspicion Mechanisms in Cassandra
Suspicion mechanisms to adaptively set the timeout based on
underlying network and failure behavior
Accrual detector: Failure Detector outputs a value (PHI)
representing suspicion
EL
Applications set an appropriate threshold
PHI calculation for a member
PT
Inter-arrival times for gossip messages
PHI(t) =
N
– log(CDF or Probability(t_now – t_last))/log 10
PHI basically determines the detection timeout, but takes
into account historical inter-arrival time variations for
gossiped heartbeats
In practice, PHI = 5 => 10-15 sec detection time
Big Data Computing Vu Pham Design of Key-Value Stores
Cassandra Vs. RDBMS
MySQL is one of the most popular (and has been for a
while)
On > 50 GB data
EL
MySQL
Writes 300 ms avg
PT
Reads 350 ms avg
Cassandra
N
Writes 0.12 ms avg
Reads 15 ms avg
Orders of magnitude faster
What’s the catch? What did we lose?
EL
PT
N
EL
3 guarantees:
PT
1. Consistency: all nodes see same data at any time, or
reads return latest written value by any client
2. Availability: the system allows operations all the time,
N
and operations return quickly
3. Partition-tolerance: the system continues to work in
spite of network partitions
EL
can cause a 20% drop in revenue.
At Amazon, each added millisecond of latency implies a
$6M yearly loss.
PT
User cognitive drift: If more than a second elapses between
clicking and material appearing, the user’s mind is already
N
somewhere else
SLAs (Service Level Agreements) written by providers
predominantly deal with latencies faced by clients.
EL
When you access your bank or investment account via
multiple clients (laptop, workstation, phone, tablet), you
want the updates done from one client to be visible to
other clients.
PT
N
When thousands of customers are looking to book a flight,
all updates from any client (e.g., book a flight) should be
accessible by other clients.
EL
• Under-sea cables cut
• DNS not working
PT
Partitions can also occur within a datacenter, e.g., a rack
switch outage
N
Still desire system to continue functioning normally
under this scenario
EL
has to choose between consistency and availability
Cassandra
PT
Eventual (weak) consistency, Availability, Partition-
tolerance
N
Traditional RDBMSs
Strong consistency over availability under a partition
EL
system can achieve at
most two of C, A, and P. HBase, HyperTable, RDBMSs
BigTable, Spanner
Pick 2
When partition-
PT
tolerance is important,
you have to choose
(non-replicated)
N
between consistency and
availability
Partition-tolerance Availability
Cassandra, RIAK,
Dynamo, Voldemort
EL
converging.
• Moving “wave” of updated values lagging behind the latest
values sent by clients, but always trying to catch up.
PT
May still return stale values to clients (e.g., if many back-
to-back writes).
N
But works well when there a few periods of low writes –
system converges quickly.
EL
Isolation
Durability
PT
Key-value stores like Cassandra provide BASE
N
Basically Available Soft-state Eventual Consistency
Prefers Availability over Consistency
EL
• Fastest: coordinator caches write and replies quickly to
client
PT
ALL: all replicas
• Ensures strong consistency, but slowest
N
ONE: at least one replica
• Faster than ALL, but cannot tolerate a failure
QUORUM: quorum across all replicas in all datacenters
(DCs)
• What?
Big Data Computing Vu Pham Design of Key-Value Stores
Quorums for Consistency
In a nutshell:
Quorum = majority A second
A quorum quorum
> 50%
Any two quorums intersect
EL
Client 1 does a write in
red quorum
A server
in blue quorumPT
Then client 2 does read
EL
Client specifies value of R (≤ N = total number of
replicas of that key).
PT
R = read consistency level.
Coordinator waits for R replicas to respond before
sending result to client.
N
In background, coordinator checks for consistency of
remaining (N-R) replicas, and initiates read repair if
needed.
EL
Client writes new value to W replicas and returns. Two
flavors:
PT
• Coordinator blocks until quorum is reached.
• Asynchronous: Just write and return.
N
EL
2. W > N/2
Select values based on application
PT
(W=1, R=1): very few writes and reads
(W=N, R=1): great for read-heavy workloads
N
(W=N/2+1, R=N/2+1): great for write-heavy workloads
(W=1, R=N): great for write-heavy workloads with
mostly one client writing per key
EL
ALL: all replicas
• Slowest, but ensures strong consistency
ONE: at least one replica
PT
• Faster than ALL, and ensures durability without failures
QUORUM: quorum across all replicas in all datacenters (DCs)
• Global consistency, but still fast
N
LOCAL_QUORUM: quorum in coordinator’s DC
• Faster: only waits for quorum in first DC client contacts
EACH_QUORUM: quorum in every DC
• Lets each DC do its own quorum: supports hierarchical replies
Big Data Computing Vu Pham Design of Key-Value Stores
Types of Consistency
Cassandra offers Eventual Consistency
EL
PT
N
EL
PT
N
EL
PT
Faster reads and writes
More consistency
N
Strong
Eventual (e.g., Sequential)
EL
Voldemort systems
PT
Faster reads and writes
N
More consistency Strong
Eventual (e.g., Sequential)
EL
Red-Blue
Causal
PT
Per-key sequential
Probabilistic
Strong
N
Eventual CRDTs (e.g., Sequential)
EL
structures for which commutated writes give same result
[INRIA, France]
E.g., value == int, and only op allowed is +1
PT
Effectively, servers don’t need to worry about
consistency
N
Red-Blue
Causal Probabilistic
EL
Blue operations can be executed (commutated) in any
order across DCs
Red operations need to be executed in the same order
at each DC
Causal
PT
Red-Blue
Probabilistic
N
Per-key sequential Strong
Eventual CRDTs (e.g., Sequential)
EL
R(K1) returns 33
Client C
PT
22 or 33
R(K2) returns 55
Causality, not messages
N
Red-Blue
Causal Probabilistic
EL
Gets you fastest availability
Causal
PT
Red-Blue
Probabilistic
N
Per-key sequential Strong
Eventual CRDTs (e.g., Sequential)
EL
processors were executed in some sequential order, and the operations of
each individual processor appear in this sequence in the order specified
by its program.
PT
After the fact, find a “reasonable” ordering of the operations (can re-
order operations) that obeys sanity (consistency) at all clients, and across
clients.
N
Transaction ACID properties, example: newer key-value/NoSQL stores
(sometimes called “NewSQL”)
Hyperdex [Cornell]
Spanner [Google]
Transaction chains [Microsoft Research]
Big Data Computing Vu Pham Design of Key-Value Stores
Conclusion
Traditional Databases (RDBMSs) work with strong
consistency, and offer ACID
Modern workloads don’t need such strong guarantees, but
do need fast response times (availability)
EL
Unfortunately, CAP theorem
Key-value/NoSQL systems offer BASE
PT
[Basically Available Soft-state Eventual Consistency]
Eventual consistency, and a variety of other consistency
N
models striving towards strong consistency
We have also discussed the design of Cassandra and
different consistency solutions.
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham CQL
Preface
EL
In this lecture, we will discuss CQL (Cassandra Query
Language) Mapping to Cassandra's Internal Data
Structure.
PT
N
EL
• No single point of failure
• Optimized for availability (through “Tunably” consistent
•
•
PT
Optimized for writes
Easily maintainable
N
• Almost infinitely scalable
EL
• BigTable: “Tables with millions of columns!?”
EL
PT
N
EL
Column Name has comparator
RowKey has partitioner
EL
No read
Very Fast!
PT Blocks on CPU
before O/I!
N
EL
Search
Get values from SSTables
Repopulate Row Cache
EL
Check Key Cache for fast SSTable
Search
Get values from SSTables
EL
CQL creates a common language so that details of the
data model can be easily communicated.
PT
CQL is a best-practices Cassandra interface and hides
N
the messy details.
EL
PT
N
EL
PT
Cassandra does not scan rows
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
Maps hold a list of key-value pairs
Uses same old Cassandra datastructure
Declaring PT
N
EL
INSERT INTO mytable (row, mylist)
PT
VALUES (123, [‘apple’,’banana’,’apple’]);
N
INSERT INTO mytable (row, mymap)
VALUES (123, {1:’apple’,2:’banana’})
EL
UPDATE mytable SET mylist = mylist + [‘apple’,‘banana’]
WHERE row = 123;
PT
UPDATE mytable SET mylist = [‘banana’] + mylist
WHERE row = 123;
N
UPDATE mytable SET mymap[‘fruit’] = ‘apple’
WHERE row = 123
UPDATE mytable SET mymap = mymap + { ‘fruit’:‘apple’}
WHERE row = 123
Big Data Computing Vu Pham CQL
SETS
EL
PT
N
EL
PT
N
EL
PT
N
EL
CREATE TABLE stuff ( a int, b int, myset set<int>,
mylist list<int>, mymap map<int,int>, PRIMARY KEY (a,b));
UPDATE stuff SET myset = {1,2}, mylist = [3,4,5], mymap = {6:7,8:9} WHERE a = 0
AND b = 1;
SELECT * FROM stuff;
(in cassandra-cli)
PT
N
use test;
list stuff ;
(in cqlsh)
SELECT key_aliases,column_aliases from system.schema_columnfamilies WHERE
keyspace_name = 'test' AND columnfamily_name = 'stuff';
Big Data Computing Vu Pham CQL
Conclusion
CQL is a reintroduction of schema
CQL creates a common data modeling language
CQL is a best-practices Cassandra interface
EL
CQL let’s you take advantage of the C* Data structure
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
EL
ZooKeeper’, which is a service for coordinating
processes of distributed applications.
PT
We will discuss its basic fundamentals, design goals,
architecture and applications.
N
https://zookeeper.apache.org/
Big Data Computing Vu Pham Design of Zookeeper
ZooKeeper, why do we need it?
Coordination is important
EL
PT
N
EL
PT
N
Most of the system like HDFS have one Master and couple of
slave nodes and these slave nodes report to the master.
Big Data Computing Vu Pham Design of Zookeeper
Fault Tolerant Distributed System
EL
PT
N
Real distributed fault tolerant system have Coordination service,
Master and backup master.
If primary failed then backup works for it.
Big Data Computing Vu Pham Design of Zookeeper
What is a Race Condition?
When two processes are competing with each other causing data
corruption.
Person B
EL
Person A
Bank PT
N
As shown in the diagram, two persons are trying to deposit 1 rs.
online into the same bank account. The initial amount is 17 rs. Due
to race conditions, the final amount in the bank is 18 rs. instead of
19.
EL
Process 1 Process 2
PT Waiting for
Process 3
Waiting for
N
Here, Process 1 is waiting for process 2 and process 2 is waiting
for process 3 to finish and process 3 is waiting for process 1 to
finish. All these three processes would keep waiting and will
never end. This is called dead lock.
Big Data Computing Vu Pham Design of Zookeeper
What is Coordination ?
How would Email Processors avoid
reading same emails?
Suppose, there is an inbox from
which we need to index emails.
Indexing is a heavy process and might
EL
take a lot of time.
Here, we have multiple machine
PT
which are indexing the emails. Every
email has an id. You can not delete
any email. You can only read an email
N
and mark it read or unread.
Now how would you handle the
coordination between multiple
indexer processes so that every email
is indexed?
EL
programming language.
But since there are multiple Email
Id-
processes running on multiple Time
PT
machines which need to coordinate,
we need a central storage.
This central storage should be safe
stamp-
Subject
-Status
N
from all concurrency related
problems.
This central storage is exactly the
role of Zookeeper.
EL
Dynamic Configuration: Multiple services are joining,
communicating and leaving (Service lookup registry)
in a cluster PT
Status monitoring: Monitoring various processes and services
EL
distributed applications.
Key attributed of such data
Small size
PT
Performance sensitive
Dynamic
N
Critical
In very simple words, it is a central store of key-value using
which distributed systems can coordinate. Since it needs to be
able to handle the load, Zookeeper itself runs on many
machines.
Big Data Computing Vu Pham Design of Zookeeper
What is ZooKeeper ?
Exposes a simple set of primitives
Very easy to program
Uses a data model like directory tree
EL
Used for
Synchronisation
Locking
PT
Maintaining Configuration
Coordination service that does not suffer from
N
Race Conditions
Dead Locks
EL
Data is kept in-memory
Achieve high throughput and low latency numbers.
High performance
PT
Used in large, distributed systems
N
Highly available
No single point of failure
Strictly ordered access
Synchronisation
Big Data Computing Vu Pham Design of Zookeeper
Design Goals: 2. Replicated
EL
PT
• All servers have a copy of the state in memory
• A leader is elected at startup
N
• Followers service clients, all updates go through leader
• Update responses are sent when a majority of servers have
persisted the change
We need 2f+1 machines to tolerate f failures
EL
Sends heart beats.
If connection breaks,
Connect to different
The servers
PT server.
N
• Know each other
•Keep in-memory image of State
•Transaction Logs & Snapshots - persistent
The number:
EL
Reflects the order of transactions.
used implement higher-level abstractions, such as
synchronization primitives.
PT
N
EL
At Yahoo!, where it was created, the throughput for a
PT
ZooKeeper cluster has been benchmarked at over 10,000
operations per second for write-dominant workloads
generated by hundreds of clients
N
EL
Znode: We store data in an entity called znode.
JSON data: The data that we store should be in JSON format
PT
which Java script object notation.
No Append Operation: The znode can only be updated. It does
not support append operations.
N
Data access (read/write) is atomic: The read or write is atomic
operation meaning either it will be full or would throw an error
if failed. There is no intermediate state like half written.
Znode: Can have children
EL
The znode "/zoo" is child of "/"
which top level znode.
denoted as /zoo/duckPT
duck is child znode of zoo. It is
EL
Ephemeral
PT
Ephermal node gets deleted if the session in which the
node was created has disconnected. Though it is tied to
client's session but it is visible to the other users.
N
An ephermal node can not have children not even
ephermal children.
EL
create -s /zoo v create -s /zoo/ v
Created /zoo0000000008 Created /zoo/0000000003
create -s /xyz v
PT create -s /zoo/ v
N
Created /xyz0000000009 Created /zoo/0000000004
(i) Standalone:
EL
In standalone mode, it is just running on one machine and
for practical purposes we do not use stanalone mode.
PT
This is only for testing purposes.
It doesn't have high availability.
N
(ii) Replicated:
Run on a cluster of machines called an ensemble.
High availability
Tolerates as long as majority.
Big Data Computing Vu Pham Design of Zookeeper
Architecture: Phase 1
Phase 1: Leader election (Paxos
Algorithm) Ensemble
The machines elect a distinguished
member - leader.
EL
The others are termed followers.
This phase is finished when majority
PT
sync their state with leader.
If leader fails, the remaining
machines hold election. takes
N
200ms.
If the majority of the machines
aren't available at any point of time,
the leader automatically steps
down.
Big Data Computing Vu Pham Design of Zookeeper
Architecture: Phase 2
Phase 2: Atomic broadcast 3 out of 4
have saved
All write requests are forwarded to
the leader,
Client
Leader broadcasts the update to
the followers
EL
Write Write
When a majority have persisted the Successful
change:
Leader
PT
The leader commits the up-date
The client gets success
response.
N
The protocol for achieving
consensus is atomic like two-phase Follower Follower
commit.
Machines write to disk before in- Follower Follower
memory
Big Data Computing Vu Pham Design of Zookeeper
Election in Zookeeper
Centralized service for maintaining configuration
information
EL
Uses a variant of Paxos called Zab (Zookeeper Atomic
Broadcast)
PT
Needs to keep a leader elected at all times
N
Reference: http://zookeeper.apache.org/
EL
numbers are ids
Gets highest id so far (from N6
PT
ZK(zookeeper) file system),
creates next-higher id, writes it
into ZK file system
N32
N
N80 N5
Elect the highest-id server as
Master
leader.
EL
monitors current master
(directly or via a failure N6
detector)
PT
On failure, initiate election
Leads to a flood of elections Crash
N32
N
Too many messages N80 N5
Master
N80 N3
EL
Each process monitors its next-
higher id process
if that successor was the N32
PT
leader and it has failed
• Become the new leader
Monitors
N5
N
else
• wait for a timeout, and check N12 N6
your successor again.
EL
after the sequence/id) protocol to commit the leader
Leader sends NEW_LEADER message to all
PT
Each process responds with ACK to at most one leader, i.e.,
one with highest process id
N
Leader waits for a majority of ACKs, and then sends
COMMIT to all
On receiving COMMIT, process updates its leader variable
Ensures that safety is still maintained
EL
PT
N
EL
PT
N
Yes. Either B or C.
EL
PT
N
EL
PT
N
No one will become Leader.
C will become Follower.
Reason: Majority is not available.
Big Data Computing Vu Pham Design of Zookeeper
Why do we need majority?
Imagine: We have an ensemble spread over two data centres.
EL
PT
N
EL
PT
N
EL
PT
N
EL
A client has list of servers in the ensemble
PT
It tries each until successful.
Server creates a new session for the client.
A session has a timeout period - decided by caller
N
EL
To keep sessions alive client sends pings (heartbeats)
Client library takes care of heartbeats
PT
Sessions are still valid on switching to another server
Failover is handled automatically by the client
N
Application can't remain agnostic of server reconnections
- because the operations will fail during disconnection.
EL
PT
N
EL
Z
Servers o Clients
PT o
k
e
N
e
p
e
r
Big Data Computing Vu Pham Design of Zookeeper
Use Case: Many Servers How do they Coordinate?
From time to time some of the servers will keep going
down. How can all of the clients can keep track of the
available servers?
EL
Z
Servers o Clients
PT o
k
e
N
e
p
e
r
Big Data Computing Vu Pham Design of Zookeeper
Use Case: Many Servers How do they Coordinate?
It is very easy using zookeeper as a centeral agency. Each
server will create their own ephermal znode under a particular
znode say "/servers". The clients would simply query
zookeeper for the most recent list of servers.
Available Servers ?
EL
Z
Servers o Clients
PT o
k
e
N
e
p
e
r
Big Data Computing Vu Pham Design of Zookeeper
Use Case: Many Servers How do they Coordinate?
Lets take a case of two servers
and a client. The two server
duck and cow created their
ephermal nodes under
"/servers" znode. The client
EL
would simply discover the alive
servers cow and duck using
command ls /servers.
PT
Say, a server called "duck" is
down, the ephermal node will
disappear from /servers znode
N
and hence next time the client
comes and queries it would only
get "cow". So, the coordinations
has been made heavily
simplified and made efficient
because of ZooKeeper.
Big Data Computing Vu Pham Design of Zookeeper
Guarantees
Sequential consistency
Updates from any particular client are applied in the order
Atomicity
Updates either succeed or fail.
EL
Single system image
A client will see the same view of the system, The new server will
Durability PT
not accept the connection until it has caught up.
OPERATION DESCRIPTION
create Creates a znode (parent znode must exist)
EL
delete Deletes a znode (mustn’t have children)
exists/ls Tests whether a znode exists & gets metadata
getACL, Gets/sets the ACL for a znode
setACL
getData/get,
PT
getChildren/ls Gets a list of the children of a znode
Gets/sets the data associated with a znode
N
setData
sync Synchronizes a client’s view of a znode with
ZooKeeper
EL
Either all fail or succeed in entirety
PT
Possible to implement transactions
N
Others never observe any inconsistent state
EL
Synch:
Public Stat exists (String path, Watcher watcher) throws KeeperException,
InterruptedException
PT
N
Asynch:
Public void exists (String path, Watcher watcher, StatCallback cb, Object ctx
EL
For multiple notifications, re-register
PT
N
EL
watches
WATCH OF …ARE
TRIGGERED
PT WHEN ZNODE IS…
N
exists created, deleted, or its data updated.
getData deleted or has its data updated.
deleted, or its any of the child is created or
getChildren
deleted
EL
authentication scheme,
an identity for that scheme,
and a set of permissions
PT
Authentication Scheme
N
digest - The client is authenticated by a username & password.
sasl - The client is authenticated using Kerberos.
ip - The client is authenticated by its IP address.
EL
Only single process may hold the lock
PT
N
EL
• Network load of transferring all data to all
Nodes
PT
2. Extremely strong consistency
N
EL
• The master provides the fetchers with configuration, and the
PT
fetchers write back informing of their status and health. The main
advantages of using ZooKeeper for FS are recovering from failures of
masters, guaranteeing availability despite failures, and decoupling
N
the clients from the servers, allowing them to direct their request to
healthy servers by just reading their status from ZooKeeper.
EL
Slaves can fail, so the master must redistribute load as slaves
come and go.
PT
The master can also fail, so other servers must be ready to take
over in case of failure. Katta uses ZooKeeper to track the status
of slave servers and the master (group membership), and to
N
handle master failover (leader election).
Katta also uses ZooKeeper to track and propagate the
assignments of shards to slaves (configuration management).
EL
provide scalability.
Each topic is replicated using a primary-backup scheme that
ensures messages are replicated to two machines to ensure
PT
reliable message delivery. The servers that makeup YMB use a
shared-nothing distributed architecture which makes
N
coordination essential for correct operation.
YMB uses ZooKeeper to manage the distribution of topics
(configuration metadata), deal with failures of machines in the
system (failure detection and group membership), and control
system operation.
Big Data Computing Vu Pham Design of Zookeeper
ZooKeeper Applications: Yahoo! Message Broker
Figure, shows part of the znode
data layout for YMB.
Each broker domain has a znode
called nodes that has an
EL
ephemeral znode for each of the
active servers that compose the
YMB service.
PT
Each YMB server creates an
N
ephemeral znode under nodes
with load and status information
providing both group Figure: The layout of Yahoo!
membership and status Message Broker (YMB) structures in
information through ZooKeeper. ZooKeeper
of YMB.
Big Data Computing Vu Pham Design of Zookeeper
ZooKeeper Applications: Yahoo! Message Broker
The topics directory has a child
znode for each topic managed by
YMB.
These topic znodes have child
EL
znodes that indicate the primary
and backup server for each topic
along with the subscribers of
that topic.
PT
The primary and backup server
znodes not only allow servers to
N
discover the servers in charge of
a topic, but they also manage Figure: The layout of Yahoo!
leader election and server Message Broker (YMB) structures in
crashes. ZooKeeper
EL
PT
N
See: https://zookeeper.apache.org/
EL
ZooKeeper achieves throughput values of hundreds of
thousands of operations per second for read-dominant
PT
workloads by using fast reads with watches, both of which
served by local replicas.
N
In this lecture, we have discussed the basic fundamentals,
design goals, architecture and applications of ZooKeeper.