0% found this document useful (0 votes)
19 views

Lecture 18

Uploaded by

Roy abhisek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Lecture 18

Uploaded by

Roy abhisek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

ZOOKEEPER Ken Birman

Spring, 2018

HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 1
CLOUD SYSTEMS HAVE
MANY “FILE SYSTEMS”
Before we discuss Zookeeper, let’s think about file systems.
Clouds have many!
One is for bulk storage: some form of “global file system” or GFS.
 At Google, it is actually called GFS
 At Amazon, S3 plays this role
 Azure uses “Azure storage fabric”

These often offer built-in block replication through a Linux feature,


but the guarantees are somewhat weak.
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 2
HOW DO THEY WORK?
 A “Name Node” service runs, fault-tolerantly, and tracks file meta-data

(like a Linux inode): Name, create/update time, size, seek pointer, etc.
 The name node tells you which data nodes hold your file.
 Very common to use a simple DHT scheme to fragment the NameNode

into subsets, hopefully spreading the work around. DataNodes are


hashed at the block level (large blocks)
 Some form of primary/backup scheme for fault-tolerance. Writes are
automatically forwarded from the primary to the backup.

HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 3
HOW DO THEY WORK?
NameNo
de
open
File
MetaData
Copy of
metadata

read DataNo DataNo DataNo DataNo


de
DataNo de
DataNo de
DataNo de
DataNo
de de de de
File data

HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 4
GLOBAL FILE SYSTEMS:
SUMMARY
Pros Cons
1. Scales well even for massive 1. Namenode (Master) can become
objects, overloaded, especially if individual
files become extremely popular.
2. Works well for large sequential
reads/writes, etc. 2. NameNode is a single-point of
failure
3. Provides High Performance
(massive throughput) 3. A slow NameNode can impact the
whole data center
4. Simple but robust reliability
model. 4. Concurrent writes to the same file
can interfere with one another
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 5
EDGE/FOG USE CASE
Building a smart highway: Like an air traffic control system,
but for cars

We want to make sure there is always exactly one


controller for each runway at the airport.

We need to be reliable: if the primary controller crashes,


the backup takes over within a few seconds.
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 6
COULD WE USE A GLOBAL
FILE SYSTEM?
Yes, for images and videos of the cars

We could capture them on a cloud of tier-one data


collectors

Store the data into the global file system, run image
analysis on the files.

HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 7
WHAT ABOUT THE
CONSISTENCY ASPECT?
Consider the “configuration” of our application, which
implements this control behavior.

We need to track which machines are operational, what roles


they have.
 We want a file that will hold a table of resources and roles.
 Every application in the system would track the contents of
this file.
 So… this file is in the cloud file system! But which file system?
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 8
A, B and
C are

CLOUD FILE SYSTEM running A is


primary,

LIMITATIONS
C is
backup D A has
restarted
crashed

Consider this simple approach:


 We maintain a log of computer status events: “crashed”,
“recovered”…
 The log is append-only. When you sense something, do
a write to the
end of the log.
 Issue: If two events occur more or less at the same time,
one can
overwrite the other, hence one might be lost.
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 9
OVERWRITES CAUSE
INCONSISTENCY!
We have discussed the concept of consistency many times, but haven’t really
spent so much time on its evil nemesis, inconsistency.

If we are logging versions of flight plans, when one update overwrites a


second one, some machines might have “seen” the lost one, and be in a state
different from others. An example of a split-brain problem.

If we are logging status of machines, some machines may think that C


crashed, but others never saw this message (worst case: maybe C really is
up, and the original log report was due to a transient timeout… but now half
out system thinks C is up, and half thinks C is down: another split brain
scenario)
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 10
AVOIDING “SPLIT BRAIN”
The name is from the title of an old science-fiction movie

We must never have two active controllers simultaneously,


or two different versions of the same flight plan record that
use the same version id.

You can turn this one type of mistake into many kinds of
risky problems that we would never want to tolerate in an
ATC system. So we must avoid such problems entirely!
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 11
ROOT ISSUE (1)
The quality of failure sensing is limited

If we sense faults because of noticing timeout, we might


make mistakes. Then if we reassign the role but the
“faulty” machine is really still running and was just
transiently inaccessible, we have two controllers!

This problem is unavoidable in a distributed system, so we


have to use agreement on membership, not “apparent
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 12
ROOT ISSUE (2)
In many systems two or more programs can try to write to
the same file at the same time, or to create the same file.

In such situations the normal Linux file system will work


correctly if the programs and the files are all on one
machine. Writes to the local file system won’t interfere.

But in distributed systems, using global file systems, we


lack this property!
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 13
HOW DOES A CONSISTENT
LOG SOLVE THIS?
If you trust the log, just read log records from top to bottom
you get an unambiguous way to track membership.
n effect,
Evenmembership isismanaged
if a logged record in“node
“wrong”, e.g. a consistent way.
6 has crashed”
Butit this
but works
hasn’t, we are only
forcediftowe can
agree trust
to use thatthe log
record.
And that in turn depends on the file system!
Equivalent mechanisms exist in systems like Derecho (self-
managed Paxos).

HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 14
WITH S3 OR GFS WE
CANNOT TRUST A LOG!
These file systems don’t guarantee consistency!

They are unsafe with concurrent writes. Concurrent log appends could
 Overwrite one-another, so one is lost
 Be briefly perceived out of order, or some machine might glimpse a
written record that will then be erased a moment later and
overwritten
 Sometimes we can even have two conflicting versions that linger
for
extended periods.
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 15
EXACTLY WHAT GOES
WRONG?
“Append-only log”
behavior In our application using it
1. Machines A, B and C are A is selected to be the primary
running controller for runway 7. C is
assigned as backup.
2-a. Machine D is launched
C notices 2-b, and takes over.
2-b. Concurrently, B thinks A But A wasn’t really crashed – B
crashed. was wrong!
3. 2-b is overwritten by 2-a Log entry 2-b is gone. A
4. A turns out to be fine, after keeps running.
all. Now we have A and C both in
the “control runway 7 role” – a
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 16
split brain!
WHY CAN’T WE JUST FIX
THE BUG?
First, be aware that S3 and GFS and similar systems are perfectly
fine for object storage by a single, non-concurrent writer.
 If nervous, take one more step and add a “self-certifying
signature”
 SHA3 hash codes are common for this, very secure and robust.
But there
are many options, even simple parity codes can help.

The reason that they don’t handle concurrent writes well is that the
required protocol is slower than the current weak consistency model.
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 17
ZOOKEEPER: A SOLUTION
FOR THIS ISSUE
The need in many systems is for a place to store configuration, parameters,
lists of which machines are running, which nodes are “primary” or “backup”,
etc.

We desire a file system interface, but “strong, fault-tolerant semantics”

Zookeeper is widely used in this role. Stronger guarantees than GFS.


 Data lives in (small) files. Zookeeper is quite slow and not very
scalable.
 But even so, it is useful for many purposes. Even locking, synchronization.

HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 18
SHOULD I USE ZOOKEEPER
FOR EVERYTHING?
Zookeeper isn’t for long-term storage, or for large objects. Put
those in the GFS. Then share the URL, which is small.

Use Zookeeper for small files used to do distributed


coordination, synchronization or configuration data (small
objects).

Mostly, try to have Zookeeper handle “active” configuration, so


it won’t need to persist data to a disk, at all.
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 19
ZOOKEEPER DURABILITY
LIMITATION
Zookeeper is mostly used with no real “persistency” guarantee, meaning
that if we shut it down completely, it normally loses any “state”

There is a checkpointing mechanism, but not fully synchronized with file


updates. Recent updates might not yet have been checkpointed.
 The developers view this as a tradeoff for high performance.
 Normally, it is configured to run once every 5s.
 Many applications simply leave Zookeeper running and if it shuts
down,
the whole application must shut down, then restart.
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 20
HOW DOES ZOOKEEPER
WORK?
Zookeeper has a layered implementation.

The health of all components is tracked, so that we know who is up and who
has crashed. In Derecho, this is called membership status.
 The Zookeeper meta-data layer is a single program: consistent by design.
 The Zookeeper data replication layer uses an atomic multicast to ensure
that
all replicas are in the same state.
 For long-term robustness, they checkpoint periodically (every 5s) and
restart
from checkpoint.
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 21
DETAILS ON ZOOKEEPER
(THIS IS THEIR SLIDE SET
AND REPEATS SOME PARTS
OF MINE)
Hunt, P., Konar, M., Junqueira, F.P. and Reed, B., 2010, June.
ZooKeeper: Wait-free Coordination for Internet-scale
Systems. In USENIX Annual Technical Conference (Vol. 8, p.
9).

Several other papers can be found on


http://scholar.google.com.
ZOOKEEPER GOALS

An open-source platform created to behave like a simple


file system
Easy to use, fault-tolerant. A bit slow, but not to a point of
being an issue.

Unlike standard cloud computing file systems, provides


strong guarantees.

Employed in many cloud computing platforms as a quick


HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 23
HOW IS ZOOKEEPER USED
TODAY?
Naming service − Identifying the nodes in a cluster by name. It is similar to
DNS, but for nodes
Configuration management − Latest and up-to-date configuration information
of the system for a joining node
Cluster management − Joining / leaving of a node in a cluster and node status
at real time
Leader election − Electing a node as leader for coordination purpose
Locking and synchronization service − Locking the data while modifying it. This
mechanism helps you in automatic fail recovery while connecting other
distributed applications like Apache HBase
Highly reliable data registry − Availability of data even when one or a few
nodes are down
Recall the Browser
Amazon
HTTPS
-Services
example Web Interface Server

Client SDK

In -Services,
HTTP or TCP/IP
these need
Server SDK resource
Kafka or SQS Application Server management
Resource Plugins and
scheduling

Shipment Mailing
Billing Packing
Planner Labels
A SIMPLE -SERVICE
SYSTEM Microservice 1
Microservice 1
Microservice 1

N Jobs API Gateway


UI Server Microservice 2
& Router Microservice 2
Microservice 2

Some questions that might arise:


• Is Replica 2 of microservice 3 up and
running?
• Do I have at least one service running? Microservice 3
Microservice 3
Microservice 3
• Microservice 3 uses Master-Worker, and
the Master just failed. What do I do?
• Replica 2 needs to find configuration
APACHE ZOOKEEPER AND -
SERVICES
Zookeeper can manage
information in your system
IP addresses, version numbers,
and other configuration
information of your
microservices.
The health of the
microservices.
The state of a particular
calculation.
Group membership
APACHE ZOOKEEPER IS…
A system for solving distributed coordination
problems for multiple cooperating clients.
A lot like a distributed file system...
 As long as the files are tiny.
 You could get notified when the file changes
 The full file pathname is meaningful to applications

A way to solve -service management problems.


THE ZOOKEEPER SERVICE
Zookeeper is
itself an
interesting
distributed
These are system
your
microservice
s ZooKeeper Service is replicated over a set of machines
All machines store a copy of the data in memory (!). Checkpointed to disk if you
wish.
A leader is elected on service startup
Clients only connect to a single ZooKeeper server & maintains a TCP connection.
Client can read from any Zookeeper server.
Writes go through the leader & need majority consensus.
https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription
ZOOKEEPER SUMMARY
ZooKeeper provides a simple and high performance kernel
for building more complex coordination primitives.
 Helps distributed components share information

Clients (your applications) contact Zookeeper services to


read and write metadata.
 Read from cache but writes are more complicated
 Sometimes, just the existence and name of a node are enough
ZOOKEEPER SUMMARY
(CONT.)
Tree model for organizing information into nodes.
 Node names may be all you need
 Lightly structured metadata stored in the nodes.

Wait-free aspects of shared registers with an event-


driven mechanism similar to cache invalidations of
distributed file systems
Targets simple metadata systems that read more than
they write.
 Small total storage
ZOOKEEPER, MORE BRIEFLY
Zookeeper Clients (that is, your applications) can create
and discover nodes on Zookeeper trees
Clients can put small pieces of data into the nodes and get
small pieces out.
 1 MB max for all data per server by default
 Each node also has built-in metadata like its version number.

You could build a small DNS with Zookeeper.


Some simple analogies
 Lock files and .pid files on Linux systems.
ZNODES
Maintain file meta-data with
version numbers for data changes,
ACL changes and timestamps.
Version numbers increases with
changes
Data is read and written in its
entirety
ZNODE TYPES
Regular • Clients create and delete explicitly

• Like regular znodes associated with


Ephemeral sessions
• Deleted when session expires

• Property of regular and ephemeral znodes


Sequential • Has a universal, monotonically increasing
counter appended to the name
ZOOKEEPER API (1/2)
create(path, data, flags): Creates a znode with path
name path, stores data[] in it, and returns the name of the
new znode.
 flags enables a client to select the type of znode: regular, ephemeral,
and set the sequential flag;
delete(path, version): Deletes the znode path if that
znode is at the expected version
exists(path, watch): Returns true if the znode with path
name path exists, and returns false otherwise.
 Note the watch flag
ZOOKEEPER API (2/2)
getData(path, watch): Returns the data and meta-data,
such as version information, associated with the znode.
setData(path, data, version): Writes data[] to znode path
if the version number is the current version of the znode
getChildren(path, watch): Returns the set of names of
the children of a znode
sync(path): Waits for all updates pending at the start of the
operation to propagate to the server that the client is
connected to.
CONFIGURATION
MANAGEMENT
All clients get their configuration information from a named
znode
 /root/config-me

Example: you can build a public key store with Zookeeper


Clients set watches to see if configurations change
Zookeeper doesn’t explicitly decide which clients are allowed to
update the configuration.
 That would be an implementation choice
 Zookeeper uses leader-follower model internally, so you could model
your own implementation after this.
THE RENDEZVOUS PROBLEM
Classic distributed computing algorithm
Consider master-worker
 Specific configurations may not be known until runtime
 EX: IP addresses, port numbers
 Workers and master may start in any order

Zookeeper implementation:
 Create a rendezvous node: /root/rendezvous
 Workers read /root/rendezvous and set a watch
 If empty, use watch to detect when master posts its configuration information
 Master fills in its configuration information (host, port)
 Workers are notified of content change and get the configuration information
LOCKS
Familiar analogy: lock files used by Apache HTTPD and MySQL
processes
Zookeeper example: who is the leader with primary copy of data?
Implementation:
 Leader creates an ephemeral file: /root/leader/lockfile
 Other would-be leaders place watches on the lock file
 If the leader client dies or doesn’t renew the lease, clients can attempt to create
a replacement lock file
Use SEQUENTIAL to solve the herd effect problem.
 Create a sequence of ephemeral child nodes
 Clients only watch the node immediately ahead of them in the sequence
Zookeeper is Browser
popular in
HTTPS
science-
computing Web Interface Server
gateways Client SDK

In micro-
HTTP or TCP/IP
service arch,
“Super” Server SDK these also
Scheduling and Application Server need
Resource scheduling
Resource Plugins
Management
Different
archs,
Karst:
schedulers, MOAB/Torq
Stampede: Comet: Jureca:SLU
SLURM SLURM RM
admin ue
domains,
With -
Services we API Server

often
replicate Application Metadata
components. Manager Server

API Server
API Server
API Server
API Server

Application Metadata
Application Metadata
Manager
Application Server
Metadata
Manager
Application Server
Metadata
Manager
Application Server
Metadata
Manager
Application Server
Metadata
Manager Server
Manager Server
WHY IS THIS FORM OF
REPLICATION NEEDED?
Fault tolerance
Increased throughput, load balancing
Component versions
 Not all components of the same type need to be on the same
version
 Backward compatibility checking

Component flavors
 Application managers can serve different types of resources
 Useful to separate them into separate processes if libraries conflict.
CONFIGURATION
MANAGEMENT
Problem: gateway components in a distributed system
need to get the correct configuration file.
Solution: Components contact Zookeeper to get
configuration metadata.
Comments: this includes both the component’s own
configuration file as well as configurations for other
components
 Rendezvous problem
SERVICE DISCOVERY
Problem: Component A needs to find instances of
Component B
Solution: Use Zookeeper to find available group members
instances of Component B
 More: get useful metadata about Component B instances like
version, domain name, port #, flavor
Comments
 Useful for components that need to directly communicate but not
for asynchronous communication (message queues)
GROUP MEMBERSHIP
Problem: a job needs to go to a specific flavor of
application manager. How can this be located?
Solution: have application managers join the appropriate
Zookeeper managed group when they come up.
Comments: This is useful to support scheduling
SYSTEM STATE FOR
DISTRIBUTED SYSTEMS
Which servers are up and running? What versions?
Services that run for long periods could use ZK to indicate
if they are busy (or under heavy load) or not.
Note overlap with our Registry
 What state does the Registry manage? What state would be more
appropriate for ZK?
LEADER ELECTION
Problem: metadata servers are replicated for read access
but only the master has write privileges. The master
crashes.
Solution: Use Zookeeper to elect a new metadata server
leader.
Comment: this is not necessarily the best way to do this
UNDER THE ZAB
ZOOKEEPER HOOD
Zab
Protocol
ZOOKEEPER HANDLING OF
WRITES
READ requests are served by any Zookeeper server
 Scales linearly, although information can be stale

WRITE requests change state so are handed differently


 One Zookeeper server acts as the leader
 The leader executes all write requests forwarded by followers
 The leader then broadcasts the changes
 The update is successful if a majority of Zookeeper servers
have correct
state at the end of the protocol
SOME ZOOKEEPER
IMPLEMENTATION
SIMPLIFICATIONS
Uses TCP for its transport layer.
 Message order is maintained by the network
 The network is reliable?

Assumes reliable file system


 Logging and DB checkpointing
SOME ZOOKEEPER
IMPLEMENTATION
SIMPLIFICATIONS
Does write-ahead logging
 Requests are first written to the log
 The Zookeeper DB is updated from the log

Zookeeper servers can acquire correct state by reading the


logs from the file system
 Checkpoint reading means you don’t have to reread the entire
history
Assumes a single administrator so no deep security
Speed isn’t
everything. Having
many servers
increases
reliability but
decreases
1. Failure and recovery
of follower.
2. Failure and recovery
of follower.
3. Failure of leader
(200 ms to recover).
4. Failure of two
followers (4a and
4b), recovery at 4c.
5. Failure of leader
6. Recovery of leader
(?)

luster of 5 zookeeper instances responds to manually injected failures.


STATE OF THE ART TODAY?
Zookeeper is solving a problem that Leslie Lamport formalized as the
state machine replication problem. Much work has been done on this.

The “Paxos” protocols solve this problem. Zookeeper’s ZAB is similar


to the Paxos concept of an “Atomic Multicast” (sometimes called
“Vertical Paxos”)

But checkpointing is not the same as the true durable Paxos. Durable
Paxos is like checkpointing on every operation. Zookeeper does so
every 5s
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 56
SUMMARY AND CAUTIONS
Zookeeper is powerful for system management/configuration data
Derecho is great for ultra-high-speed atomic multicast, replication
A message queue (Corfu) is also a powerful distributed computing
concept
 You could build a queuing system with Zookeeper, but you shouldn’t
 https://cwiki.apache.org/confluence/display/CURATOR/TN
 There are high quality queuing systems already

Where is the state of your system? Make one choice.

You might also like