Lecture 18
Lecture 18
Spring, 2018
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 1
CLOUD SYSTEMS HAVE
MANY “FILE SYSTEMS”
Before we discuss Zookeeper, let’s think about file systems.
Clouds have many!
One is for bulk storage: some form of “global file system” or GFS.
At Google, it is actually called GFS
At Amazon, S3 plays this role
Azure uses “Azure storage fabric”
(like a Linux inode): Name, create/update time, size, seek pointer, etc.
The name node tells you which data nodes hold your file.
Very common to use a simple DHT scheme to fragment the NameNode
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 3
HOW DO THEY WORK?
NameNo
de
open
File
MetaData
Copy of
metadata
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 4
GLOBAL FILE SYSTEMS:
SUMMARY
Pros Cons
1. Scales well even for massive 1. Namenode (Master) can become
objects, overloaded, especially if individual
files become extremely popular.
2. Works well for large sequential
reads/writes, etc. 2. NameNode is a single-point of
failure
3. Provides High Performance
(massive throughput) 3. A slow NameNode can impact the
whole data center
4. Simple but robust reliability
model. 4. Concurrent writes to the same file
can interfere with one another
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 5
EDGE/FOG USE CASE
Building a smart highway: Like an air traffic control system,
but for cars
Store the data into the global file system, run image
analysis on the files.
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 7
WHAT ABOUT THE
CONSISTENCY ASPECT?
Consider the “configuration” of our application, which
implements this control behavior.
LIMITATIONS
C is
backup D A has
restarted
crashed
You can turn this one type of mistake into many kinds of
risky problems that we would never want to tolerate in an
ATC system. So we must avoid such problems entirely!
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 11
ROOT ISSUE (1)
The quality of failure sensing is limited
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 14
WITH S3 OR GFS WE
CANNOT TRUST A LOG!
These file systems don’t guarantee consistency!
They are unsafe with concurrent writes. Concurrent log appends could
Overwrite one-another, so one is lost
Be briefly perceived out of order, or some machine might glimpse a
written record that will then be erased a moment later and
overwritten
Sometimes we can even have two conflicting versions that linger
for
extended periods.
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 15
EXACTLY WHAT GOES
WRONG?
“Append-only log”
behavior In our application using it
1. Machines A, B and C are A is selected to be the primary
running controller for runway 7. C is
assigned as backup.
2-a. Machine D is launched
C notices 2-b, and takes over.
2-b. Concurrently, B thinks A But A wasn’t really crashed – B
crashed. was wrong!
3. 2-b is overwritten by 2-a Log entry 2-b is gone. A
4. A turns out to be fine, after keeps running.
all. Now we have A and C both in
the “control runway 7 role” – a
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 16
split brain!
WHY CAN’T WE JUST FIX
THE BUG?
First, be aware that S3 and GFS and similar systems are perfectly
fine for object storage by a single, non-concurrent writer.
If nervous, take one more step and add a “self-certifying
signature”
SHA3 hash codes are common for this, very secure and robust.
But there
are many options, even simple parity codes can help.
The reason that they don’t handle concurrent writes well is that the
required protocol is slower than the current weak consistency model.
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 17
ZOOKEEPER: A SOLUTION
FOR THIS ISSUE
The need in many systems is for a place to store configuration, parameters,
lists of which machines are running, which nodes are “primary” or “backup”,
etc.
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 18
SHOULD I USE ZOOKEEPER
FOR EVERYTHING?
Zookeeper isn’t for long-term storage, or for large objects. Put
those in the GFS. Then share the URL, which is small.
The health of all components is tracked, so that we know who is up and who
has crashed. In Derecho, this is called membership status.
The Zookeeper meta-data layer is a single program: consistent by design.
The Zookeeper data replication layer uses an atomic multicast to ensure
that
all replicas are in the same state.
For long-term robustness, they checkpoint periodically (every 5s) and
restart
from checkpoint.
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 21
DETAILS ON ZOOKEEPER
(THIS IS THEIR SLIDE SET
AND REPEATS SOME PARTS
OF MINE)
Hunt, P., Konar, M., Junqueira, F.P. and Reed, B., 2010, June.
ZooKeeper: Wait-free Coordination for Internet-scale
Systems. In USENIX Annual Technical Conference (Vol. 8, p.
9).
Client SDK
In -Services,
HTTP or TCP/IP
these need
Server SDK resource
Kafka or SQS Application Server management
Resource Plugins and
scheduling
Shipment Mailing
Billing Packing
Planner Labels
A SIMPLE -SERVICE
SYSTEM Microservice 1
Microservice 1
Microservice 1
Zookeeper implementation:
Create a rendezvous node: /root/rendezvous
Workers read /root/rendezvous and set a watch
If empty, use watch to detect when master posts its configuration information
Master fills in its configuration information (host, port)
Workers are notified of content change and get the configuration information
LOCKS
Familiar analogy: lock files used by Apache HTTPD and MySQL
processes
Zookeeper example: who is the leader with primary copy of data?
Implementation:
Leader creates an ephemeral file: /root/leader/lockfile
Other would-be leaders place watches on the lock file
If the leader client dies or doesn’t renew the lease, clients can attempt to create
a replacement lock file
Use SEQUENTIAL to solve the herd effect problem.
Create a sequence of ephemeral child nodes
Clients only watch the node immediately ahead of them in the sequence
Zookeeper is Browser
popular in
HTTPS
science-
computing Web Interface Server
gateways Client SDK
In micro-
HTTP or TCP/IP
service arch,
“Super” Server SDK these also
Scheduling and Application Server need
Resource scheduling
Resource Plugins
Management
Different
archs,
Karst:
schedulers, MOAB/Torq
Stampede: Comet: Jureca:SLU
SLURM SLURM RM
admin ue
domains,
With -
Services we API Server
often
replicate Application Metadata
components. Manager Server
API Server
API Server
API Server
API Server
Application Metadata
Application Metadata
Manager
Application Server
Metadata
Manager
Application Server
Metadata
Manager
Application Server
Metadata
Manager
Application Server
Metadata
Manager Server
Manager Server
WHY IS THIS FORM OF
REPLICATION NEEDED?
Fault tolerance
Increased throughput, load balancing
Component versions
Not all components of the same type need to be on the same
version
Backward compatibility checking
Component flavors
Application managers can serve different types of resources
Useful to separate them into separate processes if libraries conflict.
CONFIGURATION
MANAGEMENT
Problem: gateway components in a distributed system
need to get the correct configuration file.
Solution: Components contact Zookeeper to get
configuration metadata.
Comments: this includes both the component’s own
configuration file as well as configurations for other
components
Rendezvous problem
SERVICE DISCOVERY
Problem: Component A needs to find instances of
Component B
Solution: Use Zookeeper to find available group members
instances of Component B
More: get useful metadata about Component B instances like
version, domain name, port #, flavor
Comments
Useful for components that need to directly communicate but not
for asynchronous communication (message queues)
GROUP MEMBERSHIP
Problem: a job needs to go to a specific flavor of
application manager. How can this be located?
Solution: have application managers join the appropriate
Zookeeper managed group when they come up.
Comments: This is useful to support scheduling
SYSTEM STATE FOR
DISTRIBUTED SYSTEMS
Which servers are up and running? What versions?
Services that run for long periods could use ZK to indicate
if they are busy (or under heavy load) or not.
Note overlap with our Registry
What state does the Registry manage? What state would be more
appropriate for ZK?
LEADER ELECTION
Problem: metadata servers are replicated for read access
but only the master has write privileges. The master
crashes.
Solution: Use Zookeeper to elect a new metadata server
leader.
Comment: this is not necessarily the best way to do this
UNDER THE ZAB
ZOOKEEPER HOOD
Zab
Protocol
ZOOKEEPER HANDLING OF
WRITES
READ requests are served by any Zookeeper server
Scales linearly, although information can be stale
But checkpointing is not the same as the true durable Paxos. Durable
Paxos is like checkpointing on every operation. Zookeeper does so
every 5s
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 56
SUMMARY AND CAUTIONS
Zookeeper is powerful for system management/configuration data
Derecho is great for ultra-high-speed atomic multicast, replication
A message queue (Corfu) is also a powerful distributed computing
concept
You could build a queuing system with Zookeeper, but you shouldn’t
https://cwiki.apache.org/confluence/display/CURATOR/TN
There are high quality queuing systems already