Introduction To Zookeeper
Introduction To Zookeeper
14 SEPTEMBER 2014
Share
UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is
now known as Elastic Cloud.
Let's explore Apache ZooKeeper, a distributed coordination service for distributed systems. Needless to say, there are
plenty of use cases! At Found, for example, we use ZooKeeper extensively for discovery, resource allocation, leader
election and high priority notifications. In this article, we'll introduce you to this King of Coordination and look closely at
how we use ZooKeeper at Found.
What Is ZooKeeper?
ZooKeeper is a coordination service for distributed systems. By providing a robust implementation of a few basic
operations, ZooKeeper simplifies the implementation of many advanced patterns in distributed systems.
Unlike an ordinary distributed file system, ZooKeeper supports the concepts of ephemeral zNodes and sequential
zNodes. An ephemeral zNode is a node that will disappear when the session of its owner ends. A typical use case for
ephemeral nodes is when using ZooKeeper for discovery of hosts in your distributed system. Each server can then
publish its IP address in an ephemeral node, and should a server loose connectivity with ZooKeeper and fail to
reconnect within the session timeout, then its information is deleted.
Sequential nodes are nodes whose names are automatically assigned a sequence number suffix. This suffix is strictly
growing and assigned by ZooKeeper when the zNode is created. An easy way of doing leader election with ZooKeeper
is to let every server publish its information in a zNode that is both sequential and ephemeral. Then, whichever server
has the lowest sequential zNode is the leader. If the leader or any other server for that matter, goes offline, its session
dies and its ephemeral node is removed, and all other servers can observe who is the new leader.
The pattern with every node creating a sequential and ephemeral zNode is effectively organizing all the nodes in a
queue that is observable to all. This is not only useful for leader election, it may just as well be generalized to
distributed locks for any purpose with any number of nodes inside the lock.
An important thing to note about watchers though, is that they’re always one shot, so if you want further updates to
that zNode you have to re-register them. This implies that you might loose an update in between receiving one and re-
registering, but you can detect this by utilizing the version number of the zNode. If, however, every version is
important, then sequential zNodes is the way to go.
ZooKeeper gives guarantees about ordering. Every update is part of a total ordering. All clients might not be at the
exact same point in time, but they will all see every update in the same order. It is also possible to do writes conditioned
on a certain version of the zNode so that if two clients try to update the same zNode based on the same version, only
one of the updates will be successful. This makes it easy to implement distributed counters and perform partial
updates to node data. ZooKeeper even provides a mechanism for submitting multiple update operations in a batch so
that they may be executed atomically, meaning that either all or none of the operations will be executed. If you store
data structures in ZooKeeper that need to be consistent over multiple zNodes, then the multi-update API is useful;
however, it is still not as powerful as ACID transactions in traditional SQL databases. You can’t say: “BEGIN
TRANSACTION”, as you still have to specify the expected pre-state of each zNode you rely on.
Similarily, you are likely to end up with throughput issues if you use ZooKeeper when what you really need is a message
queue, as ZooKeeper is all about correctness and consistency first and speed second. That said, it is still pretty fast
when operating normally.
ZooKeeper at Found
At Found we use ZooKeeper extensively for discovery, resource allocation, leader election and high priority
notifications. Our entire service is built up of multiple systems reading and writing to ZooKeeper.
One example of such a system is our customer console, the web application that our customers use to create and
manage Elasticsearch clusters hosted by Found. One can also think of the customer console as the customers window
into ZooKeeper. When a customer creates a new cluster or makes a change to an existing one, this is stored in
ZooKeeper as a pending plan change.
The next step is done by the Constructor, which has a watch in ZooKeeper for new plans. The Constructor implements
the plan by deciding how many Elasticsearch instances are required and if any of the existing instances may be reused.
The Constructor then updates the instance list for each Elasticsearch server accordingly and waits for the new
instances to start.
On each server running Elasticsearch instances, we have a small application that monitors the servers’ instance lists in
ZooKeeper and start or stops LXC containers with Elasticsearch instances as needed. When an Elasticsearch instance
starts, we use a plugin inside Elasticsearch to report the IP and port to ZooKeeper and discover other Elasticsearch
instances to form a cluster with.
The Constructor waits for the Elasticsearch instances to report back through ZooKeeper with their IP address and port
and uses this information to connect with each instance and to ensure they have formed a cluster successfully. And of
course, if this does not happen within a certain timeout, then the Constructor will begin rolling back the changes. A
course, if this does not happen within a certain timeout, then the Constructor will begin rolling back the changes. A
common issue that may lead to new nodes having trouble starting is a misconfigured Elasticsearch plugin or a plugin
that requires more memory than anticipated.
To provide our customers with high availability and easy failover we have a proxy in front of the Elasticsearch clusters.
It is also crucial that this proxy forwards traffic to the correct server, whether changes are planned or not. By
monitoring information reported to ZooKeeper by each Elasticsearch instance, our proxy is able to detect whether it
should divert traffic to other instances or block traffic altogether to prevent detoration of information in an unhealthy
cluster.
We also use ZooKeeper for leader election among services where this is required. One such example is our backup
service. The actual backups are made with the Snapshot and Restore API in Elasticsearch, while the scheduling of the
backups is done externally. We decided to co-locate the scheduling of the backups with each Elasticsearch instance.
Thus, for customers that pay for high availability, the backup service is also highly available. However, if there are no
live nodes in a cluster there is no point in attempting a backup. Since we only want to trigger one backup per cluster
and not one per instance, there is a need for coordinating the backup schedulers. This is done by letting them elect a
leader for each of the clusters.
With this many systems relying on ZooKeeper, we need a reliable low latency connection to it. Hence, we run one
ZooKeeper cluster per region. While having both client and server in the same region goes a long way in terms of
network reliability, you should still anticipate intermittent glitches, especially when doing maintenance to the
ZooKeeper cluster itself. At Found, we’ve learned first hand that it’s very important to have a clear idea of what
information a client should maintain a local cache of, and what actions a client system may perform while not having a
live connection to ZooKeeper.
Three or more independent servers form a ZooKeeper cluster and elect a master. The master receives all writes and
publishes changes to the other servers in an ordered fashion. The other servers provide redundancy in case the master
fails and offload the master of read requests and client notifications. The concept of ordering is important in order to
understand the quality of service that ZooKeeper provides. All operations are ordered as they are received and this
ordering is maintained as information flows through the ZooKeeper cluster to other clients, even in the event of a
master node failure. Two clients might not have the exact same point in time view of the world at any given time, but
they will observe all changes in the same order.
Consistency Algorithm
Although ZooKeeper provides similar functionality to the Paxos algorithm, the core consensus algorithm of ZooKeeper
is not Paxos. The algorithm used in ZooKeeper is called ZAB, short for ZooKeeper Atomic Broadcast. Like Paxos, it
relies on a quorum for durability. The differences can be summed up as: only one promoter at a time, whereas Paxos
may have many promoters of issues concurrently; a much stronger focus on a total ordering of all changes; and every
election of a new leader is followed by a synchronization phase before any new changes are accepted. If you want to
read up on the specifics of the algorithm, I recommend the paper: “Zab: High-performance broadcast for primary-
backup systems”.
What everyone running ZooKeeper in production needs to know, is that having a quorum means that more than half of
the number of nodes are up and running. If your client is connecting with a ZooKeeper server which does not
participate in a quorum, then it will not be able to answer any queries. This is the only way ZooKeeper is capable of
protecting itself against split brains in case of a network partition.
As much as we love ZooKeeper, we have become so dependent of it that we’re also taking care to avoid pushing its
limits. Just because we need to send a piece of information from A to B and they both use ZooKeeper does not mean
that ZooKeeper is the solution. In order to allow sending anything through ZooKeeper, the value and the urgency of the
information has to be high enough in relation to the cost of sending it (size and update frequency).
• Application logs - As annoying as it is to be miss logs when you try to debug something, logs are usually the first
thing you’re willing to sacrifice when your system is pushed to the limit. Hence, ZooKeeper is not a good fit, you
actually want something with looser consistency requirements.
• Binaries - These fellas are just too big and would require tweaking ZooKeeper settings to the point where a lot of
corner cases nobody has ever tested are likely to happen. Instead we store binaries on S3 and keep the URL’s in
ZooKeeper.
• Metrics - It may work very well to start off with, but in the long run it would pose a scaling issue. If we had been
sending metrics through ZooKeeper, it would simply be too expensive to have a comfortable buffer between
required and available capacity. That goes for metrics in general, with the exemption being two critical metrics that
are also used for application logic: the current disk and memory usage of each node. The latter is used by the
proxies to stop indexing when the customer exceeds their disk quota and the first one will at some stage in the
future be used to upgrade customers plans when needed.
In Practice
ZooKeeper has become a fairly big open source project, with many developers implementing pretty advanced stuff and
with a very high focus on correctness. Clearly, such a project requires a certain effort to become familiar with, but don’t
let that put you off. It is very much worth it when you are working with distributed systems. To help people get started
there are three guides, depending on your starting point. There’s the general getting started guide, which shows you
how to start a single ZooKeeper server and connect to it with the shell client and do a few basic operations before you
continue with either one or both of the more extensive guides: The Programmers Guide which details a lot of important
things to know and understand before building a solution with ZooKeeper, and The Administrators guide, which targets
the options relevant to a production cluster.
After starting zookeeper, you can connect to it with cli client. It attempts a connection to localhost by default and there
is never a password for the root node there is no need to give the client any parameters.
[zk: localhost:2181(CONNECTED) 0]
And you can type in ls / to see if there are any zNodes. You can create a zNode like this:
To create an ephemeral and sequential node use the flags -e and -s.
Now if you disconnect, then reconnect the ephemeral node will be removed by the server.
To create a watcher on a certain zNode you can add watch to the stat command like this:
Then we can connect to zookeeper from a different terminal and change the znode like this:
This triggers the watcher in our first session and the cli prints this:
WATCHER::
WatchedEvent state:SyncConnected type:NodeDataChanged path:/test
This lets us now that the data in the path we where watchin has been updated and that we should retrieve it if we’re
interested in the updated contents.
Available Clients
There are two client libraries maintained by the ZooKeeper project, one in Java and another in C. With regard to other
programming languages, some libraries have been made that wrap either the Java or the C client.
Curator
Curator is an independent open source project started by Netflix and adopted by the Apache foundation. It currently
has the status of incubator project in Apache terms. The purpose of the Curator project is to create well tested
implementations of common patterns on top of ZooKeeper. This is not due to ZooKeeper being faulty or misleading in
its API, but simply because it can still be challenging to create solid implementations that correctly handle all the
possible exceptions and corner cases involved with networking.
Curator provides a layer on top of the Java client that deals with retries and connection losses and provides
standardized implementations of common distributed patterns like leader election, distributed locks, shared counters,
queues and caches. In Curator lingo these are referred to as recipes, but even if you don’t need any of the recipes I
highly recommend Curator when working with ZooKeeper on the Java Virtual Machine. Or, as stated on the Curator
wiki: “Friends don’t let friends write ZooKeeper recipes”.
There are alternatives to Curator for other platforms, but not all of these are interoperable. Platform interoperability is
actually one of the cases where you just might have to stick with the low level stuff and implement recipes yourself.
P R O AD BU OC UT TS E L A S T I C C LC OO UM LDP E A A N R Y N
T H E
Elastic E L A SC TO I M C P L I A N Elasticsearch
C E Service
& S E C U Careers/Jobs
R I T Blog
Y
Stack S T A C K Elastic App Search Service Our Community
Data Protection Compliance
Elasticsearch Elastic Site Search Service Source Customers
What is Elastic for Government
Kibana Elasticsearch Service on GCP Code & Use
Elasticsearch? US Gov't Compliance
Logstash Cloud Status Teams Cases
What is GDPR Compliance
Beats Board of Documentation
the ELK
Elastic Site Directors Elastic{ON}
Stack?
Search Leadership Events
What is X-
Elastic Contact Forums
Pack?
App Our Story Meetups
Compare
Search Why Open Training
AWS
Elastic Source Videos &
Elasticsearch
Enterprise Distributed Webinars
Elastic
Search by
Stack
Elastic Intention
Features
Logs Elastic
Alerting
Elastic Partner FOLLOW
Monitoring
Metrics Program
Security
Elastic Press & US
Reporting
APM Announcements
Machine
Elastic Investor
Learning
Uptime Relations
Graph
Elastic Elastic
Analytics
SIEM Store
Canvas
Elastic Subscriptions
Elasticsearch
Endpoint Support
SQL
Security Portal
Business
Elastic
Analytics
Maps
Kubernetes
Elastic
Monitoring
Cloud on
ArcSight
Kubernetes
Integration
Elastic
Elastic
Cloud
Maps
Enterprise
Service
Elasticsearch-
Hadoop
Deutsch
Language Deutsch
Elasticsearch is a trademark of Elasticsearch B.V., registered in the U.S. and in other countries.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software
Foundation in the United States and/or other countries.