0% found this document useful (0 votes)
102 views8 pages

Introduction To Zookeeper

ZooKeeper is a distributed coordination service that provides basic operations like leader election, discovery, and notifications. It is used extensively at Found, a hosted Elasticsearch service, for coordinating resources and instances across distributed systems. Found uses ZooKeeper for discovery, resource allocation, leader election, and notifications between various systems like customer consoles, constructors, proxies, and backup services to manage Elasticsearch clusters.

Uploaded by

unnisworld
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views8 pages

Introduction To Zookeeper

ZooKeeper is a distributed coordination service that provides basic operations like leader election, discovery, and notifications. It is used extensively at Found, a hosted Elasticsearch service, for coordinating resources and instances across distributed systems. Found uses ZooKeeper for discovery, resource allocation, leader election, and notifications between various systems like customer consoles, constructors, proxies, and backup services to manage Elasticsearch clusters.

Uploaded by

unnisworld
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Blogs RSS

14 SEPTEMBER 2014

ZooKeeper - The King of Coordination


By Konrad Beiske

Share

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is
now known as Elastic Cloud.

Let's explore Apache ZooKeeper, a distributed coordination service for distributed systems. Needless to say, there are
plenty of use cases! At Found, for example, we use ZooKeeper extensively for discovery, resource allocation, leader
election and high priority notifications. In this article, we'll introduce you to this King of Coordination and look closely at
how we use ZooKeeper at Found.

What Is ZooKeeper?

ZooKeeper is a coordination service for distributed systems. By providing a robust implementation of a few basic
operations, ZooKeeper simplifies the implementation of many advanced patterns in distributed systems.

ZooKeeper as a Distributed File System


One way of getting to know ZooKeeper is to think of it as a distributed file system. In fact, the way information in
ZooKeeper is organized is quite similar to a file system. At the top there is a root simply referred to as /. Below the root
there are nodes referred to as zNodes, short for ZooKeeper Node, but mostly a term used to avoid confusion with
computer nodes. A zNode may act as both a file containing binary data and a directory with more zNodes as sub nodes.
As most file systems, each zNode has some meta data. This meta data includes read and write permissions and version
information.

Unlike an ordinary distributed file system, ZooKeeper supports the concepts of ephemeral zNodes and sequential
zNodes. An ephemeral zNode is a node that will disappear when the session of its owner ends. A typical use case for
ephemeral nodes is when using ZooKeeper for discovery of hosts in your distributed system. Each server can then
publish its IP address in an ephemeral node, and should a server loose connectivity with ZooKeeper and fail to
reconnect within the session timeout, then its information is deleted.

Sequential nodes are nodes whose names are automatically assigned a sequence number suffix. This suffix is strictly
growing and assigned by ZooKeeper when the zNode is created. An easy way of doing leader election with ZooKeeper
is to let every server publish its information in a zNode that is both sequential and ephemeral. Then, whichever server
has the lowest sequential zNode is the leader. If the leader or any other server for that matter, goes offline, its session
dies and its ephemeral node is removed, and all other servers can observe who is the new leader.

The pattern with every node creating a sequential and ephemeral zNode is effectively organizing all the nodes in a
queue that is observable to all. This is not only useful for leader election, it may just as well be generalized to
distributed locks for any purpose with any number of nodes inside the lock.

ZooKeeper as a Message Queue


Speaking of observing changes, another key feature of ZooKeeper is the possibility of registering watchers on zNodes.
Speaking of observing changes, another key feature of ZooKeeper is the possibility of registering watchers on zNodes.
This allows clients to be notified of the next update to that zNode. With the use of watchers one can implement a
message queue by letting all clients interested in a certain topic register a watcher on a zNode for that topic, and
messages regarding that topic can be broadcast to all the clients by writing to that zNode.

An important thing to note about watchers though, is that they’re always one shot, so if you want further updates to
that zNode you have to re-register them. This implies that you might loose an update in between receiving one and re-
registering, but you can detect this by utilizing the version number of the zNode. If, however, every version is
important, then sequential zNodes is the way to go.

ZooKeeper gives guarantees about ordering. Every update is part of a total ordering. All clients might not be at the
exact same point in time, but they will all see every update in the same order. It is also possible to do writes conditioned
on a certain version of the zNode so that if two clients try to update the same zNode based on the same version, only
one of the updates will be successful. This makes it easy to implement distributed counters and perform partial
updates to node data. ZooKeeper even provides a mechanism for submitting multiple update operations in a batch so
that they may be executed atomically, meaning that either all or none of the operations will be executed. If you store
data structures in ZooKeeper that need to be consistent over multiple zNodes, then the multi-update API is useful;
however, it is still not as powerful as ACID transactions in traditional SQL databases. You can’t say: “BEGIN
TRANSACTION”, as you still have to specify the expected pre-state of each zNode you rely on.

Don’t Replace Your Distributed File System and Message Queue


Although it might be tempting to have one system for everything, you’re bound to run into some issues if you try to
replace your file servers with ZooKeeper. The first issue is likely to be the zNode limit imposed by the jute.maxbuffer-
setting. This is a limit on the size of each zNode, and the default value is one megabyte. In general, it is not
recommended to change that setting, simply because ZooKeeper was not implemented to be a large datastore. The
exception to the rule, which we’ve experienced at Found, is when a client with many watchers has lost connection to
ZooKeeper, and the client library - in this case Curator - attempts to recreate all the watchers upon reconnection. Since
the same setting also applies to all messages sent to and from ZooKeeper, we had to increase it to allow Curator to
reconnect smoothly for these clients.

Similarily, you are likely to end up with throughput issues if you use ZooKeeper when what you really need is a message
queue, as ZooKeeper is all about correctness and consistency first and speed second. That said, it is still pretty fast
when operating normally.

ZooKeeper at Found

At Found we use ZooKeeper extensively for discovery, resource allocation, leader election and high priority
notifications. Our entire service is built up of multiple systems reading and writing to ZooKeeper.

One example of such a system is our customer console, the web application that our customers use to create and
manage Elasticsearch clusters hosted by Found. One can also think of the customer console as the customers window
into ZooKeeper. When a customer creates a new cluster or makes a change to an existing one, this is stored in
ZooKeeper as a pending plan change.

The next step is done by the Constructor, which has a watch in ZooKeeper for new plans. The Constructor implements
the plan by deciding how many Elasticsearch instances are required and if any of the existing instances may be reused.
The Constructor then updates the instance list for each Elasticsearch server accordingly and waits for the new
instances to start.

On each server running Elasticsearch instances, we have a small application that monitors the servers’ instance lists in
ZooKeeper and start or stops LXC containers with Elasticsearch instances as needed. When an Elasticsearch instance
starts, we use a plugin inside Elasticsearch to report the IP and port to ZooKeeper and discover other Elasticsearch
instances to form a cluster with.

The Constructor waits for the Elasticsearch instances to report back through ZooKeeper with their IP address and port
and uses this information to connect with each instance and to ensure they have formed a cluster successfully. And of
course, if this does not happen within a certain timeout, then the Constructor will begin rolling back the changes. A
course, if this does not happen within a certain timeout, then the Constructor will begin rolling back the changes. A
common issue that may lead to new nodes having trouble starting is a misconfigured Elasticsearch plugin or a plugin
that requires more memory than anticipated.

To provide our customers with high availability and easy failover we have a proxy in front of the Elasticsearch clusters.
It is also crucial that this proxy forwards traffic to the correct server, whether changes are planned or not. By
monitoring information reported to ZooKeeper by each Elasticsearch instance, our proxy is able to detect whether it
should divert traffic to other instances or block traffic altogether to prevent detoration of information in an unhealthy
cluster.

We also use ZooKeeper for leader election among services where this is required. One such example is our backup
service. The actual backups are made with the Snapshot and Restore API in Elasticsearch, while the scheduling of the
backups is done externally. We decided to co-locate the scheduling of the backups with each Elasticsearch instance.
Thus, for customers that pay for high availability, the backup service is also highly available. However, if there are no
live nodes in a cluster there is no point in attempting a backup. Since we only want to trigger one backup per cluster
and not one per instance, there is a need for coordinating the backup schedulers. This is done by letting them elect a
leader for each of the clusters.

With this many systems relying on ZooKeeper, we need a reliable low latency connection to it. Hence, we run one
ZooKeeper cluster per region. While having both client and server in the same region goes a long way in terms of
network reliability, you should still anticipate intermittent glitches, especially when doing maintenance to the
ZooKeeper cluster itself. At Found, we’ve learned first hand that it’s very important to have a clear idea of what
information a client should maintain a local cache of, and what actions a client system may perform while not having a
live connection to ZooKeeper.

How Does It Work?

Three or more independent servers form a ZooKeeper cluster and elect a master. The master receives all writes and
publishes changes to the other servers in an ordered fashion. The other servers provide redundancy in case the master
fails and offload the master of read requests and client notifications. The concept of ordering is important in order to
understand the quality of service that ZooKeeper provides. All operations are ordered as they are received and this
ordering is maintained as information flows through the ZooKeeper cluster to other clients, even in the event of a
master node failure. Two clients might not have the exact same point in time view of the world at any given time, but
they will observe all changes in the same order.

The CAP Theorem


Consistency, Availability and Partition tolerance are the the three properties considered in the CAP theorem. The
theorem states that a distributed system can only provide two of these three properties. ZooKeeper is a CP system
with regard to the CAP theorem. This implies that it sacrifices availabilty in order to achieve consistency and partition
tolerance. In other words, if it cannot guarantee correct behaviour it will not respond to queries.

Consistency Algorithm
Although ZooKeeper provides similar functionality to the Paxos algorithm, the core consensus algorithm of ZooKeeper
is not Paxos. The algorithm used in ZooKeeper is called ZAB, short for ZooKeeper Atomic Broadcast. Like Paxos, it
relies on a quorum for durability. The differences can be summed up as: only one promoter at a time, whereas Paxos
may have many promoters of issues concurrently; a much stronger focus on a total ordering of all changes; and every
election of a new leader is followed by a synchronization phase before any new changes are accepted. If you want to
read up on the specifics of the algorithm, I recommend the paper: “Zab: High-performance broadcast for primary-
backup systems”.

What everyone running ZooKeeper in production needs to know, is that having a quorum means that more than half of
the number of nodes are up and running. If your client is connecting with a ZooKeeper server which does not
participate in a quorum, then it will not be able to answer any queries. This is the only way ZooKeeper is capable of
protecting itself against split brains in case of a network partition.

What We Don’t Use ZooKeeper For


What We Don’t Use ZooKeeper For

As much as we love ZooKeeper, we have become so dependent of it that we’re also taking care to avoid pushing its
limits. Just because we need to send a piece of information from A to B and they both use ZooKeeper does not mean
that ZooKeeper is the solution. In order to allow sending anything through ZooKeeper, the value and the urgency of the
information has to be high enough in relation to the cost of sending it (size and update frequency).

• Application logs - As annoying as it is to be miss logs when you try to debug something, logs are usually the first
thing you’re willing to sacrifice when your system is pushed to the limit. Hence, ZooKeeper is not a good fit, you
actually want something with looser consistency requirements.

• Binaries - These fellas are just too big and would require tweaking ZooKeeper settings to the point where a lot of
corner cases nobody has ever tested are likely to happen. Instead we store binaries on S3 and keep the URL’s in
ZooKeeper.

• Metrics - It may work very well to start off with, but in the long run it would pose a scaling issue. If we had been
sending metrics through ZooKeeper, it would simply be too expensive to have a comfortable buffer between
required and available capacity. That goes for metrics in general, with the exemption being two critical metrics that
are also used for application logic: the current disk and memory usage of each node. The latter is used by the
proxies to stop indexing when the customer exceeds their disk quota and the first one will at some stage in the
future be used to upgrade customers plans when needed.

In Practice

ZooKeeper has become a fairly big open source project, with many developers implementing pretty advanced stuff and
with a very high focus on correctness. Clearly, such a project requires a certain effort to become familiar with, but don’t
let that put you off. It is very much worth it when you are working with distributed systems. To help people get started
there are three guides, depending on your starting point. There’s the general getting started guide, which shows you
how to start a single ZooKeeper server and connect to it with the shell client and do a few basic operations before you
continue with either one or both of the more extensive guides: The Programmers Guide which details a lot of important
things to know and understand before building a solution with ZooKeeper, and The Administrators guide, which targets
the options relevant to a production cluster.

After starting zookeeper, you can connect to it with cli client. It attempts a connection to localhost by default and there
is never a password for the root node there is no need to give the client any parameters.

beiske-retina:~ beiske$ bin/zkCli


#Removed some output for abreviety
Welcome to ZooKeeper!
JLine support is enabled
[zk: localhost:2181(CONNECTING) 0]

Once connected you get a shell prompt like this:

[zk: localhost:2181(CONNECTED) 0]

And you can type in ls / to see if there are any zNodes. You can create a zNode like this:

[zk: localhost:2181(CONNECTED) 1] create /test HelloWorld!


Created /test

And retrieve its contents:


And retrieve its contents:

[zk: localhost:2181(CONNECTED) 11] get /test


HelloWorld!
cZxid = 0xa1f54
ctime = Sun Jul 20 15:22:57 CEST 2014
mZxid = 0xa1f54
mtime = Sun Jul 20 15:22:57 CEST 2014
pZxid = 0xa1f54
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 11
numChildren = 0

To create an ephemeral and sequential node use the flags -e and -s.

[zk: localhost:2181(CONNECTED) 12] create -e -s /myEphemeralAndSequentialNode ThisIsASequentialAndEphemaralNode


Created /myEphemeralAndSequentialNode0000000001
[zk: localhost:2181(CONNECTED) 13] ls /
[test, myEphemeralAndSequentialNode0000000001]
[zk: localhost:2181(CONNECTED) 14] get /myEphemeralAndSequentialNode0000000001
ThisIsASequentialAndEphemaralNode
cZxid = 0xa1f55
ctime = Sun Jul 20 15:29:02 CEST 2014
mZxid = 0xa1f55
mtime = Sun Jul 20 15:29:02 CEST 2014
pZxid = 0xa1f55
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x14753d68f850008
dataLength = 33
numChildren = 0

Now if you disconnect, then reconnect the ephemeral node will be removed by the server.

To create a watcher on a certain zNode you can add watch to the stat command like this:

[zk: localhost:2181(CONNECTED) 30] stat /test watch


cZxid = 0xa1f60
ctime = Sun Jul 20 15:46:45 CEST 2014
mZxid = 0xa1f60
mtime = Sun Jul 20 15:46:45 CEST 2014
pZxid = 0xa1f60
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 10
numChildren = 0
[zk: localhost:2181(CONNECTED) 31]

Then we can connect to zookeeper from a different terminal and change the znode like this:

[zk: localhost:2181(CONNECTED) 1] set /test ByeCruelWorld!


cZxid = 0xa1f60
ctime = Sun Jul 20 15:46:45 CEST 2014
mZxid = 0xa1f62
mtime = Sun Jul 20 15:47:41 CEST 2014
pZxid = 0xa1f60
cversion = 0
dataVersion = 1
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 14
numChildren = 0

This triggers the watcher in our first session and the cli prints this:

WATCHER::
WatchedEvent state:SyncConnected type:NodeDataChanged path:/test

This lets us now that the data in the path we where watchin has been updated and that we should retrieve it if we’re
interested in the updated contents.

Available Clients
There are two client libraries maintained by the ZooKeeper project, one in Java and another in C. With regard to other
programming languages, some libraries have been made that wrap either the Java or the C client.

Curator
Curator is an independent open source project started by Netflix and adopted by the Apache foundation. It currently
has the status of incubator project in Apache terms. The purpose of the Curator project is to create well tested
implementations of common patterns on top of ZooKeeper. This is not due to ZooKeeper being faulty or misleading in
its API, but simply because it can still be challenging to create solid implementations that correctly handle all the
possible exceptions and corner cases involved with networking.

Curator provides a layer on top of the Java client that deals with retries and connection losses and provides
standardized implementations of common distributed patterns like leader election, distributed locks, shared counters,
queues and caches. In Curator lingo these are referred to as recipes, but even if you don’t need any of the recipes I
highly recommend Curator when working with ZooKeeper on the Java Virtual Machine. Or, as stated on the Curator
wiki: “Friends don’t let friends write ZooKeeper recipes”.

There are alternatives to Curator for other platforms, but not all of these are interoperable. Platform interoperability is
actually one of the cases where you just might have to stick with the low level stuff and implement recipes yourself.

Operations: Yet Another System to Manage


Some people argue the benefits of only having one system to deploy and upgrade. For those of us having more than
Some people argue the benefits of only having one system to deploy and upgrade. For those of us having more than
one system to look after, it is good practice to keep each of these systems as small and independent as possible. For
us at Found, ZooKeeper is a crucial step in this design goal.

Be in the know with the latest and greatest from


Elastic.

P R O AD BU OC UT TS E L A S T I C C LC OO UM LDP E A A N R Y N
T H E
Elastic E L A SC TO I M C P L I A N Elasticsearch
C E Service
& S E C U Careers/Jobs
R I T Blog
Y
Stack S T A C K Elastic App Search Service Our Community
Data Protection Compliance
Elasticsearch Elastic Site Search Service Source Customers
What is Elastic for Government
Kibana Elasticsearch Service on GCP Code & Use
Elasticsearch? US Gov't Compliance
Logstash Cloud Status Teams Cases
What is GDPR Compliance
Beats Board of Documentation
the ELK
Elastic Site Directors Elastic{ON}
Stack?
Search Leadership Events
What is X-
Elastic Contact Forums
Pack?
App Our Story Meetups
Compare
Search Why Open Training
AWS
Elastic Source Videos &
Elasticsearch
Enterprise Distributed Webinars
Elastic
Search by
Stack
Elastic Intention
Features
Logs Elastic
Alerting
Elastic Partner FOLLOW
Monitoring
Metrics Program
Security
Elastic Press & US
Reporting
APM Announcements
Machine
Elastic Investor
Learning
Uptime Relations
Graph
Elastic Elastic
Analytics
SIEM Store
Canvas
Elastic Subscriptions
Elasticsearch
Endpoint Support
SQL
Security Portal
Business
Elastic
Analytics
Maps
Kubernetes
Elastic
Monitoring
Cloud on
ArcSight
Kubernetes
Integration
Elastic
Elastic
Cloud
Maps
Enterprise
Service
Elasticsearch-
Hadoop

Deutsch
Language Deutsch

Elasticsearch is a trademark of Elasticsearch B.V., registered in the U.S. and in other countries.

Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software
Foundation in the United States and/or other countries.

You might also like