Notes - 5 Unit Big Data
Notes - 5 Unit Big Data
1. Pig :
It provides a high-level scripting language, known as Pig Latin which is used to develop the data analysis
codes.
Applications :
2. Hive :
It resides on top of Hadoop to summarize Big Data and makes querying and analyzing easy.
It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Benefits :
Ease of use
Accelerated initial insertion of data
Superior scalability, flexibility, and cost-efficiency
Streamlined security
Low overhead
Exceptional working capacity
3. HBase :
HBase is a column-oriented non-relational database management system that runs on top of the Hadoop
Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse data sets, which
are common in many big data use cases
HBase does support writing applications in Apache Avro, REST and Thrift.
Application :
Medical
Sports
Web
Oil and petroleum
e-commerce
Mr. Satish Kr Singh Page 1
Apache Pig
Apache Pig
Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is used to process the
large datasets. It provides a high-level of abstraction for processing over the MapReduce. It provides a
high-level scripting language, known as Pig Latin which is used to develop the data analysis codes. First,
to process the data which is stored in the HDFS, the programmers will write the scripts using the Pig Latin
Language. Internally Pig Engine (a component of Apache Pig) converted all these scripts into a specific
map and reduce task. But these are not visible to the programmers in order to provide a high-level of
abstraction. Pig Latin and Pig Engine are the two main components of the Apache Pig tool. The result of
Pig always stored in the HDFS.
Note: Pig Engine has two type of execution environment i.e. a local execution environment in a single JVM
(used when dataset is small in size) and distributed execution environment in a Hadoop Cluster.
It allows nested data types like map, tuple and bag It does not allow nested data types
There are several differences between the two languages, and between Pig and relational database
management systems (RDBMSs) in general. The most significant difference is that Pig Latin is a
data flow programming language, whereas SQL is a declarative programming language. In other
words, a Pig Latin program is a step-by-step set of operations on an input relation, in which each
step is a single transformation. By contrast, SQL statements are a set of constraints that, taken
together, define the output. In many ways, programming in Pig Latin is like working at the level
of an RDBMS query planner, which figures out how to turn a declarative statement into a system
of steps.
RDBMSs store data in tables, with tightly predefined schemas. Pig is more relaxed about the data
that it processes: you can define a schema at runtime, but it’s optional. Essentially, it will operate
on any source of tuples (although the source should support being read in parallel, by being in
multiple files, for example), where a UDF is used to read the tuples from their raw representation.2
The most common representation is a text file with tab-separated fields, and Pig provides a built-
in load function for this format. Unlike with a traditional database, there is no data import process
to load the data into the RDBMS. The data is loaded from the filesystem (usually HDFS) as the
first step in the processing.
RDBMSs have several features to support online, low-latency queries, such as transactions and
indexes, that are absent in Pig. Pig does not support random reads or queries on the order of tens
of milliseconds. Nor does it support random writes to update small portions of data; all writes are
bulk streaming writes, just like with MapReduce.
Operator set – Many operations like join, filter and sort can be performed through these
operators.
Programming ease – Pig Latin closely resembles to SQL. It is also easy to write a Pig script if
you’re good at SQL.
User defined functions – It is also a very amazing feature that it offers the facility to create
User-defined Functions in other programming languages like Java. Meanwhile, invoke or embed
them in Pig Scripts.
Extensibility – Developers can develop their own functions to read, process and write data.
Optimization opportunities – Pig tasks optimize their execution automatically. The
programmers only need to focus on semantics of the language. The animal Pig eats anything it
gets its mouth on. Apache Pig is named as such as it similarly processes all kinds of data like
structured, semi-structured and unstructured data and stores the result in HDFS
Handles all kinds of data - Handling all kinds of data is one of the reasons for easy
programming. That means it analyses all kinds of data. Either structured or unstructured. Also, it
stores the results in HDFS.
Extensibility - Extensibility is one of the most interesting features it has. It means users can
develop their own functions to read, process, and write data, using the existing operators.
Pig SQL
Pig Latin is a procedural language. SQL is a declarative language.
In Apache Pig, schema is optional. We can store
data without designing a schema (values are Schema is mandatory in SQL.
stored as $01, $02 etc.)
The data model in Apache Pig is nested
The data model used in SQL is flat relational.
relational.
Apache Pig provides limited opportunity for There is more opportunity for query optimization
Query optimization. in SQL.
Grunt
Grunt is a JavaScript task runner that helps us in automating mundane and repetitive tasks like minification,
compilation, unit testing, linting, etc. Grunt has hundreds of plugins to choose from, you can use Grunt to
automate just about anything with a minimum of effort. The objective of this article is to get started with
Grunt and to learn how to automatically minify our JavaScript files and validate them using JSHint.
Mr. Satish Kr Singh Page 3
Grunt has line-editing facilities like those found in GNU Readline (used in the bash shell and many other
command-line applications). For instance, the Ctrl-E key com‐ bination will move the cursor to the end of
the line. Grunt remembers command history, too,1 and you can recall lines in the history buffer using Ctrl-
P or Ctrl-N (for previous and next), or equivalently, the up or down cursor keys. Another handy feature is
Grunt’s completion mechanism, which will try to complete Pig Latin keywords and functions when you
press the Tab key.
Pig Latin
The Pig Latin is a data flow language used by Apache Pig to analyse the data in Hadoop. It is a textual
language that abstracts the programming from the Java MapReduce idiom into a notation. However, we
can say, Pig Latin is a very simple language with SQL like semantics. It is possible to use it in a productive
manner. It also contains a rich set of functions. Those exhibits data manipulation.
,
Data Processing Operators in Pig
The Apache Pig Operators is a high-level procedural language for querying large data sets using Hadoop
and the Map-Reduce Platform. A Pig Latin statement is an operator that takes a relation as input and
produces another relation as output. These operators are the main tools for Pig Latin provides to operate
on the data.
To load data from external storage for pro‐ cessing in Pig. Storing the results is straightforward, too. Here’s
an example of using PigStorage to store tuples as plain-text values separated by a colon character.
Filtering Data
Once you have some data loaded into a relation, often the next step is to filter it to remove the data that you
are not interested in. By filtering early in the processing pipe‐ line, you minimize the amount of data flowing
through the system, which can improve efficiency. The FOREACH...GENERATE operator is used to act
on every row in a relation. It can be used to remove fields or to generate new ones.
Pig has very good built-in support for join operations, making it much more approachable. . Since the large
datasets that are suitable for analysis by Pig (and MapReduce in general) are not normalized, however,
joins are used more infrequently in Pig.
Sorting Data
Relations are unordered in Pig. There is no guarantee which order the rows will be processed in. If you
want to impose an order on the output, you can use the ORDER operator to sort a relation by one or more
fields.
Sometimes you have several relations that you would like to combine into one. For this, the UNION
statement is used. Pig attempts to merge the schemas from the relations that UNION is operating on. The
SPLIT operator is the opposite of UNION: it partitions a relation into two or more relations.
Hive
Mr. Satish Kr Singh Page 4
Hive is a data warehouse and an ETL tool which provides an SQL-like interface between the user and the
Hadoop distributed file system (HDFS) which integrates Hadoop. It is built on top of Hadoop. It is a
software project that provides data query and analysis. It facilitates reading, writing and handling wide
datasets that stored in distributed storage and queried by Structure Query Language (SQL) syntax. It is not
built for Online Transactional Processing (OLTP) workloads. It is frequently used for data warehousing
tasks like data encapsulation, Ad-hoc Queries, and analysis of huge datasets. It is designed to enhance
scalability, extensibility, performance, fault-tolerance and loose-coupling with its input formats.
Initially Hive is developed by Facebook and Amazon, Netflix and It delivers standard SQL functionality
for analytics. Traditional SQL queries are written in the MapReduce Java API to execute SQL Application
and SQL queries over distributed data. Hive provides portability as most data warehousing applications
functions with SQL-based query languages like NoSQL.
Apache Hive is a data ware house system for Hadoop that runs SQL like queries called HQL (Hive query
language) which gets internally converted to map reduce jobs. It supports Data definition Language, Data
Manipulation Language and user defined functions. Apache Hive is a data warehouse software project that
is built on top of the Hadoop ecosystem. It provides an SQL-like interface to query and analyse large
datasets stored in Hadoop’s distributed file system (HDFS) or other compatible storage systems.
Hive uses a language called HiveQL, which is similar to SQL, to allow users to express data queries,
transformations, and analyses in a familiar syntax. HiveQL statements are compiled into MapReduce jobs,
which are then executed on the Hadoop cluster to process the data.
The above figure shows the architecture of Apache Hive and its major components. The major
components of Apache Hive are:
User Interface (UI) – As the name describes User interface provide an interface between user
and hive. It enables user to submit queries and other operations to the system. Hive web UI, Hive
command line, and Hive HD Insight (In windows server) are supported by the user interface.
Hive Server – It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
Driver – Queries of the user after the interface are received by the driver within the Hive.
Concept of session handles is implemented by driver. Execution and Fetching of APIs modelled
on JDBC/ODBC interfaces is provided by the user.
Compiler – Queries are parses, semantic analysis on the different query blocks and query
expression is done by the compiler. Execution plan with the help of the table in the database and
partition metadata observed from the metastore are generated by the compiler eventually.
Metastore – All the structured data or information of the different tables and partition in the
warehouse containing attributes and attributes level information are stored in the metastore.
Execution Engine – Execution of the execution plan made by the compiler is performed in the
execution engine. The plan is a DAG of stages. The dependencies within the various stages of the
plan is managed by execution engine as well as it executes these stages on the suitable system
components.
HDFS or HBASE – Hadoop distributed file system or HBASE are the data storage techniques to
store data into file system.
Hive Shell
The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL.
HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by MySQL, so if you
are familiar with MySQL, you should feel at home using Hive. When starting Hive for the first
time, we can check that it is working by listing its tables there should be none. HiveQL is generally
case insensitive (except for string comparisons).
Non-Interactive mode :
Interactive mode :
The hive can work in interactive mode by directly typing the command “hive” in the terminal.
Example: $hive
Other useful Hive shell features include the ability to run commands on the host op‐ erating system
by using a ! prefix to the command and the ability to access Hadoop filesystems using the dfs
command.
Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive
queries and commands.
Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a
web-based GUI for executing Hive queries and commands.
Hive MetaStore - It is a central repository that stores all the structure information of various
tables and partitions in the warehouse. It also includes metadata of column and its type
information, the serializers and de-serializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-
reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks in the
order of their dependencies.
HiveQL
Hive Query Language (HiveQL) is a query language in Apache Hive for processing and analysing
structured data. It separates users from the complexity of Map Reduce programming. It reuses common
concepts from relational databases, such as tables, rows, columns, and schema, to ease learning. Hive
provides a CLI for Hive query writing using Hive Query Language (HiveQL). Generally, HiveQL syntax
is similar to the SQL syntax that most data analysts are familiar with. Hive supports four file formats which
are: TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File). Hive uses derby database
Hive Tables
Hive is designed for querying and managing only structured data stored in tables. Fundamentally, Hive
knows two different types of tables: Internal table and the External table. The Internal table is also known
as the managed table.
Hive Partition is a way to split the large table into smaller tables based on the values of a column (one
partition for each distinct values) whereas Bucket is a technique to divide the data in a manageable form
(you can specify how many buckets you want). Both Partitioning and Bucketing in Hive are used to
improve performance by eliminating table scans when dealing with a large set of data on a Hadoop file
system (HDFS). The major difference between Partitioning vs Bucketing lives in the way how they split
the data.
Hive Partition is a way to organize large tables into smaller logical tables based on values of columns; one
logical table (partition) for each distinct value. In Hive, tables are created as a directory on HDFS. A table
can have one or more partitions that correspond to a sub-directory for each partition inside a table directory.
Hive Bucketing a.k.a (Clustering) is a technique to split the data into more manageable files, (By
specifying the number of buckets to create). The value of the bucketing column will be hashed by a user-
defined number into buckets.
Bucketing can be created on just one column; you can also create bucketing on a partitioned table to further
split the data which further improves the query performance of the partitioned table.
Each bucket is stored as a file within the table’s directory or the partitions directories. Note that partition
creates a directory and you can have a partition on one or more columns; these are some of the differences
between Hive partition and bucket.
You can’t manage the number of You can manage the number of buckets to create by
partitions to create specifying the count
There are two dimensions that govern table storage in Hive: the row format and the file format. The row
format dictates how rows, and the fields in a particular row, are stored. The file format dictates the container
format for fields in a row. The simplest format is a plain-text file, but there are row-oriented and column-
oriented binary formats available, too. When you create a table with no ROW FORMAT or STORED AS
clauses, the default format is delimited text with one row per line. Hive has native support for the Parquet,
RCFile, and ORCFile column oriented binary formats.
Collection Functions
Date Functions
Mathematical Functions
Conditional Functions
String Functions
Mr. Satish Kr Singh Page 9
Misc. Functions
In Hive, the users can define own functions to meet certain client requirements. These are known as UDFs
in Hive. User Defined Functions written in Java for specific modules.
Some of UDFs are specifically designed for the reusability of code in application frameworks. The
developer will develop these functions in Java and integrate those UDFs with the Hive. During the Query
execution, the developer can directly use the code, and UDFs will return outputs according to the user-
defined tasks. It will provide high performance in terms of coding and execution.
For example, for string stemming we don’t have any predefined function in Hive. For this, we can write
stem UDF in Java. Wherever we require Stem functionality, we can directly call this Stem UDF in Hive.
Depending on the use cases, the UDFs can be written. It will accept and produce different numbers of input
and output values. The general type of UDF will accept a single input value and produce a single output
value. If the UDF is used in the query, then UDF will be called once for each row in the result data set.
1. Regular UDF: UDFs works on a single row in a table and produces a single row as output. Its one to
one relationship between input and output of a function.
2. UDAF: User defined aggregate functions works on more than one row and gives single row as output.
3. UDTF: User defined tabular function works on one row as input and returns multiple rows as output. So
here the relation in one to many.
Sort by clause performs on column names of Hive tables to sort the output. We can mention DESC for
sorting the order in descending order and mention ASC for Ascending order of the sort. In this sort by it
will sort the rows before feeding to the reducer. Always sort by depends on column types. For instance, if
column types are numeric it will sort in numeric order if the columns types are string it will sort in
lexicographical order.
Hive Aggregate Functions are the most used built-in functions that take a set of values and return a single
value, when used with a group, it aggregates all values in each group and returns one value for each group.
Like in SQL, Aggregate Functions in Hive can be used with or without GROUP BY functions however
these aggregation functions are mostly used with GROUP BY hence, here I will cover examples of how to
use aggregation functions with and without applying groups.
The Map/Reduce script is used to process large volumes of data. It works well in instances when the
data can be broken down into smaller parts. When you run this script, a structured framework creates
enough jobs to process all of these smaller parts. This technique does not require user management, it
manages automatically.
This script also has the benefit of allowing these jobs to be processed in parallel. While deploying the
script, the user can select the level of parallelism.
A map/reduce script, like a scheduled script, can be run manually or on a set schedule. Compared to
scheduled scripts, this script has a few advantages. One is that if a map/reduce task breaches certain
features of NetSuite governance, the map/reduce framework will automatically force the job to yield
and its work to be rescheduled for a later time without disrupting the script.
Joins
In Apache Hive, for combining specific fields from two tables by using values common to each one we use
Hive Join, HiveQL Select Joins Query. Basically, for combining specific fields from two tables by using
values common to each one we use Hive JOIN clause. In other words, to combine records from two or
more tables in the database we use JOIN clause. However, it is more or less similar to SQL JOIN. Also,
we use it to combine rows from multiple tables.
Moreover, there are some points we need to observe about Hive Join:
a. Inner Join
Basically, to combine and retrieve the records from multiple tables we use Hive Join clause. Moreover, in
SQL JOIN is as same as OUTER JOIN. Moreover, by using the primary keys and foreign keys of the tables
JOIN condition is to be raised.
b. Left Outer Join
On defining HiveQL Left Outer Join, even if there are no matches in the right table it returns all the rows
from the left table. Even if the ON clause matches 0 (zero) records in the right table, then also this Hive
JOIN still returns a row in the result. Although, it returns with NULL in each column from the right table.
In addition, it returns all the values from the left table. Also, the matched values from the right table, or
NULL in case of no matching JOIN predicate.
Subqueries
A Query present within a Query is known as a sub query. The main query will depend on the values returned
by the subqueries. Hive supports subqueries in FROM clauses and in WHERE clauses of SQL statements.
A subquery is a SQL expression that is evaluated and returns a result set. Then that result set is used to
evaluate the parent query. The parent query is the outer query that contains the child subquery. Subqueries
in WHERE clauses are supported in Hive 0.13 and later.
When to use:
To get a particular value combined from two column values from different tables
Dependency of one table values on other tables
Comparative checking of one column values from other tables
HBase
HBase
HBase is an open-source, column-oriented distributed database system in a Hadoop environment. Initially,
it was Google Big Table, afterward; it was renamed as HBase and is primarily written in
Java. Apache HBase is needed for real-time Big Data applications. HBase can store massive amounts of
data from terabytes to petabytes. The tables present in HBase consist of billions of rows having millions of
columns. HBase is built for low latency operations, which is having some specific features compared to
traditional relational models. HBase provides a fault-tolerant way of storing sparse data sets, which are
common in many big data use cases. It is well suited for real-time data processing or random read/write
access to large volumes of data.
HBase relies on Zookeeper for high-performance coordination. ZooKeeper is built into HBase, but if you’re
running a production cluster, it’s suggested that you have a dedicated ZooKeeper cluster that’s integrated
with your HBase cluster. HBase works well with Hive, a query engine for batch processing of big data, to
enable fault-tolerant big data applications.
Features of HBase
We can use this HBase feature for high-speed requirements because it offers consistent reads and
writes.
During one read or write process, all other processes are prevented from performing any read or
write operations this is what we call Atomic read and write. So, HBase offers atomic read and
write, on a row level.
iii. Sharding
In order to reduce I/O time and overhead, HBase offers automatic and manual splitting of
regions into smaller subregions, as soon as it reaches a threshold size.
Moreover, it offers LAN and WAN which supports failover and recovery. Basically, there is a
master server, at the core, which handles monitoring the region servers as well as all metadata
for the cluster.
v. Client API
vi. Scalability
In both linear and modular form, HBase supports scalability. In addition, we can say it is linearly
scalable.
HBase can run on top of other file systems as well as like Hadoop/HDFS integration.
By using multiple block allocation and replications, HDFS is internally distributed and
automatically recovered and HBase runs on top of HDFS, hence HBase is automatically
recovered. Also using Region Server replication, this failover is facilitated.
In HBase “Backup support” means it supports back-up of Hadoop MapReduce jobs in HBase
tables.
It is possible to build an optimized request Since searching is done on the range of rows, and
HBase stores row keys in lexicographical orders, hence, by using these sorted row keys and
timestamp we can build an optimized request.
In order to perform real-time query processing, HBase supports block cache and Bloom filters.
While it comes to faster lookups, HBase internally uses Hash tables and offers random access, as
well as it stores the data in indexed HDFS files.
xviii. Schema-less
Due to high security and easy management characteristics of HBase, it offers unprecedented
high write throughput.
While it comes to programmatic access, HBase offers easy usage Java API.
The HBase Data Model is designed to handle semi-structured data that may differ in field size,
which is a form of data and columns. The data model’s layout partitions the data into simpler
components and spread them across the cluster. HBase's Data Model consists of various logical
components, such as a table, line, column, family, column, column, cell, and edition.
Row: An HBase row consists of a row key and one or more associated value columns. Row keys
are the bytes that are not interpreted. Rows are ordered lexicographically, with the first row
appearing in a table in the lowest order. The layout of the row key is very critical for this
purpose.
Column: A column in HBase consists of a family of columns and a qualifier of columns, which
is identified by a character: (colon).
Column Family: Apache HBase columns are separated into the families of columns. The
column families physically position a group of columns and their values to increase its
performance. Every row in a table has a similar family of columns, but there may not be
anything in a given family of columns.
Column Qualifier: The column qualifier is added to a column family. A column standard could
be content (html and pdf), which provides the content of a column unit. Although column
families are set up at table formation, column qualifiers are mutable and can vary significantly
from row to row.
Cell: A Cell store data and is quite a unique combination of row key, Column Family, and the
Column. The data stored in a cell call its value and data types, which is every time treated as a
byte.
Timestamp: In addition to each value, the timestamp is written and is the identifier for a given
version of a number. The timestamp reflects the time when the data is written on the Region
Server. But when we put data into the cell, we can assign a different timestamp value.
S.
Parameters RDBMS HBase
No.
1. SQL It requires SQL (Structured Query Language). SQL is not required in HBase.
3. Database
Type It is a row-oriented database It is a column-oriented database.
6. Data
retrieval In RDBMS, slower retrieval of data. In HBase, faster retrieval of data.
9. Sparse data It cannot handle sparse data. It can handle sparse data.
In HBase, the .amount of data depends
Volume of The amount of data in RDBMS is determined
10. on the number of machines deployed
data by the server’s configuration.
rather than on a single machine.
In HBase, there is no such guarantee
Transaction In RDBMS, mostly there is a guarantee
11. associated with the transaction
Integrity associated with transaction integrity.
integrity.
The HBase schema design is very different compared to the relation database schema design. Below are
some of general concept that should be followed while designing schema in Hbase:
Row key: Each table in HBase table is indexed on row key. Data is sorted lexicographically by
this row key. There are no secondary indices available on HBase table.
Automaticity: Avoid designing table that requires atomicity across all rows. All operations on
HBase rows are atomic at row level.
Even distribution: Read and write should uniformly distributed across all nodes available in
cluster. Design row key in such a way that, related entities should be stored in adjacent rows to
increase read efficacy.
HBase Schema Row key, Column family, Column qualifier, individual and Row value Size Limit
When choosing row key for HBase tables, you should design table in such a way that there should not be
any hot spotting. To get best performance out of HBase cluster, you should design a row key that would
allow system to write evenly across all the nodes. Poorly designed row key can cause the full table scan
when you request some data out of it.
Integer Indexing
This mechanism helps in selecting any arbitrary item in an array based on its Ndimensional index. Each
integer array represents the number of indexes into that dimension. When the index consists of as many
integer arrays as the dimensions of the target ndarray, it becomes straightforward.
This type of advanced indexing is used when the resultant object is meant to be the result of Boolean
operations, such as comparison operators.
Zookeeper
Zookeeper
ZooKeeper is an open-source Apache project that provides a centralized service for providing
configuration information, naming, synchronization and group services over large clusters in distributed
systems. Zookeeper is used in distributed systems to coordinate distributed processes and services. It
provides a simple, tree-structured data model, a simple API, and a distributed protocol to ensure data
consistency and availability. Zookeeper is designed to be highly reliable and fault-tolerant, and it can
handle high levels of read and write throughput.
Zookeeper is implemented in Java and is widely used in distributed systems, particularly in the Hadoop
ecosystem. The goal is to make these systems easier to manage with improved, more reliable propagation
of changes.
ZooKeeper Characteristics
ZooKeeper is simple
ZooKeeper is, at its core, a stripped-down filesystem that exposes a few simple operations and
some extra abstractions, such as ordering and notifications.
ZooKeeper is expressive
The ZooKeeper primitives are a rich set of building blocks that can be used to build a large class
of coordination data structures and protocols. Examples include dis‐ tributed queues, distributed
locks, and leader election among a group of peers.
ZooKeeper interactions support participants that do not need to know about one another. For
example, ZooKeeper can be used as a rendezvous mechanism so that processes that otherwise
don’t know of each other’s existence (or network details) can discover and interact with one
another. Coordinating parties may not even be contemporaneous, since one process may leave a
message in ZooKeeper that is read by another after the first has shut down.
ZooKeeper is a library
ZooKeeper provides an open source, shared repository of implementations and recipes of common
coordination patterns. Individual programmers are spared the burden of writing common protocols
themselves (which is often difficult to get right). Over time, the community can add to and improve
the libraries, which is to everyone’s benefit.
Apache ZooKeeper is an open-source server that reliably coordinates distributed processes and
applications. It allows distributed processes to coordinate with each other through a shared hierarchal
namespace which is organized similarly to a standard file system. Apache ZooKeeper provides a
hierarchical file system (with ZNodes as the system files) that helps with the discovery, registration,
configuration, locking, leader selection, queueing, etc. of services working in different machines.
ZooKeeper server maintains configuration information, naming, providing distributed synchronization, and
providing group services, used by distributed applications.
Applications Manager's ZooKeeper monitoring aims to help administrators manage their ZooKeeper
server - collect all the metrics that can help with troubleshooting, display performance graphs and be alerted
automatically of potential issues. In order to keep track of your ZooKeeper server's overall operation
efficiency, monitor key performance metrics such as:
Automatically discover ZooKeeper clusters, monitor memory (heap and non-heap) on the Znode, and get
alerts of changes in resource consumption. Automatically collect, graph and get alerts on garbage collection
iterations, heap size, system usage, and threads. ZooKeeper hosts are deployed in a cluster and, as long as
a majority of hosts are up, the service will be available. Applications Manager's ZooKeeper monitoring
helps make sure the total node count inside the ZooKeeper tree is consistent.
Analyze JVM thread dumps with Apache ZooKeeper monitoring to pinpoint the root cause of performance
issues for troubleshooting. Track thread usage with ZooKeeper monitoring metrics like daemon, peak and
live thread count. Ensure that started threads don't overload the server's memory.
Performance statistics
With our ZooKeeper monitor, gauge the amount of time it takes for the server to respond to a client
request, queued requests and connections in the server and performance degradation due to network usage
(client packets sent and received). Get a consistent preview of the Zookeeper performance, regardless of
whether they change roles from Followers to Leader or back.
Track the number of Znodes, the number of watchers setup over the nodes, and the number of followers
within the ensemble. Keep an eye on the leader selection stats and client session times. Know where the
Leader is for a quorum, and when there is a change in Leaders. Get alerts on the number of active, connected
sessions, and measure the growth rate over a specific time period.
A Configuration Service
The contract of the write() method is that a key with the given value is written to ZooKeeper. It hides the
difference between creating a new znode and updating an ex‐ isting znode with a new value by testing first
for the znode using the exists operation and then performing the appropriate operation.
The first of the Fallacies of Distributed Computing states that “the network is reliable.” As they stand, our
programs so far have been assuming a reliable network, so when they run on a real network, they can fail
in several ways. Some possible failure modes and what we can do to correct them so that our programs are
resilient in the face of failure. Every ZooKeeper operation in the Java API declares two types of exception
in its throws clause: InterruptedException and KeeperException
InterruptedException
An InterruptedException is thrown if the operation is interrupted. There is a standard Java mechanism for
canceling blocking methods, which is to call interrupt() on the thread from which the blocking method was
called. A successful cancellation will result in an InterruptedException. ZooKeeper adheres to this standard,
so you can cancel a ZooKeeper operation in this way. Classes or libraries that use ZooKeeper usually should
propagate the InterruptedException so that their clients can cancel their operations.
KeeperException
There are two ways, then, to handle KeeperException: either catch KeeperException and test its code to
determine what remedying action to take, or catch the equivalent KeeperException subclasses and perform
the appropriate action in each catch block. KeeperExceptions fall into three broad categories.
State exceptions
A state exception occurs when the operation fails because it cannot be applied to the znode tree. State
exceptions usually happen because another process is mutating a znode at the same time. For example, a
setData operation with a version number will fail with a KeeperException.BadVersionException if the
znode is updated by another process first because the version number does not match.
Recoverable exceptions
Recoverable exceptions are those from which the application can recover within the same ZooKeeper
session. A recoverable exception is manifested by KeeperException.ConnectionLossException, which
means that the connection to ZooKeeper has been lost. ZooKeeper will try to reconnect, and in most cases
the re‐ connection will succeed and ensure that the session is intact.
Unrecoverable exceptions
In some cases, the ZooKeeper session becomes invalid perhaps because of a timeout or because the session
was closed (both of these scenarios get a KeeperException.SessionExpiredException), or perhaps because
authentica‐ tion failed (KeeperException.AuthFailedException). In any case, all ephemeral nodes
A Lock Service
A distributed lock is a mechanism for providing mutual exclusion between a collection of processes. At
any one time, only a single process may hold the lock. Distributed locks can be used for leader election in
a large distributed system, where the leader is the process that holds the lock at any point in time.
To implement a distributed lock using ZooKeeper, we use sequential znodes to impose an order on the
processes vying for the lock. The idea is simple: first, designate a lock znode, typically describing the entity
being locked on (say, /leader); then, clients that want to acquire the lock create sequential ephemeral znodes
as children of the lock znode.
Although this algorithm is correct, there are some problems with it. The first problem is that this
implementation suffers from the herd effect. Consider hundreds or thousands of clients, all trying to acquire
the lock. Each client places a watch on the lock znode for changes in its set of children. Every time the lock
is released or another process starts the lock acquisition process, the watch fires, and every client receives
a notification. The “herd effect” refers to a large number of clients being notified of the same event when
only a small number of them can actually proceed.
In this case, only one client will successfully acquire the lock, and the process of maintaining and sending
watch events to all clients causes traffic spikes, which put pressure on the ZooKeeper servers. To avoid the
herd effect, the condition for notification needs to be refined. The key observation for implementing locks
is that a client needs to be notified only when the child znode with the previous sequence number goes
away, not when any child znode is deleted (or created).
Recoverable exceptions
Another problem with the lock algorithm as it stands is that it doesn’t handle the case when the create
operation fails due to connection loss. Recall that in this case we do not know whether the operation
succeeded or failed. Creating a sequential znode is a nonidempotent operation, so we can’t simply retry,
because if the first create had succeeded we would have an orphaned znode that would never be deleted
(until the client session ended, at least). Deadlock would be the unfortunate result.
The problem is that after reconnecting, the client can’t tell whether it created any of the child znodes. By
embedding an identifier in the znode name, if it suffers a connection loss, it can check to see whether any
of the children of the lock node have its identifier in their names. If a child contains its identifier, it knows
that the create operation succeeded, and it shouldn’t create another child znode. If no child has the identifier
in its name, the client can safely create a new sequential child znode.
Unrecoverable exceptions
If a client’s ZooKeeper session expires, the ephemeral znode created by the client will be deleted,
effectively relinquishing the lock (or at least forfeiting the client’s turn to acquire the lock). The application
using the lock should realize that it no longer holds the lock, clean up its state, and then start again by
creating a new lock object and trying to acquire it. Notice that it is the application that controls this process,
not the lock implementation, since it cannot second-guess how the application needs to clean up its state.