BDA - Unit 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Big Data

• The term ‘Big Data’ is used for collections of large datasets that include huge volume, high velocity, and a
variety of data that is increasing day by day. Using traditional data management systems, it is difficult to
process Big Data.
• Therefore, the Apache Software Foundation introduced a framework called Hadoop to solve Big Data
management and processing challenges
Features of Big Data Platform
Here are most important features of any good Big Data Analytics Platform:
a) Big Data platform should be able to accommodate new platforms and tool based on the business requirement.
Because business needs can change due to new technologies or due to change in business process.
b) It should support linear scale-out
c) It should have capability for rapid deployment
d) It should support variety of data format
e) Platform should provide data analysis and reporting tools
f) It should provide real-time data analysis software
g) It should have tools for searching the data through large data sets.

Big data is a term for data sets that are so large or complex that traditional data processing applications are
inadequate. Challenges include:-
• Analysis,
• Capture,
• Data Curation,
• Search,
• Sharing,
• Storage,
• Transfer,
• Visualization,
• Querying,
• Updating

HADOOP ECOSYSTEM TOOLS


The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are used to
help Hadoop modules.
• Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
• Pig: It is a procedural language platform used to develop a script for MapReduce operations.
• Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

List of BigData Platforms


a) Hadoop
• Hadoop is an open-source, Java based programming framework and server software which is used to save
and analyze data with the help of 100s or even 1000s of commodity servers in a clustered environment.
• Hadoop is designed to storage and process large datasets extremely fast and in fault tolerant way.
• Hadoop uses HDFS (Hadoop Distributed File System) for storing data on cluster of commodity computers. If
any server goes down it know how to replicate the data and there is no loss of data even in hardware failure.
• Hadoop is Apache sponsored project and it consists of many software packages which runs on the top of the
Apache Hadoop system.
• Hadoop provides set of tools and software for making the backbone of the Big Data analytics system.
• Hadoop ecosystem provides necessary tools and software for handling and analyzing Big Data.
• On the top of the Hadoop system many applications can be developed and plugged-in to provide ideal
solution for Big Data needs.
b) Cloudera
• Cloudra is one of the first commercial Hadoop based Big Data Analytics Platform offering Big Data solution.
• Its product range includes Cloudera Analytic DB, Cloudera Operational DB, Cloudera Data Science &
Engineering and Cloudera Essentials.
• All these products are based on the Apache Hadoop and provides real-time processing and analytics of
massive data sets.

c) Amazon Web Services


• Amazon is offering Hadoop environment in cloud as part of its Amazon Web Services package.
• AWS Hadoop solution is hosted solution which runs on Amazon’s Elastic Cloud Compute and Simple
Storage Service (S3).
• Enterprises can use the Amazon AWS to run their Big Data processing analytics in the cloud environment.
• Amazon EMR allows companies to setup and easily scale Apache Hadoop, Spark, HBase, Presto, Hive, and
other Big Data Frameworks using its cloud hosting environment.

d) Hortonworks
Hortonworks is using 100% open-source software without any propriety software. Hortonworks were the one
who first integrated support for Apache HCatalog. Hortonworks is a Big Data company based in California.
This company is developing and supports application for Apache Hadoop.
Hortonworks Hadoop distribution is 100% open source and its enterprise ready with following features:
• Centralized management and configuration of clusters
• Security and data governance are built in feature of the system
• Centralized security administration across the system

e) MapR
• MapR is another Big Data platform which us using the UNIX file system for handling data.
• It is not using HDFS and this system is easy to learn anyone familiar with the UNIX system.
• This solution integrates Hadoop, Spark, and Apache Drill with a real-time data processing feature.

f) IBM Open Platform


IBM also offers Big Data Platform which is based on the Hadoop eco-system software.
IBM uses the latest Hadoop software and provides following features (IBM Open Platform Features):
• Based on 100% Open source software
• Native support for rolling Hadoop upgrades
• Support for long running applications within YEARN.
• Support for heterogeneous storage which includes HDFS for in-memory and SSD in addition to HDD
• Native support for Spark, developers can use Java, Python and Scala to written program
• Platform includes Ambari, which is a best tool for provisioning, managing & monitoring Apache Hadoop
clusters
• IBM Open Platform includes all the software of Hadoop ecosystem e.g. HDFS, YARN, MapReduce,
Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig, Snappy, Solr, Spark, Sqoop, Zookeeper, Open
JDK, Knox, Slider
• Developer can download the trial Docker Image or Native installer for testing and learning the system
• Application is well supported by IBM technology team

g) Microsoft HDInsight
• The Microsoft HDInsight is also based on the Hadoop distribution and it’s a commercial Big Data platform
from Microsoft.
• Microsoft is software giant which is into development of windows operating system for Desktop users and
Server users.
• This is the big Hadoop distribution offering which runs on the Windows and Azure environment.
• It offers customized, optimized open source Hadoop based analytics clusters which uses Spark, Hive,
MapReduce, HBase, Strom, Kafka and R Server which runs on the Hadoop system on windows/Azure
environment.
ZooKeeper- Coordination Provisions high-performance coordination service for distributed running
service of applications and tasks
Avro-Data serializationProvisions data serialization during data transfer between application and
and transfer utility processing layers
Oozie Provides a way to package and bundles multiple coordinator and
workflow jobs and manage the lifecycle of those jobs
Sqoop(SQL-to-Hadoop)A Provisions for data-transfer between data stores such as relational DBs
data-transfer software and Hadoop
Flume - Large data Provisions for reliable data transfer and provides for recovery in case of
transfer utility failure. Transfers large amount of data in applications, eg: social-media
Ambari-A messages
Provisions, monitors, manages, and viewing of functioning of the cluster,
web-based tool MapReduce, Hive and Pig APis
Chukwa-A data collection Provisions and manages data collection system for large and distributed
system systems
HBase-A structured data Provisions a scalable and structured database for large tables
store using database
Cassandra - A database Provisions scalable and fault-tolerant database for multiple masters

Hive -A data warehouse Provisions data aggregation, data-summarization, data warehouse


system infrastructure, ad hoc (unstructured) querying and SQL-like scripting
language for query processing using HiveQL
Pig-A high- level Provisions dataflow (DF) functionality and the execution framework for
dataflow language parallel computations
Mahout Provisions scalable machine learning and library functions for data
mining and analytics

Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it
further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon
uses it in Amazon Elastic MapReduce.
Apache Hive is a data ware house system for Hadoop that runs SQL like queries called HQL (Hive Query
language) which gets internally converted to map reduce jobs.
Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and user defined functions.
DDL: create table, create index, create views.
DML: Select, Where, group by, Join, Order By
Pluggable Functions:
UDF: User Defined Function
UDAF: User Defined Aggregate Function
UDTF: User Defined Table Function
Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates
Features of Hive
 Hive's metastore is used to persist schema i.e. table definition(table name, columns, types), location of table
files, row format of table files, storage format of files.
 Hive Query Language HiveQL or HQL is similar to SQL for querying and gets reduced to map reduce jobs in
backend.
 Hive's default database is derby.
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It is familiar, fast, scalable and extensible.

Architecture of Hive
The following component diagram depicts the architecture of Hive:

This component diagram contains different units. The following table describes each unit:

Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can create interaction between
user and HDFS. The user interfaces thatHive supports are Hive Web UI, Hive command
line, and Hive HDInsight (In Windows server).

Meta Store Hive chooses respective database servers to store the schema or Metadata of tables,
databases, columns in a table, their data types, and HDFS mapping.

HiveQL HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of
Process the replacements of traditional approach for MapReduce program. Instead of writing
Engine MapReduce program in Java, we can write a query for MapReduce job and process it.

Execution The conjunction part of HiveQL process Engine and MapReduce is Hive Execution
Engine Engine. Execution engine processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
HDFS or Hadoop distributed file system or HBASE are the datastorage techniques to store
HBASE data into file system.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:

StepNo. Operation
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any databasedriver
such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the syntax andquery
plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up to here, the
parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine sends the
job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in
Data node. Here, the query executes MapReduce job.
7.1 Metadata Ops
The execution engine can execute metadata operations with Metastore.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
File Formats in Hive
 File Format specifies how records are encoded in files
 Record Format implies how a stream of bytes for a given record are encoded
 The default file format is TEXTFILE – each record is a line in the file
 Hive uses different control characters as delimeters in textfiles
 ᶺA ( octal 001) , ᶺB(octal 002), ᶺC(octal 003), \n
 The term field is used when overriding the default delimiter
 FIELDS TERMINATED BY ‘\001’
 Supports text files – csv, tsv
 TextFile can contain JSON or XML documents.

Commonly used File Formats –


1. TextFile format
 Suitable for sharing data with other tools
 Can be viewed/edited manually
2. SequenceFile
 Flat files that stores binary key ,value pair
 SequenceFile offers a Reader ,Writer, and Sorter classes for reading ,writing, and sorting respectively
 Supports – Uncompressed, Record compressed ( only value is compressed) and Block compressed ( both
key and value compressed) formats
3. RCFile
 RCFile stores columns of a table in a record columnar way
4. ORC
5. AVRO

Hive Commands

Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and User defined functions.

Hive DDL Commands


 create database
 drop database
 create table
 drop table alter table
 create index
 create view

Hive DML Commands


 Select
 Where
 Group By
 Order By
 Load Data
 Join:
o Inner Join
o Left Outer Join
o Right Outer Join o
o Full Outer Join
Hive Query Language - The Hive Query Language (HiveQL) is a query language for Hive to process and analyze
structured data in a Metastore.

SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a condition. It filters
the data using the condition and gives you a finite result. The built-in operators and functions generate an expression,
which fulfils the condition.
Syntax of the SELECT query:

Example for SELECT…WHERE clause.


Assume the employee table as given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a
query to retrieve the employee details who earn a salary of more than Rs 30000.

The ORDER BY clause is used to retrieve the details based on one column and sort the result set by ascending or
descending order.
Syntax of the ORDER BY clause:

Example - Generate a query to retrieve the employee details who earn a salary of more than Rs 30000.
The GROUP BY clause is used to group all the records in a result set using a particular collection
column. It is used to query a group of records.
Syntax of GROUP BY clause is as follows:

Example - Generate a query to retrieve the number of employees in each department.

JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is
used to combine records from two or more tables in the database.
Syntax of JOIN clause is as follows:

There are different types of joins given as follows:


 JOIN
 LEFT OUTER JOIN
 RIGHT OUTER JOIN
 FULL OUTER JOIN

Example
Consider following two tables named CUSTOMERS and ORDERS:-
CUSTOMERS ORDERS
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as OUTER JOIN in
SQL. A JOIN condition is to be raised using the primary keys and foreign keys of the tables.

Eg. - The following query executes JOIN on the CUSTOMER and ORDER tables and retrieves the records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:

LEFT OUTER JOIN


The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no matches in the
right table.
This means, if the ON clause matches 0 (zero) records in the right table, the JOIN still returns a row in the
result, but with NULL in each column from the right table.
A LEFT JOIN returns all the values from the left table, plus the matched values from the right table, or NULL
in case of no matching JOIN predicate.

Eg.- The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
RIGHT OUTER JOIN
The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no matches in the
left table.
If the ON clause matches 0 (zero) records in the left table, the JOIN still returns a row in the result, but with
NULL in each column from the left table.
A RIGHT JOIN returns all the values from the right table, plus the matched values from the left table, or NULL
in case of no matching join predicate.

Eg.- The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and ORDER tables.
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:

FULL OUTER JOIN


The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables that fulfill the
JOIN condition.
The joined table contains either all the records from both the tables, or fills in NULL values for missing matches
on either side.

Eg.- The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
Apache Pig
Apache Pig is a high-level data flow platform for executing MapReduce programs of Hadoop.
The language used for Pig is Pig Latin. The Pig Latin is a data flow language used by Apache Pig to analyze the
data in Hadoop. It is a
textual language that abstracts the programming from the Java MapReduce idiom into a notation.
The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored in HDFS.
Apart from that, Pig can also execute its job in Apache Tez or Apache Spark.
Pig can handle any type of data, i.e., structured, semi-structured or unstructured and stores the corresponding
results into Hadoop Data File System.
Every task which can be achieved using PIG can also be achieved using JAVA used in MapReduce.

Features of Apache Pig


1) Ease of programming
Writing complex JAVA programs for map reduce is quite tough for non-programmers. Pig makes this
process easy. In Pig, the queries are converted to MapReduce internally.
2) Optimization opportunities
It is how tasks are encoded permits the system to optimize their execution automatically, allowing the
user to focus on semantics rather than efficiency.
3) Extensibility
A user-defined function is written in which the user can write their logic to execute over the data set.
4) Flexible
It can easily handle structured as well as unstructured data.
5) In-built operators
It contains various type of operators such as sort, filter and joins.

Advantages of Apache Pig


o Less code - The Pig consumes less line of code to perform any operation.
o Reusability - The Pig code is flexible enough to reuse again.
o Nested data types - The Pig provides a useful concept of nested data types like tuple, bag,and map.

Apache Pig Execution Modes


Apache Pig can be executed in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from local host and local file system. There is no need of Hadoop
or HDFS. This mode is generally used for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System(HDFS) using
Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job
is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.

Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode and embedded mode.
Interactive Mode (Grunt shell) − Apache Pig can be executed in interactive mode using the Grunt shell. In this
shell, Pig Latin statements can be entered and the output can be produced(using Dump operator).
Batch Mode (Script) − Apache Pig can be executed in Batch mode by writing the Pig Latin script in a single file
with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions (User Defined
Functions) in programming languages such as Java, and using them in script.
Pig Data Types -A list of Apache Pig Data Types with description and examples are given below.

Given below in the table are some frequently used Pig Commands:-

Command Function
load Reads data from the system

Store Writes data to file system


foreach Applies expressions to each record and outputs one or more records
filter Applies predicate and removes records that do not return true
Group/cogroup Collects records with the same key from one or more inputs
join Joins two or more inputs based on a key
order Sorts records based on a key
distinct Removes duplicate records
union Merges data sets
split Splits data into two or more sets based on filter conditions
stream Sends all records through a user-provided binary
dump Writes output to stdout
limit Limits the number of records

Pig Vs Hive
Apache Spark
Apache Spark is a general-purpose & lightning fast cluster computing system designed for fast computation. It
provides a high-level API. For example, Java, Scala, Python, and R. Apache Spark is a tool for Running Spark
Applications.
Spark is 100 times faster han Bigdata Hadoop and 10 times faster than accessing data from disk. It was built on top
of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations which
includes Interactive Queries and Stream Processing. It can be integrated with Hadoop and can process existing
Hadoop HDFS data.
Spark is written in Scala but provides rich APIs in Scala, Java, Python, and R.

Spark Architecture
Apache Spark uses master-slave architecture.
Just like in the real world, the master will get the job done by using
his slaves. It means that you will have a master process and
multiple slave processes which are controlled by that dedicated
master process. In the Spark environment, master nodes are
referred to as drivers and slaves are referred to as executors.
Drivers - Master manages, maintains, and monitors the slaves while slaves are the actual workers who perform the
processing tasks. User tells the master what works to be done and the master will take care of the rest. It will
complete the task, using its slaves.
Drivers are the master process in the Spark environment. It contains all the metadata about slaves or, in Spark terms,
executors.
Drivers are responsible for: Analyzing, Distributing, Scheduling, and Monitoring the executors.
Executors
Executors are the slave processes in the Spark environment. They perform the data processing which is assigned to
them by their master.

Apache Spark Components


Apache Spark puts the promise for faster data processing and easier development. The components of Spark resolves
the issues that cropped up while using Hadoop MapReduce.

i. Spark Core -It is the kernel of Spark, which provides an execution platform for all the Spark applications. It is a
generalized platform to support a wide array of applications.
ii. Spark SQL - It enables users to run SQL/HQL queries on the top of Spark. Using Apache Spark SQL, we
can process structured as well as semi-structured data. It also provides an engine for Hive to run
unmodified queries up to 100 times faster on existing deployments.
iii. Spark Streaming- Apache Spark Streaming enables powerful interactive and data analytics application
across live streaming data. The live streams are converted into micro-batches which are executed on top of
spark core.
iv. Spark MLlib- It is the scalable machine learning library which delivers both efficiencies as well as the
high-quality algorithm. Apache Spark MLlib is one of the hottest choices for Data Scientist due to its
capability of in-memory data processing, which improves the performance of iterative algorithm
drastically.
v. Spark GraphX- Apache Spark GraphX is the graph computation engine built on top of spark that enables
to process graph data at scale.
vi. SparkR- It is R package that gives light-weight frontend to use Apache Spark from R. It allows data
scientists to analyze large datasets and interactively run jobs on them from the R shell. The main idea
behind SparkR was to explore different techniques to integrate the usability of R with the scalability of
Spark.
There are two main ways to execute Spark applications.
1. Using Spark shell
2. Using the Spark submit method
#1) Spark shell
Spark shell is an interactive way to execute Spark applications on the terminal.Spark-shell prints information
about user’s last action as user enters it into the shell.
#2) Spark-submit
In a real-world scenario, you cannot execute programs interactively. More often than not, you have to process
data periodically or in real-time using jar files, which contain the code to process data. Spark submit is the
method used in Spark to run application jars in a Spark environment. Just like running a Java jar file on a
terminal, can execute Spark applications using spark-submit. Spark gets the data file, splits it into small chunks
of data, and distributes those chunks of data across the cluster.

Features of Apache Spark:-


 Speed / Swift Processing −Apache Spark is a fast, in-memory data processing engine. Spark helps to run an
application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk.
This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data
in memory.
 With development APIs, it allows executing streaming, machine learning or SQL.
 In – Memory Processing: Improves efficiency through:
 In-memory computing primitives
 General computation graphs (DAG)
 Up to 100× faster
 Improves usability through:
 Rich APIs in Java, Scala, Python
 Interactive shell
 Often 2-10× less code
 Open-source parallel process computational framework primarily used for data engineering and analytics.
 Supports multiple languages /Dynamic− Spark provides built-in APIs in Java, Scala, or Python and user can
write applications in different languages. Spark comes up with 80 high-level operators for interactive
querying.
 Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming
data, Machine learning (ML), and Graph algorithms.
 Reusability - Apache Spark provides the provision of code reusability for batch processing, join streams
against historical data, or run adhoc queries on stream state.
 Fault Tolerance- Spark RDD (Resilient Distributed Dataset), abstraction are designed to seamlessly handle
failures of any worker nodes in the cluster. Thus, the loss of data and information is negligible.
 Apache Spark is compatible with both versions of Hadoop ecosystem.
Zookeeper
Apache Zookeeper is a distributed, open-source coordination service for distributed applications. It provides a
central place for distributed applications to store data, communicate with one another and coordinate activities.
It exposes a simple set of primitives to implement higher-level services for synchronization, configuration
maintenance and group naming. It provides a simple, tree-structured data model, a simple API, and a distributed
protocol to ensure data consistency and availability. Zookeeper is designed to be highly reliable and fault-
tolerant, and it can handle high levels of read and write throughput.
In a distributed system, there are multiple nodes or machines that need to communicate with each other and
coordinate their actions. ZooKeeper provides a way to ensure that these nodes are aware of each other and can
coordinate their actions. It does this by maintaining a hierarchical tree of data nodes called “Znodes“, which
can be used to store and retrieve data and maintain state information.
ZooKeeper provides a set of primitives, such as locks, barriers and queues, that can be used to coordinate the
actions of nodes in a distributed system. It also provides features such as leader election, failover and recovery,
which can help ensure that the system is resilient to failures. ZooKeeper is widely used in distributed systems
such as Hadoop, Kafka and HBase.

Zookeeper is implemented in Java and is an Apache Software Foundation project and is released under the
Apache License 2.0.
Why Apache Zookeeper is needed?
 Coordination services: The integration/communication of services in a distributed environment.
 Coordination services are complex to get right. They are especially prone to errors such as race conditions
and deadlock.
 Race condition-Two or more systems trying to perform some task.
 Deadlocks– Two or more operations are waiting for each other.
 To make the coordination between distributed environments easy, developers came up with an idea called
zookeeper so that they don’t have to relieve distributed applications of the responsibility of implementing
coordination services from scratch.

Architecture of Zookeeper

The ZooKeeper architecture consists of a hierarchy of nodes called znodes, organized in a tree-like structure.
Each znode can store data and has a set of permissions that control access to the znode. The znodes are
organized in a hierarchical namespace, similar to a file system. At the root of the hierarchy is the root znode
and all other znodes are children of the root znode. Each znode can have children and grandchildren, and so on.
Important Components in Zookeeper

 Leader & Follower


 Request Processor – Active in Leader Node and is responsible for processing write requests. After processing, it
sends changes to the follower nodes
 Atomic Broadcast – Present in both Leader Node and Follower Nodes. It is responsible for sending the changes
to other Nodes.
 In-memory Databases (Replicated Databases)-It is responsible for storing the data in the zookeeper. Every node
contains its own databases. Data is also written to the file system providing recoverability in case of any problems
with the cluster.

Other Components
 Client – One of the nodes in our distributed application cluster. Access information from the server. Every client
sends a message to the server to let the server know that client is alive.
 Server– Provides all the services to the client. Gives acknowledgment to the client.
 Ensemble– Group of Zookeeper servers. Minimum number of nodes that are required to form an ensemble is 3.

In Zookeeper, data is stored in a hierarchical namespace, similar to a file system. Each node in the namespace is
called a Znode, and it can store data and have children. Znodes are similar to files and directories in a file system.
Zookeeper provides a simple API for creating, reading, writing, and deleting Znodes. It also provides mechanisms for
detecting changes to the data stored in Znodes, such as watches and triggers. Znodes maintain a stat structure that
includes: Version number, ACL, Timestamp, Data Length.

The common services provided by ZooKeeper are as follows −


 Naming service − Identifying the nodes in a cluster by name. It is similar to DNS, but for nodes.
 Configuration management − Latest and up-to-date configuration information of the system for a joining node.
 Cluster management − Joining / leaving of a node in a cluster and node status at real time.
 Leader election − Electing a node as leader for coordination purpose.
 Locking and synchronization service − Locking the data while modifying it. This mechanism helps you in
automatic fail recovery while connecting other distributed applications like Apache HBase.
 Highly reliable data registry − Availability of data even when one or a few nodes are down.

Benefits of ZooKeeper
 Simple distributed coordination process
 Synchronization − Mutual exclusion and co-operation between server processes. This process helps in Apache
HBase for configuration management.
 Ordered Messages
 Serialization − Encode the data according to specific rules. Ensure your application runs consistently. This
approach can be used in MapReduce to coordinate queue to execute running threads.
 Reliability
 Atomicity − Data transfer either succeed or fail completely, but no transaction is partial.
Applications of ZooKeeper
1. Yahoo!
The ZooKeeper framework was originally built at “Yahoo!”. A well-designed distributed application needs to
meet requirements such as data transparency, better performance, robustness, centralized configuration, and
coordination. So, they designed the ZooKeeper framework to meet these requirements.
2. Apache Hadoop
Apache Hadoop is the driving force behind the growth of Big Data industry. Hadoop relies on ZooKeeper for
configuration management and coordination. ZooKeeper provides the facilities for cross-node
synchronization and ensures the tasks across Hadoop projects are serialized and synchronized.
Multiple ZooKeeper servers support large Hadoop clusters. Each client machine communicates with one of
the ZooKeeper servers to retrieve and update its synchronization information. Some of the real-time examples
are −
 Human Genome Project − The Human Genome Project contains terabytes of data. Hadoop
MapReduce framework can be used to analyze the dataset and find interesting facts for human development.
 Healthcare − Hospitals can store, retrieve, and analyze huge sets of patient medical records, which are
normally in terabytes.
3. Apache HBase
Apache HBase is an open source, distributed, NoSQL database used for real-time read/write access of large
datasets and runs on top of the HDFS. HBase follows master-slave architecture where the HBase Master
governs all the slaves. Slaves are referred as Region servers.
HBase distributed application installation depends on a running ZooKeeper cluster. Apache HBase uses
ZooKeeper to track the status of distributed data throughout the master and region servers with the help
of centralized configuration management and distributed mutex mechanisms. Some use-cases of HBase −
 Telecom − Telecom industry stores billions of mobile call records (around 30TB / month) and accessing
these call records in real time become a huge task. HBase can be used to process all the records in real
time, easily and efficiently.
 Social network − Similar to telecom industry, sites like Twitter, LinkedIn, and Facebook receive huge
volumes of data through the posts created by users. HBase can be used to find recent trends and other
interesting facts.
4. Apache Solr
Apache Solr is a fast, open source search platform written in Java. It is a blazing fast, faulttolerant distributed
search engine. Built on top of Lucene, it is a high-performance, full-featured text search engine.
Solr extensively uses every feature of ZooKeeper such as Configuration management, Leader election, node
management, Locking and syncronization of data. Some of the use-cases of Apache Solr include e-commerce,
job search, etc.
Solr uses ZooKeeper for both indexing the data in multiple nodes and searching from multiple nodes.
ZooKeeper contributes the following features −
 Add / remove nodes as and when needed
 Replication of data between nodes and subsequently minimizing data loss
 Sharing of data between multiple nodes and subsequently searching from multiple nodes for faster search
results
5. Apache Mesos
A tool which offers efficient resource isolation and sharing across distributed applications, as a cluster
manager is what we call Apache Mesos. Hence for the fault-tolerant replicated master, Mesos uses
ZooKeeper.
6. Apache Accumulo
Moreover, on top of Apache ZooKeeper (and Apache Hadoop), another sorted distributed key/value store
“Apache Accumulo” is built.
7. Neo4j
For write master selection and read slave coordination, No4j is a distributed graph database which uses
ZooKeeper.
8. Cloudera
Basically, for centralized configuration management, Cloudera search integrates search functionality with
Hadoop by using ZooKeeper.

You might also like