0% found this document useful (0 votes)

27 views45 pages

SITA1603 Unit 3 Material

Unit 3 covers various tools in the Big Data ecosystem, focusing on data ingestion, querying, processing, and management. Key tools discussed include Apache Kafka for real-time data streaming, Apache Flume for data ingestion, and Apache Drill for SQL querying. The document outlines the architecture, components, and benefits of these tools, emphasizing their roles in handling large-scale data efficiently.

Uploaded by

tvsjaswanth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views45 pages

SITA1603 Unit 3 Material

Uploaded by

tvsjaswanth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

UNIT 3 BIG DATA ECOSYSTEM TOOLS

Syllabus
UNIT 3 BIG DATA ECOSYSTEM TOOLS 9 Hrs.

Data Ingestion and Streaming: Kafka, Flume - Data Querying and Analysis: Drill, Impala - Data
Processing: Hive, Pig, Pig Latin scripts to process data stored in Hadoop- Data Integration and Workflow
Management: Nifi, Oozie. Managing and coordinating distributed systems: ZooKeeper.

3.1 Data Ingestion and Streaming: Kafka

Kafka is a distributed streaming platform for processing real-time data. Streaming means the data
are infinite and that never end. It just keeps arriving, and you can process it in real time. Distributed
means that Kafka works in a cluster, and each node in the cluster is called a Broker. Those brokers are
just servers executing a copy of Apache Kafka. So, Kafka is a set of machines working together to be
able to handle and process real-time infinite data.

• Apache Kafka is an open-source code platform and cloud-managed service for real-time
information and event-driven architecture.

• The architecture of Apache Kafka includes producers, topics, partitions, brokers, and
consumers, which work together in a distributed data consumption system.

• Apache Kafka provides high throughput, scalability, fault tolerance, and real-time data
processing, making it a popular choice for messaging, log aggregation, and website activity
tracking.

Introduction about Apache Kafka

Apache Kafka is a famous distributed streaming platform that creates real-time data pipelines and
streaming applications. It is renowned for its high throughput, fault tolerance, and scalability. Apache
Kafka is an open-source code that can be utilized as a cloud-managed service, such as Confluent
Cloud or a self-managed Kafka service. When using Apache Kafka, it is essential to regularly monitor
the cluster and optimize the configuration settings for optimal performance.

Usage of Apache Kafka

Apache Kafka manages real-time information, facilitates stream processing, and enables event-driven
architecture. It provides straightforward approaches for processing and evaluating large amounts of
data, making it well-suited for implementation in social media platforms and other applications that
require real-time data processing and analysis.

Architecture of Apache Kafka

The architecture of Apache Kafka is a key concept to understand to utilize this powerful
platform fully for data streaming and processing. It is a distributed system that can handle large
volumes of data. It is highly scalable and fault-tolerant. The key components include producers,
topics, partitions, brokers, and consumers.

1. Producers

1. Producers in Apache Kafka are responsible for publishing data from various sources
or feeds.
2. They transmit raw data to Kafka topics for further processing and distribution to
consumers.
3. Producers ensure that the raw data is efficiently transmitted to Kafka brokers for
storage and retrieval.
2. Topics

The Topic is a particular type of data stream; it's very similar to a queue, as it receives and delivers
messages. A topic is divided into partitions; each can have one or more partitions, and we need to
specify that number when creating the Topic.

Figure: Kafka topics partitions layout

1. Topics in Apache Kafka organize and categorize the data streams, creating a multi-step
pipeline for efficient data processing. They act as an immutable log that stores messages in a
serialized fashion.

2. Creating topics involves multiple steps: setting configurations, creating partitions, defining
replication factor, and adjusting retention policies.
When building a data processing system, we structured our Kafka topics to support a multi-
step pipeline. This enabled seamless data flow and efficient processing, highlighting the power of
Kafka’s topic architecture.

3. Partitions

1. Partitions in Apache Kafka are physical log files that store messages.

2. They allow distributed data consumption by enabling parallel processing among consumers.

3. Partitions also contribute to Data Recovery in case of failure, as each message is persisted
and replicated across brokers.

4. Brokers

Kafka works in a distributed way. A Kafka cluster may contain many brokers as needed. Each broker
in a cluster is identified by an ID and contains at least one partition of a topic. To configure the number
of the partitions in each broker, we need to configure something called Replication Factor when
creating a topic.

1. Brokers act as the central location in a large scale distributed system.

2. They manage the storage and replication of topic log partitions.

3. Brokers handle the multi-step pipeline of publishing, storing, and retrieving data.

Let’s say that we have three brokers in our cluster, a topic with three partitions and a Replication
Factor of three, in that case, each broker will be responsible for one partition of the topic.

Figure : The Replication in Kafka

As you can see in the above image, Topic_1 has three partitions, each broker is responsible for
a partition of the topic, so, the Replication Factor of the Topic_1 is three. It’s very important that the
number of the partitions match the number of the brokers, in this way, each broker will be responsible
for a single partition of the topic.

To ensure the reliability of the cluster, Kafka enters with the concept of the Partition
Leader. Each partition of a topic in a broker is the leader of the partition and can exist only one leader
per partition. The leader is the only one that receives the messages, their replicas will just sync the
data (they need to be in-sync to that). It will ensure that even if a broker goes down, his data won’t be
lost, because of the replicas. When a leader goes down, a replica will be automatically elected as a
new leader by Zookeeper.

Figure: Partition leader

In the above image, Broker 1 is the leader of Partition 1 of Topic 1 and has a replica in Broker
2. Let’s say that Broker 1 dies, when it happens, Zookeeper will detect that change and will
make Broker 2 the leader of Partition 1. This is what makes the distributed architecture of Kafka so
powerful.

5. Consumers

• Consumers in Apache Kafka are responsible for distributed data consumption from the
topics.

• They enable stream processing by subscribing to specific topics and processing the Raw Data
in real-time.

• Consumers play a pivotal role in ensuring that data is efficiently processed, enabling various
use cases like log aggregation and real-time analytics.

Working procedure of Apache Kafka

In order to understand the inner workings of Apache Kafka, it is important to first grasp how it
functions as a data management system. This section will delve into the essential components of Kafka
and how they work together to facilitate the publishing, storing, and retrieving of data. From the role
of producers and consumers to the importance of topics and partitions, we will explore the intricate
processes involved in the distribution and management of data within Kafka.

1. Publishing Data

1. Producers, such as applications or devices, create data feeds or event data.

2. Producers then send this data to the Kafka cluster for storage and processing.

3. Kafka stores and replicates this data across brokers for fault tolerance and high availability.
2. Storing Data

1. Decide on the topic and partition to which the data will be stored.

2. Serialize the data and add it to the appropriate topic and partition.

3. Once the data is added, it is stored in the Kafka broker’s file system.

True story: When our company implemented Kafka for data storage, organizing the topics
and partitions efficiently boosted data retrieval speed by 40%.

3. Retrieving Data

1. Retrieve Data: Consumers can retrieve data from Kafka through pull or push mechanisms,
accessing specific partitions or topics.

2. Distributed Data Consumption: Consumers can be distributed across multiple nodes for
parallel data consumption.

3. Data Sources: Data can be retrieved from various sources, such as IoT devices, applications,
and databases.

For instance, a company utilized Apache Kafka to retrieve real-time data from sensors
located at various factory locations. This allowed them to monitor equipment performance, enabling
timely maintenance and reducing downtime.

Benefits of Using Apache Kafka?

Apache Kafka offers a powerful and efficient way to handle large-scale real-time data
processing. In this section, we will examine the various benefits that come with using this popular
stream processing software platform. From its high throughput capabilities to its fault tolerance and
real-time data processing capabilities, the followings are the major benefits which enhances the
overall overall data processing experience.

1. High Throughput

• Utilize Apache Kafka for high throughput by leveraging its efficient messaging system.

• Configure Kafka to handle extensive data inflow using its high scalability.

• Take advantage of Kafka’s fault tolerance to ensure continuous high throughput even in the
event of failures.

• Employ Kafka for real-time stream processing, enabling the processing of raw data with
simple methods.

2. Scalability

• Utilize horizontal scaling by adding more machines to handle increased demand.

• Implement partitioning to distribute data across multiple brokers for enhanced performance.

• Employ load balancing to evenly distribute workloads across multiple users in a large-scale
distributed system.

A few years ago, our company faced a challenge of processing a massive amount of data from
various data sources. With the implementation of Apache Kafka’s scalability features, we were able to
efficiently handle the increased workload, ensuring seamless data processing for multiple users in our
large-scale distributed system.

3. Fault Tolerance

• Immutable Log: Apache Kafka achieves fault tolerance through an immutable log, ensuring
that once data is written, it cannot be modified. This log enables easy data recovery in case
of failures.

• Consistent System State: Kafka maintains fault tolerance by keeping replicas of data in sync,
ensuring a consistent system state even in the event of node failures.

• Data Recovery: In case of node failures, Kafka’s fault tolerance mechanisms ensure seamless
data recovery, maintaining system reliability and consistency.

4. Real-time Data Processing

• Real-time data processing involves handling and analyzing event data as it occurs.

• Utilize Apache Kafka to capture and process real time information seamlessly.

• Implement a stream processing software platform to enable continuous data processing.

Use Cases for Apache Kafka

Apache Kafka is a powerful tool that can be utilized for a variety of use cases. From messaging
to log aggregation to stream processing, Kafka offers a versatile platform for handling large amounts
of data. The major applications such as messaging, log aggregation, stream processing, and website
activity tracking etc.

1. Messaging

• Choose a suitable pubsub system for your needs, such as Apache Kafka, ensuring it aligns
with real-world messaging requirements.

• Implement simple methods to publish and subscribe to messages, ensuring seamless

communication.

• Consider scalability and fault tolerance to cater to evolving messaging demands efficiently.

When delving into messaging with Apache Kafka, prioritize finding a fitting pubsub system and
embracing straightforward methods for real-world applicability.

2. Log Aggregation

1. Collecting data: Apache Kafka gathers log data from various sources, including physical log
files and application-generated log events.

2. Aggregating logs: It aggregates logs from different systems into a centralized location,
facilitating easy access and management.

3. Distributed data consumption: Through Apache Kafka, log data can be consumed in a
distributed manner for real-time processing, analysis, and monitoring.
3. Stream Processing

• Extract raw data: In the first step of stream processing, raw data is extracted from various
sources such as databases, IoT devices, or logs.

• Transform and process: The data is then transformed and processed through a multi-step
pipeline to derive insights and meaningful information.

• Real-time analysis: Stream processing allows for real-time analysis of the data, enabling
immediate actions and decision-making based on the processed information.

Fact: Apache Kafka’s stream processing capabilities empower businesses to analyze and act on raw
data as it arrives, providing real-time insights for enhanced decision-making.

4. Website Activity Tracking

1. Implement clickstream tracking to capture user click stream activities and page views.

2. Analyze page views to understand user behavior and engagement.

3. Utilize Kafka to ingest, process, and store website activity data in real time.

To enhance website activity tracking, consider integrating Apache Kafka with analytics tools for
deeper insights into user interactions and page views.

3.2 Flume
Introduction to Apache Flume

Apache Flume is a tool for data ingestion in HDFS. It collects, aggregates, and transports a
large amount of streaming data such as log files, events from various sources like network traffic,
social media, email messages, etc. to HDFS. Flume is a highly reliable & distributed.

The main idea behind the Flume’s design is to capture streaming data from various web
servers to HDFS. It has a simple and flexible architecture based on streaming data flows. It is fault-
tolerant and provides reliability mechanism for Fault-tolerance & failure recovery.

Advantages of Apache Flume

There are several advantages of Apache Flume which makes it a better choice over others.
The advantages are:

• Flume is a scalable, reliable, fault-tolerant and customizable for different sources and
sinks.

• Apache Flume can store data in centralized stores (i.e data is supplied from a single store)
like HBase & HDFS.

• Flume is horizontally scalable.

• If the read rate exceeds the write rate, Flume provides a steady flow of data between read
and write operations.

• Flume provides reliable message delivery. The transactions in Flume are channel-based
where two transactions (one sender & one receiver) are maintained for each message.

• Using Flume, we can ingest data from multiple servers into Hadoop.
• It gives us a solution which is reliable and distributed and helps us in collecting,
aggregating and moving a large amount of data sets like Facebook, Twitter and e-
commerce websites.

• It helps us to ingest online streaming data from various sources like network traffic, social
media, email messages, log files etc. in HDFS.

• It supports a large set of sources and destinations types.

Flume Architecture

Figure Architecture of Flume

There is a Flume agent which ingests the streaming data from various data sources to HDFS.
From the diagram, you can easily understand that the web server indicates the data source. Twitter is
among one of the famous sources for streaming data.

The flume agent has 3 components: source, sink, and channel.

Source: It accepts the data from the incoming streamline and stores the data in the channel.

Channel: In general, the reading speed is faster than the writing speed. Thus, we need some buffer
to match the read & write speed difference. Basically, the buffer acts as intermediary storage that
stores the data being transferred temporarily and therefore prevents data loss. Similarly, channel acts
as the local storage or temporary storage between the source of data and persistent data in the
HDFS.

Sink: Then, our last component i.e. Sink, collects the data from the channel and commits or writes
the data in the HDFS permanently.

Streaming Twitter Data

In this practical, we will stream data from Twitter using Flume and then store the data in HDFS as
shown in the below image.

Figure : Process of storing Twitter stream data into HDFS

The first step is to create a Twitter application. For this, you first have to go to this url:
https://apps.twitter.com/ and sign in to your Twitter account. Go to create application tab as shown
in the below image.

Figure : Create an Application

After creating this application, you will find Key & Access token. Copy the key and the access
token. We will pass these tokens in our Flume configuration file to connect to this application.
Now create a flume.conf file in the flume’s root directory as shown in the below image. As we
discussed, in the Flume’s Architecture, we will configure our Source, Sink and Channel. Our Source is
Twitter, from where we are streaming the data and our Sink is HDFS, where we are writing the data.

In source configuration, we are passing the Twitter source type as

org.apache.flume.source.twitter.TwitterSource. Then, we are passing all the four tokens which we
received from Twitter. At last in source configuration, we are passing the keywords on which we are
going to fetch the tweets.

In the Sink configuration, we are going to configure HDFS properties. We will set the HDFS
path, write format, file type, batch size etc. At last, we are going to set a memory channel as shown
in the below image.
Now we are all set for execution. Let us go ahead and execute this command:

$FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f $FLUME_HOME/flume.conf

After executing this command for a while, and then you can exit the terminal using CTRL+C.
Then you can go ahead in your Hadoop directory and check the mentioned path, whether the file is
created or not.

Download the file and open it. You will get something as shown in the below image.
3.3 Data Querying and Analysis: Drill
Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed from the
ground up to support high-performance analysis on the semi-structured and rapidly evolving data
coming from modern Big Data applications, while still providing the familiarity and ecosystem of ANSI
SQL, the industry-standard query language. Drill provides plug-and-play integration with existing
Apache Hive and Apache HBase deployments.

Apache Drill Key Features

Key features of Apache Drill are:

• Low-latency SQL queries

• Dynamic queries on self-describing data in files (such as JSON, Parquet, text) and HBase tables,
without requiring metadata definitions in the Hive metastore.
• ANSI SQL
• Nested data support
• Integration with Apache Hive (queries on Hive tables and views, support for all Hive file
formats and Hive UDFs)
• BI/SQL tool integration using standard JDBC/ODBC drivers

As an open-source, schema-free SQL query engine, Drill represents a significant leap forward in the
realm of big data processing, providing unprecedented flexibility and speed.
Drill is a universal translator for different data sources. You can easily load your data into your favorite
business intelligence tool or expose it with REST. It is like SQL commands to view the data. It has Open
Database connectivity and Java Database Connectivity interfaces so you can easily connect most BI
tools. There are modules to import the scripting languages like such as Python or R.

Formats:

▪ CSV, TSV, PSV or any other delimited data

▪ Parquet

▪ JSON

▪ Avro

▪ Hadoop Sequence Files

▪ Apache and Nginx server logs

▪ Logs files

▪ PCAP/PCAP-NG

External Systems:

▪ Hbase

▪ Hive

▪ Kafka (streaming data)

▪ MapR-DB

▪ MongoDB

▪ Open Time Series Database

▪ Nearly all relational databases with a JDBC driver

▪ Hadoop Distributed File System

▪ MapR-FS

▪ Amazon Simple Storage Service

Since Drill looks like a relational database to the user, users often expect a database-like performance.

Benefits

Drill can scale data from a single node to thousands of nodes and query petabytes of data within
seconds. Drill supports user defined functions. Drill’s symmetrical architecture and simple installation
makes it easy to deploy and operate very large clusters. Drill has flexible data model and extensible
architecture. Drill columnar execution model performs SQL processing on complex data without
flattening into rows. Supports large datasets

Key Features

Drill’s pluggable architecture enables connectivity to multiple datastores.

▪ Drill has a distributed execution engine for processing queries. Users can submit requests to
any node in the cluster.

▪ Drill supports complex/multi-structured data types.

▪ Drill uses self-describing data where a schema is specified as a part of the data itself, so no
need for centralized schema definitions or management.

▪ Flexible deployment options either local node or cluster.

▪ Specialized memory management that reduces the amount of main memory that a program
uses or references while running and eliminates garbage collections.

▪ Decentralized data management.

Drill - When to use it?

▪ Apache Drill is mostly used for data analytics. When a lot of databases, files, logs and other
datatypes are spread across VM’s, filesystems, databases and more Apache Drill saves the day.

▪ It works flawlessly with popular BI tools like Tablaeu, Qlik or PowerBI.

Apache Drill Architecture

• Apache Drill consists of a daemon service called the DrillBit.

• It is responsible for accepting requests from the client, processing queries and returning
results to the client.
• When executing a query it will go to the SQL parser, this is based on the open source
framework Calcite.
• Afterwards it goes to the Logical Plan which is responsible for determining the most efficient
execution plan using a variety of techniques, it also translates a logical plan into a physical
plan.
• The optimizer uses various database optimizations. The physical plan is also called as the
execution plan. And finally it goes to the storage engine interface, this represents an
interface that is used to interact with the data sources. The plugins are extensible allowing
you to write new plugins for any additional data sources.

3.4 Apache Impala

• Impala is a High Performance General Purpose SQL Engine.

• Impala doesn’t use Map Reduce paradigm but it is able to interact well with traditional Map
Reduce architectures (e.g. Apache Hadoop).
• It is an open source project originally developed by Cloudera and donated to the public
community.
• It comprises of a Massively Parallel Processing (MPP) database engine which can be
considered as an alternative to the distributed processing as implemented in Map Reduce
(e.g. Apache Hadoop/YARN).
• It is substantially faster than conventional Map Reduce architectures and natively supports
HDFS (Hadoop Distributed File System).
• It is able to perform real-time queries for Apache Hadoop, by working with data stored in
HDFS and in HBase, the non-relational distributed database.
• Perform batch processing jobs such as Extract, Transform, and Load (ETL) in Map Reduce
framework.

Impala Features

open-source project.

• supports in-memory data processing. It is possible to analyze or access data stored on nodes
in Hadoop (by means of HDFS) without data transaction.

• Impala can access data using a query like SQL.

• provides a substantially faster processing performance to access data on HDFS when

compared against other SQL engines.

• Users can store data in storage systems such as HDFS, HBase, Amazon S3.

• Impala can be integrated with Business Intelligence (BI) tools such as Tableau.

• Impala supports various file formats such as LZO, Sequence File, Avro, RCFile, and Parquet.

• Impala uses similar metadata storage mechanisms as in Hive, it uses also ODBC driver and
SQL grammar.

Advantages of Impala

• Impala can process data stored in HDFS using existing SQL knowledge.

• Since the processing is performed in a place where data is stored, the stored data in Hadoop
does not need to be updated or transferred.

• Users can access data stored in HDFS, HBase, Amazon S3 without a knowledge of
MapReduce because Impala supports existing SQL query and does not rely on MapReduce
paradigm.

• Impala reduces the user needs to perform preliminary ETL (Extract-Transform-Load) jobs.

Disadvantages of Impala

• Impala does not support both serialization and deserialization.

• Impala is not designed to perform complex ETL jobs.

• Impala does not provide the capability to restart a failing node during the processing of a
SQL query which involves multiple nodes

• Impala can only read a text file. It does not support to read binary file user defined.

• Impala should update table whenever new record/file are added in the data directory of
HDFS.
• Whereas MapReduce materializes all intermediate results, which enables better scalability
and fault tolerance (at the expenses of data processing time), Impala streams intermediate
results directly among the cluster nodes (called executors). This approach limits the
horizontal scalability of the cluster especially when dealing with hundreds of nodes.

Impala Architecture : Impala consists of three main components:

• (i) Impalad (Impala daemon),

• (ii) Impala Statestored (State store daemon) and

• (iii) Impala Catalogd, which comprises Impala Metadata and Metastore.

Impalad,

• a process taking charge of distributed query engine, works on processing queries and
planning for query on data node in Hadoop cluster.

Impala State Store

• takes charge of holding meta data of processed on each data nodes.

• When the Impalad process is added or removed in cluster, meta data will be updated
through the Impala state store process.
• A query can be submitted to any Impalad running on any node, and that particular node
serves as a “coordinator node” for that query.

• Multiple queries are served by Impalad running on other nodes as well. After accepting the
query, Impalad reads and writes to data files and parallelizes the queries by distributing the
work to other Impala cluster nodes (executors).

• Impala Statestored, is responsible for checking the health of each Impalad, and then
conveying the health status of each node to the other nodes frequently.

• Every Impalad process interacts with the Statestored process providing its latest health
status and this information is used by the other nodes in the cluster so they can make correct
decisions before distributing the queries to a specific Impalad,

• In the event of a node failure due to any reason, Statestored updates all other nodes about
this failure, and once such a notification is available to other Impalad, they will not assign any
further query job to the affected node.

Impala Catalogd

• comprises of Metadata which uses traditional MySQL and PostgreSQL databases to store
table definitions.

• The important details, such as table and column information and table definitions are stored
in a centralized database known as Metastore, which is another component of Catalogd.

• Impala tracks information about file metadata, that is, the physical location of the blocks
about data files in HDFS.

• When dealing with an extremely large amount of data and/or many partitions, getting table
specific metadata could require a large amount of time.
• For this reason, a locally stored metadata caches is provisioned to help in providing such
information immediately.

Impala Cluster Architecture

• Clients — Entities including Hue, ODBC clients, JDBC clients, and the Impala Shell can all
interact with Impala.

• Hive Metastore — It (generally) stores information about the data available to Impala.

• For example, the metastore lets Impala know what databases are available and what the
structure of those databases is.

• As you create, drop, and alter schema objects, load data into tables, and so on through
Impala SQL statements, the relevant metadata changes are automatically broadcast to all
Impala nodes by the dedicated Catalogd service.

• Impalad — This process, which runs on DataNodes, coordinates and executes queries. Each
instance of Impala can receive, plan, and coordinate queries from Impala clients.

• Queries are distributed among Impala nodes, and these nodes then act as workers, executing
parallel query fragments.

• The status of Impalad is constantly transmitted to Statestored (not shown in the figure for
simplicity).

• HBase and HDFS — Storage for data to be queried.

Queries executed using Impala are handled as follows:

1. User applications send SQL queries to Impala cluster through ODBC or JDBC drivers, which
provide standardized querying interfaces. The user application may connect to
any Impalad in the cluster. This Impalad becomes the coordinator for the given query.
2. Impala parses the query and analyzes it to determine what tasks need to be performed
by Impalad instances across the cluster. Execution is planned for optimal efficiency and a
number of available nodes is used forthe query process.

3. Services such as HDFS and HBase are accessed by local impalad instances to provide data.

4. Each Impalad returns data to the coordinating Impalad, which sends these results to the
client.

3.5 Data Processing: Hive

• Hive is a data warehouse system which is used for querying and analysing large datasets
stored in HDFS. It process structured and semi-structured data in Hadoop..
• Hive sits on top of a data platform compatible with Hadoop (HDFS/Amazon’s S3).
• Data accessed by Hive is stored in HDFS.

A user wants to make a query. The two popular channels to query Hive through are:

• JDBC (Java Database Connectivity)

• ODBC (Open Database Connectivity)

JDBC is Java-based, hence targeted for connecting Java applications to databases. One example of
this is beeline - a JDBC client which is a command line interface that allows you to submit queries.

ODBC is for a wider range of applications beyond Java. This includes Windows applications such as
Excel.

• Type in your query to get the data you need for analysis. You use Hive-QL, which is not the
same as HQL. It is a SQL-like language, which means it’s similar but not the same as SQL.
• Your query then gets sent to Hive server (HiveServer2), which translates it for the driver.
• Metastore, as the name suggests, stores relevant metadata about the Hive table: database
name, table name, column names, etc…
• The metastore also includes details about where the table data is located in HDFS, which is
vital in guiding your Hive queries to look in the appropriate directories in HDFS
Hive Architecture

Metastore: stores metadata for Hive tables (like their schema and location) and partitions in a
relational database(traditional RDBMS format).

Driver: acts like a controller which receives the HiveQL statements. It monitors the life cycle and the
progress of the execution of the HiveQL statement. it stores the necessary metadata generated
during the execution of a HiveQL statement. it does the compiling and optimizing and executing of
the HiveQL statements.

Compiler: It performs the compilation of the HiveQL query. it converts the query to an execution
plan which contains the tasks(of mapreduce).

Optimizer: It performs various transformations on the execution plan to provide optimized plan. It
aggregates the transformations together, such as converting a pipeline of joins to a single join.

Executor: Once compilation and optimization complete, the executor executes the tasks.

Thrift application: is a software framework which allows external clients to interact with Hive over a
network, similar to the JDBC or ODBC protocols.

Beeline: The Beeline is a command shell supported by HiveServer2, where the user can submit its
queries and command to the system.

Hive Server 2: enhanced version of Hive Server 1 wich allows multiple clients to submit requests to
Hive and retrieve the final results. It is basically designed to provide the best support for open API
clients like JDBC and ODBC and Thrift.
steps to execute the HQL statement

1. executeQuery: The user interface calls the driver to excute the HQL statement(query).
2. getPlan: The driver accepts the query, creates a session handle for the query, and passes the
query to the compiler for generating the execution plan.
3. getMetaData: The compiler sends the metadata request to the metastore.
4. sendMetaData: The metastore sends the metadata to the compiler.
1. The compiler uses this metadata for performing type-checking and semantic analysis
on the expressions in the query tree. The compiler then generates the execution
plan (Directed acyclic Graph).
5. sendPlan: The compiler then sends the generated execution plan to the driver.
6. executePlan: After receiving the execution plan from compiler, driver sends the execution
plan to the execution engine for executing the plan.
7. submit job to MapReduce: The execution engine then sends these stages of DAG to
appropriate components.
8. 9,10: sendResult: Now for queries, the execution engine reads the contents of the
temporary files directly from HDFS as part of a fetch call from the driver. The driver then
sends results to the Hive interface.
Hive Data Model

Table

Hive tables are the same as the tables present in a Relational Database.

Partition

Hive organizes tables into partitions for grouping same type of data together based on a
column or one or more partition keys to identify a particular partition.

Bucket

Tables or partition are subdivided into buckets based on the hash function of a column in the
table to give extra structure to the data that may be used for more efficient queries.

Example:
Hive Data Types

Hive Primitive Data Type

Hive Complex Data Type

struct: it’s like the structures in the c language

STRUCT<col_name1 : data_type1, col_name2 : data_type2,...>

union: heterogeneous data types.

UNION< dataType1, dataType2, ...>

Different modes of Hive

• hive operates in two modes depending on the number and size of data node.
• Local Mode : is used when hadoop is having one data node and the data is small. Processing
will be very fast on smaller datasets which are present on local machine.
• MapReduce Mode : is used when hadoop is having multiple data nodes and the data is
spread across various data nodes. Processing large datasets can be more efficient using this
mode.
Cheat Sheet Hive for SQL Users

Query

Metadata

Command Line

3.6 Pig, Pig Latin scripts to process data stored in Hadoop

Apache Pig is a high-level programming language especially designed for analyzing large
data sets. In the MapReduce framework, programs are required to be translated into a sequence of
Map and Reduce stages. Pig supports all the data manipulation operations in Hadoop. Apache Pig
allows developers to write data analysis programs using Pig Latin. This is a highly flexible language and
supports users in developing custom functions for writing, reading and processing data. Apache Pig
comes with a component called Pig engine that takes the scripts written in Pig Latin as an input and
converts them into MapReduce jobs. By using various operators provided by Pig Latin language
programmers can develop their own functions for reading, writing, and processing data.

Apache Pig Features

• Rich Set of Operators: collection of rich set of operators in order to perform operations such
as join, filer, sort and many more.
• Ease of Programming: Pig Latin is similar to SQL and hence it becomes very easy for developers
to write a Pig script. If you have knowledge of SQL language, then it is very easy to learn Pig
Latin language as it is similar to SQL language.
• Optimization opportunities: The execution of the task in Apache Pig gets automatically
optimized by the task itself, hence the programmers need to only focus on the semantics of
the language.
• Extensibility: By using the existing operators, users can easily develop their own functions to
read, process, and write data.
• User Define Functions (UDF’s): we can easily create User Defined Functions on a number of
programming languages such as Java and invoke or embed them in Pig Scripts.
• All types of data handling: Analysis of all types of Data (i.e. both structured as well as
unstructured) is provided by Apache Pig and the results are stored inside HDFS.

Why do we need Apache Pig?

Pig Latin simplifies the work of programmers by eliminating the need to write complex codes in java
for performing MapReduce tasks. The multi-query approach of Apache Pig reduces the length of code
drastically and minimizes development time. Pig Latin is almost similar to SQL and if you are familiar
with SQL then it becomes very easy for you to learn.

Apache Pig Architecture

A Pig Latin program comprises a series of transformations or operations which uses input data to
produce output.
Parser: Parser used to check the syntax of the script. Parser outputs a Directed Acyclic Graph(DAG)
that will show the logical operators.

Optimizer: It ensures that the data in the pipeline should be minimum. It performs logical
optimizations such as pushdown and projection.

Compiler : The compiler component transforms the optimized logical plan into a sequence of
MapReduce jobs.

Execution Engine: This component submits all the MapReduce jobs in sorted order to the Hadoop.
Finally, all the MapReduce jobs are executed on Apache Hadoop to produce desired results.

Pig Execution modes In Hadoop Pig can be executed in two different modes which are:

Local Mode: Here Pig language makes use of a local file system and runs in a single JVM. The local
mode is ideal for analyzing small data sets.

Map Reduce Mode: In this mode, all the queries written using Pig Latin are converted into
MapReduce jobs and these jobs are run on a Hadoop cluster. MapReduce Mode is highly suitable for
running Pig on large datasets.
Applications of Pig and Hive

Pig and Hive are widely used in various data processing and analysis scenarios:

Data Transformation: Pig is well-suited for complex data transformations, such as cleansing,
normalization, and enrichment of raw data.

Ad-hoc Data Analysis: Hive is ideal for ad-hoc data analysis, allowing users to quickly query and
analyze large datasets using familiar SQL-like syntax.

ETL Pipelines: Both Pig and Hive can be integrated into ETL pipelines for data extraction,
transformation, and loading, providing robust solutions for data processing and analysis.

Machine Learning and Data Science: Pig and Hive can be used to preprocess data for machine
learning algorithms or perform exploratory data analysis in data science projects.

Data Warehousing: Hive is particularly useful for building data warehouses on top of Hadoop,
providing a scalable and cost-effective solution for storing and analyzing large volumes of structured
data.

Apache Pig MapReduce

Apache Pig is a data flow language. MapReduce is a data processing paradigm.

It is a high level language. MapReduce is low level and rigid.

Performing a Join operation in Apache Pig It is quite difficult in MapReduce to perform

is pretty simple. a Join operation between datasets.

Any novice programmer with a basic Exposure to Java is must to work with
knowledge of SQL can work conveniently MapReduce.
with Apache Pig.

Apache Pig uses multi-query approach, MapReduce will require almost 20 times
thereby reducing the length of the codes more the number of lines to perform the
to a great extent. same task.

There is no need for compilation. On MapReduce jobs have a long compilation

execution, every Apache Pig operator is process.
converted internally into a MapReduce job.
Apache Pig Vs MapReduce

PIG Latin Scripts

In addition to above differences, Apache Pig Latin −

• Allows splits in the pipeline.

• Allows developers to store data anywhere in the pipeline.

• Declares execution plans.

• Provides operators to perform ETL (Extract, Transform, and Load) functions.

Pig Latin Data Model

• The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such
as map and tuple.

Atom

• Any single value in Pig Latin, irrespective of their data, type is known as an Atom.

• It is stored as string and can be used as string and number. int, long, float, double, chararray,
and bytearray are the atomic values of Pig.

• A piece of data or a simple atomic value is known as a field.

• Example − ‘raja’ or ‘30’

Tuple

• A record that is formed by an ordered set of fields is known as a tuple, the fields can be of
any type. A tuple is similar to a row in a table of RDBMS.

Example − (Raja, 30)

Bag

• A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is

known as a bag.

• Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’.

Example − {(Raja, 30), (Mohammad, 45)}

• A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Raja, 30, {9848022338, [email protected],}}

Map

• A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and
should be unique. The value might be of any type. It is represented by ‘[]’

Example − [name#Raja, age#30]

Relation

• A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular order).

Pig Data Types

Apache Pig supports many data types. A list of Apache Pig Data Types with description and examples
are given below.
Type Description Example

Int Signed 32 bit integer 2

Long Signed 64 bit integer 15L or 15l

Float 32 bit floating point 2.5f or 2.5F

Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2

charArray Character array hello javatpoint

byteArray BLOB(Byte array)

tuple Ordered set of fields (12,43)

bag Collection f tuples {(12,43),(54,28)}

map collection of tuples [open#apache]

Apache Pig Execution Mechanisms

• Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode,
and embedded mode.

• Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt
shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump
operator).

• Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in
a single file with .pig extension.

• Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions
(User Defined Functions) in programming languages such as Java, and using them in our
script.
Invoking the Grunt Shell

You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown
below.

Local mode MapReduce mode

Command −$ ./pig –x local Command −$ ./pig -x mapreduce

Output − Output −

• Either of these commands gives you the Grunt shell prompt as shown below.

grunt>

• You can exit the Grunt shell using ‘ctrl + d’.

• After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin
statements in it.

• grunt> customers = LOAD 'customers.txt' USING PigStorage(',');

Executing Apache Pig in Batch Mode

You can write an entire Pig Latin script in a file and execute it using the –x command. Let us suppose
we have a Pig script in a file named sample_script.pig as shown below.

Sample_script.pig

student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING PigStorage(',')

as (id:int,name:chararray,city:chararray); Dump student;

Now, you can execute the script in the above file as shown below.

Local mode MapReduce mode

$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig

Load & Store Functions

• Apache Pig works on top of Hadoop.

• It is an analytical tool that analyzes large datasets that exist in the Hadoop File System.
• To analyze data using Apache Pig, we have to initially load the data into Apache Pig.

Preparing HDFS

Student ID First Name Last Name Phone City

001 Rajiv Reddy 9848022337 Hyderabad

002 siddarth Battacharya 9848022338 Kolkata

003 Rajesh Khanna 9848022339 Delhi

004 Preethi Agarwal 9848022330 Pune

005 Trupthi Mohanthy 9848022336 Bhuwaneshwar

006 Archana Mishra 9848022335 Chennai

• In the local file system, create an input file student_data.txt containing data as shown below.

• Now, move the file from the local file system to HDFS using put command as shown below.
(You can use copyFromLocal command as well.)

• $ cd $HADOOP_HOME/bin $ hdfs dfs -put /home/Hadoop/Pig/Pig_Data/student_data.txt

dfs://localhost:9000/pig_data/

Verifying the file

• You can use the cat command to verify whether the file has been moved into the HDFS, as
shown below.

• $ cd $HADOOP_HOME/bin $ hdfs dfs -cat hdfs://localhost:9000/pig_data/student_data.txt

The Load Operator

• You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD operator
of Pig Latin.

Syntax

• The load statement consists of two parts divided by the “=” operator. On the left-hand side,
we need to mention the name of the relation where we want to store the data, and on the
right-hand side, we have to define how we store the data. Given below is the syntax of
the Load operator.

Relation_name = LOAD 'Input file path' USING function as schema;

Where,

• relation_name − We have to mention the relation in which we want to store the data.

• Input file path − We have to mention the HDFS directory where the file is stored. (In
MapReduce mode)

• function − We have to choose a function from the set of load functions provided by Apache
Pig (BinStorage, JsonLoader, PigStorage, TextLoader).

• Schema − We have to define the schema of the data. We can define the required schema as
follows −

(column1 : data type, column2 : data type, column3 : data type);

• Note − We load the data without specifying the schema. In that case, the columns will be
addressed as $01, $02, etc… (check).

Example

• As an example, let us load the data in student_data.txt in Pig under the schema
named Student using the LOAD command.

• Start the Pig Grunt Shell

• First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce mode as shown
below.

$ Pig –x mapreduce

grunt>

Execute the Load Statement

• Now load the data from the file student_data.txt into Pig by executing the following Pig Latin
statement in the Grunt shell.

grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING

PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Storage We have used the PigStorage() function. It loads and stores data as structured text
function files. It takes a delimiter using which each entity of a tuple is separated, as a
parameter. By default, it takes ‘\t’ as a parameter.

• Note − The load statement will simply load the data into the specified relation in Pig. To
verify the execution of the Load statement, you have to use the Diagnostic Operators

Store Functions

• You can store the loaded data in the file system using the store operator.

Syntax
STORE Relation_name INTO ' required_directory_path ' [USING function];

Example

• Assume we have a file student_data.txt in HDFS with the following content.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.

• And we have read it into a relation student using the LOAD operator as shown below.

grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING

PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );

• Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below.

grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');
Output

OutPut folder hdfs://localhost:9000/pig_Output/

Input(s): Successfully read 0 records from:
"hdfs://localhost:9000/pig_data/student_data.txt"
Output(s): Successfully stored 0 records in: "hdfs://localhost:9000/pig_Output"
Counters: Total records written : 0 Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0 Total records proactively spilled: 0
FILTER operator

• The FILTER operator is used to select the required tuples from a relation based on a
condition.

grunt> Relation2_name = FILTER Relation1_name BY (condition);

grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING

PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);

• filter_data = FILTER student_details BY city == 'Chennai';

grunt> Dump filter_data;

• Verify the relation filter_data using the DUMP operator as shown below.

grunt> Dump filter_data;

Output

(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)

Verification –Diagnostic operator

To verify the execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin
provides four different types of diagnostic operators −

1) Dump operator

2) Describe operator

3) Explanation operator

4) Illustration operator
Dump Operator

• The Dump operator is used to run the Pig Latin statements and display the results on the
screen. It is generally used for debugging Purpose.

• Syntax
grunt> Dump Relation_Name

• Now, let us print the contents of the relation using the Dump operator as shown below.

grunt> Dump student

Output

(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)

describe operator

• the describe operator is used to view the schema of a relation.

Syntax

• The syntax of the describe operator is as follows

grunt> Describe Relation_name

example

grunt> describe student;

Output

grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city:

chararray }

explain operator

• The explain operator is used to display the logical, physical, and MapReduce execution plans
of a relation.

Syntax

• Given below is the syntax of the explain operator.

grunt> explain Relation_name;

Illustrate operator

• The illustrate operator gives you the step-by-step execution of a sequence of statements.

Syntax

• Given below is the syntax of the illustrate operator.

grunt> illustrate Relation_name;

Example

grunt> illustrate student;

3.7 Data Integration and Workflow Management: Nifi

Data processing tasks within various organizations have become increasingly complex , necessitating
the use of advanced data flow tools. Apache NiFi is a highly valuable data flow system designed to
automate and schedule various data operations across different processes. It is capable of handling
both simple and complex systems effectively. The performance of Apache NiFi, in terms of IO, CPU,
and RAM utilization, is spectacular as well.

Apache NiFi has numerous characteristics and features

• Flow management
• Ease of use
• Security
• Extensible architecture
• and flexible scaling model

THE CORE CONCEPTS OF APACHE NIFI

The most important concepts realted to Apache NiFi are DataFlow Manager (DFM), FlowFile,
Bulletin, Processor, Connection, Controller Service, Flow Controller, and Process Group.

DFM

• In simple words, a DFM is the user who has different permissions and complete management
of Apache NiFi.

FlowFile

• The FlowFile concept refers to each object moving through the system. It has two main parts:
attributes and content. The first one stores metadata about the FlowFile, such as its
filename, size, etc.

• The standard attributes that each FlowFile contains are Universally Unique Identifier (UUID),
used to distinguish the FlowFile; filename, used to represent the name of the output of the
data when it is written to some path; and path, used to define the path of the filename. On
the other hand, the content part is responsible for storing the actual data of the FlowFile.
Bulletin

It can be considered a service reporter tool that is available for each service. It provides
critical information such as rolling statistics, current status, and severity levels such as Debug, Info,
Warning, and Error, giving insights into the current situation of the component.

Processor

A Processor, as the name suggests, is responsible for performing the actual work, such as
data routing and transformation, as it has direct access to the attributes and the content of the given
FlowFile. For example, processor has the ability to read data files, execute different software
languages such as Python scripts, and executing different SQL queries. An example of Apache NiFi
processor is shown in Figure 1.

Connection

A Connection can be regarded as the link between different processors, see Figure 1. It
manages the interaction between the processors by establishing queues. A queue in Apache NiFi
refers to a buffer that has the ability to store FlowFiles. Therefore, the output of one processor, the
FlowFile, is stored in a queue until the other processor is ready to process it.

Controller Service

When a Controller Service is created, it will automatically start up every time Apache NiFi
is initiated. Several essential services can be specified in the Controller Service section. For instance,
DBCPConnectionPool and CSVRecordSetWriter are utilized to establish connections between various
database systems and Apache NiFi, and to enable writing CSV files to the local system, respectively.
Once defined, these services can be utilized by any processors.

3.8 Data Integration and Workflow Management: Oozie

what is oozie?

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are
Directed Acyclic Graphs (DAGs) of actions. Workflows, Coordinators, Bundles all come together to form
the building blocks of Oozie. Oozie allows orchestration and control of such complex multi-stage
Hadoop jobs. Multi-stage Hadoop jobs can then be run as a single Oozie job — the Oozie job is the
only thing for you to manage

What is workflow?

A workflow specifies a set of actions and the order and conditions under which these actions should
be performed. The actions are dependent on one another, as the next action can only be executed
after the output of current action. And these actions could be anything such as MapReduce job, Hive
query, Shell scripts, Pig query, Java program. You might want to run these workflows at a specific time
and frequency (provided the input data is available) so for these we need coordinators. We can create
different types of actions based on the job and each type of action can have its own type of tags. The
workflow and the scripts or jars should be placed in HDFS path before executing the workflow.

Command: oozie job –oozie http://localhost:11000/oozie -config job.properties -run

There are three types of jobs in Apache Oozie:

Oozie Workflow Jobs− These are Directed Acyclic Graphs (DAGs) which specifies a sequence of actions
to be executed.

Oozie Coordinator Jobs− These consist of workflow jobs triggered by time and data availability.

Oozie Bundles− These can be referred to as a package of multiple coordinators and workflow jobs.

Oozie Coordinator

You can schedule complex workflows as well as workflows that are scheduled regularly using
Coordinator. Oozie Coordinators triggers the workflows jobs based on time, data or event predicates.
The workflows inside the job coordinator start when the given condition is satisfied.

Definitions required for the coordinator jobs are:

start− Start datetime for the job.

end− End datetime for the job.

timezone− Timezone of the coordinator application.

frequency− The frequency, in minutes, for executing the jobs.

Some more properties for Control Information

• timeout− The maximum time, in minutes, for which an action will wait to satisfy the additional
conditions, before getting discarded. 0 indicates that if all the input events are not satisfied at
the time of action materialization, the action should timeout immediately. -1 indicates no
timeout, the action will wait forever. The default value is -1.

• concurrency− The maximum number of actions for a job that can run parallelly. The default
value is 1.

• execution– It specifies the execution order if multiple instances of the coordinator job have
satisfied their execution criteria. It can be:

• FIFO (default)
• LIFO

• LAST_ONLY

• Command: oozie job –oozie http://localhost:11000/oozie -config <path to

coordinator.properties file> -run

So what is coordinator?

• The Coordinator schedules the execution of a workflow at a specified time and/or a specified
frequency and with the availability of the data

• If input data is not available, then the workflow is delayed till the data becomes available.

• If no input data is needed, then the workflow runs purely at a specified time or frequency.

Lets understand the concept of Bundles:

• A collection of Coordinator jobs which can be started, stopped and modified together is called
a Bundle.

• The output of one Coordinator job managing a workflow can be the input to another
Coordinator job.

• And the output of one Coordinator job managing a workflow can be the input to another
Coordinator job and this process is called as Data Pipelines.

Oozie Architecture

Oozie Server: Manages job scheduling and execution.

Oozie Client: CLI interface to interact with the server

Database: All job related information is stored in the database. Oozie supports Derby by default.

Hadoop Cluster: Oozie jobs run on this cluster.

What is Oozie Application?

This has one file defined in XML which describes the application known as workflow.xml. It references
and includes other configuration files, JARs and scripts which perform the actions. The application can
be a workflow run manually, a single coordinator or a number of coordinators forming a bundle. Oozie
expects all files to be in HDFS before it can run. These XML files along with other files which are
required for the Oozie application are copied over to HDFS before the job can run.

Workflows:

• This is a core building block of Oozie

• Workflow brings together different actions.
• Actions are the individual units of work which make up the workflow. An action does the
actual processing in a workflow.
• Actions are MR jobs, Hive jobs, Pig jobs, Java programs.

Workflow consists of these nodes:

• 1. Control nodes

• 2. Action nodes

• 3. Global configuration

• Control nodes — These controls the start, end and basic execution of workflow

• Ex: <start>, <end>, <kill>, <ok>, <error>

• Action nodes — Action nodes are the ones which specifies the unit of execution

• Ex: <map-reduce>, <pig>, <hive>, <shell>, <fs>, <email>

• Global config — These can be defined in the global parameters rather than once for each
action.

What is job.properties file?

This file contains the job configurations to send to oozie to invoke workflow and
arguments for oozie workflow application. Jobs.properties lives on local file system, not on HDFS.

• Above image of job.properties explains the meaning of each of the element. Here application
path is the path of the workflow.xml which is stored in HDFS.

3.9 Managing and coordinating distributed systems: ZooKeeper.

Zookeeper overview

Highly available coordination service for distributed applications.

Centralized service for:

• Configuration management
• Naming
• Distributed synchronization
• Group services
• Originally developed at Yahoo!; now widely used in Hadoop, HBase, etc.

Why ZooKeeper?

• Designed for distributed systems with multiple independent components.

• Ensures coordination, fault tolerance, and data consistency.

• Handles server failures and network partitions efficiently.

Distributed Systems Basics

• Run across multiple systems (nodes) in a cluster.

• Enable fast execution of complex tasks via parallel processing.

• Two main components: Server and Client applications.

Challenges in Distributed Computing

• Unreliable networks

• Latency in operations

• Dynamic and heterogeneous environments

Need for ZooKeeper

• Past: Single-system, single-CPU programs

• Present: Complex, changing environments with multiple components

• ZooKeeper provides a simple API for implementing custom coordination logic.

ZooKeeper Services

• Naming Service – Node identification (like DNS)

• Configuration Management – Dynamic configuration updates

• Cluster Management – Real-time node join/leave monitoring

• Leader Election – Selects a leader node for coordination

• Locking & Synchronization – Safe concurrent access to data

• Reliable Data Registry – Ensures data availability even if some nodes fail

ZooKeeper Ensemble

A ZooKeeper Ensemble is a group of ZooKeeper servers working together to provide a reliable and
fault-tolerant coordination service for distributed systems.

Key Points about ZooKeeper Ensemble:

• Minimum of 3 servers is recommended for high availability (odd number preferred).

• One server acts as the Leader, and the others act as Followers.

• All write operations go through the Leader, while read operations can be served by any node.

• If the Leader fails, a new Leader is automatically elected from the Followers.

• Ensemble ensures data consistency and coordination even if some nodes fail.

Architecture of Zookeeper:

Zookeeper Node Roles

Client

Nodes that access data from ZooKeeper servers.

Periodically sends heartbeats to the server to indicate it is alive.

Redirects requests to another server if the current one is unresponsive.

Server

Provides services to clients and sends acknowledgments.

One of the nodes in the ZooKeeper ensemble.

Ensemble

A group of ZooKeeper servers working together.

A minimum of 3 nodes is typically required for fault tolerance.

Leader

Elected during startup.

Handles automatic recovery and coordination in the cluster.

Follower

Server nodes that follow the leader’s instructions.

Participate in read/write operations and vote in leader elections.

ZooKeeper Hierarchical Namespace

• Znode: Each node in ZooKeeper's memory-based tree structure.

• Path-based Naming: Znodes are identified using UNIX-style paths (e.g., /config, /workers).
• Root Znode (/): The starting point of the tree.

Namespaces:

• /config: Used for centralized configuration management.

• /workers: Used for naming worker nodes.

Data Storage:

• Each znode can store up to 1MB of data.

• Unlike UNIX, parent znodes can also store data.

ZooKeeper Data Model:

• Ensures synchronized data storage.

• Helps manage metadata of znodes.

Znode Stat Structure (Metadata)

Each znode in ZooKeeper maintains a stat structure containing important metadata:

• Version Number
o Increments on each data change.
o Helps manage concurrent access by multiple clients.
• ACL (Access Control List)
o Manages read/write permissions.
o Ensures secure znode access.
• Timestamp
o Indicates creation/modification time (in milliseconds).
o Uses Transaction ID (zxid) to track and sequence changes.
• Data Length
o Represents total data size stored in the znode.
o Maximum: 1 MB per znode.

Types of Znodes in ZooKeeper

Persistent Znode

Default type of znode.

Remains in ZooKeeper even after client disconnection.

Suitable for storing configuration or shared data.

Ephemeral Znode

Temporary; exists only while the client session is active.

Automatically deleted when the client disconnects.

Useful for dynamic service registration.

Sequential Znode

Can be persistent or ephemeral.

ZooKeeper appends a 10-digit unique sequence number to the znode name.

Example: /myapp → /myapp0000000001

Ensures unique naming and helps in locking & synchronization mechanisms.

Sessions in ZooKeeper

• A session is created when a client connects to a server.

• Each session gets a unique Session ID.
• FIFO execution: All client requests are processed in order.
• Clients send periodic heartbeats to keep the session alive.
• If heartbeats stop beyond the session timeout period, the client is considered dead.
• On session end, all ephemeral znodes created during the session are automatically deleted.

Watches in ZooKeeper

• Watches allow clients to receive notifications about changes in znodes.

• Set during a read operation on a znode.
• Triggered when:
o Znode data is modified
o Znode’s children change
• One-time trigger: Watch must be set again for future notifications.
• Removed automatically if the client session expires.

Benefits of Zookeeper

Simplified Coordination – Eases process coordination in distributed systems.

Ordered Operations – Ensures all messages are delivered in a strict order.

High Reliability – Maintains system availability even during failures.

Effective Synchronization – Supports mutual exclusion and process cooperation.

Atomic Transactions – Guarantees operations are either fully completed or not done at all.

Pcs TCM 2800 - 200
No ratings yet
Pcs TCM 2800 - 200
1 page
CMI Series Horizontal Multistage Centrifugal Pump Brochure
No ratings yet
CMI Series Horizontal Multistage Centrifugal Pump Brochure
2 pages
Turban Bi2e PP Ch02
No ratings yet
Turban Bi2e PP Ch02
48 pages
Kafka Using Spring Boot
No ratings yet
Kafka Using Spring Boot
136 pages
Apache Kafka Documentation
No ratings yet
Apache Kafka Documentation
419 pages
Apache Kafka
No ratings yet
Apache Kafka
9 pages
Kafka
No ratings yet
Kafka
12 pages
Kafka
No ratings yet
Kafka
43 pages
Introduction To Apache Kafka - 070224-1155-334
No ratings yet
Introduction To Apache Kafka - 070224-1155-334
7 pages
Apache Kafka Beginner Guide
No ratings yet
Apache Kafka Beginner Guide
40 pages
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Kafka Notes
No ratings yet
Kafka Notes
7 pages
Unit 3
No ratings yet
Unit 3
26 pages
Kafka With Spring Boot
No ratings yet
Kafka With Spring Boot
48 pages
Apache Kafka
No ratings yet
Apache Kafka
17 pages
Unit 5 Apache Kafka Notes
No ratings yet
Unit 5 Apache Kafka Notes
54 pages
KAFKAExample 2
No ratings yet
KAFKAExample 2
12 pages
Kafka Clustering v1.0.0
No ratings yet
Kafka Clustering v1.0.0
20 pages
Kafka and Mongodb
No ratings yet
Kafka and Mongodb
15 pages
Kafkha
No ratings yet
Kafkha
32 pages
Kafka
No ratings yet
Kafka
23 pages
Apache Kafka
No ratings yet
Apache Kafka
27 pages
Apache Kafka 101
No ratings yet
Apache Kafka 101
25 pages
Apache Kafka 360 1631077800
No ratings yet
Apache Kafka 360 1631077800
137 pages
Apache Kafka - Thi Nguyen's Blog
No ratings yet
Apache Kafka - Thi Nguyen's Blog
39 pages
Kafka
No ratings yet
Kafka
19 pages
Apache Kafka Beginner Guide Final
No ratings yet
Apache Kafka Beginner Guide Final
3 pages
Mastering Apache Kafka
No ratings yet
Mastering Apache Kafka
17 pages
Fundamentals and Architecture of Apache Kafka
No ratings yet
Fundamentals and Architecture of Apache Kafka
30 pages
Apache Kafka
No ratings yet
Apache Kafka
10 pages
Unveiling Kafka Topics - The Heartbeat of Real-Time Data Streaming
No ratings yet
Unveiling Kafka Topics - The Heartbeat of Real-Time Data Streaming
5 pages
Apache Kafka
No ratings yet
Apache Kafka
13 pages
Kafka Streaming Data
No ratings yet
Kafka Streaming Data
154 pages
Kafka
No ratings yet
Kafka
5 pages
Kafka Overview
No ratings yet
Kafka Overview
36 pages
Apache Kafka Tutorial
No ratings yet
Apache Kafka Tutorial
6 pages
Kafka My Kafka Note v67
No ratings yet
Kafka My Kafka Note v67
55 pages
Apache Kafka Introduction
No ratings yet
Apache Kafka Introduction
21 pages
Kafka
No ratings yet
Kafka
3 pages
Kafka Interview Questions
No ratings yet
Kafka Interview Questions
10 pages
5 Kafka 2.7m
No ratings yet
5 Kafka 2.7m
46 pages
Kafka Concepts For SQS User
No ratings yet
Kafka Concepts For SQS User
17 pages
Apache Kafka Description
No ratings yet
Apache Kafka Description
36 pages
Kafka Using Spring Boot v2
No ratings yet
Kafka Using Spring Boot v2
150 pages
Kafka Arch
No ratings yet
Kafka Arch
4 pages
Introduction To Data Ingestion and Processing
No ratings yet
Introduction To Data Ingestion and Processing
28 pages
Cours - Kafka
No ratings yet
Cours - Kafka
72 pages
AK
No ratings yet
AK
22 pages
Introduction To Apache Kafka
No ratings yet
Introduction To Apache Kafka
18 pages
Apache Kafka
No ratings yet
Apache Kafka
27 pages
Kafka Architectures Notes
No ratings yet
Kafka Architectures Notes
9 pages
Step 19 Kafka Optional
No ratings yet
Step 19 Kafka Optional
10 pages
Big Data - Group 14
No ratings yet
Big Data - Group 14
26 pages
Kafka Patterns and Anti-Patterns
No ratings yet
Kafka Patterns and Anti-Patterns
7 pages
Apache Kafka
No ratings yet
Apache Kafka
17 pages
08 Apache Kafka
No ratings yet
08 Apache Kafka
45 pages
Kafka Presentation
No ratings yet
Kafka Presentation
16 pages
Introduction To Apache Kafka and Its Setup
No ratings yet
Introduction To Apache Kafka and Its Setup
29 pages
Kafka Topic Questions
No ratings yet
Kafka Topic Questions
9 pages
Kafka Reference Architecture
No ratings yet
Kafka Reference Architecture
12 pages
Advanced Apache Kafka: Engineering High-Performance Streaming Applications
From Everand
Advanced Apache Kafka: Engineering High-Performance Streaming Applications
Peter Jones
No ratings yet
The Apache Kafka® and Generative AI Handbook
From Everand
The Apache Kafka® and Generative AI Handbook
Joseph Matthew Stein
No ratings yet
Confluent Certified Developer for Apache Kafka® Exam kit
From Everand
Confluent Certified Developer for Apache Kafka® Exam kit
PRIYANKA
No ratings yet
AWS Technical Essentials
No ratings yet
AWS Technical Essentials
2 pages
Influence of New Media in Achieving Communication Efficiency Mass Media in Nigeria
No ratings yet
Influence of New Media in Achieving Communication Efficiency Mass Media in Nigeria
60 pages
ISTQB CTFL40 Sample-Exam-Answers SET-E v1.2 GTB-edition Engl en
No ratings yet
ISTQB CTFL40 Sample-Exam-Answers SET-E v1.2 GTB-edition Engl en
59 pages
Computer Network Security Module 7 Discussion - Pavan
No ratings yet
Computer Network Security Module 7 Discussion - Pavan
5 pages
Remote Maintenance System - Highlights
No ratings yet
Remote Maintenance System - Highlights
2 pages
Bike of The Future-Pneumatic Bike2
No ratings yet
Bike of The Future-Pneumatic Bike2
17 pages
Setupwizard
No ratings yet
Setupwizard
38 pages
15572854118361575jy NARI MUX 2M Interface
No ratings yet
15572854118361575jy NARI MUX 2M Interface
2 pages
Software Project Management Assignment#2: Submitted By: Abeera Afridi Submitted To: Dr. Muhammad Shahab Siddiqui
No ratings yet
Software Project Management Assignment#2: Submitted By: Abeera Afridi Submitted To: Dr. Muhammad Shahab Siddiqui
6 pages
AIRLINX INRICO Brochure - Opt
No ratings yet
AIRLINX INRICO Brochure - Opt
22 pages
ACB Schneider
No ratings yet
ACB Schneider
3 pages
Drivers Guide Alberta 2015
No ratings yet
Drivers Guide Alberta 2015
122 pages
Approach G80
No ratings yet
Approach G80
13 pages
Can Protocol 1
No ratings yet
Can Protocol 1
104 pages
Manual SerDia2010 en
No ratings yet
Manual SerDia2010 en
235 pages
en - Smart RSG
No ratings yet
en - Smart RSG
3 pages
Solid State Logic SL 4000 G
No ratings yet
Solid State Logic SL 4000 G
3 pages
Palak Resume
No ratings yet
Palak Resume
1 page
页面提取自－catalogue 2
No ratings yet
页面提取自－catalogue 2
9 pages
Xtend Standalone Voice Logger 4U & Datasheet
No ratings yet
Xtend Standalone Voice Logger 4U & Datasheet
4 pages
SARA-N2 DataSheet (UBX-15025564)
No ratings yet
SARA-N2 DataSheet (UBX-15025564)
26 pages
Health Ergonomics: Nyoman Adiputra Prof - DR, DR, PFK, Moh, SP - Erg
0% (1)
Health Ergonomics: Nyoman Adiputra Prof - DR, DR, PFK, Moh, SP - Erg
16 pages
BENNING Tebechop 3000HD DM
No ratings yet
BENNING Tebechop 3000HD DM
8 pages
Cicsvsam
No ratings yet
Cicsvsam
266 pages
Smart Meter Manual
No ratings yet
Smart Meter Manual
3 pages
E20214 Tuf Gaming B650-Plus Um Web 0728221600
No ratings yet
E20214 Tuf Gaming B650-Plus Um Web 0728221600
38 pages
Invest Novel Thinking, Create Novel Value: WWW - Serontech.co - KR
No ratings yet
Invest Novel Thinking, Create Novel Value: WWW - Serontech.co - KR
8 pages