SITA1603 Unit 3 Material
SITA1603 Unit 3 Material
Syllabus
UNIT 3 BIG DATA ECOSYSTEM TOOLS 9 Hrs.
Data Ingestion and Streaming: Kafka, Flume - Data Querying and Analysis: Drill, Impala - Data
Processing: Hive, Pig, Pig Latin scripts to process data stored in Hadoop- Data Integration and Workflow
Management: Nifi, Oozie. Managing and coordinating distributed systems: ZooKeeper.
Kafka is a distributed streaming platform for processing real-time data. Streaming means the data
are infinite and that never end. It just keeps arriving, and you can process it in real time. Distributed
means that Kafka works in a cluster, and each node in the cluster is called a Broker. Those brokers are
just servers executing a copy of Apache Kafka. So, Kafka is a set of machines working together to be
able to handle and process real-time infinite data.
• Apache Kafka is an open-source code platform and cloud-managed service for real-time
information and event-driven architecture.
• The architecture of Apache Kafka includes producers, topics, partitions, brokers, and
consumers, which work together in a distributed data consumption system.
• Apache Kafka provides high throughput, scalability, fault tolerance, and real-time data
processing, making it a popular choice for messaging, log aggregation, and website activity
tracking.
Apache Kafka is a famous distributed streaming platform that creates real-time data pipelines and
streaming applications. It is renowned for its high throughput, fault tolerance, and scalability. Apache
Kafka is an open-source code that can be utilized as a cloud-managed service, such as Confluent
Cloud or a self-managed Kafka service. When using Apache Kafka, it is essential to regularly monitor
the cluster and optimize the configuration settings for optimal performance.
Apache Kafka manages real-time information, facilitates stream processing, and enables event-driven
architecture. It provides straightforward approaches for processing and evaluating large amounts of
data, making it well-suited for implementation in social media platforms and other applications that
require real-time data processing and analysis.
The architecture of Apache Kafka is a key concept to understand to utilize this powerful
platform fully for data streaming and processing. It is a distributed system that can handle large
volumes of data. It is highly scalable and fault-tolerant. The key components include producers,
topics, partitions, brokers, and consumers.
1. Producers
1. Producers in Apache Kafka are responsible for publishing data from various sources
or feeds.
2. They transmit raw data to Kafka topics for further processing and distribution to
consumers.
3. Producers ensure that the raw data is efficiently transmitted to Kafka brokers for
storage and retrieval.
2. Topics
The Topic is a particular type of data stream; it's very similar to a queue, as it receives and delivers
messages. A topic is divided into partitions; each can have one or more partitions, and we need to
specify that number when creating the Topic.
1. Topics in Apache Kafka organize and categorize the data streams, creating a multi-step
pipeline for efficient data processing. They act as an immutable log that stores messages in a
serialized fashion.
2. Creating topics involves multiple steps: setting configurations, creating partitions, defining
replication factor, and adjusting retention policies.
When building a data processing system, we structured our Kafka topics to support a multi-
step pipeline. This enabled seamless data flow and efficient processing, highlighting the power of
Kafka’s topic architecture.
3. Partitions
1. Partitions in Apache Kafka are physical log files that store messages.
2. They allow distributed data consumption by enabling parallel processing among consumers.
3. Partitions also contribute to Data Recovery in case of failure, as each message is persisted
and replicated across brokers.
4. Brokers
Kafka works in a distributed way. A Kafka cluster may contain many brokers as needed. Each broker
in a cluster is identified by an ID and contains at least one partition of a topic. To configure the number
of the partitions in each broker, we need to configure something called Replication Factor when
creating a topic.
3. Brokers handle the multi-step pipeline of publishing, storing, and retrieving data.
Let’s say that we have three brokers in our cluster, a topic with three partitions and a Replication
Factor of three, in that case, each broker will be responsible for one partition of the topic.
As you can see in the above image, Topic_1 has three partitions, each broker is responsible for
a partition of the topic, so, the Replication Factor of the Topic_1 is three. It’s very important that the
number of the partitions match the number of the brokers, in this way, each broker will be responsible
for a single partition of the topic.
To ensure the reliability of the cluster, Kafka enters with the concept of the Partition
Leader. Each partition of a topic in a broker is the leader of the partition and can exist only one leader
per partition. The leader is the only one that receives the messages, their replicas will just sync the
data (they need to be in-sync to that). It will ensure that even if a broker goes down, his data won’t be
lost, because of the replicas. When a leader goes down, a replica will be automatically elected as a
new leader by Zookeeper.
In the above image, Broker 1 is the leader of Partition 1 of Topic 1 and has a replica in Broker
2. Let’s say that Broker 1 dies, when it happens, Zookeeper will detect that change and will
make Broker 2 the leader of Partition 1. This is what makes the distributed architecture of Kafka so
powerful.
5. Consumers
• Consumers in Apache Kafka are responsible for distributed data consumption from the
topics.
• They enable stream processing by subscribing to specific topics and processing the Raw Data
in real-time.
• Consumers play a pivotal role in ensuring that data is efficiently processed, enabling various
use cases like log aggregation and real-time analytics.
In order to understand the inner workings of Apache Kafka, it is important to first grasp how it
functions as a data management system. This section will delve into the essential components of Kafka
and how they work together to facilitate the publishing, storing, and retrieving of data. From the role
of producers and consumers to the importance of topics and partitions, we will explore the intricate
processes involved in the distribution and management of data within Kafka.
1. Publishing Data
2. Producers then send this data to the Kafka cluster for storage and processing.
3. Kafka stores and replicates this data across brokers for fault tolerance and high availability.
2. Storing Data
1. Decide on the topic and partition to which the data will be stored.
2. Serialize the data and add it to the appropriate topic and partition.
3. Once the data is added, it is stored in the Kafka broker’s file system.
True story: When our company implemented Kafka for data storage, organizing the topics
and partitions efficiently boosted data retrieval speed by 40%.
3. Retrieving Data
1. Retrieve Data: Consumers can retrieve data from Kafka through pull or push mechanisms,
accessing specific partitions or topics.
2. Distributed Data Consumption: Consumers can be distributed across multiple nodes for
parallel data consumption.
3. Data Sources: Data can be retrieved from various sources, such as IoT devices, applications,
and databases.
For instance, a company utilized Apache Kafka to retrieve real-time data from sensors
located at various factory locations. This allowed them to monitor equipment performance, enabling
timely maintenance and reducing downtime.
Apache Kafka offers a powerful and efficient way to handle large-scale real-time data
processing. In this section, we will examine the various benefits that come with using this popular
stream processing software platform. From its high throughput capabilities to its fault tolerance and
real-time data processing capabilities, the followings are the major benefits which enhances the
overall overall data processing experience.
1. High Throughput
• Utilize Apache Kafka for high throughput by leveraging its efficient messaging system.
• Configure Kafka to handle extensive data inflow using its high scalability.
• Take advantage of Kafka’s fault tolerance to ensure continuous high throughput even in the
event of failures.
• Employ Kafka for real-time stream processing, enabling the processing of raw data with
simple methods.
2. Scalability
• Implement partitioning to distribute data across multiple brokers for enhanced performance.
• Employ load balancing to evenly distribute workloads across multiple users in a large-scale
distributed system.
A few years ago, our company faced a challenge of processing a massive amount of data from
various data sources. With the implementation of Apache Kafka’s scalability features, we were able to
efficiently handle the increased workload, ensuring seamless data processing for multiple users in our
large-scale distributed system.
3. Fault Tolerance
• Immutable Log: Apache Kafka achieves fault tolerance through an immutable log, ensuring
that once data is written, it cannot be modified. This log enables easy data recovery in case
of failures.
• Consistent System State: Kafka maintains fault tolerance by keeping replicas of data in sync,
ensuring a consistent system state even in the event of node failures.
• Data Recovery: In case of node failures, Kafka’s fault tolerance mechanisms ensure seamless
data recovery, maintaining system reliability and consistency.
• Real-time data processing involves handling and analyzing event data as it occurs.
• Utilize Apache Kafka to capture and process real time information seamlessly.
Apache Kafka is a powerful tool that can be utilized for a variety of use cases. From messaging
to log aggregation to stream processing, Kafka offers a versatile platform for handling large amounts
of data. The major applications such as messaging, log aggregation, stream processing, and website
activity tracking etc.
1. Messaging
• Choose a suitable pubsub system for your needs, such as Apache Kafka, ensuring it aligns
with real-world messaging requirements.
• Consider scalability and fault tolerance to cater to evolving messaging demands efficiently.
When delving into messaging with Apache Kafka, prioritize finding a fitting pubsub system and
embracing straightforward methods for real-world applicability.
2. Log Aggregation
1. Collecting data: Apache Kafka gathers log data from various sources, including physical log
files and application-generated log events.
2. Aggregating logs: It aggregates logs from different systems into a centralized location,
facilitating easy access and management.
3. Distributed data consumption: Through Apache Kafka, log data can be consumed in a
distributed manner for real-time processing, analysis, and monitoring.
3. Stream Processing
• Extract raw data: In the first step of stream processing, raw data is extracted from various
sources such as databases, IoT devices, or logs.
• Transform and process: The data is then transformed and processed through a multi-step
pipeline to derive insights and meaningful information.
• Real-time analysis: Stream processing allows for real-time analysis of the data, enabling
immediate actions and decision-making based on the processed information.
Fact: Apache Kafka’s stream processing capabilities empower businesses to analyze and act on raw
data as it arrives, providing real-time insights for enhanced decision-making.
1. Implement clickstream tracking to capture user click stream activities and page views.
3. Utilize Kafka to ingest, process, and store website activity data in real time.
To enhance website activity tracking, consider integrating Apache Kafka with analytics tools for
deeper insights into user interactions and page views.
3.2 Flume
Introduction to Apache Flume
Apache Flume is a tool for data ingestion in HDFS. It collects, aggregates, and transports a
large amount of streaming data such as log files, events from various sources like network traffic,
social media, email messages, etc. to HDFS. Flume is a highly reliable & distributed.
The main idea behind the Flume’s design is to capture streaming data from various web
servers to HDFS. It has a simple and flexible architecture based on streaming data flows. It is fault-
tolerant and provides reliability mechanism for Fault-tolerance & failure recovery.
There are several advantages of Apache Flume which makes it a better choice over others.
The advantages are:
• Flume is a scalable, reliable, fault-tolerant and customizable for different sources and
sinks.
• Apache Flume can store data in centralized stores (i.e data is supplied from a single store)
like HBase & HDFS.
• If the read rate exceeds the write rate, Flume provides a steady flow of data between read
and write operations.
• Flume provides reliable message delivery. The transactions in Flume are channel-based
where two transactions (one sender & one receiver) are maintained for each message.
• Using Flume, we can ingest data from multiple servers into Hadoop.
• It gives us a solution which is reliable and distributed and helps us in collecting,
aggregating and moving a large amount of data sets like Facebook, Twitter and e-
commerce websites.
• It helps us to ingest online streaming data from various sources like network traffic, social
media, email messages, log files etc. in HDFS.
Flume Architecture
Source: It accepts the data from the incoming streamline and stores the data in the channel.
Channel: In general, the reading speed is faster than the writing speed. Thus, we need some buffer
to match the read & write speed difference. Basically, the buffer acts as intermediary storage that
stores the data being transferred temporarily and therefore prevents data loss. Similarly, channel acts
as the local storage or temporary storage between the source of data and persistent data in the
HDFS.
Sink: Then, our last component i.e. Sink, collects the data from the channel and commits or writes
the data in the HDFS permanently.
In this practical, we will stream data from Twitter using Flume and then store the data in HDFS as
shown in the below image.
After creating this application, you will find Key & Access token. Copy the key and the access
token. We will pass these tokens in our Flume configuration file to connect to this application.
Now create a flume.conf file in the flume’s root directory as shown in the below image. As we
discussed, in the Flume’s Architecture, we will configure our Source, Sink and Channel. Our Source is
Twitter, from where we are streaming the data and our Sink is HDFS, where we are writing the data.
In the Sink configuration, we are going to configure HDFS properties. We will set the HDFS
path, write format, file type, batch size etc. At last, we are going to set a memory channel as shown
in the below image.
Now we are all set for execution. Let us go ahead and execute this command:
After executing this command for a while, and then you can exit the terminal using CTRL+C.
Then you can go ahead in your Hadoop directory and check the mentioned path, whether the file is
created or not.
Download the file and open it. You will get something as shown in the below image.
3.3 Data Querying and Analysis: Drill
Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed from the
ground up to support high-performance analysis on the semi-structured and rapidly evolving data
coming from modern Big Data applications, while still providing the familiarity and ecosystem of ANSI
SQL, the industry-standard query language. Drill provides plug-and-play integration with existing
Apache Hive and Apache HBase deployments.
As an open-source, schema-free SQL query engine, Drill represents a significant leap forward in the
realm of big data processing, providing unprecedented flexibility and speed.
Drill is a universal translator for different data sources. You can easily load your data into your favorite
business intelligence tool or expose it with REST. It is like SQL commands to view the data. It has Open
Database connectivity and Java Database Connectivity interfaces so you can easily connect most BI
tools. There are modules to import the scripting languages like such as Python or R.
Formats:
▪ Parquet
▪ JSON
▪ Avro
▪ Logs files
▪ PCAP/PCAP-NG
External Systems:
▪ Hbase
▪ Hive
▪ MapR-DB
▪ MongoDB
▪ MapR-FS
Since Drill looks like a relational database to the user, users often expect a database-like performance.
Benefits
Drill can scale data from a single node to thousands of nodes and query petabytes of data within
seconds. Drill supports user defined functions. Drill’s symmetrical architecture and simple installation
makes it easy to deploy and operate very large clusters. Drill has flexible data model and extensible
architecture. Drill columnar execution model performs SQL processing on complex data without
flattening into rows. Supports large datasets
Key Features
▪ Drill uses self-describing data where a schema is specified as a part of the data itself, so no
need for centralized schema definitions or management.
▪ Specialized memory management that reduces the amount of main memory that a program
uses or references while running and eliminates garbage collections.
▪ Apache Drill is mostly used for data analytics. When a lot of databases, files, logs and other
datatypes are spread across VM’s, filesystems, databases and more Apache Drill saves the day.
Impala Features
open-source project.
• supports in-memory data processing. It is possible to analyze or access data stored on nodes
in Hadoop (by means of HDFS) without data transaction.
• Users can store data in storage systems such as HDFS, HBase, Amazon S3.
• Impala can be integrated with Business Intelligence (BI) tools such as Tableau.
• Impala supports various file formats such as LZO, Sequence File, Avro, RCFile, and Parquet.
• Impala uses similar metadata storage mechanisms as in Hive, it uses also ODBC driver and
SQL grammar.
Advantages of Impala
• Impala can process data stored in HDFS using existing SQL knowledge.
• Since the processing is performed in a place where data is stored, the stored data in Hadoop
does not need to be updated or transferred.
• Users can access data stored in HDFS, HBase, Amazon S3 without a knowledge of
MapReduce because Impala supports existing SQL query and does not rely on MapReduce
paradigm.
• Impala reduces the user needs to perform preliminary ETL (Extract-Transform-Load) jobs.
Disadvantages of Impala
• Impala does not provide the capability to restart a failing node during the processing of a
SQL query which involves multiple nodes
• Impala can only read a text file. It does not support to read binary file user defined.
• Impala should update table whenever new record/file are added in the data directory of
HDFS.
• Whereas MapReduce materializes all intermediate results, which enables better scalability
and fault tolerance (at the expenses of data processing time), Impala streams intermediate
results directly among the cluster nodes (called executors). This approach limits the
horizontal scalability of the cluster especially when dealing with hundreds of nodes.
Impalad,
• a process taking charge of distributed query engine, works on processing queries and
planning for query on data node in Hadoop cluster.
• Multiple queries are served by Impalad running on other nodes as well. After accepting the
query, Impalad reads and writes to data files and parallelizes the queries by distributing the
work to other Impala cluster nodes (executors).
• Impala Statestored, is responsible for checking the health of each Impalad, and then
conveying the health status of each node to the other nodes frequently.
• Every Impalad process interacts with the Statestored process providing its latest health
status and this information is used by the other nodes in the cluster so they can make correct
decisions before distributing the queries to a specific Impalad,
• In the event of a node failure due to any reason, Statestored updates all other nodes about
this failure, and once such a notification is available to other Impalad, they will not assign any
further query job to the affected node.
Impala Catalogd
• comprises of Metadata which uses traditional MySQL and PostgreSQL databases to store
table definitions.
• The important details, such as table and column information and table definitions are stored
in a centralized database known as Metastore, which is another component of Catalogd.
• Impala tracks information about file metadata, that is, the physical location of the blocks
about data files in HDFS.
• When dealing with an extremely large amount of data and/or many partitions, getting table
specific metadata could require a large amount of time.
• For this reason, a locally stored metadata caches is provisioned to help in providing such
information immediately.
• Clients — Entities including Hue, ODBC clients, JDBC clients, and the Impala Shell can all
interact with Impala.
• Hive Metastore — It (generally) stores information about the data available to Impala.
• For example, the metastore lets Impala know what databases are available and what the
structure of those databases is.
• As you create, drop, and alter schema objects, load data into tables, and so on through
Impala SQL statements, the relevant metadata changes are automatically broadcast to all
Impala nodes by the dedicated Catalogd service.
• Impalad — This process, which runs on DataNodes, coordinates and executes queries. Each
instance of Impala can receive, plan, and coordinate queries from Impala clients.
• Queries are distributed among Impala nodes, and these nodes then act as workers, executing
parallel query fragments.
• The status of Impalad is constantly transmitted to Statestored (not shown in the figure for
simplicity).
1. User applications send SQL queries to Impala cluster through ODBC or JDBC drivers, which
provide standardized querying interfaces. The user application may connect to
any Impalad in the cluster. This Impalad becomes the coordinator for the given query.
2. Impala parses the query and analyzes it to determine what tasks need to be performed
by Impalad instances across the cluster. Execution is planned for optimal efficiency and a
number of available nodes is used forthe query process.
3. Services such as HDFS and HBase are accessed by local impalad instances to provide data.
4. Each Impalad returns data to the coordinating Impalad, which sends these results to the
client.
• Hive is a data warehouse system which is used for querying and analysing large datasets
stored in HDFS. It process structured and semi-structured data in Hadoop..
• Hive sits on top of a data platform compatible with Hadoop (HDFS/Amazon’s S3).
• Data accessed by Hive is stored in HDFS.
A user wants to make a query. The two popular channels to query Hive through are:
JDBC is Java-based, hence targeted for connecting Java applications to databases. One example of
this is beeline - a JDBC client which is a command line interface that allows you to submit queries.
ODBC is for a wider range of applications beyond Java. This includes Windows applications such as
Excel.
• Type in your query to get the data you need for analysis. You use Hive-QL, which is not the
same as HQL. It is a SQL-like language, which means it’s similar but not the same as SQL.
• Your query then gets sent to Hive server (HiveServer2), which translates it for the driver.
• Metastore, as the name suggests, stores relevant metadata about the Hive table: database
name, table name, column names, etc…
• The metastore also includes details about where the table data is located in HDFS, which is
vital in guiding your Hive queries to look in the appropriate directories in HDFS
Hive Architecture
Metastore: stores metadata for Hive tables (like their schema and location) and partitions in a
relational database(traditional RDBMS format).
Driver: acts like a controller which receives the HiveQL statements. It monitors the life cycle and the
progress of the execution of the HiveQL statement. it stores the necessary metadata generated
during the execution of a HiveQL statement. it does the compiling and optimizing and executing of
the HiveQL statements.
Compiler: It performs the compilation of the HiveQL query. it converts the query to an execution
plan which contains the tasks(of mapreduce).
Optimizer: It performs various transformations on the execution plan to provide optimized plan. It
aggregates the transformations together, such as converting a pipeline of joins to a single join.
Executor: Once compilation and optimization complete, the executor executes the tasks.
Thrift application: is a software framework which allows external clients to interact with Hive over a
network, similar to the JDBC or ODBC protocols.
Beeline: The Beeline is a command shell supported by HiveServer2, where the user can submit its
queries and command to the system.
Hive Server 2: enhanced version of Hive Server 1 wich allows multiple clients to submit requests to
Hive and retrieve the final results. It is basically designed to provide the best support for open API
clients like JDBC and ODBC and Thrift.
steps to execute the HQL statement
1. executeQuery: The user interface calls the driver to excute the HQL statement(query).
2. getPlan: The driver accepts the query, creates a session handle for the query, and passes the
query to the compiler for generating the execution plan.
3. getMetaData: The compiler sends the metadata request to the metastore.
4. sendMetaData: The metastore sends the metadata to the compiler.
1. The compiler uses this metadata for performing type-checking and semantic analysis
on the expressions in the query tree. The compiler then generates the execution
plan (Directed acyclic Graph).
5. sendPlan: The compiler then sends the generated execution plan to the driver.
6. executePlan: After receiving the execution plan from compiler, driver sends the execution
plan to the execution engine for executing the plan.
7. submit job to MapReduce: The execution engine then sends these stages of DAG to
appropriate components.
8. 9,10: sendResult: Now for queries, the execution engine reads the contents of the
temporary files directly from HDFS as part of a fetch call from the driver. The driver then
sends results to the Hive interface.
Hive Data Model
Table
Hive tables are the same as the tables present in a Relational Database.
Partition
Hive organizes tables into partitions for grouping same type of data together based on a
column or one or more partition keys to identify a particular partition.
Bucket
Tables or partition are subdivided into buckets based on the hash function of a column in the
table to give extra structure to the data that may be used for more efficient queries.
Example:
Hive Data Types
• hive operates in two modes depending on the number and size of data node.
• Local Mode : is used when hadoop is having one data node and the data is small. Processing
will be very fast on smaller datasets which are present on local machine.
• MapReduce Mode : is used when hadoop is having multiple data nodes and the data is
spread across various data nodes. Processing large datasets can be more efficient using this
mode.
Cheat Sheet Hive for SQL Users
Query
Metadata
Command Line
Apache Pig is a high-level programming language especially designed for analyzing large
data sets. In the MapReduce framework, programs are required to be translated into a sequence of
Map and Reduce stages. Pig supports all the data manipulation operations in Hadoop. Apache Pig
allows developers to write data analysis programs using Pig Latin. This is a highly flexible language and
supports users in developing custom functions for writing, reading and processing data. Apache Pig
comes with a component called Pig engine that takes the scripts written in Pig Latin as an input and
converts them into MapReduce jobs. By using various operators provided by Pig Latin language
programmers can develop their own functions for reading, writing, and processing data.
• Rich Set of Operators: collection of rich set of operators in order to perform operations such
as join, filer, sort and many more.
• Ease of Programming: Pig Latin is similar to SQL and hence it becomes very easy for developers
to write a Pig script. If you have knowledge of SQL language, then it is very easy to learn Pig
Latin language as it is similar to SQL language.
• Optimization opportunities: The execution of the task in Apache Pig gets automatically
optimized by the task itself, hence the programmers need to only focus on the semantics of
the language.
• Extensibility: By using the existing operators, users can easily develop their own functions to
read, process, and write data.
• User Define Functions (UDF’s): we can easily create User Defined Functions on a number of
programming languages such as Java and invoke or embed them in Pig Scripts.
• All types of data handling: Analysis of all types of Data (i.e. both structured as well as
unstructured) is provided by Apache Pig and the results are stored inside HDFS.
Pig Latin simplifies the work of programmers by eliminating the need to write complex codes in java
for performing MapReduce tasks. The multi-query approach of Apache Pig reduces the length of code
drastically and minimizes development time. Pig Latin is almost similar to SQL and if you are familiar
with SQL then it becomes very easy for you to learn.
A Pig Latin program comprises a series of transformations or operations which uses input data to
produce output.
Parser: Parser used to check the syntax of the script. Parser outputs a Directed Acyclic Graph(DAG)
that will show the logical operators.
Optimizer: It ensures that the data in the pipeline should be minimum. It performs logical
optimizations such as pushdown and projection.
Compiler : The compiler component transforms the optimized logical plan into a sequence of
MapReduce jobs.
Execution Engine: This component submits all the MapReduce jobs in sorted order to the Hadoop.
Finally, all the MapReduce jobs are executed on Apache Hadoop to produce desired results.
Pig Execution modes In Hadoop Pig can be executed in two different modes which are:
Local Mode: Here Pig language makes use of a local file system and runs in a single JVM. The local
mode is ideal for analyzing small data sets.
Map Reduce Mode: In this mode, all the queries written using Pig Latin are converted into
MapReduce jobs and these jobs are run on a Hadoop cluster. MapReduce Mode is highly suitable for
running Pig on large datasets.
Applications of Pig and Hive
Pig and Hive are widely used in various data processing and analysis scenarios:
Data Transformation: Pig is well-suited for complex data transformations, such as cleansing,
normalization, and enrichment of raw data.
Ad-hoc Data Analysis: Hive is ideal for ad-hoc data analysis, allowing users to quickly query and
analyze large datasets using familiar SQL-like syntax.
ETL Pipelines: Both Pig and Hive can be integrated into ETL pipelines for data extraction,
transformation, and loading, providing robust solutions for data processing and analysis.
Machine Learning and Data Science: Pig and Hive can be used to preprocess data for machine
learning algorithms or perform exploratory data analysis in data science projects.
Data Warehousing: Hive is particularly useful for building data warehouses on top of Hadoop,
providing a scalable and cost-effective solution for storing and analyzing large volumes of structured
data.
Any novice programmer with a basic Exposure to Java is must to work with
knowledge of SQL can work conveniently MapReduce.
with Apache Pig.
Apache Pig uses multi-query approach, MapReduce will require almost 20 times
thereby reducing the length of the codes more the number of lines to perform the
to a great extent. same task.
• The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such
as map and tuple.
Atom
• Any single value in Pig Latin, irrespective of their data, type is known as an Atom.
• It is stored as string and can be used as string and number. int, long, float, double, chararray,
and bytearray are the atomic values of Pig.
• A record that is formed by an ordered set of fields is known as a tuple, the fields can be of
any type. A tuple is similar to a row in a table of RDBMS.
Bag
• Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’.
Map
• A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and
should be unique. The value might be of any type. It is represented by ‘[]’
Relation
• A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular order).
Apache Pig supports many data types. A list of Apache Pig Data Types with description and examples
are given below.
Type Description Example
• Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode,
and embedded mode.
• Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt
shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump
operator).
• Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in
a single file with .pig extension.
• Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions
(User Defined Functions) in programming languages such as Java, and using them in our
script.
Invoking the Grunt Shell
You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown
below.
Output − Output −
• Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
• After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin
statements in it.
You can write an entire Pig Latin script in a file and execute it using the –x command. Let us suppose
we have a Pig script in a file named sample_script.pig as shown below.
Sample_script.pig
Now, you can execute the script in the above file as shown below.
• It is an analytical tool that analyzes large datasets that exist in the Hadoop File System.
• To analyze data using Apache Pig, we have to initially load the data into Apache Pig.
Preparing HDFS
• In the local file system, create an input file student_data.txt containing data as shown below.
• Now, move the file from the local file system to HDFS using put command as shown below.
(You can use copyFromLocal command as well.)
• You can use the cat command to verify whether the file has been moved into the HDFS, as
shown below.
• You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD operator
of Pig Latin.
Syntax
• The load statement consists of two parts divided by the “=” operator. On the left-hand side,
we need to mention the name of the relation where we want to store the data, and on the
right-hand side, we have to define how we store the data. Given below is the syntax of
the Load operator.
• relation_name − We have to mention the relation in which we want to store the data.
• Input file path − We have to mention the HDFS directory where the file is stored. (In
MapReduce mode)
• function − We have to choose a function from the set of load functions provided by Apache
Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
• Schema − We have to define the schema of the data. We can define the required schema as
follows −
• Note − We load the data without specifying the schema. In that case, the columns will be
addressed as $01, $02, etc… (check).
Example
• As an example, let us load the data in student_data.txt in Pig under the schema
named Student using the LOAD command.
• First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce mode as shown
below.
$ Pig –x mapreduce
grunt>
• Now load the data from the file student_data.txt into Pig by executing the following Pig Latin
statement in the Grunt shell.
• Note − The load statement will simply load the data into the specified relation in Pig. To
verify the execution of the Load statement, you have to use the Diagnostic Operators
Store Functions
• You can store the loaded data in the file system using the store operator.
Syntax
STORE Relation_name INTO ' required_directory_path ' [USING function];
Example
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
• And we have read it into a relation student using the LOAD operator as shown below.
• Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');
Output
• The FILTER operator is used to select the required tuples from a relation based on a
condition.
• Verify the relation filter_data using the DUMP operator as shown below.
Output
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
To verify the execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin
provides four different types of diagnostic operators −
1) Dump operator
2) Describe operator
3) Explanation operator
4) Illustration operator
Dump Operator
• The Dump operator is used to run the Pig Latin statements and display the results on the
screen. It is generally used for debugging Purpose.
• Syntax
grunt> Dump Relation_Name
• Now, let us print the contents of the relation using the Dump operator as shown below.
Output
(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)
describe operator
Syntax
example
Output
explain operator
• The explain operator is used to display the logical, physical, and MapReduce execution plans
of a relation.
Syntax
Illustrate operator
• The illustrate operator gives you the step-by-step execution of a sequence of statements.
Syntax
Example
Data processing tasks within various organizations have become increasingly complex , necessitating
the use of advanced data flow tools. Apache NiFi is a highly valuable data flow system designed to
automate and schedule various data operations across different processes. It is capable of handling
both simple and complex systems effectively. The performance of Apache NiFi, in terms of IO, CPU,
and RAM utilization, is spectacular as well.
• Flow management
• Ease of use
• Security
• Extensible architecture
• and flexible scaling model
The most important concepts realted to Apache NiFi are DataFlow Manager (DFM), FlowFile,
Bulletin, Processor, Connection, Controller Service, Flow Controller, and Process Group.
DFM
• In simple words, a DFM is the user who has different permissions and complete management
of Apache NiFi.
FlowFile
• The FlowFile concept refers to each object moving through the system. It has two main parts:
attributes and content. The first one stores metadata about the FlowFile, such as its
filename, size, etc.
• The standard attributes that each FlowFile contains are Universally Unique Identifier (UUID),
used to distinguish the FlowFile; filename, used to represent the name of the output of the
data when it is written to some path; and path, used to define the path of the filename. On
the other hand, the content part is responsible for storing the actual data of the FlowFile.
Bulletin
It can be considered a service reporter tool that is available for each service. It provides
critical information such as rolling statistics, current status, and severity levels such as Debug, Info,
Warning, and Error, giving insights into the current situation of the component.
Processor
A Processor, as the name suggests, is responsible for performing the actual work, such as
data routing and transformation, as it has direct access to the attributes and the content of the given
FlowFile. For example, processor has the ability to read data files, execute different software
languages such as Python scripts, and executing different SQL queries. An example of Apache NiFi
processor is shown in Figure 1.
Connection
A Connection can be regarded as the link between different processors, see Figure 1. It
manages the interaction between the processors by establishing queues. A queue in Apache NiFi
refers to a buffer that has the ability to store FlowFiles. Therefore, the output of one processor, the
FlowFile, is stored in a queue until the other processor is ready to process it.
Controller Service
When a Controller Service is created, it will automatically start up every time Apache NiFi
is initiated. Several essential services can be specified in the Controller Service section. For instance,
DBCPConnectionPool and CSVRecordSetWriter are utilized to establish connections between various
database systems and Apache NiFi, and to enable writing CSV files to the local system, respectively.
Once defined, these services can be utilized by any processors.
what is oozie?
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are
Directed Acyclic Graphs (DAGs) of actions. Workflows, Coordinators, Bundles all come together to form
the building blocks of Oozie. Oozie allows orchestration and control of such complex multi-stage
Hadoop jobs. Multi-stage Hadoop jobs can then be run as a single Oozie job — the Oozie job is the
only thing for you to manage
What is workflow?
A workflow specifies a set of actions and the order and conditions under which these actions should
be performed. The actions are dependent on one another, as the next action can only be executed
after the output of current action. And these actions could be anything such as MapReduce job, Hive
query, Shell scripts, Pig query, Java program. You might want to run these workflows at a specific time
and frequency (provided the input data is available) so for these we need coordinators. We can create
different types of actions based on the job and each type of action can have its own type of tags. The
workflow and the scripts or jars should be placed in HDFS path before executing the workflow.
Oozie Workflow Jobs− These are Directed Acyclic Graphs (DAGs) which specifies a sequence of actions
to be executed.
Oozie Coordinator Jobs− These consist of workflow jobs triggered by time and data availability.
Oozie Bundles− These can be referred to as a package of multiple coordinators and workflow jobs.
Oozie Coordinator
You can schedule complex workflows as well as workflows that are scheduled regularly using
Coordinator. Oozie Coordinators triggers the workflows jobs based on time, data or event predicates.
The workflows inside the job coordinator start when the given condition is satisfied.
• timeout− The maximum time, in minutes, for which an action will wait to satisfy the additional
conditions, before getting discarded. 0 indicates that if all the input events are not satisfied at
the time of action materialization, the action should timeout immediately. -1 indicates no
timeout, the action will wait forever. The default value is -1.
• concurrency− The maximum number of actions for a job that can run parallelly. The default
value is 1.
• execution– It specifies the execution order if multiple instances of the coordinator job have
satisfied their execution criteria. It can be:
• FIFO (default)
• LIFO
• LAST_ONLY
So what is coordinator?
• The Coordinator schedules the execution of a workflow at a specified time and/or a specified
frequency and with the availability of the data
• If input data is not available, then the workflow is delayed till the data becomes available.
• If no input data is needed, then the workflow runs purely at a specified time or frequency.
• A collection of Coordinator jobs which can be started, stopped and modified together is called
a Bundle.
• The output of one Coordinator job managing a workflow can be the input to another
Coordinator job.
• And the output of one Coordinator job managing a workflow can be the input to another
Coordinator job and this process is called as Data Pipelines.
Oozie Architecture
Database: All job related information is stored in the database. Oozie supports Derby by default.
This has one file defined in XML which describes the application known as workflow.xml. It references
and includes other configuration files, JARs and scripts which perform the actions. The application can
be a workflow run manually, a single coordinator or a number of coordinators forming a bundle. Oozie
expects all files to be in HDFS before it can run. These XML files along with other files which are
required for the Oozie application are copied over to HDFS before the job can run.
Workflows:
• 1. Control nodes
• 2. Action nodes
• 3. Global configuration
• Control nodes — These controls the start, end and basic execution of workflow
• Action nodes — Action nodes are the ones which specifies the unit of execution
• Global config — These can be defined in the global parameters rather than once for each
action.
• Above image of job.properties explains the meaning of each of the element. Here application
path is the path of the workflow.xml which is stored in HDFS.
Zookeeper overview
Why ZooKeeper?
• Unreliable networks
• Latency in operations
ZooKeeper Services
• Reliable Data Registry – Ensures data availability even if some nodes fail
ZooKeeper Ensemble
A ZooKeeper Ensemble is a group of ZooKeeper servers working together to provide a reliable and
fault-tolerant coordination service for distributed systems.
• One server acts as the Leader, and the others act as Followers.
• All write operations go through the Leader, while read operations can be served by any node.
• If the Leader fails, a new Leader is automatically elected from the Followers.
• Ensemble ensures data consistency and coordination even if some nodes fail.
Architecture of Zookeeper:
Client
Server
Ensemble
Follower
Namespaces:
Data Storage:
Persistent Znode
Ephemeral Znode
Sequential Znode
Sessions in ZooKeeper
Watches in ZooKeeper
Benefits of Zookeeper
Atomic Transactions – Guarantees operations are either fully completed or not done at all.