0% found this document useful (0 votes)
14 views14 pages

HADOOP Notes Unit 3 and 4

The document outlines the data flow in the Hadoop ecosystem, detailing stages from ingestion to output, and tools like Flume and Sqoop for data ingestion. It also covers Hadoop I/O, focusing on compression, serialization, and file-based data structures like Avro, Parquet, and ORC. Additionally, it introduces the Hadoop ecosystem components, Apache Pig for data analysis, and its execution modes, highlighting the use of Pig Latin and User Defined Functions (UDFs).

Uploaded by

agyaprajapati3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views14 pages

HADOOP Notes Unit 3 and 4

The document outlines the data flow in the Hadoop ecosystem, detailing stages from ingestion to output, and tools like Flume and Sqoop for data ingestion. It also covers Hadoop I/O, focusing on compression, serialization, and file-based data structures like Avro, Parquet, and ORC. Additionally, it introduces the Hadoop ecosystem components, Apache Pig for data analysis, and its execution modes, highlighting the use of Pig Latin and User Defined Functions (UDFs).

Uploaded by

agyaprajapati3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

MODULE- 3

1. Data Flow in Hadoop Ecosystem

The data flow in a Hadoop ecosystem involves the stages through which data passes from
source to processing and finally storage. The flow typically follows a path from data
ingestion, storage, processing, and finally output.

Key Stages of Data Flow:

1. Data Ingestion:

o Data is collected from various sources and ingested into the Hadoop system.

o This step involves tools like Flume, Sqoop, and custom data
ingestion processes.

2. Data Storage:

o Data is stored in the Hadoop Distributed File System (HDFS), where it


is organized into blocks and replicated for fault tolerance.

o HDFS is the backbone storage system that supports scalable and fault-
tolerant data storage.

3. Data Processing:

o Once data is stored in HDFS, it is processed using MapReduce, Apache


Spark, or Apache Hive.

o This processing might include transformations, aggregations, or analysis


of large datasets.

4. Data Output:

o After processing, data is output to HDFS or other storage systems.

o It can also be output to external systems like relational databases or


data lakes for further use in downstream applications.

Data Flow Example:

A company might use Flume to ingest logs from servers into HDFS. The data is processed
using MapReduce or Hive, and finally, the processed data is output to HDFS, or it might be
used for further analysis in tools like Apache Spark.

11
2. Data Ingestion with Flume and Sqoop

Data ingestion refers to the process of collecting and importing data into Hadoop. Two
commonly used tools for this task are Flume and Sqoop.

Flume:

Apache Flume is a distributed, reliable, and available service for collecting, aggregating, and
moving large amounts of streaming data into HDFS.

Key Features of Flume:

 Stream Processing: Flume can ingest data from multiple sources like logs,
social media feeds, and sensors.

 Reliability: Flume ensures data reliability with mechanisms like Channel, Source, and
Sink.

 Scalability: Flume is highly scalable and can be extended with custom plugins.

 Integration: Flume integrates well with Hadoop, HDFS, and other data
storage systems.

Flume Architecture:

 Source: Where data comes from (e.g., logs, syslog, or Kafka).

 Channel: The medium through which data flows (e.g., memory, file, or database).

 Sink: The destination where data is stored (typically HDFS).

Example of Flume Use Case:

Collecting logs from multiple web servers and ingesting them into HDFS:

flume-ng agent --conf /etc/flume/conf --conf-file /etc/flume/conf/flume.conf --name agent1

Sqoop:

Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and
relational databases like MySQL, PostgreSQL, and Oracle.

Key Features of Sqoop:

 Relational Database Integration: Sqoop is primarily used for importing data


from RDBMS into HDFS and exporting data from HDFS back into RDBMS.

 Bulk Data Transfer: Sqoop allows efficient bulk transfers of data using parallelism.

 Data Types: Sqoop supports a wide range of data types and can map them
from RDBMS to HDFS.

12
Sqoop Operations:

 Import: Imports data from an RDBMS into HDFS.

 sqoop import --connect jdbc:mysql://localhost/test --table employees --target-dir


/user/hadoop/employees

 Export: Exports data from HDFS to an RDBMS.

 sqoop export --connect jdbc:mysql://localhost/test --table employees --export-dir


/user/hadoop/employees

Sqoop Features:

 Data Import: Efficiently imports large datasets from RDBMS into HDFS.

 Incremental Imports: Supports importing only new or updated data based on


a column like a timestamp.

 Data Export: Exports processed data from Hadoop back into relational databases
for downstream consumption.

Hadoop Archives (HAR)

Hadoop Archives (HAR) are used to bundle multiple files into a single archive file in HDFS.
HAR files help in reducing the overhead associated with having too many small files in HDFS.

Key Features:

 Storage Optimization: HAR files reduce the number of files in HDFS and
optimize storage.

 Efficient Access: Improves data retrieval and access when working with many
small files.

 Compression: HAR files can be compressed, saving space on HDFS.

Creating HAR Files:

hadoop archive -archiveName archive.har -p /source/directory


/user/hadoop/target/directory

3. Hadoop I/O: Compression, Serialization, Avro, and File-Based Data Structures

Hadoop I/O focuses on how data is compressed, serialized, and stored efficiently for
processing and storage in Hadoop. This includes the handling of compression, serialization,
and formats like Avro and other file-based data structures.

13
Compression in Hadoop I/O

Compression in Hadoop helps reduce the amount of disk space required for storing data and
improves the I/O performance of the system by reducing network bandwidth and disk
storage requirements.

Key Compression Formats:

1. Gzip: Commonly used compression format in Hadoop, particularly for text files.

2. Bzip2: Offers better compression than Gzip but has slower performance.

3. LZO: Provides a good balance between compression ratio and speed, used in
real- time applications.

4. Snappy: Provides high-speed compression with a lower compression ratio, ideal


for real-time processing.

Enabling Compression in Hadoop:

Compression can be enabled at various stages:

 MapReduce: Compression can be applied to the output of MapReduce jobs


by setting the appropriate compression codec.

Serialization in Hadoop I/O

Serialization is the process of converting an object into a byte stream for storage or
transmission, and deserialization is the reverse process.

Key Serialization Formats:

1. Writable: The default serialization mechanism in Hadoop. It is highly optimized


for performance but not always the most flexible.

o Example: IntWritable, Text, LongWritable.

2. Avro: A more flexible serialization format, offering compact storage and schema-
based data serialization. It is ideal for complex data structures and is often used in
Big Data systems.

14
Example of Serialization with Writable:

4. Avro

Avro is a binary serialization format used by Hadoop for encoding data. It provides a
compact, fast, and efficient way of encoding data, and is particularly useful for serializing
data that can be written to disk or transferred over the network.

Key Features of Avro:

 Compact: It offers a compact binary format, which helps in reducing storage


and network bandwidth requirements.

 Schema-based: Avro data is always serialized with its schema, ensuring the
data’s structure is always known and consistent.

 Interoperability: Avro can be used in different programming languages such as


Java, Python, C, and C++.

 Data Evolution: Avro supports schema evolution, meaning new fields can be
added to data without breaking compatibility with older versions.

Example of Avro Usage:

1. Avro Schema Definition: The schema is defined in JSON format.

15
2. Reading Avro Data:

DatumReader<GenericRecord> reader = new SpecificDatumReader<>(User.class);

DataFileReader<GenericRecord> fileReader = new DataFileReader<>(new File("user.avro"),


reader);

5. File-Based Data Structures

File-based data structures are formats used for storing data on HDFS or similar distributed
file systems. These structures help in organizing data for better access, processing, and
compression. Some common file-based data structures are:

1. SequenceFile:

 SequenceFile is a flat file format used for storing binary key/value pairs.

 It is highly efficient for sequential access and works well for intermediate
MapReduce results.

 Supports compression of both keys and values.

Example:

2. Parquet:

 Parquet is a columnar storage format for Hadoop, optimized for storing


large datasets.

 It is particularly useful for analytical workloads where column-based access


is required.

 Parquet provides efficient data compression and encoding schemes, reducing


storage costs and improving query performance.

Key Features of Parquet:

 Columnar Storage: Enables fast reading of a subset of columns.

 Optimized for Read-heavy Workloads: Great for use in data warehouses and
for analytical queries.

16
 Efficient Compression: Parquet’s columnar storage format allows for
higher compression ratios compared to row-based formats.

3. ORC (Optimized Row Columnar):

 ORC is a columnar format designed to optimize the read and write performance
in Hive.

 It provides highly efficient storage by applying lightweight compression


techniques and efficient indexing.

Benefits of ORC:

 Efficient Compression: ORC files can reduce storage space by up to 75% compared
to text-based formats.

 Efficient Querying: ORC enables faster query performance for Hive queries due to
its efficient indexing and predicate pushdown.

17
MODULE- 4
1. Hadoop Ecosystem

The Hadoop Ecosystem refers to a suite of tools and frameworks that extend the
capabilities of the Hadoop platform. It consists of various components, each serving a
specific purpose, ranging from data ingestion, processing, and storage to resource
management, workflow
orchestration, and data analysis.

Key Components of the Hadoop Ecosystem:

1. Hadoop Common: The foundational libraries and utilities that support other
Hadoop modules.

2. HDFS (Hadoop Distributed File System): A distributed storage system that


stores large datasets across multiple nodes.

3. MapReduce: A computational model and processing engine for processing


large datasets in parallel.

4. YARN (Yet Another Resource Negotiator): A resource management layer


that allocates resources to different applications in a cluster.

5. Hive: A data warehouse solution that allows SQL-like queries to be executed


on Hadoop.

6. Pig: A high-level platform for processing large datasets using a scripting


language called Pig Latin.

7. HBase: A distributed NoSQL database that runs on top of HDFS and is suitable
for real-time random access to large datasets.

8. Sqoop: A tool for transferring data between relational databases and HDFS.

9. Flume: A distributed service for collecting, aggregating, and moving log data or
real- time data into HDFS.

10. Oozie: A workflow scheduler system for managing Hadoop jobs.

11. Zookeeper: A coordination service for managing distributed applications.

12. Mahout: A machine learning library built on top of Hadoop for scalable algorithms.

13. Storm: A real-time computation system designed for processing unbounded


streams of data.

These components work together to form a robust environment for storing, processing, and
analyzing large-scale data.

18
2. Pig: Introduction to PIG

Apache Pig is a platform for analyzing large datasets. It is built on top of the Hadoop
ecosystem and provides a simpler way to write MapReduce programs. Pig is often used for
ETL (Extract, Transform, Load) tasks.

Key Features of Pig:

 Abstraction Over MapReduce: Pig simplifies the process of writing MapReduce


programs by providing a higher-level abstraction called Pig Latin, which is a scripting
language.

 Extensibility: Pig allows users to write custom User Defined Functions (UDFs)
to extend its capabilities.

 Data Processing: It supports both batch and real-time processing of large datasets.

 Optimized Execution: Pig automatically optimizes queries for better performance.

Pig vs. MapReduce:

 Pig abstracts the complexity of writing MapReduce code by using a simple


scripting language.

 With Pig, the developer can focus more on the logic and transformations
without worrying about the intricacies of the MapReduce programming model.

3. Execution Modes of Pig

Pig can run in different modes based on the needs of the user and the environment in which
it is deployed. The execution modes define how and where Pig queries are executed.

1. Local Mode:

 Description: In Local Mode, Pig runs in a single JVM (Java Virtual Machine),
typically on a local machine or a single node.

 Use Case: Useful for small-scale datasets or for development and testing purposes.

 Execution: The processing is done sequentially (i.e., without parallelization),


making it suitable for non-distributed, small datasets.

 Limitations: Limited scalability as it runs on a single node and does not take
full advantage of Hadoop’s distributed nature.

2. MapReduce Mode (Hadoop Mode):

 Description: In this mode, Pig submits jobs to Hadoop’s MapReduce framework


for distributed execution across a cluster of machines.

20
 Use Case: Ideal for large-scale data processing when the system needs to scale
out and distribute the workload.

 Execution: Pig scripts are converted into MapReduce jobs and are executed
across the Hadoop cluster.

 Advantages: Scalability and parallel processing of large datasets.

3. Tez Mode:

 Description: Apache Tez is an advanced execution engine that improves


the performance of Pig by reducing the overhead of MapReduce.

 Use Case: Used when higher performance and optimization are needed,
especially for complex queries.

 Execution: Tez runs on top of YARN and supports more efficient DAG-
based processing than the traditional MapReduce model.

 Advantages: Faster execution for complex data flows compared to MapReduce.

4. Comparison of Pig with Databases

Pig and traditional databases (such as SQL databases) differ in their design, purpose, and
capabilities. Below is a comparison highlighting their strengths and differences.

21
5. Grunt

Grunt is the interactive shell that comes with Pig, which allows users to run Pig Latin scripts
interactively from the command line. Grunt is useful for testing Pig Latin scripts and for
performing ad-hoc data exploration.

Key Features:

 Interactive Querying: Provides an interactive environment to execute Pig scripts


line- by-line.

 Feedback and Debugging: It allows users to quickly see the results of


their commands and debug them.

 Access to Pig Functions: Users can access Pig’s built-in functions and libraries
directly through Grunt.

Example of Grunt Usage:

grunt> A = LOAD 'data.txt' USING PigStorage(',') AS (name:chararray, age:int);

grunt> DUMP A;

6. Pig Latin

Pig Latin is a data flow language used to write Pig scripts. It is procedural in nature, meaning
that it allows users to describe the sequence of operations to perform on the data.

Key Features:

 Simplicity: Pig Latin scripts are simpler to write than MapReduce code and are
more readable.

 Extensibility: Users can define User Defined Functions (UDFs) to extend


Pig's functionality.

 Optimized Execution: Pig Latin scripts are automatically optimized by the Pig
engine for efficient execution.

Basic Syntax:

 LOAD: Reads data from a file or a relational database.

 FILTER: Filters records based on conditions.

 GROUP: Groups data for aggregation.

 JOIN: Performs a join between two datasets.

 FOREACH: Applies transformations to data.

22
 DUMP: Displays the output of the query.

 STORE: Saves the output to a specified location.

Example of Pig Latin:

A = LOAD 'input.txt' USING PigStorage(',') AS (name:chararray, age:int);

B = FILTER A BY age > 30;

C = GROUP B BY name;

DUMP C;

7. User Defined Functions (UDFs) in Pig

User Defined Functions (UDFs) are custom functions that can be written in Java, Python,
or Ruby to extend the capabilities of Pig Latin. UDFs allow you to implement complex
business logic that is not available through built-in operators.

Types of UDFs:

1. Eval Functions: These are used to transform the data (e.g., changing data types
or performing calculations).

o Example: MyUDF that calculates the square of a number.

2. Load/Store Functions: These are used for custom data loading and
storing mechanisms.

o Example: A UDF for reading from a custom data source.

3. Bag and Tuple UDFs: These allow you to manipulate complex data structures
like bags, tuples, and maps.

Example of a Simple UDF in Pig (Java):

23
8. Data Processing Operators in Pig

Pig provides several built-in operators for data processing. These operators allow users to
perform various types of transformations, aggregations, and joins on datasets.

Key Operators:

1. LOAD: Reads data into the Pig environment.

A = LOAD 'data.txt' USING PigStorage(',') AS (name:chararray, age:int);

2. FILTER: Filters the data based on a condition.

B = FILTER A BY age > 30;

3. GROUP: Groups data based on a key.

C = GROUP B BY name;

4. JOIN: Joins two datasets based on a common key.

D = JOIN A BY name, B BY name;

5. FOREACH: Applies transformations to each element of the data.

E = FOREACH D GENERATE name, age * 2;

6. ORDER: Orders the data based on a given field.

F = ORDER E BY age;

7. DISTINCT: Removes duplicate records.

G = DISTINCT F;

8. CROSS: Performs a Cartesian product of two datasets.

H = CROSS A, B;

9. JOIN: Combines two datasets based on a key.

I = JOIN A BY name, B BY name;

10. LIMIT: Restricts the number of records.

J = LIMIT E 100;

24
30

You might also like