HADOOP Notes Unit 3 and 4
HADOOP Notes Unit 3 and 4
The data flow in a Hadoop ecosystem involves the stages through which data passes from
source to processing and finally storage. The flow typically follows a path from data
ingestion, storage, processing, and finally output.
1. Data Ingestion:
o Data is collected from various sources and ingested into the Hadoop system.
o This step involves tools like Flume, Sqoop, and custom data
ingestion processes.
2. Data Storage:
o HDFS is the backbone storage system that supports scalable and fault-
tolerant data storage.
3. Data Processing:
4. Data Output:
A company might use Flume to ingest logs from servers into HDFS. The data is processed
using MapReduce or Hive, and finally, the processed data is output to HDFS, or it might be
used for further analysis in tools like Apache Spark.
11
2. Data Ingestion with Flume and Sqoop
Data ingestion refers to the process of collecting and importing data into Hadoop. Two
commonly used tools for this task are Flume and Sqoop.
Flume:
Apache Flume is a distributed, reliable, and available service for collecting, aggregating, and
moving large amounts of streaming data into HDFS.
Stream Processing: Flume can ingest data from multiple sources like logs,
social media feeds, and sensors.
Reliability: Flume ensures data reliability with mechanisms like Channel, Source, and
Sink.
Scalability: Flume is highly scalable and can be extended with custom plugins.
Integration: Flume integrates well with Hadoop, HDFS, and other data
storage systems.
Flume Architecture:
Channel: The medium through which data flows (e.g., memory, file, or database).
Collecting logs from multiple web servers and ingesting them into HDFS:
Sqoop:
Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and
relational databases like MySQL, PostgreSQL, and Oracle.
Bulk Data Transfer: Sqoop allows efficient bulk transfers of data using parallelism.
Data Types: Sqoop supports a wide range of data types and can map them
from RDBMS to HDFS.
12
Sqoop Operations:
Sqoop Features:
Data Import: Efficiently imports large datasets from RDBMS into HDFS.
Data Export: Exports processed data from Hadoop back into relational databases
for downstream consumption.
Hadoop Archives (HAR) are used to bundle multiple files into a single archive file in HDFS.
HAR files help in reducing the overhead associated with having too many small files in HDFS.
Key Features:
Storage Optimization: HAR files reduce the number of files in HDFS and
optimize storage.
Efficient Access: Improves data retrieval and access when working with many
small files.
Hadoop I/O focuses on how data is compressed, serialized, and stored efficiently for
processing and storage in Hadoop. This includes the handling of compression, serialization,
and formats like Avro and other file-based data structures.
13
Compression in Hadoop I/O
Compression in Hadoop helps reduce the amount of disk space required for storing data and
improves the I/O performance of the system by reducing network bandwidth and disk
storage requirements.
1. Gzip: Commonly used compression format in Hadoop, particularly for text files.
2. Bzip2: Offers better compression than Gzip but has slower performance.
3. LZO: Provides a good balance between compression ratio and speed, used in
real- time applications.
Serialization is the process of converting an object into a byte stream for storage or
transmission, and deserialization is the reverse process.
2. Avro: A more flexible serialization format, offering compact storage and schema-
based data serialization. It is ideal for complex data structures and is often used in
Big Data systems.
14
Example of Serialization with Writable:
4. Avro
Avro is a binary serialization format used by Hadoop for encoding data. It provides a
compact, fast, and efficient way of encoding data, and is particularly useful for serializing
data that can be written to disk or transferred over the network.
Schema-based: Avro data is always serialized with its schema, ensuring the
data’s structure is always known and consistent.
Data Evolution: Avro supports schema evolution, meaning new fields can be
added to data without breaking compatibility with older versions.
15
2. Reading Avro Data:
File-based data structures are formats used for storing data on HDFS or similar distributed
file systems. These structures help in organizing data for better access, processing, and
compression. Some common file-based data structures are:
1. SequenceFile:
SequenceFile is a flat file format used for storing binary key/value pairs.
It is highly efficient for sequential access and works well for intermediate
MapReduce results.
Example:
2. Parquet:
Optimized for Read-heavy Workloads: Great for use in data warehouses and
for analytical queries.
16
Efficient Compression: Parquet’s columnar storage format allows for
higher compression ratios compared to row-based formats.
ORC is a columnar format designed to optimize the read and write performance
in Hive.
Benefits of ORC:
Efficient Compression: ORC files can reduce storage space by up to 75% compared
to text-based formats.
Efficient Querying: ORC enables faster query performance for Hive queries due to
its efficient indexing and predicate pushdown.
17
MODULE- 4
1. Hadoop Ecosystem
The Hadoop Ecosystem refers to a suite of tools and frameworks that extend the
capabilities of the Hadoop platform. It consists of various components, each serving a
specific purpose, ranging from data ingestion, processing, and storage to resource
management, workflow
orchestration, and data analysis.
1. Hadoop Common: The foundational libraries and utilities that support other
Hadoop modules.
7. HBase: A distributed NoSQL database that runs on top of HDFS and is suitable
for real-time random access to large datasets.
8. Sqoop: A tool for transferring data between relational databases and HDFS.
9. Flume: A distributed service for collecting, aggregating, and moving log data or
real- time data into HDFS.
12. Mahout: A machine learning library built on top of Hadoop for scalable algorithms.
These components work together to form a robust environment for storing, processing, and
analyzing large-scale data.
18
2. Pig: Introduction to PIG
Apache Pig is a platform for analyzing large datasets. It is built on top of the Hadoop
ecosystem and provides a simpler way to write MapReduce programs. Pig is often used for
ETL (Extract, Transform, Load) tasks.
Extensibility: Pig allows users to write custom User Defined Functions (UDFs)
to extend its capabilities.
Data Processing: It supports both batch and real-time processing of large datasets.
With Pig, the developer can focus more on the logic and transformations
without worrying about the intricacies of the MapReduce programming model.
Pig can run in different modes based on the needs of the user and the environment in which
it is deployed. The execution modes define how and where Pig queries are executed.
1. Local Mode:
Description: In Local Mode, Pig runs in a single JVM (Java Virtual Machine),
typically on a local machine or a single node.
Use Case: Useful for small-scale datasets or for development and testing purposes.
Limitations: Limited scalability as it runs on a single node and does not take
full advantage of Hadoop’s distributed nature.
20
Use Case: Ideal for large-scale data processing when the system needs to scale
out and distribute the workload.
Execution: Pig scripts are converted into MapReduce jobs and are executed
across the Hadoop cluster.
3. Tez Mode:
Use Case: Used when higher performance and optimization are needed,
especially for complex queries.
Execution: Tez runs on top of YARN and supports more efficient DAG-
based processing than the traditional MapReduce model.
Pig and traditional databases (such as SQL databases) differ in their design, purpose, and
capabilities. Below is a comparison highlighting their strengths and differences.
21
5. Grunt
Grunt is the interactive shell that comes with Pig, which allows users to run Pig Latin scripts
interactively from the command line. Grunt is useful for testing Pig Latin scripts and for
performing ad-hoc data exploration.
Key Features:
Access to Pig Functions: Users can access Pig’s built-in functions and libraries
directly through Grunt.
grunt> DUMP A;
6. Pig Latin
Pig Latin is a data flow language used to write Pig scripts. It is procedural in nature, meaning
that it allows users to describe the sequence of operations to perform on the data.
Key Features:
Simplicity: Pig Latin scripts are simpler to write than MapReduce code and are
more readable.
Optimized Execution: Pig Latin scripts are automatically optimized by the Pig
engine for efficient execution.
Basic Syntax:
22
DUMP: Displays the output of the query.
C = GROUP B BY name;
DUMP C;
User Defined Functions (UDFs) are custom functions that can be written in Java, Python,
or Ruby to extend the capabilities of Pig Latin. UDFs allow you to implement complex
business logic that is not available through built-in operators.
Types of UDFs:
1. Eval Functions: These are used to transform the data (e.g., changing data types
or performing calculations).
2. Load/Store Functions: These are used for custom data loading and
storing mechanisms.
3. Bag and Tuple UDFs: These allow you to manipulate complex data structures
like bags, tuples, and maps.
23
8. Data Processing Operators in Pig
Pig provides several built-in operators for data processing. These operators allow users to
perform various types of transformations, aggregations, and joins on datasets.
Key Operators:
C = GROUP B BY name;
F = ORDER E BY age;
G = DISTINCT F;
H = CROSS A, B;
J = LIMIT E 100;
24
30