0% found this document useful (0 votes)
20 views46 pages

Big Data-UNIT-2

Hadoop is an open-source framework for processing and storing large datasets, inspired by Google's technologies, and consists of key components including HDFS for storage, MapReduce for processing, and YARN for resource management. HDFS provides scalable, fault-tolerant storage by distributing data across multiple nodes, while MapReduce enables parallel data processing. Various data formats are supported in Hadoop, with binary formats like Avro and Parquet offering advantages in compression and performance for big data analytics.

Uploaded by

vivankapoor600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views46 pages

Big Data-UNIT-2

Hadoop is an open-source framework for processing and storing large datasets, inspired by Google's technologies, and consists of key components including HDFS for storage, MapReduce for processing, and YARN for resource management. HDFS provides scalable, fault-tolerant storage by distributing data across multiple nodes, while MapReduce enables parallel data processing. Various data formats are supported in Hadoop, with binary formats like Avro and Parquet offering advantages in compression and performance for big data analytics.

Uploaded by

vivankapoor600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

UNIT-2

1. Introduction to Hadoop
Hadoop is an open-source framework designed for processing and storing large datasets across
distributed clusters of computers. It was inspired by Google's MapReduce and Google File
System (GFS) to handle big data efficiently.

History of Hadoop

●​ Developed by Doug Cutting and Mike Cafarella in 2005.


●​ Initially part of the Apache Nutch project, later split into a separate project under the
Apache Software Foundation.
●​ Named after Doug Cutting’s son’s toy elephant.
●​ Became an Apache top-level project in 2008.

Apache Hadoop

●​ Open-source implementation of Google's MapReduce and Google File System.


●​ Designed to scale from a single server to thousands of machines.
●​ Fault-tolerant and handles failures automatically.

2. Hadoop Distributed File System (HDFS)

●​ HDFS is Hadoop's storage system designed for large-scale data storage and
high-throughput access.
●​ Key features:
○​ Distributed Storage: Data is split into blocks and stored across multiple nodes.
○​ Fault Tolerance: Data replication ensures no data loss.
○​ Scalability: Can handle petabytes of data.
○​

Hadoop Distributed File System (HDFS) is a scalable, fault-tolerant, distributed storage system
designed to store and process large volumes of data across multiple machines in a cluster. HDFS
is a core component of the Apache Hadoop ecosystem and is optimized for handling big data
workloads efficiently.

1. Features of HDFS

HDFS is designed to address the challenges of big data storage and processing by providing:

1.​ Scalability: It can store petabytes (PB) of data across thousands of machines.
2.​ Fault Tolerance: Data is replicated across multiple nodes to prevent data loss.
3.​ High Throughput: Optimized for batch processing, ensuring high-speed data access.
4.​ Data Locality: Moves computation closer to data, reducing network traffic.
5.​ Streaming Access: Best suited for sequential read/write operations.
6.​ Write-Once, Read-Many: Data is written once and read multiple times, ideal for big
data analytics.

2. HDFS Architecture

HDFS follows a master-slave architecture, consisting of three primary components:

a) NameNode (Master)
●​ Role: Manages metadata (file names, locations, replicas).
●​ Responsibilities:
○​ Keeps track of file system structure.
○​ Manages access permissions.
○​ Ensures fault tolerance by monitoring DataNodes.
○​ Does not store actual data, only metadata.

b) DataNode (Slaves)

●​ Role: Stores actual data blocks of files.


●​ Responsibilities:
○​ Reads and writes data as instructed by NameNode.
○​ Periodically sends a "heartbeat" to NameNode indicating availability.
○​ Performs block replication and data recovery.

c) Secondary NameNode

●​ Role: Assists NameNode in metadata management.


●​ Responsibilities:
○​ Takes periodic snapshots of the NameNode metadata.
○​ Helps recover NameNode failure (not a real-time backup).
○​ Reduces the workload of the NameNode.

3. HDFS File Storage Mechanism

a) Blocks in HDFS

●​ Each file in HDFS is split into fixed-size blocks (default 128 MB).
●​ Blocks are stored across multiple DataNodes.
●​ Large block size reduces the overhead of managing many small files.

b) Data Replication

●​ Each block is replicated (default: 3 copies) across different nodes.


●​ Ensures fault tolerance and high availability.
●​ If a DataNode fails, the system automatically creates a new copy.

c) Rack Awareness

●​ HDFS understands the physical topology of the cluster.


●​ Tries to store copies of data on different racks to minimize data loss risk.
4. HDFS Data Read & Write Operations

a) Writing a File to HDFS

1.​ The client contacts the NameNode to get metadata information.


2.​ The NameNode assigns DataNodes to store the file's blocks.
3.​ The client writes data to the first DataNode, which then copies to others for replication.
4.​ Once replication is completed, the client gets confirmation.

b) Reading a File from HDFS

1.​ The client requests the NameNode for file location.


2.​ The NameNode provides a list of DataNodes containing the file blocks.
3.​ The client reads data directly from the nearest DataNode.
4.​ Data is fetched in parallel to improve performance.

5. Fault Tolerance in HDFS

HDFS ensures reliability through:

1.​ Replication: Multiple copies of each block are maintained.


2.​ Heartbeats: DataNodes regularly report health status to the NameNode.
3.​ Automatic Recovery: If a DataNode fails, the NameNode reallocates missing blocks.

6. HDFS Commands (Basic)

Here are some common HDFS commands:

a) File Operations

●​ List files: hdfs dfs -ls /


●​ Create a directory: hdfs dfs -mkdir /data
●​ Upload a file: hdfs dfs -put localfile.txt /data/
●​ Download a file: hdfs dfs -get /data/file.txt localfile.txt
●​ Delete a file: hdfs dfs -rm /data/file.txt

b) Directory Operations

●​ List directory contents: hdfs dfs -ls /data


●​ Remove a directory: hdfs dfs -rm -r /data
3. Components of Hadoop

Hadoop consists of four main components:

1.​ Hadoop Common – The core utilities and libraries.


2.​ HDFS (Hadoop Distributed File System) – Storage layer for Hadoop.
3.​ MapReduce – Processing layer for distributed data computation.
4.​ YARN (Yet Another Resource Negotiator) – Resource management and job scheduling
framework.

1. Hadoop Distributed File System (HDFS)

HDFS is the storage layer of Hadoop, designed to store huge volumes of data across multiple
machines in a cluster. It provides fault tolerance, scalability, and high throughput.

Key Features of HDFS


●​ Stores large files as blocks (default: 128 MB per block).
●​ Replication mechanism: Each block is replicated (default: 3 copies) for fault tolerance.
●​ Write-once, read-many model: Files are immutable after being written.
●​ Data locality optimization: Computation is moved closer to data to minimize network
load.

Components of HDFS

Component Role

NameNode Manages metadata (file locations, permissions, and structure) but does
(Master) not store actual data.

DataNode (Slaves) Stores actual data blocks and reports health status to the NameNode.

Secondary Periodically takes snapshots of the NameNode metadata (not a backup


NameNode NameNode).

2. MapReduce

MapReduce is the data processing layer of Hadoop, designed for parallel processing of large
datasets across multiple nodes.

How MapReduce Works?

MapReduce follows a divide-and-conquer approach, consisting of two main stages:

1.​ Map Phase:


○​ Splits data into key-value pairs.
○​ Processes data in parallel across different nodes.
2.​ Reduce Phase:
○​ Aggregates the intermediate results from the Map phase.
○​ Produces the final output.
Example of MapReduce

Imagine you have a large text file, and you want to count how many times each word appears.

Map Function: Takes input as lines of text and breaks them into words. Example:​
arduino​
CopyEdit​
("Hadoop", 1), ("is", 1), ("great", 1), ("Hadoop", 1)

Shuffle & Sort: Groups similar words together.​


arduino​
CopyEdit​
("Hadoop", [1, 1]), ("is", [1]), ("great", [1])

Reduce Function: Aggregates the values.​


arduino​
CopyEdit​
("Hadoop", 2), ("is", 1), ("great", 1)

This process allows Hadoop to process terabytes of data in parallel.

3. Yet Another Resource Negotiator (YARN)

YARN is the resource management layer of Hadoop. It allocates system resources (CPU,
memory) to different applications running on the cluster.

Key Components of YARN

Component Role

Resource Manager (RM) Master component that manages resources across the cluster.

Node Manager (NM) Runs on each node to monitor resources and report to the RM.
Application Master (AM) Manages the lifecycle of individual applications (jobs).

Container The basic unit of resource allocation in YARN (CPU, memory).

Advantages of YARN

●​ Supports multiple frameworks beyond MapReduce (e.g., Spark, Tez).


●​ Efficient resource utilization across Hadoop clusters.
●​ Improves job scheduling and workload management.

4. Hadoop Common

Hadoop Common is a set of shared utilities, libraries, and configuration files that support all
other Hadoop components.

Main Functions

●​ Provides Java libraries and APIs required for Hadoop components.


●​ Includes File System APIs to interact with HDFS.
●​ Manages configuration settings for Hadoop clusters.

4. Data Format in Hadoop


●​ Text Files (CSV, JSON, XML)
●​ Sequence Files
●​ Avro Files
●​ Parquet Files
●​ ORC (Optimized Row Columnar) Files

Hadoop supports various data formats to efficiently store, process, and analyze large datasets.
The choice of format depends on storage efficiency, processing speed, and compatibility with
Hadoop tools.
1. Types of Data Formats in Hadoop

Data in Hadoop can be stored in structured, semi-structured, or unstructured formats. The main
types include:

A. Text-Based Formats

1.​ Plain Text (CSV, TSV, JSON)


2.​ SequenceFile
3.​ JSON
4.​ XML

B. Binary Formats

1.​ Avro
2.​ Parquet
3.​ ORC (Optimized Row Columnar)
4.​ Protocol Buffers (Protobuf)

Each format has different advantages based on compression, schema evolution, and efficiency.

2. Common Hadoop Data Formats

A. Text-Based Formats

1. Plain Text (CSV, TSV)

●​ Description: Stores data in simple text format, with rows separated by new lines and
columns separated by delimiters (e.g., , or \t).
●​ Advantages:
○​ Human-readable.
○​ Easy to process with Hadoop tools like MapReduce, Hive, and Pig.
●​ Disadvantages:
○​ Large storage size (no compression).
○​ No schema enforcement.

Example: CSV Format

pgsql
CopyEdit

ID, Name, Age

1, Alice, 25

2, Bob, 30

2. SequenceFile

●​ Description: A binary format that stores key-value pairs in Hadoop.


●​ Advantages:
○​ Faster than plain text.
○​ Supports compression (gzip, bzip2, LZO).
●​ Disadvantages:
○​ Not human-readable.

Example:

css

CopyEdit

Key: UserID, Value: JSON Data

(1, {"Name": "Alice", "Age": 25})

3. JSON

●​ Description: Stores data in key-value pairs in a human-readable format.


●​ Advantages:
○​ Semi-structured format.
○​ Easily parsed by Hive, Spark, Pig.
●​ Disadvantages:
○​ Large file size.
○​ Slower processing than binary formats.

Example:

json
{

"ID": 1,

"Name": "Alice",

"Age": 25

4. XML

●​ Description: Stores hierarchical data using tags.


●​ Advantages:
○​ Widely used in web services and configurations.
●​ Disadvantages:
○​ Slow processing speed.
○​ Larger size compared to JSON.

Example:

xml

<user>

<ID>1</ID>

<Name>Alice</Name>

<Age>25</Age>

</user>

B. Binary Formats (Optimized for Hadoop)

1. Avro
●​ Description: A row-based binary format designed for fast data serialization.
●​ Advantages:
○​ Supports schema evolution (can add or remove fields without breaking data).
○​ Highly compressed and efficient for MapReduce.
●​ Disadvantages:
○​ Not human-readable.

Example Avro Schema:

json

CopyEdit

"type": "record",

"name": "User",

"fields": [

{"name": "ID", "type": "int"},

{"name": "Name", "type": "string"},

{"name": "Age", "type": "int"}

2. Parquet

●​ Description: A columnar storage format used in Hadoop.


●​ Advantages:
○​ Efficient for querying large datasets (used by Hive, Spark).
○​ Compression-friendly (stores similar values together).
○​ Faster read performance for analytics.
●​ Disadvantages:
○​ Slow write speed compared to row-based formats.
3. ORC (Optimized Row Columnar)

●​ Description: A highly optimized columnar format designed for Hive.


●​ Advantages:
○​ Better compression than Parquet.
○​ Faster read/write speeds in Hive.
●​ Disadvantages:
○​ Works best with Hive (not as flexible as Parquet).

4. Protocol Buffers (Protobuf)

●​ Description: A compact, schema-based binary format by Google.


●​ Advantages:
○​ Smaller file size.
○​ Faster serialization and deserialization.
●​ Disadvantages:
○​ Requires schema definition.

Text-based formats (CSV, JSON, XML) are easy to use but inefficient for big data.

Binary formats (Avro, Parquet, ORC, Protobuf) provide better compression, performance, and
flexibility.

Avro is best for schema evolution.

Parquet & ORC are best for analytics.

SequenceFile is good for Hadoop MapReduce applications.

5. Analyzing Data with Hadoop


●​ Use MapReduce to process large datasets in parallel.
●​ Use Hive for SQL-like queries.
●​ Use Pig for scripting-based data transformations.

1. Steps in Analyzing Data with Hadoop

Step 1: Data Collection

●​ Data is ingested from various sources, including:


○​ Relational databases (MySQL, PostgreSQL) using Sqoop.
○​ Logs, sensor data, social media using Flume/Kafka.
○​ Cloud storage (AWS S3, Azure Blob) or file uploads.

Step 2: Data Storage in HDFS

●​ Data is stored in Hadoop Distributed File System (HDFS) in different formats like:
○​ Text-based formats (CSV, JSON, XML).
○​ Optimized formats (Parquet, ORC, Avro).
○​ Key-value format (SequenceFile).

Step 3: Data Processing

●​ Batch Processing: Uses MapReduce for parallel processing.


●​ SQL Querying: Uses Apache Hive or Presto for structured queries.
●​ Real-time Processing: Uses Apache Spark, Flink, or Kafka Streams.
●​ Machine Learning: Uses Apache Mahout or MLlib in Spark.

Step 4: Data Analysis & Visualization

●​ Analyzed data can be exported to BI tools like Tableau, Power BI.


●​ Hadoop-based querying tools like Impala and Hue can be used for interactive analytics.

6. Scaling Out Hadoop


●​ Horizontal Scaling: Adding more nodes to the cluster.
●​ Replication Factor: Ensuring data redundancy.
●​ Load Balancing: Distributing tasks across nodes efficiently.

Scaling out Hadoop refers to increasing the number of nodes in a Hadoop cluster to handle larger
workloads, higher storage demands, and improved processing speed. Unlike scaling up (adding
more CPU, RAM, or storage to an existing machine), scaling out distributes the load across
multiple machines.

1. Why Scale Out Hadoop?

Hadoop is designed to handle big data, which continuously grows in volume, velocity, and
variety. Scaling out helps in:

1.​ Handling larger datasets: More nodes mean more storage in HDFS.
2.​ Improving processing speed: Jobs run in parallel across more nodes.
3.​ Enhancing fault tolerance: More nodes reduce data loss risks.
4.​ Supporting higher workloads: More computing resources mean better performance.

2. Types of Hadoop Scaling


There are two ways to scale a Hadoop cluster:

A. Vertical Scaling (Scaling Up)

●​ Increasing resources (CPU, RAM, disk space) of existing nodes.


●​ Advantages:
○​ Simple to implement.
○​ No need to add more nodes.
●​ Disadvantages:
○​ Expensive (high-end hardware).
○​ Limited scalability.

B. Horizontal Scaling (Scaling Out)

●​ Adding more machines (nodes) to the cluster.


●​ Advantages:
○​ More cost-effective (uses commodity hardware).
○​ Scales indefinitely.
●​ Disadvantages:
○​ Requires cluster rebalancing and configuration.

7. Hadoop Streaming and Pipes


●​ Hadoop Streaming: Allows writing MapReduce programs in any language (Python,
Ruby, etc.).
●​ Hadoop Pipes: C++ API for Hadoop, enabling integration with non-Java applications.
Hadoop Streaming vs. Hadoop Pipes

Hadoop Streaming (Python,


Feature Hadoop Pipes (C++)
Shell, etc.)

Programming Any language supporting


C++
Language stdin/stdout

Moderate (overhead from High (native C++


Performance
inter-process communication) execution)

Ease of Development Easy Complex


High-performance
Use Case Quick prototyping, small jobs computing, large-scale
jobs
More configuration
Setup Complexity Simple
required

●​ Hadoop Streaming is best for running MapReduce jobs in non-Java languages like
Python or Shell scripts.
●​ Hadoop Pipes is useful for high-performance applications requiring native C++
execution.
●​ Choose Streaming for ease of use and Pipes for better performance.
8. Hadoop Ecosystem
●​ Apache Hive – Data warehousing and SQL queries.
●​ Apache Pig – Scripting-based data analysis.
●​ Apache HBase – NoSQL database for real-time data.
●​ Apache Sqoop – Data transfer between Hadoop and RDBMS.
●​ Apache Flume – Collecting and aggregating log data.
●​ Apache Oozie – Workflow scheduler for managing Hadoop jobs.

The Hadoop ecosystem consists of a set of tools and frameworks that extend the functionality
of Hadoop for data storage, processing, analysis, and management. These tools work
together to handle Big Data efficiently.
1. Core Components of Hadoop

The Hadoop ecosystem is built around four core components:

1.​ Hadoop Distributed File System (HDFS) – Distributed storage system.


2.​ Yet Another Resource Negotiator (YARN) – Resource management.
3.​ MapReduce – Batch processing framework.
4.​ Hadoop Common – Shared utilities and libraries.

These components enable scalable, fault-tolerant, and high-speed processing of massive


datasets.

2. Key Components of the Hadoop Ecosystem

The Hadoop ecosystem includes various tools categorized into different functions:

A. Data Storage & Management

Tool Description

HDFS Stores large-scale structured and unstructured data.

HBase NoSQL database for fast real-time access to Hadoop data.

HCatalog Table management layer for Hadoop (integrates with Hive and Pig).

Ozone Distributed object store for Hadoop, an alternative to HDFS.

B. Data Processing & Computation

Tool Description
MapReduce Batch processing framework for distributed data.

Apache Spark In-memory computing framework for real-time & batch processing.

Apache Tez Optimized DAG-based processing engine (faster than MapReduce).

Apache Flink Stream processing framework for real-time analytics.

C. Data Querying & Analysis

Tool Description

Hive SQL-like query engine for structured data in Hadoop.

Pig High-level scripting language for data transformation (Pig Latin).

Impala Low-latency SQL queries on Hadoop (faster than Hive).

Drill Schema-free SQL engine for interactive queries.

Presto SQL query engine for interactive data analysis.

D. Data Ingestion & Integration

Tool Description
Sqoop Transfers data between Hadoop and relational databases (MySQL, PostgreSQL).

Flume Collects and ingests real-time streaming data (e.g., logs, IoT data).

Kafka Distributed event streaming platform for real-time data ingestion.

NiFi Automates data flow between systems in Hadoop.

E. Machine Learning & AI

Tool Description

Apache Mahout Machine learning library for Hadoop.

MLlib (Spark ML) Scalable machine learning library in Spark.

H2O.ai AI and deep learning framework for Hadoop.

F. Workflow & Scheduling

Tool Description

Oozie Workflow scheduler for managing Hadoop jobs.

Azkaban Job scheduler for Hadoop workflows.


Airflow Workflow automation and orchestration tool.

G. Security & Governance

Tool Description

Ranger Provides security and access control for Hadoop.

Knox Secure gateway for accessing Hadoop services.

Atlas Data governance and metadata management.

H. Data Visualization & BI

Tool Description

Zeppelin Interactive notebooks for data visualization.

Hue Web-based interface for managing Hadoop queries.

Tableau/Power BI Business Intelligence (BI) tools for visual analytics.


9. Introduction to MapReduce
●​ MapReduce is the processing framework of Hadoop, allowing large-scale parallel
computations.
●​ Based on the principles of Map (transforming data) and Reduce (aggregating results).

●​

MapReduce is a distributed data processing framework in Hadoop designed to process large


datasets in parallel across multiple nodes. It follows a divide-and-conquer approach, splitting
tasks into smaller sub-tasks and distributing them across a cluster.

Key Features of MapReduce:

●​ Scalability – Processes petabytes of data efficiently.


●​ Fault tolerance – Automatically handles failures.
●​ Parallelism – Tasks execute concurrently across multiple nodes.
●​ Data locality optimization – Moves computation closer to data to reduce network
overhead.

2. How MapReduce Works?

MapReduce consists of two main phases:

1.​ Map Phase – Processes input data and converts it into intermediate key-value pairs.
2.​ Reduce Phase – Aggregates and processes the intermediate results.

Workflow:

1.​ Input data is split into chunks (HDFS blocks).


2.​ The Mapper processes each chunk and produces key-value pairs.
3.​ The Shuffling & Sorting phase groups values with the same key.
4.​ The Reducer aggregates values to produce the final output.
5.​ The output is stored in HDFS.

10. How MapReduce Works


1.​ Splitting: Input data is split into chunks.
2.​ Mapping: Mapper processes each chunk in parallel.
3.​ Shuffling: Data is sorted and grouped.
4.​ Reducing: Reducer aggregates the grouped data.
5.​ Output: The final result is written to HDFS.

MapReduce is a distributed computing model used in Hadoop to process large-scale data in a


parallel, fault-tolerant, and scalable manner. It follows a divide-and-conquer approach, breaking
down a large job into smaller tasks that execute in parallel across multiple nodes in a Hadoop
cluster.

1. Components of MapReduce

MapReduce consists of three main components:

1.​ Mapper – Processes input data and converts it into key-value pairs.
2.​ Reducer – Aggregates and processes the intermediate key-value pairs generated by the
Mapper.
3.​ Driver Program – Manages the execution of the MapReduce job.

2. The MapReduce Workflow

Step 1: Input Splitting (HDFS)

●​ The input dataset is stored in HDFS and divided into splits (chunks of data).
●​ Each split is processed by a separate Mapper task.

Step 2: Mapping (Map Phase)

●​ The Mapper function reads each split and processes it line by line.
●​ It converts the data into key-value pairs.
Step 3: Shuffling & Sorting

●​ The output from multiple Mappers is grouped by keys.


●​ Similar keys are shuffled to the same Reducer node.
●​ Hadoop automatically sorts the intermediate keys before reducing.

Step 4: Reducing (Reduce Phase)

●​ The Reducer function aggregates, filters, or processes key-value pairs.


●​ It produces the final output, stored back in HDFS.

Step 5: Output Storage

●​ The final processed data is stored in HDFS for further use.


●​ It can be queried using Hive, Pig, or Spark.

11. Developing a MapReduce Application


●​ Define Mapper
●​ Define Reducer
●​ Configure Job
●​ Run and Monitor Execution

Developing a MapReduce application in Hadoop involves writing Mapper and Reducer


functions, configuring the MapReduce job, and running it on a Hadoop cluster. Below is a
detailed step-by-step guide.
1. Define Mapper
The Mapper processes input data and converts it into intermediate key-value pairs.

Example: Word Count Mapper (Java)

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one); // Output: (word, 1)

}
}

How It Works

●​ Reads each line from the input file.


●​ Splits the line into words.
●​ Emits (word, 1) pairs for each word.

2. Define Reducer
The Reducer aggregates and processes intermediate key-value pairs produced by the Mapper.

Example: Word Count Reducer (Java)

java

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable,


Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,


Context context) throws IOException, InterruptedException {
int sum = 0;

for (IntWritable val : values) {

sum += val.get(); // Summing all occurrences of the


word

context.write(key, new IntWritable(sum)); // Output:


(word, count)

How It Works

●​ Receives a list of counts for each unique word.


●​ Sums up the counts.
●​ Emits the final (word, count) pair.

3. Configure and Set Up the Job


The Driver Program (Main Class) manages the execution of the MapReduce job.

Example: Configuring the Job in Java

java

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(WordCountMapper.class);

job.setCombinerClass(WordCountReducer.class); //
Optional combiner to optimize performance

job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0])); //


Input path in HDFS

FileOutputFormat.setOutputPath(job, new Path(args[1]));


// Output path in HDFS

System.exit(job.waitForCompletion(true) ? 0 : 1);

How It Works

●​ Configures input/output paths.


●​ Sets Mapper and Reducer classes.
●​ Defines output key/value types.
●​ Calls job.waitForCompletion(true) to execute the job.

4. Run and Monitor Execution

Step 1: Compile the Java Program

shell
javac -classpath `hadoop classpath` -d . WordCountMapper.java
WordCountReducer.java WordCount.java

jar -cvf wordcount.jar *.class

Step 2: Upload Input File to HDFS

shell

hdfs dfs -mkdir -p /user/hadoop/input

hdfs dfs -put localfile.txt /user/hadoop/input/

Step 3: Run the Hadoop Job

shell

hadoop jar wordcount.jar WordCount /user/hadoop/input


/user/hadoop/output

Step 4: Monitor Job Execution

Check the job progress on the Hadoop ResourceManager UI (default:


http://localhost:8088). Or use:

shell

hadoop job -status <job_id>


Step 5: View Output in HDFS

shell

hdfs dfs -ls /user/hadoop/output

hdfs dfs -cat /user/hadoop/output/part-r-00000

●​ Mapper: Converts input into key-value pairs.


●​ Reducer: Aggregates intermediate results.
●​ Driver Program: Configures and runs the job.
●​ Execution: Compiles, runs, and monitors in Hadoop.

12. Testing MapReduce Applications


Testing MapReduce applications is crucial to ensure that they process data correctly and
efficiently. The testing process involves verifying the mapper, reducer, and combiner
functions separately and together. Below are the key aspects and methods for testing
MapReduce applications:

1. Types of Testing in MapReduce


Unit Testing

●​ Test individual Mapper and Reducer classes.


●​ Use mock input data and check whether the expected output is generated.
●​ Frameworks: JUnit, MRUnit

Integration Testing

●​ Test the full MapReduce job with input and output formats.
●​ Ensure that data passes correctly from Mapper to Reducer.
●​ Frameworks: Local Hadoop Cluster, MiniMRCluster

Performance Testing

●​ Check how the application handles large datasets.


●​ Test with different cluster configurations and varying data sizes.
●​ Tools: Apache JMeter, YARN Resource Manager

Functional Testing

●​ Validate the business logic of MapReduce jobs.


●​ Test with real-world data and check if expected outputs are generated.

Regression Testing

●​ Ensure that new changes do not break existing functionality.


●​ Compare results before and after changes.

2. Tools for Testing MapReduce Applications


MRUnit

●​ A lightweight framework for unit testing MapReduce applications.


●​ Allows isolated testing of Mapper, Reducer, and Combiner.
●​ Does not require a Hadoop cluster.

Example Test Case using MRUnit (Java)

java

●​ @Test
●​ public void testMapper() throws IOException {
●​ Mapper<LongWritable, Text, Text, IntWritable> mapper =
new MyMapper();
●​ MapDriver<LongWritable, Text, Text, IntWritable>
mapDriver = MapDriver.newMapDriver(mapper);
●​
●​ mapDriver.withInput(new LongWritable(1), new
Text("hadoop hadoop spark"));
●​ mapDriver.withOutput(new Text("hadoop"), new
IntWritable(1));
●​ mapDriver.withOutput(new Text("hadoop"), new
IntWritable(1));
●​ mapDriver.withOutput(new Text("spark"), new
IntWritable(1));
●​
●​ mapDriver.runTest();
●​ }

Local Mode Testing

●​ Run MapReduce jobs on a single machine without setting up a full Hadoop cluster.
●​ Faster than a real cluster.
●​ Configure in core-site.xml:

xml

CopyEdit

●​ <property>
●​ <name>mapreduce.framework.name</name>
●​ <value>local</value>
●​ </property>

MiniMRCluster

●​ A lightweight cluster for integration testing.


●​ Simulates a real Hadoop environment.

Example Code:
java

CopyEdit

●​ MiniMRCluster miniMRCluster = new MiniMRCluster(2,


"file:///", 1);
●​ JobConf jobConf = new
JobConf(miniMRCluster.createJobConf());

Apache JMeter for Performance Testing

●​ Stress test MapReduce jobs by simulating multiple input loads.

3. Common Issues and Debugging in MapReduce Testing

Issue Possible Cause Solution

Incorrect Bug in Mapper/Reducer Validate with MRUnit and check


Output logic logs

Job Stalls Data skew, insufficient Use combiners, increase reducer


reducers count

Out of Large dataset per node Tune heap size, use compression
Memory

Slow Poor partitioning, high I/O Optimize data partitioning, use


Processi counters
ng
4. Best Practices for Testing MapReduce Applications
●​ Test Mapper and Reducer independently before full job testing.
●​ Use small datasets first to validate logic.
●​ Validate data integrity at each stage (Mapper → Combiner → Reducer).
●​ Enable debug logging for troubleshooting.
●​ Use MiniMRCluster before deploying to production.
●​ Automate testing using JUnit & MRUnit.

13. Anatomy of a MapReduce Job Run


A MapReduce job run consists of multiple phases, starting from job submission to output
writing. The execution flow involves splitting input data, mapping, shuffling, reducing, and
writing output.

1. Key Components of a MapReduce Job


A. Job Client (User Program)

●​ Submits the job to the JobTracker (YARN ResourceManager in Hadoop 2+).


●​ Configures input format, mapper, combiner, partitioner, reducer, and output format.

B. JobTracker (ResourceManager in Hadoop 2)

●​ Accepts jobs from clients.


●​ Divides them into tasks and assigns them to available TaskTrackers (NodeManagers in
Hadoop 2).

C. TaskTracker (NodeManager in Hadoop 2)

●​ Executes individual Map and Reduce tasks.


●​ Reports progress to the JobTracker.
2. MapReduce Execution Phases
The execution of a MapReduce job follows these key steps:

Step 1: Input Splitting

●​ HDFS Input files are split into logical input splits.


●​ Each split is processed by a separate Mapper.
●​ Example: If an input file has 300 MB and the block size is 128 MB, it will be divided into
3 splits.

Step 2: Map Phase

●​ The Mapper processes each split and emits intermediate key-value pairs.
●​ The output of the mapper is written to local disk before being shuffled.
●​ Example:​

●​ Input: "hadoop mapreduce hadoop"


●​ Output:
●​ ("hadoop", 1)
●​ ("mapreduce", 1)
●​ ("hadoop", 1)
●​

Step 3: Shuffle and Sort Phase (Partitioning)

●​ The intermediate key-value pairs from the Mappers are sorted and grouped before being
sent to the Reducer.
●​ Partitioning decides which Reducer gets which keys.
●​ Example:​


Mapper 1 Output:
●​ ("hadoop", 1) ("mapreduce", 1)
●​
●​ Mapper 2 Output:
●​ ("hadoop", 1) ("bigdata", 1)
●​
●​ Shuffled & Sorted:
●​ ("bigdata", [1])
●​ ("hadoop", [1,1])
●​ ("mapreduce", [1])
●​

Step 4: Reduce Phase

●​ Reducers take the sorted key-value list and process it.


●​ Output is written to HDFS.

Example Reduce Output:

arduino

CopyEdit

●​ ("bigdata", 1)
●​ ("hadoop", 2)
●​ ("mapreduce", 1)

Step 5: Output Writing

●​ The final output is written to HDFS in the format specified by OutputFormat.

3. MapReduce Job Flow Diagram


pgsql

CopyEdit

●​ +--------------------+
●​ | Job Submission |
●​ +--------------------+
●​ |
●​ +-----------------------+
●​ | Input Splitting |
●​ +-----------------------+
●​ |
●​ +-----------------------+
●​ | Map Phase |
●​ +-----------------------+
●​ |
●​ +-----------------------+
●​ | Shuffle & Sort |
●​ +-----------------------+
●​ |
●​ +-----------------------+
●​ | Reduce Phase |
●​ +-----------------------+
●​ |
●​ +-----------------------+
●​ | Output to HDFS |
●​ +-----------------------+

4. Components Involved in Execution


A. InputFormat

●​ Defines how the input data is split.


●​ Examples:
○​ TextInputFormat: Splits text files by line.
○​ KeyValueTextInputFormat: Reads key-value pairs from input files.

B. Mapper

●​ Processes each split and outputs key-value pairs.

C. Combiner (Optional)

●​ Acts as a mini-reducer to reduce the size of intermediate data.


●​ Example: Summing word counts locally before the Reduce phase.
D. Partitioner

●​ Decides which Reducer gets which key.

E. Shuffle and Sort

●​ Groups all values with the same key together.

F. Reducer

●​ Processes grouped key-value pairs and outputs results.

G. OutputFormat

●​ Defines how the final output is written.


●​ Example:
○​ TextOutputFormat: Outputs text files.
○​ SequenceFileOutputFormat: Writes binary files.

5. Example: Word Count MapReduce Job


A. Mapper Code

java

CopyEdit

●​ public class WordCountMapper extends Mapper<LongWritable,


Text, Text, IntWritable> {
●​ private final static IntWritable one = new
IntWritable(1);
●​ private Text word = new Text();
●​
●​ public void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {
●​ StringTokenizer tokenizer = new
StringTokenizer(value.toString());
●​ while (tokenizer.hasMoreTokens()) {
●​ word.set(tokenizer.nextToken());
●​ context.write(word, one);
●​ }
●​ }
●​ }

B. Reducer Code

java

●​ public class WordCountReducer extends Reducer<Text,


IntWritable, Text, IntWritable> {
●​ public void reduce(Text key, Iterable<IntWritable>
values, Context context) throws IOException,
InterruptedException {
●​ int sum = 0;
●​ for (IntWritable val : values) {
●​ sum += val.get();
●​ }
●​ context.write(key, new IntWritable(sum));
●​ }
●​ }

C. Driver Code

java

●​ public class WordCountDriver {


●​ public static void main(String[] args) throws Exception
{
●​ Configuration conf = new Configuration();
●​ Job job = Job.getInstance(conf, "word count");
●​
●​ job.setJarByClass(WordCountDriver.class);
●​ job.setMapperClass(WordCountMapper.class);
●​ job.setCombinerClass(WordCountReducer.class);
●​ job.setReducerClass(WordCountReducer.class);
●​
●​ job.setOutputKeyClass(Text.class);
●​ job.setOutputValueClass(IntWritable.class);
●​
●​ FileInputFormat.addInputPath(job, new
Path(args[0]));
●​ FileOutputFormat.setOutputPath(job, new
Path(args[1]));
●​
●​ System.exit(job.waitForCompletion(true) ? 0 : 1);
●​ }
●​ }

6. Optimization Tips for MapReduce Jobs


●​ Use Combiner: Reduces data size before Reduce phase.
●​ Use Counters: Debug and analyze job performance.
●​ Use Compression: Saves storage and speeds up job execution.
●​ Tune Reducer Count: Optimize reducer tasks to balance load.
●​ Avoid Skewed Data: Ensure even data distribution among reducers.
●​ Use Proper InputFormat: Select the best format based on data structure.

Understanding the anatomy of a MapReduce job run is crucial for developing optimized and
efficient data-processing applications. The execution follows a structured process of splitting,
mapping, shuffling, reducing, and output writing. Proper tuning and testing ensure better
performance in large-scale distributed systems.

14. Handling Failures in MapReduce


●​ Task Reattempts: If a task fails, Hadoop retries execution.
●​ Speculative Execution: Identifies slow tasks and re-executes them.
●​ Node Failure: Jobs are rescheduled on available nodes.

Handling Failures in MapReduce:

1.​ Task Reattempts: If a task (Map or Reduce) fails, Hadoop automatically retries the
execution on a different node. The number of retries is configurable.​

2.​ Speculative Execution: Hadoop identifies tasks that are running slower than expected
and executes duplicate copies on other nodes. The first completed result is used, while
other executions are discarded, preventing bottlenecks caused by slow nodes.​

3.​ Node Failure: If a node crashes or becomes unresponsive, Hadoop detects the failure
through the heartbeat mechanism. The JobTracker (in Hadoop 1) or ResourceManager (in
Hadoop 2/YARN) reschedules the tasks on other available nodes to ensure job
completion.

15. Job Scheduling in Hadoop


●​ FIFO (First In, First Out) – Default scheduler.
●​ Fair Scheduler – Equal resource sharing.
●​ Capacity Scheduler – Multi-tenant resource allocation.

Hadoop job scheduling determines the order and priority in which MapReduce jobs execute on a
Hadoop cluster. Efficient job scheduling optimizes resource utilization, job execution time, and
cluster performance.

1. Why is Job Scheduling Important?


●​ Efficient resource allocation: Ensures fair distribution of CPU and memory.
●​ Improved performance: Prevents bottlenecks and ensures smooth execution.
●​ Priority management: Allows high-priority jobs to execute first.
●​ Multi-user environment: Supports multiple users running jobs simultaneously.

2. Hadoop Job Schedulers

Hadoop supports multiple job schedulers to manage MapReduce jobs. The main schedulers are:

Scheduler Description Best Use Case

FIFO (First In, First Runs jobs in the order they arrive Simple batch processing,
Out) Scheduler (default). single-user environment.

Fair Scheduler Allocates resources fairly among Multi-user clusters where


users and jobs. fairness is important.

Capacity Scheduler Divides cluster resources into Large organizations with


queues for different users/groups. multi-tenant clusters.

16. Shuffle and Sort in MapReduce


●​ Shuffle: Data is transferred from Mapper to Reducer.
●​ Sort: Data is sorted before being passed to the Reducer.

17. Task Execution in MapReduce


●​ Mapper Execution: Processes input splits and emits key-value pairs.
●​ Reducer Execution: Aggregates results based on key-value pairs.

Task Execution in MapReduce


1.​ Mapper Execution:​

○​ Each Mapper processes an input split and converts raw data into intermediate
key-value pairs.
○​ The map() function is applied to each input record.
○​ The output is then sorted and shuffled before being sent to the Reducer.
2.​ Reducer Execution:​

○​ The Reducer receives grouped key-value pairs from multiple Mappers.


○​ It processes each key and aggregates or performs computations based on its
values.
○​ The final output is written to the Hadoop Distributed File System (HDFS).

18. MapReduce Types


●​ Identity Mapper & Reducer
●​ Chain Mapper & Reducer
●​ Combiner Function (Mini-Reducer at Mapper stage)

MapReduce Types

1.​ Identity Mapper & Identity Reducer:​

○​ The Identity Mapper simply passes input key-value pairs to the output without
modification.
○​ The Identity Reducer does the same, outputting the data as received from the
shuffle phase.
○​ Used when no transformation is required but sorting/shuffling is needed.
2.​ Chain Mapper & Chain Reducer:​

○​ Allows multiple Mapper or Reducer steps to be executed sequentially.


○​ A Chain Mapper applies multiple map() functions one after another before
sending data to the shuffle phase.
○​ A Chain Reducer allows multiple reduce() functions to process data
step-by-step before final output.
○​ Useful for complex processing pipelines.
3.​ Combiner Function (Mini-Reducer at Mapper Stage):​
○​ Works as a local mini-reducer that aggregates data before sending it to the
Reducer.
○​ Reduces the amount of data transferred over the network, improving efficiency.
○​ Commonly used for counting and summation tasks.
○​ Example: Summing word counts at the Mapper stage before sending partial
results to the Reducer.

19. Input and Output Formats in MapReduce


Input Formats:

●​ TextInputFormat – Default format (line-based text files).


●​ SequenceFileInputFormat – Binary file format.
●​ KeyValueTextInputFormat – Key-value pairs.

Output Formats:

●​ TextOutputFormat – Default output format.


●​ SequenceFileOutputFormat – Binary output files.

20. Features of MapReduce


●​ Parallel Processing
●​ Fault Tolerance
●​ Data Locality
●​ Automatic Load Balancing

21. Real-World Applications of MapReduce


●​ Log Processing: Web logs analysis (e.g., Facebook, LinkedIn).
●​ Search Indexing: Google’s search engine.
●​ Recommendation Systems: Amazon, Netflix.
●​ Sentiment Analysis: Analyzing social media trends.
●​ Bioinformatics: Genome sequencing.

You might also like