Big Data-UNIT-2
Big Data-UNIT-2
1. Introduction to Hadoop
Hadoop is an open-source framework designed for processing and storing large datasets across
distributed clusters of computers. It was inspired by Google's MapReduce and Google File
System (GFS) to handle big data efficiently.
History of Hadoop
Apache Hadoop
● HDFS is Hadoop's storage system designed for large-scale data storage and
high-throughput access.
● Key features:
○ Distributed Storage: Data is split into blocks and stored across multiple nodes.
○ Fault Tolerance: Data replication ensures no data loss.
○ Scalability: Can handle petabytes of data.
○
Hadoop Distributed File System (HDFS) is a scalable, fault-tolerant, distributed storage system
designed to store and process large volumes of data across multiple machines in a cluster. HDFS
is a core component of the Apache Hadoop ecosystem and is optimized for handling big data
workloads efficiently.
1. Features of HDFS
HDFS is designed to address the challenges of big data storage and processing by providing:
1. Scalability: It can store petabytes (PB) of data across thousands of machines.
2. Fault Tolerance: Data is replicated across multiple nodes to prevent data loss.
3. High Throughput: Optimized for batch processing, ensuring high-speed data access.
4. Data Locality: Moves computation closer to data, reducing network traffic.
5. Streaming Access: Best suited for sequential read/write operations.
6. Write-Once, Read-Many: Data is written once and read multiple times, ideal for big
data analytics.
2. HDFS Architecture
a) NameNode (Master)
● Role: Manages metadata (file names, locations, replicas).
● Responsibilities:
○ Keeps track of file system structure.
○ Manages access permissions.
○ Ensures fault tolerance by monitoring DataNodes.
○ Does not store actual data, only metadata.
b) DataNode (Slaves)
c) Secondary NameNode
a) Blocks in HDFS
● Each file in HDFS is split into fixed-size blocks (default 128 MB).
● Blocks are stored across multiple DataNodes.
● Large block size reduces the overhead of managing many small files.
b) Data Replication
c) Rack Awareness
a) File Operations
b) Directory Operations
HDFS is the storage layer of Hadoop, designed to store huge volumes of data across multiple
machines in a cluster. It provides fault tolerance, scalability, and high throughput.
Components of HDFS
Component Role
NameNode Manages metadata (file locations, permissions, and structure) but does
(Master) not store actual data.
DataNode (Slaves) Stores actual data blocks and reports health status to the NameNode.
2. MapReduce
MapReduce is the data processing layer of Hadoop, designed for parallel processing of large
datasets across multiple nodes.
Imagine you have a large text file, and you want to count how many times each word appears.
Map Function: Takes input as lines of text and breaks them into words. Example:
arduino
CopyEdit
("Hadoop", 1), ("is", 1), ("great", 1), ("Hadoop", 1)
YARN is the resource management layer of Hadoop. It allocates system resources (CPU,
memory) to different applications running on the cluster.
Component Role
Resource Manager (RM) Master component that manages resources across the cluster.
Node Manager (NM) Runs on each node to monitor resources and report to the RM.
Application Master (AM) Manages the lifecycle of individual applications (jobs).
Advantages of YARN
4. Hadoop Common
Hadoop Common is a set of shared utilities, libraries, and configuration files that support all
other Hadoop components.
Main Functions
Hadoop supports various data formats to efficiently store, process, and analyze large datasets.
The choice of format depends on storage efficiency, processing speed, and compatibility with
Hadoop tools.
1. Types of Data Formats in Hadoop
Data in Hadoop can be stored in structured, semi-structured, or unstructured formats. The main
types include:
A. Text-Based Formats
B. Binary Formats
1. Avro
2. Parquet
3. ORC (Optimized Row Columnar)
4. Protocol Buffers (Protobuf)
Each format has different advantages based on compression, schema evolution, and efficiency.
A. Text-Based Formats
● Description: Stores data in simple text format, with rows separated by new lines and
columns separated by delimiters (e.g., , or \t).
● Advantages:
○ Human-readable.
○ Easy to process with Hadoop tools like MapReduce, Hive, and Pig.
● Disadvantages:
○ Large storage size (no compression).
○ No schema enforcement.
pgsql
CopyEdit
1, Alice, 25
2, Bob, 30
2. SequenceFile
Example:
css
CopyEdit
3. JSON
Example:
json
{
"ID": 1,
"Name": "Alice",
"Age": 25
4. XML
Example:
xml
<user>
<ID>1</ID>
<Name>Alice</Name>
<Age>25</Age>
</user>
1. Avro
● Description: A row-based binary format designed for fast data serialization.
● Advantages:
○ Supports schema evolution (can add or remove fields without breaking data).
○ Highly compressed and efficient for MapReduce.
● Disadvantages:
○ Not human-readable.
json
CopyEdit
"type": "record",
"name": "User",
"fields": [
2. Parquet
Text-based formats (CSV, JSON, XML) are easy to use but inefficient for big data.
Binary formats (Avro, Parquet, ORC, Protobuf) provide better compression, performance, and
flexibility.
● Data is stored in Hadoop Distributed File System (HDFS) in different formats like:
○ Text-based formats (CSV, JSON, XML).
○ Optimized formats (Parquet, ORC, Avro).
○ Key-value format (SequenceFile).
Scaling out Hadoop refers to increasing the number of nodes in a Hadoop cluster to handle larger
workloads, higher storage demands, and improved processing speed. Unlike scaling up (adding
more CPU, RAM, or storage to an existing machine), scaling out distributes the load across
multiple machines.
Hadoop is designed to handle big data, which continuously grows in volume, velocity, and
variety. Scaling out helps in:
1. Handling larger datasets: More nodes mean more storage in HDFS.
2. Improving processing speed: Jobs run in parallel across more nodes.
3. Enhancing fault tolerance: More nodes reduce data loss risks.
4. Supporting higher workloads: More computing resources mean better performance.
● Hadoop Streaming is best for running MapReduce jobs in non-Java languages like
Python or Shell scripts.
● Hadoop Pipes is useful for high-performance applications requiring native C++
execution.
● Choose Streaming for ease of use and Pipes for better performance.
8. Hadoop Ecosystem
● Apache Hive – Data warehousing and SQL queries.
● Apache Pig – Scripting-based data analysis.
● Apache HBase – NoSQL database for real-time data.
● Apache Sqoop – Data transfer between Hadoop and RDBMS.
● Apache Flume – Collecting and aggregating log data.
● Apache Oozie – Workflow scheduler for managing Hadoop jobs.
The Hadoop ecosystem consists of a set of tools and frameworks that extend the functionality
of Hadoop for data storage, processing, analysis, and management. These tools work
together to handle Big Data efficiently.
1. Core Components of Hadoop
The Hadoop ecosystem includes various tools categorized into different functions:
Tool Description
HCatalog Table management layer for Hadoop (integrates with Hive and Pig).
Tool Description
MapReduce Batch processing framework for distributed data.
Apache Spark In-memory computing framework for real-time & batch processing.
Tool Description
Tool Description
Sqoop Transfers data between Hadoop and relational databases (MySQL, PostgreSQL).
Flume Collects and ingests real-time streaming data (e.g., logs, IoT data).
Tool Description
Tool Description
Tool Description
Tool Description
●
1. Map Phase – Processes input data and converts it into intermediate key-value pairs.
2. Reduce Phase – Aggregates and processes the intermediate results.
Workflow:
1. Components of MapReduce
1. Mapper – Processes input data and converts it into key-value pairs.
2. Reducer – Aggregates and processes the intermediate key-value pairs generated by the
Mapper.
3. Driver Program – Manages the execution of the MapReduce job.
● The input dataset is stored in HDFS and divided into splits (chunks of data).
● Each split is processed by a separate Mapper task.
● The Mapper function reads each split and processes it line by line.
● It converts the data into key-value pairs.
Step 3: Shuffling & Sorting
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
}
}
How It Works
2. Define Reducer
The Reducer aggregates and processes intermediate key-value pairs produced by the Mapper.
java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
How It Works
java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class); //
Optional combiner to optimize performance
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
How It Works
shell
javac -classpath `hadoop classpath` -d . WordCountMapper.java
WordCountReducer.java WordCount.java
shell
shell
shell
shell
Integration Testing
● Test the full MapReduce job with input and output formats.
● Ensure that data passes correctly from Mapper to Reducer.
● Frameworks: Local Hadoop Cluster, MiniMRCluster
Performance Testing
Functional Testing
Regression Testing
java
● @Test
● public void testMapper() throws IOException {
● Mapper<LongWritable, Text, Text, IntWritable> mapper =
new MyMapper();
● MapDriver<LongWritable, Text, Text, IntWritable>
mapDriver = MapDriver.newMapDriver(mapper);
●
● mapDriver.withInput(new LongWritable(1), new
Text("hadoop hadoop spark"));
● mapDriver.withOutput(new Text("hadoop"), new
IntWritable(1));
● mapDriver.withOutput(new Text("hadoop"), new
IntWritable(1));
● mapDriver.withOutput(new Text("spark"), new
IntWritable(1));
●
● mapDriver.runTest();
● }
● Run MapReduce jobs on a single machine without setting up a full Hadoop cluster.
● Faster than a real cluster.
● Configure in core-site.xml:
xml
CopyEdit
● <property>
● <name>mapreduce.framework.name</name>
● <value>local</value>
● </property>
MiniMRCluster
Example Code:
java
CopyEdit
Out of Large dataset per node Tune heap size, use compression
Memory
● The Mapper processes each split and emits intermediate key-value pairs.
● The output of the mapper is written to local disk before being shuffled.
● Example:
● The intermediate key-value pairs from the Mappers are sorted and grouped before being
sent to the Reducer.
● Partitioning decides which Reducer gets which keys.
● Example:
Mapper 1 Output:
● ("hadoop", 1) ("mapreduce", 1)
●
● Mapper 2 Output:
● ("hadoop", 1) ("bigdata", 1)
●
● Shuffled & Sorted:
● ("bigdata", [1])
● ("hadoop", [1,1])
● ("mapreduce", [1])
●
arduino
CopyEdit
● ("bigdata", 1)
● ("hadoop", 2)
● ("mapreduce", 1)
CopyEdit
● +--------------------+
● | Job Submission |
● +--------------------+
● |
● +-----------------------+
● | Input Splitting |
● +-----------------------+
● |
● +-----------------------+
● | Map Phase |
● +-----------------------+
● |
● +-----------------------+
● | Shuffle & Sort |
● +-----------------------+
● |
● +-----------------------+
● | Reduce Phase |
● +-----------------------+
● |
● +-----------------------+
● | Output to HDFS |
● +-----------------------+
B. Mapper
C. Combiner (Optional)
F. Reducer
G. OutputFormat
java
CopyEdit
B. Reducer Code
java
C. Driver Code
java
Understanding the anatomy of a MapReduce job run is crucial for developing optimized and
efficient data-processing applications. The execution follows a structured process of splitting,
mapping, shuffling, reducing, and output writing. Proper tuning and testing ensure better
performance in large-scale distributed systems.
1. Task Reattempts: If a task (Map or Reduce) fails, Hadoop automatically retries the
execution on a different node. The number of retries is configurable.
2. Speculative Execution: Hadoop identifies tasks that are running slower than expected
and executes duplicate copies on other nodes. The first completed result is used, while
other executions are discarded, preventing bottlenecks caused by slow nodes.
3. Node Failure: If a node crashes or becomes unresponsive, Hadoop detects the failure
through the heartbeat mechanism. The JobTracker (in Hadoop 1) or ResourceManager (in
Hadoop 2/YARN) reschedules the tasks on other available nodes to ensure job
completion.
Hadoop job scheduling determines the order and priority in which MapReduce jobs execute on a
Hadoop cluster. Efficient job scheduling optimizes resource utilization, job execution time, and
cluster performance.
Hadoop supports multiple job schedulers to manage MapReduce jobs. The main schedulers are:
FIFO (First In, First Runs jobs in the order they arrive Simple batch processing,
Out) Scheduler (default). single-user environment.
○ Each Mapper processes an input split and converts raw data into intermediate
key-value pairs.
○ The map() function is applied to each input record.
○ The output is then sorted and shuffled before being sent to the Reducer.
2. Reducer Execution:
MapReduce Types
○ The Identity Mapper simply passes input key-value pairs to the output without
modification.
○ The Identity Reducer does the same, outputting the data as received from the
shuffle phase.
○ Used when no transformation is required but sorting/shuffling is needed.
2. Chain Mapper & Chain Reducer:
Output Formats: