Unit 3 Mapreduce
Unit 3 Mapreduce
Contents
I. Fundamentals of HDFS (Hadoop Distributed File System) ............................................................ 2
1. What is HDFS? ................................................................................................................................ 2
2. Key Components of HDFS ............................................................................................................. 3
3. HDFS File Storage Mechanism .................................................................................................. 3
4. Core Features of HDFS .............................................................................................................. 3
5. Important Concepts .................................................................................................................... 3
6. Advantages of HDFS................................................................................................................... 3
7. Limitations ................................................................................................................................... 4
II. Basic HDFS Commands with Examples .............................................................................................. 4
1. Create a Directory........................................................................................................................... 4
2. List Directory Contents .................................................................................................................. 4
3. Copy File from Local to HDFS ...................................................................................................... 4
4. Copy File from HDFS to Local ...................................................................................................... 4
5. Display File Content........................................................................................................................ 5
6. Check File Size and Block Info ...................................................................................................... 5
7. Delete a File or Directory ............................................................................................................... 5
8. Check File Existence ....................................................................................................................... 5
9. Copy File within HDFS................................................................................................................... 5
10. Move (Rename) File ...................................................................................................................... 6
11. Set Permissions .............................................................................................................................. 6
12. Change Ownership........................................................................................................................ 6
13. Display File System Usage ............................................................................................................ 6
III. Introduction to MapReduce ........................................................................................................... 6
1. What is MapReduce? .................................................................................................................. 6
2. Basic Concept .............................................................................................................................. 6
3. How MapReduce Works – Step by Step ................................................................................... 7
4. MapReduce Example – Word Count ........................................................................................ 7
5. Why Use MapReduce?................................................................................................................ 8
6. Limitations ................................................................................................................................... 8
IV. Fundamentals of Parallel Processing in Hadoop ............................................................................... 8
1. What is Parallel Processing? .......................................................................................................... 8
2. How Hadoop Supports Parallel Processing .................................................................................. 8
3. Key Concepts in Hadoop's Parallel Processing ............................................................................ 9
4. Parallel Processing Flow in Hadoop .............................................................................................. 9
5. Benefits of Parallel Processing in Hadoop .................................................................................... 9
6. Challenges in Parallel Processing .................................................................................................. 9
7. Example: Word Count Program ................................................................................................. 10
V. MapReduce Streaming with Python ....................................................................................................... 10
1. What is Hadoop Streaming? ........................................................................................................... 10
2. How It Works................................................................................................................................... 10
3. Word Count Example Using Python .................................................................................................... 10
Step 1: Mapper Script (mapper.py) .................................................................................................... 10
Step 2: Reducer Script (reducer.py) .................................................................................................... 10
4. Running MapReduce Streaming Job ............................................................................................... 11
5. Checking the Output ....................................................................................................................... 11
6. Notes ............................................................................................................................................... 11
VI. Numerical Analysis with Hadoop Streaming (Python) ........................................................................... 12
Calculate mean, min, max, and sum of a list of numbers stored in a text file using MapReduce
(Streaming) with Python. .................................................................................................................... 12
1. Sample Input File (numbers.txt) ......................................................................................................... 12
2. Mapper Script (mapper.py)................................................................................................................. 12
3. Reducer Script (reducer.py) ................................................................................................................ 12
4. Make Scripts Executable ..................................................................................................................... 13
5. Run the Hadoop Streaming Job .......................................................................................................... 13
6. Check the Output ................................................................................................................................ 13
7. Key Points ............................................................................................................................................ 14
1. NameNode (Master)
o Manages metadata: filenames, directories, file permissions, block locations.
o Knows which block is stored on which DataNode.
2. DataNode (Slave)
o Stores actual data blocks.
o Periodically reports back to the NameNode.
3. Secondary NameNode
o Performs periodic checkpoints of file system metadata.
o Not a backup node! It helps in merging edits with the current state of the
filesystem.
5. Important Concepts
Concept Description
Block Minimum unit of storage in HDFS. Default size: 128 MB
Replication Copies of data blocks for reliability
Rack Awareness Data is placed on different racks to avoid data loss during rack failure
Heartbeat Signal sent by DataNodes to let the NameNode know they’re alive
Data Integrity Uses checksums to verify data accuracy
6. Advantages of HDFS
Handles hardware failure gracefully
Suitable for big data and batch processing
Easily scalable and cost-effective
7. Limitations
1. Create a Directory
hdfs dfs -mkdir /user/hadoop/data
Shows info like total capacity, used space, number of DataNodes, etc.
2. Basic Concept
1. Input Splitting:
o The input data is split into chunks.
o Each chunk is processed by a separate Map task.
2. Mapping:
o The Map function processes each record and produces intermediate key-value
pairs.
3. Shuffling and Sorting:
o The intermediate key-value pairs are grouped by key.
o This ensures that all values associated with the same key go to the same Reducer.
4. Reducing:
o The Reduce function processes each group of key-value pairs and generates
output.
5. Output:
o The final results are written back to HDFS.
Map Output:
(apple, 1)
(banana, 1)
(apple, 1)
Reduce Output:
(apple, 2)
(banana, 1)
Feature Description
Scalability Easily processes petabytes of data
Fault Tolerance Automatically recovers from failures
Parallel Processing Runs tasks in parallel for efficiency
Data Locality Moves computation closer to the data
6. Limitations
Parallel processing is the technique of dividing a large task into smaller sub-tasks, which are
processed simultaneously using multiple processors or nodes.
In Hadoop, parallel processing is the core principle behind its ability to efficiently handle large-
scale data.
Concept Description
Moves computation to where the data resides to reduce
Data Locality
network traffic.
Multiple tasks run simultaneously, each handling a part of
Task Parallelism
the job.
In classic Hadoop, they managed the scheduling and
Job Tracker & Task Tracker
execution of tasks.
YARN (Yet Another Resource Replaces JobTracker in Hadoop 2+ to manage resources
Negotiator) more efficiently.
Hadoop Streaming is a utility that allows you to create and run MapReduce jobs with any
executable or script (like Python) that can read from stdin and write to stdout.
You don’t have to write your programs in Java — you can use Python, Bash, Perl, etc.
2. How It Works
Mapper reads input line-by-line from stdin, processes it, and outputs key-value pairs to
stdout.
Reducer reads sorted key-value pairs from stdin and aggregates values for each key,
then outputs results.
current_word = None
current_count = 0
if current_word == word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count
Explanation:
6. Notes
Python reads input from stdin and writes output to stdout.
Output format should be keyvalue.
Hadoop automatically handles sorting and grouping for the reducer.
Calculate mean, min, max, and sum of a list of numbers stored in a text file using
MapReduce (Streaming) with Python.
10
20
5
40
15
#!/usr/bin/env python3
import sys
#!/usr/bin/env python3
import sys
total = 0
count = 0
min_val = None
max_val = None
total += value
count += 1
# Output results
print(f"Sum\t{total}")
print(f"Count\t{count}")
print(f"Min\t{min_val}")
print(f"Max\t{max_val}")
print(f"Mean\t{total / count if count != 0 else 0}")
Expected Output:
Sum 90.0
Count 5
Min 5.0
Max 40.0
Mean 18.0
7. Key Points
Feature Description