0% found this document useful (0 votes)
7 views14 pages

Unit 3 Mapreduce

The document provides a comprehensive overview of HDFS (Hadoop Distributed File System), detailing its fundamentals, components, file storage mechanisms, and core features. It also covers basic HDFS commands, introduces MapReduce and its functioning, and explains parallel processing in Hadoop. Additionally, it includes practical examples of using MapReduce with Python for tasks such as word counting and numerical analysis.

Uploaded by

ldoddi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views14 pages

Unit 3 Mapreduce

The document provides a comprehensive overview of HDFS (Hadoop Distributed File System), detailing its fundamentals, components, file storage mechanisms, and core features. It also covers basic HDFS commands, introduces MapReduce and its functioning, and explains parallel processing in Hadoop. Additionally, it includes practical examples of using MapReduce with Python for tasks such as word counting and numerical analysis.

Uploaded by

ldoddi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Unit 3

Contents
I. Fundamentals of HDFS (Hadoop Distributed File System) ............................................................ 2
1. What is HDFS? ................................................................................................................................ 2
2. Key Components of HDFS ............................................................................................................. 3
3. HDFS File Storage Mechanism .................................................................................................. 3
4. Core Features of HDFS .............................................................................................................. 3
5. Important Concepts .................................................................................................................... 3
6. Advantages of HDFS................................................................................................................... 3
7. Limitations ................................................................................................................................... 4
II. Basic HDFS Commands with Examples .............................................................................................. 4
1. Create a Directory........................................................................................................................... 4
2. List Directory Contents .................................................................................................................. 4
3. Copy File from Local to HDFS ...................................................................................................... 4
4. Copy File from HDFS to Local ...................................................................................................... 4
5. Display File Content........................................................................................................................ 5
6. Check File Size and Block Info ...................................................................................................... 5
7. Delete a File or Directory ............................................................................................................... 5
8. Check File Existence ....................................................................................................................... 5
9. Copy File within HDFS................................................................................................................... 5
10. Move (Rename) File ...................................................................................................................... 6
11. Set Permissions .............................................................................................................................. 6
12. Change Ownership........................................................................................................................ 6
13. Display File System Usage ............................................................................................................ 6
III. Introduction to MapReduce ........................................................................................................... 6
1. What is MapReduce? .................................................................................................................. 6
2. Basic Concept .............................................................................................................................. 6
3. How MapReduce Works – Step by Step ................................................................................... 7
4. MapReduce Example – Word Count ........................................................................................ 7
5. Why Use MapReduce?................................................................................................................ 8
6. Limitations ................................................................................................................................... 8
IV. Fundamentals of Parallel Processing in Hadoop ............................................................................... 8
1. What is Parallel Processing? .......................................................................................................... 8
2. How Hadoop Supports Parallel Processing .................................................................................. 8
3. Key Concepts in Hadoop's Parallel Processing ............................................................................ 9
4. Parallel Processing Flow in Hadoop .............................................................................................. 9
5. Benefits of Parallel Processing in Hadoop .................................................................................... 9
6. Challenges in Parallel Processing .................................................................................................. 9
7. Example: Word Count Program ................................................................................................. 10
V. MapReduce Streaming with Python ....................................................................................................... 10
1. What is Hadoop Streaming? ........................................................................................................... 10
2. How It Works................................................................................................................................... 10
3. Word Count Example Using Python .................................................................................................... 10
Step 1: Mapper Script (mapper.py) .................................................................................................... 10
Step 2: Reducer Script (reducer.py) .................................................................................................... 10
4. Running MapReduce Streaming Job ............................................................................................... 11
5. Checking the Output ....................................................................................................................... 11
6. Notes ............................................................................................................................................... 11
VI. Numerical Analysis with Hadoop Streaming (Python) ........................................................................... 12
Calculate mean, min, max, and sum of a list of numbers stored in a text file using MapReduce
(Streaming) with Python. .................................................................................................................... 12
1. Sample Input File (numbers.txt) ......................................................................................................... 12
2. Mapper Script (mapper.py)................................................................................................................. 12
3. Reducer Script (reducer.py) ................................................................................................................ 12
4. Make Scripts Executable ..................................................................................................................... 13
5. Run the Hadoop Streaming Job .......................................................................................................... 13
6. Check the Output ................................................................................................................................ 13
7. Key Points ............................................................................................................................................ 14

I. Fundamentals of HDFS (Hadoop Distributed File


System)
1. What is HDFS?
HDFS is the primary storage system used by Hadoop applications. It is designed to store very
large files reliably and stream those data sets at high bandwidth to user applications.

2. Key Components of HDFS

1. NameNode (Master)
o Manages metadata: filenames, directories, file permissions, block locations.
o Knows which block is stored on which DataNode.
2. DataNode (Slave)
o Stores actual data blocks.
o Periodically reports back to the NameNode.
3. Secondary NameNode
o Performs periodic checkpoints of file system metadata.
o Not a backup node! It helps in merging edits with the current state of the
filesystem.

3. HDFS File Storage Mechanism

 Files are split into blocks (default: 128 MB or 256 MB).


 Each block is replicated across multiple DataNodes (default replication factor: 3) to
ensure fault tolerance.
 For example, a 300 MB file is split into 3 blocks and each block is replicated 3 times.

4. Core Features of HDFS

 Fault Tolerance: If a DataNode fails, data is retrieved from another replica.


 High Throughput: Optimized for large data sets, supports batch processing.
 Scalability: Can scale to thousands of nodes by simply adding more DataNodes.
 Write Once, Read Many: Files are typically written once and read multiple times.
 Data Locality: Processing is moved closer to where the data resides (reducing network
congestion).

5. Important Concepts

Concept Description
Block Minimum unit of storage in HDFS. Default size: 128 MB
Replication Copies of data blocks for reliability
Rack Awareness Data is placed on different racks to avoid data loss during rack failure
Heartbeat Signal sent by DataNodes to let the NameNode know they’re alive
Data Integrity Uses checksums to verify data accuracy

6. Advantages of HDFS
 Handles hardware failure gracefully
 Suitable for big data and batch processing
 Easily scalable and cost-effective

7. Limitations

 Not ideal for low-latency access or real-time processing


 Not optimized for storing a large number of small files
 Doesn't support random writes (append-only model)

II. Basic HDFS Commands with Examples

1. Create a Directory
hdfs dfs -mkdir /user/hadoop/data

Creates a new directory named data inside /user/hadoop.

2. List Directory Contents


hdfs dfs -ls /user/hadoop

Lists all files and subdirectories in /user/hadoop.

3. Copy File from Local to HDFS


hdfs dfs -put sample.txt /user/hadoop/data

Uploads sample.txt from the local file system to HDFS.

4. Copy File from HDFS to Local


hdfs dfs -get /user/hadoop/data/sample.txt /home/user/

Downloads the file from HDFS to your local file system.


5. Display File Content
hdfs dfs -cat /user/hadoop/data/sample.txt

Displays the contents of the file.

6. Check File Size and Block Info


hdfs dfs -du -h /user/hadoop/data/sample.txt

Shows file size in a human-readable format.

7. Delete a File or Directory


hdfs dfs -rm /user/hadoop/data/sample.txt

Deletes the file.

hdfs dfs -rm -r /user/hadoop/data

Recursively deletes a directory.

8. Check File Existence


hdfs dfs -test -e /user/hadoop/data/sample.txt && echo "Exists" || echo "Not
found"

Checks if a file exists.

9. Copy File within HDFS


hdfs dfs -cp /user/hadoop/data/sample.txt /user/hadoop/archive/

Copies a file to a new location within HDFS.


10. Move (Rename) File
hdfs dfs -mv /user/hadoop/data/sample.txt /user/hadoop/data/sample_old.txt

Renames or moves the file.

11. Set Permissions


hdfs dfs -chmod 755 /user/hadoop/data

Sets read-write-execute permissions for the directory.

12. Change Ownership


hdfs dfs -chown hadoop:hadoop /user/hadoop/data

Changes the owner and group of a directory or file.

13. Display File System Usage


hdfs dfsadmin -report

Shows info like total capacity, used space, number of DataNodes, etc.

III. Introduction to MapReduce


1. What is MapReduce?

MapReduce is a programming model and an associated implementation for processing and


generating large data sets with a parallel, distributed algorithm on a Hadoop cluster.

It was developed by Google and later implemented in Apache Hadoop.

2. Basic Concept

MapReduce works in two main phases:


1. Map Phase
o Takes input data and converts it into a set of key-value pairs.
o Each mapper processes a subset of the data independently.
2. Reduce Phase
o Takes the key-value pairs generated by the Map phase.
o Aggregates, summarizes, or transforms them into final output.

3. How MapReduce Works – Step by Step

1. Input Splitting:
o The input data is split into chunks.
o Each chunk is processed by a separate Map task.
2. Mapping:
o The Map function processes each record and produces intermediate key-value
pairs.
3. Shuffling and Sorting:
o The intermediate key-value pairs are grouped by key.
o This ensures that all values associated with the same key go to the same Reducer.
4. Reducing:
o The Reduce function processes each group of key-value pairs and generates
output.
5. Output:
o The final results are written back to HDFS.

4. MapReduce Example – Word Count

Let’s say you have a file with the content:

apple banana apple

 Map Output:

(apple, 1)
(banana, 1)
(apple, 1)

 Shuffle & Sort:

(apple, [1, 1])


(banana, [1])

 Reduce Output:
(apple, 2)
(banana, 1)

5. Why Use MapReduce?

Feature Description
Scalability Easily processes petabytes of data
Fault Tolerance Automatically recovers from failures
Parallel Processing Runs tasks in parallel for efficiency
Data Locality Moves computation closer to the data

6. Limitations

 High latency for small tasks


 Not suitable for real-time processing
 Complex for iterative or multi-stage algorithms

IV. Fundamentals of Parallel Processing in Hadoop


1. What is Parallel Processing?

Parallel processing is the technique of dividing a large task into smaller sub-tasks, which are
processed simultaneously using multiple processors or nodes.

In Hadoop, parallel processing is the core principle behind its ability to efficiently handle large-
scale data.

2. How Hadoop Supports Parallel Processing

Hadoop uses two key components:

 HDFS (Hadoop Distributed File System): Distributes data across nodes.


 MapReduce: Processes the data in parallel.

Together, they enable Hadoop to:

 Store massive datasets across multiple machines.


 Process large data volumes concurrently.
3. Key Concepts in Hadoop's Parallel Processing

Concept Description
Moves computation to where the data resides to reduce
Data Locality
network traffic.
Multiple tasks run simultaneously, each handling a part of
Task Parallelism
the job.
In classic Hadoop, they managed the scheduling and
Job Tracker & Task Tracker
execution of tasks.
YARN (Yet Another Resource Replaces JobTracker in Hadoop 2+ to manage resources
Negotiator) more efficiently.

4. Parallel Processing Flow in Hadoop

1. Input data is split into blocks (e.g., 128 MB each).


2. Each block is assigned to a mapper task.
3. Mapper tasks run in parallel across the Hadoop cluster.
4. Intermediate results are shuffled and sorted.
5. Reducer tasks then process grouped key-value pairs in parallel.
6. The final output is written back to HDFS.

5. Benefits of Parallel Processing in Hadoop

1. High Throughput – Processes terabytes or petabytes of data quickly.


2. Scalability – New nodes can be added easily to increase processing capacity.
3. Fault Tolerance – Failed tasks are re-executed on other available nodes.
4. Efficiency – Utilizes multiple CPUs and resources simultaneously.

6. Challenges in Parallel Processing

1. Requires tuning and configuration to avoid processing bottlenecks.


2. Not suitable for large numbers of small files.
3. Dependent on data distribution and network speed.
4. Debugging and monitoring distributed tasks can be complex.
7. Example: Word Count Program

 Input file is split into 3 blocks.


 3 Mapper tasks run in parallel, each producing key-value pairs like (word, 1).
 Reducers aggregate key-value pairs in parallel to output: (word, total_count).

V. MapReduce Streaming with Python

1. What is Hadoop Streaming?

Hadoop Streaming is a utility that allows you to create and run MapReduce jobs with any
executable or script (like Python) that can read from stdin and write to stdout.

You don’t have to write your programs in Java — you can use Python, Bash, Perl, etc.

2. How It Works

 Mapper reads input line-by-line from stdin, processes it, and outputs key-value pairs to
stdout.
 Reducer reads sorted key-value pairs from stdin and aggregates values for each key,
then outputs results.

3. Word Count Example Using Python


Step 1: Mapper Script (mapper.py)
#!/usr/bin/env python3
import sys

for line in sys.stdin:


words = line.strip().split()
for word in words:
print(f"{word}\t1")

Step 2: Reducer Script (reducer.py)


#!/usr/bin/env python3
import sys

current_word = None
current_count = 0

for line in sys.stdin:


word, count = line.strip().split('\t')
count = int(count)

if current_word == word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count

# output last word


if current_word:
print(f"{current_word}\t{current_count}")

Make sure both files are executable:

chmod +x mapper.py reducer.py

4. Running MapReduce Streaming Job


hadoop jar /path/to/hadoop-streaming.jar \
-input /user/hadoop/input.txt \
-output /user/hadoop/output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py

Explanation:

 -input: HDFS input path


 -output: HDFS output path (must not already exist)
 -mapper: your Python mapper script
 -reducer: your Python reducer script
 -file: ensures your Python scripts are shipped to cluster nodes

5. Checking the Output


hdfs dfs -cat /user/hadoop/output/part-00000

6. Notes
 Python reads input from stdin and writes output to stdout.
 Output format should be keyvalue.
 Hadoop automatically handles sorting and grouping for the reducer.

VI. Numerical Analysis with Hadoop Streaming (Python)

Calculate mean, min, max, and sum of a list of numbers stored in a text file using
MapReduce (Streaming) with Python.

1. Sample Input File (numbers.txt)


Suppose the input file contains one number per line:

10
20
5
40
15

Saved in HDFS as:

hdfs dfs -put numbers.txt /user/hadoop/input/

2. Mapper Script (mapper.py)


The mapper simply emits each number:

#!/usr/bin/env python3
import sys

for line in sys.stdin:


value = line.strip()
if value:
print(f"num\t{value}")

3. Reducer Script (reducer.py)


The reducer calculates sum, count, min, max, and mean:

#!/usr/bin/env python3
import sys
total = 0
count = 0
min_val = None
max_val = None

for line in sys.stdin:


key, value = line.strip().split('\t')
value = float(value)

total += value
count += 1

if min_val is None or value < min_val:


min_val = value
if max_val is None or value > max_val:
max_val = value

# Output results
print(f"Sum\t{total}")
print(f"Count\t{count}")
print(f"Min\t{min_val}")
print(f"Max\t{max_val}")
print(f"Mean\t{total / count if count != 0 else 0}")

4. Make Scripts Executable


chmod +x mapper.py reducer.py

5. Run the Hadoop Streaming Job


hadoop jar /path/to/hadoop-streaming.jar \
-input /user/hadoop/input/numbers.txt \
-output /user/hadoop/output/numerical_stats \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py

6. Check the Output


hdfs dfs -cat /user/hadoop/output/numerical_stats/part-00000

Expected Output:

Sum 90.0
Count 5
Min 5.0
Max 40.0
Mean 18.0

7. Key Points
Feature Description

Input Format One number per line

Mapper Tags each number with a key num

Reducer Performs numerical calculations

Scalable? Yes, works on large datasets in distributed way

You might also like