0% found this document useful (0 votes)
3 views

Map Reduce

Uploaded by

anneinlondon02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Map Reduce

Uploaded by

anneinlondon02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

11/28/24, 9:40 PM ChatGPT

MapReduce Theory
MapReduce is a programming model for processing large datasets in parallel across a distributed
cluster of computers. The model consists of two main steps: the Map step and the Reduce step.

1. Map Step:
In this step, the input data is divided into chunks, and the map function processes these chunks in
parallel, creating intermediate key-value pairs. These key-value pairs are generated based on the
input data.

2. Shuffle and Sort Step:


After the map function processes the data, the system groups and sorts these key-value pairs based
on their keys, so that all values corresponding to the same key are together.

3. Reduce Step:
In this step, the reduce function processes the grouped data and combines or aggregates it to
produce the final result.

MapReduce Workflow:
Map: Breaks the data into smaller, manageable chunks and processes them in parallel.
Shuffle: Organizes the intermediate key-value pairs.
Reduce: Combines or aggregates the processed results into the final output.

MapReduce allows large-scale data processing because it runs across many machines and ensures
that each part of the data is processed efficiently.

Key Features of MapReduce:


Scalability: Can handle petabytes of data by distributing the computation across a cluster of
machines.
Fault Tolerance: If a task fails on one machine, it is reassigned to another, ensuring reliable
execution.
Parallel Processing: Processes data simultaneously on multiple machines, speeding up
computation.

Simple Example of MapReduce

Problem: Count the number of occurrences of each word in a large text file.

Input Data:
text Copy code

https://chatgpt.com/c/674890b0-e7ec-8001-bff2-af9ce12c6aad 1/3
11/28/24, 9:40 PM ChatGPT

Hello World Hello Hadoop Hadoop is great

Step-by-Step Process:

1. Map Step:
Each line of the text is split into words, and for each word, we emit a key-value pair. The key is the
word, and the value is 1 (representing the occurrence of that word).
Input:

csharp Copy code

Hello World Hello Hadoop Hadoop is great

Map Output (Key-Value Pairs):

scss Copy code

(Hello, 1) (World, 1) (Hello, 1) (Hadoop, 1) (Hadoop, 1) (is, 1) (great, 1)

2. Shuffle and Sort Step:


The intermediate data is grouped by keys (the words), so all occurrences of the same word are
grouped together.
Grouped Data:

css Copy code

(Hello, [1, 1]) (World, [1]) (Hadoop, [1, 1]) (is, [1]) (great, [1])

3. Reduce Step:
The reduce function takes each group of values associated with a key and aggregates them (sums
them up in this case).
Reduce Output:

scss Copy code

(Hello, 2) (World, 1) (Hadoop, 2) (is, 1) (great, 1)

Final Output:
The final result is the word count for each word in the input text:

makefile Copy code

https://chatgpt.com/c/674890b0-e7ec-8001-bff2-af9ce12c6aad 2/3
11/28/24, 9:40 PM ChatGPT

Hello: 2 World: 1 Hadoop: 2 is: 1 great: 1

Code Example (Pseudo-code):


Here’s a simple pseudo-code implementation for MapReduce:

python Copy code

# Map function def map_function(input_data): output = [] for line in input_data: for


word in line.split(): output.append((word, 1)) return output # Reduce function def
reduce_function(mapped_data): result = {} for word, count in mapped_data: if word in
result: result[word] += count else: result[word] = count return result # Example usage
input_data = [ "Hello World", "Hello Hadoop", "Hadoop is great" ] # Map phase
mapped_data = map_function(input_data) # Reduce phase final_result =
reduce_function(mapped_data) print(final_result)

Output:

arduino Copy code

{'Hello': 2, 'World': 1, 'Hadoop': 2, 'is': 1, 'great': 1}

Conclusion:
Big Data refers to large-scale datasets that require specialized tools for storage and
processing.
MapReduce is a powerful method for parallel processing of large datasets, where data is
processed in the form of key-value pairs.
The process involves mapping data into smaller chunks, shuffling and sorting the intermediate
data, and reducing it to a final result.

MapReduce is particularly effective for distributed computing environments like Hadoop, where
data is spread across many machines.

https://chatgpt.com/c/674890b0-e7ec-8001-bff2-af9ce12c6aad 3/3

You might also like