0% found this document useful (0 votes)
10 views

Map Reduce Algorithm - Hadoop

The document discusses the MapReduce algorithm and Hadoop. It describes the map and reduce functions and how MapReduce handles tasks like scheduling, data distribution, synchronization, and fault tolerance. It provides examples of using MapReduce for problems like finding the sum of squares and building an inverted index.

Uploaded by

Sushan Gautam
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Map Reduce Algorithm - Hadoop

The document discusses the MapReduce algorithm and Hadoop. It describes the map and reduce functions and how MapReduce handles tasks like scheduling, data distribution, synchronization, and fault tolerance. It provides examples of using MapReduce for problems like finding the sum of squares and building an inverted index.

Uploaded by

Sushan Gautam
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Map Reduce Algorithm

– Hadoop
18BCE2482
18BCE2488
18BCE2490
MapReduce ?
 Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
 All values with the same key are reduced together

 Usually, programmers also specify:


partition (k’, number of partitions) → partition for k’
 Often a simple hash of the key, e.g. hash(k’) mod n
 Allows reduce operations for different keys in parallel
combine (k’, v’) → <k’, v’>*
 Mini-reducers that run in memory after the map phase
 Used as an optimization to reducer network traffic
Performance

 Handles scheduling
Assigns workers to map and reduce tasks
 Handles “data distribution”
Moves the process to the data
 Handles synchronization
Gathers, sorts, and shuffles intermediate data
 Handles faults
Detects worker failures and restarts
 Everything happens on top of a distributed FS
Sum of Square
Sum of Square of Even and Odd
Map Reduce Architecture
Map Reduce with Combiner
Map Side
1. Map task writes to a circular buffer which it writes the output to
2. Once it reaches a threshold, it starts to spill the contents to local
disk
3. Before writing to disk, the data is partitioned corresponding to the
reducers that the data will be sent to
4. Each partition is sorted by key and combiner is run on the sorted
output
5. Multiple spill files may be created by the time map finishes.
These spill files are merged into a single partitioned, sorted output
file
6. The output file partitions are made available to reducers over
HTTP
Reduce Side
1. The map outputs are sitting on local disks. Reduce tasks will need
this data in order to proceed with the reduce task
2. Reduce task needs the map output for its particular partition from
several maps across the cluster
3. The reduce task starts copying the map outputs as soon as each
map completes. This is the copy phase. The map outputs are
fetched in parallel by multiple threads.
4. Map outputs are copied to jvm’s memory if small enough, else
copied to disk. As copies accumulate, they are merged into larger
sorted files. When all are copied, they are merged maintaining
their sort order
5. Reduce function is invoked for each key in sorted output and
output is written
SEARCHING
employee data in four different files − A, B, C, and D

Map Phase
- processes each input file and provides the employee data in
key-value pairs (<k, v> : <emp name, salary>).
Combiner phase
 will accept the input from the Map phase
 the combiner will check all the employee salary to find the highest salaried employee in
each file

Reducer phase

 From each file, you will find the highest salaried employee.
<gopal,50000>
Inverted Index

You might also like