Map Reduce Algorithm - Hadoop
Map Reduce Algorithm - Hadoop
– Hadoop
18BCE2482
18BCE2488
18BCE2490
MapReduce ?
Programmers specify two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
All values with the same key are reduced together
Handles scheduling
Assigns workers to map and reduce tasks
Handles “data distribution”
Moves the process to the data
Handles synchronization
Gathers, sorts, and shuffles intermediate data
Handles faults
Detects worker failures and restarts
Everything happens on top of a distributed FS
Sum of Square
Sum of Square of Even and Odd
Map Reduce Architecture
Map Reduce with Combiner
Map Side
1. Map task writes to a circular buffer which it writes the output to
2. Once it reaches a threshold, it starts to spill the contents to local
disk
3. Before writing to disk, the data is partitioned corresponding to the
reducers that the data will be sent to
4. Each partition is sorted by key and combiner is run on the sorted
output
5. Multiple spill files may be created by the time map finishes.
These spill files are merged into a single partitioned, sorted output
file
6. The output file partitions are made available to reducers over
HTTP
Reduce Side
1. The map outputs are sitting on local disks. Reduce tasks will need
this data in order to proceed with the reduce task
2. Reduce task needs the map output for its particular partition from
several maps across the cluster
3. The reduce task starts copying the map outputs as soon as each
map completes. This is the copy phase. The map outputs are
fetched in parallel by multiple threads.
4. Map outputs are copied to jvm’s memory if small enough, else
copied to disk. As copies accumulate, they are merged into larger
sorted files. When all are copied, they are merged maintaining
their sort order
5. Reduce function is invoked for each key in sorted output and
output is written
SEARCHING
employee data in four different files − A, B, C, and D
Map Phase
- processes each input file and provides the employee data in
key-value pairs (<k, v> : <emp name, salary>).
Combiner phase
will accept the input from the Map phase
the combiner will check all the employee salary to find the highest salaried employee in
each file
Reducer phase
From each file, you will find the highest salaried employee.
<gopal,50000>
Inverted Index