Day 6
Day 6
Cretaed buckets(dim_tables)
1-CustomerName has two cols name
2-Producct has two cols name,category
3-Orders date,qty,totalprice
Mapreduce
Is a software framework “ template” for developing
applications that process large amount of data in
parallel across a distributed environment .A mapreduce
consist of two main phases
1- Map
2- Reduce
1. Input Splitting
Before the mappers start, the input data (a file or
dataset) is divided into smaller splits. Each split is
processed by one mapper. This ensures parallelism.
Example Input File:
Line 1: apple banana apple
Line 2: banana apple mango
Line 3: mango banana banana
Input Splits:
o Split 1: apple banana apple
o Split 2: banana apple mango
o Split 3: mango banana banana
Each split is sent to a mapper.
2. Mapper Execution
Each mapper processes its assigned split line by line
and applies the map function to generate
intermediate key-value pairs.
What Happens Inside a Mapper
1. Read Input:
o The mapper reads its assigned split.
o Example: Mapper 1 gets apple banana apple.
2. Apply Map Function:
o The map function processes each line and
extracts key-value pairs based on the logic
defined.
o For example, if we’re counting words:
Split the line into words.
For each word, emit a key-value pair
where the key is the word, and the value
is 1.
o Mapper 1 Output:
o {apple: 1}, {banana: 1}, {apple: 1}
3. Buffer Intermediate Data:
o The mapper temporarily stores these key-
value pairs in memory.
o If the buffer is full, it spills the data to disk in
sorted chunks.
3. Partitioning
Once the mapper has processed its input split:
The key-value pairs are partitioned based on the
key and the number of reducers.
Example:
o Assume 2 reducers.
o A partitioner determines which keys go to
which reducer.
o Keys like apple and mango go to Reducer 1,
while banana goes to Reducer 2.
4. Combiner (Optional)
Before sending the intermediate data to the reducers, a
combiner may run on the mapper output to reduce the
amount of data transferred.
The combiner performs a mini-reduce locally at
each mapper.
Example:
o Mapper 1 Output: {apple: 1}, {banana: 1},
{apple: 1}
o After Combiner: {apple: 2}, {banana: 1}
Detailed Example
Input:
Split 1 (to Mapper 1): apple banana apple
Steps Inside Mapper:
1. Read Data:
o Read the line: apple banana apple
2. Split Line into Words:
o Words: ["apple", "banana", "apple"]
3. Emit Key-Value Pairs:
o apple -> 1
o banana -> 1
o apple -> 1
4. Buffer and Sort:
o {apple: [1, 1], banana: [1]}
5. Partition (for Reducers):
o Reducer 1: apple -> [1, 1]
o Reducer 2: banana -> [1]
Example
Scenario:
Mapper processes an input split and generates
these intermediate key-value pairs:
{apple: 1}, {banana: 1}, {apple: 1}, {mango: 1},
{banana: 1}, {mango: 1}
The memory buffer can hold only 3 key-value pairs
at a time.
Spill Files:
Spill File 1 (after the first buffer overflow):
{apple: 1}, {banana: 1}, {apple: 1} -> Sorted:
{apple: [1, 1], banana: [1]}
Spill File 2 (after the second buffer overflow):
{mango: 1}, {banana: 1}, {mango: 1} -> Sorted:
{banana: [1], mango: [1, 1]}
Merging Spill Files:
Merge Spill File 1 and Spill File 2 into a single,
sorted file:
{apple: [1, 1]}, {banana: [1, 1]}, {mango: [1, 1]}
Final Output:
Partitioned and ready to send to reducers:
o Reducer 1: {apple: [1, 1]}
o Reducer 2: {banana: [1, 1], mango: [1, 1]}