0% found this document useful (0 votes)
6 views12 pages

Day 6

Uploaded by

sidhrajsz112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views12 pages

Day 6

Uploaded by

sidhrajsz112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

At start discussed lab assignment 3

In this tables there is no unique column


CustomerName ,ProdutName,category,saledate,quantit
y,totalprice

Date/years cols can be made different dimension tables


So in order to tackle above problem we create unique
cols using AUTO_INCREMENT

Cretaed buckets(dim_tables)
1-CustomerName  has two cols name
2-Producct  has two cols name,category
3-Orders  date,qty,totalprice

Mapreduce
Is a software framework “ template” for developing
applications that process large amount of data in
parallel across a distributed environment .A mapreduce
consist of two main phases
1- Map
2- Reduce

1- Map -data is input into mappers where it is


transformed and prepared for the reducer
2-Reduce -retrieves the data from mapper and
perform desired computation or analysis
Shuffle and sort phase of map reduce is part of the
framework so it does not require any programming on
your part (developers)
MAP PHASE
The Map Phase in the MapReduce framework is the
first step where raw input data is transformed into
intermediate key-value pairs. Let’s break it down into
detailed steps:

1. Input Splitting
Before the mappers start, the input data (a file or
dataset) is divided into smaller splits. Each split is
processed by one mapper. This ensures parallelism.
 Example Input File:
 Line 1: apple banana apple
 Line 2: banana apple mango
 Line 3: mango banana banana
 Input Splits:
o Split 1: apple banana apple
o Split 2: banana apple mango
o Split 3: mango banana banana
Each split is sent to a mapper.

2. Mapper Execution
Each mapper processes its assigned split line by line
and applies the map function to generate
intermediate key-value pairs.
What Happens Inside a Mapper
1. Read Input:
o The mapper reads its assigned split.
o Example: Mapper 1 gets apple banana apple.
2. Apply Map Function:
o The map function processes each line and
extracts key-value pairs based on the logic
defined.
o For example, if we’re counting words:
 Split the line into words.
 For each word, emit a key-value pair
where the key is the word, and the value
is 1.
o Mapper 1 Output:
o {apple: 1}, {banana: 1}, {apple: 1}
3. Buffer Intermediate Data:
o The mapper temporarily stores these key-
value pairs in memory.
o If the buffer is full, it spills the data to disk in
sorted chunks.

3. Partitioning
Once the mapper has processed its input split:
 The key-value pairs are partitioned based on the
key and the number of reducers.
 Example:
o Assume 2 reducers.
o A partitioner determines which keys go to
which reducer.
o Keys like apple and mango go to Reducer 1,
while banana goes to Reducer 2.

4. Combiner (Optional)
Before sending the intermediate data to the reducers, a
combiner may run on the mapper output to reduce the
amount of data transferred.
 The combiner performs a mini-reduce locally at
each mapper.
 Example:
o Mapper 1 Output: {apple: 1}, {banana: 1},
{apple: 1}
o After Combiner: {apple: 2}, {banana: 1}

5. Shuffling and Sorting


 The intermediate key-value pairs are sorted by key
within each mapper.
 Sorted data is prepared for transfer to reducers in
the next phase.

Detailed Example
Input:
Split 1 (to Mapper 1): apple banana apple
Steps Inside Mapper:
1. Read Data:
o Read the line: apple banana apple
2. Split Line into Words:
o Words: ["apple", "banana", "apple"]
3. Emit Key-Value Pairs:
o apple -> 1
o banana -> 1
o apple -> 1
4. Buffer and Sort:
o {apple: [1, 1], banana: [1]}
5. Partition (for Reducers):
o Reducer 1: apple -> [1, 1]
o Reducer 2: banana -> [1]

Mapper Output (Intermediate Data)


For the entire dataset:
 Mapper 1:
 {apple: 1}, {banana: 1}, {apple: 1}
 Mapper 2:
 {banana: 1}, {apple: 1}, {mango: 1}
 Mapper 3:
 {mango: 1}, {banana: 1}, {banana: 1}
This intermediate data is then shuffled and sent to the
reducers for further processing.

1- Mapper reads data from HDFS line by line


2- Mappers generates key value pairs
3- The map method output key value pairs ie is
serialized and stored in buffered memory
4- When the buffer fills up or when the map task are
complete,the key value pairs in the buffer Memory
are sorted and spilled to the disk (hdfs) known as
intermediate files
5- If more than one spill file was created,these files
are merged into a single file of sorted key value
pairs the sorted records in the spill file wait to
retrieve by reducer

You are referring to the spill and merge process in


the Map Phase of MapReduce, which optimizes
memory usage and prepares data for the Reducer.
Here's a detailed explanation of what happens when
more than one spill file is created:
1. Why Spill Files Are Created
 During the map phase, the intermediate key-value
pairs generated by the mapper are first stored in
memory (in a buffer).
 When this buffer becomes full (reaches a
threshold), the data is written ("spilled") to disk to
avoid running out of memory.

2. Multiple Spill Files


 If the mapper processes a large input split, the
intermediate data may exceed the memory buffer
multiple times, causing multiple spill files to be
created on disk.
 Each spill file contains:
o Sorted Key-Value Pairs: The data is sorted
by key before being written to disk.
o Partitions: The spill file is divided into
partitions, one for each reducer, based on the
partitioning function.

3. Merging Spill Files


To make the data transfer to reducers efficient:
 All the spill files are merged into a single, final file.
 During merging:
o Data from all spill files is combined in sorted
order.
o Duplicate keys (within a partition) are not
aggregated yet—that's the reducer's job.
4. Combiner During Merge (Optional)
 If a combiner function is used, it may run during
the merge process to reduce the size of the
intermediate data.
 Example: If multiple spill files contain {apple: 1},
{apple: 1}, the combiner aggregates them into
{apple: 2} within the same partition.

5. Final Output of the Mapper


 After merging, the mapper produces a single
sorted file for each partition (reducer).
 This file is then shuffled to its assigned reducer.

Example
Scenario:
 Mapper processes an input split and generates
these intermediate key-value pairs:
 {apple: 1}, {banana: 1}, {apple: 1}, {mango: 1},
{banana: 1}, {mango: 1}
 The memory buffer can hold only 3 key-value pairs
at a time.
Spill Files:
 Spill File 1 (after the first buffer overflow):
 {apple: 1}, {banana: 1}, {apple: 1} -> Sorted:
{apple: [1, 1], banana: [1]}
 Spill File 2 (after the second buffer overflow):
 {mango: 1}, {banana: 1}, {mango: 1} -> Sorted:
{banana: [1], mango: [1, 1]}
Merging Spill Files:
 Merge Spill File 1 and Spill File 2 into a single,
sorted file:
 {apple: [1, 1]}, {banana: [1, 1]}, {mango: [1, 1]}
Final Output:
 Partitioned and ready to send to reducers:
o Reducer 1: {apple: [1, 1]}
o Reducer 2: {banana: [1, 1], mango: [1, 1]}

6. Advantages of Spill and Merge


 Efficient Memory Usage: Ensures the mapper
does not run out of memory when processing large
datasets.
 Optimized Disk I/O: Reduces the number of files
written to disk by merging multiple spill files.
 Sorted Data: Prepares data for efficient shuffling
and reduces sorting work for the reducers.
This process is a critical optimization step in the
MapReduce framework, ensuring scalability for massive
datasets.
As mapper finish their task , the reducer starts fetching
the records
After spill
Once all mappers complete the reducer method gets
invoked
The output of reducer is written to HDFS (by default) ,or
wherever the output was configured to be sent

You might also like