0% found this document useful (0 votes)

6 views12 pages

Day 6

Uploaded by

sidhrajsz112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views12 pages

Day 6

Uploaded by

sidhrajsz112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

At start discussed lab assignment 3

In this tables there is no unique column

CustomerName ,ProdutName,category,saledate,quantit
y,totalprice

Date/years cols can be made different dimension tables

So in order to tackle above problem we create unique
cols using AUTO_INCREMENT

Cretaed buckets(dim_tables)
1-CustomerName  has two cols name
2-Producct  has two cols name,category
3-Orders  date,qty,totalprice

Mapreduce
Is a software framework “ template” for developing
applications that process large amount of data in
parallel across a distributed environment .A mapreduce
consist of two main phases
1- Map
2- Reduce

1- Map -data is input into mappers where it is

transformed and prepared for the reducer
2-Reduce -retrieves the data from mapper and
perform desired computation or analysis
Shuffle and sort phase of map reduce is part of the
framework so it does not require any programming on
your part (developers)
MAP PHASE
The Map Phase in the MapReduce framework is the
first step where raw input data is transformed into
intermediate key-value pairs. Let’s break it down into
detailed steps:

1. Input Splitting
Before the mappers start, the input data (a file or
dataset) is divided into smaller splits. Each split is
processed by one mapper. This ensures parallelism.
 Example Input File:
 Line 1: apple banana apple
 Line 2: banana apple mango
 Line 3: mango banana banana
 Input Splits:
o Split 1: apple banana apple
o Split 2: banana apple mango
o Split 3: mango banana banana
Each split is sent to a mapper.

2. Mapper Execution
Each mapper processes its assigned split line by line
and applies the map function to generate
intermediate key-value pairs.
What Happens Inside a Mapper
1. Read Input:
o The mapper reads its assigned split.
o Example: Mapper 1 gets apple banana apple.
2. Apply Map Function:
o The map function processes each line and
extracts key-value pairs based on the logic
defined.
o For example, if we’re counting words:
 Split the line into words.
 For each word, emit a key-value pair
where the key is the word, and the value
is 1.
o Mapper 1 Output:
o {apple: 1}, {banana: 1}, {apple: 1}
3. Buffer Intermediate Data:
o The mapper temporarily stores these key-
value pairs in memory.
o If the buffer is full, it spills the data to disk in
sorted chunks.

3. Partitioning
Once the mapper has processed its input split:
 The key-value pairs are partitioned based on the
key and the number of reducers.
 Example:
o Assume 2 reducers.
o A partitioner determines which keys go to
which reducer.
o Keys like apple and mango go to Reducer 1,
while banana goes to Reducer 2.

4. Combiner (Optional)
Before sending the intermediate data to the reducers, a
combiner may run on the mapper output to reduce the
amount of data transferred.
 The combiner performs a mini-reduce locally at
each mapper.
 Example:
o Mapper 1 Output: {apple: 1}, {banana: 1},
{apple: 1}
o After Combiner: {apple: 2}, {banana: 1}

5. Shuffling and Sorting

 The intermediate key-value pairs are sorted by key
within each mapper.
 Sorted data is prepared for transfer to reducers in
the next phase.

Detailed Example
Input:
Split 1 (to Mapper 1): apple banana apple
Steps Inside Mapper:
1. Read Data:
o Read the line: apple banana apple
2. Split Line into Words:
o Words: ["apple", "banana", "apple"]
3. Emit Key-Value Pairs:
o apple -> 1
o banana -> 1
o apple -> 1
4. Buffer and Sort:
o {apple: [1, 1], banana: [1]}
5. Partition (for Reducers):
o Reducer 1: apple -> [1, 1]
o Reducer 2: banana -> [1]

Mapper Output (Intermediate Data)

For the entire dataset:
 Mapper 1:
 {apple: 1}, {banana: 1}, {apple: 1}
 Mapper 2:
 {banana: 1}, {apple: 1}, {mango: 1}
 Mapper 3:
 {mango: 1}, {banana: 1}, {banana: 1}
This intermediate data is then shuffled and sent to the
reducers for further processing.

1- Mapper reads data from HDFS line by line

2- Mappers generates key value pairs
3- The map method output key value pairs ie is
serialized and stored in buffered memory
4- When the buffer fills up or when the map task are
complete,the key value pairs in the buffer Memory
are sorted and spilled to the disk (hdfs) known as
intermediate files
5- If more than one spill file was created,these files
are merged into a single file of sorted key value
pairs the sorted records in the spill file wait to
retrieve by reducer

You are referring to the spill and merge process in

the Map Phase of MapReduce, which optimizes
memory usage and prepares data for the Reducer.
Here's a detailed explanation of what happens when
more than one spill file is created:
1. Why Spill Files Are Created
 During the map phase, the intermediate key-value
pairs generated by the mapper are first stored in
memory (in a buffer).
 When this buffer becomes full (reaches a
threshold), the data is written ("spilled") to disk to
avoid running out of memory.

2. Multiple Spill Files

 If the mapper processes a large input split, the
intermediate data may exceed the memory buffer
multiple times, causing multiple spill files to be
created on disk.
 Each spill file contains:
o Sorted Key-Value Pairs: The data is sorted
by key before being written to disk.
o Partitions: The spill file is divided into
partitions, one for each reducer, based on the
partitioning function.

3. Merging Spill Files

To make the data transfer to reducers efficient:
 All the spill files are merged into a single, final file.
 During merging:
o Data from all spill files is combined in sorted
order.
o Duplicate keys (within a partition) are not
aggregated yet—that's the reducer's job.
4. Combiner During Merge (Optional)
 If a combiner function is used, it may run during
the merge process to reduce the size of the
intermediate data.
 Example: If multiple spill files contain {apple: 1},
{apple: 1}, the combiner aggregates them into
{apple: 2} within the same partition.

5. Final Output of the Mapper

 After merging, the mapper produces a single
sorted file for each partition (reducer).
 This file is then shuffled to its assigned reducer.

Example
Scenario:
 Mapper processes an input split and generates
these intermediate key-value pairs:
 {apple: 1}, {banana: 1}, {apple: 1}, {mango: 1},
{banana: 1}, {mango: 1}
 The memory buffer can hold only 3 key-value pairs
at a time.
Spill Files:
 Spill File 1 (after the first buffer overflow):
 {apple: 1}, {banana: 1}, {apple: 1} -> Sorted:
{apple: [1, 1], banana: [1]}
 Spill File 2 (after the second buffer overflow):
 {mango: 1}, {banana: 1}, {mango: 1} -> Sorted:
{banana: [1], mango: [1, 1]}
Merging Spill Files:
 Merge Spill File 1 and Spill File 2 into a single,
sorted file:
 {apple: [1, 1]}, {banana: [1, 1]}, {mango: [1, 1]}
Final Output:
 Partitioned and ready to send to reducers:
o Reducer 1: {apple: [1, 1]}
o Reducer 2: {banana: [1, 1], mango: [1, 1]}

6. Advantages of Spill and Merge

 Efficient Memory Usage: Ensures the mapper
does not run out of memory when processing large
datasets.
 Optimized Disk I/O: Reduces the number of files
written to disk by merging multiple spill files.
 Sorted Data: Prepares data for efficient shuffling
and reduces sorting work for the reducers.
This process is a critical optimization step in the
MapReduce framework, ensuring scalability for massive
datasets.
As mapper finish their task , the reducer starts fetching
the records
After spill
Once all mappers complete the reducer method gets
invoked
The output of reducer is written to HDFS (by default) ,or
wherever the output was configured to be sent

MT8127 Android Scatter
No ratings yet
MT8127 Android Scatter
8 pages
Use Basic Networking Commands in Linux (Ping, Tracert, Nslookup, Netstat, ARP, RARP, Ip, Ifconfig, Dig, Route)
No ratings yet
Use Basic Networking Commands in Linux (Ping, Tracert, Nslookup, Netstat, ARP, RARP, Ip, Ifconfig, Dig, Route)
10 pages
SDN Question Bank-CSE
No ratings yet
SDN Question Bank-CSE
8 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Introduction To Blockchain
No ratings yet
Introduction To Blockchain
22 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Elmasri and Navathe DBMS Concepts 25
No ratings yet
Elmasri and Navathe DBMS Concepts 25
10 pages
03 MapReduce
No ratings yet
03 MapReduce
184 pages
HCIA-Storage Training Material V4.0 PDF
No ratings yet
HCIA-Storage Training Material V4.0 PDF
581 pages
Hadoop MapReduce Tutorial
No ratings yet
Hadoop MapReduce Tutorial
25 pages
DRKP Module 3
No ratings yet
DRKP Module 3
44 pages
Ditp - ch2 3
No ratings yet
Ditp - ch2 3
2 pages
BDA - Unit - III-1
No ratings yet
BDA - Unit - III-1
57 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
4 pages
MapReduce - Documentation
No ratings yet
MapReduce - Documentation
2 pages
Map Reduce
No ratings yet
Map Reduce
45 pages
LS VISION AHD Digital Video Recorder DVR Configuration User Manual
No ratings yet
LS VISION AHD Digital Video Recorder DVR Configuration User Manual
10 pages
Okm QSG Visualizer 3d 201601 en
No ratings yet
Okm QSG Visualizer 3d 201601 en
4 pages
Bda U2
No ratings yet
Bda U2
79 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
15 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Describe The MapReduce Execution Steps With A Neat Diagram
No ratings yet
Describe The MapReduce Execution Steps With A Neat Diagram
10 pages
2 Bda Chapter2 Answer
No ratings yet
2 Bda Chapter2 Answer
9 pages
Understanding Inputs and Outputs of Mapreduce
No ratings yet
Understanding Inputs and Outputs of Mapreduce
13 pages
MAP Reduce - 1
No ratings yet
MAP Reduce - 1
34 pages
Kubernetes
No ratings yet
Kubernetes
4 pages
Unit 3
No ratings yet
Unit 3
22 pages
Unit - III
No ratings yet
Unit - III
37 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Az-104 Course Content
No ratings yet
Az-104 Course Content
4 pages
S MapReduce Types Formats Features 06
No ratings yet
S MapReduce Types Formats Features 06
26 pages
(99+) Laptop Chip Level Repair Guide - Ibrahim Alsaidi - Academia - Edu
No ratings yet
(99+) Laptop Chip Level Repair Guide - Ibrahim Alsaidi - Academia - Edu
22 pages
Datacom 2500
No ratings yet
Datacom 2500
152 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Lecture 04
No ratings yet
Lecture 04
25 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
Bda FW-4
No ratings yet
Bda FW-4
7 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
2 1-MapReduce
No ratings yet
2 1-MapReduce
16 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Hadoop - Mapreduce
No ratings yet
Hadoop - Mapreduce
5 pages
Bda 2
No ratings yet
Bda 2
35 pages
Unit 3
No ratings yet
Unit 3
27 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Azure Fundamentals
No ratings yet
Azure Fundamentals
41 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
TCPIP Foundation For Engineers
No ratings yet
TCPIP Foundation For Engineers
2 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
30 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Map Reduce
No ratings yet
Map Reduce
11 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Unit - Iii
No ratings yet
Unit - Iii
38 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
SOUND CARD - Sblive24 Sound Blaster Live! 24-Bit External
No ratings yet
SOUND CARD - Sblive24 Sound Blaster Live! 24-Bit External
2 pages
Unit 2
No ratings yet
Unit 2
12 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Map Red
No ratings yet
Map Red
6 pages
Bda Unit 4
No ratings yet
Bda Unit 4
20 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
26 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
How To Add VMFS Datastore Using Vsphere Client
No ratings yet
How To Add VMFS Datastore Using Vsphere Client
6 pages
Summative Test
No ratings yet
Summative Test
17 pages
Database Transaction Management
No ratings yet
Database Transaction Management
20 pages
E-Restaurant Application
No ratings yet
E-Restaurant Application
12 pages
Solution Manual To Chapter 02
No ratings yet
Solution Manual To Chapter 02
2 pages
pg046 Aurora 8b10b en Us 11.1
No ratings yet
pg046 Aurora 8b10b en Us 11.1
117 pages
VxRail 14G - Ordering and Licensing Guide - v1.2
No ratings yet
VxRail 14G - Ordering and Licensing Guide - v1.2
47 pages
CCS335-CC QB Answer
No ratings yet
CCS335-CC QB Answer
60 pages
Mahoyo
No ratings yet
Mahoyo
6 pages
Ejemplo Bluetooth Impresion
No ratings yet
Ejemplo Bluetooth Impresion
6 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Example Infrastructure Outage Incident Report
No ratings yet
Example Infrastructure Outage Incident Report
3 pages
Operating System
No ratings yet
Operating System
12 pages
Network Layer - Logical Addressing - Tech Access Info
No ratings yet
Network Layer - Logical Addressing - Tech Access Info
5 pages
OS Tools Project Expanded v2
No ratings yet
OS Tools Project Expanded v2
3 pages
Access Layer Leverage: Section 6: Chapter 2
No ratings yet
Access Layer Leverage: Section 6: Chapter 2
1 page
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)

Day 6

Uploaded by

Day 6

Uploaded by

At start discussed lab assignment 3

In this tables there is no unique column

Date/years cols can be made different dimension tables

1- Map -data is input into mappers where it is

5. Shuffling and Sorting

Mapper Output (Intermediate Data)

1- Mapper reads data from HDFS line by line

You are referring to the spill and merge process in

2. Multiple Spill Files

3. Merging Spill Files

5. Final Output of the Mapper

6. Advantages of Spill and Merge

You might also like