Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts

The document discusses the MapReduce programming model, which is essential for writing data-centric parallel applications and is a key component of Big Data management. It details the operations of MapReduce, including the roles of the Map and Reduce functions, as well as the architecture of Apache Hadoop, which supports distributed storage and processing of large datasets. Additionally, it clarifies the differences between Hadoop and MapReduce and outlines the execution workflow of a MapReduce program.

Uploaded by

Dina Bardakji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views26 pages

Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts

Uploaded by

Dina Bardakji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Big Data Processing Concepts

Lecture 10: Chapter 6 Part 1

MapReduce programming model
• MapReduce the current defacto framework/paradigm for writing
data-centric parallel applications in both industry and academia
• MapReduce is inspired by the commonly used functions - Map
and Reduce in combination with the divide-and-conquer parallel
paradigm
• MapReduce is a framework composed of a programming model
and its implementation
• It is one of the first essential steps for the new generation of Big
Data management and analytics tools
• It enables to write programs that can support parallel processing
MapReduce programming model Cont.
• MapReduce, both input and output data are considered as Key-
Value pairs with different types. This design is because of the
require
• ments of parallelization and scalability. Key-value pairs can be
easily partitioned and distributed to be processed on distributed
clusters
• MapReduce programming model uses two subsequent functions
that handle data computations: the Map function and the
Reduce function
MapReduce Program Operations
• More precisely, a MapReduce program relies on the following
operations:
1. First, the Map function divides the input data (e.g., long text file) into
independent data partitions that constitute key-value pairs.
2. Then, the MapReduce framework sent all the key-value pairs into the
Mapper that processes each of them individually, throughout several
parallel map tasks across the cluster
• Each data partition is assigned to a unique compute node
• The Mapper outputs one or more intermediate key-value pairs
• At this stage, the framework is charged to collect all the intermediate key-value
pairs, to sort and group them by key. So the result is many keys with a list of all the
associated values
MapReduce Program Operations Cont.
3. Next, the Reduce function is used to process the intermediate
output data
• For each unique key, the Reduce function aggregates the values
associated to the key according to a predefined program (i.e., filtering,
summarizing, sorting, hashing, taking average or finding the maximum)
• After that, it produces one or more output key-value pairs
4. Finally, the MapReduce framework store all the output Key-value
pairs in an output file
Apache Hadoop
• Apache Hadoop is an open-source framework designed for
distributed storage and processing of large datasets across
clusters of commodity hardware
• It uses the Hadoop Distributed File System (HDFS) for scalable
storage, the MapReduce programming model for parallel data
processing, and YARN (Yet Another Resource Negotiator) for
efficient resource management and job scheduling
• Hadoop enables efficient handling of big data by distributing
tasks across multiple nodes, offering fault tolerance, scalability,
and the ability to process diverse data types, making it a
cornerstone in big data analytics
Apache Hadoop Architecture

Figure 1: Apache Hadoop Architecture

Hadoop and MapReduce are Different
• Although Hadoop and MapReduce are often used interchangeably,
they are fundamentally different. Hadoop is a comprehensive
framework for distributed storage and processing of big data, while
MapReduce is a programming model for processing large datasets in
parallel
• In reality, Hadoop's MapReduce is just one specific implementation of
the broader MapReduce paradigm
• There are several other implementations of the MapReduce model
beyond Hadoop's version, each tailored to different use cases and
environments. For example, Google's MapReduce, Apache Spark, and
Apache Flink. Here is another list also :
https://www.ibm.com/docs/en/spectrum-
symphony/7.3.2?topic=applications-supported-mapreduce
MapReduce Programming Model Execution Workflow
Revisited

Figure 2: MapReduce Execution workflow

MapReduce Programming Model Execution
Workflow
Figure 1 shows the overall flow of a MapReduce operation. When
the user program calls the MapReduce function, the following
sequence of actions occurs (the numbered labels in Figure 1
correspond to the numbers in the following list)
1. The MapReduce library in the user program first splits the input
files into M pieces of typically 64-128MB per piece (controllable
by the user via an optional parameter). It then starts up many
copies of the program on a cluster of machines
MapReduce Programming Model Execution
Workflow Cont.
• Input Splits in MapReduce: is a logical chunk of data that a single
Map task will process. It defines the range of data that a mapper
will read from HDFS blocks
• The purpose of input splits is to define the work for each Map
task and to optimize data locality by trying to place Map tasks on
nodes where the data resides
• The splits are assigned to Mapper tasks, which then read the
corresponding data from the HDFS blocks
Key Differences between HDFS Block (File
Split) and MapReduce (Input Split)
Feature HDFS Block (File Split) MapReduce Input Split
Purpose Physical storage division in HDFS Logical division for processing
Replication Replicated across nodes (default Not replicated; logical division for
3 copies) Map tasks
Determines How data is distributed in HDFS How many Map tasks will be
created
MapReduce Programming Model Execution
Workflow Cont.
2. One of the copies of the program—the master— is special. The rest
are workers that are assigned work by the master. There are M map tasks
and R reduce tasks to assign. The master picks idle workers (which are
nodes in the cluster) and assigns each one a map task or a reduce task
• Scenario: Assume that a MapReduce split size is 64 MB and the data
size on a particular HDFS node is 128 MB. Then two mappers (map
tasks) will be assigned to the same node (HDFS node). How the
Mappers Will Work Together?
• On the node, if resources (CPU and memory) are sufficient, both Mapper tasks
can run concurrently. If resources are limited, they may be queued or run in
sequence based on the available capacity
• This approach maximizes data locality, meaning that the data is processed where it is
stored, reducing network overhead
MapReduce Programming Model Execution
Workflow Cont.
3. A worker who is assigned a map task reads the contents of the
corresponding input split
• It parses key/value pairs out of the input data and passes each pair to the
user-defined map function
• For example, if the map function is counting word occurrences, it might output
intermediate pairs like ("word", 1) for each word in the input
• The intermediate key/value pairs produced by the map function are
buffered in memory
• By buffering in memory (RAM: Random Access Memory), the RAM of the worker
node executing the map task, the system can quickly sort and group these
intermediate results before writing them to disk or sending them to the next phase
• However, if the amount of data exceeds a certain threshold, the buffered data may
be spilled to disk to prevent memory overflow
Understanding Sort and Shuffle in
MapReduce
• Sort: The process of grouping all intermediate key-value pairs by
key. This sorting happens on each worker node after the map
phase
• Shuffle: The process of transferring and merging the sorted key-
value pairs from all map tasks to the appropriate reducer tasks. It
ensures that all key-value pairs with the same key end up at the
same reducer
Understanding Sort and Shuffle in
MapReduce Cont.
• Example Scenario: Word Count
• Let's use a classic example of counting the occurrences of
words in a dataset
• Input Data: Imagine we have the following 3 lines of text as input:
Understanding Sort and Shuffle in
MapReduce Cont.
• Input Splits:
• The input might be split into two parts (for simplicity):
• Split 1: "cat dog“
• Split 2: "dog cat" and "dog fish“
• Map Phase:
• Each input split is processed by a map task, and it emits intermediate
key-value pairs like this:
• Map Task 1 (processing Split 1: "cat dog"):
• Output:
Understanding Sort and Shuffle in
MapReduce Cont.
• Map Task 2 (processing Split 2: "dog cat" and "dog fish"):
• Output:

• Sorting: Each map task sorts its intermediate key-value pairs by

key. So: The Sorted Output of Map Task 1 is:
Understanding Sort and Shuffle in
MapReduce Cont.
• The Sorted Output of Map Task 2 is:

• Shuffling: Now, the framework shuffles these sorted outputs, grouping

by key across all map tasks and sends them to the appropriate reducer.
Here’s how it works:
• All values associated with the key "cat" are combined.
• All values associated with the key "dog" are combined.
• All values associated with the key "fish" are combined.
Understanding Sort and Shuffle in
MapReduce Cont.
• The shuffled input for the reducers might look like this:
• For the key "cat":

• For the key "dog":

• For the key "fish":

Understanding Sort and Shuffle in
MapReduce Cont.
• Reduce Phase: The reduce tasks then process each group of key-value
pairs to produce the final output:
• Reducer 1 (handling "cat"): Final Output:
• Input: ("cat", [1, 1])
• Output: ("cat", 2)
• Reducer 2 (handling "dog"):
• Input: ("dog", [1, 1, 1])
• Output: ("dog", 3)
• Reducer 3 (handling "fish"):
• Input: ("fish", [1])
• Output: ("fish", 1)
MapReduce Programming Model Execution
Workflow Cont.
4. Periodically, the buffered pairs are written to local disk,
partitioned into R regions by the partitioning function. The locations
of these buffered pairs on the local disk are passed back to the
master who is responsible for forwarding these locations to the
reduce workers
MapReduce Programming Model Execution
Workflow Cont.
5. When a reduce worker is notified by the master about these
locations, it uses remote procedure calls to read the buffered data
from the local disks of the map workers. When a reduce worker has
read all intermediate data for its partition, it sorts it by the
intermediate keys so that all occurrences of the same key are
grouped together. The sorting is needed because typically many
different keys map to the same reduce task. If the amount of
intermediate data is too large to fit in memory, an external sort is
used
MapReduce Programming Model Execution
Workflow Cont.
6. The reduce worker iterates over the sorted intermediate data and
for each unique intermediate key encountered, it passes the key
and the corresponding set of intermediate values to the user’s
reduce function. The output of the reduce function is appended to a
final output file for this reduce partition
MapReduce Programming Model Execution
Workflow Cont.
7. When all map tasks and reduce tasks have been completed, the
master wakes up the user program. At this point, the MapReduce
call in the user program returns back to the user code
Interesting Resource
https://www.youtube.com/watch?v=aReuLtY0YMI

01 - Introduction To Data Analytics
100% (2)
01 - Introduction To Data Analytics
58 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Big Data
No ratings yet
Big Data
120 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
No ratings yet
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
26 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
59 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Data Science
No ratings yet
Data Science
7 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Unit 2
No ratings yet
Unit 2
12 pages
Unit - III
No ratings yet
Unit - III
37 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Big Data Lecture # 07
No ratings yet
Big Data Lecture # 07
21 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
04 MapReduce
No ratings yet
04 MapReduce
45 pages
Distributed Systems: 18. Mapreduce
No ratings yet
Distributed Systems: 18. Mapreduce
39 pages
Lec 6
No ratings yet
Lec 6
16 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
Bda 03
No ratings yet
Bda 03
10 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Lec 6
No ratings yet
Lec 6
14 pages
Own Answer 2
No ratings yet
Own Answer 2
22 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
21CS1601 Unit 5 Understanding Big Data Technolgies
No ratings yet
21CS1601 Unit 5 Understanding Big Data Technolgies
20 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
21 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Bda Unit 3
No ratings yet
Bda Unit 3
20 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Big Data Analytics
No ratings yet
Big Data Analytics
50 pages
BDP 2024 09
No ratings yet
BDP 2024 09
24 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Finance CH 3 Booklet - Financial Statements
No ratings yet
Finance CH 3 Booklet - Financial Statements
8 pages
Self Inflected Wound
No ratings yet
Self Inflected Wound
13 pages
Mid Full Summary
No ratings yet
Mid Full Summary
44 pages
SoftSkills1 BTech 1year Question Bank
No ratings yet
SoftSkills1 BTech 1year Question Bank
51 pages
Apologia Final
No ratings yet
Apologia Final
12 pages
CH 1
No ratings yet
CH 1
25 pages
Lecture 11 Chapter 6 Part 2 Big Data Processing Concepts
No ratings yet
Lecture 11 Chapter 6 Part 2 Big Data Processing Concepts
14 pages
Demystifying Innovation in The Value Chain
No ratings yet
Demystifying Innovation in The Value Chain
8 pages
Important Techniques For Analyzing Visual Tex
No ratings yet
Important Techniques For Analyzing Visual Tex
6 pages
Poem Annotation
No ratings yet
Poem Annotation
26 pages
Untitled Document
No ratings yet
Untitled Document
3 pages
Interesting Python
No ratings yet
Interesting Python
5 pages
Docx
No ratings yet
Docx
15 pages
Evolution 5.1 Natural Selection Edit
No ratings yet
Evolution 5.1 Natural Selection Edit
3 pages
Act I
No ratings yet
Act I
3 pages
Lecture 6 Chapter 5 Part 2 Big Data Storage Concepts
No ratings yet
Lecture 6 Chapter 5 Part 2 Big Data Storage Concepts
6 pages
Lecture 5 Chapter 5 Part 1 Big Data Storage Concepts
No ratings yet
Lecture 5 Chapter 5 Part 1 Big Data Storage Concepts
19 pages
Cells IB
No ratings yet
Cells IB
37 pages
English A Language and Literature Internal Assessment Class of 2022
No ratings yet
English A Language and Literature Internal Assessment Class of 2022
6 pages
Teilnehmerliste - Mündlicher Ausdruck - Labs
No ratings yet
Teilnehmerliste - Mündlicher Ausdruck - Labs
14 pages
Mis Laudon 14 Chapter 4 Test Bank
No ratings yet
Mis Laudon 14 Chapter 4 Test Bank
29 pages
Chem Practice
No ratings yet
Chem Practice
2 pages
Soft Skills Summary
No ratings yet
Soft Skills Summary
17 pages
A Symbiotic Relationship Biology 4.4
No ratings yet
A Symbiotic Relationship Biology 4.4
1 page
Death of A Salesman - Act 2 Questions
No ratings yet
Death of A Salesman - Act 2 Questions
2 pages
PCA - Colab
No ratings yet
PCA - Colab
2 pages
Lecture 1
No ratings yet
Lecture 1
68 pages
Microsoft Word - Lecture 1
No ratings yet
Microsoft Word - Lecture 1
55 pages
Chapter 9 Test Bank
No ratings yet
Chapter 9 Test Bank
31 pages
Ds Bida
No ratings yet
Ds Bida
2 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Unit 5 Notes Data Analytics Kit 601
No ratings yet
Unit 5 Notes Data Analytics Kit 601
44 pages
Computer Science & Engineering: Department of
No ratings yet
Computer Science & Engineering: Department of
6 pages
Unit 4 BDA
No ratings yet
Unit 4 BDA
31 pages
AWS Data Analytics - Technical - Student
No ratings yet
AWS Data Analytics - Technical - Student
160 pages
JNTU Hyderabad: B.Tech. Year LL
No ratings yet
JNTU Hyderabad: B.Tech. Year LL
1 page
Big Data Analytics
No ratings yet
Big Data Analytics
124 pages
Hortonworks Data Platform (HDP) 3.0 - Faster, Smarter, Hybrid Data
No ratings yet
Hortonworks Data Platform (HDP) 3.0 - Faster, Smarter, Hybrid Data
3 pages
PHD CSE Seminar in Course Work
0% (1)
PHD CSE Seminar in Course Work
17 pages
Babel A Generic Benchmarking Platform
No ratings yet
Babel A Generic Benchmarking Platform
10 pages
Syllabus of BDA
No ratings yet
Syllabus of BDA
2 pages
Advanced Information and Knowledge
No ratings yet
Advanced Information and Knowledge
105 pages
The Log - What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction - LinkedIn Engineering
No ratings yet
The Log - What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction - LinkedIn Engineering
30 pages
Ojassvi Kumar: Education
No ratings yet
Ojassvi Kumar: Education
1 page
01 Mapreduce
No ratings yet
01 Mapreduce
77 pages
Itapp
No ratings yet
Itapp
17 pages
Emerging Technologies Handout
No ratings yet
Emerging Technologies Handout
64 pages
Paul Garcin Resume
No ratings yet
Paul Garcin Resume
1 page
02 Introduction
No ratings yet
02 Introduction
13 pages
Data Analytics Important Questions
No ratings yet
Data Analytics Important Questions
11 pages
AIA 6550 Module 4
No ratings yet
AIA 6550 Module 4
13 pages
Manthan Java
No ratings yet
Manthan Java
6 pages
TDWI EBook UDA Teradata Web
No ratings yet
TDWI EBook UDA Teradata Web
12 pages
The Role of Big Data Analytics For The Internet of Things (Iot)
No ratings yet
The Role of Big Data Analytics For The Internet of Things (Iot)
15 pages
Module 4 - Cloud Programming and Software Environments
No ratings yet
Module 4 - Cloud Programming and Software Environments
25 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
Lecture+Notes+ +PIG
No ratings yet
Lecture+Notes+ +PIG
21 pages
Brochure Big Data
No ratings yet
Brochure Big Data
6 pages
Meenakshi College of Engineering Chennai - 78: Lesson Plan Grid and Cloud Computing
No ratings yet
Meenakshi College of Engineering Chennai - 78: Lesson Plan Grid and Cloud Computing
6 pages

Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts

Uploaded by

Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts

Uploaded by

Big Data Processing Concepts

Lecture 10: Chapter 6 Part 1

Figure 1: Apache Hadoop Architecture

Figure 2: MapReduce Execution workflow

• Sorting: Each map task sorts its intermediate key-value pairs by

• Shuffling: Now, the framework shuffles these sorted outputs, grouping

• For the key "dog":

• For the key "fish":

You might also like