0% found this document useful (0 votes)
13 views26 pages

Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts

The document discusses the MapReduce programming model, which is essential for writing data-centric parallel applications and is a key component of Big Data management. It details the operations of MapReduce, including the roles of the Map and Reduce functions, as well as the architecture of Apache Hadoop, which supports distributed storage and processing of large datasets. Additionally, it clarifies the differences between Hadoop and MapReduce and outlines the execution workflow of a MapReduce program.

Uploaded by

Dina Bardakji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views26 pages

Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts

The document discusses the MapReduce programming model, which is essential for writing data-centric parallel applications and is a key component of Big Data management. It details the operations of MapReduce, including the roles of the Map and Reduce functions, as well as the architecture of Apache Hadoop, which supports distributed storage and processing of large datasets. Additionally, it clarifies the differences between Hadoop and MapReduce and outlines the execution workflow of a MapReduce program.

Uploaded by

Dina Bardakji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Big Data Processing Concepts

Lecture 10: Chapter 6 Part 1


MapReduce programming model
• MapReduce the current defacto framework/paradigm for writing
data-centric parallel applications in both industry and academia
• MapReduce is inspired by the commonly used functions - Map
and Reduce in combination with the divide-and-conquer parallel
paradigm
• MapReduce is a framework composed of a programming model
and its implementation
• It is one of the first essential steps for the new generation of Big
Data management and analytics tools
• It enables to write programs that can support parallel processing
MapReduce programming model Cont.
• MapReduce, both input and output data are considered as Key-
Value pairs with different types. This design is because of the
require
• ments of parallelization and scalability. Key-value pairs can be
easily partitioned and distributed to be processed on distributed
clusters
• MapReduce programming model uses two subsequent functions
that handle data computations: the Map function and the
Reduce function
MapReduce Program Operations
• More precisely, a MapReduce program relies on the following
operations:
1. First, the Map function divides the input data (e.g., long text file) into
independent data partitions that constitute key-value pairs.
2. Then, the MapReduce framework sent all the key-value pairs into the
Mapper that processes each of them individually, throughout several
parallel map tasks across the cluster
• Each data partition is assigned to a unique compute node
• The Mapper outputs one or more intermediate key-value pairs
• At this stage, the framework is charged to collect all the intermediate key-value
pairs, to sort and group them by key. So the result is many keys with a list of all the
associated values
MapReduce Program Operations Cont.
3. Next, the Reduce function is used to process the intermediate
output data
• For each unique key, the Reduce function aggregates the values
associated to the key according to a predefined program (i.e., filtering,
summarizing, sorting, hashing, taking average or finding the maximum)
• After that, it produces one or more output key-value pairs
4. Finally, the MapReduce framework store all the output Key-value
pairs in an output file
Apache Hadoop
• Apache Hadoop is an open-source framework designed for
distributed storage and processing of large datasets across
clusters of commodity hardware
• It uses the Hadoop Distributed File System (HDFS) for scalable
storage, the MapReduce programming model for parallel data
processing, and YARN (Yet Another Resource Negotiator) for
efficient resource management and job scheduling
• Hadoop enables efficient handling of big data by distributing
tasks across multiple nodes, offering fault tolerance, scalability,
and the ability to process diverse data types, making it a
cornerstone in big data analytics
Apache Hadoop Architecture

Figure 1: Apache Hadoop Architecture


Hadoop and MapReduce are Different
• Although Hadoop and MapReduce are often used interchangeably,
they are fundamentally different. Hadoop is a comprehensive
framework for distributed storage and processing of big data, while
MapReduce is a programming model for processing large datasets in
parallel
• In reality, Hadoop's MapReduce is just one specific implementation of
the broader MapReduce paradigm
• There are several other implementations of the MapReduce model
beyond Hadoop's version, each tailored to different use cases and
environments. For example, Google's MapReduce, Apache Spark, and
Apache Flink. Here is another list also :
https://www.ibm.com/docs/en/spectrum-
symphony/7.3.2?topic=applications-supported-mapreduce
MapReduce Programming Model Execution Workflow
Revisited

Figure 2: MapReduce Execution workflow


MapReduce Programming Model Execution
Workflow
Figure 1 shows the overall flow of a MapReduce operation. When
the user program calls the MapReduce function, the following
sequence of actions occurs (the numbered labels in Figure 1
correspond to the numbers in the following list)
1. The MapReduce library in the user program first splits the input
files into M pieces of typically 64-128MB per piece (controllable
by the user via an optional parameter). It then starts up many
copies of the program on a cluster of machines
MapReduce Programming Model Execution
Workflow Cont.
• Input Splits in MapReduce: is a logical chunk of data that a single
Map task will process. It defines the range of data that a mapper
will read from HDFS blocks
• The purpose of input splits is to define the work for each Map
task and to optimize data locality by trying to place Map tasks on
nodes where the data resides
• The splits are assigned to Mapper tasks, which then read the
corresponding data from the HDFS blocks
Key Differences between HDFS Block (File
Split) and MapReduce (Input Split)
Feature HDFS Block (File Split) MapReduce Input Split
Purpose Physical storage division in HDFS Logical division for processing
Replication Replicated across nodes (default Not replicated; logical division for
3 copies) Map tasks
Determines How data is distributed in HDFS How many Map tasks will be
created
MapReduce Programming Model Execution
Workflow Cont.
2. One of the copies of the program—the master— is special. The rest
are workers that are assigned work by the master. There are M map tasks
and R reduce tasks to assign. The master picks idle workers (which are
nodes in the cluster) and assigns each one a map task or a reduce task
• Scenario: Assume that a MapReduce split size is 64 MB and the data
size on a particular HDFS node is 128 MB. Then two mappers (map
tasks) will be assigned to the same node (HDFS node). How the
Mappers Will Work Together?
• On the node, if resources (CPU and memory) are sufficient, both Mapper tasks
can run concurrently. If resources are limited, they may be queued or run in
sequence based on the available capacity
• This approach maximizes data locality, meaning that the data is processed where it is
stored, reducing network overhead
MapReduce Programming Model Execution
Workflow Cont.
3. A worker who is assigned a map task reads the contents of the
corresponding input split
• It parses key/value pairs out of the input data and passes each pair to the
user-defined map function
• For example, if the map function is counting word occurrences, it might output
intermediate pairs like ("word", 1) for each word in the input
• The intermediate key/value pairs produced by the map function are
buffered in memory
• By buffering in memory (RAM: Random Access Memory), the RAM of the worker
node executing the map task, the system can quickly sort and group these
intermediate results before writing them to disk or sending them to the next phase
• However, if the amount of data exceeds a certain threshold, the buffered data may
be spilled to disk to prevent memory overflow
Understanding Sort and Shuffle in
MapReduce
• Sort: The process of grouping all intermediate key-value pairs by
key. This sorting happens on each worker node after the map
phase
• Shuffle: The process of transferring and merging the sorted key-
value pairs from all map tasks to the appropriate reducer tasks. It
ensures that all key-value pairs with the same key end up at the
same reducer
Understanding Sort and Shuffle in
MapReduce Cont.
• Example Scenario: Word Count
• Let's use a classic example of counting the occurrences of
words in a dataset
• Input Data: Imagine we have the following 3 lines of text as input:
Understanding Sort and Shuffle in
MapReduce Cont.
• Input Splits:
• The input might be split into two parts (for simplicity):
• Split 1: "cat dog“
• Split 2: "dog cat" and "dog fish“
• Map Phase:
• Each input split is processed by a map task, and it emits intermediate
key-value pairs like this:
• Map Task 1 (processing Split 1: "cat dog"):
• Output:
Understanding Sort and Shuffle in
MapReduce Cont.
• Map Task 2 (processing Split 2: "dog cat" and "dog fish"):
• Output:

• Sorting: Each map task sorts its intermediate key-value pairs by


key. So: The Sorted Output of Map Task 1 is:
Understanding Sort and Shuffle in
MapReduce Cont.
• The Sorted Output of Map Task 2 is:

• Shuffling: Now, the framework shuffles these sorted outputs, grouping


by key across all map tasks and sends them to the appropriate reducer.
Here’s how it works:
• All values associated with the key "cat" are combined.
• All values associated with the key "dog" are combined.
• All values associated with the key "fish" are combined.
Understanding Sort and Shuffle in
MapReduce Cont.
• The shuffled input for the reducers might look like this:
• For the key "cat":

• For the key "dog":

• For the key "fish":


Understanding Sort and Shuffle in
MapReduce Cont.
• Reduce Phase: The reduce tasks then process each group of key-value
pairs to produce the final output:
• Reducer 1 (handling "cat"): Final Output:
• Input: ("cat", [1, 1])
• Output: ("cat", 2)
• Reducer 2 (handling "dog"):
• Input: ("dog", [1, 1, 1])
• Output: ("dog", 3)
• Reducer 3 (handling "fish"):
• Input: ("fish", [1])
• Output: ("fish", 1)
MapReduce Programming Model Execution
Workflow Cont.
4. Periodically, the buffered pairs are written to local disk,
partitioned into R regions by the partitioning function. The locations
of these buffered pairs on the local disk are passed back to the
master who is responsible for forwarding these locations to the
reduce workers
MapReduce Programming Model Execution
Workflow Cont.
5. When a reduce worker is notified by the master about these
locations, it uses remote procedure calls to read the buffered data
from the local disks of the map workers. When a reduce worker has
read all intermediate data for its partition, it sorts it by the
intermediate keys so that all occurrences of the same key are
grouped together. The sorting is needed because typically many
different keys map to the same reduce task. If the amount of
intermediate data is too large to fit in memory, an external sort is
used
MapReduce Programming Model Execution
Workflow Cont.
6. The reduce worker iterates over the sorted intermediate data and
for each unique intermediate key encountered, it passes the key
and the corresponding set of intermediate values to the user’s
reduce function. The output of the reduce function is appended to a
final output file for this reduce partition
MapReduce Programming Model Execution
Workflow Cont.
7. When all map tasks and reduce tasks have been completed, the
master wakes up the user program. At this point, the MapReduce
call in the user program returns back to the user code
Interesting Resource
https://www.youtube.com/watch?v=aReuLtY0YMI

You might also like