Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
The document discusses the MapReduce programming model, which is essential for writing data-centric parallel applications and is a key component of Big Data management. It details the operations of MapReduce, including the roles of the Map and Reduce functions, as well as the architecture of Apache Hadoop, which supports distributed storage and processing of large datasets. Additionally, it clarifies the differences between Hadoop and MapReduce and outlines the execution workflow of a MapReduce program.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
13 views26 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
The document discusses the MapReduce programming model, which is essential for writing data-centric parallel applications and is a key component of Big Data management. It details the operations of MapReduce, including the roles of the Map and Reduce functions, as well as the architecture of Apache Hadoop, which supports distributed storage and processing of large datasets. Additionally, it clarifies the differences between Hadoop and MapReduce and outlines the execution workflow of a MapReduce program.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26
Big Data Processing Concepts
Lecture 10: Chapter 6 Part 1
MapReduce programming model • MapReduce the current defacto framework/paradigm for writing data-centric parallel applications in both industry and academia • MapReduce is inspired by the commonly used functions - Map and Reduce in combination with the divide-and-conquer parallel paradigm • MapReduce is a framework composed of a programming model and its implementation • It is one of the first essential steps for the new generation of Big Data management and analytics tools • It enables to write programs that can support parallel processing MapReduce programming model Cont. • MapReduce, both input and output data are considered as Key- Value pairs with different types. This design is because of the require • ments of parallelization and scalability. Key-value pairs can be easily partitioned and distributed to be processed on distributed clusters • MapReduce programming model uses two subsequent functions that handle data computations: the Map function and the Reduce function MapReduce Program Operations • More precisely, a MapReduce program relies on the following operations: 1. First, the Map function divides the input data (e.g., long text file) into independent data partitions that constitute key-value pairs. 2. Then, the MapReduce framework sent all the key-value pairs into the Mapper that processes each of them individually, throughout several parallel map tasks across the cluster • Each data partition is assigned to a unique compute node • The Mapper outputs one or more intermediate key-value pairs • At this stage, the framework is charged to collect all the intermediate key-value pairs, to sort and group them by key. So the result is many keys with a list of all the associated values MapReduce Program Operations Cont. 3. Next, the Reduce function is used to process the intermediate output data • For each unique key, the Reduce function aggregates the values associated to the key according to a predefined program (i.e., filtering, summarizing, sorting, hashing, taking average or finding the maximum) • After that, it produces one or more output key-value pairs 4. Finally, the MapReduce framework store all the output Key-value pairs in an output file Apache Hadoop • Apache Hadoop is an open-source framework designed for distributed storage and processing of large datasets across clusters of commodity hardware • It uses the Hadoop Distributed File System (HDFS) for scalable storage, the MapReduce programming model for parallel data processing, and YARN (Yet Another Resource Negotiator) for efficient resource management and job scheduling • Hadoop enables efficient handling of big data by distributing tasks across multiple nodes, offering fault tolerance, scalability, and the ability to process diverse data types, making it a cornerstone in big data analytics Apache Hadoop Architecture
Figure 1: Apache Hadoop Architecture
Hadoop and MapReduce are Different • Although Hadoop and MapReduce are often used interchangeably, they are fundamentally different. Hadoop is a comprehensive framework for distributed storage and processing of big data, while MapReduce is a programming model for processing large datasets in parallel • In reality, Hadoop's MapReduce is just one specific implementation of the broader MapReduce paradigm • There are several other implementations of the MapReduce model beyond Hadoop's version, each tailored to different use cases and environments. For example, Google's MapReduce, Apache Spark, and Apache Flink. Here is another list also : https://www.ibm.com/docs/en/spectrum- symphony/7.3.2?topic=applications-supported-mapreduce MapReduce Programming Model Execution Workflow Revisited
Figure 2: MapReduce Execution workflow
MapReduce Programming Model Execution Workflow Figure 1 shows the overall flow of a MapReduce operation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in Figure 1 correspond to the numbers in the following list) 1. The MapReduce library in the user program first splits the input files into M pieces of typically 64-128MB per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines MapReduce Programming Model Execution Workflow Cont. • Input Splits in MapReduce: is a logical chunk of data that a single Map task will process. It defines the range of data that a mapper will read from HDFS blocks • The purpose of input splits is to define the work for each Map task and to optimize data locality by trying to place Map tasks on nodes where the data resides • The splits are assigned to Mapper tasks, which then read the corresponding data from the HDFS blocks Key Differences between HDFS Block (File Split) and MapReduce (Input Split) Feature HDFS Block (File Split) MapReduce Input Split Purpose Physical storage division in HDFS Logical division for processing Replication Replicated across nodes (default Not replicated; logical division for 3 copies) Map tasks Determines How data is distributed in HDFS How many Map tasks will be created MapReduce Programming Model Execution Workflow Cont. 2. One of the copies of the program—the master— is special. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers (which are nodes in the cluster) and assigns each one a map task or a reduce task • Scenario: Assume that a MapReduce split size is 64 MB and the data size on a particular HDFS node is 128 MB. Then two mappers (map tasks) will be assigned to the same node (HDFS node). How the Mappers Will Work Together? • On the node, if resources (CPU and memory) are sufficient, both Mapper tasks can run concurrently. If resources are limited, they may be queued or run in sequence based on the available capacity • This approach maximizes data locality, meaning that the data is processed where it is stored, reducing network overhead MapReduce Programming Model Execution Workflow Cont. 3. A worker who is assigned a map task reads the contents of the corresponding input split • It parses key/value pairs out of the input data and passes each pair to the user-defined map function • For example, if the map function is counting word occurrences, it might output intermediate pairs like ("word", 1) for each word in the input • The intermediate key/value pairs produced by the map function are buffered in memory • By buffering in memory (RAM: Random Access Memory), the RAM of the worker node executing the map task, the system can quickly sort and group these intermediate results before writing them to disk or sending them to the next phase • However, if the amount of data exceeds a certain threshold, the buffered data may be spilled to disk to prevent memory overflow Understanding Sort and Shuffle in MapReduce • Sort: The process of grouping all intermediate key-value pairs by key. This sorting happens on each worker node after the map phase • Shuffle: The process of transferring and merging the sorted key- value pairs from all map tasks to the appropriate reducer tasks. It ensures that all key-value pairs with the same key end up at the same reducer Understanding Sort and Shuffle in MapReduce Cont. • Example Scenario: Word Count • Let's use a classic example of counting the occurrences of words in a dataset • Input Data: Imagine we have the following 3 lines of text as input: Understanding Sort and Shuffle in MapReduce Cont. • Input Splits: • The input might be split into two parts (for simplicity): • Split 1: "cat dog“ • Split 2: "dog cat" and "dog fish“ • Map Phase: • Each input split is processed by a map task, and it emits intermediate key-value pairs like this: • Map Task 1 (processing Split 1: "cat dog"): • Output: Understanding Sort and Shuffle in MapReduce Cont. • Map Task 2 (processing Split 2: "dog cat" and "dog fish"): • Output:
• Sorting: Each map task sorts its intermediate key-value pairs by
key. So: The Sorted Output of Map Task 1 is: Understanding Sort and Shuffle in MapReduce Cont. • The Sorted Output of Map Task 2 is:
• Shuffling: Now, the framework shuffles these sorted outputs, grouping
by key across all map tasks and sends them to the appropriate reducer. Here’s how it works: • All values associated with the key "cat" are combined. • All values associated with the key "dog" are combined. • All values associated with the key "fish" are combined. Understanding Sort and Shuffle in MapReduce Cont. • The shuffled input for the reducers might look like this: • For the key "cat":
• For the key "dog":
• For the key "fish":
Understanding Sort and Shuffle in MapReduce Cont. • Reduce Phase: The reduce tasks then process each group of key-value pairs to produce the final output: • Reducer 1 (handling "cat"): Final Output: • Input: ("cat", [1, 1]) • Output: ("cat", 2) • Reducer 2 (handling "dog"): • Input: ("dog", [1, 1, 1]) • Output: ("dog", 3) • Reducer 3 (handling "fish"): • Input: ("fish", [1]) • Output: ("fish", 1) MapReduce Programming Model Execution Workflow Cont. 4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master who is responsible for forwarding these locations to the reduce workers MapReduce Programming Model Execution Workflow Cont. 5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data for its partition, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used MapReduce Programming Model Execution Workflow Cont. 6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s reduce function. The output of the reduce function is appended to a final output file for this reduce partition MapReduce Programming Model Execution Workflow Cont. 7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code Interesting Resource https://www.youtube.com/watch?v=aReuLtY0YMI