0% found this document useful (0 votes)
47 views

Introduction To MapReduce

Uploaded by

shivaraj BG
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Introduction To MapReduce

Uploaded by

shivaraj BG
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Introduction: MapReduce

• The concept of MapReduce was pioneered by Google.


• The original paper titled "MapReduce: Simplified Data Processing on Large
Clusters" was written by Jeffrey Dean and Sanjay Ghemawat, and it was
published in 2004.
• In the paper, they introduced the MapReduce programming model and
described its implementation at Google for processing large-scale data across
distributed clusters.
• MapReduce became a fundamental framework for distributed computing and
played a significant role in the development of big data technologies.
• While Google introduced the concept, the open-source Apache Hadoop project
later implemented its own version of MapReduce, making it accessible to a
broader community of developers and organizations.

Prerequisites that can help you grasp MapReduce more effectively


1. Programming Languages:

• Proficiency in a programming language is crucial.


• Java is commonly used in the Hadoop ecosystem, and many MapReduce
examples are written in Java.
• Knowledge of Python can also be useful.

2. Distributed Systems:

• Understanding the basics of distributed computing is essential.


• Familiarize yourself with concepts like nodes, clusters, parallel processing, and
the challenges associated with distributed systems.
3. Hadoop Ecosystem:

• MapReduce is often associated with the Hadoop framework.


• Therefore, it's helpful to have a basic understanding of Hadoop and its
ecosystem components, such as HDFS (Hadoop Distributed File System) and
YARN (Yet Another Resource Negotiator).

4. Basic Understanding of Big Data:

• MapReduce is commonly used in the context of big data processing.


• It's beneficial to have a foundational understanding of what constitutes "big
data," the challenges associated with large datasets, and the motivation behind
distributed computing for big data.

5. Linux/Unix Commands:

• Many big data platforms, including Hadoop, are typically deployed on Unix-
like systems.
• Familiarity with basic command-line operations in a Unix environment can be
helpful for interacting with Hadoop clusters.

6. SQL (Structured Query Language):

• If you are planning to use tools like Apache Hive, which provides a SQL-like
interface for querying data in Hadoop, a basic understanding of SQL can be
beneficial.

7. Concepts of Data Storage and Retrieval:

• Understanding how data is stored and retrieved in a distributed environment


is crucial.
• Concepts like Sharding, replication, and indexing are relevant.

8. Algorithmic and Problem-Solving Skills:

• MapReduce involves breaking down problems into smaller tasks that can be
executed in parallel.
• Strong algorithmic and problem-solving skills are valuable for designing
efficient MapReduce jobs.
Explanation
Q: Describe MapReduce Execution steps with a neat diagram 12 M

• MapReduce is a programming model and processing technique designed for


processing and generating large datasets that can be parallelized across a
distributed cluster of computers
• A job means a MapReduce Program.
• Each job consists of several smaller unit, called MapReduce Tasks.
• The basic idea behind MapReduce is to divide a large computation into smaller
tasks that can be performed in parallel across multiple nodes in a cluster.

In a MapReduce job

1. The data is split into smaller chunks, and a "map" function is applied to each
chunk independently.
2. The results are then shuffled and sorted, and a "reduce" function is applied to
combine the intermediate results into the final output.

MapReduce Programing approach allows for efficient processing of large datasets in


a distributed computing environment.
JobTracker and Task Tracker

• MapReduce consists of a single master JobTracker and one slave TaskTracker


per cluster node.
• The master is responsible for scheduling the component tasks in a job onto the
slaves, monitoring them and re-executing the failed tasks.
• The slaves execute the tasks as directed by the master.
• The MapReduce framework operates entirely on key, value-pairs.
• * The framework views the input to the task as a set of (key, value) pairs and
produces a set of (key, value) pairs as the output of the task, with different
types.

Map-Tasks

Map task means a task that implements a map( ) function.

which runs user application codes for each key-value pair (kl, vl).

• Key kl is a set of keys.


• Key kl maps to group of data values.
• Values vl are a large string which is read from the input file(s).
• The output of map( ) would be zero (when no values are found) or intermediate
key-value pairs (k2, v2).
Reduce Task

• Refers to a task which takes the output v2 from the map as an input and
combines those data pieces into a smaller set of data using a combiner.
• The reduce task is always performed after the map task.

Key-Value Pair

Each phase (Map phase and Reduce phase) of MapReduce has key-value pairs as
input and output.
Data should be first converted into key-value pairs before it is passed to the Mapper,
as the Mapper only understands key-value pairs of data.

Key-value pairs in Hadoop MapReduce are generated as follows:

• InputSplit - Defines a logical representation of data and presents a Split data


for processing at individual map ().
• RecordReader - Communicates with the Input Split and converts the Split into
records which are in the form of key-value pairs in a format suitable for reading
by the Mapper.
• RecordReader uses TextlnputFormat by default for converting data into key-
value pairs.
• RecordReader communicates with the InputSplit until the file is read.

Grouping by Key

• When a map task completes, Shuffle process aggregates (combines) all the
Mapper outputs by grouping the key-values of the Mapper output, and the
value v2 append in a list of values.
• A "Group By" operation on intermediate keys creates v2.

Shuffle and Sorting Phase

• All pairs with the same group key (k2) collect and group together, creating one
group for each key.
• Shuffle output format will be a List of. Thus, a different subset of the
intermediate key space assigns to each reduce node.

Reduced Tasks

• Implements reduce () that takes the Mapper output (which shuffles and sorts),
which is grouped by key-values (k2, v2) and applies it in parallel to each
group.
• Reduce function iterates over the list of values associated with a key and
produces outputs such as aggregations and statistics.
• The reduce function sends output zero or another set of key-value pairs (k3,
v3) to the final the output file. Reduce: {(k2, list (v2) -> list (k3, v3)}
MapReduce Implementation

• MapReduce is a programming model and processing technique for handling


large datasets in a parallel and distributed fashion.
• The word count problem is a classic example of a task that can be solved using
MapReduce.
• Mathematical representation of the MapReduce algorithm for the word count
problem.
Example:

Step 1: Input Document:

D="hello Hadoop, hi Hadoop, Hello MongoDB, hi Cassandra Hadoop"

Step 2: Map Function:

The Map function processes each word in the document and emits key-value pairs
where the key is the word, and the value is 1 (indicating the count).

Map("hello”) →{("hello",1)},

Map("Hadoop”) →{("Hadoop",1), ("Hadoop",1), ("Hadoop",1)},

Map("hi”) →{("hi",1), ("hi",1)}, …

Step 3: Shuffle and Sort (Grouping by Key):

Group and sort the intermediate key-value pairs by key.

("hello", [1]), ("Hadoop", [1,1,1]), ("hi", [1,1]), …

Step 4: Reduce Function:

The Reduce function takes each unique key and the list of values and calculates the
sum.

Reduce ("hello", [1]) →{("hello",1)},

Reduce ("Hadoop", [1,1,1]) →{("Hadoop",3)},

Reduce ("hi", [1,1]) →{("hi",2)}, …

Step 5: Final Output:

{("hello",1), ("Hadoop",3), ("hi",2), ("Hello",1), ("MongoDB",1), ("Cassandra",1)}

You might also like