0% found this document useful (0 votes)

47 views

Introduction To MapReduce

Uploaded by

shivaraj BG

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views

Introduction To MapReduce

Uploaded by

shivaraj BG

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Introduction: MapReduce

• The concept of MapReduce was pioneered by Google.

• The original paper titled "MapReduce: Simplified Data Processing on Large
Clusters" was written by Jeffrey Dean and Sanjay Ghemawat, and it was
published in 2004.
• In the paper, they introduced the MapReduce programming model and
described its implementation at Google for processing large-scale data across
distributed clusters.
• MapReduce became a fundamental framework for distributed computing and
played a significant role in the development of big data technologies.
• While Google introduced the concept, the open-source Apache Hadoop project
later implemented its own version of MapReduce, making it accessible to a
broader community of developers and organizations.

Prerequisites that can help you grasp MapReduce more effectively

1. Programming Languages:

• Proficiency in a programming language is crucial.

• Java is commonly used in the Hadoop ecosystem, and many MapReduce
examples are written in Java.
• Knowledge of Python can also be useful.

2. Distributed Systems:

• Understanding the basics of distributed computing is essential.

• Familiarize yourself with concepts like nodes, clusters, parallel processing, and
the challenges associated with distributed systems.
3. Hadoop Ecosystem:

• MapReduce is often associated with the Hadoop framework.

• Therefore, it's helpful to have a basic understanding of Hadoop and its
ecosystem components, such as HDFS (Hadoop Distributed File System) and
YARN (Yet Another Resource Negotiator).

4. Basic Understanding of Big Data:

• MapReduce is commonly used in the context of big data processing.

• It's beneficial to have a foundational understanding of what constitutes "big
data," the challenges associated with large datasets, and the motivation behind
distributed computing for big data.

5. Linux/Unix Commands:

• Many big data platforms, including Hadoop, are typically deployed on Unix-
like systems.
• Familiarity with basic command-line operations in a Unix environment can be
helpful for interacting with Hadoop clusters.

6. SQL (Structured Query Language):

• If you are planning to use tools like Apache Hive, which provides a SQL-like
interface for querying data in Hadoop, a basic understanding of SQL can be
beneficial.

7. Concepts of Data Storage and Retrieval:

• Understanding how data is stored and retrieved in a distributed environment

is crucial.
• Concepts like Sharding, replication, and indexing are relevant.

8. Algorithmic and Problem-Solving Skills:

• MapReduce involves breaking down problems into smaller tasks that can be
executed in parallel.
• Strong algorithmic and problem-solving skills are valuable for designing
efficient MapReduce jobs.
Explanation
Q: Describe MapReduce Execution steps with a neat diagram 12 M

• MapReduce is a programming model and processing technique designed for

processing and generating large datasets that can be parallelized across a
distributed cluster of computers
• A job means a MapReduce Program.
• Each job consists of several smaller unit, called MapReduce Tasks.
• The basic idea behind MapReduce is to divide a large computation into smaller
tasks that can be performed in parallel across multiple nodes in a cluster.

In a MapReduce job

1. The data is split into smaller chunks, and a "map" function is applied to each
chunk independently.
2. The results are then shuffled and sorted, and a "reduce" function is applied to
combine the intermediate results into the final output.

MapReduce Programing approach allows for efficient processing of large datasets in

a distributed computing environment.
JobTracker and Task Tracker

• MapReduce consists of a single master JobTracker and one slave TaskTracker

per cluster node.
• The master is responsible for scheduling the component tasks in a job onto the
slaves, monitoring them and re-executing the failed tasks.
• The slaves execute the tasks as directed by the master.
• The MapReduce framework operates entirely on key, value-pairs.
• * The framework views the input to the task as a set of (key, value) pairs and
produces a set of (key, value) pairs as the output of the task, with different
types.

Map-Tasks

Map task means a task that implements a map( ) function.

which runs user application codes for each key-value pair (kl, vl).

• Key kl is a set of keys.

• Key kl maps to group of data values.
• Values vl are a large string which is read from the input file(s).
• The output of map( ) would be zero (when no values are found) or intermediate
key-value pairs (k2, v2).
Reduce Task

• Refers to a task which takes the output v2 from the map as an input and
combines those data pieces into a smaller set of data using a combiner.
• The reduce task is always performed after the map task.

Key-Value Pair

Each phase (Map phase and Reduce phase) of MapReduce has key-value pairs as
input and output.
Data should be first converted into key-value pairs before it is passed to the Mapper,
as the Mapper only understands key-value pairs of data.

Key-value pairs in Hadoop MapReduce are generated as follows:

• InputSplit - Defines a logical representation of data and presents a Split data

for processing at individual map ().
• RecordReader - Communicates with the Input Split and converts the Split into
records which are in the form of key-value pairs in a format suitable for reading
by the Mapper.
• RecordReader uses TextlnputFormat by default for converting data into key-
value pairs.
• RecordReader communicates with the InputSplit until the file is read.

Grouping by Key

• When a map task completes, Shuffle process aggregates (combines) all the
Mapper outputs by grouping the key-values of the Mapper output, and the
value v2 append in a list of values.
• A "Group By" operation on intermediate keys creates v2.

Shuffle and Sorting Phase

• All pairs with the same group key (k2) collect and group together, creating one
group for each key.
• Shuffle output format will be a List of. Thus, a different subset of the
intermediate key space assigns to each reduce node.

Reduced Tasks

• Implements reduce () that takes the Mapper output (which shuffles and sorts),
which is grouped by key-values (k2, v2) and applies it in parallel to each
group.
• Reduce function iterates over the list of values associated with a key and
produces outputs such as aggregations and statistics.
• The reduce function sends output zero or another set of key-value pairs (k3,
v3) to the final the output file. Reduce: {(k2, list (v2) -> list (k3, v3)}
MapReduce Implementation

• MapReduce is a programming model and processing technique for handling

large datasets in a parallel and distributed fashion.
• The word count problem is a classic example of a task that can be solved using
MapReduce.
• Mathematical representation of the MapReduce algorithm for the word count
problem.
Example:

Step 1: Input Document:

D="hello Hadoop, hi Hadoop, Hello MongoDB, hi Cassandra Hadoop"

Step 2: Map Function:

The Map function processes each word in the document and emits key-value pairs
where the key is the word, and the value is 1 (indicating the count).

Map("hello”) →{("hello",1)},

Map("Hadoop”) →{("Hadoop",1), ("Hadoop",1), ("Hadoop",1)},

Map("hi”) →{("hi",1), ("hi",1)}, …

Step 3: Shuffle and Sort (Grouping by Key):

Group and sort the intermediate key-value pairs by key.

("hello", [1]), ("Hadoop", [1,1,1]), ("hi", [1,1]), …

Step 4: Reduce Function:

The Reduce function takes each unique key and the list of values and calculates the
sum.

Reduce ("hello", [1]) →{("hello",1)},

Reduce ("Hadoop", [1,1,1]) →{("Hadoop",3)},

Reduce ("hi", [1,1]) →{("hi",2)}, …

Step 5: Final Output:

{("hello",1), ("Hadoop",3), ("hi",2), ("Hello",1), ("MongoDB",1), ("Cassandra",1)}

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Coursera Coronavirus Response Program - C4C Recommendations - April2020 - External 2
No ratings yet
Coursera Coronavirus Response Program - C4C Recommendations - April2020 - External 2
542 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Data Science
No ratings yet
Data Science
7 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Bda 03
No ratings yet
Bda 03
10 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
BDA-MapReduce (1) 5rfgy656yhgvcft6
No ratings yet
BDA-MapReduce (1) 5rfgy656yhgvcft6
60 pages
Map reduce
No ratings yet
Map reduce
35 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Hadoop
No ratings yet
Hadoop
34 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
05 Movies Data Analysis Using Mapreduce
No ratings yet
05 Movies Data Analysis Using Mapreduce
20 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
bda megh
No ratings yet
bda megh
50 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
18mcs35e U4
No ratings yet
18mcs35e U4
7 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
59 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
CC UNIT-7
No ratings yet
CC UNIT-7
16 pages
Map Reduce
No ratings yet
Map Reduce
36 pages
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
No ratings yet
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
20 pages
Hadoop - Mapreduce (1)
No ratings yet
Hadoop - Mapreduce (1)
5 pages
Mapreduce 190419130907
No ratings yet
Mapreduce 190419130907
12 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
26 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
SG2042 TRM
No ratings yet
SG2042 TRM
113 pages
4 1 0 1 Waves of IRs & Devt Policy Lessons For LDCs 2021 2022 SDT
No ratings yet
4 1 0 1 Waves of IRs & Devt Policy Lessons For LDCs 2021 2022 SDT
58 pages
Ansible Note
No ratings yet
Ansible Note
2 pages
DM - NetApp - OnTap Select
No ratings yet
DM - NetApp - OnTap Select
16 pages
ms-900-530q_p2
No ratings yet
ms-900-530q_p2
38 pages
Digital Transformation in Healthcare Sector
No ratings yet
Digital Transformation in Healthcare Sector
18 pages
Research Paper B
No ratings yet
Research Paper B
44 pages
CallRecord Log
No ratings yet
CallRecord Log
3 pages
Microsoft Azure Certifications Map
No ratings yet
Microsoft Azure Certifications Map
1 page
Linux Tutorial - DHCP Server Configuration
No ratings yet
Linux Tutorial - DHCP Server Configuration
2 pages
Computer Network Syllabus
No ratings yet
Computer Network Syllabus
3 pages
Ar 1
No ratings yet
Ar 1
32 pages
The Technological Singularity Managing The Journey 1st Edition Victor Callaghan 2024 Scribd Download
100% (6)
The Technological Singularity Managing The Journey 1st Edition Victor Callaghan 2024 Scribd Download
62 pages
Xiaomi Redmi AirDots S-Manual
No ratings yet
Xiaomi Redmi AirDots S-Manual
6 pages
Password analyzer project
No ratings yet
Password analyzer project
12 pages
Ethics Scenarios
No ratings yet
Ethics Scenarios
4 pages
Op Prelim Week 1
No ratings yet
Op Prelim Week 1
32 pages
Archer TX20E (UN) - QIG - V
No ratings yet
Archer TX20E (UN) - QIG - V
2 pages
1.3) BC-6000 Hardware System - Service Training
No ratings yet
1.3) BC-6000 Hardware System - Service Training
31 pages
Páginas Desdep64x - Parte3-4
No ratings yet
Páginas Desdep64x - Parte3-4
90 pages
Commissioning Form - Weighbridge
No ratings yet
Commissioning Form - Weighbridge
3 pages
4 Var Awl
No ratings yet
4 Var Awl
42 pages
Diff Between VB and Net
No ratings yet
Diff Between VB and Net
1 page
Sure
No ratings yet
Sure
11 pages
MTCRE Presentation Material-English
No ratings yet
MTCRE Presentation Material-English
157 pages
Resume Steven MC Namara
No ratings yet
Resume Steven MC Namara
3 pages
L13 Intro-Cnn Slides
No ratings yet
L13 Intro-Cnn Slides
65 pages
Using Pinnacle ProDev, LLC Exam Prep Resources
No ratings yet
Using Pinnacle ProDev, LLC Exam Prep Resources
8 pages
D5085-Getting Started With ControlWave Designer
No ratings yet
D5085-Getting Started With ControlWave Designer
58 pages