Introduction To MapReduce

The document introduces MapReduce, describing it as a programming model for processing large datasets in a distributed manner. MapReduce consists of map and reduce functions, with the map function processing input data in parallel across nodes and the reduce function combining results from the map stage. Fault tolerance is built into the MapReduce framework.

Uploaded by

fab vif

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views26 pages

Introduction To MapReduce

Uploaded by

fab vif

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Introducing MapReduce

Introducing MapReduce
• Now that we have described how Hadoop stores data, lets turn
our attention to how it processes data
• We typically process data in Hadoop using MapReduce
• MapReduce is not a language, it’s a programming model
• MapReduce is a method for distributing a task across multiple
nodes. Each node processes data stored on that node.
• MapReduce consists of two functions:
 map (K1, V1) -> (K2, V2)
 reduce (K2, list(V2)) -> list(K3, V3)
 Automatic parallelization and distribution (The biggest advantage).
 Fault-tolerance (individual tasks can be retried)
 Hadoop comes with standard status and monitoring tools.
 A clean abstraction for developers.
 MapReduce programs are usually written in Java (possibly in other
languages using streaming)
Understanding Map and Reduce
The map function always runs first
• Typically used to “break down”
• Filter, transform, or parse data, e.g. Parse the stock symbol, price and time
from a data feed
• The output from the map function (eventually) becomes the input to the
reduce function
The reduce function
• Typically used to aggregate data from the map function
• e.g. Compute the average hourly price of the stock
• Not always needed and therefore optional
• You can run something called a “map-only” job
Understanding Map and Reduce
Between these two tasks there is typically a hidden phase known as the
“Shuffle and Sort”
• Which organizes map output for delivery to the reducer
Each individual piece is simple, but collectively are quite powerful
• Analogous to a pipe / filter in Unix
Typical Large Data Problem
 JobTracker
◦ Determines the execution plan for the job
◦ Assigns individual tasks

 TaskTracker
◦ Keeps track of the performance of an individual mapper or reducer
MapReduce:
The Big
Picture
Map Process

• map (in_key, in_value) (out_key, out_value)

Reduce Process
• reduce (out_key, out_value list) (final_key, final_value list)
Terminology
• The client program submits a job to Hadoop.
• The job consists of a mapper, a reducer, and a list of inputs.
• The job is sent to the JobTracker process on the Master Node.
• Each Slave Node runs a process called the TaskTracker.
• The JobTracker instructs TaskTrackers to run and monitor tasks.
• A Map or Reduce over a piece of data is a single task.
• A task attempts is an instance of a task running on a slave node.
MapReduce : High Level
MapReduce Failure Recovery
 Task processes send heartbeats to the TaskTracker.
 TaskTrackers send heartbeats to the JobTracker.
 Any task that fails to report in 10 minutes is assumed to have
failed- its JVM is killed by the TaskTracker.
 Any task that throws an exception is said to have failed.
 Failed tasks are reported to the JobTracker by the TaskTracker.
 The JobTracker reschedules any failed tasks - it tries to avoid
rescheduling the task on the same TaskTracker where it
previously failed.
 If a task fails more than 4 times, the whole job fails.
TaskTracker Recovery
 Any TaskTracker that fails to report in 10 minutes is assumed to
have crashed.
 All tasks on the node are restarted elsewhere
 Any TaskTracker reporting a high number of failed tasks is
blacklisted, to prevent the node from blocking the entire job.
 There is also a “global blacklist”, for TaskTrackers which fail on
multiple jobs.
 The JobTracker manages the state of each job and partial results
of failed tasks are ignored.
Example: Word Count

• We have a large file of words, one word to a line

• Count the number of times each distinct word appears in the
file
• Sample application: analyze web server logs to find popular
URLs
MapReduce
• Input: a set of key/value pairs
• User supplies two functions:
• map(k,v)  list(k1,v1)
• reduce(k1, list(v1))  v2
• (k1,v1) is an intermediate key/value pair
• Output is the set of (k1,v2) pairs
MapReduce: Word Count
MapReduce Example
Explanation Of The Map Function
Shuffle and Sort
Explanation of Reduce Function
Putting It All Together
Benefits of MapReduce
• Simplicity (via fault tolerance)
• Particularly when compared with other distributed programming
models
• Flexibility
• Offers more analytic capabilities and works with more data types than
platforms like SQL
• Scalability
• Because it works with
• Small quantities of data at a time
• Running in parallel across a cluster
• Sharing nothing among the participating nodes

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Dvcon Us 2021 Paper Making Your Dpi C Interface A Fast River of Data Redelman
100% (1)
Dvcon Us 2021 Paper Making Your Dpi C Interface A Fast River of Data Redelman
22 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
M4 06 MapReduce
No ratings yet
M4 06 MapReduce
28 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Unit 3
No ratings yet
Unit 3
13 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
UNIT 3bda
No ratings yet
UNIT 3bda
16 pages
Map Reduce
No ratings yet
Map Reduce
14 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
04 MapReduce
No ratings yet
04 MapReduce
45 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Unit 4 Da
No ratings yet
Unit 4 Da
57 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
Introduction To Batch Processing
No ratings yet
Introduction To Batch Processing
23 pages
Unit - III
No ratings yet
Unit - III
37 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Unit 3
No ratings yet
Unit 3
33 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
BDA UNIT - 4 Notes
No ratings yet
BDA UNIT - 4 Notes
28 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Hadoop
No ratings yet
Hadoop
34 pages
What Is MapReduce in Hadoop - Architecture - Example
No ratings yet
What Is MapReduce in Hadoop - Architecture - Example
7 pages
1 Unit-1
No ratings yet
1 Unit-1
59 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Big Data Analytics-4
No ratings yet
Big Data Analytics-4
26 pages
Data Science
No ratings yet
Data Science
7 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Minor Assignment-2 (Computer Science Thinking-Recursion, Searching, Sorting and Big O)
No ratings yet
Minor Assignment-2 (Computer Science Thinking-Recursion, Searching, Sorting and Big O)
2 pages
Sonic Pi Cheatsheet
100% (1)
Sonic Pi Cheatsheet
5 pages
Resume Tata Consultancy Services
No ratings yet
Resume Tata Consultancy Services
3 pages
No ratings yet
10 pages
Goldman Sach
No ratings yet
Goldman Sach
5 pages
Tutorial ArgoUML
No ratings yet
Tutorial ArgoUML
8 pages
Chapter - 13 Microsoft Excel Lookup, Vlookup and Hlookup Function
No ratings yet
Chapter - 13 Microsoft Excel Lookup, Vlookup and Hlookup Function
3 pages
Kiran Sir C/Python/C++/Ds Decision Making
No ratings yet
Kiran Sir C/Python/C++/Ds Decision Making
3 pages
Laporan Praktikum Transformasi Dan Animasi: Oleh Azizah Tri Novanti 170533628613 S1 PTI 2017 A
No ratings yet
Laporan Praktikum Transformasi Dan Animasi: Oleh Azizah Tri Novanti 170533628613 S1 PTI 2017 A
14 pages
Exp 3-Traditional Process Model
No ratings yet
Exp 3-Traditional Process Model
14 pages
Data Transfer and Manipulation
No ratings yet
Data Transfer and Manipulation
16 pages
Search Creators CG LAB Program-07
No ratings yet
Search Creators CG LAB Program-07
4 pages
Unit-2: Linear Data Structure Array
No ratings yet
Unit-2: Linear Data Structure Array
13 pages
Pooja Pps
No ratings yet
Pooja Pps
8 pages
Suresh11 Net
No ratings yet
Suresh11 Net
8 pages
Scratch Code Variables & Repeats
No ratings yet
Scratch Code Variables & Repeats
5 pages
'C' Programming Language - Presentation
No ratings yet
'C' Programming Language - Presentation
11 pages
Le Quoc Huy Backend Java Developer TopCV - VN 281223.135647
No ratings yet
Le Quoc Huy Backend Java Developer TopCV - VN 281223.135647
2 pages
Monitoring and Tracking MQ and Applications
No ratings yet
Monitoring and Tracking MQ and Applications
79 pages
Coding Assignment Question and Solution (If Else)
No ratings yet
Coding Assignment Question and Solution (If Else)
5 pages
Student Result Management System: BY Kazi Hasnayeen Emad ID: 191902025
No ratings yet
Student Result Management System: BY Kazi Hasnayeen Emad ID: 191902025
16 pages
N2 Registration Completed Students List
No ratings yet
N2 Registration Completed Students List
12 pages
Database Management: Dr. Md. Rakibul Hoque University of Dhaka
No ratings yet
Database Management: Dr. Md. Rakibul Hoque University of Dhaka
57 pages
Robot Report Updated
No ratings yet
Robot Report Updated
37 pages
Premierpressbooks Downloads
No ratings yet
Premierpressbooks Downloads
28 pages
Final Paper Answers
No ratings yet
Final Paper Answers
27 pages
SQL Questions
No ratings yet
SQL Questions
16 pages
Python Oop Assignment Result - PDF 20250310 095423 0000
No ratings yet
Python Oop Assignment Result - PDF 20250310 095423 0000
21 pages
Raghava New
No ratings yet
Raghava New
3 pages

Introduction To MapReduce

Uploaded by

Introduction To MapReduce

Uploaded by

Introducing MapReduce

• map (in_key, in_value) (out_key, out_value)

• We have a large file of words, one word to a line

You might also like