0% found this document useful (0 votes)
42 views

Intro To Mapreduce

The document discusses MapReduce and Hadoop. It introduces MapReduce as a framework for processing large datasets in parallel across clusters of computers. It describes how Hadoop uses MapReduce to automatically parallelize jobs and handle failures. The key aspects of a MapReduce job in Hadoop are explained such as the map and reduce functions, partitioning, sorting, and output.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Intro To Mapreduce

The document discusses MapReduce and Hadoop. It introduces MapReduce as a framework for processing large datasets in parallel across clusters of computers. It describes how Hadoop uses MapReduce to automatically parallelize jobs and handle failures. The key aspects of a MapReduce job in Hadoop are explained such as the map and reduce functions, partitioning, sorting, and output.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 15

CPS216: Advanced Database

Systems (Data-intensive
Computing Systems)

Introduction to MapReduce
and Hadoop
Shivnath Babu
Word Count over a Given Set of
Web Pages
see bob throw
see 1
bob 1
throw 1
see 1
spot 1
run 1
bob 1
run 1
see 2
spot 1
throw 1


see spot run
Can we do word count in parallel?
The MapReduce Framework
(pioneered by Google)
Automatic Parallel Execution in
MapReduce (Google)
Handles failures automatically, e.g., restarts tasks if a
node fails; runs multiples copies of the same task to
avoid a slow task slowing down the whole job
MapReduce in Hadoop (1)
MapReduce in Hadoop (2)
MapReduce in Hadoop (3)
Data Flow in a MapReduce
Program in Hadoop
InputFormat
Map function
Partitioner
Sorting & Merging
Combiner
Shuffling
Merging
Reduce function
OutputFormat

1:many
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as a
MapReduce job
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as a
MapReduce job
Map
Wave 1
Reduce
Wave 1
Map
Wave 2
Reduce
Wave 2
Input
Splits
Lifecycle of a MapReduce Job
Time
How are the number of splits, number of map and reduce
tasks, memory allocation to tasks, etc., determined?
Job Configuration Parameters
190+ parameters in
Hadoop
Set manually or defaults
are used
How to sort data using Hadoop?
Let us look at a complete
example MapReduce program
in Hadoop

You might also like