Introduction To Map Reduce
Introduction To Map Reduce
1
Map Reduce: Motivation
2
Problem Scope
3
Problem Scope
Required functions
Automatic parallelization & distribution
Fault-tolerance
Status and monitoring tools
A clean abstraction for programmers
Functional programming meets
distributed computing
A batch data processing system
4
Commodity Clusters
5
Mapreduce & Hadoop - History
2003: Google publishes about its cluster architecture & distributed file
system (GFS)
2004: Google publishes about its MapReduce model used on top of GFS
Both GFS and MapReduce are written in C++ and are closed-source, with Python
and Java APIs available to Google programmers only
2006: Apache & Yahoo! -> Hadoop & HDFS
open-source, Java implementations of Google MapReduce and GFS with a diverse
set of API available to the public
Evolved from Apache Lucene/Nutch open-source web search engine
2008: Hadoop becomes an independent Apache project
Yahoo! Uses Hadoop in production
Today: Hadoop is used as a general-purpose storage and analysis platform
for big data
Other Hadoop distributions from several vendors including EMC, IBM, Microsoft,
Oracle, Cloudera, etc.
Many users (http://wiki.apache.org/hadoop/PoweredBy)
Research and development actively continues...
6
Google Cluster Architecture: Key Ideas
7
What Makes MapReduce Unique?
Its simplified programming model which allows the user to quickly write
and test distributed systems
Its efficient and automatic distribution of data and workload across
machines
Its flat scalability curve. Specifically, after a Mapreduce program is
written and functioning on 10 nodes, very little-if any- work is required
for making that same program run on 1000 nodes.
MapReduce ties smaller and more reasonably priced machines together
into a single cost-effective commodity cluster
8
Isolated Tasks
9
MapReduce in a Nutshell
Given:
a very large dataset
a well-defined computation task to be performed on elements of this dataset
(preferably, in a parallel fashion on a large cluster)
Map Reduce framework:
Just express what you want to compute (map() & reduce()).
Dont worry about parallelization, fault tolerance, data distribution, load
balancing (MapReduce takes care of these).
What changes from one application to another is the actual computation; the
programming structure stays similar.
In simple terms
Read lots of data.
Map: extract something that you care about from each record.
Shuffle and sort.
Reduce: aggregate, summarize, filter, or transform.
Write the results.
One can use as many Maps and Reduces as needed to model a given
problem.
10
Note: There is no precise 1-1
correspondence. Please take
Functional programming this just as an analogy.
foundations
11
Note: There is no precise 1-1
correspondence. Please take
Functional programming this just as an analogy.
foundations
12
MapReduce Basic Programming Model
13
Word Count
map(k1, v1) list(k2, v2) reduce(k2, list(v2)) list(v2)
14
Parallel processing model
15
Execution overview Read as part of this lecture!
Jeffrey Dean and Sanjay
Ghemawat. 2008. MapReduce:
simplified data processing on
Master Workers large clusters. Commun. ACM
51, 1 (January 2008), 107-113.
Master coordinates
Local Write / remote reads
16
MapReduce Scheduling
17
Data Distribution
An underlying distributed file systems (e.g., GFS) splits large data files
into chunks which are managed by different nodes in the cluster
Input data: A large file
Even though the file chunks are distributed across several machines,
they form a single namespace
18
Partitions
20
Choosing M and R
21
MapReduce Fault Tolerance
On worker failure:
Master detects failure via periodic heartbeats.
Both completed and in-progress map tasks on that worker should be re-
executed ( output stored on local disk).
Only in-progress reduce tasks on that worker should be re- executed (
output stored in global file system).
All reduce workers will be notified about any map re-executions.
On master failure:
State is check-pointed to GFS: new master recovers & continues.
Robustness:
Example: Lost 1600 of 1800 machines once, but finished fine.
22
MapReduce Data Locality
23
Stragglers & Backup Tasks
24
Other Practical Extensions
25
Basic MapReduce Program Design
26
MapReduce vs. Traditional RDBMS
27
More Hadoop details
28
Hadoop
29
Hadoop MapReduce: A Closer Look
Node 1 Node 2
Files loaded from local HDFS store Files loaded from local HDFS store
InputFormat InputFormat
file file
Split Split Split Split Split Split
file file
RecordReaders RR RR RR RR RR RR RecordReaders
OutputFormat OutputFormat
Writeback to local Writeback to local
HDFS store HDFS store
Input Files
Input files are where the data for a MapReduce task is initially stored
The input files typically reside in a distributed file system (e.g. HDFS)
The format of input files is arbitrary
Line-based log files
Binary files
Multi-line input records
Or something else entirely
file
file
31
InputFormat
How the input files are split up and read is defined by the InputFormat
InputFormat is a class that does the following:
Selects the files that should be used for input
Defines the InputSplits that break a file
Provides a factory for RecordReader objects that Files loaded from local HDFS store
read the file
InputFormat
file
file
32
InputFormat Types
33
Input Splits
An input split describes a unit of work that comprises a single map task in a
MapReduce program
file
If the file is very large, this can improve Split Split Split
performance significantly through parallelism file
The input split defines a slice of work but does not describe how
to access it
The RecordReader class actually loads data from its source and
converts it into (K, V) pairs suitable for reading by Mappers
Files loaded from local HDFS store
The Mapper performs the user-defined work Files loaded from local HDFS store
of the first phase of the MapReduce program
InputFormat
A new instance of Mapper is created
for each split file
Split Split Split
file
Reduce
Partitioner
Each mapper may emit (K, V) pairs Files loaded from local HDFS store
to any partition
InputFormat
Therefore, the map nodes must all agree on
where to send different pieces of file
Split Split Split
intermediate data file
RR RR RR
The partitioner class determines which
partition a given (K,V) pair will go to
Map Map Map
Reduce
Sort
Files loaded from local HDFS store
Each Reducer is responsible for reducing
the values associated with (several)
InputFormat
intermediate keys
file
Split Split Split
The set of intermediate keys on a single file
Partitioner
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
OutputFormat
Questions?
40
Exercise
41
Exercise
42
MapReduce Use Case: Word Length
Big 37
Medium 148
Small 200
Tiny 9
43
MapReduce Use Case: Word Length
44
MapReduce Use Case: Word Length
Big 1
Big 1
Big 1
Big 1,1,1,1,
Medium 1,1,1,.. Medium 1
Small 1,1,1,1,.. Medium 1
Tiny 1,1,1,1, Big 37
Medium 148
Small 1 Small 200
Small 1 Tiny 9
Big 1,1,1,1, Small 1
Medium 1,1,1,..
Small 1,1,1,1,..
Tiny 1,1,1,1,
Tiny 1
Tiny 1
Tiny 1
45
MapReduce Use Case: Inverted Indexing
49
Sources & References
50