Techmesh London 2012 December 5, 2012: Mapreduce and Its Discontents
Techmesh London 2012 December 5, 2012: Mapreduce and Its Discontents
London
2012
December
5,
2012
[email protected]
polyglotprogramming.com/talks
MapReduce
and Its
1 Discontents
Tuesday, April 16, 13
Beyond MapReduce: its been a useful technology, but has a first generation feel. Whats
next?
Copyright Dean Wampler, 2011-2013, All Rights Reserved. Photos can only be used with
permission. Otherwise, the content is free to use.
About
Me...
[email protected]@deanwampler
@deanwampler
github.com/deanwampler
Programming
Hive
Functional
Programming
for Java Developers
Dean Wampler,
Dean Wampler Jason Rutherglen &
Edward Capriolo
Its a buzz word, but generally associated with the problem of data sets too big to manage
with traditional SQL databases. A parallel development has been the NoSQL movement that is
good at handling semistructured data, scaling, etc.
3
Trends
4 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Tuesday, April 16, 13
Three trends influence my thinking...
Data
Size
4
1
Object Model
ParentB1 SQL
toJSON
3
ChildB1 ChildB2 Object-
toJSON toJSON Relational
Mapping
Result Set
Database
The toJSON methods are there because we often convert these object graphs back into fundamental structures, such as the maps and arrays of JSON so we can send them to the browser!
Relational/
Functional Query
Domain Logic
1
Functional
Abstractions
3 SQL
Functional
Other, Object-
Oriented Query
Wrapper for
Objects Domain Logic Relational Data
4
1
Object Model
ParentB1 SQL
toJSON
3 Result Set
ChildB1 ChildB2 Object-
toJSON toJSON Relational
Mapping
2
Result Set
Database
Database
Focus on:
Functional Query
Domain Logic
Lists Functional
Abstractions
3 SQL
Maps Functional
Wrapper for
Relational Data
Sets
Result Set
Trees 2
... Database
ParentB1
toJSON
ChildB1 ChildB2
toJSON toJSON
Database Files
ParentB1
toJSON
ChildB1 ChildB2
toJSON toJSON
Database Files
Database Files
Data Size
Formal
Schema
Process 1 Process 2 Process 3
Data-Driven
Programs Database Files
Hadoop uses
MapReduce
a2
hadoop 1
is 2
There is a
Map phase
We
need
to
convert
the
Input
map 1
mapreduce 1
phase 2
Four input documents, one left empty, the others with small phrases (for simplicity). The word count
output is on the right (well see why there are three output documents). We need to get from the input
on the left-hand side to the output on the right-hand side.
Input Mappers Sort, Reducers Output
Shuffle
Hadoop uses
MapReduce
a2
hadoop 1
is 2
There is a
Map phase
map 1
mapreduce 1
phase 2
reduce 1
there 2
There is a
Reduce phase uses 1
Here is a schematic view of the steps in Hadoop MapReduce. Each Input file is read by a single
Mapper process (default: can be many-to-many, as well see later).
The Mappers emit key-value pairs that will be sorted, then partitioned and shuffled to the reducers,
where each Reducer will get all instances of a given key (for 1 or more values).
Each Reducer generates the final key-value pairs and writes them to one or more files (based on the
size of the output).
Input Mappers
Hadoop uses
MapReduce (n, "")
There is a
Map phase
(n, "")
(n, "")
There is a
Reduce phase
(n, "")
Each document gets a mapper. All data is organized into key-value pairs; each line will be a
value and the offset position into the file will be the key, which we dont care about. Im
showing each documents contents in a box and 1 mapper task (JVM process) per document.
Large documents might get split to several mapper tasks.
The mappers tokenize each line, one at a time, converting all words to lower case and
counting them...
Input Mappers
(hadoop, 1)
Hadoop uses
MapReduce (n, "") (uses, 1)
(mapreduce, 1)
(there, 1)
(is, 1)
There is a
Map phase
(n, "") (a, 1)
(map, 1)
(phase, 1)
(n, "")
(there, 1)
(is, 1)
There is a
Reduce phase
(n, "") (a, 1)
(reduce, 1)
(phase, 1)
The mappers emit key-value pairs, where each key is one of the words, and the value is the
count. In the most naive (but also most memory efficient) implementation, each mapper
simply emits (word, 1) each time word is seen. However, this is IO inefficient!
Note that the mapper for the empty doc. emits no pairs, as you would expect.
Input Mappers Sort, Reducers
Shuffle
0-9, a-l
Hadoop uses (hadoop, 1)
MapReduce (n, "")
(mapreduce, 1)
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(n, "") m-q
(map, 1),(phase,1)
(there, 1)
(n, "")
(phase,1) r-z
(is, 1), (a, 1)
(there, 1),
There is a
Reduce phase
(n, "") (reduce, 1)
The mappers themselves dont decide to which reducer each pair should be sent. Rather, the
job setup configures what to do and the Hadoop runtime enforces it during the Sort/Shuffle
phase, where the key-value pairs in each mapper are sorted by key (that is locally, not
globally) and then the pairs are routed to the correct reducer, on the current machine or
other machines.
Note how we partitioned the reducers, by first letter of the keys. (By default, MR just hashes
the keys and distributes them modulo # of reducers.)
Input Mappers Sort, Reducers
Shuffle
0-9, a-l
Hadoop uses (hadoop, 1)
MapReduce (n, "") (a, [1,1]),
(mapreduce, 1) (hadoop, [1]),
(is, [1,1])
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(n, "") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]),
(mapreduce, [1]),
(phase, [1,1])
(n, "")
(phase,1) r-z
(is, 1), (a, 1) (reduce, [1]),
(there, 1), (there, [1,1]),
There is a
Reduce phase
(n, "") (reduce, 1) (uses, 1)
The reducers are passed each key (word) and a collection of all the values for that key (the
individual counts emitted by the mapper tasks). The MR framework creates these collections
for us.
Input Mappers Sort, Reducers Output
Shuffle
0-9, a-l
Hadoop uses (hadoop, 1)
MapReduce (n, "") (a, [1,1]), a2
(mapreduce, 1) (hadoop, [1]), hadoop 1
(is, [1,1]) is 2
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(n, "") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]), map 1
(mapreduce, [1]), mapreduce 1
(phase, [1,1]) phase 2
(n, "")
(phase,1) r-z
(is, 1), (a, 1) (reduce, [1]), reduce 1
(there, 1), (there, [1,1]), there 2
There is a
Reduce phase
(n, "") (reduce, 1) (uses, 1) uses 1
The final view of the WordCount process flow. The reducer just sums the counts and writes the output.
The output files contain one line for each key (the word) and value (the count), assuming were using
text output. The choice of delimiter between key and value is up to you, but tab is common.
Input Mappers Sort, Reducers Output
Shuffle
0-9, a-l
Hadoop uses (hadoop, 1)
MapReduce (n, "") (a, [1,1]), a2
(mapreduce, 1) (hadoop, [1]), hadoop 1
(is, [1,1]) is 2
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(n, "") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]), map 1
(mapreduce, [1]), mapreduce 1
(phase, [1,1]) phase 2
(n, "")
Map: (phase,1) Reduce:
r-z
To recap, a map transforms one input to one output, but this is generalized in MapReduce to be one
to 0-N. The output key-value pairs are distributed to reducers. The reduce collects together multiple
inputs with the same key into
History of
MapReduce
26 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Tuesday, April 16, 13
Lets review where MapReduce came from and its best-known, open-source incarnation,
Hadoop.
How
would
you
index
the
web?
Did Google search the entire web in 0.26 seconds to find these ~49M results?
You
ask
a
phrase
and
the
search
engine
nds
the
best
match
in
billions
of
web
pages.
A distributed file system provides horizontal scalability and resiliency when file blocks are
duplicated around the cluster.
MapReduce
for
ComputaWon
2004
The compute model for processing all that data is MapReduce. It handles lots of boilerplate,
like breaking down jobs into tasks, distributing the tasks around the cluster, monitoring the
tasks, etc. You write your algorithm to the MR programming model.
About
this
Wme,
Doug
CuAng,
the
creator
of
Lucene,
and
Mike
Cafarella
was
working
on
Nutch...
Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Tuesday, April 16, 13
Lucene is an open-source text search engine. Nutch is an open source web crawler.
They
started
clean-
room
versions
of
MapReduce
and
GFS...
The name comes from a toy, stuffed elephant that Cuttings son owned at the time.
Benets
of
MapReduce
We cant build vertical systems big enough and if we could, they would cost a fortune!
Hadoop
Design
Goals
I /O
i z e e ! !
x i m an c
a
M form
e r
P And
parallelize
execu4on,
run
on
server-class,
commodity
hardware.
39 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Tuesday, April 16, 13
Maximizing disk and network I/O is critical, because its the largest throughput bottleneck.
So, optimization is a core design goal of Hadoop (both MR and HDFS). It affects the features
and performance of everything in the stack above it, including high-level programming tools!
By
design,
Hadoop
is
great
for
batch
mode
data
crunching.
Is MapReduce the end of the story? Does it meet all our needs? Lets look at a few problems...
#1
Its
hard
to
implement
many
Algorithms
in
MapReduce.
Even word count is not obvious. When you get to fancier stuff like joins, group-bys, etc.,
the mapping from the algorithm to the implementation is not trivial at all. In fact,
implementing algorithms in MR is now a specialized body of knowledge...
#2
For
Hadoop
in
parWcularly,
the
Java
API
is
hard
to
use.
The Hadoop Java API is even more verbose and tedious to use than it should be.
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import java.util.StringTokenizer;
@Override
public void map(LongWritable key, Text valueDocContents,
OutputCollector<Text, IntWritable> output, Reporter reporter) {
String[] tokens = valueDocContents.toString.split("\\s+");
for (String wordString: tokens) {
if (wordString.length > 0) {
word.set(wordString.toLowerCase);
output.collect(word, one);
}
}
}
}
Tap Tap
HDFS
(source) (sink)
Schematically, here is what Word Count looks like in Cascading. See http://
docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details.
import org.cascading.*;
...
public class WordCount {
public static void main(String[] args) {
String inputPath = args[0];
String outputPath = args[1];
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );
Thats It!!
See https://github.com/cloudera/crunch.
Others include Scoobi (http://nicta.github.com/scoobi/) and Spark, which well discuss next.
Use
Spark (Scala)
(SoluWon
#2)
Distributed computing
with in-memory caching.
Up to 30x faster than
MapReduce.
53 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Tuesday, April 16, 13
See http://www.spark-project.org/
Why isnt it more widely used? 1) lack of commercial support, 2) only recently emerged out of
academia.
Spark
is
a
Hadoop
MapReduce
alternaWve:
See http://www.cloudera.com/content/cloudera/en/products/cloudera-enterprise-core/
cloudera-enterprise-RTQ.html. However, this was just announced a few ago (at the time of
this writing), so its not production ready quite yet...
#3
Its
not
suitable
for
real-Sme
event
processing.
Bolt
Spout Bolt
Spout Bolt
In Storm terminology, Spouts are data sources and bolts are the event processors. There are
facilities to support reliable message handling, various sources encapsulated in Sprouts and
various targets of output. Distributed processing is baked in from the start.
Databases?
Use a SQL database unless you need the scale and looser schema of a NoSQL database!
#4
Recall that PageRank is the famous algorithm invented by Sergey Brin and Larry Page to index
the web. Its the foundation of Googles search engine.
Why
not
MapReduce?
1 MR job for each
B
A
all n nodes/edges. C
E
Graph saved to disk F
Pregel is the name of the river that runs through the city of Knigsberg, Prussia (now called
Kaliningrad, Ukraine). 7 bridges crossed the river in the city (including to 5 to 2 islands
between river branches). Leonhard Euler invented graph theory when we analyzed the
question of whether or not you can cross all 7 bridges without retracing your steps (you
cant).
Open-source
AlternaWves
Apache Giraph.
All are
Apache Hama. somewhat
immature.
Aurelius Titan.
71 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Tuesday, April 16, 13
http://incubator.apache.org/giraph/
http://hama.apache.org/
http://thinkaurelius.github.com/titan/
None is very mature nor has extensive commercial support.
A Manifesto...
Note that one reason SQL has succeeded all these years is because it is also inspired by math, e.g., set theory.
Functional Collections.
Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Tuesday, April 16, 13
We already have the right model in the collection APIs that come with functional languages. They are far better engineered for intuitive data
transformations. They provide the right abstractions and hide boilerplate. In fact, they make it relatively easy to optimize implementations for
parallelization. The Scala collections offer parallelization with a tiny API call. Spark and Cascading transparently distribute collections across a
cluster.
Erlang, Akka:
Actor-based,
Distributed
Computation
Fine Grain
Compute Models.
Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Tuesday, April 16, 13
We can start using new, more efficient compute models, like Spark, Pregel, and Impala today. Of course, you have to consider maturity, viability,
and support issues in large organizations. So if you want to wait until these alternatives are more mature, then at least use better APIs for Hadoop!
For example, Erlang is a very mature language with the Actor model backed in. Akka is a Scala distributed computing model based on the Actor
model of concurrency. It exposes clean, low-level primitives for robust, distributed services (e.g., Actors), upon which we can build flexible big data
systems that can handle soft real time and batch processing efficiently and with great scalability.
Final Thought:
@deanwampler
[email protected]
All pictures Copyright Dean Wampler, 2011-2013, All Rights Reserved. All other content is
free to use, but attribution is requested.