0% found this document useful (0 votes)

68 views79 pages

Techmesh London 2012 December 5, 2012: Mapreduce and Its Discontents

Uploaded by

yasir2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views79 pages

Techmesh London 2012 December 5, 2012: Mapreduce and Its Discontents

Uploaded by

yasir2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 79

TechMesh

London 2012
December 5, 2012
[email protected]
polyglotprogramming.com/talks

MapReduce
and Its
1 Discontents
Tuesday, April 16, 13

Beyond MapReduce: its been a useful technology, but has a first generation feel. Whats
next?

Copyright Dean Wampler, 2011-2013, All Rights Reserved. Photos can only be used with
permission. Otherwise, the content is free to use.
About Me...
[email protected]@deanwampler
@deanwampler
github.com/deanwampler

Programming

Hive
Functional
Programming
for Java Developers
Dean Wampler,
Dean Wampler Jason Rutherglen &
Edward Capriolo

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

My books and contact information.

Big Data
Data so big that
tradi3onal solu3ons are
too slow, too small, or
too expensive to use.
Hat tip: Bob Korbus

3 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

Its a buzz word, but generally associated with the problem of data sets too big to manage
with traditional SQL databases. A parallel development has been the NoSQL movement that is
good at handling semistructured data, scaling, etc.
3 Trends
4 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13
Three trends influence my thinking...
Data Size

5 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
Data volumes are obviously growing rapidly.
Facebook now has over 600PB (Petabytes) of data in Hadoop clusters!
Formal Schemas

6 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
There is less emphasis on formal schemas and domain models, i.e., both relational models of data and OO models, because data schemas and
sources change rapidly, and we need to integrate so many disparate sources of data. So, using relatively-agnostic software, e.g., collections of
things where the software is more agnostic about the structure of the data and the domain, tends to be faster to develop, test, and deploy. Put
another way, we find it more useful to build somewhat agnostic applications and drive their behavior through data...
Data-Driven Programs

7 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
This is the 2nd generation Stanley, the most successful self-driving car ever built (by a Google-Stanford) team. Machine learning is growing in
importance. Here, generic algorithms and data structures are trained to represent the world using data, rather than encoding a model of the
world in the software itself. Its another example of generic algorithms that produce the desired behavior by being application agnostic and data
driven, rather than hard-coding a model of the world. (In practice, however, a balance is struck between completely agnostic apps and some
engineering towards for the specific problem, as you might expect...)
Big Data
Architectures
8 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13
What should software architectures look like for these kinds of systems?
Other, Object-
Oriented Query
Objects Domain Logic

4
1
Object Model
ParentB1 SQL
toJSON
3
ChildB1 ChildB2 Object-
toJSON toJSON Relational
Mapping

Result Set

Database

9 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
Traditionally, weve kept a rich, in-memory domain model requiring an ORM to convert persistent data into the model. This is resource overhead and complexity we cant afford in big data
systems. Rather, we should treat the result set as it is, a particular kind of collection, do the minimal transformation required to exploit our collections libraries and classes representing some
domain concepts (e.g., Address, StockOption, etc.), then write functional code to implement business logic (or drive emergent behavior with machine learning algorithms)

The toJSON methods are there because we often convert these object graphs back into fundamental structures, such as the maps and arrays of JSON so we can send them to the browser!
Relational/
Functional Query
Domain Logic

1
Functional
Abstractions
3 SQL

Functional
Other, Object-
Oriented Query
Wrapper for
Objects Domain Logic Relational Data
4
1
Object Model
ParentB1 SQL
toJSON
3 Result Set
ChildB1 ChildB2 Object-
toJSON toJSON Relational
Mapping
2

Result Set

Database
Database

10 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
But the traditional systems are a poor fit for this new world: 1) they add too much overhead in computation (the ORM layer, etc.) and memory (to store the objects). Most of what we do with
data is mathematical transformation, so were far more productive (and runtime efficient) if we embrace fundamental data structures used throughout (lists, sets, maps, trees) and build rich
transformations into those libraries, transformations that are composable to implement business logic.
Relational/

Focus on:
Functional Query
Domain Logic

Lists Functional
Abstractions
3 SQL

Maps Functional
Wrapper for
Relational Data

Sets
Result Set

Trees 2

... Database

11 Copyright 2011-2013, Dean Wampler, All Rights Reserved

ParentB1
toJSON

ChildB1 ChildB2
toJSON toJSON

Database Files

12 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Process 1 Process 2 Process 3

Web Client 1 Web Client 2 Web Client 3

ParentB1
toJSON

ChildB1 ChildB2
toJSON toJSON

Database Files

13 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
In a broader view, object models tend to push us towards centralized, complex systems that dont decompose well and stifle reuse and optimal deployment scenarios. FP code makes it
easier to write smaller, focused services that we compose and deploy as appropriate. Each ProcessN could be a parallel copy of another process, for horizontal, shared-nothing
scalability, or some of these processes could be other services
Smaller, focused services scale better, especially horizontally. They also dont encapsulate more business logic than is required, and this (informal) architecture is also suitable for scaling
ML and related algorithms.
Web Client 1 Web Client 2 Web Client 3

Data Size

Formal
Schema
Process 1 Process 2 Process 3

Data-Driven
Programs Database Files

14 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
And this structure better fits the trends I outlined at the beginning of the talk.
Web Client 1 Web Client 2 Web Client 3

MapReduce Process 1 Process 2 Process 3

Distributed FS Database Files

15 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
And MapReduce + a distributed file system, like Hadoops MapReduce and HDFS, fit this model.
What Is
MapReduce?
16 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13
MapReduce in Hadoop
Lets look at a
MapReduce algorithm:
WordCount.
(The Hello World of big data)
17 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13
Lets walk through the Hello World of MapReduce, the Word Count algorithm, at a conceptual level. Well see actual code shortly!
Input Mappers Sort, Reducers Output
Shuffle

Hadoop uses
MapReduce
a2
hadoop 1
is 2

There is a
Map phase
We need to convert
the Input map 1
mapreduce 1
phase 2

into the Output.

reduce 1
there 2
There is a
Reduce phase uses 1

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

Four input documents, one left empty, the others with small phrases (for simplicity). The word count
output is on the right (well see why there are three output documents). We need to get from the input
on the left-hand side to the output on the right-hand side.
Input Mappers Sort, Reducers Output
Shuffle

Hadoop uses
MapReduce
a2
hadoop 1
is 2

There is a
Map phase
map 1
mapreduce 1
phase 2

reduce 1
there 2
There is a
Reduce phase uses 1

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

Here is a schematic view of the steps in Hadoop MapReduce. Each Input file is read by a single
Mapper process (default: can be many-to-many, as well see later).
The Mappers emit key-value pairs that will be sorted, then partitioned and shuffled to the reducers,
where each Reducer will get all instances of a given key (for 1 or more values).
Each Reducer generates the final key-value pairs and writes them to one or more files (based on the
size of the output).
Input Mappers

Hadoop uses
MapReduce (n, "")

There is a
Map phase
(n, "")

(n, "")

There is a
Reduce phase
(n, "")

20 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

Each document gets a mapper. All data is organized into key-value pairs; each line will be a
value and the offset position into the file will be the key, which we dont care about. Im
showing each documents contents in a box and 1 mapper task (JVM process) per document.
Large documents might get split to several mapper tasks.
The mappers tokenize each line, one at a time, converting all words to lower case and
counting them...
Input Mappers

(hadoop, 1)
Hadoop uses
MapReduce (n, "") (uses, 1)
(mapreduce, 1)

(there, 1)
(is, 1)
There is a
Map phase
(n, "") (a, 1)
(map, 1)
(phase, 1)

(n, "")

(there, 1)
(is, 1)
There is a
Reduce phase
(n, "") (a, 1)
(reduce, 1)
(phase, 1)

21 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

The mappers emit key-value pairs, where each key is one of the words, and the value is the
count. In the most naive (but also most memory efficient) implementation, each mapper
simply emits (word, 1) each time word is seen. However, this is IO inefficient!
Note that the mapper for the empty doc. emits no pairs, as you would expect.
Input Mappers Sort, Reducers
Shuffle
0-9, a-l
Hadoop uses (hadoop, 1)
MapReduce (n, "")
(mapreduce, 1)
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(n, "") m-q
(map, 1),(phase,1)
(there, 1)

(n, "")
(phase,1) r-z
(is, 1), (a, 1)
(there, 1),
There is a
Reduce phase
(n, "") (reduce, 1)

22 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

The mappers themselves dont decide to which reducer each pair should be sent. Rather, the
job setup configures what to do and the Hadoop runtime enforces it during the Sort/Shuffle
phase, where the key-value pairs in each mapper are sorted by key (that is locally, not
globally) and then the pairs are routed to the correct reducer, on the current machine or
other machines.
Note how we partitioned the reducers, by first letter of the keys. (By default, MR just hashes
the keys and distributes them modulo # of reducers.)
Input Mappers Sort, Reducers
Shuffle
0-9, a-l
Hadoop uses (hadoop, 1)
MapReduce (n, "") (a, [1,1]),
(mapreduce, 1) (hadoop, [1]),
(is, [1,1])
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(n, "") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]),
(mapreduce, [1]),
(phase, [1,1])
(n, "")
(phase,1) r-z
(is, 1), (a, 1) (reduce, [1]),
(there, 1), (there, [1,1]),
There is a
Reduce phase
(n, "") (reduce, 1) (uses, 1)

23 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

The reducers are passed each key (word) and a collection of all the values for that key (the
individual counts emitted by the mapper tasks). The MR framework creates these collections
for us.
Input Mappers Sort, Reducers Output
Shuffle
0-9, a-l
Hadoop uses (hadoop, 1)
MapReduce (n, "") (a, [1,1]), a2
(mapreduce, 1) (hadoop, [1]), hadoop 1
(is, [1,1]) is 2
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(n, "") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]), map 1
(mapreduce, [1]), mapreduce 1
(phase, [1,1]) phase 2
(n, "")
(phase,1) r-z
(is, 1), (a, 1) (reduce, [1]), reduce 1
(there, 1), (there, [1,1]), there 2
There is a
Reduce phase
(n, "") (reduce, 1) (uses, 1) uses 1

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

The final view of the WordCount process flow. The reducer just sums the counts and writes the output.
The output files contain one line for each key (the word) and value (the count), assuming were using
text output. The choice of delimiter between key and value is up to you, but tab is common.
Input Mappers Sort, Reducers Output
Shuffle
0-9, a-l
Hadoop uses (hadoop, 1)
MapReduce (n, "") (a, [1,1]), a2
(mapreduce, 1) (hadoop, [1]), hadoop 1
(is, [1,1]) is 2
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(n, "") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]), map 1
(mapreduce, [1]), mapreduce 1
(phase, [1,1]) phase 2
(n, "")
Map: (phase,1) Reduce:
r-z

Transform one input to 0-N(is, 1), (a, 1)

(there, 1),
Collect
(reduce,multiple
[1]),
(there, [1,1]),
inputs into
reduce 1
there 2
outputs.
There is a
Reduce phase
(n, "") (reduce, 1) one output.
(uses, 1) uses 1

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

To recap, a map transforms one input to one output, but this is generalized in MapReduce to be one
to 0-N. The output key-value pairs are distributed to reducers. The reduce collects together multiple
inputs with the same key into
History of
MapReduce
26 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13

Lets review where MapReduce came from and its best-known, open-source incarnation,
Hadoop.
How would you
index the web?

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
How would you
index the web?

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

Did Google search the entire web in 0.26 seconds to find these ~49M results?
You ask a phrase and
the search engine nds
the best match in
billions of web pages.

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
Actually, Google
computes the index
that maps terms to
pages in advance.
Googles famous Page Rank algorithm.
Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13
In the early 2000s,
Google invented server
infrastructure to support
PageRank, etc...

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
Google File System
for Storage
2003

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

A distributed file system provides horizontal scalability and resiliency when file blocks are
duplicated around the cluster.
MapReduce
for ComputaWon
2004

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

The compute model for processing all that data is MapReduce. It handles lots of boilerplate,
like breaking down jobs into tasks, distributing the tasks around the cluster, monitoring the
tasks, etc. You write your algorithm to the MR programming model.
About this Wme, Doug
CuAng, the creator of
Lucene, and Mike
Cafarella was working
on Nutch...
Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13

Lucene is an open-source text search engine. Nutch is an open source web crawler.
They started clean-
room versions of
MapReduce and GFS...

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
By 2006 , they became
part of a separate
Apache project,
called Hadoop.

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

The name comes from a toy, stuffed elephant that Cuttings son owned at the time.
Benets of
MapReduce

37 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
The best way to
approach Big Data is to
scale Horizontally.

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

We cant build vertical systems big enough and if we could, they would cost a fortune!
Hadoop Design Goals
I /O
i z e e ! !
x i m an c
a
M form
e r
P And parallelize execu4on,
run on server-class,
commodity hardware.
39 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13

Maximizing disk and network I/O is critical, because its the largest throughput bottleneck.
So, optimization is a core design goal of Hadoop (both MR and HDFS). It affects the features
and performance of everything in the stack above it, including high-level programming tools!
By design, Hadoop is
great for batch mode
data crunching.

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

but less so for real-time event handling, as well discuss...

MapReduce
and its
Discontents

41 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

Is MapReduce the end of the story? Does it meet all our needs? Lets look at a few problems...
#1
Its hard to implement
many Algorithms
in MapReduce.

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

Even word count is not obvious. When you get to fancier stuff like joins, group-bys, etc.,
the mapping from the algorithm to the implementation is not trivial at all. In fact,
implementing algorithms in MR is now a specialized body of knowledge...
#2
For Hadoop in
parWcularly,
the Java API is
hard to use.

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

The Hadoop Java API is even more verbose and tedious to use than it should be.
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import java.util.StringTokenizer;

class WCMapper extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable> {

static final IntWritable one = new IntWritable(1);

static final Text word = new Text; // Value will be set in a non-thread-safe way!

@Override
public void map(LongWritable key, Text valueDocContents,
OutputCollector<Text, IntWritable> output, Reporter reporter) {
String[] tokens = valueDocContents.toString.split("\\s+");
for (String wordString: tokens) {
if (wordString.length > 0) {
word.set(wordString.toLowerCase);
output.collect(word, one);
}
}
}
}

class Reduce extends MapReduceBase

implements Reducer[Text, IntWritable, Text, IntWritable] {

public void reduce(Text keyWord, java.util.Iterator<IntWritable> valuesCounts,

OutputCollector<Text, IntWritable> output, Reporter reporter) {
int totalCount = 0;
while (valuesCounts.hasNext) {
totalCount += valuesCounts.next.get;
}
output.collect(keyWord, new IntWritable(totalCount));
}
}
44 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13
This is intentionally too small to read and were not showing the main routine, which doubles the code size. The algorithm is simple, but the framework is in your
face. In the next several slides, notice which colors dominate. In this slide, its green for types (classes), with relatively few yellow functions that implement actual
operations.
The main routine Ive omitted contains boilerplate details for configuring and running the job. This is just the core MapReduce code. In fact, Word Count is not
too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce
code in this API gets complex and tedious very fast!
Use Cascading (Java)
(SoluWon #1a)

45 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
Cascading is a Java library that provides higher-level abstractions for building data processing pipelines with concepts familiar from SQL such as a
joins, group-bys, etc. It works on top of Hadoops MapReduce and hides most of the boilerplate from you.
See http://cascading.org.
Cascading Concepts
Data ows consist of
source and sink Taps
connected by Pipes.

Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
Word Count
Flow
Pipe ("word count assembly")
line words word count
Each(Regex) GroupBy Every(Count)

Tap Tap
HDFS
(source) (sink)

47 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13

Schematically, here is what Word Count looks like in Cascading. See http://
docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details.
import org.cascading.*;
...
public class WordCount {
public static void main(String[] args) {
String inputPath = args[0];
String outputPath = args[1];
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );

Scheme sourceScheme = new TextLine( new Fields( "line" ) );

Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
Tap source = new Hfs( sourceScheme, inputPath );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

Pipe assembly = new Pipe( "wordcount" );

String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";

Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );
assembly = new GroupBy( assembly, new Fields( "word" ) );
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every( assembly, count );

FlowConnector flowConnector = new FlowConnector( properties );

Flow flow = flowConnector.connect( "word-count", source, sink, assembly);
flow.complete();
}
}
48 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13
Here is the Cascading Java code. Its cleaner than the MapReduce API, because the code is more focused on the algorithm with less boilerplate,
although it looks like its not that much shorter. HOWEVER, this is all the code, where as previously I omitted the setup (main) code. See http://
docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details of the API features used here; we wont discuss them here, but just
mention some highlights.
Note that there is still a lot of green for types, but at least the API emphasizes composing behaviors together.
Use Scalding (Scala)
(SoluWon #1b)

49 Copyright 2011-2013, Dean Wampler, All Rights Reserved

Tuesday, April 16, 13
Scalding is a Scala DSL (domain-specific language) that wraps Cascading providing an even more intuitive and more boilerplate-free API for
writing MapReduce jobs. https://github.com/twitter/scalding
Scala is a new JVM language that modernizes Javas object-oriented (OO) features and adds support for functional programming, as we discussed
previously and well revisit shortly.
import com.twitter.scalding._

class WordCountJob(args: Args) extends Job(args) {

TextLine( args("input") )
.read
.flatMap('line -> 'word) {
line: String => line.trim.toLowerCase.split("\\W+")
}
.groupBy('word) { group => group.size('count) }
}
.write(Tsv(args("output")))
}

Thats It!!

Tuesday, April 16, 13
This Scala code is almost pure domain logic with very little boilerplate. There are a few minor differences in the implementation. You dont explicitly specify the
Hfs (Hadoop Distributed File System) taps. Thats handled by Scalding implicitly when you run in non-local model. Also, Im using a simpler tokenization
approach here, where I split on anything that isnt a word character [0-9a-zA-Z_].
There is little green, in part because Scala infers type in many cases. There is a lot more yellow for the functions that do real work!
What if MapReduce, and hence Cascading and Scalding, went obsolete tomorrow? This code is so short, I wouldnt care about throwing it away! I invested little
time writing it, testing it, etc.
Other Improved APIs:
Crunch (Java) &
Scrunch (Scala)
Scoobi (Scala)
... 51 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13

See https://github.com/cloudera/crunch.
Others include Scoobi (http://nicta.github.com/scoobi/) and Spark, which well discuss next.
Use Spark (Scala)
(SoluWon #2)

Tuesday, April 16, 13
Spark is a Hadoop
MapReduce alternaWve:

See http://www.spark-project.org/
Why isnt it more widely used? 1) lack of commercial support, 2) only recently emerged out of
academia.
Spark is a Hadoop
MapReduce alternaWve:

Originally designed for

machine learning
applications.
54 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13
object WordCountSpark {
def main(args: Array[String]) {
val file = spark.textFile(args(0))
val counts = file.flatMap(line => line.split("\\W+"))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(args(1))
}
}

Also thats it!

Note its similar to the MapReduce API,
but far more concise.
55 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13
This spark example is actually closer in a few details, i.e., function names used, to the original Hadoop Java API example, but it cuts down boilerplate to the bare
minimum.
Use Hive, Shark, or Impala
(SoluWon #3)

Tuesday, April 16, 13
Using SQL when you can! Here are 3 options.
Use SQL when you can!

Hive: SQL on top of MapReduce.

See http://hive.apache.org/ or my book for Hive, http://shark.cs.berkeley.edu/ for shark,

and http://www.cloudera.com/content/cloudera/en/products/cloudera-enterprise-core/
cloudera-enterprise-RTQ.html for Impala. Impala is very new. It doesnt yet support all Hive
features.
SQL!

CREATE TABLE docs (line STRING);

LOAD DATA INPATH '/path/to/docs' INTO TABLE docs;

CREATE TABLE word_counts AS

SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\W+')) AS word FROM docs) w
GROUP BY word
ORDER BY word;

Word Count, again

in HiveQL
58 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13
This is how you could implement word count in Hive. Were using some Hive built-in functions for tokenizing words in each line, the one column in the docs
table, etc., etc.
Impala
HiveQL front end.
C++ and Java back end.
Provides up to 100x performance
improvement!
Developed by Cloudera.
59 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13

See http://www.cloudera.com/content/cloudera/en/products/cloudera-enterprise-core/
cloudera-enterprise-RTQ.html. However, this was just announced a few ago (at the time of
this writing), so its not production ready quite yet...
#3
Its not suitable for
real-Sme
event processing.

Tuesday, April 16, 13

For typical web/enterprise systems, real-time is up to 100s of milliseconds, so Im using

the term broadly (but following common practice in this industry). True real-time systems,
such as avionics, have much tighter constraints.
Storm! Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13
Storm implements
reliable, distributed
real-Sme
event processing.

Tuesday, April 16, 13

http://storm-project.net/ Created by Nathan Marz, now at Twitter, who also created

Cascalog, the Clojure wrapper around Cascading with added Datalog (logic programming)
features.
Bolt

Bolt

Spout Bolt

Tuesday, April 16, 13

In Storm terminology, Spouts are data sources and bolts are the event processors. There are
facilities to support reliable message handling, various sources encapsulated in Sprouts and
various targets of output. Distributed processing is baked in from the start.
Databases?

Tuesday, April 16, 13
SQL or NoSQL
Databases?

Since databases are designed for

Use a SQL database unless you need the scale and looser schema of a NoSQL database!
#4

Its not ideal for

graph processing.

Tuesday, April 16, 13
Googles Page Rank

Google invented MapReduce,

Recall that PageRank is the famous algorithm invented by Sergey Brin and Larry Page to index
the web. Its the foundation of Googles search engine.
Why not MapReduce?
1 MR job for each
B
A

iteration that updates D

all n nodes/edges. C

E
Graph saved to disk F

after each iteration.

The presentation http://www.slideshare.net/shatteredNirvana/pregel-a-system-for-

largescale-graph-processing
itemizes all the major issues with using MR to implement graph algorithms.
In a nutshell, a job with a map and reduce phase is waaay to course-grained...
Use Graph Processing
(SoluWon #4)

Tuesday, April 16, 13
A good summary presentation: http://www.slideshare.net/shatteredNirvana/pregel-a-system-for-largescale-graph-processing
Googles Pregel

Pregel: New graph framework for

Pregel is the name of the river that runs through the city of Knigsberg, Prussia (now called
Kaliningrad, Ukraine). 7 bridges crossed the river in the city (including to 5 to 2 islands
between river branches). Leonhard Euler invented graph theory when we analyzed the
question of whether or not you can cross all 7 bridges without retracing your steps (you
cant).
Open-source
AlternaWves

Apache Giraph.
All are
Apache Hama. somewhat
immature.

http://incubator.apache.org/giraph/
http://hama.apache.org/
http://thinkaurelius.github.com/titan/
None is very mature nor has extensive commercial support.
A Manifesto...

Tuesday, April 16, 13
To bring this altogether, I think we have opportunities for a better way...
Hadoop is the
Enterprise Java Beans
of our time.

Tuesday, April 16, 13
I worked with EJBs a decade ago. The framework was completely invasive into your business logic. There were too many configuration options in
XML files. The framework paradigm was a poor fit for most problems (like soft real time systems and most algorithms beyond Word Count).
Internally, EJB implementations were inefficient and hard to optimize, because they relied on poorly considered object boundaries that muddled
more natural boundaries. (Ive argued in other presentations and my FP for Java Devs book that OOP is a poor modularity tool)
The fact is, Hadoop reminds me of EJBs in almost every way. Its a 1st generation solution that mostly works okay and people do get work done
with it, but just as the Spring Framework brought an essential rethinking to Enterprise Java, I think there is an essential rethink that needs to
happen in Big Data, specifically around Hadoop. The functional programming community, is well positioned to create it...
Stop using Java!

Tuesday, April 16, 13
Java has taken us a long way and the JVM remains one of our most valuable tools. But the language is really wrong language for data purposes and
its continued use by Big Data vendors is slowing down overall progress, as well as application developer productivity, IMHO. Java emphasizes the
wrong abstractions, objects instead of mathematically-inspired functional programming constructs, and Java encourages inflexible bloat because
its verbose compared to more modern alternatives and objects (at least class-based ones) are far less reusable and flexible than people realize.
Functional Languages
improve Big Data
productivity!
75 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13
Why is Functional Programming better for Big Data? The work we do with data is inherently mathematical transformations and FP is inspired by
math. Hence, its naturally a better fit, much more so than object-oriented programming. And, modern languages like Scala, Clojure, Erlang, F#,
OCaml, and Haskell are more concise and better at eliminating boilerplate, while still providing excellent performance.

Note that one reason SQL has succeeded all these years is because it is also inspired by math, e.g., set theory.
Functional Collections.
Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13
We already have the right model in the collection APIs that come with functional languages. They are far better engineered for intuitive data
transformations. They provide the right abstractions and hide boilerplate. In fact, they make it relatively easy to optimize implementations for
parallelization. The Scala collections offer parallelization with a tiny API call. Spark and Cascading transparently distribute collections across a
cluster.
Erlang, Akka:
Actor-based,
Distributed
Computation

Fine Grain
Compute Models.
Copyright 2011-2013, Dean Wampler, All Rights Reserved
Tuesday, April 16, 13
We can start using new, more efficient compute models, like Spark, Pregel, and Impala today. Of course, you have to consider maturity, viability,
and support issues in large organizations. So if you want to wait until these alternatives are more mature, then at least use better APIs for Hadoop!
For example, Erlang is a very mature language with the Actor model backed in. Akka is a Scala distributed computing model based on the Actor
model of concurrency. It exposes clean, low-level primitives for robust, distributed services (e.g., Actors), upon which we can build flexible big data
systems that can handle soft real time and batch processing efficiently and with great scalability.
Final Thought:

Tuesday, April 16, 13
A final thought about Big Data...
Ques3ons?