0% found this document useful (0 votes)
162 views93 pages

CopiousData TheKillerAppForFP PDF

Uploaded by

yasir2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views93 pages

CopiousData TheKillerAppForFP PDF

Uploaded by

yasir2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Conference Home Page @GOTOCHGO

Welcome to
GOTO 2014 Night Chicago # 1
Speakers
Steve Vinoski, Chief Architect Basho
Dave Thomas, Bedarra Research Labs

2003 Bedarra Research Labs. All rights reserved.

Friday, November 29, 13


Conrmed speakers more to come

2003 Bedarra Research Labs. All rights reserved.

Friday, November 29, 13


GOTO Chicago Conference May 20-21, 2014

3 Keynotes - 30 Invited Talks, 8 Full Day Workshops


JAOO Aarhus Denmark 16 years ago GOTO Aarhus, Amsterdam, Berlin,
Chicago, Zurich, QCON London, FlowCon SFO, CodeMesh London, YOW!
Melbourne, Brisbane, Sydney, Lambda Jam Brisbane, Chicago, Erlang Factory SFO,
London

Mission Compliment the local tech community by:


1. Bringing World Leading Software Experts meet Local Developers
2. Focusing on the latest and emerging technologies and practices
3. Helping Expand the local network who really get software and will
be next generation of enlightened software leaders/executives.
4. Inviting all speakers by independent PC based on their
competence, and reputation independent of sponsor, platform or
organizer bias
Conference Home Page

2003 Bedarra Research Labs. All rights reserved.

Friday, November 29, 13


Copious Data: The Killer App
for Func>onal Programming

Nov. 21, 2013


GOTO Chicago 2014 Night #2
[email protected]
polyglotprogramming.com/talks
Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

Copyright Dean Wampler, 2011-2013, All Rights Reserved. Photos can only be used with
permission. Otherwise, the content is free to use.
Photo: Cloud Gate (a.k.a. The Bean) in Millenium Park, Chicago, Illinois, USA
Consultant at
Typesafe

Dean Wampler... Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

Typesafe builds tools for creating Reactive Applications, http://typesafe.com/platform. See


also the Reactive Manifesto, http://www.reactivemanifesto.org/

Photo: The Chicago River


Founder,
Chicago-Area Scala
Enthusiasts
and co-organizer,
Chicago Hadoop User Group

Dean Wampler... Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

Ive been doing Scala for 6 years and Big Data for 3.5 years.
Programming

Hive
Functional
Programming
for Java Developers
Dean Wampler,
Jason Rutherglen &
Dean Wampler Edward Capriolo

Dean Wampler... Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

My books
What Is Big err
Copious Data?

8 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Copious
Data
Data so big that
tradi>onal solu>ons are
too slow, too small, or
too expensive to use.
Hat tip: Bob Korbus

9 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

Big Data a buzz word, but generally associated with the problem of data sets too big to
manage with traditional SQL databases. A parallel development has been the NoSQL
movement that is good at handling semistructured data, scaling, etc.
3 Trends

Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Three prevailing trends driving data-centric computing.
Photo: Prizker Pavilion, Millenium Park, Chicago (designed by Frank Gehry)
Data Size

11 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Data volumes are obviously growing rapidly.
Facebook now has over 600PB (Petabytes) of data in Hadoop clusters!
Formal Schemas

12 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
There is less emphasis on formal schemas and domain models, i.e., both relational models of data and OO models, because data schemas and
sources change rapidly, and we need to integrate so many disparate sources of data. So, using relatively-agnostic software, e.g., collections of
things where the software is more agnostic about the structure of the data and the domain, tends to be faster to develop, test, and deploy. Put
another way, we find it more useful to build somewhat agnostic applications and drive their behavior through data...
Data-Driven Programs

13 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
This is the 2nd generation Stanley, the most successful self-driving car ever built (by a Google-Stanford) team. Machine learning is growing in
importance. Here, generic algorithms and data structures are trained to represent the world using data, rather than encoding a model of the
world in the software itself. Its another example of generic algorithms that produce the desired behavior by being application agnostic and data
driven, rather than hard-coding a model of the world. (In practice, however, a balance is struck between completely agnostic apps and some
engineering towards for the specific problem, as you might expect...)
Probabilistic
Models vs.
Formal
Grammars

tor.com/blogs/...
14 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
An interesting manifestation of this trend is the public argument between Noam Chomsky and Peter Norvig on the nature of language. Chomsky
long ago proposed a hierarchical model of formal language grammars. Peter Norvig is a proponent of probabilistic models of language. Indeed all
successful automated language processing systems are probabilistic.
http://www.tor.com/blogs/2011/06/norvig-vs-chomsky-and-the-fight-for-the-future-of-ai
What Is
MapReduce?

Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Cloud Gate - The Bean - in Millenium Park, Chicago, on a sunny day - with some of my relatives ;)
Hadoop is the dominant
copious data pla]orm
today.

16 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
A Hadoop Cluster
Hadoop v1.X Cluster
master backup master
JobTracker Secondary
NameNode NFS NameNode
Disk

node node node


TaskTracker TaskTracker TaskTracker
DataNode DataNode DataNode

Disk
Disk
Disk
Disk
Disk Disk
Disk
Disk
Disk
Disk Disk
Disk
Disk
Disk
Disk

17 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
A Hadoop v1.X cluster. (V2.X introduces changes in the master processes, including support for high-availability and federation). In brief:
JobTracker (JT): Master of submitted MapReduce jobs. Decomposes job into tasks (each a JVM process), often run where the blocks of input files
are located, to minimize net IO.
NameNode (NN): HDFS (Hadoop Distributed File System) master. Knows all the metadata, like block locations. Writes updates to a shared NFS disk
(in V1) for use by the Secondary NameNode.
Secondary NameNode (SNN): periodically merges in-memory HDFS metadata with update log on NFS disk to form new metadata image used when
booting the NN and SNN.
TaskTracker: manages each task given to it by the JT.
DataNode: manages the actual blocks it has on the node.
Disks: By default, Hadoop just works with a bunch of disks - cheaper and sometimes faster than RAID. Blocks are replicated 3x (default) so most
HW failures dont result in data loss.
MapReduce in Hadoop
Lets look at a
MapReduce algorithm:
Inverted Index.
Used for text/web search.
18 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
Lets walk through a simple version of computing an inverted index. Imagine a web crawler has found all docs on the web and stored their URLs
and contents in HDFS. Now well index it; build a map from each word to all the docs where its found, ordered by term frequency within the docs.
Crawl teh Interwebs
Web Crawl Map Phase

wikipedia.org/hadoop index
Hadoop provides block
MapReduce and HDFS
... ...
wikipedia.org/hadoop Hadoop provides... Map Task
... ...
...

block
wikipedia.org/hbase
... ...
HBase stores data in HDFS wikipedia.org/hbase HBase stores...
Map Task
... ...

... block
... ... Map Task
wikipedia.org/hive wikipedia.org/hive Hive queries...
Hive queries HDFS files and ... ...
HBase tables with SQL

19 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Crawl pages, including Wikipedia. Use the URL as the document id in our first index, and the contents of each document (web page) as the second
column. in our data set.
Compute Inverse Index
b Crawl Map Phase Reduce Phase
index
Reduce Task ...
block
hadoop
... ...
hbase
wikipedia.org/hadoop Hadoop provides... Map Task
hdfs
... ...
hive
Reduce Task
...

Sort, Shuffle
block
... ...
wikipedia.org/hbase HBase stores...
Map Task
...
... ...
Reduce Task

block ...
... ... Map Task
wikipedia.org/hive Hive queries...
... ... Reduce Task ...
and
...

20 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Now run a MapReduce job, where a separate Map task for each input block will be started. Each map tokenizes the content in to words, counts the
words, and outputs key-value pairs...
Compute Inverse Index
b Crawl Map Phase Reduce Phase
index
Reduce Task ...
block
... ...
(hadoop,(wikipedia.org/hadoop,1)) hadoop
hbase
wikipedia.org/hadoop Hadoop provides... (provides,(wikipedia.org/hadoop,1))
Map Task
Map Task hdfs
... ... (mapreduce,(wikipediate.org/hadoop, 1)) hive
Reduce Task
(and,(wikipedia.org/hadoop,1)) ...

Sort, Shuffle
block
... ...
(hdfs,(wikipedia.org/hadoop, 1))
wikipedia.org/hbase HBase stores...
Map Task
...
... ...
Reduce Task

block ...
... ... Map Task
wikipedia.org/hive
... Key-values output
Hive queries...
... Reduce Task ...

by first map task and


...

21 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Now run a MapReduce job, where a separate Map task for each input block will be started. Each map tokenizes the content in to words, counts the
words, and outputs key-value pairs...
Each key is a word that was found and the corresponding value is a tuple of the URL (or other document id) and the count of the words (or
alternatively, the frequency within the document). Shown are what the first map task would output (plus other k-v pairs) for the (fake) Wikipedia
Hadoop page. (Note that we convert to lower case)
Compute Inverse Index
Map Phase Reduce Phase inverse index
block
Reduce Task ... ...
hadoop (.../hadoop,1)
hbase (.../hbase,1),(.../hive,1)
... Map Task
hdfs (.../hadoop,1),(.../hbase,1),(.../hive,1)
hive (.../hive,1)
Reduce Task
... ...
Sort, Shuffle

Map Task block


... ...

Reduce Task
block
... ...
Map Task
block
Reduce Task ... ...
and (.../hadoop,1),(.../hive,1)
... ...

22 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Finally, each reducer will get some range of the keys. There are ways to control this, but well just assume that the first reducer got all keys starting
with h and the last reducer got all the and keys. The reducer outputs each word as a key and a list of tuples consisting of the URLs (or doc ids)
and the frequency/count of the word in that document, sorted by most frequent first. (All our docs have only one occurrence of any word, so the
sort is moot)
Anatomy: MapReduce Job
Map Phase Reduce Phase

Reduce Task Map (or Flatmap):


Map Task
Transform one input to
Reduce Task 0-N outputs.
Sort, Shuffle

Map Task

Reduce Task
Reduce:

Map Task Collect multiple inputs


Reduce Task
into one output.

23 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
To recap, a true functional/mathematical map transforms one input to one output, but this is generalized in MapReduce to be one to 0-N. In
other words, it should be FlatmapReduce!! The output key-value pairs are distributed to reducers. The reduce collects together multiple inputs
with the same key into
24 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

Quiz. Do you understand this tweet?


So, MapReduce is
a mashup of our friends
atmap and reduce.

25 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

Even in this somewhat primitive and coarse-grain framework, our functional data concepts are evident!
Today,
Hadoop is our best,
general-purpose tool
for horizontal scaling
of Copious Data.
26 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
MapReduce and Its
Discontents

Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

Is MapReduce the end of the story? Does it meet all our needs? Lets look at a few problems
Photo: Gratuitous Romantic beach scene, Ohio St. Beach, Chicago, Feb. 2011.
MapReduce doesnt t
all computa>on needs.
HDFS doesnt t all
storage needs.

28 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Its hard to implement
many algorithms
in MapReduce.

29 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

Even word count is not obvious. When you get to fancier stuff like joins, group-bys, etc., the
mapping from the algorithm to the implementation is not trivial at all. In fact, implementing
algorithms in MR is now a specialized body of knowledge.
MapReduce is very
course-grained.

1-Map, 1-Reduce
phase...
30 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

Even word count is not obvious. When you get to fancier stuff like joins, group-bys, etc., the
mapping from the algorithm to the implementation is not trivial at all. In fact, implementing
algorithms in MR is now a specialized body of knowledge.
Mul>ple MR jobs
required for some
algorithms.
Each one ushes its
results to disk!
31 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

If you have to sequence MR jobs to implement an algorithm, ALL the data is flushed to disk
between jobs. Theres no in-memory caching of data, leading to huge IO overhead.
MapReduce is designed
for oine, batch-mode
analy>cs.
High latency; not
suitable for event
processing.
32 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

Alternatives are emerging to provide event-stream (real-time) processing.


The Hadoop Java API
is hard to use.

33 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

The Hadoop Java API is even more verbose and tedious to use than it should be.
Lets look at code for a
simpler algorithm,
Word Count.
(Tokenize as before, but
ignore original
document loca>ons.)
34 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

In Word Count, the mapper just outputs the word-count pairs. We forget about the document
URL/id. The reducer gets all word-count pairs for a word from all mappers and outputs each
word with its final, global count.
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import java.util.StringTokenizer;

class WCMapper extends MapReduceBase


implements Mapper<LongWritable, Text, Text, IntWritable> {

static final IntWritable one = new IntWritable(1);


static final Text word = new Text; // Value will be set in a non-thread-safe way!

@Override
public void map(LongWritable key, Text valueDocContents,
OutputCollector<Text, IntWritable> output, Reporter reporter) {
String[] tokens = valueDocContents.toString.split("\\s+");
for (String wordString: tokens) {
if (wordString.length > 0) {
word.set(wordString.toLowerCase);
output.collect(word, one);
}
}
}
}

class Reduce extends MapReduceBase


implements Reducer[Text, IntWritable, Text, IntWritable] {

public void reduce(Text keyWord, java.util.Iterator<IntWritable> valuesCounts,


OutputCollector<Text, IntWritable> output, Reporter reporter) {
int totalCount = 0;
while (valuesCounts.hasNext) {
totalCount += valuesCounts.next.get;
}
output.collect(keyWord, new IntWritable(totalCount));
}
}
35 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
This is intentionally too small to read and were not showing the main routine, which doubles the code size. The algorithm is simple, but the framework is in your
face. In the next several slides, notice which colors dominate. In this slide, its dominated by green for types (classes), with relatively few yellow functions that
implement actual operations (i.e., do actual work).
The main routine Ive omitted contains boilerplate details for configuring and running the job. This is just the core MapReduce code. In fact, Word Count is not
too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce
code in this API gets complex and tedious very fast!
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import java.util.StringTokenizer;

class WCMapper extends MapReduceBase


implements Mapper<LongWritable, Text, Text, IntWritable> {

static final IntWritable one = new IntWritable(1);


static final Text word = new Text; // Value will be set in a non-thread-safe way!

@Override
public void map(LongWritable key, Text valueDocContents,
OutputCollector<Text, IntWritable> output, Reporter reporter) {
String[] tokens = valueDocContents.toString.split("\\s+");
for (String wordString: tokens) {
if (wordString.length > 0) {
word.set(wordString.toLowerCase);

}
output.collect(word, one);
The
}

}
} interesting
class Reduce extends MapReduceBase bits
implements Reducer[Text, IntWritable, Text, IntWritable] {

public void reduce(Text keyWord, java.util.Iterator<IntWritable> valuesCounts,


OutputCollector<Text, IntWritable> output, Reporter reporter) {
int totalCount = 0;
while (valuesCounts.hasNext) {
totalCount += valuesCounts.next.get;
}
output.collect(keyWord, new IntWritable(totalCount));
}
}
36 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
This is intentionally too small to read and were not showing the main routine, which doubles the code size. The algorithm is simple, but the framework is in your
face. In the next several slides, notice which colors dominate. In this slide, its dominated by green for types (classes), with relatively few yellow functions that
implement actual operations (i.e., do actual work).
The main routine Ive omitted contains boilerplate details for configuring and running the job. This is just the core MapReduce code. In fact, Word Count is not
too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce
code in this API gets complex and tedious very fast!
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import java.util.StringTokenizer;

class WCMapper extends MapReduceBase


implements Mapper<LongWritable, Text, Text, IntWritable> {

static final IntWritable one = new IntWritable(1);


static final Text word = new Text; // Value will be set in a non-thread-safe way!

@Override
public void map(LongWritable key, Text valueDocContents,
OutputCollector<Text, IntWritable> output, Reporter reporter) {
String[] tokens = valueDocContents.toString.split("\\s+");
for (String wordString: tokens) {
if (wordString.length > 0) {
word.set(wordString.toLowerCase);

}
output.collect(word, one); The 90s called. They
}
}
want their EJBs back!
}

class Reduce extends MapReduceBase


implements Reducer[Text, IntWritable, Text, IntWritable] {

public void reduce(Text keyWord, java.util.Iterator<IntWritable> valuesCounts,


OutputCollector<Text, IntWritable> output, Reporter reporter) {
int totalCount = 0;
while (valuesCounts.hasNext) {
totalCount += valuesCounts.next.get;
}
output.collect(keyWord, new IntWritable(totalCount));
}
}
37 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
This is intentionally too small to read and were not showing the main routine, which doubles the code size. The algorithm is simple, but the framework is in your
face. In the next several slides, notice which colors dominate. In this slide, its dominated by green for types (classes), with relatively few yellow functions that
implement actual operations (i.e., do actual work).
The main routine Ive omitted contains boilerplate details for configuring and running the job. This is just the core MapReduce code. In fact, Word Count is not
too bad, but when you get to more complex algorithms, even conceptually simple ideas like relational-style joins and group-bys, the corresponding MapReduce
code in this API gets complex and tedious very fast!
Use Cascading (Java)

Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Cascading is a Java library that provides higher-level abstractions for building data processing pipelines with concepts familiar from SQL such as a
joins, group-bys, etc. It works on top of Hadoops MapReduce and hides most of the boilerplate from you.
See http://cascading.org.
Photo: Fermi Lab Office Building, Batavia, IL.
Cascading Concepts
Data ows consist of
source and sink Taps
connected by Pipes.

39 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Word Count
Flow
Pipe ("word count assembly")
line words word count
Each(Regex) GroupBy Every(Count)

Tap Tap
HDFS
(source) (sink)

40 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

Schematically, here is what Word Count looks like in Cascading. See http://
docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details.
import org.cascading.*;
...
public class WordCount {
public static void main(String[] args) {
String inputPath = args[0];
String outputPath = args[1];
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, WordCount.class );

Scheme sourceScheme = new TextLine( new Fields( "line" ) );


Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
Tap source = new Hfs( sourceScheme, inputPath );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

Pipe assembly = new Pipe( "wordcount" );

String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";


Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );
assembly = new GroupBy( assembly, new Fields( "word" ) );
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every( assembly, count );

FlowConnector flowConnector = new FlowConnector( properties );


Flow flow = flowConnector.connect( "word-count", source, sink, assembly);
flow.complete();
}
}
41 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
Here is the Cascading Java code. Its cleaner than the MapReduce API, because the code is more focused on the algorithm with less boilerplate,
although it looks like its not that much shorter. HOWEVER, this is all the code, where as previously I omitted the setup (main) code. See http://
docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details of the API features used here; we wont discuss them here, but just
mention some highlights.
Note that there is still a lot of green for types, but at least the API emphasizes composing behaviors together.
Use Scalding (Scala)

Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Scalding is a Scala DSL (domain-specific language) that wraps Cascading providing an even more intuitive and more boilerplate-free API for
writing MapReduce jobs. https://github.com/twitter/scalding
Scala is a new JVM language that modernizes Javas object-oriented (OO) features and adds support for functional programming, as we discussed
previously and well revisit shortly.
import com.twitter.scalding._

class WordCountJob(args: Args)


extends Job(args) {
TextLine( args("input") )
.read
.flatMap('line -> 'word) {
line: String =>
line.trim.toLowerCase
.split("\\W+")
}
.groupBy('word) {
group => group.size('count)
}
}
.write(Tsv(args("output"))) Thats It!!
}
43 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
This Scala code is almost pure domain logic with very little boilerplate. There are a few minor differences in the implementation. You dont explicitly specify the
Hfs (Hadoop Distributed File System) taps. Thats handled by Scalding implicitly when you run in non-local model. Also, Im using a simpler tokenization
approach here, where I split on anything that isnt a word character [0-9a-zA-Z_].
There is little green, in part because Scala infers type in many cases. There is a lot more yellow for the functions that do real work!
What if MapReduce, and hence Cascading and Scalding, went obsolete tomorrow? This code is so short, I wouldnt care about throwing it away! I invested little
time writing it, testing it, etc.
Use Cascalog (Clojure)

Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html
Clojure is a new JVM, lisp-based language with lots of important concepts, such as persistent datastructures.
(defn lowercase [w] (.toLowerCase w))

(?<- (stdout) [?word ?count]


(sentence ?s)
(split ?s :> ?word1)
(lowercase ?word1 :> ?word)
(c/count ?count))

Datalog-style queries
45 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
Cascalog embeds Datalog-style logic queries. The variables to match are named ?foo.
Other Improved APIs:
Crunch (Java) &
Scrunch (Scala)
Scoobi (Scala)
... 46 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

See https://github.com/cloudera/crunch.
Others include Scoobi (http://nicta.github.com/scoobi/) and Spark, which well discuss next.
Use Spark
(Not MapReduce)

Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
http://www.spark-project.org/
Spark started as a Berkeley project. recently, the developers launched Databricks to commercialize it, given the growing interest in Spark as a
MapReduce replacement. It can run under YARN, the newer Hadoop resource manager (its not clear thats the best strategy, though, vs. using
Mesos, another Berkeley project being commercialized by Mesosphere) and Spark can talk to HDFS, the Hadoop Distributed File System.
import org.apache.spark.SparkContext

object WordCountSpark {
def main(args: Array[String]) {
val ctx = new SparkContext(...)
val file = ctx.textFile(args(0))
val counts = file.flatMap(
line => line.split("\\W+"))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(args(1))
}
} Also small and concise!

48 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
This spark example is actually closer in a few details, i.e., function names used, to the original Hadoop Java API example, but it cuts down boilerplate to the bare
minimum.
Spark is a Hadoop
MapReduce alterna>ve:
Distributed computing with
in-memory caching.
~30x faster than MapReduce
(in part due to caching of
intermediate data).
49 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

Spark also addresses the lack of flexibility for the MapReduce model.
Spark is a Hadoop
MapReduce alterna>ve:

Originally designed for


machine learning applications.
Developed by Berkeley AMP.
50 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
Use SQL!
Hive, Shark, Impala,
Presto, or Lingual

Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Using SQL when you can! Here are 5 (and growing!) options.
Use SQL when you can!
Hive: SQL on top of MapReduce.
Shark: Hive ported to Spark.
Impala & Presto: HiveQL with
new, faster back ends.
Lingual: ANSI SQL on Cascading. 52 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

See http://hive.apache.org/ or my book for Hive, http://shark.cs.berkeley.edu/ for shark,


and http://www.cloudera.com/content/cloudera/en/products/cloudera-enterprise-core/
cloudera-enterprise-RTQ.html for Impala. http://www.facebook.com/notes/facebook-
engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920
for Presto. Impala & Presto are relatively new.
Word Count in Hive SQL!
CREATE TABLE docs (line STRING);
LOAD DATA INPATH '/path/to/docs'
INTO TABLE docs;

CREATE TABLE word_counts AS


SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\W+'))
AS word FROM docs) w
GROUP BY word
ORDER BY word;

... and similarly for the other SQL tools.


53 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
This is how you could implement word count in Hive. Were using some Hive built-in functions for tokenizing words in each line, the one column in the docs
table, etc., etc.
Lingual is similarly, but because its more ANSI-compliant, the example would be much different.
Were in the era I call
The SQL Strikes Back!

(with apologies to
George Lucas...)
54 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

IT shops realize that NoSQL is useful and all, but people really, Really, REALLY love SQL. So,
its making a big comeback. You can see it in Hadoop, in SQL-like APIs for some NoSQL
DBs, e.g., Cassandra and MongoDBs Javascript-based query language, as well as NewSQL
DBs.
Combinators

Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Photo: The defunct Esquire movie theater on Oak St., off the Magnificent Mile, in Chicago. Now completely gone!
Why were the
Scala, Clojure, and SQL
solu>ons so concise
and appealing??

56 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Data problems
are fundamentally
Mathema>cs!

evanmiller.org/mathematical-hacker.html

57 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

A blog post about how developers ignore mathematics at their peril!


Category Theory

Monads - Structure.
Abstracting over collections.
Control flow and mutability
containment.
58 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

Monads generalize the properties of containers, like lists and maps, such as applying a
function to each element and returning a new instance of the same container type. This also
applies to encapsulations of state transformations and principled mutability, as used in
Haskell.
Category Theory

Monoids, Groups, Rings, etc.


Abstracting over addition,
subtraction, multiplication, and
division.

59 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Monoid: Addi>on

(a + b) + (c + d) for some a, b, c, d.
Add All the Things, Avi Bryant,
StrangeLoop 2013.

infoq.com/presentations/abstract-algebra-analytics
60 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

For an explanation of this slide, see this great presentation by Avi Bryant at StrangeLoop
2013 on generalizing addition (monoids).
Linear Algebra
Eigenvector and Singular Value
Decomposition.
Essential tools in machine
learning.
Av = mv
61 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
Example: Eigenfaces
Represent images
as vectors.
Solve for
modes.
Top N modes
approx. faces! 62
http://en.wikipedia.org/wiki/File:Eigenfaces.png
Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
Set Theory and
First-Order Logic
Relational Model.
Data organized
into tuples,
grouped by
relations.
http://dl.acm.org/citation.cfm?doid=362384.362685
63 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

Formulated by Codd in 69. Most systems dont follow it exactly, like allowing identical
records, where set elements are unique. Codds original model didnt support NULLs either
(unknown), but he later proposed a revision to allow them.
Set Theory and
First-Order Logic
Relational Model.
Most RDBMSs deviate from RM.

64 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

Formulated by Codd in 69. Most systems dont follow it exactly, like allowing identical
records, where set elements are unique. Codds original model didnt support NULLs either
(unknown), but he later proposed a revision to allow them.
What are Combinators?
Functions that are side-effect
free.
They get all their information
from their inputs and write all
their work to their outputs.
65 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
Lets look at
4 rela>onal operators
and the corresponding
func>onal combinators.

66 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

See, for example, the discussions in Database in Depth and SQL and Relational Theory,
Second Edition, both by C.J. Date (OReilly)
Recall our Word Counts:
CREATE TABLE word_counts (
word CHARACTER(64),
count INTEGER);
(ANSI SQL syntax)

val word_counts: Stream[(String,Int)]

(Scala)

67 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Our word_counts table from before, using ANSI SQL syntax this time.
The corresponding Scala might be any kind of collection, e.g., a List. Here, Ill use a Stream, which is a lazy collection useful for large data structures like I/O...
Note that its a stream of a two-element tuple, a String (for the word) and an Int (for the count).
Restrict

SELECT * FROM word_counts


WHERE word = 'Chicago';

vs.

word_counts.filter {
case (word, count) =>
word == "Chicago"
}

68 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
For the Scala example, assume word_counts is a collection (List, Vector, etc.) of 2-element tuples. The case match in the anonymous function passed to filter is a
way of conveniently assigning variables to each element of the tuple, here word and count. Then I filter on only certain word values.
Project

SELECT word FROM word_counts;

vs.

word_counts.map {
case (word, count) =>
word
}

69 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Here, I just return the words in each record or Scala tuple.
Join
CREATE TABLE dictionary (
word CHARACTER(64),
definition CHARACTER(256));

Table for join examples.

70 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
First, we need something to join with; lets use a dictionary of word definitions.
Join - SQL
SELECT w.word, d.definition
FROM word_counts AS w
dictionary AS d
WHERE w.word = d.word;

71 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Here is the SQL join that gives us the words and their definitions. (side note: Hive doesnt support this inferred join syntax; you have to use a more explicit JOIN
ON syntax.)
Join - Scalding
val word_counts =
Csv("/path/", ('wword, 'count)).read
val definitions =
Csv("/path/", ('dword, 'definition)).read

word_counts
.joinWithLarger('wword -> 'dword,
dictionary)
.project('wword, 'definition)

72 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
The Scala collections library doesnt have a join combinator. We would have to build up something that understands the data, such as exploiting sort order. This is
a case where a large-scale data system will implement expensive operations, where a general-purpose programming library might not. So, Im using a Scalding
example.
Join
SELECT w.word, d.definition
FROM word_counts AS w
dictionary AS d
WHERE w.word = d.word;

vs.

word_counts
.joinWithLarger('wword -> 'dword,
dictionary)
.project('wword, 'definition)

73 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Now shown together, with some of the Scalding setup code removed.
Joins are expensive.
Your data system needs
to exploit
op>miza>ons.

74 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Group By
SELECT count, size(word) AS size
FROM word_counts
GROUP BY count
ORDER BY size DESC;

vs.
word_counts.groupBy {
case (word, count) => count
}.toList.map {
case (count, words) => (count, words.size)
}.sortBy {
case (count, size) => -size
} 75 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
How many words appeared once, twice, 3 times, ..., N-times? Order this list descending.
Im back to the Scala library (as opposed to Scalding). The code inputs a collections of tuples, (word,count) and groups by count. This creates a map with the
count as the key and a list of the words as the value.
Next we convert this to a list of tuples (count,List(words)) and map it to a list of tuples with the (count, size of List(words)), then finally sort descending by the list
sizes.
Example
scala> val word_counts = List(
("a", 1), ("b", 2), ("c", 3),
("d", 2), ("e", 2), ("f", 3))

scala> val out = word_counts.groupBy {


case (word, count) => count
}.toList.map {
case (count, words) => (count, words.size)
}.sortBy {
case (count, size) => -size
}

out: List[(Int,Int)] =
List((2,3), (3,2), (1,1))
76 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
Heres a simple example you can run in the Scala REPL (prompts are scala>).
We could go on, but
you get the point.
Declara>ve, func>onal
combinators are a
natural tool for data.
77 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13
SQL vs. FP
SQL
Lots of optimizations for data
manipulation.
FP
More combinators.
First class functions!
78 Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

A drawback of SQL is that it doesnt provide first class functions, so (depending on the
system) youre limited to those that are built-in or UDFs (user-defined funcs) that you can
write and add. FP languages make this easy!!
FP to the
Rescue!

Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Outside my condo window one Sunday morning...
Popular Claim:

Mul4core concurrency
is driving FP adop>on.

80 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

Weve all heard this. In fact, this is how I got interested in FP.
My Claim:
Data will drive the next
wave of widespread
FP adop>on.

81 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

Even today, most developers get by without understanding concurrency. Many will just use an
Actor or Reactive model to solve their problems. I think more devs will have to learn how to
work with data at scale and that fact will drive them to FP. This will be the next wave.
Data
Friday, November 29, 13
Architectures
Copyright 2011-2013, Dean Wampler, All Rights Reserved

What should software architectures look like for these kinds of systems?
Photo: Two famous 19th Century Buildings in Chicago.
Other, Object-
Oriented Query
Objects Domain Logic

4
1
Object Model
ParentB1 SQL
toJSON
3
ChildB1 ChildB2 Object-
toJSON toJSON Relational
Mapping

Result Set

Database

83 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
Traditionally, weve kept a rich, in-memory domain model requiring an ORM to convert persistent data into the model. This is resource overhead and complexity we cant afford in big data
systems. Rather, we should treat the result set as it is, a particular kind of collection, do the minimal transformation required to exploit our collections libraries and classes representing some
domain concepts (e.g., Address, StockOption, etc.), then write functional code to implement business logic (or drive emergent behavior with machine learning algorithms)

The toJSON methods are there because we often convert these object graphs back into fundamental structures, such as the maps and arrays of JSON so we can send them to the browser!
Relational/
Functional Query
Domain Logic

1
Functional
Abstractions
3 SQL

Functional
Other, Object-
Oriented Query
Wrapper for
Objects Domain Logic Relational Data
4
1
Object Model
ParentB1 SQL
toJSON
3 Result Set
ChildB1 ChildB2 Object-
toJSON toJSON Relational
Mapping
2

Result Set

Database
Database

84 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
But the traditional systems are a poor fit for this new world: 1) they add too much overhead in computation (the ORM layer, etc.) and memory (to store the objects). Most of what we do with
data is mathematical transformation, so were far more productive (and runtime efficient) if we embrace fundamental data structures used throughout (lists, sets, maps, trees) and build rich
transformations into those libraries, transformations that are composable to implement business logic.
Relational/

Focus on:
Functional Query
Domain Logic

Lists Functional
Abstractions
3 SQL

Maps Functional
Wrapper for
Relational Data

Sets
Result Set

Trees 2

... Database

85 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
But the traditional systems are a poor fit for this new world: 1) they add too much overhead in computation (the ORM layer, etc.) and memory (to store the objects). Most of what we do with
data is mathematical transformation, so were far more productive (and runtime efficient) if we embrace fundamental data structures used throughout (lists, sets, maps, trees) and build rich
transformations into those libraries, transformations that are composable to implement business logic.
Web Client 1 Web Client 2 Web Client 3

ParentB1
toJSON

ChildB1 ChildB2
toJSON toJSON

Database Files

86 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
In a broader view, object models tend to push us towards centralized, complex systems that dont decompose well and stifle reuse and optimal deployment scenarios. FP code makes it
easier to write smaller, focused services that we compose and deploy as appropriate.
Web Client 1 Web Client 2 Web Client 3

Process 1 Process 2 Process 3

Web Client 1 Web Client 2 Web Client 3

ParentB1
toJSON

ChildB1 ChildB2
toJSON toJSON

Database Files

Database Files

87 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
In a broader view, object models tend to push us towards centralized, complex systems that dont decompose well and stifle reuse and optimal deployment scenarios. FP code makes it
easier to write smaller, focused services that we compose and deploy as appropriate. Each ProcessN could be a parallel copy of another process, for horizontal, shared-nothing
scalability, or some of these processes could be other services
Smaller, focused services scale better, especially horizontally. They also dont encapsulate more business logic than is required, and this (informal) architecture is also suitable for scaling
ML and related algorithms.
Web Client 1 Web Client 2 Web Client 3

Data Size

Formal
Schema
Process 1 Process 2 Process 3

Data-Driven
Programs Database Files

88 Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
And this structure better fits the trends I outlined at the beginning of the talk.
Hadoop is the
Enterprise Java Beans
of our >me.

Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13
I worked with EJBs a decade ago. The framework was completely invasive into your business logic. There were too many configuration options in
XML files. The framework paradigm was a poor fit for most problems (like soft real time systems and most algorithms beyond Word Count).
Internally, EJB implementations were inefficient and hard to optimize, because they relied on poorly considered object boundaries that muddled
more natural boundaries. (Ive argued in other presentations and my FP for Java Devs book that OOP is a poor modularity tool)
The fact is, Hadoop reminds me of EJBs in almost every way. Its a 1st generation solution that mostly works okay and people do get work done
with it, but just as the Spring Framework brought an essential rethinking to Enterprise Java, I think there is an essential rethink that needs to
happen in Big Data, specifically around Hadoop. The functional programming community, is well positioned to create it...
MapReduce
is waning

Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

Weve seen a lot of issues with MapReduce. Already, alternatives are being developed, either general options, like
Spark and Storm, or special-purpose built replacements, like Impala. Lets consider other options...
Emerging replacements
are based on
Func>onal Languages...
import com.twitter.scalding._

class WordCountJob(args: Args) extends Job(args) {


TextLine( args("input") )
.read
.flatMap('line -> 'word) {
line: String =>
line.trim.toLowerCase
.split("\\W+")
}
.groupBy('word) {
group => group.size('count) }
}
.write(Tsv(args("output")))
}

Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

FP is such a natural fit for the problem that any attempts to build big data systems without it will be handicapped
and probably fail.
Lets consider other MapReduce options...
... and SQL

CREATE TABLE docs (line STRING);


LOAD DATA INPATH '/path/to/docs'
INTO TABLE docs;

CREATE TABLE word_counts AS


SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\W+'))
AS word FROM docs) w
GROUP BY word
ORDER BY word;

Copyright 2011-2013, Dean Wampler, All Rights Reserved


Friday, November 29, 13

FP is such a natural fit for the problem that any attempts to build big data systems without it will be handicapped
and probably fail.
Lets consider other MapReduce options...
Ques>ons?

Nov. 21, 2013


GOTO Chicago 2014 Night #2
[email protected]
@deanwampler
polyglotprogramming.com/talks
Copyright 2011-2013, Dean Wampler, All Rights Reserved
Friday, November 29, 13

All pictures Copyright Dean Wampler, 2011-2013, All Rights Reserved. All other content is free to use, but
attribution is requested.
Photo: Building in fog on Michigan Avenue

You might also like