CopiousData TheKillerAppForFP PDF
CopiousData TheKillerAppForFP PDF
Welcome to
GOTO 2014 Night Chicago # 1
Speakers
Steve Vinoski, Chief Architect Basho
Dave Thomas, Bedarra Research Labs
Copyright Dean Wampler, 2011-2013, All Rights Reserved. Photos can only be used with
permission. Otherwise, the content is free to use.
Photo: Cloud Gate (a.k.a. The Bean) in Millenium Park, Chicago, Illinois, USA
Consultant at
Typesafe
Ive been doing Scala for 6 years and Big Data for 3.5 years.
Programming
Hive
Functional
Programming
for Java Developers
Dean Wampler,
Jason Rutherglen &
Dean Wampler Edward Capriolo
My books
What
Is
Big
err
Copious
Data?
Big Data a buzz word, but generally associated with the problem of data sets too big to
manage with traditional SQL databases. A parallel development has been the NoSQL
movement that is good at handling semistructured data, scaling, etc.
3
Trends
tor.com/blogs/...
14 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
An interesting manifestation of this trend is the public argument between Noam Chomsky and Peter Norvig on the nature of language. Chomsky
long ago proposed a hierarchical model of formal language grammars. Peter Norvig is a proponent of probabilistic models of language. Indeed all
successful automated language processing systems are probabilistic.
http://www.tor.com/blogs/2011/06/norvig-vs-chomsky-and-the-fight-for-the-future-of-ai
What
Is
MapReduce?
Disk
Disk
Disk
Disk
Disk Disk
Disk
Disk
Disk
Disk Disk
Disk
Disk
Disk
Disk
wikipedia.org/hadoop index
Hadoop provides block
MapReduce and HDFS
... ...
wikipedia.org/hadoop Hadoop provides... Map Task
... ...
...
block
wikipedia.org/hbase
... ...
HBase stores data in HDFS wikipedia.org/hbase HBase stores...
Map Task
... ...
... block
... ... Map Task
wikipedia.org/hive wikipedia.org/hive Hive queries...
Hive queries HDFS files and ... ...
HBase tables with SQL
Sort, Shuffle
block
... ...
wikipedia.org/hbase HBase stores...
Map Task
...
... ...
Reduce Task
block ...
... ... Map Task
wikipedia.org/hive Hive queries...
... ... Reduce Task ...
and
...
Sort, Shuffle
block
... ...
(hdfs,(wikipedia.org/hadoop, 1))
wikipedia.org/hbase HBase stores...
Map Task
...
... ...
Reduce Task
block ...
... ... Map Task
wikipedia.org/hive
... Key-values output
Hive queries...
... Reduce Task ...
Reduce Task
block
... ...
Map Task
block
Reduce Task ... ...
and (.../hadoop,1),(.../hive,1)
... ...
Map Task
Reduce Task
Reduce:
Even in this somewhat primitive and coarse-grain framework, our functional data concepts are evident!
Today,
Hadoop
is
our
best,
general-purpose
tool
for
horizontal
scaling
of
Copious
Data.
26 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
MapReduce
and
Its
Discontents
Is MapReduce the end of the story? Does it meet all our needs? Lets look at a few problems
Photo: Gratuitous Romantic beach scene, Ohio St. Beach, Chicago, Feb. 2011.
MapReduce
doesnt
t
all
computa>on
needs.
HDFS
doesnt
t
all
storage
needs.
Even word count is not obvious. When you get to fancier stuff like joins, group-bys, etc., the
mapping from the algorithm to the implementation is not trivial at all. In fact, implementing
algorithms in MR is now a specialized body of knowledge.
MapReduce
is
very
course-grained.
1-Map,
1-Reduce
phase...
30 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
Even word count is not obvious. When you get to fancier stuff like joins, group-bys, etc., the
mapping from the algorithm to the implementation is not trivial at all. In fact, implementing
algorithms in MR is now a specialized body of knowledge.
Mul>ple
MR
jobs
required
for
some
algorithms.
Each
one
ushes
its
results
to
disk!
31 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
If you have to sequence MR jobs to implement an algorithm, ALL the data is flushed to disk
between jobs. Theres no in-memory caching of data, leading to huge IO overhead.
MapReduce
is
designed
for
oine,
batch-mode
analy>cs.
High
latency;
not
suitable
for
event
processing.
32 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
The Hadoop Java API is even more verbose and tedious to use than it should be.
Lets
look
at
code
for
a
simpler
algorithm,
Word
Count.
(Tokenize
as
before,
but
ignore
original
document
loca>ons.)
34 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
In Word Count, the mapper just outputs the word-count pairs. We forget about the document
URL/id. The reducer gets all word-count pairs for a word from all mappers and outputs each
word with its final, global count.
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import java.util.StringTokenizer;
@Override
public void map(LongWritable key, Text valueDocContents,
OutputCollector<Text, IntWritable> output, Reporter reporter) {
String[] tokens = valueDocContents.toString.split("\\s+");
for (String wordString: tokens) {
if (wordString.length > 0) {
word.set(wordString.toLowerCase);
output.collect(word, one);
}
}
}
}
@Override
public void map(LongWritable key, Text valueDocContents,
OutputCollector<Text, IntWritable> output, Reporter reporter) {
String[] tokens = valueDocContents.toString.split("\\s+");
for (String wordString: tokens) {
if (wordString.length > 0) {
word.set(wordString.toLowerCase);
}
output.collect(word, one);
The
}
}
} interesting
class Reduce extends MapReduceBase bits
implements Reducer[Text, IntWritable, Text, IntWritable] {
@Override
public void map(LongWritable key, Text valueDocContents,
OutputCollector<Text, IntWritable> output, Reporter reporter) {
String[] tokens = valueDocContents.toString.split("\\s+");
for (String wordString: tokens) {
if (wordString.length > 0) {
word.set(wordString.toLowerCase);
}
output.collect(word, one); The 90s called. They
}
}
want their EJBs back!
}
Tap Tap
HDFS
(source) (sink)
Schematically, here is what Word Count looks like in Cascading. See http://
docs.cascading.org/cascading/1.2/userguide/html/ch02.html for details.
import org.cascading.*;
...
public class WordCount {
public static void main(String[] args) {
String inputPath = args[0];
String outputPath = args[1];
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, WordCount.class );
Datalog-style queries
45 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
Cascalog embeds Datalog-style logic queries. The variables to match are named ?foo.
Other
Improved
APIs:
Crunch
(Java)
&
Scrunch
(Scala)
Scoobi
(Scala)
... 46 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
See https://github.com/cloudera/crunch.
Others include Scoobi (http://nicta.github.com/scoobi/) and Spark, which well discuss next.
Use
Spark
(Not
MapReduce)
object WordCountSpark {
def main(args: Array[String]) {
val ctx = new SparkContext(...)
val file = ctx.textFile(args(0))
val counts = file.flatMap(
line => line.split("\\W+"))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(args(1))
}
} Also small and concise!
Spark also addresses the lack of flexibility for the MapReduce model.
Spark
is
a
Hadoop
MapReduce
alterna>ve:
(with
apologies
to
George
Lucas...)
54 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
IT shops realize that NoSQL is useful and all, but people really, Really, REALLY love SQL. So,
its making a big comeback. You can see it in Hadoop, in SQL-like APIs for some NoSQL
DBs, e.g., Cassandra and MongoDBs Javascript-based query language, as well as NewSQL
DBs.
Combinators
evanmiller.org/mathematical-hacker.html
Monads - Structure.
Abstracting over collections.
Control flow and mutability
containment.
58 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
Monads generalize the properties of containers, like lists and maps, such as applying a
function to each element and returning a new instance of the same container type. This also
applies to encapsulations of state transformations and principled mutability, as used in
Haskell.
Category
Theory
(a + b) + (c + d) for some a, b, c, d.
Add All the Things, Avi Bryant,
StrangeLoop 2013.
infoq.com/presentations/abstract-algebra-analytics
60 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
For an explanation of this slide, see this great presentation by Avi Bryant at StrangeLoop
2013 on generalizing addition (monoids).
Linear
Algebra
Eigenvector and Singular Value
Decomposition.
Essential tools in machine
learning.
Av = mv
61 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
Example:
Eigenfaces
Represent images
as vectors.
Solve for
modes.
Top N modes
approx. faces! 62
http://en.wikipedia.org/wiki/File:Eigenfaces.png
Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
Set
Theory
and
First-Order
Logic
Relational Model.
Data organized
into tuples,
grouped by
relations.
http://dl.acm.org/citation.cfm?doid=362384.362685
63 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
Formulated by Codd in 69. Most systems dont follow it exactly, like allowing identical
records, where set elements are unique. Codds original model didnt support NULLs either
(unknown), but he later proposed a revision to allow them.
Set
Theory
and
First-Order
Logic
Relational Model.
Most RDBMSs deviate from RM.
Formulated by Codd in 69. Most systems dont follow it exactly, like allowing identical
records, where set elements are unique. Codds original model didnt support NULLs either
(unknown), but he later proposed a revision to allow them.
What
are
Combinators?
Functions that are side-effect
free.
They get all their information
from their inputs and write all
their work to their outputs.
65 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
Lets
look
at
4
rela>onal
operators
and
the
corresponding
func>onal
combinators.
See, for example, the discussions in Database in Depth and SQL and Relational Theory,
Second Edition, both by C.J. Date (OReilly)
Recall
our
Word
Counts:
CREATE TABLE word_counts (
word CHARACTER(64),
count INTEGER);
(ANSI SQL syntax)
(Scala)
vs.
word_counts.filter {
case (word, count) =>
word == "Chicago"
}
vs.
word_counts.map {
case (word, count) =>
word
}
word_counts
.joinWithLarger('wword -> 'dword,
dictionary)
.project('wword, 'definition)
vs.
word_counts
.joinWithLarger('wword -> 'dword,
dictionary)
.project('wword, 'definition)
vs.
word_counts.groupBy {
case (word, count) => count
}.toList.map {
case (count, words) => (count, words.size)
}.sortBy {
case (count, size) => -size
} 75 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
How many words appeared once, twice, 3 times, ..., N-times? Order this list descending.
Im back to the Scala library (as opposed to Scalding). The code inputs a collections of tuples, (word,count) and groups by count. This creates a map with the
count as the key and a list of the words as the value.
Next we convert this to a list of tuples (count,List(words)) and map it to a list of tuples with the (count, size of List(words)), then finally sort descending by the list
sizes.
Example
scala> val word_counts = List(
("a", 1), ("b", 2), ("c", 3),
("d", 2), ("e", 2), ("f", 3))
out: List[(Int,Int)] =
List((2,3), (3,2), (1,1))
76 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
Heres a simple example you can run in the Scala REPL (prompts are scala>).
We
could
go
on,
but
you
get
the
point.
Declara>ve,
func>onal
combinators
are
a
natural
tool
for
data.
77 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
SQL
vs.
FP
SQL
Lots of optimizations for data
manipulation.
FP
More combinators.
First class functions!
78 Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
Friday, November 29, 13
A drawback of SQL is that it doesnt provide first class functions, so (depending on the
system) youre limited to those that are built-in or UDFs (user-defined funcs) that you can
write and add. FP languages make this easy!!
FP
to
the
Rescue!
Mul4core
concurrency
is
driving
FP
adop>on.
Weve all heard this. In fact, this is how I got interested in FP.
My
Claim:
Data
will
drive
the
next
wave
of
widespread
FP
adop>on.
Even today, most developers get by without understanding concurrency. Many will just use an
Actor or Reactive model to solve their problems. I think more devs will have to learn how to
work with data at scale and that fact will drive them to FP. This will be the next wave.
Data
Friday, November 29, 13
Architectures
Copyright
2011-2013,
Dean
Wampler,
All
Rights
Reserved
What should software architectures look like for these kinds of systems?
Photo: Two famous 19th Century Buildings in Chicago.
Other, Object-
Oriented Query
Objects Domain Logic
4
1
Object Model
ParentB1 SQL
toJSON
3
ChildB1 ChildB2 Object-
toJSON toJSON Relational
Mapping
Result Set
Database
The toJSON methods are there because we often convert these object graphs back into fundamental structures, such as the maps and arrays of JSON so we can send them to the browser!
Relational/
Functional Query
Domain Logic
1
Functional
Abstractions
3 SQL
Functional
Other, Object-
Oriented Query
Wrapper for
Objects Domain Logic Relational Data
4
1
Object Model
ParentB1 SQL
toJSON
3 Result Set
ChildB1 ChildB2 Object-
toJSON toJSON Relational
Mapping
2
Result Set
Database
Database
Focus on:
Functional Query
Domain Logic
Lists Functional
Abstractions
3 SQL
Maps Functional
Wrapper for
Relational Data
Sets
Result Set
Trees 2
... Database
ParentB1
toJSON
ChildB1 ChildB2
toJSON toJSON
Database Files
ParentB1
toJSON
ChildB1 ChildB2
toJSON toJSON
Database Files
Database Files
Data Size
Formal
Schema
Process 1 Process 2 Process 3
Data-Driven
Programs Database Files
Weve seen a lot of issues with MapReduce. Already, alternatives are being developed, either general options, like
Spark and Storm, or special-purpose built replacements, like Impala. Lets consider other options...
Emerging
replacements
are
based
on
Func>onal
Languages...
import com.twitter.scalding._
FP is such a natural fit for the problem that any attempts to build big data systems without it will be handicapped
and probably fail.
Lets consider other MapReduce options...
...
and
SQL
FP is such a natural fit for the problem that any attempts to build big data systems without it will be handicapped
and probably fail.
Lets consider other MapReduce options...
Ques>ons?
All pictures Copyright Dean Wampler, 2011-2013, All Rights Reserved. All other content is free to use, but
attribution is requested.
Photo: Building in fog on Michigan Avenue