Overview
Overview
Overview
www.spark-project.org UC BERKELEY
Project Goals
Extend the MapReduce model to better support
two common classes of analytics apps:
»Iterative algorithms (machine learning, graphs)
»Interactive data mining
Enhance programmability:
»Integrate into Scala programming language
»Allow interactive use from Scala interpreter
Motivation
Most current cluster programming models are
based on acyclic data flow from stable storage
to stable storage
Map
Reduce
Reduce
Map
Motivation
Most current cluster programming models are
based on acyclic data flow from stable storage
to stable storage
Map
Reduce
Benefits of data flow: runtime can decide
where
Input to run tasks and can automatically
Map Output
recover from failures
Reduce
Map
Motivation
Acyclic data flow is inefficient for applications
that repeatedly reuse a working set of data:
»Iterative algorithms (machine learning, graphs)
»Interactive data mining tools (R, Excel, Python)
With current frameworks, apps reload data
from stable storage on each query
Solution: Resilient
Distributed Datasets (RDDs)
Allow apps to keep working sets in memory for
efficient reuse
Retain the attractive properties of MapReduce
» Fault tolerance, data locality, scalability
Actions on RDDs
» Count, reduce, collect, save, …
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
BaseTransformed
RDD Cache 1
lines = spark.textFile(“hdfs://...”) RDD
Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2)) tasks Block 1
Driver
cachedMsgs = messages.cache()
Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
. . .
Cache 3
Worker Block 2
Result: scaled
full-text
tosearch
1 TB data
of Wikipedia
in 5-7 sec
in <1(vs
sec170
(vssec
20 for
secon-disk
for on-disk
data)
data) Block 3
RDD Fault Tolerance
RDDs maintain lineage information that can be
used to reconstruct lost partitions
Ex: messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘\t’)(2))
target
Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
println("Final w: " + w)
Logistic Regression Performance
4500
4000
127 s / iteration
3500
Running Time (s)
3000
2500 Hadoop
2000
Spark
1500
1000
500 first iteration 174 s
0 further iterations 6 s
1 5 10 20 30
Number of Iterations
Spark Applications
In-memory data mining on Hive data (Conviva)
Predictive analytics (Quantifind)
City traffic prediction (Mobile Millennium)
Twitter spam classification (Monarch)
Collaborative filtering via matrix factorization
…
Conviva GeoReport
Hive 20
Spark 0.5
Time (hours)
0 5 10 15 20
Pipelines functions G:
within a stage Stage 1 groupBy
Cache-aware work C: D: F:
www.spark-project.org
[email protected]
Related Work
DryadLINQ, FlumeJava
» Similar “distributed collection” API, but cannot reuse
datasets efficiently across queries
Relational databases
» Lineage/provenance, logical logging, materialized views
GraphLab, Piccolo, BigTable, RAMCloud
» Fine-grained writes similar to distributed shared memory
Iterative MapReduce (e.g. Twister, HaLoop)
» Implicit data sharing for a fixed computation pattern
Caching systems (e.g. Nectar)
» Store data in files, no explicit control over what is cached
Behavior with Not Enough RAM
100
68.8
Iteration time (s)
80
58.1
40.7
60
29.7
40
11.5
20
0
Cache 25% 50% 75% Fully
disabled cached
% of working set in memory
Fault Recovery Results
140 119 No Failure
Failure in the 6th Iteration
Iteratrion time (s)
120
100
81
80
59
59
58
58
56
57
57
57
60
40
20
0
1 2 3 4 5 6 7 8 9 10
Iteration
Spark Operations
map flatMap
filter union
Transformations sample join
(define a new RDD) groupByKey cogroup
reduceByKey cross
sortByKey mapValues
collect
Actions reduce
(return a result to count
driver program) save
lookupKey