4 Spark SBP
4 Spark SBP
• RDDs is a distributed memory construct, these are motivated by two types of applications
that current computing frameworks handle inefficiently.
• Iterative algorithms
• Interactive data mining tools
• In both the cases, keeping data in memory can improve performance by an order of
magnitude.
• RDDs are immutable and partitioned collection of records.
• They can only be created by coarse grained operations such as map, filter, group by, etc.
• Coarse grain operations means that the operations are applied on all elements in a dataset.
• RDDs can only be created by:
• Reading data from a stable storage such as HDFS or
Resilient Distributed Datasets
(RDDs)
• Once data is read into an RDD object in Spark, a variety of operations
can be performed on the RDD by invoking abstract spark APIs.
• The two major type of operation available are transformations and
actions.
• (a) Transformations return a new, modified RDD based on the original.
Several transformations are available through spark API, including
map(), filter(), sample(), and union().
• (b) Actions return a value based on some computation being
performed on an RDD. Some examples of actions supported by the
spark API include reduce(), count(), first(), and for each ().
Iterative and Interactive
Applications
• Both Iterative and Interactive applications require faster data sharing
across parallel jobs.
• Data sharing is slow in MapReduce due to replication, serialization,
and disk IO.
• Regarding storage system, most of the Hadoop applications, they
spend more than 90% of the time doing HDFS read-write operations.
Iterative Operations on MapReduce
• Reuse intermediate results across multiple computations in multi-
stage applications. The following illustration explains how the current
framework works, while doing the iterative operations on
MapReduce. This incurs substantial overheads due to data replication,
disk I/O, and serialization, which makes the system slow.
Interactive Operations on MapReduce
• User runs ad-hoc queries on the same subset of data. Each
query will do the disk I/O on the stable storage, which can
dominate application execution time.
• The following illustration explains how the current framework
works while doing the interactive queries on MapReduce.
Iterative Operations on Spark RDD
• The illustration given below shows the iterative operations on Spark
RDD. It will store intermediate results in a distributed memory instead
of Stable storage (Disk) and make the system faster.
Note − If the Distributed memory (RAM) is not sufficient to store
intermediate results (State of the JOB), then it will store those results
on the disk.
Interactive Operations on Spark
RDD
• This illustration shows interactive operations on Spark RDD. If different queries are run on the
same set of data repeatedly, this particular data can be kept in memory for better execution
times.
• By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the elements
around on the cluster for much faster access, the next time you query it. There is also support for
persisting RDDs on disk, or replicated across multiple nodes.
Directed Acyclic Graph (DAG)
• DAG is an important feature for real time data processing platforms such
as Spark, Storm, and Tez and helps them offer amazing new capabilities
for building highly interactive, real-time computing systems to power
real-time BI, predictive analytics, real-time marketing and other critical
systems.
• DAG scheduler is the scheduling layer of Apache Spark that implements
stage oriented scheduling, i.e. after an RDD action has been called it
becomes a job that is then transformed into a set of stages that are
submitted as task-sets for execution.
• In general, DAG scheduler does three things in Spark:
(a) Computes an execution DAG, i.e. DAG of stages, for a job;
(b) Determines the preferred locations to run each task on;
How DAGs Work
• RDD has two types of transformation
Map Reduc
e
Input Map Outpu
Reduc t
Map e
Motivation
• Most current cluster programming models are based
on acyclic data flow from stable storage to stable
storage
Map Reduc
Benefits of data flow: runtime
e can
decide
where
Input to run
Maptasks and can automatically
Outpu
t
recover from failures
Reduc
Map e
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Base RDD Cache 1
lines = spark.textFile(“hdfs://...”) Transformed RDD Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2)) tasks Block 1
Driver
cachedMsgs = messages.cache()
Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
...
Cache 3
Worker Block 2
Result:Result:
full-text
scaled
search
to 1ofTB
Wikipedia
data in 5-7
in <1
secsec (vs
(vs20
170secsec
forfor
on-disk
on-disk
data)
data)
Block 3
RDD Fault Tolerance
RDDs maintain lineage information that can be used to
reconstruct lost partitions
Ex:
messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘\t’)(2))
var w = Vector.random(D)
println("Final w: " + w)
Logistic Regression Performance
4500
4000
3500 127 s / iteration
Running Time (s)
3000
2500
Hadoop
2000
Spark
1500
1000
500
first iteration 174
0 s
1 5 10 20 30
further iterations
Number of Iterations 6s
Spark Applications
• In-memory data mining on Hive data (Conviva)
• Predictive analytics (Quantifind)
• City traffic prediction (Mobile Millennium)
• Twitter spam classification (Monarch)
Collaborative filtering via matrix factorization
Conviva GeoReport
Hive 20
Spark 0.5
Time
0 2 4 6 8 10 12 14 16 18 20 (hours)
Partitioning-aware map
to avoid shuffles E: join
68.8
Iteration time (s)
58.1
80
40.7
60
29.7
40
11.5
20
0
Cache 25% 50% 75% Fully
disabled cached
% of working set in memory
Spark Operations
flatMap
map union
filter join
Transformations sample cogroup
(define a new RDD) groupByKey cross
reduceByKey mapValues
sortByKey
collect
Actions reduce
(return a result to count
driver program) save
lookupKey
Case study: Conviva, Inc.
• Real-time monitoring of online video metadata
• HBO, ESPN, ABC, SyFy, …
46
Existing Streaming Systems
• Storm
•Replays record if not processed by a node
•Processes each record at least once
•May update mutable state twice!
•Mutable state can be lost due to failure!
47
Requirements
• Scalable to large clusters
• Second-scale latencies
• Simple programming model
• Integrated with batch & interactive processing
• Efficient fault-tolerance in stateful computations
Spark Streaming
49
Discretized Stream Processing
Run a streaming computation as a series of very
small, deterministic batch jobs
live data stream Spark
Chop up the live stream into batches of X seconds Streaming
Spark treats each batch of data as RDDs and
batches of X seconds
processes them using RDD operations
Finally, the processed results of the RDD
operations are returned in batches Spark
processed
results
50
Discretized Stream Processing
Run a streaming computation as a series of very
small, deterministic batch jobs
live data stream Spark
Batch sizes as low as ½ second, latency ~ 1 second Streaming
Potential for combining batch processing and
batches of X seconds
streaming processing in the same system
Spark
processed
results
51
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
tweets DStream
new DStream transformation: modify data in one Dstream to create another DStream
tweets DStream
…
[#cat, #dog, … ] every batch
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage
batch @ t batch @ t+1 batch @ t+2
tweets DStream
flatMa flatMa flatMa
p p p
hashTags DStream
save save save
every batch saved
to HDFS
Java Example
Scala
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
Java
JavaDStream<Status> tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
hashTags.saveAsHadoopFiles("hdfs://...")
Function object to define the transformation
Fault-tolerance
• RDDs are remember the sequence of tweets input data
operations that created it from the RDD replicated
original fault-tolerant input data in memory
…
reduceByK reduceByK reduceByK
tagCounts ey ey ey
[(#cat, 10), (#dog, 25), ... ]
Example 3 – Count the hashtags over last 10
mins
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
sliding window
window length sliding interval
operation
Example 3 – Counting the hashtags over last
10 mins
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
sliding window
countByValue
tagCounts count over all
the data in the
window
Smart window-based countByValue
val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1))
62
Demo
Fault-tolerant Stateful Processing
All intermediate data are RDDs, hence can be recomputed if lost
tagCounts
Fault-tolerant Stateful Processing
• State data not lost even if a worker node dies
• Does not change the value of your result
65
Other Interesting Operations
Maintaining arbitrary state, track sessions
- Maintain per-user mood as state, and update it with his/her tweets
tweets.updateStateByKey(tweet => updateMood(tweet))
67
Comparison with Storm and S4
Higher throughput than Storm
Spark Streaming: 670k records/second/node
Storm: 115k records/second/node
Apache S4: 7.5k records/second/node
68
Real Applications: Conviva
Real-time monitoring of video metadata
• Achieved 1-2 second latency
• Millions of video sessions processed
• Scales linearly with cluster size
69
Real Applications: Mobile Millennium Project
Traffic transit time estimation using online
machine learning on GPS observations
• Markov chain Monte Carlo simulations on GPS
observations
• Very CPU intensive, requires dozens of
machines for useful computation
• Scales linearly with cluster size
70
Vision - one stack to rule them all
Ad-hoc
Queries
Spark
+
Shark
+
Spark
Stream Streaming Batch
Processi Processi
ng ng
Spark program vs Spark Streaming program
Spark Streaming program on Twitter stream
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")