0% found this document useful (0 votes)
16 views74 pages

4 Spark SBP

Apache Spark is a fast, in-memory data processing engine that supports large-scale stream processing and advanced analytics, achieving up to 100 times faster performance than Hadoop. It utilizes Resilient Distributed Datasets (RDDs) for efficient data handling and fault tolerance, enabling iterative and interactive applications. Spark's ecosystem includes various built-in libraries for machine learning, graph processing, and SQL, making it a versatile alternative to traditional MapReduce frameworks.

Uploaded by

bossbitch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views74 pages

4 Spark SBP

Apache Spark is a fast, in-memory data processing engine that supports large-scale stream processing and advanced analytics, achieving up to 100 times faster performance than Hadoop. It utilizes Resilient Distributed Datasets (RDDs) for efficient data handling and fault tolerance, enabling iterative and interactive applications. Spark's ecosystem includes various built-in libraries for machine learning, graph processing, and SQL, making it a versatile alternative to traditional MapReduce frameworks.

Uploaded by

bossbitch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

Spark

Large-scale near-real-time stream processing


• Speed − Spark helps to run an application in Hadoop cluster, up to 100
times faster in memory, and 10 times faster when running on disk. This
is possible by reducing number of read/write operations to disk. It
stores the intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java,
Scala, or Python. Therefore, you can write applications in different
languages. Spark comes up with 80 high-level operators for interactive
querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It
also supports SQL queries, Streaming data, Machine learning (ML), and
Graph algorithms.
What is Spark Streaming?
• Apache spark is an integrated, fast, in-memory, general purpose engine for large
scale data processing.
• Framework for large scale stream processing
• Scales to 100s of nodes
• Achieves 10-100x performance over Hadoop by operating with an in-memory data
construct called Resilient Distributed Datasets (RDDs)
• Can achieve second scale latencies
• Integrates with Spark’s batch(iterative) and interactive processing
• Spark offers built in libraries for Machine Learning, Graph Processing, Stream Processing
and SQL to deliver superfast data processing along with high programmer productivity.
• Spark is compatible with Hadoop file systems and tools.
• Provides a simple batch-like API for implementing complex algorithm
• Spark is an alternative to MapReduce rather than a replacement for Hadoop File System.
• Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.
More on Spark
• Apache Spark was developed in 2009 in UC Berkeley’s AMPLab, and
open sourced in 2010 as an Apache project.
• It can process data from a variety of data repositories including the
HDFS, and NoSQL databases such as Hbase and Cassandra.
• Spark prioritizes in-memory processing to boost the performance of
big data analytics applications, however, it can also do conventional
disk based processing when data sets are too large to fit into available
system memory.
• Spark enables applications in Hadoop clusters to run up to 100 times
faster in memory and 10 times faster even when running on disk.
Motivation
• Many important applications must process large streams of live data and provide
results in near-real-time
• Social network trends
• Website statistics
• Intrustion detection systems
• etc.

• Require large clusters to handle workloads

• Require latencies of few seconds


Need for a framework …
… for building such complex stream processing applications

But what are the requirements


from such a framework?
Requirements
• Scalable to large clusters
• Second-scale latencies
• Simple programming model
Spark Architecture
• The core spark engine functions partly as an API layer and underpins a
set of related tools for managing and analyzing data.
• These includes
• SQL query engine
• A library for machine learning algorithms
• A Graph processing system, and
• Streaming data processing software
• Spark allows programmers to develop complex, multi-step data
pipelines using DAG pattern. It also supports in-memory data sharing
across DAGs, so that different jobs can work with the same data.
• Spark runs on top of existing HDFS infrastructure to provide enhanced
and additional functionality.
SPARK Architecture
Spark SQL Spark Streaming MLib Machine GraphX Graph
Structured Data Real-Time Learning Processing

Spark Core (API)

Standalone Scheduler YARN Mesos


Resilient Distributed Datasets
(RDDs)
• Allow apps to keep working sets in memory for efficient reuse
• Retain the attractive properties of MapReduce
• Fault tolerance, data locality, scalability
• Support a wide range of applications

• RDDs is a distributed memory construct, these are motivated by two types of applications
that current computing frameworks handle inefficiently.
• Iterative algorithms
• Interactive data mining tools
• In both the cases, keeping data in memory can improve performance by an order of
magnitude.
• RDDs are immutable and partitioned collection of records.
• They can only be created by coarse grained operations such as map, filter, group by, etc.
• Coarse grain operations means that the operations are applied on all elements in a dataset.
• RDDs can only be created by:
• Reading data from a stable storage such as HDFS or
Resilient Distributed Datasets
(RDDs)
• Once data is read into an RDD object in Spark, a variety of operations
can be performed on the RDD by invoking abstract spark APIs.
• The two major type of operation available are transformations and
actions.
• (a) Transformations return a new, modified RDD based on the original.
Several transformations are available through spark API, including
map(), filter(), sample(), and union().
• (b) Actions return a value based on some computation being
performed on an RDD. Some examples of actions supported by the
spark API include reduce(), count(), first(), and for each ().
Iterative and Interactive
Applications
• Both Iterative and Interactive applications require faster data sharing
across parallel jobs.
• Data sharing is slow in MapReduce due to replication, serialization,
and disk IO.
• Regarding storage system, most of the Hadoop applications, they
spend more than 90% of the time doing HDFS read-write operations.
Iterative Operations on MapReduce
• Reuse intermediate results across multiple computations in multi-
stage applications. The following illustration explains how the current
framework works, while doing the iterative operations on
MapReduce. This incurs substantial overheads due to data replication,
disk I/O, and serialization, which makes the system slow.
Interactive Operations on MapReduce
• User runs ad-hoc queries on the same subset of data. Each
query will do the disk I/O on the stable storage, which can
dominate application execution time.
• The following illustration explains how the current framework
works while doing the interactive queries on MapReduce.
Iterative Operations on Spark RDD
• The illustration given below shows the iterative operations on Spark
RDD. It will store intermediate results in a distributed memory instead
of Stable storage (Disk) and make the system faster.
Note − If the Distributed memory (RAM) is not sufficient to store
intermediate results (State of the JOB), then it will store those results
on the disk.
Interactive Operations on Spark
RDD
• This illustration shows interactive operations on Spark RDD. If different queries are run on the
same set of data repeatedly, this particular data can be kept in memory for better execution
times.
• By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the elements
around on the cluster for much faster access, the next time you query it. There is also support for
persisting RDDs on disk, or replicated across multiple nodes.
Directed Acyclic Graph (DAG)
• DAG is an important feature for real time data processing platforms such
as Spark, Storm, and Tez and helps them offer amazing new capabilities
for building highly interactive, real-time computing systems to power
real-time BI, predictive analytics, real-time marketing and other critical
systems.
• DAG scheduler is the scheduling layer of Apache Spark that implements
stage oriented scheduling, i.e. after an RDD action has been called it
becomes a job that is then transformed into a set of stages that are
submitted as task-sets for execution.
• In general, DAG scheduler does three things in Spark:
(a) Computes an execution DAG, i.e. DAG of stages, for a job;
(b) Determines the preferred locations to run each task on;
How DAGs Work
• RDD has two types of transformation

• When action is encountered, Spark creates DAG and submit it to DAG


Scheduler.
• DAG Scheduler split the graph into multiple stages
• Stages are created based on transformations
• Narrow transformations are grouped together
• DAG Scheduler will then submit the stages into the task scheduler.
Spark EcoSystem
• Spark is an integrated stack of tools responsible for scheduling,
distributing, and monitoring applications consisting of many
computational tasks across many workers machines, or a computing
cluster.
• Spark is written primarily in Scala, but includes code from Python,
Java, R and other languages.
• Spark comes with a set of integrated tools that reduces learning time
and deliver higher user productivity, Spark ecosystem include Mesos
resource manager, and other tools.
• Spark has already overtaken Hadoop in general because of benefits it
provides in terms of faster execution in iterative processing
algorithms.
SPARK Applications
• Real time log data monitoring
• Massive Natural language processing
• Large scale online recommendation systems
• …….
Spark vs Hadoop
Feature Hadoop Spark
Purpose Resilient cost-effective storage and Fast general purpose engine for large scale data
processing of large data sets processing
Core component Hadoop Distributed File System (HDFS) Spark Core, the in-memory processing engine
Storage HDFS manages massive data collections Spark doesn’t do distributed storage. It operates on
across multiple nodes within a cluster distributed data collections.
of commodity servers.
Fault tolerance Hadoop uses replication to achieve Spark uses RDD for fault tolerance that minimizes
fault tolerance network I/O
Nature of Accompanied by MapReduce, it Batch as well as stream processing
processing includes batch processing of this data
in parallel mode
Sweet spot Batch Processing Iterative and interactive processing jobs that can fit
in the memory.
Processing speed Map Reduce is slow Spark can be up to 10x faster that MR for batch
processing and upto 100x faster for stream
processing
Spark vs Hadoop
Feature Hadoop Spark
Security More Secure Less Secure
Failure recovery Hadoop can recover from system faults With Spark, data objects are stored in RDD, These
or failures since data are written to disk can be reconstructed after faults or failures
after every operation
Analytics tools Separate engine Built-in MLLib and GraphX libraries
Compatibility Primary storage model is HDFS Compatibility with HDFS and other storage formats
Language Support Java Scala is native language. APIs for python, R, Java,
others
Driving organization Yahoo AMPLabs from UCBerkeley
Technology owners Apache, Open-source, free Open-source, free
Key distributors Cloudera, Horton, MapR Databricks, AMPLabs
Cost of System Medium to High Medium to High
Project Goals
• Extend the MapReduce model to better support two
common classes of analytics apps:
• Iterative algorithms (machine learning, graphs)
• Interactive data mining
• Enhance programmability:
• Integrate into Scala programming language
• Allow interactive use from Scala interpreter
Motivation
Most current cluster programming models are
based on acyclic data flow from stable storage
to stable storage

Map Reduc
e
Input Map Outpu
Reduc t
Map e
Motivation
• Most current cluster programming models are based
on acyclic data flow from stable storage to stable
storage

Map Reduc
Benefits of data flow: runtime
e can
decide
where
Input to run
Maptasks and can automatically
Outpu
t
recover from failures
Reduc
Map e
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Base RDD Cache 1
lines = spark.textFile(“hdfs://...”) Transformed RDD Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2)) tasks Block 1
Driver
cachedMsgs = messages.cache()

Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
...
Cache 3
Worker Block 2
Result:Result:
full-text
scaled
search
to 1ofTB
Wikipedia
data in 5-7
in <1
secsec (vs
(vs20
170secsec
forfor
on-disk
on-disk
data)
data)
Block 3
RDD Fault Tolerance
RDDs maintain lineage information that can be used to
reconstruct lost partitions
Ex:
messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDD


filter map
(func = _.contains(...)) (func = _.split(...))
Example: Logistic Regression
Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {


val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}

println("Final w: " + w)
Logistic Regression Performance
4500
4000
3500 127 s / iteration
Running Time (s)

3000
2500
Hadoop
2000
Spark
1500
1000
500
first iteration 174
0 s
1 5 10 20 30
further iterations
Number of Iterations 6s
Spark Applications
• In-memory data mining on Hive data (Conviva)
• Predictive analytics (Quantifind)
• City traffic prediction (Mobile Millennium)
• Twitter spam classification (Monarch)
Collaborative filtering via matrix factorization
Conviva GeoReport

Hive 20

Spark 0.5
Time
0 2 4 6 8 10 12 14 16 18 20 (hours)

• Aggregations on many keys w/ same WHERE clause


• 40× gain comes from:
• Not re-reading unused columns or filtered records
• Avoiding repeated decompression
• In-memory storage of objects
Frameworks Built on Spark
• Pregel on Spark (Bagel)
• Google message passing
model for graph computation
• 200 lines of code
• Hive on Spark (Shark)
• 3000 lines of code
• Compatible with Apache Hive
• ML operators in Scala
Implementation

Runs on Apache Mesos to


share resources with Spark Hadoop MPI

Hadoop & other apps
Can read from any Hadoop Mesos
input source (e.g. HDFS)
Node Node Node Node

• No changes to Scala compiler


Spark Scheduler
Dryad-like DAGs A: B:
Pipelines functions
within a stage G:
Stage 1 groupBy
Cache-aware work
reuse & locality C: D: F:

Partitioning-aware map
to avoid shuffles E: join

Stage 2 union Stage 3

= cached data partition


Interactive Spark
Modified Scala interpreter to allow Spark to be used
interactively from the command line
Required two changes:
• Modified wrapper code generation so that each line typed
has references to objects for its dependencies
• Distribute generated classes over the network
Related Work
DryadLINQ, FlumeJava
• Similar “distributed collection” API, but cannot reuse datasets efficiently across queries
• Relational databases
• Lineage/provenance, logical logging, materialized views
GraphLab, Piccolo, BigTable, RAMCloud
• Fine-grained writes similar to distributed shared memory
• Iterative MapReduce (e.g. Twister, HaLoop)
• Implicit data sharing for a fixed computation pattern
• Caching systems (e.g. Nectar)
• Store data in files, no explicit control over what is cached
Behavior with Not Enough RAM
100

68.8
Iteration time (s)

58.1
80

40.7
60

29.7
40

11.5
20
0
Cache 25% 50% 75% Fully
disabled cached
% of working set in memory
Spark Operations
flatMap
map union
filter join
Transformations sample cogroup
(define a new RDD) groupByKey cross
reduceByKey mapValues
sortByKey

collect
Actions reduce
(return a result to count
driver program) save
lookupKey
Case study: Conviva, Inc.
• Real-time monitoring of online video metadata
• HBO, ESPN, ABC, SyFy, …

Custom-built distributed stream processing system


• 1000s complex metrics on millions of video sessions
• Two processing stacks • Requires many dozens of nodes for processing

Hadoop backend for offline analysis


• Generating daily and monthly reports
• Similar computation as the streaming system
Case study: XYZ, Inc.
• Any company who wants to process live streaming data has this problem
• Twice the effort to implement any new function
• Twice the number of bugs to solve
• Twice the headache
Custom-built distributed stream processing system
• 1000s complex metrics on millions of videos sessions
• Requires many dozens of nodes for processing
• Two processing stacks

Hadoop backend for offline analysis


• Generating daily and monthly reports
• Similar computation as the streaming system
Requirements
• Scalable to large clusters
• Second-scale latencies
• Simple programming model
• Integrated with batch & interactive processing
Stateful Stream Processing
• Traditional streaming systems have a event- mutable state
driven record-at-a-time processing model
• Each node has mutable state input
• For each record, update state & send new records
records node 1

• State is lost if node dies! node 3


input
records
• Making stateful stream processing be fault- node 2
tolerant is challenging

46
Existing Streaming Systems
• Storm
•Replays record if not processed by a node
•Processes each record at least once
•May update mutable state twice!
•Mutable state can be lost due to failure!

• Trident – Use transactions to update state


•Processes each record exactly once
•Per state transaction updates slow

47
Requirements
• Scalable to large clusters
• Second-scale latencies
• Simple programming model
• Integrated with batch & interactive processing
• Efficient fault-tolerance in stateful computations
Spark Streaming

49
Discretized Stream Processing
Run a streaming computation as a series of very
small, deterministic batch jobs
live data stream Spark
 Chop up the live stream into batches of X seconds Streaming
 Spark treats each batch of data as RDDs and
batches of X seconds
processes them using RDD operations
 Finally, the processed results of the RDD
operations are returned in batches Spark
processed
results

50
Discretized Stream Processing
Run a streaming computation as a series of very
small, deterministic batch jobs
live data stream Spark
 Batch sizes as low as ½ second, latency ~ 1 second Streaming
 Potential for combining batch processing and
batches of X seconds
streaming processing in the same system

Spark
processed
results

51
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)

DStream: a sequence of RDD representing a stream of data

Twitter Streaming API batch @ t batch @ t+1 batch @ t+2

tweets DStream

stored in memory as an RDD


(immutable, distributed)
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))

new DStream transformation: modify data in one Dstream to create another DStream

batch @ t batch @ t+1 batch @ t+2

tweets DStream

flatMap flatMap flatMap

hashTags Dstream new RDDs created for


[#cat, #dog, … ] every batch
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage
batch @ t batch @ t+1 batch @ t+2
tweets DStream
flatMa flatMa flatMa
p p p
hashTags DStream
save save save
every batch saved
to HDFS
Java Example
Scala
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")

Java
JavaDStream<Status> tweets = ssc.twitterStream(<Twitter username>, <Twitter
password>)
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
hashTags.saveAsHadoopFiles("hdfs://...")
Function object to define the transformation
Fault-tolerance
• RDDs are remember the sequence of tweets input data
operations that created it from the RDD replicated
original fault-tolerant input data in memory

• Batches of input data are replicated in flatMap


memory of multiple worker nodes,
therefore fault-tolerant
hashTags
RDD lost partitions
• Data lost due to worker failure, can be recomputed on
recomputed from input data other workers
Key concepts
• DStream – sequence of RDDs representing a stream of data
• Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets

• Transformations – modify data from on DStream to another


• Standard RDD operations – map, countByValue, reduce, join, …
• Stateful operations – window, countByValueAndWindow, …

• Output Operations – send data to external entity


• saveAsHadoopFiles – saves to HDFS
• foreach – do anything with each batch of results
Example 2 – Count the hashtags
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.countByValue()

batch @ t batch @ t+1 batch @ t+2


tweets
flatMa flatMa flatMa
p p p
hashTags
map map map


reduceByK reduceByK reduceByK
tagCounts ey ey ey
[(#cat, 10), (#dog, 25), ... ]
Example 3 – Count the hashtags over last 10
mins
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()

sliding window
window length sliding interval
operation
Example 3 – Counting the hashtags over last
10 mins
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()

t-1 t t+1 t+2 t+3


hashTags

sliding window

countByValue
tagCounts count over all
the data in the
window
Smart window-based countByValue
val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1))

t-1 t t+1 t+2 t+3


hashTags
countByValue
add the counts
from the new
batch in the
subtract the
counts from – + window
tagCounts batch before
the window + ?
Smart window-based reduce
• Technique to incrementally compute count generalizes to many reduce operations
• Need a function to “inverse reduce” (“subtract” for counting)

• Could have implemented counting as:


hashTags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(1), …)

62
Demo
Fault-tolerant Stateful Processing
All intermediate data are RDDs, hence can be recomputed if lost

t-1 t t+1 t+2 t+3


hashTags

tagCounts
Fault-tolerant Stateful Processing
• State data not lost even if a worker node dies
• Does not change the value of your result

• Exactly once semantics to all transformations


• No double counting!

65
Other Interesting Operations
 Maintaining arbitrary state, track sessions
- Maintain per-user mood as state, and update it with his/her tweets
tweets.updateStateByKey(tweet => updateMood(tweet))

 Do arbitrary Spark RDD computation within DStream


- Join incoming tweets with a spam file to filter out bad tweets
tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
})
Performance
Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second
latency
- Tested with 100 streams of data on 100 EC2 instances with 4 cores each

67
Comparison with Storm and S4
Higher throughput than Storm
 Spark Streaming: 670k records/second/node
 Storm: 115k records/second/node
 Apache S4: 7.5k records/second/node

68
Real Applications: Conviva
Real-time monitoring of video metadata
• Achieved 1-2 second latency
• Millions of video sessions processed
• Scales linearly with cluster size

69
Real Applications: Mobile Millennium Project
Traffic transit time estimation using online
machine learning on GPS observations
• Markov chain Monte Carlo simulations on GPS
observations
• Very CPU intensive, requires dozens of
machines for useful computation
• Scales linearly with cluster size

70
Vision - one stack to rule them all
Ad-hoc
Queries

Spark
+
Shark
+
Spark
Stream Streaming Batch
Processi Processi
ng ng
Spark program vs Spark Streaming program
Spark Streaming program on Twitter stream
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")

Spark program on Twitter log file


val tweets = sc.hadoopFile("hdfs://...")
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFile("hdfs://...")
Vision - one stack to rule them all
$ ./spark-shell
• Explore data interactively using Spark scala> val file = sc.hadoopFile(“smallLogs”)
Shell / PySpark to identify problems ...
scala> val filtered = file.filter(_.contains(“ERROR”))
...
scala> val ProcessProductionData
object mapped = file.map(...){
• Use same code in Spark stand-alone ... def main(args: Array[String]) {
programs to identify problems in val sc = new SparkContext(...)
val file = sc.hadoopFile(“productionLogs”)
production logs val filtered =
file.filter(_.contains(“ERROR”))
val mapped = file.map(...)
object
... ProcessLiveStream {
• Use similar code in Spark Streaming to } def main(args: Array[String]) {
val sc = new StreamingContext(...)
identify problems in live log streams }
val stream = sc.kafkaStream(...)
val filtered =
file.filter(_.contains(“ERROR”))
val mapped = file.map(...)
...
}
}
Conclusion
• Spark provides a simple, efficient, and powerful
programming model for a wide range of apps
• Download open source release:
• www.spark-project.org

You might also like