Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
1. MapReduce
2. Introduction to Spark
3. Resilient Distributed Data
4. DataFrames
5. Internals
2
Traditional Network Programming
3
Data-Parallel Models
4
MapReduce Programming Model
MapReduce turned out to be an incredibly useful and
widely-deployed framework for processing large
amounts of data. However, its design forces programs to
comply with its computation model, which is:
5
MapReduce drawbacks
many applications had to run MapReduce over multiple passes to
process their data.
All intermediate data had to be stored back in the file system (GFS
at Google, HDFS elsewhere), which tended to be slow since stored
data was not just writen to disks but also replicated.
the next MapReduce phase could not start until the previous
MapReduce job completed fully.
MapReduce was also designed to read its data from a distributed file
system (GFS/HDFS). In many cases, however, data resides within an
SQL database or is streaming in (e.g, activity logs, remote
monitoring).
6
MapReduce Programmability
7
8
A Brief History: MapReduce
circa 1979 – Stanford, MIT, CMU, etc.
set/list operations in LISP, Prolog, etc., for parallel processing
www-formal.stanford.edu/jmc/history/lisp/lisp.htm
11
Agenda
1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming,
ML)
4. DataFrames
5. Internals (time permitting)
12
Spark: A Brief History
13
Spark Summary
• highly flexible and general-purpose way of dealing with
big data processing needs
16
Note : not a scientific
Programmability
Two types:
• parallelized collections – take an existing single-node
collection and parallel it
• Hadoop datasets: files on HDFS or other compatible
storage
21
RDD: Core abstractions
An application that uses Spark identifies data sources and the operations on that
data. The main application, called the driver program is linked with the Spark API,
which creates a SparkContext (heart of the Spark system and coordinates all
processing activity.) This SparkContext in the driver program connects to a Spark
cluster manager. The cluster manager responsible for allocating worker nodes,
launching executors on them, and keeping track of their status.
Each worker node runs one or more executors. An executor is a process that runs
an instance of a Java Virtual Machine (JVM).
25
RDD
3.An RDD can be the output of a transformation function. This
allows one task to create data that can be consumed by another
task and is the way tasks pass data around.
For example, one task can filter out unwanted data and generate
a set of key-value pairs, writing them to an RDD.
26
RDD properties
They are immutable. That means their contents cannot be changed. A
task can read from an RDD and create a new RDD but it cannot modify
an RDD. The framework magically garbage collects unneeded
intermediate RDDs.
they are typed. An RDD will have some kind of structure within in,
such as a key-value pair or a set of fields. Tasks need to be able to
parse RDD streams.
They are ordered. An RDD contains a set of elements that can be
sorted. In the case of key-value lists, the elements will be sorted by a
key. The sorting function can be defined by the programmer but
sorting enables one to implement things like Reduce operations
They are partitioned. Parts of an RDD may be sent to different
servers. The default partitioning function is to send a row of data to
the server corresponding to hash(key) mod servercount
27
RDD operations
Spark allows two types of operations on RDDs: transformations and
actions.
Actions are operations that evaluate and return a new value. When
an action is requested on an RDD object, the necessary
transformations are computed and the result is returned. Actions
tend to be the things that generate the final output needed by a
program. Example actions are reduce, grab samples, and write to
file 28
Spark Essentials:
Transformations
transformation Description
when called on a dataset of (K, V) pairs,
groupByKey([numTasks])
returns a dataset of (K, Seq[V]) pairs
reduceByKey(func, when called on a dataset of (K, V) pairs, returns
a dataset of (K, V) pairs where the values for
[numTasks])
each key are aggregated using the given reduce
function
when called on a dataset of (K, V) pairs where K
sortByKey([ascending], implements Ordered, returns a dataset of (K, V)
[numTasks]) pairs sorted by keys in ascending or descending
order, as specified in the boolean ascending
argument
join(otherDataset, when called on datasets of type (K, V) and (K,
W), returns a dataset of (K, (V, W)) pairs with
[numTasks])
all pairs of elements for each key
cogroup(otherDataset, when called on datasets of type (K, V) and (K,
W), returns a dataset of (K, Seq[V], Seq[W])
[numTasks])
tuples – also called groupWith
when called on datasets of types T and U,
cartesian(otherDataset)
returns a dataset of (T, U) pairs (all pairs of
elements)
Spark Essentials: Actions
action description
aggregate the elements of the dataset using a
reduce(func) function func (which takes two arguments and
returns one), and should also be commutative and
associative so that it can be computed correctly in
parallel
return all the elements of the dataset as an array at
collect() the driver program – usually useful after a filter or
other operation that returns a sufficiently small
subset of the data
count() return the number of elements in the dataset
Spark does not care how data is stored. The appropriate RDD
connector determines how to read data.
31
Fault tolerance
That means every RDD knows which task needed to create it. If any
RDD is lost (e.g., a task that creates one died), the driver can ask the
task that generated it to recreate it.
32
Working With RDDs
textFile = sc.textFile(”SomeFile.txt”)
RDD
Working With RDDs
textFile = sc.textFile(”SomeFile.txt”)
RDD
RDD
RDD
RDD
Transformation
s
RDD
RDD
RDD
RDD Action Valu
e
Transformation
s
linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark
Worker
Driver
Worker
Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Driver
Worker
Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Base RDD
lines = spark.textFile(“hdfs://...”) Worker
Driver
Worker
Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Driver
Worker
Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Transformed RDD
Driver
Worker
Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker Block 2
Block 3
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
tasks
messages.filter(lambda s: “mysql” in s).count()
tasks Worker
Worker Block 2
Block 3
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
HDFS
Block
messages.filter(lambda s: “mysql” in s).count()
Worker
Worker Block 2
Read
Read
Block 3 HDFS
Block HDFS
Block
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver
Process
& Cache
Data
messages.filter(lambda s: “mysql” in s).count() Cache 2
Worker
Cache 3
Worker Block 2
Process
& Cache Process
Block 3 Data & Cache
Data
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
results
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver
results
Block 3
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver
Block 3
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) tasks Block 1
messages.cache() Driver
tasks
messages.filter(lambda s: “mysql” in s).count() Cache 2
messages.filter(lambda s: “php” in s).count() tasks Worker
Cache 3
Worker Block 2
Block 3
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver
Process
from
Cache
messages.filter(lambda s: “mysql” in s).count() Cache 2
s: “php” in s).count() Worker
messages.filter(lambda Cache 3
Worker Block 2
Process
from Process
Cache from
Block 3
Cache
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
results
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver
results
Block 3
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver
Performance
Java Java & Scala are faster due
JavaRDD<String> lines = sc.textFile(...); to static typing
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) { …but Python is often fine
return s.contains(“error”);
}
}).count();
Expressive API
map reduce
Expressive API
map reduce sample
filter count
take
groupBy fold
first
sort reduceByKey
union groupByKey partitionBy
join cogroup mapWith
leftOuterJoin cross pipe
rightOuterJoin zip
...
save
Fault Recovery
120
100 81
80
57 56 58 58 57 59 57 59
60
40
20
0
1 2 3 4 5 6 7 8 9 10
Iteration
Agenda
1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming,
ML)
4. DataFrames
5. Internals (time permitting)
53
Generality of RDDs
Spark MLLib
Spark SQL GraphX
Streaming SQL graph
real-‐time machine
…
learning
Spark
Generality of RDDs
Spark MLLib
Streaming Spark GraphX
SQL graph
real-‐time machine
learning
RDDs, Transformations, and Actions
Spark
Spark Streaming: Motivation
…
stream 1 stream 2
Programming Interface
views ones counts
Simple functional API t = 1:
views = readStream("http:...",
map reduce
"1s") ones = views.map(ev =>
(ev.url, 1)) counts = t = 2:
ones.runningReduce(_ + _)
Task scheduling
Monitoring/instrumentation
1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming,
ML)
4. DataFrames
5. Internals (time permitting)
67
From MapReduce to Spark
68
DataFrames in Spark
DataFrame SQL
DataFrame Python
DataFrame
Scala RDD
Python
RDD Scala 0 2 4 6 8 10
filter join
scan
this join is expensive join
(users)
filter
built-in external
JDBC
{ JSON }
and more
…
More Than Naïve Scans
optimized plan
logical plan optimized plan with intelligent data sources
filter join
join
scan
join filter
(users)
scan filter scan
(users) (events)
df = sqlCtx.load("/path/to/data")
model = pipeline.fit(df)
lr
summarize(
Exposes DataFrames, group_by(
Spark
Spark SQL MLlib GraphX
Streaming
Spark Core
Data Sources
{JSON}
Goal: unified engine across data sources,
workloads and environments
Agenda
1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming,
ML)
4. DataFrames
5. Internals
84
Spark Application
Your program Spark driver Spark executor
(JVM / Python) (app (multiple of them)
master)
dependencies = none
partitioner = none
Example: Filtered RDD
partitioner
parent) = none
RDD Graph (DAG of tasks)
file:
HadoopRDD
path = hdfs://...
errors
FilteredRDD
:
func = _.contains(…)
shouldCache = true
Task1 Task2 ...
Example: JoinedRDD
partitioner = HashPartitioner(numTasks)
Spark will now
know this data is
Dependency Types
“Narrow” (pipeline-able) “Wide” (shuffle)
partitions Roles:
• Build stages of tasks
• Submit them to lower level scheduler (e.g. YARN,
Mesos, Standalone) as ready
• Lower level scheduler will schedule data based on
locality
• Resubmit failed stages if outputs are lost
Job Scheduler
Captures RDD A: B:
dependency graph
Pipelines functions G:
Stage 1 groupBy
into “stages”
C: D: F:
Cache-aware for
data reuse & map
locality E: join
Spark
Spark SQL MLlib GraphX
Streaming
Spark Core
Data Sources
{JSON}
Goal: unified engine across data sources,
workloads and environments
Paco Nathan, Intro to Apache Spark, ITAS Workshop, Databricks
Hands-on Tour of Apache Spark in 5 Minutes. Hortonworks
Running Spark Applications, Cloudera 5.5.x documentation
Sandy Ryza, Apache Spark Resource Management and YARN App Models,
ClouderaEngineering Blog, May 30, 2014+4
http://spark.apache.org/index.html
http://spark.apache.org/docs/latest/index.html