0% found this document useful (0 votes)
178 views96 pages

Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)

Spark is a framework for large-scale data processing. It uses Resilient Distributed Datasets (RDDs) as its core abstraction, which allow data to be partitioned across clusters and operated on in parallel. RDDs can be created from external storage systems, streaming sources, or by transforming existing RDDs. Spark programs define a sequence of transformations and actions on RDDs to perform multi-step processing more efficiently than traditional MapReduce. Spark stores intermediate data in memory to improve performance over the disk-based approach of MapReduce.

Uploaded by

Costi Stoian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
178 views96 pages

Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)

Spark is a framework for large-scale data processing. It uses Resilient Distributed Datasets (RDDs) as its core abstraction, which allow data to be partitioned across clusters and operated on in parallel. RDDs can be created from external storage systems, streaming sources, or by transforming existing RDDs. Spark programs define a sequence of transformations and actions on RDDs to perform multi-step processing more efficiently than traditional MapReduce. Spark stores intermediate data in memory to improve performance over the disk-based approach of MapReduce.

Uploaded by

Costi Stoian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 96

Intro to Apache Spark

credits to CS 347- Stanford course , 2015, Reynold Xin, Databricks(Spark provider)


Content

1. MapReduce
2. Introduction to Spark
3. Resilient Distributed Data
4. DataFrames
5. Internals

2
Traditional Network Programming

Message-passing between nodes (MPI, RPC, etc)

Really hard to do at scale:


• How to split problem across nodes?
– Important to consider network and data locality

• How to deal with failures?


– If a typical server fails every 3 years, a 10,000-node
cluster sees 10 faults/day!

3
Data-Parallel Models

Restrict the programming interface so that the system


can do more automatically

“Here’s an operation, run it on all of the data”

• I don’t care where it runs (you schedule that)


• In fact, feel free to run it twice on different nodes

4
MapReduce Programming Model
MapReduce turned out to be an incredibly useful and
widely-deployed framework for processing large
amounts of data. However, its design forces programs to
comply with its computation model, which is:

Map: create a key,value pairs


Shuffle: combine common keys together and partition
them to reduce workers
Reduce: process each unique key and all of its associated
values

5
MapReduce drawbacks
many applications had to run MapReduce over multiple passes to
process their data.

All intermediate data had to be stored back in the file system (GFS
at Google, HDFS elsewhere), which tended to be slow since stored
data was not just writen to disks but also replicated.

the next MapReduce phase could not start until the previous
MapReduce job completed fully.

MapReduce was also designed to read its data from a distributed file
system (GFS/HDFS). In many cases, however, data resides within an
SQL database or is streaming in (e.g, activity logs, remote
monitoring).
6
MapReduce Programmability

Most real applications require multiple MR steps


• Google indexing pipeline: 21 steps
• Analytics queries (e.g. count clicks & top K): 2 – 5
steps
• Iterative algorithms (e.g. PageRank): 10’s of steps

Multi-step jobs create spaghetti code


• 21 MR steps -> 21 mapper and reducer classes

7
8
A Brief History: MapReduce
circa 1979 – Stanford, MIT, CMU, etc.
set/list operations in LISP, Prolog, etc., for parallel processing
www-formal.stanford.edu/jmc/history/lisp/lisp.htm

circa 2004 – Google


MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
research.google.com/archive/mapreduce.html

circa 2006 – Apache


Hadoop, originating from the Nutch Project
Doug Cutting
research.yahoo.com/files/cutting.pdf

circa 2008 – Yahoo


web scale search indexing
Hadoop Summit, HUG, etc.
developer.yahoo.com/hadoop/

circa 2009 – Amazon AWS


Elastic MapReduce
Hadoop modified for EC2/S3, plus support
for Hive, Pig, Cascading, etc.
aws.amazon.com/elasticmapreduce/
Problems with MapReduce

MapReduce use cases showed two major limitations:

1. difficulty of programming directly in MR.


2. Performance bottlenecks

In short, MR doesn’t compose well for large applications

Therefore, people built high level frameworks and


specialized systems.
10
Specialized Systems

11
Agenda

1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming,
ML)
4. DataFrames
5. Internals (time permitting)

12
Spark: A Brief History

13
Spark Summary
• highly flexible and general-purpose way of dealing with
big data processing needs

• does not impose a rigid computation model, and supports


a variety of input types.

• deal with text files, graph data, database queries, and


streaming sources and not be confined to a two-stage
processing model.

• Programmers can develop arbitrarily-complex, multi-step


data pipelines arranged in an arbitrary directed acyclic
14
graph (DAG) pattern.
Spark Summary

• programming in Spark involves defining a sequence of


transformations and actions.

• Spark has support for a map action and a reduce


operation, so it can implement traditional MapReduce
operations but it also supports SQL queries, graph
processing, and machine learning.

• stores its intermediate results in memory, providing for


dramatically higher performance.
15
Spark Ecosystem

16
Note : not a scientific
Programmability

WordCount in 3 lines of Spark

WordCount in 50+ lines of Java


MR 19
Performance
Time to sort 100TB

2013 Record: 2100 machines


Hadoop
72 minutes

2014 Record: 207 machines


Spark
23 minutes

Also sorted 1PB in 4


hours 20
RDD: Core Abstraction
Write programs in terms of distributed datasets

and operations on them

Resilient Distributed Datasets Operations


• Collections of objects spread • Transformation
across a cluster, stored in RAM s (e.g. map,
or on Disk filter, groupBy)
• Built through • Actions
parallel (e.g. count, collect,
transformations save)
• Automatically
rebuilt on failure
RDD

Resilient Distributed Datasets are the primary


abstraction in Spark – a fault-tolerant collection of
elements that can be operated on in parallel

Two types:
• parallelized collections – take an existing single-node
collection and parallel it
• Hadoop datasets: files on HDFS or other compatible
storage

21
RDD: Core abstractions
An application that uses Spark identifies data sources and the operations on that
data. The main application, called the driver program is linked with the Spark API,
which creates a SparkContext (heart of the Spark system and coordinates all
processing activity.) This SparkContext in the driver program connects to a Spark
cluster manager. The cluster manager responsible for allocating worker nodes,
launching executors on them, and keeping track of their status.

Each worker node runs one or more executors. An executor is a process that runs
an instance of a Java Virtual Machine (JVM).

When each executor is launched by the manager, it establishes a connection back


to the driver program.

The executor runs tasks on behalf of a specific SparkContext (application) and


keeps related data in memory or disk storage.

A task is a transformation or action. The executor remains running for the


duration of the
RDD: Core Abstraction
• Each worker node runs one or more executors. An executor is a process that runs
an instance of a Java Virtual Machine (JVM).

• When each executor is launched by the manager, it establishes a connection back


to the driver program.
• The executor runs tasks on behalf of a specific SparkContext (application) and
keeps related data in memory or disk storage.
• A task is a transformation or action. The executor remains running for the
duration of the application. This provides a performance advantage over the
MapReduce approach since new tasks can be started very quickly.
• The executor also maintains a cache, which stores frequently-used data in
memory instead of having to store it to a disk-based file as the MapReduce
framework does.
• The driver goes through the user’s program, which consists of actions and
transformations on data and converts that into a series of tasks. The driver then
sends tasks to the executors that registered with it.
• A task is application code that runs in the executor on a Java Virtual Machine (JVM)
and can be writen in languages such as Scala, Java, Python, Clojure, and R. It is
transmited as a jar file to an executor, which then runs it.
RDD
Data in Spark is a collection of Resilient Distributed Datasets
(RDDs). This is often a huge collection of stuff. Think of an
individual RDD as a table in a database or a structured file.

Input data is organized into RDDs, which will often be partitioned


across many computers. RDDs can be created in three ways:

1.They can be present as any file stored in HDFS or any other


storage system supported in Hadoop. This includes Amazon S3 (a
key-value server, similar in design to Dynamo), HBase (Hadoop’s
version of Bigtable), and Cassandra (a no-SQL eventually-consistent
database). This data is created by other services, such as event
streams, text logs, or a database. For instance, the results of a
specific query can be treated as an RDD. A list of files in a specific
24
directory can also be an RDD.
RDD

2.RDDs can be streaming sources using the Spark


Streaming extension.

This could be a stream of events from remote sensors,


for example.

For fault tolerance, a sliding window is used, where the


contents of the stream are buffered in memory for a
predefined time interval.

25
RDD
3.An RDD can be the output of a transformation function. This
allows one task to create data that can be consumed by another
task and is the way tasks pass data around.

For example, one task can filter out unwanted data and generate
a set of key-value pairs, writing them to an RDD.

This RDD will be cached in memory (overflowing to disk if


needed) and will be read by a task that reads the output of the
task that created the keyvalue data.

26
RDD properties
They are immutable. That means their contents cannot be changed. A
task can read from an RDD and create a new RDD but it cannot modify
an RDD. The framework magically garbage collects unneeded
intermediate RDDs.
they are typed. An RDD will have some kind of structure within in,
such as a key-value pair or a set of fields. Tasks need to be able to
parse RDD streams.
They are ordered. An RDD contains a set of elements that can be
sorted. In the case of key-value lists, the elements will be sorted by a
key. The sorting function can be defined by the programmer but
sorting enables one to implement things like Reduce operations
They are partitioned. Parts of an RDD may be sent to different
servers. The default partitioning function is to send a row of data to
the server corresponding to hash(key) mod servercount
27
RDD operations
Spark allows two types of operations on RDDs: transformations and
actions.

Transformations read an RDD and return a new RDD. Example


transformations are map, filter, groupByKey, and reduceByKey.
Transformations are evaluated lazily, which means they are
computed only when some task wants their data (the RDD that they
generate). At that point, the driver schedules them for execution.

Actions are operations that evaluate and return a new value. When
an action is requested on an RDD object, the necessary
transformations are computed and the result is returned. Actions
tend to be the things that generate the final output needed by a
program. Example actions are reduce, grab samples, and write to
file 28
Spark Essentials:
Transformations
transformation Description
when called on a dataset of (K, V) pairs,
groupByKey([numTasks])
returns a dataset of (K, Seq[V]) pairs
reduceByKey(func, when called on a dataset of (K, V) pairs, returns
a dataset of (K, V) pairs where the values for
[numTasks])
each key are aggregated using the given reduce
function
when called on a dataset of (K, V) pairs where K
sortByKey([ascending], implements Ordered, returns a dataset of (K, V)
[numTasks]) pairs sorted by keys in ascending or descending
order, as specified in the boolean ascending
argument
join(otherDataset, when called on datasets of type (K, V) and (K,
W), returns a dataset of (K, (V, W)) pairs with
[numTasks])
all pairs of elements for each key
cogroup(otherDataset, when called on datasets of type (K, V) and (K,
W), returns a dataset of (K, Seq[V], Seq[W])
[numTasks])
tuples – also called groupWith
when called on datasets of types T and U,
cartesian(otherDataset)
returns a dataset of (T, U) pairs (all pairs of
elements)
Spark Essentials: Actions

action description
aggregate the elements of the dataset using a
reduce(func) function func (which takes two arguments and
returns one), and should also be commutative and
associative so that it can be computed correctly in
parallel
return all the elements of the dataset as an array at
collect() the driver program – usually useful after a filter or
other operation that returns a sufficiently small
subset of the data
count() return the number of elements in the dataset

first() return the first element of the dataset – similar to


take(1)
return an array with the first n elements of the dataset
take(n) – currently not executed in parallel, instead the
driver program computes all the elements
takeSample(withReplacement, return an array with a random sample of num
fraction, seed) elements of the dataset, with or without replacement,
using the given random number generator seed
Data storage

Spark does not care how data is stored. The appropriate RDD
connector determines how to read data.

For example, RDDs can be the result of a query in a Cassandra


database and new RDDs can be wri en to Cassandra tables.

Alternatively, RDDs can be read from HDFS files or writen to an


HBASE table.

31
Fault tolerance

For each RDD, the driver tracks the sequence of transformations


used to create it.

That means every RDD knows which task needed to create it. If any
RDD is lost (e.g., a task that creates one died), the driver can ask the
task that generated it to recreate it.

The driver maintains the entire dependency graph, so this


recreation may end up being a chain of transformation tasks going
back to the original data

32
Working With RDDs
textFile = sc.textFile(”SomeFile.txt”)

RDD
Working With RDDs
textFile = sc.textFile(”SomeFile.txt”)

RDD
RDD
RDD
RDD

Transformation
s

linesWithSpark = textFile.filter(lambda line: "Spark” in line)


Working With RDDs
textFile = sc.textFile(”SomeFile.txt”)

RDD
RDD
RDD
RDD Action Valu
e
Transformation
s

linesWithSpark.count()
74

linesWithSpark.first()
# Apache Spark

linesWithSpark = textFile.filter(lambda line: "Spark” in line)


Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

Worker

Driver

Worker

Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker

Driver

Worker

Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Base RDD
lines = spark.textFile(“hdfs://...”) Worker

Driver

Worker

Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker


errors = lines.filter(lambda s: s.startswith(“ERROR”))

Driver

Worker

Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Transformed RDD

lines = spark.textFile(“hdfs://...”) Worker


errors = lines.filter(lambda s: s.startswith(“ERROR”))

Driver

Worker

Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker


errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache() Driver

messages.filter(lambda s: “mysql” in s).count()


Worker

Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker


errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache() Driver

messages.filter(lambda s: “mysql” in s).count()


Action
Worker

Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker


errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver

messages.filter(lambda s: “mysql” in s).count()


Worker

Worker Block 2

Block 3
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker


errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) tasks Block 1
messages.cache() Driver

tasks
messages.filter(lambda s: “mysql” in s).count()
tasks Worker

Worker Block 2

Block 3
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker


errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver
Read

HDFS
Block
messages.filter(lambda s: “mysql” in s).count()
Worker

Worker Block 2
Read
Read
Block 3 HDFS
Block HDFS
Block
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver
Process
& Cache
Data
messages.filter(lambda s: “mysql” in s).count() Cache 2
Worker
Cache 3
Worker Block 2
Process
& Cache Process
Block 3 Data & Cache
Data
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
results
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver

results

messages.filter(lambda s: “mysql” in s).count() Cache 2


results
Worker
Cache 3
Worker Block 2

Block 3
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver

messages.filter(lambda s: “mysql” in s).count() Cache 2


messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Worker Block 2

Block 3
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) tasks Block 1
messages.cache() Driver

tasks
messages.filter(lambda s: “mysql” in s).count() Cache 2
messages.filter(lambda s: “php” in s).count() tasks Worker
Cache 3
Worker Block 2

Block 3
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver
Process
from
Cache
messages.filter(lambda s: “mysql” in s).count() Cache 2
s: “php” in s).count() Worker
messages.filter(lambda Cache 3
Worker Block 2
Process
from Process
Cache from
Block 3
Cache
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
results
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver

results

messages.filter(lambda s: “mysql” in s).count() Cache 2


results
messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Worker Block 2

Block 3
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver

messages.filter(lambda s: “mysql” in s).count() Cache 2


messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Cache your data   Faster Worker Block 2
Results
Full-text search of Wikipedia Block 3
• 60GB on 20 EC2 machines
• 0.5 sec from mem vs. 20s for on-
Language Support
Python Standalone Programs
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Python, Scala, &
Java

Scala Interactive Shells


val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count() Python & Scala

Performance
Java Java & Scala are faster due
JavaRDD<String> lines = sc.textFile(...); to static typing
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) { …but Python is often fine
return s.contains(“error”);
}
}).count();
Expressive API
map reduce
Expressive API
map reduce sample
filter count
take
groupBy fold
first
sort reduceByKey
union groupByKey partitionBy
join cogroup mapWith
leftOuterJoin cross pipe
rightOuterJoin zip
...
save
Fault Recovery

RDDs track lineage information that can be used


to efficiently reconstruct lost partitions
Ex: messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDD


filter map
(func = _.contains(...)) (func = _.split(...))
Fault Recovery Results

140 Failure happens


119
Iteratrion time (s)

120
100 81
80
57 56 58 58 57 59 57 59
60
40
20
0
1 2 3 4 5 6 7 8 9 10
Iteration
Agenda

1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming,
ML)
4. DataFrames
5. Internals (time permitting)

53
Generality of RDDs

Spark MLLib
Spark SQL GraphX
Streaming SQL graph
real-­‐time machine

learning
Spark
Generality of RDDs

DStream’s: SchemaRDD’s RDD-­‐ RDD-­‐


Streams of RDD’s Based Based
Matrices Graphs

Spark MLLib
Streaming Spark GraphX
SQL graph
real-­‐time machine
learning
RDDs, Transformations, and Actions

Spark
Spark Streaming: Motivation

Many important apps must process large data streams at


second-scale latencies
• Site statistics, intrusion detection, online ML

To build and scale these apps users want:


• Integration: with offline analytical stack
• Fault-tolerance: both for crashes and stragglers
• Efficiency: low cost beyond base processing
Discretized Stream Processing
batch operation
t = 1:
input pull

immutable dataset immutable dataset


(stored reliably) (output or state);
stored in memory
as RDD
t = 2:
input


stream 1 stream 2
Programming Interface
views ones counts
Simple functional API t = 1:

views = readStream("http:...",
map reduce
"1s") ones = views.map(ev =>
(ev.url, 1)) counts = t = 2:
ones.runningReduce(_ + _)

Interoperates with RDDs


// Join stream with static RDD ...
counts.join(historicCounts).map(...
= RDD = partition
)
// Ad-hoc queries on stream state
counts.slice(“21:00”,“21:05”).topK(10
)
Inherited “for free” from Spark
RDD data model and API

Data partitioning and shuffles

Task scheduling

Monitoring/instrumentation

Scheduling and resource allocation


Benefits for Users

High performance data sharing


• Data sharing is the bottleneck in many environments
• RDD’s provide in-place sharing through memory

Applications can compose models


• Run a SQL query and then PageRank the results
• ETL your data and then run graph/ML on it

Benefit from investment in shared functionality


• E.g. re-usable components (shell) and performance
optimizations
Agenda

1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming,
ML)
4. DataFrames
5. Internals (time permitting)

67
From MapReduce to Spark

68
DataFrames in Spark

Distributed collection of data grouped into named


columns (i.e. RDD with schema)
DSL designed for common tasks
• Metadata
• Sampling
• Project, filter, aggregation, join, …
• UDFs
Available in Python, Scala, Java, and R (via
SparkR)
Not Just Less Code: Faster
Implementations

DataFrame SQL

DataFrame Python

DataFrame

Scala RDD

Python

RDD Scala 0 2 4 6 8 10

Time to Aggregate 10 million int pairs (secs)


DataFrame Internals

Represented internally as a “logical plan”

Execution is lazy, allowing it to be optimized by a


query optimizer
Plan Optimization & Execution

DataFrames and SQL share the same optimization/execution pipeline

Maximize code reuse & share optimization efforts


joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= ”2015-01-01”)

logical plan physical plan

filter join

scan
this join is expensive join
(users)
filter


scan scan scan


(users) (events) (events)
Data Sources supported by DataFrames

built-in external

JDBC

{ JSON }

and more

More Than Naïve Scans

Data Sources API can automatically prune columns and


push filters to the source
• Parquet: skip irrelevant columns and blocks of data; turn
string comparison into integer comparisons for dictionary
encoded data
• JDBC: Rewrite queries to push predicates down
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date > ”2015-01-01”)

optimized plan
logical plan optimized plan with intelligent data sources

filter join
join

scan
join filter
(users)
scan filter scan
(users) (events)

scan scan scan


(users) (events) (events)
Our Experience So Far

SQL is wildly popular and important


• 100% of Databricks customers use some SQL

Schema is very useful


• Most data pipelines, even the ones that start with unstructured
data, end up having some implicit structure
• Key-value too limited
• That said, semi-/un-structured support is paramount

Separation of logical vs physical plan


• Important for performance optimizations (e.g. join selection)
Machine Learning Pipelines
tokenizer = Tokenizer(inputCol="text", outputCol="words”)
hashingTF = HashingTF(inputCol="words", outputCol="features”)
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

df = sqlCtx.load("/path/to/data")
model = pipeline.fit(df)

lr

df0 tokenizer df1 hashingTF df2 lr.model df3


Pipeline Model
R Interface (SparkR)

Spark 1.4 (June) df = jsonFile(“tweets.json”)

summarize(
Exposes DataFrames, group_by(

and ML library in R df[df$user == “matei”,],


“date”),
sum(“retweets”))
Data Science at Scale

Higher level interfaces in Scala, Java, Python, R

Drastically easier to program Big Data


• With APIs similar to single-node tools
DataFrames ML Pipelines

Spark
Spark SQL MLlib GraphX
Streaming

Spark Core

Data Sources

{JSON}
Goal: unified engine across data sources,
workloads and environments
Agenda

1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming,
ML)
4. DataFrames
5. Internals

84
Spark Application
Your program Spark driver Spark executor
(JVM / Python) (app (multiple of them)
master)

RDD graph Cluster


Task
sc = new
manager
SparkContext
f = sc.textFile(“…”) Scheduler thread
f.filter(…) s
.count()
Block tracker Block
...
manager
Shuffle tracker

A single application often contains multiple actions HDFS, HBase, …


RDD is an interface
1. Set of partitions (“splits” in
Hadoop)
2. List of dependencies on parent “lineage”
RDDs
3. Function to compute a partition
(as an Iterator) given its
parent(s)

4. (Optional) partitioner (hash, optimized


5. range)
(Optional) preferred
location(s) for each partition execution
Example: HadoopRDD

partitions = one per HDFS block

dependencies = none

compute(part) = read corresponding block

preferredLocations(part) = HDFS block location

partitioner = none
Example: Filtered RDD

partitions = same as parent RDD

dependencies = “one-to-one” on parent

compute(part) = compute parent and filter

it preferredLocations(part) = none (ask

partitioner
parent) = none
RDD Graph (DAG of tasks)

Dataset-level view: Partition-level view:

file:
HadoopRDD
path = hdfs://...

errors
FilteredRDD
:
func = _.contains(…)
shouldCache = true
Task1 Task2 ...
Example: JoinedRDD

partitions = one per reduce task

dependencies = “shuffle” on each parent

compute(partition) = read and join shuffled

data preferredLocations(part) = none

partitioner = HashPartitioner(numTasks)
Spark will now
know this data is
Dependency Types
“Narrow” (pipeline-able) “Wide” (shuffle)

map, filter groupByKey on


non-partitioned data

join with inputs


co-
union partitioned join with inputs not
co-partitioned
Execution Process
RDD Objects DAG Scheduler Task Scheduler Worke
r
Cluster
manager Threads
DAG TaskSet Tas
Block
k
manager

rdd1.join(rdd2) split graph into launch tasks via execute tasks


.groupBy(…)
stages of tasks cluster manager
.filter(…)

submit each retry failed or store and serve


build operator DAG stage as ready straggling tasks blocks
DAG Scheduler
Input: RDD and partitions to compute

Output: output from actions on those

partitions Roles:
• Build stages of tasks
• Submit them to lower level scheduler (e.g. YARN,
Mesos, Standalone) as ready
• Lower level scheduler will schedule data based on
locality
• Resubmit failed stages if outputs are lost
Job Scheduler

Captures RDD A: B:
dependency graph
Pipelines functions G:
Stage 1 groupBy
into “stages”
C: D: F:
Cache-aware for
data reuse & map
locality E: join

Partitioning-aware Stage 2 union Stage 3


to avoid shuffles
= cached partition
DataFrames ML Pipelines

Spark
Spark SQL MLlib GraphX
Streaming

Spark Core

Data Sources

{JSON}
Goal: unified engine across data sources,
workloads and environments
Paco Nathan, Intro to Apache Spark, ITAS Workshop, Databricks
Hands-on Tour of Apache Spark in 5 Minutes. Hortonworks
Running Spark Applications, Cloudera 5.5.x documentation
Sandy Ryza, Apache Spark Resource Management and YARN App Models,
ClouderaEngineering Blog, May 30, 2014+4

http://spark.apache.org/index.html
http://spark.apache.org/docs/latest/index.html

You might also like