0% found this document useful (0 votes)

178 views96 pages

Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)

Spark is a framework for large-scale data processing. It uses Resilient Distributed Datasets (RDDs) as its core abstraction, which allow data to be partitioned across clusters and operated on in parallel. RDDs can be created from external storage systems, streaming sources, or by transforming existing RDDs. Spark programs define a sequence of transformations and actions on RDDs to perform multi-step processing more efficiently than traditional MapReduce. Spark stores intermediate data in memory to improve performance over the disk-based approach of MapReduce.

Uploaded by

Costi Stoian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

178 views96 pages

Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)

Uploaded by

Costi Stoian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 96

Intro to Apache Spark

credits to CS 347- Stanford course , 2015, Reynold Xin, Databricks(Spark provider)

Content

1. MapReduce
2. Introduction to Spark
3. Resilient Distributed Data
4. DataFrames
5. Internals

2
Traditional Network Programming

Message-passing between nodes (MPI, RPC, etc)

Really hard to do at scale:

• How to split problem across nodes?
– Important to consider network and data locality

• How to deal with failures?

– If a typical server fails every 3 years, a 10,000-node
cluster sees 10 faults/day!

3
Data-Parallel Models

Restrict the programming interface so that the system

can do more automatically

“Here’s an operation, run it on all of the data”

• I don’t care where it runs (you schedule that)

• In fact, feel free to run it twice on diﬀerent nodes

4
MapReduce Programming Model
MapReduce turned out to be an incredibly useful and
widely-deployed framework for processing large
amounts of data. However, its design forces programs to
comply with its computation model, which is:

Map: create a key,value pairs

Shuffle: combine common keys together and partition
them to reduce workers
Reduce: process each unique key and all of its associated
values

5
MapReduce drawbacks
many applications had to run MapReduce over multiple passes to
process their data.

All intermediate data had to be stored back in the file system (GFS
at Google, HDFS elsewhere), which tended to be slow since stored
data was not just writen to disks but also replicated.

the next MapReduce phase could not start until the previous
MapReduce job completed fully.

MapReduce was also designed to read its data from a distributed file
system (GFS/HDFS). In many cases, however, data resides within an
SQL database or is streaming in (e.g, activity logs, remote
monitoring).
6
MapReduce Programmability

Most real applications require multiple MR steps

• Google indexing pipeline: 21 steps
• Analytics queries (e.g. count clicks & top K): 2 – 5
steps
• Iterative algorithms (e.g. PageRank): 10’s of steps

Multi-step jobs create spaghetti code

• 21 MR steps -> 21 mapper and reducer classes

7
8
A Brief History: MapReduce
circa 1979 – Stanford, MIT, CMU, etc.
set/list operations in LISP, Prolog, etc., for parallel processing
www-formal.stanford.edu/jmc/history/lisp/lisp.htm

circa 2004 – Google

MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
research.google.com/archive/mapreduce.html

circa 2006 – Apache

Hadoop, originating from the Nutch Project
Doug Cutting
research.yahoo.com/files/cutting.pdf

circa 2008 – Yahoo

web scale search indexing
Hadoop Summit, HUG, etc.
developer.yahoo.com/hadoop/

circa 2009 – Amazon AWS

Elastic MapReduce
Hadoop modified for EC2/S3, plus support
for Hive, Pig, Cascading, etc.
aws.amazon.com/elasticmapreduce/
Problems with MapReduce

MapReduce use cases showed two major limitations:

1. diﬀiculty of programming directly in MR.

2. Performance bottlenecks

In short, MR doesn’t compose well for large applications

Therefore, people built high level frameworks and

specialized systems.
10
Specialized Systems

11
Agenda

1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming,
ML)
4. DataFrames
5. Internals (time permitting)

12
Spark: A Brief History

13
Spark Summary
• highly flexible and general-purpose way of dealing with
big data processing needs

• does not impose a rigid computation model, and supports

a variety of input types.

• deal with text files, graph data, database queries, and

streaming sources and not be confined to a two-stage
processing model.

• Programmers can develop arbitrarily-complex, multi-step

data pipelines arranged in an arbitrary directed acyclic
14
graph (DAG) pattern.
Spark Summary

• programming in Spark involves defining a sequence of

transformations and actions.

• Spark has support for a map action and a reduce

operation, so it can implement traditional MapReduce
operations but it also supports SQL queries, graph
processing, and machine learning.

• stores its intermediate results in memory, providing for

dramatically higher performance.
15
Spark Ecosystem

16
Note ： not a scientific
Programmability

WordCount in 3 lines of Spark

WordCount in 50+ lines of Java

MR 19
Performance
Time to sort 100TB

2013 Record: 2100 machines

Hadoop
72 minutes

2014 Record: 207 machines

Spark
23 minutes

Also sorted 1PB in 4

hours 20
RDD: Core Abstraction
Write programs in terms of distributed datasets

and operations on them

Resilient Distributed Datasets Operations

• Collections of objects spread • Transformation
across a cluster, stored in RAM s (e.g. map,
or on Disk filter, groupBy)
• Built through • Actions
parallel (e.g. count, collect,
transformations save)
• Automatically
rebuilt on failure
RDD

Resilient Distributed Datasets are the primary

abstraction in Spark – a fault-tolerant collection of
elements that can be operated on in parallel

Two types:
• parallelized collections – take an existing single-node
collection and parallel it
• Hadoop datasets: files on HDFS or other compatible
storage

21
RDD: Core abstractions
An application that uses Spark identifies data sources and the operations on that
data. The main application, called the driver program is linked with the Spark API,
which creates a SparkContext (heart of the Spark system and coordinates all
processing activity.) This SparkContext in the driver program connects to a Spark
cluster manager. The cluster manager responsible for allocating worker nodes,
launching executors on them, and keeping track of their status.

Each worker node runs one or more executors. An executor is a process that runs
an instance of a Java Virtual Machine (JVM).

When each executor is launched by the manager, it establishes a connection back

to the driver program.

The executor runs tasks on behalf of a specific SparkContext (application) and

keeps related data in memory or disk storage.

A task is a transformation or action. The executor remains running for the

duration of the
RDD: Core Abstraction
• Each worker node runs one or more executors. An executor is a process that runs
an instance of a Java Virtual Machine (JVM).

• When each executor is launched by the manager, it establishes a connection back

to the driver program.
• The executor runs tasks on behalf of a specific SparkContext (application) and
keeps related data in memory or disk storage.
• A task is a transformation or action. The executor remains running for the
duration of the application. This provides a performance advantage over the
MapReduce approach since new tasks can be started very quickly.
• The executor also maintains a cache, which stores frequently-used data in
memory instead of having to store it to a disk-based file as the MapReduce
framework does.
• The driver goes through the user’s program, which consists of actions and
transformations on data and converts that into a series of tasks. The driver then
sends tasks to the executors that registered with it.
• A task is application code that runs in the executor on a Java Virtual Machine (JVM)
and can be writen in languages such as Scala, Java, Python, Clojure, and R. It is
transmited as a jar file to an executor, which then runs it.
RDD
Data in Spark is a collection of Resilient Distributed Datasets
(RDDs). This is often a huge collection of stuff. Think of an
individual RDD as a table in a database or a structured file.

Input data is organized into RDDs, which will often be partitioned

across many computers. RDDs can be created in three ways:

1.They can be present as any file stored in HDFS or any other

storage system supported in Hadoop. This includes Amazon S3 (a
key-value server, similar in design to Dynamo), HBase (Hadoop’s
version of Bigtable), and Cassandra (a no-SQL eventually-consistent
database). This data is created by other services, such as event
streams, text logs, or a database. For instance, the results of a
specific query can be treated as an RDD. A list of files in a specific
24
directory can also be an RDD.
RDD

2.RDDs can be streaming sources using the Spark

Streaming extension.

This could be a stream of events from remote sensors,

for example.

For fault tolerance, a sliding window is used, where the

contents of the stream are buffered in memory for a
predefined time interval.

25
RDD
3.An RDD can be the output of a transformation function. This
allows one task to create data that can be consumed by another
task and is the way tasks pass data around.

For example, one task can filter out unwanted data and generate
a set of key-value pairs, writing them to an RDD.

This RDD will be cached in memory (overflowing to disk if

needed) and will be read by a task that reads the output of the
task that created the keyvalue data.

26
RDD properties
They are immutable. That means their contents cannot be changed. A
task can read from an RDD and create a new RDD but it cannot modify
an RDD. The framework magically garbage collects unneeded
intermediate RDDs.
they are typed. An RDD will have some kind of structure within in,
such as a key-value pair or a set of fields. Tasks need to be able to
parse RDD streams.
They are ordered. An RDD contains a set of elements that can be
sorted. In the case of key-value lists, the elements will be sorted by a
key. The sorting function can be defined by the programmer but
sorting enables one to implement things like Reduce operations
They are partitioned. Parts of an RDD may be sent to different
servers. The default partitioning function is to send a row of data to
the server corresponding to hash(key) mod servercount
27
RDD operations
Spark allows two types of operations on RDDs: transformations and
actions.

Transformations read an RDD and return a new RDD. Example

transformations are map, filter, groupByKey, and reduceByKey.
Transformations are evaluated lazily, which means they are
computed only when some task wants their data (the RDD that they
generate). At that point, the driver schedules them for execution.

Actions are operations that evaluate and return a new value. When
an action is requested on an RDD object, the necessary
transformations are computed and the result is returned. Actions
tend to be the things that generate the final output needed by a
program. Example actions are reduce, grab samples, and write to
file 28
Spark Essentials:
Transformations
transformation Description
when called on a dataset of (K, V) pairs,
groupByKey([numTasks])
returns a dataset of (K, Seq[V]) pairs
reduceByKey(func, when called on a dataset of (K, V) pairs, returns
a dataset of (K, V) pairs where the values for
[numTasks])
each key are aggregated using the given reduce
function
when called on a dataset of (K, V) pairs where K
sortByKey([ascending], implements Ordered, returns a dataset of (K, V)
[numTasks]) pairs sorted by keys in ascending or descending
order, as specified in the boolean ascending
argument
join(otherDataset, when called on datasets of type (K, V) and (K,
W), returns a dataset of (K, (V, W)) pairs with
[numTasks])
all pairs of elements for each key
cogroup(otherDataset, when called on datasets of type (K, V) and (K,
W), returns a dataset of (K, Seq[V], Seq[W])
[numTasks])
tuples – also called groupWith
when called on datasets of types T and U,
cartesian(otherDataset)
returns a dataset of (T, U) pairs (all pairs of
elements)
Spark Essentials: Actions

action description
aggregate the elements of the dataset using a
reduce(func) function func (which takes two arguments and
returns one), and should also be commutative and
associative so that it can be computed correctly in
parallel
return all the elements of the dataset as an array at
collect() the driver program – usually useful after a filter or
other operation that returns a sufficiently small
subset of the data
count() return the number of elements in the dataset

first() return the first element of the dataset – similar to

take(1)
return an array with the first n elements of the dataset
take(n) – currently not executed in parallel, instead the
driver program computes all the elements
takeSample(withReplacement, return an array with a random sample of num
fraction, seed) elements of the dataset, with or without replacement,
using the given random number generator seed
Data storage

Spark does not care how data is stored. The appropriate RDD
connector determines how to read data.

For example, RDDs can be the result of a query in a Cassandra

database and new RDDs can be wri en to Cassandra tables.

Alternatively, RDDs can be read from HDFS files or writen to an

HBASE table.

31
Fault tolerance

For each RDD, the driver tracks the sequence of transformations

used to create it.

That means every RDD knows which task needed to create it. If any
RDD is lost (e.g., a task that creates one died), the driver can ask the
task that generated it to recreate it.

The driver maintains the entire dependency graph, so this

recreation may end up being a chain of transformation tasks going
back to the original data

32
Working With RDDs
textFile = sc.textFile(”SomeFile.txt”)

RDD
Working With RDDs
textFile = sc.textFile(”SomeFile.txt”)

RDD
RDD
RDD
RDD

Transformation
s

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

Working With RDDs
textFile = sc.textFile(”SomeFile.txt”)

RDD
RDD
RDD
RDD Action Valu
e
Transformation
s

linesWithSpark.count()
74

linesWithSpark.first()
# Apache Spark

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

Worker

Driver

Worker

Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker

Driver

Worker

Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Base RDD
lines = spark.textFile(“hdfs://...”) Worker

Driver

Worker

Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Driver

Worker

Worker
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Transformed RDD

lines = spark.textFile(“hdfs://...”) Worker

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Driver

Worker

Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker

errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache() Driver

messages.filter(lambda s: “mysql” in s).count()

Worker

Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker

errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache() Driver

messages.filter(lambda s: “mysql” in s).count()

Action
Worker

Worker
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker

errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver

messages.filter(lambda s: “mysql” in s).count()

Worker

Worker Block 2

Block 3
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker

errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) tasks Block 1
messages.cache() Driver

tasks
messages.filter(lambda s: “mysql” in s).count()
tasks Worker

Worker Block 2

Block 3
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker

errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver
Read

HDFS
Block
messages.filter(lambda s: “mysql” in s).count()
Worker

Worker Block 2
Read
Read
Block 3 HDFS
Block HDFS
Block
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver
Process
& Cache
Data
messages.filter(lambda s: “mysql” in s).count() Cache 2
Worker
Cache 3
Worker Block 2
Process
& Cache Process
Block 3 Data & Cache
Data
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
results
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver

results

messages.filter(lambda s: “mysql” in s).count() Cache 2

results
Worker
Cache 3
Worker Block 2

messages.filter(lambda s: “mysql” in s).count() Cache 2

messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Worker Block 2

tasks
messages.filter(lambda s: “mysql” in s).count() Cache 2
messages.filter(lambda s: “php” in s).count() tasks Worker
Cache 3
Worker Block 2

Block 3
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver
Process
from
Cache
messages.filter(lambda s: “mysql” in s).count() Cache 2
s: “php” in s).count() Worker
messages.filter(lambda Cache 3
Worker Block 2
Process
from Process
Cache from
Block 3
Cache
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
Cache 1
lines = spark.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
results
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
messages.cache() Driver

results

messages.filter(lambda s: “mysql” in s).count() Cache 2

results
messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Worker Block 2

messages.filter(lambda s: “mysql” in s).count() Cache 2

messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Cache your data   Faster Worker Block 2
Results
Full-text search of Wikipedia Block 3
• 60GB on 20 EC2 machines
• 0.5 sec from mem vs. 20s for on-
Language Support
Python Standalone Programs
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Python, Scala, &
Java

Scala Interactive Shells

val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count() Python & Scala

Performance
Java Java & Scala are faster due
JavaRDD<String> lines = sc.textFile(...); to static typing
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) { …but Python is often fine
return s.contains(“error”);
}
}).count();
Expressive API
map reduce
Expressive API
map reduce sample
filter count
take
groupBy fold
first
sort reduceByKey
union groupByKey partitionBy
join cogroup mapWith
leftOuterJoin cross pipe
rightOuterJoin zip
...
save
Fault Recovery

RDDs track lineage information that can be used

to eﬀiciently reconstruct lost partitions
Ex: messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDD

ﬁlter map
(func = _.contains(...)) (func = _.split(...))
Fault Recovery Results

140 Failure happens

119
Iteratrion time (s)

120
100 81
80
57 56 58 58 57 59 57 59
60
40
20
0
1 2 3 4 5 6 7 8 9 10
Iteration
Agenda

1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming,
ML)
4. DataFrames
5. Internals (time permitting)

53
Generality of RDDs

Spark MLLib
Spark SQL GraphX
Streaming SQL graph
real-‐time machine
…
learning
Spark
Generality of RDDs

DStream’s: SchemaRDD’s RDD-‐ RDD-‐

Streams of RDD’s Based Based
Matrices Graphs

Spark MLLib
Streaming Spark GraphX
SQL graph
real-‐time machine
learning
RDDs, Transformations, and Actions

Spark
Spark Streaming: Motivation

Many important apps must process large data streams at

second-scale latencies
• Site statistics, intrusion detection, online ML

To build and scale these apps users want:

• Integration: with oﬀline analytical stack
• Fault-tolerance: both for crashes and stragglers
• Eﬀiciency: low cost beyond base processing
Discretized Stream Processing
batch operation
t = 1:
input pull

immutable dataset immutable dataset

(stored reliably) (output or state);
stored in memory
as RDD
t = 2:
input
…

…
stream 1 stream 2
Programming Interface
views ones counts
Simple functional API t = 1:

views = readStream("http:...",
map reduce
"1s") ones = views.map(ev =>
(ev.url, 1)) counts = t = 2:
ones.runningReduce(_ + _)

Interoperates with RDDs

// Join stream with static RDD ...
counts.join(historicCounts).map(...
= RDD = partition
)
// Ad-hoc queries on stream state
counts.slice(“21:00”,“21:05”).topK(10
)
Inherited “for free” from Spark
RDD data model and API

Data partitioning and shuﬀles

Task scheduling

Monitoring/instrumentation

Scheduling and resource allocation

Benefits for Users

High performance data sharing

• Data sharing is the bottleneck in many environments
• RDD’s provide in-place sharing through memory

Applications can compose models

• Run a SQL query and then PageRank the results
• ETL your data and then run graph/ML on it

Benefit from investment in shared functionality

• E.g. re-usable components (shell) and performance
optimizations
Agenda

1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming,
ML)
4. DataFrames
5. Internals (time permitting)

67
From MapReduce to Spark

68
DataFrames in Spark

Distributed collection of data grouped into named

columns (i.e. RDD with schema)
DSL designed for common tasks
• Metadata
• Sampling
• Project, filter, aggregation, join, …
• UDFs
Available in Python, Scala, Java, and R (via
SparkR)
Not Just Less Code: Faster
Implementations

DataFrame SQL

DataFrame Python

DataFrame

Scala RDD

Python

RDD Scala 0 2 4 6 8 10

Time to Aggregate 10 million int pairs (secs)

DataFrame Internals

Represented internally as a “logical plan”

Execution is lazy, allowing it to be optimized by a

query optimizer
Plan Optimization & Execution

DataFrames and SQL share the same optimization/execution pipeline

Maximize code reuse & share optimization eﬀorts

joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= ”2015-01-01”)

logical plan physical plan

filter join

scan
this join is expensive join
(users)
filter


scan scan scan

(users) (events) (events)
Data Sources supported by DataFrames

built-in external

JDBC

{ JSON }

and more
…
More Than Naïve Scans

Data Sources API can automatically prune columns and

push filters to the source
• Parquet: skip irrelevant columns and blocks of data; turn
string comparison into integer comparisons for dictionary
encoded data
• JDBC: Rewrite queries to push predicates down
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date > ”2015-01-01”)

optimized plan
logical plan optimized plan with intelligent data sources

filter join
join

scan
join filter
(users)
scan filter scan
(users) (events)

scan scan scan

(users) (events) (events)
Our Experience So Far

SQL is wildly popular and important

• 100% of Databricks customers use some SQL

Schema is very useful

• Most data pipelines, even the ones that start with unstructured
data, end up having some implicit structure
• Key-value too limited
• That said, semi-/un-structured support is paramount

Separation of logical vs physical plan

• Important for performance optimizations (e.g. join selection)
Machine Learning Pipelines
tokenizer = Tokenizer(inputCol="text", outputCol="words”)
hashingTF = HashingTF(inputCol="words", outputCol="features”)
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

df = sqlCtx.load("/path/to/data")
model = pipeline.fit(df)

df0 tokenizer df1 hashingTF df2 lr.model df3

Pipeline Model
R Interface (SparkR)

Spark 1.4 (June) df = jsonFile(“tweets.json”)

summarize(
Exposes DataFrames, group_by(

and ML library in R df[df$user == “matei”,],

“date”),
sum(“retweets”))
Data Science at Scale

Higher level interfaces in Scala, Java, Python, R

Drastically easier to program Big Data

• With APIs similar to single-node tools
DataFrames ML Pipelines

Spark
Spark SQL MLlib GraphX
Streaming

Spark Core

Data Sources

{JSON}
Goal: unified engine across data sources,
workloads and environments
Agenda

1. MapReduce Review
2. Introduction to Spark and RDDs
3. Generality of RDDs (e.g. streaming,
ML)
4. DataFrames
5. Internals

84
Spark Application
Your program Spark driver Spark executor
(JVM / Python) (app (multiple of them)
master)

RDD graph Cluster

Task
sc = new
manager
SparkContext
f = sc.textFile(“…”) Scheduler thread
f.filter(…) s
.count()
Block tracker Block
...
manager
Shuffle tracker

A single application often contains multiple actions HDFS, HBase, …

RDD is an interface
1. Set of partitions (“splits” in
Hadoop)
2. List of dependencies on parent “lineage”
RDDs
3. Function to compute a partition
(as an Iterator) given its
parent(s)

4. (Optional) partitioner (hash, optimized

5. range)
(Optional) preferred
location(s) for each partition execution
Example: HadoopRDD

partitions = one per HDFS block

dependencies = none

compute(part) = read corresponding block

preferredLocations(part) = HDFS block location

partitioner = none
Example: Filtered RDD

partitions = same as parent RDD

dependencies = “one-to-one” on parent

compute(part) = compute parent and filter

it preferredLocations(part) = none (ask

partitioner
parent) = none
RDD Graph (DAG of tasks)

Dataset-level view: Partition-level view:

file:
HadoopRDD
path = hdfs://...

errors
FilteredRDD
:
func = _.contains(…)
shouldCache = true
Task1 Task2 ...
Example: JoinedRDD

partitions = one per reduce task

dependencies = “shuﬀle” on each parent

compute(partition) = read and join shuﬀled

data preferredLocations(part) = none

partitioner = HashPartitioner(numTasks)
Spark will now
know this data is
Dependency Types
“Narrow” (pipeline-able) “Wide” (shuffle)

map, filter groupByKey on

non-partitioned data

join with inputs

co-
union partitioned join with inputs not
co-partitioned
Execution Process
RDD Objects DAG Scheduler Task Scheduler Worke
r
Cluster
manager Threads
DAG TaskSet Tas
Block
k
manager

rdd1.join(rdd2) split graph into launch tasks via execute tasks

.groupBy(…)
stages of tasks cluster manager
.filter(…)

submit each retry failed or store and serve

build operator DAG stage as ready straggling tasks blocks
DAG Scheduler
Input: RDD and partitions to compute

Output: output from actions on those

partitions Roles:
• Build stages of tasks
• Submit them to lower level scheduler (e.g. YARN,
Mesos, Standalone) as ready
• Lower level scheduler will schedule data based on
locality
• Resubmit failed stages if outputs are lost
Job Scheduler

Captures RDD A: B:
dependency graph
Pipelines functions G:
Stage 1 groupBy
into “stages”
C: D: F:
Cache-aware for
data reuse & map
locality E: join

Partitioning-aware Stage 2 union Stage 3

to avoid shuﬀles
= cached partition
DataFrames ML Pipelines

Spark
Spark SQL MLlib GraphX
Streaming

Spark Core

Data Sources

{JSON}
Goal: unified engine across data sources,
workloads and environments
Paco Nathan, Intro to Apache Spark, ITAS Workshop, Databricks
Hands-on Tour of Apache Spark in 5 Minutes. Hortonworks
Running Spark Applications, Cloudera 5.5.x documentation
Sandy Ryza, Apache Spark Resource Management and YARN App Models,
ClouderaEngineering Blog, May 30, 2014+4

http://spark.apache.org/index.html
http://spark.apache.org/docs/latest/index.html

Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
C++ Notes (Unit-1,2,3)
No ratings yet
C++ Notes (Unit-1,2,3)
138 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Bigdata Interview Preparation Guide
No ratings yet
Bigdata Interview Preparation Guide
292 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Create A Simple ABAP CDS View in ADT
100% (1)
Create A Simple ABAP CDS View in ADT
71 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
Structure of Java Program
No ratings yet
Structure of Java Program
4 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Apache Spark Analytics Made Simple
No ratings yet
Apache Spark Analytics Made Simple
76 pages
Cloudera Spark Developer Training
No ratings yet
Cloudera Spark Developer Training
491 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Tutorial Elasticsearch - English
0% (1)
Tutorial Elasticsearch - English
166 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
200+ Python Exercises For Beginners Solve Coding Challenges
No ratings yet
200+ Python Exercises For Beginners Solve Coding Challenges
8 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
AWS DataEngineering
100% (1)
AWS DataEngineering
23 pages
Kafka Sparkstreaming
No ratings yet
Kafka Sparkstreaming
75 pages
Spark
No ratings yet
Spark
13 pages
Data Engineering Explanation
No ratings yet
Data Engineering Explanation
43 pages
Web Development Internship Project Report Sample
0% (1)
Web Development Internship Project Report Sample
6 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Hive Query Optimization Infinity
No ratings yet
Hive Query Optimization Infinity
13 pages
Pyspark With Docker
100% (1)
Pyspark With Docker
15 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
Data Science Links
No ratings yet
Data Science Links
1 page
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Spark Interview
No ratings yet
Spark Interview
17 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Spark Notes
No ratings yet
Spark Notes
6 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Edureka - Scala Interview Questions
No ratings yet
Edureka - Scala Interview Questions
21 pages
Hive Interview Questions Answers
No ratings yet
Hive Interview Questions Answers
6 pages
Apache Spark Interview Questions Book
100% (1)
Apache Spark Interview Questions Book
15 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
FinalTravel Diary
No ratings yet
FinalTravel Diary
138 pages
For Students - Sap Business One Course Details
No ratings yet
For Students - Sap Business One Course Details
7 pages
Wrapper Classes in Java
No ratings yet
Wrapper Classes in Java
18 pages
Chapter 7 - Modularization - SAP ABAP - Hands-On Test Projects With Business Scenarios
No ratings yet
Chapter 7 - Modularization - SAP ABAP - Hands-On Test Projects With Business Scenarios
40 pages
CSC 218 Lecture Slides (Intro To C++)
No ratings yet
CSC 218 Lecture Slides (Intro To C++)
30 pages
Book of Vaadin
No ratings yet
Book of Vaadin
620 pages
JUCE
100% (1)
JUCE
5 pages
JavaScript Execution Context - How JS Works Behind The Scenes
No ratings yet
JavaScript Execution Context - How JS Works Behind The Scenes
18 pages
Fundamental Concepts of C#
No ratings yet
Fundamental Concepts of C#
12 pages
Jhu784 Notes
No ratings yet
Jhu784 Notes
135 pages
Distributed Software Engineering
No ratings yet
Distributed Software Engineering
17 pages
OpenMP SPM
No ratings yet
OpenMP SPM
9 pages
CPP Presentation
No ratings yet
CPP Presentation
15 pages
Web Services Interview Questoins and Answers Bharath Thippireddy
100% (1)
Web Services Interview Questoins and Answers Bharath Thippireddy
8 pages
NWEC P L001 Opt1
No ratings yet
NWEC P L001 Opt1
12 pages
PyTrack - A Lightweight Personal Habit Tracker in The Terminal
No ratings yet
PyTrack - A Lightweight Personal Habit Tracker in The Terminal
4 pages
Online Food Order Java SP
No ratings yet
Online Food Order Java SP
7 pages
Digital Clock Lecture 3 Final
No ratings yet
Digital Clock Lecture 3 Final
11 pages
Java Section 4 Ilearning
No ratings yet
Java Section 4 Ilearning
17 pages
Assignment 2 C Language
No ratings yet
Assignment 2 C Language
7 pages
3078java Assignments
No ratings yet
3078java Assignments
4 pages
16ec5506 Rtos Question Bank
No ratings yet
16ec5506 Rtos Question Bank
6 pages
Planner
No ratings yet
Planner
2 pages
1z0 804 PDF
No ratings yet
1z0 804 PDF
11 pages
LabVIEW Remote Debugging Cannot Find The Application - National Instruments
No ratings yet
LabVIEW Remote Debugging Cannot Find The Application - National Instruments
2 pages
Form1: "Do You Want A New Entry?" "New Entry"
No ratings yet
Form1: "Do You Want A New Entry?" "New Entry"
2 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Pentaho Data Integration Cookbook - Second Edition
From Everand
Pentaho Data Integration Cookbook - Second Edition
María Carina Roldán
No ratings yet
Django 1.0 Template Development
From Everand
Django 1.0 Template Development
Scott Newman
No ratings yet
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)

Uploaded by

Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)

Uploaded by

Intro to Apache Spark

credits to CS 347- Stanford course , 2015, Reynold Xin, Databricks(Spark provider)

Message-passing between nodes (MPI, RPC, etc)

Really hard to do at scale:

• How to deal with failures?

Restrict the programming interface so that the system

“Here’s an operation, run it on all of the data”

• I don’t care where it runs (you schedule that)

Map: create a key,value pairs

Most real applications require multiple MR steps

Multi-step jobs create spaghetti code

circa 2004 – Google

circa 2006 – Apache

circa 2008 – Yahoo

circa 2009 – Amazon AWS

MapReduce use cases showed two major limitations:

1. diﬀiculty of programming directly in MR.

In short, MR doesn’t compose well for large applications

Therefore, people built high level frameworks and

• does not impose a rigid computation model, and supports

• deal with text files, graph data, database queries, and

• Programmers can develop arbitrarily-complex, multi-step

• programming in Spark involves defining a sequence of

• Spark has support for a map action and a reduce

• stores its intermediate results in memory, providing for

WordCount in 3 lines of Spark

WordCount in 50+ lines of Java

2013 Record: 2100 machines

2014 Record: 207 machines

Also sorted 1PB in 4

and operations on them

Resilient Distributed Datasets Operations

Resilient Distributed Datasets are the primary

When each executor is launched by the manager, it establishes a connection back

The executor runs tasks on behalf of a specific SparkContext (application) and

A task is a transformation or action. The executor remains running for the

• When each executor is launched by the manager, it establishes a connection back

Input data is organized into RDDs, which will often be partitioned

1.They can be present as any file stored in HDFS or any other

2.RDDs can be streaming sources using the Spark

This could be a stream of events from remote sensors,

For fault tolerance, a sliding window is used, where the

This RDD will be cached in memory (overflowing to disk if

Transformations read an RDD and return a new RDD. Example

first() return the first element of the dataset – similar to

For example, RDDs can be the result of a query in a Cassandra

Alternatively, RDDs can be read from HDFS files or writen to an

For each RDD, the driver tracks the sequence of transformations

The driver maintains the entire dependency graph, so this

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

lines = spark.textFile(“hdfs://...”) Worker

lines = spark.textFile(“hdfs://...”) Worker

lines = spark.textFile(“hdfs://...”) Worker

lines = spark.textFile(“hdfs://...”) Worker

messages.filter(lambda s: “mysql” in s).count()

lines = spark.textFile(“hdfs://...”) Worker

messages.filter(lambda s: “mysql” in s).count()

lines = spark.textFile(“hdfs://...”) Worker

messages.filter(lambda s: “mysql” in s).count()

lines = spark.textFile(“hdfs://...”) Worker

lines = spark.textFile(“hdfs://...”) Worker

messages.filter(lambda s: “mysql” in s).count() Cache 2

messages.filter(lambda s: “mysql” in s).count() Cache 2

messages.filter(lambda s: “mysql” in s).count() Cache 2

messages.filter(lambda s: “mysql” in s).count() Cache 2

Scala Interactive Shells

RDDs track lineage information that can be used

HDFS File Filtered RDD Mapped RDD

140 Failure happens

DStream’s: SchemaRDD’s RDD-­‐ RDD-­‐

Many important apps must process large data streams at

To build and scale these apps users want:

immutable dataset immutable dataset

Interoperates with RDDs

Data partitioning and shuﬀles

DStream’s: SchemaRDD’s RDD-‐ RDD-‐