0% found this document useful (0 votes)
46 views66 pages

Bdsaa

Spark uses resilient distributed datasets (RDDs) that allow data to be rebuilt on the fly based on transformations applied to other datasets. This avoids writing to disk and enables faster processing by keeping data in memory when possible. RDDs track how a dataset was created from previous transformations to rebuild it when needed. This makes Spark more efficient than MapReduce for iterative jobs, streaming data, and interactive applications that require frequent random access.

Uploaded by

Khan Khaja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views66 pages

Bdsaa

Spark uses resilient distributed datasets (RDDs) that allow data to be rebuilt on the fly based on transformations applied to other datasets. This avoids writing to disk and enables faster processing by keeping data in memory when possible. RDDs track how a dataset was created from previous transformations to rebuild it when needed. This makes Spark more efficient than MapReduce for iterative jobs, streaming data, and interactive applications that require frequent random access.

Uploaded by

Khan Khaja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Spark:

Resilient Distributed Datasets, Workflow System

H. Andrew Schwartz

CSE545
Spring 2023
Big Data Analytics, The Class
Goal: Generalizations
A model or summarization of the data.

Data Workflow Frameworks Analytics and Algorithms

Hadoop File System Similarity Search


Spark Hypothesis Testing
Streaming Transformers/Self-Supervision
MapReduce
Deep Learning Frameworks Recommendation Systems
Link Analysis
Where is MapReduce Inefficient?

DFS Map LocalFS Network Reduce DFS Map ...


(Anytime where MapReduce would need to write and read from disk a lot).
Where is MapReduce Inefficient?
● Long pipelines sharing data
● Interactive applications
● Streaming applications
● Iterative algorithms (optimization problems)

DFS Map LocalFS Network Reduce DFS Map ...


(Anytime where MapReduce would need to write and read from disk a lot).
(Anytime where MapReduce would need to write and read from disk a lot).
Where is MapReduce Inefficient?
● Long pipelines sharing data
● Interactive applications
● Streaming applications
● Iterative algorithms (optimization problems)

DFS Map LocalFS Network Reduce DFS Map ...


(Anytime where MapReduce would need to write and read from disk a lot).
(Anytime where MapReduce would need to write and read from disk a lot).
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

RDD1
Create RDD
dfs:// (DATA)
filename
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

RDD1 RDD2
transformation1()
dfs:// (DATA) (DATA)
filename
created from
dfs://filename
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

RDD1 RDD2 RDD3


transformation2()
dfs:// (can drop
(DATA) (DATA)
filename the data)
created from transformation1 transformation2
dfs://filename from RDD1 from RDD2
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

● Enables rebuilding datasets on the fly.


● Intermediate datasets not stored on disk
(and only in memory if needed and enough space)

Faster communication and I O


Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

“Stable Storage” Other RDDs


Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

map filter join


...
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

RDD1 RDD2 RDD3


transformation2()
dfs://
(DATA) (DATA)
filename
created from transformation1 transformation2
dfs://filename from RDD1 from RDD2
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

RDD1 RDD2 RDD3


transformation2()
dfs://
(DATA)
filename
created from transformation1 transformation2
dfs://filename from RDD1 from RDD2
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
RDD4
of how the dataset was created as combination of

)
3(
transformations from other dataset(s). (DATA)

on
ti
ma
transformation3

for
from RDD2

ns
tra
RDD1 RDD2 RDD3
transformation2()
dfs:// (will recreate
(DATA)
filename data)
created from transformation1 transformation2
dfs://filename from RDD1 from RDD2
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
RDD4
of how the dataset was created as combination of

)
3(
transformations from other dataset(s). (DATA)

on
ti
ma
transformation3

for
from RDD2

ns
tra
RDD1 RDD2 RDD3
transformation2()
dfs:// (will recreate
(DATA)
filename data)
created from transformation1 transformation2
dfs://filename from RDD1 from RDD2
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
RDD4
of how the dataset was created as combination of

)
3(
transformations from other dataset(s). (DATA)

on
ti
ma
transformation3

for
from RDD2

ns
tra
RDD1 RDD2 RDD3
transformation2()
dfs:// (will recreate
(DATA)
filename data)
created from transformation1 transformation2
dfs://filename from RDD1 from RDD2
(original) Transformations: RDD to RDD

Resilient Distributed Datasets (RDDs) -- Read-only


partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed
Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012. April 2012.
(original) Transformations: RDD to RDD

Resilient Distributed Datasets (RDDs) -- Read-only


partitioned collection of records (like a DFS) but with a record
Multiple Records
of how the dataset was created as combination of
transformations from other dataset(s).

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed
Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012. April 2012.
(original) Transformations: RDD to RDD

Resilient Distributed Datasets (RDDs) -- Read-only


partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed
Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012. April 2012.
(original) Transformations: RDD to RDD

Resilient Distributed Datasets (RDDs) -- Read-only


partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

(orig.) Actions: RDD to Value Object, or Storage

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed
Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012. April 2012.
Current Transformations and Actions

http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

common transformations: filter, map, flatMap, reduceByKey, groupByKey

http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

common actions: collect, count, take


Example

Count errors in a log file: lines


TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))

errors
count()

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example

Count errors in a log file: lines


TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))

errors
Pseudocode:
count()

lines = sc.textFile(“dfs:...”)
errors =
lines.filter(_.startswith(“ERROR”))
errors.count

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example 2

Collect times of hdfs-related errors lines


TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))

errors
Pseudocode:

lines = sc.textFile(“dfs:...”)
errors =
lines.filter(_.startswith(“ERROR”))
errors.persist
errors.count
...

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example 2

Collect times of hdfs-related errors Persistance


lines
TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))
Can specify that an RDD “persists”
in memory so other queries can
errors
Pseudocode: use it. filter.(_.contains(“HDFS”))
Can specify a priority for
lines = sc.textFile(“dfs:...”) HDFS errors
persistance; lower priority =>
errors =
moves to disk,map.(_.split(‘\t’)(3))
if needed, earlier
lines.filter(_.startswith(“ERROR”))
errors.persist
errors.count time fields
...
collect()

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example 2

Collect times of hdfs-related errors Persistance


lines
TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))
Can specify that an RDD “persists”
in memory so other queries can
errors
Pseudocode: use it. filter.(_.contains(“HDFS”))
Can specify a priority for
lines = sc.textFile(“dfs:...”) HDFS errors
persistance; lower priority =>
errors =
moves to disk,map.(_.split(‘\t’)(3))
if needed, earlier
lines.filter(_.startswith(“ERROR”))
errors.persist
errors.count time fields
parameters for persist
...
collect()

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example

Collect times of hdfs-related errors lines


TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))

errors
Pseudocode:
filter.(_.contains(“HDFS”))

lines = sc.textFile(“dfs:...”) (HDFS errors)


errors =
lines.filter(_.startswith(“ERROR”))
errors.persist
errors.count
errors.filter(_.contains(“HDFS”))
...
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example

Collect times of hdfs-related errors lines


TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))

errors
Pseudocode:
filter.(_.contains(“HDFS”))

lines = sc.textFile(“dfs:...”) (HDFS errors)


errors =
map.(_.split(‘\t’)(3))
lines.filter(_.startswith(“ERROR”))
errors.persist
errors.count (time fields)
errors.filter(_.contains(“HDFS”))
collect()
.map(_split(‘\t’)(2))
.collect()
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example

Collect times of hdfs-related errors lines


TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))

errors
Pseudocode:
filter.(_.contains(“HDFS”))

lines = sc.textFile(“dfs:...”) (HDFS errors)


errors =
map.(_.split(‘\t’)(3))
lines.filter(_.startswith(“ERROR”))
errors.persist
errors.count (time fields)
errors.filter(_.contains(“HDFS”))
collect()
.map(_split(‘\t’)(2))
.collect()
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Functional Programming Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example

Collect times of hdfs-related errors lines


TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))

errors
Pseudocode:
“lineage” filter.(_.contains(“HDFS”))

lines = sc.textFile(“dfs:...”) (HDFS errors)


errors =
map.(_.split(‘\t’)(3))
lines.filter(_.startswith(“ERROR”))
errors.persist
errors.count (time fields)
errors.filter(_.contains(“HDFS”))
collect()
.map(_split(‘\t’)(3))
.collect()
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Functional Programming Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Advantages as Workflow System

● More efficient failure recovery


● More efficient grouping of tasks and scheduling
● Integration of programming language features:
○ loops (not an “acyclic” workflow system).
○ function libraries

(MMDSv3)
The Spark Programming Model

Gupta, Manish. Lightening Fast Big Data Analytics using Apache Spark. UniCom 2014.
Example

Word Count textFile


Example

Word Count textFile


flatMap(split(“ “))

(words)
Scala:
map.((word, 1))
val textFile = tuples of (word, 1)
sc.textFile("hdfs://...")
val counts = textFile reduceByKey.(_ + _)
.flatMap(line => line.split(" "))
.map(word => (word, 1)) tuples of (word, count)
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...") saveAsTextFile

Apache Spark Examples


http://spark.apache.org/examples.html
Example

Word Count textFile


flatMap(split(“ “))

(words)
Python:
map.((word, 1))
textFile = sc.textFile("hdfs://...") tuples of (word, 1)
counts = textFile
.flatMap(lambda line: line.split(" ")) reduceByKey.(a + b)
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b) tuples of (word, count)
counts.saveAsTextFile("hdfs://...")
saveAsTextFile

Apache Spark Examples


http://spark.apache.org/examples.html
PySpark Demo: Wordcount
Example

Word Count textFile


flatMap(split(“ “))

(words)
Python:
map.((word, 1))
textFile = sc.textFile("hdfs://...") tuples of (word, 1)
counts = textFile
.flatMap(lambda line: line.split(" ")) reduceByKey.(_ + _)
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b) tuples of (word, count)
counts.saveAsTextFile("hdfs://...")
saveAsTextFile

Apache Spark Examples


http://spark.apache.org/examples.html
Example

Word Count textFile


(simulating map-reduce approach) flatMap(split(“ “))
(much slower!) (words)
Python:
map((word, 1))
textFile = sc.textFile("hdfs://...") tuples of (word, 1)
counts = textFile
.flatMap(lambda line: line.split(" ")) groupByKey()
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b) tuples of (word, [c1, c2, …])
.groupByKey()
map()
.mapValues(sum)
counts.saveAsTextFile("hdfs://...")
tuples of (word, count)

saveAsTextFile
Lazy Evaluation

Spark waits to load data and execute transformations until necessary -- lazy
Spark tries to complete actions as immediately as possible -- eager

Why?
Lazy Evaluation

Spark waits to load data and execute transformations until necessary -- lazy
Spark tries to complete actions as immediately as possible -- eager

Why?

● Only executes what is necessary to achieve action.

● Can optimize the complete chain of operations to reduce communication


Lazy Evaluation

Spark waits to load data and execute transformations until necessary -- lazy
Spark tries to complete actions as quickly as possible -- eager

Why?

● Only executes what is necessary to achieve action.

● Can optimize the complete chain of operations to reduce communication

e.g.

rdd.map(lambda r: r[1]*r[3]).take(5) #only executes map for five records

rdd.filter(lambda r: “ERROR” in r[0]).map(lambda r: r[1]*r[3])


#only passes through the data once
PySpark Demo: Statistics

https://data.worldbank.org/data-catalog/poverty-and-equity-database
https://databank.worldbank.org/data/download/PovStats_CSV.zip
Broadcast Variables

Read-only objects can be shared across all nodes.


Broadcast variable is a wrapper: access object with .value

Python:

filterWords = [‘one’, ‘two’, ‘three’, ‘four’, …]


fwBC = sc.broadcast(set(filterWords))
Broadcast Variables

Read-only objects can be shared across all nodes.


Broadcast variable is a wrapper: access object with .value

Python:

filterWords = [‘one’, ‘two’, ‘three’, ‘four’, …]


fwBC = sc.broadcast(set(filterWords))

textFile = sc.textFile("hdfs:...")
counts = textFile
.flatMap(lambda line: line.split(" "))
.filter(lambda word: word in fwBC.value)
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:...")
Accumulators
Write-only objects that keep a running aggregation
Default Accumulator assumes sum function

initialValue = 0
sumAcc = sc.accumulator(initialValue)
rdd.foreach(lambda i: sumAcc.add(i))
print(sumAcc.value)
Accumulators
Write-only objects that keep a running aggregation
Default Accumulator assumes sum function
Custom Accumulator: Inherit (AccumulatorParam) as class and override methods

initialValue = 0
sumAcc = sc.accumulator(initialValue)
rdd.foreach(lambda i: sumAcc.add(i))
print(minAcc.value)

class MinAccum(AccumulatorParam):
def zero(self, zeroValue = np.inf):#overwrite this
return zeroValue
def addInPlace(self, v1, v2):#overwrite this
return min(v1, v2)
minAcc = sc.accumulator(np.inf, minAccum())
rdd.foreach(lambda i: minAcc.add(i))
print(minAcc.value)
Spark System: Review
● RDD provides full recovery by backing up transformations from stable storage
rather than backing up the data itself.
● RDDs, which are immutable, can be stored in memory and thus are often
much faster.
● Functional programming is used to define transformation and actions on
RDDs.
Spark System: Hierarchy

Driver

Executor Executor

Core Core ... Core Core Core ... Core

Working Working
Memory Memory ...
Storage Storage

Disk Disk ... Disk Disk Disk ... Disk


Spark System: Hierarchy

Driver

Executor Executor

Core Core ... Core Core Core ... Core

Working Working
Memory Memory ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Spark System: Hierarchy

Driver

Executor Executor
For executing tasks
Core Core ... Core Core Core ... Core

Working Working
Memory Memory
For storing persisted RDDs ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Spark System: Hierarchy

A “slot” for a task on a partition Driver

Executor Executor
For executing tasks
Core Core ... Core Core Core ... Core

Working Working
Memory Memory
For storing persisted RDDs ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Spark System: Hierarchy
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

A “slot” for a task on a partition Driver

Executor Executor
For executing tasks
Core Core ... Core Core Core ... Core

Working Working
Memory Memory
For storing persisted RDDs ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Spark System: Hierarchy
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

Driver
Technically, a virtual machine with
slots for scheduling tasks. In practice,
Executor Executorper slot and one
one core is allocated
task is run per slot at a time.
Slot Slot ... Slot Core Core ... Core

Working Working
Memory Memory ...
Storage Storage

Disk Disk ... Disk Disk Disk ... Disk


Spark System: Hierarchy
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

Two types:
Driver
1) Narrow:
record in-> process -> record[s] out
Executor 2) Executor
Wide:
records in-> shuffle: regroup across
Core Core ... Core Core Core
cluster ... Core
-> process-> record[s] out
Working Working
Memory Memory ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Spark System: Hierarchy
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

Two types:
Driver
1) Narrow:
record in-> process -> record[s] out
Executor 2) Executor
Wide:
records in-> shuffle: regroup across
Core Core ... Core Core Core
cluster ... Core
-> process-> record[s] out
Working Working
Memory Memory ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Image from Nguyen: https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/
Spark System: Hierarchy
Co-partitions:
If the
Eager partitions
action forofftwo
-> sets (lazy) chain of transformations
RDDs are based on
-> launches jobs -> broken into stages -> broken into tasks
the same hash
function and key. Two types:
Driver
1) Narrow:
record in-> process -> record[s] out
Executor 2) Executor
Wide:
records in-> shuffle: regroup across
Core Core ... Core Core Core
cluster ... Core
-> process-> record[s] out
Working Working
Memory Memory ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Image from Nguyen: https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/
Spark System: Hierarchy Where the program is
launched: coordinates
everything (like the
name node in Hadoop)

Driver

Executor Executor

Core Core ... Core Core Core ... Core

Working Working
Memory Memory ...
Storage Storage

Disk Disk ... Disk Disk Disk ... Disk


Spark System: Hierarchy Where the program is
launched: coordinates
To see/set restrictions: spark-defaults.conf everything (like the
name node in Hadoop)
(most common bottleneck: spark.executor.memory)

Driver

Executor Executor

Core Core ... Core Core Core ... Core

Working Working
Memory Memory ...
Storage Storage

Disk Disk ... Disk Disk Disk ... Disk


Spark System: Scheduling
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

Jobs: A series of transformations (in a DAG) needed for the action

Job
Spark System: Scheduling
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

Jobs: A series of transformations (in a DAG) needed for the action


Stages: 1 or more per job -- 1 per set of operations separated by shuffle

Stage 1
shuffle
Job
Stage 2
...
Spark System: Scheduling
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

Jobs: A series of transformations (in a DAG) needed for the action


Stages: 1 or more per job -- 1 per set of operations separated by shuffle
Tasks: many per stage -- repeats exact same operation per partition
Task (Partition) Core/Thread 1
Stage 1 Task (Partition) Core/Thread 2
shuffle

...

...
Job Task (Partition) Core/Thread 1
Stage 2
Task (Partition) Core/Thread 2
...

...

...
Spark System: Scheduling
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

Task (Partition) Core/Thread 1


Stage 1 Task (Partition) Core/Thread 2
shuffle

...

...
Job Task (Partition) Core/Thread 1
Stage 2
Task (Partition) Core/Thread 2
...

...

...
Image from Nguyen: https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/
Spark System: Scheduling
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

Task (Partition) Core/Thread 1


Stage 1 Task (Partition) Core/Thread 2
shuffle

...

...
Job Task (Partition) Core/Thread 1
Stage 2
Task (Partition) Core/Thread 2
...

...

...
Image from Nguyen: https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/
MapReduce
Spark Overview or Spark?
● Spark is typically faster
○ RDDs in memory
○ Lazy evaluation enables optimizing chain of operations.
● Spark is typically more flexible (custom chains of transformations)
MapReduce
Spark Overview or Spark?
● Spark is typically faster
○ RDDs in memory
○ Lazy evaluation enables optimizing chain of operations.
● Spark is typically more flexible (custom chains of transformations)
However:
● Still need HDFS (or some DFS) to hold original or resulting data efficiently
and reliably.
● Memory across Spark cluster should be large enough to hold entire dataset to
fully leverage speed.
Thus, MapReduce may sometimes be more cost-effective for very large data that
does not fit in memory.

You might also like