Bdsaa
Bdsaa
H. Andrew Schwartz
CSE545
Spring 2023
Big Data Analytics, The Class
Goal: Generalizations
A model or summarization of the data.
RDD1
Create RDD
dfs:// (DATA)
filename
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).
RDD1 RDD2
transformation1()
dfs:// (DATA) (DATA)
filename
created from
dfs://filename
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).
)
3(
transformations from other dataset(s). (DATA)
on
ti
ma
transformation3
for
from RDD2
ns
tra
RDD1 RDD2 RDD3
transformation2()
dfs:// (will recreate
(DATA)
filename data)
created from transformation1 transformation2
dfs://filename from RDD1 from RDD2
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
RDD4
of how the dataset was created as combination of
)
3(
transformations from other dataset(s). (DATA)
on
ti
ma
transformation3
for
from RDD2
ns
tra
RDD1 RDD2 RDD3
transformation2()
dfs:// (will recreate
(DATA)
filename data)
created from transformation1 transformation2
dfs://filename from RDD1 from RDD2
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
RDD4
of how the dataset was created as combination of
)
3(
transformations from other dataset(s). (DATA)
on
ti
ma
transformation3
for
from RDD2
ns
tra
RDD1 RDD2 RDD3
transformation2()
dfs:// (will recreate
(DATA)
filename data)
created from transformation1 transformation2
dfs://filename from RDD1 from RDD2
(original) Transformations: RDD to RDD
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed
Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012. April 2012.
(original) Transformations: RDD to RDD
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed
Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012. April 2012.
(original) Transformations: RDD to RDD
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed
Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012. April 2012.
(original) Transformations: RDD to RDD
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed
Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012. April 2012.
Current Transformations and Actions
http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions
errors
count()
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example
errors
Pseudocode:
count()
lines = sc.textFile(“dfs:...”)
errors =
lines.filter(_.startswith(“ERROR”))
errors.count
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example 2
errors
Pseudocode:
lines = sc.textFile(“dfs:...”)
errors =
lines.filter(_.startswith(“ERROR”))
errors.persist
errors.count
...
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example 2
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example 2
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example
errors
Pseudocode:
filter.(_.contains(“HDFS”))
errors
Pseudocode:
filter.(_.contains(“HDFS”))
errors
Pseudocode:
filter.(_.contains(“HDFS”))
errors
Pseudocode:
“lineage” filter.(_.contains(“HDFS”))
(MMDSv3)
The Spark Programming Model
Gupta, Manish. Lightening Fast Big Data Analytics using Apache Spark. UniCom 2014.
Example
(words)
Scala:
map.((word, 1))
val textFile = tuples of (word, 1)
sc.textFile("hdfs://...")
val counts = textFile reduceByKey.(_ + _)
.flatMap(line => line.split(" "))
.map(word => (word, 1)) tuples of (word, count)
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...") saveAsTextFile
(words)
Python:
map.((word, 1))
textFile = sc.textFile("hdfs://...") tuples of (word, 1)
counts = textFile
.flatMap(lambda line: line.split(" ")) reduceByKey.(a + b)
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b) tuples of (word, count)
counts.saveAsTextFile("hdfs://...")
saveAsTextFile
(words)
Python:
map.((word, 1))
textFile = sc.textFile("hdfs://...") tuples of (word, 1)
counts = textFile
.flatMap(lambda line: line.split(" ")) reduceByKey.(_ + _)
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b) tuples of (word, count)
counts.saveAsTextFile("hdfs://...")
saveAsTextFile
saveAsTextFile
Lazy Evaluation
Spark waits to load data and execute transformations until necessary -- lazy
Spark tries to complete actions as immediately as possible -- eager
Why?
Lazy Evaluation
Spark waits to load data and execute transformations until necessary -- lazy
Spark tries to complete actions as immediately as possible -- eager
Why?
Spark waits to load data and execute transformations until necessary -- lazy
Spark tries to complete actions as quickly as possible -- eager
Why?
e.g.
https://data.worldbank.org/data-catalog/poverty-and-equity-database
https://databank.worldbank.org/data/download/PovStats_CSV.zip
Broadcast Variables
Python:
Python:
textFile = sc.textFile("hdfs:...")
counts = textFile
.flatMap(lambda line: line.split(" "))
.filter(lambda word: word in fwBC.value)
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:...")
Accumulators
Write-only objects that keep a running aggregation
Default Accumulator assumes sum function
initialValue = 0
sumAcc = sc.accumulator(initialValue)
rdd.foreach(lambda i: sumAcc.add(i))
print(sumAcc.value)
Accumulators
Write-only objects that keep a running aggregation
Default Accumulator assumes sum function
Custom Accumulator: Inherit (AccumulatorParam) as class and override methods
initialValue = 0
sumAcc = sc.accumulator(initialValue)
rdd.foreach(lambda i: sumAcc.add(i))
print(minAcc.value)
class MinAccum(AccumulatorParam):
def zero(self, zeroValue = np.inf):#overwrite this
return zeroValue
def addInPlace(self, v1, v2):#overwrite this
return min(v1, v2)
minAcc = sc.accumulator(np.inf, minAccum())
rdd.foreach(lambda i: minAcc.add(i))
print(minAcc.value)
Spark System: Review
● RDD provides full recovery by backing up transformations from stable storage
rather than backing up the data itself.
● RDDs, which are immutable, can be stored in memory and thus are often
much faster.
● Functional programming is used to define transformation and actions on
RDDs.
Spark System: Hierarchy
Driver
Executor Executor
Working Working
Memory Memory ...
Storage Storage
Driver
Executor Executor
Working Working
Memory Memory ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Spark System: Hierarchy
Driver
Executor Executor
For executing tasks
Core Core ... Core Core Core ... Core
Working Working
Memory Memory
For storing persisted RDDs ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Spark System: Hierarchy
Executor Executor
For executing tasks
Core Core ... Core Core Core ... Core
Working Working
Memory Memory
For storing persisted RDDs ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Spark System: Hierarchy
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks
Executor Executor
For executing tasks
Core Core ... Core Core Core ... Core
Working Working
Memory Memory
For storing persisted RDDs ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Spark System: Hierarchy
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks
Driver
Technically, a virtual machine with
slots for scheduling tasks. In practice,
Executor Executorper slot and one
one core is allocated
task is run per slot at a time.
Slot Slot ... Slot Core Core ... Core
Working Working
Memory Memory ...
Storage Storage
Two types:
Driver
1) Narrow:
record in-> process -> record[s] out
Executor 2) Executor
Wide:
records in-> shuffle: regroup across
Core Core ... Core Core Core
cluster ... Core
-> process-> record[s] out
Working Working
Memory Memory ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Spark System: Hierarchy
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks
Two types:
Driver
1) Narrow:
record in-> process -> record[s] out
Executor 2) Executor
Wide:
records in-> shuffle: regroup across
Core Core ... Core Core Core
cluster ... Core
-> process-> record[s] out
Working Working
Memory Memory ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Image from Nguyen: https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/
Spark System: Hierarchy
Co-partitions:
If the
Eager partitions
action forofftwo
-> sets (lazy) chain of transformations
RDDs are based on
-> launches jobs -> broken into stages -> broken into tasks
the same hash
function and key. Two types:
Driver
1) Narrow:
record in-> process -> record[s] out
Executor 2) Executor
Wide:
records in-> shuffle: regroup across
Core Core ... Core Core Core
cluster ... Core
-> process-> record[s] out
Working Working
Memory Memory ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Image from Nguyen: https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/
Spark System: Hierarchy Where the program is
launched: coordinates
everything (like the
name node in Hadoop)
Driver
Executor Executor
Working Working
Memory Memory ...
Storage Storage
Driver
Executor Executor
Working Working
Memory Memory ...
Storage Storage
Job
Spark System: Scheduling
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks
Stage 1
shuffle
Job
Stage 2
...
Spark System: Scheduling
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks
...
...
Job Task (Partition) Core/Thread 1
Stage 2
Task (Partition) Core/Thread 2
...
...
...
Spark System: Scheduling
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks
...
...
Job Task (Partition) Core/Thread 1
Stage 2
Task (Partition) Core/Thread 2
...
...
...
Image from Nguyen: https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/
Spark System: Scheduling
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks
...
...
Job Task (Partition) Core/Thread 1
Stage 2
Task (Partition) Core/Thread 2
...
...
...
Image from Nguyen: https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/
MapReduce
Spark Overview or Spark?
● Spark is typically faster
○ RDDs in memory
○ Lazy evaluation enables optimizing chain of operations.
● Spark is typically more flexible (custom chains of transformations)
MapReduce
Spark Overview or Spark?
● Spark is typically faster
○ RDDs in memory
○ Lazy evaluation enables optimizing chain of operations.
● Spark is typically more flexible (custom chains of transformations)
However:
● Still need HDFS (or some DFS) to hold original or resulting data efficiently
and reliably.
● Memory across Spark cluster should be large enough to hold entire dataset to
fully leverage speed.
Thus, MapReduce may sometimes be more cost-effective for very large data that
does not fit in memory.