0% found this document useful (0 votes)

46 views66 pages

Bdsaa

Spark uses resilient distributed datasets (RDDs) that allow data to be rebuilt on the fly based on transformations applied to other datasets. This avoids writing to disk and enables faster processing by keeping data in memory when possible. RDDs track how a dataset was created from previous transformations to rebuild it when needed. This makes Spark more efficient than MapReduce for iterative jobs, streaming data, and interactive applications that require frequent random access.

Uploaded by

Khan Khaja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views66 pages

Bdsaa

Uploaded by

Khan Khaja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Spark:

Resilient Distributed Datasets, Workflow System

H. Andrew Schwartz

CSE545
Spring 2023
Big Data Analytics, The Class
Goal: Generalizations
A model or summarization of the data.

Data Workﬂow Frameworks Analytics and Algorithms

Hadoop File System Similarity Search

Spark Hypothesis Testing
Streaming Transformers/Self-Supervision
MapReduce
Deep Learning Frameworks Recommendation Systems
Link Analysis
Where is MapReduce Inefficient?

DFS Map LocalFS Network Reduce DFS Map ...

(Anytime where MapReduce would need to write and read from disk a lot).
Where is MapReduce Inefficient?
● Long pipelines sharing data
● Interactive applications
● Streaming applications
● Iterative algorithms (optimization problems)

DFS Map LocalFS Network Reduce DFS Map ...

(Anytime where MapReduce would need to write and read from disk a lot).
(Anytime where MapReduce would need to write and read from disk a lot).
Where is MapReduce Inefficient?
● Long pipelines sharing data
● Interactive applications
● Streaming applications
● Iterative algorithms (optimization problems)

DFS Map LocalFS Network Reduce DFS Map ...

(Anytime where MapReduce would need to write and read from disk a lot).
(Anytime where MapReduce would need to write and read from disk a lot).
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

RDD1
Create RDD
dfs:// (DATA)
filename
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

RDD1 RDD2
transformation1()
dfs:// (DATA) (DATA)
filename
created from
dfs://filename
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

RDD1 RDD2 RDD3

transformation2()
dfs:// (can drop
(DATA) (DATA)
filename the data)
created from transformation1 transformation2
dfs://filename from RDD1 from RDD2
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

● Enables rebuilding datasets on the fly.

● Intermediate datasets not stored on disk
(and only in memory if needed and enough space)

Faster communication and I O

Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

“Stable Storage” Other RDDs

map filter join

...
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

RDD1 RDD2 RDD3

transformation2()
dfs://
(DATA) (DATA)
filename
created from transformation1 transformation2
dfs://filename from RDD1 from RDD2
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

RDD1 RDD2 RDD3

transformation2()
dfs://
(DATA)
filename
created from transformation1 transformation2
dfs://filename from RDD1 from RDD2
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
RDD4
of how the dataset was created as combination of

)
3(
transformations from other dataset(s). (DATA)

on
ti
ma
transformation3

for
from RDD2

ns
tra
RDD1 RDD2 RDD3
transformation2()
dfs:// (will recreate
(DATA)
filename data)
created from transformation1 transformation2
dfs://filename from RDD1 from RDD2
Spark’s Big Idea
Resilient Distributed Datasets (RDDs) -- Read-only
partitioned collection of records (like a DFS) but with a record
RDD4
of how the dataset was created as combination of

)
3(
transformations from other dataset(s). (DATA)

on
ti
ma
transformation3

for
from RDD2

)
3(
transformations from other dataset(s). (DATA)

on
ti
ma
transformation3

for
from RDD2

ns
tra
RDD1 RDD2 RDD3
transformation2()
dfs:// (will recreate
(DATA)
filename data)
created from transformation1 transformation2
dfs://filename from RDD1 from RDD2
(original) Transformations: RDD to RDD

Resilient Distributed Datasets (RDDs) -- Read-only

partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed
Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012. April 2012.
(original) Transformations: RDD to RDD

Resilient Distributed Datasets (RDDs) -- Read-only

partitioned collection of records (like a DFS) but with a record
Multiple Records
of how the dataset was created as combination of
transformations from other dataset(s).

Resilient Distributed Datasets (RDDs) -- Read-only

partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

Resilient Distributed Datasets (RDDs) -- Read-only

partitioned collection of records (like a DFS) but with a record
of how the dataset was created as combination of
transformations from other dataset(s).

(orig.) Actions: RDD to Value Object, or Storage

http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

common transformations: filter, map, flatMap, reduceByKey, groupByKey

http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

common actions: collect, count, take

Example

Count errors in a log file: lines

TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))

errors
count()

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example

Count errors in a log file: lines

TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))

errors
Pseudocode:
count()

lines = sc.textFile(“dfs:...”)
errors =
lines.filter(_.startswith(“ERROR”))
errors.count

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example 2

Collect times of hdfs-related errors lines

TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))

errors
Pseudocode:

lines = sc.textFile(“dfs:...”)
errors =
lines.filter(_.startswith(“ERROR”))
errors.persist
errors.count
...

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example 2

Collect times of hdfs-related errors Persistance

lines
TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))
Can specify that an RDD “persists”
in memory so other queries can
errors
Pseudocode: use it. filter.(_.contains(“HDFS”))
Can specify a priority for
lines = sc.textFile(“dfs:...”) HDFS errors
persistance; lower priority =>
errors =
moves to disk,map.(_.split(‘\t’)(3))
if needed, earlier
lines.filter(_.startswith(“ERROR”))
errors.persist
errors.count time fields
...
collect()

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example 2

Collect times of hdfs-related errors Persistance

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example

Collect times of hdfs-related errors lines

TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))

errors
Pseudocode:
filter.(_.contains(“HDFS”))

lines = sc.textFile(“dfs:...”) (HDFS errors)

errors =
lines.filter(_.startswith(“ERROR”))
errors.persist
errors.count
errors.filter(_.contains(“HDFS”))
...
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example

Collect times of hdfs-related errors lines

TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))

errors
Pseudocode:
filter.(_.contains(“HDFS”))

lines = sc.textFile(“dfs:...”) (HDFS errors)

errors =
map.(_.split(‘\t’)(3))
lines.filter(_.startswith(“ERROR”))
errors.persist
errors.count (time fields)
errors.filter(_.contains(“HDFS”))
collect()
.map(_split(‘\t’)(2))
.collect()
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example

Collect times of hdfs-related errors lines

TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))

errors
Pseudocode:
filter.(_.contains(“HDFS”))

lines = sc.textFile(“dfs:...”) (HDFS errors)

errors =
map.(_.split(‘\t’)(3))
lines.filter(_.startswith(“ERROR”))
errors.persist
errors.count (time fields)
errors.filter(_.contains(“HDFS”))
collect()
.map(_split(‘\t’)(2))
.collect()
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Functional Programming Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Example

Collect times of hdfs-related errors lines

TYPE MESSAGE TIME filter.(_.startsWith(“ERROR”))

errors
Pseudocode:
“lineage” filter.(_.contains(“HDFS”))

lines = sc.textFile(“dfs:...”) (HDFS errors)

errors =
map.(_.split(‘\t’)(3))
lines.filter(_.startswith(“ERROR”))
errors.persist
errors.count (time fields)
errors.filter(_.contains(“HDFS”))
collect()
.map(_split(‘\t’)(3))
.collect()
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient
Functional Programming Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing.”. NSDI 2012. April 2012.
Advantages as Workflow System

● More efficient failure recovery

● More efficient grouping of tasks and scheduling
● Integration of programming language features:
○ loops (not an “acyclic” workflow system).
○ function libraries

(MMDSv3)
The Spark Programming Model

Gupta, Manish. Lightening Fast Big Data Analytics using Apache Spark. UniCom 2014.
Example

Word Count textFile

Example

Word Count textFile

flatMap(split(“ “))

(words)
Scala:
map.((word, 1))
val textFile = tuples of (word, 1)
sc.textFile("hdfs://...")
val counts = textFile reduceByKey.(_ + _)
.flatMap(line => line.split(" "))
.map(word => (word, 1)) tuples of (word, count)
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...") saveAsTextFile

Apache Spark Examples

http://spark.apache.org/examples.html
Example

Word Count textFile

flatMap(split(“ “))

(words)
Python:
map.((word, 1))
textFile = sc.textFile("hdfs://...") tuples of (word, 1)
counts = textFile
.flatMap(lambda line: line.split(" ")) reduceByKey.(a + b)
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b) tuples of (word, count)
counts.saveAsTextFile("hdfs://...")
saveAsTextFile

Apache Spark Examples

http://spark.apache.org/examples.html
PySpark Demo: Wordcount
Example

Word Count textFile

flatMap(split(“ “))

(words)
Python:
map.((word, 1))
textFile = sc.textFile("hdfs://...") tuples of (word, 1)
counts = textFile
.flatMap(lambda line: line.split(" ")) reduceByKey.(_ + _)
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b) tuples of (word, count)
counts.saveAsTextFile("hdfs://...")
saveAsTextFile

Apache Spark Examples

http://spark.apache.org/examples.html
Example

Word Count textFile

(simulating map-reduce approach) flatMap(split(“ “))
(much slower!) (words)
Python:
map((word, 1))
textFile = sc.textFile("hdfs://...") tuples of (word, 1)
counts = textFile
.flatMap(lambda line: line.split(" ")) groupByKey()
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b) tuples of (word, [c1, c2, …])
.groupByKey()
map()
.mapValues(sum)
counts.saveAsTextFile("hdfs://...")
tuples of (word, count)

saveAsTextFile
Lazy Evaluation

Spark waits to load data and execute transformations until necessary -- lazy
Spark tries to complete actions as immediately as possible -- eager

Why?
Lazy Evaluation

Spark waits to load data and execute transformations until necessary -- lazy
Spark tries to complete actions as immediately as possible -- eager

Why?

● Only executes what is necessary to achieve action.

● Can optimize the complete chain of operations to reduce communication

Lazy Evaluation

Spark waits to load data and execute transformations until necessary -- lazy
Spark tries to complete actions as quickly as possible -- eager

Why?

● Only executes what is necessary to achieve action.

● Can optimize the complete chain of operations to reduce communication

e.g.

rdd.map(lambda r: r[1]*r[3]).take(5) #only executes map for five records

rdd.filter(lambda r: “ERROR” in r[0]).map(lambda r: r[1]*r[3])

#only passes through the data once
PySpark Demo: Statistics

https://data.worldbank.org/data-catalog/poverty-and-equity-database
https://databank.worldbank.org/data/download/PovStats_CSV.zip
Broadcast Variables

Read-only objects can be shared across all nodes.

Broadcast variable is a wrapper: access object with .value

Python:

filterWords = [‘one’, ‘two’, ‘three’, ‘four’, …]

fwBC = sc.broadcast(set(filterWords))
Broadcast Variables

Read-only objects can be shared across all nodes.

Broadcast variable is a wrapper: access object with .value

Python:

filterWords = [‘one’, ‘two’, ‘three’, ‘four’, …]

fwBC = sc.broadcast(set(filterWords))

textFile = sc.textFile("hdfs:...")
counts = textFile
.flatMap(lambda line: line.split(" "))
.filter(lambda word: word in fwBC.value)
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:...")
Accumulators
Write-only objects that keep a running aggregation
Default Accumulator assumes sum function

initialValue = 0
sumAcc = sc.accumulator(initialValue)
rdd.foreach(lambda i: sumAcc.add(i))
print(sumAcc.value)
Accumulators
Write-only objects that keep a running aggregation
Default Accumulator assumes sum function
Custom Accumulator: Inherit (AccumulatorParam) as class and override methods

initialValue = 0
sumAcc = sc.accumulator(initialValue)
rdd.foreach(lambda i: sumAcc.add(i))
print(minAcc.value)

class MinAccum(AccumulatorParam):
def zero(self, zeroValue = np.inf):#overwrite this
return zeroValue
def addInPlace(self, v1, v2):#overwrite this
return min(v1, v2)
minAcc = sc.accumulator(np.inf, minAccum())
rdd.foreach(lambda i: minAcc.add(i))
print(minAcc.value)
Spark System: Review
● RDD provides full recovery by backing up transformations from stable storage
rather than backing up the data itself.
● RDDs, which are immutable, can be stored in memory and thus are often
much faster.
● Functional programming is used to define transformation and actions on
RDDs.
Spark System: Hierarchy

Driver

Executor Executor

Core Core ... Core Core Core ... Core

Working Working
Memory Memory ...
Storage Storage

Disk Disk ... Disk Disk Disk ... Disk

Spark System: Hierarchy

Driver

Executor Executor

Core Core ... Core Core Core ... Core

Working Working
Memory Memory ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Spark System: Hierarchy

Driver

Executor Executor
For executing tasks
Core Core ... Core Core Core ... Core

A “slot” for a task on a partition Driver

Executor Executor
For executing tasks
Core Core ... Core Core Core ... Core

Working Working
Memory Memory
For storing persisted RDDs ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Spark System: Hierarchy
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

A “slot” for a task on a partition Driver

Executor Executor
For executing tasks
Core Core ... Core Core Core ... Core

Driver
Technically, a virtual machine with
slots for scheduling tasks. In practice,
Executor Executorper slot and one
one core is allocated
task is run per slot at a time.
Slot Slot ... Slot Core Core ... Core

Working Working
Memory Memory ...
Storage Storage

Disk Disk ... Disk Disk Disk ... Disk

Spark System: Hierarchy
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

Two types:
Driver
1) Narrow:
record in-> process -> record[s] out
Executor 2) Executor
Wide:
records in-> shuffle: regroup across
Core Core ... Core Core Core
cluster ... Core
-> process-> record[s] out
Working Working
Memory Memory ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Image from Nguyen: https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/
Spark System: Hierarchy
Co-partitions:
If the
Eager partitions
action forofftwo
-> sets (lazy) chain of transformations
RDDs are based on
-> launches jobs -> broken into stages -> broken into tasks
the same hash
function and key. Two types:
Driver
1) Narrow:
record in-> process -> record[s] out
Executor 2) Executor
Wide:
records in-> shuffle: regroup across
Core Core ... Core Core Core
cluster ... Core
-> process-> record[s] out
Working Working
Memory Memory ...
For reading from
Storage Storage DFS; disk
persisted RDDs;
extra space for
Disk Disk ... Disk Disk Disk ... Disk shuffles
Image from Nguyen: https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/
Spark System: Hierarchy Where the program is
launched: coordinates
everything (like the
name node in Hadoop)

Driver

Executor Executor

Core Core ... Core Core Core ... Core

Working Working
Memory Memory ...
Storage Storage

Disk Disk ... Disk Disk Disk ... Disk

Spark System: Hierarchy Where the program is
launched: coordinates
To see/set restrictions: spark-defaults.conf everything (like the
name node in Hadoop)
(most common bottleneck: spark.executor.memory)

Driver

Executor Executor

Core Core ... Core Core Core ... Core

Working Working
Memory Memory ...
Storage Storage

Disk Disk ... Disk Disk Disk ... Disk

Spark System: Scheduling
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

Jobs: A series of transformations (in a DAG) needed for the action

Job
Spark System: Scheduling
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

Jobs: A series of transformations (in a DAG) needed for the action

Stages: 1 or more per job -- 1 per set of operations separated by shuffle

Stage 1
shuffle
Job
Stage 2
...
Spark System: Scheduling
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

Jobs: A series of transformations (in a DAG) needed for the action

Stages: 1 or more per job -- 1 per set of operations separated by shuffle
Tasks: many per stage -- repeats exact same operation per partition
Task (Partition) Core/Thread 1
Stage 1 Task (Partition) Core/Thread 2
shuffle

...

...
Job Task (Partition) Core/Thread 1
Stage 2
Task (Partition) Core/Thread 2
...

...

...
Spark System: Scheduling
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

Task (Partition) Core/Thread 1

Stage 1 Task (Partition) Core/Thread 2
shuffle

...

...
Job Task (Partition) Core/Thread 1
Stage 2
Task (Partition) Core/Thread 2
...

...

...
Image from Nguyen: https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/
Spark System: Scheduling
Eager action -> sets off (lazy) chain of transformations
-> launches jobs -> broken into stages -> broken into tasks

Task (Partition) Core/Thread 1

Stage 1 Task (Partition) Core/Thread 2
shuffle

...

...
Job Task (Partition) Core/Thread 1
Stage 2
Task (Partition) Core/Thread 2
...

...

...
Image from Nguyen: https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/
MapReduce
Spark Overview or Spark?
● Spark is typically faster
○ RDDs in memory
○ Lazy evaluation enables optimizing chain of operations.
● Spark is typically more flexible (custom chains of transformations)
MapReduce
Spark Overview or Spark?
● Spark is typically faster
○ RDDs in memory
○ Lazy evaluation enables optimizing chain of operations.
● Spark is typically more flexible (custom chains of transformations)
However:
● Still need HDFS (or some DFS) to hold original or resulting data efficiently
and reliably.
● Memory across Spark cluster should be large enough to hold entire dataset to
fully leverage speed.
Thus, MapReduce may sometimes be more cost-effective for very large data that
does not fit in memory.

Quiz FMG
100% (1)
Quiz FMG
11 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Application of Proper Draping
50% (2)
Application of Proper Draping
9 pages
Tools in Family Assessment
83% (6)
Tools in Family Assessment
3 pages
Cefr Letters b2 and c1
No ratings yet
Cefr Letters b2 and c1
32 pages
Group 2: Topic: Industry Analysis of Akij Food and Beverage
No ratings yet
Group 2: Topic: Industry Analysis of Akij Food and Beverage
39 pages
A Case Study in Mathematizing Divination Systems Using Modular
100% (1)
A Case Study in Mathematizing Divination Systems Using Modular
19 pages
SPARK
No ratings yet
SPARK
66 pages
Philippine Indigenous Craft - ICC
No ratings yet
Philippine Indigenous Craft - ICC
8 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
SPARK
No ratings yet
SPARK
35 pages
Week 8 - Lecture Notes
No ratings yet
Week 8 - Lecture Notes
75 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
FEA-Academy Course On-Demand - Practical Basic FEA
No ratings yet
FEA-Academy Course On-Demand - Practical Basic FEA
35 pages
Function Spark
No ratings yet
Function Spark
10 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
CSE545 sp20 (4) 2-25
No ratings yet
CSE545 sp20 (4) 2-25
58 pages
Financial Planning
No ratings yet
Financial Planning
53 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Spark Context, Resilient Distributed Datasets
No ratings yet
Spark Context, Resilient Distributed Datasets
36 pages
Renal Diseases Pathophysiology
100% (1)
Renal Diseases Pathophysiology
6 pages
Sse 213
No ratings yet
Sse 213
3 pages
BDA Session 9
No ratings yet
BDA Session 9
34 pages
Lecture - Spark
No ratings yet
Lecture - Spark
48 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Introduction To Spark PDF
No ratings yet
Introduction To Spark PDF
37 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Lec 10
No ratings yet
Lec 10
28 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
prezentareBD Tot
No ratings yet
prezentareBD Tot
30 pages
R23 IDS Unit3
No ratings yet
R23 IDS Unit3
36 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Resilient Distributed Datasets: A Fault-Tolerant Abstraction For In-Memory Cluster Computing
No ratings yet
Resilient Distributed Datasets: A Fault-Tolerant Abstraction For In-Memory Cluster Computing
18 pages
TS 3.01.01 RES I1
No ratings yet
TS 3.01.01 RES I1
5 pages
Introduction 2025-02-09
No ratings yet
Introduction 2025-02-09
4 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Learning Spark Programming Basics: Introduction To Rdds
No ratings yet
Learning Spark Programming Basics: Introduction To Rdds
70 pages
Spark
No ratings yet
Spark
96 pages
BD 07 Spark
No ratings yet
BD 07 Spark
49 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
15 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Open Spark Shell
No ratings yet
Open Spark Shell
12 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Lec 20
No ratings yet
Lec 20
25 pages
E1 Analytics
No ratings yet
E1 Analytics
115 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Apache Spark
No ratings yet
Apache Spark
7 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
SPARK Architecture
No ratings yet
SPARK Architecture
22 pages
HCIA Big Data
No ratings yet
HCIA Big Data
20 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Solutions On Quiz 1
No ratings yet
Solutions On Quiz 1
6 pages
PL MEHTA Demand Forecasting
No ratings yet
PL MEHTA Demand Forecasting
12 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Rs 007
No ratings yet
Rs 007
1 page
Test1 1617
No ratings yet
Test1 1617
4 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
5 - PDFsam - Beginner Guide Spark
No ratings yet
5 - PDFsam - Beginner Guide Spark
2 pages
Apache Spark RDD PDF
No ratings yet
Apache Spark RDD PDF
3 pages
Unit 4
No ratings yet
Unit 4
8 pages
Assignment 03 BigData Computing Noc23-Cs112
No ratings yet
Assignment 03 BigData Computing Noc23-Cs112
6 pages
StartNow Overview
No ratings yet
StartNow Overview
22 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
2 pages
Splits Input Into Independent Chunks in Parallel Manner
No ratings yet
Splits Input Into Independent Chunks in Parallel Manner
4 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Trip of Dreams PDF
No ratings yet
Trip of Dreams PDF
6 pages
Noun. (1) The French Indirect Object Pronouns Are
No ratings yet
Noun. (1) The French Indirect Object Pronouns Are
4 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Correlation Vs Causation1446
No ratings yet
Correlation Vs Causation1446
14 pages
Summary of Major Events and Problems - US Army Chemical Corps 1959
No ratings yet
Summary of Major Events and Problems - US Army Chemical Corps 1959
42 pages
Identification: Vulnerable Individual (Assessment)
No ratings yet
Identification: Vulnerable Individual (Assessment)
20 pages
Agent 1
No ratings yet
Agent 1
41 pages
TE - AI - Assignment No. 3
No ratings yet
TE - AI - Assignment No. 3
4 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
10 pages
Monitoring Sheet MR Sia Opv Campaign Final 2023 Doc Grace
No ratings yet
Monitoring Sheet MR Sia Opv Campaign Final 2023 Doc Grace
12 pages
Bol BPP
No ratings yet
Bol BPP
2 pages
Maxwellian Distribution Revisited: Maxwell - 2a.m
No ratings yet
Maxwellian Distribution Revisited: Maxwell - 2a.m
7 pages
ICT Project Creation Process
No ratings yet
ICT Project Creation Process
3 pages
Guidelines For Filling-Up The Post Graduate Application Form
No ratings yet
Guidelines For Filling-Up The Post Graduate Application Form
2 pages
Set 1
No ratings yet
Set 1
2 pages
Gynecology & Obstetrics
No ratings yet
Gynecology & Obstetrics
5 pages
Experiment To Find The Optimal Design For A Carbon Dioxide Propelled Torpedo
No ratings yet
Experiment To Find The Optimal Design For A Carbon Dioxide Propelled Torpedo
6 pages
Bobj Integration With Portal
No ratings yet
Bobj Integration With Portal
3 pages
Mega Booster
No ratings yet
Mega Booster
1 page

Bdsaa

Uploaded by

Bdsaa

Uploaded by

Spark:

Resilient Distributed Datasets, Workflow System

Data Workﬂow Frameworks Analytics and Algorithms

Hadoop File System Similarity Search

DFS Map LocalFS Network Reduce DFS Map ...

DFS Map LocalFS Network Reduce DFS Map ...

DFS Map LocalFS Network Reduce DFS Map ...

RDD1 RDD2 RDD3

● Enables rebuilding datasets on the fly.

Faster communication and I O

“Stable Storage” Other RDDs

map filter join

RDD1 RDD2 RDD3

RDD1 RDD2 RDD3

Resilient Distributed Datasets (RDDs) -- Read-only

Resilient Distributed Datasets (RDDs) -- Read-only

Resilient Distributed Datasets (RDDs) -- Read-only

Resilient Distributed Datasets (RDDs) -- Read-only

(orig.) Actions: RDD to Value Object, or Storage

common transformations: filter, map, flatMap, reduceByKey, groupByKey

common actions: collect, count, take

Count errors in a log file: lines

Count errors in a log file: lines

Collect times of hdfs-related errors lines

Collect times of hdfs-related errors Persistance

Collect times of hdfs-related errors Persistance

Collect times of hdfs-related errors lines

lines = sc.textFile(“dfs:...”) (HDFS errors)

Collect times of hdfs-related errors lines

lines = sc.textFile(“dfs:...”) (HDFS errors)

Collect times of hdfs-related errors lines

lines = sc.textFile(“dfs:...”) (HDFS errors)

Collect times of hdfs-related errors lines

lines = sc.textFile(“dfs:...”) (HDFS errors)

● More efficient failure recovery

Word Count textFile

Word Count textFile

Apache Spark Examples

Word Count textFile

Apache Spark Examples

Word Count textFile

Apache Spark Examples

Word Count textFile

● Only executes what is necessary to achieve action.

● Can optimize the complete chain of operations to reduce communication

● Only executes what is necessary to achieve action.

● Can optimize the complete chain of operations to reduce communication

rdd.map(lambda r: r[1]*r[3]).take(5) #only executes map for five records

rdd.filter(lambda r: “ERROR” in r[0]).map(lambda r: r[1]*r[3])

Read-only objects can be shared across all nodes.

filterWords = [‘one’, ‘two’, ‘three’, ‘four’, …]

Read-only objects can be shared across all nodes.

filterWords = [‘one’, ‘two’, ‘three’, ‘four’, …]

Core Core ... Core Core Core ... Core

Disk Disk ... Disk Disk Disk ... Disk

Core Core ... Core Core Core ... Core

A “slot” for a task on a partition Driver

A “slot” for a task on a partition Driver

Disk Disk ... Disk Disk Disk ... Disk

Core Core ... Core Core Core ... Core

Disk Disk ... Disk Disk Disk ... Disk

Core Core ... Core Core Core ... Core

Disk Disk ... Disk Disk Disk ... Disk

Jobs: A series of transformations (in a DAG) needed for the action

Jobs: A series of transformations (in a DAG) needed for the action

Jobs: A series of transformations (in a DAG) needed for the action

Task (Partition) Core/Thread 1

Task (Partition) Core/Thread 1

You might also like