0% found this document useful (0 votes)
27 views51 pages

Spark

Uploaded by

majbah00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views51 pages

Spark

Uploaded by

majbah00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

CS4225/CS5425 Big Data Systems for Data

Science
Spark I: Basics

Ai Xin
School of Computing
National University of Singapore
[email protected]

1
Intro
 Lecturer: Ai Xin
 Email: [email protected]
 Office Hours: 2-3pm on 20 Oct, 3, 17 and 24 Nov at COM3-B1-24
 TAs
 Assignment 2 (Post to Canvas/Discussion or Email TAs)
• SIDDARTH NANDANAHOSUR SURESH (Name A-G)
• TAN TZE YEONG (Name H-L)
• TAN YAN RONG AMELIA (Name L-R)
• TENG YI SHIONG (Name R-W)
• TOH WEI JIE (Name W-Z)

 Tutorial and Lecture (Post to Canvas/Discussion or Email TAs)


• ZHANG JIHAI (week 7 – 9)
• GOH TECK LUN (conduct tutorials)
• Hu Zhiyuan (week 10 – 13)

2
Schedule

3
Today’s Plan
 Introduction and Basics
 Working with RDDs
 Caching and DAGs
 DataFrames and Datasets

4
Motivation: Hadoop vs Spark

 Issues with Hadoop Mapreduce:


 Network and disk I/O costs: intermediate data has to be written to local
disks and shuffled across machines, which is slow
 Not suitable for iterative (i.e. modifying small amounts of data
repeatedly) processing, such as interactive workflows, as each individual
step has to be modelled as a MapReduce job.
 Spark stores most of its intermediate results in memory, making
it much faster, especially for iterative processing
 When memory is insufficient, Spark spills to disk which requires disk I/O
5
Performance Comparison

6
Ease of Programmability

WordCount (Hadoop MapReduce)


7
Ease of Programmability

val file = sc.textFile(“hdfs://...”)


val counts = file.flatMap(line => line.split(“ ”))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.save(“...”)

WordCount (Spark)

8
Spark Components and API Stack

9
Spark Architecture

 Driver Process responds to user input, manages the Spark application etc., and
distributes work to Executors, which run the code assigned to them and send
the results back to the driver
 Cluster Manager (can be Spark’s standalone cluster manager, YARN, Mesos or
Kubernetes) allocates resources when the application requests it
 In local mode, all these processes run on the same machine 10
Evolution of Spark APIs

Resilient
Distributed DataFrame DataSet
Datasets (2013) (2013)
(2011)

• A collection of • A collection of Row • Internally rows,


JVM objects objects externally JVM objects.
• Functional • Expression-based • Almost the "Best of both
operators (map, operations worlds": type safe + fast
filter, etc) • Logical plans and
optimizer

11
Today’s Plan
 Introduction and Basics
 Working with RDDs
 Caching and DAGs
 DataFrames and Datasets

12
Represent a collection of
Achieve fault tolerance objects that is distributed over
through lineages machines

Resilient Distributed Datasets (RDDs)

13
RDD: Distributed Data
# Create an RDD of names, distributed over 3 partitions
dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)

Partition data Worker


into 3 parts
Driver [Alice,
Bob]
 RDDs are immutable, i.e. they
cannot be changed once created.
Worker
 This is an RDD with 4 strings. In
Worker
[Carol]
actual hardware, it will be
partitioned into the 3 workers.
[Daniel]
14
Transformations
 Transformations are a way of transforming RDDs into RDDs.

# Create an RDD: length of names


dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)
nameLen = dataRDD.map(lambda s: len(s))

 This represents the transformation that maps each string to its


length, creating a new RDD.
 However, transformations are lazy. This means the transformation
will not be executed yet, until an action is called on it
 Q: what are the advantages of being lazy?
 A: Spark can optimize the query plan to improve speed (e.g. removing
unneeded operations)
 Examples of transformations: map, order, groupBy,
filter, join, select
15
Actions
 Actions trigger Spark to compute a result from a series of
transformations.

dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)


nameLen = dataRDD.map(lambda s: len(s))
nameLen.collect()

[5, 3, 5, 6]

 collect() here is an action.


 It is the action that asks Spark to retrieve all elements of the RDD to the driver
node.
 Examples of actions: show, count, save, collect
16
Distributed Processing
# Create an RDD: length of names
dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)
nameLen = dataRDD.map(lambda s: len(s))
nameLen.collect()

 As we previously said, RDDs are


actually distributed across Worker
machines.
Driver [Alice,
 Thus, the transformations and Bob]

actions are executed in parallel.


The results are only sent to the Worker

driver in the final step. Worker


[Carol]

[Daniel] 17
Distributed Processing
# Create an RDD: length of names
dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)
nameLen = dataRDD.map(lambda s: len(s))
nameLen.collect()

 As we previously said, RDDs are


actually distributed across Worker
machines.
Driver
 Thus, the transformations and [Alice, map
[5, 3]
actions are executed in parallel. Bob]

The results are only sent to the Worker

driver in the final step. Worker


map
[Carol] [5]
map
[Daniel] [6] 18
Distributed Processing
# Create an RDD: length of names
dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)
nameLen = dataRDD.map(lambda s: len(s))
nameLen.collect()

 As we previously said, RDDs are


actually distributed across Worker
machines.
Driver
 Thus, the transformations and
collect [5, 3]
actions are executed in parallel.
The results are only sent to the Worker

driver in the final step. Worker

[5]

[6] 19
Distributed Processing
# Create an RDD: length of names
dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)
nameLen = dataRDD.map(lambda s: len(s))
nameLen.collect()

 As we previously said, RDDs are


actually distributed across Worker
machines. [5, 3, 5, 6]
Driver
 Thus, the transformations and
collect [5, 3]
actions are executed in parallel.
The results are only sent to the Worker

driver in the final step. Worker

[5]

[6] 20
Working with RDDs Note: this reads the file on each
worker node in parallel, not on
the driver node
textFile = sc.textFile(”File.txt”)

RDD
RDD
RDD
RDD Action Value

Transformations

linesWithSpark.count()
74

linesWithSpark.first()
# Apache Spark

linesWithSpark = textFile.filter(lambda line:


"Spark” in line)
Today’s Plan
 Introduction and Basics
 Working with RDDs
 Caching and DAGs
 DataFrames and Datasets

22
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

lines = sc.textFile(“hdfs://...”) Worker


errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
Driver
messages.cache()

messages.filter(lambda s: “mysql” in s).count()


Worker

Worker
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

lines = sc.textFile(“hdfs://...”) Worker


errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache()

messages.filter(lambda s: “mysql” in s).count()


Worker

Worker Block 2

Block 3
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

lines = sc.textFile(“hdfs://...”) Worker


errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) tasks Block 1
Driver
messages.cache()

tasks
messages.filter(lambda s: “mysql” in s).count()
tasks Worker

Worker Block 2

Block 3
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

lines = sc.textFile(“hdfs://...”) Worker


errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache() Read
HDFS
Block
messages.filter(lambda s: “mysql” in s).count()
Worker

Worker Block 2
Read
HDFS Read
Block 3 Block HDFS
Block
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache() Process
& Cache
Data
messages.filter(lambda s: “mysql” in s).count() Cache 2
Worker
Cache 3
Worker Block 2
Process
& Cache Process
Block 3 Data & Cache
Data
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
results
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache()

results

messages.filter(lambda s: “mysql” in s).count() Cache 2


results
Worker
Cache 3
Worker Block 2

Block 3
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache()

messages.filter(lambda s: “mysql” in s).count() Cache 2


messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Worker Block 2

Block 3
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) tasks Block 1
Driver
messages.cache()

tasks
messages.filter(lambda s: “mysql” in s).count() Cache 2
messages.filter(lambda s: “php” in s).count() tasks Worker
Cache 3
Worker Block 2

Block 3
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache() Process
from
Cache
messages.filter(lambda s: “mysql” in s).count() Cache 2
messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Worker Block 2
Process
from Process
Block 3 Cache from
Cache
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
results
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache()

results

messages.filter(lambda s: “mysql” in s).count() Cache 2


results
messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Worker Block 2

Block 3
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache()

messages.filter(lambda s: “mysql” in s).count() Cache 2


messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Cache your data  Faster Results Worker Block 2
Full-text search of Wikipedia
• 60GB on 20 EC2 machines
• 0.5 sec from mem vs. 20s for on-disk Block 3
Caching
 cache(): saves an RDD to memory (of each worker node).
 persist(options): can be used to save an RDD to memory,
disk, or off-heap memory
 When should we cache or not cache an RDD?
 When it is expensive to compute and needs to be re-used multiple times.
 If worker nodes have not enough memory, they will evict the “least recently
used” RDDs. So, be aware of memory limitations when caching.

34
Directed Acyclic Graph (DAG)

 Internally, Spark creates a graph


(“directed acyclic graph”) which
represents all the RDD objects
and how they will be
transformed.
 Transformations construct this
graph; actions trigger
computations on it.

val file = sc.textFile(“hdfs://...”)


val counts = file.flatMap(line => line.split(“ ”))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.save(“...”)
35
WordCount (Spark)
Narrow and Wide Dependencies
 Narrow dependencies are where each
partition of the parent RDD is used by at
most 1 partition of the child RDD
 E.g. map, flatMap, filter, contains
 Wide dependencies are the opposite (each
partition of parent RDD is used by multiple
partitions of the child RDD)
 E.g. reduceByKey, groupBy, orderBy
 In the DAG, consecutive narrow
dependencies are grouped together as
“stages”.
 Within stages, Spark performs consecutive
transformations on the same machines.
 Across stages, data needs to be shuffled, i.e.
exchanged across partitions, in a process
very similar to map-reduce, which involves
writing intermediate results to disk
 Minimizing shuffling is good practice for
improving performance. 36
Lineage and Fault Tolerance
 Unlike Hadoop, Spark does not use
replication to allow fault tolerance.
Why?
 Spark tries to store all the data in
memory, not disk. Memory capacity is
much more limited than disk, so simply
duplicating all data is expensive.
 Lineage approach: if a worker node
goes down, we replace it by a new
worker node, and use the graph
(DAG) to recompute the data in the
lost partition.
 Note that we only need to recompute
the RDDs from the lost partition.
37
Today’s Plan
 Introduction and Basics
 Working with RDDs
 Caching and DAGs
 DataFrames and Datasets

38
DataFrames
 A DataFrame represents a table of data, similar to tables in SQL, or
DataFrames in pandas.
 Compared to RDDs, this is a higher level interface, e.g. it has
transformations that resemble SQL operations.
 DataFrames (and Datasets) are the recommended interface for working with
Spark – they are easier to use than RDDs and almost all tasks can be done
with them, while only rarely using the RDD functions.
 However, all DataFrame operations are still ultimately compiled down to
RDD operations by Spark.

39
DataFrames: example
flightData2015 = spark\
.read\
.option("inferSchema", "true")\
.option("header", "true")\
.csv("/mnt/defg/flight-data/csv/2015-summary.csv")

 Reads in a DataFrame from a CSV file.


flightData2015.sort("count").take(3)

 Sorts by ‘count’ and output the first 3 rows (action)


Array([United States,Romania,15], [United States,Croatia...

40
DataFrames: transformations
 An easy way to transform DataFrames is to use SQL queries.
This takes in a DataFrame and returns a DataFrame (the output
of the query).
flightData2015.createOrReplaceTempView("flight_data_2015")
maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")
maxSql.collect()

41
DataFrames: DataFrame interface

 We can also run the exact same query as follows:


from pyspark.sql.functions import desc
flightData2015\
.groupBy("DEST_COUNTRY_NAME")\
.sum("count")\
.withColumnRenamed("sum(count)", "destination_total")\
.sort(desc("destination_total"))\
.limit(5)\
.collect()

 Generally, these transformation functions (groupBy, sort, …) take in


either strings or “column objects”, which represent columns.
 For example, “desc” here returns a column object.
42
Datasets
 Datasets are similar to DataFrames, but are type-safe.
 In fact, in Spark (Scala), DataFrame is just an alias for Dataset[Row]
 However, Datasets are not available in Python and R, since these are
dynamically typed languages
case class Flight(DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME: String, count:
BigInt)
val flightsDF = spark.read.parquet("/mnt/defg/flight-data/parquet/2010-
summary.parquet/")
val flights = flightsDF.as[Flight]
flights.collect()

 The Dataset flights is type safe – its type is the “Flight” class.
 Now when calling collect(), it will also return objects of the
“Flight” class, instead of Row objects.
43
Example: Spark Notebook in Google Colab
 To experiment with simple Spark commands without needing to install /
setup anything on your computer, you can run Spark on Google Colab
 See the simple example notebook at
https://colab.research.google.com/drive/1qtNpkieNEUzyF2NnXTyqyGL3LQD
1TVlI#scrollTo=pUgUMWYUKAU3

44
Example: Spark Notebooks in Databricks
 You need to sign up a Databricks community edition account (free)

 Source: https://github.com/databricks/LearningSparkV2 45
Demo_1: Spark Web UI

46
47
48
49
Demo_2: Caching Data

50
Acknowledgements
 CS4225 slides by He Bingsheng and Bryan Hooi
 Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee,
“Learning Spark: Lightning-Fast Data Analytics”
 Databricks, “The Data Engineer’s Guide to Spark”
 https://www.pinterest.com/pin/739364463807740043/
 https://colab.research.google.com/github/jmbanda/BigDataPro
gramming_2019/blob/master/Chapter_5_Loading_and_Saving_
Data_in_Spark.ipynb
 https://untitled-life.github.io/blog/2018/12/27/wide-vs-
narrow-dependencies/

51

You might also like