Spark
Spark
Science
Spark I: Basics
Ai Xin
School of Computing
National University of Singapore
[email protected]
1
Intro
Lecturer: Ai Xin
Email: [email protected]
Office Hours: 2-3pm on 20 Oct, 3, 17 and 24 Nov at COM3-B1-24
TAs
Assignment 2 (Post to Canvas/Discussion or Email TAs)
• SIDDARTH NANDANAHOSUR SURESH (Name A-G)
• TAN TZE YEONG (Name H-L)
• TAN YAN RONG AMELIA (Name L-R)
• TENG YI SHIONG (Name R-W)
• TOH WEI JIE (Name W-Z)
2
Schedule
3
Today’s Plan
Introduction and Basics
Working with RDDs
Caching and DAGs
DataFrames and Datasets
4
Motivation: Hadoop vs Spark
6
Ease of Programmability
WordCount (Spark)
8
Spark Components and API Stack
9
Spark Architecture
Driver Process responds to user input, manages the Spark application etc., and
distributes work to Executors, which run the code assigned to them and send
the results back to the driver
Cluster Manager (can be Spark’s standalone cluster manager, YARN, Mesos or
Kubernetes) allocates resources when the application requests it
In local mode, all these processes run on the same machine 10
Evolution of Spark APIs
Resilient
Distributed DataFrame DataSet
Datasets (2013) (2013)
(2011)
11
Today’s Plan
Introduction and Basics
Working with RDDs
Caching and DAGs
DataFrames and Datasets
12
Represent a collection of
Achieve fault tolerance objects that is distributed over
through lineages machines
13
RDD: Distributed Data
# Create an RDD of names, distributed over 3 partitions
dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)
[5, 3, 5, 6]
[Daniel] 17
Distributed Processing
# Create an RDD: length of names
dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)
nameLen = dataRDD.map(lambda s: len(s))
nameLen.collect()
[5]
[6] 19
Distributed Processing
# Create an RDD: length of names
dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)
nameLen = dataRDD.map(lambda s: len(s))
nameLen.collect()
[5]
[6] 20
Working with RDDs Note: this reads the file on each
worker node in parallel, not on
the driver node
textFile = sc.textFile(”File.txt”)
RDD
RDD
RDD
RDD Action Value
Transformations
linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark
22
Caching
Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns
Worker
Caching
Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns
Worker Block 2
Block 3
Caching
Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns
tasks
messages.filter(lambda s: “mysql” in s).count()
tasks Worker
Worker Block 2
Block 3
Caching
Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns
Worker Block 2
Read
HDFS Read
Block 3 Block HDFS
Block
Caching
Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns
Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache() Process
& Cache
Data
messages.filter(lambda s: “mysql” in s).count() Cache 2
Worker
Cache 3
Worker Block 2
Process
& Cache Process
Block 3 Data & Cache
Data
Caching
Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns
Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
results
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache()
results
Block 3
Caching
Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns
Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache()
Block 3
Caching
Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns
Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) tasks Block 1
Driver
messages.cache()
tasks
messages.filter(lambda s: “mysql” in s).count() Cache 2
messages.filter(lambda s: “php” in s).count() tasks Worker
Cache 3
Worker Block 2
Block 3
Caching
Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns
Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache() Process
from
Cache
messages.filter(lambda s: “mysql” in s).count() Cache 2
messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Worker Block 2
Process
from Process
Block 3 Cache from
Cache
Caching
Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns
Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
results
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache()
results
Block 3
Caching
Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns
Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache()
34
Directed Acyclic Graph (DAG)
38
DataFrames
A DataFrame represents a table of data, similar to tables in SQL, or
DataFrames in pandas.
Compared to RDDs, this is a higher level interface, e.g. it has
transformations that resemble SQL operations.
DataFrames (and Datasets) are the recommended interface for working with
Spark – they are easier to use than RDDs and almost all tasks can be done
with them, while only rarely using the RDD functions.
However, all DataFrame operations are still ultimately compiled down to
RDD operations by Spark.
39
DataFrames: example
flightData2015 = spark\
.read\
.option("inferSchema", "true")\
.option("header", "true")\
.csv("/mnt/defg/flight-data/csv/2015-summary.csv")
40
DataFrames: transformations
An easy way to transform DataFrames is to use SQL queries.
This takes in a DataFrame and returns a DataFrame (the output
of the query).
flightData2015.createOrReplaceTempView("flight_data_2015")
maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")
maxSql.collect()
41
DataFrames: DataFrame interface
The Dataset flights is type safe – its type is the “Flight” class.
Now when calling collect(), it will also return objects of the
“Flight” class, instead of Row objects.
43
Example: Spark Notebook in Google Colab
To experiment with simple Spark commands without needing to install /
setup anything on your computer, you can run Spark on Google Colab
See the simple example notebook at
https://colab.research.google.com/drive/1qtNpkieNEUzyF2NnXTyqyGL3LQD
1TVlI#scrollTo=pUgUMWYUKAU3
44
Example: Spark Notebooks in Databricks
You need to sign up a Databricks community edition account (free)
Source: https://github.com/databricks/LearningSparkV2 45
Demo_1: Spark Web UI
46
47
48
49
Demo_2: Caching Data
50
Acknowledgements
CS4225 slides by He Bingsheng and Bryan Hooi
Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee,
“Learning Spark: Lightning-Fast Data Analytics”
Databricks, “The Data Engineer’s Guide to Spark”
https://www.pinterest.com/pin/739364463807740043/
https://colab.research.google.com/github/jmbanda/BigDataPro
gramming_2019/blob/master/Chapter_5_Loading_and_Saving_
Data_in_Spark.ipynb
https://untitled-life.github.io/blog/2018/12/27/wide-vs-
narrow-dependencies/
51