0% found this document useful (0 votes)

27 views51 pages

Spark

Uploaded by

majbah00

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views51 pages

Spark

Uploaded by

majbah00

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

CS4225/CS5425 Big Data Systems for Data

Science
Spark I: Basics

Ai Xin
School of Computing
National University of Singapore
[email protected]

1
Intro
 Lecturer: Ai Xin
 Email: [email protected]
 Office Hours: 2-3pm on 20 Oct, 3, 17 and 24 Nov at COM3-B1-24
 TAs
 Assignment 2 (Post to Canvas/Discussion or Email TAs)
• SIDDARTH NANDANAHOSUR SURESH (Name A-G)
• TAN TZE YEONG (Name H-L)
• TAN YAN RONG AMELIA (Name L-R)
• TENG YI SHIONG (Name R-W)
• TOH WEI JIE (Name W-Z)

 Tutorial and Lecture (Post to Canvas/Discussion or Email TAs)

• ZHANG JIHAI (week 7 – 9)
• GOH TECK LUN (conduct tutorials)
• Hu Zhiyuan (week 10 – 13)

2
Schedule

3
Today’s Plan
 Introduction and Basics
 Working with RDDs
 Caching and DAGs
 DataFrames and Datasets

4
Motivation: Hadoop vs Spark

 Issues with Hadoop Mapreduce:

 Network and disk I/O costs: intermediate data has to be written to local
disks and shuffled across machines, which is slow
 Not suitable for iterative (i.e. modifying small amounts of data
repeatedly) processing, such as interactive workflows, as each individual
step has to be modelled as a MapReduce job.
 Spark stores most of its intermediate results in memory, making
it much faster, especially for iterative processing
 When memory is insufficient, Spark spills to disk which requires disk I/O
5
Performance Comparison

6
Ease of Programmability

WordCount (Hadoop MapReduce)

7
Ease of Programmability

val file = sc.textFile(“hdfs://...”)

val counts = file.flatMap(line => line.split(“ ”))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.save(“...”)

WordCount (Spark)

8
Spark Components and API Stack

9
Spark Architecture

 Driver Process responds to user input, manages the Spark application etc., and
distributes work to Executors, which run the code assigned to them and send
the results back to the driver
 Cluster Manager (can be Spark’s standalone cluster manager, YARN, Mesos or
Kubernetes) allocates resources when the application requests it
 In local mode, all these processes run on the same machine 10
Evolution of Spark APIs

Resilient
Distributed DataFrame DataSet
Datasets (2013) (2013)
(2011)

• A collection of • A collection of Row • Internally rows,

JVM objects objects externally JVM objects.
• Functional • Expression-based • Almost the "Best of both
operators (map, operations worlds": type safe + fast
filter, etc) • Logical plans and
optimizer

11
Today’s Plan
 Introduction and Basics
 Working with RDDs
 Caching and DAGs
 DataFrames and Datasets

12
Represent a collection of
Achieve fault tolerance objects that is distributed over
through lineages machines

Resilient Distributed Datasets (RDDs)

13
RDD: Distributed Data
# Create an RDD of names, distributed over 3 partitions
dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)

Partition data Worker

into 3 parts
Driver [Alice,
Bob]
 RDDs are immutable, i.e. they
cannot be changed once created.
Worker
 This is an RDD with 4 strings. In
Worker
[Carol]
actual hardware, it will be
partitioned into the 3 workers.
[Daniel]
14
Transformations
 Transformations are a way of transforming RDDs into RDDs.

# Create an RDD: length of names

dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)
nameLen = dataRDD.map(lambda s: len(s))

 This represents the transformation that maps each string to its

length, creating a new RDD.
 However, transformations are lazy. This means the transformation
will not be executed yet, until an action is called on it
 Q: what are the advantages of being lazy?
 A: Spark can optimize the query plan to improve speed (e.g. removing
unneeded operations)
 Examples of transformations: map, order, groupBy,
filter, join, select
15
Actions
 Actions trigger Spark to compute a result from a series of
transformations.

dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)

nameLen = dataRDD.map(lambda s: len(s))
nameLen.collect()

[5, 3, 5, 6]

 collect() here is an action.

 It is the action that asks Spark to retrieve all elements of the RDD to the driver
node.
 Examples of actions: show, count, save, collect
16
Distributed Processing
# Create an RDD: length of names
dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)
nameLen = dataRDD.map(lambda s: len(s))
nameLen.collect()

 As we previously said, RDDs are

actually distributed across Worker
machines.
Driver [Alice,
 Thus, the transformations and Bob]

actions are executed in parallel.

The results are only sent to the Worker

driver in the final step. Worker

[Carol]

[Daniel] 17
Distributed Processing
# Create an RDD: length of names
dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)
nameLen = dataRDD.map(lambda s: len(s))
nameLen.collect()

 As we previously said, RDDs are

actually distributed across Worker
machines.
Driver
 Thus, the transformations and [Alice, map
[5, 3]
actions are executed in parallel. Bob]

The results are only sent to the Worker

driver in the final step. Worker

map
[Carol] [5]
map
[Daniel] [6] 18
Distributed Processing
# Create an RDD: length of names
dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)
nameLen = dataRDD.map(lambda s: len(s))
nameLen.collect()

 As we previously said, RDDs are

actually distributed across Worker
machines.
Driver
 Thus, the transformations and
collect [5, 3]
actions are executed in parallel.
The results are only sent to the Worker

driver in the final step. Worker

[5]

[6] 19
Distributed Processing
# Create an RDD: length of names
dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)
nameLen = dataRDD.map(lambda s: len(s))
nameLen.collect()

 As we previously said, RDDs are

actually distributed across Worker
machines. [5, 3, 5, 6]
Driver
 Thus, the transformations and
collect [5, 3]
actions are executed in parallel.
The results are only sent to the Worker

driver in the final step. Worker

[5]

[6] 20
Working with RDDs Note: this reads the file on each
worker node in parallel, not on
the driver node
textFile = sc.textFile(”File.txt”)

RDD
RDD
RDD
RDD Action Value

Transformations

linesWithSpark.count()
74

linesWithSpark.first()
# Apache Spark

linesWithSpark = textFile.filter(lambda line:

"Spark” in line)
Today’s Plan
 Introduction and Basics
 Working with RDDs
 Caching and DAGs
 DataFrames and Datasets

22
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

lines = sc.textFile(“hdfs://...”) Worker

errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
Driver
messages.cache()

messages.filter(lambda s: “mysql” in s).count()

Worker

Worker
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

lines = sc.textFile(“hdfs://...”) Worker

errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache()

messages.filter(lambda s: “mysql” in s).count()

Worker

Worker Block 2

Block 3
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

lines = sc.textFile(“hdfs://...”) Worker

errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) tasks Block 1
Driver
messages.cache()

tasks
messages.filter(lambda s: “mysql” in s).count()
tasks Worker

Worker Block 2

Block 3
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

lines = sc.textFile(“hdfs://...”) Worker

errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache() Read
HDFS
Block
messages.filter(lambda s: “mysql” in s).count()
Worker

Worker Block 2
Read
HDFS Read
Block 3 Block HDFS
Block
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache() Process
& Cache
Data
messages.filter(lambda s: “mysql” in s).count() Cache 2
Worker
Cache 3
Worker Block 2
Process
& Cache Process
Block 3 Data & Cache
Data
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
results
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache()

results

messages.filter(lambda s: “mysql” in s).count() Cache 2

results
Worker
Cache 3
Worker Block 2

Block 3
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache()

messages.filter(lambda s: “mysql” in s).count() Cache 2

messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Worker Block 2

Block 3
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) tasks Block 1
Driver
messages.cache()

tasks
messages.filter(lambda s: “mysql” in s).count() Cache 2
messages.filter(lambda s: “php” in s).count() tasks Worker
Cache 3
Worker Block 2

Block 3
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache() Process
from
Cache
messages.filter(lambda s: “mysql” in s).count() Cache 2
messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Worker Block 2
Process
from Process
Block 3 Cache from
Cache
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

results

messages.filter(lambda s: “mysql” in s).count() Cache 2

results
messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Worker Block 2

Block 3
Caching

Log Mining example: Load error messages from a log into memory, then
interactively search for various patterns

Cache 1
lines = sc.textFile(“hdfs://...”) Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2]) Block 1
Driver
messages.cache()

messages.filter(lambda s: “mysql” in s).count() Cache 2

messages.filter(lambda s: “php” in s).count() Worker
Cache 3
Cache your data  Faster Results Worker Block 2
Full-text search of Wikipedia
• 60GB on 20 EC2 machines
• 0.5 sec from mem vs. 20s for on-disk Block 3
Caching
 cache(): saves an RDD to memory (of each worker node).
 persist(options): can be used to save an RDD to memory,
disk, or off-heap memory
 When should we cache or not cache an RDD?
 When it is expensive to compute and needs to be re-used multiple times.
 If worker nodes have not enough memory, they will evict the “least recently
used” RDDs. So, be aware of memory limitations when caching.

34
Directed Acyclic Graph (DAG)

 Internally, Spark creates a graph

(“directed acyclic graph”) which
represents all the RDD objects
and how they will be
transformed.
 Transformations construct this
graph; actions trigger
computations on it.

val file = sc.textFile(“hdfs://...”)

val counts = file.flatMap(line => line.split(“ ”))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.save(“...”)
35
WordCount (Spark)
Narrow and Wide Dependencies
 Narrow dependencies are where each
partition of the parent RDD is used by at
most 1 partition of the child RDD
 E.g. map, flatMap, filter, contains
 Wide dependencies are the opposite (each
partition of parent RDD is used by multiple
partitions of the child RDD)
 E.g. reduceByKey, groupBy, orderBy
 In the DAG, consecutive narrow
dependencies are grouped together as
“stages”.
 Within stages, Spark performs consecutive
transformations on the same machines.
 Across stages, data needs to be shuffled, i.e.
exchanged across partitions, in a process
very similar to map-reduce, which involves
writing intermediate results to disk
 Minimizing shuffling is good practice for
improving performance. 36
Lineage and Fault Tolerance
 Unlike Hadoop, Spark does not use
replication to allow fault tolerance.
Why?
 Spark tries to store all the data in
memory, not disk. Memory capacity is
much more limited than disk, so simply
duplicating all data is expensive.
 Lineage approach: if a worker node
goes down, we replace it by a new
worker node, and use the graph
(DAG) to recompute the data in the
lost partition.
 Note that we only need to recompute
the RDDs from the lost partition.
37
Today’s Plan
 Introduction and Basics
 Working with RDDs
 Caching and DAGs
 DataFrames and Datasets

38
DataFrames
 A DataFrame represents a table of data, similar to tables in SQL, or
DataFrames in pandas.
 Compared to RDDs, this is a higher level interface, e.g. it has
transformations that resemble SQL operations.
 DataFrames (and Datasets) are the recommended interface for working with
Spark – they are easier to use than RDDs and almost all tasks can be done
with them, while only rarely using the RDD functions.
 However, all DataFrame operations are still ultimately compiled down to
RDD operations by Spark.

39
DataFrames: example
flightData2015 = spark\
.read\
.option("inferSchema", "true")\
.option("header", "true")\
.csv("/mnt/defg/flight-data/csv/2015-summary.csv")

 Reads in a DataFrame from a CSV file.

flightData2015.sort("count").take(3)

 Sorts by ‘count’ and output the first 3 rows (action)

Array([United States,Romania,15], [United States,Croatia...

40
DataFrames: transformations
 An easy way to transform DataFrames is to use SQL queries.
This takes in a DataFrame and returns a DataFrame (the output
of the query).
flightData2015.createOrReplaceTempView("flight_data_2015")
maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")
maxSql.collect()

41
DataFrames: DataFrame interface

 We can also run the exact same query as follows:

from pyspark.sql.functions import desc
flightData2015\
.groupBy("DEST_COUNTRY_NAME")\
.sum("count")\
.withColumnRenamed("sum(count)", "destination_total")\
.sort(desc("destination_total"))\
.limit(5)\
.collect()

 Generally, these transformation functions (groupBy, sort, …) take in

either strings or “column objects”, which represent columns.
 For example, “desc” here returns a column object.
42
Datasets
 Datasets are similar to DataFrames, but are type-safe.
 In fact, in Spark (Scala), DataFrame is just an alias for Dataset[Row]
 However, Datasets are not available in Python and R, since these are
dynamically typed languages
case class Flight(DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME: String, count:
BigInt)
val flightsDF = spark.read.parquet("/mnt/defg/flight-data/parquet/2010-
summary.parquet/")
val flights = flightsDF.as[Flight]
flights.collect()

 The Dataset flights is type safe – its type is the “Flight” class.
 Now when calling collect(), it will also return objects of the
“Flight” class, instead of Row objects.
43
Example: Spark Notebook in Google Colab
 To experiment with simple Spark commands without needing to install /
setup anything on your computer, you can run Spark on Google Colab
 See the simple example notebook at
https://colab.research.google.com/drive/1qtNpkieNEUzyF2NnXTyqyGL3LQD
1TVlI#scrollTo=pUgUMWYUKAU3

44
Example: Spark Notebooks in Databricks
 You need to sign up a Databricks community edition account (free)

 Source: https://github.com/databricks/LearningSparkV2 45
Demo_1: Spark Web UI

46
47
48
49
Demo_2: Caching Data

50
Acknowledgements
 CS4225 slides by He Bingsheng and Bryan Hooi
 Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee,
“Learning Spark: Lightning-Fast Data Analytics”
 Databricks, “The Data Engineer’s Guide to Spark”
 https://www.pinterest.com/pin/739364463807740043/
 https://colab.research.google.com/github/jmbanda/BigDataPro
gramming_2019/blob/master/Chapter_5_Loading_and_Saving_
Data_in_Spark.ipynb
 https://untitled-life.github.io/blog/2018/12/27/wide-vs-
narrow-dependencies/

Familiarization With MDA8086 and MTS86C Microprocessor Trainers
100% (1)
Familiarization With MDA8086 and MTS86C Microprocessor Trainers
18 pages
Pyspark
No ratings yet
Pyspark
31 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
SPARK
No ratings yet
SPARK
66 pages
1426 CLF C02 New Q&A 1 1000
No ratings yet
1426 CLF C02 New Q&A 1 1000
1,000 pages
Project - Synopsis - Sample BCA MCA MSIT
No ratings yet
Project - Synopsis - Sample BCA MCA MSIT
5 pages
DevExpress Dashboard (WinForms)
100% (3)
DevExpress Dashboard (WinForms)
300 pages
Spark
No ratings yet
Spark
160 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
User Guide For COFEE v112
100% (2)
User Guide For COFEE v112
46 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Ics 2104 Object Oriented Programming I
No ratings yet
Ics 2104 Object Oriented Programming I
3 pages
Designpatternsjoomla 100603035605 Phpapp01
No ratings yet
Designpatternsjoomla 100603035605 Phpapp01
53 pages
SPARK
No ratings yet
SPARK
35 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
BCS306A Module 2
No ratings yet
BCS306A Module 2
14 pages
Spring Boot Interview
No ratings yet
Spring Boot Interview
11 pages
Unit - V Java
No ratings yet
Unit - V Java
15 pages
University of Mumbai MCQ QUESTION BANK (100 Questions) : Middleware Firmware Package System Software Middleware
No ratings yet
University of Mumbai MCQ QUESTION BANK (100 Questions) : Middleware Firmware Package System Software Middleware
16 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
AZ 303-304 Notes
No ratings yet
AZ 303-304 Notes
216 pages
Empowerment Technology SHS Docs
No ratings yet
Empowerment Technology SHS Docs
60 pages
Hydrostatics
No ratings yet
Hydrostatics
189 pages
SPARK
No ratings yet
SPARK
125 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
ABAP Coding STD + Naming Conversions
No ratings yet
ABAP Coding STD + Naming Conversions
25 pages
Module 3
No ratings yet
Module 3
51 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
SCP 21
No ratings yet
SCP 21
64 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Final Report
No ratings yet
Final Report
29 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Lecture 10 - Spark
No ratings yet
Lecture 10 - Spark
87 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Lec 9
No ratings yet
Lec 9
38 pages
bd1718 10 Spark
No ratings yet
bd1718 10 Spark
55 pages
CH 01
No ratings yet
CH 01
42 pages
SPARK
No ratings yet
SPARK
47 pages
Foresight Mis-Aligned Sectors User Guide v1.0
No ratings yet
Foresight Mis-Aligned Sectors User Guide v1.0
19 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
Spark and Scala Week 1
No ratings yet
Spark and Scala Week 1
16 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Lec 9
No ratings yet
Lec 9
33 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Managing Marketing Information To Gain Customer Insights
No ratings yet
Managing Marketing Information To Gain Customer Insights
12 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Data Warehousing and Data Mining Original Notes
No ratings yet
Data Warehousing and Data Mining Original Notes
47 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
Azure Brushup
No ratings yet
Azure Brushup
11 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Red Hat OpenShift Application Services Cheat Sheet Red Hat Developer
No ratings yet
Red Hat OpenShift Application Services Cheat Sheet Red Hat Developer
10 pages
Digital Marketing Terms
No ratings yet
Digital Marketing Terms
12 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Development of A Mobile Application For Monitoring and Controlling A CNC Machine Using Industry 4.0 Concepts
No ratings yet
Development of A Mobile Application For Monitoring and Controlling A CNC Machine Using Industry 4.0 Concepts
8 pages
Complete GRPC Services Beginer To Expert
No ratings yet
Complete GRPC Services Beginer To Expert
5 pages
Compiled Tasks ICTICT606
No ratings yet
Compiled Tasks ICTICT606
4 pages
Useful Phrases For Interviews
No ratings yet
Useful Phrases For Interviews
2 pages
Mittal School of Business: Assignment 2
No ratings yet
Mittal School of Business: Assignment 2
3 pages
Preventing Users From Accessing Jira Applications During Backups - Atlassian Documentation
No ratings yet
Preventing Users From Accessing Jira Applications During Backups - Atlassian Documentation
2 pages
CIT Final Paper
No ratings yet
CIT Final Paper
2 pages
Resume Rohit Garg
No ratings yet
Resume Rohit Garg
2 pages
ZetaTalk: Index: Mirror Sites
No ratings yet
ZetaTalk: Index: Mirror Sites
3 pages
Syllabus BTech First Yr PPS R-17-18
No ratings yet
Syllabus BTech First Yr PPS R-17-18
2 pages
QuickStart Guide to Db2 Development with Python
From Everand
QuickStart Guide to Db2 Development with Python
Roger E. Sanders
No ratings yet
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)

Spark

Uploaded by

Spark

Uploaded by

CS4225/CS5425 Big Data Systems for Data

 Tutorial and Lecture (Post to Canvas/Discussion or Email TAs)

 Issues with Hadoop Mapreduce:

WordCount (Hadoop MapReduce)

val file = sc.textFile(“hdfs://...”)

• A collection of • A collection of Row • Internally rows,

Resilient Distributed Datasets (RDDs)

Partition data Worker

# Create an RDD: length of names

 This represents the transformation that maps each string to its

dataRDD = sc.parallelize(["Alice", "Bob", "Carol", "Daniel"], 3)

 collect() here is an action.

 As we previously said, RDDs are

actions are executed in parallel.

driver in the final step. Worker

 As we previously said, RDDs are

The results are only sent to the Worker

driver in the final step. Worker

 As we previously said, RDDs are

driver in the final step. Worker

 As we previously said, RDDs are

driver in the final step. Worker

linesWithSpark = textFile.filter(lambda line:

lines = sc.textFile(“hdfs://...”) Worker

messages.filter(lambda s: “mysql” in s).count()

lines = sc.textFile(“hdfs://...”) Worker

messages.filter(lambda s: “mysql” in s).count()

lines = sc.textFile(“hdfs://...”) Worker

lines = sc.textFile(“hdfs://...”) Worker

messages.filter(lambda s: “mysql” in s).count() Cache 2

messages.filter(lambda s: “mysql” in s).count() Cache 2

messages.filter(lambda s: “mysql” in s).count() Cache 2

messages.filter(lambda s: “mysql” in s).count() Cache 2

 Internally, Spark creates a graph

val file = sc.textFile(“hdfs://...”)

 Reads in a DataFrame from a CSV file.

 Sorts by ‘count’ and output the first 3 rows (action)

 We can also run the exact same query as follows:

 Generally, these transformation functions (groupBy, sort, …) take in

You might also like