0% found this document useful (0 votes)

43 views73 pages

Hadoop Spark

This document provides an introduction to Apache Spark, including its main components and features. It discusses Spark's architecture and programming model using Resilient Distributed Datasets (RDDs). It also covers how Spark applications execute on a cluster and who some of the major users of Spark are.

Uploaded by

rrajaram1997

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views73 pages

Hadoop Spark

Uploaded by

rrajaram1997

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Données massives et apprentissage profond

Lecture 1 – Apache Spark

Gianluca Quercini

[email protected]

Polytech Paris-Saclay, 2022

General information

Organization of the course

MapReduce and Spark.

Spark programming.

SQL and NoSQL.

MongoDB practice.

Hadoop technologies.

Scaling.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 1 / 64

General information

Class material

Available online
https://tinyurl.com/p7jb5wra
Click here

Slides of the lectures.

Tutorials and lab assignments.
References (books and articles).

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 2 / 64

General information

Evaluation

Lab assignments. Lab assignments 1 and 2 will be graded.

Lab assignment 1. Spark programming
Lab assignment 2. MongoDB
Submission: Code source + written report.

Written exam. 1 hour.

Spark programming.
Data modeling in MongoDB.
Querying in MongoDB.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 3 / 64

General information

Contact

Email: [email protected]

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 4 / 64

Lecture 1 – Apache Spark Objectives

What you will learn

In this lecture you will learn:

What Spark is and its main features.

The components of the Spark stack.

The high-level Spark architecture.

The notion of Resilient Distributed Dataset (RDD).

The main transformations and actions on RDDs.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 5 / 64

Lecture 1 – Apache Spark Introduction to Spark

Apache Spark

Definition (Apache Spark)

Apache Spark is a distributed computing framework designed to be fast
and general-purpose. Source

Main features
Speed. Run computations in memory (Hadoop relies on disks).
General-purpose. Different workloads in the same system.
Batch applications, iterative algorithms.
Interactive queries, streaming applications.
Accessibility. Python, Scala, Java, SQL and R; rich built-in libraries.
Integration. With other Big Data tools, such as Hadoop.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 6 / 64

Lecture 1 – Apache Spark Introduction to Spark

Spark components

Structured
Spark SQL MLlib GraphX
Streaming

Spark Core

Standalone
YARN Mesos Kubernetes
Scheduler

Image source

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 7 / 64

Lecture 1 – Apache Spark Introduction to Spark

Spark components

Spark core
Scheduling, distributing, and monitoring applications.
Data structures for manipulating data (RDDs, DataFrames).

Spark SQL
Spark’s package for working with (semi-)structured data.
Data querying with SQL and HQL (Hive Query Language).
Many sources of data: JSON, XML, Parquet...

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 8 / 64

Lecture 1 – Apache Spark Introduction to Spark

Spark components

Structured streaming
Processing of live streams of data (e.g., real-time event logs)
Similar API to batch processing.

MLlib
Machine learning algorithms (e.g., classification, regression,
clustering)
All methods designed to scale out across a cluster.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 9 / 64

Lecture 1 – Apache Spark Introduction to Spark

Spark components

GraphX
Manipulation of graph data.
Library with common graph algorithms (e.g., PageRank)

Cluster managers
Control how tasks are distributed across a cluster.
Spark provides its own standalone cluster manager.
Spark can also use other cluster managers.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 10 / 64

Lecture 1 – Apache Spark Introduction to Spark

Spark unified stack: benefits

Shallow learning curve. Same programming model across all

components.

Optimization propagation. Higher-level components automatically

benefit from improvements on lower-layer components.

Cost minimization. No need for further software components.

Heterogeneous processing models in the same application.

Read a stream of data.
Apply machine learning algorithms.
Uses SQL to analyze the results.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 11 / 64

Lecture 1 – Apache Spark Introduction to Spark

Using Spark

Interactive mode
Using a command-line interface (CLI) or shell.
Python and Scala shell.
SparkSQL shell.
SparkR shell.

Data processing applications

Building an application by using the Spark APIs.
Scala (Spark’s native language).
Python.
Java.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 12 / 64

Lecture 1 – Apache Spark Introduction to Spark

Who uses Spark

Several important actors use Spark:

Amazon.

eBay. Log transaction aggregation and analytics.

Groupon.

Stanford DAWN. Research project aiming at democratizing AI.

TripAdvisor.

Yahoo!

+ Full list available at http://spark.apache.org/powered-by.html

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 13 / 64

Lecture 1 – Apache Spark Introduction to Spark

Spark application
Spark application: set of independent processes called executors.
Executor run computations and store the data for the application.
Executors are coordinated by the driver.

Worker Node
Executor Cache

Task Task Task

Spark Driver
Cluster
SparkContext Manager

Worker Node
Executor Cache

Image source Task Task Task

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 14 / 64

Lecture 1 – Apache Spark Introduction to Spark

Spark application execution

The driver is launched and creates the SparkContext object.

The SparkContext obtains executors from the cluster manager.

The driver sends the user’s code to the executors.

The driver assigns each executor a set of tasks.

A task is a computation on a chunk of data.

Worker Node
Spark Driver Executor Cache
Cluster
SparkContext Manager Task Task Task

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 15 / 64

Lecture 1 – Apache Spark Introduction to Spark

Spark application execution

Applications are isolated from one another.

Each application has its own SparkContext.
An executor only runs tasks of one application.
A driver only schedules tasks for one application.
Data cannot be shared across different applications.

Spark is agnostic to the underlying cluster manager.

The driver listens to incoming connections from the executors on a
network port.
The driver should be in the same local network as the executors.

+ Two different Spark applications can still share data through an

external storage system (e.g., a database or HDFS files).

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 16 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Spark programming

Two options exist to write a Spark application:

Low-level programming, using operations on a low-level data

structure called Resilient Distributed Dataset (RDD).

High-level programming, using high-level libraries, such as SparkSQL

and Structured Streaming.

+ In this lecture, we’ll focus on low-level programming to better

understand the inner workings of Spark.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 17 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Low-level Spark programming

A Spark program uses an object called SparkContext.

SparkContext represents a connection to a cluster.

Initializing the SparkContext

from pyspark import SparkCon f, SparkContext

conf = SparkConf().setMaster(<cluster URL>).setAppName(<app_name>)

sc = SparkContext(conf = conf)

A Spark program is a sequence of operations invoked on the

SparkContext (sc).

These operations manipulate a special type of data structure, called

Resilient Distributed Dataset (RDD).

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 18 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Resilient Distributed Dataset (RDD)

Definition (Resilient Distributed Dataset)

A Resilient Distributed Dataset, or simply RDD, is an immutable,
distributed collection of objects. Source

The data in each RDD is split across multiple partitions.

Each partition resides on one node of the cluster.

Two partitions can reside on the same node.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 19 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Resilient Distributed Dataset (RDD)

Call me Ishmael. Some years ago--never mind how long precisely--

having little or no money in my purse, and nothing particular to interest
me on shore, I thought I would sail about a little and see the watery File in HDFS
part of the world. It is a way I have of driving off the spleen and By default
regulating the circulation. Whenever I find myself growing grim about 1 block = 1 partition
Input file the mouth; whenever it is a damp, drizzly November in my soul;
whenever I find myself involuntarily pausing before coffin warehouses,
and bringing up the rear of every funeral I meet; and especially
whenever my hypos get such an upper hand of me, that it requires a
strong moral principle to prevent me from deliberately stepping into the
street, and methodically knocking people's hats off--then, I account it It is possible to
high time to get to sea as soon as I can. This is my substitute for pistol specify a different
and ball. With a philosophical flourish Cato throws himself upon his number of partitions
sword; I quietly take to the ship. There is nothing surprising in this. If
they but knew it, almost all men in their degree, some time or other,
cherish very nearly the same feelings towards the ocean with me.

Partition 0 Partition 1

Call me Ishmael. Some years ago--never mind how long precisely-- regulating the circulation. Whenever I find myself growing grim about
having little or no money in my purse, and nothing particular to interest the mouth; whenever it is a damp, drizzly November in my soul;
me on shore, I thought I would sail about a little and see the watery whenever I find myself involuntarily pausing before coﬃn warehouses,
part of the world. It is a way I have of driving oﬀ the spleen and and bringing up the rear of every funeral I meet; and especially

RDD Partition 2 Partition 3

whenever my hypos get such an upper hand of me, that it requires a and ball. With a philosophical flourish Cato throws himself upon his
strong moral principle to prevent me from deliberately stepping into the sword; I quietly take to the ship. There is nothing surprising in this. If
street, and methodically knocking people's hats oﬀ--then, I account it they but knew it, almost all men in their degree, some time or other,
high time to get to sea as soon as I can. This is my substitute for pistol cherish very nearly the same feelings towards the ocean with me.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 20 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Resilient Distributed Dataset (RDD)

regulating the circulation. Whenever I find

myself growing grim about the mouth;
whenever it is a damp, drizzly November in my
soul; whenever I find myself involuntarily
pausing before coﬃn warehouses, and bringing
Call me Ishmael. Some years ago--never mind
up the rear of every funeral I meet; and
how long precisely--having little or no money in
especially
my purse, and nothing particular to interest me
on shore, I thought I would sail about a little
and see the watery part of the world. It is a
way I have of driving oﬀ the spleen and

whenever my hypos get such an upper hand of

me, that it requires a strong moral principle to
prevent me from deliberately stepping into the and ball. With a philosophical flourish Cato
street, and methodically knocking people's hats throws himself upon his sword; I quietly take to
oﬀ--then, I account it high time to get to sea as the ship. There is nothing surprising in this. If
soon as I can. This is my substitute for pistol they but knew it, almost all men in their degree,
some time or other, cherish very nearly the
same feelings towards the ocean with me.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 21 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Creating an RDD

1 From an in-memory collection (e.g., a list or a set).

sc.parallelize([1, 5, 3, 2, 6, 7])

+ This method is used for debugging and prototyping on

small datasets.

2 From an data source on disk (e.g., a file or a database).

sc.textFile("hdfs://sar01:9000/data/sample_text.txt")

+ This method is used in production to process large datasets.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 22 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

About the number of partitions

RDDs created with parallelize

Local mode: number of cores on the local machines.

Cluster mode: total number of cores on all executor nodes, or 2,

whichever is larger.

RDDs created from files stored in HDFS

Number of HDFS blocks in the input file, or 2, whichever is larger.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 23 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD transformations

A transformation is an operation that takes in one or more RDDs and returns a

new RDD. A transformation is applied in parallel on each partition.

RDD 1 RDD 2 RDD 3 RDD 4

Transformation 1 Transformation 2 Transformation 3
Partition 0

Partition 1

Partition 2

Partition 3

Partition 4

Partition 5

Partition 6

Partition 7

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 24 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD transformations: map

map() takes in a function f and a RDD < xi | 0 ≤ i ≤ n >; returns

a new RDD < f (xi ) | 0 ≤ i ≤ n >.

map(lambda x: x*x)
2 ; 5 ; 6 ; 7 ; 8 ; 11 ; 4 ; 25 ; 36 ; 49 ; 64 ; 121 ;
Partition 0
13 169
map(lambda x: x*x)
Partition 1 4;5;2;3;4;5;8 16 ; 25 ; 4 ; 9 ; 16 ; 25 ; 64

Partition 2 map(lambda x: x*x)

1;4;3;2;4;5;6 1 ; 16 ; 9 ; 4 ; 16 ; 25 ; 36

Partition 3 map(lambda x: x*x)

2;4;5;2;3;4;8 4 ; 16 ; 25 ; 4 ; 9 ; 16 ; 64

+ Partition i of the input RDD is on the same node as partition i

of the output RDD.
Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 25 / 64
Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD transformations: flatMap

flatMap is used instead of map when the function f returns a list and we need the
results to be flattened.

flatMap(lambda x: x.split()) map(lambda x: x.split())

Lorem ; ipsum ; dolor ; sit ; Lorem ipsum dolor sit amet ; [Lorem, ipsum, dolor, sit,
amet ; consectetur ; consectetur adipiscing elit ; amet] ; [consectetur,
adipiscing ; elit ; sed ; do ; sed do eiusmod tempor adipiscing, elit] ; [sed, do,
eiusmod ; tempor ; incididunt incididunt eiusmod, tempor incididunt]

ut ; labore ; et ; dolore ;
ut labore et dolore magna [ut, labore, et, dolore, magna,
magna ; aliqua ; Ut ; enim ;
aliqua ; Ut enim ad minim aliqua] ; [Ut, enim, ad, minim,
ad ; minim ; veniam ; quis ;
veniam ; quis nostrud veniam] ; [quis, nostrud,
nostrud ; exercitation ;
exercitation ullamco laboris exercitation, ullamco, laboris]
ullamco ; laboris

nisi ; ut ; aliquip ; ex ; ea ; nisi ut aliquip ex ea [nisi, ut, aliquip, ex, ea,

commodo ; consequat ; commodo consequat ; Duis commodo, consequat] ;
Duis ; aute ; irure ; dolor aute irure dolor [Duis, aute, irure, dolor]

in ; reprehenderit ; in ; [in, reprehenderit, in,

in reprehenderit in voluptate
voluptate ; velit ; esse ; voluptate, velit] ; [esse,
velit ; esse cillum dolore eu
cillum ; dolore ; eu ; fugiat ; cillum, dolore, eu, fugiat,
fugiat nulla pariatur
nulla ; pariatur nulla pariatur]

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 26 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD transformations: filter

filter() takes in a predicate p and a RDD < xi | 0 ≤ i ≤ n >;

returns a new RDD < xi | 0 ≤ i ≤ n, p(xi ) is true >

filter(lambda x: x>3)
2 ; 5 ; 6 ; 7 ; 8 ; 11 ;
Partition 0 5 ; 6 ; 7 ; 8 ; 11 ; 13
13
filter(lambda x: x>3)
Partition 1 4;5;2;3;4;5;8 4;5;4;5;8

filter(lambda x: x>3)
Partition 2 1;4;3;2;4;5;6 4;4;5;6

Partition 3 filter(lambda x: x>3)

2;4;5;2;3;4;8 4;5;4;8

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 27 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD transformations: union

union() takes in two RDDs and returns a new RDD containing the
items of the first and second RDD with repetitions.

RDD 1 union

3;4
3;4
1;5
1;5
4;2
4;2
4;5
4;5

10
RDD 2

10 12 ; 13

12 ; 13 2;4

2;4 3;6

3;6

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 28 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD transformations: distinct

distinct() takes in one RDD and returns a new RDD containing

the items of the input RDD without repetitions.

distinct()

2 ; 5 ; 6 ; 7 ; 8 ; 11 ;
Partition 0 8;4
13

Partition 1 4;5;2;3;4;5;8 5 ; 13 ; 1

Partition 2 1;4;3;2;4;5;6 2;6

Partition 3 2;4;5;2;3;4;8 7 ; 11 ; 3

+ Unlike the previous transformations, distinct leads to data

being shuffled.
Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 29 / 64
Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

About data shuffling F

Which partition does the element 23 belong to in the RDD

obtained after applying the transformation distinct?

distinct()

Partition 0 4;5;4 4 ; 12

Partition 1 3;2;6;7 5;1

Partition 2 23 ; 12 ; 1 ; 4 2;6

Partition 3 4 ; 23 ; 11 ; 2 3 ; 7 ; 11

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 30 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

About data shuffling F

The element 23 belongs to the partition 3.

While shuffling, the destination partition p of an element K in a

RDD with n partitions is computed as follows:

p = hashCode(K ) mod n

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 31 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD transformations: intersection

intersection() takes in one or two RDDs and returns a new RDD

containing the items that occur in both RDDs.

RDD 1 intesection

3;4

1;5

4;2
2
4;5
3

4
RDD 2

12 ; 13

2;4

3;6

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 32 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Narrow transformations F

Definition (Narrow transformation)

A narrow transformation is one where each partition of the output RDD
depends on at most one partition of the input RDD.

Which of the above transformations are narrow?

Narrow transformations are inexpensive.

No need for communication between executors.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 33 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Narrow transformations F

Definition (Narrow transformation)

A narrow transformation is one where each partition of the output RDD
depends on at most one partition of the input RDD.

filter, map, flatMap and union are narrow

transformations.

Narrow transformations are inexpensive.

No need for communication between executors.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 33 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Wide transformations F

Definition (Wide transformation)

A wide transformation is one where each partition of the output RDD
may depend on several partitions of the input RDD.

Which of the above transformations are wide?

Wide transformations are more costly.

Executors need to communicate.
Data is shuffled across the cluster network.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 34 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Wide transformations F

Definition (Wide transformation)

A wide transformation is one where each partition of the output RDD
may depend on several partitions of the input RDD.

distinct and intersection are wide transformations.

Wide transformations are more costly.

Executors need to communicate.
Data is shuffled across the cluster network.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 34 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD actions

An action is an operation that takes in a RDD and returns a value to

the driver after running a computation of the dataset.

The result of an action is sent to the driver.

If the result is a list of values, all values are sent to the driver.

The result of an action can also be written to disk.

Disk writes can be to the local file system or HDFS.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 35 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD actions: reduce

reduce() takes in an RDD and a function f and applies the function

pair-wise to all elements of the input RDD.

reduce(lambda x, y: x+y)
RDD

Partition 0 3;4 3+4=7

Partition 1 1;5 1+5=6

Driver 7 + 6 + 6 + 11 = 30
Partition 2 4;2 4+2=6

Partition 3 2;4;5 2 + 4 + 5 = 11

The function f must take in 2 arguments.

The type of the value returned by the function f must be the
same as the type of the elements of the input RDD.
Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 36 / 64
Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD actions: collect

collect() takes in an RDD and returns the list of the elements in

the RDD.

collect()

Partition 0 3;4 [3, 4]

Partition 1 1;5 [1, 5]

Driver [3, 4, 1, 5, 4, 2, 2, 4, 5]
Partition 2 4;2 [4, 2]

Partition 3 2;4;5 [ 2, 4, 5]

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 37 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Is collect() a safe action? F

What are the risks, if any, while invoking collect() on a

large RDD?

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 38 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Is collect() a safe action? F

What are the risks, if any, while invoking collect() on a

large RDD?

High network traffic.

The driver’s memory may not enough to store all the RDD
elements.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 38 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD actions: count

count() takes in an RDD and returns the number of items in the

RDD.

count()

Partition 0 3;4 2

Partition 1 1;5 2
Driver 2+2+2+3 = 9
Partition 2 4;2 2

Partition 3 2;4;5 3

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 39 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Understanding the code F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \

"chemistry", "biology", "astronomy"])
r2 = r1.map(lambda x: x.capitalize())

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 40 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Understanding the code F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \

"chemistry", "biology", "astronomy"])
r2 = r1.map(lambda x: x.capitalize())

r2 is an RDD (result of a transformation).

r2 has as many elements as r1.
Each item of r2 is a string from r1 with the first letter
capitalized.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 40 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \

"chemistry", "biology", "astronomy"])
r2 = r1.filter(lambda x: len(x) > 10)

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 41 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \

"chemistry", "biology", "astronomy"])
r2 = r1.filter(lambda x: len(x) > 10)

r2 is an RDD (result of a transformation).

r2 has less elements than r1.
r2 only contains the items from r1 that have more than 10
characters.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 41 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \

"chemistry", "biology", "astronomy"])
r2 = r1.reduce(lambda x, y: "{} - {}".format(x, y))

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 42 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \

"chemistry", "biology", "astronomy"])
r2 = r1.reduce(lambda x, y: "{} - {}".format(x, y))

r2 is a string, not a RDD (result of an action).

r2 is the string ”computer science - geology - chemistry -
biology - astronomy”.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 42 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \

"chemistry", "biology", "astronomy"])
r2 = r1.reduce(lambda x, y: [x + y])

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 43 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \

"chemistry", "biology", "astronomy"])
r2 = r1.reduce(lambda x, y: [x + y])

The code is incorrect, because the return type (list) of the reduce
function is different from the type of the input RDD elements (string).

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 43 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["author", "title", "edition"])

r2 = r1.flatMap(lambda x: [c for c in x])

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 44 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["author", "title", "edition"])

r2 = r1.flatMap(lambda x: [c for c in x])

r2 is an RDD (result of a transformation).

Each element of r2 is a letter from a string in r1. How would
that be different if we had used map instead of flatMap?

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 44 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Key-value RDDs

Key-value RDDs (a.k.a., Pair RDDs) are RDDs where each item is
a pair (k, v ), k being the key and v being the value.

Key-value RDDs are important building blocks in many applications.

Key-value RDDs support all the transformations and actions that can
be applied on regular RDDs.

Key-value RDDs support special transformations and actions.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 45 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Key-value RDDs transformations: reduceByKey

reduceByKey takes in a RDD with (K , V ) pairs and a function f and returns a new
RDD of (K , V ) pairs where the values for each key are aggregated using f , which
must be of type (V , V ) → V .

reduceByKey(lambda x, y: x+y)
RDD

('cat', 7) ; (‘owl', 7) ;
Partition 0 ('cat', 2) ; (‘owl', 3)
(‘cow', 1) ; (‘tiger', 1)

Partition 1 ('dog', 5) ; (‘cat', 2)

Partition 2 ('dog', 1) ; (‘cow', 1)

Partition 3 ('cat', 3) ; (‘owl', 4) ;

('dog', 6)
(‘tiger', 1)

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 46 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Key-value RDDs transformations: reduceByKey

The input RDD has a certain number of partitions n.

No assumption can be made on which elements belong to which

partition.

The RDD returned by reduceByKey is hash partitioned. Each item

belongs to a precise partition.

The partition number p of a pair (K , V ) is derived as follows:

p = hashCode(K ) mod num partitions

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 47 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Key-value RDDs transformations: groupByKey

groupByKey takes in a RDD with (K , V ) pairs and returns a new RDD of

(K , Iterable < V >) pairs.

groupByKey()

('cat', iter) ; (‘owl', iter) ;

Partition 0 ('cat', 2) ; (‘owl', 3)
(‘cow', iter) ; (‘tiger', iter)

Partition 1 ('dog', 5) ; (‘cat', 2)

Partition 2 ('dog', 1) ; (‘cow', 1)

Partition 3 ('cat', 3) ; (‘owl', 4) ;

('dog', iter)
(‘tiger', 1)

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 48 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Key-value RDDs transformations: mapValues

mapValues takes in a RDD with (K , V ) pairs and a function f and returns a new
RDD where the function f is applied to each value V (keys are not modified).

mapValues(lambda x: len(x))

('cat', [2, 2, 3]) ; ('owl', [3, 4]) ; ('cat', 3), ('owl', 2),
Partition 0
('cow', [1]) ; ('tiger', [1]) ('cow', 1), ('tiger', 1)

Partition 1

Partition 2

Partition 3 ('dog', [5, 1]) ('dog', 2)

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 49 / 64

Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Example: Word count

def word_count(input_file):
text = sc.textFile(input_file)
return text.flatMap(lambda line: line.split(" "))\
.map(lambda word: (word, 1))\
.reduceByKey(lambda x, y: x+y)

The function textFile reads a text file into a RDD.

Two narrow transformations (flatMap and map) and one wide
transformation (reduceByKey).

+ Spark maintains a logical execution plan (called RDD lineage)

described as a Directed Acyclic Graph (DAG).

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 50 / 64

Lecture 1 – Apache Spark Spark execution model

RDD lineage

SparkContext Spark has a DAG scheduler that splits

(sc)
the graph into multiple stages.

HadoopRDD sc.textFile(input_file)

MappedRDD flatMap(lambda line: line.split(" ")

MappedRDD map(lambda word: (word, 1)

ShuﬄedRDD reduceByKey(lambda x, y: x+y)

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 51 / 64

Lecture 1 – Apache Spark Spark execution model

RDD lineage: stages

Stage 1
SparkContext
(sc)
Sequences of narrow transformations
are pipelined into a single stage. Wide
HadoopRDD sc.textFile(input_file)
transformations always trigger a new
stage.

MappedRDD flatMap(lambda line: line.split(" ")

MappedRDD map(lambda word: (word, 1)

Stage 2

ShuﬄedRDD reduceByKey(lambda x, y: x+y)

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 52 / 64

Lecture 1 – Apache Spark Spark execution model

RDD lineage: stages

Stage 1
SparkContext
(sc) Stage 2

sc.parallelize()

sc.textFile(input_file)

Stage 3

map(lambda x….) join()

filter()

Stages that have no dependency can

be executed in parallel.
saveAsTextFile()

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 53 / 64

Lecture 1 – Apache Spark Spark execution model

RDD lineage: tasks

Stage 1
SparkContext
(sc)
The DAG scheduler submits the stages
to the task scheduler. Creates as many
sc.textFile(input_file)
tasks as there are partitions in the RDD.
Tasks are executed in parallel.

flatMap(lambda line: line.split(" ")

map(lambda word: (word, 1)

Stage 2

reduceByKey(lambda x, y: x+y)

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 54 / 64

Lecture 1 – Apache Spark Spark execution model

RDD lineage: fault tolerance

groupBy
What to do when a partition is
lost?

join
map

union

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 55 / 64

Lecture 1 – Apache Spark Spark execution model

RDD lineage: fault tolerance

groupBy
Lost partitions can be recomputed
thanks to the lineage graph.

join
map

union

No need to save intermediate re-

sults to disk.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 56 / 64

Lecture 1 – Apache Spark Spark execution model

Lazy evaluation

In Spark, transformations are lazily evaluated.

Definition (Lazy evaluation)

When a transformation is invoked, Spark does not execute it
immediately. Transformations are only executed when the first action is
called.

An RDD can be thought of a set of instructions on how to compute

the data that we build up through transformations.

Lazy evaluation helps reducing the number of passes needed to

load and transform the data.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 57 / 64

Lecture 1 – Apache Spark Spark execution model

Lazy evaluation: motivating example F

lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
nb_lines = exceptions.count()
print("Number of exception lines ", nb_lines)

What happens if Spark executes immediately each

transformation?

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 58 / 64

Lecture 1 – Apache Spark Spark execution model

Lazy evaluation: motivating example F

Invoking sc.textFile() does not load immediately the data.

The transformation filter() is not applied when it is invoked.
Transformations are applied only when the action count() is invoked.
Only the data that meet the constraint of the filter is loaded from
the file.

lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
nb_lines = exceptions.count()
print("Number of exception lines ", nb_lines)

+ Without lazy evaluation we would have loaded into main mem-

ory the whole content of the input file.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 59 / 64

Lecture 1 – Apache Spark Spark execution model

Lazy evaluation: consequences F

The following code invokes two actions: which ones?

What happens when we invoke the second action?

Example
lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
nb_lines = exceptions.count()
exceptions.collect()

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 60 / 64

Lecture 1 – Apache Spark Spark execution model

Lazy evaluation: consequences F

With lazy evaluation, transformations are computed each time an

action is invoked on a given RDD.
In the following example, all transformations are computed when we
invoke the function count() and the function collect().

Example
lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
nb_lines = exceptions.count()
exceptions.collect()

To avoid computing transformations multiple times, we can persist

the data.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 61 / 64

Lecture 1 – Apache Spark Spark execution model

Persisting the data

Persisting the data means caching the result of the transformations.
Either in main memory (default), or disk or both.
If a node in the cluster fails, Spark recomputes the persisted
partitions.
We can replicate persisted partitions on other nodes to recover from
failures without recomputing.

lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
exceptions.persist(StorageLevel.MEMORY_AND_DISK)
nb_lines = exceptions.count()
exceptions.collect()

persist() is called right before the first action.

persist() does not force the evaluation of transformations.
unpersist() can be called to evict persisted partitions.
Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 62 / 64
Lecture 1 – Apache Spark References

References

Karau, Holden, et al. Learning spark: lightning-fast big data analysis.

O’Reilly Media, Inc., 2015. Click here

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 63 / 64

Lecture 1 – Apache Spark References

Playing with transformations and actions

Notebook available on Google Colab Click here

+ Select File → Save a copy in Drive to create a copy of the

notebook in your Drive and play with it.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 64 / 64

6ES7331 7KF02 0AB0 Manual
No ratings yet
6ES7331 7KF02 0AB0 Manual
30 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Lenovo ThinkPad T480 01YR328 NM-B501 Rev0.1
No ratings yet
Lenovo ThinkPad T480 01YR328 NM-B501 Rev0.1
104 pages
Multilayer Integrated Film Bulk Acoustic Resonators - Yafei Zhang, Da Chen
No ratings yet
Multilayer Integrated Film Bulk Acoustic Resonators - Yafei Zhang, Da Chen
158 pages
SPiiPlus Modbus Setup Guide
No ratings yet
SPiiPlus Modbus Setup Guide
56 pages
Communication Interface
100% (1)
Communication Interface
45 pages
Product Information DIGSI 5 V07.90
No ratings yet
Product Information DIGSI 5 V07.90
62 pages
A Short Tutorial On User Exits
100% (1)
A Short Tutorial On User Exits
15 pages
Working With Functions
No ratings yet
Working With Functions
22 pages
IBM Mainframe Bits: Understanding The Platform Hardware: Paper
No ratings yet
IBM Mainframe Bits: Understanding The Platform Hardware: Paper
34 pages
Conformiq Creator 2 Self Study Material 1
No ratings yet
Conformiq Creator 2 Self Study Material 1
114 pages
D C S 1 A: Jquery Program To Apply Borders To Text Area and Paragraphs N: C: R N
No ratings yet
D C S 1 A: Jquery Program To Apply Borders To Text Area and Paragraphs N: C: R N
34 pages
Oracle Fusion Applications Administration Essentials Sample Chapter
No ratings yet
Oracle Fusion Applications Administration Essentials Sample Chapter
16 pages
Oracle SQL Tutorial
No ratings yet
Oracle SQL Tutorial
22 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
Java Programming Syllabus
No ratings yet
Java Programming Syllabus
1 page
Java Persistence API JPA Basics
No ratings yet
Java Persistence API JPA Basics
60 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Seminar Synopsis
No ratings yet
Seminar Synopsis
5 pages
UAV-Assisted RIS For Future Wireless Communications A Survey On Optimization and Performance Analysis
No ratings yet
UAV-Assisted RIS For Future Wireless Communications A Survey On Optimization and Performance Analysis
17 pages
Control Systems Engineering Chapter 1: Introduction: Dr.-Ing. Witthawas Pongyart
No ratings yet
Control Systems Engineering Chapter 1: Introduction: Dr.-Ing. Witthawas Pongyart
51 pages
SPARK
No ratings yet
SPARK
125 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
BDA GTU Study Material Presentations Unit-6 03102021061221PM
No ratings yet
BDA GTU Study Material Presentations Unit-6 03102021061221PM
23 pages
Texas TLO74 Tone Preamp
No ratings yet
Texas TLO74 Tone Preamp
9 pages
Igt - Boot Os List Rev B 10-28-2015
No ratings yet
Igt - Boot Os List Rev B 10-28-2015
5 pages
Unit 5
100% (1)
Unit 5
109 pages
Communication Hardware
No ratings yet
Communication Hardware
69 pages
16 LED Lighted Makeup Mirror
No ratings yet
16 LED Lighted Makeup Mirror
2 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Module 9: Processing Distributed Data With Apache Spark: WWW - Edureka.co/big-Data-And-Hadoop
No ratings yet
Module 9: Processing Distributed Data With Apache Spark: WWW - Edureka.co/big-Data-And-Hadoop
45 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Spark
No ratings yet
Spark
9 pages
Spark BD
No ratings yet
Spark BD
9 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Z MC Reset Acsplan
No ratings yet
Z MC Reset Acsplan
2 pages
Dice Resume CV Santosh Srinivasan
No ratings yet
Dice Resume CV Santosh Srinivasan
5 pages
Module 3
No ratings yet
Module 3
51 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
ApacheSparkWorkshop 2020 09 17
No ratings yet
ApacheSparkWorkshop 2020 09 17
58 pages
Narrative ET1L
No ratings yet
Narrative ET1L
4 pages
Bda U3 p1 (Intro To Spark)
No ratings yet
Bda U3 p1 (Intro To Spark)
66 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Unit-4 - Apache Spark
No ratings yet
Unit-4 - Apache Spark
24 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
Google Cybersecurity Glossary
No ratings yet
Google Cybersecurity Glossary
37 pages
ETI MCQ (NDP) On 5 G, CO 6 I, ETI 2024-2025
100% (1)
ETI MCQ (NDP) On 5 G, CO 6 I, ETI 2024-2025
5 pages
EE313 FilterDesign FormulaSheet
No ratings yet
EE313 FilterDesign FormulaSheet
1 page
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Module 4
No ratings yet
Module 4
29 pages
A2041175501 - 28953 - 15 - 2025 - Unit 1 Part 1
No ratings yet
A2041175501 - 28953 - 15 - 2025 - Unit 1 Part 1
13 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Bda U4
No ratings yet
Bda U4
49 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
A 53 PAE Envelope Tracking GaN Power Amplifier For 20MHz Bandwidth LTE Signals at 880MHz
No ratings yet
A 53 PAE Envelope Tracking GaN Power Amplifier For 20MHz Bandwidth LTE Signals at 880MHz
3 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
8 pages
ECS765P - W5 - Spark Programming
No ratings yet
ECS765P - W5 - Spark Programming
43 pages
Unit 5.1
No ratings yet
Unit 5.1
9 pages
Apache Spark Lecture Notes
No ratings yet
Apache Spark Lecture Notes
4 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
2V0-11.25 Exam Dumps Questions
No ratings yet
2V0-11.25 Exam Dumps Questions
11 pages
Unit V
No ratings yet
Unit V
35 pages
Unit IV Spark
No ratings yet
Unit IV Spark
23 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Learning Concurrent Programming in Scala
From Everand
Learning Concurrent Programming in Scala
Aleksandar Prokopec
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Spark: Big Data Cluster Computing in Production
From Everand
Spark: Big Data Cluster Computing in Production
Ilya Ganelin
No ratings yet