0% found this document useful (0 votes)
43 views73 pages

Hadoop Spark

This document provides an introduction to Apache Spark, including its main components and features. It discusses Spark's architecture and programming model using Resilient Distributed Datasets (RDDs). It also covers how Spark applications execute on a cluster and who some of the major users of Spark are.

Uploaded by

rrajaram1997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views73 pages

Hadoop Spark

This document provides an introduction to Apache Spark, including its main components and features. It discusses Spark's architecture and programming model using Resilient Distributed Datasets (RDDs). It also covers how Spark applications execute on a cluster and who some of the major users of Spark are.

Uploaded by

rrajaram1997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Données massives et apprentissage profond

Lecture 1 – Apache Spark

Gianluca Quercini

[email protected]

Polytech Paris-Saclay, 2022


General information

Organization of the course

MapReduce and Spark.

Spark programming.

SQL and NoSQL.

MongoDB practice.

Hadoop technologies.

Scaling.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 1 / 64


General information

Class material

Available online
https://tinyurl.com/p7jb5wra
Click here

Slides of the lectures.


Tutorials and lab assignments.
References (books and articles).

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 2 / 64


General information

Evaluation

Lab assignments. Lab assignments 1 and 2 will be graded.


Lab assignment 1. Spark programming
Lab assignment 2. MongoDB
Submission: Code source + written report.

Written exam. 1 hour.


Spark programming.
Data modeling in MongoDB.
Querying in MongoDB.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 3 / 64


General information

Contact

Email: [email protected]

Email: [email protected]

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 4 / 64


Lecture 1 – Apache Spark Objectives

What you will learn

In this lecture you will learn:

What Spark is and its main features.

The components of the Spark stack.

The high-level Spark architecture.

The notion of Resilient Distributed Dataset (RDD).

The main transformations and actions on RDDs.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 5 / 64


Lecture 1 – Apache Spark Introduction to Spark

Apache Spark

Definition (Apache Spark)


Apache Spark is a distributed computing framework designed to be fast
and general-purpose. Source

Main features
Speed. Run computations in memory (Hadoop relies on disks).
General-purpose. Different workloads in the same system.
Batch applications, iterative algorithms.
Interactive queries, streaming applications.
Accessibility. Python, Scala, Java, SQL and R; rich built-in libraries.
Integration. With other Big Data tools, such as Hadoop.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 6 / 64


Lecture 1 – Apache Spark Introduction to Spark

Spark components

Structured
Spark SQL MLlib GraphX
Streaming

Spark Core

Standalone
YARN Mesos Kubernetes
Scheduler

Image source

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 7 / 64


Lecture 1 – Apache Spark Introduction to Spark

Spark components

Spark core
Scheduling, distributing, and monitoring applications.
Data structures for manipulating data (RDDs, DataFrames).

Spark SQL
Spark’s package for working with (semi-)structured data.
Data querying with SQL and HQL (Hive Query Language).
Many sources of data: JSON, XML, Parquet...

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 8 / 64


Lecture 1 – Apache Spark Introduction to Spark

Spark components

Structured streaming
Processing of live streams of data (e.g., real-time event logs)
Similar API to batch processing.

MLlib
Machine learning algorithms (e.g., classification, regression,
clustering)
All methods designed to scale out across a cluster.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 9 / 64


Lecture 1 – Apache Spark Introduction to Spark

Spark components

GraphX
Manipulation of graph data.
Library with common graph algorithms (e.g., PageRank)

Cluster managers
Control how tasks are distributed across a cluster.
Spark provides its own standalone cluster manager.
Spark can also use other cluster managers.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 10 / 64


Lecture 1 – Apache Spark Introduction to Spark

Spark unified stack: benefits

Shallow learning curve. Same programming model across all


components.

Optimization propagation. Higher-level components automatically


benefit from improvements on lower-layer components.

Cost minimization. No need for further software components.

Heterogeneous processing models in the same application.


Read a stream of data.
Apply machine learning algorithms.
Uses SQL to analyze the results.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 11 / 64


Lecture 1 – Apache Spark Introduction to Spark

Using Spark

Interactive mode
Using a command-line interface (CLI) or shell.
Python and Scala shell.
SparkSQL shell.
SparkR shell.

Data processing applications


Building an application by using the Spark APIs.
Scala (Spark’s native language).
Python.
Java.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 12 / 64


Lecture 1 – Apache Spark Introduction to Spark

Who uses Spark

Several important actors use Spark:

Amazon.

eBay. Log transaction aggregation and analytics.

Groupon.

Stanford DAWN. Research project aiming at democratizing AI.

TripAdvisor.

Yahoo!

+ Full list available at http://spark.apache.org/powered-by.html

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 13 / 64


Lecture 1 – Apache Spark Introduction to Spark

Spark application
Spark application: set of independent processes called executors.
Executor run computations and store the data for the application.
Executors are coordinated by the driver.

Worker Node
Executor Cache

Task Task Task


Spark Driver
Cluster
SparkContext Manager

Worker Node
Executor Cache

Image source Task Task Task

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 14 / 64


Lecture 1 – Apache Spark Introduction to Spark

Spark application execution

The driver is launched and creates the SparkContext object.

The SparkContext obtains executors from the cluster manager.

The driver sends the user’s code to the executors.

The driver assigns each executor a set of tasks.

A task is a computation on a chunk of data.

Worker Node
Spark Driver Executor Cache
Cluster
SparkContext Manager Task Task Task

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 15 / 64


Lecture 1 – Apache Spark Introduction to Spark

Spark application execution

Applications are isolated from one another.


Each application has its own SparkContext.
An executor only runs tasks of one application.
A driver only schedules tasks for one application.
Data cannot be shared across different applications.

Spark is agnostic to the underlying cluster manager.


The driver listens to incoming connections from the executors on a
network port.
The driver should be in the same local network as the executors.

+ Two different Spark applications can still share data through an


external storage system (e.g., a database or HDFS files).

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 16 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Spark programming

Two options exist to write a Spark application:

Low-level programming, using operations on a low-level data


structure called Resilient Distributed Dataset (RDD).

High-level programming, using high-level libraries, such as SparkSQL


and Structured Streaming.

+ In this lecture, we’ll focus on low-level programming to better


understand the inner workings of Spark.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 17 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Low-level Spark programming

A Spark program uses an object called SparkContext.


SparkContext represents a connection to a cluster.

Initializing the SparkContext


from pyspark import SparkCon f, SparkContext

conf = SparkConf().setMaster(<cluster URL>).setAppName(<app_name>)


sc = SparkContext(conf = conf)

A Spark program is a sequence of operations invoked on the


SparkContext (sc).

These operations manipulate a special type of data structure, called


Resilient Distributed Dataset (RDD).

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 18 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Resilient Distributed Dataset (RDD)

Definition (Resilient Distributed Dataset)


A Resilient Distributed Dataset, or simply RDD, is an immutable,
distributed collection of objects. Source

The data in each RDD is split across multiple partitions.

Each partition resides on one node of the cluster.

Two partitions can reside on the same node.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 19 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Resilient Distributed Dataset (RDD)

Call me Ishmael. Some years ago--never mind how long precisely--


having little or no money in my purse, and nothing particular to interest
me on shore, I thought I would sail about a little and see the watery File in HDFS
part of the world. It is a way I have of driving off the spleen and By default
regulating the circulation. Whenever I find myself growing grim about 1 block = 1 partition
Input file the mouth; whenever it is a damp, drizzly November in my soul;
whenever I find myself involuntarily pausing before coffin warehouses,
and bringing up the rear of every funeral I meet; and especially
whenever my hypos get such an upper hand of me, that it requires a
strong moral principle to prevent me from deliberately stepping into the
street, and methodically knocking people's hats off--then, I account it It is possible to
high time to get to sea as soon as I can. This is my substitute for pistol specify a different
and ball. With a philosophical flourish Cato throws himself upon his number of partitions
sword; I quietly take to the ship. There is nothing surprising in this. If
they but knew it, almost all men in their degree, some time or other,
cherish very nearly the same feelings towards the ocean with me.

Partition 0 Partition 1

Call me Ishmael. Some years ago--never mind how long precisely-- regulating the circulation. Whenever I find myself growing grim about
having little or no money in my purse, and nothing particular to interest the mouth; whenever it is a damp, drizzly November in my soul;
me on shore, I thought I would sail about a little and see the watery whenever I find myself involuntarily pausing before coffin warehouses,
part of the world. It is a way I have of driving off the spleen and and bringing up the rear of every funeral I meet; and especially

RDD Partition 2 Partition 3

whenever my hypos get such an upper hand of me, that it requires a and ball. With a philosophical flourish Cato throws himself upon his
strong moral principle to prevent me from deliberately stepping into the sword; I quietly take to the ship. There is nothing surprising in this. If
street, and methodically knocking people's hats off--then, I account it they but knew it, almost all men in their degree, some time or other,
high time to get to sea as soon as I can. This is my substitute for pistol cherish very nearly the same feelings towards the ocean with me.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 20 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Resilient Distributed Dataset (RDD)

regulating the circulation. Whenever I find


myself growing grim about the mouth;
whenever it is a damp, drizzly November in my
soul; whenever I find myself involuntarily
pausing before coffin warehouses, and bringing
Call me Ishmael. Some years ago--never mind
up the rear of every funeral I meet; and
how long precisely--having little or no money in
especially
my purse, and nothing particular to interest me
on shore, I thought I would sail about a little
and see the watery part of the world. It is a
way I have of driving off the spleen and

whenever my hypos get such an upper hand of


me, that it requires a strong moral principle to
prevent me from deliberately stepping into the and ball. With a philosophical flourish Cato
street, and methodically knocking people's hats throws himself upon his sword; I quietly take to
off--then, I account it high time to get to sea as the ship. There is nothing surprising in this. If
soon as I can. This is my substitute for pistol they but knew it, almost all men in their degree,
some time or other, cherish very nearly the
same feelings towards the ocean with me.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 21 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Creating an RDD

1 From an in-memory collection (e.g., a list or a set).

sc.parallelize([1, 5, 3, 2, 6, 7])

+ This method is used for debugging and prototyping on


small datasets.

2 From an data source on disk (e.g., a file or a database).

sc.textFile("hdfs://sar01:9000/data/sample_text.txt")

+ This method is used in production to process large datasets.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 22 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

About the number of partitions

RDDs created with parallelize


Local mode: number of cores on the local machines.

Cluster mode: total number of cores on all executor nodes, or 2,


whichever is larger.

RDDs created from files stored in HDFS


Number of HDFS blocks in the input file, or 2, whichever is larger.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 23 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD transformations

A transformation is an operation that takes in one or more RDDs and returns a


new RDD. A transformation is applied in parallel on each partition.

RDD 1 RDD 2 RDD 3 RDD 4


Transformation 1 Transformation 2 Transformation 3
Partition 0

Partition 1

Partition 2

Partition 3

Partition 4

Partition 5

Partition 6

Partition 7

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 24 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD transformations: map

map() takes in a function f and a RDD < xi | 0 ≤ i ≤ n >; returns


a new RDD < f (xi ) | 0 ≤ i ≤ n >.

map(lambda x: x*x)
2 ; 5 ; 6 ; 7 ; 8 ; 11 ; 4 ; 25 ; 36 ; 49 ; 64 ; 121 ;
Partition 0
13 169
map(lambda x: x*x)
Partition 1 4;5;2;3;4;5;8 16 ; 25 ; 4 ; 9 ; 16 ; 25 ; 64

Partition 2 map(lambda x: x*x)


1;4;3;2;4;5;6 1 ; 16 ; 9 ; 4 ; 16 ; 25 ; 36

Partition 3 map(lambda x: x*x)


2;4;5;2;3;4;8 4 ; 16 ; 25 ; 4 ; 9 ; 16 ; 64

+ Partition i of the input RDD is on the same node as partition i


of the output RDD.
Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 25 / 64
Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD transformations: flatMap


flatMap is used instead of map when the function f returns a list and we need the
results to be flattened.

flatMap(lambda x: x.split()) map(lambda x: x.split())

Lorem ; ipsum ; dolor ; sit ; Lorem ipsum dolor sit amet ; [Lorem, ipsum, dolor, sit,
amet ; consectetur ; consectetur adipiscing elit ; amet] ; [consectetur,
adipiscing ; elit ; sed ; do ; sed do eiusmod tempor adipiscing, elit] ; [sed, do,
eiusmod ; tempor ; incididunt incididunt eiusmod, tempor incididunt]

ut ; labore ; et ; dolore ;
ut labore et dolore magna [ut, labore, et, dolore, magna,
magna ; aliqua ; Ut ; enim ;
aliqua ; Ut enim ad minim aliqua] ; [Ut, enim, ad, minim,
ad ; minim ; veniam ; quis ;
veniam ; quis nostrud veniam] ; [quis, nostrud,
nostrud ; exercitation ;
exercitation ullamco laboris exercitation, ullamco, laboris]
ullamco ; laboris

nisi ; ut ; aliquip ; ex ; ea ; nisi ut aliquip ex ea [nisi, ut, aliquip, ex, ea,


commodo ; consequat ; commodo consequat ; Duis commodo, consequat] ;
Duis ; aute ; irure ; dolor aute irure dolor [Duis, aute, irure, dolor]

in ; reprehenderit ; in ; [in, reprehenderit, in,


in reprehenderit in voluptate
voluptate ; velit ; esse ; voluptate, velit] ; [esse,
velit ; esse cillum dolore eu
cillum ; dolore ; eu ; fugiat ; cillum, dolore, eu, fugiat,
fugiat nulla pariatur
nulla ; pariatur nulla pariatur]

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 26 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD transformations: filter

filter() takes in a predicate p and a RDD < xi | 0 ≤ i ≤ n >;


returns a new RDD < xi | 0 ≤ i ≤ n, p(xi ) is true >

filter(lambda x: x>3)
2 ; 5 ; 6 ; 7 ; 8 ; 11 ;
Partition 0 5 ; 6 ; 7 ; 8 ; 11 ; 13
13
filter(lambda x: x>3)
Partition 1 4;5;2;3;4;5;8 4;5;4;5;8

filter(lambda x: x>3)
Partition 2 1;4;3;2;4;5;6 4;4;5;6

Partition 3 filter(lambda x: x>3)


2;4;5;2;3;4;8 4;5;4;8

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 27 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD transformations: union

union() takes in two RDDs and returns a new RDD containing the
items of the first and second RDD with repetitions.

RDD 1 union

3;4
3;4
1;5
1;5
4;2
4;2
4;5
4;5

10
RDD 2

10 12 ; 13

12 ; 13 2;4

2;4 3;6

3;6

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 28 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD transformations: distinct

distinct() takes in one RDD and returns a new RDD containing


the items of the input RDD without repetitions.

distinct()

2 ; 5 ; 6 ; 7 ; 8 ; 11 ;
Partition 0 8;4
13

Partition 1 4;5;2;3;4;5;8 5 ; 13 ; 1

Partition 2 1;4;3;2;4;5;6 2;6

Partition 3 2;4;5;2;3;4;8 7 ; 11 ; 3

+ Unlike the previous transformations, distinct leads to data


being shuffled.
Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 29 / 64
Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

About data shuffling F

Which partition does the element 23 belong to in the RDD


obtained after applying the transformation distinct?

distinct()

Partition 0 4;5;4 4 ; 12

Partition 1 3;2;6;7 5;1

Partition 2 23 ; 12 ; 1 ; 4 2;6

Partition 3 4 ; 23 ; 11 ; 2 3 ; 7 ; 11

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 30 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

About data shuffling F

The element 23 belongs to the partition 3.

While shuffling, the destination partition p of an element K in a


RDD with n partitions is computed as follows:

p = hashCode(K ) mod n

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 31 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD transformations: intersection

intersection() takes in one or two RDDs and returns a new RDD


containing the items that occur in both RDDs.

RDD 1 intesection

3;4

1;5

4;2
2
4;5
3

4
RDD 2

10

12 ; 13

2;4

3;6

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 32 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Narrow transformations F

Definition (Narrow transformation)


A narrow transformation is one where each partition of the output RDD
depends on at most one partition of the input RDD.

Which of the above transformations are narrow?

Narrow transformations are inexpensive.


No need for communication between executors.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 33 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Narrow transformations F

Definition (Narrow transformation)


A narrow transformation is one where each partition of the output RDD
depends on at most one partition of the input RDD.

filter, map, flatMap and union are narrow


transformations.

Narrow transformations are inexpensive.


No need for communication between executors.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 33 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Wide transformations F

Definition (Wide transformation)


A wide transformation is one where each partition of the output RDD
may depend on several partitions of the input RDD.

Which of the above transformations are wide?

Wide transformations are more costly.


Executors need to communicate.
Data is shuffled across the cluster network.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 34 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Wide transformations F

Definition (Wide transformation)


A wide transformation is one where each partition of the output RDD
may depend on several partitions of the input RDD.

distinct and intersection are wide transformations.

Wide transformations are more costly.


Executors need to communicate.
Data is shuffled across the cluster network.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 34 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD actions

An action is an operation that takes in a RDD and returns a value to


the driver after running a computation of the dataset.

The result of an action is sent to the driver.

If the result is a list of values, all values are sent to the driver.

The result of an action can also be written to disk.

Disk writes can be to the local file system or HDFS.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 35 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD actions: reduce

reduce() takes in an RDD and a function f and applies the function


pair-wise to all elements of the input RDD.

reduce(lambda x, y: x+y)
RDD

Partition 0 3;4 3+4=7

Partition 1 1;5 1+5=6


Driver 7 + 6 + 6 + 11 = 30
Partition 2 4;2 4+2=6

Partition 3 2;4;5 2 + 4 + 5 = 11

The function f must take in 2 arguments.


The type of the value returned by the function f must be the
same as the type of the elements of the input RDD.
Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 36 / 64
Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD actions: collect

collect() takes in an RDD and returns the list of the elements in


the RDD.

collect()

Partition 0 3;4 [3, 4]

Partition 1 1;5 [1, 5]


Driver [3, 4, 1, 5, 4, 2, 2, 4, 5]
Partition 2 4;2 [4, 2]

Partition 3 2;4;5 [ 2, 4, 5]

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 37 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Is collect() a safe action? F

What are the risks, if any, while invoking collect() on a


large RDD?

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 38 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Is collect() a safe action? F

What are the risks, if any, while invoking collect() on a


large RDD?

High network traffic.


The driver’s memory may not enough to store all the RDD
elements.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 38 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

RDD actions: count

count() takes in an RDD and returns the number of items in the


RDD.

count()

Partition 0 3;4 2

Partition 1 1;5 2
Driver 2+2+2+3 = 9
Partition 2 4;2 2

Partition 3 2;4;5 3

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 39 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Understanding the code F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \


"chemistry", "biology", "astronomy"])
r2 = r1.map(lambda x: x.capitalize())

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 40 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Understanding the code F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \


"chemistry", "biology", "astronomy"])
r2 = r1.map(lambda x: x.capitalize())

r2 is an RDD (result of a transformation).


r2 has as many elements as r1.
Each item of r2 is a string from r1 with the first letter
capitalized.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 40 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \


"chemistry", "biology", "astronomy"])
r2 = r1.filter(lambda x: len(x) > 10)

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 41 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \


"chemistry", "biology", "astronomy"])
r2 = r1.filter(lambda x: len(x) > 10)

r2 is an RDD (result of a transformation).


r2 has less elements than r1.
r2 only contains the items from r1 that have more than 10
characters.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 41 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \


"chemistry", "biology", "astronomy"])
r2 = r1.reduce(lambda x, y: "{} - {}".format(x, y))

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 42 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \


"chemistry", "biology", "astronomy"])
r2 = r1.reduce(lambda x, y: "{} - {}".format(x, y))

r2 is a string, not a RDD (result of an action).


r2 is the string ”computer science - geology - chemistry -
biology - astronomy”.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 42 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \


"chemistry", "biology", "astronomy"])
r2 = r1.reduce(lambda x, y: [x + y])

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 43 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["computer science", "geology", \


"chemistry", "biology", "astronomy"])
r2 = r1.reduce(lambda x, y: [x + y])

The code is incorrect, because the return type (list) of the reduce
function is different from the type of the input RDD elements (string).

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 43 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["author", "title", "edition"])


r2 = r1.flatMap(lambda x: [c for c in x])

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 44 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

What does the following code? F

What does the following code?

r1 = sc.parallelize(["author", "title", "edition"])


r2 = r1.flatMap(lambda x: [c for c in x])

r2 is an RDD (result of a transformation).


Each element of r2 is a letter from a string in r1. How would
that be different if we had used map instead of flatMap?

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 44 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Key-value RDDs

Key-value RDDs (a.k.a., Pair RDDs) are RDDs where each item is
a pair (k, v ), k being the key and v being the value.

Key-value RDDs are important building blocks in many applications.

Key-value RDDs support all the transformations and actions that can
be applied on regular RDDs.

Key-value RDDs support special transformations and actions.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 45 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Key-value RDDs transformations: reduceByKey

reduceByKey takes in a RDD with (K , V ) pairs and a function f and returns a new
RDD of (K , V ) pairs where the values for each key are aggregated using f , which
must be of type (V , V ) → V .

reduceByKey(lambda x, y: x+y)
RDD

('cat', 7) ; (‘owl', 7) ;
Partition 0 ('cat', 2) ; (‘owl', 3)
(‘cow', 1) ; (‘tiger', 1)

Partition 1 ('dog', 5) ; (‘cat', 2)

Partition 2 ('dog', 1) ; (‘cow', 1)

Partition 3 ('cat', 3) ; (‘owl', 4) ;


('dog', 6)
(‘tiger', 1)

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 46 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Key-value RDDs transformations: reduceByKey

The input RDD has a certain number of partitions n.

No assumption can be made on which elements belong to which


partition.

The RDD returned by reduceByKey is hash partitioned. Each item


belongs to a precise partition.

The partition number p of a pair (K , V ) is derived as follows:

p = hashCode(K ) mod num partitions

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 47 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Key-value RDDs transformations: groupByKey

groupByKey takes in a RDD with (K , V ) pairs and returns a new RDD of


(K , Iterable < V >) pairs.

groupByKey()

('cat', iter) ; (‘owl', iter) ;


Partition 0 ('cat', 2) ; (‘owl', 3)
(‘cow', iter) ; (‘tiger', iter)

Partition 1 ('dog', 5) ; (‘cat', 2)

Partition 2 ('dog', 1) ; (‘cow', 1)

Partition 3 ('cat', 3) ; (‘owl', 4) ;


('dog', iter)
(‘tiger', 1)

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 48 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Key-value RDDs transformations: mapValues

mapValues takes in a RDD with (K , V ) pairs and a function f and returns a new
RDD where the function f is applied to each value V (keys are not modified).

mapValues(lambda x: len(x))

('cat', [2, 2, 3]) ; ('owl', [3, 4]) ; ('cat', 3), ('owl', 2),
Partition 0
('cow', [1]) ; ('tiger', [1]) ('cow', 1), ('tiger', 1)

Partition 1

Partition 2

Partition 3 ('dog', [5, 1]) ('dog', 2)

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 49 / 64


Lecture 1 – Apache Spark Programming with Resilient Distributed Datasets (RDDs)

Example: Word count

def word_count(input_file):
text = sc.textFile(input_file)
return text.flatMap(lambda line: line.split(" "))\
.map(lambda word: (word, 1))\
.reduceByKey(lambda x, y: x+y)

The function textFile reads a text file into a RDD.


Two narrow transformations (flatMap and map) and one wide
transformation (reduceByKey).

+ Spark maintains a logical execution plan (called RDD lineage)


described as a Directed Acyclic Graph (DAG).

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 50 / 64


Lecture 1 – Apache Spark Spark execution model

RDD lineage

SparkContext Spark has a DAG scheduler that splits


(sc)
the graph into multiple stages.

HadoopRDD sc.textFile(input_file)

MappedRDD flatMap(lambda line: line.split(" ")

MappedRDD map(lambda word: (word, 1)

ShuffledRDD reduceByKey(lambda x, y: x+y)

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 51 / 64


Lecture 1 – Apache Spark Spark execution model

RDD lineage: stages

Stage 1
SparkContext
(sc)
Sequences of narrow transformations
are pipelined into a single stage. Wide
HadoopRDD sc.textFile(input_file)
transformations always trigger a new
stage.

MappedRDD flatMap(lambda line: line.split(" ")

MappedRDD map(lambda word: (word, 1)

Stage 2

ShuffledRDD reduceByKey(lambda x, y: x+y)

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 52 / 64


Lecture 1 – Apache Spark Spark execution model

RDD lineage: stages

Stage 1
SparkContext
(sc) Stage 2

sc.parallelize()

sc.textFile(input_file)

Stage 3

map(lambda x….) join()

filter()

Stages that have no dependency can


be executed in parallel.
saveAsTextFile()

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 53 / 64


Lecture 1 – Apache Spark Spark execution model

RDD lineage: tasks

Stage 1
SparkContext
(sc)
The DAG scheduler submits the stages
to the task scheduler. Creates as many
sc.textFile(input_file)
tasks as there are partitions in the RDD.
Tasks are executed in parallel.

flatMap(lambda line: line.split(" ")

map(lambda word: (word, 1)

Stage 2

reduceByKey(lambda x, y: x+y)

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 54 / 64


Lecture 1 – Apache Spark Spark execution model

RDD lineage: fault tolerance

groupBy
What to do when a partition is
lost?

join
map

union

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 55 / 64


Lecture 1 – Apache Spark Spark execution model

RDD lineage: fault tolerance

groupBy
Lost partitions can be recomputed
thanks to the lineage graph.

join
map

union

No need to save intermediate re-


sults to disk.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 56 / 64


Lecture 1 – Apache Spark Spark execution model

Lazy evaluation

In Spark, transformations are lazily evaluated.

Definition (Lazy evaluation)


When a transformation is invoked, Spark does not execute it
immediately. Transformations are only executed when the first action is
called.

An RDD can be thought of a set of instructions on how to compute


the data that we build up through transformations.

Lazy evaluation helps reducing the number of passes needed to


load and transform the data.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 57 / 64


Lecture 1 – Apache Spark Spark execution model

Lazy evaluation: motivating example F

lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
nb_lines = exceptions.count()
print("Number of exception lines ", nb_lines)

What happens if Spark executes immediately each


transformation?

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 58 / 64


Lecture 1 – Apache Spark Spark execution model

Lazy evaluation: motivating example F

Invoking sc.textFile() does not load immediately the data.


The transformation filter() is not applied when it is invoked.
Transformations are applied only when the action count() is invoked.
Only the data that meet the constraint of the filter is loaded from
the file.

lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
nb_lines = exceptions.count()
print("Number of exception lines ", nb_lines)

+ Without lazy evaluation we would have loaded into main mem-


ory the whole content of the input file.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 59 / 64


Lecture 1 – Apache Spark Spark execution model

Lazy evaluation: consequences F

The following code invokes two actions: which ones?

What happens when we invoke the second action?

Example
lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
nb_lines = exceptions.count()
exceptions.collect()

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 60 / 64


Lecture 1 – Apache Spark Spark execution model

Lazy evaluation: consequences F

With lazy evaluation, transformations are computed each time an


action is invoked on a given RDD.
In the following example, all transformations are computed when we
invoke the function count() and the function collect().

Example
lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
nb_lines = exceptions.count()
exceptions.collect()

To avoid computing transformations multiple times, we can persist


the data.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 61 / 64


Lecture 1 – Apache Spark Spark execution model

Persisting the data


Persisting the data means caching the result of the transformations.
Either in main memory (default), or disk or both.
If a node in the cluster fails, Spark recomputes the persisted
partitions.
We can replicate persisted partitions on other nodes to recover from
failures without recomputing.

lines = sc.textFile("./data/logfile.txt")
exceptions = lines.filter(lambda line : "exception" in line)
exceptions.persist(StorageLevel.MEMORY_AND_DISK)
nb_lines = exceptions.count()
exceptions.collect()

persist() is called right before the first action.


persist() does not force the evaluation of transformations.
unpersist() can be called to evict persisted partitions.
Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 62 / 64
Lecture 1 – Apache Spark References

References

Karau, Holden, et al. Learning spark: lightning-fast big data analysis.


O’Reilly Media, Inc., 2015. Click here

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 63 / 64


Lecture 1 – Apache Spark References

Playing with transformations and actions

Notebook available on Google Colab Click here

+ Select File → Save a copy in Drive to create a copy of the


notebook in your Drive and play with it.

Gianluca Quercini Big Data Polytech Paris-Saclay, 2022 64 / 64

You might also like