0% found this document useful (0 votes)

6 views

07_Apache Spark - An Introduction

The document provides an introduction to Apache Spark, detailing its features, components, and architecture. It explains how Spark serves as an efficient, open-source in-memory cluster computing framework that supports multiple programming languages and offers fast data processing capabilities. Key components such as Spark Core, Spark SQL, and Spark Streaming are discussed, along with the anatomy of a Spark job run and the role of Resilient Distributed Datasets (RDDs).

Uploaded by

i237822

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

07_Apache Spark - An Introduction

Uploaded by

i237822

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

In the name of ALL AH, the Beneficent, the Merciful

7 Apache Spark
An Introduction

Compiled by
Dr. Muhammad Sajid Qureshi
Contents*

❖ Apache Spark
▪ Introduction, features, ecosystem, and major benefits
▪ Components of Spark
• Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX
▪ Architecture of Spark
• RDDs, Jobs, Tasks, and Context
▪ Anatomy of a Spark Job Run
• Job submission, DAG creation, Task scheduling, Task Execution
▪ Executors and Cluster Managers
▪ Spark applications

* Most of the contents are extracted from:

+ “Hadoop-The Definitive Guide” (Chapter 19) by Tom White, O’ Rielly Media Inc., 4 edition.

Apache Spark - An Introduction 2

What is Spark ?
❖ Spark – an alternative to MapReduce
▪ Spark is an efficient, open-source in-memory cluster computing framework.

• It can store large datasets in distributed fashion, and can apply parallel processing on the data.

• Being an in-memory data processing engine, it can process the real-time data streams.

• Spark can run on YARN and works with Hadoop file formats and storage backends like HDFS.

▪ Data analyst use Spark to process, analyze, transform, and visualize data at very large scale.

• It efficiently performs the iterative and interactive operations on data.

• Spark provides a user-friendly interface for programming a cluster with implicit parallel data processing
and fault tolerance.

Apache Spark - An Introduction 3

What is Spark ?
❖ Spark – an alternative to MapReduce
▪ Spark support multiple languages for its programming

• It provides APIs in languages like Scala, Python, R, and JAVA.

▪ Spark was started as a project in AMPLab at the University of Berkley in 2009.

▪ In 2014 spark set a world record when, it was used by Databricks to sort-out a large dataset efficiently.

Apache Spark - An Introduction 4

Spark’s features

Apache Spark - An Introduction 5

Spark’s features
❖ Fast data processing
▪ Spark use the Resilient Distributed Dataset (RDD) for quick and reliable data processing.

❖ In-memory computing
▪ As data resides in RAM, so its read-write operations, and processing is very fast.

❖ Support for multiple programming languages

▪ Scala, Python, R, and JAVA

❖ Fault tolerance
▪ The RDD mechanism makes Spark a reliable data processing engine, as they recover the data loss in case of
failure of a node.

❖ Rich libraries for data processing

▪ Spark offer rich libraries to process, analyze, transform, and visualize data.

Apache Spark - An Introduction 6

Spark’s features

Apache Spark - An Introduction 7

Hadoop Versus Spark

Hadoop Spark

o The MapReduce framework is slower than Spark, because it o Being an in-memory processing engine, Spark can do
loads the data from storage devices before processing it. parallel data processing much faster than MapReduce.

o The MapReduce is designed for batch processing of large

datasets. o Spark can do batch-processing as well as it can process the
o The data node store intermediate results in their local real-time data (streaming data).
storage, that slows down compilation of the final results.

o Hadoop uses Kerberos for authentication, that is o Sparks uses a shared secret for easier authentication.
complicated. Additionally, it can run on YARN to use the Kerberos.

o Writing a program (usually in JAVA) for Hadoop framework

o Sparks supports Scala, that simplifies programming for it.
requires more effort.

Apache Spark - An Introduction 8

Components of Spark
.
❖ Spark has the following major components
▪ Spark Core
• It manages the RDD in Spark, to enable efficient and reliable data processing.
• Responsible for memory management, job scheduling, fault tolerance.
• It also coordinates with the storage systems like HDFS, HBase, and DBMSs.
▪ Spark SQL
• This component allows fast processing of the structured and semi-structured data.
▪ Spark Streaming
• This offers a light-weight API to process the real-time data (data streams).
▪ Spark MLlib
• A simple, scalable librarY containing various machine learning algorithms for big data analytics.
▪ Spark GraphX
• Specially designed for storage and processing of the graph databases (as used in LinkedIn, DBpedia, Meta
etc.).

Apache Spark - An Introduction 9

Spark Ecosystem

Apache Spark - An Introduction 10

Spark Components – SQL

Apache Spark - An Introduction 11

Spark Components – Streaming

Apache Spark - An Introduction 12

Spark Components – MLlib

Apache Spark - An Introduction 13

Spark Components – GraphX

Apache Spark - An Introduction 14

The RDD in Spark
❖ Role of the Resilient Distributed Dataset (RDD) in Spark
▪ Spark uses the RDDs for quick and reliable distributed in-memory data processing.

▪ RDD is a read-only collection of objects that is partitioned across multiple data nodes in a cluster.

• In a Spark program, initially one or more RDDs are loaded as input

• Then, through a series of transformations they are turned into a set of target RDDs, which have an
action to be performed on them.

▪ RDD are resilient because Spark can automatically reconstruct a lost partition by recomputing it from the
RDDs that it was computed from.

▪ RDD can be created in 3 ways:

• From an in-memory collection of objects (known as parallelizing a collection)

• Using a dataset from external storage (such as HDFS)

• Transforming an existing RDD

Apache Spark - An Introduction 15

Operations on RDDs

Apache Spark - An Introduction 16

Transformation and Actions on RDDs
❖ Spark provides two categories of operations on RDDs: transformations and actions.
▪ A transformation generates a new RDD from an existing one.

• If the return type of and operation is RDD, then it’s a transformation; otherwise, it’s an action.

▪ An action triggers a computation on an RDD and does something with the results—either returning them to
the user, or saving them to external storage.

• Actions have an immediate effect, but transformations do not—they are lazy.

▪ Spark’s library contains a rich set of operators including the following:

• Transformations for mapping,

• Grouping, aggregating, and repartitioning

• Sampling, and joining RDDS

• Treating RDDS as sets.

Apache Spark - An Introduction 17

Operations on RDDs

Apache Spark - An Introduction 18

Spark Applications, Jobs, Stages, and Tasks
❖ The Job in Spark
▪ A Spark job is more is made up of an arbitrary directed acyclic graph (DAG) of stages, each of which is
roughly equivalent to a map or reduce phase in MapReduce.

▪ Stages are split into tasks by the Spark runtime and are run in parallel on partitions of an RDD spread across
the cluster—just like tasks in MapReduce.

▪ A job always runs in the context of an application (represented by a SparkContext instance) that serves to
group RDDs and shared variables.

▪ An application can run more than one job, in series or in parallel, and provides the mechanism for a job to
access an RDD that was cached by a previous job in the same application.

Apache Spark - An Introduction 19

How Spark runs a job ?

❖ Anatomy of a Spark Job Run

▪ Job submission

▪ DAG creation

▪ Task scheduling

▪ Task execution

Apache Spark - An Introduction 20

How Spark runs a job ?

Apache Spark - An Introduction 21

How Spark runs a job ?

Apache Spark - An Introduction 22

How Spark runs a job ?

Apache Spark - An Introduction 23

How Spark runs a job ?

Apache Spark - An Introduction 24

The stages and RDDs in a Spark job

Apache Spark - An Introduction 25

Spark Job Executors

❖ Spark use the Executors to run the tasks that make up a job.

▪ First the executor keeps a local cache of all the dependencies that previous tasks have used.

▪ Second it deserializes the task code from the serialized bytes that were sent as a part of the
launch task message.

▪ Third, the Executor executes the task code in the same JVM as the executor.

• Tasks can return a result to the driver. The result is serialized and sent to the executor
backend, and then back to the driver as a status update message.

Apache Spark - An Introduction 26

Cluster Managers for Spark

❖ Cluster Managers for Spark

▪ Spark requires a cluster manager to manage the lifecycle of executors that run the jobs.

▪ Spark is compatible with a variety of cluster managers with different characteristics:

• Local cluster

✓ In local mode there is a single executor running in the same JVM as the driver.

✓ This mode is useful for testing or running small jobs.

• Standalone

✓ It is a simple distributed implementation to run a single Master and multiple worker nodes.

Apache Spark - An Introduction 27

Cluster Managers for Spark

❖ Cluster Managers for Spark

• Apache Mesos

✓ Mesos is a general-purpose cluster resource manager that allows fine-grained sharing of

resources across different applications.

• Hadoop YARN

✓ When YARN is used as a cluster manager for Spark, each Spark application corresponds to an
instance of a YARN application, and each executor runs in its own YARN container.

✓ The Mesos and YARN cluster managers are superior to the standalone manager as they can
manage resources of other applications running on the cluster. They also enforce a scheduling
policy across all of them.

Apache Spark - An Introduction 28

Spark Cluster Managers

Apache Spark - An Introduction 29

Spark on YARN

❖ Spark deployment on YARN

▪ Running Spark on YARN provides better integration with other Hadoop components.

▪ Spark can be deployed on YARN in two modes:

• Client mode
✓ In it, the driver program runs in the client application.
✓ The client mode is required for programs having an interactive component, like Spark-shell or
PySpark.

• Cluster mode
✓ In this mode, the driver program runs on the cluster in the YARN Application Master.
✓ YARN cluster mode is appropriate for production jobs that require logging activity.

Apache Spark - An Introduction 30

How Spark executors are started in YARN client mode

Apache Spark - An Introduction 31

How Spark executors are started in YARN cluster mode

Apache Spark - An Introduction 32

Applications of Spark

Apache Spark - An Introduction 33

Spark Use Case

Apache Spark - An Introduction 34

Related Resources
❖ Apache Spark Tutorials
▪ Apache Spark

• https://www.youtube.com/watch?v=QaoJNXW6SQo&t=3s

▪ Understanding Apache Spark

• https://www.youtube.com/watch?v=znBa13Earms

▪ How Apache Spark runs a job

• https://www.youtube.com/watch?v=jDkLiqlyQaY

Apache Spark - An Introduction 35

Contents’ Review
❖ Apache Spark
▪ Introduction, features, ecosystem, and major benefits
▪ Components of Spark
• Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX
▪ Architecture of Spark
• RDDs, Jobs, Tasks, and Context
▪ Anatomy of a Spark Job Run
• Job submission, DAG creation, Task scheduling, Task Execution
▪ Executors and Cluster Managers
▪ Spark applications

You are Welcome !

Questions ?
Comments !
Suggestions !!

Apache Spark - An Introduction 36

Big Data Engineering - PySpark
100% (1)
Big Data Engineering - PySpark
120 pages
Introduction To Windows PDF
No ratings yet
Introduction To Windows PDF
24 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Unit 5
100% (1)
Unit 5
109 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
UNIT V
No ratings yet
UNIT V
35 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Module 3
No ratings yet
Module 3
51 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
CC_ppt
No ratings yet
CC_ppt
12 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
ApacheSparkWorkshop 2020 09 17
No ratings yet
ApacheSparkWorkshop 2020 09 17
58 pages
UNIT 5.1
No ratings yet
UNIT 5.1
9 pages
Shark
No ratings yet
Shark
24 pages
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Bda 5
No ratings yet
Bda 5
21 pages
Spark Architecture
No ratings yet
Spark Architecture
17 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
SPARK
No ratings yet
SPARK
66 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Spark Tutorial
No ratings yet
Spark Tutorial
8 pages
bda u3 p1 (intro to spark)
No ratings yet
bda u3 p1 (intro to spark)
66 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
4a.introduction to Apache Spark
No ratings yet
4a.introduction to Apache Spark
28 pages
BDA-Lec7
No ratings yet
BDA-Lec7
32 pages
SPARK
No ratings yet
SPARK
125 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
39.-Introduction-to-Spark-1
No ratings yet
39.-Introduction-to-Spark-1
21 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
From Everand
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
Adam Jones
No ratings yet
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
CC9K Ports GD v3.0 8200 2102 42 D
No ratings yet
CC9K Ports GD v3.0 8200 2102 42 D
28 pages
Swapna Interview Questions
No ratings yet
Swapna Interview Questions
8 pages
Create TCP - IP Server - MATLAB - MathWorks India
No ratings yet
Create TCP - IP Server - MATLAB - MathWorks India
12 pages
Exadata Notes
No ratings yet
Exadata Notes
4 pages
ORION Touch Preliminary Information V1.0 Dist
0% (2)
ORION Touch Preliminary Information V1.0 Dist
13 pages
Install IIS and PHP and Deploy The CWS Sample
No ratings yet
Install IIS and PHP and Deploy The CWS Sample
4 pages
Class 5
No ratings yet
Class 5
55 pages
Creative Suite 5 Master Collection Read Me
No ratings yet
Creative Suite 5 Master Collection Read Me
21 pages
FAS2600 SE Presentation - v1.1
No ratings yet
FAS2600 SE Presentation - v1.1
42 pages
ARM150-M155 OM Online-Manual GB
No ratings yet
ARM150-M155 OM Online-Manual GB
43 pages
A Virtual Honeypot Framework
No ratings yet
A Virtual Honeypot Framework
15 pages
Computer Awareness For Bank PO
No ratings yet
Computer Awareness For Bank PO
11 pages
CS6303 Computer Architecture ACT Notes
No ratings yet
CS6303 Computer Architecture ACT Notes
76 pages
SCADA Over IP-Based LAN-WAN Connections
No ratings yet
SCADA Over IP-Based LAN-WAN Connections
6 pages
SujithaO LE Resume (1)
No ratings yet
SujithaO LE Resume (1)
5 pages
Setting Up 802.1X Authentication With Debian Linux and Freeradius Part 1
No ratings yet
Setting Up 802.1X Authentication With Debian Linux and Freeradius Part 1
9 pages
Hostel Management System Project Report
20% (5)
Hostel Management System Project Report
98 pages
Release Notes
No ratings yet
Release Notes
7 pages
Using Fiddler For Autodiscover Troubleshooting Scenarios Part 4-4 - Part 24 of 36
No ratings yet
Using Fiddler For Autodiscover Troubleshooting Scenarios Part 4-4 - Part 24 of 36
16 pages
Administration of Application Server - ABAP
No ratings yet
Administration of Application Server - ABAP
7 pages
Prepare CX300, 500, 700, and CX3 Series For Update To Release 24 and Above
No ratings yet
Prepare CX300, 500, 700, and CX3 Series For Update To Release 24 and Above
15 pages
FACE: A Firewall Analysis and Configuration Engine
No ratings yet
FACE: A Firewall Analysis and Configuration Engine
16 pages
Survival Guide: Lawrence Angrave
No ratings yet
Survival Guide: Lawrence Angrave
57 pages
The Road
No ratings yet
The Road
1 page
Scenario Packing - Shipping (Eng-Ind)
No ratings yet
Scenario Packing - Shipping (Eng-Ind)
4 pages
Working With Solaris Server Consoles Using Lom/ Ilom/ Alom/ Elom/ RSC
No ratings yet
Working With Solaris Server Consoles Using Lom/ Ilom/ Alom/ Elom/ RSC
10 pages
Oracle DBA Pocket Guide 8i
No ratings yet
Oracle DBA Pocket Guide 8i
2 pages
APG43L 3.2 Network Impact Report
No ratings yet
APG43L 3.2 Network Impact Report
31 pages
Timer Counter AVR
100% (2)
Timer Counter AVR
105 pages

07_Apache Spark - An Introduction

Uploaded by

07_Apache Spark - An Introduction

Uploaded by

In the name of ALL AH, the Beneficent, the Merciful

* Most of the contents are extracted from:

Apache Spark - An Introduction 2

• It efficiently performs the iterative and interactive operations on data.

Apache Spark - An Introduction 3

• It provides APIs in languages like Scala, Python, R, and JAVA.

▪ Spark was started as a project in AMPLab at the University of Berkley in 2009.

Apache Spark - An Introduction 4

Apache Spark - An Introduction 5

❖ Support for multiple programming languages

❖ Rich libraries for data processing

Apache Spark - An Introduction 6

Apache Spark - An Introduction 7

o The MapReduce is designed for batch processing of large

o Writing a program (usually in JAVA) for Hadoop framework

Apache Spark - An Introduction 8

Apache Spark - An Introduction 9

Apache Spark - An Introduction 10

Apache Spark - An Introduction 11

Apache Spark - An Introduction 12

Apache Spark - An Introduction 13

Apache Spark - An Introduction 14

• In a Spark program, initially one or more RDDs are loaded as input

▪ RDD can be created in 3 ways:

• From an in-memory collection of objects (known as parallelizing a collection)

• Using a dataset from external storage (such as HDFS)

• Transforming an existing RDD

Apache Spark - An Introduction 15

Apache Spark - An Introduction 16

• Actions have an immediate effect, but transformations do not—they are lazy.

▪ Spark’s library contains a rich set of operators including the following:

• Transformations for mapping,

• Grouping, aggregating, and repartitioning

• Sampling, and joining RDDS

• Treating RDDS as sets.

Apache Spark - An Introduction 17

Apache Spark - An Introduction 18

Apache Spark - An Introduction 19

❖ Anatomy of a Spark Job Run

Apache Spark - An Introduction 20

Apache Spark - An Introduction 21

Apache Spark - An Introduction 22

Apache Spark - An Introduction 23

Apache Spark - An Introduction 24

Apache Spark - An Introduction 25

Apache Spark - An Introduction 26

❖ Cluster Managers for Spark

▪ Spark is compatible with a variety of cluster managers with different characteristics:

✓ This mode is useful for testing or running small jobs.

Apache Spark - An Introduction 27

❖ Cluster Managers for Spark

✓ Mesos is a general-purpose cluster resource manager that allows fine-grained sharing of

Apache Spark - An Introduction 28

Apache Spark - An Introduction 29

❖ Spark deployment on YARN

▪ Spark can be deployed on YARN in two modes:

Apache Spark - An Introduction 30

Apache Spark - An Introduction 31

Apache Spark - An Introduction 32

Apache Spark - An Introduction 33

Apache Spark - An Introduction 34

▪ Understanding Apache Spark

▪ How Apache Spark runs a job

Apache Spark - An Introduction 35

You are Welcome !

Apache Spark - An Introduction 36

You might also like