0% found this document useful (0 votes)

56 views55 pages

3 UNIT3 Spark

Uploaded by

ιηρ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views55 pages

3 UNIT3 Spark

Uploaded by

ιηρ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

SPARK

UNIT - III
Spark Definition
Spark is a Unified Computing Engine with a set of libraries

Unified : Spark is designed to support a wide range of data analytics tasks (

ranging from data loading and SQL querying to Machine Learning and Real
Time Streaming ) on a single computing engine with consistent set of APIs.

Computing Engine:
Spark provides parallel data computing/processing possible
like MapReduce in Hadoop . The difference lies in that Mapreduce works
only on top of the data stored in hadoop where as Spark can work on top of
any distributed storage relieving the end user from worrying about where to
store/retrieve his data.
History
 In 2009 when University of Berkley created a new resource manager in Hadoop called MESOS,
Spark was created as a programming Framework to test the functionality of MESOS.

 By 2010 people found that Spark is more faster than MapReduce.

 By 2012 the developers of Spark have submitted Spark to Apachie to

develop it as an opensource.

 Versions of Spark 0, 1.6, 2.0 ,..2.3

 Databricks is the company established by the developers of Spark to provide spark as a

standalone project

 Spark works on top of Hadoop.

Importance of Spark Framework
Importance of Spark Framework
Spark Features
• Speed:
Spark performs up to 100 times faster than MapReduce for processing large amounts of
data.
It is also able to divide the data into chunks in a controlled way.
• Powerful Caching:
Powerful caching and disk persistence capabilities are offered by a simple programming
layer.
• Deployment: Mesos, Hadoop via YARN, or Spark’s own cluster manager can all be used to deploy
it.
• Real-Time: Because of its in-memory processing, it offers real-time computation and low latency.

• Polyglot: In addition to Java, Scala, Python, and R, Spark also supports all four of these languages.
We can write Spark code in any one of these languages. Spark also provides a command-line
interface in Scala and Python.
Layered View of Spark
• Layer 1 : When downloaded we can get Spark core which can run on both RAM and
Hard disk

• Layer2 : Supporting Languages, Java, R, Python, Scala

Spark was written in Scala
With Spark 1.6 Java, R, Python used to run slow due to lack of optimization

• Layer 3: Components of Spark

• Layer 4: Modes of Operation

• Layer 5: Echo System of Spark

Components of Spark Unified Stack
Spark Core:
• The Spark Core is the heart of Spark and performs the core functionality.
• It holds the components for task scheduling, fault recovery, interacting with
storage systems and memory management

Spark SQL:
• The Spark SQL is built on the top of Spark Core. It provides support for structured
data.
• It allows to query the data via SQL (Structured Query Language) as well as the
Apache Hive variant of SQL called the HQL (Hive Query Language).
• It supports JDBC and ODBC connections that establish a relation between Java
objects and existing databases, data warehouses and business intelligence tools.
• It also supports various sources of data like Hive tables, Parquet, and JSON.
Components of Spark Unified Stack
Spark Streaming:
• Spark Streaming is a Spark component that supports scalable and fault-
tolerant processing of streaming data.
• It uses Spark Core's fast scheduling capability to perform streaming
analytics.
• It accepts data in mini-batches and performs RDD transformations on that
data.
• Its design ensures that the applications written for streaming data can be
reused to analyze batches of historical data with little modification.
• The log files generated by web servers can be considered as a real-time
example of a data stream
Components of Spark Unified Stack
MLlib:
• The MLlib is a Machine Learning library that contains various machine
learning algorithms.
• These include correlations and hypothesis testing, classification and
regression, clustering, and principal component analysis.
• It is nine times faster than the disk-based implementation used by
Apache Mahout.
Components of Spark Unified Stack

GraphX:
• The GraphX is a library that is used to manipulate graphs and
perform graph parallel computations.
• It facilitates to create a directed graph with arbitrary properties
attached to each vertex and edge.
• To manipulate graph, it supports various fundamental operators like
subgraph, join Vertices, and aggregate Messages.
Spark Application Architecture
Spark Application Architecture
Spark applications run as independent sets of processes on a cluster,
coordinated by the Spark Context object in our main program (called the
driver program).
Specifically, to run on a cluster, the Spark Context can connect to several
types of cluster managers (either Spark’s own standalone cluster
manager, Mesos, YARN or Kubernetes), which allocate resources across
applications. Once connected, Spark acquires executors on nodes in the
cluster, which are processes that run computations and store data for our
application. Next, it sends our application code (defined by JAR or Python
files passed to Spark Context) to the executors. Finally, Spark Context
sends tasks to the executors to run.
Spark Application Architecture
Useful things to note
1. Each application gets its own executor processes, which stay up for
the duration of the whole application and run tasks in multiple
threads. This has the benefit of isolating applications from each other,
on both the scheduling side (each driver schedules its own tasks) and
executor side (tasks from different applications run in different JVMs).
However, it also means that data cannot be shared across different
Spark applications (instances of Spark Context) without writing it to an
external storage system.
2. Spark is computing to the underlying cluster manager. As long as it
can acquire executor processes, and these communicate with each
other, it is relatively easy to run it even on a cluster manager that also
supports other applications (e.g. Mesos/YARN/Kubernetes).
Spark Application Architecture

3. The driver program must listen for and accept incoming connections
from its executors throughout its lifetime. As such, the driver program
must be network addressable from the worker nodes.

4. Because the driver schedules tasks on the cluster, it should be run

close to the worker nodes, preferably on the same local area network.
If you’d like to send requests to the cluster remotely, it’s better to open
an RPC to the driver and have it submit operations from nearby than to
run a driver far away from the worker nodes.
Spark Application Architecture
A high-level view of the architecture of the Apache Spark application is
as follows:
The Spark driver:
• The master node (process) in a driver process coordinates workers
and oversees the tasks.
• Spark is split into jobs and scheduled to be executed on executors in
clusters.
Spark contexts (gateways) are created by the driver to monitor the job
working in a specific cluster and to connect to a Spark cluster.
• In the diagram, the driver programs call the main application and
create a spark context (acts as a gateway) that jointly monitors the job
working in the cluster and connects to a Spark cluster.
• Everything is executed using the spark context.
The Spark driver:
• Each Spark session has an entry in the Spark context.
• Spark drivers include more components to execute jobs in clusters, as
well as cluster managers.
• Context acquires worker nodes to execute and store data as Spark
clusters are connected to different types of cluster managers.
• When a process is executed in the cluster, the job is divided into
stages with gain stages into scheduled tasks.
Spark Application Architecture

The Spark Context:

• Each Spark session has an entry in the Spark context.
• Spark drivers include more components to execute jobs in clusters, as
well as cluster managers.
• Context acquires worker nodes to execute and store data as Spark
clusters are connected to different types of cluster managers.
• When a process is executed in the cluster, the job is divided into
stages with gain stages into scheduled tasks. Spark Architecture
Applications Prof R Madana Mohana || Big Data
Spark Application Architecture
The Spark executors:
• An executor is responsible for executing a job and storing data in a cache
at the outset.
• Executors first register with the driver programme at the beginning. •
These executors have a number of time slots to run the application
concurrently.
• The executor runs the task when it has loaded data and they are removed
in idle mode.
• The executor runs in the Java process when data is loaded and removed
during the execution of the tasks.
• The executors are allocated dynamically and constantly added and
removed during the execution of the tasks.
• A driver program monitors the executors during their performance. Users’
tasks are executed in the Java process.
Spark Application Architecture
Cluster Manager:
A driver program controls the execution of jobs and stores data in a cache.
• At the outset, executors register with the drivers.
• This executor has a number of time slots to run the application concurrently.
• Executors read and write external data in addition to servicing client requests.
• A job is executed when the executor has loaded data and they have been
removed in the idle state.
• The executor is dynamically allocated, and it is constantly added and deleted
depending on the duration of its use.
• A driver program monitors executors as they perform users’ tasks. • Code is
executed in the Java process when an executor executes a user’s task.
Spark Application Architecture

Worker Nodes:
• The slave nodes function as executors, processing tasks, and returning the
results back to the spark context.
• The master node issues tasks to the Spark context and the worker nodes
execute them.
• They make the process simpler by boosting the worker nodes (1 to n) to
handle as many jobs as possible in parallel by dividing the job up into sub-
jobs on multiple machines.
• A Spark worker monitors worker nodes to ensure that the computation is
performed simply.
• Each worker node handles one Spark task.
• In Spark, a partition is a unit of work and is assigned to one executor for
each one.
RealTime Analytics with Spark
• Apache Spark architecture allows a continuous stream of data by
dividing the stream into micro-batches called Discretized stream or
Dstream which is an API
• Dstream is a sequence of RDDs that are created from input data or
from sources such as Kafka, Flume, or by applying operations on
other Dstream
• RDDs thus generated can be converted into data frames and queried
using Spark SQL.
• Dstream can be subjected to any application that can query RDD
through Spark’s JDBC driver and stored in Spark’s working memory
to query it later on-demand of Spark’s API.
RealTime
Analytics with
Spark
Most Real time
Analytics Systems can be
broken down into
 A Receiver System,
 a Stream Processing System
and
 a Storage System.
Batch Analytics with Spark
• Batch processing is used to deal with enormous amounts of data for
implementing high-volume and repeating data jobs, each of which
performs a specific operation without the need for user intervention.

• This technique is especially useful for repetitive and monotonous

operations for supporting several data workflows.

• Due to significant gains in precision and accuracy through

automation, organizations can achieve superior data quality while
decreasing bottlenecks in data processing activities.
Batch
Analytics
with Spark
Resilient Distributed Dataset (RDD)
• RDDs are the main logical data units in Spark.

• They are a distributed collection of objects, which are stored in memory or on

disks of different machines of a cluster.

• A single RDD can be divided into multiple logical partitions so that these
partitions can be stored and processed on different machines of a cluster.
• RDDs are immutable (read-only) in nature.
• You cannot change an original RDD, but you can create new RDDs by performing
coarse-grain operations, like transformations, on an existing RDD.
• An RDD in Spark can be cached and used again for future transformations, which
is a huge benefit for users.

• RDDs are said to be lazily evaluated, i.e., they delay the evaluation until it is really
needed. Thus saves time.
Features of an RDD in Spark

• Resilience: RDDs track data lineage information to recover lost data,

automatically on failure. It is also called fault tolerance.
• Distributed: Data present in an RDD resides on multiple nodes. It is
distributed across different nodes of a cluster.
• Lazy evaluation: Data does not get loaded in an RDD even if you
define it. Transformations are actually computed when you call
action, such as count or collect, or save the output to a file system.
Features of an RDD in Spark
• Immutability: Data stored in an RDD is in the read-only mode━you
cannot edit the data which is present in the RDD. But, you can create
new RDDs by performing transformations on the existing RDDs.
• In-memory computation: An RDD stores any immediate data that is
generated in the memory (RAM) than on the disk so that it provides
faster access.
• Partitioning: Partitions can be done on any existing RDD to create
logical parts that are mutable. You can achieve this by applying
transformations to the existing partitions.
Operations
on RDD
Transformations Function Description
Returns a new RDD by applying the function
map()
on each data element
Returns a new RDD formed by selecting those
filter() elements of the source on which the function
returns true
Aggregates the values of a key using a
reduceByKey()
function
Converts a (key, value) pair into a (key,
groupByKey()
<iterable value>) pair

Returns a new RDD that contains all elements

union()
and arguments from the source RDD

Returns a new RDD that contains an

intersection()
intersection of the elements in the datasets
Types of
Transformati
ons
• Narrow transformation – In
Narrow transformation, all
the elements that are
required to compute the
records in single partition
live in the single partition of
parent RDD. A limited subset
of partition is used to
calculate the result. Narrow
transformations are the
result of map(), filter().
Types of
Transformations

Wide transformation – In wide

transformation, all the elements that
are required to compute the records
in the single partition may live in
many partitions of parent RDD. The
partition may live in many partitions
of parent RDD. Wide transformations
are the result of groupbyKey() and
reducebyKey().
Actions Function Description
Gets the number of data elements in
count()
an RDD
Gets all the data elements in an RDD
collect()
as an array
Aggregates data elements into an RDD
reduce() by taking two arguments and returning
one

take(n) Fetches the first n elements of an RDD

Executes the operation for each data

foreach(operation)
element in an RDD
Retrieves the first data element of an
first()
RDD

How To Write A Lot - Paul J. Silvia
100% (3)
How To Write A Lot - Paul J. Silvia
124 pages
Jean Jacques Rousseau - Excerpts From Emile On Education
No ratings yet
Jean Jacques Rousseau - Excerpts From Emile On Education
6 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Cerificate Report Sharique
No ratings yet
Cerificate Report Sharique
12 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Bda 5
No ratings yet
Bda 5
21 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Bda U4
No ratings yet
Bda U4
49 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Unit 5
100% (1)
Unit 5
109 pages
Unit V
No ratings yet
Unit V
35 pages
Real Time Processing Framework
No ratings yet
Real Time Processing Framework
9 pages
Spark Databricks Summary
80% (5)
Spark Databricks Summary
100 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
8 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Understanding The Spark Cluster Architecture: Anatomy of A Spark Application
No ratings yet
Understanding The Spark Cluster Architecture: Anatomy of A Spark Application
17 pages
Introduction To Spark For Data Engineers / Data Scientists
100% (3)
Introduction To Spark For Data Engineers / Data Scientists
100 pages
Apache Spark
No ratings yet
Apache Spark
100 pages
Spark Notes
No ratings yet
Spark Notes
19 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Data Engineers Guide Apache Spark Delta Lake v3
No ratings yet
Data Engineers Guide Apache Spark Delta Lake v3
94 pages
Shark
No ratings yet
Shark
24 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
ECS765P - W5 - Spark Programming
No ratings yet
ECS765P - W5 - Spark Programming
43 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Spark Introduction
No ratings yet
Spark Introduction
25 pages
Spark SQL
100% (1)
Spark SQL
25 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Spark
No ratings yet
Spark
9 pages
Unit 5.1
No ratings yet
Unit 5.1
9 pages
BDA GTU Study Material Presentations Unit-6 03102021061221PM
No ratings yet
BDA GTU Study Material Presentations Unit-6 03102021061221PM
23 pages
Sspark
No ratings yet
Sspark
7 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Spark ETL and Process
No ratings yet
Spark ETL and Process
15 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Spark Introduction
No ratings yet
Spark Introduction
4 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Data Bricks
No ratings yet
Data Bricks
115 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Unit 4
No ratings yet
Unit 4
8 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Spark Tutorial
No ratings yet
Spark Tutorial
77 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Unit-4 - Apache Spark
No ratings yet
Unit-4 - Apache Spark
24 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
12 Earlylate
No ratings yet
12 Earlylate
30 pages
File 5 SOPH, UWC, Health Promotion I Unit 4 OERF
No ratings yet
File 5 SOPH, UWC, Health Promotion I Unit 4 OERF
18 pages
Class 11 Physics Lesson Plans Chapter 1 Units and Measurements
100% (1)
Class 11 Physics Lesson Plans Chapter 1 Units and Measurements
73 pages
EF3e Uppint Quicktest 01
67% (3)
EF3e Uppint Quicktest 01
2 pages
Jahangirnagar Model United Nations 2015: Proposal Letter
No ratings yet
Jahangirnagar Model United Nations 2015: Proposal Letter
11 pages
DLL EAPP Nov.7-10 2nd QRTR
No ratings yet
DLL EAPP Nov.7-10 2nd QRTR
5 pages
2023 Grade 9 2ND Term Eng. Lang. Note
No ratings yet
2023 Grade 9 2ND Term Eng. Lang. Note
89 pages
MODULE 3 - Professionals and Practitioners in Counseling
100% (1)
MODULE 3 - Professionals and Practitioners in Counseling
10 pages
Entrep Report
No ratings yet
Entrep Report
8 pages
Christ The Sacrament of The Encounter With God Volume I Edward Schillebeeckx Ted Mark Schoof Carl Sterkens Erik Borgman Robert J Schreiter Editors Download
No ratings yet
Christ The Sacrament of The Encounter With God Volume I Edward Schillebeeckx Ted Mark Schoof Carl Sterkens Erik Borgman Robert J Schreiter Editors Download
49 pages
Fed Undergraduate Booklet A5
No ratings yet
Fed Undergraduate Booklet A5
24 pages
Jay Hardwick Resume
No ratings yet
Jay Hardwick Resume
2 pages
DR Loai Saadah CV - Just
No ratings yet
DR Loai Saadah CV - Just
10 pages
Adopt-A-School Program Action Plan S.y.2022-2023
No ratings yet
Adopt-A-School Program Action Plan S.y.2022-2023
3 pages
Project Proposal Updated 4 - 1
No ratings yet
Project Proposal Updated 4 - 1
3 pages
Rey Et Al. (2022) - Federated Learning For Malware Detection in IoT Devices
No ratings yet
Rey Et Al. (2022) - Federated Learning For Malware Detection in IoT Devices
14 pages
Worksheet Practicing Either-Neither - So - and Nor
No ratings yet
Worksheet Practicing Either-Neither - So - and Nor
2 pages
EU Funding 2014-15
No ratings yet
EU Funding 2014-15
2 pages
NHS FPX 6004 Assessment 3 Training Session For Policy Implementation
No ratings yet
NHS FPX 6004 Assessment 3 Training Session For Policy Implementation
7 pages
Q1W1 - Introduction To Information and Communication Technology
No ratings yet
Q1W1 - Introduction To Information and Communication Technology
99 pages
Use Poetry For English Language Teaching
100% (1)
Use Poetry For English Language Teaching
25 pages
C Bollas
No ratings yet
C Bollas
8 pages
Good Afternoon
No ratings yet
Good Afternoon
30 pages
Brentari-Lexical Borrowing in American Sign Language (2003)
100% (1)
Brentari-Lexical Borrowing in American Sign Language (2003)
24 pages
Exam Practice Fel0012
No ratings yet
Exam Practice Fel0012
5 pages
Hot Dog - (Winner of The 2023 Caldecott Medal) - Doug Salati Axis 360 (Digital Media Service) - Place of Publication Not Identified, 2022 - Knopf - 9780593308431 - Anna's Archive
No ratings yet
Hot Dog - (Winner of The 2023 Caldecott Medal) - Doug Salati Axis 360 (Digital Media Service) - Place of Publication Not Identified, 2022 - Knopf - 9780593308431 - Anna's Archive
48 pages
Daily Lesson Plan in Grade 7
100% (2)
Daily Lesson Plan in Grade 7
2 pages
Naukri GreeshmaKantipudi (1y 0m)
No ratings yet
Naukri GreeshmaKantipudi (1y 0m)
1 page

3 UNIT3 Spark

Uploaded by

3 UNIT3 Spark

Uploaded by

SPARK

Unified : Spark is designed to support a wide range of data analytics tasks (

 By 2010 people found that Spark is more faster than MapReduce.

 By 2012 the developers of Spark have submitted Spark to Apachie to

 Versions of Spark 0, 1.6, 2.0 ,..2.3

 Databricks is the company established by the developers of Spark to provide spark as a

 Spark works on top of Hadoop.

• Layer2 : Supporting Languages, Java, R, Python, Scala

• Layer 3: Components of Spark

• Layer 4: Modes of Operation

• Layer 5: Echo System of Spark

4. Because the driver schedules tasks on the cluster, it should be run

The Spark Context:

• This technique is especially useful for repetitive and monotonous

• Due to significant gains in precision and accuracy through

• They are a distributed collection of objects, which are stored in memory or on

• Resilience: RDDs track data lineage information to recover lost data,

Returns a new RDD that contains all elements

Returns a new RDD that contains an

Wide transformation – In wide

take(n) Fetches the first n elements of an RDD

Executes the operation for each data

You might also like