0% found this document useful (0 votes)

11 views23 pages

Spark Slides

Uploaded by

nnkumar.mca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views23 pages

Spark Slides

Uploaded by

nnkumar.mca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

What is Spark ?

- Spark is an open-source big data processing framework that is designed to handle large-scale data processing tasks. It is faster
and more efficient than traditional big data processing frameworks such as Hadoop.

- Spark can process data in-memory, making it much faster than disk-based processing. Spark includes a variety of high-level APIs
and libraries that make it easy to develop and deploy data processing applications, including Spark SQL, MLlib, and GraphX.
Spark can be run on a cluster of computers, allowing it to scale to handle very large datasets.
Hadoop Vs Spark

1. Processing Model: Hadoop is based on the MapReduce programming model, which involves dividing
large datasets into smaller, more manageable chunks and processing them in parallel on a distributed
computing infrastructure. Spark, on the other hand, is based on the Resilient Distributed Datasets
(RDD) programming model, which allows for in-memory processing of data.

2. Performance: Spark is generally faster than Hadoop due to its ability to perform in-memory
processing. In addition, Spark has a more efficient execution engine and can run batch processing,
stream processing, machine learning, and graph processing workloads more efficiently.
3. Ease of Use: Spark is generally considered to be easier to use than Hadoop, as it includes a variety of high-level
APIs and libraries that make it easier to develop and deploy data processing applications.

4. Real-time Processing: Spark is designed to handle real-time processing tasks, whereas Hadoop is primarily used
for batch processing.

5. Scalability: Both Hadoop and Spark are designed to scale horizontally, meaning that they can handle large
amounts of data by distributing processing across a cluster of computers. However, Spark is generally considered to
be more scalable than Hadoop, as it can process data in-memory and has a more efficient execution engine
What is a Spark Cluster ?
● A Spark cluster is a group of connected computers (nodes) that work together to process large datasets.
● The cluster is managed by a Spark driver program that coordinates the distribution of tasks across the nodes.
● The driver program runs on one node, while the other nodes act as worker nodes that process the tasks.
● The worker nodes can be divided into smaller partitions, which can be processed in parallel to speed up the
computation.
● Spark clusters can be deployed on-premises, in a cloud environment, or on a hybrid infrastructure.
● By distributing the processing tasks across multiple nodes, Spark clusters can process large datasets
much faster than traditional single-node systems.
● Spark provides fault tolerance, which means that if a node fails during processing, the failed task can
be automatically redistributed to another node to complete the processing.
● The size of a Spark cluster can range from a few nodes to thousands, depending on the size of the
dataset and performance requirements of the application.
● Spark clusters can be used for a wide range of data processing tasks, such as data transformation,
machine learning, and graph processing.
What is RDD in Spark ?
In simple terms, RDD (Resilient Distributed Dataset) in Spark can be thought of as a collection of data that is spread across multiple
computers, which can be processed in parallel. RDDs are fault-tolerant and immutable, meaning they can recover from failures and their
data cannot be changed once created.

Think of RDDs as a big, distributed list of elements that can be operated on in parallel. Each element can be a number, a string, or even
a more complex data structure like an array or a tuple. You can transform RDDs by applying operations like filter, map, and reduce,
which will be executed in parallel across the distributed nodes.
Type of Operations in Spark

Transformation Operations:
These operations transform the input data into some other form of data. Transformations are immutable in nature which means they do
not change the original data, but rather create new RDDs. Here are some examples of transformation operations:

map() - applies a function to each element in the RDD and returns a new RDD with the transformed elements. For example:

rdd = sc.parallelize([1, 2, 3])

new_rdd = rdd.map(lambda x: x + 1)

filter() - applies a predicate function to each element in the RDD and returns a new RDD with the elements that satisfy the predicate. For
example:

rdd = sc.parallelize([1, 2, 3, 4, 5])

new_rdd = rdd.filter(lambda x: x % 2 == 0)
groupByKey() - groups the values of each key in the RDD and returns a new RDD with a
pair of the key and the list of corresponding values. For example:
rdd = sc.parallelize([(1, 'a'), (2, 'b'), (1, 'c'), (2, 'd')])
new_rdd = rdd.groupByKey()

Action Operations:
These operations trigger some computation on the RDD and return a result or perform
some action on the data. Unlike transformations, actions are not immutable in nature,
which means they change the state of the RDD. Here are some examples of action
operations:
collect() - returns all the elements of the RDD to the driver program as an array. For
example:
rdd = sc.parallelize([1, 2, 3])
count() - returns the number of elements in the RDD. For example:

rdd = sc.parallelize([1, 2, 3])

count = rdd.count()

reduce() - aggregates the elements of the RDD using a function and returns a single value.

For example:rdd = sc.parallelize([1, 2, 3, 4, 5])

result = rdd.reduce(lambda x, y: x + y)
Functioning of RDD in Spark ?
RDD (Resilient Distributed Datasets) is a fundamental concept in Apache Spark for processing and
analyzing large-scale datasets in a distributed environment. To visualize RDDs in Spark, we can use a graph
representation where each node represents a partition of the RDD and each block represents an element in
the RDD.
Let's take an example of a simple RDD that contains a list of integers:
val data = Array(1, 2, 3, 4, 5)
val rdd = sc.parallelize(data, 2)

In this example, we create an RDD rdd with two partitions using the sc.parallelize method. Each
partition of the RDD will contain a subset of the data. We can visualize this RDD as follows:
RDD (rdd)
/ \
Node 1 Node 2
/ \ / \
Block 1 Block 2 Block 3 Block 4 Block 5
RDD (rdd)
/ \
Node 1 Node 2
/ \ / \
Block 1 Block 2 Block 3 Block 4 Block 5

In this visualization, the RDD is represented as a node with two child nodes, each representing a
partition of the RDD. Each partition is further divided into blocks, which represent the individual
elements in the RDD.
In this case, the RDD has two partitions, so the first partition (Node 1) contains the first two blocks
(Block 1 and Block 2) and the second partition (Node 2) contains the remaining three blocks (Block 3,
Block 4, and Block 5).
By visualizing the RDD in this way, we can better understand how Spark partitions and distributes the
data across a cluster of machines for parallel processing. It also helps us to optimize our Spark
applications by identifying potential bottlenecks or areas for improvement.
Let's take another example where we have an RDD of words, and we want to visualize how the data is distributed across
partitions and blocks:
val data = Array("apple", "banana", "orange", "grape", "pineapple")
val rdd = sc.parallelize(data, 3)

In this example, we create an RDD rdd with three partitions using the sc.parallelize method. We can visualize this
RDD as follows:
RDD (rdd)
/ | \
Node 1 Node 2 Node 3
/ | /| / | \
Block 1 Block 2 Block 3 Block 4 Block 5
"apple" "banana" "orange" "grape" "pineapple"

In this visualization, the RDD is represented as a node with three child nodes, each representing a partition of the RDD.
Each partition is further divided into blocks, which represent the individual elements in the RDD (in this case, words).
The RDD has three partitions, so the first partition (Node 1) contains the first two blocks (Block 1 and Block 2), the
second partition (Node 2) contains the next two blocks (Block 3 and Block 4), and the third partition (Node 3) contains
the final block (Block 5).
Immutability in RDD
let's take an example to explain the immutability of RDDs in Apache Spark.
Consider the following example where we have an RDD of integers:
val numbersRDD = sc.parallelize(Seq(1, 2, 3, 4, 5))

Since RDDs are immutable, we cannot modify the contents of the numbersRDD once it is created.
For example, if we try to change the value of the first element in the RDD, like this:
numbersRDD(0) = 6 // This will result in an error
We will get a compilation error, because RDDs are read-only and their data cannot be changed.
Instead, if we want to modify the data in an RDD, we need to create a new RDD with the modified
data. For example, if we want to add 1 to each element in the numbersRDD, we can create a new
RDD with the modified data, like this:
val incrementedRDD = numbersRDD.map(x => x + 1)
This will create a new RDD called incrementedRDD with the modified data, where each element is
incremented by 1. The original numbersRDD remains unchanged.
Resilience in RDD
Consider the following example where we have an RDD of integers:

val numbersRDD = sc.parallelize(Seq(1, 2, 3, 4, 5))

RDDs are resilient to node failures, which means they can recover from failures and continue processing data. This is achieved through
the use of lineage, which is a record of how the RDD was created from other RDDs.

Let's say we have a cluster of three machines (M1, M2, and M3), and we store the numbersRDD across two partitions, with one partition
on M1 and the other partition on M2. If one of the machines (say, M2) fails, Spark can use the lineage information to recover the lost
partition and continue processing the data.

Here's how it works: When we create the numbersRDD, Spark keeps track of how it was created, and stores this information in the
lineage. The lineage includes the transformations and dependencies that were used to create the RDD, as well as the location of the
parent RDDs. In this example, the lineage for numbersRDD would include the parallelize transformation and the location of the two
partitions.

If M2 fails, the partition on M2 would be lost, and Spark would not be able to access the data in that partition. However, because the
lineage includes the location of the parent RDD (i.e., the parallelize transformation), Spark can recreate the lost partition by re-
executing the transformation on the remaining partition on M1. This ensures that the data is still available for processing, even if one of
the nodes fails.
Lineage in RDD ?
lineage is a record of how an RDD (Resilient Distributed Dataset) was created from other RDDs through
transformations. It is a fundamental concept in Spark's fault tolerance mechanism and is used to recover
lost data due to node failures.

The lineage of an RDD is a directed acyclic graph (DAG) that shows the dependencies between RDDs
and the transformations applied to them. Each node in the graph represents an RDD, and each edge
represents a transformation that was applied to create the new RDD.

For example, consider the following code that creates an RDD and applies two transformations to it:

val rdd1 = sc.parallelize(Seq(1, 2, 3, 4, 5))

val rdd2 = rdd1.filter(_ % 2 == 0)

val rdd3 = rdd2.map(_ * 2)

The lineage of rdd3 would be a DAG that looks like this:

rdd1 --(filter)--> rdd2 --(map)--> rdd3

If a node in the cluster fails and some data is lost, Spark can use the lineage information to reconstruct the lost data by re-executing the
transformations that were applied to create the lost RDDs. In the above example, if rdd2 was lost due to a node failure, Spark can
recreate it by re-applying the filter transformation on rdd1. If rdd1 was lost, Spark can recreate it by re-executing the parallelize
transformation.

Lineage allows Spark to achieve fault tolerance without requiring expensive replication of data across the cluster. It also enables Spark
to optimize the execution plan by reordering transformations based on their dependencies and minimizing data shuffling between nodes.

Form Sheet AIAG VDA Design U Process-FMEA en
100% (1)
Form Sheet AIAG VDA Design U Process-FMEA en
2 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Software Requirements Memory Jogger
No ratings yet
Software Requirements Memory Jogger
6 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Learning Spark Programming Basics: Introduction To Rdds
No ratings yet
Learning Spark Programming Basics: Introduction To Rdds
70 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
5 - PDFsam - Beginner Guide Spark
No ratings yet
5 - PDFsam - Beginner Guide Spark
2 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
SPARK
No ratings yet
SPARK
35 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
Spark Interview
No ratings yet
Spark Interview
17 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
SPARK Architecture
No ratings yet
SPARK Architecture
22 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Pyspark
No ratings yet
Pyspark
31 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
No ratings yet
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
6 pages
Spark
No ratings yet
Spark
160 pages
Unit 4
No ratings yet
Unit 4
8 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Spark 1
No ratings yet
Spark 1
57 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Pyspart Iq
No ratings yet
Pyspart Iq
27 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Function Spark
No ratings yet
Function Spark
10 pages
Spark
No ratings yet
Spark
9 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Basics of RDD
No ratings yet
Basics of RDD
84 pages
Open Spark Shell
No ratings yet
Open Spark Shell
12 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
bd1718 10 Spark
No ratings yet
bd1718 10 Spark
55 pages
Function Spark
No ratings yet
Function Spark
9 pages
Spark
No ratings yet
Spark
96 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Spark (Introduction, RDD)
No ratings yet
Spark (Introduction, RDD)
28 pages
Module 3
No ratings yet
Module 3
51 pages
QuickStart Guide to Db2 Development with Python
From Everand
QuickStart Guide to Db2 Development with Python
Roger E. Sanders
No ratings yet
Ite6102 Computer Programming 1 Updated
No ratings yet
Ite6102 Computer Programming 1 Updated
23 pages
Data Flow Diagram
No ratings yet
Data Flow Diagram
10 pages
Packers and Movers
No ratings yet
Packers and Movers
30 pages
Objective/Goals Action Resources Evidence of Achievement Due Date
No ratings yet
Objective/Goals Action Resources Evidence of Achievement Due Date
4 pages
Phplot
No ratings yet
Phplot
380 pages
Wwhagdvhu5iqmcmc18 PDF
No ratings yet
Wwhagdvhu5iqmcmc18 PDF
10 pages
Dfsort Tutorial Reference Study Material - Sort JCL - Ibm Mainframe
No ratings yet
Dfsort Tutorial Reference Study Material - Sort JCL - Ibm Mainframe
8 pages
A10 Web Logging
No ratings yet
A10 Web Logging
11 pages
Practical No. 5 Objective - : Chandigarh University Data Structure Lab (Csp-209)
No ratings yet
Practical No. 5 Objective - : Chandigarh University Data Structure Lab (Csp-209)
5 pages
SoapUI Cookbook - Sample Chapter
No ratings yet
SoapUI Cookbook - Sample Chapter
40 pages
Codility FAQ
No ratings yet
Codility FAQ
1 page
Logs
No ratings yet
Logs
31 pages
Loading Hierchys Into FF From r3 and Then To BI
No ratings yet
Loading Hierchys Into FF From r3 and Then To BI
14 pages
B2C Deployment Version 4
No ratings yet
B2C Deployment Version 4
17 pages
Result 5th Sem
No ratings yet
Result 5th Sem
1 page
Software Requirements Specification For Mindly: Ultimate Study Companion
No ratings yet
Software Requirements Specification For Mindly: Ultimate Study Companion
15 pages
TypeScript For C Sharp Programmers
No ratings yet
TypeScript For C Sharp Programmers
68 pages
Cst205 Oopj Dec 2022
No ratings yet
Cst205 Oopj Dec 2022
3 pages
History
No ratings yet
History
14 pages
Internet and Open Source Concepts
No ratings yet
Internet and Open Source Concepts
6 pages
Functions in Informatica
No ratings yet
Functions in Informatica
15 pages
Dot Net Technology PDF
No ratings yet
Dot Net Technology PDF
2 pages
Walk-In-Interview Notice For Contractual Appointment Advertisement No.: JAP-IT/CLMS/Recruitment/01/2015
No ratings yet
Walk-In-Interview Notice For Contractual Appointment Advertisement No.: JAP-IT/CLMS/Recruitment/01/2015
12 pages
PPL Unit-Iii
No ratings yet
PPL Unit-Iii
10 pages
C - Programming
No ratings yet
C - Programming
9 pages
Python Pandas1
No ratings yet
Python Pandas1
39 pages
Class 9 Computer Project
No ratings yet
Class 9 Computer Project
28 pages
Designing Interfaces and Dialogues 3
No ratings yet
Designing Interfaces and Dialogues 3
17 pages

Spark Slides

Uploaded by

Spark Slides

Uploaded by

What is Spark ?

rdd = sc.parallelize([1, 2, 3])

rdd = sc.parallelize([1, 2, 3, 4, 5])

rdd = sc.parallelize([1, 2, 3])

For example:rdd = sc.parallelize([1, 2, 3, 4, 5])

val numbersRDD = sc.parallelize(Seq(1, 2, 3, 4, 5))

val rdd1 = sc.parallelize(Seq(1, 2, 3, 4, 5))

val rdd2 = rdd1.filter(_ % 2 == 0)

val rdd3 = rdd2.map(_ * 2)

rdd1 --(filter)--> rdd2 --(map)--> rdd3

You might also like