0% found this document useful (0 votes)
70 views19 pages

Spark Final Theory

Spark Theory

Uploaded by

royalsubha123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views19 pages

Spark Final Theory

Spark Theory

Uploaded by

royalsubha123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Apache Spark

1. What is Apache Spark?


Ans:- Apache Spark is a unified computing engine and set of libaries for parallel data processing on
computer.
Unified:- Spark is designed to suppourt wide range of task over the same computing engine. For ex
Data scientiests, Data Engineer and Data Analyst can use the smae platfrom for therie data
transformation or modelling.
Compute Engine:- Spark does not stores data anywhere by default in use in memory computing.
We can store data using s3,Rdbms and other various storage devices.
-> Spark is limited to a computing engine.It does not store the data
->Spark Can connect with different data sources like s3, HDFS, Azure, JDBC etc.
->Spark can work with almost all the data storage systems
Computer Cluster:-

Master Node

Salve Salve Salve Salve


16GB Ram, 16GB Ram, 16GB Ram, 16GB Ram,
1TB ,4Cpu 1TB ,4Cpu 1TB ,4Cpu 1TB ,4Cpu

Fig : Master Slave Archicture

Pros Of Spark
1. Fast Processing
2. In-Memory Computing [In Spark data isnstored in the Ram , so ir can acess the data quickly and
accelerate the speed of analytics]
3. Flexible [It Suppourts multiple languages like java, scala, R, python ]
4. Fault Tolerance [Spark Contains Reslient Distributed Datasets(RDD) that are designed to handle
the faliure of any worker node in the cluster . Thus, it ensures that the loss of data reduces to Zero]
5. Better Analytics [Spark has a rich set of sql queries , machine learning algorithms, complex
analytics, etc. With all these functionolaties, analytics can be performed better
Apache Spark
2. Why Apache Spark?
Ans:-
Big Data
Big data is term that describes large, hard-to-manage volumes of data both stractured and
unstractured- that inundate buisness on a day to day basis.
These data sets are so voluminious that traditional data processing sofwtares just can not handle
them. But these large volumes of datas can be used to address buisness problems we would not been
able to tackle before.

5V’s of Big Data


1. Volume->5Gb/5Tb/10Tb
2. Velocity ->per 1sec/ per 1hour
3. Verity->Stractured, Semi-Stractured, Unstractured[Most produces in recent market]
4. Value
5. Veracity [Uncertainty Of Data ]
If the 5v’s Satifies then the data can be called Big Data.
ETL = Extract Transform Load [Old Concept]
ELT=Extract Load Transform [New Concept]
Issues
i. Storage
ii. Processing- Ram/Cpu
Options
i. Monolithic Approach
ii. Distributed Approach

Monolithic Distributed
Vertical Scaling Horizontal Scaling
Expensive Economical
Low Availability High Avalibality
Single point of faliure No single point of faliure
Apache Spark
Hadoop vs Apache Spark:

Hadoop Apache Spark

Processing data is slow in Hadoop compared to Apache Spark can process the data 100 times
Apache Spark. Because it writes data back to disk faster than Hardoop because it perform in
and read again from disk to in memory. [stores memory for all computations computations.[It
the intermediate data in disk] does not stores intermediate datas in disk all done
in-memory]
Performs batch procesing of data Performs both batch processing and real-time
processing of data
Hadoop has more lines of code. Since it is written Spark has fewer lines of code as it is implemented
in java, it takes more time to execute in scala

Difficult to write code in Hadoop. Hive was built Easy to use and write code and debug code
to make it easier.

Spark suppourts authentication via a shared


secret.It can also run on YARN leveraging the
Hadoop supports kerberos authentications, which capabality of Keberos[YARN- Yet Another
is difficult to manage Resourec Negotiator]

Hadoop provides more security than Apache Apache spark has less security features.
spark

For Fault Tolerance it is having block of data and Uses Dag [ Directed Acylic Graph ] to provide
replication factor to handle the failure. Fault Tolerance.

Some misconceptions about Hadoop and Apache Spark:-


i. Hadoop is a Database
ii. Spark is 100 times faster than Hadoop.
Iii. Spark process data in ram but Hadoop doesnot.
Resilient Distributed Dataset(RDD)
Spark Core is embeded with RDDs an immutable fault-tolerant, distributed collection of objects that
can be operated on in parallel.
Spark Eco System:-
Components of Apache Spark
1. Spark Core
2. Spark SQL
3. Spark Streaming
4. Spark Mlib
5. Spark GraphX
Apache Spark

Spark Engine

Cluster Manager[YARN/Mesos/Kubernates/Standalone]

Spark Core
Spark core is the base engine for large-scale parallel and distributed data processing.It is
Responsible for :-
a) Memory management
b) Fault recovery
c) Scheduling and distributing and monotoring jobs on a cluster
d) Interating with storage System

Spark Streaming
Spark streaming is a lightweight Api that allows devlopers to perform batch processing and real-
time streaming of data with ease. Provides secure, reliable, and fast processing of live data streams.

Spark Mllib
Spark Mllib is a low level machine learning library that is simple to use, is sclable, and compitable
with various programming languages.
Mllib eases the deployment and declopement of sclable machine learning algorithms.
Apache Spark
It basically contains machine learning labraries that have an implemention of various machine
learning algorithms.

Spark GraphX
GraphX is Sparks’s own Graph Computation Engine and data store.
ETL->Extraction Transformation Load

Spark Architecture
Spark uses a master-slave archicteture that consists of a driver , that runs on a master node , and
multiple executors which run across the worker nodes in the cluster.
WorkFlow->
In the Master node there is a Resourse manager like Yarn . When the Spark application runs the
request firstly goes to Resourse manager. Suppouse it requests Dirver =20Gb, Executor =25Gb and
No of executors =5, Cpu Core=5.
The resourse manager firstly creates the Driver in Master node in any of slave/worker nodes.After
the Driver is created It will create the 5 executors in our case in the workers nodes where the free
spaces is available in the worker nodes and also allocates the Cpu cores there.Where the Diver class
is created that is called Application Master in the application master we can write code in java or
python. If we write the code in python then there is a Pyspark driver and there is also a Jvm wrapper
that converts pyspark to Jvm because spark is written in Scala. The Flow will be Python Wrapper-
>Java Wrapper->Spark Core.
If we write code in python then only Pyspark is comes into picture it is not there always but the
Jvm Main methode is always is there because Spark is written in scala. If we write code in java
there is perviously Jvm main methode so no Pyspark is needed.
If we writes some User Defined Function that will be called in runtime so then in executors we also
need the Python worker for this the application may be slow . For this we are not suppouse touse
User Defined Functions.

Transformations & Actions:-


What is Transformations?
In Apache Spark, transformations are functions applied to existing datasets (represented as Resilient
Distributed Datasets or RDD’s) to create new ones.
Or,
Transformation is basically the process of manupulating data from a data source.
Apache Spark
There are two types of Transformations:-
i. Narrow Dpenedency Transformations:- Transformation that doesn’t require data movements
between partitions. Ex:- filter, select, union, map.
ii. Wide Dependency Transformations:- Transformation that requires data movements between
partitions. Ex:- Join,groupBy,distinct etc.

Lazy Evalutions and DAG:-


DAG-> Directed Acylic Graph
Transformations are not executed immediately but recorded in a lineage graph. Only when an action
(a function that returns a value) is triggered, are the necessary transformations performed. This
optimizes resource usage and allows for efficient data processing.
DAG(Directed Acyclic Graph):- In Apache Spark, DAG stands for Directed Acyclic Graph. It's a
fundamental concept that visualizes and organizes the different steps involved in processing your
data. Think of it like a roadmap for your Spark job.

Spark Sql Engine:-


Questions
i. What is catalyst optimizer / Spark Sql Engine ?
ii. Why do we get Analysis exception error?
iii. What is Catalog?
iv.What is physical planning / spark plan?
v. Is spark sql engine a compiler?
vi. How many phases are involved in spark sql engine to convert a code into Javabyte code?=>4

SQL

Catalyst Optimizer/Spark
DATAFRAME API RDD
Sql engine

DATASET
Apache Spark
Four Phases of Spark Sql Engine:-
i.Analysis
ii.Logical Planning
iii. Physical Planning
iv. Code Generation
Flow is TopDown:-

Code

Unresolved logical
plan

[MetaData] Catalog Analysis


[Tables,Files,Databases]

Resolved Logical
plan

Logical Optimization

Optimized Logical
plan

Physical plan
Physical plan 1,2
,3,4.....n
Cost Model to choose best
physical plan

Best Physical
plan

Final Code
[RDD’S]
Apache Spark
RDD(Resilent Distributed Dataset):-
Resilent->In case of faliure it knows how it is being created so it can heal/recover easily [Fault
Tolerant] by the lineage graph.
Distributed->Data is distributed over the cluster.
Dataset->Acutal Data
RDD is a immutable if it is being created then it can not be changes. RDD is fault Tolerant.

Why RDD is called Fault Tolerant?


In case of any Faliure each RDD remembers how it was being created from other RDD through a
lineage graph(It is a prt of Dag) .
If a worker node holding an RDD fails, Spark doesn't need to recompute the entire dataset from
scratch. Instead, it uses the lineage graph to identify unaffected partitions and only recomputes the
lost ones based on their parent RDDs. This saves a significant amount of time and resources.

Features :-
Immutable
Lazy Evalutions

Pros of a RDD:-
i. Works on unstractured data.
ii. Type Safe.

Cons of a RDD:-
i. No Optimization done bySpark.
ii. How to and what to

When do we need a RDD?


If we want to get full control on our data and wanted to work with unstractured data then we need
RDD.

Spark Session:-
In Apache Spark, the SparkSession serves as the entry point for interacting with Spark's
functionalities.
Apache Spark
Code to Create a Spark session in java

import org.apache.spark.sql.SparkSession;

public class MySparkApp {

public static void main(String[] args) {

// 1. Create a SparkSession builder


SparkSession spark = SparkSession.builder()

// 2. Set application name (optional)


.appName("My Spark Application")

// 3. Configure master (spark://, local[n], yarn, etc.)


.master("local[2]")

// 4. (Optional) Specify additional configurations


.config("spark.sql.shuffle.partitions", 4)

// 5. Get or create the SparkSession


.getOrCreate();

// ... Do your Spark operations with spark ...

// 6. Stop the SparkSession


spark.stop();
}
}

Application,Jobs, Stages & Tasks:-


Application:
•Represents the entire Spark program you submit for execution. It holds a logical unit of work, like
analyzing a dataset or training a model.
•Launched when you call SparkSession.getOrCreate() in your code.
•Can involve multiple jobs, each performing specific transformations and actions on the data.
Job:
•A collection of transformations and actions applied to a dataset within an application.
•Triggered by an action like count(), collect(), or save().
Apache Spark
•Spark creates a Directed Acyclic Graph (DAG) for each job, representing the dependencies
between transformations.
•Responsible for transforming and manipulating the data based on your instructions.
Stage:
•A set of tasks that can be executed independently without shuffling data across the network.
•Created based on shuffle dependencies in the job's DAG.
•Stages are processed sequentially, but tasks within a stage can run concurrently across executors.
•Represent logical units of work within a job, enabling parallel execution.
Task:
•The smallest unit of execution in Spark, responsible for processing a specific partition of data.
•Assigned to an executor node in the cluster.
•Executes a single transformation or action on its assigned partition.
•Tasks within a stage can run concurrently, contributing to the overall speed of your application.
Visualization:
Imagine you have an application to clean and analyze a large dataset. The application launches
multiple jobs for tasks like filtering, aggregating, and joining data. Each job might be divided into
stages based on how data needs to be shuffled. Within each stage, individual tasks work on separate
data partitions in parallel. This distributed approach helps Spark process your data efficiently.

Flow:-
Driver -> Jobs ->Stages -> Tasks
NB:- Group by By default creates 200 partions and total have 200 tasks.

Repartition vs Coalesce:-
Repartitioning:-
Repartioning is the process of partioning data and creating new partion from existing partition .
Hrere the data shuffling happens. For this it is slow. We can increase the number of partitions and
decrease as well as we want.
Pros:->
Evenly Distributed Data
Cons:->
More I/O
Coalesce:-
It is the process of mereging the partitions. Here we can only decrase the numbers of partitions.
Pros->Less Expensive, No shuffilig so less time
Cons:->Uneven distributions of data.
Repartioning and colease are basically opposite to each otrher one increase the partitions numbers
and one decreasses it how ever we can also decrease the number of partitions by repartitions also.
Apache Spark
Spark Join:-
i. What are join stragies in spark?
ii. Why join is expensive ?
iii. Difference between shuffle hash join and shuffle sort-merge join?
iv. Where do we need boardcast join?
Join also creates bydefault 200 partitions.
Join is expensive because lit is a wide dependency transformations where the data shuffling
happpens between partitions which is more expensive and time taking.
join stragies:-
i. Shuffle sort-merge join.
ii. Shuffle hash join.
Iii. Broadcast hash join
iv. Booadcast nested loop join.
Shuffle sort-merge join:-
After data shuffling done between partitions then sorting of datas are done in the data frame then
join performs that called sorted merge join. Time->O(nlogn). It choosed by spark bydefault.
4
1
5
2

1
3
2

After sorting:-

1
2
4
5

1
2
3
Apache Spark
Adter that join is performed 1->1 ,2->2, 3->3 in that way.

Shuffle hash join:-


Here spark creates hashtable of small table. Hashtable stored in-memory. Complexity O(n)
Broadcast hash join:-
Broadcast bydefault is 10 Mb. Suppouse we have two tables one is small table and one is lage table.
And large table is devided into multiple partititions and stored in multiple worker nodes . Then we
broadcast the small to each worker nodes then it does not need any shufling the data can be joinned
directly that is called Broadcast Hash join.

Driver Out Of Memory:-


i. What is oom in Spark?
ii. Why do we get driver omm?
iii. What is driver overhead memory?
iv.Common reasons to get driver oom?
v. How to Handle oom?
.show()-> Methode does not shows all the records at same time. By default 20 records will be
shown. Only one partitions comes not all the partitions come to the driver.
.collects()-> It shows all the records present. So here its more likely to come driver memory out of
bound exception.
Diver Memory
spark.driver.memory=>Jvm process [1Gb Driver Memory]
spark.driver.memoryoverhead=>Non Jvm,Containers [10% of driver memory minimum 384Mb
must be there]
Common Reasons for getting Driver out of memory:
i. collect() methode is used.
ii. Broadcast
iii. More Objects used in a process

Executor Out Of Memory:-


i. Why do we get oom when data can be spilled to the disk?
ii. How spark manages storage inside executors?
iii. How task is splitted in executors?
iv.Why do we need overhead memory?
Apache Spark
v. When do we get executor oom?
vi. Types of memory manager in spark?
spark.executor.memory=>Jvm process [1Gb Driver Memory]
spark.executor.memoryoverhead=>Non Jvm,Containers [10% of driver memory minimum 384Mb
must be there]
In spark.executor.memory there are 3 partitions Suppouse it is of 10GB
->User Memory:-40% of 9.7Gb
->Spark Memory:- 60 % of 9.7 Gb [Spark memory is devided into two parts one is Storage memory
pool (50%)and the other is Executor memory pool(50%)]
->Reserved Memory:- 300 Mb

Reserver Memory Usages:-


i. It is used to tore spark internal objects.
ii. Used by Spark engine.
User Memory usage:-
i. It is used to store user defined data stractures, spark internal meta data and any userdefined
functions created.
ii. This is used by RDD operation. Ex:- Aggregation:- Map Partitions transformations.
Spark Memory:-
Storage Memory:-
i. It is used for storing intermediate state of task like joining.
ii. It is used to store cached data.
Iii. Memory eviction is done in LRU fashion.
Executor Memory:-
i. It used for storing object that is required during execution of spark tasks.
ii. Store hash table for hash aggregation.
iii. Spilling to disk.
iv. Short lived ->cleaned after each operations

User Memory
40% Storage Memory
50%
Executor Spark Memory
Memory 60% Executor Memory
50%
Resreved Memory
300Mb
Apache Spark

Spill:-
Spill refers to the process of moving data from in-memory storage to disk.This happens when a
partition of data becomes too large to fit in the available memory on an executor. It's essentially a
safety net to avoid OutOfMemoryErrors.

•When a spill occurs,Spark writes the excess data to a temporary file on the local disk. This data can
be read back into memory when needed, but accessing data from disk is significantly slower than
accessing it from RAM.
Evict:
•Eviction refers to the process of removing data from the in-memory cache. This happens when
the cache reaches its capacity and needs to free up space for new data.
•Spark uses a Least Recently Used (LRU) eviction policy, meaning the data that hasn't been
accessed recently is the first to be evicted.

That means in short if Executor memory is full but storage memory is not full then it utilize Storage
Memory but if Storage memory is also fulled and Executor memory required some extra space then
if the data is spillable then it spilled to Disk. But if the data is not spillable and storage memory and
executor memory is full then it show out of memory exception. Foe example if my Executor
Memory and Storage Memory is fulled and we perform join or hashing which is non spillable then
it show the out of memory exception.

Deployement Mode in Spark:-


Edge Node:-
Edge node is basically an external machine that used to connect the clusters.It basically acts like a
simple bridge between cluster and client/users.

If we use
-- bin/spark-submit/
--master yarn/
--deploy-mode client
then the Driver node will be created in the Edge node

else we use
-- bin/spark-submit/
--master yarn/
--deploy-mode cluster
then the driver node will be created inseide the cluseter worker nodes.
Apache Spark
Client and cluster mode deployement in apache spark

Client Mode Cluster Mode


Logs are generated on client Machine. Ity is
Logs are generated in std out and std err file.
easy to debug for client. It is suitable for producation workload.
Network Latency is High Network Latency is low
Driver Out Of memory can be there. Driver can go into out of memory but the
chances are less.
Driver goes away once the edge node server Even if edge server closed the process still
is disconnected or closed runs on cluster

Adaptive Query Execution[AQE]:-


Features of Aqe:-
i. Dynamically coalescing shuffle partition.
ii. Dynamically switching join Stragies.
iii. Dynamically optimizing the skew join.

Dynamically Coalescing Shuffle Partitions in AQE:


In Apache Spark, shuffle operations are used to distribute data across worker nodes and perform
aggregations or joins. Traditionally, the number of shuffle partitions was manually configured, but
this could lead to inefficiency in two ways:
•Too many partitions:If you had more partitions than worker cores, it could create overhead and
underutilize resources.
•Too few partitions:Too few partitions could cause data skew and bottlenecks, especially for
unevenly distributed data.
AQE's Dynamic Coalescing to the Rescue:
AQE addresses these issues by automatically adjusting the number of shuffle partitions during
query execution. It basically dynamically colease the partitions to handle the skewed datas.

Dynamically switching join Stragies:-


Dynamically switching join strategies is another powerful feature of Adaptive Query Execution
(AQE) in Apache Spark. It allows the join type to be automatically chosen based on runtime
statistics, rather than being statically defined in the original query. This leads to improved
performance and efficiency, especially when dealing with datasets with unknown characteristics or
skewed data.

Working Flow:-
Initial Plan: Based on available statistics (e.g., table sizes, data types), AQE creates an initial query
plan with a chosen join strategy. This might be a sort-merge join, broadcast hash join, or nested loop
join, depending on factors like table sizes and join conditions.
1.Runtime Information: As the query executes, AQE gathers runtime statistics, such as the actual
size of each table after filtering, the number of distinct values in join keys, and data skew.
Apache Spark
2.Re-evaluation: Using this runtime information, AQE re-evaluates the chosen join strategy. It
calculates the estimated cost of executing each possible join algorithm with the new data
characteristics.
3.Switching if Beneficial: If AQE finds a different join strategy with a significantly lower
estimated cost, it dynamically switches to that strategy on the fly. This means the query execution
changes course to use the more efficient join approach.

Dynamically optimizing the skew join:-


To Handle the Skewed data Apache Spark can split the data in runtime into multiple partitions but
the below conditions must be satisfied.
If your skewed data is more than five times of median size of the skewed data is greater than 256
Mb then the skwed data aslo can be splitted in to partitions.

Cache and Persist in Apache Spark:-


Caching:- It is basically an optimizing technique. Stores the intermediate results.
In Apache Spark, caching refers to the process of temporarily storing a DataFrame, Dataset, or RDD
(Resilient Distributed Dataset) in memory across the Spark cluster. This can significantly improve
the performance of your application by speeding up subsequent operations that use the same data. If
the cashing data is bigger than the inmemory size then the data spilled to disk then it take more time
to read and write as well.

Benifits of Caching:-
Reduced Processing Time: Subsequent operations on the cached data avoid reading from disk,
which is much slower than accessing memory.
•Improved Performance: This optimization can significantly speed up your application, especially
for iterative analyses or repeated computations.
•Fault Tolerance: Cached data can be used for recomputation in case of failures, enhancing data
availability.

Perist:-
In Apache Spark, "persist" is another option for storing data efficiently, similar to caching, but
offering more flexibility and control. While caching uses a default storage level
("MEMORY_ONLY"), persist allows you to specify various storage levels based on your needs.

Feature Persist Caching

Method persist(storageLevel) cache()

User-defined (e.g.,
Default Storage
MEMORY_ONLY, MEMORY_ONLY
Level
MEMORY_AND_DISK)

Data Spilling Can spill data to disk based on Can spill data to disk only if no
Apache Spark

other options (e.g., executor runs


chosen storage level out of memory with
MEMORY_ONLY)

Explicit Control Offers control over: Limited control:

- Storage level (e.g., choose


- Only allows using the default
MEMORY_AND_DISK to explicitly
MEMORY_ONLY level
enable spilling)

- Serialization format (e.g.,


MEMORY_ONLY_SER for smaller
in-memory footprint)

- Replication factor (e.g., for fault


tolerance)

- Spill Location (indirectly via


Sparklocal.dir)

More flexible, allows customizing Less flexible, limited to using the


Flexibility
various persistence aspects default behavior

Requires understanding of storage Easier to use, ideal for simple


Convenience
levels and configuration caching scenarios

Specific needs requiring control


Simple scenarios where default
Use Cases over storage location, format, and
caching behavior is sufficient
fault tolerance

Dynamic Resourse Allocation in Spark:-


Dynamic Resource Allocation (DRA) is a powerful feature in Apache Spark that allows the cluster
to automatically adjust the number of executors assigned to an application based on its workload.
This means that Spark can scale up resources when needed and scale down when they are no longer
required.
There are two types of resourse allocations:
i. Satic Resourse allocation
i. Dynamic Resourse alocation
Eample:
Apache Spark
Imagine you have a Spark application that processes large datasets in phases. In the initial phase, it
might require more resources for data ingestion and cleaning. Later, during aggregation and
analysis, fewer resources might be sufficient. With DRA enabled, Spark can automatically allocate
more executors during the initial phase and scale down later, efficiently utilizing cluster resources
and optimizing performance throughout the execution.
When to avoid Dynamic Memory allocations:-
In productions environment when the process is critical then we may avoid the Dynamic memory
allocation because In dynamic memeory allocation the process may fail in that case when it relased
the memory that is unsed and that is now used by another process and then the first process needs
the memory but it is not available right now then it may falil or takes more time.

Dynamic Partition Purning:-


Conditions for Dynamic partitions purning:-
i. Data should be partitioned.
ii. Second table must be broadcasted.

Dynamic partition pruning is a performance optimization technique used in querying partitioned

tables in systems like Apache Spark. It reduces the amount of data scanned by leveraging the

partition information present in the query and the data itself.

Here's how it works:

1. Identifying potential pruning: When you execute a query with a filter condition on a partitioned

column, Spark can analyze the filter and compare it to the partition values. If the filter doesn't apply

to certain partitions, those partitions can be pruned from the scan, meaning they won't be read at

all.

Salting:-

Salting was a technique used in Apache Spark, primarily before version 3.0, to address data skew

issues when performing operations like joins and aggregations on large datasets. Data skew occurs

when data is unevenly distributed across partitions, causing some partitions to hold significantly

more data than others. This can lead to performance bottlenecks and slow down Spark jobs.
Apache Spark

You might also like