Spark Final Theory
Spark Final Theory
Master Node
Pros Of Spark
1. Fast Processing
2. In-Memory Computing [In Spark data isnstored in the Ram , so ir can acess the data quickly and
accelerate the speed of analytics]
3. Flexible [It Suppourts multiple languages like java, scala, R, python ]
4. Fault Tolerance [Spark Contains Reslient Distributed Datasets(RDD) that are designed to handle
the faliure of any worker node in the cluster . Thus, it ensures that the loss of data reduces to Zero]
5. Better Analytics [Spark has a rich set of sql queries , machine learning algorithms, complex
analytics, etc. With all these functionolaties, analytics can be performed better
Apache Spark
2. Why Apache Spark?
Ans:-
Big Data
Big data is term that describes large, hard-to-manage volumes of data both stractured and
unstractured- that inundate buisness on a day to day basis.
These data sets are so voluminious that traditional data processing sofwtares just can not handle
them. But these large volumes of datas can be used to address buisness problems we would not been
able to tackle before.
Monolithic Distributed
Vertical Scaling Horizontal Scaling
Expensive Economical
Low Availability High Avalibality
Single point of faliure No single point of faliure
Apache Spark
Hadoop vs Apache Spark:
Processing data is slow in Hadoop compared to Apache Spark can process the data 100 times
Apache Spark. Because it writes data back to disk faster than Hardoop because it perform in
and read again from disk to in memory. [stores memory for all computations computations.[It
the intermediate data in disk] does not stores intermediate datas in disk all done
in-memory]
Performs batch procesing of data Performs both batch processing and real-time
processing of data
Hadoop has more lines of code. Since it is written Spark has fewer lines of code as it is implemented
in java, it takes more time to execute in scala
Difficult to write code in Hadoop. Hive was built Easy to use and write code and debug code
to make it easier.
Hadoop provides more security than Apache Apache spark has less security features.
spark
For Fault Tolerance it is having block of data and Uses Dag [ Directed Acylic Graph ] to provide
replication factor to handle the failure. Fault Tolerance.
Spark Engine
Cluster Manager[YARN/Mesos/Kubernates/Standalone]
Spark Core
Spark core is the base engine for large-scale parallel and distributed data processing.It is
Responsible for :-
a) Memory management
b) Fault recovery
c) Scheduling and distributing and monotoring jobs on a cluster
d) Interating with storage System
Spark Streaming
Spark streaming is a lightweight Api that allows devlopers to perform batch processing and real-
time streaming of data with ease. Provides secure, reliable, and fast processing of live data streams.
Spark Mllib
Spark Mllib is a low level machine learning library that is simple to use, is sclable, and compitable
with various programming languages.
Mllib eases the deployment and declopement of sclable machine learning algorithms.
Apache Spark
It basically contains machine learning labraries that have an implemention of various machine
learning algorithms.
Spark GraphX
GraphX is Sparks’s own Graph Computation Engine and data store.
ETL->Extraction Transformation Load
Spark Architecture
Spark uses a master-slave archicteture that consists of a driver , that runs on a master node , and
multiple executors which run across the worker nodes in the cluster.
WorkFlow->
In the Master node there is a Resourse manager like Yarn . When the Spark application runs the
request firstly goes to Resourse manager. Suppouse it requests Dirver =20Gb, Executor =25Gb and
No of executors =5, Cpu Core=5.
The resourse manager firstly creates the Driver in Master node in any of slave/worker nodes.After
the Driver is created It will create the 5 executors in our case in the workers nodes where the free
spaces is available in the worker nodes and also allocates the Cpu cores there.Where the Diver class
is created that is called Application Master in the application master we can write code in java or
python. If we write the code in python then there is a Pyspark driver and there is also a Jvm wrapper
that converts pyspark to Jvm because spark is written in Scala. The Flow will be Python Wrapper-
>Java Wrapper->Spark Core.
If we write code in python then only Pyspark is comes into picture it is not there always but the
Jvm Main methode is always is there because Spark is written in scala. If we write code in java
there is perviously Jvm main methode so no Pyspark is needed.
If we writes some User Defined Function that will be called in runtime so then in executors we also
need the Python worker for this the application may be slow . For this we are not suppouse touse
User Defined Functions.
SQL
Catalyst Optimizer/Spark
DATAFRAME API RDD
Sql engine
DATASET
Apache Spark
Four Phases of Spark Sql Engine:-
i.Analysis
ii.Logical Planning
iii. Physical Planning
iv. Code Generation
Flow is TopDown:-
Code
Unresolved logical
plan
Resolved Logical
plan
Logical Optimization
Optimized Logical
plan
Physical plan
Physical plan 1,2
,3,4.....n
Cost Model to choose best
physical plan
Best Physical
plan
Final Code
[RDD’S]
Apache Spark
RDD(Resilent Distributed Dataset):-
Resilent->In case of faliure it knows how it is being created so it can heal/recover easily [Fault
Tolerant] by the lineage graph.
Distributed->Data is distributed over the cluster.
Dataset->Acutal Data
RDD is a immutable if it is being created then it can not be changes. RDD is fault Tolerant.
Features :-
Immutable
Lazy Evalutions
Pros of a RDD:-
i. Works on unstractured data.
ii. Type Safe.
Cons of a RDD:-
i. No Optimization done bySpark.
ii. How to and what to
Spark Session:-
In Apache Spark, the SparkSession serves as the entry point for interacting with Spark's
functionalities.
Apache Spark
Code to Create a Spark session in java
import org.apache.spark.sql.SparkSession;
Flow:-
Driver -> Jobs ->Stages -> Tasks
NB:- Group by By default creates 200 partions and total have 200 tasks.
Repartition vs Coalesce:-
Repartitioning:-
Repartioning is the process of partioning data and creating new partion from existing partition .
Hrere the data shuffling happens. For this it is slow. We can increase the number of partitions and
decrease as well as we want.
Pros:->
Evenly Distributed Data
Cons:->
More I/O
Coalesce:-
It is the process of mereging the partitions. Here we can only decrase the numbers of partitions.
Pros->Less Expensive, No shuffilig so less time
Cons:->Uneven distributions of data.
Repartioning and colease are basically opposite to each otrher one increase the partitions numbers
and one decreasses it how ever we can also decrease the number of partitions by repartitions also.
Apache Spark
Spark Join:-
i. What are join stragies in spark?
ii. Why join is expensive ?
iii. Difference between shuffle hash join and shuffle sort-merge join?
iv. Where do we need boardcast join?
Join also creates bydefault 200 partitions.
Join is expensive because lit is a wide dependency transformations where the data shuffling
happpens between partitions which is more expensive and time taking.
join stragies:-
i. Shuffle sort-merge join.
ii. Shuffle hash join.
Iii. Broadcast hash join
iv. Booadcast nested loop join.
Shuffle sort-merge join:-
After data shuffling done between partitions then sorting of datas are done in the data frame then
join performs that called sorted merge join. Time->O(nlogn). It choosed by spark bydefault.
4
1
5
2
1
3
2
After sorting:-
1
2
4
5
1
2
3
Apache Spark
Adter that join is performed 1->1 ,2->2, 3->3 in that way.
User Memory
40% Storage Memory
50%
Executor Spark Memory
Memory 60% Executor Memory
50%
Resreved Memory
300Mb
Apache Spark
Spill:-
Spill refers to the process of moving data from in-memory storage to disk.This happens when a
partition of data becomes too large to fit in the available memory on an executor. It's essentially a
safety net to avoid OutOfMemoryErrors.
•When a spill occurs,Spark writes the excess data to a temporary file on the local disk. This data can
be read back into memory when needed, but accessing data from disk is significantly slower than
accessing it from RAM.
Evict:
•Eviction refers to the process of removing data from the in-memory cache. This happens when
the cache reaches its capacity and needs to free up space for new data.
•Spark uses a Least Recently Used (LRU) eviction policy, meaning the data that hasn't been
accessed recently is the first to be evicted.
That means in short if Executor memory is full but storage memory is not full then it utilize Storage
Memory but if Storage memory is also fulled and Executor memory required some extra space then
if the data is spillable then it spilled to Disk. But if the data is not spillable and storage memory and
executor memory is full then it show out of memory exception. Foe example if my Executor
Memory and Storage Memory is fulled and we perform join or hashing which is non spillable then
it show the out of memory exception.
If we use
-- bin/spark-submit/
--master yarn/
--deploy-mode client
then the Driver node will be created in the Edge node
else we use
-- bin/spark-submit/
--master yarn/
--deploy-mode cluster
then the driver node will be created inseide the cluseter worker nodes.
Apache Spark
Client and cluster mode deployement in apache spark
Working Flow:-
Initial Plan: Based on available statistics (e.g., table sizes, data types), AQE creates an initial query
plan with a chosen join strategy. This might be a sort-merge join, broadcast hash join, or nested loop
join, depending on factors like table sizes and join conditions.
1.Runtime Information: As the query executes, AQE gathers runtime statistics, such as the actual
size of each table after filtering, the number of distinct values in join keys, and data skew.
Apache Spark
2.Re-evaluation: Using this runtime information, AQE re-evaluates the chosen join strategy. It
calculates the estimated cost of executing each possible join algorithm with the new data
characteristics.
3.Switching if Beneficial: If AQE finds a different join strategy with a significantly lower
estimated cost, it dynamically switches to that strategy on the fly. This means the query execution
changes course to use the more efficient join approach.
Benifits of Caching:-
Reduced Processing Time: Subsequent operations on the cached data avoid reading from disk,
which is much slower than accessing memory.
•Improved Performance: This optimization can significantly speed up your application, especially
for iterative analyses or repeated computations.
•Fault Tolerance: Cached data can be used for recomputation in case of failures, enhancing data
availability.
Perist:-
In Apache Spark, "persist" is another option for storing data efficiently, similar to caching, but
offering more flexibility and control. While caching uses a default storage level
("MEMORY_ONLY"), persist allows you to specify various storage levels based on your needs.
User-defined (e.g.,
Default Storage
MEMORY_ONLY, MEMORY_ONLY
Level
MEMORY_AND_DISK)
Data Spilling Can spill data to disk based on Can spill data to disk only if no
Apache Spark
tables in systems like Apache Spark. It reduces the amount of data scanned by leveraging the
1. Identifying potential pruning: When you execute a query with a filter condition on a partitioned
column, Spark can analyze the filter and compare it to the partition values. If the filter doesn't apply
to certain partitions, those partitions can be pruned from the scan, meaning they won't be read at
all.
Salting:-
Salting was a technique used in Apache Spark, primarily before version 3.0, to address data skew
issues when performing operations like joins and aggregations on large datasets. Data skew occurs
when data is unevenly distributed across partitions, causing some partitions to hold significantly
more data than others. This can lead to performance bottlenecks and slow down Spark jobs.
Apache Spark