0% found this document useful (0 votes)
203 views102 pages

Master Pyspark Zero To Hero 1738689679

The document provides a comprehensive overview of Apache Spark, detailing its architecture including the roles of the Driver and Executors, as well as the concepts of transformations, data partitioning, and execution planning. It explains various DataFrame operations, data reading modes, and performance optimization techniques such as handling skewness and partitioning strategies. Additionally, it covers the use of Spark SQL functions, secrets management, and the importance of metadata in Spark applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
203 views102 pages

Master Pyspark Zero To Hero 1738689679

The document provides a comprehensive overview of Apache Spark, detailing its architecture including the roles of the Driver and Executors, as well as the concepts of transformations, data partitioning, and execution planning. It explains various DataFrame operations, data reading modes, and performance optimization techniques such as handling skewness and partitioning strategies. Additionally, it covers the use of Spark SQL functions, secrets management, and the importance of metadata in Spark applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Spark Tutorial

Driver:-
1. Heart of the Spark Application
2. Manages the information and state of executors
3. Analyses, distributes and Schedules the work on executors
Executor:-
1. Execute the code
2. Report the status of execution to driver
A User assigns a job to driver and driver in turn analyses
distribution and breaks down the job into Stages and Tasks. And it
assigns it to the executors. Here executors are basically a JVM
process which runs in the cluster of machines and it consists of
cores.
NOTE:-
1. Each task can only work on 1 partition of data at a time.
2. Tasks can execute in parallel.
3. Executor are JVM processes running on cluster machines.
4. Executors hosts cores and each core can run 1 task at a time.

Narrow Transformation:-
After applying transformation each partition contribute to at-most
one partition.
Wide Transformation:-
After applying transformation if one partition contribute to more
than one partition. This type of transformations lead to data
shuffle.
How Spark works on Data Partitions?
- Spark distributes the data in form of partitions to the Cluster.

Logical Planning is the 1st Phase of Execution Planning


Physical Planning is the 2nd and Final Phase of Execution Planning
Directed Acyclic Graph(DAG)- A DAG represents the logical
execution plan of a Spark Job.

Example- Generate the EMP dataframe and filter EMP


salary>50000. Write the output in CSV format.
Spark Session Object is the entry point for our Spark Application
Spark Session object name -We are using generic name “spark” for
better understanding.
Local([*]) is to run Spark locally with as many worker threads as
logical cores on your machine.
If you want to change the spark session object name then this code
you can use: (eg, new_object_name = spark.getActiveSession())
Schema for DataFrame- Schema is the metadata that defines the
name and the type of the column.

DataFrame is divided into two parts: Rows & Columns


You need DataFrame to manipulate the columns
The function _parse_datatype_string in PySpark is a private method
used internally to parse a string representation of a data type into a
PySpark DataType object. It is primarily utilized when working with
schema definitions or manipulating data types programmatically.
Adding or Overriding columns :- Spark provides withColumn() to
create new column or override existing column.
Adding static values columns :- Spark provides lit() to create a
distributed static column.
Renaming existing columns :- Spark provides
withColumnRenamed() to the same.
Multiple ways for Rename :- We can use expr() or selectExpr() as
well for renaming columns (eg, selectExpr(employee_id as emp_id)
Downstream system :- Any system which is dependent on your
output.
Note- It is never recommended to use spaces in the column names.
Remove columns from DataFrame :- use drop().
Limit : In case we need to get only few records and not full dataset.
Union vs UnionAll :- Union removed the duplicated from the
combined dataset whereas UnionAll doesn’t.
Using select or selectExpr- We can write the column aggregation
expressions in select or selectExpr as well.
Unique and Distinct data from selected column.
Window or Analytical Functions :- Applies aggregate and ranking
functions over a particular window(set of rows).
Expr() function for the rescue:- In case anytime PySpark API looks
difficult, we can use SQL in expr functions easily.
Coalesce vs Repartitions:-
Repartition involves data shuffling, whereas Coalesce doesn’t.
Repartition can increase or decrease partition numbers but
Coalesce can only decrease not increase.
Repartition allows uniform data distribution, but Coalesce can’t
guarantee it.
How many jobs created when we specify the schema during data
read?
- When you specify the schema during data reading in Apache
Spark, no additional jobs are created for schema inference.
This behavior can significantly improve performance because
schema inference requires Spark to scan part of the data,
which would otherwise trigger additional jobs.
Read Mode for csv:- Useful to handle Bad Records, Schema is
mandatory.
Permissive Mode:- Default Mode for data read. We don’t need to
specify mode explicitly in code.
Drop Malformed Mode:- Drops the Bad Records.
Fail Fast Mode:- Fails the Job as soon a Bad Record is identified.
Row vs Columnar Data Format
Benefit of Columnar Format:- The columnar format lets the reader
read, decompress, and process only the columns that are required
for the current query.
Parquet vs ORC vs AVRO

Parquet stores the data along with the data.


Demo for Columnar Data Benefit:- Performance benefit while
reading Columnar data.
Python Datatype – List/Array or Struct/Dict are important.
Reading JSON Files – Default is single line JSON. In case of multiline
we need to specify an option.
Read JSON data in Single Column – We will use format as TEXT.
Read with Schema – Reading JSON file with Schema.
Write JSON Schema – How to write complex JSON schema in string
ddl format?
from_json function – Parse the JSON from String
to_json function – Converts Parsed JSON data to String JSON text.
Flatten JSON data – Expand and explode the JSON data to simple
tabular structure.
Dot Notation – It can be used to extract\expand struct/dict data.
How Spark writes data?
- Cluster with 1 node, 2 Executors and each Executor with 2
cores.
- Total cores/Parallelism : 4
Number of Files – The processed data from each partition will be
written to individual files.
Default Parallelism – Total number of Cores available for processing
data in parallel.
Spark creates the folder.
Spark Partition folders – It allows Spark to skip un-necessary data
read in case you need data based on specific partition.
Note – Spark doesn’t write department_id column in files as it is
the partition column (info from folder name). You can see it is
missing in the files.
Spark Submit – Command is used to execute Spark application on
Cluster.
Note – Spark Submit command can be supplied from the installed
Spark folder, check your setting before executing the code.
Note: For Standalone clusters:
the –num-executors parameter may not work always.
So, to control the number of executors:
1. Define number of cores per executors with –executor-cores
parameter(spark.executor.cores)
2. Control max number of cores for execution with –total-
executor-cores parameter(spark.cores.max)
If you need 3 executors with 2 cores(you don’t need to use –num-
executors) --executor-cores 2 –total-executor-cores 6
--num-executors parameter can be used to control number of
executor for YARN resource manager.
Note: Spark Submit commands are often scheduled or triggered to
execute jobs on Clusters in Production setting.

What are JVM here?


-JVM are the executors requested by the Spark Application.
Shuffle or Exchange – Shuffle divides the JOB into Stages.
Default Shuffle Partition – By default the shuffle partition value is
200. This can be controlled from Spark config.
Shuffle Files:
Are serialized in Tungsten Binary Format(Unsafe Row). These files
can directly be read in memory thus improving read performance.
Adaptive Query Engine (AQE):
By default provides some performance benefits on Shuffle.
Note:
For cache we need and action and count and write are always
preferred as they will scan the whole dataset resulting in proper
caching.
Storage Level:
Default storage level for cache in MEMORY_AND_DISK for
dataframes and datasets.
Note:
Data partitions will be read by all executors in parallel thus is will be
in different executors.
Joins can lead to shuffle if not optimized correctly.
What is skewness?
-Basically unbalanced data not distributed properly.
2 types of spillage -Spillage memory & Spillage disk.
How to identify Skewness?
-We can find the partition count for the shuffled data.
Salted keys: Creation of new salted keys for joining at both tables.
Note: Salting Technique doesn’t need to be used everytime, make
sure to use only when you have issues of memory due to
spillage(mostly out of memory errors).
This was just a demo to show you how to fix spillage.
Skewed Partitions Optimization:
Spark AQE will take care of the partition sizing to balance the data
and avoid spillage.
Spark.sql.adaptive.advisoryPartitionSizeInBytes:
The size in bytes of the shuffle partition during adaptive
optimization. This can control the size of shuffle partitions post-
shuffle.
Now that out task partition memory is small we will make the
partitions smaller in size to 8MB.
Spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes:
A partition is considered as skewed if its size in bytes is larger than
this threshold.
What is Catalog?
-Catalog stores the metadata of SQL Objects.
Temp Views:
Temp views are only available until the session is active.
Spark SQL Functions:
PySpark DataFrames API functions are available for Spark SQL by
default. No need for any import.
Using Secrets Securely:
This strategy works for on-premise or standard clusters. If you are
working with Databricks, make sure to save secrets in Azure Key
Vault and import them while reading or writing data.
Partitioning breaks data in folders:
Folder names are in
format:<partitioning><column>=<key>e.g.country=IN
Note: Partitioning column should be part of query in-order to
improve performance
Impact of High Cardinality Column:
Always avoid partitioning on High Cardinality or Unique value
columns to avoid creating too many partitions.
High Cardinality Column:
Columns with more unique values (less repetition)
z-order can be done in more than one column.
_metadata is a hidden column that can be used to get the source
metadata.
Configuration:
This configuration is used to control the file compaction max size
using OPTIMIZE command.
Selective Z-Ordering with Partition filters:
Only Optimizes and Z-orders the data for the specific partition.

You might also like