Master Pyspark Zero To Hero 1738689679

The document provides a comprehensive overview of Apache Spark, detailing its architecture including the roles of the Driver and Executors, as well as the concepts of transformations, data partitioning, and execution planning. It explains various DataFrame operations, data reading modes, and performance optimization techniques such as handling skewness and partitioning strategies. Additionally, it covers the use of Spark SQL functions, secrets management, and the importance of metadata in Spark applications.

Uploaded by

abhiramkakinada2409

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

249 views102 pages

Master Pyspark Zero To Hero 1738689679

Uploaded by

abhiramkakinada2409

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 102

Spark Tutorial

Driver:-
1. Heart of the Spark Application
2. Manages the information and state of executors
3. Analyses, distributes and Schedules the work on executors
Executor:-
1. Execute the code
2. Report the status of execution to driver
A User assigns a job to driver and driver in turn analyses
distribution and breaks down the job into Stages and Tasks. And it
assigns it to the executors. Here executors are basically a JVM
process which runs in the cluster of machines and it consists of
cores.
NOTE:-
1. Each task can only work on 1 partition of data at a time.
2. Tasks can execute in parallel.
3. Executor are JVM processes running on cluster machines.
4. Executors hosts cores and each core can run 1 task at a time.

Narrow Transformation:-
After applying transformation each partition contribute to at-most
one partition.
Wide Transformation:-
After applying transformation if one partition contribute to more
than one partition. This type of transformations lead to data
shuffle.
How Spark works on Data Partitions?
- Spark distributes the data in form of partitions to the Cluster.

Logical Planning is the 1st Phase of Execution Planning

Physical Planning is the 2nd and Final Phase of Execution Planning
Directed Acyclic Graph(DAG)- A DAG represents the logical
execution plan of a Spark Job.

Example- Generate the EMP dataframe and filter EMP

salary>50000. Write the output in CSV format.
Spark Session Object is the entry point for our Spark Application
Spark Session object name -We are using generic name “spark” for
better understanding.
Local([*]) is to run Spark locally with as many worker threads as
logical cores on your machine.
If you want to change the spark session object name then this code
you can use: (eg, new_object_name = spark.getActiveSession())
Schema for DataFrame- Schema is the metadata that defines the
name and the type of the column.

DataFrame is divided into two parts: Rows & Columns

You need DataFrame to manipulate the columns
The function _parse_datatype_string in PySpark is a private method
used internally to parse a string representation of a data type into a
PySpark DataType object. It is primarily utilized when working with
schema definitions or manipulating data types programmatically.
Adding or Overriding columns :- Spark provides withColumn() to
create new column or override existing column.
Adding static values columns :- Spark provides lit() to create a
distributed static column.
Renaming existing columns :- Spark provides
withColumnRenamed() to the same.
Multiple ways for Rename :- We can use expr() or selectExpr() as
well for renaming columns (eg, selectExpr(employee_id as emp_id)
Downstream system :- Any system which is dependent on your
output.
Note- It is never recommended to use spaces in the column names.
Remove columns from DataFrame :- use drop().
Limit : In case we need to get only few records and not full dataset.
Union vs UnionAll :- Union removed the duplicated from the
combined dataset whereas UnionAll doesn’t.
Using select or selectExpr- We can write the column aggregation
expressions in select or selectExpr as well.
Unique and Distinct data from selected column.
Window or Analytical Functions :- Applies aggregate and ranking
functions over a particular window(set of rows).
Expr() function for the rescue:- In case anytime PySpark API looks
difficult, we can use SQL in expr functions easily.
Coalesce vs Repartitions:-
Repartition involves data shuffling, whereas Coalesce doesn’t.
Repartition can increase or decrease partition numbers but
Coalesce can only decrease not increase.
Repartition allows uniform data distribution, but Coalesce can’t
guarantee it.
How many jobs created when we specify the schema during data
read?
- When you specify the schema during data reading in Apache
Spark, no additional jobs are created for schema inference.
This behavior can significantly improve performance because
schema inference requires Spark to scan part of the data,
which would otherwise trigger additional jobs.
Read Mode for csv:- Useful to handle Bad Records, Schema is
mandatory.
Permissive Mode:- Default Mode for data read. We don’t need to
specify mode explicitly in code.
Drop Malformed Mode:- Drops the Bad Records.
Fail Fast Mode:- Fails the Job as soon a Bad Record is identified.
Row vs Columnar Data Format
Benefit of Columnar Format:- The columnar format lets the reader
read, decompress, and process only the columns that are required
for the current query.
Parquet vs ORC vs AVRO

Parquet stores the data along with the data.

Demo for Columnar Data Benefit:- Performance benefit while
reading Columnar data.
Python Datatype – List/Array or Struct/Dict are important.
Reading JSON Files – Default is single line JSON. In case of multiline
we need to specify an option.
Read JSON data in Single Column – We will use format as TEXT.
Read with Schema – Reading JSON file with Schema.
Write JSON Schema – How to write complex JSON schema in string
ddl format?
from_json function – Parse the JSON from String
to_json function – Converts Parsed JSON data to String JSON text.
Flatten JSON data – Expand and explode the JSON data to simple
tabular structure.
Dot Notation – It can be used to extract\expand struct/dict data.
How Spark writes data?
- Cluster with 1 node, 2 Executors and each Executor with 2
cores.
- Total cores/Parallelism : 4
Number of Files – The processed data from each partition will be
written to individual files.
Default Parallelism – Total number of Cores available for processing
data in parallel.
Spark creates the folder.
Spark Partition folders – It allows Spark to skip un-necessary data
read in case you need data based on specific partition.
Note – Spark doesn’t write department_id column in files as it is
the partition column (info from folder name). You can see it is
missing in the files.
Spark Submit – Command is used to execute Spark application on
Cluster.
Note – Spark Submit command can be supplied from the installed
Spark folder, check your setting before executing the code.
Note: For Standalone clusters:
the –num-executors parameter may not work always.
So, to control the number of executors:
1. Define number of cores per executors with –executor-cores
parameter(spark.executor.cores)
2. Control max number of cores for execution with –total-
executor-cores parameter(spark.cores.max)
If you need 3 executors with 2 cores(you don’t need to use –num-
executors) --executor-cores 2 –total-executor-cores 6
--num-executors parameter can be used to control number of
executor for YARN resource manager.
Note: Spark Submit commands are often scheduled or triggered to
execute jobs on Clusters in Production setting.

What are JVM here?

-JVM are the executors requested by the Spark Application.
Shuffle or Exchange – Shuffle divides the JOB into Stages.
Default Shuffle Partition – By default the shuffle partition value is
200. This can be controlled from Spark config.
Shuffle Files:
Are serialized in Tungsten Binary Format(Unsafe Row). These files
can directly be read in memory thus improving read performance.
Adaptive Query Engine (AQE):
By default provides some performance benefits on Shuffle.
Note:
For cache we need and action and count and write are always
preferred as they will scan the whole dataset resulting in proper
caching.
Storage Level:
Default storage level for cache in MEMORY_AND_DISK for
dataframes and datasets.
Note:
Data partitions will be read by all executors in parallel thus is will be
in different executors.
Joins can lead to shuffle if not optimized correctly.
What is skewness?
-Basically unbalanced data not distributed properly.
2 types of spillage -Spillage memory & Spillage disk.
How to identify Skewness?
-We can find the partition count for the shuffled data.
Salted keys: Creation of new salted keys for joining at both tables.
Note: Salting Technique doesn’t need to be used everytime, make
sure to use only when you have issues of memory due to
spillage(mostly out of memory errors).
This was just a demo to show you how to fix spillage.
Skewed Partitions Optimization:
Spark AQE will take care of the partition sizing to balance the data
and avoid spillage.
Spark.sql.adaptive.advisoryPartitionSizeInBytes:
The size in bytes of the shuffle partition during adaptive
optimization. This can control the size of shuffle partitions post-
shuffle.
Now that out task partition memory is small we will make the
partitions smaller in size to 8MB.
Spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes:
A partition is considered as skewed if its size in bytes is larger than
this threshold.
What is Catalog?
-Catalog stores the metadata of SQL Objects.
Temp Views:
Temp views are only available until the session is active.
Spark SQL Functions:
PySpark DataFrames API functions are available for Spark SQL by
default. No need for any import.
Using Secrets Securely:
This strategy works for on-premise or standard clusters. If you are
working with Databricks, make sure to save secrets in Azure Key
Vault and import them while reading or writing data.
Partitioning breaks data in folders:
Folder names are in
format:<partitioning><column>=<key>e.g.country=IN
Note: Partitioning column should be part of query in-order to
improve performance
Impact of High Cardinality Column:
Always avoid partitioning on High Cardinality or Unique value
columns to avoid creating too many partitions.
High Cardinality Column:
Columns with more unique values (less repetition)
z-order can be done in more than one column.
_metadata is a hidden column that can be used to get the source
metadata.
Configuration:
This configuration is used to control the file compaction max size
using OPTIMIZE command.
Selective Z-Ordering with Partition filters:
Only Optimizes and Z-orders the data for the specific partition.

Lab 1
50% (2)
Lab 1
14 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
No ratings yet
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
27 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Spark QA
No ratings yet
Spark QA
34 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Pyspark
No ratings yet
Pyspark
31 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Pyspark Questions
No ratings yet
Pyspark Questions
63 pages
Databricks & PySpark Learning Day-10
No ratings yet
Databricks & PySpark Learning Day-10
4 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Azure Data Factory Interview Questions and Aswers
No ratings yet
Azure Data Factory Interview Questions and Aswers
5 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Azure Data Engineering Interview Q & A - Topicwise
100% (1)
Azure Data Engineering Interview Q & A - Topicwise
57 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
No ratings yet
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
41 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Databricks
No ratings yet
Databricks
11 pages
De Mod 0 Get Started With Pyspark Programming
No ratings yet
De Mod 0 Get Started With Pyspark Programming
7 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
58 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
ABD00 Notebooks Combined - Databricks
No ratings yet
ABD00 Notebooks Combined - Databricks
109 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
DP 900
No ratings yet
DP 900
1 page
Bupropis
No ratings yet
Bupropis
2 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
4 pages
Detailed SQL Interview Questions
No ratings yet
Detailed SQL Interview Questions
4 pages
Azure Etl 1741608374
No ratings yet
Azure Etl 1741608374
14 pages
Ade 1737191501
No ratings yet
Ade 1737191501
29 pages
Azure Analytics Interview Answers Complete
No ratings yet
Azure Analytics Interview Answers Complete
5 pages
Day 10 1729086189
No ratings yet
Day 10 1729086189
14 pages
Name:-Madhav Verma Roll No.: - 102004036 Sub Group: - 1EE2
No ratings yet
Name:-Madhav Verma Roll No.: - 102004036 Sub Group: - 1EE2
10 pages
Comandos Olt Zte C320 Derivalnet Mejorada
No ratings yet
Comandos Olt Zte C320 Derivalnet Mejorada
4 pages
Buffer Overflow Exploit 101
No ratings yet
Buffer Overflow Exploit 101
35 pages
Password Based Doorlock System in 8051
No ratings yet
Password Based Doorlock System in 8051
11 pages
Untitled
No ratings yet
Untitled
790 pages
Rust Programming
No ratings yet
Rust Programming
9 pages
NOEL-UNEKE FRANCES CHIDIEBERE's CV
No ratings yet
NOEL-UNEKE FRANCES CHIDIEBERE's CV
4 pages
Lecture01 Introduction To TCP-IP Stack Layers
No ratings yet
Lecture01 Introduction To TCP-IP Stack Layers
42 pages
GS10-GS14-GS15-GS16 SML 1222 en
No ratings yet
GS10-GS14-GS15-GS16 SML 1222 en
284 pages
Google Passguide Cloud-Digital-Leader Actual Test 2023-Jul-21 by Marcus 91q Vce
100% (2)
Google Passguide Cloud-Digital-Leader Actual Test 2023-Jul-21 by Marcus 91q Vce
29 pages
Network Protocols Detailed Guide
No ratings yet
Network Protocols Detailed Guide
2 pages
Imd212 Task 1
No ratings yet
Imd212 Task 1
5 pages
L06 RISCV Functions
No ratings yet
L06 RISCV Functions
49 pages
Welcome To SQL
No ratings yet
Welcome To SQL
116 pages
Lab Assignment-8 (AVL Trees and Priority Queue) Arjav Kanadia - IEC2020101
No ratings yet
Lab Assignment-8 (AVL Trees and Priority Queue) Arjav Kanadia - IEC2020101
11 pages
7@cmritonline - Ac.in: Record Book
No ratings yet
7@cmritonline - Ac.in: Record Book
25 pages
Application Layer: Different Layers of AUTOSAR Architecture
No ratings yet
Application Layer: Different Layers of AUTOSAR Architecture
3 pages
Poweredge t560 Spec Sheet
No ratings yet
Poweredge t560 Spec Sheet
3 pages
Btech Cs 7 Sem Cloud Computing rcs075 2022
No ratings yet
Btech Cs 7 Sem Cloud Computing rcs075 2022
1 page
Industrial Edge
No ratings yet
Industrial Edge
18 pages
HPE - A50004306enw - HPE ProLiant DL360 Gen11
No ratings yet
HPE - A50004306enw - HPE ProLiant DL360 Gen11
109 pages
2nd Year Computer Long Questions
70% (10)
2nd Year Computer Long Questions
4 pages
Computer Organisation and Architecture An Introduction 2nd Edition B.S. Chalk PDF Download
No ratings yet
Computer Organisation and Architecture An Introduction 2nd Edition B.S. Chalk PDF Download
52 pages
Canon imageRUNNER ADVANCE 6055 Brochure
No ratings yet
Canon imageRUNNER ADVANCE 6055 Brochure
12 pages
SMChms Datasheet v2.6
No ratings yet
SMChms Datasheet v2.6
2 pages
Week 3 - 4 Refactoring
No ratings yet
Week 3 - 4 Refactoring
54 pages
Lab Manual: Department of Computer Science and Engineering
No ratings yet
Lab Manual: Department of Computer Science and Engineering
51 pages
Signal Flow Graphs Lecture
No ratings yet
Signal Flow Graphs Lecture
41 pages
Android Based Idir Information Management System
No ratings yet
Android Based Idir Information Management System
90 pages

Master Pyspark Zero To Hero 1738689679

Uploaded by

Master Pyspark Zero To Hero 1738689679

Uploaded by

Spark Tutorial

Logical Planning is the 1st Phase of Execution Planning

Example- Generate the EMP dataframe and filter EMP

DataFrame is divided into two parts: Rows & Columns

Parquet stores the data along with the data.

What are JVM here?

You might also like