Spark Otp

Uploaded by

NIKHIL RANJAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views7 pages

Spark Otp

Uploaded by

NIKHIL RANJAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Below topic will be covered.

1. What is Shuffle
2. How a group By() works
2. Broadcast Join & Normal Shuffle-Sort-Merge Join.
3. Partition Skew
4. Adaptive Query Execution (AQE)
5. Dynamically handling Partition Skew

6. Optimization of joining 2 Large tables : Bucketing

7. Memory Management in Apache Spark.
8. Sort Aggregate Vs Hash Aggregate

1 © 2023 Nokia Nokia internal use

How a group By() works
• When a wide transformation is triggered, by default, 200 shuffle partitions are created.
• Since there are only 9 unique keys (in order_status), there would be at the max only 9 partitions
that will have data in it and the remaining 191 partitions will remain empty (as shown in the
diagram below)

2 © 2023 Nokia
Broadcast Join & Normal Shuffle-Sort-Merge Join
•Use Case:
Suitable when one of the Data Frames involved in the join operation is small enough to fit entirely in
the memory of each executor.

•How it Works:
• The smaller Data Frame is broadcasted to all the worker nodes in the Spark cluster.
• The larger Data Frame is partitioned, and each partition is joined with the entire smaller Data Frame
on each node.
•Advantages:

• Reduces the amount of data shuffled over the network, improving performance.
• Efficient for joining small lookup tables with larger datasets.

3 © 2023 Nokia
Shuffle-Sort-Merge Join (Normal Join):
•Use Case:
• Appropriate when neither of the DataFrames can fit entirely in the memory of the executors, and a
large-scale distributed join is required.
•How it Works:
• Data is partitioned and shuffled across the cluster based on the join key.
• Each partition is sorted locally on each node.
• Finally, a merge operation combines the sorted partitions to produce the final joined result.
•Advantages:

• Scales well for large datasets that cannot fit in memory.

• Handles distributed joins efficiently.

4 © 2023 Nokia
5 © 2023 Nokia
Partition Skew
Partition skew occurs when one of the partitions holds relatively more data as compared to the rest
of the partitions.

Even though there are 200 partitions created after shuffling in case of wide transformation, all the
records with order_status ‘COMPLETE’ will still be moved to one single partition.

Asset Management PAS 55 ISO 55000
100% (2)
Asset Management PAS 55 ISO 55000
15 pages
Apply and Innovate 2018 Honda Kawabe
No ratings yet
Apply and Innovate 2018 Honda Kawabe
41 pages
COREN Registration and Guide
No ratings yet
COREN Registration and Guide
3 pages
ApacheSpark Top 10 QnA
No ratings yet
ApacheSpark Top 10 QnA
33 pages
Spark QA
No ratings yet
Spark QA
34 pages
1714069759520
No ratings yet
1714069759520
17 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Differences and Definitions
No ratings yet
Differences and Definitions
13 pages
Data Engineer Question
No ratings yet
Data Engineer Question
33 pages
DBR 7.x - Spark 3.x Features Migration
No ratings yet
DBR 7.x - Spark 3.x Features Migration
86 pages
BigQuery Query Optimization With Troposphere PDF
No ratings yet
BigQuery Query Optimization With Troposphere PDF
51 pages
Databricks RealQuestions
No ratings yet
Databricks RealQuestions
9 pages
Spark Databricks
No ratings yet
Spark Databricks
19 pages
Spark Pyspark Day 21
No ratings yet
Spark Pyspark Day 21
22 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Spark Class 2
No ratings yet
Spark Class 2
37 pages
Spark Interview More Questions With Answers
No ratings yet
Spark Interview More Questions With Answers
3 pages
Spark Basic 2-1
No ratings yet
Spark Basic 2-1
25 pages
An Adaptive Execution Engine For Apache Spark SQL: Carson Wang Yucai Yu Hao Cheng
No ratings yet
An Adaptive Execution Engine For Apache Spark SQL: Carson Wang Yucai Yu Hao Cheng
29 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Spark
No ratings yet
Spark
27 pages
Optimization of The Join Between Large Tables in T
No ratings yet
Optimization of The Join Between Large Tables in T
14 pages
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
52 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
Optimizing PySpark Operations
No ratings yet
Optimizing PySpark Operations
4 pages
Sort-Merge Vs Shuffle Hash Join Explained
No ratings yet
Sort-Merge Vs Shuffle Hash Join Explained
5 pages
Joins in Pyspark
No ratings yet
Joins in Pyspark
10 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
100% (1)
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
20 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Pyspark Shuffle
No ratings yet
Pyspark Shuffle
3 pages
Notes
No ratings yet
Notes
5 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
Bda Unit 5
No ratings yet
Bda Unit 5
29 pages
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
15 Asked Questions in KPMG
No ratings yet
15 Asked Questions in KPMG
22 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
Data Bricks
No ratings yet
Data Bricks
10 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Apache Spark Things To Know
No ratings yet
Apache Spark Things To Know
8 pages
Bigdata-Chap3 Notes
No ratings yet
Bigdata-Chap3 Notes
11 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Spark Material
No ratings yet
Spark Material
6 pages
Bda U-5
No ratings yet
Bda U-5
30 pages
Spark Optimization Techniques 1676610430
No ratings yet
Spark Optimization Techniques 1676610430
15 pages
Databricks Raveendra 1668569836
No ratings yet
Databricks Raveendra 1668569836
25 pages
Pyspark 12 Questions
No ratings yet
Pyspark 12 Questions
8 pages
Data Skew and Remedies in Spark Programming
No ratings yet
Data Skew and Remedies in Spark Programming
19 pages
Interview Q & A (SQL Spark HIVE Airflow AWS Kafka) - 1
No ratings yet
Interview Q & A (SQL Spark HIVE Airflow AWS Kafka) - 1
25 pages
BDH Answer Bank
No ratings yet
BDH Answer Bank
21 pages
Big Data Pyq 2023 Solution
No ratings yet
Big Data Pyq 2023 Solution
18 pages
Big Data & Hadoop
100% (3)
Big Data & Hadoop
189 pages
Analyzing and Processing Data Faster Bas PDF
No ratings yet
Analyzing and Processing Data Faster Bas PDF
6 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Databricks Vs SQL Cheat Sheet
No ratings yet
Databricks Vs SQL Cheat Sheet
11 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Pyspark
100% (1)
Pyspark
48 pages
10 Sorting
No ratings yet
10 Sorting
3 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
From Everand
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
Robert Johnson
No ratings yet
Fluent-Intro 15.0 L07 Turbulence PDF
No ratings yet
Fluent-Intro 15.0 L07 Turbulence PDF
48 pages
Docker Kubernetes Made Easy Interactive Ebook FINAL
No ratings yet
Docker Kubernetes Made Easy Interactive Ebook FINAL
7 pages
JCS2121 Prog in C Syllabus
No ratings yet
JCS2121 Prog in C Syllabus
2 pages
KeralaPentecostHistory (SajuMathew) PDF
100% (1)
KeralaPentecostHistory (SajuMathew) PDF
440 pages
Math Lesson Plan The Vitruvian Man
No ratings yet
Math Lesson Plan The Vitruvian Man
9 pages
Study of Centrifugal Sugar by Electrical Machines
100% (2)
Study of Centrifugal Sugar by Electrical Machines
64 pages
31 R20 M1-Sept-2023
No ratings yet
31 R20 M1-Sept-2023
7 pages
D1.5 Analysis of Hard and Software Requirements
No ratings yet
D1.5 Analysis of Hard and Software Requirements
59 pages
Remote Access Technical Whitepaper
No ratings yet
Remote Access Technical Whitepaper
9 pages
SQL Queries To Generate Reports
No ratings yet
SQL Queries To Generate Reports
8 pages
Assignment
No ratings yet
Assignment
12 pages
Factorisation: Methods of Factorization
No ratings yet
Factorisation: Methods of Factorization
33 pages
Law and Emerging Technologies Unit 1
No ratings yet
Law and Emerging Technologies Unit 1
103 pages
ServicePlus- Deprived Scheduled Caste Certificate - वंचित अनुसूचित जाति प्रमाण पत्र
No ratings yet
ServicePlus- Deprived Scheduled Caste Certificate - वंचित अनुसूचित जाति प्रमाण पत्र
2 pages
Hack Wifi
No ratings yet
Hack Wifi
4 pages
Abdulrahman El Moughrabi Resume
No ratings yet
Abdulrahman El Moughrabi Resume
2 pages
Mongodb and Python-1
No ratings yet
Mongodb and Python-1
66 pages
CS 3440 Graded Quiz Unit 6
No ratings yet
CS 3440 Graded Quiz Unit 6
7 pages
Neural Networks 16 Mark Answers
No ratings yet
Neural Networks 16 Mark Answers
3 pages
Evotech 3.0 Invitation
No ratings yet
Evotech 3.0 Invitation
18 pages
Latex
No ratings yet
Latex
27 pages
Chapter 9: Applications of The DFT: Impulse Response or Impulse Response Function
No ratings yet
Chapter 9: Applications of The DFT: Impulse Response or Impulse Response Function
2 pages
CRM Contacts Specialist JD
No ratings yet
CRM Contacts Specialist JD
2 pages
Configuring ODI External User Authentication
No ratings yet
Configuring ODI External User Authentication
18 pages
Simplified Com Ai Presentation Maker Agile Workflow
No ratings yet
Simplified Com Ai Presentation Maker Agile Workflow
3 pages
Rom vs. Ram
No ratings yet
Rom vs. Ram
8 pages
Mathlinks 9 Review Bundles CH 2
No ratings yet
Mathlinks 9 Review Bundles CH 2
4 pages

Spark Otp

Uploaded by

Spark Otp

Uploaded by

Below topic will be covered.

6. Optimization of joining 2 Large tables : Bucketing

1 © 2023 Nokia Nokia internal use

• Scales well for large datasets that cannot fit in memory.

You might also like