0% found this document useful (0 votes)
21 views7 pages

Spark Otp

Uploaded by

NIKHIL RANJAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views7 pages

Spark Otp

Uploaded by

NIKHIL RANJAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Below topic will be covered.

1. What is Shuffle
2. How a group By() works
2. Broadcast Join & Normal Shuffle-Sort-Merge Join.
3. Partition Skew
4. Adaptive Query Execution (AQE)
5. Dynamically handling Partition Skew

6. Optimization of joining 2 Large tables : Bucketing


7. Memory Management in Apache Spark.
8. Sort Aggregate Vs Hash Aggregate

1 © 2023 Nokia Nokia internal use


How a group By() works
• When a wide transformation is triggered, by default, 200 shuffle partitions are created.
• Since there are only 9 unique keys (in order_status), there would be at the max only 9 partitions
that will have data in it and the remaining 191 partitions will remain empty (as shown in the
diagram below)

2 © 2023 Nokia
Broadcast Join & Normal Shuffle-Sort-Merge Join
•Use Case:
Suitable when one of the Data Frames involved in the join operation is small enough to fit entirely in
the memory of each executor.

•How it Works:
• The smaller Data Frame is broadcasted to all the worker nodes in the Spark cluster.
• The larger Data Frame is partitioned, and each partition is joined with the entire smaller Data Frame
on each node.
•Advantages:

• Reduces the amount of data shuffled over the network, improving performance.
• Efficient for joining small lookup tables with larger datasets.

3 © 2023 Nokia
Shuffle-Sort-Merge Join (Normal Join):
•Use Case:
• Appropriate when neither of the DataFrames can fit entirely in the memory of the executors, and a
large-scale distributed join is required.
•How it Works:
• Data is partitioned and shuffled across the cluster based on the join key.
• Each partition is sorted locally on each node.
• Finally, a merge operation combines the sorted partitions to produce the final joined result.
•Advantages:

• Scales well for large datasets that cannot fit in memory.


• Handles distributed joins efficiently.

4 © 2023 Nokia
5 © 2023 Nokia
Partition Skew
Partition skew occurs when one of the partitions holds relatively more data as compared to the rest
of the partitions.

Even though there are 200 partitions created after shuffling in case of wide transformation, all the
records with order_status ‘COMPLETE’ will still be moved to one single partition.

6 © 2023 Nokia
Adaptive Query Execution (AQE)

7 © 2023 Nokia

You might also like