Spark Otp
Spark Otp
1. What is Shuffle
2. How a group By() works
2. Broadcast Join & Normal Shuffle-Sort-Merge Join.
3. Partition Skew
4. Adaptive Query Execution (AQE)
5. Dynamically handling Partition Skew
2 © 2023 Nokia
Broadcast Join & Normal Shuffle-Sort-Merge Join
•Use Case:
Suitable when one of the Data Frames involved in the join operation is small enough to fit entirely in
the memory of each executor.
•How it Works:
• The smaller Data Frame is broadcasted to all the worker nodes in the Spark cluster.
• The larger Data Frame is partitioned, and each partition is joined with the entire smaller Data Frame
on each node.
•Advantages:
• Reduces the amount of data shuffled over the network, improving performance.
• Efficient for joining small lookup tables with larger datasets.
3 © 2023 Nokia
Shuffle-Sort-Merge Join (Normal Join):
•Use Case:
• Appropriate when neither of the DataFrames can fit entirely in the memory of the executors, and a
large-scale distributed join is required.
•How it Works:
• Data is partitioned and shuffled across the cluster based on the join key.
• Each partition is sorted locally on each node.
• Finally, a merge operation combines the sorted partitions to produce the final joined result.
•Advantages:
4 © 2023 Nokia
5 © 2023 Nokia
Partition Skew
Partition skew occurs when one of the partitions holds relatively more data as compared to the rest
of the partitions.
Even though there are 200 partitions created after shuffling in case of wide transformation, all the
records with order_status ‘COMPLETE’ will still be moved to one single partition.
6 © 2023 Nokia
Adaptive Query Execution (AQE)
7 © 2023 Nokia