0% found this document useful (0 votes)

47 views19 pages

Data Skew and Remedies in Spark Programming

Uploaded by

ihavaneid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views19 pages

Data Skew and Remedies in Spark Programming

Uploaded by

ihavaneid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

SKEW IN SPARK

PROGRAMMING
What is Skew
“Data Skew” is a Condition in which a Table’s data is “Unevenly Distributed” among
the Partitions in the Cluster. Operations, such as “Join” performs very slow on such
Partitions.
Initially, data is typically read in “Partitions” of 128 MB size each, and, evenly
distributed throughout the “Cluster of Computers”.
However, as the data is “Transformed”, for example “Aggregated”, it is possible to
have significantly more records in one “Partition” than another. This is called “Skew”.
To some degrees, a small amount of “Skew” is ignorable, may be in 10% range. But
large “Skews” can result in “Spill”, or even worse, really hard to diagnose “Out of
Memory Errors”.

As seen from the above image, after “Aggregated” operation, if the data in City D is 2
times larger than the data in City A, City B, or City C -
➢ The data in City D is going to take 2 times as long to process than the data in
City A, City B, or City C, assuming it takes “n” seconds per Record.
➢ The data in City D requires twice as much RAM to process than the data in City
A, City B, or City C, assuming it takes “n” MB per Record.
The ramification of “Skew” -
➢ The entire “Stage” is going to take as long as the “Longest-Running Task”.
• Purely from an Execution Standpoint in Time, if all the four Partitions are
processed simultaneously, the entire Job will take as long as the
processing of the slowest Stage, i.e., the Stage with the Task of
processing of City D.
➢ The more dire problem would be to not have enough RAM for the “Skewed
Partitions”.

Contact: [email protected] Mob: 9836972305

• If the Cluster is not created of proper size with keeping “Skewed
Partitions” in mind, then it might happen that the Maximum Capacity of
the Executors exceeds, in terms of Handling the RAM, required to
Process Data in that Skewed Partition.
Skew - Speed vs. RAM - When the “Skew” in a “Dataset” is significant enough to cause
a “Performance Problem”, it generally manifests itself in one of the following two ways
-
➢ Speed - Because of the “Skew” in a Dataset, a particular set of Tasks, or one
Task out of a larger set of Tasks may take longer than it is expected to process.
➢ RAM - Because of the “Skew” in a Dataset, there might be a Problem with RAM.
The RAM problem manifests itself as “Spill”, or, in worst case scenario, “Out of
Memory (OOM) Errors”.
In such cases, where “Skew” in a “Dataset” causes a “Performance Problem”, the first
problem to solve is the “Uneven Distribution of records” across all “Partitions”,
because, solving for either “Speed”, or, “RAM” will only treat the Symptoms, and not
the Root Cause. “Uneven Distribution of records” manifests itself as proportionally
“Slower Tasks”.
How to Find Skews in Query - “Data Skew” can severely downgrade Performance of
Queries, especially those with “Joins”. “Joins” between big Tables require “Shuffling
Data” and the “Data Skew” can lead to an extreme imbalance of work in the Cluster.
It’s like that “Data Skew” is affecting a Query, if a Query appears to be stuck finishing
very few Tasks (e.g., the last 3 Tasks out of 200). To verify that “Data Skew” is affecting
a Query -
➢ Click the Stage that is stuck and verify that it is doing a Join.
➢ After the Query finishes, find the Stage that does a Join, and check the “Task
Duration Distribution”.
➢ Sort the Tasks by “Decreasing Duration” and check the first few Tasks. If one
Task took much longer to complete than the other Tasks, there is “Data Skew”.
Different Ways to Mitigate Skews in Dataset - Following are different “Strategies” for
fixing “Skews” -
➢ Employ a Databricks-specific “Skew Hint”, called “Skew Join Optimization”.
➢ Enable “Adaptive Query Execution” in “Spark 3.0”.
➢ Salt the “Skewed Column” with a random number, which creates better
“Distribution” across each “Partition” at the cost of extra processing.

Contact: [email protected] Mob: 9836972305

What is Skew Join Optimization
How Data is Skewed - The following code reads the Delta File from the provided Path.
Then Groups the data by the Column “city_id”, performs a Count operation and lastly,
performs a Sort operation upon the Grouped data.
sc.setJobDescription("How Skewed?")

dfVisualize = spark.read.\
format("delta").\
load(transactionPath).\
groupBy("city_id").\
count().\
orderBy("count")

display(dfVisualize)

The resultant data, in the created DataFrame, can be visually displayed in the following
way -

Contact: [email protected] Mob: 9836972305

It can be seen that vast majority of Dataset is coming in around 8 million, and, a Subset
of Dataset is coming in around 23 million. It means that there is a significant number
of “Skewed Records”.
Create a Baseline - Let’s create the Baseline. Load two DataFrames. Perform a Join
operation between the two DataFrames, based on the Column “city_id”. Then, perform
a “noop” Write.
sc.setJobDescription("Establish a Baseline")

#Load the City DataFrame

dfCity = spark.read.\
format("delta").\
load(cityPath)

#Load the Transaction DataFrame

dfTransaction = spark.read.\
format("delta").\
load(transactionPath)

#Join the Two DataFrames by "city_id" and Execute a "noop" Write

dfTransaction.\
join(dfCity, dfTransaction["city_id"] == dfCity["city_id"]).\
write.\
format("noop").\
mode("overwrite").\
save()

Contact: [email protected] Mob: 9836972305

From the above image, it could be seen that it took almost “8 minutes” to process the
Join operation between the two DataFrames without “Skew Hint”.
Specify Skew Hints in Dataset and DataFrame-Based Join Comands - When a Join
command is performed with DataFrame, or, Dataset Objects, if it is found that the
Query is stuck on finishing a small number of Tasks due to “Data Skew”, the “Skew
Hint” can be specified using the “hint (“skew”)” method, i.e., “df.hint (“skew”)”. The
“Skew Join Optimization” is performed on the DataFrame for which the “Skew Hint” is
specified.
With the information from a “Skew Hint”, Databricks Runtime can construct a better
Query Plan, one that does not suffer from “Data Skew”.
In addition to the basic “Hint”, it is possible to specify the “hint ()” method with the
following combinations of Parameters -
➢ DataFrame With Single Column Name - The “Skew Join Optimization” is
performed on the specified Column of the DataFrame.
df.hint (“skew”, “col1”)

Contact: [email protected] Mob: 9836972305

➢ DataFrame With Multiple Column Names - The “Skew Join Optimization” is
performed for multiple specified Columns of the DataFrame.
df.hint (“skew”, [“col1”, “col2”])
➢ DataFrame With Single Column Name and Skew Value - The “Skew Join
Optimization” is performed on the data in the specified Column of the
DataFrame with the “Skew Value”.
df.hint (“skew”, “col1”, “value”)
It is possible to specify the “Skew Hint” for multiple DataFrame Objects, involved in a
Join operation, in the following way –
df1.hint (“skew”, “col1”).join (df2.hint (“skew”, “col2”), df1[“col1”] == df2[“col2”])
Let’s create the second DataFrame with “Skew Hint”, because we found that “Data
Skew” is present in the second DataFrame. Load two DataFrames. Perform a Join
operation between the two DataFrames, based on the Column “city_id”. Then, perform
a “noop” Write.
sc.setJobDescription("Join with Skew Hint")

#Load the City DataFrame

dfCity = spark.read.\
format("delta").\
load(cityPath)

#Load the Transaction DataFrame

dfTransaction = spark.read.\
format("delta").\
load(transactionPath).\
hint("skew", "city_id")

#Join the Two DataFrames by "city_id" and Execute a "noop" Write

dfTransaction.\
join(dfCity, dfTransaction["city_id"] == dfCity["city_id"]).\
write.\
format("noop").\
mode("overwrite").\
save()

Contact: [email protected] Mob: 9836972305

From the above image, it could be seen that it took almost “7 minutes” to process the
Join operation between the two DataFrames with “Skew Hint”.
Following is the comparison of Metrics for the code “Join - Baseline” with the code “Join
- Skew Hint” -
Code
Type Duration Tasks Health Shuffle Spill
Standard ~8 min 832 Bad 0 / 0 / ~100 KB / ~400 MB / ~3 GB ~50 GB
Skew Mostly 134 MB / 174 MB / 184 MB / 195 MB / 1.1
Hint ~7 min 832 OK GB ~4 GB
➢ Special Note - Sometimes, even though there is less Spill when “Skew Hint” is
used on DataFrame or Dataset, it is slower. This might be because of the acts
of “Re-Partitioning the Data” and “Re-Organizing the Data” that are done under
the hood by Apache Spark. So, using “Skew Hint” on a DataFrame introduces
Performance Hit.

Contact: [email protected] Mob: 9836972305

Configure Skew Hint for Join with Relation (Table / View / Subquery) in Spark SQL -
➢ Configure Skew Hint with Only Relation Name - A “Skew Hint” must contain at
least the name of the “Relation” with “Skew”. A “Relation” is a “Table”, “View”,
or a “Subquery”. All Joins with the “Relation” then use “Skew Join Optimization”.
✓ Table with Skew - Spark SQL Query of Join between two Tables/Views
without Skew Hint” -
%sql
SELECT A.city_id, B.city, B.country, A.transacted_at,
A.trx_id, A.retailer_id, A.description, A.amount
FROM global_temp.tblTransaction A
INNER JOIN global_temp.tblCity B
ON A.city_id = B.city_id

From the above image, it could be seen that it took almost “2 minutes”
to process the Join operation between the two Tables/Views without
“Skew Hint”.

Spark SQL Query of Join between two Tables/Views with Skew Hint” -
%sql
SELECT /*+ SKEW('A') */ A.city_id, B.city,
B.country, A.transacted_at, A.trx_id, A.retailer_id, A.description,
A.amount
FROM global_temp.tblTransaction A
INNER JOIN global_temp.tblCity B
ON A.city_id = B.city_id

Contact: [email protected] Mob: 9836972305

From the above image, it could be seen that it took almost “5 minutes”
to process the Join operation between the two Tables/Views with “Skew
Hint”.
In this case, using “Skew Hint” on a Relation (Table/View/Subquery)
improves Data Skew but introduces Performance Hit.
✓ Subquery with Skew - Spark SQL Query of Join between a Table/View
and a Subquery without Skew Hint” -
%sql
SELECT A.city_id, B.city, B.country, A.transacted_at,
A.trx_id, A.retailer_id, A.description, A.amount
FROM
( SELECT *
FROM tblTransaction
)A
INNER JOIN tblCity B
ON A.city_id = B.city_id

From the above image, it could be seen that it took “8.13 minutes” to
process the Join operation between a Table/View and a Subquery
without “Skew Hint”.

Spark SQL Query of Join between a Table/View and a Subquery with

Skew Hint” -
%sql
SELECT /*+ SKEW('A') */ A.city_id, B.city,
B.country, A.transacted_at, A.trx_id, A.retailer_id, A.description,
A.amount
FROM
( SELECT *

Contact: [email protected] Mob: 9836972305

FROM tblTransaction
)A
INNER JOIN tblCity B
ON A.city_id = B.city_id

From the above image, it could be seen that it took “8.68 minutes” to
process the Join operation between a Table/View and a Subquery with
“Skew Hint”.
In this case, using “Skew Hint” on a Relation (Table/View/Subquery)
improves Data Skew but introduces Performance Hit.
➢ Configure Skew Hint with Relation Name and Column Names - There might be
multiple Joins on a “Relation” and only some of the Joins will suffer from Skew.
“Skew Join Optimization” has some overhead. So, it is better to use it only when
needed. For this purpose, the “Skew Hint” accepts Column Names. Only the
Joins with the provided Columns use “Skew Join Optimization”.
✓ Skew on Single Column - Spark SQL Query of Join on a Single Column
between two Tables/Views with Skew Hint” -
%sql
SELECT /*+ SKEW('A', 'city_id') */ A.city_id, B.city,
B.country, A.transacted_at, A.trx_id, A.retailer_id, A.description,
A.amount
FROM global_temp.tblTransaction A
INNER JOIN global_temp.tblCity B
ON A.city_id = B.city_id

Contact: [email protected] Mob: 9836972305

✓ Skew on Multiple Columns - Spark SQL Query of Join on Multiple
Columns between two Tables/Views with Skew Hint” -
%sql
SELECT /*+ SKEW('A', ('city_id', 'city')) */ A.city_id,
B.city, B.country, A.transacted_at, A.trx_id, A.retailer_id, A.description,
A.amount
FROM global_temp.tblTransaction A
INNER JOIN global_temp.tblCity B
ON A.city_id = B.city_id

➢ Configure Skew Hint with Relation Name, Column Names and Skew Values - It
is possible to specify Skew Values in the “Skew Hint”. Depending on the Query
and the Data, the Skew Values might be known or might be easy to find out, as
the Skew Values never change. Doing this reduces the overhead of “Skew Join
Optimization”. Otherwise, “Delta Lake” detects the Skew Values automatically.
✓ Skew on Single Column and Single Skew Value - Spark SQL Query of Join
on a Single Column and a Single Skew Value between two Tables/Views
with Skew Hint” –
%sql
SELECT /*+ SKEW('A', 'city_id', 28424447) */
A.city_id, B.city, B.country, A.transacted_at, A.trx_id, A.retailer_id,
A.description, A.amount
FROM global_temp.tblTransaction A
INNER JOIN global_temp.tblCity B
ON A.city_id = B.city_id

Contact: [email protected] Mob: 9836972305

✓ Skew on Single Column and Multiple Skew Values - Spark SQL Query of
Join on a Single Column and Multiple Skew Values between two
Tables/Views with Skew Hint” -
%sql
SELECT /*+ SKEW('A', 'city_id', (28424447,
559832710)) */ A.city_id, B.city, B.country, A.transacted_at, A.trx_id,
A.retailer_id, A.description, A.amount
FROM global_temp.tblTransaction A
INNER JOIN global_temp.tblCity B
ON A.city_id = B.city_id

✓ Skew on Multiple Columns and Multiple Skew Values - Spark SQL Query
of Join on Multiple Columns and Multiple Skew Values between two
Tables/Views with Skew Hint” -
%sql
SELECT /*+ SKEW('A', ('city_id', 'city'), ((28424447,
'Albany'), (559832710, 'Jackson'))) */ A.city_id, B.city, B.country,
A.transacted_at, A.trx_id, A.retailer_id, A.description, A.amount
FROM global_temp.tblTransaction A
INNER JOIN global_temp.tblCity B
ON A.city_id = B.city_id

Enabling Adaptive Query Execution for Skew Join

It is recommended to rely on Adaptive Query Execution (AQE) Skew Join Handling,
rather than using the Skew Join Hint, because AQE Skew Join in completely Automatic,
and in general, performs better than the Skew Hint counterpart.

Contact: [email protected] Mob: 9836972305

By enabling the Adaptive Query Execution (AQE), Apache Spark checks the “Stage
Statistics” and determines if there are any “Skew Joins”. If found, Apache Spark
Optimizes the “Skew Joins” by Splitting the Bigger Partitions into Smaller Evenly Sized
Partitions. Then Spark performs Join of these Smaller Partitions with the corresponding
Matching Partition Size of other Table/DataFrame.
There are two Size Conditions that must be satisfied for AQE to detect a Partition as a
Skewed Partition -
➢ The Skewed Partition Size should be Larger Than the
“spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes”. The default
value is “256 MB”.
As Apache Spark Splits the Skewed Partitions by Targeting the Value of the
Config Property “spark.sql.adaptive.advisoryPartitionSizeInBytes”, ideally the
Value of the Config Property “skewedPartitionThresholdInBytes” should be
Larger Than the Value of the Config Property “advisoryPartitionSizeInBytes”. So,
anytime the Value of the Config Property “advisoryPartitionSizeInBytes” is
increased, the Value of the Config Property “skewedPartitionThresholdInBytes”
should also be increased.
➢ The Skewed Partition Size should be Larger Than the -
✓ Median Size of all Partitions * Skewed Partition Factor
“Skewed Partition Factor” means the Configuration Property
“spark.sql.adaptive.skewJoin.skewedPartitionFactor”. The default value is “5”.
To enable the AQE for Optimizing the “Skew Joins”, both of the following Configuration
Properties need to be set to “True” -
➢ spark.conf.set(“spark.sql.adaptive.enabled”, True)
➢ spark.conf.set(“spark.sql.adaptive.skewJoin.enabled”, True)
There is another Configuration Property that goes with the above mentioned AQE
Configuration Properties, i.e., “spark.sql.adaptive.advisoryPartitionSizeInBytes”. When
AQE is enabled, as Adaptive Query Engine performs Re-Partitioning on the Data,
Apache Spark makes sure to Avoid creating too many Small Tasks using this
Configuration Property. Spark will Coalesce “Contiguous Shuffle Partitions” according
to the Target Size specified by “spark.sql.adaptive.advisoryPartitionSizeInBytes”. The
default value is “64 MB”.
#Enable AQE and the Adaptive Skew Join
spark.conf.set("spark.sql.adaptive.enabled", True)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", True)
#Default is 64 MB. In this case, maintain 128 MB as the Least Size of Coalesce Shuffle
Partition
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128m")

Contact: [email protected] Mob: 9836972305

In the “Join”, using the DataFrame, no “Hint” should be provided -
sc.setJobDescription("Join with AQE")

#Load the City DataFrame

dfCity = spark.read.\
format("delta").\
load(cityPath)

#Load the Transaction DataFrame

dfTransaction = spark.read.\
format("delta").\
load(transactionPath)

#Join the Two DataFrames by "city_id" and Execute a "noop" Write

dfTransaction.\
join(dfCity, dfTransaction["city_id"] == dfCity["city_id"]).\
write.\
format("noop").\
mode("overwrite").\
save()

Contact: [email protected] Mob: 9836972305

Code Duration Tasks Health Shuffle Spill
Type
Standard ~8 min 832 Bad 0 / 0 / ~100 KB / ~400 MB / ~3 GB ~50 GB
Mostly
Skew Hint ~7 min 832 134 MB / 174 MB / 184 MB / 195 MB / 1.1 GB ~4 GB
OK
AQE ~6 min 1489 Excellent 0 / ~115 MB / ~115 KB / ~125 MB / ~130 MB 0
Limitation of AQE for Skew Join - Using AQE, the Skew Join Handling support is Limited
to only certain Join Types. Example - in “Left Outer Join”, the Skew on the Left Side can
only be Optimized.

Salted Join for Skew

Prior to Spark 3.0, i.e., when the Adaptive Query Execution (AQE) was not available to
be Enabled, and, if the “Skew Hints” are not available in the Databricks, for some
reason, then the “Skew-Salted Join” is the only option to solve the Skew Join problem.
This approach is by far the Most Complicated to implement. This scenario is mostly
implemented by Data Engineers.

Contact: [email protected] Mob: 9836972305

This approach can take Longer to Execute in some cases, because the act of Salting
involves lots of Re-Partitioning.
This approach remains a Viable Option when other two Solutions are Not Available.
The Idea behind this approach is to Split the Large Partitions into Smaller Ones using
a “Salt”, i.e., a Random Value is added to each of the Partitioning Columns so that all
the Matching Data naturally falls into the Same Partition.
This approach is more about Guaranteeing the Execution of all Tasks, and Not a
Uniform Duration of Each Task.
Step 1 - Create a DataFrame Based on the Range of the “Skew Factor”. Here “7” is
used as the “Skew Factor”. The resultant DataFrame will have 7 Values, 0 to 6.
sc.setJobDescription("Create Salt DataFrame")

spark.conf.set("spark.sql.adaptive.enabled", False)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", False)

# Too Large - "Unnecessary Overhead" in the "Join" and "Cross Join"

# Too Small - "Skewed Partition" is not split up enough
# The Value "7" as "Skew Factor" was "Selected" after much experimentation
skewFactor = 7
saltDf = spark.range(skewFactor).toDF("salt")

Step 2 - Cross Join the Salt DataFrame “saltDf” with the Dimension DataFrame to
create the “Salted Dimension DataFrame”. Re-Partitioning can help Mitigate the Spills
and Evenly Re-Distributes the new Dimension Table across all Partitions.
In this case “Re-Partition” is used to reduce the Impact of “Cross Join”, because the
“Cross Join” would be an “Expensive Operation” if the “Skew Factor” is high.

Contact: [email protected] Mob: 9836972305

sc.setJobDescription("Create Salted Dimension DataFrame")

# Post "Cross Join", we will be at ~865 MB

# "128 MB" is Spark's Safe Default Partition Size
noOfPartition = math.ceil(865/128)

#Load the Salted City DataFrame

dfSaltedCity = spark.read.\
format("delta").\
load(cityPath).\
repartition(noOfPartition).\
crossJoin(saltDf).\
withColumn("salted_city_id", concat("city_id", lit("_"),
saltDf["salt"])).\
drop("salt")

Step 3 - For the Fact DataFrame, randomly assign a “Salt” to each record.
sc.setJobDescription("Create Salted Fact DataFrame")

#Load the Salted Transaction DataFrame

dfSaltedTransaction = spark.read.\
format("delta").\
load(transactionPath).\
withColumn("salt", (lit(skewFactor) *
rand()).cast(IntegerType())).\

Contact: [email protected] Mob: 9836972305

withColumn("salted_city_id", concat(col("city_id"), lit("_"),
col("salt"))).\
drop("salt")

Step 4 - Join the two DataFrames based on the “salted_city_id” Column.

sc.setJobDescription("Perform the Salted Skew Join")

#Join the Two Salted DataFrames by "salted_city_id" and Execute a "noop" Write
dfSaltedTransaction.\
join(dfSaltedCity, dfSaltedTransaction["salted_city_id"] ==
dfSaltedCity["salted_city_id"]).\
write.\
format("noop").\
mode("overwrite").\
save()

Contact: [email protected] Mob: 9836972305

Code Duration Tasks Health Shuffle Spill
Type
Standard ~8 min 832 Bad 0 / 0 / ~100 KB / ~400 MB / ~3 GB ~50 GB
Skew
~7 min 832 Mostly OK 134 MB / 174 MB / 184 MB / 195 MB / 1.1 GB ~4 GB
Hint
AQE ~6 min 1489 Excellent 0 / ~115 MB / ~115 KB / ~125 MB / ~130 MB 0
Salted ~10 min 832 Better/Bad ~400 MB / ~75 MB / ~150 KB / ~300 MB / ~800 MB 0
Limitation of Salting for Skew Join - Salting a “Skewed Dataset” has several problems.
➢ “Salting” has the Side Effect of Splitting Smaller the Partitions into even Smaller
Ones.
This is where the “Adaptive Query Execution (AQE)” shines as it does not have
this Side Effect.
➢ “Salting” requires “Extra Work”. It is advised to use “Salting” to only the
“Skewed Columns” in the entire Dataset. Data Engineers, who are preparing
the Data for the Consumers, know which Columns in the Dataset are “Skewed”.
So, even if it takes more time, the Data Engineers need to filter the Dataset into
two halves - “Skewed” half, and the “Non-Skewed” half. Then, perform “Salting”
on the “Skewed” half. Once it is “Salted” properly, “Join” the “Salted Skewed”
half back to the “Non-Skewed” half. In this way, the “Salting” provides better
performance for the Consumers, but it is a lot of extra work.

Contact: [email protected] Mob: 9836972305

Databricks Certified Professional Data Engineer 1 1
No ratings yet
Databricks Certified Professional Data Engineer 1 1
16 pages
Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
Learn SAP Basis in 24 Hours
From Everand
Learn SAP Basis in 24 Hours
Alex Nordeen
4.5/5 (2)
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Spark QA
No ratings yet
Spark QA
34 pages
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Data Engineer Question
No ratings yet
Data Engineer Question
33 pages
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
ApacheSpark Top 10 QnA
No ratings yet
ApacheSpark Top 10 QnA
33 pages
Build your own Blockchain: Make your own blockchain and trading bot on your pc
From Everand
Build your own Blockchain: Make your own blockchain and trading bot on your pc
Magelan Cybersecurity
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Databricks RealQuestions
No ratings yet
Databricks RealQuestions
9 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Management Strategies for the Cloud Revolution (Review and Analysis of Babcock's Book)
From Everand
Management Strategies for the Cloud Revolution (Review and Analysis of Babcock's Book)
BusinessNews Publishing
No ratings yet
Pyspark Common Issue, Cause & Fix
No ratings yet
Pyspark Common Issue, Cause & Fix
3 pages
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
No ratings yet
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
12 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Spark Class 2
No ratings yet
Spark Class 2
37 pages
Apache Spark & Databricks: Optimizations
No ratings yet
Apache Spark & Databricks: Optimizations
11 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Classification Project 1691995218
No ratings yet
Classification Project 1691995218
43 pages
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
Unit 4 Spark SQL
No ratings yet
Unit 4 Spark SQL
49 pages
M01B-L01 - The 5 SS, Condensed
No ratings yet
M01B-L01 - The 5 SS, Condensed
39 pages
Spark Otp
No ratings yet
Spark Otp
7 pages
Spark
No ratings yet
Spark
27 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
Databricks Vs SQL Cheat Sheet
No ratings yet
Databricks Vs SQL Cheat Sheet
11 pages
Feature Engineering - 01
No ratings yet
Feature Engineering - 01
31 pages
Optimizing PySpark Operations
No ratings yet
Optimizing PySpark Operations
4 pages
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
52 pages
Methods & Function in Databricks
No ratings yet
Methods & Function in Databricks
34 pages
1714069759520
No ratings yet
1714069759520
17 pages
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Spark Handbook
No ratings yet
Spark Handbook
7 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Chat GPT
No ratings yet
Chat GPT
7 pages
Tableu Guide
No ratings yet
Tableu Guide
13 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Big Data Analysis
No ratings yet
Big Data Analysis
38 pages
Day 17
No ratings yet
Day 17
9 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Exam AZ-800: Administering Windows Server Hybrid Core Infrastructure Preparation
From Everand
Exam AZ-800: Administering Windows Server Hybrid Core Infrastructure Preparation
Georgio Daccache
No ratings yet
Panda Joins
No ratings yet
Panda Joins
25 pages
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
From Everand
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
vivian njoroge
No ratings yet
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
Spark Cheatsheet
No ratings yet
Spark Cheatsheet
9 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
Deloitte Scenario-Based Questions in Spark
No ratings yet
Deloitte Scenario-Based Questions in Spark
7 pages
Slowly Changing Dimension DW
No ratings yet
Slowly Changing Dimension DW
3 pages
With Spark SQL: Delta Lake DDL/DML: Time Travel
No ratings yet
With Spark SQL: Delta Lake DDL/DML: Time Travel
2 pages
Verifiable Encryption of Digital Signatures and Applications
No ratings yet
Verifiable Encryption of Digital Signatures and Applications
20 pages
Aggregate and Verifiably Encrypted Signatures From Bilinear Maps
No ratings yet
Aggregate and Verifiably Encrypted Signatures From Bilinear Maps
22 pages
Installation Guide - Oracle12.2 WANDS 20.2.3
No ratings yet
Installation Guide - Oracle12.2 WANDS 20.2.3
15 pages
Ant Colony Optimization: 22c: 145, Chapter 12
No ratings yet
Ant Colony Optimization: 22c: 145, Chapter 12
38 pages
Data Science IBM
No ratings yet
Data Science IBM
157 pages
University of Iowa Thesis Deposit
100% (2)
University of Iowa Thesis Deposit
6 pages
Advances in Intelligent Systems and Computing: Springer Books Available As Printed Book
No ratings yet
Advances in Intelligent Systems and Computing: Springer Books Available As Printed Book
1 page
RPG Programmers Guide
No ratings yet
RPG Programmers Guide
542 pages
Question Paper Code: X10326: Computer Science and Engineering
100% (1)
Question Paper Code: X10326: Computer Science and Engineering
2 pages
Port Trace
No ratings yet
Port Trace
4 pages
Theory of Automata
No ratings yet
Theory of Automata
17 pages
Thiết bị lưu trữ SAN Unity XT 880
No ratings yet
Thiết bị lưu trữ SAN Unity XT 880
4 pages
X.25 Packet Switching Protocol
No ratings yet
X.25 Packet Switching Protocol
4 pages
SE QuestionBank
No ratings yet
SE QuestionBank
12 pages
SS Module 4
No ratings yet
SS Module 4
21 pages
Effective Elimination of Harmonics by Means of A Hybrid Series Active Filter (HSAF)
No ratings yet
Effective Elimination of Harmonics by Means of A Hybrid Series Active Filter (HSAF)
6 pages
IT2042 - Information Security
No ratings yet
IT2042 - Information Security
199 pages
HPE - Sd00003740en - Us - HPE Web Services API Developer Guide v1 For HPE Alletra 9000 and HPE Primera
No ratings yet
HPE - Sd00003740en - Us - HPE Web Services API Developer Guide v1 For HPE Alletra 9000 and HPE Primera
422 pages
Picking Up Where Sboms Leave Off
No ratings yet
Picking Up Where Sboms Leave Off
7 pages
Imc MVPN Xe 16 Book
No ratings yet
Imc MVPN Xe 16 Book
160 pages
CSD Problem Solving and Computing - Lesson 06 - Processing
No ratings yet
CSD Problem Solving and Computing - Lesson 06 - Processing
17 pages
2023UIT3018 APfile
No ratings yet
2023UIT3018 APfile
19 pages
Excel VBA in Easy Steps
No ratings yet
Excel VBA in Easy Steps
20 pages
Online Courier Management System
No ratings yet
Online Courier Management System
59 pages
Java Coding Interview Questions + Answers (With Code Examples) - Zero To Mastery
No ratings yet
Java Coding Interview Questions + Answers (With Code Examples) - Zero To Mastery
71 pages
Unit 1 Introduction of Oracle
No ratings yet
Unit 1 Introduction of Oracle
7 pages
EXERCISE 4 (Done)
No ratings yet
EXERCISE 4 (Done)
8 pages
2023 Computer Studies F2
No ratings yet
2023 Computer Studies F2
8 pages
Log
No ratings yet
Log
75 pages
CP R81.10 Installation and Upgrade Guide
No ratings yet
CP R81.10 Installation and Upgrade Guide
614 pages
A Feature Model of Actor, Agent, and Object Programming Languages
No ratings yet
A Feature Model of Actor, Agent, and Object Programming Languages
13 pages
LA Android Developer Guide
No ratings yet
LA Android Developer Guide
63 pages

Data Skew and Remedies in Spark Programming

Uploaded by

Data Skew and Remedies in Spark Programming

Uploaded by

SKEW IN SPARK

Contact: [email protected] Mob: 9836972305

Contact: [email protected] Mob: 9836972305

Contact: [email protected] Mob: 9836972305

#Load the City DataFrame

#Load the Transaction DataFrame

#Join the Two DataFrames by "city_id" and Execute a "noop" Write

Contact: [email protected] Mob: 9836972305

Contact: [email protected] Mob: 9836972305

#Load the City DataFrame

#Load the Transaction DataFrame

#Join the Two DataFrames by "city_id" and Execute a "noop" Write

Contact: [email protected] Mob: 9836972305

Contact: [email protected] Mob: 9836972305

Contact: [email protected] Mob: 9836972305

Spark SQL Query of Join between a Table/View and a Subquery with

Contact: [email protected] Mob: 9836972305

Contact: [email protected] Mob: 9836972305

Contact: [email protected] Mob: 9836972305

Enabling Adaptive Query Execution for Skew Join

Contact: [email protected] Mob: 9836972305

Contact: [email protected] Mob: 9836972305

#Load the City DataFrame

#Load the Transaction DataFrame

#Join the Two DataFrames by "city_id" and Execute a "noop" Write

Contact: [email protected] Mob: 9836972305

Salted Join for Skew

Contact: [email protected] Mob: 9836972305

# Too Large - "Unnecessary Overhead" in the "Join" and "Cross Join"

Contact: [email protected] Mob: 9836972305

# Post "Cross Join", we will be at ~865 MB

#Load the Salted City DataFrame

#Load the Salted Transaction DataFrame

Contact: [email protected] Mob: 9836972305

Step 4 - Join the two DataFrames based on the “salted_city_id” Column.

Contact: [email protected] Mob: 9836972305

Contact: [email protected] Mob: 9836972305

You might also like