Removing duplicate columns after DataFrame join in PySpark
Last Updated :
29 Dec, 2021
In this article, we will discuss how to remove duplicate columns after a DataFrame join in PySpark.
Create the first dataframe for demonstration:
Python3
# Importing necessary libraries
from pyspark.sql import SparkSession
# Create a spark session
spark = SparkSession.builder.appName('pyspark \
- example join').getOrCreate()
# Create data in dataframe
data = [(('Ram'),1,'M'),
(('Mike'),2,'M'),
(('Rohini'),3,'M'),
(('Maria'),4,'F'),
(('Jenis'),5,'F')]
# Column names in dataframe
columns = ["Name","ID","Gender"]
# Create the spark dataframe
df1 = spark.createDataFrame(data=data, schema = columns)
# Print the dataframe
df1.show()
Output:
+------+---+------+
| Name| ID|Gender|
+------+---+------+
| Ram| 1| M|
| Mike| 2| M|
|Rohini| 3| M|
| Maria| 4| F|
| Jenis| 5| F|
+------+---+------+
Create a second dataframe for demonstration:
Python3
# Create data in dataframe
data2 = [(1,3000),
(2,4000),
(3,4000),
(4,4000),
(5, 1200)]
# Column names in dataframe
columns = ["ID","salary"]
# Create the spark dataframe
df2 = spark.createDataFrame(data=data2,
schema = columns)
# Print the dataframe
df2.show()
Output:
+---+------+
| ID|salary|
+---+------+
| 1| 3000|
| 2| 4000|
| 3| 4000|
| 4| 4000|
| 5| 1200|
+---+------+
Using join()
This will join the two dataframes
Syntax: dataframe.join(dataframe1).show()
where,
dataframe is the first dataframe
dataframe1 is the second dataframe
Let's see the dataframe after join:
Python3
df = df1.join(df2, df1.ID==df2.ID)
df.show()
Output:
+------+---+------+---+------+
| Name| ID|Gender| ID|salary|
+------+---+------+---+------+
| Jenis| 5| F| 5| 1200|
| Ram| 1| M| 1| 3000|
|Rohini| 3| M| 3| 4000|
| Mike| 2| M| 2| 4000|
| Maria| 4| F| 4| 4000|
+------+---+------+---+------+
Here we see the ID and Salary columns are added to our existing article.
Now, let check the columns once:
Here we check gender columns which is unique so its work fine.
Python3
df.select('Gender').show()
Output:
+------+
|Gender|
+------+
| F|
| M|
| M|
| M|
| F|
+------+
Now let check our Duplicates Columns:
Here it will produce errors because of duplicate columns.
Python3
Output:
AnalysisException: Reference 'ID' is ambiguous, could be: ID, ID.
Removing duplicate columns after join in PySpark
If we want to drop the duplicate column, then we have to specify the duplicate column in the join function. Here we are simply using join to join two dataframes and then drop duplicate columns.
Syntax: dataframe.join(dataframe1, ['column_name']).show()
where,
- dataframe is the first dataframe
- dataframe1 is the second dataframe
- column_name is the common column exists in two dataframes
Python3
new_df = df1.join(df2, ["id"])
new_df.show()
Output:
+---+------+------+------+
| ID| Name|Gender|salary|
+---+------+------+------+
| 5| Jenis| F| 1200|
| 1| Ram| M| 3000|
| 3|Rohini| M| 4000|
| 2| Mike| M| 4000|
| 4| Maria| F| 4000|
+---+------+------+------+
Similar Reads
Rename Duplicated Columns after Join in Pyspark dataframe In this article, we are going to learn how to rename duplicate columns after join in Pyspark data frame in Python. A distributed collection of data grouped into named columns is known as a Pyspark data frame. While handling a lot of data, we observe that not all data is coming from one data frame, t
4 min read
Removing duplicate rows based on specific column in PySpark DataFrame In this article, we are going to drop the duplicate rows based on a specific column from dataframe using pyspark in Python. Duplicate data means the same data based on some condition (column values). For this, we are using dropDuplicates() method: Syntax: dataframe.dropDuplicates(['column 1','column
1 min read
Remove duplicates from a dataframe in PySpark In this article, we are going to drop the duplicate data from dataframe using pyspark in Python Before starting we are going to create Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creati
2 min read
How to avoid duplicate columns after join in PySpark ? When working with PySpark, it's common to join two DataFrames. However, if the DataFrames contain columns with the same name (that aren't used as join keys), the resulting DataFrame can have duplicate columns. This is particularly relevant when performing self-joins or joins on multiple columns. Bel
2 min read
How to delete columns in PySpark dataframe ? In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: Python3 # importing
2 min read
Add new column with default value in PySpark dataframe In this article, we are going to see how to add a new column with a default value in PySpark Dataframe. The three ways to add a column to PandPySpark as DataFrame with Default Value. Using pyspark.sql.DataFrame.withColumn(colName, col)Using pyspark.sql.DataFrame.select(*cols)Using pyspark.sql.SparkS
3 min read