Merge two DataFrames with different amounts of columns in PySpark
Last Updated :
21 Dec, 2021
In this article, we will discuss how to perform union on two dataframes with different amounts of columns in PySpark in Python.
Let's consider the first dataframe
Here we are having 3 columns named id, name, and address.
Python3
# importing module
import pyspark
# import when and lit function
from pyspark.sql.functions import when, lit
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "kakumanu"],
["2", "ojaswi", "hyd"],
["3", "rohith", "delhi"],
["4", "sridevi", "kakumanu"],
["5", "bobby", "guntur"]]
# specify column names
columns = ['ID', 'NAME', 'Address']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
# display
dataframe1.show()
Output:

Let's consider second dataframe
Here we are going to create dataframe with 2 columns
Python3
# importing module
import pyspark
# import when and lit function
from pyspark.sql.functions import when, lit
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", 23],
["2", 21],
["3", 32],
]
# specify column names
columns = ['ID', 'Age']
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
# display
dataframe2.show()
Output:
We can not perform union operations because the columns are different, so we have to add the missing columns. Here In first dataframe (dataframe1) , the columns  ['ID', 'NAME', 'Address'] and second dataframe (dataframe2 ) columns are  ['ID','Age'].
Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.sql.functions which is used to add a column with a value. Here we are going to add a value with None.
Syntax:
for column in [column for column in dataframe1.columns if column not in dataframe2.columns]:
  dataframe2 = dataframe2.withColumn(column, lit(None))
where,Â
- dataframe1 is the firstdata frame
- dataframe2 is the second dataframe
Example: Add missing columns to both the dataframes
Python3
# importing module
import pyspark
# import lit function
from pyspark.sql.functions import lit
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "kakumanu"],
["2", "ojaswi", "hyd"],
["3", "rohith", "delhi"],
["4", "sridevi", "kakumanu"],
["5", "bobby", "guntur"]]
# specify column names
columns = ['ID', 'NAME', 'Address']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
# list of employee data
data = [["1", 23],
["2", 21],
["3", 32],
]
# specify column names
columns = ['ID', 'Age']
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
# add columns in dataframe1 that are missing from dataframe2
for column in [column for column in dataframe2.columns
if column not in dataframe1.columns]:
dataframe1 = dataframe1.withColumn(column, lit(None))
# add columns in dataframe2 that are missing from dataframe1
for column in [column for column in dataframe1.columns
if column not in dataframe2.columns]:
dataframe2 = dataframe2.withColumn(column, lit(None))
# now see the columns of dataframe1
print(dataframe1.columns)
# now see the columns of dataframe2
print(dataframe2.columns)
Output:
['ID', 'NAME', 'Address', 'Age']
['ID', 'Age', 'NAME', 'Address']
Example 1: Using union()
Now we can perform union by using union() function. This function will join two dataframes.
Syntax: dataframe1.union(dataframe2)
Example:
Python3
# importing module
import pyspark
# import lit function
from pyspark.sql.functions import lit
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "kakumanu"],
["2", "ojaswi", "hyd"],
["3", "rohith", "delhi"],
["4", "sridevi", "kakumanu"],
["5", "bobby", "guntur"]]
# specify column names
columns = ['ID', 'NAME', 'Address']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
# list of employee data
data = [["1", 23],
["2", 21],
["3", 32],
]
# specify column names
columns = ['ID', 'Age']
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
# add columns in dataframe1 that are missing from dataframe2
for column in [column for column in dataframe2.columns
if column not in dataframe1.columns]:
dataframe1 = dataframe1.withColumn(column, lit(None))
# add columns in dataframe2 that are missing from dataframe1
for column in [column for column in dataframe1.columns
if column not in dataframe2.columns]:
dataframe2 = dataframe2.withColumn(column, lit(None))
# perform union
dataframe1.union(dataframe2).show()
Output:

Example 2: Using unionAll()
Syntax: dataframe1.unionAll(dataframe2)
Python3
# importing module
import pyspark
# import lit function
from pyspark.sql.functions import lit
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "kakumanu"],
["2", "ojaswi", "hyd"],
["3", "rohith", "delhi"],
["4", "sridevi", "kakumanu"],
["5", "bobby", "guntur"]]
# specify column names
columns = ['ID', 'NAME', 'Address']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
# list of employee data
data = [["1", 23],
["2", 21],
["3", 32],
]
# specify column names
columns = ['ID', 'Age']
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
# add columns in dataframe1 that are missing
# from dataframe2
for column in [column for column in dataframe2.columns\
if column not in dataframe1.columns]:
dataframe1 = dataframe1.withColumn(column, lit(None))
# add columns in dataframe2 that are missing
# from dataframe1
for column in [column for column in dataframe1.columns \
if column not in dataframe2.columns]:
dataframe2 = dataframe2.withColumn(column, lit(None))
# perform unionAll
dataframe1.unionAll(dataframe2).show()
Output:

Example 3: Using unionByName
We can also perform unionByName, This will join dataframes by name.
Syntax: dataframe1.unionByName(dataframe2)
Example:
Python3
# importing module
import pyspark
# import lit function
from pyspark.sql.functions import lit
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "kakumanu"],
["2", "ojaswi", "hyd"],
["3", "rohith", "delhi"],
["4", "sridevi", "kakumanu"],
["5", "bobby", "guntur"]]
# specify column names
columns = ['ID', 'NAME', 'Address']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data, columns)
# list of employee data
data = [["1", 23],
["2", 21],
["3", 32],
]
# specify column names
columns = ['ID', 'Age']
# creating a dataframe from the lists of data
dataframe2 = spark.createDataFrame(data, columns)
# add columns in dataframe1 that are missing from dataframe2
for column in [column for column in dataframe2.columns \
if column not in dataframe1.columns]:
dataframe1 = dataframe1.withColumn(column, lit(None))
# add columns in dataframe2 that are missing from dataframe1
for column in [column for column in dataframe1.columns \
if column not in dataframe2.columns]:
dataframe2 = dataframe2.withColumn(column, lit(None))
# perform unionByName
dataframe1.unionByName(dataframe2).show()
Output:
Similar Reads
PySpark - Merge Two DataFrames with Different Columns or Schema
In this article, we will discuss how to merge two dataframes with different amounts of columns or schema in PySpark in Python. Let's consider the first dataframe: Here we are having 3 columns named id, name, and address for better demonstration purpose. Python3 # importing module import pyspark # im
7 min read
Pandas - Merge two dataframes with different columns
Let's understand how to merge two dataframes with different columns. In Pandas, you can merge two DataFrames with different columns using concat(), merge() and join(). Merging Two DataFrames with Different Columns - using concat()concat() method is ideal for combining multiple DataFrames vertically
3 min read
Get number of rows and columns of PySpark dataframe
In this article, we will discuss how to get the number of rows and the number of columns of a PySpark dataframe. For finding the number of rows and number of columns we will use count() and columns() with len() function respectively. df.count(): This function is used to extract number of rows from t
6 min read
How to get name of dataframe column in PySpark ?
In this article, we will discuss how to get the name of the Dataframe column in PySpark. To get the name of the columns present in the Dataframe we are using the columns function through this function we will get the list of all the column names present in the Dataframe. Syntax: df.columns We can a
3 min read
Add new column with default value in PySpark dataframe
In this article, we are going to see how to add a new column with a default value in PySpark Dataframe. The three ways to add a column to PandPySpark as DataFrame with Default Value. Using pyspark.sql.DataFrame.withColumn(colName, col)Using pyspark.sql.DataFrame.select(*cols)Using pyspark.sql.SparkS
3 min read
Merge two dataframes with same column names
In this discussion, we will explore the process of Merging two dataframes with the same column names using Pandas. To achieve this, we'll leverage the functionality of pandas.concat(), pandas.join(), and pandas.merge() functions. These methods handle the concatenation operations along a specified ax
3 min read
How to Iterate over rows and columns in PySpark dataframe
In this article, we will discuss how to iterate rows and columns in PySpark dataframe. Create the dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app nam
6 min read
How to Add Multiple Columns in PySpark Dataframes ?
In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. Let's create a sample dataframe for demonstration: Dataset Used: Cricket_data_set_odi Python3 # import pandas to read json file import pandas as pd # importing module import pyspark # importing sparksessio
2 min read
How to delete columns in PySpark dataframe ?
In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: Python3 # importing
2 min read
Sort the PySpark DataFrame columns by Ascending or Descending order
In this article, we are going to sort the dataframe columns in the pyspark. For this, we are using sort() and orderBy() functions in ascending order and descending order sorting. Let's create a sample dataframe. Python3 # importing module import pyspark # importing sparksession from # pyspark.sql mo
5 min read