How to join on multiple columns in Pyspark? Last Updated : 19 Dec, 2021 Comments Improve Suggest changes Like Article Like Report In this article, we will discuss how to join multiple columns in PySpark Dataframe using Python. Let's create the first dataframe: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of employee data data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby")] # specify column names columns = ['ID1', 'NAME1'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) dataframe.show() Output: Let's create the second dataframe: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of employee data data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby"), (4, "rohith"), (5, "gnanesh")] # specify column names columns = ['ID2', 'NAME2'] # creating a dataframe from the lists of data dataframe1 = spark.createDataFrame(data, columns) dataframe1.show() Output: we can join the multiple columns by using join() function using conditional operator Syntax: dataframe.join(dataframe1, (dataframe.column1== dataframe1.column1) & (dataframe.column2== dataframe1.column2)) where, dataframe is the first dataframedataframe1 is the second dataframecolumn1 is the first matching column in both the dataframescolumn2 is the second matching column in both the dataframesExample 1: PySpark code to join the two dataframes with multiple columns (id and name) Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of employee data data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby")] # specify column names columns = ['ID1', 'NAME1'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # list of employee data data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby"), (4, "rohith"), (5, "gnanesh")] # specify column names columns = ['ID2', 'NAME2'] # creating a dataframe from the lists of data dataframe1 = spark.createDataFrame(data, columns) # join based on ID and name column dataframe.join(dataframe1, (dataframe.ID1 == dataframe1.ID2) & (dataframe.NAME1 == dataframe1.NAME2)).show() Output: Example 2: Join with or operator Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of employee data data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby")] # specify column names columns = ['ID1', 'NAME1'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # list of employee data data = [(1, "sravan"), (2, "ojsawi"), (3, "bobby"), (4, "rohith"), (5, "gnanesh")] # specify column names columns = ['ID2', 'NAME2'] # creating a dataframe from the lists of data dataframe1 = spark.createDataFrame(data, columns) # join based on ID and name column dataframe.join(dataframe1, (dataframe.ID1 == dataframe1.ID2) | (dataframe.NAME1 == dataframe1.NAME2)).show() Output: Comment More infoAdvertise with us Next Article How to Add Multiple Columns in PySpark Dataframes ? S sravankumar_171fa07058 Follow Improve Article Tags : Python Python-Pyspark Practice Tags : python Similar Reads Pass multiple columns in UDF in Pyspark In this article, we are going to learn how to pass multiple columns in UDF using Pyspark in Python. Pyspark has numerous types of functions, such as string functions, sort functions, Window functions, etc. but do you know Pyspark has also one of the most essential types of functions, i.e., User Defi 5 min read Split a List to Multiple Columns in Pyspark Have you ever been stuck in a situation where you have got the data of numerous columns in one column? Got confused at that time about how to split that dataset? This can be easily achieved in Pyspark in numerous ways. In this article, we will discuss regarding same. Modules Required: Pyspark: An o 5 min read How to Add Multiple Columns in PySpark Dataframes ? In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. Let's create a sample dataframe for demonstration: Dataset Used: Cricket_data_set_odi Python3 # import pandas to read json file import pandas as pd # importing module import pyspark # importing sparksessio 2 min read How to find distinct values of multiple columns in PySpark ? In this article, we will discuss how to find distinct values of multiple columns in PySpark dataframe. Let's create a sample dataframe for demonstration:Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparkses 2 min read Python PySpark - DataFrame filter on multiple columns In this article, we are going to filter the dataframe on multiple columns by using filter() and where() function in Pyspark in Python. Creating Dataframe for demonestration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSessio 2 min read How to select and order multiple columns in Pyspark DataFrame ? In this article, we will discuss how to select and order multiple columns from a dataframe using pyspark in Python. For this, we are using sort() and orderBy() functions along with select() function. Methods UsedSelect(): This method is used to select the part of dataframe columns and return a copy 2 min read Like