How to find distinct values of multiple columns in PySpark ?
Last Updated :
04 Jul, 2021
In this article, we will discuss how to find distinct values of multiple columns in PySpark dataframe.
Let's create a sample dataframe for demonstration:
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "Tezas", "Google"],
["2", "Mohit Rawat", "Rakuten"],
["3", "rohith", "Geeksforgeeks"],
["4", "Nancy", "IBM"],
["1", "Raghav", "Wipro"],
["4", "Komal", "Amazon"]]
# specify column names
columns = ['ID', 'NAME', 'Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
Output:
Method 1: Using distinct() method
The distinct() method is utilized to drop/remove the duplicate elements from the DataFrame.
Syntax: df.distinct(column)
Example 1: Get a distinct Row of all Dataframe.
Python3
dataframe.distinct().show()
Output:
Example 2: Get distinct Value of single Columns.
It can be done by passing a single column name with dataframe.
Python3
dataframe.select('NAME').distinct().show()
Output:
Example 3: Get distinct Value of Multiple Columns.
It can be done by passing multiple column names as a form of a list with dataframe.
Python3
dataframe.select('ID',"NAME").distinct().show()
Method 2: Using dropDuplicates() method.
The dropDuplicates() used to remove rows that have the same values on multiple selected columns.
Syntax: df.dropDuplicates()
Example 1: Get a distinct Row of all Dataframe.
Python3
dataframe.dropDuplicates().show()
Output:
Example 2: Get distinct Value of single Columns.
It can be done by passing a single column name with dataframe.
Python3
dataframe.select("NAME").dropDuplicates().show()
Output:
Example 3: Get distinct Value of multiple Columns.
It can be done by passing multiple column names as a form of a list with dataframe.
Python3
dataframe.dropDuplicates(["NAME","ID"]).select(["ID","NAME"]).show()
Output:
Similar Reads
How to Add Multiple Columns in PySpark Dataframes ? In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. Let's create a sample dataframe for demonstration: Dataset Used: Cricket_data_set_odi Python3 # import pandas to read json file import pandas as pd # importing module import pyspark # importing sparksessio
2 min read
How to Order PysPark DataFrame by Multiple Columns ? In this article, we are going to order the multiple columns by using orderBy() functions in pyspark dataframe. Ordering the rows means arranging the rows in ascending or descending order, so we are going to create the dataframe using nested list and get the distinct data. orderBy() function that sor
2 min read
How to select and order multiple columns in Pyspark DataFrame ? In this article, we will discuss how to select and order multiple columns from a dataframe using pyspark in Python. For this, we are using sort() and orderBy() functions along with select() function. Methods UsedSelect(): This method is used to select the part of dataframe columns and return a copy
2 min read
Show distinct column values in PySpark dataframe In this article, we are going to display the distinct column values from dataframe using pyspark in Python. For this, we are using distinct() and dropDuplicates() functions along with select() function. Let's create a sample dataframe. Python3 # importing module import pyspark # importing sparksessi
2 min read
Split multiple array columns into rows in Pyspark Suppose we have a Pyspark DataFrame that contains columns having different types of values like string, integer, etc., and sometimes the column data is in array format also. Working with the array is sometimes difficult and to remove the difficulty we wanted to split those array data into rows. Spl
5 min read