Drop duplicate rows in PySpark DataFrame Last Updated : 29 Aug, 2022 Summarize Comments Improve Suggest changes Share Like Article Like Report In this article, we are going to drop the duplicate rows by using distinct() and dropDuplicates() functions from dataframe using pyspark in Python. Let's create a sample Dataframe Python3 # importing module import pyspark # importing sparksession from # pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving # an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of employee data data = [["1", "sravan", "company 1"], ["2", "ojaswi", "company 1"], ["3", "rohith", "company 2"], ["4", "sridevi", "company 1"], ["1", "sravan", "company 1"], ["4", "sridevi", "company 1"]] # specify column names columns = ['Employee ID', 'Employee NAME', 'Company'] # creating a dataframe from the # lists of data dataframe = spark.createDataFrame(data, columns) print('Actual data in dataframe') dataframe.show() Output: Method 1: Distinct Distinct data means unique data. It will remove the duplicate rows in the dataframe Syntax: dataframe.distinct() where, dataframe is the dataframe name created from the nested lists using pyspark Python3 print('distinct data after dropping duplicate rows') # display distinct data dataframe.distinct().show() Output: We can use the select() function along with distinct function to get distinct values from particular columns Syntax: dataframe.select(['column 1','column n']).distinct().show() Python3 # display distinct data in Employee # ID and Employee NAME dataframe.select(['Employee ID', 'Employee NAME']).distinct().show() Output: Method 2: dropDuplicate Syntax: dataframe.dropDuplicates() where, dataframe is the dataframe name created from the nested lists using pyspark Python3 # remove duplicate data using # dropDuplicates()function dataframe.dropDuplicates().show() Output: Python program to remove duplicate values in specific columns Python3 # remove duplicate data using # dropDuplicates() function in # two columns dataframe.select(['Employee ID', 'Employee NAME'] ).dropDuplicates().show() Output: Comment More infoAdvertise with us Next Article Split Dataframe in Row Index in Pyspark S sravankumar_171fa07058 Follow Improve Article Tags : Python Python-Pyspark Practice Tags : python Similar Reads Remove duplicates from a dataframe in PySpark In this article, we are going to drop the duplicate data from dataframe using pyspark in Python Before starting we are going to create Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creati 2 min read How to drop duplicates and keep one in PySpark dataframe In this article, we will discuss how to handle duplicate values in a pyspark dataframe. A dataset may contain repeated rows or repeated data points that are not useful for our task. These repeated values in our dataframe are called duplicate values. To handle duplicate values, we may use a strategy 3 min read How to duplicate a row N time in Pyspark dataframe? In this article, we are going to learn how to duplicate a row N times in a PySpark DataFrame. Method 1: Repeating rows based on column value In this method, we will first make a PySpark DataFrame using createDataFrame(). In our example, the column "Y" has a numerical value that can only be used here 4 min read Drop Rows in PySpark DataFrame with Condition In this article, we are going to drop the rows in PySpark dataframe. We will be considering most common conditions like dropping rows with Null values, dropping duplicate rows, etc. All these conditions use different functions and we will discuss them in detail.We will cover the following topics:Dro 4 min read Split Dataframe in Row Index in Pyspark In this article, we are going to learn about splitting Pyspark data frame by row index in Python. In data science. there is a bulk of data and their is need of data processing and lots of modules, functions and methods are available to process data. In this article we are going to process data by sp 5 min read Get specific row from PySpark dataframe In this article, we will discuss how to get the specific row from the PySpark dataframe. Creating Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession # from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession # and giving an app 4 min read Like