Drop duplicate rows in PySpark DataFrame Last Updated : 29 Aug, 2022 Comments Improve Suggest changes Like Article Like Report In this article, we are going to drop the duplicate rows by using distinct() and dropDuplicates() functions from dataframe using pyspark in Python. Let's create a sample Dataframe Python3 # importing module import pyspark # importing sparksession from # pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving # an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of employee data data = [["1", "sravan", "company 1"], ["2", "ojaswi", "company 1"], ["3", "rohith", "company 2"], ["4", "sridevi", "company 1"], ["1", "sravan", "company 1"], ["4", "sridevi", "company 1"]] # specify column names columns = ['Employee ID', 'Employee NAME', 'Company'] # creating a dataframe from the # lists of data dataframe = spark.createDataFrame(data, columns) print('Actual data in dataframe') dataframe.show() Output: Method 1: Distinct Distinct data means unique data. It will remove the duplicate rows in the dataframe Syntax: dataframe.distinct() where, dataframe is the dataframe name created from the nested lists using pyspark Python3 print('distinct data after dropping duplicate rows') # display distinct data dataframe.distinct().show() Output: We can use the select() function along with distinct function to get distinct values from particular columns Syntax: dataframe.select(['column 1','column n']).distinct().show() Python3 # display distinct data in Employee # ID and Employee NAME dataframe.select(['Employee ID', 'Employee NAME']).distinct().show() Output: Method 2: dropDuplicate Syntax: dataframe.dropDuplicates() where, dataframe is the dataframe name created from the nested lists using pyspark Python3 # remove duplicate data using # dropDuplicates()function dataframe.dropDuplicates().show() Output: Python program to remove duplicate values in specific columns Python3 # remove duplicate data using # dropDuplicates() function in # two columns dataframe.select(['Employee ID', 'Employee NAME'] ).dropDuplicates().show() Output: Create Quiz Comment S sravankumar_171fa07058 Follow 0 Improve S sravankumar_171fa07058 Follow 0 Improve Article Tags : Python Python-Pyspark Explore Python FundamentalsPython Introduction 2 min read Input and Output in Python 4 min read Python Variables 4 min read Python Operators 4 min read Python Keywords 2 min read Python Data Types 8 min read Conditional Statements in Python 3 min read Loops in Python - For, While and Nested Loops 5 min read Python Functions 5 min read Recursion in Python 4 min read Python Lambda Functions 5 min read Python Data StructuresPython String 5 min read Python Lists 4 min read Python Tuples 4 min read Python Dictionary 3 min read Python Sets 6 min read Python Arrays 7 min read List Comprehension in Python 4 min read Advanced PythonPython OOP Concepts 11 min read Python Exception Handling 5 min read File Handling in Python 4 min read Python Database Tutorial 4 min read Python MongoDB Tutorial 3 min read Python MySQL 9 min read Python Packages 10 min read Python Modules 3 min read Python DSA Libraries 15 min read List of Python GUI Library and Packages 3 min read Data Science with PythonNumPy Tutorial - Python Library 3 min read Pandas Tutorial 4 min read Matplotlib Tutorial 5 min read Python Seaborn Tutorial 3 min read StatsModel Library - Tutorial 3 min read Learning Model Building in Scikit-learn 6 min read TensorFlow Tutorial 2 min read PyTorch Tutorial 6 min read Web Development with PythonFlask Tutorial 8 min read Django Tutorial | Learn Django Framework 7 min read Django ORM - Inserting, Updating & Deleting Data 4 min read Templating With Jinja2 in Flask 6 min read Django Templates 5 min read Build a REST API using Flask - Python 3 min read Building a Simple API with Django REST Framework 3 min read Python PracticePython Quiz 1 min read Python Coding Practice 1 min read Python Interview Questions and Answers 15+ min read Like