Removing duplicate rows based on specific column in PySpark DataFrame Last Updated : 06 Jun, 2021 Comments Improve Suggest changes Like Article Like Report In this article, we are going to drop the duplicate rows based on a specific column from dataframe using pyspark in Python. Duplicate data means the same data based on some condition (column values). For this, we are using dropDuplicates() method: Syntax: dataframe.dropDuplicates(['column 1','column 2','column n']).show() where, dataframe is the input dataframe and column name is the specific columnshow() method is used to display the dataframe Let's create the dataframe. Python3 # importing module import pyspark # importing sparksession from pyspark.sql # module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of students data data = [["1", "sravan", "vignan"], ["2", "ojaswi", "vvit"], ["3", "rohith", "vvit"], ["4", "sridevi", "vignan"], ["1", "sravan", "vignan"], ["5", "gnanesh", "iit"]] # specify column names columns = ['student ID', 'student NAME', 'college'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) print('Actual data in dataframe') dataframe.show() Output: Dropping based on one column Python3 # remove duplicate rows based on college # column dataframe.dropDuplicates(['college']).show() Output: Dropping based on multiple columns Python3 # remove duplicate rows based on college # and ID column dataframe.dropDuplicates(['college', 'student ID']).show() Output: Comment More infoAdvertise with us Next Article Removing duplicate rows based on specific column in PySpark DataFrame gottumukkalabobby Follow Improve Article Tags : Python Python-Pyspark Practice Tags : python Similar Reads Removing duplicate columns after DataFrame join in PySpark In this article, we will discuss how to remove duplicate columns after a DataFrame join in PySpark. Create the first dataframe for demonstration:Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark \ - exa 3 min read Rename Duplicated Columns after Join in Pyspark dataframe In this article, we are going to learn how to rename duplicate columns after join in Pyspark data frame in Python. A distributed collection of data grouped into named columns is known as a Pyspark data frame. While handling a lot of data, we observe that not all data is coming from one data frame, t 4 min read Filtering rows based on column values in PySpark dataframe In this article, we are going to filter the rows based on column values in PySpark dataframe. Creating Dataframe for demonstration:Python3 # importing module import spark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n 2 min read Remove duplicates from a dataframe in PySpark In this article, we are going to drop the duplicate data from dataframe using pyspark in Python Before starting we are going to create Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creati 2 min read Add new column with default value in PySpark dataframe In this article, we are going to see how to add a new column with a default value in PySpark Dataframe. The three ways to add a column to PandPySpark as DataFrame with Default Value. Using pyspark.sql.DataFrame.withColumn(colName, col)Using pyspark.sql.DataFrame.select(*cols)Using pyspark.sql.SparkS 3 min read Delete rows in PySpark dataframe based on multiple conditions In this article, we are going to see how to delete rows in PySpark dataframe based on multiple conditions. Method 1: Using Logical expression Here we are going to use the logical expression to filter the row. Filter() function is used to filter the rows from RDD/DataFrame based on the given conditio 2 min read Drop rows containing specific value in PySpark dataframe In this article, we are going to drop the rows with a specific value in pyspark dataframe. Creating dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n 2 min read Count rows based on condition in Pyspark Dataframe In this article, we will discuss how to count rows based on conditions in Pyspark dataframe. For this, we are going to use these methods: Using where() function.Using filter() function. Creating Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pysp 4 min read Select specific column of PySpark dataframe with its position In this article, we will discuss how to select a specific column by using its position from a pyspark dataframe in Python. For this, we will use dataframe.columns() method inside dataframe.select() method. Syntax: dataframe.select(dataframe.columns[column_number]).show() where, dataframe is the data 2 min read Like