Remove duplicates from a dataframe in PySpark

Data Duplication Removal from Dataset Using Python

Last Updated : 28 Jul, 2025

Duplicates are a common issues in real-world datasets that can negatively impact our analysis. They occur when identical rows or entries appear multiple times in a dataset. Although they may seem harmless but they can cause problems in analysis if not fixed. Duplicates could happen due to:

Data entry errors: When the same information is recorded more than once by mistake.
Merging datasets: When combining data from different sources can lead to overlapping of data that can create duplicates.

Why Duplicates Can Cause Problems?

Skewed Analysis: Duplicates can affect our analysis results which leads to misleading conclusions such as an wrong average salary.
Inaccurate Models: It can cause machine learning models to overfit which reduces their ability to perform well on new data.
Increased Computational Costs: It consume extra computational power which slows down analysis and impacts workflow.
Data Redundancy and Complexity: It make it harder to maintain accurate records and organize data and adds unnecessary complexity.

Identifying Duplicates

To manage duplicates the first step is identifying them in the dataset. Pandas offers various functions which are helpful to spot and remove duplicate rows. Now we will see how to identify and remove duplicates using Python.

We will be using Pandas library for its implementation and will use a sample dataset below.

Python

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'David'],
    'Age': [25, 30, 25, 35, 30, 40],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'San Francisco']
}

df = pd.DataFrame(data)
df

Output:

Sample Dataset

1. Using duplicated() Method

The duplicated() method helps to identify duplicate rows in a dataset. It returns a boolean Series indicating whether a row is a duplicate of a previous row.

Python

duplicates = df.duplicated()

duplicates

Output:

Using duplicated()

Removing Duplicates

Duplicates may appear in one or two columns instead of the entire dataset. In such cases, we can choose specific columns to check for duplicates.

1. Based on Specific Columns

Here we will specify columns i.e name and city to remove duplicates using drop_duplicates() .

Python

df_no_duplicates_columns = df.drop_duplicates(subset=['Name', 'City'])
(df_no_duplicates_columns)

Output:

Removing duplicates based on columns

2. Keeping the First or Last Occurrence

By default drop_duplicates() keeps the first occurrence of each duplicate row. However, we can adjust it to keep the last occurrence instead.

Python

df_keep_last = df.drop_duplicates(keep='last')
(df_keep_last)

Output:

Keeping the first or last occurence

Cleaning duplicates is an important step in ensuring data accuracy which improves model performance and optimizing analysis efficiency.

How to Deal with Duplicate Data Python

Remove duplicates from a dataframe in PySpark

O

oceanofknow6flv

Improve

Article Tags :

Similar Reads

Python - Removing Constant Features From the Dataset

Those features which contain constant values (i.e. only one value for all the outputs or target values) in the dataset are known as Constant Features. These features don't provide any information to the target feature. These are redundant data available in the dataset. Presence of this feature has n

Remove duplicates from a dataframe in PySpark

In this article, we are going to drop the duplicate data from dataframe using pyspark in Python Before starting we are going to create Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creati

Remove Duplicate rows in R using Dplyr

In this article, we are going to remove duplicate rows in R programming language using Dplyr package. Method 1: distinct() This function is used to remove the duplicate rows in the dataframe and get the unique data Syntax: distinct(dataframe) We can also remove duplicate rows based on the multiple c

Identify and Remove Duplicate Data in R

A dataset can have duplicate values and to keep it redundancy-free and accurate, duplicate rows need to be identified and removed. In this article, we are going to see how to identify and remove duplicate data in R. First we will check if duplicate data is present in our data, if yes then, we will r

How to Remove Duplicate Elements from NumPy Array

In this article, we will see how to remove duplicate elements from NumPy Array. Here we will learn how to Remove Duplicate Elements from a 1-D NumPy Array and 2-D NumPy Array.Input1: [1 2 3 4 5 1 2 3 1 2 9] Output1: [1 2 3 4 5 9] Explanation: In this example, we have removed duplicate elements from

Data Wrangling in Python

Data Wrangling is the process of gathering, collecting, and transforming Raw data into another format for better understanding, decision-making, accessing, and analysis in less time. Data Wrangling is also known as Data Munging. Python Data WranglingImportance Of Data Wrangling Data Wrangling is a v