Sometimes during our data analysis, we need to look at the duplicate rows to understand more about our data rather than dropping them straight away.
Luckily, in pandas we have few methods to play with the duplicates.
.duplciated()
This method allows us to extract duplicate rows in a DataFrame. We will use a new dataset with duplicates. I have downloaded the Hr Dataset from link.
import pandas as pd import numpy as np # Import HR Dataset with certain columns df = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/HRDataset.csv", usecols = ("Employee_Name""PerformanceScore","Position","CitizenDesc")) #Sort the values on employee name and make it permanent df.sort_values("Employee_Name"inplace = True) df.head(3)
Employee_Name | Position | CitizenDesc | PerformanceScore | |
---|---|---|---|---|
0 | Adinolfi | Production Technician I | US Citizen | Exceeds |
1 | Adinolfi | Sr. DBA | US Citizen | Fully Meets |
2 | Adinolfi | Production Technician II | US Citizen | Fully Meets |
The way duplicated() works by default is by keep parameter , This parameter is going to mark the very first occurrence of each value as a non-duplicate.
This method does not mark a row as duplicate if it exists more than once, rather it marks each subsequent row after the first row as duplicate. Confused? Let me try to explain one more time with an example, suppose there are 3 apples in a basket what this method does is mark the first apple as non-duplicate and the rest of the two apples as duplicates.
Example
df["Employee_Name"].head(3)
Output
0 Adinolfi 1 Adinolfi 2 Adinolfi Name: Employee_Name, dtype: object
Example
df["Employee_Name"].duplicated().head(3)
Output
0 False 1 True 2 True Name: Employee_Name, dtype: bool
Now to extract the duplicates out (remember the first occurrence is not a duplicate rather the subsequence occurrence are duplicates and will be outputted by this method) we need to pass this method to a data frame.
df.shape
(310, 4)
df[df["Employee_Name"].duplicated()]
Employee_Name | Position | CitizenDesc | PerformanceScore | |
---|---|---|---|---|
1 | Adinolfi | Sr. DBA | US Citizen | Fully Meets |
2 | Adinolfi | Production Technician II | US Citizen | Fully Meets |
3 | Adinolfi | Production Technician I | US Citizen | Fully Meets |
4 | Adinolfi | Production Manager | US Citizen | Fully Meets |
6 | Anderson | Production Technician I | US Citizen | Exceeds |
... | ... | ... | ... | ... |
303 | Wang | Production Technician II | US Citizen | Fully Meets |
304 | Wang | Production Technician II | US Citizen | Fully Meets |
305 | Wang | Production Technician I | US Citizen | PIP |
306 | Wang | CIO | US Citizen | Exceeds |
307 | Wang | Data Analyst | US Citizen | Fully Meets |
79 rows × 4 columns
From the output above there are 310 rows with 79 duplicates which are extracted by using the .duplicated() method.
ARGUMENT-"LAST"
By default, this method is going to mark the first occurrence of the value as non-duplicate, we can change this behavior by passing the argument keep = last.
What this parameter is going to do is to mark the first two apples as duplicates and the last one as non-duplicate.
df[df["Employee_Name"].duplicated(keep="last")]
Employee_Name | Position | CitizenDesc | PerformanceScore | |
---|---|---|---|---|
0 | Adinolfi | Production Technician I | US Citizen | Exceeds |
1 | Adinolfi | Sr. DBA | US Citizen | Fully Meets |
2 | Adinolfi | Production Technician II | US Citizen | Fully Meets |
3 | Adinolfi | Production Technician I | US Citizen | Fully Meets |
5 | Anderson | Production Technician I | US Citizen | Fully Meets |
... | ... | ... | ... | ... |
302 | Wang | Production Technician II | US Citizen | Exceeds |
303 | Wang | Production Technician II | US Citizen | Fully Meets |
304 | Wang | Production Technician II | US Citizen | Fully Meets |
305 | Wang | Production Technician I | US Citizen | PIP |
306 | Wang | CIO | US Citizen | Exceeds |
ARGUMENT - FALSE
The keep parameter will also accept an additional argument “false” which will mark all the values occurring more than once as duplicates, in our case all the 3 apples will be marked as duplicates rather the first or last as shown in the above examples.
Note – When specifying the false parameter do not use the quotes.
df[df"Employee_Name"].duplicated(keep=False)]
Employee_Name | Position | CitizenDesc | PerformanceScore | |
---|---|---|---|---|
0 | Adinolfi | Production Technician I | US Citizen | Exceeds |
1 | Adinolfi | Sr. DBA | US Citizen | Fully Meets |
2 | Adinolfi | Production Technician II | US Citizen | Fully Meets |
3 | Adinolfi | Production Technician I | US Citizen | Fully Meets |
4 | Adinolfi | Production Manager | US Citizen | Fully Meets |
... | ... | ... | ... | ... |
303 | Wang | Production Technician II | US Citizen | Fully Meets |
304 | Wang | Production Technician II | US Citizen | Fully Meets |
305 | Wang | Production Technician I | US Citizen | PIP |
306 | Wang | CIO | US Citizen | Exceeds |
307 | Wang | Data Analyst | US Citizen | Fully Meets |
Now finally, to extract the unique values from a dataset we can use the “~” (tilda) symbol to negate the values
df_unique~df["Employee_Name"].duplicated(keep=False)df[df_unique]
Employee_Name | Position | CitizenDesc | PerformanceScore | |
---|---|---|---|---|
7 | Andreola | Software Engineer | US Citizen | Fully Meets |
25 | Bozzi | Production Manager | US Citizen | Fully Meets |
26 | Bramante | Director of Operations | US Citizen | Exceeds |
27 | Brill | Production Technician I | US Citizen | Fully Meets |
34 | Burkett | Production Technician II | US Citizen | Fully Meets |
... | ... | ... | ... | ... |
276 | Sweetwater | Software Engineer | US Citizen | Exceeds |
277 | Szabo | Production Technician I | Non-Citizen | Fully Meets |
278 | Tavares | Production Technician II | US Citizen | Fully Meets |
308 | Zhou | Production Technician I | US Citizen | Fully Meets |
309 | Zima | NaN | NaN | NaN |
drop_duplicates()
This method is pretty similar to the previous method, however this method can be on a DataFrame rather than on a single series.
NOTE :- This method looks for the duplicates rows on all the columns of a DataFrame and drops them.
len(df)
Output
310
len(df.drop_duplicates())
Output
290
SUBSET PARAMTER
The subset parameter accepts a list of column names as string values in which we can check for duplicates.
df1=df.drop_duplicates(subset=["Employee_Name"],keep="first")df1
Employee_Name | Position | CitizenDesc | PerformanceScore | |
---|---|---|---|---|
0 | Adinolfi | Production Technician I | US Citizen | Exceeds |
5 | Anderson | Production Technician I | US Citizen | Fully Meets |
7 | Andreola | Software Engineer | US Citizen | Fully Meets |
14 | Athwal | Production Technician I | US Citizen | Fully Meets |
20 | Beak | Production Technician I | US Citizen | Fully Meets |
... | ... | ... | ... | ... |
293 | Von Massenbach | Production Technician II | US Citizen | Fully Meets |
295 | Wallace | Production Technician I | US Citizen | Needs Improvement |
300 | Wang | Production Technician I | Eligible NonCitizen | Fully Meets |
308 | Zhou | Production Technician I | US Citizen | Fully Meets |
309 | Zima | NaN | NaN | NaN |
We can specify multiple columns and use all the keep parameters discussed in the previous section.
df1=df.drop_duplicates(subset="Employee_Name""CitizenDesc"],keep=False)df1
Employee_Name | Position | CitizenDesc | PerformanceScore | |
---|---|---|---|---|
7 | Andreola | Software Engineer | US Citizen | Fully Meets |
16 | Beak | Production Technician I | Eligible NonCitizen | Fully Meets |
25 | Bozzi | Production Manager | US Citizen | Fully Meets |
26 | Bramante | Director of Operations | US Citizen | Exceeds |
27 | Brill | Production Technician I | US Citizen | Fully Meets |
... | ... | ... | ... | ... |
287 | Tejeda | Network Engineer | Eligible NonCitizen | Fully Meets |
286 | Tejeda | Software Engineer | Non-Citizen | Fully Meets |
300 | Wang | Production Technician I | Eligible NonCitizen | Fully Meets |
308 | Zhou | Production Technician I | US Citizen | Fully Meets |
309 | Zima | NaN | NaN | NaN |
unique() Method
The unique methods find the unique values in a series and return the unique values as an Array. This method does not exclude missing values.
len(df["Employee_Name"])
Output
310
df["Employee_Name"].unique()
array(['Adinolfi', 'Anderson', 'Andreola', 'Athwal', 'Beak', 'Bondwell', 'Bozzi', 'Bramante', 'Brill', 'Brown', 'Burkett', 'Butler', 'Carabbio', 'Carey', 'Carr', 'Carter', 'Chace', 'Champaigne', 'Chan', 'Chang', 'Chivukula', 'Cierpiszewski', 'Cisco', 'Clayton', 'Cloninger', 'Close', 'Clukey', 'Cockel', 'Cole', 'Cornett', 'Costa', 'Crimmings', 'Daneault', 'Daniele', 'Darson', 'Davis', 'DeGweck', 'Del Bosque', 'Demita', 'Desimone', 'DiNocco', 'Dickinson', 'Dietrich', 'Digitale', 'Dobrin', 'Dolan', 'Dougall', 'Dunn', 'Eaton', 'Employee_Name', 'Engdahl', 'England', 'Erilus', 'Estremera', 'Evensen', 'Exantus', 'Faller', 'Fancett', 'Favis', 'Ferguson', 'Fernandes', 'Ferreira', 'Fidelia', 'Fitzpatrick', 'Foreman', 'Foss', 'Foster-Baker', 'Fraval', 'Friedman', 'Galia', 'Garcia', 'Garneau', 'Gaul', 'Gentry', 'Gerke', 'Gill', 'Gonzales', 'Gonzalez', 'Good', 'Handschiegl', 'Hankard', 'Harrison', 'Heitzman', 'Horton', 'Houlihan', 'Howard', 'Hubert', 'Hunts', 'Hutter', 'Huynh', 'Immediato', 'Ivey', 'Jackson', 'Jacobi', 'Jeannite', 'Jeremy Prater', 'Jhaveri', 'Johnson', 'Johnston', 'Jung', 'Kampew', 'Keatts', 'Khemmich', 'King', 'Kinsella', 'Kirill', 'Knapp', 'Kretschmer', 'LaRotonda', 'Lajiri', 'Langford', 'Langton', 'Latif', 'Le', 'LeBel', 'LeBlanc', 'Leach', 'Leruth', 'Liebig', 'Linares', 'Linden', 'Lindsay', 'Lundy', 'Lunquist', 'Lydon', 'Lynch', 'MacLennan', 'Mahoney', 'Manchester', 'Mancuso', 'Mangal', 'Martin', 'Martins', 'Maurice', 'McCarthy', 'McKinzie', 'Mckenna', 'Meads', 'Medeiros', 'Merlos', 'Miller', 'Monkfish', 'Monroe', 'Monterro', 'Moran', 'Morway', 'Motlagh', 'Moumanil', 'Mullaney', 'Murray', 'Navathe', 'Ndzi', 'Newman', 'Ngodup', 'Nguyen', 'Nowlan', 'O'hare', 'Oliver', 'Onque', 'Osturnka', 'Owad', 'Ozark', 'Panjwani', 'Patronick', 'Pearson', 'Pelech', 'Pelletier', 'Perry', 'Peters', 'Peterson', 'Petingill', 'Petrowsky', 'Pham', 'Pitt', 'Potts', 'Power', 'Punjabhi', 'Purinton', 'Quinn', 'Rachael', 'Rarrick', 'Rhoads', 'Riordan', 'Rivera', 'Roberson', 'Robertson', 'Robinson', 'Roby', 'Roehrich', 'Rogers', 'Roper', 'Rose', 'Rossetti', 'Roup', 'Ruiz', 'Saada', 'Saar-Beckles', 'Sadki', 'Sahoo', 'Salter', 'Sander', 'Semizoglou', 'Sewkumar', 'Shepard', 'Shields', 'Simard', 'Singh', 'Sloan', 'Smith', 'Soto', 'South', 'Sparks', 'Spirea', 'Squatrito', 'Stanford', 'Stanley', 'Steans', 'Stoica', 'Strong', 'Sullivan', 'Sutwell', 'Sweetwater', 'Szabo', 'Tavares', 'Tejeda', 'Veera', 'Von Massenbach', 'Wallace', 'Wang', 'Zhou', 'Zima'], dtype=object)
len(df["Employee_Name"].unique())
Output
231
.nunique() Method
This method returns the number of unique values in a series. This method by default excludes the missing values using the parameter dropna = True.
You can pass the False argument to dropna parameter to not drop the missing values.
df["Employee_Name"].nunique()
Output
231
df["Employee_Name"].nunique(dropna=False)
Output
231