Open In App

Pandas Dataframe Difference

Last Updated : 16 Dec, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

When working with multiple DataFrames, you might want to compute the differences between them, such as identifying rows that are in one DataFrame but not in another. Pandas provides various ways to compute the difference between DataFrames, whether it's comparing rows, columns, or entire DataFrames. This is useful in data analysis, especially when you need to track changes between datasets over time or compare two similar datasets.

In this article, we will explore methods to find the difference between DataFrames using Pandas.

Python
import pandas as pd

# Create DataFrames for Dataset 1 and Dataset 2
data1 = {'Name': ['John', 'Alice', 'Bob', 'Eve'], 
         'Age': [25, 30, 22, 35], 
         'Gender': ['Male', 'Female', 'Male', 'Female']}
df1 = pd.DataFrame(data1)

data2 = {'Name': ['John', 'Alice', 'Charlie', 'Eve'], 
         'Age': [25, 32, 28, 35], 
         'Gender': ['Male', 'Female', 'Male', 'Female']}
df2 = pd.DataFrame(data2)


Finding Rows in One DataFrame but Not in Another

The most common way to find the difference between DataFrames is to identify rows that are in one DataFrame but not in the other. This can be done using the merge() method with the indicator=True option or by using isin() method.

  • Use merge() with indicator=True to identify differences.
Python
# Merge the DataFrames with the 'indicator' flag to track the source of each row
merged_df = pd.merge(df1, df2, how='outer', indicator=True)

# Find rows that are only in df1 but not in df2
diff_df1 = merged_df[merged_df['_merge'] == 'left_only']
print(diff_df1)

# Find rows that are only in df2 but not in df1
diff_df2 = merged_df[merged_df['_merge'] == 'right_only']
print(diff_df2)


Screenshot-2024-12-13-125406

The merge() method is used with the indicator=True flag to add a new column (_merge) that shows whether a row is only in df1, only in df2, or in both.We then filter for rows where _merge is 'left_only' (rows unique to df1) or 'right_only' (rows unique to df2).

Finding the Difference in Values (Element-wise)

If you want to find the difference between corresponding elements in two DataFrames, you can subtract one DataFrame from another. This works for numerical data and compares corresponding values row-wise and column-wise.

Python
# Subtract df2 from df1 (numerical columns only)
df_diff = df1.select_dtypes(include=['number']) - df2.select_dtypes(include=['number'])
print(df_diff)
Screenshot-2024-12-13-132758

select_dtypes(include=['number']) method selects only the numerical columns for subtraction.Subtraction of corresponding values in df1 and df2 produces a new DataFrame with the element-wise differences.

Using isin to Find Values Not Shared Between DataFrames

The isin() method is another powerful tool to compare rows between DataFrames. It allows you to filter for rows in one DataFrame that do not appear in the other.

Python
# Find rows in df1 that are not in df2
df_diff = df1[~df1['Name'].isin(df2['Name'])]
print(df_diff)
Screenshot-2024-12-13-133440

The isin() method checks if each value in the Name column of df1 is present in the Name column of df2. The tilde (~) negates the result, meaning we filter for rows in df1 whose Name does not exist in df2.

Comparing DataFrame Indexes

You may also want to compare the indexes of two DataFrames to see if they are the same or different. You can use the .index attribute to compare indexes between DataFrames.


Python
# Compare indexes between df1 and df2
index_diff = df1.index.difference(df2.index)
print(index_diff)
Screenshot-2024-12-13-150731

The difference() method returns the indexes that are present in df1 but not in df2. This is useful when you want to check whether the row labels (indexes) are the same across DataFrames.

Summary:

Pandas provides multiple methods for finding the difference between DataFrames, each suited for specific use cases:

  • merge() with the indicator=True flag is great for finding rows that differ between DataFrames.
  • Subtraction is useful for comparing numerical values element-wise.
  • isin() is helpful for filtering rows that are not shared between DataFrames.
  • difference() can be used to compare DataFrame indexes.

These techniques can be combined and customized to suit a variety of data comparison tasks in your analysis workflow.

Related Articles:


Next Article

Similar Reads