Pandas Cheat Sheet
Pandas Cheat Sheet
Deepaliof
Author Srivastava
Cheat Sheet
“Ultimate Python Programming”
Deepali Srivastava
Author of “Ultimate Python Programming”
Deepali Srivastava
DeepaliAuthor
SrivastavaofAuthor
“Ultimate Python
of “Ultimate Programming”
Python Programming”
Importing numpy and pandas
import numpy as np
import pandas as pd
Indexing
# By default, Pandas assigns an integer index (0, 1, 2, 3, ...) to each row
df.index # Shows the row labels, which can be numeric (default) or custom labels (like strings, datetime, etc.)
df = df.set_index('name') # Set 'name' column as the index
df = df.reset_index() # Reset the index to the default integer index, original index will be added as a new column
'John' in df.index # Checking if a value exists in the index, Returns True or False
Correlation
# Getting pairwise correlation of numeric columns
correlation_matrix = df.corr(numeric_only=True) # Ensures only numeric columns are used
# Default method is 'pearson', can use 'kendall' or 'spearman’
correlation_matrix = df.corr(numeric_only=True ,method='pearson')
# Access a specific correlation value between two columns using .loc
correlation_matrix.loc['age','salary'] # Correlation between column 'age' and 'salary'
Deepali Srivastava Author of “Ultimate Python Programming”
Deepali Srivastava, Author of “Ultimate Python Programming”
# Use axis=1 to perform operations on rows, so you get results like row sums, row means, etc.
df.min(axis=1) # Minimum value in each row
df.sum(axis=1) # Sum of each row
# Use relational operators (<, >, ==, >=, <=, !=) to filter rows
df[df['age'] > 10]
# Use .between() to filter rows where a column's values fall within a specified range
df[df['age'].between(20, 30)] # both 20 and 30 inclusive
df[df['age'].between(20, 30, inclusive = 'both’)] # default - both 20 and 30 inclusive
df[df['age'].between(20, 30, inclusive='neither')] # Exclude both 20 and 30 from the filter
df[df['age'].between(20, 30, inclusive='left')] # Include 20 but exclude 30
df[df['age'].between(20, 30, inclusive='right')] # Exclude 20 but include 30
def age_category(age):
return 'Young' if age < 18 else 'Adult'
df['category'] = df['age'].apply(age_category) # Apply age_category function to 'age' column
Removing columns
df = df.drop('name', axis=1) # Drop the 'name' column
df = df[['name','age','phone']] # Keep only 'name', 'age', and 'phone' columns, other columns dropped
# Copying Rows
some_rows = df.iloc[:2].copy() # Copy the first two rows
df_adults = df[df['age'] > 18].copy() # Copy rows where Age > 30
# Copying Columns
df_1 = df[['name', 'city']].copy() # Copy only the 'name' and 'city' columns
# Filter the rows based on condition, select specific columns and then sort on columns 'col2'
df[condition][['col1', 'col2', 'col3']].sort_values('col2')
# Apply max() for 'age' column and mean() for 'height' column
df.groupby('grade').agg({ 'age': 'max', 'height': 'mean' })
Custom aggregation
df.groupby('grade').agg(func) # Apply custom function func()
df.groupby('store').agg({ 'sales': ['sum', 'mean'],
'items_sold': lambda x: x.sum() / len(x) })
# Include only those grades for which average height is greater than 150
df.groupby('grade').filter(lambda x : x['height'].mean() > 150)
# Group by 'grade' column, find sum of numeric columns,sort the resulting DataFrame by the 'age' column
df.groupby('grade').sum().sort_values('age')
df['name_split'] = df['name'].str.split(' ') # Split the 'name' column into a list of words
df['first_name'] = df['name'].str.split(' ').str[0] # Get the first name (first element of list)
# Split the 'name' column into first and last name and expand into separate columns
df[['first_name', 'last_name']] = df['name'].str.split(' ', expand=True) # Expand into two new columns
# Convert a datetime object back to a string with a specific format using strftime
df['formatted_date'] = df['date'].dt.strftime('%Y-%m-%d') # Format datetime as 'YYYY-MM-DD'
# Calculate the difference between two datetime columns using subtraction which results in aTimedelta object
df['duration'] = df['end_date'] - df['start_date'] # Time difference between 'end_date' and 'start_date'
df['days'] = df['duration'].dt.days # Extract the number of days from the Timedelta
# Filtering by Date
df[df['date'] > '2025-01-01'] # Filter rows where 'dob' is after January 1, 2025
# Reindex the DataFrame df to the new date range, Missing dates will create new rows with NaN values in all columns
df = df.reindex(daterange)
Learn Python with 650+ Programs, 900+ Practice Questions, and 5 Projects