0% found this document useful (0 votes)

42 views20 pages

Pandas Cheat Sheet

The document is a cheat sheet for using Pandas in Python, authored by Deepali Srivastava, which includes a comprehensive list of functions and methods for data manipulation and analysis. It covers topics such as importing libraries, reading and writing data, data exploration, indexing, and various data aggregation techniques. Additionally, it provides links to a Jupyter Notebook with 108 questions for practical application of the concepts discussed.

Uploaded by

rmontoya.leiva03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views20 pages

Pandas Cheat Sheet

Uploaded by

rmontoya.leiva03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Pandas

Deepaliof
Author Srivastava
Cheat Sheet
“Ultimate Python Programming”

Deepali Srivastava
Author of “Ultimate Python Programming”

Access the Pandas Jupyter Notebook with 108 questions here 👇

https://github.com/Deepali-Srivastava/Pandas-Cheat-Sheet-for-Data-Analysis

Access the Pandas Jupyter Notebook with 108 questions here

https://github.com/Deepali-Srivastava/Pandas-Cheat-Sheet-for-Data-Analysis
Cheat Sheet Contents

▪ Importing numpy and pandas ▪ Apply a function to add/modify a column

▪ Reading and writing data to files ▪ Removing columns
▪ Reading a CSV with specific options ▪ Removing rows based on label index
▪ Reading a file that does not contain a header row ▪ Changing column names
▪ Treat specific values as missing values(NaN) while reading a file ▪ Finding missing Values
▪ Data Exploration ▪ Filling missing values
▪ Exploring and converting data types ▪ Removing rows/columns that contain missing values
▪ Indexing ▪ Copying a DataFrame or a part of it
▪ Correlation ▪ Sorting values in columns
▪ Commonly used aggregation functions ▪ Getting n largest and n smallest values for a column
▪ Iterating over a DataFrame ▪ concat(): Combine DataFrames vertically or horizontally
▪ Getting Unique Values and their Frequencies ▪ merge(): Combine two DataFrames based on common columns or indices
▪ Identifying and Removing Duplicate Rows in a DataFrame ▪ Grouping and aggregating data : Analysing data per category
▪ Common aggregations
▪ Identifying and Removing Duplicate Columns in a DataFrame
▪ Multiple aggregations using agg()
▪ Data Selection
▪ Custom aggregation
▪ Conditional Data selection
▪ Multilevel grouping - grouping data by multiple columns
▪ Changing values using .loc
▪ Groupby on filtered dataset
▪ Changing values using .iloc
▪ Grouping and aggregation works by split-apply-combine
▪ Faster single value updates using .at(label-based) or .iat(index-based) ▪ Iterating through groups
▪ Replacing Values ▪ Retrieve the DataFrame for a specific group
▪ Clipping : Limiting the values within a specified range(useful in handling outliers) ▪ Working with strings
▪ Adding or modifying columns ▪ Working with datetime

Deepali Srivastava Author of “Ultimate Python Programming”

Deepali Srivastava
DeepaliAuthor
SrivastavaofAuthor
“Ultimate Python
of “Ultimate Programming”
Python Programming”
Importing numpy and pandas
import numpy as np
import pandas as pd

Reading and writing data to files

df = pd.read_csv('data.csv')
df.to_csv('data.csv')
df = pd.read_excel('data.xlsx')
df.to_excel('data.xlsx')

Reading a CSV with specific options

df = pd.read_csv( 'data.csv',
sep=':', # CSV with colons as separating values
index_col=0, # Set the first column as the index
usecols=['Name', 'Age'], # Read only 'Name' and 'Age' columns
nrows=100, # Read first 100 rows only
dtype={'Age': float}, # Ensure 'Age' is read as a float
parse_dates=['column_2'], # Parse 'column_2' as datetime
encoding='utf-8' # Use UTF-8 encoding )
Reading a file that does not contain a header row
df = pd.read_csv('data.csv',
header=None, # First row not treated as header
names=['Name',' Age ',' Phone '] # Manually specify column names
)
Treat specific values as missing values(NaN) while reading a file
# Values 'n.a.', 'n/a' or 'null' in any column are treated as missing values
df = pd.read_csv('data.csv', na_values=['n.a.', 'n/a', 'null'])
# Column-Specific Custom Missing Values
df = pd.read_csv('data.csv', na_values={'email': ['unknown'], 'age':[-1], 'location': ['not available','n/a']})

Deepali Srivastava Author of “Ultimate Python Programming”

Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
Data Exploration
df.head(n) # Display the first n rows of the DataFrame
df.tail(n) # Display the last n rows of the DataFrame
df.sample(n) # Returns n random rows from the DataFrame
df.sample(frac=0.05) # Returns 5% random rows from the DataFrame
df.info() # Get a summary of the DataFrame, including data types and non-null values
df.describe() # Generate summary statistics for numerical columns
df.shape # Get the dimensions of the DataFrame in a tuple (rows, columns)
df.size # Returns the total number of elements in the DataFrame (rows × columns)
df.columns # List all column names in the DataFrame

Exploring and converting data types

df.dtypes # Get the data types of each column
df['age'].dtype # Get the data type of a specific column
df['salary'].astype('float64') # Convert the column from its existing data type into a float64

Indexing
# By default, Pandas assigns an integer index (0, 1, 2, 3, ...) to each row
df.index # Shows the row labels, which can be numeric (default) or custom labels (like strings, datetime, etc.)
df = df.set_index('name') # Set 'name' column as the index
df = df.reset_index() # Reset the index to the default integer index, original index will be added as a new column
'John' in df.index # Checking if a value exists in the index, Returns True or False

Correlation
# Getting pairwise correlation of numeric columns
correlation_matrix = df.corr(numeric_only=True) # Ensures only numeric columns are used
# Default method is 'pearson', can use 'kendall' or 'spearman’
correlation_matrix = df.corr(numeric_only=True ,method='pearson')
# Access a specific correlation value between two columns using .loc
correlation_matrix.loc['age','salary'] # Correlation between column 'age' and 'salary'
Deepali Srivastava Author of “Ultimate Python Programming”
Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
Commonly used aggregation functions
df.max() # Maximum value in each column
df['age'].max() # Maximum value in the 'age' column
df[['age','height']].max() # Maximum value in the 'age' and 'height' columns
df['age'].min() # Minimum value of 'age' column
df['age'].idxmax() # Index of the maximum value in 'age' column
df['age'].idxmin() # Index of minimum value in 'age' column
df.iloc[df['age'].idxmax()] # To get the row with maximum value
df['age'].mean() # Mean of the 'age' column
df['age'].median() # Median of the 'age' column
df['age'].mode() # returns most frequent value(s) of 'age' column in a Series
mode_value = df['age'].mode()[0] # Accesses the first mode value from the Series
df['age'].sum() # Sum of the 'age' column
df['age'].count() # Number of non-null values in the 'age' column
df['age'].std() # Standard deviation of the 'age' column
df['age'].var() # Variance of the 'age' column
df['age'].prod() # Product of all values in the 'age' column
df['age'].quantile(0.25) # 25th percentile(first quartile) of 'age' column
df['age'].describe() # Summary statistics for the 'age' column
df['age'].cumsum() # Cumulative sum of 'age' column

# Use axis=1 to perform operations on rows, so you get results like row sums, row means, etc.
df.min(axis=1) # Minimum value in each row
df.sum(axis=1) # Sum of each row

Deepali Srivastava Author of “Ultimate Python Programming”

Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
Iterating over a DataFrame
for index, row in df.iterrows():
print(index) # Print the index of the current row.
print(row) # Print the data of the current row.

Getting Unique Values and their Frequencies

# unique()returns an array of the unique values in the 'Class' column of df
df['Class'].unique()
# nunique() returns the number of unique values in the 'Class' column of df
df['Class'].nunique()
# Get a count per category for categorical columns
# value_counts() returns a Series containing the count of occurrences of each unique value
df['Class'].value_counts() # Sorted in descending order by default, showing the most frequent values first

# Get the count of a specific category

count_grade_9 = df['Class'].value_counts()['Grade 9'] # Get the count of 'Grade 9' in 'Class' column

Identifying and Removing Duplicate Rows in a DataFrame

# Return a Boolean Series where True indicates a duplicate row.
df.duplicated()
# Count the number of duplicate (True) and unique (False) rows.
df.duplicated().value_counts()
# Remove all duplicate rows, keeping only the first occurrence.
df.drop_duplicates() # By default, it considers all columns
df.drop_duplicates(subset=['column1', 'column2']) # check for duplicates only in specific columns

Identifying and Removing Duplicate Columns in a DataFrame

df.T.duplicated() # Check for duplicate columns by transposing the DataFrame and applying duplicated()
df.T.duplicated().value_counts() # Count how many columns are unique (False) and how many are duplicates (True)
df = df.T.drop_duplicates().T # Remove duplicate columns
Deepali Srivastava Author of “Ultimate Python Programming”
Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of “Ultimate Python Programming”

Data Selection
# Selecting columns
df['column'] # Select a single column as a Series
df[['column']] # Select a single column as a DataFrame
df[['col1', 'col2']] # Select multiple columns as a DataFrame

# Index based selection using iloc operator,.

df.iloc[row_positions, column_positions] # Select data based on its numerical position in the dataframe
df.iloc[0] # Select a single row by position (here, the 0th row)
df.iloc[[0,2]] # Select multiple rows by position (here, the 0th and 2nd rows)
df.iloc[0,2] # Select specific rows and columns, (here,the value at 0th row and 2nd column)
# Slicing in iloc, start is included, end is excluded
df.iloc[0:2] # Slice rows from position 0 to 2 (excluding position 2)
df.iloc[:, 0:2] # Slice columns from position 0 to 2 (excluding position 2)
df.iloc[1:3, 0:2] # Rows 1 to 3 (excluding 3) and columns 0 to 2 (excluding 2)
df.iloc[:, 1] # Column at position 1
# Label based selection using loc operator,
df.loc[row_label, col_label] # Access rows and columns by label
df.loc['a'] # Select a single row by label
df.loc[['a', 'c']] # Select multiple rows by label
df.loc['a', 'City'] # Select a specific row and column
df.loc[['a', 'b'], ['Name', 'Age']] # Select a subset of rows and columns
df.loc[df['Age'] > 30] # Boolean indexing to filter rows based on a condition
# Slicing in loc: both start and end are included
df.loc['b':'d', 'Name':'City'] # Rows from 'b' to 'd' and columns from 'Name' to 'City'
df.loc['a':'c'] # Slice rows from 'a' to 'c'
df.loc[:, 'Name':'Age'] # Slice columns from 'Name' to 'Age'
df.loc['b':'d', 'Age'] # Rows from 'b' to 'd', only 'Age' column
df.loc[df['Age'] > 30, 'Name':'City'] # Select rows where 'Age' > 30 and specific columns
df.loc[df['Age'] > 25, 'Name'] # Slice rows with a condition and only 'Name' column

Deepali Srivastava Author of “Ultimate Python Programming”

Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
Conditional Data selection: Filter rows based on a condition or multiple conditions

# Use relational operators (<, >, ==, >=, <=, !=) to filter rows
df[df['age'] > 10]

# Combine multiple conditions using logical operators & | ~

df[(df['age'] > 20) & (df['score'] > 85)] # Each condition should be enclosed in parentheses
df[(df['age'] > 10) | (df['height'] <150)]

# To filter rows where a column's value is in a specified list, use .isin()

df[df['city'].isin(['Bangalore', 'Bareilly', 'Agra'])]
df[~df['city'].isin(['Bangalore', 'Bareilly', 'Agra'])]

# Use .between() to filter rows where a column's values fall within a specified range
df[df['age'].between(20, 30)] # both 20 and 30 inclusive
df[df['age'].between(20, 30, inclusive = 'both’)] # default - both 20 and 30 inclusive
df[df['age'].between(20, 30, inclusive='neither')] # Exclude both 20 and 30 from the filter
df[df['age'].between(20, 30, inclusive='left')] # Include 20 but exclude 30
df[df['age'].between(20, 30, inclusive='right')] # Exclude 20 but include 30

# Selecting specific columns after filtering

df[ cond1 & cond2][['col1', 'col2', 'col3']][:5]

# To filter rows where a column contains a specific string, use .str.contains()

df[df['Name'].str.contains('ux')]

Deepali Srivastava Author of “Ultimate Python Programming”

Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
Changing values using .loc
# Modify a single value
df.loc[1, 'age'] = 32 # Change 'age' of row with index 1(using integer index)
df.loc['Jim', 'age'] = 32 # Change 'age' of row with index 'Jim'( if 'name' column set as index)
# Modify multiple values
df.loc[0:1, 'city'] = 'Bengaluru' # Update 'city' for rows with index 0 and 1
# Modify based on a condition
df.loc[df['age'] > 30, 'age'] += 5 # Increase 'age' by 5 for rows where 'age' > 30
df.loc[df['gpa'] < 2.5, 'result'] = 'fail' # Assign 'fail' to the 'result' column for rows where 'gpa' < 2.5
# Update an entire row
df.loc[1] = ['N/A', 'N/A', 'N/A'] # Set all values for row with index 1 to 'N/A'
# Update an entire column
df.loc[:,'city'] = 'Bareilly' # Set all values for 'city’ column

Changing values using .iloc

# Modify a Single Value
df.iloc[0, 2] = 'Agra' # Change value at 0th row and 2nd column
# Modify Multiple Values
df.iloc[0:2, 1] = [28, 35] # For the 0th and 1st rows, Change 1st column values
# Update an entire row
df.iloc[1] = ['Ram', 32, 'Bareilly'] # Change the 1st row
# Update an entire column
df.iloc[:, 2] = 'Lucknow' # Set all values for 2nd column to 'Lucknow'

Faster single value updates using .at(label-based) or .iat(index-based)

df.at[1, 'name'] = 'Maruti' # Change 'name' of the row with index 1 to 'Maruti'
df.iat[1, 1] = 36 # Update 1st column of 1st row

Deepali Srivastava Author of “Ultimate Python Programming”

Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
Replacing Values
df = df.replace('Yes', 'Y') # Replace a single value in the entire DataFrame
df = df['result'].replace('Pass','P') # Replace a single value in a specific column
df['result'] = df['result'].replace(['Pass', 'Fail'],['P', 'F']) # Replace multiple values in a column
df['result'] = df['result'].map({'Pass':'P', 'Fail':'F'}) # Replace multiple values using map
df['result'] = df['result'].replace(['Pass', 'Good', 'Satisfactory'], '1') # Replace multiple values with a single value

# Replacing multiple values in multiple columns

df = df.replace({
'result': ['Pass', 'Fail'],
'grade': ['A', 'B', 'C', 'D', 'F']
}, {
'result': ['P', 'F'],
'grade': ['4', '3', '2', '1', '0']
})

Clipping : Limiting the values within a specified range(useful in handling outliers)

# Clip values in column 'A' to be between 15 and 40
df['A'] = df['A'].clip(lower=15, upper=40) # Values lower than 15 replaced by 15, Values greater than 40 replaced by 40
# Clip all values in entire DataFrame to be between 10 and 30
df_clipped = df.clip(lower=10, upper=30)
# Clip column 'A' to be between 15 and 40, and column 'B' to be between 10 and 40
df_clipped = df.clip(lower={'A': 15, 'B': 10}, upper={'A': 40, 'B': 40})

Adding or modifying columns

# Add a new column or modify if the column already exists
df['country'] = 'India' # Creates a column with the same value for all rows (or updates if column exists)
df['age_in_months'] = df['age'] * 12 # Creates a column using an existing column (or updates if column exists)
df['total'] = df['marksA'] + df['marksB'] # Using a Calculation with multiple Columns
df['Maximum Marks'] = df['Maximum Marks'] + 10 # Modify the column
# Vectorized operation to add/modify a column based on 2 columns
df['type'] = np.where((df['age'] < 15) & (df['medals'] > 5), 'exceptional', 'normal')
Deepali Srivastava Author of “Ultimate Python Programming”
Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
Apply a function to add/modify a column
df['subject'] = df['subject'].apply(lambda name :name.strip().lower()) # Strip and lowercase text

def age_category(age):
return 'Young' if age < 18 else 'Adult'
df['category'] = df['age'].apply(age_category) # Apply age_category function to 'age' column

def func(age, medals):

if age < 15 and medals> 5:
return 'exceptional'
else:
return 'normal'
# Apply np.vectorize to the function for vectorized operation
df['type'] = np.vectorize(func)(df['age'], df['medals'])

Removing columns
df = df.drop('name', axis=1) # Drop the 'name' column
df = df[['name','age','phone']] # Keep only 'name', 'age', and 'phone' columns, other columns dropped

Removing rows based on label index

df = df.drop('134ABC',errors='ignore') # Avoid error if index doesn't exist

Changing column names

df = df.rename(columns={'col1':'new_col1','col2':'new_col2', 'col3':'new_col3'}) # Rename specific columns
df.columns = ['new_col1', 'new_col2', 'new_col3'] # Rename all columns at once(make sure the length matches)
df.columns = [col.replace('%', '') for col in df.columns] # Remove '%' from column names

Deepali Srivastava Author of “Ultimate Python Programming”

Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
Finding missing Values
df.isnull() # Returns a DataFrame with True for missing values
df.notnull() # Detect non-missing values
df.isnull().sum() # Count Missing Values in Each Column
df.isnull().values.any() # Check if Any Missing Value Exists
df[df.isnull().any(axis=1)] # Locate Rows with Missing Values

Filling missing values

# Fill all missing values in the DataFrame with 'unknown'
df = df.fillna('unknown')

# Fill all missing values in column 'gender' with 'unknown'

df['gender'] = df['gender'].fillna('unknown')

# Fill missing values in different columns with different values

df = df.fillna({'gender': 'unknown', 'age': -1, 'country': 'India'})

# Fill missing values in a column with column mean/median/mode

df['column_name'] = df['column_name'].fillna(df['column_name'].mean())
df['column_name'] = df['column_name'].fillna(df['column_name'].median())
df['column_name'] = df['column_name'].fillna(df['column_name'].mode()[0])

# Fill Missing values with the Previous/Next Value

df = df.fillna(method='ffill') # Forward fill (uses previous row's value)
df = df.fillna(method='bfill') # Backward fill (uses next row's value)
# Fill missing values with interpolation
df['column_name'] = df['column_name'].interpolate() # Interpolate between available values

Deepali Srivastava Author of “Ultimate Python Programming”

Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
Removing rows/columns that contain missing values
# Drop rows with any missing values
df = df.dropna()
# Drop rows with all values missing
df = df.dropna(how='all')
# Drop rows where 'price' and 'quantity' columns have missing values
df = df.dropna(subset=['price', 'quantity'])
# Drop rows that have fewer than 4 non-missing values, keeps rows where at least 4 columns have valid data
df = df.dropna(thresh=4) # thresh denotes minimum number of non-NaN values
# Drop columns with any missing values
df = df.dropna(axis=1)
# Drop any columns that have fewer than 100 non-null values in them
df = df.dropna(axis=1, thresh=100)

Copying a DataFrame or a part of it

# Assigning a DataFrame or part of it creates reference, modifying one will affect the other
df_1 = df
df_adults = df[df['age'] > 18]

# Create an independent copy using copy()

# Copying entire DataFrame
df_copy = df.copy()

# Copying Rows
some_rows = df.iloc[:2].copy() # Copy the first two rows
df_adults = df[df['age'] > 18].copy() # Copy rows where Age > 30

# Copying Columns
df_1 = df[['name', 'city']].copy() # Copy only the 'name' and 'city' columns

Deepali Srivastava Author of “Ultimate Python Programming”

Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
Sorting values in columns
# Sort by single column, default is ascending order
df = df.sort_values('column_name')
df = df.sort_values(by='column_name') # Using the 'by' parameter for better clarity

# Sort by single column in descending order

df = df.sort_values(by='column_name', ascending = False)

# Sort by col1(ascending order) then by col2(descending order)

df = df.sort_values(by=['col1', 'col2'], ascending=[True, False])

# Sort Rows by Index, default is ascending order

df.sort_index()

# Sort Rows by Index in descending order

df.sort_index(ascending=False)

# Sort column names, default is ascending order

df.sort_index(axis=1)

# Descending Column Names

df.sort_index(axis=1, ascending=False)
# key parameter for custom sorting
df.sort_values(by='column_name', key=lambda col: col.str.len()) # Sorting strings by length

# Filter the rows based on condition, select specific columns and then sort on columns 'col2'
df[condition][['col1', 'col2', 'col3']].sort_values('col2')

Getting n largest and n smallest values for a column

# Using nsmallest and nlargest() is more efficient than using sort_values with head() or tail()
df.nlargest(3, 'Marks') # Get top 3 largest values for 'Marks' column
df.nsmallest(3, 'Marks') # Get top 3 smallest values for 'Marks' column

Deepali Srivastava Author of “Ultimate Python Programming”

Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
concat() : Combine DataFrames either vertically (stacking rows) or horizontally (joining columns)
# Concatenate vertically (stack rows).
# If DataFrames have different columns, missing columns are filled with Nan values
df_vertical = pd.concat([df1, df2], axis=0, ignore_index=True) # ignore_index=True resets the index after concatenation
# Concatenate horizontally (join columns)
#If indices do not match for some rows, it will result in NaN values for the rows that don't exist in one of the DataFrames
df_horizontal = pd.concat([df1, df2], axis=1) # rows are aligned by their index
# Concatenate with keys to distinguish the DataFrames, Adds hierarchical indexing
df_all= pd.concat([males_df, females_df], axis=0, ignore_index=True, keys=['males', 'females'])
df_all.loc['males'] # Access data by key
df_all.loc['females'] # Access data by key

merge(): Combine two DataFrames based on common columns or indices

pd.merge(df1, df2, on='ID', how='inner')
pd.merge(df1, df2, on='ID') # by default inner merge
pd.merge(df1, df2, on='ID', how='left')
pd.merge(df1, df2, on='ID', how='right') right
inner left outer
pd.merge(df1, df2, on='ID', how='outer')

# Joining with different column names in the two DataFrames

pd.merge(df1, df2, left_on='ID', right_on='EmpID')
# Joining on column in left DataFrame and index in right DataFrame
pd.merge(df1, df2, left_on='ID', right_index=True)
# Merge with indicator flag, adds a new column to the resulting DataFrame called _merge
# _merge column indicates whether each row comes from left only, right only or both DataFrames
pd.merge(df1, df2, on='ID', how='outer', indicator=True)
# Merge with custom suffixes
# By default _x and _y added to columns with same names in both DataFrames(except column(s) used to join)
pd.merge(df1, df2, on='ID', how='outer', suffixes=('_df1', '_df2')) #Adds suufixes to the overlapping column names

Deepali Srivastava Author of “Ultimate Python Programming”

Deepali Srivastava, Author of “Ultimate Python Programming”

, Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
Grouping and aggregating data : Analysing data per category
Common aggregations
# Group by 'column_name' and apply aggregation functions
df.groupby('column_name').sum() # Sum of numeric columns
df.groupby('column_name').mean() # Mean of numeric columns
df.groupby('column_name').count() # Count of non-null values
df.groupby('column_name').min() # Minimum value for each group
df.groupby('column_name').max() # Maximum value for each group
df.groupby('column_name').describe() # Summary statistics for each group
df.groupby('column_name').plot() # Plot grouped data
df.groupby('grade').mean() # Group by 'grade'column and apply mean() for all numeric columns
df.groupby('grade')['height'].mean() # Group by 'grade'column and apply mean() only for 'height' column
df.groupby('grade')[['height', 'age']].mean() # Group by 'grade'column and apply mean() for 'height' and 'age' columns

Multiple aggregations using agg()

# Apply min(), mean() and max() for 'height' and 'age' columns
df.groupby('grade')[['height', 'age']].mean().agg(['min', 'mean', 'max'])

# Apply max() for 'age' column and mean() for 'height' column
df.groupby('grade').agg({ 'age': 'max', 'height': 'mean' })

Custom aggregation
df.groupby('grade').agg(func) # Apply custom function func()
df.groupby('store').agg({ 'sales': ['sum', 'mean'],
'items_sold': lambda x: x.sum() / len(x) })

Multilevel grouping - grouping data by multiple columns

df.groupby(['grade', 'section']).mean() # grouped first by 'grade' then by 'section'
df.groupby(['grade', 'section']).mean().reset_index() # convert the hierarchical index into a flat table

Groupby on filtered dataset

df[['name','age','grade']].groupby('grade').mean()
df[df['age'] > 10].groupby('grade').mean()
df[df['grade'].isin(['4','6'])].groupby('grade').mean()
Deepali Srivastava Author of “Ultimate Python Programming”
Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
# After grouping, the grouped column(s) become the index,to reset the index call reset_index()
df.groupby('column_name').sum().reset_index()

# Prevent the grouped column from becoming the index

df.groupby('column_name', as_index=False).sum() # as_index=False keeps 'column_name' as a regular column

# Include or exclude entire groups based on some condition

df.groupby('column_name').filter(func) # Apply custom function 'func' to filter groups
# Include only those grades that contain more than 10 rows
df.groupby('grade').filter(lambda x : x.shape[0] > 10)

# Include only those grades for which average height is greater than 150
df.groupby('grade').filter(lambda x : x['height'].mean() > 150)

# Group by 'grade' column, find sum of numeric columns,sort the resulting DataFrame by the 'age' column
df.groupby('grade').sum().sort_values('age')

Grouping and aggregation works by split-apply-combine

Split
Grade Height Apply (mean)
4 141 Grade Height
4 145 4 142.0
Grade Height 4 140
Combine # Finding average height of students in each grade
6 152 df.groupby('grade')['height'].mean()
5 146 Grade Height
Grade Height 4 142.0
4 141 Grade Height
5 146 5 147.5
4 145 5 147.5
5 149 6 150.3
5 149
6 150
4 140 Grade Height
6 149 6 152 Grade Height
6 150 6 150.3
6 149
Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
Iterating through groups
g = df.groupby('grade') # creates a GroupBy object by grouping the rows based on the unique values in the column 'grade'
for grade, data in g:
print('Grade –', grade) # Print the group name (grade)
print('Data – ')
print(data) # Print the data for the specific group
print('\n')

# Apply aggregate functions on the GroupBy object

g.mean() # Returns the mean value for each column in each group
g.max() # Returns the maximum value for each column in each group
g.size() # Returns the size of each group (number of rows per group)

Retrieve the DataFrame for a specific group

g = df.groupby('grade')
g.get_group('6')
df.groupby('column_name').get_group('group_value')
# Group by 'region' and 'store', get data for region='East' and store='A'
df.groupby(['region', 'store']).get_group(('East', 'A'))

Working with strings

# .str accessor used to apply string methods to columns
df['name_stripped'] = df['name_with_spaces'].str.strip() # Remove leading and trailing spaces
df['name_replaced'] = df['name'].str.replace('a', '@') # Replace 'a' with '@' in the 'name' column
df['country'].str.upper().value_counts() # Convert 'country' column to uppercase and get value counts

df['name_split'] = df['name'].str.split(' ') # Split the 'name' column into a list of words
df['first_name'] = df['name'].str.split(' ').str[0] # Get the first name (first element of list)
# Split the 'name' column into first and last name and expand into separate columns
df[['first_name', 'last_name']] = df['name'].str.split(' ', expand=True) # Expand into two new columns

Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
Working with datetime
# Converting a Column to Datetime
df['dob'] = pd.to_datetime(df['dob']) # Convert the 'dob' column to datetime

# Extract various components (year, month, day) from a datetime column.

df['birth_year'] = df['dob'].dt.year # Extract year from 'dob'
df['birth_month'] = df['dob'].dt.month # Extract month from 'dob'
df['birth_day'] = df['dob'].dt.day # Extract day from 'dob'
df['dob'].dt.weekday # Get weekday (integer: Monday=0, Sunday=6)
df['dob'].dt.day_name() # Get the name of the day (e.g., 'Monday', 'Tuesday')

# Convert a datetime object back to a string with a specific format using strftime
df['formatted_date'] = df['date'].dt.strftime('%Y-%m-%d') # Format datetime as 'YYYY-MM-DD'

# Calculate the difference between two datetime columns using subtraction which results in aTimedelta object
df['duration'] = df['end_date'] - df['start_date'] # Time difference between 'end_date' and 'start_date'
df['days'] = df['duration'].dt.days # Extract the number of days from the Timedelta

# Filtering by Date
df[df['date'] > '2025-01-01'] # Filter rows where 'dob' is after January 1, 2025

Inserting Missing Dates

# Create a complete date range from January 1, 2024, to January 8, 2024 with daily frequency
daterange = pd.date_range('2024-01-01', '2024-01-08', freq='D')

# Ensure the 'date' column is a datetime type and set it as index

df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

# Reindex the DataFrame df to the new date range, Missing dates will create new rows with NaN values in all columns
df = df.reindex(daterange)

Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”
Learn Python with 650+ Programs, 900+ Practice Questions, and 5 Projects

Learn Python with 650+ Programs, 900+ Practice Questions, and 5 Projects

Access the Pandas Jupyter Notebook with 108 questions here 👇

https://github.com/Deepali-Srivastava/Pandas-Cheat-Sheet-for-Data-Analysis

Access the Pandas Jupyter Notebook with 108 questions here

https://github.com/Deepali-Srivastava/Pandas-Cheat-Sheet-for-Data-Analysis
Deepali Srivastava Author of “Ultimate Python Programming”
Deepali Srivastava, Author of “Ultimate Python Programming”

Deepali Srivastava, Author of

Deepali Srivastava “Ultimate
Author Python
of “Ultimate Programming”
Python Programming”

EDA Cheat Sheet
No ratings yet
EDA Cheat Sheet
7 pages
Tutorial Data Visualization Pandas Matplotlib Seaborn
No ratings yet
Tutorial Data Visualization Pandas Matplotlib Seaborn
32 pages
Barrios, Agustin - Details PDF
33% (3)
Barrios, Agustin - Details PDF
5 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
1-Pandas Cheat Sheet
No ratings yet
1-Pandas Cheat Sheet
7 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Cheat Sheet Data Preprocessing Tasks in Pandas
100% (1)
Cheat Sheet Data Preprocessing Tasks in Pandas
2 pages
Phython Example
No ratings yet
Phython Example
12 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Session2-DM Using Pandas
No ratings yet
Session2-DM Using Pandas
51 pages
12 Pandas
100% (1)
12 Pandas
21 pages
Prints
No ratings yet
Prints
43 pages
Pandas
No ratings yet
Pandas
30 pages
Justenoughpython Pandas 220915 175329
No ratings yet
Justenoughpython Pandas 220915 175329
64 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
Pandas 1
No ratings yet
Pandas 1
50 pages
Python
No ratings yet
Python
32 pages
EDA Cheat Sheet - Exploratory Data Analysis
No ratings yet
EDA Cheat Sheet - Exploratory Data Analysis
2 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Pandas
No ratings yet
Pandas
94 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Pandas Introduction: What Is Python Pandas Used For?
No ratings yet
Pandas Introduction: What Is Python Pandas Used For?
28 pages
Intro Pandas
No ratings yet
Intro Pandas
18 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
Unit 1 Python Pandas
No ratings yet
Unit 1 Python Pandas
20 pages
Data Engineer Interview 1740985064
No ratings yet
Data Engineer Interview 1740985064
14 pages
Cheat Sheet
No ratings yet
Cheat Sheet
15 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
7 Days Analytics Course 3feiz7 4
No ratings yet
7 Days Analytics Course 3feiz7 4
8 pages
Pandas Cheatsheet 1737475033
No ratings yet
Pandas Cheatsheet 1737475033
11 pages
Python-for-Data-Analysis (Pandas
No ratings yet
Python-for-Data-Analysis (Pandas
31 pages
ANL252 SU4 Jul2022
No ratings yet
ANL252 SU4 Jul2022
55 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Pandas Cheat Sheet Free Resources At: Dataquest - Io/guide
No ratings yet
Pandas Cheat Sheet Free Resources At: Dataquest - Io/guide
7 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
Pandas Syntax Revision For ML
No ratings yet
Pandas Syntax Revision For ML
10 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
Python SQL
No ratings yet
Python SQL
5 pages
Code Explanation For Date Types
No ratings yet
Code Explanation For Date Types
8 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
12 pages
Python 3 Cheat Sheet
94% (51)
Python 3 Cheat Sheet
2 pages
Python Programming & SQL
100% (4)
Python Programming & SQL
152 pages
Bla Power Pvt. LTD: Woodward 505 Governor Valve / Actuator Calibration &test
No ratings yet
Bla Power Pvt. LTD: Woodward 505 Governor Valve / Actuator Calibration &test
23 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
Data Preprocessing Tasks in Pandas PYTHON
No ratings yet
Data Preprocessing Tasks in Pandas PYTHON
2 pages
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy
No ratings yet
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy
3 pages
Python Full Notes - Working
100% (4)
Python Full Notes - Working
645 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Pandas Merged
No ratings yet
Pandas Merged
2 pages
Pandas Cheat Sheet Final
No ratings yet
Pandas Cheat Sheet Final
1 page
HTML CSS JavaScript Basics
100% (7)
HTML CSS JavaScript Basics
225 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
17 pages
Pandas Cheat Sheet - Python For Data Science
No ratings yet
Pandas Cheat Sheet - Python For Data Science
5 pages
ML Lab1 Python Panda
No ratings yet
ML Lab1 Python Panda
9 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Unit 2 Oral Quiz: Conversation Strategy Conversation Strategy
100% (1)
Unit 2 Oral Quiz: Conversation Strategy Conversation Strategy
1 page
Intertextuality Quiz
No ratings yet
Intertextuality Quiz
1 page
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Python Programming. A Step-by-Step Guide For Absolute Beginners
93% (43)
Python Programming. A Step-by-Step Guide For Absolute Beginners
181 pages
Eunice de Souza
No ratings yet
Eunice de Souza
3 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (21)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Reading An Entire File at Once: Generating Current Date
No ratings yet
Reading An Entire File at Once: Generating Current Date
2 pages
Python 3 Basics Tutorial
100% (2)
Python 3 Basics Tutorial
128 pages
Python Pandas Tutorial
96% (28)
Python Pandas Tutorial
178 pages
EBOOK - Python Crash Course For Data Analysis
100% (12)
EBOOK - Python Crash Course For Data Analysis
168 pages
Python Notes For Professionals
100% (18)
Python Notes For Professionals
814 pages
Python Cheat Sheet: Click Here
100% (1)
Python Cheat Sheet: Click Here
60 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
The Python Manual
97% (31)
The Python Manual
196 pages
Python Cheat Sheets
97% (33)
Python Cheat Sheets
11 pages
Irish Slang Quiz
No ratings yet
Irish Slang Quiz
2 pages
Python Cheat Sheet: Mosh Hamedani
100% (8)
Python Cheat Sheet: Mosh Hamedani
14 pages
Numpy Basics: Arithmetic Operations
100% (17)
Numpy Basics: Arithmetic Operations
7 pages
A Response To David Gates: "The Door Is About To Close ": Are You Ready?
No ratings yet
A Response To David Gates: "The Door Is About To Close ": Are You Ready?
52 pages
Association For Computational Linguistics
No ratings yet
Association For Computational Linguistics
308 pages
8086 Hardware Specification
100% (1)
8086 Hardware Specification
84 pages
Object Oriented Python Tutorial
100% (20)
Object Oriented Python Tutorial
111 pages
Errata: Cultural History of The Native Peoples of Southern New England
100% (3)
Errata: Cultural History of The Native Peoples of Southern New England
5 pages
Evolve 1 Unit 1 PPT Lesson 3
No ratings yet
Evolve 1 Unit 1 PPT Lesson 3
9 pages
Pandas Python For Data Science
100% (1)
Pandas Python For Data Science
1 page
DevGuru ASP Quickref
No ratings yet
DevGuru ASP Quickref
85 pages
3kb04.muhammad Sandhi Khadafi.T2
No ratings yet
3kb04.muhammad Sandhi Khadafi.T2
91 pages
HTML Notes
No ratings yet
HTML Notes
96 pages
Introduction To IOS - XR 6.0: System Engineer, Global Service Providers CCIE SP #42403
No ratings yet
Introduction To IOS - XR 6.0: System Engineer, Global Service Providers CCIE SP #42403
48 pages
Going Away PART 2: It S +adjective +to
100% (1)
Going Away PART 2: It S +adjective +to
12 pages
Ict Group Assignment Word
No ratings yet
Ict Group Assignment Word
8 pages
Peepdf - PDF Analysis Tool
No ratings yet
Peepdf - PDF Analysis Tool
12 pages
Professor Scarlet's Notebook
No ratings yet
Professor Scarlet's Notebook
163 pages
0 - Data Kelab 2023
No ratings yet
0 - Data Kelab 2023
36 pages
1.HTML Tutorial
No ratings yet
1.HTML Tutorial
1 page
Section Ten A Java Calculator Project
No ratings yet
Section Ten A Java Calculator Project
39 pages
Reading Techniques
No ratings yet
Reading Techniques
11 pages
Kotoba Safety & Quality
No ratings yet
Kotoba Safety & Quality
24 pages
Astrid Lindgren - ENG
No ratings yet
Astrid Lindgren - ENG
1 page
Capsule - The Buried Giant
No ratings yet
Capsule - The Buried Giant
2 pages
Hamza
No ratings yet
Hamza
2 pages
CS341 HomeworkSol PDF
No ratings yet
CS341 HomeworkSol PDF
5 pages
Muh. Fawaz Salammutaqi - Summary of Describing Jobs
No ratings yet
Muh. Fawaz Salammutaqi - Summary of Describing Jobs
8 pages
Tip 1: Conversion Rules As Per The Reporting Verb: What Is Direct & Indirect Speech?
No ratings yet
Tip 1: Conversion Rules As Per The Reporting Verb: What Is Direct & Indirect Speech?
9 pages
Top 50 Pandas Interview Questions and Answers (2024)
No ratings yet
Top 50 Pandas Interview Questions and Answers (2024)
34 pages
?????? ?????????? ?? ??????????
No ratings yet
?????? ?????????? ?? ??????????
2 pages
The Racers Life
No ratings yet
The Racers Life
74 pages
Actc HTML Notes
No ratings yet
Actc HTML Notes
48 pages
Commas For Extra Detail
No ratings yet
Commas For Extra Detail
1 page
Unit-1 Python Pandas
No ratings yet
Unit-1 Python Pandas
56 pages
HTML
No ratings yet
HTML
12 pages
HTML - Basic Tags
No ratings yet
HTML - Basic Tags
5 pages
Pandas
No ratings yet
Pandas
27 pages
Pandas Notes Basic To Advance
No ratings yet
Pandas Notes Basic To Advance
21 pages
Pandas 6 1716219621
No ratings yet
Pandas 6 1716219621
17 pages
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
From Everand
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
Arun Manivannan
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Mastering Python
From Everand
Mastering Python
Rick van Hattem
No ratings yet
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet