Pandas Pro Level Cheat Sheet
Import Pandas
import pandas as pd
Data Structures:
Series: pd.Series(data, index=index)
DataFrame: pd.DataFrame(data, columns=columns)
Index: An immutable array that is used to reference data.
Example: index = pd.Index(['one', 'two', 'three', 'four', 'five'])
MultiIndex: Hierarchical indexing, allowing for more than one index level on an axis.
Example: arrays = [np.array(['A', 'A', 'B', 'B']), np.array([1, 2, 1, 2])] multi_index =
pd.MultiIndex.from_arrays(arrays, names=('letters', 'numbers'))
DatetimeIndex: A specialized Index for datetime data.
Example: datetime_index = pd.date_range('20220101', periods=6)
PeriodIndex: A specialized Index for Period data.
Example: period_index = pd.period_range('2022-01', periods=6, freq='M')
TimedeltaIndex: A specialized Index for timedelta data.
Example: timedelta_index = pd.timedelta_range('1 days', periods=5)
Categorical: A data type for categorical data.
Example: categories = pd.Categorical(['a', 'b', 'c', 'a'], categories=['a', 'b', 'c'])
SparseDataFrame: Efficiently represent sparse data (data with a large number of zero or missing
values).
Example: df_sparse = pd.SparseDataFrame({'A': [0, 0, 0, 1, 0, 2, 0]}, default_fill_value=0)
Panel: (Considered Deprecated) A three-dimensional labeled data structure. Use MultiIndex or
Panel4D.
Example (Not Recommended): panel_data = {'Item1': df1, 'Item2': df2, 'Item3': df3} panel =
pd.Panel(panel_data)
Reading and Writing Data:
Read CSV: df = pd.read_csv('file.csv')
Write CSV: df.to_csv('output.csv', index=False)
Read Excel: df = pd.read_excel('file.xlsx', sheet_name='Sheet1')
Write Excel: df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)
Basic DataFrame Operations:
Display DataFrame: Print the DataFrame to view its contents.
Example: print(df)
Head and Tail: Display the first and last n rows of the DataFrame.
Example: df.head(n) df.tail(n)
Info and Describe: Get information about the DataFrame, including data types and non-null values.
Describe provides basic statistics.
Example: df.info() df.describe()
Selection and Indexing:
Select Column: Retrieve a specific column from the DataFrame.
Example: df['column_name']
Select Row by Index: Retrieve a specific row by its index.
Example: df.loc[index]
Select Rows and Columns by Index: Retrieve specific rows and columns based on index values.
Example: df.loc[start_index:end_index, ['col1', 'col2']]
Filtering Data: Filter rows based on a condition.
Example: df[df['column_name'] > value]
Manipulating DataFrames:
Add Column: Add a new column to the DataFrame.
Example: df['new_column'] = values
Drop Column: Remove a column from the DataFrame.
Example: df.drop('column_name', axis=1, inplace=True)
Rename Columns: Rename columns in the DataFrame.
Example: df.rename(columns={'old_name': 'new_name'}, inplace=True)
Selection and Indexing:
1. Selecting Columns:
● By Label: df['column_name']
● By List of Labels: df[['col1', 'col2', 'col3']]
2. Selecting Rows:
● By Label (loc): df.loc[index]
● By Labels Range (loc): df.loc[start_index:end_index]
● By Position (iloc): df.iloc[row_position]
3. Selecting Rows and Columns:
● By Label (loc): df.loc[start_index:end_index, ['col1', 'col2']]
● By Position (iloc): df.iloc[row_start:row_end, col_start:col_end]
4. Conditional Selection:
● Filtering Rows Based on a Condition: df[df['column_name'] > value]
● Filtering Rows with Multiple Conditions: df[(df['col1'] > value1) & (df['col2'] < value2)]
5. Selecting by Data Type:
● Selecting Numeric Columns: df.select_dtypes(include='number')
● Selecting Categorical Columns: df.select_dtypes(include='category')
6. Indexing Techniques:
● Setting a Column as Index: df.set_index('column_name', inplace=True)
● Resetting Index: df.reset_index(inplace=True)
7. Hierarchical Indexing (MultiIndex):
● Creating a MultiIndex: df.set_index(['index_col1', 'index_col2'], inplace=True)
● Selecting from a MultiIndex: df.loc['label1', 'label2']
8. Slicing and Dicing:
● Slicing Rows: df[start_index:end_index]
● Slicing Columns: df.loc[:, 'col_start':'col_end']
9. Working with Dates and Times:
● Selecting Rows Based on Date Range: df.loc['start_date':'end_date']
● Resampling Time Series Data: df.resample('D').mean()
10. Other Useful Methods:
● isin Method for Filtering: df[df['column'].isin(['value1', 'value2'])]
● query Method for Expressive Filtering: df.query('col1 > value1 and col2 < value2')
● Conditional Assignment: df.loc[df['column'] > value, 'new_column'] = 'New Value'
Manipulating DataFrames:
1. Add Column:
Add a new column to the DataFrame. Example: df['new_column'] = values
2. Drop Column:
Remove a column from the DataFrame. Example: df.drop('column_name', axis=1, inplace=True)
3. Rename Columns:
Rename columns in the DataFrame. Example: df.rename(columns={'old_name': 'new_name'},
inplace=True)
4. Handling Missing Data:
● Check for Missing Values: df.isnull() df.notnull()
● Drop Missing Values: df.dropna()
● Fill Missing Values: df.fillna(value)
5. Grouping and Aggregation:
● Group By: df.groupby('column_name')
● Aggregation Functions: df['column_name'].agg(['mean', 'sum', 'count'])
6. Merging and Concatenation:
● Concatenate DataFrames: pd.concat([df1, df2], axis=0, ignore_index=True)
● Merge DataFrames: pd.merge(df1, df2, on='common_column', how='inner')
7. Time Series Operations:
● Convert to DateTime: df['datetime_column'] = pd.to_datetime(df['datetime_column'])
● Resample Time Series: df.resample('D').mean()
● Shift and Lag: df['lagged_column'] = df['column'].shift(periods=n)
8. Pivot Tables:
● Create Pivot Table: pivot_table = pd.pivot_table(df, values='values', index='index_column',
columns='column_to_pivot', aggfunc=np.sum)
9. Advanced Functions:
● Apply Function: df.apply(lambda x: custom_function(x), axis=1)
10. Window Functions (rolling):
df['rolling_mean'] = df['column'].rolling(window=n).mean()
Custom Functions with np.vectorize:
vectorized_function = np.vectorize(custom_function)
df['new_column'] = vectorized_function(df['column'])
More on Handling Missing Data:
1. Checking for Missing Values:
● Check for Missing Values:
● df.isnull(): Returns a DataFrame of the same shape as df with True where missing
values are present, False otherwise.
● df.notnull(): Returns the opposite of df.isnull().
2. Removing Missing Values:
● Drop Missing Values:
● df.dropna(): Removes rows containing any missing values.
● df.dropna(subset=['col1', 'col2']): Removes rows where specific columns have
missing values.
● df.dropna(axis=1): Removes columns containing any missing values.
3. Filling Missing Values:
● Fill Missing Values:
● df.fillna(value): Fills missing values with a specific constant value.
● df.fillna(df.mean()): Fills missing values with the mean of each column.
● df.fillna(method='ffill'): Forward fills missing values using the previous value.
● df.fillna(method='bfill'): Backward fills missing values using the next value.
4. Interpolation:
● Linear Interpolation:
● df.interpolate(): Performs linear interpolation to fill missing values.
5. Imputation:
● Simple Imputation:
● Using statistical measures like mean, median, or mode to fill missing values.
● Impute with Sklearn:
● Using Scikit-Learn's imputation methods for more advanced imputation strategies.
6. Handling Missing Values in Time Series Data:
● Forward and Backward Filling:
● Particularly useful for time series data.
● df.ffill(): Forward fills missing values.
● df.bfill(): Backward fills missing values.
7. Handling Missing Values in Categorical Data:
● Fill Categorical Missing Values:
● df['categorical_column'].fillna(value): Fills missing values in a specific categorical
column.
● Fill Categorical with Mode:
● df['categorical_column'].fillna(df['categorical_column'].mode()[0]): Fills missing
values with the mode.
8. Handling Missing Values in Text Data:
● Fill Text Data:
● df['text_column'].fillna(value): Fills missing values in a text column.
9. Handling Missing Values in Specific Columns:
● Conditional Filling:
● Filling missing values in a column based on conditions.
10. Handling Missing Values in Grouped Data:
● Grouped Imputation:
● Imputing missing values based on groups in the data.
11. Missing Values Heatmap:
● Visualizing Missing Data:
● Using a heatmap to visualize the distribution of missing values.
More on Grouping and Aggregation:
1. Grouping Data:
● Group By Single Column: grouped = df.groupby('column_name')
● Group By Multiple Columns: grouped = df.groupby(['col1', 'col2'])
2. Aggregating Data:
● Basic Aggregation Functions: grouped['numeric_column'].agg(['mean', 'sum', 'count'])
● Applying Different Aggregations to Different Columns: grouped.agg({'col1': 'mean', 'col2':
'sum'})
3. Multiple Aggregation Functions:
● Using Multiple Aggregation Functions: grouped['numeric_column'].agg(['mean', 'sum',
'count'])
4. Custom Aggregation Functions:
● Defining and Using Custom Aggregation Functions:
grouped['numeric_column'].agg(custom_function)
5. Grouping by Time Periods:
● Grouping Time Series Data: df.groupby(pd.Grouper(freq='M')).agg({'col1': 'mean', 'col2':
'sum'})
6. Grouping and Aggregating with Multiple Operations:
● Using .agg with Multiple Operations: grouped['numeric_column'].agg(['mean', 'sum',
lambda x: x.max() - x.min()])
7. Transforming Data:
● Transforming Data Within Groups: grouped['numeric_column'].transform(lambda x: (x -
x.mean()) / x.std())
8. Filtering Groups:
● Filtering Groups Based on Aggregated Values: grouped.filter(lambda x:
x['numeric_column'].sum() > threshold)
9. Grouping by Index:
● Grouping by Index Levels: df.set_index(['col1', 'col2']).groupby(level=['col1', 'col2'])
10. Pivot Tables:
● Creating Pivot Tables: pivot_table = pd.pivot_table(df, values='values',
index='index_column', columns='column_to_pivot', aggfunc=np.sum)
11. Named Aggregations (Pandas 0.25.0 and later):
● Using Named Aggregations: grouped.agg(total=('numeric_column', 'sum'),
average=('numeric_column', 'mean'))
12. Combining GroupBy with Other Pandas Operations:
● GroupBy with apply: df.groupby('column_name').apply(lambda x: custom_function(x))
● GroupBy with merge and concat: pd.merge(df, grouped, on='common_column',
how='inner')
13. Hierarchical Indexing (MultiIndex) with Grouping:
● Grouping with MultiIndex: df.groupby(['col1', 'col2']).agg({'numeric_column': 'mean'})
Mor on Merging and Concatenation:
1. Concatenation:
● Concatenating DataFrames vertically (along rows):
● pd.concat([df1, df2])
● pd.concat([df1, df2], ignore_index=True)
● Concatenating DataFrames horizontally (along columns):
● pd.concat([df1, df2], axis=1)
2. Merging:
● Merging DataFrames based on a common column:
● pd.merge(df1, df2, on='common_column', how='inner')
● Merging on multiple columns:
● pd.merge(df1, df2, on=['col1', 'col2'], how='inner')
● Different types of joins (how parameter):
● Inner Join: pd.merge(df1, df2, on='common_column', how='inner')
● Left Join: pd.merge(df1, df2, on='common_column', how='left')
● Right Join: pd.merge(df1, df2, on='common_column', how='right')
● Outer Join: pd.merge(df1, df2, on='common_column', how='outer')
● Merging on index:
● pd.merge(df1, df2, left_index=True, right_index=True, how='inner')
3. Joining:
● Joining DataFrames on index:
● df1.join(df2, how='inner')
● Joining on columns:
● df1.join(df2, on='common_column', how='inner')
4. Concatenating Series:
● Concatenating Series along rows:
● pd.concat([series1, series2])
● Concatenating Series along columns:
● pd.concat([series1, series2], axis=1)
5. Concatenating and Merging with MultiIndex:
● Concatenating along MultiIndex:
● pd.concat([df1, df2], keys=['key1', 'key2'])
● Merging with MultiIndex:
● pd.merge(df1, df2, left_on=['col1', 'col2'], right_index=True, how='inner')
More on Time Series Operations:
1. Convert to DateTime:
● Convert a column to a datetime format:
● df['datetime_column'] = pd.to_datetime(df['datetime_column'])
2. Resample Time Series:
● Resample time series data:
● df.resample('D').mean(): Resample to daily frequency, calculating the mean.
3. Shifting and Lagging:
● Shift data in a column:
● df['lagged_column'] = df['column'].shift(periods=n)
● Lagging data in a time series:
● df['lagged_column'] = df['column'].shift(periods=n)
4. Rolling Windows:
● Calculate rolling mean:
● df['rolling_mean'] = df['column'].rolling(window=n).mean()
● Calculate rolling sum:
● df['rolling_sum'] = df['column'].rolling(window=n).sum()
● Other rolling window functions: min(), max(), std(), etc.
5. Time Series Indexing:
● Set DateTime as index:
● df.set_index('datetime_column', inplace=True)
● Reset index:
● df.reset_index(inplace=True)
6. Time Delta:
● Create a time delta column:
● df['time_delta'] = df['end_date'] - df['start_date']
7. Date Range:
● Generate a date range:
● date_range = pd.date_range(start='start_date', end='end_date', freq='D')
8. Time Series Visualization:
● Plot time series data:
● df.plot(x='datetime_column', y='value_column', kind='line')
9. Time Zone Handling:
● Set time zone:
● df['datetime_column'] =
df['datetime_column'].dt.tz_localize('UTC').dt.tz_convert('America/New_York')
● Remove time zone information:
● df['datetime_column'] = df['datetime_column'].dt.tz_localize(None)
Pivot Tables:
1. Create Pivot Table:
● Create a basic pivot table:
● pivot_table = pd.pivot_table(df, values='values', index='index_column',
columns='column_to_pivot', aggfunc=np.sum)
2. Aggregate Values:
● Aggregate values using different aggregation functions:
● pivot_table = pd.pivot_table(df, values='values', index='index_column',
columns='column_to_pivot', aggfunc={'values': 'sum'})
3. Handling Missing Values in Pivot Tables:
● Handling missing values during aggregation:
● pivot_table = pd.pivot_table(df, values='values', index='index_column',
columns='column_to_pivot', aggfunc=np.sum, fill_value=0)
4. Multi-level Pivot Tables:
● Create a multi-level pivot table:
● pivot_table = pd.pivot_table(df, values='values', index=['index_column1',
'index_column2'], columns='column_to_pivot', aggfunc=np.sum)
5. Pivot Tables with Multiple Aggregation Functions:
● Use multiple aggregation functions:
● pivot_table = pd.pivot_table(df, values='values', index='index_column',
columns='column_to_pivot', aggfunc={'values': ['sum', 'mean']})
6. Pivot Tables with Custom Aggregation Functions:
● Use custom aggregation functions:
● pivot_table = pd.pivot_table(df, values='values', index='index_column',
columns='column_to_pivot', aggfunc={'values': custom_function})
7. Grand Totals and Margins:
● Include grand totals and margins:
● pivot_table = pd.pivot_table(df, values='values', index='index_column',
columns='column_to_pivot', aggfunc=np.sum, margins=True,
margins_name='Total')
8. Pivot Tables with Date-Time Index:
● Create a pivot table with a date-time index:
● pivot_table = pd.pivot_table(df, values='values', index=pd.Grouper(freq='M',
key='datetime_column'), columns='column_to_pivot', aggfunc=np.sum)
9. Pivot Tables with Categories:
● Create a pivot table with categorical columns:
● pivot_table = pd.pivot_table(df, values='values', index='index_column',
columns='category_column', aggfunc=np.sum)
10. Cross Tabulation:
● Create a cross-tabulation table:
● cross_tab = pd.crosstab(index=df['index_column'], columns=df['column_to_pivot'],
values=df['values'], aggfunc=np.sum)
More on Advanced Functions:
1. Apply Function:
● Apply a function to each element or row/column of a DataFrame:
● df['new_column'] = df['column'].apply(lambda x: custom_function(x))
● df.apply(lambda x: custom_function(x), axis=1)
2. Vectorized Operations with NumPy:
● Use vectorized operations for improved performance:
● df['new_column'] = np.sqrt(df['column'])
● df['new_column'] = np.log(df['column'])
3. Custom Functions with np.vectorize:
● Vectorize custom functions for element-wise operations:
● vectorized_function = np.vectorize(custom_function)
● df['new_column'] = vectorized_function(df['column'])
4. Window Functions (Rolling):
● Apply rolling window functions for time-series data:
● df['rolling_mean'] = df['column'].rolling(window=n).mean()
● df['rolling_sum'] = df['column'].rolling(window=n).sum()
5. Lambda Functions:
● Use lambda functions for quick, inline operations:
● df['new_column'] = df['column'].apply(lambda x: x * 2)
6. map Function:
● Map values based on a dictionary or function:
● df['category_column'] = df['numeric_column'].map({1: 'A', 2: 'B', 3: 'C'})
● df['new_column'] = df['column'].map(lambda x: custom_mapping_function(x))
7. applymap Function:
● Apply a function to every element of a DataFrame:
● df.applymap(lambda x: custom_function(x))
8. transform Function:
● Apply a function element-wise or along a specific axis:
● df['new_column'] = df['numeric_column'].transform(lambda x: x - x.mean())
9. pivot Function:
● Pivot the DataFrame based on the specified columns:
● df_pivot = df.pivot(index='index_column', columns='column_to_pivot',
values='values')
10. pd.get_dummies for One-Hot Encoding:
● Convert categorical variables into dummy/indicator variables:
● df_encoded = pd.get_dummies(df, columns=['categorical_column'])
11. groupby with Custom Aggregation Functions:
● Use custom aggregation functions with groupby:
● grouped = df.groupby('group_column').agg(custom_function)