The document provides a cheat sheet with 33 techniques for cleaning and processing data in Python. It covers topics like handling missing values, data type conversions, duplicate removal, text cleaning, categorical processing, outlier detection, feature engineering, and geospatial data processing. The goal is to serve as a reference for common data cleaning and preparation tasks in Python.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100%(2)100% found this document useful (2 votes)
986 views8 pages
Data Cleaning - Cheatsheet
The document provides a cheat sheet with 33 techniques for cleaning and processing data in Python. It covers topics like handling missing values, data type conversions, duplicate removal, text cleaning, categorical processing, outlier detection, feature engineering, and geospatial data processing. The goal is to serve as a reference for common data cleaning and preparation tasks in Python.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8
# [ Data Cleaning ] {CheatSheet}
1. Handling Missing Values
● Identify Missing Values: df.isnull().sum()
● Drop Rows with Missing Values: df.dropna() ● Drop Columns with Missing Values: df.dropna(axis=1) ● Fill Missing Values with a Constant: df.fillna(value) ● Fill Missing Values with Mean/Median/Mode: df.fillna(df.mean()) ● Forward Fill Missing Values: df.ffill() ● Backward Fill Missing Values: df.bfill() ● Interpolate Missing Values: df.interpolate()
2. Data Type Conversions
● Convert Data Type of a Column: df['col'] = df['col'].astype('type')
● Convert to Numeric: pd.to_numeric(df['col'], errors='coerce') ● Convert to Datetime: pd.to_datetime(df['col'], errors='coerce') ● Convert to Categorical: df['col'] = df['col'].astype('category')
3. Dealing with Duplicates
● Identify Duplicate Rows: df.duplicated()
● Drop Duplicate Rows: df.drop_duplicates() ● Drop Duplicates in a Specific Column: df.drop_duplicates(subset='col') ● Drop Duplicates Keeping the Last Occurrence: df.drop_duplicates(keep='last')
4. Text Data Cleaning
● Trim Whitespace: df['col'] = df['col'].str.strip()
● Convert to Lowercase: df['col'] = df['col'].str.lower() ● Convert to Uppercase: df['col'] = df['col'].str.upper() ● Remove Specific Characters: df['col'] = df['col'].str.replace('[character]', '')
By: Waleed Mousa
● Replace Text Based on Pattern (Regex): df['col'] = df['col'].str.replace(r'[regex]', 'replacement') ● Split Text into Columns: df[['col1', 'col2']] = df['col'].str.split(',', expand=True)
● Set Datetime Index: df.set_index('datetime_col', inplace=True)
● Resample Time Series Data: df.resample('D').mean() ● Fill Missing Time Series Data: df.asfreq('D', method='ffill') ● Time-Based Filtering: df['year'] = df.index.year; df[df['year'] > 2000]
10. Data Frame Operations
● Merge Data Frames: pd.merge(df1, df2, on='key', how='inner')
● Concatenate Data Frames: pd.concat([df1, df2], axis=0) ● Join Data Frames: df1.join(df2, on='key') ● Pivot Table: df.pivot_table(index='row', columns='col', values='value')
● Extracting Date Components: df['year'] = df['date_col'].dt.year
● Calculating Date Differences: df['days_diff'] = (df['date_col1'] - df['date_col2']).dt.days ● Date Range Generation for Time Series: pd.date_range(start='2020-01-01', end='2020-12-31', freq='D')