0% found this document useful (0 votes)

52 views7 pages

Data Clearning

This document provides a cheat sheet for data cleaning techniques in Python. It discusses how to deal with missing data through techniques like dropping null values, imputing means/modes, and interpolation. It also addresses dealing with duplicates, outlier detection, encoding categorical features, and data transformations. Common Python libraries used include Pandas, NumPy, Scikit-Learn, and Seaborn.

Uploaded by

lequangtrung010389

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views7 pages

Data Clearning

Uploaded by

lequangtrung010389

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Data Cleaning Cheat Sheet in

Python - By Eugenia Anello

Table of Contents:
1. Dealing with Missing Data
2. Dealing with Duplicates
3. Outlier Detection
4. Encode Categorical Features
5. Transformation
1. Dealing with Missing data
Check missing data in each column of the dataset

df.isnull().sum()

Delete missing data

df.dropna(how='all')

Drop columns that have missing values

df.dropna(how='columns')

Drop specific columns that have missing values

df.dropna(subset=[‘municipal,'city'])

Replace missing values with Mean/Median/Mode

df[‘price’].fillna(df[‘price’].mean())
df[‘age’].fillna(df[‘age’].median())
df[‘type_building’].fillna(df[‘type_building’].mode())

Replace missing values with Mean/Median/Mode of the group

df['price'].fillna(df.group('type_building')['price'].transform(‘mean’),
inplace=True)

Forward Fill - Fill missing values with values before them

df['stock_price'].fillna(method='ffill')
Forward Fill within Groups
df['stock_price'] = df.groupby('type_stock').ffill()

Backward Fill - FIll missing values with values after them

df['stock_price'].fillna(method='bfill')

Backward Fill within Groups

df['stock_price'] = df.groupby('type_stock')['stock_price'].bfill()

Fill missing values using the interpolation method

df['stock_price'] =
df['stock_price'].interpolate(method='polynomial',order=2)

Fill missing values using the interpolation method within groups

df['stock_price'] = df.groupby('type_stock')['stock_price'].apply(lambda
x: x.interpolate(method='polynomial',order=2))
2. Dealing with Duplicates
Check if there are duplicates

df.duplicated().sum()

Extract duplicate rows from the dataframe

df[df.duplicated()]

Drop duplicates

df.drop_duplicates()

Aggregate data

df.groupby('id').agg({'price':'mean'}).reset_index()
3. Outlier detection
Detect range of values for each column of the dataset

df.describe([x*0.1 for x in range(10)])

Display boxplot to display the distribution of a column

import seaborn as sns

sns.boxplot(x=df['age'])

Display histogram to display the distribution of a column

sns.displot(data=df[‘column1’])

Remove outliers

df = df[df['age']<df[‘age'].quantile(0.9)]

Outlier detection with machine learning models, like Isolation Forest

if = IsolationForest(random_state=42)
if.fit(X)
y_pred = if.predict(X)
4. Encode categorical features
Apply one-hot-encoding to a categorical column

from sklearn.prepreprocessing import OneHotEncoder

ohe = OneHotEncoder()
encoded_data = pd.DataFrame(ohe.fit_transform(df[[‘type_build’]]).toarray())
new_df = df.join(encoded_data)

Apply label-encoding to a categorical column

from sklearn.prepreprocessing import LabelEncoder

le = LabelEncoder()
df[‘type_build’] = le.fit_transform(df[‘type_build’])

Apply ordinal-encoding to a categorical column to retain its ordinal nature

from sklearn.prepreprocessing import OrdinalEncoder

le = OrdinalEncoder()
df['price_level'] = le.fit_transform(df['price_level'])
5. Transformation
Standardize features by removing the mean and scaling to unit variance

from sklearn.processing import StandardScaler

X_std = StandardScaler().transform(X)

Rescale features into the range [0,1]

from sklearn.processing import MinMaxScaler

X_mms = MinMaxScaler().transform(X)

Scale features exploiting statistics that are robust to outliers

from sklearn.processing import RobustScaler
X_rs = RobustScaler().transform(X)

Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Ds ML House Price Book
No ratings yet
Ds ML House Price Book
46 pages
ML Lab Manual
No ratings yet
ML Lab Manual
110 pages
Kaggle Course Notes
No ratings yet
Kaggle Course Notes
87 pages
Machine Learning Record VR19
No ratings yet
Machine Learning Record VR19
46 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Lab File
No ratings yet
Lab File
96 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Housing Prices Notebook
No ratings yet
Housing Prices Notebook
14 pages
Ass 1 ML
No ratings yet
Ass 1 ML
21 pages
Oddstudents
No ratings yet
Oddstudents
35 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Data Cleaning and Preprocessing Techniques
No ratings yet
Data Cleaning and Preprocessing Techniques
13 pages
Data Science Tutorial 1686911993
No ratings yet
Data Science Tutorial 1686911993
41 pages
Linear Regression Analysis - Polynomial Regression
No ratings yet
Linear Regression Analysis - Polynomial Regression
25 pages
Jashan ML
No ratings yet
Jashan ML
20 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Exercise5 Solution
No ratings yet
Exercise5 Solution
22 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
PRJ Sales Forecasting
No ratings yet
PRJ Sales Forecasting
22 pages
Data Cleaning With Python Cheat Sheet Anello
No ratings yet
Data Cleaning With Python Cheat Sheet Anello
4 pages
1684918425867
No ratings yet
1684918425867
14 pages
Eda Code Snippets
No ratings yet
Eda Code Snippets
17 pages
Advance Python
No ratings yet
Advance Python
5 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
Regression Analysis - Lasso and Ridge Regularization
No ratings yet
Regression Analysis - Lasso and Ridge Regularization
17 pages
Pandas Syntax Revision For ML
No ratings yet
Pandas Syntax Revision For ML
10 pages
California Housing Project
No ratings yet
California Housing Project
5 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
Hint Sheet
No ratings yet
Hint Sheet
13 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Report
No ratings yet
Report
40 pages
Exp 2 Data Preprocessing - Cleaning The Dataset Obtained From The UCI ML Repository
No ratings yet
Exp 2 Data Preprocessing - Cleaning The Dataset Obtained From The UCI ML Repository
9 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
Data Analysis: Data Preparation
No ratings yet
Data Analysis: Data Preparation
9 pages
Lab Manual 5 Solved 40
No ratings yet
Lab Manual 5 Solved 40
13 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Preprocessing
No ratings yet
Preprocessing
9 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Task 6
No ratings yet
Task 6
14 pages
Data Wrangling Python.
No ratings yet
Data Wrangling Python.
8 pages
Setup: Chapter 2 - End-To-End Machine Learning Project
No ratings yet
Setup: Chapter 2 - End-To-End Machine Learning Project
31 pages
Predicting Home Prices in Bangalore
No ratings yet
Predicting Home Prices in Bangalore
18 pages
Step-by-Step Explanation of Python Data Preprocessing Script
No ratings yet
Step-by-Step Explanation of Python Data Preprocessing Script
9 pages
10.over Head Tanks (F22,23,24)
No ratings yet
10.over Head Tanks (F22,23,24)
41 pages
House Price Prediction Using Machine Learning in Python
No ratings yet
House Price Prediction Using Machine Learning in Python
13 pages
Python Codes
No ratings yet
Python Codes
17 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
GLOFA-GM7U Manual Eng PDF
No ratings yet
GLOFA-GM7U Manual Eng PDF
367 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
850ad1ab34 mv730 User Guide 20
No ratings yet
850ad1ab34 mv730 User Guide 20
10 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Method Statement For Installation of VAV Box
100% (1)
Method Statement For Installation of VAV Box
3 pages
GTA San Andreas Cheat
100% (1)
GTA San Andreas Cheat
8 pages
Cera Grout 40
100% (1)
Cera Grout 40
2 pages
Motherboard D945GCLF (Mobo in PC3) ProductGuide04
100% (1)
Motherboard D945GCLF (Mobo in PC3) ProductGuide04
56 pages
NASA Apollo 13 - Houston We've Got A Problem
100% (2)
NASA Apollo 13 - Houston We've Got A Problem
28 pages
Tenchu 2 - Birth of The Stealth Assassins
No ratings yet
Tenchu 2 - Birth of The Stealth Assassins
52 pages
Where To Install Automatic Sprinkler System 1636830783
No ratings yet
Where To Install Automatic Sprinkler System 1636830783
6 pages
Ptv-Plug Fans Ptv-Plug Fans
No ratings yet
Ptv-Plug Fans Ptv-Plug Fans
58 pages
Company Profile: Catalog Produced in April 2015 No.1100E-4
No ratings yet
Company Profile: Catalog Produced in April 2015 No.1100E-4
28 pages
Wetted Wall Column PDF
No ratings yet
Wetted Wall Column PDF
4 pages
IW-8.5.0-Workspace Desktop Edition Help
No ratings yet
IW-8.5.0-Workspace Desktop Edition Help
132 pages
Group 6 - Courier Industry
No ratings yet
Group 6 - Courier Industry
16 pages
Data Sheet: E Cores and Accessories
No ratings yet
Data Sheet: E Cores and Accessories
6 pages
The Blockchain Trilemma: An Evaluation Framework
No ratings yet
The Blockchain Trilemma: An Evaluation Framework
10 pages
Nadeem Maarig Termtable2
No ratings yet
Nadeem Maarig Termtable2
5 pages
Installing Com0com Null Modem Emulator
No ratings yet
Installing Com0com Null Modem Emulator
12 pages
Surfaktan
No ratings yet
Surfaktan
33 pages
What All Do U Need To Be A Professional Gamer - G
No ratings yet
What All Do U Need To Be A Professional Gamer - G
1 page
Controlled Shape Changing Components by Using 4D Printing Technology
No ratings yet
Controlled Shape Changing Components by Using 4D Printing Technology
4 pages
Nominal Run Pipe Reference Standards Thicknesses (E MM) TUBASYS SLU Manufacturing Standards (E MM)
No ratings yet
Nominal Run Pipe Reference Standards Thicknesses (E MM) TUBASYS SLU Manufacturing Standards (E MM)
1 page
Performance Sheet Drill Collar
No ratings yet
Performance Sheet Drill Collar
2 pages
ST - Mother Theresa Engineering College: Course Plan
No ratings yet
ST - Mother Theresa Engineering College: Course Plan
8 pages
My Top Six
No ratings yet
My Top Six
5 pages
05 - HO - Logical Database Design and The Relational Model
No ratings yet
05 - HO - Logical Database Design and The Relational Model
27 pages
IC200MDL650
No ratings yet
IC200MDL650
4 pages
Sis 2.0 - 1627724041083
No ratings yet
Sis 2.0 - 1627724041083
3 pages
Me
No ratings yet
Me
6 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)

Data Clearning

Uploaded by

Data Clearning

Uploaded by

Data Cleaning Cheat Sheet in

Python - By Eugenia Anello

Delete missing data

Drop columns that have missing values

Drop specific columns that have missing values

Replace missing values with Mean/Median/Mode

Replace missing values with Mean/Median/Mode of the group

Forward Fill - Fill missing values with values before them

Backward Fill - FIll missing values with values after them

Backward Fill within Groups

Fill missing values using the interpolation method

Fill missing values using the interpolation method within groups

Extract duplicate rows from the dataframe

df.describe([x*0.1 for x in range(10)])

Display boxplot to display the distribution of a column

import seaborn as sns

Display histogram to display the distribution of a column

Outlier detection with machine learning models, like Isolation Forest

from sklearn.prepreprocessing import OneHotEncoder

Apply label-encoding to a categorical column

from sklearn.prepreprocessing import LabelEncoder

Apply ordinal-encoding to a categorical column to retain its ordinal nature

from sklearn.prepreprocessing import OrdinalEncoder

from sklearn.processing import StandardScaler

Rescale features into the range [0,1]

from sklearn.processing import MinMaxScaler

Scale features exploiting statistics that are robust to outliers

You might also like