1data Cleansing Cheklist

Uploaded by

Nadir Zamouche

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views2 pages

1data Cleansing Cheklist

Uploaded by

Nadir Zamouche

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 2

Here are the different data preprocessing steps (Note I can bypass using sql since

I can do different kinds of joins using jupyter):

Before we start we should check the csv file out and sort the columns out to check
the different values of each column also which situation we are in classfication,
regression... ,etc.

1. Data Collection: more often from CSV files after a long battle with SQL.

2. Data Inspection: Check the structure and format of the data using:
* info() "to get the dtype of each column and Number of columns aka check
missing values (if a column has 50% of its values missing we just delete it
entirely)".
* describe() "to check the mean, min, max & other values".
* value_counts():
- For target column to see which type odf situation wee're in (Binary
Classification, Multiclass Classfication & Regression).
- for non numerical (categorial) values "to get the count for each value
instance".

3. Data Cleaning:
* Don't forget with time columns convert them to datetime or something first,
then play around like extracting the time (NaN should become NaT).
* Handle Missing numerical data (Data Mutation):
+ Delete the entire row especially if it's an important column like Id using
dropna() however not always just when ID is something important like in
telecommunication
fraud detection the phone number aka in this case is very important.
+ Replace them with 0 or mean using fillna() or SimpleImputer().
* Non-numerical (categorial) data:
- Filling with a Placeholder Value: Replace missing values with a specific
placeholder string, like 'Unknown', 'Missing', or an empty string ''.
- Forward Fill or Backward Fill: Use the previous or next value in the column
to fill missing values.
- Dropping Missing Values: Remove rows or columns that contain missing values
(when it represents a small portion of the data).
- Mode Imputation: Replace missing values with the most frequent value (mode)
in the column or even correlation with other columns.
* Delete useless columns that don't have anything to do with the target column
or just doesn't help the model to identify patterns like: sequentially asigned
numbers
like in IDs and name etc.

4. String Data Transformation:

* Categorial data: Numerilization techniques: label, ordinal encoder, one hot
encoder (value to array size problem!) & word embedding.
Note:
- Label encoder: for binary columns.
- Ordinal Encoder: Use an ordinal encoder when there is a meaningful order or
hierarchy among the categories. This means the categories have a clear ranking or
some
sort of inherent order. The ordinal encoder assigns integer values to the
categories based on this order, making it suitable for variables with a natural
progression
like some sort of a distance also you can use custom encoder when things don't
get along.
- One-Hot Encoder: Use a one-hot encoder when there is no inherent order among
the categories, and each category is independent of the others.

5. Data Visualization:
* The use of boxplot "boite à moustache" to check if there are any outliers.
* Use hist() property of matplotlib to see if the dataset is tail havy (side
heavy) to determine wether to use logarithm in order to compress data and get them
near the
mean.
* Correalation search (between input column themselves if there is a strong
correlation 1 or -1 we delete one or we merge them then correlation with the target
column
with a special condition). Here are the 3 possible outcomes:
- 1: Both variables change in the same direction.
- -1: Variables change in the opposite directions.
- 0: No relationship in the change of the variables.
Also check 19AssociationEffectSize image in Statistics folder for more details
on correlation.

6. Data Transformation (optional):

* Numerical data: we can use transformations like CoxBox transformation "Yeo-
johnson extension" or logarithme if the data set was tail (side) heavy.
* Delete outliers rows from boxplot if there were only few otherwise use
stdscaler later.
* Delete useless columns based on correlation search (between input column
themselves if there is a strong correlation (1, -1) we delete one or we merge them
then
of each column with the target column with a special cap value condition) and
delete the one with 0.

7. Feature Scaling:
* Normalization: is good to use when you know that the distribution of your data
does not follow a Gaussian distribution. This can be useful in algorithms that do
not assume any distribution of the data like K-Nearest Neighbors and Neural
Networks using minmaxscaler.
* Standardization "std": on the other hand, can be helpful in cases where the
data follows a Gaussian distribution. However, this does not have to be necessarily
true.
Also, unlike normalization, standardization does not have a bounding range.
So, even if you have outliers in your data, they will not be affected by
standardization.
Overall, for this we use standardscaler which is moreoften used than
Minmaxscaler especially when there is outliers.

8. Data Preparation (Pipeline): At the end "deployment phase" create a pipeline to

group all of these data preprocessing steps then fit_trnasform the data.

Here's a general sequence:

* Data Collection: Gather raw data.
* Data Inspection: see the data type and check for missing values.
* Data Cleaning, Visualization and Transformation: Handle missing values, outliers,
and perform necessary transformations (not the final just for testing).
* Feature Scaling: mostly stdscaler or minmaxscaler.
* Data Preparation (Pipeline): Create a pipeline to organize and apply various
preprocessing steps.

Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Northbay Summarizes Data Pre-Processing Algorithms
No ratings yet
Northbay Summarizes Data Pre-Processing Algorithms
10 pages
# (Data Preprocessing) : (Cheatsheet)
No ratings yet
# (Data Preprocessing) : (Cheatsheet)
10 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Aml Midsem
No ratings yet
Aml Midsem
59 pages
Data Cleaning
No ratings yet
Data Cleaning
28 pages
PMA Unit-2 PDF
No ratings yet
PMA Unit-2 PDF
19 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Chương
No ratings yet
Chương
12 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Practical No. 01
No ratings yet
Practical No. 01
114 pages
Ap Python
No ratings yet
Ap Python
12 pages
Practicals
No ratings yet
Practicals
42 pages
Advance Python
No ratings yet
Advance Python
5 pages
ML Self Unit 2
No ratings yet
ML Self Unit 2
20 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Data Cleaning Using R
No ratings yet
Data Cleaning Using R
5 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Machine Learning Project Checklist
100% (1)
Machine Learning Project Checklist
10 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Index: SR. NO. Practical Name Date of Perform NO. Sign
No ratings yet
Index: SR. NO. Practical Name Date of Perform NO. Sign
28 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
1 Asdfadgaf
No ratings yet
1 Asdfadgaf
8 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Dictionary - Programs Questions and Answers - Class 11
No ratings yet
Dictionary - Programs Questions and Answers - Class 11
17 pages
Exp 2
No ratings yet
Exp 2
6 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Ass2 Transformation
No ratings yet
Ass2 Transformation
6 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Dsur Ea2352001010391 W7
No ratings yet
Dsur Ea2352001010391 W7
3 pages
ML Complete Notes Hridoy
No ratings yet
ML Complete Notes Hridoy
5 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
DPT Week 1
No ratings yet
DPT Week 1
3 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
Week 2
No ratings yet
Week 2
3 pages
DMTN
No ratings yet
DMTN
17 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
III Unit
No ratings yet
III Unit
4 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Dev Core
No ratings yet
Dev Core
7 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Design of Horizontal Axis Tidal Turbines
No ratings yet
Design of Horizontal Axis Tidal Turbines
8 pages
A REPORT ON MIMO IN WIRELESS APPLICATIONS - Final
No ratings yet
A REPORT ON MIMO IN WIRELESS APPLICATIONS - Final
11 pages
Frese OPTIMA Compact Actuators
No ratings yet
Frese OPTIMA Compact Actuators
6 pages
TENSION TEST ON Tor Steel
No ratings yet
TENSION TEST ON Tor Steel
7 pages
Thinking Avant La Lettre A Review of 4E Cognition Carney 2020
No ratings yet
Thinking Avant La Lettre A Review of 4E Cognition Carney 2020
15 pages
Show Pro sm192m DMX CONTROLLER User Manual
No ratings yet
Show Pro sm192m DMX CONTROLLER User Manual
10 pages
Stability Analysis and Modelling Underground Excavations in Fractured Rocks - Vol 1
No ratings yet
Stability Analysis and Modelling Underground Excavations in Fractured Rocks - Vol 1
309 pages
2010 Ford Scape 3.0l Fluid Capacities
No ratings yet
2010 Ford Scape 3.0l Fluid Capacities
2 pages
Batch Record
No ratings yet
Batch Record
11 pages
Ducted Split Air Conditioner: Service Manual
No ratings yet
Ducted Split Air Conditioner: Service Manual
19 pages
Scientech 2400GN
No ratings yet
Scientech 2400GN
178 pages
Welding Machine Pre Start Checklist
No ratings yet
Welding Machine Pre Start Checklist
2 pages
SISS S13 LiuJian FHE by LiuJian
No ratings yet
SISS S13 LiuJian FHE by LiuJian
7 pages
Solutions For 2007 A Level H2 Maths Paper 1
No ratings yet
Solutions For 2007 A Level H2 Maths Paper 1
12 pages
Grammar Jeopardy: Modal Auxiliaries, Relative Adverbs, & Relative Pronouns
No ratings yet
Grammar Jeopardy: Modal Auxiliaries, Relative Adverbs, & Relative Pronouns
18 pages
STACK
No ratings yet
STACK
39 pages
KeyTalk Anything You Ever Wanted To Know About SMIME Email Encryption DigitalSigning Configurations. But Were Afraid To Ask
No ratings yet
KeyTalk Anything You Ever Wanted To Know About SMIME Email Encryption DigitalSigning Configurations. But Were Afraid To Ask
19 pages
Solid State (IITian Notes - Kota)
No ratings yet
Solid State (IITian Notes - Kota)
43 pages
User Maual For Operation and PC Software and APP of TC66 (C) Type-C USB PD Trigger Meter 2019.6.5
No ratings yet
User Maual For Operation and PC Software and APP of TC66 (C) Type-C USB PD Trigger Meter 2019.6.5
12 pages
AI Unit 1 Short Answer
No ratings yet
AI Unit 1 Short Answer
14 pages
Unit 1 & 2
No ratings yet
Unit 1 & 2
26 pages
Algebra 2 Homework Help Answers
100% (1)
Algebra 2 Homework Help Answers
7 pages
14 Slide
No ratings yet
14 Slide
44 pages
Akka HTTP
No ratings yet
Akka HTTP
23 pages
Ammonia STD 10
No ratings yet
Ammonia STD 10
2 pages
Module - 7 Lecture Notes - 2 Mixed Integer Programming: y C B X
No ratings yet
Module - 7 Lecture Notes - 2 Mixed Integer Programming: y C B X
3 pages
Homomorphism
No ratings yet
Homomorphism
10 pages
Class 11 Ut-4 Budwa
No ratings yet
Class 11 Ut-4 Budwa
2 pages
CHEM 113-Quiz #7 Answer Key
No ratings yet
CHEM 113-Quiz #7 Answer Key
4 pages
ADVANCED DATA STRUCTURES FOR ALGORITHMS: Mastering Complex Data Structures for Algorithmic Problem-Solving (2024)
From Everand
ADVANCED DATA STRUCTURES FOR ALGORITHMS: Mastering Complex Data Structures for Algorithmic Problem-Solving (2024)
VIOLET CASTRO
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet

1data Cleansing Cheklist

Uploaded by

1data Cleansing Cheklist

Uploaded by

Here are the different data preprocessing steps (Note I can bypass using sql since

I can do different kinds of joins using jupyter):

4. String Data Transformation:

6. Data Transformation (optional):

8. Data Preparation (Pipeline): At the end "deployment phase" create a pipeline to

Here's a general sequence:

You might also like