0% found this document useful (0 votes)
26 views2 pages

1data Cleansing Cheklist

Uploaded by

Nadir Zamouche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views2 pages

1data Cleansing Cheklist

Uploaded by

Nadir Zamouche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

Here are the different data preprocessing steps (Note I can bypass using sql since

I can do different kinds of joins using jupyter):


Before we start we should check the csv file out and sort the columns out to check
the different values of each column also which situation we are in classfication,
regression... ,etc.

1. Data Collection: more often from CSV files after a long battle with SQL.

2. Data Inspection: Check the structure and format of the data using:
* info() "to get the dtype of each column and Number of columns aka check
missing values (if a column has 50% of its values missing we just delete it
entirely)".
* describe() "to check the mean, min, max & other values".
* value_counts():
- For target column to see which type odf situation wee're in (Binary
Classification, Multiclass Classfication & Regression).
- for non numerical (categorial) values "to get the count for each value
instance".

3. Data Cleaning:
* Don't forget with time columns convert them to datetime or something first,
then play around like extracting the time (NaN should become NaT).
* Handle Missing numerical data (Data Mutation):
+ Delete the entire row especially if it's an important column like Id using
dropna() however not always just when ID is something important like in
telecommunication
fraud detection the phone number aka in this case is very important.
+ Replace them with 0 or mean using fillna() or SimpleImputer().
* Non-numerical (categorial) data:
- Filling with a Placeholder Value: Replace missing values with a specific
placeholder string, like 'Unknown', 'Missing', or an empty string ''.
- Forward Fill or Backward Fill: Use the previous or next value in the column
to fill missing values.
- Dropping Missing Values: Remove rows or columns that contain missing values
(when it represents a small portion of the data).
- Mode Imputation: Replace missing values with the most frequent value (mode)
in the column or even correlation with other columns.
* Delete useless columns that don't have anything to do with the target column
or just doesn't help the model to identify patterns like: sequentially asigned
numbers
like in IDs and name etc.

4. String Data Transformation:


* Categorial data: Numerilization techniques: label, ordinal encoder, one hot
encoder (value to array size problem!) & word embedding.
Note:
- Label encoder: for binary columns.
- Ordinal Encoder: Use an ordinal encoder when there is a meaningful order or
hierarchy among the categories. This means the categories have a clear ranking or
some
sort of inherent order. The ordinal encoder assigns integer values to the
categories based on this order, making it suitable for variables with a natural
progression
like some sort of a distance also you can use custom encoder when things don't
get along.
- One-Hot Encoder: Use a one-hot encoder when there is no inherent order among
the categories, and each category is independent of the others.

5. Data Visualization:
* The use of boxplot "boite à moustache" to check if there are any outliers.
* Use hist() property of matplotlib to see if the dataset is tail havy (side
heavy) to determine wether to use logarithm in order to compress data and get them
near the
mean.
* Correalation search (between input column themselves if there is a strong
correlation 1 or -1 we delete one or we merge them then correlation with the target
column
with a special condition). Here are the 3 possible outcomes:
- 1: Both variables change in the same direction.
- -1: Variables change in the opposite directions.
- 0: No relationship in the change of the variables.
Also check 19AssociationEffectSize image in Statistics folder for more details
on correlation.

6. Data Transformation (optional):


* Numerical data: we can use transformations like CoxBox transformation "Yeo-
johnson extension" or logarithme if the data set was tail (side) heavy.
* Delete outliers rows from boxplot if there were only few otherwise use
stdscaler later.
* Delete useless columns based on correlation search (between input column
themselves if there is a strong correlation (1, -1) we delete one or we merge them
then
of each column with the target column with a special cap value condition) and
delete the one with 0.

7. Feature Scaling:
* Normalization: is good to use when you know that the distribution of your data
does not follow a Gaussian distribution. This can be useful in algorithms that do
not assume any distribution of the data like K-Nearest Neighbors and Neural
Networks using minmaxscaler.
* Standardization "std": on the other hand, can be helpful in cases where the
data follows a Gaussian distribution. However, this does not have to be necessarily
true.
Also, unlike normalization, standardization does not have a bounding range.
So, even if you have outliers in your data, they will not be affected by
standardization.
Overall, for this we use standardscaler which is moreoften used than
Minmaxscaler especially when there is outliers.

8. Data Preparation (Pipeline): At the end "deployment phase" create a pipeline to


group all of these data preprocessing steps then fit_trnasform the data.

Here's a general sequence:


* Data Collection: Gather raw data.
* Data Inspection: see the data type and check for missing values.
* Data Cleaning, Visualization and Transformation: Handle missing values, outliers,
and perform necessary transformations (not the final just for testing).
* Feature Scaling: mostly stdscaler or minmaxscaler.
* Data Preparation (Pipeline): Create a pipeline to organize and apply various
preprocessing steps.

You might also like