1data Cleansing Cheklist
1data Cleansing Cheklist
1. Data Collection: more often from CSV files after a long battle with SQL.
2. Data Inspection: Check the structure and format of the data using:
* info() "to get the dtype of each column and Number of columns aka check
missing values (if a column has 50% of its values missing we just delete it
entirely)".
* describe() "to check the mean, min, max & other values".
* value_counts():
- For target column to see which type odf situation wee're in (Binary
Classification, Multiclass Classfication & Regression).
- for non numerical (categorial) values "to get the count for each value
instance".
3. Data Cleaning:
* Don't forget with time columns convert them to datetime or something first,
then play around like extracting the time (NaN should become NaT).
* Handle Missing numerical data (Data Mutation):
+ Delete the entire row especially if it's an important column like Id using
dropna() however not always just when ID is something important like in
telecommunication
fraud detection the phone number aka in this case is very important.
+ Replace them with 0 or mean using fillna() or SimpleImputer().
* Non-numerical (categorial) data:
- Filling with a Placeholder Value: Replace missing values with a specific
placeholder string, like 'Unknown', 'Missing', or an empty string ''.
- Forward Fill or Backward Fill: Use the previous or next value in the column
to fill missing values.
- Dropping Missing Values: Remove rows or columns that contain missing values
(when it represents a small portion of the data).
- Mode Imputation: Replace missing values with the most frequent value (mode)
in the column or even correlation with other columns.
* Delete useless columns that don't have anything to do with the target column
or just doesn't help the model to identify patterns like: sequentially asigned
numbers
like in IDs and name etc.
5. Data Visualization:
* The use of boxplot "boite à moustache" to check if there are any outliers.
* Use hist() property of matplotlib to see if the dataset is tail havy (side
heavy) to determine wether to use logarithm in order to compress data and get them
near the
mean.
* Correalation search (between input column themselves if there is a strong
correlation 1 or -1 we delete one or we merge them then correlation with the target
column
with a special condition). Here are the 3 possible outcomes:
- 1: Both variables change in the same direction.
- -1: Variables change in the opposite directions.
- 0: No relationship in the change of the variables.
Also check 19AssociationEffectSize image in Statistics folder for more details
on correlation.
7. Feature Scaling:
* Normalization: is good to use when you know that the distribution of your data
does not follow a Gaussian distribution. This can be useful in algorithms that do
not assume any distribution of the data like K-Nearest Neighbors and Neural
Networks using minmaxscaler.
* Standardization "std": on the other hand, can be helpful in cases where the
data follows a Gaussian distribution. However, this does not have to be necessarily
true.
Also, unlike normalization, standardization does not have a bounding range.
So, even if you have outliers in your data, they will not be affected by
standardization.
Overall, for this we use standardscaler which is moreoften used than
Minmaxscaler especially when there is outliers.