Data Cleaning
Data Cleaning
Cleaning
Part I
Zakaria KERKAOU
[email protected]
Why do we need to clean data?
Garbage in, garbage out.
Data profiling.
Visualisation.
Data profiling
A summary statistics about the
data, called data profiling, is
really helpful to give a general
idea about the quality of the
data.
If a value is missing because it doesn't exist (like the height of the oldest child
of someone who doesn't have any children) then it doesn't make sense to try
and guess what it might be. These values you probably do want to keep as
NaN.
On the other hand, if a value is missing because it wasn't recorded, then you
can try to guess what it might have been based on the other values in that
column and row.
Handling Missing Values
pandas.isnull() will detect missing values for an array-like object.
This function takes a scalar or array-like object and indicates whether values
are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in
datetimelike).
Drop missing values
If you're in a hurry or don't have a reason to figure out why your
values are missing, one option you have is to just remove any rows
or columns that contain missing values.
To remove all rows :
However, in case every row in our dataset had at least one missing
value, this will remove all our data.
In that case We might have better luck removing all the columns
that have at least one missing value instead.
Filling in missing values
automatically
Another option is to try and fill in the missing values.
Different ways to fill the missing values
1.Mean/Median, Mode
2.bfill,ffill
3.interpolate
4.replace
Filling in missing values
automatically
We use the Panda's fillna() function to fill in missing values in a
dataframe for us.
Here, I'm saying that I would like to replace all the NaN values with 0.
We can also replace missing values with whatever value comes directly
after it in the same column, this is valid for datasets where the
observations have some sort of logical order to them.
Mean/Median, Mode
1. Numerical Data →Mean/Median
2. Categorical Data →Mode
1.Mean — When the data has no outliers. Mean is the average value. Mean
will be affected by outliers.
2.Median — When the data has more outliers, it's best to replace them with
the median value. Median is the middle value (50%)
3.Mode — In columns having categorical data, we can fill the missing values
by mode which is the most commu, value
Mean/Median, Mode
1.Example :
Mean/Median, Mode
1.Example :
Mean/Median, Mode
1.Example :
Categorical Data
1.If we want to replace missing values in categorical data, we can replace
them with mode(most common value) :
2.Example :
Categorical Data
1.If we want to replace missing values in categorical data, we can replace
them with mode(most common value) :
2.Example :
bfill,ffill
1.bfill — backward fill — It will propagate the first observed non-null value
backward.
2.ffill — forward fill — it propagates the last observed non-null value
forward.
3.Example :
bfill,ffill
1.bfill — backward fill — It will propagate the first observed non-null value
backward.
2.ffill — forward fill — it propagates the last observed non-null value
forward.
3.Example :
Interpolate
1.Instead of filling all three rows with the same value, we can use interpolate
method.
2.Example :
Scaling and Normalization
Scaling vs. Normalization: What's the
difference?
In both cases, you're transforming the
values of numeric variables so that the
transformed data points have specific
helpful properties. The difference is that:
In scaling, you're changing the range of
your data.
In normalization, you're changing the
shape of the distribution of your data.
Scaling
In scaling you're transforming your data so that it fits within a specific scale, like 0-100 or
0-1.
For example, you might be looking at the prices of some products in both Yen and US
Dollars. One US Dollar is worth about 100 Yen, but if you don't scale your prices, Machine
learning methods like SVM or KNN will consider a difference in price of 1 Yen as
important as a difference of 1 US Dollar!
Normalization
Normalization is a more radical transformation. The point of normalization is to change
your observations so that they can be described as a normal distribution.
Normal distribution: Also known as the "bell curve", this is a specific statistical
distribution where a roughly equal observations fall above and below the mean.
In general, you'll normalize your data if you're going to be using a machine learning or
statistics technique that assumes your data is normally distributed.
One of the most used methods to normalize here is called the Box-Cox
Transformation.
The parameter is estimated using the profile likelihood function and using goodness-of-
fit tests.
Normalization
We Notice here that the shape of our data has changed. Before
normalizing it was almost L-shaped. But after normalizing it looks
more like the outline of a bell (hence "bell curve").