0% found this document useful (0 votes)
92 views42 pages

Data Cleaning

The document discusses data cleaning and provides details on why it is important, key aspects of data quality, and common techniques for cleaning data. Specifically, it outlines that raw data often contains errors that can lead to incorrect conclusions if not cleaned. It then defines several criteria for high quality data, including validity, accuracy, completeness, consistency and uniformity. The document also describes common data issues like missing values, outliers and inconsistencies and techniques for addressing them, such as dropping rows, imputing values, and scaling/normalizing data. Finally, it presents the typical workflow of inspecting data for issues, cleaning the data to address problems found, and then verifying the cleaned data.

Uploaded by

ZADOD YASSINE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views42 pages

Data Cleaning

The document discusses data cleaning and provides details on why it is important, key aspects of data quality, and common techniques for cleaning data. Specifically, it outlines that raw data often contains errors that can lead to incorrect conclusions if not cleaned. It then defines several criteria for high quality data, including validity, accuracy, completeness, consistency and uniformity. The document also describes common data issues like missing values, outliers and inconsistencies and techniques for addressing them, such as dropping rows, imputing values, and scaling/normalizing data. Finally, it presents the typical workflow of inspecting data for issues, cleaning the data to address problems found, and then verifying the cleaned data.

Uploaded by

ZADOD YASSINE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Data

Cleaning
Part I
Zakaria KERKAOU
[email protected]
Why do we need to clean data?
Garbage in, garbage out.

Quality data beats fancy algorithms.


• In a data science workflow, we usually access raw data.
• However, raw data can contain duplicate values, misspellings, data type
parsing errors and legacy systems.
• Incorrect or inconsistent data leads to false conclusions. And so, how well you
clean and understand the data has a high impact on the quality of the results.
• In fact, a simple algorithm can outweigh a complex one just because it was
given enough and high-quality data.
Data quality
• High-quality data needs to pass a set of quality criteria. Those
include:

Validity.

Accuracy.

Completeness.

Consistency.

Uniformity.
Data Validity
• Data Validity is the degree to which the data conform to defined
business rules or constraints. For example:

Data-Type Constraints.

Range Constraints.

Mandatory Constraints.

Unique Constraints.

Set-Membership constraints.

Foreign-key constraints.

Regular expression patterns.

Cross-field validation.
Data Validity
• Data-Type Constraints: values in a particular column must be of
a particular datatype, e.g., boolean, numeric, date, etc.
Data Validity
• Data-Type Constraints: values in a particular column must be of
a particular datatype, e.g., boolean, numeric, date, etc.
Data Validity
• Range Constraints: typically, numbers or dates should fall within
a certain range. That is, they have minimum and/or maximum
permissible values.
• For example a five star rating system should only have a
maximum value of 5.
Data Validity
• Range Constraints: typically, numbers or dates should fall within
a certain range. That is, they have minimum and/or maximum
permissible values.
• For example a five star rating system should only have a
maximum value of 5.
Data Validity
• Mandatory Constraints: Certain columns cannot be empty. For
Example the identifier (id).
Data Validity
• Uniqueness Constraints: A field, or a combination of fields, must
be unique across a dataset. For example no two products can have
the same identifier.
Data Validity
Set-Membership constraints: values of a column come from a
set of discrete values, e.g. enum values. For example, a person’s
gender may be male or female.
Data Validity
Cross-field validation: certain conditions that span across
multiple fields must hold. For example, a patient’s date of discharge
from the hospital cannot be earlier than the date of admission.
Data Validity
Foreign-key constraints: as in relational databases, a foreign key
column can’t have a value that does not exist in the referenced
primary key.
Regular expression patterns: text fields that have to be in a
certain pattern. For example, phone numbers may be required to
have the pattern (999) 999–9999.
Data Accuracy
The degree to which the data is close to the true values.
While defining all possible valid values allows invalid values to be
easily spotted, it does not mean that they are accurate.
For example a person’s height and weight do have limits.
Data Completeness
The degree to which all required measures are known.
Incompleteness is almost impossible to fix with data cleansing
methodology: one cannot infer facts that were not captured when
the data in question was initially recorded.
e.g., interview data, it may be possible to fix incompleteness by
going back to the original source of data.
Missing data is going to happen for various reasons. One can
mitigate this problem by questioning the original source if possible,
say re-interviewing the subject.
But chances are, the subject is either going to give a different
answer or will be hard to reach again.
Data Consistency
The degree to which the data is consistent, within the same data set
or across multiple data sets.
Inconsistency occurs when two values in the data set contradict
each other.
Data Uniformity
The degree to which the data is specified using the same unit of
measure.
The workflow
The workflow
The workflow is a sequence of three steps aiming at producing high-
quality data and taking into account all the criteria we’ve talked
about.

1) Inspection: Detect unexpected, incorrect, and inconsistent data.


2) Cleaning: Fix or remove the anomalies discovered.
3) Verifying: After cleaning, the results are inspected to verify
correctness.

What you see as a sequential process is, in fact, an iterative,


endless process. One c an go from verifying to inspection when new
flaws are detected.
Inspection

Inspecting the data is time-consuming and requires using many


methods for exploring the underlying data for error detection.


Data profiling.

Visualisation.
Data profiling
A summary statistics about the
data, called data profiling, is
really helpful to give a general
idea about the quality of the
data.

For example, check whether a


particular column conforms to
particular standards or pattern.
Visualizations
By analysing and visualizing the data using statistical methods such
as mean, standard deviation, range, or quantiles, one can find
outlier values that are unexpected and thus erroneous.
Cleaning
Data cleaning involve different techniques based on the problem
and the data type. Different methods can be applied with each has
its own trade-offs.

Overall, incorrect data is either :



Removed

Corrected

Imputed
Handling Missing Values
First we need to figure out why the data is missing.
This is the point at which we get into the part of data science that I like to call
"data intuition". We need to ask the question

Is this value missing because it wasn't recorded or because it doesn't


exist?

If a value is missing because it doesn't exist (like the height of the oldest child
of someone who doesn't have any children) then it doesn't make sense to try
and guess what it might be. These values you probably do want to keep as
NaN.
On the other hand, if a value is missing because it wasn't recorded, then you
can try to guess what it might have been based on the other values in that
column and row.
Handling Missing Values
pandas.isnull() will detect missing values for an array-like object.

This function takes a scalar or array-like object and indicates whether values
are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in
datetimelike).
Drop missing values
If you're in a hurry or don't have a reason to figure out why your
values are missing, one option you have is to just remove any rows
or columns that contain missing values.
To remove all rows :

However, in case every row in our dataset had at least one missing
value, this will remove all our data.
In that case We might have better luck removing all the columns
that have at least one missing value instead.
Filling in missing values
automatically
Another option is to try and fill in the missing values.
Different ways to fill the missing values
1.Mean/Median, Mode
2.bfill,ffill
3.interpolate
4.replace
Filling in missing values
automatically
We use the Panda's fillna() function to fill in missing values in a
dataframe for us.

Here, I'm saying that I would like to replace all the NaN values with 0.
We can also replace missing values with whatever value comes directly
after it in the same column, this is valid for datasets where the
observations have some sort of logical order to them.
Mean/Median, Mode
1. Numerical Data →Mean/Median
2. Categorical Data →Mode

1.Mean — When the data has no outliers. Mean is the average value. Mean
will be affected by outliers.

2.Median — When the data has more outliers, it's best to replace them with
the median value. Median is the middle value (50%)

3.Mode — In columns having categorical data, we can fill the missing values
by mode which is the most commu, value
Mean/Median, Mode
1.Example :
Mean/Median, Mode
1.Example :
Mean/Median, Mode
1.Example :
Categorical Data
1.If we want to replace missing values in categorical data, we can replace
them with mode(most common value) :
2.Example :
Categorical Data
1.If we want to replace missing values in categorical data, we can replace
them with mode(most common value) :
2.Example :
bfill,ffill
1.bfill — backward fill — It will propagate the first observed non-null value
backward.
2.ffill — forward fill — it propagates the last observed non-null value
forward.

3.Example :
bfill,ffill
1.bfill — backward fill — It will propagate the first observed non-null value
backward.
2.ffill — forward fill — it propagates the last observed non-null value
forward.

3.Example :
Interpolate
1.Instead of filling all three rows with the same value, we can use interpolate
method.

2.Example :
Scaling and Normalization
Scaling vs. Normalization: What's the
difference?
In both cases, you're transforming the
values of numeric variables so that the
transformed data points have specific
helpful properties. The difference is that:


In scaling, you're changing the range of
your data.

In normalization, you're changing the
shape of the distribution of your data.
Scaling
In scaling you're transforming your data so that it fits within a specific scale, like 0-100 or
0-1.
For example, you might be looking at the prices of some products in both Yen and US
Dollars. One US Dollar is worth about 100 Yen, but if you don't scale your prices, Machine
learning methods like SVM or KNN will consider a difference in price of 1 Yen as
important as a difference of 1 US Dollar!
Normalization
Normalization is a more radical transformation. The point of normalization is to change
your observations so that they can be described as a normal distribution.

Normal distribution: Also known as the "bell curve", this is a specific statistical
distribution where a roughly equal observations fall above and below the mean.
In general, you'll normalize your data if you're going to be using a machine learning or
statistics technique that assumes your data is normally distributed.
One of the most used methods to normalize here is called the Box-Cox
Transformation.

The parameter is estimated using the profile likelihood function and using goodness-of-
fit tests.
Normalization

We Notice here that the shape of our data has changed. Before
normalizing it was almost L-shaped. But after normalizing it looks
more like the outline of a bell (hence "bell curve").

You might also like