0% found this document useful (0 votes)

92 views42 pages

Data Cleaning

The document discusses data cleaning and provides details on why it is important, key aspects of data quality, and common techniques for cleaning data. Specifically, it outlines that raw data often contains errors that can lead to incorrect conclusions if not cleaned. It then defines several criteria for high quality data, including validity, accuracy, completeness, consistency and uniformity. The document also describes common data issues like missing values, outliers and inconsistencies and techniques for addressing them, such as dropping rows, imputing values, and scaling/normalizing data. Finally, it presents the typical workflow of inspecting data for issues, cleaning the data to address problems found, and then verifying the cleaned data.

Uploaded by

ZADOD YASSINE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views42 pages

Data Cleaning

Uploaded by

ZADOD YASSINE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Data

Cleaning
Part I
Zakaria KERKAOU
[email protected]
Why do we need to clean data?
Garbage in, garbage out.

Quality data beats fancy algorithms.

• In a data science workflow, we usually access raw data.
• However, raw data can contain duplicate values, misspellings, data type
parsing errors and legacy systems.
• Incorrect or inconsistent data leads to false conclusions. And so, how well you
clean and understand the data has a high impact on the quality of the results.
• In fact, a simple algorithm can outweigh a complex one just because it was
given enough and high-quality data.
Data quality
• High-quality data needs to pass a set of quality criteria. Those
include:

Validity.

Accuracy.

Completeness.

Consistency.

Uniformity.
Data Validity
• Data Validity is the degree to which the data conform to defined
business rules or constraints. For example:

Data-Type Constraints.

Range Constraints.

Mandatory Constraints.

Unique Constraints.

Set-Membership constraints.

Foreign-key constraints.

Regular expression patterns.

Cross-field validation.
Data Validity
• Data-Type Constraints: values in a particular column must be of
a particular datatype, e.g., boolean, numeric, date, etc.
Data Validity
• Data-Type Constraints: values in a particular column must be of
a particular datatype, e.g., boolean, numeric, date, etc.
Data Validity
• Range Constraints: typically, numbers or dates should fall within
a certain range. That is, they have minimum and/or maximum
permissible values.
• For example a five star rating system should only have a
maximum value of 5.
Data Validity
• Range Constraints: typically, numbers or dates should fall within
a certain range. That is, they have minimum and/or maximum
permissible values.
• For example a five star rating system should only have a
maximum value of 5.
Data Validity
• Mandatory Constraints: Certain columns cannot be empty. For
Example the identifier (id).
Data Validity
• Uniqueness Constraints: A field, or a combination of fields, must
be unique across a dataset. For example no two products can have
the same identifier.
Data Validity
Set-Membership constraints: values of a column come from a
set of discrete values, e.g. enum values. For example, a person’s
gender may be male or female.
Data Validity
Cross-field validation: certain conditions that span across
multiple fields must hold. For example, a patient’s date of discharge
from the hospital cannot be earlier than the date of admission.
Data Validity
Foreign-key constraints: as in relational databases, a foreign key
column can’t have a value that does not exist in the referenced
primary key.
Regular expression patterns: text fields that have to be in a
certain pattern. For example, phone numbers may be required to
have the pattern (999) 999–9999.
Data Accuracy
The degree to which the data is close to the true values.
While defining all possible valid values allows invalid values to be
easily spotted, it does not mean that they are accurate.
For example a person’s height and weight do have limits.
Data Completeness
The degree to which all required measures are known.
Incompleteness is almost impossible to fix with data cleansing
methodology: one cannot infer facts that were not captured when
the data in question was initially recorded.
e.g., interview data, it may be possible to fix incompleteness by
going back to the original source of data.
Missing data is going to happen for various reasons. One can
mitigate this problem by questioning the original source if possible,
say re-interviewing the subject.
But chances are, the subject is either going to give a different
answer or will be hard to reach again.
Data Consistency
The degree to which the data is consistent, within the same data set
or across multiple data sets.
Inconsistency occurs when two values in the data set contradict
each other.
Data Uniformity
The degree to which the data is specified using the same unit of
measure.
The workflow
The workflow
The workflow is a sequence of three steps aiming at producing high-
quality data and taking into account all the criteria we’ve talked
about.

1) Inspection: Detect unexpected, incorrect, and inconsistent data.

2) Cleaning: Fix or remove the anomalies discovered.
3) Verifying: After cleaning, the results are inspected to verify
correctness.

What you see as a sequential process is, in fact, an iterative,

endless process. One c an go from verifying to inspection when new
flaws are detected.
Inspection

Inspecting the data is time-consuming and requires using many

methods for exploring the underlying data for error detection.


Data profiling.

Visualisation.
Data profiling
A summary statistics about the
data, called data profiling, is
really helpful to give a general
idea about the quality of the
data.

For example, check whether a

particular column conforms to
particular standards or pattern.
Visualizations
By analysing and visualizing the data using statistical methods such
as mean, standard deviation, range, or quantiles, one can find
outlier values that are unexpected and thus erroneous.
Cleaning
Data cleaning involve different techniques based on the problem
and the data type. Different methods can be applied with each has
its own trade-offs.

Overall, incorrect data is either :


Removed

Corrected

Imputed
Handling Missing Values
First we need to figure out why the data is missing.
This is the point at which we get into the part of data science that I like to call
"data intuition". We need to ask the question

Is this value missing because it wasn't recorded or because it doesn't

exist?

If a value is missing because it doesn't exist (like the height of the oldest child
of someone who doesn't have any children) then it doesn't make sense to try
and guess what it might be. These values you probably do want to keep as
NaN.
On the other hand, if a value is missing because it wasn't recorded, then you
can try to guess what it might have been based on the other values in that
column and row.
Handling Missing Values
pandas.isnull() will detect missing values for an array-like object.

This function takes a scalar or array-like object and indicates whether values
are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in
datetimelike).
Drop missing values
If you're in a hurry or don't have a reason to figure out why your
values are missing, one option you have is to just remove any rows
or columns that contain missing values.
To remove all rows :

However, in case every row in our dataset had at least one missing
value, this will remove all our data.
In that case We might have better luck removing all the columns
that have at least one missing value instead.
Filling in missing values
automatically
Another option is to try and fill in the missing values.
Different ways to fill the missing values
1.Mean/Median, Mode
2.bfill,ffill
3.interpolate
4.replace
Filling in missing values
automatically
We use the Panda's fillna() function to fill in missing values in a
dataframe for us.

Here, I'm saying that I would like to replace all the NaN values with 0.
We can also replace missing values with whatever value comes directly
after it in the same column, this is valid for datasets where the
observations have some sort of logical order to them.
Mean/Median, Mode
1. Numerical Data →Mean/Median
2. Categorical Data →Mode

1.Mean — When the data has no outliers. Mean is the average value. Mean
will be affected by outliers.

2.Median — When the data has more outliers, it's best to replace them with
the median value. Median is the middle value (50%)

3.Mode — In columns having categorical data, we can fill the missing values
by mode which is the most commu, value
Mean/Median, Mode
1.Example :
Mean/Median, Mode
1.Example :
Mean/Median, Mode
1.Example :
Categorical Data
1.If we want to replace missing values in categorical data, we can replace
them with mode(most common value) :
2.Example :
Categorical Data
1.If we want to replace missing values in categorical data, we can replace
them with mode(most common value) :
2.Example :
bfill,ffill
1.bfill — backward fill — It will propagate the first observed non-null value
backward.
2.ffill — forward fill — it propagates the last observed non-null value
forward.

3.Example :
bfill,ffill
1.bfill — backward fill — It will propagate the first observed non-null value
backward.
2.ffill — forward fill — it propagates the last observed non-null value
forward.

3.Example :
Interpolate
1.Instead of filling all three rows with the same value, we can use interpolate
method.

2.Example :
Scaling and Normalization
Scaling vs. Normalization: What's the
difference?
In both cases, you're transforming the
values of numeric variables so that the
transformed data points have specific
helpful properties. The difference is that:


In scaling, you're changing the range of
your data.

In normalization, you're changing the
shape of the distribution of your data.
Scaling
In scaling you're transforming your data so that it fits within a specific scale, like 0-100 or
0-1.
For example, you might be looking at the prices of some products in both Yen and US
Dollars. One US Dollar is worth about 100 Yen, but if you don't scale your prices, Machine
learning methods like SVM or KNN will consider a difference in price of 1 Yen as
important as a difference of 1 US Dollar!
Normalization
Normalization is a more radical transformation. The point of normalization is to change
your observations so that they can be described as a normal distribution.

Normal distribution: Also known as the "bell curve", this is a specific statistical
distribution where a roughly equal observations fall above and below the mean.
In general, you'll normalize your data if you're going to be using a machine learning or
statistics technique that assumes your data is normally distributed.
One of the most used methods to normalize here is called the Box-Cox
Transformation.

The parameter is estimated using the profile likelihood function and using goodness-of-
fit tests.
Normalization

We Notice here that the shape of our data has changed. Before
normalizing it was almost L-shaped. But after normalizing it looks
more like the outline of a bell (hence "bell curve").

TalendOpenStudio BigData GettingStarted 5.4.1 en
No ratings yet
TalendOpenStudio BigData GettingStarted 5.4.1 en
60 pages
Bajaj Allianz Life Insurance Company.
100% (1)
Bajaj Allianz Life Insurance Company.
19 pages
2013 VCE IT 1 and 2 - Course Planner
No ratings yet
2013 VCE IT 1 and 2 - Course Planner
4 pages
5 Steps To Build A Business Case For Continuous Data Quality Assurance
100% (1)
5 Steps To Build A Business Case For Continuous Data Quality Assurance
11 pages
Agile Team Bootcamp
100% (1)
Agile Team Bootcamp
8 pages
Agile Example Healthcare
100% (1)
Agile Example Healthcare
4 pages
Shivam Soin
No ratings yet
Shivam Soin
42 pages
Salon Database
No ratings yet
Salon Database
5 pages
Mit Data Science Program
No ratings yet
Mit Data Science Program
14 pages
Sucrm
No ratings yet
Sucrm
9 pages
"Everyone Wants To Do The Model Work, Not The Data Work": Data Cascades in High-Stakes AI
No ratings yet
"Everyone Wants To Do The Model Work, Not The Data Work": Data Cascades in High-Stakes AI
15 pages
Software Quality Process Assignment
No ratings yet
Software Quality Process Assignment
12 pages
Chapter # 21 Software Quality Metrics
No ratings yet
Chapter # 21 Software Quality Metrics
15 pages
Sre Assignment
No ratings yet
Sre Assignment
8 pages
Agile Product Backlog Template Excel Agile Product Backlog Excel Template Free
No ratings yet
Agile Product Backlog Template Excel Agile Product Backlog Excel Template Free
4 pages
Mm1 Scrum Poster A3
No ratings yet
Mm1 Scrum Poster A3
2 pages
Inlias Manual
No ratings yet
Inlias Manual
57 pages
Project Sample
No ratings yet
Project Sample
29 pages
The Subject Matter Expert (A Misunderstood Product Owner Stance) by Robbin Schuurman The Value Maximizers Medium
No ratings yet
The Subject Matter Expert (A Misunderstood Product Owner Stance) by Robbin Schuurman The Value Maximizers Medium
9 pages
Artificial Intelligence For Clinical Trial - 2019 - Trends in Pharmacological SC
No ratings yet
Artificial Intelligence For Clinical Trial - 2019 - Trends in Pharmacological SC
15 pages
IRDAI (Regulatory Sandbox) Regulations, 2019
No ratings yet
IRDAI (Regulatory Sandbox) Regulations, 2019
8 pages
Demystifying: Application Modernization
No ratings yet
Demystifying: Application Modernization
20 pages
Course Outline - C.2 IS602 - Spreadsheet Modeling For T&O Managers (Students Copy)
No ratings yet
Course Outline - C.2 IS602 - Spreadsheet Modeling For T&O Managers (Students Copy)
8 pages
Daraz - PK Final Project
No ratings yet
Daraz - PK Final Project
4 pages
Lecture Notes w4 - Scrum and Kanban
No ratings yet
Lecture Notes w4 - Scrum and Kanban
65 pages
Insurance Management System
No ratings yet
Insurance Management System
141 pages
Great Learning
100% (1)
Great Learning
2 pages
DWH Int Questions
100% (1)
DWH Int Questions
9 pages
Kanban For Software Development-Corbis
No ratings yet
Kanban For Software Development-Corbis
34 pages
CS8383-Object Oriented Programming Lab
No ratings yet
CS8383-Object Oriented Programming Lab
45 pages
Agile Scrum Model
No ratings yet
Agile Scrum Model
10 pages
I Text 7 Building Blocks
No ratings yet
I Text 7 Building Blocks
224 pages
Guidewire Functional Testing Questions
No ratings yet
Guidewire Functional Testing Questions
29 pages
Data Mning by Jaiwei Han Chapter 2
No ratings yet
Data Mning by Jaiwei Han Chapter 2
90 pages
SE403 Software Project Management Chapter 4 PDF
No ratings yet
SE403 Software Project Management Chapter 4 PDF
59 pages
Machine Learning Zero To Hero 6 Weeks
No ratings yet
Machine Learning Zero To Hero 6 Weeks
6 pages
2020 Zurich Canada Broker Contact List
No ratings yet
2020 Zurich Canada Broker Contact List
7 pages
ST-1 Solution Big Data KCS061
No ratings yet
ST-1 Solution Big Data KCS061
26 pages
DSML - Curriculum Brochure
No ratings yet
DSML - Curriculum Brochure
40 pages
It Is A Cloud Based Analytical Reporting Solution From MSFT 2. Introduction About Business Intelligence
No ratings yet
It Is A Cloud Based Analytical Reporting Solution From MSFT 2. Introduction About Business Intelligence
192 pages
Mantis Guide
No ratings yet
Mantis Guide
19 pages
What Is BI Testing
No ratings yet
What Is BI Testing
19 pages
What Is The Difference Between Scrum, Kanban and XP?
No ratings yet
What Is The Difference Between Scrum, Kanban and XP?
2 pages
Microstrategy Tips and Techniques: Reporting Essentials Five Styles of Business Intelligence
No ratings yet
Microstrategy Tips and Techniques: Reporting Essentials Five Styles of Business Intelligence
20 pages
Ey Hfs Top 10 Application Modernization Services 2022 Ey Excerpt PDF
No ratings yet
Ey Hfs Top 10 Application Modernization Services 2022 Ey Excerpt PDF
36 pages
About Fortify Docs 21.2.0
No ratings yet
About Fortify Docs 21.2.0
3 pages
Opinion Mining of Online Customer Reviews: Patlammagari Gowtamreddy
No ratings yet
Opinion Mining of Online Customer Reviews: Patlammagari Gowtamreddy
44 pages
All Tcs Info
No ratings yet
All Tcs Info
15 pages
Chapter 8 - Social Media Information Systems
No ratings yet
Chapter 8 - Social Media Information Systems
38 pages
Learneverythingai 1661068200
No ratings yet
Learneverythingai 1661068200
66 pages
Unica Faq HCL
No ratings yet
Unica Faq HCL
3 pages
Unit 5
No ratings yet
Unit 5
72 pages
Seminar
No ratings yet
Seminar
16 pages
Use Cases, User Stories and Business Rules
No ratings yet
Use Cases, User Stories and Business Rules
32 pages
Pmis Presentation 1
No ratings yet
Pmis Presentation 1
19 pages
PCAP - Exam: Website: VCE To PDF Converter: Facebook: Twitter
No ratings yet
PCAP - Exam: Website: VCE To PDF Converter: Facebook: Twitter
14 pages
Unit - 1 Intoduction To Software Project Management
No ratings yet
Unit - 1 Intoduction To Software Project Management
13 pages
Itsdi Master
No ratings yet
Itsdi Master
370 pages
Agile Scrum Handbook – 3rd edition
From Everand
Agile Scrum Handbook – 3rd edition
Nader K. Rad
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages

Data Cleaning

Uploaded by

Data Cleaning

Uploaded by

Data

Quality data beats fancy algorithms.

1) Inspection: Detect unexpected, incorrect, and inconsistent data.

What you see as a sequential process is, in fact, an iterative,

Inspecting the data is time-consuming and requires using many

For example, check whether a

Overall, incorrect data is either :

Is this value missing because it wasn't recorded or because it doesn't

You might also like