0% found this document useful (0 votes)
6 views47 pages

Importance of Data Cleaning 1

The document discusses the significance of data cleaning, outlining its definition, necessity, methods, and best practices. It emphasizes that clean data enhances analysis efficiency, improves data quality, and prevents false conclusions that could lead to poor decision-making. The document also highlights various data quality attributes and the overall benefits of implementing effective data cleaning processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views47 pages

Importance of Data Cleaning 1

The document discusses the significance of data cleaning, outlining its definition, necessity, methods, and best practices. It emphasizes that clean data enhances analysis efficiency, improves data quality, and prevents false conclusions that could lead to poor decision-making. The document also highlights various data quality attributes and the overall benefits of implementing effective data cleaning processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Research Skill Enhancement Webinar co‐hosted by

RDA CODATA Summer School and CODATA Connect group


presented by
Simisani Ndaba
Importance of Data Cleaning

05 August 2021
This work is licensed under Creative Commons
Attribution 4.0 International License.
Overview
• Meaning of Data Cleaning
• Need for Data Cleaning
• Data Cleaning Methods
• Data Cleaning Steps
• Best Practices
• Data Quality Attributes
• How Data Cleaning is used in a Dataset
• Overall Benefits of Data Cleaning
What is Data Cleaning?
Data cleaning is a process in which you
go through all of the data in a data set Data Scrubbing
Data Cleansing
and either;
 remove or update information that is
incomplete,
 incorrect,
Data Pre‐processing
 improperly formatted,
 duplicated, or
 irrelevant.
Michael Walker (2021) Python Data Cleaning Cookbook
Raw Data vs Clean Data
• Raw data is the data that is collected directly from the data source,

Template showing an example of raw data: the number of colonies per treatment
condition and controls, Ponti et al (2014)
Raw Data vs Clean Data
• “Dirty Data” is raw data full of irrelevances, errors, and corrupt information
• Clean data is in analyzable format

An example of dirty data and cleaned sample (Shaded cells denote dirty
values, and their cleaned values are in bold font), Krishnan et al(2014)
RAW Data Processed Data
Data Cleaning
Source: CrowdFlower 2016 to 2018
The Need for Data Cleaning
• Having data that is clean can help in performing the
analysis faster, saving precious time.
• Improving the quality of data to make them “fit for
use” by users
• Improving users documentation and presentation.
• False conclusions because of incorrect or “dirty” data
can inform poor decision‐making.
• False conclusions can lead to moments in reporting
when you realize your data doesn’t stand up to
scrutiny.
• It is important to create a culture of quality data in
your research work.
The Need for Data Cleaning contd..
• Combining multiple data sources creates
synchronisation issues
• If data is incorrect, outcomes are
unreliable
• data cleaning processes will vary from
dataset to dataset.
• establish a template for your data
cleaning process
Data Cleaning Methods
• Histograms
• Conversion Tables
• Tools
• Algorithms
• Manually
How do you Clean Data?
Import data Merge Data set
Rebuilding Missing Data Standardisation
Normalisation
Verification and Enrichment

Export Data
in Data Cleaning

• Consider your data in the most holistic way possible


• Increased controls on database inputs
• Choose the right software solutions
• Limit your sample size
• Spot check errors throughout
• Leverage free online courses
Data Quality
Validity
The degree to which your data conforms to defined rules or constraints.
 Data‐Type Constraints: values must be of a particular datatype, e.g., boolean,
numeric, date, etc.
 Range Constraints: numbers or dates should fall within a certain range.
 Mandatory Constraints: certain columns cannot be empty.
 Unique Constraints: a field, or a combination of fields, must be unique across
a dataset.
Accuracy
 Ensure your data is close to the true values.
 Defining possible valid values allows invalid values to be easily
spotted, it does not mean that they are accurate.
 Difference between accuracy and precision.
 Accuracy refers to how close a measurement is to the true or accepted value.
Precision refers to how close measurements of the same item are to each
other. Precision is independent of accuracy.
Completeness
• The degree to which all required data is known.
• Missing data is going to happen for various reasons.
• One can mitigate this problem by questioning the original source if
possible, say re‐interviewing the subject.
• Chances are, the subject is either going to give a different answer or
will be hard to reach again.
Consistency
 Ensure your data is consistent within the same dataset and/or across
multiple data sets.
 Inconsistency occurs when two values in the data set contradict each
other.
 A valid age, say 10, mightn’t match with the marital status, say
divorced. A customer is recorded in two different tables with two
different addresses. Which one is true?.
Uniqueness

• A measure of unwanted duplication existing within or across systems


for a particular field, record, or data set.
Timeliness
• The extent to which age of the data is appropriated for the task at
hand.
Other Dimensions

Integrity
• The quality, reliability, trustworthiness, and completeness of a
data set – providing accuracy, consistency and context.
• This criteria looks as whether a dataset follows the rules and
standards set. Are there any values missing that can harm the
efficacy of the data or keep analysts from discerning important
relationships or patterns?
Uniformity
• The degree to which the data is specified using the same unit of
measure.
• The weight may be recorded either in pounds or kilos. The date might
follow the USA format or European format. The currency is sometimes in
USD and sometimes in YEN.
• And so data must be converted to a single measure unit.
Example using Data cleansing in stages

• The following example a data set containing company registration


numbers, e‐mails, addresses, etc. consisting on using;
• Importing dataset
• Data validation and Removing Irrelevant data
• Formatting data to a common value (standardization / consistency)
• Cleaning up duplicates
• Filling missing data vs. erasing incomplete data
Example using Data cleansing…..
Data cleansing Step 1: Importing Data set
List of tax numbers of Polish companies (Transparent Data,2021)

Data Validation of company TAX numbers (raw data)


Example using Data cleansing…..
Data cleansing Step 2: Data Validation
• In this dataset, the last digit of each tax identification number
• this is called a ‘check digit’ which is validated by an algorithm
• Check digit Validation =
• multiplying each of the first nine digits of the tax number (542269845)by
weights (in sequence: 6, 5, 7, 2, 3, 4, 5, 6, 7)
(5*6)+(4*5)+(2*7)+(2*2)+(6*3)+(9*4)+(8*5)+(4*6)+(5*7)=221
• summing the results of this multiplication, and then dividing checksum by 11.
221%11=0.9
• The remainder of the division should be identical to the last digit in the tax
number, that is, from the list 542269845(1)
• 0.9 (rounding off)= 1
Example using Data cleansing…..
Data cleansing Step 2: Data Validation and (Removed Irrelevant data)

• Data Validation of company TAX numbers (data after validation)


Example using Data cleansing…..
Data cleansing Step 3: Formatting data to a common form
• The next step is to Normalize the data.
• Some tax numbers were written with dashes, spaces or the prefix
“PL” which stands for Poland.

• How do we format all company tax numbers to a common form.


How?
• Omit the prefix with the country code.
• write all numbers without any special characters separating the digits.
Example using Data cleansing…..
Data cleansing Step 3: Formatting data to a common form

after formatting data


Example using Data cleansing…..
Data cleansing Step 4: Cleaning up duplicates
• The next step in data cleaning is to check for duplicates

after removing duplicates


Example using Data cleansing…..
Data cleansing Step 5: Filling missing data and erasing incomplete data
• The next step is preventing the possession of incomplete data.
• Voivodeship or district can be easily completed based on the name of the city or postal
code

addresses data set: Filling missing data and erasing incomplete data
Example using Data cleansing…..
Data cleansing Step 5: Filling missing data and erasing incomplete
data

Table after filling missing data and erasing incomplete data


Datasets For Data Cleaning Practice

Common Crawl Corpus Trending YouTube Video Statistics


Google Books Ngrams Kaggle
Hotel Booking Demand Iris Species
FiveThirtyEight PAN at CLEF
Taxi Trajectory Data Socrata
Example using Data cleansing in stages

SurveyMonkey.com Data Cleaning


• In qualitative data collection, Survey Data Analysis is used
• Survey data cleaning involves identifying and removing responses
• Going through some common cases in SurveyMonkey.com
• Respondents who answer a portion of the question
Example using Data cleansing in stages

SurveyMonkey.com Data Cleaning contd…


• Respondents who speed through your survey
Example using Data cleansing in stages

SurveyMonkey.com Data Cleaning contd…


• Respondents who give inconsistent responses
Challenges of data cleaning
• Data cleaning solutions can have several problems during the process.
You needs to understand the various problems and figure out how to
tackle them.
• Ongoing maintenance can be expensive and time‐consuming
• Limited knowledge about what is causing anomalies, creating ifficulties in
creating the right transformations
• Privacy and Security
• Data deletion, where a loss of information leads to incomplete data that cannot
be accurately ‘filled in’
• It is difficult to build a data cleansing graph to assist with the process ahead of
time
Overall Benefits of Data Cleaning

• Data Quality, that help increase your efficiency and speed up the
decision‐making process.
• You can better monitor your errors to help you eliminate incorrect,
corrupt, or inconsistent data.
• You will make fewer errors overall.
• You can map different functions and what your data should do.
• It’s easy to remove errors across multiple data sources.
In Summary

• One of the most interesting things about data in this era is its ease of
accessibility‐online through social media, search engines, websites, etc.
• Most of the data is either incorrect or full of irrelevancies. In order to
leverage on the easily accessible huge data, we need to take our time to
clean it.
• Data cleaning is arguably one of the most important steps towards
achieving great results from the data analysis process.
• If the data isn’t cleaned, data analysis will not yield a perfect result.
[email protected]

[email protected]

You might also like