Importance of Data Cleaning 1
Importance of Data Cleaning 1
05 August 2021
This work is licensed under Creative Commons
Attribution 4.0 International License.
Overview
• Meaning of Data Cleaning
• Need for Data Cleaning
• Data Cleaning Methods
• Data Cleaning Steps
• Best Practices
• Data Quality Attributes
• How Data Cleaning is used in a Dataset
• Overall Benefits of Data Cleaning
What is Data Cleaning?
Data cleaning is a process in which you
go through all of the data in a data set Data Scrubbing
Data Cleansing
and either;
remove or update information that is
incomplete,
incorrect,
Data Pre‐processing
improperly formatted,
duplicated, or
irrelevant.
Michael Walker (2021) Python Data Cleaning Cookbook
Raw Data vs Clean Data
• Raw data is the data that is collected directly from the data source,
Template showing an example of raw data: the number of colonies per treatment
condition and controls, Ponti et al (2014)
Raw Data vs Clean Data
• “Dirty Data” is raw data full of irrelevances, errors, and corrupt information
• Clean data is in analyzable format
An example of dirty data and cleaned sample (Shaded cells denote dirty
values, and their cleaned values are in bold font), Krishnan et al(2014)
RAW Data Processed Data
Data Cleaning
Source: CrowdFlower 2016 to 2018
The Need for Data Cleaning
• Having data that is clean can help in performing the
analysis faster, saving precious time.
• Improving the quality of data to make them “fit for
use” by users
• Improving users documentation and presentation.
• False conclusions because of incorrect or “dirty” data
can inform poor decision‐making.
• False conclusions can lead to moments in reporting
when you realize your data doesn’t stand up to
scrutiny.
• It is important to create a culture of quality data in
your research work.
The Need for Data Cleaning contd..
• Combining multiple data sources creates
synchronisation issues
• If data is incorrect, outcomes are
unreliable
• data cleaning processes will vary from
dataset to dataset.
• establish a template for your data
cleaning process
Data Cleaning Methods
• Histograms
• Conversion Tables
• Tools
• Algorithms
• Manually
How do you Clean Data?
Import data Merge Data set
Rebuilding Missing Data Standardisation
Normalisation
Verification and Enrichment
Export Data
in Data Cleaning
Integrity
• The quality, reliability, trustworthiness, and completeness of a
data set – providing accuracy, consistency and context.
• This criteria looks as whether a dataset follows the rules and
standards set. Are there any values missing that can harm the
efficacy of the data or keep analysts from discerning important
relationships or patterns?
Uniformity
• The degree to which the data is specified using the same unit of
measure.
• The weight may be recorded either in pounds or kilos. The date might
follow the USA format or European format. The currency is sometimes in
USD and sometimes in YEN.
• And so data must be converted to a single measure unit.
Example using Data cleansing in stages
addresses data set: Filling missing data and erasing incomplete data
Example using Data cleansing…..
Data cleansing Step 5: Filling missing data and erasing incomplete
data
• Data Quality, that help increase your efficiency and speed up the
decision‐making process.
• You can better monitor your errors to help you eliminate incorrect,
corrupt, or inconsistent data.
• You will make fewer errors overall.
• You can map different functions and what your data should do.
• It’s easy to remove errors across multiple data sources.
In Summary
• One of the most interesting things about data in this era is its ease of
accessibility‐online through social media, search engines, websites, etc.
• Most of the data is either incorrect or full of irrelevancies. In order to
leverage on the easily accessible huge data, we need to take our time to
clean it.
• Data cleaning is arguably one of the most important steps towards
achieving great results from the data analysis process.
• If the data isn’t cleaned, data analysis will not yield a perfect result.
[email protected]