Data Cleaning

Data cleaning is the process of correcting or removing inaccurate, corrupted, or incomplete data from a dataset to improve data quality and decision-making. The process involves steps such as inspection, cleaning, verification, and reporting, and is essential for effective data management in business intelligence and data science. Techniques for data cleaning include addressing missing data, duplicates, data entry errors, standardization, and managing outliers.

Uploaded by

xrevelryxinxthexdarkx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views8 pages

Data Cleaning

Uploaded by

xrevelryxinxthexdarkx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

What is data cleaning?

● Data cleaning is the process of fixing or removing incorrect,

corrupted, incorrectly formatted, duplicate, or incomplete data
within a dataset
● Its purpose is to fix incorrect, incomplete, duplicate, or erroneous
data in a dataset.
● Includes identifying data errors and correcting them through
changing, updating, or removing data.
● This process improves data quality, provides more accurate and
consistent information, and leads to better decision-making.
● Essential for data management, data preparation, and use in
business intelligence (BI) and data science.
● Typically performed by data quality analysts, engineers, or other
data management professionals.
● Data scientists, BI analysts, and business users may also
participate.

What are the steps in the Data Cleaning process?

1. Inspection and Profiling - Assessing data quality, identifying
issues, and documenting data characteristics.
2. Cleaning - Correcting data errors, addressing inconsistencies,
duplicates, and redundancy.
3. Verification - Inspecting cleaned data to ensure accuracy and
adherence to data quality standards.
4. Reporting - Documenting the results of data cleansing, including
issues found and corrected, and updated quality metrics.
5 Characteristics Of Quality Data
● Validity. The degree to which your data conforms to defined
business rules or constraints.
● Accuracy. Ensure your data is close to the true values.
● Completeness. The degree to which all required data is known.
● Consistency. Ensure your data is consistent within the same
dataset and/or across multiple data sets.
● Uniformity. The degree to which the data is specified using the
same unit of measure.

Why do we do Data Cleaning?

● Improved Decision-Making. More accurate data leads to better
informed decisions.
● More Effective Marketing and Sales. Clean customer data
enhances marketing and sales efforts.
● Better Operational Performance. High-quality data helps avoid
operational issues like inventory shortages and delivery problems.
● Increased Use of Data. Trustworthy data encourages its use in
business processes.
● Reduced Data Costs. Prevents data errors from propagating,
saving time and money.

Techniques in Data Cleaning

Missing Data
● Missing data are a fact of life in multivariate analysis; cannot be
avoided but need to be addressed so as not to affect the
generalizability of the results.
● A missing data process is any systematic event external to the
respondent (such as data entry errors or data collection problems)
or action on the part of the respondent (such as refusal to answer)
that leads to missing value.
● Data cleansing corrects various structural errors in data sets. For
example, that includes misspellings and other typographical errors,
wrong numerical entries, syntax errors and missing values, such as
blank or null fields that should contain data.
Types of Missing Data
1. Missing at Random (MAR) – certain observations are missing
relative to the observed data. It is not related to the specific missing
values. The data is not missing across all observations but only
within sub-samples of the data. It is not known if the data should
be there; instead, it is missing given the observed data. The missing
data can be predicted based on the complete observed data.
2. Missing Completely at Random (MCAR) – observations are
missing across all observations regardless of the expected value or
other variables. One can compare two sets of data, one with
missing observations and one without. Using a t-test, if there is no
difference between the two data sets, the data is characterized as
MCAR. Data may be missing due to test design, failure in the
observations or failure in recording observations. This type of data
is seen as MCAR because the reasons for its absence are external
and not related to the value of the observation. It is typically safe to
remove MCAR data because the results will be unbiased. The test
may not be as powerful, but the results will be reliable.
3. Missing Not at Random (MNAR) - The MNAR category applies
when the missing data has a structure to it. In other words, there
appear to be reasons the data is missing. In a survey, perhaps a
specific group of people–say women ages 45 to 55–did not answer a
question. Like MAR, the data cannot be determined by the observed
data, because the missing information is unknown. Data scientists
must model the missing data to develop an unbiased estimate.
Simply removing observations with missing data could result in a
model with bias.
Approaches in Dealing with Missing Data
1. Use of Observations with Complete Data Only
• The simplest and most direct approach for dealing with missing data is
to include only those observations with complete data, also known as the
complete case approach.
2. Delete Cases(s) and/or Variable(s)
• Another simple remedy for missing data is to delete the offending
case(s) and/or variable(s) to reduce bias. In this approach, the researcher
determines the extent of missing data on each case and variable and
then deletes the case(s) or variable(s) with excessive levels. In many cases
where a nonrandom pattern of missing data is present, this may be the
most efficient solution.
3. Imputation Method
• Imputation is the process of estimating the missing value based on
valid values of other variables and/or cases in the sample.
• However, imputation is both seductive and dangerous. It is seductive
because it can lull the user into the pleasurable state of believing that
the data are complete after all, and it is dangerous because it lumps
together situations where the problem is sufficiently minor that it can be
legitimately handled in this way and situations where standard
estimators applied to the real and imputed data have substantial biases.
for quantitative variables (Single)
a. Mean imputation
• Simply calculate the mean of the observed values for that variable for
all individuals who are non-missing. It has the advantage of keeping the
same mean and the same sample size.
b. Substitution
• Impute the value from a new individual who was not selected to be in
the sample. In other words, go find a new subject and use their value
instead.
c. Hot Deck imputation
• A randomly chosen value from an individual in the sample who has
similar values on other variables. An advantage is the random
component, which adds in some variability. This is important for
accurate standard errors.
d. Cold Deck Imputation
• A systematically chosen value from an individual who has similar
values on other variables; similar to Hot Deck but removes the random
variation. So for example, you may always choose the third individual in
the same experimental condition and block.
e. Regression imputation
• The predicted value obtained by regressing the missing variable on
other variables. This preserves relationships among variables involved in
the imputation model, but not variability around predicted values.
f. Stochastic Regression Imputation
• The predicted value from a regression plus a random residual value.
This has all the advantages of regression imputation but adds in the
advantages of the random component. Most multiple imputation is based
off of some form of stochastic regression imputation
g. Interpolation and extrapolation
• An estimated value from other observations from the same individual. It
usually only works in longitudinal data.
for quantitative variables (Multiple)
• A combination of two or more methods. to derive a composite estimate –
usually the mean of the various estimates – for the missing value. The
rationale of this approach is that the use of multiple approaches
minimizes the specific concerns with any single method and the
composite will be the best possible estimate.

Duplicates
● Data cleansing identifies duplicate records in data sets and either
removes or merges them through the use of deduplication
measures. For example, when data from two systems is combined,
duplicate data entries can be reconciled to create single records.
Data Entry Errors

Standardizing Data

Outliers
• Outliers are observations with a unique combination of characteristics
identifiable as distinctly different from the other observations.
• A univariate outlier is a data point that consists of an extreme value on
one variable. A multivariate outlier is a combination of unusual scores on
at least two variables. Both types of outliers can influence the outcome of
statistical analyses.
• Outliers cannot be categorically characterized as either beneficial or
problematic, but instead must be viewed within the context of the
analysis and should be evaluated by the types of information they may
provide.
Four Classes of Outliers
1. Arises from a procedural error, such as a data entry error or a
mistake in coding. These outliers should be identified in the data
cleaning stage, but if overlooked, they should be eliminated or recorded
as missing values.
2. Observation that occurs as the result of an extraordinary event,
which then is an explanation for the uniqueness of the observation. The
researcher must decide whether the extraordinary event should be
represented in the sample. If so, the outlier should be retained in the
analysis; if not, it should be deleted.
3. Extraordinary observations for which the researcher has no
explanation. Although these are the outliers most likely to be omitted,
they may be retained if the researcher feels they represent a valid
segment of the population.
4. Contains observations that fall within the ordinary range of
values on each of the variables but are unique in their combination
of values across the variables. In these situations, the researcher
should retain the observation unless specific evidence is available that
discounts the outlier as a valid member of the population.

REFERENCE:
● https://www.tableau.com/learn/articles/what-is-data-cleaning#:~:
text=Data%20cleaning%20is%20the%20process,incomplete%20dat
a%20within%20a%20dataset.
● https://www.techtarget.com/searchdatamanagement/definition/d
ata-scrubbing
● PPT ni Ma’am Meann

Elective Mathematics Super Mock 2025
No ratings yet
Elective Mathematics Super Mock 2025
4 pages
Oracle WMS PICK (White Paper)
100% (16)
Oracle WMS PICK (White Paper)
35 pages
Basis Worksheet
No ratings yet
Basis Worksheet
52 pages
Solar Greenhouse Construction and Operation by Rick Fisher
100% (1)
Solar Greenhouse Construction and Operation by Rick Fisher
166 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
An Episodic History of Mathematics PDF
No ratings yet
An Episodic History of Mathematics PDF
483 pages
Chapter 17 Exercise
0% (1)
Chapter 17 Exercise
3 pages
Data Quality
100% (2)
Data Quality
16 pages
Chapter 4 Measures of Location
No ratings yet
Chapter 4 Measures of Location
37 pages
Data Screening (Sometimes Referred To As "Data Screaming") Is The Process of Ensuring Your Data Is
No ratings yet
Data Screening (Sometimes Referred To As "Data Screaming") Is The Process of Ensuring Your Data Is
4 pages
Data Quality
No ratings yet
Data Quality
14 pages
Marketing Analytics (Unit 2)
No ratings yet
Marketing Analytics (Unit 2)
78 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Missing Data
100% (2)
Missing Data
35 pages
Determination of PKa Values For API
100% (1)
Determination of PKa Values For API
9 pages
Vedang PSAT 2
No ratings yet
Vedang PSAT 2
19 pages
Ch3 Rotor System Operation PDF
No ratings yet
Ch3 Rotor System Operation PDF
13 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
FDS U4
No ratings yet
FDS U4
93 pages
Application of PLC Control Technology in Intelligent Automatic Control
No ratings yet
Application of PLC Control Technology in Intelligent Automatic Control
4 pages
Lec 3 Data Preprocessing and Transformation
No ratings yet
Lec 3 Data Preprocessing and Transformation
73 pages
Kubota Mobile Light Tower
No ratings yet
Kubota Mobile Light Tower
1 page
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Get Finite Element Design of Concrete Structures 2nd Ed Edition G. A. Rombach PDF Ebook With Full Chapters Now
100% (9)
Get Finite Element Design of Concrete Structures 2nd Ed Edition G. A. Rombach PDF Ebook With Full Chapters Now
85 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Bomba Kobe T200 - Manual de Partes
100% (1)
Bomba Kobe T200 - Manual de Partes
13 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
How To Tune Your TV
No ratings yet
How To Tune Your TV
5 pages
Third Order Intercepts
No ratings yet
Third Order Intercepts
6 pages
Company SNP (Eng) - Color - 1-6-61
No ratings yet
Company SNP (Eng) - Color - 1-6-61
95 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Da Mid1
No ratings yet
Da Mid1
32 pages
Aggregate Impact Value
No ratings yet
Aggregate Impact Value
8 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Handling Missing Data
No ratings yet
Handling Missing Data
32 pages
Chapter - 1 Introduction:-: Variable Power Supply With Digital Control 2011
No ratings yet
Chapter - 1 Introduction:-: Variable Power Supply With Digital Control 2011
49 pages
Fanuc Pmc-Model Sa1/Sb7 Supplemental Programming Manual (LADDER Language)
No ratings yet
Fanuc Pmc-Model Sa1/Sb7 Supplemental Programming Manual (LADDER Language)
12 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
PHD Seminar
No ratings yet
PHD Seminar
38 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
No ratings yet
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
42 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
MFA-106-Unit III Data Preparation and Data Warehousing-16Apr2024
No ratings yet
MFA-106-Unit III Data Preparation and Data Warehousing-16Apr2024
15 pages
Everything You Ever Wanted To Functional Global Variables
No ratings yet
Everything You Ever Wanted To Functional Global Variables
51 pages
Lecture 2.3.10
No ratings yet
Lecture 2.3.10
30 pages
Summary - Data Quality
No ratings yet
Summary - Data Quality
7 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Data Collection Cleaning Preprocessing Presentation
No ratings yet
Data Collection Cleaning Preprocessing Presentation
13 pages
1.data Cleaning Screening
No ratings yet
1.data Cleaning Screening
21 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Ads Exp2
No ratings yet
Ads Exp2
3 pages
1.3 Data Quality
No ratings yet
1.3 Data Quality
6 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Summary Data Quality Course
No ratings yet
Summary Data Quality Course
7 pages
Qpwugerqwjbrchapter 2 Descriptive Statistics: Tabular and Graphical Presentations
No ratings yet
Qpwugerqwjbrchapter 2 Descriptive Statistics: Tabular and Graphical Presentations
37 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
Week 5 Lecture - Data Wrangling
No ratings yet
Week 5 Lecture - Data Wrangling
26 pages
BMC JE Brochure English 1729259098
No ratings yet
BMC JE Brochure English 1729259098
7 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
Data - Preprocessing - 2
No ratings yet
Data - Preprocessing - 2
10 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
DM Missing Value
No ratings yet
DM Missing Value
21 pages
Missing Data
No ratings yet
Missing Data
14 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Quarter 3 - Module 1C: Nature of Crystals
No ratings yet
Quarter 3 - Module 1C: Nature of Crystals
14 pages
Handling The Missing Values
No ratings yet
Handling The Missing Values
4 pages
Handling Missing Data
No ratings yet
Handling Missing Data
23 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Subtitle
No ratings yet
Subtitle
2 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
SSTB - D - Vol 2 - Form 2 - Title and Main Content
No ratings yet
SSTB - D - Vol 2 - Form 2 - Title and Main Content
11 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
Appendix:Glossary - Wiktionary
No ratings yet
Appendix:Glossary - Wiktionary
23 pages
An Investigation of A Model For Air Resistance Lab
No ratings yet
An Investigation of A Model For Air Resistance Lab
4 pages
Act2 Apren GVZA
No ratings yet
Act2 Apren GVZA
4 pages
WINSEM2018-19 - MGT1051 - TH - SJTG23 - VL2018195003627 - Reference Material I - 12-12 - C1 - BAE
No ratings yet
WINSEM2018-19 - MGT1051 - TH - SJTG23 - VL2018195003627 - Reference Material I - 12-12 - C1 - BAE
20 pages
Conference
No ratings yet
Conference
7 pages
Missing Data & How To Handle It
No ratings yet
Missing Data & How To Handle It
32 pages
Data Cleaning Workshop:: Club Data Science and Cloud Computing
No ratings yet
Data Cleaning Workshop:: Club Data Science and Cloud Computing
6 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
Item #: D51L29, D51Z29, D51B29 Assembly Instructions
No ratings yet
Item #: D51L29, D51Z29, D51B29 Assembly Instructions
10 pages
Omnia SST: Audio Processing Software
No ratings yet
Omnia SST: Audio Processing Software
3 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet

Data Cleaning

Uploaded by

Data Cleaning

Uploaded by

What is data cleaning?

●​ Data cleaning is the process of fixing or removing incorrect,

What are the steps in the Data Cleaning process?

Why do we do Data Cleaning?

Techniques in Data Cleaning

You might also like

● Data cleaning is the process of fixing or removing incorrect,