0% found this document useful (0 votes)
11 views8 pages

Data Cleaning

Data cleaning is the process of correcting or removing inaccurate, corrupted, or incomplete data from a dataset to improve data quality and decision-making. The process involves steps such as inspection, cleaning, verification, and reporting, and is essential for effective data management in business intelligence and data science. Techniques for data cleaning include addressing missing data, duplicates, data entry errors, standardization, and managing outliers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views8 pages

Data Cleaning

Data cleaning is the process of correcting or removing inaccurate, corrupted, or incomplete data from a dataset to improve data quality and decision-making. The process involves steps such as inspection, cleaning, verification, and reporting, and is essential for effective data management in business intelligence and data science. Techniques for data cleaning include addressing missing data, duplicates, data entry errors, standardization, and managing outliers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

What is data cleaning?

●​ Data cleaning is the process of fixing or removing incorrect,


corrupted, incorrectly formatted, duplicate, or incomplete data
within a dataset
●​ Its purpose is to fix incorrect, incomplete, duplicate, or erroneous
data in a dataset.
●​ Includes identifying data errors and correcting them through
changing, updating, or removing data.
●​ This process improves data quality, provides more accurate and
consistent information, and leads to better decision-making.
●​ Essential for data management, data preparation, and use in
business intelligence (BI) and data science.
●​ Typically performed by data quality analysts, engineers, or other
data management professionals.
●​ Data scientists, BI analysts, and business users may also
participate.

What are the steps in the Data Cleaning process?


1.​ Inspection and Profiling - Assessing data quality, identifying
issues, and documenting data characteristics.
2.​ Cleaning - Correcting data errors, addressing inconsistencies,
duplicates, and redundancy.
3.​ Verification - Inspecting cleaned data to ensure accuracy and
adherence to data quality standards.
4.​ Reporting - Documenting the results of data cleansing, including
issues found and corrected, and updated quality metrics.
5 Characteristics Of Quality Data
●​ Validity. The degree to which your data conforms to defined
business rules or constraints.
●​ Accuracy. Ensure your data is close to the true values.
●​ Completeness. The degree to which all required data is known.
●​ Consistency. Ensure your data is consistent within the same
dataset and/or across multiple data sets.
●​ Uniformity. The degree to which the data is specified using the
same unit of measure.

Why do we do Data Cleaning?


●​ Improved Decision-Making. More accurate data leads to better
informed decisions.
●​ More Effective Marketing and Sales. Clean customer data
enhances marketing and sales efforts.
●​ Better Operational Performance. High-quality data helps avoid
operational issues like inventory shortages and delivery problems.
●​ Increased Use of Data. Trustworthy data encourages its use in
business processes.
●​ Reduced Data Costs. Prevents data errors from propagating,
saving time and money.

Techniques in Data Cleaning


Missing Data
●​ Missing data are a fact of life in multivariate analysis; cannot be
avoided but need to be addressed so as not to affect the
generalizability of the results.
●​ A missing data process is any systematic event external to the
respondent (such as data entry errors or data collection problems)
or action on the part of the respondent (such as refusal to answer)
that leads to missing value.
●​ Data cleansing corrects various structural errors in data sets. For
example, that includes misspellings and other typographical errors,
wrong numerical entries, syntax errors and missing values, such as
blank or null fields that should contain data.
Types of Missing Data
1.​ Missing at Random (MAR) – certain observations are missing
relative to the observed data. It is not related to the specific missing
values. The data is not missing across all observations but only
within sub-samples of the data. It is not known if the data should
be there; instead, it is missing given the observed data. The missing
data can be predicted based on the complete observed data.
2.​ Missing Completely at Random (MCAR) – observations are
missing across all observations regardless of the expected value or
other variables. One can compare two sets of data, one with
missing observations and one without. Using a t-test, if there is no
difference between the two data sets, the data is characterized as
MCAR. Data may be missing due to test design, failure in the
observations or failure in recording observations. This type of data
is seen as MCAR because the reasons for its absence are external
and not related to the value of the observation. It is typically safe to
remove MCAR data because the results will be unbiased. The test
may not be as powerful, but the results will be reliable.
3.​ Missing Not at Random (MNAR) - The MNAR category applies
when the missing data has a structure to it. In other words, there
appear to be reasons the data is missing. In a survey, perhaps a
specific group of people–say women ages 45 to 55–did not answer a
question. Like MAR, the data cannot be determined by the observed
data, because the missing information is unknown. Data scientists
must model the missing data to develop an unbiased estimate.
Simply removing observations with missing data could result in a
model with bias.
Approaches in Dealing with Missing Data
1. Use of Observations with Complete Data Only
• The simplest and most direct approach for dealing with missing data is
to include only those observations with complete data, also known as the
complete case approach.
2. Delete Cases(s) and/or Variable(s)
• Another simple remedy for missing data is to delete the offending
case(s) and/or variable(s) to reduce bias. In this approach, the researcher
determines the extent of missing data on each case and variable and
then deletes the case(s) or variable(s) with excessive levels. In many cases
where a nonrandom pattern of missing data is present, this may be the
most efficient solution.
3. Imputation Method
• Imputation is the process of estimating the missing value based on
valid values of other variables and/or cases in the sample.
• However, imputation is both seductive and dangerous. It is seductive
because it can lull the user into the pleasurable state of believing that
the data are complete after all, and it is dangerous because it lumps
together situations where the problem is sufficiently minor that it can be
legitimately handled in this way and situations where standard
estimators applied to the real and imputed data have substantial biases.
for quantitative variables (Single)
a. Mean imputation
• Simply calculate the mean of the observed values for that variable for
all individuals who are non-missing. It has the advantage of keeping the
same mean and the same sample size.
b. Substitution
• Impute the value from a new individual who was not selected to be in
the sample. In other words, go find a new subject and use their value
instead.
c. Hot Deck imputation
• A randomly chosen value from an individual in the sample who has
similar values on other variables. An advantage is the random
component, which adds in some variability. This is important for
accurate standard errors.
d. Cold Deck Imputation
• A systematically chosen value from an individual who has similar
values on other variables; similar to Hot Deck but removes the random
variation. So for example, you may always choose the third individual in
the same experimental condition and block.
e. Regression imputation
• The predicted value obtained by regressing the missing variable on
other variables. This preserves relationships among variables involved in
the imputation model, but not variability around predicted values.
f. Stochastic Regression Imputation
• The predicted value from a regression plus a random residual value.
This has all the advantages of regression imputation but adds in the
advantages of the random component. Most multiple imputation is based
off of some form of stochastic regression imputation
g. Interpolation and extrapolation
• An estimated value from other observations from the same individual. It
usually only works in longitudinal data.
for quantitative variables (Multiple)
• A combination of two or more methods. to derive a composite estimate –
usually the mean of the various estimates – for the missing value. The
rationale of this approach is that the use of multiple approaches
minimizes the specific concerns with any single method and the
composite will be the best possible estimate.

Duplicates
●​ Data cleansing identifies duplicate records in data sets and either
removes or merges them through the use of deduplication
measures. For example, when data from two systems is combined,
duplicate data entries can be reconciled to create single records.
Data Entry Errors

Standardizing Data

Outliers
• Outliers are observations with a unique combination of characteristics
identifiable as distinctly different from the other observations.
• A univariate outlier is a data point that consists of an extreme value on
one variable. A multivariate outlier is a combination of unusual scores on
at least two variables. Both types of outliers can influence the outcome of
statistical analyses.
• Outliers cannot be categorically characterized as either beneficial or
problematic, but instead must be viewed within the context of the
analysis and should be evaluated by the types of information they may
provide.
Four Classes of Outliers
1. Arises from a procedural error, such as a data entry error or a
mistake in coding. These outliers should be identified in the data
cleaning stage, but if overlooked, they should be eliminated or recorded
as missing values.
2. Observation that occurs as the result of an extraordinary event,
which then is an explanation for the uniqueness of the observation. The
researcher must decide whether the extraordinary event should be
represented in the sample. If so, the outlier should be retained in the
analysis; if not, it should be deleted.
3. Extraordinary observations for which the researcher has no
explanation. Although these are the outliers most likely to be omitted,
they may be retained if the researcher feels they represent a valid
segment of the population.
4. Contains observations that fall within the ordinary range of
values on each of the variables but are unique in their combination
of values across the variables. In these situations, the researcher
should retain the observation unless specific evidence is available that
discounts the outlier as a valid member of the population.

REFERENCE:
●​ https://www.tableau.com/learn/articles/what-is-data-cleaning#:~:
text=Data%20cleaning%20is%20the%20process,incomplete%20dat
a%20within%20a%20dataset.
●​ https://www.techtarget.com/searchdatamanagement/definition/d
ata-scrubbing
●​ PPT ni Ma’am Meann

You might also like