Data Cleaning
Data Cleaning
Duplicates
● Data cleansing identifies duplicate records in data sets and either
removes or merges them through the use of deduplication
measures. For example, when data from two systems is combined,
duplicate data entries can be reconciled to create single records.
Data Entry Errors
Standardizing Data
Outliers
• Outliers are observations with a unique combination of characteristics
identifiable as distinctly different from the other observations.
• A univariate outlier is a data point that consists of an extreme value on
one variable. A multivariate outlier is a combination of unusual scores on
at least two variables. Both types of outliers can influence the outcome of
statistical analyses.
• Outliers cannot be categorically characterized as either beneficial or
problematic, but instead must be viewed within the context of the
analysis and should be evaluated by the types of information they may
provide.
Four Classes of Outliers
1. Arises from a procedural error, such as a data entry error or a
mistake in coding. These outliers should be identified in the data
cleaning stage, but if overlooked, they should be eliminated or recorded
as missing values.
2. Observation that occurs as the result of an extraordinary event,
which then is an explanation for the uniqueness of the observation. The
researcher must decide whether the extraordinary event should be
represented in the sample. If so, the outlier should be retained in the
analysis; if not, it should be deleted.
3. Extraordinary observations for which the researcher has no
explanation. Although these are the outliers most likely to be omitted,
they may be retained if the researcher feels they represent a valid
segment of the population.
4. Contains observations that fall within the ordinary range of
values on each of the variables but are unique in their combination
of values across the variables. In these situations, the researcher
should retain the observation unless specific evidence is available that
discounts the outlier as a valid member of the population.
REFERENCE:
● https://www.tableau.com/learn/articles/what-is-data-cleaning#:~:
text=Data%20cleaning%20is%20the%20process,incomplete%20dat
a%20within%20a%20dataset.
● https://www.techtarget.com/searchdatamanagement/definition/d
ata-scrubbing
● PPT ni Ma’am Meann