0% found this document useful (0 votes)
12 views6 pages

Da 5

Data Quality

Uploaded by

Sharath Rakki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

Da 5

Data Quality

Uploaded by

Sharath Rakki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Data Quality

• Poor data quality negatively affects many data processing


efforts
“The most important point is that poor data quality is an
unfolding disaster.
– Poor data quality costs the typical company at least
ten percent (10%) of revenue; twenty percent (20%)
is probably a better estimate.”
• Data mining example: a classification model for detecting
people who are loan risks is built using poor data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default
Data Quality …
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:


– Noise and outliers
– Missing values
– Duplicate data
– Wrong data
Noise
• For objects, noise is an extraneous object
• For attributes, noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor phone
and “snow” on television screen
– We talk about signal to noise ratio.
Left image of 2 sine waves has low or zero SNR; the right image are the
two waves combined with noise and has high SNR

Two Sine Waves Two Sine Waves + Noise


Origins of noise
• outliers -- values seemingly out of the normal
range of data
• duplicate records -- good database design should
minimize this (use DISTINCT on SQL retrievals)
• incorrect attribute values -- again good db design
and integrity constraints should minimize this
• numeric only, deal with rogue strings or characters
where numbers should be.
• how to locate and treat outliers (values seemingly
out of the normal range)
• null handling for attributes (nulls=missing values)
Outliers
• Outliers are data objects with characteristics that
are considerably different than most of the other
data objects in the data set
– Case 1: Outliers are
noise that interferes
with data analysis
– Case 2: Outliers are
the goal of our analysis
• Credit card fraud
• Intrusion detection
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values


– Eliminate data objects or variables
– Estimate missing values
• Example: time series of temperature
• Example: census results
– Ignore the missing value during analysis

You might also like