We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6
Data Quality
• Poor data quality negatively affects many data processing
efforts “The most important point is that poor data quality is an unfolding disaster. – Poor data quality costs the typical company at least ten percent (10%) of revenue; twenty percent (20%) is probably a better estimate.” • Data mining example: a classification model for detecting people who are loan risks is built using poor data – Some credit-worthy candidates are denied loans – More loans are given to individuals that default Data Quality … • What kinds of data quality problems? • How can we detect problems with the data? • What can we do about these problems?
• Examples of data quality problems:
– Noise and outliers – Missing values – Duplicate data – Wrong data Noise • For objects, noise is an extraneous object • For attributes, noise refers to modification of original values – Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen – We talk about signal to noise ratio. Left image of 2 sine waves has low or zero SNR; the right image are the two waves combined with noise and has high SNR
Two Sine Waves Two Sine Waves + Noise
Origins of noise • outliers -- values seemingly out of the normal range of data • duplicate records -- good database design should minimize this (use DISTINCT on SQL retrievals) • incorrect attribute values -- again good db design and integrity constraints should minimize this • numeric only, deal with rogue strings or characters where numbers should be. • how to locate and treat outliers (values seemingly out of the normal range) • null handling for attributes (nulls=missing values) Outliers • Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set – Case 1: Outliers are noise that interferes with data analysis – Case 2: Outliers are the goal of our analysis • Credit card fraud • Intrusion detection Missing Values • Reasons for missing values – Information is not collected (e.g., people decline to give their age and weight) – Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
• Handling missing values
– Eliminate data objects or variables – Estimate missing values • Example: time series of temperature • Example: census results – Ignore the missing value during analysis