R21 DM Unit1
R21 DM Unit1
• Data quality is the measure of how well a data set serve its
specific purpose.
• The focus is on measurement and data collection issues.
Measurement and Data Collection Errors
• The term measurement error refers to any problem resulting from
the measurement process.
• A common problem is that the value recorded differs from the true
value to some extent.
• For continuous attributes, the numerical difference of the measured
and true value is called the error.
• The term data collection error refers to errors such as
omitting data objects or attribute values, or inappropriately
including a data object.
Noise and Artifacts
• Noise is the random component of a measurement error. It may
involve the distortion of a value or the addition of spurious
objects.
• Data errors such as a streak in the same place on a set of
photographs.
• Such deterministic distortions of the data are referred to as
artifacts.
• In statistics, the quality of the measurement process and the
resulting data are measured by precision and bias.
• Precision:: The closeness of repeated measurements (of the same
quantity) to one another.
• Bias:: A systematic variation of measurements from the quantity
being measured.
• Accuracy:: The closeness of measurements to the true value of the
quantity being measured.
Outliers
• Outliers are either (1) data objects that, have characteristics that
are different from the other data objects in the data set, or
(2) values of an attribute that are unusual with respect to the values
for that attribute.
Missing Values
• It is usual for an object to be missing one or more attribute values.
• In some cases, the information was not collected; e.g., some people
decline to give their age or weight.
There are several strategies for dealing with missing data
• Eliminate Data Objects or Attributes: A simple and effective
strategy is to eliminate objects with missing values.
• Estimate Missing Values Sometimes missing data can be reliably
estimated.
• For example, consider a time series that changes in a reasonably
smooth fashion, but has a few, widely scattered missing values.
• In such cases, the missing values can be estimated by using the
remaining values.
• Ignore the Missing Value during Analysis Many data mining
approaches can be modified to ignore missing values.
• For example, suppose that objects are being clustered and the
similarity between pairs of data objects needs to be calculated.
Inconsistent Values
• Data can contain inconsistent values. Consider an address field,
where both a zip code and city are listed, but the specified zip code
area is not contained in that city.
Duplicate Data
• A data set may include data objects that are duplicates, or almost
duplicates, of one another.
• Many people receive duplicate mailings because they appear in a
database multiple times under slightly different names.
Issues Related to Applications
• Data quality issues can also be considered from an application
viewpoint as expressed by the statement “data is of high quality.
• Timeliness Some data starts to age as soon as it has been collected.
• Relevance The available data must contain the information
necessary for the application.
• Consider the task of building a model that predicts the accident