Data Quality and Data Preproccessing
Data Quality and Data Preproccessing
1
a period of time following each month, the data stored in the database are incomplete. However,
once all of the data are received, it is correct. The fact that the month-end data are not updated
in a timely fashion has a negative impact on the data quality.
Two other factors affecting data quality are believability and
interpretability. Believability reflects how much the data are trusted by users,
while interpretability reflects how easy the data are understood. Suppose that a database, at
one point, had several errors, all of which have since been corrected. The past errors, however,
had caused many problems for sales department users, and so they no longer trust the data. The
data also use many accounting codes, which the sales department does not know how to
interpret. Even though the database is now accurate, complete, consistent, and timely, sales
department users may regard it as of low quality due to poor believability and interpretability.
Data cleaning routines work to “clean” the data by filling in missing values, smoothing noisy
data, identifying or removing outliers, and resolving inconsistencies. If users believe the data
are dirty, they are unlikely to trust the results of any data mining that has been applied.
Furthermore, dirty data can cause confusion for the mining procedure, resulting in unreliable
output. Although most mining routines have some procedures for dealing with incomplete or
noisy data, they are not always robust. Instead, they may concentrate on avoiding overfitting
the data to the function being modeled. Therefore, a useful preprocessing step is to run your
data through some data cleaning routines.
Getting back to your task at AllElectronics, suppose that you would like to include data from
multiple sources in your analysis. This would involve integrating multiple databases, data
cubes, or files (i.e., data integration). Yet some attributes representing a given concept may
have different names in different databases, causing inconsistencies and redundancies. For
example, the attribute for customer identification may be referred to as customer_id in one data
store and cust_id in another. Naming inconsistencies may also occur for attribute values. For
example, the same first name could be registered as “Bill” in one database, “William” in
another, and “B.” in a third. Furthermore, you suspect that some attributes may be inferred
from others (e.g., annual revenue). Having a large amount of redundant data may slow down
or confuse the knowledge discovery process. Clearly, in addition to data cleaning, steps must
be taken to help avoid redundancies during data integration. Typically, data cleaning and data
integration are performed as a preprocessing step when preparing data for a data warehouse.
Additional data cleaning can be performed to detect and remove redundancies that may have
resulted from data integration.
“Hmmm,” you wonder, as you consider your data even further. “The data set I have selected
for analysis is HUGE, which is sure to slow down the mining process. Is there a way I can
reduce the size of my data set without jeopardizing the data mining results?” Data
reduction obtains a reduced representation of the data set that is much smaller in volume, yet
produces the same (or almost the same) analytical results. Data reduction strategies
include dimensionality reduction
2
techniques (e.g., wavelet transforms and principal components analysis), attribute subset
selection (e.g., removing irrelevant attributes), and attribute construction (e.g., where a small
set of more useful attributes is derived from the original set).
Getting back to your data, you have decided, say, that you would like to use a distance-based
mining algorithm for your analysis, such as neural networks, nearest-neighbor classifiers, or
clustering.1 Such methods provide better results if the data to be analyzed have
been normalized, that is, scaled to a smaller range such as [0.0, 1.0]. Your customer data, for
example, contain the attributes age and annual salary. The annual salary attribute usually
takes much larger values than age. Therefore, if the attributes are left unnormalized, the
distance measurements taken on annual salary will generally outweigh distance measurements
taken on age. Discretization and concept hierarchy generation can also be useful, where raw
data values for attributes are replaced by ranges or higher conceptual levels. For example, raw
values for age may be replaced by higher-level concepts, such as youth, adult, or senior.
Discretization and concept hierarchy generation are powerful tools for data mining in that they
allow data mining at multiple abstraction levels. Normalization, data discretization, and
concept hierarchy generation are forms of data transformation. You soon realize such data
transformation operations are additional data preprocessing procedures that would contribute
toward the success of the mining process.
Figure 3.1 summarizes the data preprocessing steps described here. Note that the previous
categorization is not mutually exclusive. For example, the removal of redundant data may be
seen as a form of data cleaning, as well as data reduction.
3
In summary, real-world data tend to be dirty, incomplete, and inconsistent. Data preprocessing
techniques can improve data quality, thereby helping to improve the accuracy and efficiency
of the subsequent mining process. Data preprocessing is an important step in the knowledge
discovery process, because quality decisions must be based on quality data. Detecting data
anomalies, rectifying them early, and reducing the data to be analyzed can lead to huge payoffs
for decision making.