1preparing Data
1preparing Data
Preparing Data
1.1. Missing Data
For many real world applications oI data mining, even when there are huge amounts oI data,
the subset oI cases with complete data may be relatively small. Available samples and also
Iuture cases may have values missing. Some oI the data mining methods accept missing values
and satisIactorily process data to reach a Iinal conclusion. Other methods require that all values
be available. An obvious question is whether these missing values can be Iilled in during data
preparation, prior to the application oI the data mining methods. The simplest solution Ior this
problem is the reduction oI the data set and the elimination oI all samples with missing values.
That is possible when large data sets are available, and missing values occur only in a small
percentage oI samples. II we do not drop the samples with missing values, then we have to Iind
values Ior them. What are the practical solution ?
First, a data miner, together with the domain expert can manually examine samples that have
no values and enter a reasonable, probable, or expected value, based on a domain experience.
The method is straightIorward Ior small numbers oI missing values and relatively small data sets.
But, iI there is no obvious or plausible value Ior each case, the miner is introducing noise into the
data set by manually generating a value.
The second approach gives an even simpler solution Ior elimination oI missing values. It
based on a Iormal, oIten automatic replacement oI missing values with some constants, such as :
a. replace all missing values with a single global constant (a selection oI a global constant is
highly application-dependent)
b. replace a missing value with its Ieature mean.
replace a missing value with its Ieature mean Ior the given class (this approach is possible
only Ior classiIication problems where samples are classiIied in advance).
These simple solution are tempting. Their main Ilaw is that the substituted value is not the
correct value. By replacing the missing value with a constant or changing the values Ior a Iew
diIIerent Ieatures, the data are biased. The replaced value (values) will homogenize the cases
with missing values into a uniIorm subset directed toward the class with most missing values (an
artiIicial class). II missing values are replaced with a single global constant Ior all Ieatures, an
unknown value may be implicitly made into a positive Iactor that is not objectively justiIied.
One possible interpretation oI missing values is that they are 'don`t care values. In other
words, we suppose that these values do not have any inIluence on the Iinal data mining results.
In that case, a sample with the missing value may be extended to the set oI artiIicial samples,
where, Ior each new sample, the missing value is replaced with one oI the possible Ieature values
oI a given domain. Although this interpretation may look more natural, the problem withthis
approach is the combinatorial explosion oI artiIicial samples. For example, iI one three
dimensional sample X is given as X 1,?,3}, where the second Ieature`s value is missing, the
process will generate Iive artiIicial samples Ior the Ieature domain |0,1,2,3,4|.
X11,0,3}, X21,1,3}, X31,2,3},X41,3,3}, and X51,4,3}.
Finally, the data miner can generate a predictive model to predict each oI the missing values.
For example, iI three Ieatures A,B, and C are given Ior each sample, then based on samples that
have all three values as a training set, the data miner can generate a model oI correlation between
Ieatures. DiIIerent technique such as regression, Bayesian Iormalism, clustering, or decision tree
induction may be used depending on data types (all these techniques are explained later). Once
you have a trained model, you can present a new sample that has a value missing and generate a
predictive value. For example, iI values Ior Ieatures A and B are given, the model generates the
value Ior the Ieature C. II a missing value is highly correlated with the other known Ieatures, this
process will generate the best value Ior that Ieature. OI course, iI you can always predict a
missing value with certainty, this means that the Ieature is redundant in a data set and not
necessary in Iurther data mining analyses. In real world applications, you should expect an
imperIect correlation between the Ieature with the missing value and other Ieatures. ThereIore,
all automatic methods Iill in values that may not be correct. Such automatic methods, however
are among the most popular in the data mining community. In comparison to the other methods,
they use the most inIormation Irom the present data to predict missing values.
In general, it is speculative and oIten misleading to replace missing values using a simple,
artiIicial schema oI data preparation. It is best to generate multiple solution oI data mining with
and without Ieatures that have missing values and then analyze and interpret them.
1.2. Outlier Analysis
Very oIten, in large data sets, there exist samples that do not comply with the general behavior oI
the data model. Such samples, which are signiIicantly diIIerent oI inconsistent with the
remaining set oI data, are called outliers.
For the Iollowing examples oI outliers :
person`s age in the database is -1, the value is obviously not correct, and the error could
have been caused by a deIault setting oI the Iield 'unrecorded age in the computer
program.
The number oI children Ior one person is 25, this value is unusual and has to be checked.
The value could be typographical error.
Many data mining algorithms try to minimize the inIluence oI outliers on the Iinal model, or to
eliminate them in the preprocessing phases. The data mining analyst has to be very careIul in the
automatic elimination oI outliers because, iI the data are correct, that could result in the loss oI
important hidden inIormation.
Some data mining application are Iocused on outlier detection, and it is the essential result oI a
data analysis. For example, while detecting Iraudulent credit card transactions in a bank, the
outliers are typical examples that may indicate Iraudulent activity, and the entire data mining
process is concentrated on their detection.
But, in most oI the other data mining applications, especially iI they are supported with large
data sets, outliers are not very useIul.
Outlier detection and potential removal Irom a data set can be described as a process oI the
selection oI k out oI n samples that considerably dissimilar, exceptional, or inconsistent with
respect to the remaining data.
The problem oI deIining outliers is nontrivial, especially in multidimensional samples. Data
visualization methods that are useIul in outlier detection Ior one to three dimensions are weaker
in multidimensional data because oI a lack oI adequate visualization methodologies Ior these
spaces. An example oI a visualization oI two-dimensional samples and visual detection oI
outliers is given in Figure 1.1
1.2.1. Simplest approach
The simplest approach to outlier detection Ior one-dimensional samples is based on statistics.
Assuming that the distribution oI values is given, it is necessary to Iind basic statistical
parameters sucah as mean value and variance. Based on these values and the expected (or
predicted) number oI outliers, it is possible to exstablish the threshold value as a Iunction oI
variance. All samples out oI the threshold value are candidates Ior outliers. The main problem
with this simple methodology is an a prori assumption about data distribution. In most real
world examples the data distribution may not be known.
For example, iI the given data set representas the Ieature Age with twenty diIIerent values :
Age 3,56,23,39,156,41,22,9,28,139,31,55,20,-67,37,11,55,45,37}
Then, the corresponding statistical parameters are :
Mean 39.9
Standard deviation 45.65
II we select the threshold value Ior normal distribution oI data as :
Threshold Mean +2 Standard deviation
Then, all data that are out oI range -51,4, 131.2 }will be potential outliers. Additional
knowledge oI the characteristics oI the Ieature (Age is always greater than zero) may Iurther
reduce the range to 0,131.2}. In our example there are three values that are outliers based on
the given criteria : 156,139 and -67. With a high probability we can conclude that all three oI
them are typo-errors (data entered with additional digits or an additional '-' sign).
1.2.2. Distance-based outlier detection
Is a second method that eliminates some oI the limitations imposed by the statistical approach.
The most important diIIerence is that this method is applicable to multidimensional samples
while statistical descriptors analyze only a single dimension, or several dimensions, but
separately. The basic computational complexity oI this method is the evaluation oI distance
measures between all samples in an n-dimensional data set. Then, a sample si in a data set S is an
outlier iI at least Iraction p oI the samples in S lies at a distance greater than d. In other words,
distance based outliers are those samples which do not have enough neighbors, where neighbors
are deIined through the multidimensional two parameters, p and d, which may be given in
advanceusing knowledge about the data, or which may be changed during the iterations (trial and
error approach) to select the most representative outliers.
To illustrate the approach we can analyze a set oI two-dimensional samples S, where the
requirements Ior outliers are the values oI thresholds p_4 and d_3.
Ss1,s2,s3,s4,s5,s6,s7}(2,4),(3,2),(1,1),(4,3),(1,6),(5,3),(4,2)}
The table oI Euclidian distances, d |(x1-x2)
2
(y1-y2)
2
|