13 - Chapter 4 PDF
13 - Chapter 4 PDF
DATA PREPROCESSING
4.1 PREAMBLE
―Information quality is not an esoteric notion;it directly affects the
effectiveness and efficiency of business processes. Information quality
also plays a major role in customer satisfaction.‖ - Larry P. English
102
4.2 PREPROCESSING
Data preprocessing prepares raw data for further processing. The traditional
data preprocessing method is reacting as it starts with data that is assumed ready
for analysis and there is no feedback and impart for the way of data collection. The
data inconsistency between data sets is the main difficulty for the data
preprocessing
Data Cleaning
Data cleaning is process of fill in missing values, smoothing the noisy data,
identify or remove outliers, and resolve inconsistencies.
103
Data Integration
Integration of multiple databases, data cubes, or files.
Data Transformation
Data transformation is the task of data normalization and aggregation.
Data Reduction
Process of reduced representation in volume but produces the same or
similar analytical results
Data Discretization
Part of data reduction but with particular importance, especially for
numerical data
The proposed model and task for preprocessing is described in the following
sections.
104
Figure 4.2 Model for Proposed Preprocessing task
105
4.4 DATA CLEANING
Data cleaning routines work to ―clean‖ the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies. The actual process of data cleansing may involve
106
removing typographical errors or validating and correcting values against a known
list of entities. The validation may be strict.
Data cleansing differs from data validation in that validation almost
invariably means data is rejected from the system at entry and is performed at
entry time, rather than on batches of data.
Data cleansing may also involve activities like, harmonization of data, and
standardization of data. For example, harmonization of short codes (St, rd) to
actual words (street, road). Standardization of data is a means of changing a
reference data set to a new standard, ex, use of standard codes.
Among these tasks missing values causes inconsistencies for data mining. To
overcome these inconsistencies, handling the missing value is a good solution.
In the medical domain, missing data might occur as the value is not relevant
to a particular case, could not be recorded when the data was collected, or is
ignored by users because of privacy concerns or it may be unfeasible for the
patient to undergo the clinical tests, equipment malfunctioning, etc. Methods for
resolving missing values are therefore needed in health care systems to enhance
the quality of diagnosis. The following sections describe about the proposed data
cleaning methods.
107
Figure 4.3 Model For Data cleaning
The method incorporated for outlier detection is Rule Based Outlier Detection
Method. Outlier (or anomaly) detection is an important problem for many
domains, including fraud detection, risk analysis, network intrusion and medical
diagnosis, and the discovery of significant outliers is becoming an integral aspect
of data mining. Outlier detection is a mature field of research with its origins in
statistics.
108
Outlier detection techniques can operate in one of the following three modes:
109
Rule based techniques generate rules that capture the normal behavior of a
system [Skalak and Rissland 1990]. Any instance that is not covered by any such
rule is considered as an anomaly. Several rule based anomaly detection techniques
operate in a semi-supervised mode where rules are learnt for normal class(es) and
the confidence associated with the rule that ‖fires‖ for a test instance determines if
it is normal or anomaly [Fan et al. 2001; Helmer et al. 1998; Lee et al. 1997;
Salvador and Chan 2003; Teng et al. 2002].
In this concept, separate rules are constructed for the positive and negative
class cases. The confirmation rules for the positive class must be true for many
positive cases and for no negative case. If a negative case is detected true for any
confirmation rule developed for the positive class, it is a reliable sign that the case
is an outlier. In the same way, confirmation rules constructed for the negative class
can be used for outlier detection of positive patient records. Some preliminary
inductive learning results have been demonstrated [Gamberger et. al., 2000] that
explicit detection of outliers can be useful for maintaining the data quality of
medical records and that it might be a key for the improvement of medical
decisions and their reliability in the regular medical practice. With the intention of
on-line detection of possible data inconsistence, sets of confirmation rules have
been developed for the database and their test results are reported in this work. An
110
additional advantage of the approach is that the user can have the information
about the rule which caused the anxiety what can be useful in the error detection
process.
/*Phase 1- initialization */
Begin
Step:1
Counter = 0
Repeat
Counter++
Step:2
111
While not end of the database do
Until (counter – k)
End
Figure 4.4 Procedure for Outlier detection
The outcome of the about discussed algorithm is dataset without outlier based on
the rule. Missing data is another important issue in preprocessing it is discussed in
the next section.
Several methods have been proposed in the literature to treat missing data. Those
methods are divided into three categories as proposed by Dempster and et
112
al.[1977]. The different patterns of missing values are discussed in the next
section.
Non-Ignorable (NI)
NI exists when missing values are not randomly distributed across
observations. If the probability that a cell is missing depends on the unobserved
value of the missing response, then the process is non-ignorable.
In next section the theoretical framework for Handling the missing value is
discussed.
113
4.4.2.2 The theoretical framework
The classification of missing data is categorized in the following three
mechanisms:
• If the probability of an observation being missing does not depend on
observed or unobserved measurements, then the observation is MCAR. A
typical example is a patient moving to another city for non-health reasons.
Patients who drop-out of a study for this reason could be considered as a
random sample of the total study population and their characteristics are
similar.
• If the probability of an observation being missing depends only on
observed measurements, then the observation is MAR. This assumption
implies that the behavior of the post drop-out observations can be predicted
from the observed variables, and therefore that response can be estimated
without bias using exclusively the observed data. [For example, when a
patient drops out due to lack of efficacy (illness due lack of vitamin
efficiency) reflected by a series of poor efficacy outcomes that have been
observed, the appropriate value to assign to the subsequent efficacy
endpoint for this patient can be calculated using the observed data. ]
• When observations are neither MCAR nor MAR, they are classified as
Missing Not At Random (MNAR) or a non ignorable i.e. the probability of
an observation being missing depends on unobserved measurements. In
this scenario, the value of the unobserved responses depends on
information not available for the analysis (i.e. Not the values observed
previously on the analysis variable or the covariates being used), and thus,
future observations cannot be predicted without bias by the model. For
example, it may happen that after a series of visits with good outcome, a
patient drop-out due to lack of efficacy. In this situation the analysis model
based on the observed data, including relevant covariates, is likely to
continue to predict a good outcome, but it is usually unreasonable to expect
114
the patient to continue to derive benefit from treatment., it is impossible to
be certain whether there is a relationship between missing values and the
unobserved outcome variable or to judge whether that missing data can be
adequately predicted from the observed data. It is not possible to know
whether the MAR, never mind MCAR, assumptions are appropriate in any
practical situation. A proposition that no data in a confirmatory clinical trial
are MNAR seems implausible. Because it is considered that some data are
MNAR, the properties (e.g. Bias) of any methods based on MCAR or MAR
assumptions cannot be reliably determined for any given dataset.
Therefore, the method chosen should not depend primarily on the properties of the
method under the MAR or MCAR assumptions, but on whether it is considered to
provide an appropriately conservative estimate in the circumstances of the trial
under consideration. The methods for handling missing values and procedure is
described in the next section.
The specific methods for handling missing value are mentioned below
115
2. The second method determines the level of missing values in each instance and
attributes. It discards the instance with high level of missing data.
b. Parameter estimation
The maximum likehood procedure is used to estimate the parameters of a
model defined for the complete data. The maximum like hood procedures that
use variants of the Expectation–Maximization algorithm can handle parameter
estimation in the presence of missing data [Mehala et. al., 2009; Dempster and
et al.,1977]
c. Imputation techniques
Imputation is the substitution of some value for a missing data point or a
missing component of a data point. Once all missing values have been imputed,
the dataset can then be analyzed using standard techniques for complete data. The
analysis should ideally take into account that there is a greater degree of
uncertainty than if the imputed values had actually been observed, however, and
this generally requires some modification of the standard complete data analysis
methods. In this research work the estimation maximization method is
implemented.
The algorithm used for handling missing values using the most common
feature method is EM algorithm. The procedure is discussed in Figure 4.5
116
Input : D /* the cardiology database */
Begin
Identify the global constant used for the variable /*global constant used =
―cardio‖*/
Figure 4.5 Procedure for Estimation Maximization Method For Missing Values
117
estimating missing data of an observation, based on valid values of other
variables‖ (Hair et al. 1998). Imputation minimizes bias in the mining process, and
preserves ―expensive to collect‟ data, that would otherwise be discarded (Marvin
et al. 2003). It is important that the estimates for the missing values are accurate,
as even a small number of biased estimates may lead to inaccurate and misleading
results in the mining process.
118
The advantage of this method is
Prediction of quantitative and qualitative attributes
Handling multiple missing value in the records.
In the mean imputation, the mean of the values of an attribute that contains
missing data is used to fill in the missing values. In the case of a categorical
attribute, the mode, which is the most frequent value, is used instead of the mean
[Liu et.al, 2004]. The algorithm imputes missing values for each attribute
separately. Mean imputation can be conditional or unconditional, i.e., not
conditioned on the values of other variables in the record. The conditional mean
method imputes a mean value, that depends on the values of the complete
attributes for the incomplete record.
• The missing data are filled in m times to generate m complete data sets
• The m complete data sets are analyzed by using standard procedures
• The results from the m complete data sets are combined for the inference
119
4.4.3.4 LSImpute_Rows
LSImpute_Rows method estimates missing values based on the least square
error principle and correlation between cases (rows in the input matrix) [Liu et.al.
2004, Jos´ e et.al, 2006].
4.4.3.5 EMImpute_Columns
The EMImpute_Columns estimates missing values using the same
imputation model, but based on the correlation between the features [Marisol et.al,
2005] (columns in the input matrix). LSImpute_Rows and EMImpute_Columns
involve multiple regressions to make their predictions
120
performance by estimating average imputation error. The average imputation error
is the measure which represents the degree of inconsistency between the observed
and imputed values. The approach is experimented on PIMA Indian Type II
Diabetes Data set, which originally do not have any missing data. All the 8
attributes are considered for the experiments as the decision attribute is derived
using these 8 attributes. Datasets with different percentage of missing data (from
5% to 85%) were generated using the random labeling feature. For each
percentage of missing data, 20 random simulations are to be conducted.
In each dataset, missing values were simulated by randomly labeling
feature values as missing values. The datasets with different amounts of missing
values (from 5% to 35% of the total available data) were generated. For each
percentage of missing data, 20 random simulations were conducted. The data were
standardised using the maximum difference normalisation procedure which
mapped the data into the interval [0..1]. The estimated values were compared to
those in the original data set. The average estimation error E was calculated as
follows:
m n
where ‗n‘ is the number of imputed values, ‗m‘ is the number of random
simulations for each missing value, Oij is the original value to be imputed, Iij is
the imputed value, j is the corresponding feature to which Oi and Ii belong. The
result analysis of all these methods is discussed in the next section.
121
4.4.3.7 Result analysis
The estimated error results obtained from different methods for the
databases is tabulated in Figure 4.4. The different k-NN estimators were
implemented, but only the most accurate model is shown. The 10-NN models
produced an average estimation error that is consistently more accurate than those
obtained using the Mean imputation, NORM and LSImpute_Rows methods.
Tables 4.1 and Figure 4.6 shows the average estimated errors and corresponding
standard deviation. The predictive performance of these methods depends on the
amount of missing values and complete cases containing the dataset.
122
18
16
14
12
Avg. Estimated Error
10 10-NN
Mean based Imputation
8
NORM
6 EMImpute_Columns
LSImpute_Rows
4
0
5% 10% 15% 20% 25% 30% 35%
Missing Values
Figure 4.6 Comparison of different methods using different percentages of missing values
123
predictions. Attribute correction using data mining concept is discussed in the
following section.
4.4.4.1 Framework
Imputed Attribute
Corrected Attribute
124
4.4.4.2 Context Dependent Attribute Correction using Association Rule
(CDACAR)
• If ‗s‘ is "test" and ‗t‘ is "tent", then LD(s,t) = 1 , because one substitution
(change s" to "n") is sufficient to transform ‗s‘ into ‗t‘.
125
where Lev(s1,s2) denotes Levenshtein distance between strings s1 1n s2.
The modified.
Step 2: Generate all the association rules from the already generated
sets..The rules generated may have 1, 2 or 3 predecessors and only one
successor. The association rules are generated from the set of validation
rules.
Step 3: The algorithm discovers records whose attribute values are the
predecessors of the rules generated with an attribute whose value is
different from
the successor of a given rule.
Step 4: The value of the attribute which is suspicious in a row is compared
with all the successors.
Step 4: If the relative Levenshtein distance is lower than the threshold
distance the value may be corrected. If there are more values within the
accepted range of the parameter, a value most similar to the value of the
record is chosen.
126
4.4.4.3 Context Independent Attribute Correction using Clustering
Technique (CIACCT)
i. Distthresh being the minimum distance between two values, allowing them to
be marked as similar and related
ii. Occrel is used to determine whether both compared values belong to the
reference data set.
Step 1: First cleaning process, for that all attributes convert from lower
case to upper case, all the non-alpha numeric values are removed and then
the number of occurrences of all the values in the cleaned data set is
calculated
127
Step 2: Each element is assigned to separate cluster. The cluster element
with the highest number of occurrences is treated as cluster representative.
Step 4:Starting from the first cluster, compare all the cluster and also
calculate the distance between the cluster using the modified Levenshtein
distance.
Step 5: If the distance is lower than the distthresh parameter and the ratio of
occurrences of cluster representative is greater or equal the Occrel
parameter, the clusters are merged
Step 6: After all the clusters are compared, the clusters are examined
whether they contain values having distance between them and the cluster
representative above the threshold value.if so, they are removed from the
cluster and added to the cluster list as separate clusters.
Step 7: Repeat the same step until there are no changes in the cluster list
i.e.no clusters are merged and no cluster are created. The cluster
representative is from the reference data set and the cluster define
transformation rules for a given cluster values should be replaced with the
value of the cluster representative.
128
4.4.4.4 Results analysis of attribute correction
The Algorithm was tested using the sample Cardiology dataset drawn from
Hungarian data.The rule-generation part of the algorithm is performed on the
whole data set. The Attribute correction part was performed on a random sample.
The Following measures are used for checking the correctness of the
algorithm. Let
Pc – Percentage of correctly altered values
Pi – Percentage of incorrectly altered values
P0- Percentage of values marked during the review as incorrect, but not
Altered during cleaning
From Table – 4.2 it can be observed that the relationship between the
measures and the distthresh parameter. Figure 4.8 shows the result that the number
of values marked as incorrect and altered is growing with the increase of the
distthresh parameter. This also proves that the context-dependent algorithm
129
perform better for identifying incorrect entries. The number of incorrectly altered
values is growing with the increase of the parameter. However, a value of the
distthresh parameter can be identified that gives optimal results. i.e. the number of
correctly altered values is high and the number of incorrectly altered values is low.
Table –4.2 Dependency between the measures and the parameter for Context-dependent algorithm
Distthresh Pc Pi P0
0.1 90 10 73.68
130
100
90
80
70
Percentage
60
50 Pc
40
Pi
30
Po
20
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Distthresh
Figure 4.8 Dependency between the measures and the parameter for Context-dependent
algorithm
The result shows that the number of values marked as incorrect(Pi) and altered
is growing with the increase of the DistThresh parameter. Some attribute that may
at first glance seem to be incorrect, are correct in the context of other attribute
within the same record. Percentage of correctly marked entries reaches the peak
for the DistThresh parameter equal to 0.05.
The Algorithm was tested using the sample Cardiology dataset which is drawn
from Hungarian data There are about 44000 records divided into 11 batches of 4
thousand records. The attribute CP (Chest pain type) in that Angial is one of the
types which occurs when an area of the heart muscle does not get enough oxygen
rich blood. By using CIACCT 4.22% i.e. 1856 element of whole data set were
identified as incorrect and hence subject to alteration. Table 4.3 contains the
example transformation rules discovered during the execution.
131
Table 4.3 Transformation Rules
Angail Angial
Anchail Angial
Angal Angial
Ancail Angial
P0- Percentage of values marked during the review as incorrect, but not
Altered during cleaning
From Table 4.4 and Figure 4.9 it can be observed that the relationship
between the measures and the distthresh parameter. The results show that the
number of values marked as incorrect and altered is growing with the increase of
the distthresh parameter. This also proves that the context-independent algorithm
perform better to identify incorrect entries. However, a value of the distthresh
parameter can be identified that gives good results. i.e. the number of correctly
altered values(Pc) is high and the number of incorrectly altered values(Pi) is low.
132
Table 4.4 – Dependency between the measures and the parameter for Context -Independent
algorithm
Distthresh Pc Pi P0
133
100
90
80
70
60
Percentage
50 Pc
40 Pi
30 Po
20
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Distthresh
Figure 4.9 – Dependency between the measures and the parameter for Context –Independent
algorithm
The algorithm perform better for longer strings as short string would
require higher value of the parameter to discover a correct reference value. High
values of the distthresh parameter results in larger number of incorrectly altered
elements. This algorithm results in an efficiency of 92% of correctly altered
elements which is an acceptable value. The range of the application of this method
is limited to elements that can be standardized for which reference data is
available. Conversely, using this method for cleaning last names could not yield
good results.
134
different bioinformatics repositories). In this work combining the two
cardiovascular databases from different Hospital is taken into consideration.
The data consumed and/or produced by one component is the same as the data
produced and/or consumed by all the other components. This description of
integration highlights the three primary types of system integration, specifically:
presentation, control and data integration.
Data integration appears with increasing frequency as the volume and the need
to share existing data explodes. As information systems grow in complexity and
volume, the need for scalability and versatility of data integration increases. In
management practice, data integration is frequently called Enterprise Information
Integration.
The rapid growth of distributed data has fueled significant interest in building
data integration systems. However, developing these systems today still requires
an enormous amount of labor from system builders. Several nontrivial tasks must
be performed, such as wrapper construction and mapping between schemas. Then,
in dynamic environments such as the Web, sources often undergo changes that
break the system, requiring the builder to continually invest maintenance effort.
This has resulted in very high cost of ownership for integration systems, and
severely limited their deployment in practice.
Health care providers collect and maintain large quantities of data. The major
issue in these data representatives is the dissimilarity in the structure. Very rarely
the structure of the database remains the same. Yet data communication and
data sharing is becoming more important as organizations see the advantages
of integrating their activities and the cost benefits that accrue when data can
be reused rather than recreated from scratch.
135
The integration of heterogeneous data sources has a long research history
following the different evolutions of information systems. Integrating various data
sources is a major problem in knowledge management. It deals with integrating
heterogeneous data sources and it is a complex activity that involves reconciliation
at various levels - data models, data schema and data instances. Thus there arises a
strong need for a viable automation tool that organize data into a common syntax.
Some of the current work in data integration research concerns the Semantic
Integration problem. This problem is not about how to structure the architecture of
the integration, but how to resolve semantic conflicts between heterogeneous data
sources. For example if two companies merge their databases, certain concepts
and definitions in their respective schemas like "earnings" inevitably have
different meanings. In one database it may mean profits in dollars (a floating point
number), while in the other it might be the number of sales (an integer). A
common strategy for the resolution of such problems is the use of ontology which
explicitly defines schema terms and thus help to resolve semantic conflicts.
While comparing the columns of one dataset with another for similarity, there
exist two kinds of similarities namely Exact matching and Statistical matching.
136
Exact matching involves the exact matching of strings to the columns and
Statistical matching involves the partial matching of the strings present in the
column name. For eg. Pname in one database column matching with Pname in
another database column is exact matching. ―Pname‖ in one database column
matching with ‗Patient name‘ in another database column is called as statistical
matching.
This method has a limitation that if two strings represent the same thing, but
different in words says for eg. Cost and price are two different words representing
the same meaning. It is not possible to match these two words using Jaro-Winkler
method. To avoid this in this research work knowledge repository is used to
check all possible words which cannot be matched by Jaro method. First compare
these words with the given strings. If it does not match compare it using the Jaro –
Winkler method.
137
First copy all the columns of one database into the new database, then compare
each column of the other database to be integrated with the knowledge repository
that maintain in the research work. If the word matches with these words in the
knowledge base, we integrate it to the corresponding column in the new database,
else compare it using the Jaro – Winkler measure. Even then, if the columns don‘t
match ,create a new column in the integrated database. Figure 4.10 describes the
Procedure for Jaro- Winkler algorithm.
Step 2: Check for the attribute name in both the table and calculate the Jaro
distance Metric
Step 3: Higher the Jaro distance metric is, the higher the similarities between the
two attributes and the two attributes are considered as similar and their
values are merged.
Step 4: If two attributes are dissimilar check for their name in the knowledge
repository.
138
Input : database1, database2.
Output: database 3 ( integrated database)
Method:
Step1: copy all the attributes and values of database 1 into database3.
Step2: For every attribute in database2,
set flag=0;
do { select each attribute from database 3
{ match it using knowledge repository
String[] st3={"p_id,pat_id,id,patient", "address,street,area",
"amount,amt,cost", "phone no, mobile no, contact no"}
IF it matches {
set flag = 1
copy all the values of that particular column into the
corresponding matching column of database3 }
ELSE { check for similarity between the two attributes from
database2 and database3 }
using JARO method ( string comparison)
IF it matches { et flag = 1
copy all the values of that particular column into the
corresponding matching column of database3
} } }
IF (flag = 0) {
create a new column in database 3 with the same column name as in
database2
copy all the values of that column into the corresponding column in
database 3; }
Step3: end
Figure 4.10 Procedure for Data Integration
139
4.6 DATA DISCRETIZATION
140
accelerates induction, and may produce simpler and more accurate classification.
As it is generally done, global discretization denies the learning algorithm of
taking any change advantage of the ordering information implicit in numeric
attributes.
This simple technique makes global discretization an even more useful for
data preprocessing.
141
categories, such as entropy based algorithms, including Ent-MDLP, Mantaras
distance, dependence based algorithms, including ChiMerge, Chi2, and binning
based algorithms including 1R, Marginal Ent. The unsupervised algorithms
contain equal width, equal frequency and some other recently proposed algorithms
and an algorithm using tree-based density estimation.
Local methods produce partitions that are applied to localized regions of the
instance space. Global methods, such as binning, produce a mesh over the entire
continuous instances, space, where each feature is partitioned into regions
independent of the other attributes.
142
3. Then divide the attribute value into k intervals approximately of equal size.
4. For each interval sets replace the values with a class name.
In this work, the following rule are used to transform the data in the database.
Systole
90-130 Normal
below 90 Hypotension
above 130 Hypertension
Diastole
60-80 Normal
below 60 Hypotension
above 80 Hypertension
Heart beat
72 - Normal for adult
140-150 Normal for Child
100-300 low
143
300-500 medium
Below 500 heavy dose
Anesthesia
1-3 Normal
4-8 Serious
Data warehouses store vast amounts of data. Mining takes a long time to run
this complete and complex data set. Data reduction reduces the data set and
144
provides a smaller volume data set, which yields similar results as the complete
data sets.
Working with data collected through a team effort or at multiple sites can be
both challenging and rewarding. The sheer size and complexity of the dataset
sometimes makes the analysis daunting, but a large data set may also yield richer
and more useful information. The benefits of the data reduction techniques
increase as the data sets grow in size and complexity
1. Dimensionality Reduction
2. Numerosity Reduction
Dimensionality Reduction
Numerosity Reduction
Numerosity Reduction is fitting the data into model. This method can be
handled by Parametric Methods. The parameter on which the numerosity
reduction has to take place is got from the user. According to the parameter its
corresponding values are stored and the remaining data are discarded. Algorithm
for Numerosity reduction is described in Figure 4.13.
145
4.7.2. Implementation of Data Reduction
Dimensionality Reduction
Numerosity Reduction
1. Get the input dataset for which numerosity reduction has to be done.
2. Get the attribute names and the parametric value according to which
numerosity reduction has to be done.
3. The dataset that satisfy the parameter value are stored and the remaining
data are discarded
Dimensionality Reduction
Begin
146
Numerosity Reduction
Begin
Step 1: For each attribute in D
Step 2: Get the input parameter according to which reduction has to be performed
Step 3: For each record in the database, remove the records which does not satisfy
the parameter
End
Figure 4.13 Procedure for Numerosity Reduction
4.8 SUMMARY
147