0% found this document useful (0 votes)
136 views46 pages

13 - Chapter 4 PDF

The document discusses data preprocessing techniques. It begins by defining data preprocessing as transforming raw data into an understandable format in order to address issues like noise, missing values, and inconsistencies. The document then provides details about major preprocessing tasks like data cleaning, integration, transformation, and reduction. It focuses on data cleaning techniques including outlier detection using rule-based methods, and filling in missing values to resolve inconsistencies. The goal of these preprocessing techniques is to enhance data quality and prepare the data for further analysis.

Uploaded by

Syam Siva Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views46 pages

13 - Chapter 4 PDF

The document discusses data preprocessing techniques. It begins by defining data preprocessing as transforming raw data into an understandable format in order to address issues like noise, missing values, and inconsistencies. The document then provides details about major preprocessing tasks like data cleaning, integration, transformation, and reduction. It focuses on data cleaning techniques including outlier detection using rule-based methods, and filling in missing values to resolve inconsistencies. The goal of these preprocessing techniques is to enhance data quality and prepare the data for further analysis.

Uploaded by

Syam Siva Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

CHAPTER 4

DATA PREPROCESSING

4.1 PREAMBLE
―Information quality is not an esoteric notion;it directly affects the
effectiveness and efficiency of business processes. Information quality
also plays a major role in customer satisfaction.‖ - Larry P. English

As noted by Han and Kamber (2006) today‘s real-world databases are


highly susceptible to noise, missing, and inconsistent data because of their
typically huge size (often several gigabytes or more) and their likely origin from
multiple, heterogeneous sources. Low-quality data will lead to low-quality mining
results. Incomplete, noisy, and inconsistent data are common place properties of
large real world databases and data warehouses. Incomplete data can occur for a
number of reasons. Attributes of interest may not always be available. Other data
may not be included simply because it was not considered important at the time of
entry.
Relevant data may not be recorded due to misunderstanding, or because of
equipment malfunctions. Data that were inconsistent with other recorded data may
have been deleted. Furthermore, the recording of the history or modifications to
the data may have been overlooked, Missing data, particularly for tuples with
missing values for some attributes, may need to be inferred (Han and Kamber,
2006).
Data preprocessing is a data mining technique that involves transforming
raw data into an understandable format. Data preprocessing is a proven method of
resolving such issues.

102
4.2 PREPROCESSING

Data preprocessing prepares raw data for further processing. The traditional
data preprocessing method is reacting as it starts with data that is assumed ready
for analysis and there is no feedback and impart for the way of data collection. The
data inconsistency between data sets is the main difficulty for the data
preprocessing

Figure 4.1 Preprocessing Task

Following is the Major task of preprocessing

Data Cleaning
Data cleaning is process of fill in missing values, smoothing the noisy data,
identify or remove outliers, and resolve inconsistencies.

103
Data Integration
Integration of multiple databases, data cubes, or files.

Data Transformation
Data transformation is the task of data normalization and aggregation.

Data Reduction
Process of reduced representation in volume but produces the same or
similar analytical results

Data Discretization
Part of data reduction but with particular importance, especially for
numerical data

The proposed model and task for preprocessing is described in the following
sections.

4.3 GENERAL MODEL FOR PREPROCESSING


The proposed preprocessing task in this research work is modeled in the
Figure 4.2
 Treating missing values
o Rule based outlier detection
o Imputation methods to treating missing value
o Attribute correction using data mining concepts
 Data integration using Knowledge repository and Jaro Winkler
 Data discretization using the Equal width methodology
 Data reduction
o Dimensionality reduction
o Numerosity reduction

104
Figure 4.2 Model for Proposed Preprocessing task

105
4.4 DATA CLEANING

―Data cleaning is the number one problem in data warehousing‖—


DCI (Discovery Corps, Inc.) survey.

Data quality is an essential characteristic that determines the reliability of


data for making decisions. High-quality data is

 Complete: All relevant data such as accounts, addresses and relationships


for a given customer is linked.
 Accurate: Common data problems like misspellings, typos, and random
abbreviations have been cleaned up.
 Available: Required data are accessible on demand; users do not need to
search manually for the information.
 Timely: Up-to-date information is readily available to support decisions.

In general, data quality is defined as an aggregated value over a set of


quality criteria [Naumann.F ,2002; Heiko and Johann, 2006]. Starting with the
quality criteria defined in [Naumann.F ,2002] , the author describes the set of
criteria that are affected by comprehensive data cleansing and define how to assess
scores for each one of them for an existing data collection. To measure the quality
of a data collection, scores have to be assessed for each of the quality criteria. The
assessment of scores for quality criteria can be used to quantify the necessity of
data cleansing for a data collection as well as the success of a performed data
cleansing process of a data collection. Quality criteria can also be used within the
optimization of data cleansing by specifying priorities for each of the criteria
which in turn influences the execution of data cleansing methods affecting the
specific criteria.

Data cleaning routines work to ―clean‖ the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies. The actual process of data cleansing may involve

106
removing typographical errors or validating and correcting values against a known
list of entities. The validation may be strict.
Data cleansing differs from data validation in that validation almost
invariably means data is rejected from the system at entry and is performed at
entry time, rather than on batches of data.
Data cleansing may also involve activities like, harmonization of data, and
standardization of data. For example, harmonization of short codes (St, rd) to
actual words (street, road). Standardization of data is a means of changing a
reference data set to a new standard, ex, use of standard codes.

The major data cleaning tasks include


 Identify outliers and smooth out noisy data
 Fill in missing values
 Correct inconsistent data
 Resolve redundancy caused by data integration

Among these tasks missing values causes inconsistencies for data mining. To
overcome these inconsistencies, handling the missing value is a good solution.

In the medical domain, missing data might occur as the value is not relevant
to a particular case, could not be recorded when the data was collected, or is
ignored by users because of privacy concerns or it may be unfeasible for the
patient to undergo the clinical tests, equipment malfunctioning, etc. Methods for
resolving missing values are therefore needed in health care systems to enhance
the quality of diagnosis. The following sections describe about the proposed data
cleaning methods.

107
Figure 4.3 Model For Data cleaning

4.4.1. Outlier Detection

The method incorporated for outlier detection is Rule Based Outlier Detection
Method. Outlier (or anomaly) detection is an important problem for many
domains, including fraud detection, risk analysis, network intrusion and medical
diagnosis, and the discovery of significant outliers is becoming an integral aspect
of data mining. Outlier detection is a mature field of research with its origins in
statistics.

108
Outlier detection techniques can operate in one of the following three modes:

(i) Supervised outlier detection


Techniques trained in supervised mode assume the availability
of a training data set which has labeled instances for normal as well
as outlier class. The typical approach in such cases is to build a
predictive model for normal vs. outlier classes. Any unseen data
instance is compared against the model to determine which class it
belongs to. There are two major issues that arise in supervised
outlier detection. First, the anomalous instances are few, as
compared to the normal instances in the training data. Second,
obtaining accurate and representative labels, especially for the
outlier class is usually challenging
(ii) Semi-Supervised outlier detection
Techniques that operate in a semi-supervised mode, assume that the
training data has labeled instances for only the normal class. Since
they do not require labels for the outlier class, they are more widely
applicable than supervised techniques. For example, in space craft
fault detection, an outlier scenario would signify an accident, which
is not easy to model. The typical approach used in such techniques is
to build a model for the class corresponding to normal behavior, and
use the model to identify outliers in the test data.
(iii) Unsupervised outlier detection
Techniques that operate in unsupervised mode do not require
training data, and thus are most widely applicable. The techniques in
this category make the implicit assumption that normal instances are
far more frequent than outliers in the test data. If this assumption is
not true, then such techniques suffer from high false alarm rate

109
Rule based techniques generate rules that capture the normal behavior of a
system [Skalak and Rissland 1990]. Any instance that is not covered by any such
rule is considered as an anomaly. Several rule based anomaly detection techniques
operate in a semi-supervised mode where rules are learnt for normal class(es) and
the confidence associated with the rule that ‖fires‖ for a test instance determines if
it is normal or anomaly [Fan et al. 2001; Helmer et al. 1998; Lee et al. 1997;
Salvador and Chan 2003; Teng et al. 2002].

4.4.1.1. Rule based method of outlier detection

The rule-based outlier detection is more appropriate for on-line inconsistency


testing. It works with data of a particular domain only and the consequence is its
simplicity and high execution speed. The approach is actually a set of logical tests
that must be satisfied by every patient record. If one or more of the tests is not
satisfied, the record is detected as an outlier. The logical tests are defined by the
set of rules that hold for the patient records in the domain [Gamberger et. al.,
2000].

In this concept, separate rules are constructed for the positive and negative
class cases. The confirmation rules for the positive class must be true for many
positive cases and for no negative case. If a negative case is detected true for any
confirmation rule developed for the positive class, it is a reliable sign that the case
is an outlier. In the same way, confirmation rules constructed for the negative class
can be used for outlier detection of positive patient records. Some preliminary
inductive learning results have been demonstrated [Gamberger et. al., 2000] that
explicit detection of outliers can be useful for maintaining the data quality of
medical records and that it might be a key for the improvement of medical
decisions and their reliability in the regular medical practice. With the intention of
on-line detection of possible data inconsistence, sets of confirmation rules have
been developed for the database and their test results are reported in this work. An

110
additional advantage of the approach is that the user can have the information
about the rule which caused the anxiety what can be useful in the error detection
process.

Steps Involved For Rule-Based Outlier Detection

 Get the input cardiac dataset.


 For each record in the table a set of logical tests (rules) is done.
 Records which do not satisfy the rule are considered to be outliers.
 Outliers are then removed from the table.
4.4.1.2. Procedure for outlier detection

Figure 4.4 describes the procedure for outlier detection.

Input : D /* the cardiology database */ , K /* no. of desired outliers */

Output: k identified output

/*Phase 1- initialization */

Begin

Step:1

For each record t in D do

Update hash table using t

Label t as a non-outlier with flag ―0‖

/*Phase 2- outlier identification procedure using rule based outlier detection


method*/

Counter = 0

Repeat

Counter++

Step:2

111
While not end of the database do

Read next record t which is labeled ―0‖ // non-outlier

Compute the characteristics by labeling t as outlier

If the computed character not equal to character in the rules then

Update hashing tables using t

Label t as outlier with flag ―1‖

Until (counter – k)

End
Figure 4.4 Procedure for Outlier detection

The outcome of the about discussed algorithm is dataset without outlier based on
the rule. Missing data is another important issue in preprocessing it is discussed in
the next section.

4.4.2 Handling Missing Values


The missing value treating method plays an important role in the data
preprocessing. Missing data is a common problem in statistical analysis. The
tolerance level of missing data is classified as

Missing Value (Percentage) - Significant


Upto 1% - Trivial
1-5% - Manageable
5-15% - sophisticated methods to handle
More than 15% - Severe impact of interpretation

Several methods have been proposed in the literature to treat missing data. Those
methods are divided into three categories as proposed by Dempster and et

112
al.[1977]. The different patterns of missing values are discussed in the next
section.

4.4.2.1 Pattern of missing


The Missing value in database falls into this three categories viz., Missing
Completely at Random (MCAR), Missing Random (MAR) and Non-Ignorable
(NI)

Missing Completely at Random (MCAR)


This is the highest level of randomness. It occurs when the probability of an
instance (case) having a missing value for an attribute does not depend on either
the known values or the missing values are randomly distributed across all
observations. This is not a realistic assumption for many real time data.

Missing at Random (MAR)


When missingness does not depend on the true value of the missing
variable, but it might depend on the value of other variables that are observed.
This method occurs when missing values are not randomly distributed across all
observations, rather they are randomly distributed within one or more sub samples

Non-Ignorable (NI)
NI exists when missing values are not randomly distributed across
observations. If the probability that a cell is missing depends on the unobserved
value of the missing response, then the process is non-ignorable.

In next section the theoretical framework for Handling the missing value is
discussed.

113
4.4.2.2 The theoretical framework
The classification of missing data is categorized in the following three
mechanisms:
• If the probability of an observation being missing does not depend on
observed or unobserved measurements, then the observation is MCAR. A
typical example is a patient moving to another city for non-health reasons.
Patients who drop-out of a study for this reason could be considered as a
random sample of the total study population and their characteristics are
similar.
• If the probability of an observation being missing depends only on
observed measurements, then the observation is MAR. This assumption
implies that the behavior of the post drop-out observations can be predicted
from the observed variables, and therefore that response can be estimated
without bias using exclusively the observed data. [For example, when a
patient drops out due to lack of efficacy (illness due lack of vitamin
efficiency) reflected by a series of poor efficacy outcomes that have been
observed, the appropriate value to assign to the subsequent efficacy
endpoint for this patient can be calculated using the observed data. ]
• When observations are neither MCAR nor MAR, they are classified as
Missing Not At Random (MNAR) or a non ignorable i.e. the probability of
an observation being missing depends on unobserved measurements. In
this scenario, the value of the unobserved responses depends on
information not available for the analysis (i.e. Not the values observed
previously on the analysis variable or the covariates being used), and thus,
future observations cannot be predicted without bias by the model. For
example, it may happen that after a series of visits with good outcome, a
patient drop-out due to lack of efficacy. In this situation the analysis model
based on the observed data, including relevant covariates, is likely to
continue to predict a good outcome, but it is usually unreasonable to expect

114
the patient to continue to derive benefit from treatment., it is impossible to
be certain whether there is a relationship between missing values and the
unobserved outcome variable or to judge whether that missing data can be
adequately predicted from the observed data. It is not possible to know
whether the MAR, never mind MCAR, assumptions are appropriate in any
practical situation. A proposition that no data in a confirmatory clinical trial
are MNAR seems implausible. Because it is considered that some data are
MNAR, the properties (e.g. Bias) of any methods based on MCAR or MAR
assumptions cannot be reliably determined for any given dataset.

Therefore, the method chosen should not depend primarily on the properties of the
method under the MAR or MCAR assumptions, but on whether it is considered to
provide an appropriately conservative estimate in the circumstances of the trial
under consideration. The methods for handling missing values and procedure is
described in the next section.

4.4.2.3 Methods for handling missing values

The specific methods for handling missing value are mentioned below

 Method of ignoring instances with unknown feature values.


 Most common feature value.
 Method of treating missing feature values as special values. (Filling a
global constant like ―Cardio‖ for missing values in character data types)

a. Ignoring or Discarding Data.


In this method there are two ways to discard the data with missing values
1. The first way is complete case analysis, where the entire instance with missing
values is discarded.

115
2. The second method determines the level of missing values in each instance and
attributes. It discards the instance with high level of missing data.

b. Parameter estimation
The maximum likehood procedure is used to estimate the parameters of a
model defined for the complete data. The maximum like hood procedures that
use variants of the Expectation–Maximization algorithm can handle parameter
estimation in the presence of missing data [Mehala et. al., 2009; Dempster and
et al.,1977]

c. Imputation techniques
Imputation is the substitution of some value for a missing data point or a
missing component of a data point. Once all missing values have been imputed,
the dataset can then be analyzed using standard techniques for complete data. The
analysis should ideally take into account that there is a greater degree of
uncertainty than if the imputed values had actually been observed, however, and
this generally requires some modification of the standard complete data analysis
methods. In this research work the estimation maximization method is
implemented.

ESTIMATION MAXIMIZATION (EM) METHOD FOR MISSING


VALUES

The algorithm used for handling missing values using the most common
feature method is EM algorithm. The procedure is discussed in Figure 4.5

1. Estimates the most appropriate value to be filled in the missing field.


2. Maximizes the value of all the missing fields in the corresponding attribute.

116
Input : D /* the cardiology database */

Output: k identified output (with filled in values for missing value)

Begin

Step 1: For each record t in D do

Step2 : check if the field = integer then /* FILLING MISSING BY SUBSTITUTING

MEAN FOR THE INTEGER FIELD*/

compute the mean / average of the field values

Step 3:update the field with the computed value

if col name = Age

calculate average of Age

update col name with avg(age)

step 4: if the field = character then

/*FILLING GLOBAL CONSTANTS OF VALUES MISSING IN TEXT


FIELD*/

Identify the global constant used for the variable /*global constant used =
―cardio‖*/

Step 5: update the field with global constant

Figure 4.5 Procedure for Estimation Maximization Method For Missing Values

4.4.3. Missing Value Imputation Methods

As an alternate method of the EM Model, missing data imputation is a


procedure that replaces the missing values with some possible values. Imputed
values are treated as just as reliable as the truly observed data, but they are only as
good as the assumptions used to create them.

Imputation is a method of filling in the missing values by attributing values


derived from other available data to them. Imputation is defined as ―the process of

117
estimating missing data of an observation, based on valid values of other
variables‖ (Hair et al. 1998). Imputation minimizes bias in the mining process, and
preserves ―expensive to collect‟ data, that would otherwise be discarded (Marvin
et al. 2003). It is important that the estimates for the missing values are accurate,
as even a small number of biased estimates may lead to inaccurate and misleading
results in the mining process.

The imputation consists of many types viz., Single Imputation, Partial


imputation, Multiple Imputation and Iterative Imputation. Zhang.S.C [2010] has
handled the missing values in heterogeneous data sets using semi parametric way
of iterative imputation method [Zhang.S.C, 2010].

Multiple imputation(MI) has several desirable features:


 Introducing appropriate random error in the imputation process makes it
possible to get approximately unbiased estimates of all parameters. No
deterministic imputation method can do this in general settings.
 Repeated imputation allows one to get good estimates of the standard
errors. Single imputation methods don‘t allow for the additional error
introduced by imputation (without specialized software of very limited
generality).
 MI can be used with any kind of data and any kind of analysis without
specialized software

4.4.3.1 Imputation in K-Nearest Neighbors (K-NN)

In this method, the missing values of an instance are imputed considering a


given number of instances that are most similar to the instance of interest. The
distance is calculated using distance function.

118
The advantage of this method is
 Prediction of quantitative and qualitative attributes
 Handling multiple missing value in the records.

The disadvantage of this method is


(i) Searches through out all the dataset looking for the most similar
instances which is time consumable.
(ii) Choice of distance function to calculate the distance.

4.4.3.2 Mean based imputation (single imputation)

In the mean imputation, the mean of the values of an attribute that contains
missing data is used to fill in the missing values. In the case of a categorical
attribute, the mode, which is the most frequent value, is used instead of the mean
[Liu et.al, 2004]. The algorithm imputes missing values for each attribute
separately. Mean imputation can be conditional or unconditional, i.e., not
conditioned on the values of other variables in the record. The conditional mean
method imputes a mean value, that depends on the values of the complete
attributes for the incomplete record.

4.4.3.3 Norm that implements missing value estimation

On the expectation maximization algorithm [Schafer J.L, 1999] multiple


imputation inference involves three distinct phases:

• The missing data are filled in m times to generate m complete data sets
• The m complete data sets are analyzed by using standard procedures
• The results from the m complete data sets are combined for the inference

119
4.4.3.4 LSImpute_Rows
LSImpute_Rows method estimates missing values based on the least square
error principle and correlation between cases (rows in the input matrix) [Liu et.al.
2004, Jos´ e et.al, 2006].

4.4.3.5 EMImpute_Columns
The EMImpute_Columns estimates missing values using the same
imputation model, but based on the correlation between the features [Marisol et.al,
2005] (columns in the input matrix). LSImpute_Rows and EMImpute_Columns
involve multiple regressions to make their predictions

4.4.3.6 Other imputation methods


Hot deck imputation
In these method the missing value is filled with a value from an estimated
distribution of the missing value in the data set. In Random Hot deck, a missing
value of an attribute is replaced by an observed value of the attribute chosen
randomly.

Cold deck imputation


It is same as hot deck imputation, but the difference is the source for
imputated value obtained from different source.

Imputation using decision tree


All the decision tree classifier handles missing values by using built in
approaches.

GCFIT_MISS_IMPUTE which is proposed by Ilango et. al., [2009] is to impute


the missing values in the Type II diabetes, databases and to evaluate its

120
performance by estimating average imputation error. The average imputation error
is the measure which represents the degree of inconsistency between the observed
and imputed values. The approach is experimented on PIMA Indian Type II
Diabetes Data set, which originally do not have any missing data. All the 8
attributes are considered for the experiments as the decision attribute is derived
using these 8 attributes. Datasets with different percentage of missing data (from
5% to 85%) were generated using the random labeling feature. For each
percentage of missing data, 20 random simulations are to be conducted.
In each dataset, missing values were simulated by randomly labeling
feature values as missing values. The datasets with different amounts of missing
values (from 5% to 35% of the total available data) were generated. For each
percentage of missing data, 20 random simulations were conducted. The data were
standardised using the maximum difference normalisation procedure which
mapped the data into the interval [0..1]. The estimated values were compared to
those in the original data set. The average estimation error E was calculated as
follows:

m n

E= ( Oij - Iij )/(maxj -minj )) /n /m (4.1)


k=1 i=1

where ‗n‘ is the number of imputed values, ‗m‘ is the number of random
simulations for each missing value, Oij is the original value to be imputed, Iij is
the imputed value, j is the corresponding feature to which Oi and Ii belong. The
result analysis of all these methods is discussed in the next section.

121
4.4.3.7 Result analysis
The estimated error results obtained from different methods for the
databases is tabulated in Figure 4.4. The different k-NN estimators were
implemented, but only the most accurate model is shown. The 10-NN models
produced an average estimation error that is consistently more accurate than those
obtained using the Mean imputation, NORM and LSImpute_Rows methods.
Tables 4.1 and Figure 4.6 shows the average estimated errors and corresponding
standard deviation. The predictive performance of these methods depends on the
amount of missing values and complete cases containing the dataset.

Table 4.1 Average estimated error ± standard deviation

Percentage of Missing Data


Methods
5 10 15 20 25 30 35
10-NN 10.5±9.4 11.1±9.7 11.7±10.2 12.6±10.6 13.7±11.6 14.7±12.2 15.5±12.7
Mean based
Imputation 13.6±11.3 14.0±11.5 13.5±11.1 13.7±11.4 13.4±11.3 13.7±11.4 13.8±11.5

NORM 12.4±13.5 13.3±14.8 12.7±13.9 14.0±14.4 14.6±15.3 14.7±15.3 15.3±15.2

EMImpute_Columns 8.5±22.7 9.2±22.5 9.1±22.4 9.3±22.3 9.2±22.2 7.8±23.2 7.7±23.1

LSImpute_Rows 12.3±22.7 13.6±22.7 14.4±22.6 14.3±22.6 14.6±22.7 13.1±23.7 12.9±23.6

122
18

16

14

12
Avg. Estimated Error

10 10-NN
Mean based Imputation
8
NORM
6 EMImpute_Columns
LSImpute_Rows
4

0
5% 10% 15% 20% 25% 30% 35%
Missing Values

Figure 4.6 Comparison of different methods using different percentages of missing values

From the analysis , it is clearly understood that 10-NN method produced


the least variability in results. However, when more than 30% of the data were
missing the performance of k-NN started to significantly deteriorate. This
deterioration occurs if the number of complete cases (nearest neighbors) used to
impute a missing value is actually smaller than k. This is one of the limitations of
this study because the k-NN models only considered complete cases (nearest
neighbors) for making estimations.
The k-NN was able to generate relatively accurate and less variable results
for different amounts of missing data, which were assessed using 20 missing value
random simulations. However, it is important to remark that, while on the one
hand, this study allowed us to assess the potential of different missing data
estimation methods, on the other hand it did not offer significant evidence to
describe a relationship between the amount of missing data and the accuracy of the

123
predictions. Attribute correction using data mining concept is discussed in the
following section.

4.4.4 Attribute Correction Using Association Rule And Clustering


Techniques

In this section the proposed two algorithms Context Dependent Attribute


Correction using Association Rule (CDACAR) and Context Independent Attribute
correction implemented using Clustering Technique (CIACCT) for attribute
correction using data mining techniques for external reference are discussed.The
algorithm described in this section is used to examine if the data set is source of
reference data that could be used to identify the incorrect entries and enable to
correct the entries.

4.4.4.1 Framework

The Framework for Attribute correction is shown in Figure 4.7.

Imputed Attribute

Context Dependent Context Independent

Association Rule Clustering

Corrected Attribute

Figure 4.7 Framework for Attribute correction

124
4.4.4.2 Context Dependent Attribute Correction using Association Rule
(CDACAR)

The context dependent attributes refer to attribute values which consider


the reference data values and the other attribute values of the record.
In this algorithm the association rules methodology is used to discover
validation rules for data sets.The frequent item sets are generated by using Apriori
[Webb.J, 2003] algorithm is utilized.
The following two parameters are used in CDACAR
Minsup is defined analogically for the parameter of the same name for the Apriori
algorithm
Distthresh is the minimum distance between the value of the ―suspicious‖
attribute and the proposed value. Being a successor rule, it violates in order to
make corrections.
Levenshtein distance (LD) is a measure of the similarity between two strings,
which we will refer to as the source string (s) and the target string (t). The distance
is the number of deletions, insertions, or substitutions required to transform ‗s‘
into ‗t‘. For example,

• If ‗s‘ is "test" and ‗t‘ is "test", then LD(s,t) = 0, because no


transformations are needed. The strings are already identical.

• If ‗s‘ is "test" and ‗t‘ is "tent", then LD(s,t) = 1 , because one substitution
(change s" to "n") is sufficient to transform ‗s‘ into ‗t‘.

The following is the modified Levenshtein distance

Lev ( s1 , s2 )  1 2 .( Lev ( s1 , s2 ) | s1 |  Lev ( s1 , s2 ) | s2 |) (4.2)

125
where Lev(s1,s2) denotes Levenshtein distance between strings s1 1n s2.
The modified.

Distance for strings may be interpreted as an average fraction of one string


that has to be modified to be transformed into the other. For instance, The LD
between ―Articulation‖ and ―Articaulation‖ is 2. The modified Levenshtein
distance for above said string is 0.25. The modification was introduced to be
independent of the string length during the comparison.

The algorithm is outlined below

Step 1: Generate all the frequent sets.

Step 2: Generate all the association rules from the already generated
sets..The rules generated may have 1, 2 or 3 predecessors and only one
successor. The association rules are generated from the set of validation
rules.
Step 3: The algorithm discovers records whose attribute values are the
predecessors of the rules generated with an attribute whose value is
different from
the successor of a given rule.
Step 4: The value of the attribute which is suspicious in a row is compared
with all the successors.
Step 4: If the relative Levenshtein distance is lower than the threshold
distance the value may be corrected. If there are more values within the
accepted range of the parameter, a value most similar to the value of the
record is chosen.

The result is analyzed in the section 4.4.4.4.

126
4.4.4.3 Context Independent Attribute Correction using Clustering
Technique (CIACCT)

Context-independent attribute correction implies that all the record


attributes are examined and cleaned in isolation, without regard to values of other
attributes of a given record. The main idea behind this algorithm is based on an
observation that in most data sets there is a certain number of values having large
number of occurrences within the data sets and a very large number of attributes
with a very low number of occurrences. Therefore, the most representative value
may be the source of reference data. The values with a low number of occurrences
are noise or misspelled instance of the reference data.

The same Levenshtein distance is used in these methods which were


discussed in the previous algorithm.

In this methods the following two parameters are considered

i. Distthresh being the minimum distance between two values, allowing them to
be marked as similar and related
ii. Occrel is used to determine whether both compared values belong to the
reference data set.

The CICACCT algorithm is described below

Step 1: First cleaning process, for that all attributes convert from lower
case to upper case, all the non-alpha numeric values are removed and then
the number of occurrences of all the values in the cleaned data set is
calculated

127
Step 2: Each element is assigned to separate cluster. The cluster element
with the highest number of occurrences is treated as cluster representative.

Step 3: Cluster list is sorted in descending order according to the number of


occurrences of each cluster representative.

Step 4:Starting from the first cluster, compare all the cluster and also
calculate the distance between the cluster using the modified Levenshtein
distance.

Step 5: If the distance is lower than the distthresh parameter and the ratio of
occurrences of cluster representative is greater or equal the Occrel
parameter, the clusters are merged

Step 6: After all the clusters are compared, the clusters are examined
whether they contain values having distance between them and the cluster
representative above the threshold value.if so, they are removed from the
cluster and added to the cluster list as separate clusters.

Step 7: Repeat the same step until there are no changes in the cluster list
i.e.no clusters are merged and no cluster are created. The cluster
representative is from the reference data set and the cluster define
transformation rules for a given cluster values should be replaced with the
value of the cluster representative.

As far as the reference dictionary is concerned, it may happen that it will


contain values where the number of occurrences are very small. These values may
be marked as noise and trimmed in order to preserve the compactness of the
dictionary.

128
4.4.4.4 Results analysis of attribute correction

Context Dependent Attribute Correction using Association Rule (CDACAR)

The Algorithm was tested using the sample Cardiology dataset drawn from
Hungarian data.The rule-generation part of the algorithm is performed on the
whole data set. The Attribute correction part was performed on a random sample.

The Following measures are used for checking the correctness of the
algorithm. Let
Pc – Percentage of correctly altered values
Pi – Percentage of incorrectly altered values
P0- Percentage of values marked during the review as incorrect, but not
Altered during cleaning

The measure is defined as


Pc = nc / na * 100 (4.3)
Pi = ni / na * 100 (4.4)
P0 = n00 / n0 * 100 (4.5)
nc -correctly altered values
ni -number of incorrectly altered values
na -total number of altered values
n0 -number of values identified as incorrect
n00 -the number of elements initially marked as incorrect that were
not altered during the cleaning process.

From Table – 4.2 it can be observed that the relationship between the
measures and the distthresh parameter. Figure 4.8 shows the result that the number
of values marked as incorrect and altered is growing with the increase of the
distthresh parameter. This also proves that the context-dependent algorithm

129
perform better for identifying incorrect entries. The number of incorrectly altered
values is growing with the increase of the parameter. However, a value of the
distthresh parameter can be identified that gives optimal results. i.e. the number of
correctly altered values is high and the number of incorrectly altered values is low.

Table –4.2 Dependency between the measures and the parameter for Context-dependent algorithm

Distthresh Pc Pi P0

0 0.0 0.0 100.0

0.1 90 10 73.68

0.2 68.24 31.76 46.62

0.3 31.7 68.3 36.09

0.4 17.26 82.74 33.83

0.5 11.84 88.16 31.33

0.6 10.2 89.8 31.08

0.7 9.38 90.62 30.33

0.8 8.6 91.4 28.82

0.9 8.18 91.82 27.32

1.0 7.77 92.23 17.79

130
100
90
80
70
Percentage
60
50 Pc
40
Pi
30
Po
20
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Distthresh

Figure 4.8 Dependency between the measures and the parameter for Context-dependent
algorithm

The result shows that the number of values marked as incorrect(Pi) and altered
is growing with the increase of the DistThresh parameter. Some attribute that may
at first glance seem to be incorrect, are correct in the context of other attribute
within the same record. Percentage of correctly marked entries reaches the peak
for the DistThresh parameter equal to 0.05.

Context Independent Attribute Correction using Clustering Techniques


(CIACCT)

The Algorithm was tested using the sample Cardiology dataset which is drawn
from Hungarian data There are about 44000 records divided into 11 batches of 4
thousand records. The attribute CP (Chest pain type) in that Angial is one of the
types which occurs when an area of the heart muscle does not get enough oxygen
rich blood. By using CIACCT 4.22% i.e. 1856 element of whole data set were
identified as incorrect and hence subject to alteration. Table 4.3 contains the
example transformation rules discovered during the execution.

131
Table 4.3 Transformation Rules

Original value Correct value

Angail Angial

Anchail Angial

Angal Angial

Ancail Angial

The measure is used

Pc – Percentage of correctly altered values

Pi – Percentage of incorrectly altered values

P0- Percentage of values marked during the review as incorrect, but not
Altered during cleaning

From Table 4.4 and Figure 4.9 it can be observed that the relationship
between the measures and the distthresh parameter. The results show that the
number of values marked as incorrect and altered is growing with the increase of
the distthresh parameter. This also proves that the context-independent algorithm
perform better to identify incorrect entries. However, a value of the distthresh
parameter can be identified that gives good results. i.e. the number of correctly
altered values(Pc) is high and the number of incorrectly altered values(Pi) is low.

132
Table 4.4 – Dependency between the measures and the parameter for Context -Independent
algorithm

Distthresh Pc Pi P0

0 0.0 0.0 100.0

0.1 92.63 7.37 92.45

0.2 79.52 20.48 36.96

0.3 67.56 32.44 29.25

0.4 47.23 52.77 26.93

0.5 29.34 70.66 23.41

0.6 17.36 82.64 19.04

0.7 7.96 92.04 8.92

0.8 4.17 95.83 1.11

0.9 1.17 98.83 0.94

1.0 0.78 99.22 0

133
100
90
80
70
60
Percentage
50 Pc
40 Pi
30 Po
20
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Distthresh

Figure 4.9 – Dependency between the measures and the parameter for Context –Independent
algorithm

The algorithm perform better for longer strings as short string would
require higher value of the parameter to discover a correct reference value. High
values of the distthresh parameter results in larger number of incorrectly altered
elements. This algorithm results in an efficiency of 92% of correctly altered
elements which is an acceptable value. The range of the application of this method
is limited to elements that can be standardized for which reference data is
available. Conversely, using this method for cleaning last names could not yield
good results.

4.5 DATA INTEGRATION

Data integration is the process of combining data residing at different


sources and providing the user with a unified view of these data. This process
emerges in a variety of situations , both commercial (when two similar companies
need to merge their databases) and scientific (combining research results from

134
different bioinformatics repositories). In this work combining the two
cardiovascular databases from different Hospital is taken into consideration.

The data consumed and/or produced by one component is the same as the data
produced and/or consumed by all the other components. This description of
integration highlights the three primary types of system integration, specifically:
presentation, control and data integration.

4.5.1. Need for Data Integration

Data integration appears with increasing frequency as the volume and the need
to share existing data explodes. As information systems grow in complexity and
volume, the need for scalability and versatility of data integration increases. In
management practice, data integration is frequently called Enterprise Information
Integration.

The rapid growth of distributed data has fueled significant interest in building
data integration systems. However, developing these systems today still requires
an enormous amount of labor from system builders. Several nontrivial tasks must
be performed, such as wrapper construction and mapping between schemas. Then,
in dynamic environments such as the Web, sources often undergo changes that
break the system, requiring the builder to continually invest maintenance effort.
This has resulted in very high cost of ownership for integration systems, and
severely limited their deployment in practice.

Health care providers collect and maintain large quantities of data. The major
issue in these data representatives is the dissimilarity in the structure. Very rarely
the structure of the database remains the same. Yet data communication and
data sharing is becoming more important as organizations see the advantages
of integrating their activities and the cost benefits that accrue when data can
be reused rather than recreated from scratch.

135
The integration of heterogeneous data sources has a long research history
following the different evolutions of information systems. Integrating various data
sources is a major problem in knowledge management. It deals with integrating
heterogeneous data sources and it is a complex activity that involves reconciliation
at various levels - data models, data schema and data instances. Thus there arises a
strong need for a viable automation tool that organize data into a common syntax.

Some of the current work in data integration research concerns the Semantic
Integration problem. This problem is not about how to structure the architecture of
the integration, but how to resolve semantic conflicts between heterogeneous data
sources. For example if two companies merge their databases, certain concepts
and definitions in their respective schemas like "earnings" inevitably have
different meanings. In one database it may mean profits in dollars (a floating point
number), while in the other it might be the number of sales (an integer). A
common strategy for the resolution of such problems is the use of ontology which
explicitly defines schema terms and thus help to resolve semantic conflicts.

4.5.2. Implementation of Data Integration

Data integration is done using JARO ALGORITHM

The Jaro-Winkler distance is a measure of similarity between two strings. It


is a variant of the Jaro distance metric and mainly used in the area of record
linkage. The higher the Jaro-Winkler distance of two strings is, the more similar
the strings are. The Jaro distance metric states that given two strings s1 and s2, are
similar only if all the characters of s1 matches with that of s2.This number of
matching characters is denoted as ‗m‘.

Two characters from s1 and s2 respectively, are considered matching only if


they are similar.

While comparing the columns of one dataset with another for similarity, there
exist two kinds of similarities namely Exact matching and Statistical matching.
136
Exact matching involves the exact matching of strings to the columns and
Statistical matching involves the partial matching of the strings present in the
column name. For eg. Pname in one database column matching with Pname in
another database column is exact matching. ―Pname‖ in one database column
matching with ‗Patient name‘ in another database column is called as statistical
matching.

This method has a limitation that if two strings represent the same thing, but
different in words says for eg. Cost and price are two different words representing
the same meaning. It is not possible to match these two words using Jaro-Winkler
method. To avoid this in this research work knowledge repository is used to
check all possible words which cannot be matched by Jaro method. First compare
these words with the given strings. If it does not match compare it using the Jaro –
Winkler method.

After Data integration is performed, the database may contain incomplete


data set, hence it is efficient to perform data cleaning after every data integration is
performed. This increases data reliability.

String Similar name

Patient identification number p_id, pat_id, id, patient, pat_no,


p_no,file_no, f_no.

Address Address, street, area

Blood pressure BP, Pressure, stress

Medicine Medicine, drug, medication

A sample of the knowledge repository that maintain in this work is such as


given below:

137
First copy all the columns of one database into the new database, then compare
each column of the other database to be integrated with the knowledge repository
that maintain in the research work. If the word matches with these words in the
knowledge base, we integrate it to the corresponding column in the new database,
else compare it using the Jaro – Winkler measure. Even then, if the columns don‘t
match ,create a new column in the integrated database. Figure 4.10 describes the
Procedure for Jaro- Winkler algorithm.

Algorithm for Data Integration

Step 1: Get the two databases which is needed to be integrated as input.

Step 2: Check for the attribute name in both the table and calculate the Jaro
distance Metric

Step 3: Higher the Jaro distance metric is, the higher the similarities between the
two attributes and the two attributes are considered as similar and their
values are merged.

Step 4: If two attributes are dissimilar check for their name in the knowledge
repository.

Step 5: If found, then the two attribute‘s values are merged.


else it is considered as new attributes and one added to the database.

138
Input : database1, database2.
Output: database 3 ( integrated database)
Method:
Step1: copy all the attributes and values of database 1 into database3.
Step2: For every attribute in database2,
set flag=0;
do { select each attribute from database 3
{ match it using knowledge repository
String[] st3={"p_id,pat_id,id,patient", "address,street,area",
"amount,amt,cost", "phone no, mobile no, contact no"}
IF it matches {
set flag = 1
copy all the values of that particular column into the
corresponding matching column of database3 }
ELSE { check for similarity between the two attributes from
database2 and database3 }
using JARO method ( string comparison)
IF it matches { et flag = 1
copy all the values of that particular column into the
corresponding matching column of database3
} } }
IF (flag = 0) {
create a new column in database 3 with the same column name as in
database2
copy all the values of that column into the corresponding column in
database 3; }
Step3: end
Figure 4.10 Procedure for Data Integration

139
4.6 DATA DISCRETIZATION

Discretization is a process that transforms quantitative data into qualitative


data. Quantitative data are commonly involved in data mining applications. It
significantly improve the quality of discovering knowledge and also reduces the
running time of various data mining tasks such as association rule discovery,
classification, clustering and prediction.

Discretization is a process that transforms data containing a quantitative


attribute so that the attribute in question is replaced by a qualitative
attribute. A many to one mapping function is created so that each value of
the original quantitative attributes is mapped onto a value of the new
qualitative attribute. First, discretization divides the value range of the
quantitative attribute into a finite number of intervals. The mapping function
associates all of the quantitative values in a single interval to a single qualitative
value.

Discrete data is information that can be categorized into a classification.


Discrete data are based on counts. Only a finite number of values are possible, and
the values cannot be subdivided meaningfully. Attribute data (Discrete data) is
data that cannot be broken down into smaller unit and add additional meaning. It is
typically things counted in whole numbers.

4.6.1. Need for Discretization

Reducing the number of values for an attribute is especially beneficial if


decision-tree-based methods of classification are to be applied to the pre-processed
data. The reason is that these methods are typically recursive, and a large amount
of time is spent on sorting the data at each step.
Before applying learning algorithms to data sets, practitioners often globally
discretize any numeric attributes. If the algorithm cannot handle numeric attributes
directly, prior discretization is essential. Even if it can, prior discretization often

140
accelerates induction, and may produce simpler and more accurate classification.
As it is generally done, global discretization denies the learning algorithm of
taking any change advantage of the ordering information implicit in numeric
attributes.

However, a simple transformation of discretized data preserves this


information in a form that learners can use. This work show that, compared to
using the discretized data directly, this transformation significantly increases the
accuracy of decision trees built by C4.5, decision lists built by PART, and decision
tables built using the wrapper method, on several benchmark datasets. Moreover,
it can significantly reduce the size of the resulting classifiers.

This simple technique makes global discretization an even more useful for
data preprocessing.

Many algorithms developed in the machine learning community focus on


learning in nominal feature spaces. However, many real-world databases often
involve continuous features. Those features have to be discretized before using
such algorithms. Discretization methods can transform continuous features into a
finite number of intervals, where each interval is associated with a numerical
discrete value. Discretized intervals, then can be treated as ordinal values during
induction and deduction.

4.6.2. Methods in Discretization

The discretization methods can be classified according to three axes:


supervised versus unsupervised, global versus local, and static versus dynamic. A
supervised method would use the classification information during the
discretization process, while the unsupervised method would not depend on class
information. The popular supervised discretization algorithms contain many

141
categories, such as entropy based algorithms, including Ent-MDLP, Mantaras
distance, dependence based algorithms, including ChiMerge, Chi2, and binning
based algorithms including 1R, Marginal Ent. The unsupervised algorithms
contain equal width, equal frequency and some other recently proposed algorithms
and an algorithm using tree-based density estimation.

Local methods produce partitions that are applied to localized regions of the
instance space. Global methods, such as binning, produce a mesh over the entire
continuous instances, space, where each feature is partitioned into regions
independent of the other attributes.

Many discretization methods require a parameter, n, indicating the maximum


number of partition intervals in discretizing a feature. Static methods, such as Ent-
MDLP, perform the discretization on each feature and determine the value of n for
each feature independent of the other features. However, the dynamic methods
search through the space of possible n values for all features simultaneously,
thereby capturing interdependencies in feature discretization. There are a wide
variety of discretization methods starting with the naive methods such as equal-
width and equal-frequency.

The simplest and efficient discretization method is an unsupervised direct


method named equal width discretization which is a binning methodology. It
calculates the maximum and the minimum for the feature that is being discretized
and partitions the range observed into k approximately equal sized intervals.

4.6.3. Equal width Discretization Methodology

The equal width discretization methodology is described below

1. Get the input dataset which has to be discretized.


2. For each attribute calculate its minimum possible value and maximum
possible value.

142
3. Then divide the attribute value into k intervals approximately of equal size.
4. For each interval sets replace the values with a class name.

Algorithm for Width Discretization Methodology described in Figure 4.11.

Rules for Discretization

In this work, the following rule are used to transform the data in the database.

Systole
90-130 Normal
below 90 Hypotension
above 130 Hypertension
Diastole

60-80 Normal
below 60 Hypotension
above 80 Hypertension

Heart beat
72 - Normal for adult
140-150 Normal for Child

BMI- Body mass Index

Below 18.5 - Underweight


18.5-25 - Normal range
25-30 - Overweight
Above 30 – Obesity
Dose

100-300 low
143
300-500 medium
Below 500 heavy dose

Anesthesia

1-3 Normal
4-8 Serious

Input: database to be discretized


output: database ( discretized database)
Begin
step1: Get each column from the input database
step2: Check the column name with the column name present in the rules for
discretization
set flag = 0;
IF it matches { set flag = 1;
do{ Check for the condition in the rules and transform the
numerical attribute in the column to its corresponding categorical attribute.
} until all the values in the column are discretized
}
IF (flag = 0)
{ Leave that column and go on to the next column (Start from step 1)
}
End
Figure 4.11 Procedure for Data Discretization

4.7. DATA REDUCTION

Data warehouses store vast amounts of data. Mining takes a long time to run
this complete and complex data set. Data reduction reduces the data set and

144
provides a smaller volume data set, which yields similar results as the complete
data sets.

Working with data collected through a team effort or at multiple sites can be
both challenging and rewarding. The sheer size and complexity of the dataset
sometimes makes the analysis daunting, but a large data set may also yield richer
and more useful information. The benefits of the data reduction techniques
increase as the data sets grow in size and complexity

4.7.1. Methods for Data Reduction

Reduction can be handled by two methods they are discussed as follows.

1. Dimensionality Reduction
2. Numerosity Reduction

Dimensionality Reduction

Dimensionality Reduction is defined as removal of unimportant attributes.


The method used for handling dimensionality reduction is feature selection. A
process that chooses an optimal subset of features according to a objective
function This selects the minimum set of attributes, features that is sufficient for
the data mining task. Algorithm for Dimensionality reduction is described in
Figure 4.12.

Numerosity Reduction

Numerosity Reduction is fitting the data into model. This method can be
handled by Parametric Methods. The parameter on which the numerosity
reduction has to take place is got from the user. According to the parameter its
corresponding values are stored and the remaining data are discarded. Algorithm
for Numerosity reduction is described in Figure 4.13.

145
4.7.2. Implementation of Data Reduction

Dimensionality Reduction

1. Get the input dataset which has to be reduced.


2. According to the need of data mining algorithms, get the attribute names
that are necessary for the domain.
3. Remove the other attributes from the dataset which are not needed.

Numerosity Reduction

1. Get the input dataset for which numerosity reduction has to be done.
2. Get the attribute names and the parametric value according to which
numerosity reduction has to be done.
3. The dataset that satisfy the parameter value are stored and the remaining
data are discarded
Dimensionality Reduction

Input: D /* the cardiology database */

K /* no. of attributes need to reduced*/

Output: The cardiology database with reduced dimensionality

Begin

Step 1: For each attribute in D


Step 2: Get the number of attributes and attribute name which has to be reduced
from database
Step 3: Delete the attribute from the database
Step 4: Repeat until all the attribute which need to be reduced are deleted
End
Figure 4.12 Procedure for Data Reduction

146
Numerosity Reduction

Input: D /* the cardiology database with discretized attributes*/,

K /* parameter according to which numerosity reduction has to


performed*/

Output: The cardiology database with reduced numerosity

Begin
Step 1: For each attribute in D
Step 2: Get the input parameter according to which reduction has to be performed
Step 3: For each record in the database, remove the records which does not satisfy
the parameter
End
Figure 4.13 Procedure for Numerosity Reduction

4.8 SUMMARY

In this part of research work, a new preprocessing technique is


implemented. The need for the proposed model is discussed in detail. Randomly
simulated missing values were estimated by five data imputation methods out of
these, K-NN produce the promising result, Attribute Correction algorithm for
Context Dependent and Context Independent is proposed and implemented, also
implemented knowledge repository along with Jaro Winkler for data integration,
equal width discretization methodology is used for data discretization and
Dimensionality reduction and numerosity reduction is used to reduced the data for
better knowledge discovery.

147

You might also like