A Comparison of Mining Incomplete and Inconsistent Data

Patrick G. Clark; Cheng Gao; Jerzy Grzymala-Busse

doi:10.5755/j01.itc.46.2.17330

Authors

Patrick G. Clark
Cheng Gao
Jerzy Grzymala-Busse University of Kansas

DOI:

https://doi.org/10.5755/j01.itc.46.2.17330

Keywords:

Incomplete data, lost values, \do not care" conditions, in- consistent data, rough set theory, probabilistic approximations, MLEM2 rule induction algorithm.

Abstract

We present experimental results on a comparison of incom-pleteness and inconsistency. We used two interpretations of missing at-tribute values: lost values and "do not care" conditions. Our experimentswere conducted on 204 data sets, including 71 data sets with lost val-ues, 71 data sets with "do not care" conditions and 62 inconsistent datasets, created from eight original numerical data sets. We used the Modified Learning from Examples Module version 2 (MLEM2) rule inductionalgorithm for data mining, combined with three types of probabilisticapproximations: lower, middle and upper. We used an error rate, com-puted by ten-fold cross validation, as the criterion of quality. There isexperimental evidence that incompleteness is worse than inconsistencyfor data mining (two-tailed test, 5% level of signicance). Additionally,lost values are better than "do not care" conditions, again, with regardsto the error rate, and there is a little dierence in an error rate betweenthree types of probabilistic approximations.

DOI: http://dx.doi.org/10.5755/j01.itc.46.2.17330

A Comparison of Mining Incomplete and Inconsistent Data

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

Information