Missing Value
Missing Value
Information Sciences
journal homepage: www.elsevier.com/locate/ins
a r t i c l e i n f o a b s t r a c t
Article history: Missing values in datasets should be extracted from the datasets or should be estimated
Received 1 May 2011 before they are used for classification, association rules or clustering in the preprocessing
Received in revised form 21 October 2012 stage of data mining. In this study, we utilize a fuzzy c-means clustering hybrid approach
Accepted 19 January 2013
that combines support vector regression and a genetic algorithm. In this method, the fuzzy
Available online 1 February 2013
clustering parameters, cluster size and weighting factor are optimized and missing values
are estimated. The proposed novel hybrid method yields sufficient and sensible imputation
Keywords:
performance results. The results are compared with those of fuzzy c-means genetic algo-
Missing data
Missing values
rithm imputation, support vector regression genetic algorithm imputation and zero
Imputation imputation.
Support vector regression Ó 2013 Elsevier Inc. All rights reserved.
Fuzzy c-means
1. Introduction
Missing values are highly undesirable in data mining, machine learning and other information systems [33]. In recent
years, much research has been regarding missing value estimation and imputation has been performed [3,9,24,
33,35,47,49]. To deal with missing values in datasets: ignoring, deleting, zero or mean estimation methods might be used
instead of imputation methods [7,30]. However, the primary disadvantages of these estimation methods are the loss of effi-
ciency due to discarding incomplete observations and biases in estimates when data are missing in a systematic manner
[35]; these disadvantages reduce data quality. Quality data mining results can be obtained only with high quality data
[37,41]. Therefore, missing values should be estimated to increase data quality. Missing values typically occur because of
sensor faults, a lack of response in scientific experiments, faulty measurements, data transfer problems in digital systems
or respondents’ unwillingness to respond to survey questions [1,27,31,32,36]. In scientific research, especially in psychology,
data for some variables in the database to be analyzed may be missing. If the missing values are not treated correctly, they
may decrease or even jeopardize the validity of the research [3,5,14,22,34].
2. Literature review
This section presents a brief summary of the studies related to support vector regression imputation and fuzzy c-means
imputation.
0020-0255/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.ins.2013.01.021
26 I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35
Abdella et al. studied the use of genetic algorithms and neural networks to approximate missing data in databases [1].
Deogun et al. utilized the clustering method with soft computing, which is more tolerant of inaccuracy and uncertainty,
and they applied a fuzzy clustering algorithm to treat incomplete data [9]. Liao et al. presented a fuzzy k-means clustering
algorithm that uses a sliding window for the imputation of incomplete data to improve data quality [24]. Pelckmans et al.
proposed an alternative approach, in which no attempt is made to reconstruct the values that are missing, but the impact of
the missing data on the result and the expected cost are modeled using support vector machines. The approach is to assume
some models for the covariates of missing values and then use a maximum likelihood approach to obtain the estimates for
these models. The advantage of this approach is that classification rules can be learned from observational data even when
missing values occur amongst the input variables, whereas the disadvantage is that the proposed model aims for high clas-
sification accuracy rather than high imputation accuracy for the missing values [35]. Lim et al. proposed a hybrid neural net-
work that uses fuzzy ARTMAP and fuzzy c-means clustering for pattern classification using incomplete training and test data.
One of the disadvantages of fuzzy ARTMAP is that it is very sensitive to the arrangement of the training data. Fuzzy ARTMAP
is also acutely sensitive to the selection of the vigilance parameter because determining the optimal value for the vigilance
parameters can be quite difficult [25]. Hathaway et al. introduced an approach for clustering that is based on incomplete
dissimilarity data. An advantage of the method is that fuzzy c-means are regarded as a reliable clustering algorithm for
incomplete data [20]. Feng et al. also presented a support vector regression (SVR)-based imputation method that uses an
orthogonal coding scheme to estimate missing values for DNA microarray gene expression data. A comparative study of their
method with those previously developed, such as the K nearest neighbor and Bayesian principal component analysis impu-
tation methods indicated that the SVR method is effective in imputation. A significant advantage of the SVR model is that it
requires less computational time, but our hybrid SVR clustering technique yields more sensible results for outlier values [47].
Timm et al. noted that incomplete datasets are a significant problem in data analysis. They introduced a class-specific prob-
ability for missing values to assign incomplete data points to clusters appropriately [44]. Farhangfar et al. aimed to provide a
comprehensive review of representative imputation techniques. The use of a low-quality single-imputation method yielded
imputation accuracy comparable to the accuracy achieved when one utilizes some other advanced imputation techniques
[11]. Li et al. assumed that missing attributes are represented as intervals, and they proposed a novel fuzzy c-means algo-
rithm for incomplete data based on nearest neighbor intervals. The disadvantage of this method is that there is no theoretical
basis for the selection of the cluster (c) number, so further research is needed to investigate this problem [23]. Nuovo com-
pared fuzzy c-means (FCM) imputation with case deletion imputation. The comparison was made in a psychological research
environment, using a database of mentally retarded in patients. The results indicated that completion techniques, in partic-
ular the technique based on FCM, yield effective data imputation and help avoid the deletion of elements with missing data
that diminish the power of the research. Fuzzy c-means (FCM) imputation is more accurate than regression imputation (RI)
and expectation maximization estimation (EME). However, a major disadvantage is that the FCM implementation uses a
weighting factor (m) parameter value, which is equal to 2, and the value should be adapted to the dataset type [34].
There are three types of missing data described in the literature [26]:
1. Missing completely at random (MCAR) – The missing value has no dependency on any other variable.
2. Missing at random (MAR) – The missing value depends on other variables. The missing value can be estimated using
other variables.
3. Missing not at random (MNAT) – The missing value depends on other missing values, and thus missing data cannot be
estimated from existing variables.
In this paper, we assume that the data are MAR, which implies that the missing values are deducible in some complex
manner from the remaining data [38]. In Table 1, we present a section of a dataset with missing values. In this paper, we
aim to estimate missing values using fuzzy c-means optimized with support vector regression and a genetic algorithm. These
notations will be used in the rest of the paper: Y1, Y2, Y3, Y4, Y5 and Y6 are records (rows). X1, X2, X3, X4 and X5 are attri-
butes (columns). Y2, Y5 and Y6, which do not have any missing values, are ‘complete’ rows, and Y1, Y3 and Y4, which have
missing values, are called ‘incomplete’ rows.
Table 1
A section of a dataset with missing values.
X1 X2 X3 X4 X5
Y1 0.113524 0.084785 ? 0.625473 0.06385
Y2 0.112537 0.138211 0.15942 0.625473 0.068545
Y3 0.110563 ? 0.144928 0.624212 0.083568
Y4 0.110563 0.170732 0.146998 0.623581 ?
Y5 0.108588 0.129501 0.144928 0.624212 0.076056
Y6 0.108588 0.082462 0.112836 0.626103 0.015023
I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35 27
Support Vector Machines (SVMs) are a statistical learning theory that was created by Vapnik and Lerner in 1963. In 1996,
the theory was further developed by Scholkopf, Burges and Vapnik, who constructed a hyperplane or a set of hyperplanes in
a high- or infinite-dimensional space on known data. They successfully used SVM for classification of unseen data [21,42]. A
different version of SVM for regression, which is called support vector regression (SVR), was proposed by Vapnik, Steven
Golowich and Alex Smola in 1997 [45]. The model produced by support vector classification depends on only a subset of
the training data because the cost function for building the model does not consider training points that lie beyond the mar-
gin. Analogously, the model produced by SVR depends on only a subset of the training data because the cost function for
building the model ignores any training data that are close (within a threshold e) to the model prediction. Support vector
regression (SVR) is the most commonly used form of support vector machines [4].
In 1975, John Holland introduced the Genetic Algorithm (GA) for natural selection, in which the law of survival of the fit-
test is applied to a population of individuals. Genetic algorithms have been widely used as an effective search technique to
perform searches ranging from general to specific and from simple to complex. This natural method is used for optimization.
Genetic algorithms are implemented by generating a population and creating a new population by performing the following
procedures: reproduction, crossover, and mutation [29].
2.4. Basic model: Support Vector Regression (SVR) and Genetic Algorithm (GA) imputation
1. Select the examples in which there are not any missing attribute values.
2. Set one of the condition attributes (the input attribute), some of whose values are missing, as the decision attribute
(the output attribute), and conversely, set the decision attributes as the condition attributes.
3. Use SVM regression to predict the decision attribute values [12].
Using the steps (1, 2, and 3) outlined above for each attribute one by one and combining all attribute outputs yields the
model output that corresponds to the model input. Thus, the model is trained to recall itself [28,29,32]. Fig. 1 illustrates the
manner in which the regression methods and the optimization technique are used to impute data. First, the support vector
regression model must be trained with complete records before being used for data imputation. When using SVR, the inputs
are recalled on output; xu (the unknown attribute value) is approximated by the cultural GA. The input of model (6) is com-
posed of xk (the known data attributes) and xu, and f (function) is the model output that corresponds to the input (7). The
model should recall the input data; thus, the difference is called an error (8). The cultural GA attempts to reduce the error
between the model output and the input; the error (9) must be non-negative for minimization, which results in a data var-
iable that is likely to be the missing value. However, for completeness, all of the outputs are used to reduce the error of the
approximated value [28,29,32].
Xk
SVR input ¼ ð6Þ
Xu
Xk
SVR output ¼ f ð7Þ
Xu
Xk Xk
error ¼ f ð8Þ
Xu Xu
2
Xk Xk
GA fitness funtion ¼ f ð9Þ
Xu Xu
Fig. 1. A model for missing value imputation with a support vector regression genetic algorithm.
28 I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35
Given a set of objects, the overall objective of clustering is to divide the dataset into groups based on the similarity of the
objects and to minimize the intra-cluster dissimilarity. Fuzzy c-means (FCM) is a method of clustering that allows one datum
to belong to two or more clusters. This method, developed by Dunn in 1973 [8,10] and improved by Bezdek in 1981 [46], is
frequently used in pattern recognition. It is based on the minimization of the following objective function (1):
X
N X
C
2
Jm ¼ um
ij ðxi c j Þ ð1Þ
i¼1 j¼1
26c6N ð2Þ
16m61 ð3Þ
1
uij ¼ ð4Þ
Pc xi cj m1
2
k¼1 xi ck
PN m
i¼1 uij xi
cj ¼ PN ð5Þ
m
i¼1 uij
In (2), c is the cluster number, which ranges between two and the record count (N). m is a parameter called the weighting
factor and ranges between one and infinity. This parameter controls the amount of fuzziness in the clustering process [9,46]
(3). There is no theoretical optimal choice of c and m [9,15]. The parameters can be changed depending on the characteristics
of the dataset and relation of attributes with each other. In this paper, we aim to find optimal solutions for the c and m
parameters for the dataset currently in use.
In fuzzy clustering, each data object xi has a membership function (4), which describes the degree to which the data ob-
ject belongs to a certain cluster cj (5). In the process of updating the membership functions and centroids, only complete
attributes are considered. In this process, the data object cannot be assigned to a concrete cluster represented by a cluster
centroid as performed in the basic cultural K-mean clustering algorithm; this is why each data object belongs to all c clusters
with different membership degrees. The missing value of the incomplete data object xi is estimated using the information
about membership degrees and the values of the cluster centroids [9]. Experiments demonstrate that the fuzzy imputation
algorithm yields better performance than the basic clustering algorithm [9].
Fig. 2 illustrates how to estimate a sample missing value, where we suppose that ‘?’ is one of missing values in the dataset.
The complete data are clustered into three clusters, and a weighting factor value of 2 is used (c = 3, m = 2). The missing value
= ‘?’ membership values are estimated to be 0.5, 0.3, and 0.2, and the centroids of the clusters are estimated as 10, 15, and 20.
Thus, the missing value is calculated as ‘?’=0.5 10 + 0.3 15 + 0.2 20 = 13.5.
2.6. The proposed method, estimating missing values using fuzzy c-means optimized with SVR and a genetic algorithm
A typical dataset containing missing values can be divided into two sections: Dataset (complete) and Dataset (incom-
plete) rows. An incomplete record is a row of the dataset in which one or more columns have missing value(s), whereas
a complete record is a row of the dataset in which no attributes have missing value(s).
Fig. 3 illustrates the proposed method. We can estimate missing values using fuzzy c-means, where c is the number of
clusters and m is the parameter of weighting factor. The optimal cluster (c) number and weighting factor (m) should be
determined to obtain the best predictive accuracy. Here, the purpose of the genetic algorithm, in cooperation with support
4
Cluster 1
Cluster 2
2 Cluster 3
Centroids 20
0,2
0 ?
0,5 0,3 15
10
-2
-4
-5 0 5
vector regression, is to minimize error. The minimized error function is error = (X Y)2, where X is the output of the support
vector regression (SVR) prediction and Y is the output of fuzzy c-means algorithm prediction. Before estimating the missing
values, the SVR model should be trained with the Dataset (complete) rows to estimate the output values that closely corre-
spond to the input. Thus, the genetic algorithm (GA) finds optimal (c, m) parameter values for which the fitness function has a
minimum difference, which is the error value between X and Y.
The pseudocode of the proposed method, fuzzy c-means imputation optimized with SVR-GA, is as follows:
1. Train the support vector regression algorithm with the Dataset (complete) rows, for which Input(X) Output(Y).
2. Estimate the dataset (incomplete) rows using fuzzy c-means, and compare the fuzzy c-means output with the SVR
output vector.
3. Obtain the optimized c and m parameters by using the genetic algorithm to minimize the difference between the SVR
output and the fuzzy c-means output.
4. Estimate the missing values using fuzzy c-means with optimized parameters.
Another approach is fuzzy c-means genetic algorithm imputation (FCM-GA): in this method, some of the values in com-
plete records are artificially deleted and the genetic algorithm searches these missing values and optimizes the c and m
parameters. Therefore, FCM with optimized parameters can be used to impute actual missing values.
The pseudocode for the fuzzy c-means imputation optimized with a genetic algorithm is given below:
1. Artificially delete some Dataset (complete) values using the ratio of Dataset (incomplete).
2. Estimate new Dataset (incomplete) values using fuzzy c-means.
3. Obtain optimized c and m parameters by using the genetic algorithm to minimize the difference between the artifi-
cially deleted and actual values.
4. Estimate the missing values using fuzzy c-means with the optimized parameters.
3. Experimental Implementation
3.1. Data
We used six datasets that are frequently used in literature and the UCI Repository of Machine Learning Databases [6]
(Table 2).
Testing on more datasets and working with several numbers and types of missing values are needed to determine how
the algorithm generalizes. All datasets are artificially regenerated such that they have 1%, 5%, 10%, 15%, 20% and 25% missing
value ratios.
All the datasets are transformed using a min–max normalization (10) to {0, 1} before use to ensure that they reduce the
support vector regression training time. Because each data attribute has a different domain and we want to test the algo-
rithms under equal conditions, we first normalize the dataset so that all the data values are between 0 and 1.
An attribute is normalized by transforming its values such that are within a small specified range, such as 0.0–1.0. Nor-
malization is especially useful for classification algorithms; normalizing the input values for each attribute calculated in the
training samples helps accelerate the learning phase. For distance-based methods, normalization initially helps prevent attri-
butes with larger ranges from outweighing attributes with smaller ranges [19].
30 I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35
Table 2
The datasets used.
xi xi;min
xi;norm ¼ ð10Þ
xi;max xi;min
The proposed method is coded using Matlab R2009b version 7.9, and the least squares support vector (LS-SVM) toolbox
[43] was used for performing the regression using a radial basis kernel. The genetic algorithm toolbox in Matlab used a pop-
ulation size of 20, 40 generations, a crossover fraction of 60% and a mutation fraction of 3%.
The efficiency of the missing data estimation system is evaluated using the root mean standard error (RMSE), the relative
prediction accuracy (A), the Wilcoxon rank sum statistical significance test and the runtime in seconds (t). The root mean
standard error measures the error between the real values and the estimated values and quantifies the accuracy of the pre-
diction [1,28]. It is given by (11), in which xi is the real value, x^i is the estimated value, and n is the number of missing values.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pn ^ 2
i¼1 ðxi xi Þ
RMSE ¼ ð11Þ
n
The relative prediction accuracy (A) is a measure of how many estimations are made within a certain tolerance [1,28]; the
tolerance is set to 10%, as performed by Nelwamondo et al. The accuracy is given by (12), in which n is the total number of
predictions and nT is the number of correct predictions within a certain tolerance. A, 10% tolerance of the relative prediction
accuracy accepts as accurate estimates those values within ±10% of the actual value. For example, if the actual value is 100
and estimated value is between 90 and 110, this situation can be regarded as an accurate estimate.
nT
A¼ 100 ð12Þ
n
The Wilcoxon rank sum statistical significance test (W) was performed to test the validity of the estimation accuracy of
different imputation methods at the statistical significance level of 0.05 [18]. The principle underlying this test is that it does
not mandate the data to have equal variances, which is essential because the variances of the data can be affected by erro-
neous estimations, particularly in the case of Zeroimpute. The null hypothesis is H0Y Yest (13), where Y and Yest are the ac-
tual and estimated matrixes, respectively. The P-value of the hypothesis is calculated as follows:
Figs. 4–7 present box plots of the performance evaluation for the glass, haberman, iris, musk1, wine and yeast datasets,
which have 1%, 5%, 10%, 15%, 20%, 25% missing values. The results confirm the good performance of hybrid fuzzy c-means for
estimating missing values.
Box plots: On each box, the central mark is the median; the edges of the box are the 25th and 75th percentiles; the whis-
kers extend to the most extreme data points that are not considered outliers; and outliers are plotted individually.
Fig. 4 presents a comparison of four methods on six datasets that have 1–25% missing values. Each box contains 36 results
of the method in terms of the root mean standard errors (RMSE). RMSE indicates the error of the predictions. A lower error
value indicates better performance. The median RMSE values are 0.0215, 0.0178, 0.0511 and 0.2765.
I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35 31
0.6
0.5
RMS Error
0.4
0.3
0.2
0.1
0
SvrFcmGa FcmGa SvrGa ZeroImpute
Fig. 4. RMS error in overall datasets for datasets missing 1–25% of the values.
100
90
Accuracy %10 Tolerance
80
70
60
50
40
30
20
10
0
SvrFcmGa FcmGa SvrGa ZeroImpute
Fig. 5. Accuracy for datasets in which 1–25% of the data are missing.
1
0.9
0.8
Wilcoxon rank sum test
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
SvrFcmGa FcmGa SvrGa ZeroImpute
Fig. 6. Wilcoxon rank sum test for datasets in which 1–25% of the data are missing.
Fig. 5 shows the accuracy, as quantified by the difference between the actual and estimated data, values with 10% toler-
ance for four methods applied to six datasets with 1–25% missing values. A higher value of accuracy indicates better impu-
tation. The median accuracy values are 92.03, 92.65, 72.17 and 65.95.
Fig. 6 shows Wilcoxon rank sum statistical confidence measures; a higher value of P (13) implies that the method is supe-
rior to the others. The values for the four methods applied to six datasets in which 1–25% of the data are missing are 0.8877,
0.8908, 0.8352 and 0.
Fig. 7 shows the run times (in seconds), we can obtain on a standard machine. Run times indicate that using hybrid fuzzy
c-means is a quite complex method for minimizing the fuzzy c-means objective function (1). Therefore, time statistics are a
32 I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35
700
600
Run Time(secs)
500
400
300
200
100
Fig. 7. Runtime (s) for datasets in which 1–25% of the data are missing.
100
SvrFcmGa
90 FcmGa
80 SvrGa
ZeroImpute
Run Time (Secs)
70
60
50
40
30
20
10
0
%1 %5 %10 %15 %20 %25
% Missing Values
Fig. 8. Runtime (s) for the Iris dataset, in which1–25% of the data are missing.
0.7
SvrFcmGa
0.6 FcmGa
SvrGa
ZeroImpute
0.5
RMS Error
0.4
0.3
0.2
0.1
0
%1 %5 %10 %15 %20 %25
% Missing Values
Fig. 9. RMS error in the Iris dataset for 1–25% missing data values.
deficiency of the proposed method in this paper compared with the others. The median runtime values are 85.28, 128.14,
27.03 and 0.001 in seconds.
The conditions imposed to develop the primary results can be observed in Figs. 4–7. Fuzzy c-means are a strong tool for
the identification of changing class structures and flexible, moveable and creatable for uncertain data (i.e., outliers, noise)
[40] to improve the imputation accuracy of the proposed method.
I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35 33
Figs. 8–11 present plots of the results for the single dataset Iris that has 1%, 5%, 10%, 15%, 20%, 25% missing values. Fig. 8
demonstrates that the proposed hybrid FCM method has a longer runtime. The method includes a longer fuzzy c-means clus-
tering process and requires a longer support vector regression training time than the other methods. If the training (com-
plete) data are more suitable for clustering and regression, the runtime decreases dramatically for the datasets that have
20–25% missing values, as can be observed from Fig. 8. In collaboration with the genetic algorithm, a robust optimizing
method, support vector regression, that can determine the relations among input variables yields lower runtimes for the iris
data.
Fig. 9 demonstrates that the hybrid FCM method has lower error values than SvrGa and Zeroimpute for the iris data.
Fig. 10 shows that the proposed hybrid FCM method is scientifically proven to be better than others with higher values in
iris data.
Fig. 11 shows the accuracy of the estimated and actual data with 10% tolerance. Note that especially for 1–10% missing
values, the proposed method is more accurate for the iris data.
In the following paragraphs, the unique features of the approaches proposed and the primary advantages of the results
compared with others are discussed. The basic fuzzy c-means imputation model uses constant, predetermined cluster (c)
number and weighting factor (m) values for each dataset, which cannot be estimated with high sensitivity. These values
must be changed depending on the type of data: data may be time series or non-time series and real or integer, and the num-
ber of rows in the dataset and the attribute column size can vary. The SvrGa imputation fails for some outlier data, and the
genetic algorithm may find a suboptimal solution unless the whole solution space is searched. This is a local minimization
problem. A problem with GAs is that the genes that are highly fit but not optimal may rapidly dominate the population,
which causes it to converge to a local minimum. Once the population has converged, the ability of the GA to search for better
solutions is significantly constrained; the crossover of almost identical chromosomes produces little that is new. Another GA
deficiency is that using only mutation to explore entirely new ground results in a slow, random search [17]. On the other
1
0.9
0.8
Wilcoxon ranksum test
0.7
0.6
0.5
0.4
0.3
SvrFcmGa
0.2 FcmGa
0.1 SvrGa
ZeroImpute
0
%1 %5 %10 %15 %20 %25
% Missing Values
Fig. 10. Wilcoxon rank sum test for Iris dataset for 1–25% missing data values.
100
90
Accuracy %10 Tolerance
80
70
60
50
40
30
SvrFcmGa
20 FcmGa
SvrGa
10
ZeroImpute
0
%1 %5 %10 %15 %20 %25
% Missing Values
Fig. 11. Accuracy in the iris dataset for 1–25% missing data values.
34 I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35
0.95
0.85
0.8
0.75
0.7
0.65
0.6
0.55
glass haberman iris musk1 wine yeast
Datasets
Fig. 12. Wilcoxon rank sum test for datasets analyzed using the SvrFcmGa imputation.
hand, the proposed SvrFcmGa approach decides optimal parameters value suitable for the data and the results depend on the
initial choice of weights. Support vector regression can be used to determine the relations among the input variables. The
fuzzy c-means imputation method yields conservative data estimation for outliers and noisy data. Fuzzy estimation has been
used successfully for noise reduction and enhancement, and it improves the accuracy of clustering under noise [2,16,48].
There are some deficiencies of the proposed method. Training the support vector regression method is a significant issue;
the kernel type and the performance criteria must be specified before imputation. To further improve the imputation accu-
racy, feature selection and/or dimension reduction techniques [13] can be applied to the training data; for example, principal
component analysis methods can be applied before the support vector training to reduce the training time. In Fig. 12, it is
observed that more suitable data for the clustering process and the support vector training stage yield superior imputation
performance. To make further improvements in estimation and to ensure that the default data are consistent and suitable for
clustering progress, different cluster validity analyses can be performed and the results can be compared before the impu-
tation task.
5. Conclusions
In this paper, a hybrid method that uses support vector regression which is known as a reliable machine learning tech-
nique and a genetic algorithm was used with fuzzy clustering to estimate missing values. Complete train data were clustered
based on their similarity, and fuzzy principles were used during clustering. Therefore, each missing value becomes a member
of more than one cluster centroids, which yields more sensible imputation results. Six datasets with different characteristics
were used in this paper, and the cluster size and the weighting factor parameters are optimized according to the correspond-
ing dataset. Better imputation accuracy is achieved compared with the FcmGa, SvrGa, Zero imputation methods. In empirical
tests the proposed method proved to be more accurate than the others. The proposed fuzzy c-means SvrGa imputation was
compared with other representative models, the Fcm-Genetic algorithm and Support vector regression-Genetic algorithm.
The experimental results demonstrated that the fuzzy c-means SvrGa imputation yields a more sufficient, sensible estima-
tion accuracy ratio for suitable clustering data.
References
[1] M. Abdella, T. Marwala, The use of genetic algorithms and neural networks to approximate missing data in database, Comput. Inform. 24 (2005) 577–
589.
[2] M.N. Ahmed, S.M. Yamany, N. Mohamed, A.A. Farag, T. Moriarty, A modified fuzzy c-means algorithm for bias field estimation and segmentation of MRI
data, IEEE Trans. Med. Imaging 21 (2002) 193–199.
[3] I.B. Aydilek, A. Arslan, A novel hybrid approach to estimating missing values in databases using k-nearest neighbors and neural networks, Int. J. Innov.
Comput. I (8) (2012) 4705–4717.
[4] Debasish Basak, Srimanta Pal, Dipak Chandra Patranabis, Support vector regression, Neural Inform. Process. – Lett. Rev. 11 (10) (2007).
[5] C. Bergmeir, J.M. Benitez, On the use of cross-validation for time series predictor evaluation, Inform. Sci. 191 (2012) 192–213.
[6] C.L. Blake, C.J. Merz, UCI Repository of Machine Learning Databases <http://www.ics.uci.edu/~mlearn/MLRepository.html>, Irvine, CA, U. of California,
Department of Information and Computer Science, 1998 (accessed 19.10.2012).
[7] Y. Cheng, D.Q. Miao, Q.R. Feng, Positive approximation and converse approximation in interval-valued fuzzy rough sets, Inform. Sci. 181 (2011) 2086–
2110.
[8] S. Das, S. Sil, Kernel-induced fuzzy clustering of image pixels with an improved differential evolution algorithm, Inform. Sci. 180 (2010) 1237–1256.
[9] D. Li, J. Deogun, W. Spaulding, B. Shuart, Towards missing data imputation: a study of fuzzy K-means clustering method, Rough Sets Curr. Trends
Comput. 3066 (2004) 573–579.
[10] J.C. Dunn, A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters, J. Cybernet. 3 (1973) 32–57, http://
dx.doi.org/10.1080/01969727308546046.
I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35 35
[11] A. Farhangfar, L.A. Kurgan, W. Pedrycz, A novel framework for imputation of missing values in databases, IEEE Trans. Syst. Man. Cybernet. A 37 (2007)
692–709.
[12] H.H. Feng, G.S. Chen, C. Yin, B.R. Yang, Y.M. Chen, A SVM regression based approach to filling in missing values, Proc. Knowled.-Based Intell. Inform.
Eng. Syst. 3683 (Pt 3) (2005) 581–587.
[13] I.K. Fodor, A survey of dimension reduction techniques, Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory, 2002.
[14] S. Genc, F.E. Boran, D. Akay, Z.S. Xu, Interval multiplicative transitivity for consistency, missing values and priority weights of interval fuzzy preference
relations, Inform. Sci. 180 (2010) 4877–4891.
[15] S. Ghosh, P.P. Mujumdar, Future rainfall scenario over Orissa with GCM projections by statistical downscaling, Curr. Sci. India 90 (2006) 396–404.
[16] M. Gil, E.G. Sarabia, J.R. Llata, J.P. Oria, Fuzzy c-means clustering for noise reduction, enhancement and reconstruction of 3D ultrasonic images, in:
Proceedings of the Emerging Technologies and Factory Automation, 1999, ETFA‘99.
[17] D.E. Goldberg, Sizing populations for serial and parallel genetic algorithms, in: Proceedings of the Third International Conference on Genetic
Algorithms, 1989, pp. 70–79.
[18] J. Hajek, Z.k. Sidak, Theory of Rank Tests, Academic Press, New York, 1967.
[19] Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques, Academic Press, 2001.
[20] R.J. Hathaway, J.C. Bezdek, Clustering incomplete relational data using the non-Euclidean relational fuzzy c-means algorithm, Pattern Recogn. Lett. 23
(2002) 151–160.
[21] C.H. Huang, H.Y. Kao, Interval regression analysis with soft-margin reduced support vector machine, Proc. Next-Gener. Appl. Intell. 5579 (2009) 826–
835.
[22] J. Van Hulse, T.M. Khoshgoftaar, Incomplete-case nearest neighbor imputation in software measurement data, in: IRI 2007: Proceedings of the 2007
IEEE International Conference on Information Reuse and Integration, 2007, pp. 630–637.
[23] D. Li, H. Gu, L.Y. Zhang, A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data, Expert Syst. Appl. 37 (2010)
6942–6947.
[24] Zaifei Liao, Xinjie Lu, Tian Yang, Hongan Wang, Missing data imputation: a fuzzy K-means clustering algorithm over sliding window, Fuzzy Syst.
Knowled. Discovery 14–16 (2009) 133–137.
[25] C.P. Lim, J.H. Leong, M.M. Kuan, A hybrid neural network system for pattern classification tasks with missing features, IEEE Trans. Pattern Anal. 27
(2005) 648–653.
[26] R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing Data, Wiley, New York, 1987.
[27] M.S. Mahmoud, M.F. Emzir, State estimation with asynchronous multi-rate multi-smart sensors, Inform. Sci. 196 (2012) 15–27.
[28] T. Marwala, Computational Intelligence for Missing Data Imputation, Estimation and Management: Knowledge Optimization Techniques, Information
Science Reference, Hershey PA, 2009.
[29] T. Marwala, S. Chakraverty, Fault classification in structures with incomplete measured data using autoassociative neural networks and genetic
algorithm, Curr. Sci. India 90 (2006) 542–548.
[30] Z.Q. Meng, Z.Z. Shi, Extended rough set-based attribute reduction in inconsistent incomplete decision systems, Inform. Sci. 204 (2012) 44–69.
[31] S. Mohamed, T Marwala, Neural network based techniques for estimating missing data in databases, in: 16th Annual Symposium of the Pattern
Recognition Association of South Africa, Langebaan, 2005, pp. 27–32.
[32] F.V. Nelwamondo, S. Mohamed, T. Marwala, Missing data: a comparison of neural network and expectation maximization techniques, Curr. Sci. India
93 (2007) 1514–1521.
[33] Fulufhelo V. Nelwamondo, Dan Golding, Tshilidzi Marwala, A dynamic programming approach to missing data estimation using neural Networks,
Inform. Sci. (2009), http://dx.doi.org/10.1016/j.ins.2009.10.008.
[34] A.G. Di Nuovo, Missing data analysis with fuzzy C-Means: a study of its application in a psychological scenario, Expert Syst. Appl. 38 (2011) 6793–
6797.
[35] K. Pelckmans, J. De Brabanter, J.A.K. Suykens, B. De Moor, Handling missing values in support vector machine classifiers, Neural Networks 18 (2005)
684–692.
[36] W. Qiao, Z. Gao, R.G. Harley, Continuous on-line identification of nonlinear plants in power systems with missing sensor measurements, in:
Proceedings of the International Joint Conference on Neural Networks (IJCNNs), vols. 1–5, 2005, pp. 1729–1734.
[37] A.S. Salama, Topological solution of missing attribute values problem in incomplete information tables, Inform. Sci. 180 (2010) 631–639.
[38] J.L. Schafer, J.W. Graham, Missing data: our view of the state of the art, Psychol. Meth. 7 (2) (2002) 147–177.
[39] M.S.B. Sehgal, I. Gondal, L.S. Dooley, R. Coppel, Ameliorative missing value imputation for robust biological knowledge inference, J. Biomed. Inform. 41
(2008) 499–514.
[40] A. Shahi, R.B. Atan, M.N. Sulaiman, Detecting effectiveness of outliers and noisy data on fuzzy system using FCM, Eur. J. Sci. Res. 36 (4) (2009) 627–638.
[41] K. Sim, G.M. Liu, V. Gopalkrishnan, J.Y. Li, A case study on financial ratios via cross-graph quasi-bicliques, Inform. Sci. 181 (2011) 201–216.
[42] A.J. Smola, B. Scholkopf, A tutorial on support vector regression, Stat. Comput. 14 (2004) 199–222.
[43] J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, J. Vandewalle, Least Squares Support Vector Machines <http://www.esat.kuleuven.be/sista/
lssvmlab/>, World Scientific, Singapore, 2002 (accessed 19.10.2012).
[44] H. Timm, C. Doring, R. Kruse, Different approaches to fuzzy clustering of incomplete datasets, Int. J. Approx. Reason. 35 (2004) 239–249.
[45] V. Vapnik, S.E. Golowich, A. Smola, Support vector method for function approximation, regression estimation, and signal processing, Adv. Neur. Inform.
9 (1997) 281–287.
[46] P.H. Wang, Pattern-recognition with fuzzy objective function algorithms – Bezdek JC, Siam. Rev. 25 (1983). 442-442.
[47] X. Wang, A. Li, Z.H. Jiang, H.Q. Feng, Missing value estimation for DNA microarray gene expression data by support vector regression imputation and
orthogonal coding scheme, Bmc Bioinform. 7 (2006).
[48] Xiaohong Wu, Bin Wu c, Yong Deng a, Jiewen Zhao b, The fuzzy learning vector quantization with allied fuzzy c-means clustering for clustering noisy
data, J. Inform. Comput. Sci. 8 (9) (2011) 1713–1719.
[49] L.Y. Yang, L.S. Xu, Topological properties of generalized approximation spaces, Inform. Sci. 181 (2011) 3570–3580.