0% found this document useful (0 votes)

10 views11 pages

Missing Value

Uploaded by

bint

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views11 pages

Missing Value

Uploaded by

bint

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Information Sciences 233 (2013) 25–35

Contents lists available at SciVerse ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

A hybrid method for imputation of missing values using

optimized fuzzy c-means with support vector regression
and a genetic algorithm
Ibrahim Berkan Aydilek ⇑, Ahmet Arslan
Department of Computer Engineering, Selçuk University, Konya, Turkey

a r t i c l e i n f o a b s t r a c t

Article history: Missing values in datasets should be extracted from the datasets or should be estimated
Received 1 May 2011 before they are used for classiﬁcation, association rules or clustering in the preprocessing
Received in revised form 21 October 2012 stage of data mining. In this study, we utilize a fuzzy c-means clustering hybrid approach
Accepted 19 January 2013
that combines support vector regression and a genetic algorithm. In this method, the fuzzy
Available online 1 February 2013
clustering parameters, cluster size and weighting factor are optimized and missing values
are estimated. The proposed novel hybrid method yields sufﬁcient and sensible imputation
Keywords:
performance results. The results are compared with those of fuzzy c-means genetic algo-
Missing data
Missing values
rithm imputation, support vector regression genetic algorithm imputation and zero
Imputation imputation.
Support vector regression Ó 2013 Elsevier Inc. All rights reserved.
Fuzzy c-means

1. Introduction

Missing values are highly undesirable in data mining, machine learning and other information systems [33]. In recent
years, much research has been regarding missing value estimation and imputation has been performed [3,9,24,
33,35,47,49]. To deal with missing values in datasets: ignoring, deleting, zero or mean estimation methods might be used
instead of imputation methods [7,30]. However, the primary disadvantages of these estimation methods are the loss of effi-
ciency due to discarding incomplete observations and biases in estimates when data are missing in a systematic manner
[35]; these disadvantages reduce data quality. Quality data mining results can be obtained only with high quality data
[37,41]. Therefore, missing values should be estimated to increase data quality. Missing values typically occur because of
sensor faults, a lack of response in scientific experiments, faulty measurements, data transfer problems in digital systems
or respondents’ unwillingness to respond to survey questions [1,27,31,32,36]. In scientific research, especially in psychology,
data for some variables in the database to be analyzed may be missing. If the missing values are not treated correctly, they
may decrease or even jeopardize the validity of the research [3,5,14,22,34].

2. Literature review

This section presents a brief summary of the studies related to support vector regression imputation and fuzzy c-means
imputation.

⇑ Corresponding author. Tel.: +90 3322233333.

E-mail addresses: [email protected] (I.B. Aydilek), [email protected] (A. Arslan).

0020-0255/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.ins.2013.01.021
26 I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35

Abdella et al. studied the use of genetic algorithms and neural networks to approximate missing data in databases [1].
Deogun et al. utilized the clustering method with soft computing, which is more tolerant of inaccuracy and uncertainty,
and they applied a fuzzy clustering algorithm to treat incomplete data [9]. Liao et al. presented a fuzzy k-means clustering
algorithm that uses a sliding window for the imputation of incomplete data to improve data quality [24]. Pelckmans et al.
proposed an alternative approach, in which no attempt is made to reconstruct the values that are missing, but the impact of
the missing data on the result and the expected cost are modeled using support vector machines. The approach is to assume
some models for the covariates of missing values and then use a maximum likelihood approach to obtain the estimates for
these models. The advantage of this approach is that classification rules can be learned from observational data even when
missing values occur amongst the input variables, whereas the disadvantage is that the proposed model aims for high clas-
sification accuracy rather than high imputation accuracy for the missing values [35]. Lim et al. proposed a hybrid neural net-
work that uses fuzzy ARTMAP and fuzzy c-means clustering for pattern classification using incomplete training and test data.
One of the disadvantages of fuzzy ARTMAP is that it is very sensitive to the arrangement of the training data. Fuzzy ARTMAP
is also acutely sensitive to the selection of the vigilance parameter because determining the optimal value for the vigilance
parameters can be quite difficult [25]. Hathaway et al. introduced an approach for clustering that is based on incomplete
dissimilarity data. An advantage of the method is that fuzzy c-means are regarded as a reliable clustering algorithm for
incomplete data [20]. Feng et al. also presented a support vector regression (SVR)-based imputation method that uses an
orthogonal coding scheme to estimate missing values for DNA microarray gene expression data. A comparative study of their
method with those previously developed, such as the K nearest neighbor and Bayesian principal component analysis impu-
tation methods indicated that the SVR method is effective in imputation. A significant advantage of the SVR model is that it
requires less computational time, but our hybrid SVR clustering technique yields more sensible results for outlier values [47].
Timm et al. noted that incomplete datasets are a significant problem in data analysis. They introduced a class-specific prob-
ability for missing values to assign incomplete data points to clusters appropriately [44]. Farhangfar et al. aimed to provide a
comprehensive review of representative imputation techniques. The use of a low-quality single-imputation method yielded
imputation accuracy comparable to the accuracy achieved when one utilizes some other advanced imputation techniques
[11]. Li et al. assumed that missing attributes are represented as intervals, and they proposed a novel fuzzy c-means algo-
rithm for incomplete data based on nearest neighbor intervals. The disadvantage of this method is that there is no theoretical
basis for the selection of the cluster (c) number, so further research is needed to investigate this problem [23]. Nuovo com-
pared fuzzy c-means (FCM) imputation with case deletion imputation. The comparison was made in a psychological research
environment, using a database of mentally retarded in patients. The results indicated that completion techniques, in partic-
ular the technique based on FCM, yield effective data imputation and help avoid the deletion of elements with missing data
that diminish the power of the research. Fuzzy c-means (FCM) imputation is more accurate than regression imputation (RI)
and expectation maximization estimation (EME). However, a major disadvantage is that the FCM implementation uses a
weighting factor (m) parameter value, which is equal to 2, and the value should be adapted to the dataset type [34].

2.1. Missing data

There are three types of missing data described in the literature [26]:

1. Missing completely at random (MCAR) – The missing value has no dependency on any other variable.
2. Missing at random (MAR) – The missing value depends on other variables. The missing value can be estimated using
other variables.
3. Missing not at random (MNAT) – The missing value depends on other missing values, and thus missing data cannot be
estimated from existing variables.

In this paper, we assume that the data are MAR, which implies that the missing values are deducible in some complex
manner from the remaining data [38]. In Table 1, we present a section of a dataset with missing values. In this paper, we
aim to estimate missing values using fuzzy c-means optimized with support vector regression and a genetic algorithm. These
notations will be used in the rest of the paper: Y1, Y2, Y3, Y4, Y5 and Y6 are records (rows). X1, X2, X3, X4 and X5 are attri-
butes (columns). Y2, Y5 and Y6, which do not have any missing values, are ‘complete’ rows, and Y1, Y3 and Y4, which have
missing values, are called ‘incomplete’ rows.

Table 1
A section of a dataset with missing values.

X1 X2 X3 X4 X5
Y1 0.113524 0.084785 ? 0.625473 0.06385
Y2 0.112537 0.138211 0.15942 0.625473 0.068545
Y3 0.110563 ? 0.144928 0.624212 0.083568
Y4 0.110563 0.170732 0.146998 0.623581 ?
Y5 0.108588 0.129501 0.144928 0.624212 0.076056
Y6 0.108588 0.082462 0.112836 0.626103 0.015023
I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35 27

2.2. Support Vector Regression (SVR)

Support Vector Machines (SVMs) are a statistical learning theory that was created by Vapnik and Lerner in 1963. In 1996,
the theory was further developed by Scholkopf, Burges and Vapnik, who constructed a hyperplane or a set of hyperplanes in
a high- or infinite-dimensional space on known data. They successfully used SVM for classification of unseen data [21,42]. A
different version of SVM for regression, which is called support vector regression (SVR), was proposed by Vapnik, Steven
Golowich and Alex Smola in 1997 [45]. The model produced by support vector classification depends on only a subset of
the training data because the cost function for building the model does not consider training points that lie beyond the mar-
gin. Analogously, the model produced by SVR depends on only a subset of the training data because the cost function for
building the model ignores any training data that are close (within a threshold e) to the model prediction. Support vector
regression (SVR) is the most commonly used form of support vector machines [4].

2.3. Genetic Algorithms (GAs)

In 1975, John Holland introduced the Genetic Algorithm (GA) for natural selection, in which the law of survival of the ﬁt-
test is applied to a population of individuals. Genetic algorithms have been widely used as an effective search technique to
perform searches ranging from general to speciﬁc and from simple to complex. This natural method is used for optimization.
Genetic algorithms are implemented by generating a population and creating a new population by performing the following
procedures: reproduction, crossover, and mutation [29].

2.4. Basic model: Support Vector Regression (SVR) and Genetic Algorithm (GA) imputation

1. Select the examples in which there are not any missing attribute values.
2. Set one of the condition attributes (the input attribute), some of whose values are missing, as the decision attribute
(the output attribute), and conversely, set the decision attributes as the condition attributes.
3. Use SVM regression to predict the decision attribute values [12].

Using the steps (1, 2, and 3) outlined above for each attribute one by one and combining all attribute outputs yields the
model output that corresponds to the model input. Thus, the model is trained to recall itself [28,29,32]. Fig. 1 illustrates the
manner in which the regression methods and the optimization technique are used to impute data. First, the support vector
regression model must be trained with complete records before being used for data imputation. When using SVR, the inputs
are recalled on output; xu (the unknown attribute value) is approximated by the cultural GA. The input of model (6) is com-
posed of xk (the known data attributes) and xu, and f (function) is the model output that corresponds to the input (7). The
model should recall the input data; thus, the difference is called an error (8). The cultural GA attempts to reduce the error
between the model output and the input; the error (9) must be non-negative for minimization, which results in a data var-
iable that is likely to be the missing value. However, for completeness, all of the outputs are used to reduce the error of the
approximated value [28,29,32].

Xk
SVR input ¼ ð6Þ
Xu

Xk
SVR output ¼ f ð7Þ
Xu

Xk Xk
error ¼ f ð8Þ
Xu Xu
2
Xk Xk
GA fitness funtion ¼ f ð9Þ
Xu Xu

Fig. 1. A model for missing value imputation with a support vector regression genetic algorithm.
28 I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35

2.5. Fuzzy c-means imputation

Given a set of objects, the overall objective of clustering is to divide the dataset into groups based on the similarity of the
objects and to minimize the intra-cluster dissimilarity. Fuzzy c-means (FCM) is a method of clustering that allows one datum
to belong to two or more clusters. This method, developed by Dunn in 1973 [8,10] and improved by Bezdek in 1981 [46], is
frequently used in pattern recognition. It is based on the minimization of the following objective function (1):
X
N X
C
2
Jm ¼ um
ij ðxi c j Þ ð1Þ
i¼1 j¼1

26c6N ð2Þ

16m61 ð3Þ

1
uij ¼ ð4Þ
Pc xi cj m1
2

k¼1 xi ck

PN m
i¼1 uij xi
cj ¼ PN ð5Þ
m
i¼1 uij

In (2), c is the cluster number, which ranges between two and the record count (N). m is a parameter called the weighting
factor and ranges between one and inﬁnity. This parameter controls the amount of fuzziness in the clustering process [9,46]
(3). There is no theoretical optimal choice of c and m [9,15]. The parameters can be changed depending on the characteristics
of the dataset and relation of attributes with each other. In this paper, we aim to ﬁnd optimal solutions for the c and m
parameters for the dataset currently in use.
In fuzzy clustering, each data object xi has a membership function (4), which describes the degree to which the data ob-
ject belongs to a certain cluster cj (5). In the process of updating the membership functions and centroids, only complete
attributes are considered. In this process, the data object cannot be assigned to a concrete cluster represented by a cluster
centroid as performed in the basic cultural K-mean clustering algorithm; this is why each data object belongs to all c clusters
with different membership degrees. The missing value of the incomplete data object xi is estimated using the information
about membership degrees and the values of the cluster centroids [9]. Experiments demonstrate that the fuzzy imputation
algorithm yields better performance than the basic clustering algorithm [9].
Fig. 2 illustrates how to estimate a sample missing value, where we suppose that ‘?’ is one of missing values in the dataset.
The complete data are clustered into three clusters, and a weighting factor value of 2 is used (c = 3, m = 2). The missing value
= ‘?’ membership values are estimated to be 0.5, 0.3, and 0.2, and the centroids of the clusters are estimated as 10, 15, and 20.
Thus, the missing value is calculated as ‘?’=0.5 10 + 0.3 15 + 0.2 20 = 13.5.

2.6. The proposed method, estimating missing values using fuzzy c-means optimized with SVR and a genetic algorithm

A typical dataset containing missing values can be divided into two sections: Dataset (complete) and Dataset (incom-
plete) rows. An incomplete record is a row of the dataset in which one or more columns have missing value(s), whereas
a complete record is a row of the dataset in which no attributes have missing value(s).
Fig. 3 illustrates the proposed method. We can estimate missing values using fuzzy c-means, where c is the number of
clusters and m is the parameter of weighting factor. The optimal cluster (c) number and weighting factor (m) should be
determined to obtain the best predictive accuracy. Here, the purpose of the genetic algorithm, in cooperation with support

4
Cluster 1
Cluster 2
2 Cluster 3
Centroids 20
0,2
0 ?
0,5 0,3 15
10
-2

-4
-5 0 5

Fig. 2. Fuzzy c-means (FCM) imputation.

I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35 29

Fig. 3. The proposed fuzzy c-means (FCM) SVR-GA imputation method.

vector regression, is to minimize error. The minimized error function is error = (X Y)2, where X is the output of the support
vector regression (SVR) prediction and Y is the output of fuzzy c-means algorithm prediction. Before estimating the missing
values, the SVR model should be trained with the Dataset (complete) rows to estimate the output values that closely corre-
spond to the input. Thus, the genetic algorithm (GA) ﬁnds optimal (c, m) parameter values for which the ﬁtness function has a
minimum difference, which is the error value between X and Y.
The pseudocode of the proposed method, fuzzy c-means imputation optimized with SVR-GA, is as follows:

1. Train the support vector regression algorithm with the Dataset (complete) rows, for which Input(X) Output(Y).
2. Estimate the dataset (incomplete) rows using fuzzy c-means, and compare the fuzzy c-means output with the SVR
output vector.
3. Obtain the optimized c and m parameters by using the genetic algorithm to minimize the difference between the SVR
output and the fuzzy c-means output.
4. Estimate the missing values using fuzzy c-means with optimized parameters.

Another approach is fuzzy c-means genetic algorithm imputation (FCM-GA): in this method, some of the values in com-
plete records are artiﬁcially deleted and the genetic algorithm searches these missing values and optimizes the c and m
parameters. Therefore, FCM with optimized parameters can be used to impute actual missing values.
The pseudocode for the fuzzy c-means imputation optimized with a genetic algorithm is given below:

1. Artiﬁcially delete some Dataset (complete) values using the ratio of Dataset (incomplete).
2. Estimate new Dataset (incomplete) values using fuzzy c-means.
3. Obtain optimized c and m parameters by using the genetic algorithm to minimize the difference between the artiﬁ-
cially deleted and actual values.
4. Estimate the missing values using fuzzy c-means with the optimized parameters.

3. Experimental Implementation

3.1. Data

We used six datasets that are frequently used in literature and the UCI Repository of Machine Learning Databases [6]
(Table 2).
Testing on more datasets and working with several numbers and types of missing values are needed to determine how
the algorithm generalizes. All datasets are artificially regenerated such that they have 1%, 5%, 10%, 15%, 20% and 25% missing
value ratios.
All the datasets are transformed using a min–max normalization (10) to {0, 1} before use to ensure that they reduce the
support vector regression training time. Because each data attribute has a different domain and we want to test the algo-
rithms under equal conditions, we first normalize the dataset so that all the data values are between 0 and 1.
An attribute is normalized by transforming its values such that are within a small specified range, such as 0.0–1.0. Nor-
malization is especially useful for classification algorithms; normalizing the input values for each attribute calculated in the
training samples helps accelerate the learning phase. For distance-based methods, normalization initially helps prevent attri-
butes with larger ranges from outweighing attributes with smaller ranges [19].
30 I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35

Table 2
The datasets used.

Dataset name Records Attributes

Glass 214 11
Haberman 306 4
Iris 150 5
Musk1 476 167
Wine 178 14
Yeast 1489 9

xi xi;min
xi;norm ¼ ð10Þ
xi;max xi;min

3.2. SVR, fuzzy c-means and the Genetic Algorithm implementation

The proposed method is coded using Matlab R2009b version 7.9, and the least squares support vector (LS-SVM) toolbox
[43] was used for performing the regression using a radial basis kernel. The genetic algorithm toolbox in Matlab used a pop-
ulation size of 20, 40 generations, a crossover fraction of 60% and a mutation fraction of 3%.

3.3. Performance evaluation

The efficiency of the missing data estimation system is evaluated using the root mean standard error (RMSE), the relative
prediction accuracy (A), the Wilcoxon rank sum statistical significance test and the runtime in seconds (t). The root mean
standard error measures the error between the real values and the estimated values and quantifies the accuracy of the pre-
diction [1,28]. It is given by (11), in which xi is the real value, xî is the estimated value, and n is the number of missing values.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pn ^ 2
i¼1 ðxi xi Þ
RMSE ¼ ð11Þ
n
The relative prediction accuracy (A) is a measure of how many estimations are made within a certain tolerance [1,28]; the
tolerance is set to 10%, as performed by Nelwamondo et al. The accuracy is given by (12), in which n is the total number of
predictions and nT is the number of correct predictions within a certain tolerance. A, 10% tolerance of the relative prediction
accuracy accepts as accurate estimates those values within ±10% of the actual value. For example, if the actual value is 100
and estimated value is between 90 and 110, this situation can be regarded as an accurate estimate.
nT
A¼ 100 ð12Þ
n
The Wilcoxon rank sum statistical significance test (W) was performed to test the validity of the estimation accuracy of
different imputation methods at the statistical significance level of 0.05 [18]. The principle underlying this test is that it does
not mandate the data to have equal variances, which is essential because the variances of the data can be affected by erro-
neous estimations, particularly in the case of Zeroimpute. The null hypothesis is H0Y Yest (13), where Y and Yest are the ac-
tual and estimated matrixes, respectively. The P-value of the hypothesis is calculated as follows:

H0 ; P-value ¼ 2PrðR > yrÞ ð13Þ

where yr is the sum of the ranks of observations for Y and R is the corresponding random variable. Therefore, the higher (P)
value implies a more accurate estimate than lower (P) values [39]. The other criterion is the runtime in seconds (t) that is
measured on a standard machine; speciﬁcally, we used an Intel Core 2 quad-core CPU with 4.00 GB of RAM running the
Microsoft Windows XP SP2 operating system.

4. Experimental results and discussion

Figs. 4–7 present box plots of the performance evaluation for the glass, haberman, iris, musk1, wine and yeast datasets,
which have 1%, 5%, 10%, 15%, 20%, 25% missing values. The results conﬁrm the good performance of hybrid fuzzy c-means for
estimating missing values.
Box plots: On each box, the central mark is the median; the edges of the box are the 25th and 75th percentiles; the whis-
kers extend to the most extreme data points that are not considered outliers; and outliers are plotted individually.
Fig. 4 presents a comparison of four methods on six datasets that have 1–25% missing values. Each box contains 36 results
of the method in terms of the root mean standard errors (RMSE). RMSE indicates the error of the predictions. A lower error
value indicates better performance. The median RMSE values are 0.0215, 0.0178, 0.0511 and 0.2765.
I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35 31

0.6

0.5

RMS Error
0.4

0.3

0.2

0.1

0
SvrFcmGa FcmGa SvrGa ZeroImpute

Fig. 4. RMS error in overall datasets for datasets missing 1–25% of the values.

100
90
Accuracy %10 Tolerance

80
70
60
50
40
30
20
10
0
SvrFcmGa FcmGa SvrGa ZeroImpute

Fig. 5. Accuracy for datasets in which 1–25% of the data are missing.

1
0.9
0.8
Wilcoxon rank sum test

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
SvrFcmGa FcmGa SvrGa ZeroImpute

Fig. 6. Wilcoxon rank sum test for datasets in which 1–25% of the data are missing.

Fig. 5 shows the accuracy, as quantiﬁed by the difference between the actual and estimated data, values with 10% toler-
ance for four methods applied to six datasets with 1–25% missing values. A higher value of accuracy indicates better impu-
tation. The median accuracy values are 92.03, 92.65, 72.17 and 65.95.
Fig. 6 shows Wilcoxon rank sum statistical conﬁdence measures; a higher value of P (13) implies that the method is supe-
rior to the others. The values for the four methods applied to six datasets in which 1–25% of the data are missing are 0.8877,
0.8908, 0.8352 and 0.
Fig. 7 shows the run times (in seconds), we can obtain on a standard machine. Run times indicate that using hybrid fuzzy
c-means is a quite complex method for minimizing the fuzzy c-means objective function (1). Therefore, time statistics are a
32 I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35

700

600

Run Time(secs)
500

400

300

200

100

SvrFcmGa FcmGa SvrGa ZeroImpute

Fig. 7. Runtime (s) for datasets in which 1–25% of the data are missing.

100
SvrFcmGa
90 FcmGa
80 SvrGa
ZeroImpute
Run Time (Secs)

70
60
50
40
30
20
10
0
%1 %5 %10 %15 %20 %25
% Missing Values

Fig. 8. Runtime (s) for the Iris dataset, in which1–25% of the data are missing.

0.7
SvrFcmGa
0.6 FcmGa
SvrGa
ZeroImpute
0.5
RMS Error

0.4

0.3

0.2

0.1

0
%1 %5 %10 %15 %20 %25
% Missing Values

Fig. 9. RMS error in the Iris dataset for 1–25% missing data values.

deficiency of the proposed method in this paper compared with the others. The median runtime values are 85.28, 128.14,
27.03 and 0.001 in seconds.
The conditions imposed to develop the primary results can be observed in Figs. 4–7. Fuzzy c-means are a strong tool for
the identification of changing class structures and flexible, moveable and creatable for uncertain data (i.e., outliers, noise)
[40] to improve the imputation accuracy of the proposed method.
I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35 33

Figs. 8–11 present plots of the results for the single dataset Iris that has 1%, 5%, 10%, 15%, 20%, 25% missing values. Fig. 8
demonstrates that the proposed hybrid FCM method has a longer runtime. The method includes a longer fuzzy c-means clus-
tering process and requires a longer support vector regression training time than the other methods. If the training (com-
plete) data are more suitable for clustering and regression, the runtime decreases dramatically for the datasets that have
20–25% missing values, as can be observed from Fig. 8. In collaboration with the genetic algorithm, a robust optimizing
method, support vector regression, that can determine the relations among input variables yields lower runtimes for the iris
data.
Fig. 9 demonstrates that the hybrid FCM method has lower error values than SvrGa and Zeroimpute for the iris data.
Fig. 10 shows that the proposed hybrid FCM method is scientifically proven to be better than others with higher values in
iris data.
Fig. 11 shows the accuracy of the estimated and actual data with 10% tolerance. Note that especially for 1–10% missing
values, the proposed method is more accurate for the iris data.
In the following paragraphs, the unique features of the approaches proposed and the primary advantages of the results
compared with others are discussed. The basic fuzzy c-means imputation model uses constant, predetermined cluster (c)
number and weighting factor (m) values for each dataset, which cannot be estimated with high sensitivity. These values
must be changed depending on the type of data: data may be time series or non-time series and real or integer, and the num-
ber of rows in the dataset and the attribute column size can vary. The SvrGa imputation fails for some outlier data, and the
genetic algorithm may find a suboptimal solution unless the whole solution space is searched. This is a local minimization
problem. A problem with GAs is that the genes that are highly fit but not optimal may rapidly dominate the population,
which causes it to converge to a local minimum. Once the population has converged, the ability of the GA to search for better
solutions is significantly constrained; the crossover of almost identical chromosomes produces little that is new. Another GA
deficiency is that using only mutation to explore entirely new ground results in a slow, random search [17]. On the other

1
0.9
0.8
Wilcoxon ranksum test

0.7
0.6
0.5
0.4
0.3
SvrFcmGa
0.2 FcmGa
0.1 SvrGa
ZeroImpute
0
%1 %5 %10 %15 %20 %25
% Missing Values

Fig. 10. Wilcoxon rank sum test for Iris dataset for 1–25% missing data values.

100
90
Accuracy %10 Tolerance

80
70
60
50
40
30
SvrFcmGa
20 FcmGa
SvrGa
10
ZeroImpute
0
%1 %5 %10 %15 %20 %25
% Missing Values

Fig. 11. Accuracy in the iris dataset for 1–25% missing data values.
34 I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35

0.95

Wilcoxon rank sum test

0.9

0.85

0.8

0.75

0.7

0.65

0.6

0.55
glass haberman iris musk1 wine yeast
Datasets

Fig. 12. Wilcoxon rank sum test for datasets analyzed using the SvrFcmGa imputation.

hand, the proposed SvrFcmGa approach decides optimal parameters value suitable for the data and the results depend on the
initial choice of weights. Support vector regression can be used to determine the relations among the input variables. The
fuzzy c-means imputation method yields conservative data estimation for outliers and noisy data. Fuzzy estimation has been
used successfully for noise reduction and enhancement, and it improves the accuracy of clustering under noise [2,16,48].
There are some deficiencies of the proposed method. Training the support vector regression method is a significant issue;
the kernel type and the performance criteria must be specified before imputation. To further improve the imputation accu-
racy, feature selection and/or dimension reduction techniques [13] can be applied to the training data; for example, principal
component analysis methods can be applied before the support vector training to reduce the training time. In Fig. 12, it is
observed that more suitable data for the clustering process and the support vector training stage yield superior imputation
performance. To make further improvements in estimation and to ensure that the default data are consistent and suitable for
clustering progress, different cluster validity analyses can be performed and the results can be compared before the impu-
tation task.

5. Conclusions

In this paper, a hybrid method that uses support vector regression which is known as a reliable machine learning tech-
nique and a genetic algorithm was used with fuzzy clustering to estimate missing values. Complete train data were clustered
based on their similarity, and fuzzy principles were used during clustering. Therefore, each missing value becomes a member
of more than one cluster centroids, which yields more sensible imputation results. Six datasets with different characteristics
were used in this paper, and the cluster size and the weighting factor parameters are optimized according to the correspond-
ing dataset. Better imputation accuracy is achieved compared with the FcmGa, SvrGa, Zero imputation methods. In empirical
tests the proposed method proved to be more accurate than the others. The proposed fuzzy c-means SvrGa imputation was
compared with other representative models, the Fcm-Genetic algorithm and Support vector regression-Genetic algorithm.
The experimental results demonstrated that the fuzzy c-means SvrGa imputation yields a more sufﬁcient, sensible estima-
tion accuracy ratio for suitable clustering data.

References

[1] M. Abdella, T. Marwala, The use of genetic algorithms and neural networks to approximate missing data in database, Comput. Inform. 24 (2005) 577–
589.
[2] M.N. Ahmed, S.M. Yamany, N. Mohamed, A.A. Farag, T. Moriarty, A modiﬁed fuzzy c-means algorithm for bias ﬁeld estimation and segmentation of MRI
data, IEEE Trans. Med. Imaging 21 (2002) 193–199.
[3] I.B. Aydilek, A. Arslan, A novel hybrid approach to estimating missing values in databases using k-nearest neighbors and neural networks, Int. J. Innov.
Comput. I (8) (2012) 4705–4717.
[4] Debasish Basak, Srimanta Pal, Dipak Chandra Patranabis, Support vector regression, Neural Inform. Process. – Lett. Rev. 11 (10) (2007).
[5] C. Bergmeir, J.M. Benitez, On the use of cross-validation for time series predictor evaluation, Inform. Sci. 191 (2012) 192–213.
[6] C.L. Blake, C.J. Merz, UCI Repository of Machine Learning Databases <http://www.ics.uci.edu/~mlearn/MLRepository.html>, Irvine, CA, U. of California,
Department of Information and Computer Science, 1998 (accessed 19.10.2012).
[7] Y. Cheng, D.Q. Miao, Q.R. Feng, Positive approximation and converse approximation in interval-valued fuzzy rough sets, Inform. Sci. 181 (2011) 2086–
2110.
[8] S. Das, S. Sil, Kernel-induced fuzzy clustering of image pixels with an improved differential evolution algorithm, Inform. Sci. 180 (2010) 1237–1256.
[9] D. Li, J. Deogun, W. Spaulding, B. Shuart, Towards missing data imputation: a study of fuzzy K-means clustering method, Rough Sets Curr. Trends
Comput. 3066 (2004) 573–579.
[10] J.C. Dunn, A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters, J. Cybernet. 3 (1973) 32–57, http://
dx.doi.org/10.1080/01969727308546046.
I.B. Aydilek, A. Arslan / Information Sciences 233 (2013) 25–35 35

[11] A. Farhangfar, L.A. Kurgan, W. Pedrycz, A novel framework for imputation of missing values in databases, IEEE Trans. Syst. Man. Cybernet. A 37 (2007)
692–709.
[12] H.H. Feng, G.S. Chen, C. Yin, B.R. Yang, Y.M. Chen, A SVM regression based approach to filling in missing values, Proc. Knowled.-Based Intell. Inform.
Eng. Syst. 3683 (Pt 3) (2005) 581–587.
[13] I.K. Fodor, A survey of dimension reduction techniques, Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory, 2002.
[14] S. Genc, F.E. Boran, D. Akay, Z.S. Xu, Interval multiplicative transitivity for consistency, missing values and priority weights of interval fuzzy preference
relations, Inform. Sci. 180 (2010) 4877–4891.
[15] S. Ghosh, P.P. Mujumdar, Future rainfall scenario over Orissa with GCM projections by statistical downscaling, Curr. Sci. India 90 (2006) 396–404.
[16] M. Gil, E.G. Sarabia, J.R. Llata, J.P. Oria, Fuzzy c-means clustering for noise reduction, enhancement and reconstruction of 3D ultrasonic images, in:
Proceedings of the Emerging Technologies and Factory Automation, 1999, ETFA‘99.
[17] D.E. Goldberg, Sizing populations for serial and parallel genetic algorithms, in: Proceedings of the Third International Conference on Genetic
Algorithms, 1989, pp. 70–79.
[18] J. Hajek, Z.k. Sidak, Theory of Rank Tests, Academic Press, New York, 1967.
[19] Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques, Academic Press, 2001.
[20] R.J. Hathaway, J.C. Bezdek, Clustering incomplete relational data using the non-Euclidean relational fuzzy c-means algorithm, Pattern Recogn. Lett. 23
(2002) 151–160.
[21] C.H. Huang, H.Y. Kao, Interval regression analysis with soft-margin reduced support vector machine, Proc. Next-Gener. Appl. Intell. 5579 (2009) 826–
835.
[22] J. Van Hulse, T.M. Khoshgoftaar, Incomplete-case nearest neighbor imputation in software measurement data, in: IRI 2007: Proceedings of the 2007
IEEE International Conference on Information Reuse and Integration, 2007, pp. 630–637.
[23] D. Li, H. Gu, L.Y. Zhang, A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data, Expert Syst. Appl. 37 (2010)
6942–6947.
[24] Zaifei Liao, Xinjie Lu, Tian Yang, Hongan Wang, Missing data imputation: a fuzzy K-means clustering algorithm over sliding window, Fuzzy Syst.
Knowled. Discovery 14–16 (2009) 133–137.
[25] C.P. Lim, J.H. Leong, M.M. Kuan, A hybrid neural network system for pattern classification tasks with missing features, IEEE Trans. Pattern Anal. 27
(2005) 648–653.
[26] R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing Data, Wiley, New York, 1987.
[27] M.S. Mahmoud, M.F. Emzir, State estimation with asynchronous multi-rate multi-smart sensors, Inform. Sci. 196 (2012) 15–27.
[28] T. Marwala, Computational Intelligence for Missing Data Imputation, Estimation and Management: Knowledge Optimization Techniques, Information
Science Reference, Hershey PA, 2009.
[29] T. Marwala, S. Chakraverty, Fault classification in structures with incomplete measured data using autoassociative neural networks and genetic
algorithm, Curr. Sci. India 90 (2006) 542–548.
[30] Z.Q. Meng, Z.Z. Shi, Extended rough set-based attribute reduction in inconsistent incomplete decision systems, Inform. Sci. 204 (2012) 44–69.
[31] S. Mohamed, T Marwala, Neural network based techniques for estimating missing data in databases, in: 16th Annual Symposium of the Pattern
Recognition Association of South Africa, Langebaan, 2005, pp. 27–32.
[32] F.V. Nelwamondo, S. Mohamed, T. Marwala, Missing data: a comparison of neural network and expectation maximization techniques, Curr. Sci. India
93 (2007) 1514–1521.
[33] Fulufhelo V. Nelwamondo, Dan Golding, Tshilidzi Marwala, A dynamic programming approach to missing data estimation using neural Networks,
Inform. Sci. (2009), http://dx.doi.org/10.1016/j.ins.2009.10.008.
[34] A.G. Di Nuovo, Missing data analysis with fuzzy C-Means: a study of its application in a psychological scenario, Expert Syst. Appl. 38 (2011) 6793–
6797.
[35] K. Pelckmans, J. De Brabanter, J.A.K. Suykens, B. De Moor, Handling missing values in support vector machine classifiers, Neural Networks 18 (2005)
684–692.
[36] W. Qiao, Z. Gao, R.G. Harley, Continuous on-line identification of nonlinear plants in power systems with missing sensor measurements, in:
Proceedings of the International Joint Conference on Neural Networks (IJCNNs), vols. 1–5, 2005, pp. 1729–1734.
[37] A.S. Salama, Topological solution of missing attribute values problem in incomplete information tables, Inform. Sci. 180 (2010) 631–639.
[38] J.L. Schafer, J.W. Graham, Missing data: our view of the state of the art, Psychol. Meth. 7 (2) (2002) 147–177.
[39] M.S.B. Sehgal, I. Gondal, L.S. Dooley, R. Coppel, Ameliorative missing value imputation for robust biological knowledge inference, J. Biomed. Inform. 41
(2008) 499–514.
[40] A. Shahi, R.B. Atan, M.N. Sulaiman, Detecting effectiveness of outliers and noisy data on fuzzy system using FCM, Eur. J. Sci. Res. 36 (4) (2009) 627–638.
[41] K. Sim, G.M. Liu, V. Gopalkrishnan, J.Y. Li, A case study on financial ratios via cross-graph quasi-bicliques, Inform. Sci. 181 (2011) 201–216.
[42] A.J. Smola, B. Scholkopf, A tutorial on support vector regression, Stat. Comput. 14 (2004) 199–222.
[43] J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, J. Vandewalle, Least Squares Support Vector Machines <http://www.esat.kuleuven.be/sista/
lssvmlab/>, World Scientific, Singapore, 2002 (accessed 19.10.2012).
[44] H. Timm, C. Doring, R. Kruse, Different approaches to fuzzy clustering of incomplete datasets, Int. J. Approx. Reason. 35 (2004) 239–249.
[45] V. Vapnik, S.E. Golowich, A. Smola, Support vector method for function approximation, regression estimation, and signal processing, Adv. Neur. Inform.
9 (1997) 281–287.
[46] P.H. Wang, Pattern-recognition with fuzzy objective function algorithms – Bezdek JC, Siam. Rev. 25 (1983). 442-442.
[47] X. Wang, A. Li, Z.H. Jiang, H.Q. Feng, Missing value estimation for DNA microarray gene expression data by support vector regression imputation and
orthogonal coding scheme, Bmc Bioinform. 7 (2006).
[48] Xiaohong Wu, Bin Wu c, Yong Deng a, Jiewen Zhao b, The fuzzy learning vector quantization with allied fuzzy c-means clustering for clustering noisy
data, J. Inform. Comput. Sci. 8 (9) (2011) 1713–1719.
[49] L.Y. Yang, L.S. Xu, Topological properties of generalized approximation spaces, Inform. Sci. 181 (2011) 3570–3580.

Department of Defense: Test Method Standard Microcircuits
No ratings yet
Department of Defense: Test Method Standard Microcircuits
196 pages
Sampling Design
No ratings yet
Sampling Design
21 pages
GU Food Formulation 2021 EN LR
No ratings yet
GU Food Formulation 2021 EN LR
18 pages
Missing Value Imputation Via Clusterwise Linear Regression
No ratings yet
Missing Value Imputation Via Clusterwise Linear Regression
13 pages
An Analysis of Four Missing Data Treatment Methods
No ratings yet
An Analysis of Four Missing Data Treatment Methods
13 pages
SICE: An Improved Missing Data Imputation Technique: Open Access Research
No ratings yet
SICE: An Improved Missing Data Imputation Technique: Open Access Research
21 pages
Lab Report 1
0% (1)
Lab Report 1
4 pages
The Process of Measurement: - Introduction
No ratings yet
The Process of Measurement: - Introduction
24 pages
Gaussian Mixture Model Clustering With Incomplete Data
No ratings yet
Gaussian Mixture Model Clustering With Incomplete Data
14 pages
Semi-Supervised Inference For Block-Wise Missing Data Without Imputation
No ratings yet
Semi-Supervised Inference For Block-Wise Missing Data Without Imputation
36 pages
Jds 1135
No ratings yet
Jds 1135
13 pages
A Survey of Indoor Positioning Systems For Wireless Personal Networks
No ratings yet
A Survey of Indoor Positioning Systems For Wireless Personal Networks
20 pages
Chapter 9 AppA - Test Bank
No ratings yet
Chapter 9 AppA - Test Bank
20 pages
Updated ABC Document
No ratings yet
Updated ABC Document
3 pages
Chapter 3 Example Sample 2
No ratings yet
Chapter 3 Example Sample 2
14 pages
Missing Value Imputation Using Hybrid K-Means and Association Rules
No ratings yet
Missing Value Imputation Using Hybrid K-Means and Association Rules
9 pages
Clustering With Incomplete Data Insights Into Model Stability and Filling Method Effectiveness
No ratings yet
Clustering With Incomplete Data Insights Into Model Stability and Filling Method Effectiveness
6 pages
Testing Modeling of Generator Controls
No ratings yet
Testing Modeling of Generator Controls
12 pages
Performance Prediction Through OEE-Model
No ratings yet
Performance Prediction Through OEE-Model
11 pages
Module 3
No ratings yet
Module 3
53 pages
Soal Pisa Dan Pembahasan (Computer-Based-Science) (2015)
No ratings yet
Soal Pisa Dan Pembahasan (Computer-Based-Science) (2015)
23 pages
Cienciadedatos
No ratings yet
Cienciadedatos
21 pages
Enhancing Missing Values Imputation Through Transformer-Based Predictive Modeling
No ratings yet
Enhancing Missing Values Imputation Through Transformer-Based Predictive Modeling
8 pages
30 Interview Questions To Test Your Skills On KNN Algorithm
No ratings yet
30 Interview Questions To Test Your Skills On KNN Algorithm
12 pages
Cambridge International AS & A Level: Further Mathematics 9231/13
No ratings yet
Cambridge International AS & A Level: Further Mathematics 9231/13
16 pages
Cambridge International AS Level: Mathematics 9709/22 October/November 2020
No ratings yet
Cambridge International AS Level: Mathematics 9709/22 October/November 2020
11 pages
A Real-Time Conductor Sag Measurement System Using A Differential GPS
No ratings yet
A Real-Time Conductor Sag Measurement System Using A Differential GPS
6 pages
Unit 2 Notes - Docx-3
No ratings yet
Unit 2 Notes - Docx-3
14 pages
An Analysis of Four Missing Data Treatment Methods For Supervised Learning
No ratings yet
An Analysis of Four Missing Data Treatment Methods For Supervised Learning
16 pages
Matrix Template
No ratings yet
Matrix Template
8 pages
Master Wilson
No ratings yet
Master Wilson
66 pages
Sefidian2018 PDF
No ratings yet
Sefidian2018 PDF
61 pages
Measurement: Rami Ahmad
No ratings yet
Measurement: Rami Ahmad
10 pages
Missing Data Imputation in Multivariate Data by Evolutionary Algorithms
No ratings yet
Missing Data Imputation in Multivariate Data by Evolutionary Algorithms
7 pages
Meth 2024 Part3 Imput
No ratings yet
Meth 2024 Part3 Imput
32 pages
CBRG A Novel Algorithm For Handling Missing Data Using Bayesian Ridge Regression and Fea
No ratings yet
CBRG A Novel Algorithm For Handling Missing Data Using Bayesian Ridge Regression and Fea
17 pages
J Patrec 2015 08 023
No ratings yet
J Patrec 2015 08 023
9 pages
Machine Learning Based Missing Data Imputation
No ratings yet
Machine Learning Based Missing Data Imputation
13 pages
Documento Corregido
No ratings yet
Documento Corregido
16 pages
149 Missing
No ratings yet
149 Missing
10 pages
Advanced Mathematical Applications in Data Science
From Everand
Advanced Mathematical Applications in Data Science
Biswadip Basu Mallik
No ratings yet
FDS U4
No ratings yet
FDS U4
93 pages
A Method For Missing Values Imputation of Machine Learning Datasets
No ratings yet
A Method For Missing Values Imputation of Machine Learning Datasets
11 pages
Enterprise Asset Management System Training DAY2 Powerpoint SAMPLE
No ratings yet
Enterprise Asset Management System Training DAY2 Powerpoint SAMPLE
9 pages
Missing Imput Values
No ratings yet
Missing Imput Values
2 pages
Chapter 1 Linear and Angular Measurement
No ratings yet
Chapter 1 Linear and Angular Measurement
29 pages
DepEd School Forms Checking Repor1
100% (2)
DepEd School Forms Checking Repor1
9 pages
Refeerence Paper 32
No ratings yet
Refeerence Paper 32
4 pages
Data Cleaning - Project Work
No ratings yet
Data Cleaning - Project Work
10 pages
Mida (AE)
No ratings yet
Mida (AE)
12 pages
An Investigation of Missing Data Methods For Classification Trees
No ratings yet
An Investigation of Missing Data Methods For Classification Trees
43 pages
Chapter 3
No ratings yet
Chapter 3
18 pages
6940 Dieck Measurement5 PDF
No ratings yet
6940 Dieck Measurement5 PDF
36 pages
Paper 1 MS
No ratings yet
Paper 1 MS
378 pages
Platias2020 Greece
No ratings yet
Platias2020 Greece
10 pages
8 Hron Et Al 2010
No ratings yet
8 Hron Et Al 2010
13 pages
Centraltendencywhattoconsider 1
No ratings yet
Centraltendencywhattoconsider 1
6 pages
Reserch Problem Formulation
No ratings yet
Reserch Problem Formulation
3 pages
Mechanical Equipment Testing
100% (1)
Mechanical Equipment Testing
12 pages
DT - Missing Values
No ratings yet
DT - Missing Values
11 pages
A Comparison of Six Methods For Missing Data Imputation 2155 6180 1000224 PDF
No ratings yet
A Comparison of Six Methods For Missing Data Imputation 2155 6180 1000224 PDF
6 pages
Application PDF
No ratings yet
Application PDF
14 pages
Neurocomputing: Vadlamani Ravi, Mannepalli Krishna
No ratings yet
Neurocomputing: Vadlamani Ravi, Mannepalli Krishna
8 pages
The Negative Impact of Missing Value Imputation in Classification of Diabetes Dataset and Solution For Improvement
No ratings yet
The Negative Impact of Missing Value Imputation in Classification of Diabetes Dataset and Solution For Improvement
8 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
10 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
ISO 22855-2008 (E) Benzoate Sorbate
No ratings yet
ISO 22855-2008 (E) Benzoate Sorbate
19 pages
Chap 2 Management Accounting Agamata
No ratings yet
Chap 2 Management Accounting Agamata
13 pages
Fuzzy Based Techniques For Handling Missing Values
No ratings yet
Fuzzy Based Techniques For Handling Missing Values
6 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
Ads Exp2
No ratings yet
Ads Exp2
3 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Extreme Learning Machine For Missing Data Using Multiple Imputations
No ratings yet
Extreme Learning Machine For Missing Data Using Multiple Imputations
18 pages
Fuzzy Expectation Maximization and Fuzzy Clustering-Based Missing Value Imputation Framework For Data Pre-Processing (FEMI)
No ratings yet
Fuzzy Expectation Maximization and Fuzzy Clustering-Based Missing Value Imputation Framework For Data Pre-Processing (FEMI)
1 page
Fuzzy Artmap and Neural Network Approach To Online Processing of Inputs With Missing Values
No ratings yet
Fuzzy Artmap and Neural Network Approach To Online Processing of Inputs With Missing Values
7 pages
M Akaba 2019
No ratings yet
M Akaba 2019
7 pages
An Evaluation of K-Nearest Neighbour Imputation Using Likert Data
No ratings yet
An Evaluation of K-Nearest Neighbour Imputation Using Likert Data
23 pages
Framework For Missing Value Imputation: Ms.R.Malarvizhi, DR - Antony Selvadoss Thanamani
No ratings yet
Framework For Missing Value Imputation: Ms.R.Malarvizhi, DR - Antony Selvadoss Thanamani
3 pages
MIssing Data Imputation Using Machine Learning Algorithm
No ratings yet
MIssing Data Imputation Using Machine Learning Algorithm
11 pages
IJDKP
No ratings yet
IJDKP
17 pages
Engineering Journal Missing Data Imputation Methods in Classification Contexts
No ratings yet
Engineering Journal Missing Data Imputation Methods in Classification Contexts
6 pages
Journal of Statistical Software: Reviewer: Abdolvahab Khademi University of Massachusetts
No ratings yet
Journal of Statistical Software: Reviewer: Abdolvahab Khademi University of Massachusetts
4 pages
Estimating Missing Values of Heterogeneous Datasets by Clustering
No ratings yet
Estimating Missing Values of Heterogeneous Datasets by Clustering
24 pages
Paper 4-Imputation and Classification of Missing Data Using Least Square Support Vector Machines - A New Approach in Dementia Diagnosis
No ratings yet
Paper 4-Imputation and Classification of Missing Data Using Least Square Support Vector Machines - A New Approach in Dementia Diagnosis
6 pages
DADM S5 Imputation of Missing Data
No ratings yet
DADM S5 Imputation of Missing Data
15 pages
Chapter 1 - Basic Concepts
No ratings yet
Chapter 1 - Basic Concepts
31 pages
Ijctt V3i2p104
No ratings yet
Ijctt V3i2p104
5 pages
Comparison of Imputation Techniques After Classifying The Dataset Using KNN Classifier For The Imputation of Missing Data
No ratings yet
Comparison of Imputation Techniques After Classifying The Dataset Using KNN Classifier For The Imputation of Missing Data
4 pages
Missing Data Imputation Using Singular Value Decomposition
No ratings yet
Missing Data Imputation Using Singular Value Decomposition
6 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
6 pages

Missing Value

Uploaded by

Missing Value

Uploaded by

Information Sciences 233 (2013) 25–35

Contents lists available at SciVerse ScienceDirect

A hybrid method for imputation of missing values using

⇑ Corresponding author. Tel.: +90 3322233333.

2.1. Missing data

2.2. Support Vector Regression (SVR)

2.3. Genetic Algorithms (GAs)

2.5. Fuzzy c-means imputation

Fig. 2. Fuzzy c-means (FCM) imputation.

Fig. 3. The proposed fuzzy c-means (FCM) SVR-GA imputation method.

Dataset name Records Attributes

3.2. SVR, fuzzy c-means and the Genetic Algorithm implementation

3.3. Performance evaluation

H0 ; P-value ¼ 2PrðR > yrÞ ð13Þ

4. Experimental results and discussion

SvrFcmGa FcmGa SvrGa ZeroImpute

Wilcoxon rank sum test

You might also like