Extreme Learning Machine For Missing Data Using Multiple Imputations
Extreme Learning Machine For Missing Data Using Multiple Imputations
pdf
Click here to view linked References
Dušan Sovilja,b,∗, Emil Eirolaa , Yoan Micheb , Kaj-Mikael Björka , Rui Nianc , Anton Akusokd , Amaury
Lendassea,d
a Arcada University of Applied Sciences, 00550 Helsinki, Finland
b Aalto University School of Science, FI-00076, Finland
c Ocean University of China, 266003 Qingdao, China
d The University of Iowa, Iowa City, IA 52242-1527, USA
Abstract
In the paper, we examine the general regression problem under the missing data scenario. In order to provide
reliable estimates for the regression function (approximation), a novel methodology based on Gaussian
Mixture Model and Extreme Learning Machine is developed. Gaussian Mixture Model is used to model the
data distribution which is adapted to handle missing values, while Extreme Learning Machine enables to
devise a multiple imputation strategy for final estimation. With multiple imputation and ensemble approach
over many Extreme Learning Machines, final estimation is improved over the mean imputation performed
only once to complete the data. The proposed methodology has longer running times compared to simple
methods, but the overall increase in accuracy justifies this trade-off.
Keywords: Extreme Learning Machine, missing data, multiple imputation, gaussian mixture model,
mixture of gaussians, conditional distribution
1. Introduction
Recurring problem in many scientific domains is the accurate prediction or forecast for unknown and/or
future instances. This issue is addressed by assuming that there exist the underlying mechanism that
generates the available data, and then building a model that provides good enough approximation for that
same mechanism. Finally, any kind of inference is based on the constructed model assuming all the necessary
information is taken into account. The task of making predictions, for example, daily temperature or
retail sales for some specific time period, is considered a regression problem or estimation of the regression
function. Another issue becoming more prevalent in Machine Learning domain is related to the missing
data in databases encountered in many research areas [1, 2, 3, 4]. This issue has huge impact on both the
learning algorithms and the subsequent inference procedures. If this issue is not treated correctly, any kind
of inference results in severely biased and inaccurate estimates.
In the paper, we are interested with regression problems of the form
yi = f (xi ) + ǫi (1)
in the presence of missing data where (X; Y) = {(xi ; yi )}N
are data samples with xi consisting of d
i=1
explanatory features or variables, yi the target variable and ǫi the noise term. The usual assumption behind
the noise term is that it follows a Gaussian distribution with zero mean and known variance ǫ ∼ N (0, σ 2 ).
The regression problem is to find a model M that is a close approximation to the true underlying function f .
∗ Corresponding author.
Email addresses: dusan.sovilj@aalto.fi (Dušan Sovilj), emil.eirola@arcada.fi (Emil Eirola), yoan.miche@aalto.fi
(Yoan Miche), kaj-mikeal.bjork@arcada.fi (Kaj-Mikael Björk), anton-akusok@uiowa.edu (Anton Akusok),
amaury-lendasse@uiowa.edu (Amaury Lendasse)
• Conditional mean imputation approach which is optimal in terms of minimising the mean squared
error of the imputed values, but suffers from biased statistics of the data. For instance, estimates of
variance or distance are negatively biased.
• Random draw imputation that is more appropriate for generating a representative instance of a fully
imputed data set. However, the imputations can be highly variable with respect to any single value
to be accurate.
• Multiple imputation. This setup draws several representative imputations of the data, analyses each set
separately, and combines the results to form an overall estimate with uncertainty taken into account
[6]. This approach can result in unbiased and accurate estimates after a sufficiently high number of
draws, but it is not always straightforward to determine the posterior distribution to draw from [7].
In the context of Machine Learning, repeating the analysis several times is however impractical as
training and analysing a sophisticated model tends to be computationally expensive.
The conceptually simplest approach to dealing with incomplete data is to fill in the missing values before
commencing any further analysis. Many methods have been suggested for imputation with the intent to
appropriately conform to the distribution of the data. These include imputation by nearest neighbours [8],
or the improved incomplete-case k-NN imputation [9]. An alternative approach is to study the input density
indirectly through conditional distributions, by fully conditional specification [10]. However, the uncertainty
of the imputed values is often not explicitly modelled in most imputation methods, and hence ignored in
the further analysis, potentially leading to biased results.
Having an appropriate model to take into consideration missing data has several advantages. First, with
any kind of imputation, many learning algorithms can be directly applied to imputed data, such as neural
networks, Gaussian processes and density estimation methods. Second, having a specific model designed
to tackle missing values allows to take into consideration the variability of imputed values, and thus, the
variance of the final estimation the practitioner is interested about.
Finite mixture models are a powerful modelling tool with a wide array of applications. Of considerate
importance is the Gaussian Mixture Model (GMM), also known as Mixture of Gaussians, which has been
studied extensively to describe distributions of a data set. This model provides a suitable estimation of the
underlying data density distribution as GMM is a universal approximator [11]. This enables GMM to model
any kind of continuous densities to arbitrary precision, and has been employed for a variety of problems in
vision [12, 13], language identification [14], speech [15, 16] and image [17, 18] processing. The parameters
of GMM are obtained via maximum likelihood (ML) estimation by the Expectation-Maximisation (EM)
algorithm [19]. EM algorithm is a general purpose algorithm for finding the ML solution with latent
variables or incomplete data and does not require any derivatives of the likelihood function. GMM has been
extended to accommodate missing values in data sets [20, 21] which has seen some resurgence in recent years
[22, 23, 24].
In this paper, we are considering regression estimation in the presence of missing data. First, mixture
of Gaussians is applied to original data with missing values. Second, a large number of imputations is
performed, that is, a multiple imputation approach is adopted. After all newly formed data sets are available,
2
a suitable regression model is build. As the number of draws can be large and the data sets can often contain
huge number of samples, a fast (in terms of training speed) and accurate model should be used. The choice
is on Extreme Learning Machine (ELM) as it satisfies both criteria. In the case of difficult data, where
substantial number of imputed data sets is required, ELM acts a good model as fast computational models
are more viable than the alternative gradient-based neural networks or kernel methods.
Gaussian Mixture Model has been used to train neural networks in the presence of missing data [25] with
the average gradient computed for the relevant parameters by using conditional distribution for the missing
values. The method is designed to handle training of networks with back-propagation and is not applicable
to other Machine Learning methods. Extreme Learning Machine has also been adapted to handle missing
values [26, 27] with both approaches estimating distances between samples that are subsequently used for
the RBF kernel in the hidden layer. One advantage of that approach is circumventing estimation of all the
missing values and focusing only on providing required information for the methods based on distances, such
as Support Vector Machines or k-nearest neighbours. However, the method only returns expected pairwise
distances that are then employed by the ELM for regression. The downside is that other activations functions
have to be ignored, and the imputation is done once by the conditional mean. Although conditional mean
imputation provides improved results over simple ad hod solutions, it neglects the variability introduced by
the underlying Gaussian Mixture Model.
The rest of the paper is organised as follows: Section 2 explains the overall approach in more detail
focusing on the main points in the methodology. Two main components of the approach, namely mixture of
Gaussians for missing data and Extreme Learning Machine are explained in Sections 3 and 4 respectively.
Section 5 showcases the results between two types of imputation – conditional mean and multiple imputation,
combined with two different modelling strategies. Finally, summarising remarks are given in Section 6.
2. Methodology
1. Fitting the Gaussian Mixture Model on a data set with missing values.
2. Generating new data sets via multiple imputation based on the Gaussian Mixture Model from the first
stage.
3. Building Extreme Learning Machine for each generated data set in the second stage.
4. Combining all the Extreme Learning Machines to provide final estimates.
3
2.2. Multiple Imputation for Missing Values
Conditional mean imputation remains one of the prevalent solutions to the problem of missing data
offering better estimation than simple ad hoc procedures [27]. However, it provides insufficient information
for further inference as the whole underlying distribution is condensed into a single value. For this reason,
multiple imputation considers many replacements for the unobserved values, and subsequent inference is
based on combining inferences across all imputed versions. This way uncertainty about the missing values
is taken into account.
In the second stage of the methodology, many data sets are drawn from GMM model Γ obtained in
the first phase. The imputation is done per sample xi that contains unobserved values. The conditional
distribution for missing variables p(xmiss
i | xobs
i ) (conditioned on observed variables) is again a Gaussian
Mixture Model with adjusted components based on the parameters of Γ. Drawing from the adjusted Gaussian
Mixture Model is straightforward, but it shows that conditional mean imputation can be severely biased if
the conditional distribution p(xmiss
i | xobs
i ) is multimodal.
The filling in is done per sample until all the samples are processed. This complete sweep produces one
new data set, and the procedure is repeated for prespecified number of times, say V . The end results of this
stage is V new data sets {Xv }Vv=1 . Looking from a different perspective, for each missing value there are a
total of V imputations which are then collected to form new data sets. The next step is to train the models
on these complete data sets.
where Mv (xnew ) is the output of a model Mv for a sample xnew . This way, estimation of ŷnew properly
reflects sampling variability due to incomplete data.
4
Second strategy can be considered true combining approach, as V models are generated, one for each
data set Xv . Each model has different initialisation of the input weights, and the actual prediction is done
as given by Eq. (2).
The goal of Expectation-Maximisation algorithm [19] is to find the maximum likelihood solution to the
case where there are unknown latent variables added to the data. In the case of Gaussian Mixture Model’s
with K components, the criterion to be optimised is the log-likelihood
N K
!
X X
log L(θ) = log p(X | θ) = log πk N (xi | µk , Σk ) , (3)
i=1 k=1
where N (x | µk , Σk ) is the probability density function of the multivariate normal distribution, and θ =
{πk , µk , Σk }K
k=1 is the set of parameters to be determined. The E-step consists of finding the posterior
probabilities for the latent variables, while in the M-step new values for means µk , covariances Σk and
mixing coefficients πk are recomputed based on the probabilities from the E-step.
PK All the parameters θ are
initialised to some suitable values (with the constraints 0 < πk < 1 and k=1 πk = 1), and then E and
M-steps are alternated until convergence either in the log-likelihood or in the parameter values. For the EM
algorithm, we are only considering the input or feature space X, ignoring the target vector Y altogether.
N K
!
X X
O
log L(θ) = log p(X | θ) = log πk N xO
i
i
| µ k , Σk (4)
i=1 k=1
where XO = {xO i N Oi
i }i=1 and N (xi | µk , Σk ) is used for the marginal multivariate normal distribution prob-
ability density of the observed values of xi .
To account for the missing data, certain additional expectations need to computed in the EM al-
gorithm. These are conditional expectations to compute missing components of a sample with respect
to each Gaussian component k, and their conditional covariance matrices, i.e., µ̃M Mi Oi
ik = E[xi | xi ], and
i
Σ̃M
ik
Mi
= Var[xM i Oi
i | xi ], where the mean and covariance are calculated under the assumption that xi origi-
nates from the kth Gaussian. For convenience, we also define corresponding imputed data vectors x̃ik and
full covariance matrices Σ̃ik which are padded with zeros for the known components. Then the E-step is:
5
πk N (xOi | µ k , Σk )
i
tik = PK Oi
, (5)
j=1 πj N (xi | µj , Σj )
xO
i
µ̃M
ik
i
= µM
k
i
+ ΣM
k
Oi
(ΣOO
k ) (xO
i −1
i
i
− µO
k ),
i
x̃ik = i
Mi , (6)
µ̃ik
0OOi 0OMi
Σ̃M Mi
= ΣM Mi
− ΣM Oi
(ΣOO ) ΣOM
i −1 i
, Σ̃ik = . (7)
ik k k k k 0M Oi Σ̃M
ik
Mi
The notation µM k refers to using only the elements from the vector µk specified by the index set Mi , and
i
Oi
similarly for xi , etc. For matrices, ΣM
k
Oi
refers to elements in the rows specified by Mi and columns by Oi .
The expressions for the parameters in Eqs. (6) and (7) originate from the observation that the conditional
distribution of the missing components also follows a multivariate normal distribution with these parameters
[37, Thm. 2.5.1].
The M-step is slightly altered to reflect that we are dealing with missing data. Component means are
estimated based on the imputed data vectors x̃ik and the covariance matrix estimates require an additional
term concerning covariances of imputed values:
N
1 X
µk = tik x̃ik (8)
Nk i=1
N
1 X h i
Σk = tik (x̃ik − µk )(x̃ik − µk )T + Σ̃ik (9)
Nk i=1
Nk
πk = . (10)
N
An efficient method to evaluate required matrix inverse operations in the Eqs. (6) and (7) is to use the
sweep operator [5]. These steps are explained in more detail in [26]. In the same paper, a case of high
dimensional spaces is also studied with the proposed solution based on high-dimensional data clustering
[38]. The problem in high-dimensional data is reliable estimation of the parameters as it becomes difficult
to fit Gaussians mixture model.
µ̃M Mi M Oi
ik = µk + Σk
i
(ΣOO
k ) (xO
i −1 Oi
i − µk )
i
(11)
and covariances
Σ̃M
ik
Mi
= ΣM
k
Mi
− ΣM
k
Oi
(ΣOO
k ) ΣkOMi .
i −1
(12)
The mixing coefficients correspond to the probabilities of the sample to originate from each component
πk N (xOi | µ k , Σk )
i
tik = PK Oi
. (13)
j=1 πj N (xi | µj , Σj )
6
The conditional mean imputation of the missing values is then realised as the weighted average of the
component centres:
K
X
x̃M
i
i
= tik µ̃M
ik .
i
(14)
k=1
Sampling from the conditional distribution is accomplished by first fixing the component k (drawing
from the categorical distribution defined by probabilities tik ), drawing |Mi | independent standard normal
variables into a vector z, and using the Cholesky factor L of the covariance matrix Σ̃Mik
Mi
corresponding to
component k. A representative sample is then generated by
µ̃M
ik + Lz .
i
(15)
where P is the number of free parameters. In the case of full covariance matrix for each component k, there
are in total P = Kd + K − 1 + Kd(d + 1)/2 parameters to estimate: Kd for the means, Kd(d + 1)/2 for
the covariance matrices and K − 1 for the mixing coefficients. As the number of dimensions d increases,
the number of parameters quickly tends to become larger then available samples making the BIC criterion
invalid. One possibility to circumvent this issue is by imposing restrictions on the structure of covariance
matrices making the model less powerful.
A simple and commonly applied method to determine appropriate number of components is the following:
start with a single component K = 1, learn the model with this single component Γ1 , then set K = 2 and
check whether the newly fitted Γ2 is more suitable than the model with K = 1. If this is the case, keep
increasing K until the criterion (BIC in our case) no longer improves.
Extreme Learning Machine (ELM) [39, 40] presents a novel technique for training a neural network
that has been applied to variety of cases [41, 42, 43, 44]. ELM belongs to a family of single-hidden layer
feedforward networks (SLFN) which considerably reduces the training time. These networks are particularly
appealing due to their universal approximation capability, meaning that any continuous function f can
be approximated with desired level of accuracy [45]. To reach that accuracy requires a suitable learning
algorithm that adjusts or tunes all the network parameters. The most well known algorithm for training
these networks is the back-propagation algorithm [46] which is an iterative procedure based on the gradient
of the error function. The novelty that distinguishes ELM is that certain network parameters need not be
tuned, and instead can be randomly generated.
7
Let us consider a data set with N samples in d dimensional space, i.e., {(xi ; yi )}N d
i=1 ∈ R × R. A SLFN
models the data sample xi with
L
X
βj gj (wj · xi + bj ), i = 1, . . . , N (17)
j=1
where wj are the input weights, bj the bias, operation wj · xi is the inner product and gj the activation
function for the jth neuron in the hidden layer. βj is the output weight coming from the jth neuron to the
output neuron, while L is the total number of neurons in the hidden layer. In this paper, we only consider
the case with a single output, that is, the target yi is a scalar value. The above formula can be easily
generalised to the multivariate case with a vector of outputs yi . For the univariate case, the hidden layer
T
output weights can be collected in a vector as β = [β1 , . . . , βL ] .
That SLFN can approximate the data with zero error means that the output of a network is exactly the
desired target value yi across all samples, that is,
L
X
βj gj (wj · xi + bj ) = yi , i = 1, . . . , N (18)
i=j
Hβ = y (19)
where
g1 (w1 · x1 + b1 ) · · · gL (wL · x1 + bL )
H=
.. .. ..
, (20)
. . .
g1 (w1 · xN + b1 ) · · · gL (wL · xN + bL ) N ×V
β1 y1
β = ... and y = ... . (21)
βL yL
H is the hidden layer output matrix or feature mapping of the SLFN. The jth column of H is the output
of jth hidden neuron for all the data samples, while the ith row of the matrix is the sample xi put through
all the neuron, that is, a transformed sample in the new feature space RL . Training SLFN requires tuning
all the parameters of the network – wj , bj and βj , i = j, . . . , L.
The novelty that ELM brings to learning of SLFN is that network parameters wj and bj need not be
tuned at all – they can be randomly generated before encountering the data and kept fixed throughout
the learning stage. This does not prevent the ELM from losing its universal approximation capability,
provided that activation functions gj (x) used in the hidden layer follow certain mild conditions [47, 48]. The
probability distribution for {(wj , bj )}N
j=1 can be any continuous probability distribution from any interval
d
on R × R. By dropping the tuning of input weights for the hidden layer, the only remaining adjustable
parameters are the output weights β. Given that relation between hidden layer matrix H and output is y
is linear, the solution to Eq. (19) is obtained with ordinary least-squares approach, that is, the goal is to
find β for the following minimisation problem
As H might be non-square (in the case L < N ), the solution is to use Moore-Penrose generalised inverse
(pseudo-inverse) H† of matrix H which gives the solution as
which in statistics is know as ridge regression [50]. This modified version of the basic ELM is used in the
experiments. The parameter C can be cross-validated to achieve better results for a data set at hand. In
our experiments, we are skipping this validation phase in order to speed up the execution time and we are
using a small value of C = 10−6 to prevent numerical instabilities in the computation of (HT H + CI)−1 .
5. Experiments
The effectiveness of the proposed approach is tested on several data sets taken from Machine Learning
repositories. Since all the data sets used do not have any missing values, the real case scenario is simulated
by removing some portion of the data before the whole methodology is applied. Values in the data set are
removed at random with a fixed probability until a prespecified number of instances are discarded. In the
experiments, we are only focused on supervised regression task, but the approach can easily be extended to
classification tasks (binary and multiclass). Although it is possible to use outputs Y as another feature to
help estimate missing values in the input samples X, they are left out during the first stage when fitting the
mixture of Gaussians.
Table 1 shows all the data sets used in the experiments. Data sets are taken from two repositories
for Machine Learning related tasks: the UCI Machine Learning repository [52] and the LIACC regression
repository [53].
9
Table 1: Data sets used in the experiments. N indicates the total number of samples in the data set and d
denotes the number of features.
Name N d
Abalone 4177 8
Bank 8FM 4500 8
Boston housing 506 13
Machine CPU 209 6
Stocks 950 9
Wine quality (red) 1599 11
Table 2: Number of neurons in the ELM used for each data set.
The comparison is done between two strategies for missing values imputation: conditional mean impu-
tation (CM) and multiple imputation (MI). Both are based on the same fitted Gaussian Mixture Model
which is the first step in the methodology. In the CM case, each missing value is replaced by the conditional
mean value given by Eq. (14) which gives a data set Xcm . On the other hand, for the MI scenario a total
of V = 1000 new data sets {Xv }Vv=1 are generated using Eq. (15) that are subsequently used to train the
ELM models. The criterion for comparison is squared error risk which is estimated as an average of 10
Monte-Carlo runs on a test set. That is, the data set containing missing values is first split into training
and test parts. The training part contains two-thirds of the samples, while the remaining third belongs to
the test set. Since we are adopting this approach, it is possible to have missing values in the test set, and if
this is the case, they are replaced by their true values in order to be able to compute required risk.
Once the splitting is done, the variables are standardised to zero mean and unit variance since the ELM
is sensitive to the range of the variables. The test set is then modified according to the statistics (mean and
variance) obtained on a training set.
dimensionality of data and the “optimal” number of components. As explained in Section 3.4, the number
of components K keeps increasing as the BIC keeps decreasing. For each value of K, there are 10 repetitions
of the EM algorithm in order to find several local maxima and return the best one. The quickest step is the
imputation part for the complete data since sampling from the GMM is straightforward.
1
100 200 300 400 500
neurons
(a) 0% missing
6 6
mean test error
2 2
0 0
50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500
neurons neurons
(b) 5% missing (c) 10% missing
8
4
mean test error
mean test error
6
3
4
2
2
1
0 100 200 300 400 500
50 100 150 200 250 300 350 400 450 500
neurons neurons
(d) 15% missing (e) 20% missing
4 4
mean test error
3 3
2 2
1 1
100 200 300 400 500 100 200 300 400 500
neurons neurons
(f) 25% missing (g) 30% missing
Figure 1: Average test mean squared error for Stocks data set as the number of missing data increases. Blue
line indicates multiple imputation scenario (MI), red line signifies conditional mean imputation (CM) and
black line is simple removal strategy.
12
that is, it is chosen once the testing is done. Although this cannot be done in practice (choosing a model
on unseen data), it is shown only to provide an insight that choosing the most complex network stays very
close to the optimal network structure, and in some cases it is actually better than the CM strategy (Boston
housing and Stocks data).
6. Conclusion
In this paper, the task of accurate prediction is tackled on a data containing missing values. The complete
methodology consists of four steps with simple and known methods. Mixture of Gaussians is employed to
model the underlying distribution of the data, while Extreme Learning Machine enables multiple imputation
approach to be executed on a reasonable scale. Adjustments for Expectation-Maximisation algorithm for
Gaussian Mixture Model are given in order to tackle the missing values in the data, alongside the required
updates for conditional Gaussian Mixture Model needed to sample new data sets.
The combination of GMM and ELM allows adopting the multiple imputation approach to missing data,
which has shown to be superior in almost all tested cases over the method based on conditional mean
imputation. Having a distribution to reflect uncertainty in the data due to missing values can be beneficial
over simple ad hoc methods, but can still be severely biased as the experiments have shown. In order to
ensure stable and reliable predictions, a sufficient number of draws is required to properly represent the
underlying data distribution. In such scenario with potentially high number of data sets, ELM is a suitable
model due to its fast training times and good generalisation properties.
13
20 0.03
0.025
15
mean test error
10 0.015
0.01
5
0.005
0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
percentage of missing values percentage of missing values
(a) Abalone (b) Bank 8FM
5
x 10
400 2.5
2
300
mean test error
mean test error
1.5
200
1
100
0.5
0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
percentage of missing values percentage of missing values
(c) Boston housing (d) Machine CPU
10 20
8
15
mean test error
6
10
4
5
2
0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
percentage of missing values percentage of missing values
(e) Stocks (f) Wine quality (red)
Figure 2: Average test mean squared error for all tested data sets with respect to the number of missing
values. Blue colour is MI strategy and red colour represents CM strategy. Solid lines represent the most
complex networks while dashed lines are the best performing networks on a test set.
14
18
8
16
14
6
12
10 4
8
2
6
4 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
percentage of missing values percentage of missing values
(a) Abalone (b) Stocks
Figure 3: Average test mean squared error for Abalone and Stocks data sets. Blue colour indicates ensem-
bling approach and red colour single initialisation of ELM. Solid lines represent MI strategy while dashed
lines are CM strategy. The graphs correspond to the most complex networks.
The disadvantage of applying the multiple imputation procedure is a notable increase in computational
time. This can be seen as a trade-off between time and accuracy compared to alternative methods of
handling the missing values. Ignoring all incomplete samples is a poor solution, and the difference in
accuracy is considerable already at low proportions of missing values, as shown by the experiments. For
larger fractions of missing data, it is necessary to apply an appropriate procedure which avoids discarding
partially known samples, and multiple imputation provides a practical approach.
An interesting side-effect of having more missing values in the data is the potential removal of model
structure selection procedure. The models with the highest complexity have competitive results with the op-
timal models on majority of data sets. This suggests that validation procedures based on training/validation
errors can simply be replaced by a model with enough neurons in the hidden layer of the ELM.
References
[1] A. R. T. Donders, G. J. van der Heijden, T. Stijnen, K. G. Moons, Review: A gentle introduction to imputation of missing
values, Journal of Clinical Epidemiology 59 (10) (2006) 1087–1091.
[2] A. Sorjamaa, A. Lendasse, Y. Cornet, E. Deleersnijder, An improved methodology for filling missing values in spatiotem-
poral climate data set, Computational Geosciences 14 (2010) 55–64.
[3] A. N. Baraldi, C. K. Enders, An introduction to modern missing data analyses, Journal of School Psychology 48 (1) (2010)
5–37.
[4] P. D. Allison, Missing data: Quantitative applications in the social sciences, British Journal of Mathematical and Statistical
Psychology 55 (1) (2002) 193–196.
[5] R. J. A. Little, D. B. Rubin, Statistical Analysis with Missing Data, 2nd Edition, Wiley-Interscience, 2002.
[6] D. B. Rubin, Multiple Imputation for Nonresponse in Surveys, Wiley, 1987.
[7] C. K. Enders, Applied Missing Data Analysis, Methodology in the Social Sciences, Guilford Press, 2010.
[8] E. R. Hruschka, E. R. Hruschka Jr., N. F. F. Ebecken, Evaluating a nearest-neighbor method to substitute continuous
missing values, in: AI 2003: Advances in Artificial Intelligence, Vol. 2903 of Lecture Notes in Computer Science, Springer
Berlin Heidelberg, 2003, pp. 723–734.
[9] J. Van Hulse, T. M. Khoshgoftaar, Incomplete-case nearest neighbor imputation in software measurement data, in: Pro-
ceedings of 2007 IEEE International Conference on Information Reuse and Integration (IRI 2007), Las Vegas, NV, USA,
2007, pp. 630–637.
[10] S. Van Buuren, J. P. Brand, C. G. Groothuis-Oudshoorn, D. B. Rubin, Fully conditional specification in multivariate
imputation, Journal of Statistical Computation and Simulation 76 (12) (2006) 1049–1064.
[11] D. Titterington, A. Smith, U. Makov, Statistical Analysis of Finite Mixture Distributions, Wiley, New York, 1985.
15
−3
6.5 x 10
4
6 3.5
mean test error
4 1
0 5 10 15 20 25 30 0 5 10 15 20 25 30
percentage of missing values percentage of missing values
(a) Abalone (b) Bank 8FM
25 14000
12000
mean test error
20
10000
8000
15
6000
10 4000
0 5 10 15 20 25 30 0 5 10 15 20 25 30
percentage of missing values percentage of missing values
(c) Boston housing (d) Machine CPU
0.5
1
mean test error
0.45
0.8
0.4
0.6
0.4 0.35
0 5 10 15 20 25 30 0 5 10 15 20 25 30
percentage of missing values percentage of missing values
(e) Stocks (f) Wine quality (red)
Figure 4: Average test mean squared error for all tested data sets with respect to the number of missing
values. Blue colour is MI strategy and red colour represent CM strategy. Solid lines represent the most
complex networks while dashed lines are the best performing networks on a test set. The results are for the
ensemble on 1000 trained models.
16
[12] Z. Zivkovic, Improved adaptive gaussian mixture model for background subtraction, in: Proceedings of 17th International
Conference on Pattern Recognition (ICPR 2004), Vol. 2, Cambridge, UK, 2004, pp. 28–31.
[13] P. KaewTraKulPong, R. Bowden, An improved adaptive background mixture model for real-time tracking with shadow
detection, in: Video-Based Surveillance Systems, Springer US, 2002, pp. 135–144.
[14] P. A. Torres-Carrasquillo, D. A. Reynolds, J. Deller Jr, Language identification using gaussian mixture model tokenization,
in: Proceedings of 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), Vol. 1,
Orlando, FL, USA, 2002, pp. 757–760.
[15] D. A. Reynolds, R. C. Rose, Robust text-independent speaker identification using gaussian mixture speaker models, IEEE
Transactions on Speech and Audio Processing 3 (1) (1995) 72–83.
[16] D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, et al.,
The subspace gaussian mixture model: A structured model for speech recognition, Computer Speech & Language 25 (2)
(2011) 404–439.
[17] H. Greenspan, A. Ruf, J. Goldberger, Constrained gaussian mixture model framework for automatic segmentation of mr
brain images, IEEE Transactions on Medical Imaging 25 (9) (2006) 1233–1245.
[18] M. Ait Kerroum, A. Hammouch, D. Aboutajdine, Textural feature selection by joint mutual information based on gaussian
mixture model for multispectral image classification, Pattern Recognition Letters 31 (10) (2010) 1168–1174.
[19] A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the em algorithm, Journal of
the Royal Statistical Society, Series B 39 (1) (1977) 1–37.
[20] Z. Ghahramani, M. Jordan, Learning from incomplete data, Tech. rep., Lab Memo No. 1509, CBCL Paper No. 108, MIT
AI Lab (1995).
[21] L. Hunt, M. Jorgensen, Mixture model clustering for mixed data with missing information, Computational Statistics &
Data Analysis 41 (3–4) (2003) 429–440.
[22] T. I. Lin, J. C. Lee, H. J. Ho, On fast supervised learning for normal mixture models with missing information, Pattern
Recognition 39 (6) (2006) 1177–1187.
[23] Inference from multiple imputation for missing data using mixtures of normals, Statistical Methodology 7 (3) (2010)
351–365.
[24] O. Delalleau, A. C. Courville, Y. Bengio, Efficient em training of gaussian mixtures with missing data, CoRR
abs/1209.0521, http://arxiv.org/abs/1209.0521.
[25] V. Tresp, S. Ahmad, R. Neuneier, Training neural networks with deficient data, in: Proceedings of 7th Conference on
Neural Information Processing Systems (NIPS 1993), Vol. 6 of Advances in Neural Information Processing Systems,
Pasadena, CA, USA, 1993, pp. 128–135.
[26] E. Eirola, A. Lendasse, V. Vandewalle, C. Biernacki, Mixture of gaussians for distance estimation with missing data,
Neurocomputing 131 (2014) 32–42.
[27] Q. Yu, Y. Miche, E. Eirola, M. van Heeswijk, E. Sverin, A. Lendasse, Regularized extreme learning machine for regression
with missing data, Neurocomputing 102 (2013) 45–51.
[28] H. Akaike, A new look at the statistical model identification, IEEE Transactions on Automatic Control 19 (6) (1974)
716–723.
[29] G. Schwarz, Estimating the dimension of a model, The Annals of Statistics 6 (2) (1978) 461–464.
[30] C. M. Hurvich, C.-L. Tsai, Regression and time series model selection in small samples, Biometrika 76 (2) (1989) 297–307.
[31] Y. Miche, E. Eirola, P. Bas, O. Simula, C. Jutten, A. Lendasse, M. Verleysen, Ensemble modeling with a constrained linear
system of leave-one-out outputs, in: Proceedings of 18th European Symposium on Artificial Neural Networks (ESANN
2010), Computational Intelligence and Machine Learning, Bruges, Belgium, 2010, pp. 19–24.
[32] M. van Heeswijk, Y. Miche, E. Oja, A. Lendasse, Gpu-accelerated and parallelized elm ensembles for large-scale regression,
Neurocomputing 74 (16) (2011) 2430–2437.
[33] M. van Heeswijk, Y. Miche, T. Lindh-Knuutila, P. Hilbers, T. Honkela, E. Oja, A. Lendasse, Adaptive ensemble models of
extreme learning machines for time series prediction, in: Proceedings of 19th International Conference on Artificial Neural
Networks (ICANN 2009), Vol. 5769 of Lecture Notes in Computer Science, Cyprus, 2009, pp. 305–314.
[34] D. Sovilj, A. Lendasse, O. Simula, Extending extreme learning machine with combination layer, in: Proceedings of 12th
International Work-Conference on Artificial Neural Networks (IWANN 2013), Vol. 7902 of Lecture Notes in Computer
Science, Puerto de la Cruz, Tenerife, Spain, 2013, pp. 417–426.
[35] L. Breiman, Stacked regressions, Machine Learning 24 (1) (1996) 49–64.
[36] D. Draper, Assessment and propagation of model uncertainty, Journal of the Royal Statistical Society, Series B 57 (1)
(1995) 45–97.
[37] T. W. Anderson, An Introduction to Multivariate Statistical Analysis, 3rd Edition, Wiley-Interscience, 2003.
[38] C. Bouveyron, S. Girard, C. Schmid, High-dimensional data clustering, Computational Statistics & Data Analysis 52 (1)
(2007) 502–519.
[39] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: A new learning scheme of feedforward neural networks, in:
Proceedings of 2004 IEEE International Joint Conference on Neural Networks (IJCNN 2004), Vol. 2, Budapest, Hungary,
2004, pp. 985–990.
[40] G.-B. Huang, An insight into extreme learning machines: Random neurons, random features and kernels, Cognitive
Computation 6 (3) (2014) 376–390.
[41] A. Sorjamaa, Y. Miche, R. Weiss, A. Lendasse, Long-term prediction of time series using nne-based projection and op-
elm, in: Proceedings of 2008 IEEE World Conference on Computational Intelligence (WCCI 2008), Hong Kong, 2008, pp.
2675–2681.
[42] C. Cheng, W. P. Tay, G.-B. Huang, Extreme learning machines for intrusion detection, in: Proceedings of 2012 Interna-
17
tional Joint Conference on Neural Networks (IJCNN 2012), Brisbane, Australia, 2012, pp. 1–8.
[43] R. Minhas, A. Baradarani, S. Seifzadeh, Q. M. Jonathan Wu, Human action recognition using extreme learning machine
based on visual vocabularies, Neurocomputing 73 (10–12) (2010) 1906–1917.
[44] F. Chen, T. Ou, Sales forecasting system based on gray extreme learning machine with taguchi method in retail industry,
Expert Systems with Applications 38 (3) (2011) 1336–1345.
[45] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Networks
2 (5) (1989) 359–366.
[46] D. Rumelhart, G. Hinton, R. Williams, Learning representations by back-propagation errors, Nature 323 (1986) 533–536.
[47] Z. Q.-Y. Huang, G.-B., C.-K. Siew, Extreme learning machine: Theory and applications, Neurocomputing 70 (2006)
489–501.
[48] Z. Q.-Y. Huang, G.-B., C.-K. Siew, Universal approximation using incremental constructive feedforward networks with
random hidden nodes, IEEE Transactions on Neural Networks 17 (4) (2006) 879–892.
[49] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, Op-elm: Optimally-pruned extreme learning machine,
IEEE Transactions on Neural Networks 21 (1) (2010) 158–162.
[50] A. Hoerl, R. Kennard, Ridge regressoin: Biased estimate for nonorthogonal problems, Technometrics 12 (1) (1970) 55–67.
[51] Y. Lan, Y. C. Soh, G.-B. Huang, Constructive hidden nodes selection of extreme learning machine for regression, Neuro-
computing 73 (16–18) (2010) 3191–3199.
[52] A. Frank, A. Asuncion, UCI machine learning repository, http://archive.ics.uci.edu/ml (2012).
[53] L. Torgo, LIACC regression data sets, http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html.
18