A Robust and Regularized Extreme Learning Machine
A Robust and Regularized Extreme Learning Machine
net/publication/307559947
CITATIONS READS
2 176
2 authors:
Some of the authors of this publication are also working on these related projects:
Robust System Identification Using Kernel Methods, such as Recursive Least Squares (RLS) and Least Squares SVR (LSSVR). View project
All content following this page was uploaded by Guilherme A. Barreto on 21 March 2017.
Abstract—In a moment when the study of outlier robustness with random hidden weights values, commonly known now
within Extreme Learning Machine is still in its infancy, we as Extreme Learning Machine (ELM) [11], has become very
propose a method that combines maximization of the hidden popular due to its generalization performance and much faster
layer’s information transmission, through Batch Intrinsic Plas- learning speed [12]. The random choice of input-to-hidden-
ticity (BIP), with robust estimation of the output weights. This layer weights (input weights for short) leaves only the hidden-
method named R-ELM/BIP generates a reliable solution in the
to-output-layer weights (output weights) to be determined ana-
presence of corrupted data with a good generalization capability
and small weight norms. Computer experiments were carried lytically. Although, even with the considerable attention that
out with three regression problems using traditional ELM, ELM ELM techniques have received in computational intelligence
with BIP, ELM using Iteratively Reweighted Least Squares as and machine learning communities lately, the study into the
estimation method (ROB-ELM) and our proposal (R-ELM/BIP). effects of outliers in ELM is only in its infancy [13].
Two aspects influence the robustness properties in a ELM
network: the computational robustness, related to numerical
I. I NTRODUCTION stability, and outliers robustness. The first one has been gene-
Machine Learning problems are often contaminated by rally ignored, since many efforts emphasize on the accuracy
noise, which reflects inaccuracies in observations and the of solutions [14]. Those computational problems occur when
stochastic nature of the underlying process [1]. This contami- the hidden layer output matrix H is ill-conditioned, caused
nation may generate outliers, which can be defined intuitively by the random input weights and biases selection. This makes
as data points inconsistent with the remainder of data set [2]. the linear system, used to train the output weights, results in
Khamis et al (2005) [3] and Steege et al (2012) [4] show a solution sensitive to data perturbation and become a poor
that outliers influence the modeling accuracy as well as the estimation to the truth [14]. Besides, it is known that the size of
estimated parameters. Therefore, when fitting a model to the the output layer weight is more relevant for the generalization
data, those outliers need to be identified and eliminated or, capability than the configuration of the neural network, in
alternatively, examined closely, as they may be of the main terms of number of neurons and format of activation function
interest themselves [1]. [15], [16]. Works such as [15], [17], [18] and [19] explore this
issue.
Robust neural networks have been a subject of interest
for many years in many different applications. Works such as The second aspect, related to outliers robustness, has been
Yong (1993) [5] shows that the conventional back-propagation explored in recent years within a few development proposals,
algorithm for neural network regression is robust to leverages using estimation methods that are known for being less sensi-
(input data x corrupted), but not to outliers (output data y tive to outliers then the Ordinary Least Squares (OLS). Work
corrupted). Larsen et al (1998) [6] proposed a neural network such as Huynh and Wong (2008) [2] substitutes the Singular
optimized using the maximum a posteriori technique with a Value Decomposition method by the Weighted Least Squares,
modified likelihood function which incorporates the potential which is similar to OLS, but creates penalties corresponding
risk of outliers in the data. Lee et al (2009) [7] proposed a to training patterns to weight their contribution to the final so-
Welsch M-estimator radial basis function with pruning and lution. Barros and Barreto (2013) [20] concentrate their efforts
growing techniques for noisy time series prediction. Łobos on robust classification problems with a proposal of an ELM
et al (2010) [8] present on-line techniques for robust esti- that used Iteratively Reweighted Least Squares (IRLS), named
mation of parameters of harmonic signals based on the total ROB-ELM. And finally, Horata et al (2013) [13] addresses
least-squares criteria, which can be implemented by analogue both aspects by applying three estimation methods: IRLS, the
adaptive circuits. Feng et al (2010) [9] propose an algorithm Multivariate Least-Trimmed Squares (MLTS) estimator and the
for the neural network quantile regression adopted from a One-Step Reweighted MLTS (RMLTS) modified by Extended
Majorization-Minimization algorithm for optimization and ap- Complete Orthogonal Decomposition (ECOD), which acts
plied it on an empirical analysis of credit card portfolio data. over the computational problem.
Aladag et al. (2014) [10] propose a median neuron model
That being said, it is not possible to overlook the impor-
multilayer feed forward (MNM-MFF) model, trained with a
tance of approaching both aspects. Then, we propose a method
modified particle swarm optimization metaheuristic, in order to
that combines both regularizing and outlier robustness features
deal with forecasting performance problems caused by outliers.
to ELM learning, through optimization of the hidden layer
In recent years, a multilayer feedforward neural network output by altering the activation function’s parameters with a
new method named Batch Intrinsic Plasticity (BIP) [21] and in principle, we could remove the outliers from the training
then estimate the output weights with the robust estimation data, although this is not always feasible. A second approach,
method IRLS. BIP shapes the output distributions of the input known as robust regression, uses estimation methods that are
neurons into exponential distributions through the adaptation not as sensitive to outliers as the OLS.
of slope and bias of the neuron activation function. By forcing
the hidden neurons activation into an exponential distribution, Huber [22] introduced the concept of 𝑀 -estimation, where
it maximizes the networks information transmission, caused 𝑀 stands for “maximum likelihood”. Here, robustness is
by the high entropy of the distribution, leading also to good achieved by minimizing another function than the sum of
generalization properties [21]. Our proposal has been tested the squared errors [20]. Based on Huber theory, a general
on regression problems with artificial and real-world datasets 𝑀 -estimator applied to the 𝑖-th output neuron minimizes the
with promising results. following objective function:
𝑁
∑
II. A LGORITHMS 𝐽(𝜷 𝑖 ) = 𝜌(𝑑𝑖𝑛 − 𝜷 𝑇𝑖 h𝑛 ), (4)
𝑛=1
A. Extreme Learning Machine
where the function 𝜌(⋅) gives the contribution of each error
ELM is a two layer network with random and fixed weights 𝑒𝑖𝑛 = 𝑑𝑖𝑛 − 𝑦𝑖𝑛 to the objective function, 𝑑𝑖𝑛 is the desired
matrix on the hidden layer. Its input-to-hidden-layer weights output of the 𝑖-th output neuron for the 𝑛-th linear system
matrix is represented by 𝜷 ℎ𝑑𝑑 ∈ ℝ𝑝×𝑞 , where 𝑝 is the number input sample h𝑛 . OLS is a particular case of 𝑀 -estimator,
of input attributes and 𝑞 the number of hidden neurons [12]. characterized by 𝜌(𝑒𝑖𝑛 ) = 𝑒2𝑖𝑛 .
The output of the hidden layer is given by:
( ) The function 𝜌 should possess these properties [23]:
h𝑛 = 𝜙 𝜷 𝑇ℎ𝑑𝑑 u𝑛 + b𝑛 , (1)
Property 1: 𝜌(𝑒𝑖𝑛 ) ≥ 0.
where h𝑛 ∈ ℝ𝑞 , u𝑛 ∈ ℝ𝑝 is the current input vector, b𝑛 ∈ Property 2: 𝜌(0) = 0.
ℝ𝑞 receives the biases and 𝜙 is a logistic activation function. Property 3: 𝜌(𝑒𝑖𝑛 ) = 𝜌(−𝑒𝑖𝑛 ).
During the first step of training phase, all inputs from the Property 4: 𝜌(𝑒𝑖𝑛 ) ≥ 𝜌(𝑒𝑖′ 𝑛 ), for ∣𝑒𝑖𝑛 ∣ > ∣𝑒𝑖′ 𝑛 ∣.
training sequence ((u𝑛 , d𝑛 ), 𝑛 = 1...𝑁 ) are presented to the
network and the corresponding network states (h𝑛 , d𝑛 ) are Let 𝜓 be the derivative of 𝜌. Differentiating 𝜌 with respect
harvested in H and D matrices respectively, where d𝑛 is the to the estimated weight vector 𝜷 𝑖 , we have:
desired output. 𝑁
∑ ( )
Since the network output is given by Eq. (2), for the second ˆ 𝑇 h𝑛 h𝑇 = 0,
𝜓 𝑦𝑖𝑛 − 𝜷 (5)
𝑖 𝑛
and last part of training, we compute the output weights 𝜷 ∈ 𝑛=1
ℝ𝑞×𝑚 , which connects the hidden layer to the output neurons, where 0 ∈ ℝ𝑝 is a row vector of zeros. Then, defining the
as a linear regression problem. weight function 𝑤(𝑒𝑖𝑛 ) = 𝜓(𝑒𝑖𝑛 )/𝑒𝑖𝑛 , and let 𝑤𝑖𝑛 = 𝑤(𝑒𝑖𝑛 ),
Y = 𝜷𝑇 H (2) the estimating equations are given by:
𝑁
∑ ( )
The ordinary least square (OLS) solution of the linear ˆ 𝑇 h𝑛 h𝑇 = 0.
𝑤𝑖𝑛 𝑦𝑖𝑛 − 𝜷 (6)
𝑖 𝑛
system created is given by the Moore-Penrose generalized 𝑛=1
inverse as follows:
( )−1 Thus, solving the estimating equations corresponds to solving
𝜷 = HH𝑇 HD𝑇 . (3) a weighted least-squares problem, minimizing:
∑ ∑
2 2
𝑤𝑖𝑛 𝑒𝑖𝑛 = 𝑤2 (𝑒𝑖𝑛 )𝑒2𝑖𝑛 . (7)
In several real-world problems the matrix HH𝑇 can be 𝑛 𝑛
singular, undermining the use of Eq. (3). In fact, a near singular
HH𝑇 (yet invertible) matrix is also a problem because it can It should be highlighted that the weights depend on the
lead to numerically unstable results. Huang et al [11] has residuals (i.e. estimated errors), the residuals depend upon the
solved this issue using Singular Value Decomposition (SVD) estimated coefficients, and the estimated coefficients depend
approach( to compute )the Moore-Penrose pseudo-inverse ins- upon the weights [20]. As a consequence, there is no closed-
tead of (HH𝑇 )−1 H [13]. That method can support either form equation for the estimation of 𝜷 𝑖 . An alternative is an
full or not full column rank matrices [13], which gives to iterative estimation method named iteratively reweighted least-
ELM some computational robustness property. Unfortunately, squares (IRLS) [23] which is often used and will be explained
it is also computationally expensive when dealing with large below.
datasets and may produce an unreliable solution when the
training data is corrupted by outliers. C. Iteratively Reweighted Least Squares
As described in [20], [23]:
B. Introduction to 𝑀 -Estimation
An important characteristic of OLS is that it assigns the ˆ (0) using the OLS
Step 1 - Provide an initial estimate 𝜷 𝑖
same importance to all error samples, i.e. all errors contribute solution in Eq. (3).
the same way to the final solution [20]. To handle this issue,
Step 2 - At each iteration 𝑡, compute the residuals from the E. Robust ELM with Batch Intrinsic Plasticity
previous iteration 𝑒𝑖𝑛 (𝑡 − 1), 𝑛 = 1, . . . , 𝑁 , associated with
the 𝑖-th output neuron, and then compute the corresponding In a sum, the basic idea of the proposed approach is
weights 𝑤𝑖𝑛 (𝑡 − 1) = 𝑤[𝑒𝑖𝑛 (𝑡 − 1)]. very simple: we combine the regularizer effect and learning
optimization property of BIP with the outlier robustness of
the 𝑀 -estimation framework and the IRLS algorithm to create
Step 3 - Solve for new weighted-least-squares estimate of an optimized ELM network. We will refer to this approach
𝜷 𝑖 (𝑡): as Robust ELM/BIP (R-ELM/BIP for short). The steps for its
implementation follows below.
[ ]
ˆ (𝑡) = HW(𝑡 − 1)H𝑇 −1 HW(𝑡 − 1)D𝑇 ,
𝜷 (8)
𝑖 𝑖
Step 1 - Initialize randomly 𝜷 ℎ𝑑𝑑 and collect all stimuli X =
(x1 , ..., x𝑞 ) with all training input data;
where W(𝑡 − 1) = diag{𝑤𝑖𝑛 (𝑡 − 1)} is an 𝑁 × 𝑁 weight
matrix and D𝑖 is the desired outputs matrix for the 𝑖-th output Step 2 - Calculate (𝑎𝑗 , 𝑏𝑗 ) for all hidden neurons as described
neuron. Repeat Steps 2 and 3 until the convergence of the in Section II-D;
ˆ (𝑡).
estimated coefficient vector 𝜷 𝑖
Step 3 - Re-introduce the training input data to the network
Several weighting functions for the 𝑀 -estimators can be and collect the network states H;
chosen, and, in this work, we adopted the bisquare weighting ˆ for all output neurons accordingly
Step 4 - Finally, find 𝜷 𝑖
function: with Section II-C.
{ [ ( 𝑒 )2 ] 2
𝑤(𝑒𝑖𝑛 ) = 1 − 𝑖𝑛
𝜅 , if ∣𝑒𝑖𝑛 ∣ > 𝜅 (9) III. R ESULTS
1, otherwise.
The experiments were made using three regression pro-
where the parameter 𝜅 is a tuning constant. Smaller values of blems: SinC, Abalone and Boston Housing. The first one, is
𝜅 leads to more resistance to outliers, but at the expense of an artificial dataset composed by 2000 samples with 1 input
lower efficiency when the errors are normally distributed [20]. and 1 output, generated from Eq. (12).
In particular, 𝜅 = 4.685𝜎 for the bisquare function, where 𝜎 𝑦𝑖 = 𝑠𝑖𝑛(𝜋𝑥𝑖 )/(𝜋𝑥𝑖 ), 𝑖 = 1, ..., 𝑁 and −𝜋 ≤ 𝑥𝑖 ≤ 𝜋. (12)
is a robust estimate of the standard deviation of the errors. A
common approach is to take 𝜎 = MAR/0.6745, where MAR The Abalone and Boston Housing Corrected (Boston for
is the median absolute residual. short), taken respectively from UCI1 and StatLib2 databases,
are real-world problems. Abalone dataset offers 4177 samples
with 7 inputs and 1 output, while Boston has 506 samples
D. Batch Intrinsic Plasticity with 18 inputs and 1 output. For ELM’s training and testing
samples, the sets were divided: SinC (1000/1000), Abalone
BIP is an unsupervised learning rule, based on a biologi- (2000/2177) and Boston (379/127). Besides, the attributes from
cally plausible mechanism, that adapts bias (𝑏𝑗 ) and slope (𝑎𝑗 ) all sets were scaled to [0, 1] and their target values to [-1,1].
of the hidden neurons activation function, tuning them into
Following Horata’s evaluation methodology [13], the out-
more suitable regimes, maximizing information transmission
lier robustness properties of the methods presented on Sec-
and acting as a feature regularizer [21]. This is accomplished
tion II are investigated by contaminating randomly the training
by forcing the 𝑗-th hidden neuron activation with an logistic
data targets with one-sided or two-sided outliers. To apply
activation function (hyperbolic tangent in this case, see Eq.
those outliers, the subset 𝐾 ⊂ {1, ..., 𝑁 } of row indexes
(10)) into a desired exponential distribution 𝑓𝑑𝑒𝑠 .
of D will indicate which samples will be contaminated and
Δ𝑘 ∈ ℝ𝑚 , ∀𝑘 ∈ 𝐾, is a row vector that receives normal
(1 − exp(−𝑎𝑗 𝑥𝑗𝑛 − 𝑏𝑗 ))
ℎ𝑗𝑛 = (10) distributed errors.
(1 + exp(−𝑎𝑗 𝑥𝑗𝑛 − 𝑏𝑗 ))
Let d𝑘 be a row of D and d̃𝑘 be the respective contami-
nated row by one-sided outliers when:
For each hidden neuron, all the incoming synaptic sum
x𝑗 = 𝜷 𝑇ℎ𝑑𝑑𝑗 U is collected, where U = (u(1), ..., u(𝑁 )). d̃𝑘 = d𝑘 + ∣Δ𝑘 ∣, (13)
Then random targets T𝑓𝑑𝑒𝑠 = (𝑡1 , ..., 𝑡𝑁 )𝑇 , from the desired or by two-sided outliers when:
exponential output distribution, and the collected stimuli x𝑗
are drawn in ascending order. d̃𝑘 = d𝑘 + Δ𝑘 . (14)
The model 𝛷(x𝑗 ) = (x𝑇𝑗 , (1, ..., 1)𝑇 ) is built so we can
Each problem will simulate other sub-problems depending
calculate:
on the type of outlier and its contamination rate. The training
targets are corrupted by one-sided or two-sided outliers and
(𝑎𝑗 , 𝑏𝑗 )𝑇 = (𝛷(x𝑗 )𝑇 𝛷(x𝑗 ) + 𝜆𝐼)−1 𝛷(x𝑗 )𝑇 𝑓 −1 (t𝑓𝑑𝑒𝑠 ), (11) the percentage of those contaminated samples may be: 10%,
20%, 30% or 40% of the total number of training data. Hence,
where 𝑓 −1 is the inverse of the activation function. Thus, 𝜆 >
0 is the regularization parameter and 𝐼 ∈ ℝ𝑞×𝑞 is an identity 1 http://archive.ics.uci.edu/ml/index.html
10 20 30 40
ELM 0.31728 ± 0.01759 0.42294 ± 0.022938 0.50394 ± 0.017375 0.54911 ± 0.016443
SinC ELM/BIP 0.30301 ± 0.018275 0.41215 ± 0.023355 0.4943 ± 0.017155 0.53996 ± 0.016893
(1 sided) R-ELM 0.32699 ± 0.018297 0.44896 ± 0.024869 0.55008 ± 0.019694 0.59641 ± 0.020741
R-ELM/BIP 0.31575 ± 0.019083 0.44451 ± 0.025284 0.55402 ± 0.018957 0.63107 ± 0.019579
ELM 0.32718 ± 0.021844 0.44715 ± 0.01762 0.55631 ± 0.023449 0.6355 ± 0.02309
SinC ELM/BIP 0.3141 ± 0.022108 0.43723 ± 0.018417 0.54661 ± 0.023217 0.62714 ± 0.023176
(2 sided) R-ELM 0.32833 ± 0.02183 0.44892 ± 0.017735 0.55872 ± 0.023568 0.63782 ± 0.022994
R-ELM/BIP 0.3167 ± 0.022154 0.44112 ± 0.018683 0.55189 ± 0.023331 0.63266 ± 0.02314
ELM 0.30937 ± 0.015178 0.41892 ± 0.014043 0.49249 ± 0.015587 0.54649 ± 0.014396
Abalone ELM/BIP 0.30962 ± 0.01533 0.41921 ± 0.01404 0.49273 ± 0.01571 0.54659 ± 0.014432
(1 sided) R-ELM 0.32316 ± 0.015803 0.45169 ± 0.014966 0.54808 ± 0.017707 0.62445 ± 0.01714
R-ELM/BIP 0.32332 ± 0.015825 0.45202 ± 0.014961 0.54818 ± 0.017724 0.62428 ± 0.017088
ELM 0.31865 ± 0.015152 0.44631 ± 0.014997 0.54589 ± 0.016201 0.63184 ± 0.016713
Abalone ELM/BIP 0.31893 ± 0.015148 0.44648 ± 0.015001 0.54601 ± 0.016026 0.63168 ± 0.016412
(2 sided) R-ELM 0.32091 ± 0.015043 0.44927 ± 0.015082 0.54944 ± 0.01625 0.63559 ± 0.01667
R-ELM/BIP 0.32104 ± 0.015045 0.44941 ± 0.015083 0.54954 ± 0.016202 0.6356 ± 0.016628
ELM 0.29169 ± 0.033615 0.40355 ± 0.030263 0.47067 ± 0.033956 0.52405 ± 0.032386
Boston ELM/BIP 0.29252 ± 0.033987 0.40409 ± 0.029703 0.47028 ± 0.033765 0.52485 ± 0.032287
(1 sided) R-ELM 0.31603 ± 0.036465 0.45043 ± 0.033717 0.54187 ± 0.040013 0.5924 ± 0.043863
R-ELM/BIP 0.3168 ± 0.035917 0.45117 ± 0.033769 0.54166 ± 0.040368 0.591 ± 0.044263
ELM 0.29624 ± 0.036144 0.42591 ± 0.034494 0.53481 ± 0.03501 0.59901 ± 0.043044
Boston ELM/BIP 0.29661 ± 0.036586 0.42572 ± 0.034833 0.53576 ± 0.035314 0.60062 ± 0.042521
(2 sided) R-ELM 0.3116 ± 0.035407 0.44557 ± 0.035873 0.5599 ± 0.037153 0.62804 ± 0.043756
R-ELM/BIP 0.31081 ± 0.036255 0.44614 ± 0.035326 0.56035 ± 0.036784 0.62959 ± 0.044111
TABLE II. C OMPARISON OF TEST ’ S MEAN RMSE AND STANDARD DEVIATION OF ELM, ELM/BIP, ROB-ELM AND R-ELM/BIP WITH ARTIFICIAL
DATASET (S IN C) AND REAL REGRESSION PROBLEMS (A BALONE AND B OSTON H OUSING ).
10 20 30 40
ELM 0.11997 ± 0.0050207 0.18408 ± 0.0091126 0.26066 ± 0.0095486 0.33467 ± 0.010247
SinC ELM/BIP 0.089072 ± 0.0071982 0.16654 ± 0.010607 0.25085 ± 0.0094876 0.3269 ± 0.011963
(1 sided) R-ELM 0.087491 ± 0.0012655 0.089213 ± 0.003473 0.091219 ± 0.0015507 0.12792 ± 0.013189
R-ELM/BIP 5.0649×10−5 ± 8.5172×10−5 2.2089×10−5 ± 4.0014×10−5 4.711×10−6 ± 6.1517×10−6 4.6419×10−6 ± 9.6504×10−6
ELM 0.091381 ± 0.0030023 0.095552 ± 0.0048112 0.1019 ± 0.0067492 0.10256 ± 0.0076451
SinC ELM/BIP 0.040336 ± 0.0079798 0.057581 ± 0.010523 0.076226 ± 0.012953 0.08296 ± 0.016879
(2 sided) R-ELM 0.087293 ± 0.0011283 0.087445 ± 0.001065 0.088951 ± 0.0079636 0.088068 ± 0.0013317
R-ELM/BIP 4.4631×10−5 ± 8.7431×10−5 1.721×10−5 ± 3.2819×10−5 9.2179×10−6 ± 1.6389×10−5 4.3035×10−6 ± 7.1121×10−6
ELM 0.10728 ± 0.0062478 0.17863 ± 0.0070967 0.25384 ± 0.0090707 0.33196 ± 0.0099962
Abalone ELM/BIP 0.10809 ± 0.0055507 0.17856 ± 0.0070735 0.25435 ± 0.0087548 0.33289 ± 0.0099281
(1 sided) R-ELM 0.062609 ± 0.0012822 0.062607 ± 0.0016053 0.06265 ± 0.0015469 0.067467 ± 0.0018998
R-ELM/BIP 0.063793 ± 0.0020453 0.063137 ± 0.0015558 0.063912 ± 0.0021094 0.06857 ± 0.0019539
ELM 0.072112 ± 0.0049853 0.081733 ± 0.006677 0.0897 ± 0.0095399 0.096313 ± 0.012501
Abalone ELM/BIP 0.07261 ± 0.0035585 0.082111 ± 0.0075051 0.089434 ± 0.0083095 0.098868 ± 0.01285
(2 sided) R-ELM 0.06307 ± 0.0015705 0.062332 ± 0.001205 0.062575 ± 0.0015548 0.063167 ± 0.0018103
R-ELM/BIP 0.063934 ± 0.0015417 0.063521 ± 0.0021724 0.063724 ± 0.001802 0.063873 ± 0.0020998
ELM 0.14095 ± 0.026788 0.22239 ± 0.027087 0.29328 ± 0.032442 0.376 ± 0.034651
Boston ELM/BIP 0.13998 ± 0.020638 0.22301 ± 0.025641 0.29786 ± 0.039615 0.37361 ± 0.030934
(1 sided) R-ELM 0.067481 ± 0.0081434 0.064883 ± 0.0060752 0.069308 ± 0.011972 0.14881 ± 0.060624
R-ELM/BIP 0.069561 ± 0.0098091 0.067461 ± 0.0088994 0.073066 ± 0.010384 0.1506 ± 0.056514
ELM 0.11801 ± 0.032185 0.15066 ± 0.025102 0.19344 ± 0.033945 0.21684 ± 0.03698
Boston ELM/BIP 0.11963 ± 0.019048 0.15508 ± 0.020773 0.18842 ± 0.030141 0.21325 ± 0.036929
(2 sided) R-ELM 0.066718 ± 0.011338 0.069368 ± 0.014331 0.067116 ± 0.010247 0.087772 ± 0.023071
R-ELM/BIP 0.068507 ± 0.0099174 0.068531 ± 0.011441 0.069932 ± 0.011803 0.084385 ± 0.017704
TABLE III. H OLM -S IDAK ’ S POST HOC TEST RESULTS FOR
COMPARISON OF ALL METHODS WITH R-ELM/BIP.
proposed R-ELM/BIP. It considers that the performance of any
of the other methods tested is significantly different from the
ELM ELM/BIP ROB-ELM R-ELM/BIP. In Tab. III, we present the p-values and their 𝛼,
1-sided problems where it is clear that our proposed method is significantly
p-values 0.0000 0.0000 0.2709 different, in terms of test’s mean RMSE, from ELM and
𝛼 0.0170 0.0253 0.0500 ELM/BIP but not from ROB/ELM.
2-sided problems
10 ELM
p-values 0.0001 0.0005 0.0787 x 10
2.5 ELM/BIP
𝛼 0.0170 0.0253 0.0500 R−ELM
both problems
R−ELM/BIP
1.5
1 sided 2 sided
this setup results in 8 sub-problems for each dataset, as shown
in Tab. I and II. 1
ELM (2)
1500 ELM/BIP (2)
all implementations were executed with MATLAB, where, for
R−ELM (2)
each dataset and sub-problems, training and test samples were R−ELM/BIP (2)
Boston
randomly selected into 50 independent runs. Abalone Housing
1000
In Tab. I, we present the training’s mean RMSE and the
standard deviation for all datasets and their respective sub-
problems. The algorithms showed similar performances in
training, although, as the contamination rate increased, the 500
methods with no robust estimation presented better results.
This behavior was expected, since the non-robust methods can
not ignore the contaminated samples in any level, and try to 0
10 20 30 40 10 20 30 40
learn even the noise. Outlier contamination rate (%)
In Tab. II, we present the test’s mean RMSE and its respec-
tive standard deviation for all datasets and their respective sub- Fig. 2. Average weight norm with Abalone and Boston Housing datasets,
contaminated by 1 sided (1) and 2 sided(2) noises.
problems. With the artificial dataset SinC, it is clear that the R-
ELM/BIP produces the results with best performance. Further-
more, the methods that uses M-Estimators, presented smaller Another aspect to be investigated is presented on Fig. 1 and
variations on their mean RMSE in spite of the increasing Fig. 2, with the norm average of the estimated output weights
contamination rate. To determine the statistical significance of for each type of outliers and contamination rate. As discussed
the rank differences observed for each method in the three previously, the size of the output weights influences on the
datasets, we carried out a non-parametric Friedman test (pro- generalization and sensitivity to data perturbation, therefore
vided by MATLAB) with the ranking of the mean test’s RMSE the methods using BIP show the best results. Even though,
for 1 sided problems, 2 sided problems and both at once. due to the ill-posed H matrix, the resulting output weights
Given the null-hypothesis that all algorithms are equivalent, have very large values (see Fig. 1). In this particular case, the
the test provided 𝑝 = 1.9368 × 10−6 , 𝑝 = 9.1203 × 10−5 R-ELM/BIP presents its norms around 107 , which is smaller
and 𝑝 = 1.1826 × 10−10 respectively, where all 𝑝 < 0.01. than the others with norms around 1010 . Also, from Fig. 2, we
Therefore, we can reject the null-hypothesis, stating that there can see the evolution of the average norm of weights through
is a significant statistical difference. the different contamination rates. For those networks without
the BIP implementation, their norm values not only increase
Based on this rejection, we applied the Holm-Sidak’s post with the data corruption but also they can achieve high values,
hoc test (provided by [24]) to compare all methods with the depending on the problem. However, those networks with BIP,
showed norms that were almost indifferent about the corruption [8] T. Łobos, P. Kostyła, Z. Wacławek, and A. Cichocki, “Adaptive neural
rate, providing much smaller values. networks for robust estimation of signal parameters,” The International
Journal for Computation and Mathematics in Electrical and Electronic
Engineering, vol. 19, no. 3, pp. 903–912, 2000.
IV. C ONCLUSION [9] Y. Feng, R. Li, A. Sudjianto, and Y. Zhang, “Robust neural network
with applications to credit portfolio data analysis,” Stat Interface, vol. 3,
This work presented a new ELM algorithm endowed with no. 4, p. 437444, 2010.
an optimized hidden layer combined with outlier robust es- [10] C. H. Aladag, E. Egrioglu, and U. Yolcu, “Robust multilayer neural
timation of the output weights. That optimization forced the network based on median neuron model,” Neural Computing and
hidden neurons to respond only for the few and most important Applications, vol. 24, no. 3-4, pp. 945–956, 2014.
stimuli, preventing saturated neurons. Moreover, the IRLS [11] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine:
promotes a learning that takes into account the contribution Theory and applications,” Neurocomputing, vol. 70, no. 1-3, pp. 489–
of each residual to the objective function, which diminishes 501, 2006.
the influence of outliers to the final solution. [12] G.-B. Huang, D. Wang, and Y. Lan, “Extreme learning machines: a
survey,” International Journal of Machine Learning and Cybernetics,
From the exposed on Tab. II and Fig. 1 and 2, the vol. 2, no. 2, pp. 107 – 122, 2011.
proposed approach R-ELM/BIP achieved results which were [13] P. Horata, S. Chiewchanwattana, and K. Sunat, “Robust extreme lear-
ning machine,” Neurocomputing, vol. 102, pp. 31–34, 2013, advances
consistent with the best performances presented, with little in Extreme Learning Machines (ELM 2011).
variation along the increasing contamination rate, plus regu-
[14] G. Zhao, Z. Shen, and Z. Man, “Robust input weight selection for
larized output weights that can provide less sensitivity to data well-conditioned extreme learning machine,” International Journal of
perturbation. Even though there was no significant statistical Information Technology, vol. 17, no. 1, 2011.
difference between the test’s mean RMSE between ROB-ELM [15] A. C. P. Kulaif and F. J. V. Zuben, “Improved regularization in extreme
and R-ELM/BIP, the values of output weights given by our learning machines,” in Annals of Congresso Brasileiro de Inteligłncia
approach are less sensitive to contamination rates and also to Computacional (CBIC), 2013.
different problems. This provides a more reliable feature to the [16] P. L. Bartlett, “The sample complexity of pattern classification with
results that can be achieved. neural networks: the size of the weights is more important than the size
of the network,” Information Theory, IEEE Transactions on, vol. 44,
As for future work, we will investigate the influence of no. 2, pp. 525–536, March 1998.
model selection and different regularization methods over the [17] W. Deng, Q. Zheng, and L. Chen, “Regularized extreme learning
machine,” in CIDM. IEEE, 2009, pp. 389–395.
ELM performance with corrupted data.
[18] Y. Wang, F. Cao, and Y. Yuan, “A study on effectiveness of extreme
learning machine,” Neurocomputing, vol. 74, no. 16, pp. 2483–2490,
ACKNOWLEDGMENT 2011, advances in Extreme Learning Machine: Theory and Applications
Biological Inspired Systems. Computational and Ambient Intelligence
The authors would like to thank the Coordenação de Selected papers of the 10th International Work-Conference on Artificial
Aperfeiçoamento de Pessoal de Nı́vel Superior (CAPES) and Neural Networks (IWANN2009).
Fundação Núcleo de Tecnologia Industrial do Ceará (NUTEC) [19] J. M. Martnez-Martı́nez, P. Escandell-Montero, E. Soria-Olivas, J. D.
for the financial support. Martı́n-Guerrero, R. Magdalena-Benedito, and J. Gómez-Sanchis, “Reg-
ularized extreme learning machine for regression problems,” Neurocom-
puting, vol. 74, no. 17, pp. 3716–3721, 2011.
R EFERENCES [20] A. L. B. Barros and G. A. Barreto, “Building a robust extreme
learning machine for classification in the presence of outliers,” in Hybrid
[1] G. Beliakov, A. Kelarev, and J. Yearwood, “Robust artificial neural
Artificial Intelligent Systems, ser. Lecture Notes in Computer Science,
networks and outlier detection. technical report,” CoRR, 2011.
J.-S. Pan, M. Polycarpou, M. Woniak, A. C. Carvalho, H. Quintin, and
[Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1110.
E. Corchado, Eds. Springer Berlin Heidelberg, 2013, vol. 8073, pp.
html#abs-1110-0169
588–597.
[2] H. T. Huynh and Y. Won, “Weighted least squares scheme for
[21] K. Neumann and J. Steil, “Optimizing extreme learning machines via
reducing effects of outliers in regression based on extreme learning
ridge regression and batch intrinsic plasticity,” Neurocomputing, vol.
machine,” J. Digital Content Technol. Appl.(JDCTA), vol. 2, no. 3, pp.
102, pp. 23–30, 2013, advances in Extreme Learning Machines (ELM
40–46, 2008. [Online]. Available: http://dblp.uni-trier.de/db/journals/
2011).
jdcta/jdcta2.html#HuynhW08
[22] P. J. Huber, “Robust estimation of a location parameter,” Annals of
[3] A. Khamis, Z. Ismail, K. Haron, and A. T. Mohammed, “The effects
Mathematical Statistics, vol. 35, no. 1, pp. 73–101, 1964.
of outliers data on neural network performance,” Journal of Applied
Sciences, vol. 5, no. 8, pp. 1394–1398, 2005. [23] J. Fox, Applied Regression Analysis, Linear Models, and Related
Methods. Sage Publications, 1997.
[4] F. Steege, V. Stephan, and H. Grob, “Effects of noise-reduction on
neural function approximation,” in Proc. 20th European Symposium on [24] G. Cardillo, “Holm-sidak t-test: a routine for multiple t-test compari-
Artificial Neural Networks, Computational Intelligence and Machine sons,” http://www.mathworks.com/matlabcentral/fileexchange/12786.
Learning, 2012, pp. 73–78.
[5] Y. Liu, “Robust parameter estimation and model selection for neural
network regression,” in Advances in Neural Information Processing
Systems (NIPS), J. D. Cowan, G. Tesauro, and J. Alspector, Eds.
Morgan Kaufmann, 1993, pp. 192–199.
[6] J. Larsen, L. Nonboe, M. Hintz-Madsen, and L. Hansen, “Design of
robust neural network classifiers,” in Acoustics, Speech and Signal Pro-
cessing, 1998. Proceedings of the 1998 IEEE International Conference
on, vol. 2, May 1998, pp. 1205–1208 vol.2.
[7] C.-C. Lee, C.-L. Tsai, Y.-C. Chiang, and C.-Y. Shih, “Noisy time series
prediction using m-estimator based robust radial basis function neural
networks with growing and pruning techniques,” Expert Systems and
Applications, vol. 36, no. 3, p. 47174724, 2009.