0% found this document useful (0 votes)
13 views17 pages

2012 - Huang Et Al. Extreme Learning Machine For Regression and Multiclass Classification

The paper introduces the Extreme Learning Machine (ELM) as a unified learning framework that simplifies the implementation of least square support vector machine (LS-SVM) and proximal support vector machine (PSVM) for regression and multiclass classification. ELM is shown to have better scalability, faster learning speed, and improved generalization performance compared to traditional SVM and its variants. The authors demonstrate that ELM can handle various applications without the need for tuning hidden layer parameters, making it a versatile solution for machine learning tasks.

Uploaded by

Yann Fleatwood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views17 pages

2012 - Huang Et Al. Extreme Learning Machine For Regression and Multiclass Classification

The paper introduces the Extreme Learning Machine (ELM) as a unified learning framework that simplifies the implementation of least square support vector machine (LS-SVM) and proximal support vector machine (PSVM) for regression and multiclass classification. ELM is shown to have better scalability, faster learning speed, and improved generalization performance compared to traditional SVM and its variants. The authors demonstrate that ELM can handle various applications without the need for tuning hidden layer parameters, making it a versatile solution for machine learning tasks.

Uploaded by

Yann Fleatwood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO.

2, APRIL 2012 513

Extreme Learning Machine for Regression and


Multiclass Classification
Guang-Bin Huang, Senior Member, IEEE, Hongming Zhou, Xiaojian Ding, and Rui Zhang

Abstract—Due to the simplicity of their implementations, least applications. SVM has two main learning features: 1) In SVM,
square support vector machine (LS-SVM) and proximal sup- the training data are first mapped into a higher dimensional
port vector machine (PSVM) have been widely used in binary feature space through a nonlinear feature mapping function
classification applications. The conventional LS-SVM and PSVM φ(x), and 2) the standard optimization method is then used
cannot be used in regression and multiclass classification appli-
cations directly, although variants of LS-SVM and PSVM have to find the solution of maximizing the separating margin of
been proposed to handle such cases. This paper shows that both two different classes in this feature space while minimizing the
LS-SVM and PSVM can be simplified further and a unified training errors. With the introduction of the epsilon-insensitive
learning framework of LS-SVM, PSVM, and other regularization loss function, the support vector method has been extended to
algorithms referred to extreme learning machine (ELM) can be solve regression problems [5].
built. ELM works for the “generalized” single-hidden-layer feed- As the training of SVMs involves a quadratic programming
forward networks (SLFNs), but the hidden layer (or called feature
problem, the computational complexity of SVM training al-
mapping) in ELM need not be tuned. Such SLFNs include but are
not limited to SVM, polynomial network, and the conventional gorithms is usually intensive, which is at least quadratic with
feedforward neural networks. This paper shows the following: respect to the number of training examples. It is difficult to deal
1) ELM provides a unified learning platform with a widespread with large problems using single traditional SVMs [6]; instead,
type of feature mappings and can be applied in regression and SVM mixtures can be used in large applications [6], [7].
multiclass classification applications directly; 2) from the opti- Least square SVM (LS-SVM) [2] and proximal SVM
mization method point of view, ELM has milder optimization con- (PSVM) [3] provide fast implementations of the traditional
straints compared to LS-SVM and PSVM; 3) in theory, compared
to ELM, LS-SVM and PSVM achieve suboptimal solutions and
SVM. Both LS-SVM and PSVM use equality optimization
require higher computational complexity; and 4) in theory, ELM constraints instead of inequalities from the traditional SVM,
can approximate any target continuous function and classify any which results in a direct least square solution by avoiding
disjoint regions. As verified by the simulation results, ELM tends quadratic programming.
to have better scalability and achieve similar (for regression and SVM, LS-SVM, and PSVM are originally proposed for bi-
binary class cases) or much better (for multiclass cases) generaliza- nary classification. Different methods have been proposed in or-
tion performance at much faster learning speed (up to thousands der for them to be applied in multiclass classification problems.
times) than traditional SVM and LS-SVM.
One-against-all (OAA) and one-against-one (OAO) methods
Index Terms—Extreme learning machine (ELM), feature are mainly used in the implementation of SVM in multiclass
mapping, kernel, least square support vector machine (LS-SVM), classification applications [8]. OAA-SVM consists of m SVMs,
proximal support vector machine (PSVM), regularization where m is the number of classes. The ith SVM is trained
network.
with all of the samples in the ith class with positive labels and
all the other examples from the remaining m − 1 classes with
I. I NTRODUCTION
negative labels. OAO-SVM consists of m(m − 1)/2 SVMs,
where each is trained with the samples from two classes only.
I N THE PAST two decades, due to their surprising classi-
fication capability, support vector machine (SVM) [1] and
its variants [2]–[4] have been extensively used in classification
Some encoding schemes such as minimal output coding (MOC)
[9] and Bayesian coding–decoding schemes [10] have been pro-
posed to solve multiclass problems with LS-SVM. Each class
is represented by a unique binary output codeword of m bits. m
Manuscript received August 22, 2011; accepted September 4, 2011. Date of outputs are used in MOC-LS-SVM in order to scale up to 2m
publication October 6, 2011; date of current version March 16, 2012. This work classes [9]. Bayes’ rule-based LS-SVM uses m binary LS-SVM
was supported by a grant from Singapore Academic Research Fund (AcRF)
Tier 1 under Project RG 22/08 (M52040128) and also a grant from China plug-in classifiers with its binary class probabilities inferred in
National Natural Science Foundation under Project 61075050. This paper was a second step within the related probabilistic framework [10].
recommended by Editor E. Santos, Jr. With the prior multiclass probabilities and the posterior binary
G.-B. Huang and H. Zhou are with the School of Electrical and Electronic
Engineering, Nanyang Technological University, Singapore 639798 (e-mail: class probabilities, Bayes’ rule is then applied m times to infer
[email protected]; [email protected]). posterior multiclass probabilities [10]. Bayes’ rule and different
R. Zhang is with the School of Electrical and Electronic Engineering, coding scheme are used in PSVM for multiclass problems [11].
Nanyang Technological University, Singapore 639798, and also with the De-
partment of Mathematics, Northwest University, Xi’an, China 710069 (e-mail:
The decision functions of binary SVM, LS-SVM, and PSVM
[email protected]). classifiers have the same form
X. Ding is with the School of Electronic and Information Engineering, Xi’an
Jiaotong University, Xi’an, China 710049 (e-mail: [email protected]). N 
Color versions of one or more of the figures in this paper are available online 
at http://ieeexplore.ieee.org. f (x) = sign αi ti K(x, xi ) + b (1)
Digital Object Identifier 10.1109/TSMCB.2011.2168604 i=1

1083-4419/$26.00 © 2011 IEEE

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
514 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 2, APRIL 2012

where ti is the corresponding target class label of the training PSVM [3] with ELM kernel. Later, Frénay and Verleysen
data xi , αi is the Lagrange multiplier to be computed by the [19] show that the normalized ELM kernel can also be ap-
learning machines, and K(u, v) is a suitable kernel function to plied in the traditional SVM. Their proposed SVM with ELM
be given by users. From the network architecture point of view, kernel and the conventional SVM have the same optimization
SVM, LS-SVM, and PSVM can be considered as a specific constraints (e.g., both inequality constraints and bias b are
type of single-hidden-layer feedforward network (SLFN) (the used). Recently, Huang et al. [29] further show the following:
so-called support vector network termed by Cortes and Vapnik 1) SVM’s maximal separating margin property and the ELM’s
[1]) where the output of the ith hidden node is K(x, xi ) and the minimal norm of output weight property are actually consistent,
output weight linking the ith hidden node to the output node and with ELM framework, SVM’s maximal separating margin
is αi ti . The term bias b plays an important role in SVM, LS- property and Barlett’s theory on feedforward neural networks
SVM, and PSVM, which produces the equality optimization remain consistent, and 2) compared to SVM, ELM requires
constraints in the dual optimization problems of these meth- fewer optimization constraints and results in simpler implemen-
ods. For example, the only difference between LS-SVM and tation, faster learning, and better generalization performance.
PSVM is on how to use the bias b in the optimization formula However, similar to SVM, inequality optimization constraints
while they have the same optimization constraint, resulting are used in [29]. Huang et al. [29] use random kernels and
in different least square solutions. No learning parameter in discard the term bias b used in the conventional SVM. However,
the hidden-layer output function (kernel) K(u, v) needs to be no direct relationship has so far been found between the original
tuned by SVM, LS-SVM, and PSVM, although some user- ELM implementation [12]–[16] and LS-SVM/PSVM. Whether
specified parameter needs to be chosen a priori. feedforward neural networks, SVM, LS-SVM, and PSVM can be
Extreme learning machine (ELM) [12]–[16] studies a much unified still remains open.
wider type of “generalized” SLFNs whose hidden layer need Different from the studies of Huang et al. [29], Liu et al.
not be tuned. ELM has been attracting the attentions from more [18], and Frénay and Verleysen [19], this paper extends ELM
and more researchers [17]–[22]. ELM was originally developed to LS-SVM and PSVM and provides a unified solution for
for the single-hidden-layer feedforward neural networks [12]– LS-SVM and PSVM under equality constraints. In particular,
[14] and then extended to the “generalized” SLFNs which may the following contributions have been made in this paper.
not be neuron alike [15], [16]
1) ELM was originally developed from feedforward neural
networks [12]–[16]. Different from other ELM work in
f (x) = h(x)β (2) literature, this paper manages to extend ELM to kernel
learning: It is shown that ELM can use a wide type of fea-
where h(x) is the hidden-layer output corresponding to the ture mappings (hidden-layer output functions), including
input sample x and β is the output weight vector between the random hidden nodes and kernels. With this extension,
hidden layer and the output layer. One of the salient features the unified ELM solution can be obtained for feedforward
of ELM is that the hidden layer need not be tuned. Essentially, neural networks, RBF network, LS-SVM, and PSVM.
ELM originally proposes to apply random computational nodes 2) Furthermore, ELM, which is with higher scalability and
in the hidden layer, which are independent of the training less computational complexity, not only unifies different
data. Different from traditional learning algorithms for a neural popular learning algorithms but also provides a unified
type of SLFNs [23], ELM aims to reach not only the smallest solution to different practical applications (e.g., regres-
training error but also the smallest norm of output weights. sion, binary, and multiclass classifications). Different
ELM [12], [13] and its variants [14]–[16], [24]–[28] mainly variants of LS-SVM and SVM are required for different
focus on the regression applications. Latest development of types of applications. ELM avoids such trivial and tedious
ELM has shown some relationships between ELM and SVM situations faced by LS-SVM and SVM. In ELM method,
[18], [19], [29]. all these applications can be resolved in one formula.
Suykens and Vandewalle [30] described a training method for 3) From the optimization method point of view, ELM and
SLFNs which applies the hidden-layer output mapping as the LS-SVM have the same optimization cost function; how-
feature mapping of SVM. However, the hidden-layer parame- ever, ELM has milder optimization constraints compared
ters need to be iteratively computed by solving an optimization to LS-SVM and PSVM. As analyzed in this paper and
problem (refer to the last paragraph in Section IV-A1 for further verified by simulation results over 36 wide types
details). As Suykens and Vandewalle stated in their work (see of data sets, compared to ELM, LS-SVM achieves subop-
[30, p. 907]), the drawbacks of this method are the following: timal solutions (when the same kernels are used) and has
the high computational cost and larger number of parameters in higher computational complexity. As verified by simula-
the hidden layer. Liu et al. [18] and Frénay and Verleysen [19] tions, the resultant ELM method can run much faster than
show that the ELM learning approach can be applied to SVMs LS-SVM. ELM with random hidden nodes can run even up
directly by simply replacing SVM kernels with (random) ELM to tens of thousands times faster than SVM and LS-SVM.
kernels and better generalization can be achieved. Different Different from earlier ELM works which do not perform
from the study of Suykens and Vandewalle [30] in which the well in sparse data sets, the ELM method proposed in this
hidden layer is parametric, the ELM hidden layer used in the paper can handle sparse data sets well.
studies of Liu et al. [18] and Frénay and Verleysen [19] is 4) This paper also shows that the proposed ELM method not
nonparametric, and the hidden-layer parameters need not be only has universal approximation capability (of approxi-
tuned and can be fixed once randomly generated. Liu et al. [18] mating any target continuous function) but also has clas-
suggest to apply ELM kernel in SVMs and particularly study sification capability (of classifying any disjoint regions).

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
HUANG et al.: EXTREME LEARNING MACHINE FOR REGRESSION AND MULTICLASS CLASSIFICATION 515

II. B RIEF OF SVM S where each Lagrange multiplier αi corresponds to a training


sample (xi , ti ). Vectors xi ’s for which ti (w · φ(xi ) + b) = 1
This section briefs the conventional SVM [1] and its variants,
are termed support vectors [1].
namely, LS-SVM [2] and PSVM [3].
Kernel functions K(u, v) = φ(u) · φ(v) are usually used in
the implementation of SVM learning algorithm. In this case,
we have
A. SVM
1  
N N N
Cortes and Vapnik [1] study the relationship between SVM
and multilayer feedforward neural networks and showed that minimize : LDSVM = ti tj αi αj K(xi , xj ) − αi
2 i=1 j=1 i=1
SVM can be seen as a specific type of SLFNs, the so-called
support vector networks. In 1962, Rosenblatt [31] suggested 
N
that multilayer feedforward neural networks (perceptrons) can subject to : ti αi = 0
be trained in a feature space Z of the last hidden layer. In this i=1
feature space, a linear decision function is constructed
0 ≤ αi ≤ C, i = 1, . . . , N. (6)
 L 

f (x) = sign αi zi (x) (3) The SVM kernel function K(u, v) needs to satisfy Mercer’s
i=1 condition [1]. The decision function of SVM is
N 
where zi (x) is the output of the ith neuron in the last hidden  s

layer of a perceptron. In order to find an alternative solution f (x) = sign αs ts K(x, xs ) + b (7)
s=1
of zi (x), in 1995, Cortes and Vapnik [1] proposed the SVM
which maps the data from the input space to some feature where Ns is the number of support vectors xs ’s.
space Z through some nonlinear mapping φ(x) chosen a priori.
Constrained-optimization methods are then used to find the
separating hyperplane which maximizes the separating margins B. LS-SVM
of two different classes in the feature space.
Given a set of training data (xi , ti ), i = 1, . . . , N , where Suykens and Vandewalle [2] propose a least square version to
xi ∈ Rd and ti ∈ {−1, 1}, due to the nonlinear separability of SVM classifier. Instead of the inequality constraint (4) adopted
these training data in the input space, in most cases, one can in SVM, equality constraints are used in the LS-SVM [2].
map the training data xi from the input space to a feature space Hence, by solving a set of linear equations instead of quadratic
Z through a nonlinear mapping φ : xi → φ(xi ). The distance programming, one can implement the least square approach
between two different classes in the feature space Z is 2/w. easily. LS-SVM is proven to have excellent generalization
To maximize the separating margin and to minimize the training performance and low computational cost in many applications.
errors, ξi , is equivalent to In LS-SVM, the classification problem is formulated as

1 2
N
1  N 1
Minimize : LPSVM = w2 + C ξi Minimize : LPLS−SVM = w · w + C ξ
2 2 2 i=1 i
i=1

Subject to : ti (w · φ(xi ) + b) ≥ 1 − ξi , i = 1, . . . , N Subject to : ti (w · φ(xi )+b) = 1 − ξi , i = 1, . . . , N.

ξi ≥ 0, i = 1, . . . , N (4) (8)

where C is a user-specified parameter and provides a tradeoff Based on the KKT theorem, to train such an LS-SVM is
between the distance of the separating margin and the training equivalent to solving the following dual optimization problem:
error.
1 2
N
Based on the Karush–Kuhn–Tucker (KKT) theorem [32], to 1
LDLS−SVM = w·w+C ξ
train such an SVM is equivalent to solving the following dual 2 2 i=1 i
optimization problem:

N


N 
N − αi (ti (w · φ(xi ) + b) − 1 + ξi ) . (9)
1
minimize : LDSVM = ti tj αi αj φ(xi ) · φ(xj ) i=1
2 i=1 j=1
Different from Lagrange multipliers (5) in SVM, in LS-

N
SVM, Lagrange multipliers αi ’s can be either positive or
− αi negative due to the equality constraints used. Based on the
i=1
KKT theorem, we can have the optimality conditions of (9) as

N follows:
subject to : ti αi = 0
i=1 ∂LDLS−SVM  N
=0 → w = αi ti φ(xi ) (10a)
0 ≤ αi ≤ C, i = 1, . . . , N (5) ∂w i=1

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
516 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 2, APRIL 2012

∂LDLS−SVM  N The mathematical model built for linear PSVM is


=0 → αi ti = 0 (10b)
∂b 1 2
N
1
(w · w + b2 ) + C
i=1
Minimize : LPPSVM = ξ
∂LDLS−SVM 2 2 i=1 i
= 0 → αi = Cξi , i = 1, . . . , N (10c)
∂ξi Subject to : ti (w · xi + b) = 1 − ξi , i = 1, . . . , N. (14)
∂LDLS−SVM The corresponding dual optimization problem is
= 0 → ti (w · φ(xi ) + b) − 1 + ξi = 0,
∂αi
1 2
N
1
i = 1, . . . , N. (10d) LDPSVM = (w · w + b2 ) + C ξ
2 2 i=1 i
By substituting (10a)–(10c) into (10d), the aforementioned 
N
equations can be equivalently written as − αi (ti (w · xi + b) − 1 + ξi ) . (15)
        i=1
0 TT b 0 TT b 0
= =  By applying KKT optimality conditions [similar to
T CI + ΩLS−SVM α T CI + ZZT α 1
(11) (10a)–(10d)], we have
   
where I I
+ ΩPSVM + TTT α = + ZZT + TTT α = 1
⎡ ⎤ C C
t1 φ(x1 ) (16)
⎢ .. ⎥
Z=⎣ . ⎦
where Z = [t1 x1 , . . . , tN xN ]T and ΩPSVM = ZZT .
tN φ(xN ) Similar to LS-SVM, the training data x can be mapped
ΩLS−SVM = ZZT . (12) from the input space into a feature space φ : x → φ(x),
and one can obtain the nonlinear version of PSVM: Z =
T
[t1 φ(x1 )T , . . . , tN φ(xN )T ] . As feature mapping φ(x) is
The feature mapping φ(x) is a row vector,1 T = [t1 , t2 ,
usually unknown, Mercer’s conditions can be applied to ma-
. . . , tN ]T , α = [α1 , α2 , . . . , αN ]T , and 1 = [1, 1, . . . , 1]T . In trix ΩPSVM : ΩPSVMi,j = ti tj φ(xi ) · φ(xj ) = ti tj K(xi , xj ),
LS-SVM, as φ(x) is usually unknown, Mercer’s condition [33] which is the same as LS-SVM’s kernel matrix ΩLS−SVM (13).
can be applied to matrix ΩLS−SVM The decision function of PSVM classifier is f (x) =
sign( N i=1 αi ti K(x, xi ) + b).
ΩLS−SVMi,j = ti tj φ(xi ) · φ(xj ) = ti tj K(xi , xj ). (13)

The decision function of LS-SVM classifier is f (x) = III. P ROPOSED C ONSTRAINED -O PTIMIZATION -BASED
sign( N i=1 αi ti K(x, xi ) + b).
ELM
The Lagrange multipliers αi ’s are proportional to the training ELM [12]–[14] was originally proposed for the single-
errors ξi ’s in LS-SVM, while in the conventional SVM, many hidden-layer feedforward neural networks and was then ex-
Lagrange multipliers αi ’s are typically equal to zero. Compared tended to the generalized SLFNs where the hidden layer need
to the conventional SVM, sparsity is lost in LS-SVM [9]; this not be neuron alike [15], [16]. In ELM, the hidden layer need
is true to PSVM [3]. not be tuned. The output function of ELM for generalized
SLFNs (take one output node case as an example) is
C. PSVM 
L
fL (x) = βi hi (x) = h(x)β (17)
Fung and Mangasarian [3] propose the PSVM classifier,
i=1
which classifies data points depending on proximity to either
one of the two separation planes that are aimed to be pushed where β = [β1 , . . . , βL ]T is the vector of the output weights
away as far apart as possible. Similar to LS-SVM, the key idea between the hidden layer of L nodes and the output node and
of PSVM is that the separation hyperplanes are not bounded h(x) = [h1 (x), . . . , hL (x)] is the output (row) vector of the
planes anymore but “proximal” planes, and such effect is hidden layer with respect to the input x. h(x) actually maps the
reflected in mathematical expressions that the inequality con- data from the d-dimensional input space to the L-dimensional
straints are changed to equality constraints. Different from LS- hidden-layer feature space (ELM feature space) H, and thus,
SVM, in the objective formula of linear PSVM, (w · w + b2 ) h(x) is indeed a feature mapping. For the binary classification
is used instead of w · w, making the optimization problem applications, the decision function of ELM is
strongly convex, and has little or no effect on the original
optimization problem. fL (x) = sign (h(x)β) . (18)

Different from traditional learning algorithms [23], ELM


1 In order to keep the consistent notation and formula formats, similar to LS-
tends to reach not only the smallest training error but also
SVM [2], PSVM [3], ELM [29], and TER-ELM [22], feature mappings φ(x)
the smallest norm of output weights. According to Bartlett’s
and h(x) are defined as a row vector while the rest of the vectors are defined theory [34], for feedforward neural networks reaching smaller
as column vectors in this paper unless explicitly specified. training error, the smaller the norms of weights are, the better

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
HUANG et al.: EXTREME LEARNING MACHINE FOR REGRESSION AND MULTICLASS CLASSIFICATION 517

generalization performance the networks tend to have. We removed, and the resultant learning algorithm has milder op-
conjecture that this may be true to the generalized SLFNs where timization constraints. Thus, better generalization performance
the hidden layer may not be neuron alike [15], [16]. ELM is to and lower computational complexity can be obtained. In SVM,
minimize the training error as well as the norm of the output LS-SVM, and PSVM, as the feature mapping φ(xi ) may be
weights [12], [13] unknown, usually not every feature mapping to be used in
SVM, LS-SVM, and PSVM satisfies the universal approxima-
Minimize : Hβ − T2 and β (19) tion condition. Obviously, a learning machine with a feature
mapping which does not satisfy the universal approximation
where H is the hidden-layer output matrix
condition cannot approximate all target continuous functions.
⎡ ⎤ ⎡ ⎤
h(x1 ) h1 (x1 ) · · · hL (x1 ) Thus, the universal approximation condition is not only a
⎢ ⎥ ⎢ .. .. .. ⎥ sufficient condition but also a necessary condition for a feature
H = ⎣ ... ⎦ = ⎣ . . . ⎦. (20) mapping to be widely used. This is also true to classification
..
h(xN ) h1 (xN ) . hL (xN ) applications.
2) Classification Capability: Similar to the classification
Seen from (18), to minimize the norm of the output weights capability theorem of single-hidden-layer feedforward neural
β is actually to maximize the distance of the separating networks [38], we can prove the classification capability of
margins of the two different classes in the ELM feature space: the generalized SLFNs with the hidden-layer mapping h(x)
2/β. satisfying the universal approximation condition (22).
The minimal norm least square method instead of the stan- Definition 3.1: A closed set is called a region regardless
dard optimization method was used in the original implementa- whether it is bounded or not.
tion of ELM [12], [13] Lemma 3.1 [38]: Given disjoint regions K1 , K2 , . . . , Km
in Rd and the corresponding m arbitrary real values
β = H† T (21)
c1 , c2 , . . . , cm , and an arbitrary region X disjointed from any
where H† is the Moore–Penrose generalized inverse of matrix Ki , there exists a continuous function f (x) such that f (x) = ci
H [35], [36]. Different methods can be used to calculate the if x ∈ Ki and f (x) = c0 if x ∈ X, where c0 is an arbitrary real
Moore–Penrose generalized inverse of a matrix: orthogonal value different from c1 , c2 , . . . , cp .
projection method, orthogonalization method, iterative method, The classification capability theorem of Huang et al. [38] can
and singular value decomposition (SVD) [36]. The orthogonal be extended to generalized SLFNs which need not be neuron
projection method [36] can be used in two cases: when HT H alike.
−1 Theorem 3.1: Given a feature mapping h(x), if h(x)β is
is nonsingular and H† = (HT H) HT , or when HHT is
−1 dense in C(Rd ) or in C(M ), where M is a compact set of
nonsingular and H† = HT (HHT ) . Rd , then a generalized SLFN with such a hidden-layer mapping
According to the ridge regression theory [37], one can add h(x) can separate arbitrary disjoint regions of any shapes in Rd
a positive value to the diagonal of HT H or HHT ; the resul- or M .
tant solution is stabler and tends to have better generalization Proof: Given m disjoint regions K1 , K2 , . . . , Km in Rd
performance. Toh [22] and Deng et al. [21] have studied the and their corresponding m labels c1 , c2 , . . . , cm , according to
performance of ELM with this enhancement under the Sigmoid Lemma 3.1, there exists a continuous function f (x) in C(Rd )
additive type of SLFNs. This section extends such study to gen- or on one compact set of Rd such that f (x) = ci if x ∈ Ki .
eralized SLFNs with a different type of hidden nodes (feature Hence, if h(x)β is dense in C(Rd ) or on one compact set
mappings) as well as kernels. of Rd , then it can approximate the function f (x), and there
There is a gap between ELM and LS-SVM/PSVM, and it is exists a corresponding generalized SLFN to implement such
not clear whether there is some relationship between ELM and a function f (x). Thus, such a generalized SLFN can separate
LS-SVM/PSVM. This section aims to fill the gap and build the these decision regions regardless of shapes of these regions. 
relationship between ELM and LS-SVM/PSVM. Seen from Theorem 3.1, it is a necessary and sufficient con-
dition that the feature mapping h(x) is chosen to make h(x)β
A. Sufficient and Necessary Conditions for Universal have the capability of approximating any target continuous
Classifiers function. If h(x)β cannot approximate any target continuous
functions, there may exist some shapes of regions which cannot
1) Universal Approximation Capability: According to ELM be separated by a classifier with such feature mapping h(x).
learning theory, a widespread type of feature mappings h(x) In other words, as long as the dimensionality of the feature
can be used in ELM so that ELM can approximate any con- mapping (number of hidden nodes L in a classifier) is large
tinuous target functions (refer to [14]–[16] for details). That is, enough, the output of the classifier h(x)β can be as close to the
given any target continuous function f (x), there exists a series class labels in the corresponding regions as possible.
of βi ’s such that In the binary classification case, ELM only uses a single-
 L  output node, and the class label closer to the output value of
 
  ELM is chosen as the predicted class label of the input data.
lim fL (x) − f (x) = lim  βi hi (x) − f (x) = 0.
L→+∞ L→+∞   There are two solutions for the multiclass classification case.
i=1
(22) 1) ELM only uses a single-output node, and among the
With this universal approximation capability, the bias b in the multiclass labels, the class label closer to the output value
optimization constraints of SVM, LS-SVM, and PSVM can be of ELM is chosen as the predicted class label of the

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
518 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 2, APRIL 2012

input data. In this case, the ELM solution to the binary where ξ i = [ξi,1 , . . . , ξi,m ]T is the training error vector of the
classification case becomes a specific case of multiclass m output nodes with respect to the training sample xi . Based
solution. on the KKT theorem, to train ELM is equivalent to solving the
2) ELM uses multioutput nodes, and the index of the output following dual optimization problem:
node with the highest output value is considered as the
1
N
label of the input data. 1
LDELM = β2 + C ξ 2
For the sake of readability, these two solutions are analyzed 2 2 i=1 i
separately. It can be found that, eventually, the same solution
formula is obtained for both cases. 
N 
m
 
− αi,j h(xi )β j − ti,j + ξi,j (27)
i=1 j=1
B. Simplified Constrained-Optimization Problems
where β j is the vector of the weights linking hidden layer to the
1) Multiclass Classifier With Single Output: Since ELM can
jth output node and β = [β 1 , . . . , β m ]. We can have the KKT
approximate any target continuous functions and the output of
corresponding optimality conditions as follows:
the ELM classifier h(x)β can be as close to the class labels
in the corresponding regions as possible, the classification N
∂LDELM
problem for the proposed constrained-optimization-based ELM =0 → β j = αi,j h(xi )T → β = HT α (28a)
with a single-output node can be formulated as ∂β j i=1

∂LDELM
1 2
N
1 =0 → αi = Cξ i , i = 1, . . . , N (28b)
Minimize : LPELM = β2 + C ξ ∂ξ i
2 2 i=1 i
∂LDELM T
Subject to : h(xi )β = ti − ξi , i = 1, . . . , N. (23) =0 → h(xi )β − tT
i + ξ i = 0, i = 1, . . . , N
∂αi
Based on the KKT theorem, to train ELM is equivalent to (28c)
solving the following dual optimization problem:
where αi = [αi,1 , . . . , αi,m ]T and α = [α1 , . . . , αN ]T .
1 1 
N 
N It can be seen from (24), (25a)–(25c), (27), and (28a)–(28c)
LDELM = β2 + C ξi2 − αi (h(xi )β − ti + ξi ) that the single-output node case considered a specific case of
2 2 i=1 i=1 multioutput nodes when the number of output nodes is set to
(24) one: m = 1. Thus, we only need to consider the multiclass
where each Lagrange multiplier αi corresponds to the ith classifier with multioutput nodes. For both cases, the hidden-
training sample. We can have the KKT optimality conditions layer matrix H (20) remains the same, and the size of H is only
of (24) as follows: decided by the number of training samples N and the number
of hidden nodes L, which is irrelevant to the number of output
∂LDELM  N nodes (number of classes).
=0 → β = αi h(xi )T = HT α (25a)
∂β i=1

∂LDELM C. Equality Constrained-Optimization-Based ELM


= 0 → αi = Cξi , i = 1, . . . , N (25b)
∂ξi Different solutions to the aforementioned KKT conditions
∂LDELM can be obtained based on the concerns on the efficiency in
= 0 → h(xi )β − ti + ξi = 0, i = 1, . . . , N different size of training data sets.
∂αi
1) For the Case Where the Number of Training Samples is
(25c) Not Huge: In this case, by substituting (28a) and (28b) into
(28c), the aforementioned equations can be equivalently written
where α = [α1 , . . . , αN ]T . as
2) Multiclass Classifier With Multioutputs: An alternative  
approach for multiclass applications is to let ELM have mul- I T
+ HH α=T (29)
tioutput nodes instead of a single-output node. m-class of C
classifiers have m output nodes. If the original class label is
p, the expected output vector of the m output nodes is ti = where
p
T
[0, . . . , 0, 1, 0, . . . , 0] . In this case, only the pth element of ⎡ ⎤ ⎡ ⎤
tT1 t11 ··· t1m
ti = [ti,1 , . . . , ti,m ]T is one, while the rest of the elements are ⎢ ⎥ .. .. .. ⎦
set to zero. The classification problem for ELM with multiout- T = ⎣ ... ⎦ = ⎣ . . . . (30)
put nodes can be formulated as tT
N tN 1 ··· tN m

1
N
1 From (28a) and (29), we have
Minimize : LPELM = β2 + C ξ 2
2 2 i=1 i  −1
I
T
β = HT + HHT T. (31)
Subject to : h(xi )β = tT
i − ξi , i = 1, . . . , N (26) C

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
HUANG et al.: EXTREME LEARNING MACHINE FOR REGRESSION AND MULTICLASS CLASSIFICATION 519

The output function of ELM classifier is Remark: Although the alternative approaches for the differ-
 −1 ent size of data sets are discussed and provided separately, in
I theory, there is no specific requirement on the size of the train-
f (x) = h(x)β = h(x)HT + HHT T. (32)
C ing data sets in all the approaches [see (32) and (38)], and all the
approaches can be used in any size of applications. However,
1) Single-output node (m = 1): For multiclass classifica- different approaches have different computational costs, and
tions, among all the multiclass labels, the predicted class their efficiency may be different in different applications. In
label of a given testing sample is closest to the output of the implementation of ELM, it is found that the generalization
ELM classifier. For binary classification case, ELM needs performance of ELM is not sensitive to the dimensionality of
only one output node (m = 1), and the decision function the feature space (L) and good performance can be reached
of ELM classifier is as long as L is large enough. In our simulations, L = 1000 is
  −1  set for all tested cases no matter whatever size of the training
T I T data sets. Thus, if the training data sets are very large N  L,
f (x) = sign h(x)H + HH T . (33)
C one may prefer to apply solutions (38) in order to reduce
computational costs. However, if a feature mapping h(x) is
2) Multioutput nodes (m > 1): For multiclass cases, the unknown, one may prefer to use solutions (32) instead (which
predicted class label of a given testing sample is the index will be discussed later in Section IV).
number of the output node which has the highest output
value for the given testing sample. Let fj (x) denote
the output function of the jth output node, i.e., f (x) = IV. D ISCUSSIONS
[f1 (x), . . . , fm (x)]T ; then, the predicted class label of A. Random Feature Mappings and Kernels
sample x is
1) Random Feature Mappings: Different from SVM, LS-
label(x) = arg max fi (x). (34) SVM, and PSVM, in ELM, a feature mapping (hidden-layer
i∈{1,...,m}
output vector) h(x) = [h1 (x), . . . , hL (x)] is usually known
to users. According to [15] and [16], almost all nonlinear
2) For the Case Where the Number of Training Samples is
piecewise continuous functions can be used as the hidden-node
Huge: If the number of training data is very large, for example,
output functions, and thus, the feature mappings used in ELM
it is much larger than the dimensionality of the feature space,
can be very diversified.
N  L, we have an alternative solution. From (28a) and (28b),
For example, as mentioned in [29], we can have
we have
h(x) = [G(a1 , b1 , x), . . . , G(aL , bL , x)] (40)
β = CHT ξ (35)
1  T †
ξ= H β. (36) where G(a, b, x) is a nonlinear piecewise continuous function
C satisfying ELM universal approximation capability theorems
From (28c), we have [14]–[16] and {(ai , bi )}L
i=1 are randomly generated according
to any continuous probability distribution. For example, such
1 †
Hβ − T + (HT ) β = 0 nonlinear piecewise continuous functions can be as follows.
 C  1) Sigmoid function
T 1 T †
H H + (H ) β = HT T
C 1
 −1 G(a, b, x) = . (41)
I T
1 + exp (−(a · x + b))
β= +H H HT T. (37)
C 2) Hard-limit function
In this case, the output function of ELM classifier is 
1, if a · x − b ≥ 0
 −1 G(a, b, x) = (42)
I 0, otherwise.
f (x) = h(x)β = h(x) + HT H HT T. (38)
C 3) Gaussian function
 
1) Single-output node (m = 1): For multiclass classifica- G(a, b, x) = exp −bx − a2 . (43)
tions, the predicted class label of a given testing sample
is the class label closest to the output value of ELM clas- 4) Multiquadric function
sifier. For binary classification case, the decision function  1/2
of ELM classifier is G(a, b, x) = x − a2 + b2 . (44)
  −1 
I T T Sigmoid and Gaussian functions are two of the major hidden-
f (x) = sign h(x) +H H H T . (39)
C layer output functions used in the feedforward neural networks
and RBF networks2 , respectively. Interestingly, ELM with
2) Multioutput nodes (m > 1): The predicted class label of
a given testing sample is the index of the output node 2 Readers can refer to [39] for the difference between ELM and RBF
which has the highest output. networks.

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
520 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 2, APRIL 2012

hard-limit [24] and multiquadric functions can have good 3) Feature Mapping Matrix: In ELM, H=
T
generalization performance as well. [h(x1 )T , . . . , h(xN )T ] is called the hidden-layer output
Suykens and Vandewalle [30] described a training method for matrix (or called feature mapping matrix) due to the fact
SLFNs which applies the hidden-layer output mapping as the that it represents the corresponding hidden-layer outputs of
feature mapping of SVM. However, different from ELM where the given N training samples. h(xi ) denotes the output of
the hidden layer is not parametric and need not be tuned, the the hidden layer with regard to the input sample xi . Feature
feature mapping of their SVM implementation is parametric, mapping h(xi ) maps the data xi from the input space to the
and the hidden-layer parameters need to be iteratively computed hidden-layer feature space, and the feature mapping matrix H
by solving an optimization problem. Their learning algorithm is irrelevant to target ti . As observed from the essence of the
was briefed as follows: feature mapping, it is reasonable to have the feature mapping
matrix independent from the target values ti ’s. However,
minimize : rw2 (45) in both LS-SVM and PSVM, the feature mapping matrix
subject to : T
Z = [t1 φ(x1 )T , . . . , tN φ(xN )T ] (12) is designed to depend
C1 : QP subproblem : on the targets ti ’s of the training samples xi ’s.
N
w= αi∗ ti tanh(Vxi + B)
i=1 B. ELM: Unified Learning Mode for Regression, Binary, and
αi∗ = arg max Q (αi ; tanh(Vxi + B)) Multiclass Classification
αi
0 ≤ αi∗ ≤ c As observed from (32) and (38), ELM has the unified solu-
C2 : V(:); B2 ≤ γ tions for regression, binary, and multiclass classification. The
C3 : r is radius of smallest ball containing kernel matrix ΩELM = HHT is only related to the input data
{tanh(Vxi ) + B}N (46) xi and the number of training samples. The kernel matrix
i=1
ΩELM is neither relevant to the number of output nodes m nor
where V denotes the interconnection matrix for the hidden to the training target values ti ’s. However, in multiclass LS-
layer, B is the bias vector, (:) is a columnwise scan of the SVM, aside from the input data xi , the kernel matrix ΩM (52)
interconnection matrix for the hidden layer, and γ is a positive also depends on the number of output nodes m and the training
constant. In addition, Q is the cost function of the correspond- target values ti ’s.
ing SVM dual problem For the multiclass case with m labels, LS-SVM uses m
output nodes in order to encode multiclasses where ti,j denotes
max Q (αi ; K(xi , xj )) the output value of the jth output node for the training data
αi
xi [10]. The m outputs can be used to encode up to 2m different
1  
N N N classes. For multiclass case, the primal optimization problem of
=− ti tj αi αj K(xi , xj ) + αi . (47) LS-SVM can be given as [10]
2 i=1 j=1 i=1
1 1  2
m N m
(m)
QP subproblems need to be solved for hidden-node parame- Minimize : LPLS−SVM = wj · wj + C ξ
2 j=1 2 i=1 j=1 i,j
ters V and B, while the hidden-node parameters of ELM are

randomly generated and known to users. ⎪ ti,1 (w1 · φ1 (xi ) + b1 ) = 1 − ξi,1

2) Kernels: If a feature mapping h(x) is unknown to users, ti,2 (w2 · φ2 (xi ) + b2 ) = 1 − ξi,2
Subject to :
one can apply Mercer’s conditions on ELM. We can define a ⎪...

kernel matrix for ELM as follows: ti,m (wm · φm (xi ) + bm ) = 1 − ξi,m
i = 1, . . . , N. (50)
ΩELM = HHT : ΩELMi,j = h(xi ) · h(xj ) = K(xi , xj ).
(48) Similar to the LS-SVM solution (11) to the binary clas-
sification, with KKT conditions, the corresponding LS-SVM
Then, the output function of ELM classifier (32) can be
solution for multiclass cases can be obtained as follows:
written compactly as
    
 −1 0 TT bM 0
I =  (51)
f (x) = h(x)HT
+ HH T
T T ΩM αM 1
C  
⎡ ⎤T I I
K(x, x1 )  −1 ΩM = blockdiag Ω(1) + , . . . , Ω(m) +
I C C
⎢ . ⎥
=⎣ .. ⎦ + ΩELM T. (49) (j)
C Ωkl = tk,j tl,j K (j) (xk , xl )
K(x, xN ) bM = [b1 , . . . , bm ]
In this specific case, similar to SVM, LS-SVM, and PSVM, αM = [α1,1 , . . . , αN,1 , . . . , α1,m , . . . , αN,m ] (52)
(j)
the feature mapping h(x) need not be known to users; K (xk , xl ) = φj (xk ) · φj (xl )
instead, its corresponding kernel K(u, v) (e.g., K(u, v) =  
xk − xl 2
exp(−γu − v2 )) is given to users. The dimensionality L of = exp − , j = 1, . . . , N.
the feature space (number of hidden nodes) need not be given σj2
either. (53)

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
HUANG et al.: EXTREME LEARNING MACHINE FOR REGRESSION AND MULTICLASS CLASSIFICATION 521

is not approximated. In fact, the feature map h(x) of ELM is


randomly generated and independent of the training samples
(if random hidden nodes are used). The kernel matrix of the
fixed-size LS-SVM is built with the subset of size M N,
while the kernel matrix of ELM is built with the entire data set
of size N in all cases.

D. Difference From Other Regularized ELMs


Toh [22] and Deng et al. [21] proposed two different types of
weighted regularized ELMs.
The total error rate (TER) ELM [22] uses m output nodes
for m class label classification applications. In TER-ELM, the
counting cost function adopts a quadratic approximation. The
OAA method is used in the implementation of TER-ELM in
Fig. 1. Scalability of different classifiers: An example on Letter data set. The multiclass classification applications. Essentially, TER-ELM
training time spent by LS-SVM and ELM (Gaussian kernel) increases sharply consists of m binary TER-ELM, where jth TER-ELM is trained
when the number of training data increases. However, the training time spent
by ELM with Sigmoid additive node and multiquadric function node increases with all of the samples in the jth class with positive labels
very slowly when the number of training data increases. and all the other examples from the remaining m − 1 classes
with negative labels. Suppose that there are m+ j number of
Seen from (52), the multiclass LS-SVM actually uses m positive category patterns and m− j number of negative category
binary-class LS-SVM concurrently for m class labels of clas- patterns in the jth binary TER-ELM. We have a positive output
sifications; each of the m binary-class LS-SVMs may have yj+ = (τ + η)1+ j for the jth class of samples and a negative
different kernel matrix Ω(j) , j = 1, . . . , m. However, in any class output yj− = (τ − η)1− j for all the non-jth class of sam-
cases, ELM has one hidden layer linking to all the m out- +
ples, where 1+ T
j = [1, . . . , 1] ∈ R
mj
and 1− T
j = [1, . . . , 1] ∈
put nodes. In multiclass LS-SVM, different kernels may be −
used in each individual binary LS-SVM, and the jth LS-SVM Rmj . A common setting for threshold (τ ) and bias (η) will be
uses kernel K (j) (u, v). Take Gaussian kernel as an example, set for all the m outputs. The output weight vector β j in jth
K (j) (u, v) = exp(−(xk − xl 2 /σj2 )); from practical point of binary TER-ELM is calculated as
view, it may be time consuming and tedious for users to choose  −1
different kernel parameters σi , and thus, one may set a common 1 −T − 1 +T
βj = Hj Hj + + Hj Hj +
value σi = σ for all the kernels. In multiclass LS-SVM, the m− j m
size of ΩM is N × N m, which is related to the number of  j 
output nodes m. However, in ELM, the size of kernel matrix 1 −T − 1 +T +
· Hj yj + + Hj yj (54)
ΩELM = HHT is N × N , which is fixed for all the regression, m− j mj
binary, and multiclass classification cases.
where H+ −
j and Hj denote the hidden-layer matrices of the jth
binary TER-ELM corresponding to the positive and negative
C. Computational Complexity and Scalability samples, respectively.
For LS-SVM and PSVM, the main computational cost comes By defining two class-specific diagonal weighting
from calculating the Lagrange multipliers α’s based on (11) and matrices Wj+ = diag(0, . . . , 0, 1/m+ +
j , . . . , mj ) and
− − −
(16). Obviously, ELM computes α based on a simpler method Wj = diag(1/mj , . . . , mj , 0, . . . , 0), the solution formula
(29). More importantly, in large-scale applications, instead of (54) of TER-ELM can be written as
HHT (size: N × N ), ELM can get a solution based on (37),  −1
I
where HT H (size: L × L) is used. As in most applications, βj = + HT
j W j H j HT j Wj yj (55)
the number of hidden nodes L can be much smaller than the C
number of training samples: L N , the computational cost
where Wj = Wj+ + Wj− and the elements of Hj and yj are
reduces dramatically. For the case L N , ELM can use HT H
ordered according to the positive and negative samples of the
(size: L × L). Compared with LS-SVM and PSVM which use
two classes (jth class samples and all the non-jth class sam-
HHT (size: N × N ), ELM has much better computational
ples). In order to improve the stability of the learning, I/C is
scalability with regard to the number of training samples N .
introduced in the aforementioned formula. If the dimensionality
(cf. Fig. 1 for example.)
of the hidden layer is much larger than the number of the
In order to reduce the computational cost of LS-SVM in
training data (L  N ), an alternative solution suggested in
large-scale problems, fixed-size LS-SVM has been proposed
[22] is
by Suykens et al. [40]–[44]. Fixed-size LS-SVM uses an
M -sample subset of the original training data set (M N)  −1
T I T
to compute a finite dimensional approximation φ̂(x) to the β j = Hj + Wj Hj Hj Wj yj . (56)
C
feature map φ(x). However, different from the fixed-size LS-
SVM, if L N , L × L solution of ELM still uses the entire Kernels and generalized feature mappings are not considered
N training samples. In any case, the feature map h(x) of ELM in TER-ELM.

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
522 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 2, APRIL 2012

Deng et al. [21] mainly focus on the case where L < N , TABLE II
S PECIFICATION OF B INARY C LASSIFICATION P ROBLEMS
and (37) of ELM and the solution formula of Deng et al.
[21] look similar to each other. However, different from the
ELM solutions provided in this paper, Deng et al. [21] do
not consider kernels and generalized feature mappings in their
weighted regularized ELM. In the proposed solutions of ELM,
L hidden nodes may have a different type of hidden-node output
function hi (x) : h(x) = [h1 (x), . . . , hL (x)], while in [21], all
the hidden nodes use the Sigmoid type of activation functions.
Deng et al. [21] do not handle the alternative solution (31).
Seen from (37), multivariate polynomial model [45] can be
considered as a specific case of ELM.
The original solutions (21) of ELM [12], [13], [26], TER-
ELM [22], and the weighted regularized ELM [21] are not
able to apply kernels in their implementations. With the new
suggested approach, kernels can be used in ELM [cf. (49)]. sizes, and/or high dimensions. These data sets include 12 binary
classification cases, 12 multiclassification cases, and 12 regres-
sion cases. Most of the data sets are taken from UCI Machine
E. Milder Optimization Constraints Learning Repository [47] and Statlib [48].
In LS-SVM, as the feature mapping φ(x) is usually un- 1) Binary Class Data Sets: The 12 binary class data sets
known, it is reasonable to think that the separating hyperplane (cf. Table II) can be classified into four groups of data:
in LS-SVM may not necessarily pass through the origin in the 1) data sets with relatively small size and low dimensions,
LS-SVM feature space, and thus, a term bias b is required in e.g., Pima Indians diabetes, Statlog Australian credit,
their optimization constraints: ti (w · φ(xi ) + b) = 1 − ξi . The Bupa Liver disorders [47], and Banana [49];
corresponding KKT condition (necessary condition) [cf. (10b)] 2) data sets with relatively small size and high dimensions,
for the conventional LS-SVM is N i=1 αi ti = 0. Poggio et al.
e.g., leukemia data set [50] and colon microarray data set
[46] prove in theory that the term bias b is not required in [51];
positive definite kernel and that it is not incorrect to have the 3) data sets with relatively large size and low dimensions,
term bias b in the SVM model. Different from the analysis e.g., Star/Galaxy-Bright data set [52], Galaxy Dim data
of Poggio et al. [46], Huang et al. [29] show that, from the set [52], and mushroom data set [47];
practical and universal approximation point of view, the term 4) data sets with large size and high dimensions, e.g., adult
bias b should not be given in the ELM learning. data set [47].
According to ELM theories [12]–[16], almost all nonlinear The leukemia data set was originally taken from a collection
piecewise continuous functions as feature mappings can make of leukemia patient samples [53]. The data set consists of
ELM satisfy universal approximation capability, and the sep- 72 samples: 25 samples of AML and 47 samples of ALL.
arating hyperplane of ELM basically tends to pass through Each sample of leukemia data set is measured over 7129 genes
the origin in the ELM feature space. There is no term bias b (cf. Leukemia in Table II). The colon microarray data set
in the optimization constraint of ELM, h(xi )β = ti − ξi , and consists of 22 normal and 40 tumor tissue samples. In this data
thus, different from LS-SVM, ELM does not need to satisfy set, each sample of colon microarray data set contains 2000
N genes (cf. Colon in Table II).
the condition i=1 αi ti = 0. Although LS-SVM and ELM
have the same primal optimization formula, ELM has milder Performances of the different algorithms have also been
optimization constraints than LS-SVM, and thus, compared to tested on both leukemia data set and colon microarray data
ELM, LS-SVM obtains a suboptimal optimization. set after the minimum-redundancy–maximum-relevance fea-
The differences and relationships among ELM, LS- ture selection method [54] being taken (cf. Leukemia (Gene Sel)
SVM/PSVM, and SVM can be summarized in Table I. and Colon (Gene Sel) in Table II).
2) Multiclass Data Sets: The 12 multiclass data sets (cf.
Table III) can be classified into four groups of data as well:
V. P ERFORMANCE V ERIFICATION 1) data sets with relatively small size and low dimensions,
This section compares the performance of different algo- e.g., Iris, Glass Identification, and Wine [47];
rithms (SVM, LS-SVM, and ELM) in real-world benchmark 2) data sets with relatively medium size and medium dimen-
regression, binary, and multiclass classification data sets. In sions, e.g., Vowel Recognition, Statlog Vehicle Silhou-
order to test the performance of the proposed ELM with various ettes, and Statlog Image Segmentation [47];
feature mappings in supersmall data sets, we have also tested 3) data sets with relatively large size and medium dimen-
ELM on the XOR problem. sions, e.g., letter and shuttle [47];
4) data sets with large size and/or large dimensions, e.g.,
DNA, Satimage [47], and USPS [50].
A. Benchmark Data Sets 3) Regression Data Sets: The 12 regression data sets (cf.
In order to extensively verify the performance of different Table IV) can be classified into three groups of data:
algorithms, wide types of data sets have been tested in our 1) data sets with relatively small size and low dimensions,
simulations, which are of small sizes, low dimensions, large e.g., Basketball, Strike [48], Cloud, and Autoprice [47];

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
HUANG et al.: EXTREME LEARNING MACHINE FOR REGRESSION AND MULTICLASS CLASSIFICATION 523

TABLE I
F EATURE C OMPARISONS A MONG ELM, LS-SVM, AND SVM

TABLE III TABLE IV


S PECIFICATION OF M ULTICLASS C LASSIFICATION P ROBLEMS S PECIFICATION OF R EGRESSION P ROBLEMS

2) data sets with relatively small size and medium dimen- B. Simulation Environment Settings
sions, e.g., Pyrim, Housing [47], Bodyfat, and Cleve- The simulations of different algorithms on all the data sets
land [48]; except for Adult, Letter, Shuttle, and USPS data sets are carried
3) data sets with relatively large size and low dimensions, out in MATLAB 7.0.1 environment running in Core 2 Quad,
e.g., Balloon, Quake, Space-ga [48], and Abalone [47]. 2.66-GHZ CPU with 2-GB RAM. The codes used for SVM and
LS-SVM are downloaded from [55] and [56], respectively.
Column “random perm” in Tables II–IV shows whether the Simulations on large data sets (e.g., Adult, Letter, Shuttle, and
training and testing data of the corresponding data sets are USPS data sets) are carried out in a high-performance computer
reshuffled at each trial of simulation. If the training and testing with 2.52-GHz CPU and 48-GB RAM. The symbol “∗ ” marked
data of the data sets remain fixed for all trials of simulations, it in Tables VI and VII indicates that the corresponding data sets
is marked “No.” Otherwise, it is marked “Yes.” are tested in such a high-performance computer.

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
524 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 2, APRIL 2012

Fig. 2. Performances of LS-SVM and ELM with Gaussian kernel are sensitive
to the user-specified parameters (C, γ): An example on Satimage data set. Fig. 3. Performance of ELM (with Sigmoid additive node and multiquadric
(a) LS-SVM with Gaussian kernel. (b) ELM with Gaussian kernel. RBF node) is not very sensitive to the user-specified parameters (C, L), and
good testing accuracies can be achieved as long as L is large enough: An
example on Satimage data set. (a) ELM with Sigmoid additive node. (b) ELM
C. User-Specified Parameters with multiquadric RBF node.
The popular Gaussian kernel function K(u, v) =
exp(−γu − v2 ) is used in SVM, LS-SVM, and ELM. For ELM with Sigmoid additive hidden node and multi-
ELM performance is also tested in the cases of Sigmoid type quadric RBF hidden node, h(x) = [G(a1 , b1 , x), . . . , G(aL ,
of additive hidden node and multiquadric RBF hidden node. bL , x)], where G(a, b, x) = 1/(1 + exp(−(a · x + b)))
In order to achieve good generalization performance, the cost for Sigmoid additive hidden node or G(a, b, x) =
1/2
parameter C and kernel parameter γ of SVM, LS-SVM, and (x − a2 + b2 ) for multiquadric RBF hidden node.
ELM need to be chosen appropriately. We have tried a wide All the hidden-node parameters (ai , bi )L i=1 are randomly
range of C and γ. For each data set, we have used 50 different generated based on uniform distribution. The user-specified
values of C and 50 different values of γ, resulting in a total of parameters are (C, L), where C is chosen from the range
2500 pairs of (C, γ). The 50 different values of C and γ are {2−24 , 2−23 , . . . , 224 , 225 }. Seen from Fig. 3, ELM can achieve
{2−24 , 2−23 , . . . , 224 , 225 }. good generalization performance as long as the number of
It is known that the performance of SVM is sensitive to hidden nodes L is large enough. In all our simulations on ELM
the combination of (C, γ). Similar to SVM, the generalization with Sigmoid additive hidden node and multiquadric RBF
performance of LS-SVM and ELM with Gaussian kernel de- hidden node, L = 1000. In other words, the performance of
pends closely on the combination of (C, γ) as well (see Fig. 2 ELM with Sigmoid additive hidden node and multiquadric
for the performance sensitivity of LS-SVM and ELM with RBF hidden node is not sensitive to the number of hidden
Gaussian kernel on the user-specified parameters (C, γ)). The nodes L. Moreover, L need not be specified by users; instead,
best generalization performance of SVM, LS-SVM, and ELM users only need to specify one parameter: C.
with Gaussian kernel is usually achieved in a very narrow range Fifty trials have been conducted for each problem. Simula-
of such combinations. Thus, the best combination of (C, γ) of tion results, including the average testing accuracy, the corre-
SVM, LS-SVM, and ELM with Gaussian kernel needs to be sponding standard deviation (Dev), and the training times, are
chosen for each data set. given in this section.

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
HUANG et al.: EXTREME LEARNING MACHINE FOR REGRESSION AND MULTICLASS CLASSIFICATION 525

TABLE V
PARAMETERS OF THE C ONVENTIONAL SVM, LS-SVM, AND ELM

The user-specified parameters chosen in our simulations are


given in Table V.

D. Performance Comparison on XOR Problem


The performance of SVM, LS-SVM, and ELM has been
tested in the XOR problem which has two training samples in
each class. The aim of this simulation is to verify whether ELM
can handle some rare cases such as the cases with extremely
few training data sets. Fig. 4 shows the boundaries of different
classifiers in XOR problem. It can be seen that, similar to SVM
and LS-SVM, ELM is able to solve the XOR problem well.
User-specified parameters used in this XOR problem are chosen
as follows: (C, γ) for SVM is (210 , 20 ), (C, γ) for LS-SVM is
(24 , 214 ), (C, γ) for ELM with Gaussian kernel is (25 , 215 ),
and (C, L) for ELM with Sigmoid additive hidden node is
(20 , 3000).

E. Performance Comparison on Real-World


Fig. 4. Separating boundaries of different classifiers in XOR problem.
Benchmark Data sets (a) SVM. (b) LS-SVM. (c) ELM (Gaussian kernel). (d) ELM (Sigmoid additive
node).
Tables VI–VIII show the performance comparison of SVM,
LS-SVM, and ELM with Gaussian kernel, random Sigmoid and LS-SVM with much faster learning speed. Seen from
hidden nodes, and multiquadric RBF nodes. It can be seen that Tables VI–VIII, different output functions of ELM can be used
ELM can always achieve comparable performance as SVM in different data sets in order to have efficient implementation

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
526 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 2, APRIL 2012

TABLE VI
P ERFORMANCE C OMPARISON OF SVM, LS-SVM, AND ELM: B INARY C LASS DATA S ETS

TABLE VII
P ERFORMANCE C OMPARISON OF SVM, LS-SVM, AND ELM: M ULTICLASS DATA S ETS

in different size of data sets, although any output function can among the comparisons of these two algorithms, apparently,
be used in all types of data sets. better test results are given in boldface. It can be seen that
Take Shuttle (large number of training samples) and USPS ELM with Gaussian kernel achieves the same generalization
(medium size of data set with high input dimensions) data sets performance in almost all the binary classification and regres-
in Table VII as examples. sion cases as LS-SVM at much faster learning speeds; however,
ELM usually achieves much better generalization performance
1) For Shuttle data sets, ELM with Gaussian kernel and
in multiclass classification cases (cf. Table VII) than LS-SVM.
random multiquadric RBF nodes runs 6 and 4466 times
Fig. 5 shows the boundaries of different classifiers in Banana
faster than LS-SVM, respectively.
case. It can be seen that ELM can classify different classes
2) For USPS data sets, ELM with Gaussian kernel and
well.
random multiquadric RBF nodes runs 6 and 65 times
faster than LS-SVM, respectively, and runs 1342 and
13 832 times faster than SVM, respectively. VI. C ONCLUSION
On the other hand, different from LS-SVM which is sensitive ELM is a learning mechanism for the generalized SLFNs,
to the combinations of parameters (C, γ), ELM with random where learning is made without iterative tuning. The essence of
multiquadric RBF nodes is not sensitive to the unique user- ELM is that the hidden layer of the generalized SLFNs should
specified parameter C [cf. Fig. 3(b)] and is ease of use in the not be tuned. Different from traditional learning theories on
respective implementations. learning, ELM learning theory [14]–[16] shows that if SLFNs
Tables VI–VIII particularly highlight the performance com- f (x) = h(x)β with tunable piecewise continuous hidden-layer
parison between LS-SVM and ELM with Gaussian kernel, and feature mapping h(x) can approximate any target continuous

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
HUANG et al.: EXTREME LEARNING MACHINE FOR REGRESSION AND MULTICLASS CLASSIFICATION 527

TABLE VIII
P ERFORMANCE C OMPARISON OF SVM, LS-SVM, AND ELM: R EGRESSION DATA S ETS

all the real-world cases tested in our simulations). Different


from SVM, LS-SVM, and PSVM which usually request two
parameters (C, γ) to be specified by users, single-parameter
setting makes ELM be used easily and efficiently.
If feature mappings are unknown to users, similar to SVM,
LS-SVM, and PSVM, kernels can be applied in ELM as well.
Different from LS-SVM and PSVM, ELM does not have con-
straints on the Lagrange multipliers αi ’s. Since LS-SVM and
ELM have the same optimization objective functions and LS-
SVM has some optimization constraints on Lagrange multipli-
ers αi ’s, in this sense, LS-SVM tends to obtain a solution which
is suboptimal to ELM.
As verified by the simulation results, compared to SVM
and LS-SVM ELM achieves similar or better generalization
performance for regression and binary class classification cases,
and much better generalization performance for multiclass clas-
sification cases. ELM has better scalability and runs at much
faster learning speed (up to thousands of times) than traditional
SVM and LS-SVM.
Fig. 5. Separating boundaries of different classifiers in Banana case. (a) SVM. This paper has also shown that, in theory, ELM can approx-
(b) LS-SVM. (c) ELM (Gaussian kernel). (d) ELM (Sigmoid additive node). imate any target continuous function and classify any disjoint
regions.
functions, tuning is not required in the hidden layer then. All
the hidden-node parameters which are supposed to be tuned ACKNOWLEDGMENT
by conventional learning algorithms can be randomly generated
according to any continuous sampling distribution [14]–[16]. The authors would like to thank L. Ljung from Linköpings
This paper has shown that both LS-SVM and PSVM can Universitet, Sweden, for reminding us of possible relationships
be simplified by removing the term bias b and the resultant between ELM and LS-SVM. The authors would also like to
learning algorithms are unified with ELM. Instead of different thank H. White from the University of California, San Diego,
variants requested for different types of applications, ELM can for his constructive and inspiring comments and suggestions on
be applied in regression and multiclass classification appli- our research on ELM.
cations directly. More importantly, according to ELM theory R EFERENCES
[14]–[16], ELM can work with a widespread type of fea-
[1] C. Cortes and V. Vapnik, “Support vector networks,” Mach. Learn.,
ture mappings (including Sigmoid networks, RBF networks, vol. 20, no. 3, pp. 273–297, 1995.
trigonometric networks, threshold networks, fuzzy inference [2] J. A. K. Suykens and J. Vandewalle, “Least squares support vector
systems, fully complex neural networks, high-order networks, machine classifiers,” Neural Process. Lett., vol. 9, no. 3, pp. 293–300,
ridge polynomial networks, etc). Jun. 1999.
[3] G. Fung and O. L. Mangasarian, “Proximal support vector machine clas-
ELM requires less human intervention than SVM and LS- sifiers,” in Proc. Int. Conf. Knowl. Discov. Data Mining, San Francisco,
SVM/PSVM. If the feature mappings h(x) are known to users, CA, 2001, pp. 77–86.
in ELM, only one parameter C needs to be specified by users. [4] Y.-J. Lee and O. L. Mangasarian, “RSVM: Reduced support vector
machines,” in Proc. SIAM Int. Conf. Data Mining, Chicago, IL, Apr. 5–7,
The generalization performance of ELM is not sensitive to the 2001.
dimensionality L of the feature space (the number of hidden [5] H. Drucker, C. J. Burges, L. Kaufman, A. Smola, and V. Vapnik, “Support
nodes) as long as L is set large enough (e.g., L ≥ 1000 for vector regression machines,” in Neural Information Processing Systems 9,

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
528 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 2, APRIL 2012

M. Mozer, J. Jordan, and T. Petscbe, Eds. Cambridge, MA: MIT Press, [32] R. Fletcher, Practical Methods of Optimization: Volume 2 Constrained
1997, pp. 155–161. Optimization. New York: Wiley, 1981.
[6] G.-B. Huang, K. Z. Mao, C.-K. Siew, and D.-S. Huang, “Fast modular net- [33] S. Haykin, Neural Networks: A Comprehensive Foundation. Englewood
work implementation for support vector machines,” IEEE Trans. Neural Cliffs, NJ: Prentice-Hall, 1999.
Netw., vol. 16, no. 6, pp. 1651–1663, Nov. 2005. [34] P. L. Bartlett, “The sample complexity of pattern classification with neural
[7] R. Collobert, S. Bengio, and Y. Bengio, “A parallel mixtures of SVMs networks: The size of the weights is more important than the size of the
for very large scale problems,” Neural Comput., vol. 14, no. 5, pp. 1105– network,” IEEE Trans. Inf. Theory, vol. 44, no. 2, pp. 525–536, Mar. 1998.
1114, May 2002. [35] D. Serre, Matrices: Theory and Applications. New York: Springer-
[8] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclass support Verlag, 2002.
vector machines,” IEEE Trans. Neural Netw., vol. 13, no. 2, pp. 415–425, [36] C. R. Rao and S. K. Mitra, Generalized Inverse of Matrices and Its
Mar. 2002. Applications. New York: Wiley, 1971.
[9] J. A. K. Suykens and J. Vandewalle, “Multiclass least squares support [37] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation
vector machines,” in Proc. IJCNN, Jul. 10–16, 1999, pp. 900–903. for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67,
[10] T. Van Gestel, J. A. K. Suykens, G. Lanckriet, A. Lambrechts, Feb. 1970.
B. De Moor, and J. Vandewalle, “Multiclass LS-SVMs: Moderated out- [38] G.-B. Huang, Y.-Q. Chen, and H. A. Babri, “Classification ability of single
puts and coding-decoding schemes,” Neural Process. Lett., vol. 15, no. 1, hidden layer feedforward neural networks,” IEEE Trans. Neural Netw.,
pp. 48–58, Feb. 2002. vol. 11, no. 3, pp. 799–801, May 2000.
[11] Y. Tang and H. H. Zhang, “Multiclass proximal support vector machines,” [39] G.-B. Huang, M.-B. Li, L. Chen, and C.-K. Siew, “Incremental extreme
J. Comput. Graph. Statist., vol. 15, no. 2, pp. 339–355, Jun. 2006. learning machine with fully complex hidden nodes,” Neurocomputing,
[12] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: A vol. 71, no. 4–6, pp. 576–583, Jan. 2008.
new learning scheme of feedforward neural networks,” in Proc. IJCNN, [40] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and
Budapest, Hungary, Jul. 25–29, 2004, vol. 2, pp. 985–990. J. Vandewalle, Least Squares Support Vector Machines. Singapore:
[13] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: World Scientific, 2002.
Theory and applications,” Neurocomputing, vol. 70, no. 1–3, pp. 489–501, [41] M. Espinoza, J. A. K. Suykens, and B. De Moor, “Fixed-size least squares
Dec. 2006. support vector machines: A large scale application in electrical load
[14] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal approximation using forecasting,” Comput. Manage. Sci.—Special Issue on Support Vector
incremental constructive feedforward networks with random hidden Machines, vol. 3, no. 2, pp. 113–129, Apr. 2006.
nodes,” IEEE Trans. Neural Netw., vol. 17, no. 4, pp. 879–892, [42] M. Espinoza, J. A. K. Suykens, R. Belmans, and B. De Moor, “Elec-
Jul. 2006. tric load forecasting—Using kernel based modeling for nonlinear system
[15] G.-B. Huang and L. Chen, “Convex incremental extreme learning ma- identification,” IEEE Control Syst. Mag.—Special Issue on Applications
chine,” Neurocomputing, vol. 70, no. 16–18, pp. 3056–3062, Oct. 2007. of System Identification, vol. 27, no. 5, pp. 43–57, Oct. 2007.
[16] G.-B. Huang and L. Chen, “Enhanced random search based incremen- [43] K. D. Brabanter, J. D. Brabanter, J. A. K. Suykens, and B. De Moor,
tal extreme learning machine,” Neurocomputing, vol. 71, no. 16–18, “Optimized fixed-size kernel models for large data sets,” Comput. Statist.
pp. 3460–3468, Oct. 2008. Data Anal., vol. 54, no. 6, pp. 1484–1504, Jun. 2010.
[17] X. Tang and M. Han, “Partial Lanczos extreme learning machine for [44] P. Karsmakers, K. Pelckmans, K. D. Brabanter, H. V. Hamme, and
single-output regression problems,” Neurocomputing, vol. 72, no. 13–15, J. A. K. Suykens, “Sparse conjugate directions pursuit with application
pp. 3066–3076, Aug. 2009. to fixed-size kernel models,” Mach. Learn., vol. 85, no. 1–2, pp. 109–148,
[18] Q. Liu, Q. He, and Z. Shi, “Extreme support vector machine classifier,” 2011.
Lecture Notes in Computer Science, vol. 5012, pp. 222–233, 2008. [45] K.-A. Toh, Q.-L. Tran, and D. Srinivasan, “Benchmarking a reduced
[19] B. Frénay and M. Verleysen, “Using SVMs with randomised feature multivariate polynomial pattern classifier,” IEEE Trans. Pattern Anal.
spaces: An extreme learning approach,” in Proc. 18th ESANN, Bruges, Mach. Intell., vol. 26, no. 6, pp. 740–755, Jun. 2004.
Belgium, Apr. 28–30, 2010, pp. 315–320. [46] T. Poggio, S. Mukherjee, R. Rifkin, A. Rakhlin, and A. Verri, “ b,” Artif.
[20] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, and A. Lendasse, Intell. Lab., MIT, Cambridge, MA, A.I. Memo No. 2001-011, CBCL
“OP-ELM: Optimally pruned extreme learning machine,” IEEE Trans. Memo 198, 2001.
Neural Netw., vol. 21, no. 1, pp. 158–162, Jan. 2010. [47] C. L. Blake and C. J. Merz, “UCI Repository of Machine Learning
[21] W. Deng, Q. Zheng, and L. Chen, “Regularized extreme learning ma- Databases,” Dept. Inf. Comput. Sci., Univ. California, Irvine, CA, 1998.
chine,” in Proc. IEEE Symp. CIDM, Mar. 30–Apr. 2, 2009, pp. 389–395. [Online]. Available: http://www.ics.uci.edu/~mlearn/MLRepository.html
[22] K.-A. Toh, “Deterministic neural classification,” Neural Comput., vol. 20, [48] M. Mike, “Statistical Datasets,” Dept. Statist., Univ. Carnegie Mellon,
no. 6, pp. 1565–1595, Jun. 2008. Pittsburgh, PA, 1989. [Online]. Available: http://lib.stat.cmu.edu/datasets/
[23] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representa- [49] A. Bordes, S. Ertekin, J. Weston, and L. Bottou, “Fast kernel classifiers
tions by back-propagation errors,” Nature, vol. 323, pp. 533–536, 1986. with online and active learning,” J. Mach. Learn. Res., vol. 6, pp. 1579–
[24] G.-B. Huang, Q.-Y. Zhu, K. Z. Mao, C.-K. Siew, P. Saratchandran, and 1619, Sep. 2005.
N. Sundararajan, “Can threshold networks be trained directly?” IEEE [50] J. J. Hull, “A database for handwritten text recognition research,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 53, no. 3, pp. 187–191, Trans. Pattern Anal. Mach. Intell., vol. 16, no. 5, pp. 550–554, May 1994.
Mar. 2006. [51] J. Li and H. Liu, “Kent Ridge Bio-Medical Data Set Repository,” School
[25] N.-Y. Liang, G.-B. Huang, P. Saratchandran, and N. Sundararajan, “A Comput. Eng., Nanyang Technol. Univ., Singapore, 2004. [Online]. Avail-
fast and accurate on-line sequential learning algorithm for feedforward able: http://levis.tongji.edu.cn/gzli/data/mirror-kentridge.html
networks,” IEEE Trans. Neural Netw., vol. 17, no. 6, pp. 1411–1423, [52] S. C. Odewahn, E. B. Stockwell, R. L. Pennington, R. M. Humphreys,
Nov. 2006. and W. A. Zumach, “Automated star/galaxy discrimination with neural
[26] M.-B. Li, G.-B. Huang, P. Saratchandran, and N. Sundararajan, “Fully networks,” Astron. J., vol. 103, no. 1, pp. 318–331, Jan. 1992.
complex extreme learning machine,” Neurocomputing, vol. 68, pp. 306– [53] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek,
314, Oct. 2005. J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri,
[27] G. Feng, G.-B. Huang, Q. Lin, and R. Gay, “Error minimized extreme C. D. Bloomfield, and E. S. Lander, “Molecular classification of cancer:
learning machine with growth of hidden nodes and incremental learning,” Class discovery and class prediction by gene expression monitoring,”
IEEE Trans. Neural Netw., vol. 20, no. 8, pp. 1352–1357, Aug. 2009. Science, vol. 286, no. 5439, pp. 531–537, Oct. 1999.
[28] H.-J. Rong, G.-B. Huang, N. Sundararajan, and P. Saratchandran, “Online [54] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual infor-
sequential fuzzy extreme learning machine for function approximation mation criteria of max-dependency, max-relevance, and min-redundancy,”
and classification problems,” IEEE Trans. Syst., Man, Cybern. B, Cybern., IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp. 1226–1238,
vol. 39, no. 4, pp. 1067–1072, Aug. 2009. Aug. 2005.
[29] G.-B. Huang, X. Ding, and H. Zhou, “Optimization method based extreme [55] S. Canu, Y. Grandvalet, V. Guigue, and A. Rakotomamonjy, “SVM and
learning machine for classification,” Neurocomputing, vol. 74, no. 1–3, Kernel Methods Matlab Toolbox,” Perception Systémes et Information,
pp. 155–163, Dec. 2010. INSA de Rouen, Rouen, France, 2005. [Online]. Available: http://asi.insa-
[30] J. A. K. Suykens and J. Vandewalle, “Training multilayer perceptron clas- rouen.fr/enseignants/~arakotom/toolbox/index.html
sifier based on a modified support vector method,” IEEE Trans. Neural [56] K. Pelckmans, J. A. K. Suykens, T. Van Gestel, J. De Brabanter, L. Lukas,
Netw., vol. 10, no. 4, pp. 907–911, Jul. 1999. B. Hamers, B. De Moor, and J. Vandewalle, “LS-SVMLab Toolbox,”
[31] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory Dept. Elect. Eng., ESAT-SCD-SISTA, Leuven, Belgium, 2002. [Online].
of Brain Mechanisms. New York: Spartan Books, 1962. Available: http://www.esat.kuleuven.be/sista/lssvmlab/

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.
HUANG et al.: EXTREME LEARNING MACHINE FOR REGRESSION AND MULTICLASS CLASSIFICATION 529

Guang-Bin Huang (M’98–SM’04) received the Xiaojian Ding received the B.Sc. degree in applied
B.Sc degree in applied mathematics and M.Eng mathematics and the M.Sc. degree in computer engi-
degree in computer engineering from Northeastern neering from Xi’an University of Technology, Xi’an,
University, Shenyang, China, in 1991 and 1994, re- China, in 2003 and 2006, respectively, and the Ph.D.
spectively, and the Ph.D. degree in electrical engi- degree from the School of Electronic and Informa-
neering from Nanyang Technological University, tion Engineering, Xi’an Jiaotong University, Xi’an,
Singapore, in 1999. in 2010. He studied in the School of Electrical
During undergraduate period, he also concurrently and Electronic Engineering, Nanyang Technological
studied in the Applied Mathematics Department and University, Singapore, under the award from the
Wireless Communication Department, Northeastern Chinese Scholarship Council of China from August
University, China. From June 1998 to May 2001, he 2009 to August 2010.
was a Research Fellow with Singapore Institute of Manufacturing Technology His research interests include extreme learning machines, neural networks,
(formerly known as Gintic Institute of Manufacturing Technology), where he pattern recognition, and machine learning.
has led/implemented several key industrial projects (e.g., Chief Designer and
Technical Leader of Singapore Changi Airport Cargo Terminal Upgrading
Project, etc). Since May 2001, he has been an Assistant Professor and Associate
Professor with the School of Electrical and Electronic Engineering, Nanyang
Technological University, Singapore. His current research interests include
machine learning, computational intelligence, and extreme learning machines.
Dr. Huang serves as an Associate Editor of Neurocomputing and the IEEE
T RANSACTIONS ON S YSTEMS , M AN , AND C YBERNETICS —PART B.
Rui Zhang received the B.Sc. and M.Sc. degrees
in mathematics from Northwest University, Xi’an,
China, in 1994 and 1997, respectively. She is cur-
Hongming Zhou received the B.Eng. degree
rently working toward the Ph.D. degree in the School
from Nanyang Technological University, Singapore,
of Electrical and Electronic Engineering, Nanyang
in 2009, where he is currently working toward the
Technological University, Singapore.
Ph.D. degree in the School of Electrical and Elec-
From August 2004 to January 2005, she was a
tronic Engineering.
Visiting Scholar with the Department of Mathe-
His research interests include extreme learn-
matics, University of Illinois at Champaign–Urbana,
ing machines, neural networks, and support vector Urbana. Since 1997, she has been with the Depart-
machines.
ment of Mathematics, Northwest University, where
she is currently an Associate Professor. Her current research interests include
extreme learning machines, machine learning, and neural networks.

Authorized licensed use limited to: Universita Studi di Torino - Dipartimento Di Informatica. Downloaded on May 01,2021 at 08:20:23 UTC from IEEE Xplore. Restrictions apply.

You might also like