Extreme Learning Machine Based Transfer Learning
Extreme Learning Machine Based Transfer Learning
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
art ic l e i nf o a b s t r a c t
Article history: The extreme learning machine (ELM) is a new method for using Single-hidden Layer Feed-forward
Received 30 September 2014 Networks (SLFNs) with a much simpler training method. While conventional extreme learning machine
Received in revised form are based on the training and test data which should be under the same distribution, in reality it is often
5 January 2015
desirable to learn an accurate model using only a tiny amount of new data and a large amount of old
Accepted 6 January 2015
data. Transfer learning (TL) aims to solve related but different target domain problems by using plenty of
Available online 11 August 2015
labeled source domain data. When the task from one new domain comes, new domain samples are
Keywords: relabeled costly, and it would be a waste to discard all the old domain data. Therefore, an algorithm
Extreme learning machine called TL-ELM based on the ELM algorithm is proposed, which uses a small amount of target domain tag
Transfer learning (TL)
data and a large number of source domain old data to build a high-quality classification model. The
Classification
method inherits the advantages of ELM and makes up for the defects that traditional ELM cannot
transfer knowledge. Experimental results indicate that the performance of the proposed methods is
superior to or at least comparable with existing benchmarking methods. In addition, a novel domain
adaptation kernel extreme learning machine (TL-DAKELM) based on the kernel extreme learning
machine was proposed with respect to the TL-ELM. Experimental results show the effectiveness of the
proposed algorithm.
& 2015 Elsevier B.V. All rights reserved.
1. Introduction the learning algorithm of ELM. Han et al. [6] encoded a priori
information to improve the function approximation of ELM. Kim
As a nonlinear model, neural network (NN) has the ability of good et al. [7] introduced a variable projection method to reduce the
generalization and nonlinear mapping, which can be used to solve dimension of the parameter space. Zhu et al. [8] used a differential
the dimension curse problem [1]. Combining forward propagation of evolutionary algorithm to select the input weights of ELM. Some
information with back- propagation of error, typical back-propagation other scholars dedicated to optimize the structure of ELM. Wang
(EBP) neural network plays an important role in neural learning [2]. et al. [9] made a proper selection of the input weights and bias of
Besides, support vector machine (SVM) is based on the statistical ELM in order to improve the performance of ELM. Li et al. [10]
learning and structural risk minimization principle [3,4]. However, it is proposed a structure-adjustable online ELM learning method,
known that both BP neural network and SVM have some challenging which can adjust the number of hidden layer RBF nodes. Huang
issues such as: (1) slow learning speed, (2) trivial human intervention, et al. [11,12] proposed an incremental structure ELM, which
(3) poor computational scalability [37]. A new learning algorithm, i.e., increase the hidden nodes gradually. Meanwhile, another incre-
extreme learning machine (ELM) was proposed by Huang et al. [5]. mental approach referred to as error minimized extreme learning
Compared with BP neural network and SVM, the ELM has better machine (EM-ELM) was proposed by Feng et al. [13]. All these
generalization performance at a much faster learning speed and with incremental ELM start from a small size of ELM hidden layer, and
least human intervention. add random hidden node (nodes) to the hidden layer. During the
Although ELM has made some achievements, but there is still growth of networks, the output weights are updated incremen-
space for improvement. Some scholars are engaged in optimizing tally. On the other hand, an alternative method to optimize the
structure of ELM is to train an ELM that is larger than necessary
and then prune the unnecessarily nodes during learning. A pruned
n
Corresponding author.
ELM (PELM) was proposed by Rong et al. [14,15] for classification
E-mail addresses: [email protected] (X. Li), problem. Yoan et al. [16] proposed an optimally pruned extreme
[email protected] (W. Mao), [email protected] (W. Jiang). learning machine (OP-ELM) methodology. Besides, there are still
http://dx.doi.org/10.1016/j.neucom.2015.01.096
0925-2312/& 2015 Elsevier B.V. All rights reserved.
204 X. Li et al. / Neurocomputing 174 (2016) 203–210
other attempts to optimize the structure of ELM, such as CS-ELM the weight connecting the ith hidden node and the output node,
[17] proposed by Lan et al., which used a subset model selection and f L ðxÞ A R is the output of the SLFN. wi U xj denotes the inner
method. Zong et al. [18] put forward the weighted extreme product of wi and xj U wi and bi are the learning parameters of
learning machine for imbalance learning. The kernel trick applied hidden nodes and they are randomly chosen before learning.
to ELM was introduced in previous work [19]. While conventional If the SLFN with N hidden nodes can approximate theses N
extreme learning machine are based on the training and test data samples with zero error, it then means there exists β i ,wi , and bi
which should be under the same distribution, in reality it is often such that
desirable to learn an accurate model using only a tiny amount of
X
L
new data and a large amount of old data. βi hðwi U xj þ bi Þ ¼ t j ; j ¼ 1; …; N ð2Þ
The transfer learning technology can reuse past experience and i¼1
knowledge to solve current problem. Generally speaking, the goal Eq. (2) can be written compactly as
of transfer learning is to use training data from related tasks to aid
learning on a future problem of interest. Transfer learning refers to Hβ ¼ T ð3Þ
the problem of retaining and leveraging the knowledge available where
for one or more tasks, domains, or distributions to efficiently 0 1 0 1
hðx1 Þ hðw1 ; b1 ; x1 Þ ⋯ hðwL ; bL ; x1 Þ
develop a reasonable hypothesis for a new task, domain, or
B ⋮ C B ⋮ ⋱ ⋮ C
distribution [20].Transfer learning (TL) is a method that aims at H¼@ A¼@ A ;
reusing knowledge learned in an environment to improve the hðxN Þ hðw1 ; bN ; x1 Þ ⋯ hðwL ; bL ; xN Þ
NL
learning performance in new environments, which can solve the T ¼ ½t 1 ; …; t N T ; and β ¼ ½β 1 ; β2 ; ⋯; βL T :
problem of transfer learning from different but similar tasks [21].
According to the relationship between the source and target H is called the hidden layer output matrix of the network [4,5];
domains, TL can be divided into inductive transfer [22] and the ith column of H is the ith hidden node's output vector with
transductive transfer [23]. Dai et al. [24] proposed a boosting respect to inputs x1 ; x2 ; ⋯; xN and the jth row of H is the output
algorithm, TrAdaBoost, which is an extension of the AdaBoost vector of the hidden layer with respect to input xj :As introduced in
algorithm, to address the inductive transfer learning problems. Wu [27], one of the methods to calculate Moore–Penrose generalized
et al. [25] integrated the source domain (auxiliary) data in Support inverse of a matrix is the orthogonal projection method:
Vector Machine (SVM) framework for improving the classification H† ¼ HT ðHHT Þ 1 :
performance. Pan et al. [29] proposed a Q learning system for According to the ridge regression theory [27], one can add a
continuous spaces which is constructed as a regression problem positive value to the diagonal of HHT the resultant solution is more
for an ELM. Instead of involving generalization across problem stable and tends to have better generalization performance
instances, transfer learning emphasizes the transfer of knowledge 1
I
across tasks, domains, and distributions that are related but not f ðxÞ ¼ hβ ¼ hðxÞH T þ HHT T; ð4Þ
C
the same. The default assumption of traditional supervised learn-
ing methods is that training and testing data are drawn from the The feature mapping hðxÞ is usually known to users in ELM.
same distribution. When the two distributions do not match, two However, if a feature mapping hðxÞ is unknown to users a kernel
distinct transfer learning sub-problems can be defined depending matrix for ELM can be defined as follows [6]:
on whether the training and testing data refer to the same domain
ΩELM ¼ HHT : ΩELMi;j ¼ hðxi Þ Uhðxj Þ ¼ Kðxi ; xj Þ: ð5Þ
or not [33]. In the framework of domain adaptation, most of the
learning methods are inspired by the idea that these two con- Thus, the output function of ELM classifier can be written
sidered domains, although different, are highly correlated [35]. compactly as
Duan et al. also proposed a domain transfer SVM(DTSVM) and its 2 3
1 Kðx; x1 Þ T 1
extended version DTMKL for DAL problems such as cross-domain I 6 7 I
video concept detection and text classification [36]. f ðxÞ ¼ hðxÞHT þ HHT T¼4 ⋮ 5 þ ΩELM T: ð6Þ
C C
In this paper, we would like to investigate these issues. There- Kðx; xN Þ
fore, an algorithm called transfer learning based on the ELM
N
algorithm (TL-ELM) is proposed, which uses a small number of Algorithm1. Given a training set ðxi ; t i Þ i ¼ 1 Rn Rn , activation
target tag data and a large number of source domain old data to kernel function gð U Þ, and the hidden node number L :
build a high-quality classification model. The method takes the
advantages of the traditional ELM and makes up for the defect that Step 1: Randomly assign input weight wi and bias bi ; i ¼ 1; ⋯; L:
traditional ELM cannot migrate knowledge. In addition, we pro- Step 2: Calculate the hidden layer output matrix H:
pose a so-called TL-DAKELM based on the kernel extreme learning Step 3: Calculate the output weight β : β ¼ H† T:
machine ELM as an extension to the TL-ELM method for pattern
classification problems. Experimental results show the effective-
ness of the proposed algorithm.
3. ELM in transfer learning
2. Kernel extreme learning machine 3.1. Minimum norm least-squares (LS) solution of SLFNs
This section we briefly review the ELM proposed in [26]. The It is very interesting and surprising that unlike the most
essence of ELM is that in ELM the hidden layer need not be tuned. common understanding that all the parameters of SLFNs need to
The output function of ELM for generalized SLFNs is be adjusted, the input weights wi and the hidden layer biases bi
X
L X
L are in fact not necessarily tuned and the hidden layer output
f L ðxÞ ¼ βi hi ðxj Þ ¼ βi hðwi Uxj þ bi Þ ¼ hðxÞβ j ¼ 1; …; N ð1Þ matrix Η can actually remain unchanged once random values have
i¼1 i¼1
been assigned to these parameters in the beginning of learning.
where wi A Rn is the weight vector connecting the input nodes and For fixed input weights wi and the hidden layer biases bi , seen
the ith hidden node,bi A R is the bias of the ith hidden node,β i A R is from Eq. (7), to train an SLFN is simply equivalent to finding a
X. Li et al. / Neurocomputing 174 (2016) 203–210 205
linear system is
which implies that
β^ ¼ Η† Τ; ð9Þ 0 r αi r C t ð14Þ
†
where Η is the Moore–Penrose generalized inverse of matrix Η In the case of ωt we have
[28]. ∂L
¼0
∂ξt
3.2. Framework of TL-ELM X
N
¼ ωt þ μðωt ωs Þ αi yti ðhðxti ÞÞ; ð15Þ
i¼1
The framework of proposed TL-ELM method is shown in Fig. 1.
An initial ELM classifier is trained on the labeled training dataset of which results in
source domain. ωs is source transfer knowledge reserved when the
μωs 1 X N
initial ELM classifier is generated. The less labeled testing set in ωt ¼ þ α yt ðhðxti ÞÞ: ð16Þ
1þμ 1þμ i ¼ 1 i i
target domain and transfer knowledge was optimized [30]. Then
target domain test dataset are classified by the TL-ELM classifier. According to (16), we have the dual form of the original
The traditional ELM builds a learning model using the training problem
and test data which should be under the same distribution. The N
X
N X
N X μyt ðhðxt Þ U ws Þ 1
disadvantage is that it lacks the generalization capability of dealing min 1 þ1 μ αi αj yti ytj ðhðxti Þ U hðxtj ÞÞ þ i i
1 βi ‖ω ‖2
μþ1 μþ1 s
with transfer knowledge. Applying the transfer learning, the i ¼ 1j ¼ 1 i¼1
1 XN
1
min : J ωt J 2 þ C t ξti þ μ J ωt ωs J 2 ; 3.2.1. TL-ELM algorithm
2 i¼1
2 The proposed TL-ELM algorithm can be summarized as follows.
ξ t
i Z 0; i ¼ 1; ⋯; N; ð10Þ
Step 1: Obtain the initial knowledge ωs , and select the appro-
ωs is source transfer knowledge, ωt is target transfer knowl- priate penalty parameters C t , μ;
edge, μ is penalty parameter. Step 2: For the given C t , find the optimal vector
Then, the corresponding Lagrangian is αn ¼ ðαn1 ; ⋯; αnN ÞT by applying Eq. (17);
Step 3: According to Eq. (16),obtain wnt ;
1 XN
1 Step 4: Output the decision function.
L ¼ J ωt J 2 þC t ξti þ μ J ωt ωs J 2
2 i¼1
2
CX n
1
Target domain TL-ELM argminf ¼ ξ2 þ γ ðp; qÞ
classifier w;b;ξ 2 i ¼ 1 i 2 KMS
nX
þm In this section, in order to evaluate the properties of our
s:t: βj kσ =γ ðxi ; xj Þ ¼ yi ξi : ξi Z 0; i ¼ 1; ⋯; n: ð19Þ framework, we perform the experiments on one none-text data
j¼1 set from the UCI machine learning repository, text data sets 20-
Newsgroups repository and face data set.
where Ω is the positive semi-definite kernel function. We use
standard Gaussian kernel (i.e. kσ =γ ðx; yÞ ¼ expð 1=2ðσ =γ Þ2 ‖x y‖2 Þ,
4.1.1. Dataset
bandwidth σ =γ ) as the default kernel.
4.1.1.1. UCI dataset. The UCI machine learning repository contains
For a classification problem, given parameter λ A ½0; 1; the
Banana, Titanic, Waveform, Image, Heart, Diabetes, Flare Solar, and
optimal solution of Eq. (19) is equivalent to the linear system of
Splice dataset (Table 1).
equations with respect to variable α as follows:
" #
0 1Tn 0 0 4.1.1.2. 20-News Groups dataset. The 20-NewsGroups data set
¼ ð20Þ contains 20,000 documents distributed evenly in 20 different
1n Ω~ α Y
newsgroups. Each newsgroup corresponds to a different topic. Some
of the newsgroups are closely related and can be grouped into one
where 1n ¼ ½1; …; 1T ; α ¼ ½α1 ; …; αn T ; Y ¼ ½y1 ; …; yn T ;
~ ¼ K T ðΩÞ 1 K þ I =C; I is n-dimensional unit matrix. category at a higher level, while others remain as separate categories.
Ω s s n n
For example, the top category sci contains 4 subcategories sci.crypt,
As for multiclass classification problems, the traditional skills
sci.electronics, sci.med, and sci.space in the science field (Table 2). Any
are to decompose a multiclass classification problem into several
two top categories can be selected to construct the cross-domain data
binary classification problems in one against one (OAO) or one
set. The source domain contains some subcategories from the two top
against all (OAA). However, these skills suffer from the problem of
categories. The target domain contains the rest of the subcategories.
high computational complexity. Tao et al. [34] introduce the vector
The details of the constructed data sets are listed in Table 2. For the
labeled outputs into the solution of LS-DAKSVM, which can make
20-NewsGroups categorization data, in each case the goal is to
the corresponding computational complexity independent of the
correctly discriminate between articles at the top level, e.g. “comp”
number of classes and requires no more computations than a
articles vs. “rec” articles, using different sets of sub-categories within
single binary classifier. According to ELM theories [11–15], almost
each top-category for training and testing.
all nonlinear piecewise continuous functions as feature mappings
can make ELM satisfy universal approximation capability, and the
separating hyperplane of ELM basically tends to pass through the 4.1.1.3. ORL and Yale face dataset. ORL face database [31] contains
origin in the ELM feature space. There is no term bias b in the 10 different images for each of the 40 distinctive subjects. Subjects
optimization constraint of ELM, thus, different from LS-SVM, ELM are imaged at different times, with varying lighting conditions,
P facial expressions and facial details. All images are captured
does not need to satisfy the condition ni¼ 1 αi yi ¼ 0: Although LS-
SVM and ELM have the same primal optimization formula, ELM against a dark homogeneous background with the subjects in
has milder optimization constraints than LS-SVM, and thus,
compared to ELM, LS-SVM obtains a suboptimal optimization.
Table 1
We represent the class labels according to the one-of-c rule,
UCI data sets used in the experiments.
namely, if the training sample xi ði ¼ 1; ⋯; nÞ belongs to the t-th
class, then the class label of xi is yi ¼ ½0; ⋯; 1 ; ⋯; 0T A Rc , where the Data set Dimensionality Sample Size
|fflfflffl{zfflfflffl}
t
Banana 2 4900
t-th element is 1 and all the other elements are 0. Hence, for some Titanic 3 2200
multi-class classification problems, the optimal solution of TL- Waveform 21 5000
DAKELM can be formulated as Image 18 2310
Heart 13 270
1 T ~ ~ CX n Diabetes 8 768
minf ¼ β~ Ω βþ ξ2i ; Flare Solar 9 1066
β ;ξ 2 2 i¼1 Splice 60 3175
s:t:β~ Ks ¼ yi ξi ; i ¼ 1; ⋯; n
T
Table 2
Top class and sub-class on the 20Newsgroups dataset.
where β~ A Rnc : ð21Þ
Top class Subclass number Subclass Sample size
The proposed TL-DAKELM algorithm can be summarized as
follows. Comp 1 comp.sys.ibm.pc.hardware 968
2 comp.windows.x 978
3 comp.sys.mac.hardware 956
Algorithm 3. Input data set matrix X,Y. Set Gaussian kernel Rec 4 rec.motorcycles 975
bandwidths σ , σ =γ . 5 rec.sport.baseball 982
6 rec.sport.hockey 973
Sci 7 sci.electronics 976
Step 1: Determine the parameter γ . 8 sci.med 985
Step 2: For the given C, find the optimal vector β by applying 9 sci.space 987
Eq. (21); Talk 10 talk.politics.guns 907
11 talk.politics.misc 992
Step 3: According to Eq. (20),obtain w; 12 talk.religion.misc 778
Step 4: Output the decision function.
X. Li et al. / Neurocomputing 174 (2016) 203–210 207
an upright, frontal position with a small tolerance for side algorithm gets higher accuracy, it costs much training time
movement (Fig. 2). compared with traditional ELM algorithm.
Yale face database [32] contains 165 grayscale images of 15 Fig. 4 shows the change of sample for the classification
individuals. There are 11 images per subject, one per different accuracy according to the ratio between T t and T s by different
facial expression or configuration: center-light, with glasses, algorithms. Obviously, the high classification accuracy is strongly
happy, left-light, without glasses, normal, right-light, sad, sleepy, dependent on the ration value between T t and T s . The sensitivity
surprised, and winking (Fig. 3). of parameter μ analysis is shown in Fig. 5. μ is a prescribed value
that lies in the interval [0,1]. When two domains have a closely
relationship and parameter μ is larger, there is higher classification
4.1.2. Classification performance assessment
accuracy.
For the UCI categorization data, the different attribute is to
Details of the relationship between noise level and classifica-
classify between dataset. The parameter C is chosen from the
tion accuracy is showed in Table 5. The noise experiment is on
range {0.001, 0.01, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 500,
1000, 2000, 4000, 8000}. Because of the less training sample in
target domain, the 12 different values of the parameter C t are Table 4
{0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 100}. In TL-SVM, μ Different algorithms' performance on the 20Newsgroups datasets.
select from the range {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1}.
Task Datset Accuracy (%) Training time (s)
However, Tr-AdaBoost (Tr-SVM) parameter set according to [6].
In Table 3, the TL-ELM method delivers more stable results Source Target Tr- ELM TL- Tr- ELM TL-
across all the datasets and is highly competitive in most of the sample sample AdaBoost ELM AdaBoost ELM
(Positive (Positive (Tr-SVM) (Tr-SVM)
datasets. It obtains the best classification accuracy more than any
class vs class vs
other method. Hence, as discussed in the above section, TL-ELM negtive negtive
algorithm possesses overall Tr-AdaBoost (Tr-SVM) advantages over class) class)
other methods in the sense of both computational complexity and
classification accuracy. 1 1 vs 5 3 vs 6 88.65 96.35 97.35 2.76 0.22 0.67
2 1 vs 7 3 vs 8 93.21 94.31 94.28 1.54 0.11 0.35
Details of the data in different domains are summarized in
3 1 vs 11 3 vs 12 86.74 88.29 89.15 0.89 0.09 0.21
Table 2. As shown in Table 4, TL-ELM is obviously more superior 4 5 vs 7 6 vs 8 83.21 87.12 87.03 2.34 0.21 0.98
than Tr-AdaBoost (Tr-SVM) in classification accuracy for almost all 5 5 vs 6 11 vs 12 79.62 86.27 89.37 1.67 0.13 0.34
these datasets. The TL-ELM algorithm obtains the good perfor- 6 7 vs 11 8 vs 12 88.15 89.67 90.87 2.13 0.24 0.67
mance compared with the traditional ELM. Although the TL-ELM
Table 3
Different algorithms' performance on the UCI dataset.
Source sample (Positive class vs negtive Target sample (Positive class vs negtive Tr-AdaBoost (Tr- ELM TL- Tr-AdaBoost (Tr- ELM TL-
class) class) SVM) ELM SVM) ELM
Banana (30%, 450) vs (50%, 1700) (30%, 390) vs (50%, 1300) 82.51 82.62 86.31 1.31 0.16 0.86
Titanic (30%, 250) vs (50%, 683) (30%, 250) vs (50%,684) 92.31 93.78 96.62 1.12 0.19 0.39
Waveform (30%, 520) vs (50%, 6535) (30%, 520) vs (50%, 6534) 90.93 92.62 96.73 1.42 0.11 0.98
Image (30%, 260) vs (50%, 722) (30%, 260) vs (50%, 722) 96.31 96.40 97.62 0.91 0.06 0.26
Heart (30%, 30) vs (50%, 85) (30%, 30) vs (50%, 85) 93.76 94.67 95.58 1.62 0.21 0.85
Diabetes (30%, 86) vs (50%, 241) (30%, 86) vs (50%, 242) 87.25 91.09 96.24 1.14 0.13 0.82
Flare solar (30%, 120) vs (50%, 333) (30%, 120) vs (50%, 333) 89.69 89.91 92.62 1.83 0.15 0.56
Splice (30%, 368) vs (50%, 974) (30%, 368) vs (50%, 975) 87.78 90.42 96.32 1.05 0.10 0.79
208 X. Li et al. / Neurocomputing 174 (2016) 203–210
1.0 comp.vs.talk data set. The zero mean and different standard
deviation of the Gaussian noise was added to the training samples.
As shown in Table 5, TL-ELM is obviously superior than traditional
0.9
ELM in classification accuracy. But with the increase of noise, the
TL-ELM is on the decline. So, it does not show very good
0.8 robustness to noise.
Accuracy
Fig. 6. Face image samples of the Yale and ORL datasets. (a) Yale faces of object. (b) Yale faces of object with rotation of 101. (c) ORL faces of object. (d) ORL faces of object
with rotation of 101. (a) Yale faces of object, (b) Yale faces of object with rotation of 101, (c) ORL faces of object and (d) ORL faces of object with rotation of 101
X. Li et al. / Neurocomputing 174 (2016) 203–210 209
Table 6 Acknowledgments
Means (%) of classification accuracy (ACC) of all algorithm on Yale and ORL with
different rotation angles.
We appreciate the anonymous reviewers for the valuable
Face data Algorithms comments. This work was supported by the Zhejiang Provincial
Natural Science Foundation of China (No. LR12F03002) and the
LS-SVM LS-DAKSVM ELM TL-DAKELM National Natural Science Foundation of China (No. 61375049).
Yale 101 67.90 75.89 76.90 89.62
301 65.32 71.62 70.66 86.78
501 60.17 65.78 64.32 80.34 References
ORL 101 79.45 85.77 86.92 92.66
301 76.74 84.15 78.93 89.71
[1] A.K. Jain, J. Mao, Artificial neural networks: a tutorial, IEEE Comput. (1996)
501 72.81 83.23 75.85 85.45 31–44.
[2] R. Hetch-Neilsen, Theory of the back propagation neural network, in:
Proceedings of the International Joint Conference on Neural Networks, 1989,
pp. 593–605.
[3] C. Cortes, V.N. Vapnik, Support vector networks, Mach. Learn. 20 (1995)
the percentage of correctly labeled samples over the total number 273–297.
[4] J.A.K. Suykens, Vandewalle, least squares support vector machine classifiers,
of samples) as the reference classification accuracy measure. The Neural Process. Lett. 9 (1999) 293–300.
performance of TL-DAKELM will be compared with LS-SVM, LS- [5] G.B. Huang, L. Chen, C.K. Siew, Universal approximation using incremental
DAKSVM and traditional ELM. For the above multiclass classifica- constructive feedforward networks with random hidden nodes, IEEE Trans.
Neural Netw. 17 (2006) 879–892.
tion tasks, LS-SVM choose one against one (OAO) multi-class [6] F. Han, D.S. Huang, Improved extreme learning machine for function approx-
separation strategy to finish the corresponding multi-class classi- imation by encoding a priori information, Neurocomputing 69 (2006)
fication tasks. For each evaluation, 5 rounds of experiments are 2369–2373.
[7] C.T. Kim, J.J. Lee, Training two-layered feedforward networks with variable
repeated with randomly selected training data, and the average projection method, IEEE Trans. Neural Netw. 19 (2008) 371–375.
result is recorded as the final classification accuracy in Table 6. [8] Q.Y. Zhu, A.K. Qin, P.N. Suganthan, G.B. Huang, Evolutionary extreme learning
The overall accuracy of LS-SVM is lower than any other machine, Pattern Recognit. 38 (2005) 1759–1763.
[9] Y. Wang, F. Cao, Y. Yuan, A study on effectiveness of extreme learning machine,
classifier in all DALtasks. With the increase in rotation angle, the
Neurocomputing 74 (2011) 2483–2490.
classification performance of all classifiers descends gradually. [10] G.H. Li, M. Liu, M.Y. Dong, A new online learning algorithm for structure-
However, TL-DAKELM seems to decrease more slowly than other adjustable extreme learning machine, Comput. Math. Appl. 60 (2010)
377–389.
methods. Exceptionally, traditional ELM exhibit competitive per-
[11] G.B. Huang, L. Chen, Convex incremental extreme learning machine, Neuro-
formance to some extent compared to other methods, particularly computing 70 (2007) 3056–3062.
on more complex datasets. As shown in Table 6, the TL-DAKELM [12] G.B. Huang, M.B. Li, L. Chen, C.K. Siew, Incremental extreme learning machine
method delivers more stable results across all the datasets and is with fully complex hidden nodes, Neurocomputing 71 (2008) 576–583.
[13] G. Feng, G.B. Huang, Q. Lin, R. Gay, Error minimized extreme learning machine
highly competitive in most of the datasets. It obtains the best with growth of hidden nodes and incremental learning, IEEE Trans. Neural
classification accuracy more times than any other method. Hence, Netw. 20 (2009) 1352–1357.
as discussed in the above section, TL-DAKELM possesses overall [14] H.J. Rong, Y.S. Ong, A.H. Tan, Z. Zhu, A fast pruned-extreme learning machine
for classification problem, Neurocomputing 72 (2008) 359–366.
DAL advantages over other methods in the sense of both computa- [15] H.J. Rong, G.B. Huang, N. Sundararajan, P. Saratchandran, Online sequential
tional complexity and classification accuracy. fuzzy extreme learning machine for function approximation and classification
problems, IEEE Trans. Syst. Man Cybern. Part B-Cybern. 39 (2009) 1067–1072.
[16] M. Yoan, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, OP-ELM:
optimally pruned extreme learning machine, IEEE Trans. Neural Netw. 21
(2010) 158–162.
[17] Y. Lan, Y.C. Soh, G.B. Huang, Constructive hidden nodes selection of extreme
learning machine for regression, Neurocomputing 73 (2010) 3191–3199.
5. Conclusions and future research
[18] W.W. Zong, G.B. Huang, Y. Chen, Weighted extreme learning machine for
imbalance learning, Neurocomputing 101 (2013) 229–242.
We address the issue of transfer learning based on ELM for [19] E. Parviainen, J. Riihimäki, Y. Miche, A. Lendasse, Interpreting extreme learning
classification in this paper. The basic idea of TL-ELM is to use a machine as an approximation to an infinite neural network, in: Proceedings of
the International Conference on Knowledge Discovery and Information
small amount of target domain tag data and a large number of Retrieval, 2010 pp. 65–73.
source domain old data to build a high-quality classification [20] S.J. Pan, J.T. Kwok, Q. Yang, Transfer learning via dimensionality reduction, in:
model. Starting from the solution of independent ELMs, we Proceedings of the AAAI Conference on Artificial Intelligence, 8, 2008, pp. 677-
682.
showed that the addition of a new term in the cost function [21] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22
(which penalizes the diversity between consecutive classifiers) (2010) 1345–1359.
produces transferring knowledge. The results show that the [22] T. Croonenborghs, K. Driessens, M. Bruynooghe, Learning relational options for
inductive transfer in relational reinforcement learning, in: Proceedings of the
proposed method using TL-ELM can effectively improve the 17th Conference on Inductive Logic Programming, 2007.
classification by learning cross-domain knowledge and is robust [23] A. Arnold, R. Nallapati, W.W. Cohen, A comparative study of methods for
to different sizes of training data. transductive transfer learning, in: Proceedings of the 17th International
Conference on Data Mining Workshops, 2007, pp. 77–82.
In addition, a novel domain adaptation kernel extreme learning [24] W. Dai, Q. Yang, G. Xue, Y. Yu, Boosting for Transfer Learning, in: Proceedings
machine (TL-DAKELM) based on the kernel extreme learning of the 24th International Conference on Machine Learning, 2007, pp. 193–200.
machine was proposed with respect to the TL-ELM. Experimental [25] P. Wu, T.G. Dietterich, Improving SVM Accuracy by Training on Auxiliary Data
Sources, in: Proceedings of the 21st International Conference on Machine
results show the effectiveness of the proposed algorithm. In the
Learning, 2004.
framework of domain adaptation, due to the absence of prior [26] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: theory and
information for target-domain, traditional statistical validation applications, Neurocomputing 70 (2006) 489–501.
strategies proposed in the previous literature cannot be used for [27] G.B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for
regression and multiclass classification, IEEE Trans. Syst. Man Cybern. Part
assessing efficiently of the resulting classifier. Hence, in the near B: Cybern. 42 (2012) 513–529.
future, we also expect to do some research on an effective [28] D. Serre, Matrices: Theory and Applications, Springer, 2002.
validation strategy to validate the solutions consistent with the [29] J. Pan, X. Wang, Y. Cheng, et al., Multi-source transfer ELM-based Q learning,
Neurocomputing (2013).
source and target domains. [30] G.B. Huang, X. Ding, H. Zhou, Optimization method based extreme learning
machine for classification, Neurocomputing 74 (2010) 155–163.
210 X. Li et al. / Neurocomputing 174 (2016) 203–210
[31] “ORL Face Database,” AT&T Laboratories Cambridge, 〈http://www.camorl.co. Wei Jiang received the Ph.D. degree in Faculty of
uk/facedatabase.html〉, 2005. Engineering, Tokyo University of Industry in 2005,
[32] “Yale Face Database,” Columbia Univ., 〈http://www.cs.columbia.edu/belhu Tokyo, Japan. His current research interests include
meur/pub/images/yalefaces/〉, 2005. machine vision, pattern recognition, and extreme
[33] L. Bruzzone, M. Marconcini, Domain adaptation problems: a DASVM classifi- learning machine.
cation technique and a circular validation strategy, IEEE Trans. Pattern Anal.
Mach. Intell. 32 (2010) 770–787.
[34] J.W. Tao, F.L. Chung, S. Wang, On minimum distribution discrepancy support
vector machine for domain adaptation, Pattern Recognit. 45 (2012)
3962–3984.
[35] Evan Wei Xiang, Derek Hao Bin Cao, Qiang Hu, Yang, bridging domains using
world wide knowledge for transfer learning, IEEE Trans. Knowl. Data Eng. 22
(2010) 770–783.
[36] L. Duan, I.W. Tsang, D. Xu, et al., Domain transfer SVM for video concept
detection, in: Proceedings of the IEEE International Conference Computer
Vision and Pattern Recognition, 2009, pp.1375–1381.
[37] G.B. Huang, D.H. Wang, Y. Lan, Extreme learning machines: a survey, Int.
J. Mach. Learn. Cybernet. 2 (2011) 107–122.