0% found this document useful (0 votes)
15 views15 pages

Broad Learning System An Effective and Efficient Incremental Learning System Without The Need For Deep Architecture

The Broad Learning System (BLS) is proposed as an alternative to deep learning architectures, focusing on efficient incremental learning without the need for complete retraining. BLS utilizes a flat network structure with mapped features and enhancement nodes, allowing for rapid model updates and simplification through low-rank approximations. Experimental results demonstrate its effectiveness compared to traditional deep neural networks in various applications, including classification and regression tasks.

Uploaded by

Shiron Thalagala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views15 pages

Broad Learning System An Effective and Efficient Incremental Learning System Without The Need For Deep Architecture

The Broad Learning System (BLS) is proposed as an alternative to deep learning architectures, focusing on efficient incremental learning without the need for complete retraining. BLS utilizes a flat network structure with mapped features and enhancement nodes, allowing for rapid model updates and simplification through low-rank approximations. Experimental results demonstrate its effectiveness compared to traditional deep neural networks in various applications, including classification and regression tasks.

Uploaded by

Shiron Thalagala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO.

1, JANUARY 2018

Broad Learning System: An Effective and Efficient


Incremental Learning System Without
the Need for Deep Architecture
C. L. Philip Chen, Fellow, IEEE, and Zhulin Liu
Abstract— Broad Learning System (BLS) that aims to offer the most popular deep networks are the deep belief networks
an alternative way of learning in deep structure is proposed (DBN) [5], [6], the deep Boltzmann machines (DBM) [7], and
in this paper. Deep structure and learning suffer from a time- the convolutional neural networks (CNN) [8], [9]. Although
consuming training process because of a large number of con-
necting parameters in filters and layers. Moreover, it encounters the deep structure has been so powerful, most of networks
a complete retraining process if the structure is not sufficient to suffer from the time-consuming training process because
model the system. The BLS is established in the form of a flat a great number of hyperparameters and complicated struc-
network, where the original inputs are transferred and placed as tures are involved. Moreover, this complication makes it
“mapped features” in feature nodes and the structure is expanded so difficult to analyze the deep structure theoretically that
in wide sense in the “enhancement nodes.” The incremental
learning algorithms are developed for fast remodeling in broad most work span in turning the parameters or stacking more
expansion without a retraining process if the network deems layers for better accuracy. To achieve this mission, more
to be expanded. Two incremental learning algorithms are given and more powerful computing resource has been involved.
for both the increment of the feature nodes (or filters in deep Recently, some variations in hierarchical structure [10]–[12] or
structure) and the increment of the enhancement nodes. The ensembles [13]–[17], are proposed to improve the training
designed model and algorithms are very versatile for selecting
a model rapidly. In addition, another incremental learning is performance.
developed for a system that has been modeled encounters a new Single layer feedforward neural networks (SLFN) have been
incoming input. Specifically, the system can be remodeled in an widely applied to solve problems such as classification and
incremental way without the entire retraining from the beginning. regression because of their universal approximation capabil-
Satisfactory result for model reduction using singular value ity [18]–[21]. Conventional methods for training SLFN are
decomposition is conducted to simplify the final structure. Com-
pared with existing deep neural networks, experimental results gradient-descent-based learning algorithms [22], [23]. The
on the Modified National Institute of Standards and Technology generalization performance of them is more sensitive to the
database and NYU NORB object recognition dataset benchmark parameter settings such as learning rate. Similarly, they usually
data demonstrate the effectiveness of the proposed BLS. suffer from slow convergence and trap in a local minimum.
Index Terms— Big data, big data modeling, broad learning The random vector functional-link neural network (RVFLNN),
system (BLS), deep learning, incremental learning, random proposed in [19]–[21], offers a different learning method.
vector functional-link neural networks (RVFLNN), single layer RVFLNN effectively eliminates the drawback of the long
feedforward neural networks (SLFN), singular value decomposi- training process and also it provides the generalization capa-
tion (SVD).
bility in function approximation [20], [24]. It has been proven
I. I NTRODUCTION that RVFLNN is a universal approximation for continuous

D EEP structure neural networks and learnings have been


applied in many fields and have achieved breakthrough
successes in a great number of applications [1], [2], partic-
functions on compact sets with fast learning property. There-
fore, RVFLNN has been employed to solve problems in
diverse domains, including the context of modeling and con-
ularly in large scale data processing [3], [4]. Among them, trol [25]. Although RVFLNN enhances the performance of the
perception significantly, this technique could not work well
Manuscript received March 25, 2017; revised May 20, 2017 and June 11,
2017; accepted June 12, 2017. Date of publication July 21, 2017; date of on remodeling high-volume and time-variety data in modern
current version December 29, 2017. This work was supported in part by Macao large data era [26]. To model moderate data size, a dynamic
Science and Technology Development Fund under Grant 019/2015/A1, in step-wise updating algorithm was proposed in [27] to update
part by UM Research Grants, and in part by the National Nature Sci-
ence Foundation of China under Grant 61572540. (Corresponding author: the output weights of the RVFLNN for both a new added
C. L. Philip Chen.) pattern and a new added enhancement node. This paper paves
C. L. P. Chen is with the Department of Computer and Information Science, a path for remodeling the system that has been modeled and
Faculty of Science and Technology, University of Macau, Macau 99999,
China; with Dalian Maritime University, Dalian 116026, China; and also with encounters a new incoming data.
the State Key Laboratory of Management and Control for Complex Systems, Nowadays, in addition to the growth of data in size, the
Institute of Automation, Chinese Academy of Sciences, Beijing 100080, China data dimensions also increase tremendously. Taking the raw
(e-mail: [email protected]).
Z. Liu is with the Department of Computer and Information Science, data with high dimension directly to a neural network, the
Faculty of Science and Technology, University of Macau, Macau 99999, China system cannot sustain its accessibility anymore. The challenge
(e-mail: [email protected]). for solving high dimensional data problem becomes imperative
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. recently. Two common practices to alleviate this problem are
Digital Object Identifier 10.1109/TNNLS.2017.2716952 dimension reduction and feature extraction. Feature extraction
2162-237X © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.
CHEN AND LIU: BLS: AN EFFECTIVE AND EFFICIENT INCREMENTAL LEARNING SYSTEM 11

is to seek, possible, the optimal transformation from the input


data into the feature vectors. Common approaches, which
have the advantage of easy implementation and outstanding
efficiency, for feature selection include variable ranking [28],
feature subset selection [29], penalized least squares [30],
random feature extractions such as nonadaptive random pro-
jections [31] and random forest [32], or convolution-based
input mapping, to name a few.
Likewise for the feature extraction, the RVFLNN can take
mapped features as its input. The proposed Broad Learning
System (BLS) is designed based on the idea of RVFLNN.
In addition, the BLS can effectively and efficiently update the
systems (or relearn) incrementally when it deems necessary.
The BLS is designed as, first, the mapped features are gener- Fig. 1. Functional-link neural network [27].
ated from the input data to form the feature nodes. Second, the
mapped features are enhanced as enhancement nodes with ran-
domly generated weights. The connections of all the mapped
features and the enhancement nodes are fed into the output.
Ridge regression of the pseudoinverse is designed to find
the desired connection weights. Broad Learning algorithms
are designed for the network through the broad expansion
in both the feature nodes and the enhancement nodes. And
the incremental learning algorithms are developed for fast Fig. 2. Same functional-link network redrawn from Fig. 1 [27].
remodeling in broad expansion without a retraining process
if the network deems to be expanded. broad learning algorithms is addressed here. Finally, discus-
It should be noted that once the learning system has com- sions and conclusions are given in section V.
pleted its modeling, it may consist of some redundancy due to
the broad expansion. The system can be simplified more using II. P RELIMINARIES
low-rank approximations. Low-rank approximation has been
Pao and Takefuji [19] and Igelnik and Pao [21] give a
established as a new tool in scientific computing to address
classical mathematic discussion of the advantages of the
large-scale linear and multilinear algebra problems, which
functional-link network in terms of training speed and its
would be intractable by classical techniques. In [33] and [34],
generalization property over the general feedforward net-
comprehensive exposition of the theory, algorithms, and appli-
works [20]. The capabilities and universal approximation
cations of structured low-rank approximations are presented.
properties of RVFLNN have been clearly shown and, hence,
Among the various algorithms, the singular value decomposi-
are omitted in this section. The illustration of the flatted
tion (SVD) and nonnegative matrix factorization (NMF) [35]
characteristics of the functional-link network is shown again
are widely used for exploratory data analysis. By embedding
in Figs. 1 and 2. In this section, first, the proposed broad
these classical low-rank algorithms into the proposed broad
learning is introduced based on the rapid and dynamic learning
learning network, we also design an SVD-based structure to
features of the functional-link network developed by Chen
simplify broad learning algorithm. This simplification process
and Wan [27] and Chen [36]. Second, the ridge regression
can be done by compressing the system on-the-fly right after
approximation algorithm of the pesudoinverse is recalled.
the moment when additional feature nodes are inserted or right
After that, sparse autoencoder and SVD are briefly discussed.
after additional enhancement nodes are inserted or after both
together. One can also choose to compress the whole structure
after the learning is completed. The algorithms are designed so A. Dynamic Stepwise Updating Algorithm for
versatile such that one can select number of equivalent neural Functional-Link Neural Networks
nodes in the final structure. This approach lays out an excellent For a general network in usual classification task, referring
approach for model selection. to Figs. 1 and 2, denoted by A the matrix [X|ξ(X Wh + βh )],
This paper is organized as follows. In Section II, prelimi- where A is the expanded input matrix consisting of all
naries of RVFLNN, ridge regression approach of pseudoin- input vectors combined with enhancement components. The
verse, sparse autoencoder, and SVD are given. Section III functional-link network model is illustrated in Fig. 2. In [27],
presents the BLS and gives the details of the proposed broad a dynamic version model was proposed to update the weights
learning algorithms. Section IV compares the performance of the system instantly for both a new added pattern and a new
in Modified National Institute of Standards and Technology added enhancement node. Compared with the classic model,
database (MNIST) classification and NYU object recognition this model is simple, fast, and easy to update. Generally,
benchmark (NORB) classification with those from various this model was inspired by the pseudoinverse of a partitioned
deep systems. Also the performance analysis of the proposal matrix described in [37] and [38].

Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.
12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 1, JANUARY 2018

The value λ denotes the further constraints on the sum of the


squared weights, W. This solution is equivalent with the
ridge regression theory, which gives an approximation to
the Moore−Penrose generalized inverse by adding a positive
number to the diagonal of AT A or A AT [41]. Theoretically,
if λ = 0, the inverse problem degenerates into the least square
problem and leads the solution to original pesuedoinverse.
Fig. 3. Illustration of stepwise update algorithm [27]. On the other hand, if λ → ∞, the solution is heavily
constrained and tends to 0. Consequently, we have

Denote An be the n × m pattern matrix. In this section, we W = (λI + A AT )−1 AT Y . (2)


will only introduce the stepwise updating algorithm for adding
Specifically, we have that
a new enhancement node to the network as shown in Fig. 3.
In this case, it is equivalent to add a new column to the input
A+ = lim (λI + A AT )−1 AT . (3)
matrix An . Denote An+1  [ An |a]. Then the pseudoinverse λ→0
of the new A+ n+1 equals
 + 
An − d bT C. Sparse Autoencoder
bT Supervised learning tasks, such as classifications, usually
where d = A+ need a good feature representation of the input to achieve
na
 an outstanding performance. Feature representation is not
(c)+ if c = 0 only an efficient way of data representation, but also, more
bT =
(1 + d T d)−1 d T A+
n if c = 0
importantly, it captures the characteristics of the data. Usually,
intractable mathematic derivation is used or an easy random
and c = a − An d. initialization to generate a set of random features can be
Again, the new weights are populated. However, randomness suffers from unpredictability
  and needs guidance. To overcome the randomness nature,
Wn − d bT Yn
Wn+1 = the sparse autoencoder could be regarded as an important
bT Yn
tool to slightly fine-tune the random features to a set of
where Wn+1 and Wn are the weights after and before new sparse and compact features. Specifically, sparse feature learn-
enhancement nodes are added, respectively. In this way, the ing models have been attractive that could explore essential
new weights can be updated easily by only computing the characterization [12], [42], [43].
pseudoinverse of the corresponding added node. It should be To extract the sparse features from the given training data,
noted that if An is of the full rank, then c = 0 which will make X can be considered to solve the optimization problem [44].
a fast updating of the pseudoinverse A+ n and the weight Wn . Notice that it is equivalent to the optimization problem in (1)
if we set σ2 = u = 1, and σ1 = v = 2
B. Pseudoinverse and Ridge Regression Learning Algorithms
arg min : Z Ŵ − X22 + λ Ŵ 1 (4)
In a flatted network, pseudoinverse can be considered as

a very convenient approach to solve the output-layer weights
of a neural network. Different methods can be used to cal- where Ŵ is the sparse autoencoder solution and Z is the
culate this generalized inverse, such as orthogonal projec- desired output of the given linear equation, i.e., X W = Z.
tion method, orthogonalization method, iterative method, and The above problem denoted by lasso in [45] is con-
SVD [39], [40]. However, a direct solution is too expensive, vex. Consequently, the approximation problem in (4) can be
especially the training samples and input patterns are suffered solved by dozens of ways, such as orthogonal matching pur-
from high volume, high velocity, and/or high variety [26]. suit [46], K-SVD [47], alternating direction method of multi-
Also, pseudoinverse, which is the least square estimator of pliers (ADMM) [48], and fast iterative shrinkage-thresholding
linear equations, is aimed to reach the output weights with the algorithm (FISTA) [49]. Among them, the ADMM method
smallest training errors, but may not true for generalization is actually designed for general decomposition methods and
errors, especially for the ill condition problems. In fact, the decentralized algorithms in the optimization problems. More-
following kind of optimal problems is an alternative way to over, it has been shown that many state-of-art algorithms for
solve the pseudoinverse: l1 norm involved problems could be derived by ADMM [50],
arg min :  AW − Y σv 1 + λWσu2 (1) such as FISTA. Hence, a brief review of typical approach for
W lasso is presented below.
where σ1 > 0, σ2 > 0, and u, v are typically kinds of norm First, (4) can be equivalently considered as the following
general problem:
regularization. By taking σ1 = σ2 = u = v = 2, the above
optimal problem is setting as regular l2 norm regularization,
arg min : f (w) + g(w), w ∈ Rn (5)
which is convex and has a better generalization performance. w

Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.
CHEN AND LIU: BLS: AN EFFECTIVE AND EFFICIENT INCREMENTAL LEARNING SYSTEM 13

where f (w) = Zw − x22 and g(w) = λw1 . In ADMM A. Broad Learning Model
form, the above problem could be rewritten as The proposed BLS is constructed based on the traditional
RVFLNN. However, unlike the traditional RVFLNN that takes
arg min : f (w) + g(o), s.t. w − o = 0. (6)
w
the input directly and establishes the enhancement nodes, we
first map the inputs to construct a set of mapped features.
Therefore, the proximal problem could be solved by the In addition, we also develop incremental learning algorithms
following iterative steps: that can update the system dynamically.
⎧ Assume that we present the input data X and project the
−1
⎨wk+1 := (Z Z + ρ I) (Z x + ρ(o − u )) data, using φi (X Wei +βei ), to become the i th mapped features,
T T k k

ok+1 := S λ (wk+1 + uk ) (7) Zi , where Wei is the random weights with the proper dimen-

⎩ ρ
sions. Denote Z i ≡ [Z1, . . . , Zi ], which is the concatenation
uk+1 := uk + (wk+1 − ok+1 ) of all the first i groups of mapping features. Similarly, the
j th group of enhancement nodes, ξ j (Z i Wh j + βh j ) is denoted
where ρ > 0 and S is the soft threshholding operator which
as H j , and the concatenation of all the first j groups of
is defined as
enhancement nodes are denoted as H j ≡ [ H1 , . . . , H j ].

⎪ In practice, i and j can be selected differently depending upon
⎨a − κ, a > κ
the complexity of the modeling tasks. Furthermore, φi and φk
Sκ (a) = 0, |a| ≤ κ (8)

⎩ can be different functions for i = k. Similarly, ξ j and ξr can
a + κ, a < −κ. be different functions for j = r . Without loss of generality, the
subscripts of the i th random mappings φi and the j th random
mappings ξ j are omitted in this paper.
D. Singular Value Decomposition In our BLS, to take the advantages of sparse autoencoder
Now comes a highlight of linear algebra. Any real m × n characteristics, we apply the linear inverse problem in (7) and
matrix A can be factored as fine-tune the initial Wei to obtain better features. Next, the
details of the algorithm are given below.
A = UV T Assume the input data set X, which equips with N samples,
each with M dimensions, and Y is the output matrix which
where U is an m × m orthogonal matrix whose columns are belongs to R N×C . For n feature mappings, each mapping
the eigenvectors of A AT , V is an n × n orthogonal matrix generates k nodes, can be represented as the equation of the
whose columns are the eigenvectors of AT A, and  is an form
m × n diagonal matrix of the form
Zi = φ(X Wei + βei ), i = 1, . . . , n (9)
 = diag{σ1 , . . . , σr , 0, . . . , 0}
where Wei and βei are randomly generated. Denote all the
with σ1  σ2  · · ·  σr > 0 and r = rank( A). Moreover, in feature nodes as Z n ≡ [Z1, . . . , Zn ], and denote the mth group
the above, σ1 , . . . , σr are the square roots of the eigenvalues of enhancement nodes as
of AT A. They are called the singular values of A. Therefore,
Hm ≡ ξ(Z n Wh m + βh m ). (10)
we achieve a decomposition of matrix A, which is one of a
number of effective numerical analysis tools used to analyze Hence, the broad model can be represented as the equation
matrices. In our algorithms, two different ways to reduce the of the form
size of the matrix are involved. In the first, the threshold
parameter η is set as 0 < η ≤ 1, which means that the Y = [Z1, . . . , Zn |ξ(Z n Wh 1 + βh 1 ), . . . , ξ(Z n Wh m + βh m )]W m
components associate with an eigenvalue σi ≥ ησ1 are kept. = [Z1, . . . , Zn |H1 , . . . , Hm ]W m
The second case is to select a fixed l singularities, while l
= [Z n |H m ]W m
is smaller than n. Define a threshold value ε, which is η for
case 1 and l for case 2. In the real practice, both of the cases where the W m = [Z n |H m ]+ Y . W m are the connecting
may be happened depending on various requirements. An SVD weights for the broad structure and can be easily computed
technique is known in its advantage in feature selection. through the ridge regression approximation of [Z n |H m ]+
using (3). Fig. 4(a) shows the above broad learning network.
III. B ROAD L EARNING S YSTEM

In this section, the details of the proposed BLS are given. B. Alternative Enhancement Nodes Establishment
First, the model construction that is suitable for broad expan- In the previous section, the broad expansion of enhancement
sion is introduced, and the second part is the incremental nodes is added synchronously with the connections from
learning for the dynamic expansion of the model. The two mapped features. Here, a different construction can be done
characteristics result in a complete system. At last, model by connecting each group of mapped features to a group of
simplification using SVD is presented. enhancement nodes. Details are described below.

Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.
14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 1, JANUARY 2018

Fig. 4. Illustration of BLS. (a) BLS. (b) BLS with an alternative enhancement nodes establishment.

(a)
For input data set, X, for n mapped features and for n together as Z (a) ∈ R N×nk whose weights are We , and
enhancement groups, the new construction is separately, the mq enhancement nodes are denoted together
as H (a) ∈ R N×mq whose weights are Wh(a) . Respectively, for
Y = [Z1, ξ(Z1 Wh 1 + βh 1 )| . . . Zn , ξ(Zn Wh n + βh n )]W n (11) the second model, treat the nk feature mapping together as
 [Z1, . . . , Zn |ξ(Z1 Wh 1 + βh 1 ), . . . , ξ(Zn Wh n + βh n )]W n (b)
Z (b) ∈ R N×nk whose weights are We , and separately, the
(12) nγ enhancement nodes are denoted together as H (b) ∈ R N×nγ
whose weights are Wh(b) . Obviously, We(b) and We(a), that is
where Zi , i = 1, . . . , n, are N × α dimensional mapping with the same dimension, are exactly equivalent because the
features achieved by (9) and Wh j ∈ Rα×γ . This model entrances of the two matrices are generated from the same
structure is illustrated in Fig. 4(b). distribution.
It is obvious that the main difference between two con- As for the enhancement nodes part, first we have that H (a)
structions, in Fig. 4(a) and (b), is in the connections of the and H (b) are of the same size if mq = nγ . Therefore, we need
enhancement nodes. The following theorem proves that the to prove that their elements are equivalent. For any sample
above two different connections in the enhancement nodes are (a)
xl chosen from the data set, denote the columns of We as
actually equivalent. (a)
wei ∈ R , i = 1, . . . , nk and the columns of Wh as w ah j ∈
a N
Theorem 1: For the model in Section III-A [or Fig. 4(a)],
Rnk , j = 1, . . . , mq.
the feature dimension of Zi(a) is k, for i = 1, . . . , n, and the Hence, the j th enhancement node associate with the sample
(a)
dimension of H j is q, for j = 1, . . . , m. Respectively, for xl should be
the model in Section III-B [or Fig. 4(b)] the feature dimension
of Zi(b) is k, for i = 1, . . . , n, and the dimension of H j(b) is Hl(a) (a) (a)
j = ξ φ X We + βe Wh(a) + βh(a) lj
γ , for j = 1, . . . , n. Then if mq = nγ , and H (a) and H (b) = ξ φ xl wea1 + βeal1 , . . . , φ xl weank + βealnk wah j + βha j
are normalized, the two networks are exactly equivalent. nk
Consequently, no matter which kind of establishment of = ξ φ xl weai + βlia whali + βhali
enhancement nodes is adapted, the network is essentially i=1
the same, as long as the total number of feature nodes and ≈ nk E ξ φ xl wea + βea wha + βha
enhancement nodes are equal. Hereby, only the model in
= nk E ξ φ xl we + βe wh + βh .
Section III-A [or Fig. 4(a)] will be considered in the rest of
this paper. The proof is as follows. Here E stands for the expectation of distribution, we is
Proof: Suppose that the elements of Wei , Wh j , βei , and the N dimension random vector drawn from the distribu-
βh j are randomly drawn independently of the same distribu- tion density ρ(w), and wh is the scaler number sampled
tion ρ(w). For the first model, treat the nk feature mapping from ρ(w).

Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.
CHEN AND LIU: BLS: AN EFFECTIVE AND EFFICIENT INCREMENTAL LEARNING SYSTEM 15

Similarly, for the columns of We(b) as webi ∈ R N , i = Algorithm 1 Broad Learning: Increment of p Additional
1, . . . , nk and the columns of Wh(b) as wbh j ∈ Rk , j = Enhancement Nodes
1, . . . , nγ , it could be deduced that for the second model we Input : training samples X;
have Output: W
1 for i = 0; i ≤ n do
Hl(b) (b) (b)
j = ξ φ X We + βe Wh(b) + βh(b) lj 2 Random Wei , βei ;
=ξ φ xe1 w1b + βebl1 ,...,φ xel wek
b
+ βeblk whb j + β bj 3 Calculate Zi = [φ(X Wei + βei )];
4 end
nk
5 Set the feature mapping group Z = [Z1 , . . . , Zn ];
n
= ξ φ xl webi + βebli whbli + βhbi j
6 for j = 1; j ≤ m do
i=1
7 Random Wh j , βh j ;
≈ k E ξ φ xl web + βeb wh b + βhb 8 Calculate H j = [ξ(Z n Wh j + βh j )];
= k E ξ φ xl we + βe wh + βh . 9 end
10 Set the enhancement nodes group H = [ H1 , . . . , Hm ];
m
Since all the we , wh and βe , βh are drawn from the same +
11 Set A and calculate ( A ) with Eq. (3);
m m
distribution, the expectations of the above two composite
12 while The training error threshold is not satisfied do
distributions are obviously the same. Hence, it is clear that
13 Random Wh m+1 , βh m+1 ;
1 (a)
(b) Calculate Hm+1 = [ξ(Zm+1 Wh m+1 + βh m+1 )];
Hl j H . 14
n lj 15 Set Am+1 = [ Am |Hm+1 ];
Therefore, we could conclude that under the given assumption, 16 Calculate ( Am+1 )+ and W m+1 by Eq. (13,14,15);
H (a) and H (b) are also equivalent if the normalization operator 17 m = m + 1;
is applied. 18 end
19 Set W = W
m+1 ;

C. Incremental Learning for Broad Expansion: Increment of


Additional Enhancement Nodes
In some cases, if the learning cannot reach the desired D. Incremental Learning for Broad Expansion: Increment of
accuracy, one solution is to insert additional enhancement the Feature Mapping Nodes
nodes to achieve a better performance. Next, we will detail
In various applications, with the selected feature mappings,
the broad expansion method for adding p enhancement nodes.
the dynamic increment of the enhancement nodes may not
Denote Am = [Z n |H m ] and Am+1 as
be good enough for learning. This may be caused from the
Am+1 ≡ [ Am |ξ(Z n Wh m+1 + βh m+1 )] insufficient feature mapping nodes that may not extract enough
underlying variation factors which define the structure of the
where Wh m+1 ∈ Rnk× p , and βh m+1 ∈ R p . The connecting
input data.
weights and biases from mapped features to the p additional
In popular deep structure networks, when the existing model
enhancement nodes, are randomly generated.
could not learn the task well, general practices are to either
By the discussion in Section II, we could deduce the
increase the number of the filter (or window) or increase the
pseudoinverse of the new matrix as
 m +  number of layer. The procedures suffer from tedious learning
( A ) − D BT by resetting the parameters for new structures.
( Am+1 )+ = (13)
BT Instead, in the proposed BLS, if the increment of a new
feature mapping is needed, the whole structure can be easily
where D = ( Am )+ ξ(Z n Wh m+1 + βh m+1 )
 constructed and the incremental learning is applied without
(C)+ if C = 0 retraining the whole network from the beginning.
B =
T
−1 +
(14) Here, let us consider the incremental learning for newly
(1 + D D) B ( A ) if C = 0
T T m
incremental feature nodes. Assume that the initial structure
and C = ξ(Z n Wh m+1 + βh m+1 ) − Am D. consists of n groups feature mapping nodes and m groups
Again, the new weights are broad enhancement nodes. Considering that the (n + 1)th
 m  feature mapping group nodes are added and denoted as
W − D BT Y
W m+1 = . (15)
BT Y Zn+1 = φ(X Wen+1 + βen+1 ). (16)
The broad learning construction model and learning pro-
cedure is listed in Algorithm 1, meanwhile, the structure is The corresponding enhancement nodes are randomly gener-
illustrated in Fig. 5. Notice that all the peseudoinverse of the ated as follows:
involved matrix are calculated by the regularization approach Hex m = [ξ(Zn+1 Wex1 +βex1 ), . . . , ξ(Zn+1 Wexm +βexm )] (17)
in (3). Specifically, this algorithm only needs to compute the
pseudoinverse of the additional enhancement nodes instead of where Wexi and βexi are randomly generated. Denote Am n+1 =
computations of the entire ( Am+1 ) and thus results in fast [ Am
n |Z n+1 |Hex m ], which is the upgrade of new mapped fea-
incremental learning. tures and the corresponding enhancement nodes. The relatively

Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.
16 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 1, JANUARY 2018

Fig. 5. Broad learning: increment of p additional enhancement nodes.

upgraded pseudoinverse matrix should be achieved as follows: Algorithm 2 Broad Learning: Increment of n + 1 Mapped
 m +  Features
+ ( An ) − D B T Input : training samples X;
An+1 =
m
(18)
BT Output: W
+ 1 for i = 0; i ≤ n do
where D = ( Am
n ) [Zn+1 |Hex m ] 2 Random Wei , βei ;
 3 Calculate Zi = [φ(X Wei + βei )];
(C)+ if C = 0
B =
T
+ (19) 4 end
(1 + D T D)−1 D T Am
n if C = 0 5 Set the feature mapping group Z = [Z1 , . . . , Zn ];
n

6 for j = 1; j ≤ m do
and C = [Zn+1 |Hexm ] − Am
n D. 7 Random Wh j , βh j ;
Again, the new weights are 8 Calculate H j = [ξ(Z n Wh j + βh j )];
 m  9 end
Wn − D B T Y
Wn+1 =
m
. (20) 10 Set the enhancement nodes group H = [ H1 , . . . , Hm ];
m
BT Y +
11 Set An and calculate ( An ) with Eq. (3);
m m

Specifically, this algorithm only needs to compute the 12 while The training error threshold is not satisfied do
pseudoinverse of the additional mapped features instead of 13 Random Wen+1 , βen+1 ;
computing of the entire Am n+1 , thus resulting in fast incremen- 14 Calculate Zn = [φ(X Wen+1 + βen+1 )];
tal learning. 15 Random Wex i , βex i , i = 1, . . . , m;
The incremental algorithm of the increment feature mapping 16 Calculate Hex m =
is shown in Algorithm 2. And the incremental network for [ξ(Zn+1 Wex 1 + βex 1 ), . . . , ξ(Zn+1 Wex m + βex m )];
additional (n + 1) feature mapping as well as p enhancement 17 Update Am n+1 ;
Update ( Am +
n+1 ) and Wn+1 by Eq. (18,19,20);
m
nodes is shown in Fig. 6. 18
19 n = n + 1;
20 end
E. Incremental Learning for the Increment of Input Data 21 Set W = Wn+1 ;
m

Now let us come to the cases that the input training samples
keep entering. Often, once a system modeling is completed and
if a new input with a corresponding output enters to model, Wh j and βei , βh j are randomly generated during the initial of
the model should be updated to reflect the additional samples. the network. Hence, we have the updating matrix
The algorithm in this section is designed to update the weights  
easily without an entire training cycle. Amn
An =
x m
.
Denote X a as the new inputs added into the neural network, ATx
and denote Am n as the n groups of feature mapping nodes The associated pseudoinverse updating algorithm could be
and m groups of enhancement nodes of the initial network. deduced as follows:
The respectively increment of mapped feature nodes and the
+ +
enhancement nodes are formulated as follows: x
Am
n = Am
n − B D T |B (23)
+
A x = φ(X a We1 + βe1 ), . . . , φ(X a Wen + βen )| (21) where D T = ATx Am
n

ξ Zxn Wh 1 + βh 1 , . . . , ξ Zxn Wh m + βh m (22) (C)+ if C = 0
BT = + (24)
(1 + D T D)−1 Am
n D if C = 0
where Zx n = [φ(X a We1 + βe1 ), . . . , φ(X a Wen + βen )] is the
group of the incremental features updated by X a . The Wei , and C = ATx − D T Am
n.

Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.
CHEN AND LIU: BLS: AN EFFECTIVE AND EFFICIENT INCREMENTAL LEARNING SYSTEM 17

Fig. 6. Broad learning: increment of n + 1 mapped features.

Fig. 7. Broad learning: increment of input data.

Therefore the updated weights are F. Broad Learning: Structure Simplification


x and Low-Rank Learning
Wnm = Wnm + YaT − ATx Wnm B (25)
After the broad expansion with added mapped features and
where Ya are the respectiv labels of additional X a . enhancement nodes via incremental learning, the structure may
Similarly, the input nodes updating algorithm is shown in have a risk of being redundant due to poor initialization or
Algorithm 3. And the network illustration is checked in Fig. 7. redundancy in the input data.
Again, this incremental learning saves time for only computing Generally, the structure can be simplified by a series of
necessary pseudoinverse. This particular scheme is perfect for low-rank approximation methods. In this section, we adapt the
incremental learning for new incoming input data. classical SVD as a conservative choice to offer the structure
Remark 1: So far a general framework of the proposed BLS simplification for the proposed broad model.
is presented, while the selection of functions for the feature The simplification can be done in different ways: 1) during
mapping deserves attention. Theoretically, functions φi (·) have the generation of mapped features; 2) during the generation of
no explicit restrictions, which means that the common choices enhancement nodes; or 3) in the completion of broad learning.
such as kernel mappings, nonlinear transformations, or convo- 1) SVD Simplification of Mapping Features: Let us begin
lutional functions are acceptable. Specifically, if the function with the random initial network with n groups of feature nodes
in the feature mapping uses convolutional functions, the broad that can be represented as the equation of the following form:
learning network structure is very similar to that of classical
Y = [φ(X We1 + βe1 ), . . . , φ(X Wen + βen )]Wn0
CNN structure except that the broad learning network has
additional connecting links between the convolutional layers = [Z1, . . . , Zn ]Wn0 .
and the output layer. Furthermore, additional connections, Similarly, from the previous sections, we denote by A0n =
either laterally or stacking, among the feature nodes can be [Z1, . . . , Zn ], which yields
explored.
Remark 2: The proposed BLS is constructed based on Y = A0n Wn0 .
the characteristics of the flatted functional-link networks. To explore the characteristics of the matrix A0n , we apply SVD
In general, such flatted functional extensions and incremental to Zi , i = 1, . . . , n as follows:
learning algorithms could be used in various networks such
as support vector machine or RBF. The pseudoinverse com- Zi = U Zi  ZPi VZi T (26)
putation is used here. This can be replaced with an iterative Q Q T
= U Zi ·  ZPi | Zi · VZPi |VZi (27)
algorithm if desired. Gradient descent approach can be used to T T
find the weights of enhancement nodes as well if it is desired. = U Zi  ZPi VZPi + U Zi  ZQi VZQi = ZiP + ZiQ (28)

Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.
18 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 1, JANUARY 2018

Algorithm 3 Broad Learning: Increment of Feature- As for the original model, define
Mapping Nodes, Enhancement Nodes, and New Inputs {0,n} {0,n} T
Input : training sample X; Wn0  W Z1 | . . . |W Zn
Output: W
we have that
1 for i = 0; i ≤ n do
2 Random Wei , βei ; Y = A0n Wn0
3 Calculate Zi = [φ(X Wei + βei )];
= [Z1, . . . , Zn ]Wn0
4 end
{0,n} {0,n}
5 Set the feature mapping group Z = [Z1 , . . . , Zn ];
n = Z1 W Z 1 + · · · + Zn W Z n
6 for j = 1; j ≤ m do T {0,n} T {0,n}
= Z1 VZP1 VZP1 W Z1 + · · · + Zn VZPn VZPn W Zn
7 Random Wh j , βh j ; ⎡ ⎤
T {0,n}
8 Calculate H j = [ξ(Z n Wh j + βh j )]; VZP1 W Z1
⎢ ⎥
9 end = Z1 VZP1 , . . . , Zn VZPn ⎣ ··· ⎦
10 Set the enhancement nodes group H = [ H1 , . . . , Hm ];
m T {0,n}
VZPn W Zn
m m +
11 Set An and calculate ( An ) with Eq. (3); {0,n} {0,n}
12 while The training error threshold is not satisfied do
= AF WF
13 if p enhancement nodes are added then
where
14 Random Wh m+1 , βh m+1 ; ⎡ ⎤
Calculate Hm+1 = [ξ(Z n Wh m+1 + βh m+1 )]; Update T {0,n}
15 VZP1 W Z1
Am+1 ; {0,n} ⎢ ⎥
n WF =⎣ ··· ⎦.
16 Calculate ( Am+1
n )+ and Wnm+1 by Eq. (13,14,15); P T
V Zn W Zn
{0,n}
17 m = m + 1;
18 else Finally, by solving a least square linear equation, the model
19 if n + 1 feature mapping is added then is refined to
20 Random Wen+1 , βen+1 ;
{0,n} {0,n}
21 Calculate Zn+1 = [φ(X Wen+1 + βen+1 )]; Y = AF WF (29)
22 Random Wex i , βex i , i = 1, . . . , m;
23 Calculate Hex n+1 = where
[ξ(Zn+1 Wex i + βex i ), . . . , ξ(Zn+1 Wex m + βex m )]; {0,n} {0,n} +
WF = AF Y. (30)
24 Update Am n+1 ;
Update ( Am +
n+1 ) and Wn+1 by Eq. (18,19,20);
m
{0,n} {0,n}
Here, ( A F )+ is the pseudoinverse of A F . In this way, the
25
26 n = n + 1; {0,n}
original A0n is simplified to A F .
27 else
2) SVD Simplification of Enhancement Nodes: We are able
28 New inputs are added as X a ;
to simplify the structure after adding a new group of enhance-
29 calculate A x by (21),(22), update x Am n;
+ ment nodes to the network. Suppose that the n groups of
30 update (x Amn ) and ( Wn ) by Eq. (23,24,25);
x m
feature mapping nodes and m groups of enhancement nodes
31 end
have been added, and the network is
32 end
33 end {m,n} {m,n}
Y = AF WF
34 Set W;
where
{m,n}
AF = Z1 VZP1 , . . . , Zn VZPn |H1 VHP1 , . . . , Hm VHPm
where  P and  Q are divided by the order of singularities, and
under the parameter ε.
Remember that our motivation is to achieve a satisfactory H j = ξ Z1 VZP1 , . . . , Zn VZPn Wh j + βh j .
reduction of the numbers of nodes. The idea is to compress
Zi by the principal portion, ZiP . The equation between Zi and In the above equations, H jP s, j = 1, . . . , m, are obtained by
ZiP is derived as follows: the same way as ZiP , which means

H j = UHj  H
P
V T
j Hj
(31)
T Q Q T
ZiP VZPi = U Zi  ZPi VZPi VZPi = UHj · H
P
j
| H j · VHPj |VH j (32)
T QT T QT
= U Zi  ZPi VZPi VZPi + U Zi  ZPi VZi VZPi P
= UHj  H VP +
Q
U H j  H j VH j
Q
= H jP + H j . (33)
j Hj
Q Q T
= U Zi ·  ZPi | Zi · VZPi |VZi · VZPi
Similarly, the simplified structure is obtained by substituting
= Zi VZPi . H j by H j VHPj .

Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.
CHEN AND LIU: BLS: AN EFFECTIVE AND EFFICIENT INCREMENTAL LEARNING SYSTEM 19

Algorithm 4 Broad Learning: SVD Structure


Simplification
Input : training sample X; threshold εe , εh , ε
Output: W
1 for i = 1; i ≤ n do
2 Random Wei , βei ;
3 Calculate Zi = [φ(X Wei + βei )];
4 Calculate VZPi by SVD under threshold εe ;
5 end
6 for j = 1; j ≤ m do Fig. 8. MNIST data set [51].
7 Random Wh j , βh j ;
8 Calculate H j = [ξ([Z1 VZP1 , . . . , Zn VZPn ]Wh j + βh j )];
{m+1,n}
9 Calculate VHPj by SVD under threshold εh ; Here, WF is the least square solution of the following
10 end
model:
{m,n} = [Z V P , . . . , Z V P |H V P , . . . , H V P ]; {m+1,n} {m+1,n}
11 Set A 1 Z1 n Zn 1 H1 m Hm Y = AF WF .
12 calculate ( A
{m,n} )+ with Eq. (3);

13 while The training error threshold is not satisfied do


4) SVD Simplification of Completion of Broad Learning:
14 Random Wh m+1 , βh m+1 ; Now, it seems that a complete network is built; however, there
15 Calculate is a need to simplify more. A possible solution is to cut off
Hm+1 = [ξ([Z1 VZP1 , . . . , Zn VZPn ]Wh m+1 + βh m+1 )]; more small singular values components.
Therefore, we have that
16 Calculate VHPm+1 by SVD under threshold εh ;
{m,n}
17 Update A{m+1,n} ; AF = UF  F VFT
{m+1,n} + {m+1,n}
18 Calculate ( A F ) and WF by Eq. (34-36); Q
= UF ·  FP | F · VFP |VF
Q T
19 m = m + 1; T Q QT
20 end = UF  FP VFP + UF  F VF
21 Get VF by SVD under threshold ε;
P {m,n} P {m,n} Q
= AF + AF .
22 Calculate A F ;
+ Similar to the beginning of the algorithm, set
23 Calculate A F and WF by Eq. (3);
24 Set W = W F ; {m,n}
AF = AF VFP
we have an approximation of the matrix as follows:
3) SVD Simplification of Inserting Additional p Enhance- Y = A F WF (37)
ment Nodes: Without loss of generality, based on the above
assumptions, we deduce an SVD simplification for inserting where
additional p enhancement nodes as follows:
WF = A F + Y . (38)
{m+1,n} {m,n}
A = A F |Hm+1
Finally, the structure simplified broad learning algorithm is
where Hm+1 = ξ([Z1 VZP1 , . . . , Zn VZPn ]Wh m+1 + βh m+1 ). given in Algorithm 4. Generally, the number of the neural
Similarly, as the SVD implement in the previous steps, we nodes could be significantly reduced depending on the thresh-
have that old values εe , εh , and ε, for simplifying the feature map-
{m+1,n} {m,n} ping nodes, the enhancement nodes, and the final structure,
AF = AF |Hm+1 VHPm+1 . respectively.
{m+1,n}
To update the pseudoinverse of A F , similar to (13)–(15),
we conclude that IV. E XPERIMENT AND D ISCUSSION
 {m,n} +  In this section, experimental results are given to verify the
{m+1,n} + AF − D BT
AF = (34) proposed system. To confirm the effectiveness of the proposed
BT
system, classification experiments are applied on popular
{m,n}
where D = ( A F )+ Hm+1 VHPm+1 MNIST and NORB data. To prove the effectiveness of BLS,
 we will compare the classification ability of our method with
(C)+ if C = 0 existing mainstream methods, including stacked auto encoders
BT = {m,n} + (35)
(1 + D T D)−1 D T A F if C = 0 (SAE) [6], another version of stacked autoencoder (SDA) [52],
DBN [5], multilayer perceptron-based methods (MLP) [53],
{m,n}
and C = Hm+1 VHPm+1 − A F D. The new weights are DBM [7], two kinds of extremely learning machine (ELM)-
 {m,n}  based multilayer structure, which denoted as MLELM [54] and
{m+1,n} WF − D BT Y
WF = . (36) HELM [10], respectively. The above comparing experiments
BT Y are tested on MATLAB software platform under a laptop
Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.
20 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 1, JANUARY 2018

TABLE I TABLE II
C LASSIFICATION A CCURACY ON MNIST D ATA S ET C LASSIFICATION A CCURACY ON MNIST D ATA S ET W ITH D IFFERENT
N UMBERS OF E NHANCEMENT N ODES

that equips with Intel-i7 2.4 GHz CPU, 16 GB memory. The


classification results of the above methods are cited from [10]. set containing 60 000 samples and a test set of 10 000 samples.
Furthermore, we compare our results with an extended fuzzy Every digit is represented by an image with the size of 28×28
restricted Boltzmann machine (FRBM) [13] which offers a gray-scaled pixels. Typically, images are shown in Fig. 8.
more reasonable, solid, and theoretical foundation for estab- To test the accuracy and efficiency of our proposed broad
lishing FRBMs. The one-layer FRBM and our proposed broad learning structures in the classification, we give a priori knowl-
learning results are tested on a 3.40-GHz Intel i7-6700 CPU edge about the numbers of feature nodes and enhancement
processor PC with MATLAB platform. It should be noted that nodes. However, this is exactly the normal practice for building
duplicate experiments are tested in the sever computer equips the network in the deep learning neural networks, which
with 2.30 GHz Intel Xeon E5-2650 CPU processor and the is in fact the most challenging task of the whole learning
testing accuracy and the training time are given with a special process. In our experiments, the network is constructed by total
superscript ∗. 10 × 10 feature nodes and 1 × 11 000 enhancement nodes. For
A state-of-art deep CNN [3] has achieved extremely good reference, the deep structures of SAE, DBN, DBM, MLELM,
results even on the ImageNet challenge. However, this is and HELM is 1000 − 500 − 25 − 30, 500 − 500 − 2000,
generally done with the help of a very deep structure or 500 −500 −1000, 700 −700 −15 000, and 300 −300 −12 000,
the ensembles of various operations. In this paper, only the respectively. The testing accuracies of mentioned methods
comparison of the original deep CNN (LeNet5) in [22] is and our proposed method are shown in Table I. Although
presented here for fair comparison because the proposed BLS the 98.74% is not the best one, (in fact, the performance
only uses linear feature mapping. Related experiments about of the broad learning is still better than SAE and MLP),
CNN are tested in the above server computer on Theano the training time in the server is very fast at a surprising
platform. level which is 29.9157 seconds, while the testing time in
Generally, all the methods we mentioned above, except the server is 1.0783 s. Moreover, it should be noticed that
for HELM and MLELM, are deep structures and the hyper- the number of feature mapping nodes is only 100 here. This
parameters are tuned based on the back propagation whose result accords with scholar’s intuition in large scale learning
initial learning rates are set as 0.1 and the decay rate for that the information in the practical applications is usually
each learning epoch is set as 0.95. As for the ELM-based redundancy. More results with different mapped features are
networks, the regularization parameters of MLELM are set as given in Table II.
10−1 , 103 , and 108 , respectively, while the penalty parameter Next, we will show the fast and effectiveness of the incre-
of HELM is 108 . Other detailed parameters could be checked mental learning system. And the associated experiments are
in [10]. In our proposed BLS, the regularization parameter λ implemented in the sever computer mentioned above. Let
for ridge regression is set as 10−8 , meanwhile, the one layer the final structure consists of 100 feature nodes and 11 000
linear feature mapping with one step fine-tune to emphasize enhancement nodes. Two different initial networks are used to
the selected features is adopted. In addition, the associated test the incremental learning here.
parameters Wei and βei , for i = 1, . . . , n are drawn from the First, suppose the initial network is set as 10 × 10 feature
standard uniform distributions on the interval [−1, 1]. For the nodes and 9000 enhancement nodes. Then, Algorithm 1 is
enhancement nodes, the sigmoid function is chosen to establish applied to dynamically increase 500 enhancement nodes each
BLS. At the end of this section, experimental results based on time until it reaches 11 000.
the proposed broad learning algorithms are given. Second, three dynamic increments are used here to increase
dynamically: 1) the feature nodes; 2) the corresponding
enhancement nodes; and 3) the additional enhancement nodes.
A. MNIST Data Scenario is shown in Fig. 6. The network is initially set to
In this section, a series of experiments is focused on the have 10 × 6 feature nodes and 7000 enhancement nodes at
classical MNIST handwritten digital images [8]. This data set the beginning of the Algorithm 3. Then, the feature nodes are
consists of 70 000 handwritten digits partitioned into a training dynamically increased from 60 to 100 at the step of 10 in each

Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.
CHEN AND LIU: BLS: AN EFFECTIVE AND EFFICIENT INCREMENTAL LEARNING SYSTEM 21

TABLE III
C LASSIFICATION A CCURACY ON MNIST D ATA S ET U SING I NCREMENTAL L EARNING

TABLE IV
S NAPSHOT R ESULTS OF MNIST C LASSIFICATION U SING I NCREMENTAL L EARNING

update, the corresponding enhancement nodes for the addi-


tional features are increased at 250 each, and the additional
enhancement nodes are increased at 750 each. Or equivalently,
at each step, 10 feature nodes and 1000 enhancement nodes
are inserted to the network. Table III presents the performance
of the above two different dynamic constructions for MNIST
classification compared with the results in Table I. The incre-
mental versions have the similar performance as the one-shot
construction. It is surprising that the dynamic increments on Fig. 9. Examples for training figures.
both feature nodes and enhancement nodes perform the best.
This may be caused by the randomness nature of the feature
nodes and the enhancement nodes. This implies that dynamic
update of the model using incremental learning could present
a compatible result; meanwhile, it provides the opportunities
to adjust structure and accuracy for the system to match the
desired performance.
Additional experiment is designed to test the elapse time of
the incremental learning. The initial network is set as 10 × 6 Fig. 10. Examples for testing figures.
feature nodes and 3000 enhancement nodes. Similarly as the
previous case, the feature nodes are dynamically increased
from 60 to 100 at the step of 10 in each update, the cor- nodes, and the snapshot results of each update are shown
responding enhancement nodes for the additional features are in Table V. On the other hand, experiments for the increment
increased at 750 each, and the additional enhancement nodes of input patterns and enhancement nodes are tested. First, the
are increased at 1250 each. The training times and results network is initially set to have 10 ×10 feature nodes and 5000
of each update are presented in Table IV. The result is very enhancement nodes. Then, the additional enhancement nodes
competitive to that of the one-shot solution when the network are increased dynamically at 250 each, and the additional input
reaches 100 feature nodes and 11 000 enhancement nodes as input patterns are increased at 10 000 each. The attractive
shown in Table I. It is proven that the incremental learning results of each update could be checked in Table VI.
algorithms are very effective.
Finally, incremental broad learning algorithms on added
inputs are tested. On the one hand, suppose the initial network B. NORB Data
is trained under the first 10 000 training samples in the NORB data set [55] is a more complicated data set com-
MNIST data set. Then, incremental algorithm is applied to pared with MNIST data set; in a total of 48 600 images, each
add dynamically 10 000 input patterns each time until all the has 2 × 32 × 32 pixels. The NORB contains images of 50
60 000 training samples are fed. The structure of the tested different 3-D toy objects belonging to five distinct categories:
BLS is set as 10 × 10 feature nodes and 5000 enhancement 1) animals; 2) humans; 3) airplanes; 4) trucks; and 5) cars,

Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.
22 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 1, JANUARY 2018

TABLE V
S NAPSHOT R ESULTS OF MNIST C LASSIFICATION U SING I NCREMENTAL L EARNING : I NCREMENT OF I NPUT PATTERNS

TABLE VI
S NAPSHOT R ESULTS OF MNIST C LASSIFICATION U SING I NCREMENTAL L EARNING : I NCREMENT OF I NPUT PATTERNS AND E NHANCEMENT N ODES

TABLE VII C. SVD-Based Structure Simplification


C LASSIFICATION A CCURACY ON NORB D ATA S ET
In this part, we run simulations using SVD to simplify
the structure after the model is constructed. The experiments
are tested in MNIST data set. During the experiments, the
threshold εe = εh = 1, and ε = N are set, which means that
there is no simplification on feature nodes and enhancement
nodes generation, but only to keep the first N important
principle components in the final simplified network, i.e.,
{m,n}
apply SVD operation to A F . As shown in Table VIII, N is
selected as 500, 600, 800, 1000, 1500, 2000, 2500, and 3000.
The in table denotes the network structures of BLS, where
the first is the number of feature nodes and the second is the
as shown in Figs. 9 and 10. The sampled objects are imaged number of the enhancement nodes. Or specifically, summation
under various lighting conditions, elevations, and azimuths. of the numbers lie in the column is the total number of nodes
The training set contains 24 300 stereo image of 25 objects in the broad network. In the “BLSVD” column of the table,
(five per class) as shown in Fig. 9, while the testing set SVD operation is applied to the network and the network is
contains 24 300 image of the remaining 25 objects, as shown compressed to the desired N nodes. The tests are to compare
in Fig. 10. In our experiments, the network was constructed with the Restrict Boltzmann Machine (RBM) and the original
by the one-shot model which consists of 100 × 10 feature BLS. Parameters of RBM are referred to [5], i.e., the learning
nodes and 1 × 9000 enhancement nodes. Compared with the rate is 0.05 and the weight decay is 0.001. All of the three
deep and complex structure of DBN and HELM, which is methods are repeated ten times. In the table, the minimal test
4000 − 4000 − 4000 and 3000 − 3000 − 15000, respectively, error (MTE) and the average test error (ATE) under all the ten
the proposed BLS presents faster training time. The testing experiments are shown in percentage.
results, shown in Table VII, present a pleasant performance, From the table we could obviously observe that when the
especially the training time of the proposed broad learning. number of nodes exceeds 1000, both the BLS models have
Similar to the MNIST cases, although the accuracy is not the better results than RBM. Moreover, the models selected by
best one, the performance matches with the previous work SVD significantly improve the accuracy. Specifically, the RBM
with a testing time of 6.0299 s in the server. Considering the is in fact trapped in around 3% accuracy no matter how many
superfast speed in computation, which is the best among the nodes are added to the network. In addition, the result of the
existing methods, the proposed broad learning and network is proposed SVD-based broad learning varies in a narrow range
very attractive. than RBM.

Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.
CHEN AND LIU: BLS: AN EFFECTIVE AND EFFICIENT INCREMENTAL LEARNING SYSTEM 23

TABLE VIII
N ETWORK C OMPRESSION R ESULT U SING SVD B ROAD L EARNING A LGORITHM

D. Performance Analysis (see Section III) can be applied to a flat network or to a


Based on the experiments above, BLS obviously outper- network that only needs to compute the connecting weights
forms the existing deep structure neural networks in terms of the last layer, such as ELM.
of training speed. Furthermore, compared with other MLP
training methods, BLS leads to a promising performance in ACKNOWLEDGMENT
classification accuracy and learning speed. Compared with Source codes of the paper can be found in the author’s
hours or days for training and hundreds of epochs of itera- website at: http://www.fst.umac.mo/en/staff/pchen.html.
tion with high-performance computers in deep structure, the
BLS can be easily constructed in a few minutes, even in a R EFERENCES
regular PC. [1] M. Gong, J. Zhao, J. Liu, Q. Miao, and L. Jiao, “Change detection in
In addition, it should be mentioned that, from synthetic aperture radar images based on deep neural networks,” IEEE
Tables III and IV, the incremental version of broad learning Trans. Neural Netw. Learn. Syst., vol. 27, no. 1, pp. 125–138, Jan. 2016.
[2] W. Hou, X. Gao, D. Tao, and X. Li, “Blind image quality assessment via
does not lose the accuracy of the classification, even better in deep learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 6,
the MNIST case. pp. 1275–1286, Jun. 2015.
Furthermore, the broad structure of our system could be [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Advances in Neural Infor-
simplified by applying series of low-rank approximations. mation Processing Systems 25 (NIPS 2012), F. Pereira, C. J. C. Burges,
In this paper, only the classical SVD method is discussed L. Bottou, and K. Q. Weinberger, Eds. New York, NY, USA: Curran
and compared with one-layer RBM, the proposed SVD-based Associates, Inc., 2012, pp. 1097–1105.
[4] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
broad learning is more stable. A fast structure reduction using MA, USA: MIT Press, 2016.
different low-rank algorithms can be developed if the SVD is [5] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for
considered as not so efficient. deep belief nets,” Neural Comput., vol. 18, pp. 0899–7667, May 2006.
[6] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
V. C ONCLUSION 2006.
[7] R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in
BLS is proposed in this paper, which aims to offer an alter- Proc. AISTATS, vol. 1. 2009, p. 3.
native way for deep learning and structure. The establishment [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
of the system is based on the idea of RVFLNN. learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
pp. 2278–2324, Nov. 1998.
The designed model can be expanded in wide fashion [9] K. Simonyan and A. Zisserman, (2014). “Very deep convolutional
when new feature nodes and enhancement nodes are needed. networks for large-scale image recognition." [Online]. Available:
The corresponding incremental learning algorithms are also https://arxiv.org/abs/1409.1556
[10] J. Tang, C. Deng, and G.-B. Huang, “Extreme learning machine for
designed. The incremental learnings are developed for fast multilayer perceptron,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27,
remodeling in broad expansion without a retraining process no. 4, pp. 809–821, Apr. 2016.
if the network deems to be expanded. From the incremental [11] M. Chen, Z. E. Xu, K. Q. Weinberger, and F. Sha. (2012). “Marginalized
denoising autoencoders for domain adaptation." [Online]. Available:
experimental results presented in Table IV, it is shown that https://arxiv.org/abs/1206.4683
the incremental learnings can rapidly update and remodel the [12] M. Gong, J. Liu, H. Li, Q. Cai, and L. Su, “A multiobjective sparse
system. It is observed that learning time of a one-shot structure feature learning model for deep neural networks,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 26, no. 12, pp. 3263–3277, Dec. 2015.
is smaller than step-by-step increments version to reach the [13] S. Feng and C. L. P. Chen, “A fuzzy restricted Boltzmann machine:
final structure. However, this incremental learning offers an Novel learning algorithms based on crisp possibilistic mean value of
approach for system remodeling and for model selection, fuzzy numbers,” IEEE Trans. Fuzzy Syst., to be published.
[14] C. L. P. Chen, C.-Y. Zhang, L. Chen, and M. Gan, “Fuzzy restricted
especially in modeling high-volume time-varying systems. Boltzmann machine for the enhancement of deep learning,” IEEE Trans.
The experiments on MNIST and NORB data confirm the Fuzzy Syst., vol. 23, no. 6, pp. 2163–2173, Dec. 2015.
dynamic update properties of the proposed BLS. Finally, the [15] Z. Yu, L. Li, J. Liu, and G. Han, “Hybrid adaptive classifier ensemble,”
IEEE Trans. Cybern., vol. 45, no. 2, pp. 177–190, Feb. 2015.
SVD approach is applied to simplify the structure. It is also [16] Z. Yu, H. Chen, J. Liu, J. You, H. Leung, and G. Han, “Hybrid K -nearest
indicated that the simplified networks demonstrate promising neighbor classifier,” IEEE Trans. Cybern., vol. 46, no. 6, pp. 1263–1275,
results. Jun. 2016.
[17] Z. Yu et al., “Incremental semi-supervised clustering ensemble for high
Lastly, with proper arrangement in the feature nodes, the dimensional data clustering,” IEEE Trans. Knowl. Data Eng., vol. 28,
proposed broad learning algorithms and incremental learnings no. 3, pp. 701–714, Mar. 2016.

Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.
24 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 1, JANUARY 2018

[18] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, “Multilayer [44] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete
feedforward networks with a nonpolynomial activation function can basis set: A strategy employed by V1?” Vis. Res., vol. 37, no. 23,
approximate any function,” Neural Netw., vol. 6, no. 6, pp. 861–867, pp. 3311–3325, 1997.
1993. [45] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy.
[19] Y.-H. Pao and Y. Takefuji, “Functional-link net computing: Theory, Statist. Soc., B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
system architecture, and functionalities,” Computer, vol. 25, no. 5, [46] J. A. Tropp and A. C. Gilbert, “Signal recovery from random mea-
pp. 76–79, May 1992. surements via orthogonal matching pursuit,” IEEE Trans. Inf. Theory,
[20] Y.-H. Pao, G.-H. Park, and D. J. Sobajic, “Learning and generalization vol. 53, no. 12, pp. 4655–4666, Dec. 2007.
characteristics of the random vector functional-link net,” Neurocomput- [47] M. Aharon, M. Elad, and A. Bruckstein, “rm K -SVD: An algorithm
ing, vol. 6, no. 2, pp. 163–180, 1994. for designing overcomplete dictionaries for sparse representation,” IEEE
[21] B. Igelnik and Y.-H. Pao, “Stochastic choice of basis functions in Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006.
adaptive function approximation and the functional-link net,” IEEE [48] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed
Trans. Neural Netw., vol. 6, no. 6, pp. 1320–1329, Nov. 1995. optimization and statistical learning via the alternating direction method
[22] Y. LeCun et al., “Handwritten digit recognition with a back-propagation of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122,
network,” in Proc. Neural Inf. Process. Syst. (NIPS), 1990, pp. 396–404. Jan. 2011.
[23] J. S. Denker et al., “Neural network recognizer for hand-written zip [49] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding
code digits,” in Advances in Neural Information Processing Systems, algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1,
D. S. Touretzky, Ed. San Francisco, CA, USA: Morgan Kaufmann, 1989, pp. 183–202, 2009.
pp. 323–331. [50] T. Goldstein and B. O’Donoghue, S. Setzer, and R. Baraniuk, “Fast
alternating direction optimization methods,” SIAM J. Imag. Sci., vol. 7,
[24] K. S. Narendra and K. Parthasarathy, “Identification and control of
no. 3, pp. 1588–1623, 2014.
dynamical systems using neural networks,” IEEE Trans. Neural Netw.,
[51] O. Breuleux, Y. Bengio, and P. Vincent, “Quickly generating represen-
vol. 1, no. 1, pp. 4–27, Mar. 1990.
tative samples from an RBM-derived process,” Neural Comput., vol. 23,
[25] I. Y. Tyukin and D. V. Prokhorov, “Feasibility of random basis func- no. 8, pp. 2058–2073, Aug. 2011.
tion approximators for modeling and control,” in Proc. IEEE Control [52] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting
Appl. Intell. Control (ISIC) (CCA), Jul. 2009, pp. 1391–1396. and composing robust features with denoising autoencoders,” in Proc.
[26] C. L. P. Chen and C.-Y. Zhang, “Data-intensive applications, challenges, 25th Int. Conf. Mach. Learn. (ICML), New York, NY, USA, 2008,
techniques and technologies: A survey on big data,” Inf. Sci., vol. 275, pp. 1096–1103.
pp. 314–347, Aug. 2014. [53] C. M. Bishop, Pattern Recognition and Machine Learning (Information
[27] C. L. P. Chen and J. Z. Wan, “A rapid learning and dynamic stepwise Science and Statistics). Secaucus, NJ, USA: Springer-Verlag, 2006.
updating algorithm for flat neural networks and the application to time- [54] E. Cambria et al., “Extreme learning machines [trends controversies],”
series prediction,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 29, IEEE Intell. Syst., vol. 28, no. 6, pp. 30–59, Nov. 2013.
no. 1, pp. 62–72, Feb. 1999. [55] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for generic
[28] A. Rakotomamonjy, “Variable selection using SVM-based criteria,” object recognition with invariance to pose and lighting,” in Proc. IEEE
J. Mach. Learn. Res., vol. 3, pp. 1357–1370, Mar. 2003. Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 2.
[29] P. M. Narendra and K. Fukunaga, “A branch and bound algorithm Jun. 2004, pp. II-94–II-104.
for feature subset selection,” IEEE Trans. Comput., vol. 26, no. 9,
pp. 917–922, Sep. 1977.
[30] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood
and its oracle properties,” J. Amer. Statist. Assoc., vol. 96, no. 456,
pp. 1348–1360, 2001. C. L. Philip Chen (S’88–M’88–SM’94–F’07)
[31] R. G. Baraniuk and M. B. Wakin, “Random projections of smooth received the M.S. degree in electrical engineering
manifolds,” Found. Comput. Math., vol. 9, no. 1, pp. 51–77, 2009. from the University of Michigan, Ann Arbor, MI,
[32] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, USA, in 1985, and the Ph.D. degree in electrical
2001. engineering from Purdue University, West Lafayette,
[33] L. Grasedyck, D. Kressner, and C. Tobler, “A literature survey of low- IN, USA, in 1988.
rank tensor approximation techniques,” GAMM-Mitteilungen, vol. 36, He was a tenured professor in the United States
no. 1, pp. 53–78, 2013. for 23 years. He is currently the Dean of the Faculty
[34] I. Markovsky, “Low rank approximation,” in Algorithms, Implementa- of Science and Technology, University of Macau,
tion, Applications (Communications and Control Engineering). London, Macau, China, where he is the Chair Professor of the
U.K.: Springer, 2011. Department of Computer and Information Science.
He is also the Department Head and an Associate Dean with two different
[35] Z. Yang, Y. Xiang, K. Xie, and Y. Lai, “Adaptive method for nonsmooth
universities. His current research interests include systems, cybernetics, and
nonnegative matrix factorization,” IEEE Trans. Neural Netw. Learn.
computational intelligence.
Syst., vol. 28, no. 4, pp. 948–960, Apr. 2017.
Dr. Chen is a fellow of AAAS, Chinese Association of Automation,
[36] C. L. P. Chen, “A rapid supervised learning neural network for function and HKIE. He was the President of IEEE Systems, Man, and Cybernetics
interpolation and approximation,” IEEE Trans. Neural Netw., vol. 7, Society from 2012 to 2013. He has been the Editor-in-Chief of the IEEE
no. 5, pp. 1220–1230, Sep. 1996. T RANSACTIONS ON S YSTEMS , M AN , AND C YBERNETICS : S YSTEMS since
[37] C. Leonides, “Control and dynamic systems V18,” in Advances in 2014 and an Associate Editor of several IEEE Transactions. He is also the
Theory and Applications (Control and dynamic systems). Amsterdam, Chair of TC 9.1 Economic and Business Systems of International Federation
The Netherlands: Elsevier, 2012. of Automatic Control, and also an Accreditation Board of Engineering and
[38] A. Ben-Israel and T. Greville, Generalized Inverses: Theory and Appli- Technology Education Program Evaluator for computer engineering, electrical
cations. New York, NY, USA: Wiley, 1974. engineering, and software engineering programs.
[39] C. R. Rao and S. K. Mitra, “Generalized inverse of a matrix and its
applications,” in Proc. 6th Berkeley Symp. Math. Statist. Probab., vol. 1.
1972, pp. 601–620.
[40] D. Serre, “Matrices,” in Theory and Applications (Graduate Texts in
Mathematics). New York, NY, USA: Springer-Verlag, 2002. Zhulin Liu received the bachelor’s degree in mathe-
[41] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation matics from Shandong University, Shandong, China
for nonorthogonal problems,” Technometrics, vol. 42, no. 1, pp. 80–86, in 2005, and the M.S. degree in mathematics from
Feb. 2000. the University of Macau, Macau, China, in 2009,
[42] Z. Xu, X. Chang, F. Xu, and H. Zhang, “L 1/2 regularization: A thresh- where she is currently pursuing the Ph.D. degree
olding representation theory and a fast solver,” IEEE Trans. Neural Netw. with the Faculty of Science and Technology.
Learn. Syst., vol. 23, no. 7, pp. 1013–1027, Jul. 2012. Her current research interests include computa-
tional intelligence, matching learning, and function
[43] W. Yang, Y. Gao, Y. Shi, and L. Cao, “MRM-lasso: A sparse multiview
approximation.
feature selection method via low-rank analysis,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 26, no. 11, pp. 2801–2815, Nov. 2015.

Authorized licensed use limited to: Southeastern Louisiana University. Downloaded on May 14,2025 at 21:07:02 UTC from IEEE Xplore. Restrictions apply.

You might also like