0% found this document useful (0 votes)
11 views12 pages

Li 2017

This paper introduces the adaptive hyper-sphere (AdaHS) classifier, designed to handle dynamic datasets with complex, changing patterns. AdaHS employs competitive learning and features an adaptive hidden layer with tunable radii, making it efficient in terms of speed and memory usage while maintaining high accuracy. The proposed model is particularly effective for non-linearly separable data and can incrementally learn without the need for retraining.

Uploaded by

gg378036035
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

Li 2017

This paper introduces the adaptive hyper-sphere (AdaHS) classifier, designed to handle dynamic datasets with complex, changing patterns. AdaHS employs competitive learning and features an adaptive hidden layer with tunable radii, making it efficient in terms of speed and memory usage while maintaining high accuracy. The proposed model is particularly effective for non-linearly separable data and can incrementally learn without the need for retraining.

Uploaded by

gg378036035
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS 1

Classifying With Adaptive Hyper-Spheres:


An Incremental Classifier Based
on Competitive Learning
Tie Li, Gang Kou, Yi Peng , and Yong Shi

Abstract—Nowadays, datasets are always dynamic and pat-


terns in them are changing. Instances with different labels are
intertwined and often linearly inseparable, which bring new
challenges to traditional learning algorithms. This paper pro-
poses adaptive hyper-sphere (AdaHS), an adaptive incremental
classifier, and its kernelized version: Nys-AdaHS. The classifier
incorporates competitive training with a border zone. With adap-
tive hidden layer and tunable radii of hyper-spheres, AdaHS
has strong capability of local learning like instance-based algo-
rithms, but free from slow searching speed and excessive memory
consumption. The experiments showed that AdaHS is robust,
adaptive, and highly accurate. It is especially suitable for dynamic
data in which patterns are changing, decision borders are com-
plicated, and instances with the same label can be spherically Fig. 1. Data points that are not linearly separable.
clustered.
Index Terms—Adaptive algorithms, Nyström method, pattern conducted in recent years [6]–[10]. Some of them categorize
clustering, self-organizing feature maps (SOFMs). the modeling strategies of online learning into statistical learn-
ing models and adversarial models. A latent assumption of
statistical learning models is that variables are “independent
and identically distributed.” However, patterns are evolving
I. I NTRODUCTION and changing over time, and the statistical assumptions are
DAPTIVE incremental learning, also called online learn-
A ing, aims to handle dynamic data arriving in real-
time [1]. Data in the real world are not always processed
hard to be satisfied [10]. For example, fraudulent information
and censorship are adversarial in credit risk assessment. Fraud
and anti-fraud strategies are evolving interactively, which
all at once, such as real-time financial data analysis, network results in changing patterns [11]. Analogous phenomenon can
intrusion detection, and dynamic Webpages mining [2]–[4]. also be found in stock market and network intrusion detection.
Zhou and Chen [5] proposed three types of incremen- In practice, given the complexity of data sources, uniform pat-
tal learning: 1) example-incremental learning; 2) class- terns do not always exist across the entire dataset. Specific
incremental learning; and 3) attribute-incremental learning patterns only fit in certain parts of the datasets at a particular
(A-IL). The first two types of incremental learning with time. So it is necessary for learning algorithms to adapt to new
assumption that the attributes are fixed have been studied more patterns in dynamic data.
than A-IL [6]–[9]. Fig. 1 shows an example of changing patterns in different
With the availability of increasing amount of dynamic data, areas of a dataset.
a great deal of researches on incremental learning have been Some feasible solutions to identify changing patterns in
dynamic data are described as follows.
Manuscript received October 31, 2016; accepted September 26, 2017. 1) Instance-Based Learning: A well-known instance-based
This work was supported by the National Natural Science Foundation of
China under Grant 71325001 and Grant 71771037. This paper was recom- learning algorithm is k-nearest neighbors (k-NNs). But
mended by Associate Editor Z. Liu. (Corresponding author: Yi Peng.) the problem with k-NN is that the search time is usually
T. Li and Y. Peng are with the School of Management and Economics, unbearable and the memory consumption is too high
University of Electronic Science and Technology of China, Chengdu 611731,
China (e-mail: [email protected]; [email protected]). in large-scale data applications [12]. Though indexing
G. Kou is with the School of Business Administration, Southwestern technologies such as ball-tree, KD-tree, R-tree, locality
University of Finance and Economics, Chengdu 611130, China (e-mail: sensitive hashing, and other hashing technologies [13]
[email protected]).
Y. Shi is with the Key Laboratory of Big Data Mining and Knowledge can make the search of the similar points faster, the
Management, Chinese Academy of Sciences, Beijing 100190, China (e-mail: model of k-NN is too simple and lacks characterization
[email protected]). of the data distribution [14].
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. 2) SVM and Kernel Methods: SVMs use hyper-planes
Digital Object Identifier 10.1109/TSMC.2017.2761360 to divide space [15]. However, hyper-planes are not
2168-2216  c 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/
redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS

adaptive enough to divide space with highly complicated


data distributions (such as Fig. 1). In fact a decision bor-
der of hyper-plane is not realistic in most cases. SVMs
rely on kernel tricks to project instances into a repro-
ducing kernel Hilbert space [16]. The problem is that
kernel methods are hard to be applied for large-scale
datasets because the time cost of computing the kernel
matrix is O(n2 ). Furthermore, the search of the optimal
parameters for kernel functions is also time consuming.
Given the limitations of hyper-planes, a straightforward
intuition is to use hyper-spheres to divide the space [17]. Many
methods, such as support vector data description (SVDD),
ball trees, competitive learning, and clustering algorithms, Fig. 2. Causations of the proposed model.
explicitly or implicitly, use hyper-spheres to divide the space.
SVDD is a representative model that uses mathematical
method to obtain an appropriate hyper-sphere. It builds a min- In the past two decades, extensive research has been
imum radius hyper-sphere around the data. The primal form conducted regarding ART, including ART-1, ART-2, ART-3,
of the optimization problem of SVDD [18] is fuzzy ART, and ARTMAP [23]–[25]. The primary goal of
ARTMAP, fuzzy ARTMAP, and Gaussian ARTMAP, is to

n
min R2 + C ξi solve the plasticity–stability dilemma [25]. In general, the
i=1 structure of ART is too complicated for most cases.
s.t. xi − α ≤ R2 + ξi , ξi ≥ 0
2
(1) ART, counter propagation network (CPN), [26], [27] and
learning vector quantization [28] are usually used for classi-
where xi ∈ Rm , i = 1, . . . , n, is the training data; R rep- fication or supervised clustering. They are partially based on
resents the radius; ξi is the slack variables; α is the center self-organizing feature map (SOFM), namely “Kohonen learn-
of the hyper-sphere; and C is the penalty constant. SVDD ing” [29], which is an incremental clustering algorithm and
was first introduced for single class classification and outlier can be trained in linear time. Xiao and Chaovalitwongse [30]
detection [18] and then many improvements had been made proposed analogous model based on k-NN, and the hyper-
for multiclass classification by building one hyper-sphere for spheres are referred as “prototypes.” Valente and Abrão [31]
each class [19]. The dual problem of SVDD can be expressed proposed a multi-input multioutput transmit scheme based
as inner-product form. When the data distribution is com- on morphological perceptron. Dai and Song [32] proposed
plex, SVDD also uses kernel methods to project the instances a supervised competitive learning algorithm for the generation
into a reproducing kernel Hilbert space, in which instances of of multiple classifier systems.
a particular class are more likely to be enclosed by one single The main criticism of competitive learning was that the
hyper-sphere [18]. accuracy was not as high as mathematical optimization ones,
The main advantage of SVDD is that it can be solved especially when the size of a dataset is small. In incremen-
via mathematical optimization method and easy to use kernel tal learning, most mathematical optimization problems resort
tricks. The limitations of SVDD include the following. to stochastic gradient descent (SGD) or its variants, as the
1) There is only one hyper-sphere for each class. If the substitute for gradient descent methods, and SGD suffers
data distribution of the same class is complex, one from “regret error” [33]. Other disadvantages of competitive
hyper-sphere is obviously not enough [19]. learning include the following.
2) It is hard for SVDD to determine the number of hyper- 1) They do not have a mechanism to minimize generaliza-
spheres adaptively. tion errors, which may cause over-fitting.
1) Clustering-Based Classification Models: Numerous 2) Unlike SVDD, they lack the explicit definition of each
studies revealed that there is a connection between clus- hyper-sphere’s varying boundary, or “decision border.”
tering and classification [17], [20], [21]. Such studies 3) It is difficult to denote them in inner-product forms, and
include radial basis function networks (RBFNs) [20] thus hard to use kernel tricks.
and functional link neural network [21]. RBFN is Given the limitations of existing models, this paper proposes
a clustering-classification style neural network classifier a new model based on the following sequence of causations
and has its incremental version IRBFN, but the number (Fig. 2).
of clusters is fixed and this limits its adaptivity [20]. The purpose of this paper is to propose a new incremen-
Adaptive resonance theory (ART) can use clusters for tal learning model which can be used for the aforementioned
classification, add clusters adaptively, and be trained complicated dynamical scenario. Adaptability, locality, and
incrementally [22]. bounded memory consumption are the key requirements.
2) Competitive Learning: ART is a type of competitive neu- Assumption 1: In order to bind this paper within a specific
ral networks. It tries to fit each new input pattern in an framework, we make the following assumptions.
existing subclass. If no matching subclass can be found, 1) Regarding Clustering: Data distribution is always com-
a new subclass is created containing the new pattern. plex, with instances consisting of different labels
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LI et al.: CLASSIFYING WITH AdaHSs: INCREMENTAL CLASSIFIER BASED ON COMPETITIVE LEARNING 3

TABLE I
C OMPARISON OF THE H YPER -S PHERES -BASED M ODELS

Fig. 3. Topological structure of CPN.

layer. Then V = (V1 , . . . , VM ) denotes weight matrix of the


instars. If the training in stage 1 can be viewed as a clustering
process, then neuron i is cluster ci and Vi is the centroid of
cluster ci .
intertwined. There are usually a large number of clusters When an instance is coming, it will compute the proximity
containing instances with the same labels in local dense between the instance and each Vi in the weight matrix, i.e.,
areas. Local clustering is practical in most cases and the centroid of cluster ci . Here, proximity can be measured
easy to accomplish using methods such as local distance by computing inner product netj = VjT x, (j = 1, 2, . . . , m). It
metrics learning [34]. adopts a winner-takes-all strategy to determine which neuron’s
2) Regarding Local Consistency: We assume that instances weights are to be adjusted. The winner is netj∗ = max{netj }. In
near to one another tend to have the same label. This other words, the winner is cj∗ whose centroid is the closest to
assumption is used in many semisupervised learning the incoming instance. The winning neuron’s weights would
research studies [35] and conforms to basic theories of be adjusted as follows:
local models, such as RBFN and k-NN.  
Contributions: The main contributions of the proposed Vj∗ (t + 1) = Vj∗ (t) + α x − Vj∗ (t) (2)
model are as follows.
where α is the learning rate, indicating that the centroid of the
1) It utilizes the adaptivity of competitive neural networks
winning cluster will move in the direction of x. As instances
while recognizes decision borders of data points
keep coming, the weights vector—i.e., the centroid of the
like SVM.
hyper-spheres—tend to move toward the densest region of
2) It can be incrementally trained and does not require
the space. This first stage of the CPN’s training algorithm is
retraining when new patterns emerge. It can be applied
a process of self-organizing clustering, although it is structured
to datasets that are not linearly separable and maintains
using a network.
reasonable memory consumption.
The second part of the structure is a Grossberg [22] learn-
3) It can apply feasible kernel methods on small- and large-
ing. We will redesign a different hidden layer and different
scale datasets.
connection from the hidden layer to the output layer.
The main differences of the proposed model with the
existing ones are summarized in Table I.
The remainder of this paper is organized as follows. B. Advantages and Disadvantages of the Original Model
Section II describes the basic theory, including the analysis To illustrate the advantage and disadvantage of original
of competitive learning and kernel methods. Section III repre- model, a set of 2-D artificial data were created and visualized
sents the proposed algorithms. Section IV reports the results in Fig. 4.
and discussion of the experiments. Section V concludes this In Fig. 4(a), instances can be grouped into six clusters.
paper with conclusion and future works. Setting the number of neurons in the hidden layer to six, the
first training stage of the model in Fig. 3 can automatically
find the centroids of the six clusters, which are represented
II. BASIC T HEORY
by the weights of the six neurons. The second training stage
A. Basic Theory of Supervised Competitive Learning can learn each cluster’s connection to the right class. The dis-
We partially borrow the topological structure of CPN to tance from each instance in Fig. 4(a) to its cluster centroid is
introduce our model. CPNs are a combination of competitive smaller than the distances to the centroids of other clusters.
networks and Grossberg’s [22] outstar networks. The topolog- The dataset shown in Fig. 4 is ideal for CPN to classify.
ical structure of CPN has three layers: 1) input layer; 2) hidden Data distribution in Fig. 4(a) is simplified and idealistic.
layer; and 3) output layer (Fig. 3). Data with distribution similar to Fig. 4(b) will cause two kinds
Suppose there are N elements in the input layer, M neurons of problems to the original model.
in the hidden layer, and L neurons in the output layer. Let 1) First, the self-organized clustering process depends
vector Vi = (vi1 , . . . , viN )T denote the weights of neuron i in on the similarity measures between data points and
the hidden layer connecting to each of the elements of the input hyper-sphere’s centroid. Points closer to one cluster’s
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS

(a) (b) (c)

Fig. 4. Artificial datasets and the proposed clustering solutions.

centroid may belong to another cluster. Therefore, every


cluster should have a definite scope or radius, and the
scope should be as far away from others as possible.
2) Second, the number of clusters in the hidden layer is
fixed in the original model. However, it is difficult to
estimate the number of clusters in advance. Given differ-
ent numbers of neurons in the hidden layer, the accuracy
varies dramatically. The training of the instar layer—
i.e., the clustering process—is contingent on this fixed
number.
Fig. 5. Topological structure of the proposed model.

C. Building of the DMZ


To solve the first aforementioned problem, we should have is unlimited. Consider the clusters represented by the 2-D cir-
a general knowledge of the scope of the clusters. For example, cles in Fig. 4(c). All of the instances can be clustered no matter
points of cluster A [in Fig. 4(b)] near the border may be closer what the data distribution is and what the shape of the border
to the centroid of cluster B, so these points will be considered is, as long as there are enough hyper-spheres of varying radii
belong to cluster B in the original model. We must identify and are properly arranged.
the decision border that separates clusters according to their
labels. When two instances with conflicting labels fall into the
same cluster, it gives us an opportunity to identify the border D. Proposed Topological Structure
point that is somewhere between the two conflicting instances Given the solutions above, the structure of our improved
(as long as the instance is not an outlier). To maintain the model is shown in Fig. 5.
maximum margin and for the sake of simplicity, the median The first difference is that our model has an adaptive
point of two instances could be selected as a point in a zone dynamic hidden layer and the number of neurons in hidden
called a demilitarized zone (DMZ), and clusters should be layer is adaptive. The second difference is that each neuron
as far away from the DMZ as possible. As the number of Hi connects to only one particular neuron in the output layer,
conflicting instances increases, a general zone gradually forms and wij is used to record the radius of neuron Hi .
as the DMZ. This mechanism can find borders of any shapes
that are surrounded by many hyper-spheres. E. Kernelization
To solve the second problem, the number of clusters
should not be predetermined. The clusters should be formed It is challenging for competitive learning models to
dynamically and merged or split if necessary. The scope of the apply kernel methods because they cannot be denoted in
hyper-spheres, represented by the corresponding radii, should inner-product forms. Some previous studies use approx-
be adjusted on demand. As an example, consider the situa- imation methods for the kernelization of competitive
tion presented in Fig. 4(b): with instances of conflicting labels learning [37], [38]. This paper uses Nyström method to ker-
found in the top cluster, the original cluster should tune its nelize the proposed model [39], [40].
radius. After training, a new cluster would be formed beneath Let the kernel matrix written in blocks form
 
the top cluster containing instances of different labels from the A11 A12
ones in the top cluster. The radii of the two clusters should be A= . (3)
A21 A22
tuned according to their distance to the borders.
One single hyper-sphere may not enclose an area whose Let C = [A11 A21 ]T , Nyström method uses A11 and C to
shape is not hyper-spherical [36]. However, any shape could approximate large matrix A. Suppose C is a uniform sam-
be enclosed as long as the number of the formed hyper-spheres pling of the columns, Nyström method generates a rank-k
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LI et al.: CLASSIFYING WITH AdaHSs: INCREMENTAL CLASSIFIER BASED ON COMPETITIVE LEARNING 5

III. P ROPOSED C LASSIFIER : A DA H S


The main characteristic of the proposed model is to adap-
tively build hyper-spheres. Therefore, we call the model
adaptive hyper-spheres (AdaHSs), and the version after
Nyström projection is called Nys-AdaHS.

A. Training Stages
Our algorithms are trained in three stages, which are
described below.
Fig. 6. Artificial dataset 3 after Nyström and SVD transformation. Stage 1 (Forming Hyper-Spheres and Adjusting Centroids
and Radii):
1) Forming Hyper-Spheres and Adjusting Centroids: Given
approximation of A(k ≤ n) and is defined by that instances are read dynamically, there is no
  hyper-sphere at the beginning. The first instance inputed
A11 A21
Ak = CA+
nys
C T
= + ≈A (4) forms a hyper-sphere whose centroid is itself and ini-
11 A21 A21 A11 AT21 tial radius is set to a large value. When a new instance
where A+ 11 denotes the generalized pseudo inverse of A11 .
is inputed and does not fall into any existing hyper-
There exists an Eigen decomposition A+ −1 T spheres, a new hyper-sphere will be formed in the same
11 = V V , such
nys nys
that each element Ak ij in Ak can be decomposed as way. If a new instance falls into one or more existing
  hyper-spheres, the winner is the one whose centroid is
Ak nys ij = CiT V−1 V T Cj the closest to the new instance. The winning cluster’s
 T   centroid is recalculated as
= −1/2 V T Ci −1/2 V T Ci ci (t + 1) = ci (t) + α[φ(x) − ci (t)] (9)
 T
= −1/2 V T (κ(xi , x1 ), . . . κ(xi , xm )) where x is the new inputed instance, c(t) is the original
  centroid of the hyper-sphere, c(t+1) is the new centroid,
• −1/2 V T κ xj , x1 , . . . κ xj , xm (5) and α is the learning rate. When the number of instances
that fall within a particular hyper-sphere grows, its cen-
where κ(xi , xj ) is the base kernel function, x1 , x2 , . . . , xm are
troid tends to move toward the densest zone. In order
representative data points and can be obtained by uniform
to speed up the search of the winner, we build simple
sampling or clustering methods such as K-means and SOFM.
k-dimension trees for all hyper-spheres. With the knowl-
Let
edge of the radius, it is easy to figure out the upper and

φ (x) = −1/2 V T (κ(x, x1 ), . . . , κ(x, xm ))T (6) lower bounds of the selected k dimensions. In this way,
m it avoids extensive computation of all Euclidean distance
such that of instance and hyper-sphere pairs.
2) Building Decision Border Zone—DMZ: The goal of this
nys ∼ ∼ ∼
Ak ij = φ (xi )T φ xj = κ xi , xj . (7) step is to find the DMZ’s median points that approximate
m m the shape of the DMZ. We find the points using the fol-
With Nyström method, we can get an explicit approximation lowing technique. The first time a labeled instance falls
of the nonlinear projection φ(x), which is into a hyper-sphere, the hyper-sphere will be labeled
using the label of this instance. If another instance with

x → φ (x). (8) a conflicting label falls into the same hyper-sphere, it
m indicates that the hyper-sphere has entered the DMZ. We
To justify why we use kernel methods for our model, we identify the nearest data point in the hyper-sphere to the
first used Nyström method to raise the dimension of dataset newly inputed conflicting instance, and let pi represent
3 to 403, then used singular value decomposition (SVD) to the median point as follows:
reduce the dimension to 2 for the purpose of visualization. 1
Fig. 6 illustrates the transformed dataset 3 from Fig. 4(c). φ(xconflicting ) + ci
pi = (10)
2
Compared with Fig. 4(c), the data in Fig. 6 can be cov-
where φ(xconflicting ), pi ∈ ci , and pi is recorded and used
ered with less hyper-spheres, or each hyper-sphere can enclose
in the posterior clustering process.
more data points. Because the sampling points in Nyström
3) Adjusting the Radii of Hyper-Spheres: Once a DMZ
methods can be obtained dynamically, the projection of (8)
point is found in a hyper-sphere, the radius of the
can be used for every single instance in competitive learning
hyper-sphere should be updated such that it does not
and can be applied directly to our incremental model.
enter the DMZ. The new radius of hyper-sphere ci
Without loss of generality, we use φ(x) to denote a potential
should therefore be set as
projection of x in the reminder of this paper. If it works in the
original space, the projection of x is to itself. ri = d(pi , ci ) − dsafe (11)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS

Algorithm 1 Forming of Hyper-Spheres and the Adjusting of


the Centroids and Radii
Input:
x, the newly read instance;
Output:
C : A set of hyper-spheres whose centroids and radii are tuned properly;
DMZ : A set of points who shape the decision border approximately.
Method:
(1) ct = Null, len = −1;
(2) For Each ci in C
(3) If φ(x) falls into ci
//Find the winner of the hyper-spheres
(4) If label(x) = label(ci ) and (len = −1 or dE (φ(x), ci ) < len)
(5) ct = ci ; //Store the present temporary nearest hyper-sphere
(6) len = dE (φ(x), ci );//Store the present temp nearest distance
(7) Else If label(x) = label(ci ) //Split the hyper-sphere
(8) pi = 12 (φ(xconflicting ) + ci ), φ(xconflicting ), pi ∈ ci ; Fig. 7. MapReduce computing model.
(9) Add pi to DMZ;
(10) ri = d(pi , ci ) − dsafe ; //Adjusting radii rj of hyper-sphere cj ;
(11) Mark ci as “support hyper-sphere”; Algorithm 3 Selection of Hyper-Spheres
(12) End If
(13) End If Input:
(14)End For C : The set of hyper-spheres which are formed in preceding stages;
(15)If ct = Null Output:
// Adjust the winning hyper-sphere’s centroid C : The remaining hyper-spheres after selection.
(16) ci (t + 1) = ci (t) + α[φ(x) − ci (t)]; Method:
(17)Else (1)For Each ci in C
(18) Form a new hyper-sphere, and make φ(x) be the centroid; // T is the threshold of the instances number which one hyper-sphere
(19) Let the label of the new hyper-sphere be label(x). must at least have.
(20)End If // num(c) is a function computing the number of instances in a
hyper-sphere.
(2) If num(ci ) < T
//Let d(ci , DMZ) be the distance from the centroid of ci to the nearest
Algorithm 2 Merging of Hyper-Spheres data point in DMZ
Input: (3) If ri < d(ci , DMZ)
C : A set of hyper-spheres which are formed in stage 1; (4) Discard ci ;
Output: (5) End If
C : The remaining hyper-spheres after merging. (6) Else
Method: (7) Mark ci as “core hyper-sphere”;
(1)For Each ci in C (8) End If
(2) For Each cj in C except ci (9)End For
(3) cbig = maxradius (ci , cj ), csmall = minradius (ci , cj ),dt = d(cbig , csmall );
(4) If dt + rsmall <= rbig + θ × rsmall //θ is the merging coefficient
(5) Merge ci and cj ;
(6) End If 1) The first type of hyper-spheres includes large number
(7) End For
(8)End For
of instances. Because these are the fundamental hyper-
spheres that contain most data points, they are marked
as “core hyper-spheres.”
2) The second type of hyper-spheres has less instances but
where dsafe represents a safe distance at which a hyper- locates near the border. They are marked as “support
sphere should be from the closest DMZ point. hyper-spheres” because such hyper-spheres can be found
The logics of this stage are outlined in Algorithm 1 below. by measuring the distance between hyper-spheres and
Stage 2 (Merging Hyper-Spheres): Hyper-spheres may over- the nearest DMZ points.
lap with one another or even be contained in others. Therefore, 3) The third type of hyper-spheres has small number of
after certain period of training, a merging operation should instances and is far away from the border. These hyper-
be performed. Suppose that we have two hyper-spheres, spheres can be discarded.
cA and cB , and the radii of them are not the same. Let To achieve high classification accuracy, both core hyper-
cbig = maxradius (cA , cB ), csmall = minradius (cA , cB ), dt = spheres and support hyper-spheres should be selected. The
d(cbig , csmall ), and θ be the merging coefficient. If dt + logic of the third stage is outlined in Algorithm 3.
rsmall <= rbig + θ × rsmall , the prerequisite to merge is met.
Then let rtemp = dt + rsmall , and the new radius of the cbig will
be rnew = max(rtemp , rbig ). B. Mini-Batch Learning and Distributed Computing
The details of this stage are outlined in Algorithm 2. To make it applicable in large-scale applications, we encap-
Stage 3 (Selecting Hyper-Spheres): Since the training pro- sulate the proposed algorithms into a MapReduce framework.
cess is entirely autonomous, the number of generated hyper- We can collect the incoming instances as mini-batch set and
spheres could be large. Therefore, the final stage needs to then train them in MapReduce tasks. The computing model of
select hyper-spheres. the algorithms is illustrated in Fig. 7.
There are three types of hyper-spheres that are prominent, The collected mini-batch instances can be encapsulated in
which are described as follows. key-value pairs and mapped into mapper tasks.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LI et al.: CLASSIFYING WITH AdaHSs: INCREMENTAL CLASSIFIER BASED ON COMPETITIVE LEARNING 7

Fig. 8. Convergence test on all of the datasets. (a) Dataset 1. (b) Dataset 2. (c) Dataset 3. (d) Iris. (e) Seeds. (f) Segment. (g) Wholesale. (h) Glass.
(i) Diabetes. (j) Wine. (k) Credit-g. (l) Credit-rating. (m) Phishing websites. (n) Credit card. (o) Pendigits. (p) Shuffle. (q) Occupation. (r) HAPT. (s) Loans.
(t) URLs.

In each mapper tasks, the operations are based on instances. two situations. In the first situation, the new instance falls
It queries local cache for every instance to find out in into an existing hyper-sphere and the label of the instance
which hyper-spheres the instance falls, marks the winning is determined by the label of the hyper-sphere. In the sec-
hyper-sphere and the conflicting ones, and sends the hyper- ond situation, the new instance does not fall into an existing
spheres along with the description of the needed operations in hyper-sphere, and the label of the new instance is coordinated
another form of key-value<id, hyper-sphere> pairs. by the k nearest hyper-spheres’ labels

In each reducer task, the operations are based on every y = arg max wj I yi = lj (12)
hyper-sphere, which is aggregated according to the hyper-sphere lj
ci ∈Nk (x)
id emitted from mapper tasks. The competitive learning can where wj = exp(−([dE (φ(x), cj )2 ]/[2rj2 ])); i = 1, 2, . . . , L;
be conducted collectively with the aggregated instances. The j = 1, 2, . . . , k; Nk (x) is the k nearest hyper-spheres; and I is
tuning of a radius can be performed for only once with the the indicator function. The default value of k is set to 3.
closest conflicting instance, and it should find out the orphan
points and return the tuned hyper-sphere at the end. IV. E XPERIMENTS
After a turn of the MapReduce tasks, the merging and selec-
tion of the hyper-spheres should be performed. After all of We implemented our classifier using Java, with the help of
the operations, the tuned hyper-spheres should be saved to third-party Jars including common-math3, weka, joptimizer,
the cache. The orphan points should be retrained in the next and a local caching framework. The distributed MapReduce
turn. In the whole MapReduce process, subtasks do not coor- implementation of AdaHS was built upon Hazelcast [41],
dinate with each other. Thus the hyper-spheres and DMZ are which also provides distributed caching system. Most exper-
not updated in real time in a mini-batch turn, and they are iments were conducted on computer of i7-4560U (4CPU,
updated collectively after all reducer tasks return. 2-GHz), 8-GB RAM, and Ubuntu OS. The distributed deploy
of AdaHS was conducted on a cluster of two and four
machines, respectively, using the same configuration.
C. Predicting Labels
Just like other supervised competitive neural networks, A. Benchmark Datasets
AdaHS must determine the winning hyper-sphere in the hid- To evaluate the AdaHS, we used 20 datasets as the bench-
den layer to predict the label of a new instance. There are marks. Among them, three were the 2-D artificial datasets
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS

TABLE II TABLE III


D ETAILS OF THE DATASETS O PTIMAL VALUE FOR γ IN K ERNEL F UNCTION

TABLE IV
N UMBER OF H YPER -S PHERES

mentioned in Section II; “loans” and “URLs” were real


datasets that we collected from other real projects; the other
datasets were sourced from the University of California at
Irvine, Machine Learning Repository [42] and LibSVM [43].
All attributes were numeric and the details of datasets are
summarized in Table II.
process that watches boundary points, and a building process
B. Kernel Approximation With Nyström Method to construct DMZ. Thus, training is a process of adaptive clus-
tering, with the constraints that all instances in a hypersphere
As shown in (6), two types of elements need to be deter-
must have the same label.
mined. The first is the sampling points and the second is the
Both AdaHS and Nys-AdaHS record the number of hyper-
kernel function.
spheres after training. Table IV showed the number of hyper-
For datasets with less than 500 dimensions, we select all
spheres for all datasets. It could be observed that the Nyström
data points as the samples. For datasets with more than
method generated much less number of hyper-spheres than the
500 fields, we use SOFM, which can be viewed as an indepen-
original model.
dent parallel process of AdaHS, to obtain 500 cluster centers
and use the centers as the representative points in (6). Previous
research showed that sampling with clustering can enable D. Convergence Tests for AdaHS
Nyström method to have a much better approximation than We use “distortion error” to monitor the training process
uniform sampling [36]. and test for convergence. The distortion error is defined as
We used radial basis function as the base kernel function follows [37], [38]:
for Nyström method 
n
 
error = φ(xi ) − ci 2 (14)
κ xi , xj = exp −γ xi − xj 2 . (13)
i=1
The optimal values of parameter γ were obtained by grid where ci is the centroid of hyper-sphere to which xi belongs. If
search on [2−12 , 2−11 , . . . , 212 ]. By comparison, we also per- AdaSH converges, it should find the globally optimal solution.
formed grid search for kernelized SVM in the same way. In such a situation, both number of hyper-spheres and distor-
The optimal values, which make the classifiers perform best tion error should be stable and converge to a particular value.
on each dataset with regarding to accuracy, were recorded in Otherwise, the value of distortion error oscillates and does not
Table III. converge. As long as the constraints on the same dataset are
It can be observed that the optimal values for SVM and our satisfied, the smaller the distortion error, the better the qual-
model were not the same. ity of clustering. We performed this test on the 20 benchmark
datasets.
C. Training: Clustering With Classes Constraints On datasets that could be easily clustered, such as dataset 1,
AdaHS uses labels of instances during the clustering phase. the distortion error decreases to the bottom of the curve in only
Training AdaHS consists of a clustering process, a monitoring a few iterations and remains stable. On datasets that could not
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LI et al.: CLASSIFYING WITH AdaHSs: INCREMENTAL CLASSIFIER BASED ON COMPETITIVE LEARNING 9

TABLE V
D ETAILS OF THE ACCURACY (%) IN S ITUATIONS I AND II wholesale, but perform poorly on datasets 1–3, segment, glass,
and URLs. k-NN, k-SVM, and LWL were slow on large-scale
datasets such as loans and URLs. Kernel methods improved
the performance on most datasets, both in Nys-AdaSH and
k-SVM.
AdaHS fit quite well for specific datasets, such as
datasets 1–3, shuttle, occupancy, loans, and URLs, while main-
tained an acceptable accuracy on other datasets. As a local
model, AdaHS works well on datasets which are linearly
inseparable. In addition, because of the build-in clustering
mechanism, its accuracy is comparable to k-NN and even SVM
using kernel methods, but free from slow searching speed and
excessive memory consumption.
We observed slightly lower performance of AdaHS and k-
NN on diabetes, wine, credit-g, and credit-rating. This is due
to the “bad distance metrics” noted in [31], which is crucial
to distance-based models and implying that the features are
not selected or scaled properly. Besides, one assumption of
AdaHS is that instances in local areas can be clustered well.
If this assumption is violated, such as in occupancy and loans,
AdaHS will retrogress like k-NN. There will be too many
be easily hyper-spherically clustered, such as dataset 2, dis- clusters and DMZ points in the memory, and the searching
tortion error values converge to the minimum after relatively speed will drop accordingly.
more iterations of training. As it is shown in Fig. 4(b), the bor-
ders are relatively complex, so it took more hyper-spheres to
enclose the entire set of instances and more time to converge. F. Discussion
On some datasets, such as iris, segment, wholesale, credit- 1) Time Complexity and Space Complexity: It is obvious
rating, and HAPT, the classifier converged after several small that the time costs of Algorithms 1–3 are n × m, m2 , and m,
oscillations. where n is the number of data points and m is the number of
Results of the convergence tests on all 20 datasets showed clusters. The original form of the time complexity is O(nm +
that, given enough hyper-spheres and with the constraints that m2 + m). If the number of clusters m is constant, the total
all instances in the same cluster have the same label, the com- computational cost is O(n). That means if the assumption of
petitive learning was able to provide a clustering solution no “clustering” holds, AdaHS runs in linear time. Data kept in
matter how complex and irregular the decision border is and memory are clusters and DMZ information, so the space cost
what the data distribution is. is O(m + l), where l is the data size in DMZ.
In Nys-AdaHS, the time cost of SOFM is O(nk), where
E. Performance Evaluation k is the cluster number of SOFM and the target dimension
Tenfold cross-validation was used to test the accuracy of of Nyström method. The time cost of SVD on A+ 11 in (4) is
AdaHS. To examine the details of the resulting predictions, O(rk2 ), where r is the rank of A+11 [40] and the multiplication
performance on the two types of prediction was studied with the vector in (6) also takes O(nk). So the total time cost is
separately. O(rk2 + 2nk + nm + m2 + m). With the cluster center of SOFM
1) Two Types of Prediction: As discussed in Section III, our kept in memory, the space cost of Nys-AdaHS is O(k + m + l).
algorithm may be confronted with two situations in prediction, It can be observed from Table VII that the time cost of
i.e., there is an explicit winning hyper-sphere or an instance Nys-AdaHS is far less than that of k-SVM, especially on the
does not fall into any existing hyper-sphere. Experiments on last 7 datasets. Because the computation of kernel matrix in
the 20 datasets showed that prediction accuracies of the two k-SVM takes O(n2 ), which makes it hardly feasible in real
situations varied. Accuracies of the first situation were much applications. For example, on URLs, k-SVM took 6.085E5 s
higher than the second situation, as shown in Table V. (about seven days), and that is not realistic in practice. Nys-
2) Accuracy and Time Cost Comparison: To evaluate the AdaHs only took 894 s.
relative performance of AdaHS, we selected several other 2) Significance of Nyström Methods for AdaHS: Our moti-
well-known algorithms, including naïve Bayes, LDA, SVM, vation to apply kernel methods to AdaSH is not exactly the
C4.5, RBFN, and other incremental learning algorithms for same as SVM’s. Based on the data in Tables IV, VI, and VII,
comparison. Both accuracy and time cost were recorded. The the benefits of Nyström method for AdaHS are summarized
comparative results are shown in Tables VI and VII. as follows.
Indices in Tables VI and VII show that C4.5 performed a) Improving classification accuracy: On datasets 1–3,
best on datasets phishing_sites and loans whose attributes were segment, wholesale, glass, and loans, kernel methods improved
mostly nominal. LDA and L-SVM performed well on datasets the accuracy of SVM dramatically. That is because RBF kernel
with a globally consistent pattern, such as iris, seeds, wine, and brings the local learning ability to SVM, and improves the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS

TABLE VI
ACCURACY (%) C OMPARISON W ITH OTHER A LGORITHMS

TABLE VII
T IME C OST (S ECONDS ) C OMPARISON W ITH OTHER A LGORITHMS

accuracy of SVM on datasets that are not linearly separatable. ability for data definition” [18]. The reason is that AdaHS
AdaSH does not rely heavily on kernel methods like SVM is a clustering-based method and now each cluster contains
in terms of accuracy. AdaSH is a local model. Nevertheless, more information. It is especially useful when we analyze the
if kernel method can make the dissimilar points apart and evolving trends or the similar instances contained in the same
linearly separable in the new space [16], it can make the hyper- cluster.
spheres more easily to enclose the points and classify. The c) Improving the training speed: Nyström method does
effect of this benefit can be observed on most of the dataset increase a time cost of O(2nk +rk2 ). By projecting data points
in Table VI, for most accuracies were improved slightly. to a new space of simpler distribution, the number of clusters
b) Increasing hyper-spheres’ ability for data definition: can be reduced significantly, and the time cost saved from
It can be observed from Table IV that the number of clus- this benefit is O((m1 − m2 )(n + m1 + m2 + 1)), where m1
ters was reduced significantly on all datasets with Nyström and m2 refer to the number of clusters in AdaHS and Nys-
method. That means each hyper-sphere in the new space AdaHS, respectively. So there is a tradeoff between the two
can enclose more data points. In SVDD, this phenomenon terms. On large-scale datasets of complex data distribution
was stated as “kernel methods increase the hyper-sphere’s (i.e., the original number of clusters is very large, like credit
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

LI et al.: CLASSIFYING WITH AdaHSs: INCREMENTAL CLASSIFIER BASED ON COMPETITIVE LEARNING 11

card, occupancy, loans, and URLs), the total learning time of [13] R. Spring and A. Shrivastava, “Scalable and sustainable deep learning
Nys-AdaHS could be reduced. via randomized hashing,” in Proc. ACM SIGKDD, Halifax, NS, Canada,
2017, pp. 445–454.
[14] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approx-
imate nearest neighbor in high dimensions,” Commun. ACM, vol. 51,
V. C ONCLUSION no. 1, pp. 117–122, 2008.
[15] P. Laskov, C. Gehl, S. Krüger, and K.-R. Müller, “Incremental support
To deal with dynamic data and changing patterns, this paper vector learning: Analysis, implementation and applications,” J. Mach.
proposed a new algorithm AdaHS, which incorporates the Learn. Res., vol. 7, pp. 1909–1936, Sep. 2006.
adaptivity of competitive neural networks and the idea of [16] S. Agarwal, V. V. Saradhi, and H. Karnick, “Kernel-based online
building a border zone. It has a strong capability for local machine learning and support vector reduction,” Neurocomputing,
vol. 71, nos. 7–9, pp. 1230–1237, 2008.
learning. By keeping only cluster and DMZ information in [17] B. Li, M. Chi, J. Fan, and X. Xue, “Support cluster machine,” in Proc.
memory, it avoids the problem of excessive memory con- ICML, Corvallis, OR, USA, 2007, pp. 505–512.
sumption and improves the searching speed dramatically. The [18] G. Chen, X. Zhang, Z. J. Wang, and F. Li, “Robust support vector
data description for outlier detection with noise or uncertain data,”
experiments using 20 datasets showed that AdaHS is especially Knowl. Based Syst., vol. 90, pp. 129–137, Dec. 2015.
suitable for datasets whose patterns are changing, decision bor- [19] G. Huang, H. Chen, Z. Zhou, F. Yin, and K. Guo, “Two-class support
ders are complex, and instances with the same label can be vector data description,” Pattern Recognit., vol. 44, no. 2, pp. 320–329,
2011.
spherically clustered. AdaHS has great potentials in fields like [20] Z. Uykan, C. Guzelis, M. E. Celebi, and H. N. Koivo, “Analysis of
anti-fraud analysis, network intrusion detection, stock market, input–output clustering for determining centers of RBFN,” IEEE Trans.
and credit scoring. Neural Netw., vol. 11, no. 4, pp. 851–858, Jul. 2000.
[21] B. Chandra and M. Gupta, “A novel approach for distance-based
AdaHS is proposed as a classifier to deal with semi-supervised clustering using functional link neural network,” Soft
changing patterns, which is a subtopic in system Comput., vol. 17, no. 3, pp. 369–379, 2013.
uncertainties [44]–[47]. System uncertainties theory has [22] S. Grossberg, “Adaptive resonance theory: How a brain learns to con-
sciously attend, learn, and recognize a changing world,” Neural Netw.,
great significance to many important applications such as vol. 37, pp. 1–47, Jan. 2013.
“actuator dynamics” [46], “multiagent-based systems” [47], [23] G. A. Carpenter, S. Grossberg, and J. H. Reynolds, “ARTMAP:
and “nonlinear systems” [47]. One of our future works Supervised real-time learning and classification of nonstationary data by
a self-organizing neural network,” Neural Netw., vol. 4, no. 5,
will take these problems into consideration and explore the pp. 565–588, 1991.
potential applications to those areas. [24] J. R. Williamson, “Gaussian ARTMAP: A neural network for fast incre-
mental learning of noisy multidimensional maps,” Neural Netw., vol. 9,
no. 5, pp. 881–897, 1996.
R EFERENCES [25] B. Vigdor and B. Lerner, “The Bayesian ARTMAP,” IEEE Trans. Neural
Netw., vol. 18, no. 6, pp. 1628–1644, Nov. 2007.
[1] A. Bouchachia, “Adaptation in classification systems,” in Foundations [26] R. Hecht-Nielsen, “Counter propagation networks,” Appl. Opt., vol. 26,
of Computational Intelligence, vol. 2. Heidelberg, Germany: Springer, no. 23, pp. 4979–4983, 1987.
2009, pp. 237–258. [27] Y. Dong, M. Shao, and X. Tai, “An adaptive counter propagation network
[2] G. Kou, Y. Peng, and G. Wang, “Evaluation of clustering algorithms based on soft competition,” Pattern Recognit. Lett., vol. 29, no. 7,
for financial risk analysis using MCDM methods,” Inf. Sci., vol. 275, pp. 938–949, 2008.
pp. 1–12, Aug. 2014. [28] P. Schneider, M. Biehl, and B. Hammer, “Adaptive relevance matri-
[3] G. Kou, Y. Peng, Y. Shi, Z. Chen, and X. Chen, “A multiple- ces in learning vector quantization,” Neural Comput., vol. 21, no. 12,
criteria quadratic programming approach to network intrusion pp. 3532–3561, 2009.
detection,” in Data Mining and Knowledge Management, [29] T. Kohonen, “Self-organized formation of topologically correct feature
vol. 3327. Heidelberg, Germany: Springer, 2005, pp. 145–153. maps,” Biol. Cybern., vol. 43, no. 1, pp. 59–69, 1982.
[Online]. Available: https://link.springer.com/chapter/10.1007%2F978- [30] C. Xiao and W. A. Chaovalitwongse, “Optimization models for feature
3-540-30537-8_16#citeas selection of decomposed nearest neighbor,” IEEE Trans. Syst., Man,
[4] Y. Huang and G. Kou, “A kernel entropy manifold learning approach Cybern., Syst., vol. 46, no. 2, pp. 177–184, Feb. 2016.
for financial data analysis,” Decis. Support Syst., vol. 64, pp. 31–42, [31] R. A. Valente and T. Abrão, “MIMO transmit scheme based on mor-
Aug. 2014. phological perceptron with competitive learning,” Neural Netw., vol. 80,
[5] Z.-H. Zhou and Z.-Q. Chen, “Hybrid decision tree,” Knowl. Based Syst., pp. 9–18, Apr. 2016.
vol. 15, no. 8, pp. 515–528, 2002. [32] Q. Dai and G. Song, “A novel supervised competitive learning algo-
[6] C. Alippi, D. Liu, D. Zhao, and L. Bu, “Detecting and reacting to rithm,” Neurocomputing, vol. 191, pp. 356–362, May 2016.
changes in sensing units: The active classifier case,” IEEE Trans. Syst., [33] N. N. Schraudolph, J. Yu, and S. Günter, “A stochastic quasi-
Man, Cybern., Syst., vol. 44, no. 3, pp. 353–362, Mar. 2014. Newton method for online convex optimization,” J. Mach. Learn. Res.,
[7] V. Bruni and D. Vitulano, “An improvement of kernel-based object track- pp. 436–443, 2007.
ing based on human perception,” IEEE Trans. Syst., Man, Cybern., Syst., [34] W. Bian and D. Tao, “Constrained empirical risk minimization frame-
vol. 44, no. 11, pp. 1474–1485, Nov. 2014. work for distance metric learning,” IEEE Trans. Neural Netw. Learn.
[8] H. He, S. Chen, K. Li, and X. Xu, “Incremental learning from stream Syst., vol. 23, no. 8, pp. 1194–1205, Aug. 2012.
data,” IEEE Trans. Neural Netw., vol. 22, no. 12, pp. 1901–1914, [35] K. Chen and S. H. Wang, “Semi-supervised learning via regular-
Dec. 2011. ized boosting working on multiple semi-supervised assumptions,” IEEE
[9] M. Pratama, S. G. Anavatti, P. P. Angelov, and E. Lughofer, “PANFIS: Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 129–143, Jan. 2011.
A novel incremental learning machine,” IEEE Trans. Neural Netw. [36] M. Zia-ur Rehman, T. Li, Y. Yang, and H. Wang, “Hyper-ellipsoidal clus-
Learn. Syst., vol. 25, no. 1, pp. 55–68, Jan. 2014. tering technique for evolving data stream,” Knowl. Based Syst., vol. 70,
[10] L. L. Minku, A. P. White, and X. Yao, “The impact of diversity on pp. 3–14, Nov. 2014.
online ensemble learning in the presence of concept drift,” IEEE Trans. [37] J. Lai and C. Wang, “Kernel and graph: Two approaches for nonlinear
Knowl. Data Eng., vol. 22, no. 5, pp. 730–742, May 2010. competitive learning clustering,” Front. Elect. Electron. Eng., vol. 7,
[11] Y. Guo, W. Zhou, C. Luo, C. Liu, and H. Xiong, “Instance-based credit no. 1, pp. 134–146, 2012.
risk assessment for investment decisions in P2P lending,” Eur. J. Oper. [38] J.-S. Wu, W.-S. Zheng, and J.-H. Lai, “Approximate kernel competitive
Res., vol. 249, no. 2, pp. 417–426, 2016. learning,” Neural Netw., vol. 63, pp. 117–132, Mar. 2015.
[12] C.-H. Chen, “Feature selection for clustering using instance-based learn- [39] S. Kumar, M. Mohri, and A. Talwalkar, “Sampling methods for
ing by exploring the nearest and farthest neighbors,” Inf. Sci., vol. 318, the Nyström method,” J. Mach. Learn. Res., vol. 13, pp. 981–1006,
pp. 14–27, Oct. 2015. Apr. 2012.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS

[40] J. Lu, S. C. H. Hoi, J. Wang, P. Zhao, and Z.-Y. Liu, “Large scale online Gang Kou received the B.S. degree in physics
kernel learning,” J. Mach. Learn. Res., vol. 17, no. 47, pp. 1–43, 2016. from Tsinghua University, Beijing, China, and
[41] Hazelcast. Hazelcast: The Leading In–Memory Data Grid. Accessed: the M.S. degree in computer science and the
Apr. 5, 2016. [Online]. Available: http://hazelcast.com Ph.D. degree in information technology from the
[42] UC Irvine Machine Learning Repository. Accessed: Dec. 13, 2015. University of Nebraska at Omaha, Omaha, NE,
[Online]. Available: http://archive.ics.uci.edu/ml/index.php USA.
[43] LibSVM. LIBSVM Data: Classification, Regression, and He is a Distinguished Professor of Chang Jiang
Multi-Label. Accessed: Dec. 16, 2015. [Online]. Available: Scholars Program and the Executive Dean of the
https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/ School of Business Administration, Southwestern
[44] C. Chen et al., “Adaptive fuzzy asymptotic control of MIMO systems University of Finance and Economics, Chengdu,
with unknown input coefficients via a robust Nussbaum gain-based China.
approach,” IEEE Trans. Fuzzy Syst., vol. 25, no. 5, pp. 1252–1263, Dr. Kou is the Managing Editor of the International Journal of Information
Oct. 2017. Technology and Decision Making, and the Editor-in-Chief of Springer book
[45] Z. Liu, G. Lai, Y. Zhang, X. Chen, and C. L. P. Chen, “Adaptive neu- series on Quantitative Management.
ral control for a class of nonlinear time-varying delay systems with
unknown hysteresis,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25,
no. 12, pp. 2129–2140, Dec. 2014. Yi Peng received the B.S. degree in manage-
[46] C. Chen, Z. Liu, Y. Zhang, C. L. P. Chen, and S. L. Xie, ment information systems from Sichuan University,
“Saturated Nussbaum function based approach for robotic systems with Chengdu, China, in 1997, and the M.S. degree
unknown actuator dynamics,” IEEE Trans. Cybern., vol. 46, no. 10, in management information systems and the
pp. 2311–2322, Oct. 2016. Ph.D. degree in information technology from the
[47] C. Chen et al., “Adaptive consensus of nonlinear multi-agent systems University of Nebraska at Omaha, Omaha, NE,
with non-identical partially unknown control directions and bounded USA, in 2007.
modelling errors,” IEEE Trans. Autom. Control, vol. 62, no. 9, From 2007 to 2011, she was an Assistant
pp. 4654–4659, Sep. 2017. Professor with the School of Management and
Economics, University of Electronic Science and
Technology of China, Chengdu, China, where she
has been a Professor with the School of Management and Economics,
University of Electronic Science and Technology of China, since 2011. She
has authored three books and over 100 articles. Her current research interests
include data mining, multiple criteria decision making, and data mining
applications.

Yong Shi received the Ph.D. degree in management


science and computer systems from the University
of Kansas, Lawrence, KS, USA, in 1991.
He is the Director of the Key Research Laboratory
on Big Data Mining and Knowledge Management
Tie Li received the M.S. degree in management and the Research Center on Fictitious Economy
science and engineering from Shanxi University, and Data Science, Chinese Academy of Sciences,
Taiyuan, China, in 2009. He is currently pursuing Beijing, China. Since 1999, he has been the Charles
the doctoral degree with the School of Management W. and the Margre H. Durham Distinguished
and Economics, University of Electronic Science Professor of information technology with the
and Technology of China, Chengdu, China. He has College of Information Science and Technology,
authored four software and ten papers. His cur- Peter Kiewit Institute, Omaha, NE, USA, and the University of Nebraska,
rent research interests include big data mining and Lincoln, NE, USA. He has authored 17 books and over 200 papers. His cur-
information management. rent research interests include business intelligence, data mining, and multiple
criteria decision making.

You might also like