0% found this document useful (0 votes)
2 views

2015-Elsevier-An-improved-bee-colony-optimization-algorithm-with-an-application-to-document-clustering

This paper presents an improved bee colony optimization (IBCO) algorithm for document clustering, addressing limitations of the traditional bee colony optimization (BCO) by introducing cloning and fairness concepts. The authors propose hybrid algorithms that combine IBCO with k-means to enhance local search capabilities and avoid local optima, demonstrating their effectiveness through empirical results on various datasets. The proposed methods show significant improvements in clustering performance compared to traditional algorithms like k-means and other evolutionary-based approaches.

Uploaded by

chandreshgovind
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

2015-Elsevier-An-improved-bee-colony-optimization-algorithm-with-an-application-to-document-clustering

This paper presents an improved bee colony optimization (IBCO) algorithm for document clustering, addressing limitations of the traditional bee colony optimization (BCO) by introducing cloning and fairness concepts. The authors propose hybrid algorithms that combine IBCO with k-means to enhance local search capabilities and avoid local optima, demonstrating their effectiveness through empirical results on various datasets. The proposed methods show significant improvements in clustering performance compared to traditional algorithms like k-means and other evolutionary-based approaches.

Uploaded by

chandreshgovind
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Neurocomputing 159 (2015) 9–26

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

An improved bee colony optimization algorithm with an application


to document clustering
Rana Forsati a,n, Andisheh Keikha b, Mehrnoush Shamsfard a
a
Natural Language Processing (NLP) Research Lab., Faculty of Electrical and Computer Engineering, Shahid Beheshti University, G. C., Tehran, Iran
b
Department of Computer Science and Engineering, Ryerson University, Ontario, Canada

art ic l e i nf o a b s t r a c t

Article history: The bee colony optimization (BCO) algorithm is proved to be one of the fast, robust and efficient global
Received 23 September 2013 search heuristics in tackling different practical problems. Considering BCO algorithm in this paper, we
Received in revised form utilize it for the data clustering, a fundamental problem that frequently arises in many applications.
20 November 2014
However, we discovered some obstacles in directly applying the ancient BCO to address the clustering
Accepted 5 February 2015
problem and managed to change some basic behaviors of this swarm algorithm. In particular, we present
Communicated by Dorothy Ndedi
Monekosso an improved bee colony optimization algorithm, dubbed IBCO, by introducing cloning and fairness
Available online 24 February 2015 concepts into the BCO algorithm and make it more efficient for data clustering. These features give BCO
very powerful and balanced exploration and exploitation capabilities to effectively guide the search
Keywords:
process toward the proximity of the high quality solutions. In particular, the cloning feature allows it to
Swarm intelligence
take advantage of experiences gained from previous generations when generating new solutions. The
Bee colony optimization
Document clustering problem of getting stuck in local optima still laid bare in the proposed improved version. As a result, to
overcome the shortage of this swarm algorithm in searching locally, we hybridize it with the k-means
algorithm to take advantage of fine tuning power of the widely used k-means algorithm which
demonstrates good result in local searches. We propose four different hybridized algorithms based on
IBCO and k-means algorithms and investigate the clustering results and convergence behavior of them.
We empirically demonstrate that our hybrid algorithms alleviate the problem of sticking in a local
solution even for large and high dimensional data sets such as document clustering. The results show
that proposed algorithms are robust enough to be used in many applications compared to k-means and
other recently proposed evolutionary based clustering algorithms including genetic, particle swarm
optimization, ant colony, and bee based algorithms.
& 2015 Published by Elsevier B.V.

1. Introduction patterns in same clusters have more similarity to each other than the
patterns in different clusters [35,76].
Clustering is one of the crucial unsupervised learning techniques Some of the most conventional clustering methods can be bro-
for dealing with massive amounts of heterogeneous information. adly classified into two main categories [6,36]. The first category
The aim of clustering is to group a set of data objects into a set of includes the hierarchical clustering methods. A hierarchical algo-
meaningful sub-classes, called clusters which could be disjoint or not. rithm [30,41,68,96] creates a hierarchical decomposition of the given
Clustering is a fundamental tool in exploratory data analysis with dataset forming a dendrogram Na tree which splits the dataset
practical importance in a wide variety of applications such as data recursively into smaller subsets and represent the objects in a
mining, machine learning, pattern recognition, statistical data analy- multi-level structure. The hierarchical procedures can be further
sis, data compression, and vector quantization [88]. The aim of divided into agglomerative or bottom-up algorithms and divisive or
clustering is to find the hidden structure underlying a given collection top-down algorithms [87]. In the first category, each element is
of data points. In other words, in clustering, a set of patterns, usually initially assigned to a separate cluster, the algorithm then repeatedly
vectors in a multi-dimensional space are classified in such a way that merges pairs of clusters until a certain stopping criterion is met [87].
On the other hand, the divisive algorithms begin with the whole set
of objects and proceed to divide it into a certain number of clusters
successively.
n
Corresponding author.
Our concern in this paper is based on partitioning clustering
E-mail addresses: [email protected] (R. Forsati), [10] methods which include the most practical clustering algo-
[email protected] (A. Keikha), [email protected] (M. Shamsfard). rithms especially for large data sets. The attempt is to divide the

http://dx.doi.org/10.1016/j.neucom.2015.02.048
0925-2312/& 2015 Published by Elsevier B.V.
10 R. Forsati et al. / Neurocomputing 159 (2015) 9–26

data set into a set of disjoint clusters without the hierarchical other well-known heuristic algorithms such as genetic algorithm,
structure. Partitioning methods try to partition a collection of differential evolutional algorithm, and particle swarm optimiza-
objects into a set of groups, so as to maximize a pre-defined tion algorithm for unconstrained optimization problems. The bee
objective value. The most popular partitioning clustering algo- colony algorithm has been very successful in a wide variety of
rithms are the prototype-based clustering methods where each optimization problems [82] in engineering and control. In fact, in
cluster is represented by the center of the cluster and the used optimization problems, we want to search the solution space and
objective function (a square error function) is the sum of the in the BCO algorithm, this search can be done more efficiently.
distances from the patterns to the center [64]. Since stochastic optimization approaches are good at avoiding
Although hierarchical methods are often said to have better convergence to a locally optimal solution, these approaches could
quality in clustering, they usually do not provide the reallocation be used to find a globally near-optimal solution [45].
of objects, which may have been poorly classified in the early The BCO algorithm belongs to the class of population-based
stages of the analysis [36] and the time complexity of them techniques which is considered to be applied to find solutions for
declared to be quadratic [42]. On the other hand, in recent years difficult combinatorial optimization problems. The major idea behind
the partitioning clustering methods showed a lot of advantages in the BCO is to create the multi-agent system capable of efficiently
applications involving large datasets due to their relatively low solving hard combinatorial optimization problems. These features
computational requirements [42,44]. The time complexity of the increase the flexibility of the BCO algorithm and produce better
partitioning technique is almost linear, which makes it widely solutions. The bee colony behaves to some extent similar and to some
appealing in real world problems. extent in a different way from bee colonies in nature. They explore
Among the partitioning clustering algorithms, especially the through the search space looking for the feasible solutions. In order to
center-based clustering algorithms, the k-means algorithm [57] is discover superior and superior solutions, artificial bees cooperate
most popular thanks to its simplicity and efficiency. Although the with each other and exchange information. Also, they focus on more
k-means algorithm is simple, straightforward and easy to imple- promising areas and gradually discard solutions from the less
ment and works fast in most situations, it suffers from some major promising areas via collective knowledge and giving out information
drawbacks that make it inappropriate for many applications. The among themselves.
first disadvantage is that the number of clusters K must be As the behavior of the k-means algorithm mostly is influenced
specified prior to application. Also, since the summary statistic is by the number of clusters specified and the random choice of
mean of the values for each cluster, so, the individual members of initial cluster centers, in this study, we concentrate on the latter,
the cluster can have a high variance and mean may not be a good where the results are less dependent on the initial cluster centers
summary of the cluster members. In addition, as the number of chosen, hence more stabilized by introducing different algorithms
clusters grows, for example to thousands of clusters, k-means based on the BCO for clustering. In summary, the present work
clustering becomes untenable, approaching the Oðn2 Þ comparisons makes the following contributions:
where n is the number of points. However, for relatively few
clusters and a reduced set of pre-selected words, the k-means  A basic bee colony based clustering (BCOCLUST) algorithm
algorithm can do well [84]. Another major drawback of k-means which solves the clustering problem with the ancient BCO
algorithm is sensitivity to initial centers. Finally, the k-means method. This basic algorithm has some problems regarding
algorithm converges to the nearest local optimum from the some basic behaviors of the BCO algorithm that causes the bees
starting position of the search and the final clusters may not be to follow one solution after a while and get stock in a local
the optimal solution [25]. optimum.
In order to overcome these problems that exist in traditional  An improved BCO algorithm by introducing cloning and fairness
partitioning clustering methods especially k-means, recently, new concepts into the BCO algorithm. These modifications are
concepts and techniques have been proposed in this area by aimed at increasing the explorative power of the BCO algorithm
researchers from different fields. One of these techniques is optimiza- and propagation of knowledge in an optimization process,
tion methods that try to optimize a pre-defined function that can be respectively. The second proposed clustering algorithm is based
very useful in data clustering. Optimization techniques define a global on the improved BCO method and referred to as IBCOCLUST
function to capture the quality of best partitioning and try to optimize which proposes a better modeling for the specific application of
its value by traversing the search space. Therefore different artificial clustering.
intelligence based clustering methods, such as statistics [24], graph  Hybrid clustering algorithms using k-means and the IBCO-
theory [93], expectation-maximization algorithms [65], artificial CLUST algorithms. Although the problem of getting stuck in
neural networks [62,51,70], evolutionary algorithms [76,21,72], the local optimum has been solved in the IBCOCLUST method,
swarm intelligence algorithms [83,71,39,73] have been proposed. In the algorithm still suffers from locating the best solution in the
principle, any general purpose optimization method can serve as the proximity of the found global solution. The hybrid techniques
basis for this approach. The methods such as genetic algorithm alleviate this problem by combining the fine tuning capability
[63,33,67], simulated annealing, ant colony optimization [79], particle of the k-means in the proximity of global solution and the
swarm optimization [18,71,50] and harmony search [25] have been searching power of the IBCOCLUST in locating the global
used for data clustering in the context of other meta-heuristics. Also solution. The hybrid methods improve the k-means algorithm
some algorithms based on the bees behavior have been proposed for by making it less dependent on the initial parameters such as
this problem such as honey bee [23,94,77], the bees algorithm [73], randomly chosen initial cluster centers, hence more stable. It
and artificial bee colony algorithm [94,97,89,40]. Another swarm seems that the hybrid algorithms that combine two ideas can
intelligence algorithm based on the bees' behavior is bee colony result in an algorithm that can outperform either one indi-
optimization [54–56] which is our focus in this paper to highlight the vidually.
power of this optimization algorithm in data clustering problem.  To demonstrate the effectiveness and convergence rate of IBCO-
The BCO algorithm [54–56] is a nature-inspired meta-heuristic CLUST and hybrid algorithms, we have applied these algorithms on
optimization method, which is similar to the way bees in nature various standard datasets and got very promising results compared
look for food, and the way optimization algorithms search for an to the k-means and GA and PSO-based clustering algorithm [57,43].
optimum in combinatorial optimization problems. The perfor- BCO and PSO algorithms fall into the same class of artificial
mance of the BCO algorithm has been compared with those of intelligence optimization algorithms, population-based algorithms,
R. Forsati et al. / Neurocomputing 159 (2015) 9–26 11

and they are proposed by inspiration of swarm intelligence. Besides, small. IGKA inherits the salient feature of FGKA of always conver-
comparing the BCO algorithm with the PSO algorithm, the perfor- ging to the global optimum.
mance of BCO algorithm is also compared with a wide set of Ant colony based clustering: The ant colony optimization (ACO)
classification techniques that are also given in [82]. algorithm is inspired by ants behavior in determining the optimal
 Also having in mind that document clustering is one of the path from nets to the source of food [19]. The clustering problem
major challenges in information extraction, to better evaluate in its optimization formulation can be solved utilizing the ACO
the functionality of our proposed algorithms, we apply them on method as explored in [79]. In [91] a multi-ant colonies approach
this important application as well. The evaluation of the for clustering data consists of some parallel and independent ant
experimental results based on accuracy, robustness, and con- colonies and a queen ant agent. Each ant colony process takes
vergence rate shows considerable improvement and robust- different types of ants moving speed and different versions of the
ness of the hybridized algorithms for large scale document probability conversion function to generate various clustering
clustering. results. A number of hybrid algorithms based on ACO method
are available in the literature. Initially Kuo et al. [48] have
proposed ants based k-means algorithm, which is subsequently
Outline: The paper is organized as follows. We begin in Section 2
improved by hybridization of ACO, self-organizing maps and
by thoroughly surveying the related works that are mostly aligned
k-means in [12]. Further, Jiang et al. have developed new hybrid
to our work. In Section 3 we provide background on the basic
clustering algorithms by combining the ACO with the k-harmonic
algorithms including the clustering problem and the principles of
means algorithm in [38] and the DSBCAN algorithm in [37]. The
BCO meta-heuristic. The basic BCO based clustering algorithm, the
work in [32] utilized the ACO based clustering for document
improved BCO algorithm along with the hybrid algorithms are
retrieval and the AntClust algorithm was introduced in [49] for
discussed in Section 4. Section 5 presents the data sets used in our
web session clustering.
experiments, empirical study of BCO parameters on convergence
A new ant colony based method for text clustering using a
behavior of the BCOCLUST algorithm. It also contains the perfor-
validity index is introduced [90]. In this method the walking of the
mance evaluation of the proposed algorithms compared to well-
ants is mapped to the picking or dropping of projected document
known algorithms. The experiments and analysis of the proposed
vectors with different probabilities. In another work Zhang et al.
algorithms on the application of document clustering is also
[95] suggests that the random movements of ants in the solution
presented in Section 6. Finally Section 7 concludes the paper.
space lead to slow convergence. They provide a method for faster
document clustering, called AFTC. The approach employs the
pheromone laid by the ants to avoid randomness of movement,
2. More related work which lead the ants to move towards a direction with high
pheromone concentration at each step. The direction of movement
Earlier in the introduction, we discussed some of the main lines is the orientation where the text vectors are relatively more
of research on clustering; here, we survey further lines of study concentrated. A new text clustering approach named elite ant
that are directly related to our work on meta-heuristic based colony optimization clustering (EACOC), based on suitable reten-
clustering algorithms. tion of the elites has been introduced in [34]. The mechanism is to
Genetic algorithm based clustering: The genetic algorithm (GA) retain the elites that the algorithm works, in a way that in each
is inspired by the theory of natural selection and begins with a iteration it retains a certain number of valuable solutions into the
population of solutions which tries to survive in an environment next cycle, with the purpose of improving algorithm performance.
(defined with fitness evaluation). The parent population shares A new fully controllable ant colony algorithm (FCACA) for docu-
their properties of adaptation to the environment to the children ments clustering has been introduced in [20]. This introduces a
with various mechanisms of evolution such as genetic crossover new version of the basic heuristic decision function that signifi-
and mutation. The process continues over a number of generations cantly improves the convergence and provides greater control over
to find a desirable solution [28]. the process of the grouping data.
The GA has been extensively utilized for clustering problem. Particle swarm based clustering: The particle swarm optimiza-
The paper [7] was among the first initially proposed the use of tion (PSO) algorithm is based on the swarming behavior of
basic GA for partitional clustering and in particular document particles searching for food in a collaborative manner [13]. The
clustering [11]. The standard binary encoding scheme with a fixed cluster analysis using PSO was proposed in [69] for image cluster-
number of cluster centers is used for initialization of chromo- ing. Then, Van der Merwe and Engelbrecht [85] applied it for
somes. The reproduction operation is carried out using uniform cluster analysis of arbitrary datasets. The algorithm in its basic
crossover and cluster-oriented mutation (altering the bits of binary form for cluster analysis consists of a swarm in a d dimensional
string). Ravindra Krovi [47] investigated the potential feasibility search space in which each particle position consists of K cluster
of using genetic algorithms for the purpose of clustering. A novel centroid vectors. A number of recent works tried to modify the
hybrid genetic k-means algorithm, dubbed GKA, proposed by [45], PSO algorithm to make it more effective for clustering problem. In
which finds a globally optimal partition of a given data into a [14] the PSO algorithm was adapted to position prototypes
specified number of clusters. This hybrid method circumvents (particles) in regions of the space that represent natural clusters
expensive crossover operations by using a classical gradient of the input data set by influencing the particles' velocity update
descent algorithm that is used in clustering using the k-means from previous position along with taking into account the past
algorithm. Using finite Markov chain theory, it was proved that the experiences. The hybrid algorithm based on k-means and PSO is
GKA converges to the global optimum. The fast genetic k-means proposed [14]. In [2] a PSO based clustering algorithm is proposed
algorithm [52] (FGKA) was inspired by GKA but features several for web usage mining and clustering.
improvements over GKA. The incremental genetic k-means algo- Biologically inspired based clustering: The biologically inspired
rithm (IGKA) [53] was an extension to previously proposed FGKA algorithms comprise natural meta-heuristics derived from living
clustering algorithm. IGKA outperforms FGKA when the mutation phenomena and behavior of biological organisms. These algo-
probability was small. The main idea of IGKA was to calculate rithms encompass artificial immune systems [15] and bacterial
the objective value total within-cluster variation and to cluster foraging optimization [29]. These methods have recently been
centroids incrementally whenever the mutation probability was applied to clustering problem. In [86] a new clustering algorithm
12 R. Forsati et al. / Neurocomputing 159 (2015) 9–26

Table 1
Summary of notations consistently used in the paper and their meaning.

Symbol Meaning

n The number of data objects


d The ambient dimension of data objects
D ¼ ðd1 ; …; dn Þ The set of objects to be clustered
K The number of clusters
A A f0; 1gnK The assignment matrix
C ¼ fc1 ; c2 ; …; cK g The cluster centers associated with an assignment matrix A
DM ð; Þ : Rd  Rd -R þ The Minkowski similarity measure between data points
DC ð; Þ : Rd  Rd -R þ The Cosine similarity measure between data points
B The number of bees in the hive
B ¼ fb1 ; b2 ; …; bB g The set of B bees used in optimization
R The number of recruiter bees
T The total number of iterations
M The number of constructive moves
S The number of stages at each iteration of BCO algorithm

0 P
based on the mechanism analysis of bacterial foraging is proposed. i.e., 〈d; d 〉 ¼ di¼ 1 di d0j . We use bold upper case letter for matrices.
It is an optimization based methodology in which a group of Throughout this paper, we only consider the ℓ2-norm which is
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pd ffi
bacteria forage to converge to certain positions as final cluster d
defined as ‖x‖2 ¼ i ¼ 1 xi for any vector x A R . A summary of
2
centers by minimizing the fitness function. The initial solution
space is created by assigning the bacteria positions as the ran- notations used in this paper is provided in Table 1.
domly chosen cluster centroids in the data set. Then, the chemo-
taxis process defines the movement of bacteria, which represents 3.1. The clustering problem
either a tumble followed by a tumble or tumble followed by a run.
Younsi and Wang [92] have developed and used a new artificial Clustering algorithms are commonly used to summarize large
immune system algorithm for data clustering that uses the quantities of data, in a wide variety of domains. Clustering in a d-
interaction between B-cells and antigens to tackle the optimiza- dimensional Euclidean space Rd is the process of partitioning a
tion problem. given set of n points into a fixed number of K clusters based on some
Harmony search based clustering: The harmony search (HS) similarity metric in a clustering procedure. The ith object
algorithm is a meta-heuristic algorithm mimicking the improvisa- is characterized by a real-value d-dimensional profile vector
tion process of musicians (where music players improvise the where each element corresponds to the jth real-value feature
pitches of their instruments to obtain better harmony) [27,59]. The (j ¼ 1; 2; …; d). The vector space model gives us a good opportunity
HS algorithm has been applied for various engineering optimiza- for defining different metrics for similarity between two data points.
tion problems and in the context of clustering, novel partitional Let D ¼ fd1 ; d2 ; …; dn g be a set of n objects and use D A Rnd
algorithms have been developed very recently. A new HS based to denote the corresponding data matrix. Given D, the goal of
document clustering with continuous representation is considered a partitional clustering algorithm is to determine a partition
in [61]. In this algorithm each cluster centroid is considered as a D ¼ D1 [ D2 [ ⋯ [ DK where each partition Di is associated with
decision variable; so each row of harmony memory, which con- a center ci A Rd and denote by C ¼ fc1 ; c2 ; …; cK g the set of centers.
tains K decision variables, represents one possible solution for Each data point di is assigned to the closest center, i.e.,
clustering. The main drawback of the algorithm developed in [61] arg mink A f1;2;…;Kg Dðdi ; ck Þ for a similarity measure Dð; Þ which will
is its continuous representation. The continuous representation of be discussed shortly. The goal is to find centers such that objects
clusters' centroid decreases the efficiency of pitch adjusting which belong to the same cluster are as similar to each other as
process. Another HS based document clustering with discrete possible, while objects which belong to different clusters are as
representation called HKA is proposed in [60]. Furthermore, using dissimilar as possible. Over the years, two prominent ways have
0
a probabilistic analysis it has been shown that the proposed been used to compute the similarity between two data d and d .
algorithm convergences to a near-optimal solution in a fairly The first method is based on Minkowski distances which is for any
0
reasonable amount of time. This algorithm codify the whole two vectors d and d is defined by
partition of the document set in a vector of length n, where n is !1=p
0
Xd
the number of the documents. Considering the behavior of HKA, it DM ðd; d Þ ¼ ðdi  d0i Þp ; ð1Þ
was found that the proposed algorithm is good at finding promis- i¼1

ing areas of the search space, but not as good as k-means at fine- which is equivalent to Euclidean distance for p ¼2.
tuning within those areas, so it may take more time to converge. The other commonly used similarity measure in data clustering
On the other hand, k-means algorithm is good at fine-tuning, but is the cosine correlation measure [74], given by
lack a global perspective. So a hybrid algorithm that combines two 0
0 〈d; d 〉
ideas is proposed in [26]. In the hybrid algorithm at each DC ðd; d Þ ¼ ; ð2Þ
improvisation step a one-step k-means is leveraged to fine-tune J d1 J J d2 J
the new solution. where J  J denotes the norm of a vector. This measure becomes
one if both vectors are identical, and zero if there is nothing in
common between them (i.e., the vectors are orthogonal to each
3. Preliminaries other).
To find K centers, the problem is defined as an minimization of
Notation: Throughout this paper, we use the following notation. an objective function based on both the data points and the center
We use bold-face letters to denote vectors. For any two vectors locations. A popular function used to quantify the goodness of a
0 0 0
d; d A Rd , we denote by 〈d; d 〉 the inner product between d and d , partitioning is determined by sum of the intra-cluster distances
R. Forsati et al. / Neurocomputing 159 (2015) 9–26 13

which is defined as [31,80] without recruiting nest mates; or (c) dance and thus recruit nest
X
n mates before returning to the food source. The bee opts for one of
f ðD; CÞ ¼ min Dn ðdi ; cj Þ; ð3Þ the above alternatives with a certain probability. Within the dance
j ¼ 1;2;…;K
i¼1 area, the bee dancers advertise different food sources.
where Dn ðdi ; cj Þ denotes either the Minkowski similarity or cosine The BCO algorithm consists of two alternating phases: a forward
distance between object di and center cj . The clustering problem pass and a backward pass. During each forward pass, every bee is
aims to find the partitioning that has optimal adequacy with exploring the search space and creating various partial solutions. It
respect to the huge number of possible candidate partitioning, applies a predefined number of moves (visiting a certain number of
which is expressed in the form of the Stirling number of second nodes), which construct and/or improve the solution, yielding a
kind1 which indicates the complicated nature of the problem as new solution. During the second forward pass, bees will visit few
  more nodes, expand previously created partial solutions. Having
1 X K
K N obtained new partial solutions, the bees return to the nest and start
ð  1ÞK  i i :
K! i ¼ 1 i the second phase, the so-called backward pass. During the back-
ward pass, all bees share information about their solutions. In the
It has been shown that the clustering problem is NP-hard even
nest, all bees participate in a decision-making process and exchange
when the number of clusters is two [16]. This illustrates that the
information about quality of the partial solutions created. Bees com-
clustering by examining all possible partitions of n data vectors of
pare all generated partial solutions.
d-dimensional into K clusters is not computationally feasible. The
During the backward pass, based on the quality of the partial
problem is even more demanding when additionally the number
solutions generated, every bee decides with a certain probability
of clusters is unknown. Then the number of different combinations
whether it will advertise its solution or not. The bees with better
is the sum of the Stirling numbers of the second kind. As a result,
solutions have more chances to advertise their solutions. The remain-
exhaustive search methods are far too time consuming even with
ing bees have to decide whether to continue to explore their own
modern computing systems. Obviously, we need to resort to some
solution in the next forward pass, or to start exploring the neighbor-
optimization techniques to reduce the search space, but there is no
hood of one of the solutions being advertised. Similarly, this decision
guarantee that the optimal solution will be found. Although
is taken with a probability, so that better solutions have a higher
various optimization methodologies have been developed for
probability of being chosen for exploration. Depending on the quality
optimal clustering, the complexity of the task reveals the need
of the partial solutions generated, every bee possesses certain level of
for developing efficient algorithms to precisely locate the optimum
loyalty to the path leading to the previously discovered partial
solution. In this context, this study presents a novel stochastic
solution. The search process is composed of iterations. The first
approach for data clustering, aiming at a better time complexity
iteration is finished when bees create for the first time one or more
and partitioning accuracy.
feasible solutions by visiting all nodes.
The two phases of the search algorithm namely the forward
3.2. The bee colony optimization meta-heuristic and backward passes, are performed iteratively until a stopping
condition is satisfied. The possible stopping conditions could be,
The bee colony optimization (BCO) meta-heuristic has been for example, the maximum number of iterations or the number
proposed by Lucic and Teodorovic [54,55]. It is a simple and robust of iterations without the improvement of the objective function.
stochastic optimization algorithm and the basic idea is to create a The best discovered solution during the first iteration is saved, and
colony of artificial bees capable of solving difficult combinatorial then the second iteration begins. Within the second iteration, bees
optimization problems successfully. The algorithm simulates the again incrementally construct solutions of the problem, etc. There
intelligent behavior of bee swarms. An artificial bee colony behaves are one or more solutions at the end of each iteration. The analyst-
to some extent like and to some extent in a different way from bee decision maker prescribes the total number of iterations. The
colonies found in the natural world. The performance of the BCO detailed steps of optimizing function f ðxÞ for decision variable x
algorithm is compared with those of other well-known modern are summarized as follows. We set the number of stages S to be
meta-heuristic algorithms such as genetic algorithm (GA), differ- same as the number of decision variables. We also let B denote the
ential evolution (DE), and particle swarm optimization (PSO) on number of bees in the hive, T denote the total number of iterations,
constrained and unconstrained problems [81,8]. and M represents the number of constructive moves during one
The BCO is a model of collecting and processing nectar, the forward.
practice which is highly organized. Each bee decides to reach the
nectar source by following a nest mate who has already discovered 1. Initialization: Every bee is set to be an empty solution. A
a patch of flowers. Each hive has a so-called dance floor area on feasible solution of the problem is initialized as the best
which the bees that have discovered nectar sources, dance as a way solution xn .
to convince their nest mates to follow them. If a bee decides to leave 2. Optimization: The following steps are iterated for t ¼ 1; 2; …; T
the hive to get nectar, she follows one of the bee dancers to one of iterations:
the nectar areas or stick to her nectar source. Upon arrival, the (a) Set the current stage to be one, i.e., s’1.
foraging bee takes a load of nectar and returns to the hive relin- (b) For every bee do the forward pass as
quishing the nectar to a food-store bee. After she relinquishes the (i) m’1 //counter for constructive moves in the
food, the bee can (a) abandon the food source and become again an forward pass.
uncommitted follower; (b) continue to forage at the food source (ii) Evaluate all possible constructive moves.
(iii) According to evaluation, choose one move using the
1
The Stirling numbers of the second kind [78], that is usually denoted by Sðn; kÞ roulette wheel or tournament.
 
or nk is the number of ways to partition a set of n labelled objects into k nonempty (iv) m’m þ1; If m o ¼ M then go to Step ii.
unlabelled subsets. Equivalently, it captures the number of different equivalence (c) All bees are back at the hive and backward pass starts.
relations with precisely k equivalence classes that can be defined on an n element Allow bees to exchange information about quality of the
set. For example, the set f1; 2; 3g can be partitioned into three subsets in one way:
ff1g; f2g; f3gg; into two subsets in three ways: ff1; 2g; f3gg, ff1; 3g; f2gg, and
partial solution created.
   
ff1g; f2; 3gg; and into one subset in one way: ff1; 2; 3gg. Obviously, n1 ¼ nn ¼ 1. (d) Evaluate (partial) objective function value for each bee.
Pk kj k n
The Sðn; kÞ can be explicitly calculated by Sðn; kÞ ¼ ð1=k!Þ j ¼ 0 ð  1Þ ð j Þj . (e) Sort the bees by their objective function value.
14 R. Forsati et al. / Neurocomputing 159 (2015) 9–26

(f) Every bee decides randomly whether to continue its own the notion of a good cluster, since for any function one can exhibit
exploration and become a recruiter, or to become a cases for which it fails. Furthermore, not surprisingly, no poly-
follower (bees with a higher objective function value have nomial time algorithm is known for optimizing such cost func-
a greater chance to continue their own exploration). tions. The brute force solution would be to enumerate all possible
(g) For every follower choose a new solution from recruiters by clusterings and pick the best. As the number of possible partition-
the roulette wheel. ing grows exponentially, this approach is not feasible. The key
(h) s’s þ M, and if s oS, go to Step (b). design challenge in objective function-based clustering is the
(i) If the best solution xt obtained during the tth iteration is formulation of an objective function capable of reflecting the
better than the best-known solution, update the best nature of the problem so that its optimization reveals meaningful
known solution (xn ’xt ) structure (clusters) in the data. The following subsections describe
(j) t’t þ 1. our modeling and bee colony operators, according to this model
3. Output: Report the best solution xn as the final solution. for clustering purpose.

4. The basic bee colony based algorithm to data clustering

In this section we first propose a pure bee colony based 4.1.1. Representation of solutions
clustering algorithm namely BCOCLUST. We begin in Section 4.1 The first attempt to solve data clustering by the BCO is how to
by giving detailed steps of the first proposed algorithm (BCO- represent solutions to exploit bee colony algorithm. To this end,
CLUST). The nature of BCO algorithm [82,8] is to force the un-loyal we decompose data clustering into K stages where each stage
bees to follow only the loyal bees. In some problems like cluster- targets one cluster center. For instance, the first stage represents
ing, this behavior of BCO algorithm leads to getting stuck into a the first cluster center; the second stage represents the second
local optimum, since most of the bees start following a loyal cluster center and so on. At each stage s A f1; 2; …; Kg, every bee
bee which is foraging for a local optimum solution. To overcome should select a subset of objects which will be allocated to the
this problem, we made a change in this behavior and propose cluster s and these objects will be used to compute cs . We note
Improved BCOCLUST (IBCOCLUST) algorithm in Section 4.2. To that each individual bee consists of an encoding of a candidate
achieve even better clustering, the explorative power of IBCO- solution (food source) and a fitness that indicates its quality.
CLUST is combined with the refining power of the k-means in four In order to apply it to solve clustering problem, we have used
ways. Contrary to the localized searching property of k-means floating point arrays to encode cluster centers. Let A A f0; 1gnK
algorithm, the proposed algorithms perform a globalized search in be the assignment matrix, with n rows and K columns where
the entire solution space. Additionally, the proposed algorithms ½aij ; 1 r i rn; 1 r j r K indicates whether or not the ithe data is
improve k-means by making it less dependent on the initial assigned to cluster j, i.e.,
parameters such as randomly chosen initial cluster centers, there- 
fore, making it more stable. The details of these hybridization 1 if ith data is assigned to jth cluster
aij ¼ ð4Þ
algorithms are described in Section 4.3. 0 otherwise

The assignment matrix A ¼ ½aij  has the properties that each aij
PK
4.1. BCOCLUST: bee colony based clustering must assigned exactly to one cluster (i.e., j ¼ 1 aij ¼ 1 for
i ¼ 1; 2; …; n). An assignment that represents K nonempty clusters
In order to cluster data using BCO, we must recast clustering as is a legal assignment. In this model, each food source discovered
an optimization task that locates the optimal centroids of the by each bee is a candidate solution and corresponds to a set of K
clusters rather than to find an optimal partitioning, with clustering centroids. So, the search space is the space of all matrices
quality as the objective, and to use a suitable general purpose A A f0; 1gnK that satisfy the constraint in which each data must
optimization method to find a good clustering. The principal be allocated to exactly one cluster and there is no empty cluster.
advantage of this approach is that the objective of the clustering Each stage involves optimizing one variable. We set the number of
is explicit, which enables us to better understand the performance stages to be K and the number of bees to participate in the search
of the clustering algorithm on particular types of data and to use process to be B. The algorithm proceeds in T iterations. Each bee at
task-specific clustering objectives. It is even possible to consider each stage s ¼ 1; 2; …; K decides about the set of objects to be
several objectives simultaneously, an approach explored recently assigned to the cluster s. An example of representation of solutions
in [94]. This model offers us a chance to apply BCO algorithm on is reported in Fig. 1. The number of solution components to be
the optimal clustering of a collection of data. visited within one forward pass is set to one (m ¼1). Therefore, at
Recall that the goal is to partition n vectors each of them with d each forward pass, bees are supposed to visit a single stage.
dimensions into K clusters. In each iteration of our algorithm, for All bees are located in the hive at the beginning of the search
each cluster, all the bees leave the hive to allocate some of the data process. Each artificial bee allocates a subset of the data to the
to that cluster with special probabilities and come back to the hive corresponding cluster with specified probabilities in each stage,
to see the work of other bees and decide whether or not to and in this way constructs a solution of the problem incrementally.
continue with their own decision or select an another bees Bees are adding solution components to the current partial
solution to go on with. We consider each cluster centroid as a solution until they visit all of the K stages. The search process is
decision variable; so each solution extracted by a bee after each composed of iterations. The first iteration is finished when bees
iteration, which contains K decision variables, represents one create feasible solutions. The best discovered solution during the
possible solution for clustering. On the other hand, each solution first iteration is saved, and then the second iteration begins. In
contains a number of candidate centroids that represents each each iteration of proposed algorithm, for each cluster (stage), all
cluster. In this case, each extracted solution contains K vectors the bees leave the hive to allocate some of the data to that cluster
C ¼ fc1 ; c2 ; …; cK g. with special probabilities and come back to the hive to see the
Viewing the clustering problem as an optimization problem of work of other bees until that time and decide whether to continue
such an objective function formalizes the problem to some extent. its way or select one of the other bees solution and continue on
However, we are not aware of any function that optimally captures that way.
R. Forsati et al. / Neurocomputing 159 (2015) 9–26 15

4.1.2. Evaluation of solutions Step 3. Backward pass (Bees partial solutions comparison
A key characteristic of most clustering algorithms is that they mechanism): After all of the bees completed Step 2, they will be
use a global criterion function whose optimization drives the back to hive to compare their partial solutions with themselves.
entire clustering process. For those clustering algorithms, the We assume that every bee can obtain the information about
clustering problem can be stated as computing a clustering solutions' quality generated by all other bees. In this way, bees
solution such that the value of a particular objective function is compare all generated partial solutions. Based on the quality of the
optimized. Our objective function is to minimize intra-cluster partial solutions generated, every bee decides whether to abandon
similarity while maximizing the inter-cluster similarity. Fitness the created partial solution or dance and thus recruit the nest-
value of each solution, which corresponds to one potential solu- mates before returning to the created partial solution. Depending
tion, is determined by sum of the intra-cluster distances (SICD) as on the quality of the partial solutions generated, every bee
detailed in Eq. (8). possesses certain level of loyalty to the previously discovered
Clearly, the smaller the sum of the distances is, the higher the partial solution. Our criterion to decide about the goodness of
quality of clustering. A food source represents a possible solution discovered solution in general is sum of the distance of each vector
to the problem. The quantity of existing sources of nectar in the from its cluster center for all the vectors. We want this criterion to
areas are explored by the bees corresponds to the quality of the be as minimal as possible. So as the bees are back at the hive, the
solution represented by that food source. Each iteration of the probability that bth bee (during stage s A f1; 2; …; Kg and iteration
BCOCLUST algorithm is detailed as the following: t A f1; 2; …; Tg) will be faithful to its previously generated partial
Step 1. Initialization: The first step is the initialization. If this is solution (loyalty decision) is expressed as follows:
not the first iteration of the algorithm and the best discovered
pb ðs þ 1; tÞ ¼ e  Ob ðs;tÞ=ðstÞ ; b ¼ 1; 2; …; B; ð6Þ
cluster centers during the previous iterations are available, the
initial cluster centers for all the stages are set to the best answer of where Ob ðs; tÞ is the normalized value of SICDb defined as
the previous iteration. In other words, after generation of a set of
SICDb ðs; tÞ  SICDmin ðs; tÞ
initial solutions obtained in the previous iteration which described Ob ðs; tÞ ¼ ð7Þ
SICDmax ðs; tÞ  SICDmin ðs; tÞ
in the next steps, we rank the initial solutions based on the fitness
function and set the K best of them as the initial cluster centers. where SICDmax and SICDmin denote the objective function value for
Otherwise, if this is the first iteration, a set of initial cluster centers the worst and the best discovered partial solution between all the
generated randomly from the data points in D will be set for each bess, respectively, and SICDb is sum of the distances of each object
cluster. Each solution represents K cluster centers. The cluster from its cluster center for all the objects that has been assigned by
centers of the ith stage are randomly selected from the uniform bee b as shown below:
distribution over the set and indicate the cluster center of s X
X n
ith stage. SICDb ðs; tÞ ¼ Dbji : ð8Þ
There is a loop from 1 to K where in each loop the following i¼1j¼1

two steps are performed: (


Step 2. Constructive moves in the forward pass (allocate data to Dn ðdj ; cbi Þ if jth data is assigned to ith cluster by bee b
Dbij ¼
cluster): In each forward pass, every artificial bee visits one stage, 0 otherwise
allocates the data to the corresponding cluster, and after that
ð9Þ
returns to the hive as detailed in Step 3. There is a loop from 1 to K
where within each loop, the data will be allocated to the where cbi is the center for cluster i decided by bee b.
corresponding cluster. For each cluster, the probability of a bee
SICDmin ðs; tÞ ¼ min SICDi ðs; tÞ ð10Þ
choosing the data i as a member of jth cluster (Cj ), pij, is expressed i A f1;2;…;Bg

using the Logit model as follows:


SICDmax ðs; tÞ ¼ max SICDi ðs; tÞ ð11Þ
i A f1;2;…;Bg
expð  Dn ðdi ; cj ÞÞ
pij ¼ Pn ; j ¼ 1; 2; …; K ð5Þ Let us discuss Eq. (6) in more detail. The better partial solution
m ¼ 1 expð  Dn ðdm ; cj ÞÞ
(i.e., lower Ob value) will increase the probability that bee b will be
where Dðdi ; cj Þ denotes the distance of ith data to jth cluster. For all loyal to the previously discovered partial solution. Additionally,
data points which have not been assigned yet, this process works the greater the ordinary number of the forward passes, the higher
in this way that for each non-allocated data, a random number is the influence of the already discovered partial solution will be.
generated, if the number is less than datas’ allocation probability, This is expressed by the term s in the denominator of the
then the data will be allocated to cluster j with probability 1=K or exponent. In other words, at the beginning of the search process,
will be set free. Within each forward pass a bee visited a certain bees are more brave to search the solution space. The more
number of nodes and created a partial solution. After solutions are forward passes they make, the less the bees have courage to
evaluated (and normalized) the loyalty decision and recruiting explore the solution space. The more we are approaching the end
process are performed as described in the following subsection. of the search process, the more focused the bees are on the already
known partial solution.

4.1.3. Recruiting process


At the beginning of a new stage, if a bee does not want to
expand the previously generated partial solution, the bee will go to
the dancing area and will follow another bee. Within the dance
area the bee dancers (recruiters) advertise different partial solu-
tions. We have assumed in this paper that the probability that the
partial solution of bee b would be chosen by any uncommitted bee
is equal to:

Fig. 1. Representation of BCO in clustering. pb ¼ e  γ Ob ðs;tÞ ; b ¼ 1; 2; …; R ð12Þ


16 R. Forsati et al. / Neurocomputing 159 (2015) 9–26

Table 2 Table 3
The statistics of general purpose benchmark data sets used in our first set of Model selection for the number of bees. For each data set we fix the value of other
experiments. parameters including the maximum number of iteration and γ, and evaluate the
performance of the IBCOCLUST algorithm for different number of bees. Recall that T
Dataset # Attributes (d) # Classes (K) # Samples (n) denotes the maximum number of iterations.

Iris 4 3 150 Constant Configuration Data set Number of bees SICD


Wine 13 3 178
Glass 9 6 214 γ¼1 Iris 2 123.76
Wisconsin breast cancer 9 2 683 T ¼ 1000 5 122.18
Vowel 3 6 871 7 122.16
8 122.15
10 122.15
12 97.23
15 97.23
where γ A ð0; 1Þ and R denotes the number of recruiters. The 16 97.35
probability pb is used in a roulette wheel selection or tournament 17 97.34
selection algorithm and one of the bees is selected. 20 97.23
Using Eq. (12) and a random number generator, every uncom- 24 97.23
25 97.34
mitted follower joins one bee dancer (recruiter). Recruiters fly together 30 97.23
with their recruited nest mates in the next forward pass along the path
γ ¼ 0:01 Glass 2 250.44
discovered by the recruiter. So the bee that wants to continue another
T ¼ 1000 3 246.66
partial solution will set its partial solution exactly as the selected bee 6 248.92
but will continue the algorithm independently. At the end of this path, 8 228.64
all bees are free to independently search the solution space and 9 219.85
generate the next iteration of constructive moves. It can be inferred 12 220.56
15 218.72
from Eq. (7) that if a bee has discovered the lowest distance in stage s 16 220.13
of iteration t, the bee will fly along the same partial path with the 17 219.74
probability equal to one. The higher the sum of the distance of a 18 214.74
discovered path the smaller the probability of flying again along the 20 214.85
25 220.96
same path.
30 217.68
Step 4. Allocation of non-allocated data: After the loop of K
clusters has finished there might be some data that has been not
allocated to any cluster yet for each bee. These vectors will be
allocated to each cluster with a greedy algorithm which means
each vector belongs to the cluster that its cluster center is the forces the algorithm to mostly stick to the exiting solutions found
nearest center to that vector. by the bess (i.e., exploitation) and consequently leading to less
Step 5. Setting the cluster centers (computing the centroid of exploration of the solution space. On the other hand, by choosing a
clusters): At last, the cluster centers as the centroid of the vectors small value for exploitation, the algorithm performs a random
belong to each cluster for each bee are computed as follows: each behavior in the solution space (i.e., exploration), hence losing all
solution extracted by each bee corresponds to a clustering with the information collected during the past rounds which deteriorates
assignment matrix A. Let fc1 ; c2 ; …; cK g be the set of K centroids for the effectiveness of the algorithm. The cloning idea is to give an
assignment matrix A. The kth centroid is computed as erasable amount of exploitation to the BCO algorithms which is
Xn .hXn i lacked in the original algorithms. The detailed description of these
ck ¼ a d
i ¼ 1 ki i
a :
j ¼ 1 kj
ð13Þ two features is as follows.
Fairness: With this modification, we aim at giving chance to
Step 6. Selecting the best answer: In this phase, among all
every bee to be followed. In particular, in the improved algorithm
generated solutions, the best one is determined and is used to
after a bee decides to follow another bee, it is not forced to follow
update the global best. The global best will be used for setting the
only loyal bees and it may consider both loyal and non-loyal ones to
cluster centers for all the stages in next iteration. At this point, all B
follow. In other words, there is no restriction in following just loyal
solutions are deleted, and the new iteration starts. The BCO
bees and every bee can follow any other bee. It is obvious that the
proceeds iteratively until a stopping condition is met.
chance that a loyal bee is selected is much more than the chance
that a recruiter is selected due to their fitness that is included in the
4.2. IBCOCLUST: improved BCOCLUST algorithm probability of selection. By giving chance to non-loyal bees to be
followed, even with a small probability, the algorithm is able to keep
A major shortcoming of the BCOCLUST algorithms that leads to the diversity of solutions in a reasonable level and therefore having a
unreasonable results is the low diversity of solutions in the course much better explorative power.
of search process. This phenomenon is the consequence of the Cloning: Another improvement we have made is to resolve the
nature of the algorithm where all the bees start to follow the one forgiveness characteristic of the BCO algorithm. In the standard BCO,
whose answer is the best among others (i.e., exploitation) after a the iterations of the algorithm are independent and no propagation
few iterations, and so the answer will converge to a local optimized of knowledge happens between different iterations. But since the
one. To overcome this problem, we present the Improved BCO- best solution of each iteration, potentially carries all the information
CLUST (IBCOCLUST) algorithm which is based on two major the algorithm learns in the course of that specific iteration, it would
modifications we have made to the original BCO algorithm: fairness be better to incorporate this knowledge in next iteration. To this end,
and cloning. The main insight which underlines the proposed we propose the following novel idea to propagate the information
improved algorithm and the cloning and fairness ideas stems from, during the optimization. Let Ctn 1 ¼ fctn 1;1 ; ctn 1;2 ; …; ctn 1;K g denote
is as follows. For meta-heuristics algorithm such as BCO the main the best clustering solution obtained till iteration t 1. In the tth
tricky point is to balance the trade-off between exploration and iteration, we add a specific bee to the set of bees calledcloning bee
exploitation. On one hand, a high value for the exploitation rate that behaves as follows. The cloning bee is similar to other bees with
R. Forsati et al. / Neurocomputing 159 (2015) 9–26 17

Table 4
Model selection for parameter γ. For each data set we fix the value of other
parameters including the number of bees and the maximum number of iteration
and evaluate the performance of the algorithm for different values of γ. Recall that T
and B denote the maximum number of iterations and the number of bees,
respectively.

Constant configuration Data set Variable parameter Value SICD

B ¼12 Iris γ 0.0001 97.33


T ¼ 1000 0.0005 97.34
0.001 97.23
0.005 97.33
0.01 97.34
0.05 97.33 Fig. 3. Convergence behavior of proposed algorithms on glass data set with
0.1 97.35 number of bees B¼ 9 and γ ¼ 0:01.
0.5 97.45
1 97.23

B ¼12 Wine γ 0.0001 16,453.28


T ¼ 1000 0.0005 16,460.80
0.001 16,455.28
0.005 16,460.00
0.01 16,476.80
0.05 16,460.80
0.1 16,497.59
0.5 16,465.50
1 16,567.45

B ¼12 Vowel γ 0.0001 149,786.85


T ¼ 1000 0.0005 149,786.85
0.001 150,769.34
0.005 149,786.85
0.01 14,998.34
0.05 149,786.85
0.1 149,466.23 Fig. 4. Convergence behavior of IBCOCLUST on glass data set with γ ¼ 0:01.
0.5 150,093.05
1 150,693.05

B ¼18 Glass γ 0.0001 218.98 trade-off between exploration and exploitation. On one hand, a
T ¼ 1000 0.0005 218.90
high value for the exploitation rate forces the algorithm to mostly
0.001 248.55
0.005 223.90 stick to the exiting solutions found by the bees (i.e., exploitation)
0.01 218.26 and consequently leading to less exploration of the solution space.
0.05 223.84 On the other hand, by choosing a small value for exploitation,
0.1 223.94 the algorithm performs a random behavior in the solution space
0.5 223.90
1 223.14
(i.e., exploration), hence losing all the information collected during
the past rounds which deteriorates the effectiveness of the
algorithm. The cloning idea is to give a reasonable amount of
exploitation to the BCO algorithms which is lacked in the original
algorithms. In other words a chance is given to learn from the
experiences of other bees based on the quality of the solutions
they have found to reduce the amount of search done be all bees.

The other parts of the algorithm such as initialization, selecting


the bees to be loyal or non-loyal are exactly same as the
BCOCLUCT. The pseudo-code of the IBCOCLUST is demonstrated
in Algorithm 1.

Algorithm 1. IBCOCLUST.

Input: The number of iterations T, number of clusters K,


number of bees B, the data set D ¼ fd1 ; d2 ; …; dn g
1: Initialize Cn ’A random valid clustering,
2: for t ¼ 1; …; T do
Fig. 2. Convergence behavior of IBCOCLUST on Iris dataset with γ ¼ 1 and different
number of bees. 3: for b ¼ 1; …; B do
4: if t 4 0 then
5: Set the initial cluster centroid of the b to Cn
the difference that in the forward pass, it is forced to follow the 6: else
decision of Ctn 1 . More specifically, at stage s of iteration t, the cloning 7: Select a random data as the centroid of each cluster
bee's decision is the subset of data points in D that are assigned to of each b
sth cluster by Ctn 1 , i.e., the cluster represented by ctn 1;s . 8: end if
9: end for
Remark 1. The main insight that underlines the cloning and 10: for s ¼ 1; …; K do
fairness ideas stems from is as follows. For the meta-heuristics 11: for b ¼ 1; …; B do
algorithm such as BCO the main tricky point is to balance the 12: Select point d by tournament selection
18 R. Forsati et al. / Neurocomputing 159 (2015) 9–26

13: r’Uð0; 1Þ where U is the uniform random generator is the best representation of a cluster in terms of the squared
14: if r o K1 then Euclidean distance, i.e.,
15: Allocate selected point to the current cluster
X
K X X
K X
16: end if ‖d  ctkþ 1 ‖2 r ‖d  ctk ‖2 :
17: end for k ¼ 1 d A Dt þ 1 k ¼ 1 d A Dtk
k
18: for b ¼ 1; …; B do
19: Calculate the probability of sticking to its solution The fact that the algorithm will converge locally follows from the
20: r’Uð0; 1Þ fact that the objective function cannot increase and there are only
21: if r o sticking to its solutionthen a finite number of possible clustering of data. □
22: The specified bee will stick to its own solution
23: else We note however that the number of iterations required to
24: Select another bee by tournament selection and reach convergence can be exponentially large, and furthermore,
choose its solution there is no any non-trivial lower bound on the gap between the
25: end if value of the k-means objective of the algorithms output and the
26: end for minimum possible value of that objective function.
27: end for Although Lemma 1 indicates that the k-means algorithm mono-
28: Allocate non-assigned data points with a greedy tonically decreases the objective, but it might find locally optimal
algorithm solutions with respect to the clustering error. This is due to the non-
29: Select the best solution at iteration t and set to be the Cn convex nature of criterion functions (sum of the squared Euclidean
30: end for distance) in terms of both centers and assignment of data points,
Return Cn the iterative relocation methods are often trapped into one of the
local minima. As mentioned before, the problem in general in NP-
hard (see e.g., few recent results on proving this claim [3,17,58]).
4.3. Bee colony k-means clustering Hence the quality of a final clustering solution depends and is very
sensitive to the initial configuration and the obtained partition is
The IBCOCLUST algorithm performs a global search for solu- often only suboptimal (not the globally best partition). This defi-
tions whereas the k-means clustering procedure performs a local ciency becomes more serious for applications such as which
search. In a local search, the solution obtained is usually located in intensifies the hardness of the problem due to low quality of text
the proximity of the solution obtained in the previous step. For data, high-dimension and sparseness properties of documents.
example, the k-means clustering algorithm uses the randomly To illustrate this fact, we provide a simple setting which shows
generated seeds as the initial clusters centroids and refines the the hardness of the problem and sub-optimality of the k-means
position of the centroids at each iteration. The refining process of method. To do so, consider a clustering problem over real line with
the k-means algorithm indicates that the algorithm only explores five clusters with centers C ¼ fc1 ; c2 ; …; c5 g. For the simplicity of
the very narrow proximity, surrounding the initial randomly exposition we assume that c1 o ⋯ oc5 and every two consecutive
generated centroids and its final solution depends on these centers are located in a distance of Δ. We assume that there is a ball
initially selected centroids. So, the IBCOCLUST and k-means algo- of radius δ around each center and n data points are distributed in
rithms have complementary strong and weak points, the IBCO- these balls uniformly at random. Hence for the optimal clustering of
CLUST is good at finding promising areas of the search space, but these n points, the sum of squared Euclidean distance of points to
cluster centers is Oðnδ Þ, because the distance of each point to its
2
not as good as k-means at fine-tuning within those areas. On the
other hand, the k-means algorithm is good at fine-tuning, but lacks cluster center is at most δ.
a global perspective. It seems a hybrid algorithm that combines We utilize the k-means algorithm to cluster these points. To this
the IBCOCLUST with k-means can result in an algorithm that can end, we initialize the k-means algorithm by choosing five data points
outperform either one individually. at random as the initial centers of the clusters. There is a chance that
The following lemma shows that the k-means algorithm always no data point from first cluster, two data points from the third
improves the objective function. cluster, and on data point from the remaining clusters have been
chosen as centers. In the first round of k-means, all points in clusters
Lemma 1. The k-means algorithm monotonically decreases the 1 and 2 will be assigned to the cluster centered at c1. The two centers
objective until local convergence, i.e., f ðD; Ct þ 1 Þ rfðD; Ct Þ. in cluster 3 will end up sharing that cluster. And the centers in
clusters 4 and 5 will move roughly to the centers of those clusters.
Proof. Let Ct ¼ ðct1 ; …; ctK Þ be the cluster centers at the tth iteration
Thereafter, no further changes will occur. This local optimum has cost
of the k-means algorithm which partitions the data points into the
ΩðnΔ2 Þ. We note that this cost can be made arbitrarily far away from
sets Dt1 ; …; DtK , where Dti ; i A ½K is the set of points assigned to ith
the optimum cost by setting the distance between the consecutive
cluster based on Ct . The objective function for this clustering is as
centers, i.e., Δ, large enough. As this example illustrates, despite the
X
K X convergence of k-means algorithm to local minimum, the initializa-
f ðD; Ct Þ ¼ ‖d  ctk ‖2 tion of k-means algorithm significantly affects the final result. In this
k ¼ 1 d A Dk
simple one dimensional problem, the only scenario that k-means
which is the sum of the distances of data points to the centers of algorithm would generate reasonable result with a cost close to the
the assigned clusters. Let Ct þ 1 ¼ ðct1þ 1 ; …; ctKþ 1 Þ denote the solution optimum cost is the case where in the initialization step we sample
at ðt þ1Þ th iteration by applying two steps of the k-means on data point from each cluster. It is not hard to see that for large
algorithm. Similarly, let Dt1þ 1 ; …; DtKþ 1 denote the set of the data number of data points compared to the number of clusters, i.e., n=K,
points assigned to each cluster at the end of ðt þ 1Þ th iteration. the probability of this case is inversely proportional to the Stirling
We consider the effect of each step of k-means algorithm number in Eq. (4) and is very small.
separately. The reassignment step results in a non-increasing Motivated by the above example on the poor performance of the
objective since the distance between a data point and its newly k-means algorithm, we propose four different versions of the hybrid
assigned cluster mean never increases the objective. Similarly, the clustering, depending on the stage where we carry out the k-means
mean update step results in an increasing objective since the mean algorithm. The hybrid clustering approaches use k-means algorithm
R. Forsati et al. / Neurocomputing 159 (2015) 9–26 19

Table 5
The configuration of parameters for different baseline algorithms which are compared to the proposed algorithms.

GA ACO PSO CABC HSCLUST

Parameter Value Parameter Value Parameter Value Parameter Value Parameter Value

Population 50 Ants (R) 50 Population 100 Colony size 100 HMS 2K
Crossover rate 0.8 Probability for max trial 0.98 Min and max inertia 0.7, Upper bounce 5 HMCR 0.9
0.9
Mutation rate 0.001 Local search probability 0.01 Acceleration factor (c1) 2 Limit 100 PARmin 0.09
Max number of 1000 Evaporation rate 0.01 Acceleration factor (c2) 2 Maximum cycle 1000 PARmax 0.99
iterations number
Max number of 1000 Max number of 1000 Max Number of 1000
iterations iterations Iterations
V min  0.05
V max 0.05

Table 6
SICD comparison. Hybrid I and Hybrid II indicate the one step hybridization and the interleaved hybridization of IBCOCLUST and k-means algorithms, respectively.

Data Measure GA ACO k-means PSO CABC IBCOCLUST IBKCLUST KIBCLUST Hybrid I Hybrid II

Iris Average 139.98 97.17 106.05 103.51 — 97.27 97.33 96.4 96.38 95.14
Best 125.19 97.1 97.33 96.66 — 97.22 97.33 96.4 96.33 95.10

Wine Average 16,530.53 16,530.53 18,161 16,311 16,449.8 16,460.55 16,460.9 16,460.6 16,458.1 16295.9
Best 16,530.50 16,530.50 16,555.68 16,294.00 16,433.37 16,460 16,460.55 16,453.28 16,433.37 16,295

Glass Average — — 260.4 291.33 223.68 225.19 223.4 226.34 226.59 221.5
Best — — 215.68 271.29 212.32 214.85 214.78 217.97 214.78 214.71

Cancer Average — — 2988.3 3334.6 2964.4 2976.89 2976.33 2980.15 2977.59 2976.24
Best — — 2987 2976.3 2964.4 2976.24 2976.06 2980.15 2976.24 2976.11

Vowel Average — — 159,242.87 168,477 — 150,881.16 152,575.13 150,751.38 151,688.49 150,892.17


Best — — 149,422.3 163,882 — 149,466.61 149,466.61 150,469.89 149,490.88 149,473.9

Table 7 IBCOCLUST (improved bee colony þ k-means clustering) whereas


CEP comparison. Hybrid I and Hybrid II indicate the one step hybridization and the the other one changes the order of these algorithms (k-means þ
interleaved hybridization of IBCOCLUST and k-means algorithms, respectively.
improved bee colony clustering). We need to find the right balance
Algorithm Iris Wine Glass Cancer between local exploitation and global exploration. The global
searching stage and the local refine stage are accomplished by
VFI 0 5.77 41.11 7.34 those two modules, respectively.
Ridor 0.52 5.1 31.66 6.33 IBKClust: Hybridization of Improved Bee colony and k-means
NBTree 2.63 2.22 24.07 7.69
Clustering. In the initial stage, the IBCOCLUST module is executed
MultiBoost 2.63 17.77 53.7 5.59
Bagging 0.26 2.66 25.36 4.47 for a short period (50–100 iterations) to discover the vicinity of the
Kstar 0.52 3.99 17.58 2.44 optimal solution by a global search and at the same time to avoid
RBF 9.99 2.88 44.44 20.27 consuming high computation. When the IBCOCLUST is completed
MlpAnn 0 1.33 28.51 2.93
or shows a negligible trend of improvement after many iterations,
BayesNet 2.63 0 29.62 4.19
PSO 2.63 2.22 39.5 5.8 the result from the IBCOCLUST module is used as the initial seed of
ABC 0 0 41.5 2.81 the k-means module. The k-means algorithm will be applied for
IBCOCLUST 2.63 5.5 32.19 4.49 refining and generating the final result.
IBKCLUST 2.63 5.2 30.93 4.49 KIBClust: hybridization of k-means and improved bee colony
KIBCLUST 2.63 5.5 28.12 4.49
clustering. First k-means module is performed on the data, and
Hybrid I 2.63 4.31 28.12 3.7 5
Hybrid II 2.63 3.96 26.76 3.7 then the results of k-means are given to the IBCOCLUST as the
initial answers and the IBCOCLUST will continue its process.

to replace the refining stage in the IBCOCLUST algorithm. The hybrid


4.3.2. The interleaved hybridization
algorithms combine the power of the IBCOCLUST algorithm with
In this hybrid algorithm, the local method is integrated into the
the speed of a k-means and the global searching stage and the local
IBCOCLUST. For instance, after every L iterations, the k-means uses
refine stage are accomplished by those two modules, respectively.
the best vector from the IBCOCLUST as its starting point. Then
The IBCOCLUST finds the region of the optimum, and then the
center of clusters is updated if the locally optimized vectors
k-means takes over to find the optimum centroids. We need to find
(obtained by k-means) have better fitness value than those in
the right balance between local exploitation and global exploration.
IBCOCLUST and this procedure repeated until stop condition.

4.3.1. The sequential hybridization 4.3.3. The integrated hybridization


Two kinds of hybridization of k-means and IBCOCLUST are To improve the algorithm a one-step k-means algorithm is
proposed in this section. One of them applies k-means and then introduced. In each iteration, after that a new clustering solution
20 R. Forsati et al. / Neurocomputing 159 (2015) 9–26

Table 8
The quality of different hybrid clustering algorithms on five data sets measured in terms of SICD.

Data Measure K-NM-PSO K-PSO K-GA K-HS K-ABC IBCOCLUST Hybrid (II)

Iris Average 96.67 96.76 97.10 96.22 96.29 95.14


Best 96.66 96.66 96.10 96.10 96.19 95.10

Wine Average 16,293.00 16,296.00 16,298.70 16,296.10 16,296.10 16,294.90


Best 16,292.00 16,292.00 16,295.00 16,292.20 16,292.50 16,292.00

Glass Average 200.50 221.55 221.70 221.76 221.89 221.35


Best 199.68 213.37 215.70 214.80 215.30 214.71

Cancer Average 2964.70 2965.80 2968.00 2975.30 2970.45 2967.24


Best 2964.50 2964.50 2965.56 2964.50 2964.60 2963.11

Vowel Average 150,895.61 150,990.65 150,992.45 150,908.56 150,903.45 150,892.17


Best 149,496.40 149,486.34 149,556.01 149,478.45 149,498.4 149,473.90

Table 9
The statistics of document data sets used in our experiments.

Dataset Label Description # Documents (n) # of Clusters (K)

Politics Polotics Random topics of politics 176 6


TREC Newspaper Various articles from certain topics 873 8
DMOZ DMOZ Selected documents among 14 topics 697 14
20 Newsgroup Message Collected from 10 different usenet newsgroups 9249 10
WebAce Wap Web pages listed in the subject hierarchy of Yahoo! 1560 20

is generated with applying the IBCOCLUST, we use k-means to repository, are used to evaluate the performance of the proposed
reassign each data to the cluster with the nearest centroid. If the BCO based algorithms. The data sets and their main statistics are
result of k-means has better fitness than the generated solution by summarized in Table 2.
the IBCOCLUST, then we replace it with a candidate solution of the
IBCCOCLUST. In each iteration, one iteration of the IBCOCLUST and 5.2. Experimental setup
then one iteration of k-means are performed, and the process will
continue until the fixed number of iterations has finished. In the next step, the proposed BCO based algorithms are applied to
the data sets summarized in Table 2. The Euclidean distance measure
is used as the similarity metric in each algorithm. It should be
5. Experimental results on basic data sets
emphasized at this point that the results shown in the rest of the
paper for every dataset, are the average of over 30 independent
To better understand the performance of the proposed algo-
runs of the algorithms (to make a fair comparison), each run with
rithms, we divide the conducted experiments into two different
randomly generated initial solutions and different seeds of the
settings. In the first setting, we apply the proposed algorithms to
random number generator. Also, for an easy comparison, the algo-
few well-known general purpose benchmark data sets and com-
rithms proceed over 1000 iterations in each run since the 1000
pare them with the baseline algorithms. In the second setting, we
generations are enough for convergence of the algorithms. No
compare the proposed algorithms to the state-of-the-art algo-
parameter needs to be set up for the k-means algorithm. For each
rithms on document clustering. The main reason for splitting our
data set optimum number of bees is decided empirically which will
experiments into two parts is the unique challenges that exist in
be elaborated in the next subsection. We would like to emphasize
clustering document data sets due to the high-dimensionality and
that our experimental results revealed that the optimum number of
spareness characteristics of text data which makes it much
bees is a factor of the number of features in the data set.
challenging problem. We note that different algorithms might
achieve totally different performance on these two types of data
sets. The focus of this section is on general purpose data sets. The 5.3. Empirical study of the impact of different parameters on
application of proposed algorithms to five different document data convergence of IBCOCLUST
sets will be discussed in Section 6.
We begin by introducing the data sets we have used in our experi- The aim of this section is to study the effect of two important
ments followed by comparing the proposed algorithms to a number of parameters, namely the number of bees and the parameter γ on the
well-known algorithms according to their quality and rate of conver- result attained by different proposed algorithms. The SICD value of the
gence. We also compare the proposed hybrid algorithms to other obtained solution is the value of fitness function. The algorithm we use
hybrid evolutionary methods proposed in the literature. to evaluate is the IBCOCLUST which was described in Section 4.2.

5.1. Data sets 5.3.1. The impact of number of bees on clustering quality
Figs. 2 and 4 report the fitness of solutions measured by SICD for
In this work, five benchmark data sets from the UC Irvine (UCI) different values of the number of bees. We can see that decreasing
machine learning repository [94] which is a well-known database the number of bees leads to premature convergence and increasing
R. Forsati et al. / Neurocomputing 159 (2015) 9–26 21

Table 10
Normalized SICD comparison. Hybrid I and Hybrid II indicate the one step hybridization and the interleaved hybridization of IBCOCLUST and k-means algorithms,
respectively.

Data Criteria k-means HSCLUST IBCOCLUST IBKCLUST KIBCLUST Hybrid I Hybrid II

Politics Euclidean 0.73254 0.6524 0.6043 0.6174 0.6162 0.597 0.5821


Cosine 0.7690 0.6732 0.6235 0.6246 0.6419 0.6382 0.601

Newspaper Euclidean 0.6815 0.3682 0.5445 0.5729 0.6323 0.5684 0.492


Cosine 0.7263 0.6723 0.6716 0.6392 0.5978 0.6093 0.5744

DMOZ Euclidean 0.4587 0.3952 0.2947 0.4543 0.4593 0.3758 0.3669


Cosine 0.4821 0.4092 0.3374 0.3419 0.3563 0.3655 0.3367

Message Euclidean 0.8325 0.7612 0.7423 0.7896 0.7747 0.7739 0.7511


Cosine 0.9206 0.8843 0.8958 0.8737 0.884 0.8624 0.8759

Wap Euclidean 0.8652 0.8147 0.8234 0.7951 0.8103 0.7968 0.7893


Cosine 0.8753 0.82 0.8433 0.7814 0.7882 0.7695 0.7532

Table 11
Comparison of proposed hybrid algorithms to other baseline algorithms in docu-
ment clustering measured in terms of F-measure quality.

Dataset Politics Newspaper DMOZ Message Wap

k-means 0.6035 0.5117 0.5423 0.3236 0.4302


GA 0.6822 0.6427 0.7194 0.5213 0.4982
HSCLUST 0.7655 0.7824 0.72445 0.6692 0.58963
IBCOCLUST 0.8317 0.7256 0.7668 0.6845 0.63291
IBKCLUST 0.8152 0.8324 0.7731 0.75993 0.6022
KIBCLUST 0.7916 0.8212 0.7563 0.6954 0.7225
Hybrid I 0.7928 0.8349 0.7433 0.7166 0.7098
Hybrid II 0.8846 0.8826 0.8574 0.7902 0.7342

when number of bees is 2, while the best value was achieved when
Fig. 5. Normalized Euclidean SICD comparison. the number of bees was linear function of the number of feature in
each cluster. Table 3 indicates that, when the amount of bees are
few, they seek the solution space in low depth, while by increasing
the number of bees leads to significant improvements in the initial their number solution space will be explored in more depth.
phase of the clustering. Note that when the time or the number of
iterations is fixed as shown in Figs. 2 and 4, increasing the number 5.3.2. The impact of parameter γ on clustering quality
of bees may deteriorate the quality of the clustering. In general we We repeat the same process to decide the value of the
can say, the larger number of bees, the more time (or iterations) is parameter γ in our experiments. Similar to the experiments
needed for algorithm to find the optimal solution but usually higher conducted for the number of bees, we fix the value of other
quality is achieved. parameters, i.e., the number of bees and the maximum number of
As it can be seen in Figs. 2 and 4, a very small number of bees iterations, and vary the value of parameter γ. The results of this
lead to less exploration in the search space this potentially leads to experiments of four data sets are reported in Table 4. From the
the algorithm to get stuck in a local optimum. On the other hand, results in Table 4 few conclusions are in order. First, we note that
as the number of iterations is finite, increasing the number of bees the value of parameter γ should be tuned based on the data set at
may deteriorate the quality of the clustering. In general, using a hand to obtain the best results. Second, we can observe that for
mild number of bees seems to be a good and logical choice with two Wine and Glass data sets, the value of γ is much smaller than
the advantages of converging to the best result. In addition, our the best setting of this parameter for other two data sets.
empirical studies demonstrate that with a linear relation between Comparing Wine and Glass data sets to Iris and Vowel data sets
number of bees and the number of feature, better results are in terms of number of features, this fact demonstrates that for data
reached. sets with a large number of features the value of γ must be chosen
To decide the best setting for the number of bees, the value of smaller. Finally, based on the results in Table 4, it seems reasonable
the parameter γ and the number of iterations are set fixed and the to do a model checking on this parameter by performing a grid
SICD of IBCOCLUST for different values of number of bees is search on the candidate values of γ in the ð0; 1 interval with the
evaluated. Table 3 shows the effect of increasing the number of scales of 0.01 to find the best possible value.
bees, when the number of generations is constant with the value of
1000. By producing high number of bees, it will be guaranteed that, 5.4. Convergence analysis
most of the possible solutions that are available in the solution
space will be searched. As it can be inferred from Table 3, by Here we present experiments to investigate the effectiveness of
increasing the number of bees, better clustering results can be the proposed algorithms in tens of their convergence rate to the
acquired, but it has some fluctuations. Its worst value is gained optimal solution.
22 R. Forsati et al. / Neurocomputing 159 (2015) 9–26

5.5.1. SICD based evaluation


Table 6 reports the SICD value of algorithms applied to the
mentioned data sets. The smaller the SICD value, the more compact
the clustering solution is. Looking at Table 6, we can see that the
results obtained by different proposed algorithms are comparable. It
is noticeable from Table 6 that the PSO algorithm outperforms the
proposed algorithms in Wine dataset but in other datasets the
proposed algorithms have better SICD than PSO and other well-
known algorithms. In comparing the proposed algorithms to each
other we can say that the 5 step Interleaved Hybridization outper-
forms the others.

5.5.2. CEP based evaluation


In order to make a better evaluation of clustering, as a primary
measure of quality, we used the widely adopted CEP measure.
Since the benchmark data sets have their nominal partitions
Fig. 6. F-measure comparison.
known to the user, we also compute the mean number of
misclassified data points. This is the average number of objects
that were assigned to clusters other than according to the nominal
Fig. 3 illustrates the convergence behavior of the proposed
classification. In Table 7, we report the misclassification errors
algorithms and the k-means algorithm on the Glass data set with
(with respect to the nominal classification) for the experiments
the number of bees B¼9 and γ ¼ 0:01. Fig. 3 illustrates that the
conducted for different algorithms.
reduction of SICD value in IBCOCLUST follows a smooth curve from
its initial vectors to final optimum solution with no sharp moves.
Another noteworthy point in Fig. 3 is that SICD has the lowest final 5.6. Comparison to other hybrid clustering methods
value for L-step Interleaved Hybridization among the other algo-
rithms. The sequence of other algorithms with respect to their SICD In addition to the basic evolutionary-based clustering algorithms,
values are Interleaved Hybridization, Sequential Hybridization and we compare the proposed algorithms to other state-of-th-art hybrid
IBCOCLUST. It can be inferred from Fig. 3 that hybrid algorithms algorithms. The hybrid models use both evolutionary based algo-
overcome IBCOCLUST disadvantage by incorporating two-step rithms and the k-means algorithm simultaneously. These algorithms
hybrid algorithms. The algorithm uses BCO to get close to optimal include a hybrid technique based on combining the k-means algo-
solution, but since it does not fine-tune this result, it uses the rithm, NelderMead simplex search, and particle swarm optimization
k-means algorithm to fine tunes that. The results show that the (K-NM-PSO) [22], a hybrid technique of k-means and particle swarm
hybrid approaches outperform the component algorithms (k-means optimization (K-PSO) [1], a hybrid approach based on genetic algo-
and IBCOCLUST) in terms of the quality of generated clusters. As it rithm and k-means algorithm called K-GA [45], harmony k-means
can be seen from Fig. 3 the IBCOCLUST takes more time to reach the algorithm K-HS [60] and hybrid algorithm for data clustering using
optimal solution than the k-means. This is because the k-means ABC and k-means algorithm, dubbed K-ABC [46]. A brief description
algorithm may be trapped in local optimums. Although the k-means of these algorithm is given below for completeness:
algorithm is more efficient than the IBCOCLUST with respect to
execution time, the IBCOCLUST generates much better clustering 1. K-PSO [1]: The Hybrid PSO algorithm, first uses the k-means
than the k-means algorithm. clustering to seed the initial swarm, and then uses PSO
algorithm to refine the clusters formed by k-means. In this
approach, k-means is used to calculate the distance from each
5.5. Comparison to other baseline algorithms item to the cluster centers. In this approach, the result of the
k-means algorithm is used as one of the particles, while the
In this part of experiments, we evaluate and compare the remaining particles are initialized randomly.
performances of the proposed algorithms according to their quality 2. K-NM-PSO [22]: This algorithm is a hybrid technique based on
of generated clusters with k-mean, PSO based clustering (PSO) [43], combining the k-means algorithm, NelderMead simplex search,
GA based clustering (GA) [67], ACO based clustering (ACO) [79] and and particle swarm optimization. It clusters arbitrary data by
cooperative artificial bee colony based clustering (CABC) [97] evolving the appropriate cluster centers in an attempt to
algorithms. The algorithmic parameters used in this set of experi- optimize a given clustering metric. The hybrid algorithm first
ments for each baseline algorithm is reported in Table 5. The setting executes the k-means algorithm, which terminates when there
of parameters for ACO, GA, PSO and CABC is the same as their is no change in centroid vectors. This algorithm randomly
original paper. generate 3 N particles, or vertices, and NM-PSO is then carried
To evaluate the quality of clustering obtained by different out to its completion.
algorithms, we use two metrics, namely Classification Error Percen- 3. K-HS [60]: The hybrid K-HS algorithm combines the harmony
tage (CEP) and SICD where the first measure has been chosen from search algorithm with a one-step k-means algorithm. Each row
external quality measures and SICD has been selected from internal of the harmony memory in this algorithm has a discrete
measures. CEP expresses the clustering results from an external representation. This algorithm codifies the whole partition of
export view as shown in Eq. (14), whereas SICD examines how the data in a vector of length n, where n is the number of
much the clustering satisfies the optimization constraints data. In this algorithm at each improvisation step a one-step
k-means is included to fine-tune the new solution.
of misclassified objects
CEP ¼  100 ð14Þ 4. K-GA [45]: This algorithm is a hybrid approach based on genetic
size of test dataset
and k-means algorithms called the genetic k-means algorithm for
We now report and discuss the results for each measure clustering analysis which defines a basic mutation operator
separately in the following subsections. specific to clustering called distance-based mutation. The genetic
R. Forsati et al. / Neurocomputing 159 (2015) 9–26 23

operators that are used in K-GA are the selection, the distance from each of the 10 different Usenet newsgroups resulting in
based mutation and the k-means operator. The representation of 10,000 messages which become 9249 after preprocessing. The last
GA is to consider a chromosome of length n and allow each allele dataset WebAce is from the WebACE project (WAP) [9,66]. The
in chromosome to take values from f1; 2; …; Kg. The mutation details about these datasets are demonstrated in Table 9.
changes an allele value depending on the distances of the cluster The feature vector of each document is its words, however
centroids from the corresponding data point. assuming all of the words in a document makes the feature set too
much big for text mining. Therefore, to overcome this problem a
In Table 8, SICD values of hybrid version of IBCOCLUST is compared preprocessing approach is necessary to reduce the dimension of
to other hybrid evolutionary methods proposed in the literature. In the feature set. To this aim, the common words (e.g. function
Iris, Wine and Vowel data sets, the proposed algorithm had the words: “a”, “the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”) are
lowest rates of SICD that makes the algorithm as a distinct one, while eliminated from the documents, also different forms of a stem are
in other datasets such as Glass and Cancer proposed algorithm had a determined as one.
similar or superior performance comparing to other competitors. The
behavior of the proposed algorithms is varied in different datasets, 6.3. Experimental setup
but what is very common among most of the datasets is the
superiority of the proposed algorithm, over the conventional varia- The entire proposed algorithms are applied to the introduced
tions, in most datasets. datasets. The parameters are tuned as mentioned in Section 5.2.
The results shown in the rest of paper are the average over 30 runs
of the algorithms. Based on the achievements in Section 5.3 the
6. Experimental results on document clustering
number of bees is indicated as a factor of the number of features
for each dataset.
We now turn to comparing the proposed algorithms to the
state-of-the-art algorithms on document clustering. Document
clustering is a crucial and important application in Information 6.4. Quality of clustering
Retrieval [25] and characteristics of this type of data such as high-
dimensionality and sparseness introduces new challenges to In this part of experiments we compare the proposed algo-
clustering problem and makes it harder compared to other types rithms according to their quality of generated clusters with few
of data. Having this in mind, we chose this application for well known and efficient clustering algorithms including k-mean,
evaluating and comparing the proposed algorithms on different harmony search and GA based clustering algorithms [25,4].
document data sets. In this section, a brief introduction to the For evaluation of clustering quality we used the two greatly
problem of document clustering is given, the data sets are applicable metrics of SICD and F-measure. The SICD metric
introduced and the algorithms are compared with k-means, a GA measures the external quality while the F-measure examines the
based algorithm, and harmony search based document clustering internal quality of the clustering.
(referred to as HSCLUST) [25].
6.4.1. SICD based evaluation
6.1. Document clustering Table 10 demonstrates the comparison of normalized SICD for
five datasets using both cosine and Euclidean similarity measures.
In document clustering the vector space model is used to As it can be seen the result obtained by our proposed algorithms
represent documents in which each vector is set of the documents outperforms k-means in all datasets and in average 5 step Inter-
features such as words, terms and N-grams. These vectors are used leaved Hybridization outperform all the algorithms proposed.
in similarity measure between documents as well. We use the same Fig. 5 depicted this comparison for Euclidian similarity measure.
notation as Section 3, where in document clustering D would be the
set of n documents in which di ; i ¼ 1; 2…; d is the ith documents. 6.4.2. F-measure based evaluation
These real values can be determined as the words frequency or can Another evaluation metric we utilize to compare the quality of
have other relevant measures like frequency and inverse document clustering resulted in our proposed algorithms is F-measure [5]. It
frequency (TF-IDF) which is the most widely used weighting is defined as the harmonic means of precision and recall from
schema [75]. Having in mind that assuming all of the words in a information retrieval. In our measurement each cluster is sup-
document will result in a very high vector dimension (i.e., large d), a posed as if it were the result of a query and each class as if it were
preprocessing phase is applied to eliminate the unnecessary words the preferred set of documents for a query. The recall and
and reduce the vector dimension [44]. The similarity measure is the precision of that cluster for each specified class can then be
two well-known similarities namely Euclidean and Cosine mea- calculated. For a specified cluster of documents C ¼ fc1 ; c2 ; …; cK g,
sures which are introduced in Section 3 in Eqs. (1) and (2). Also the to assess the quality of C with respect to an ideal cluster
performance of a clustering is measured, using the introduced SICD Cn ¼ fcn1 ; cn2 ; …; cnK g (categorization by human) we first compute
fitness function as defined in Eq. (3). precision and recall as

6.2. Document data sets j C \ Cn j j C \ Cn j


PðC; Cn Þ ¼ and RðC; Cn Þ ¼ ð15Þ
j Cj j Cj
For the application of document clustering five different Then we define:
datasets with different characteristics are used. The first dataset
2  PðC; Cn Þ  RðC; Cn Þ
namely Politics is a dataset consisting of random topics in politics FðC; Cn Þ ¼ ð16Þ
PðC; Cn Þ þ RðC; Cn Þ
which is collected in 2006. The TRED dataset is collected among
different topics from San Jose Mercury newspaper including topics Table 11 shows the details of F-measures compared to k-means
such as computers, electronics, health, medical, research, and and GA. Again the 5 step Interleaved Hybridization has the best
technology. The DMOZ dataset is collected among 14 topics in performance according to F-measure and all other algorithms
which for each topic some web pages are selected and included in outperform k-means and GA as well. Fig. 6 shows a better over-
the data. The 20 Newsgroup dataset is collection of 1000 messages view on this comparison.
24 R. Forsati et al. / Neurocomputing 159 (2015) 9–26

7. Conclusion [5] Arindam Banerjee, Chase Krumpelman, Joydeep Ghosh, Sugato Basu, Raymond
J. Mooney, Model-based overlapping clustering, in: Proceedings of the Ele-
venth ACM SIGKDD International Conference on Knowledge Discovery in Data
In this paper, we applied the bee colony optimization (BCO) Mining, ACM, Chicago, IL, USA, 2005, pp. 532–537.
algorithm to the clustering problem. The shortcomings of basic [6] Pevel Berkhin, A survey of clustering data mining techniques, 2006, pp. 25–71.
pure BCO based clustering were examined and refined in IBCO- [7] James C. Bezdek, Srinivas Boggavarapu, Lawrence O. Hall, Amine Bensaid,
Genetic algorithm guided clustering, in: Proceedings of the First IEEE Con-
CLUST which is the improved version of the basic algorithm. In
ference on Evolutionary Computation, IEEE World Congress on Computational
particular, the improved algorithm is a novel modification of the Intelligence, IEEE, 1994, pp. 34–39.
BCO optimization algorithm by introducing the fairness and [8] Salim Bitam, Mohamed Batouche, E.-g. Talbi, A survey on bee colony
cloning properties which are aimed at increasing the explorative algorithms, in: IEEE International Symposium on Parallel & Distributed
Processing, Workshops and Phd Forum (IPDPSW), IEEE, 2010, pp. 1–8.
power of the BCO algorithm and propagation of knowledge in an [9] Daniel Boley, Maria Gini, Robert Gross, Eui-Hong Sam Han, Kyle Hastings,
optimization process, respectively. George Karypis, Vipin Kumar, Bamshad Mobasher, Jerome Moore, Document
Additionally, four different hybridization methods have been categorization and query generation on the world wide web using webace,
Artif. Intell. Rev. 13 (5–6) (1999) 365–391.
proposed as well, which are basically the composition of IBCOCLUST
[10] Paul S. Bradley, Usama Fayyad, Cory Reina, Scaling em (expectation–max-
and the k-means algorithms in different manners: (1) IBKCLUST in imization) clustering to large databases, Microsoft Research, 1998.
which the IBCOCLUST is applied before the k-means, (2) KIBCLUST [11] Arantza Casillas, M.T. González De Lena, R. Martínez, Document clustering into
in which k-means is applied before the IBCOCLUST, (3) One step an unknown number of clusters using a genetic algorithm, in: Text, Speech
and Dialogue, Springer, České Budéjovice, Czech Republic, 2003, pp. 43–49.
Hybridization IBCOCLUSTþk-means in which at each iteration both [12] Sheng-Chai Chi, Chih Chieh Yang, Integration of ant colony som and k-means
algorithms are applied simultaneously, and (4) k step Interleaved for clustering analysis, in: Knowledge-Based Intelligent Information and
Hybridization in which in each iteration k steps of each algorithm is Engineering Systems, Springer, Bournemouth, UK, 2006, pp. 1–8.
[13] Maurice Clerc, James Kennedy, The particle swarm-explosion, stability, and
applied simultaneously. The performances of all of the proposed convergence in a multidimensional complex space, IEEE Trans. Evol. Comput.
algorithms are compared with well-known methods which are 6 (1) (2002) 58–73.
widely used by the researchers in two applications of data and [14] Sandra C.M. Cohen, Leandro N. de Castro, Data clustering with particle
swarms, in: CEC 2006, IEEE Congress on Evolutionary Computation, IEEE,
document clustering. The results of the experiments show an
2006, pp. 1792–1798.
impressive improvement in comparison to others and can indicate [15] Dipankar Dasgupta, Advances in artificial immune systems, Computational
that the Improved Bee Colony algorithm can successfully be applied intelligence magazine, IEEE 1 (4) (2006) 40–49.
to clustering for the purpose of clustering. It also shows that the k [16] Sanjoy Dasgupta, The Hardness of k-Means Clustering, Department of Com-
puter Science and Engineering, University of California, San Diego, 2008.
step Interleaved Hybridization method results in best solution, since [17] Sanjoy Dasgupta, Yoav Freund, Random projection trees for vector quantiza-
in each iteration the steps of IBCOCLUST augment the search space tion, IEEE Trans. Inf. Theory 55 (7) (2009) 3229–3242.
by searching more globally, and after that the k-means algorithm [18] Ivanoe De Falco, Antonio Della Cioppa, Ernesto Tarantino, Facing classification
problems with particle swarm optimization, Appl. Soft Comput. 7 (3) (2007)
has another k steps chance to find the locally optimum in the global
652–658.
solution area that IBCOCLUST provided for it. As this process goes [19] Marco Dorigo, Vittorio Maniezzo, Alberto Colorni, Ant system: optimization by
on, the algorithm has a chance to investigate many local optima in a colony of cooperating agents, IEEE Trans. Syst. Man Cybern. Part B: Cybern.
many global solution areas that are the best global solution areas 26 (1) (1996) 29–41.
[20] Piotr Dziwiński, Łukasz Bartczuk, Janusz T. Starczewski, Fully controllable ant
among other areas and therefore gain the best performance. colony system for text data clustering, in: Swarm and Evolutionary Computa-
This work leaves few directions, both theoretically and empiri- tion, Springer, Zakopane, Poland, 2012, pp. 199–205.
cally, as future work. In our setting, the number of clusters and the [21] Emanuel. Falkenauer, Genetic Algorithms and Grouping Problems, John Wiley
& Sons, Inc., 1998.
number of data points were assumed to be fixed in advance and as a [22] Shu-Kai S. Fan, Yun-chia Liang, Erwie Zahara, Hybrid simplex search and
result, a static matrix for the assignment of data points to the clusters particle swarm optimization for the global optimization of multimodal
was sufficient for our optimization purpose. When the number of functions, Eng. Optim. 36 (4) (2004) 401–418.
clusters is not known or the data points can be dynamically added or [23] Mohammad Fathian, Babak Amiri, Ali Maroosi, Application of honey-bee
mating optimization algorithm on clustering, Appl. Math. Comput. 190 (2)
removed, this static structure would not be sufficient and a dynamic (2007) 1502–1513.
data structure is necessary. We note that considering the hardness of [24] Edward W. Forgy, Cluster analysis of multivariate data: efficiency versus
the clustering problem even for a fixed number of clusters and data interpretability of classifications, Biometrics 21 (1965) 768–769.
[25] Forsati Rana, Mahdavi Mehrdad, Shamsfard Mehrnoush, Meybodi Mohammad
points, the dynamic problem is much more challenging and requires Reza, Efficient stochastic algorithms for document clustering, Inf. Sci. 220
careful investigating. It would be interesting to further examine this (2013) 269–291.
issue as a future work. [26] Rana Forsati, MohammadReza Meybodi, Mehrdad Mahdavi, AzadehGhari
Neiat, Hybridization of k-means and harmony search methods for web page
clustering, in: Proceedings of the 2008 IEEE/WIC/ACM International Confer-
ence on Web Intelligence and Intelligent Agent Technology, vol. 01, IEEE
Acknowledgment Computer Society, Sydney, Australia, 2008, pp. 329–335.
[27] Zong Woo Geem, Joong Hoon Kim, G.V. Loganathan, A new heuristic
optimization algorithm: harmony search, Simulation 76 (2) (2001) 60–68.
The authors would like to thank the Associate Editor and [28] David Edward Goldberg, et al., Genetic Algorithms in Search, Optimization,
anonymous reviewers for their immensely insightful comments and Machine Learning, vol. 412, Addison-wesley, Reading Menlo Park, 1989.
and helpful suggestions on the original version of this paper. [29] Julie Greensmith, Uwe Aickelin, The dendritic cell algorithm (Ph.D. thesis),
Nottingham Trent University, 2007.
[30] Sudipto Guha, Rajeev Rastogi, Kyuseok Shim, Cure: an efficient clustering
algorithm for large databases, 27(2) (1998) 73–84.
References [31] Zülal Güngör, Alper Ünler, k-harmonic means data clustering with simulated
annealing heuristic, Appl. Math. Comput. 184 (2) (2007) 199–209.
[1] Alireza Ahmadyfard, Hamidreza Modares, Combining pso and k-means to [32] Julia Handl, Bernd Meyer, Improved ant-based clustering and sorting in a
enhance data clustering, in: International Symposium on Telecommunications, document retrieval interface, in: Parallel Problem Solving from Nature PPSN
IST 2008, IEEE, 2008, pp. 688–691. VII, Springer, 2002, Granada, Spain, pp. 913–923.
[2] Shafiq Alam, Gillian Dobbie, Patricia Riddle, Particle swarm optimization based [33] Hong He, Yonghong Tan, A two-stage genetic algorithm for automatic
clustering of web usage data, in: Proceedings of the 2008 IEEE/WIC/ACM clustering, Neurocomputing 81 (2012) 49–59.
International Conference on Web Intelligence and Intelligent Agent Technol- [34] Fu Hui, Chen Chao-tian, Liu Xiao-Yong, An improvement text clustering
ogy, vol. 03, IEEE Computer Society, 2008, pp. 451–454. algorithm based on ant colony, in: 1st International Conference on Informa-
[3] Daniel Aloise, Amit Deshpande, Pierre Hansen, Preyas Popat, Np-hardness of tion Science and Engineering (ICISE), IEEE, 2009, pp. 782–785.
euclidean sum-of-squares clustering, Mach. Learn. 75 (2) (2009) 245–248. [35] Anil K. Jain, Data clustering: 50 years beyond k-means, Pattern Recognit. Lett.
[4] Sanghamitra Bandyopadhyay, Ujjwal Maulik, An evolutionary technique based 31 (8) (2010) 651–666.
on k-means algorithm for optimal clustering in rn, Inf. Sci. 146 (1) (2002) [36] Anil K. Jain, M. Narasimha Murty, Patrick J. Flynn, Data clustering: a review,
221–237. ACM Comput. Surv. (CSUR) 31 (3) (1999) 264–323.
R. Forsati et al. / Neurocomputing 159 (2015) 9–26 25

[37] Hua Jiang, Jing Li, Shenghe Yi, Xiangyang Wang, Xin Hu, A new hybrid method [70] Nikhil R. Pal, James C. Bezdek, E.C.-K. Tsao, Generalized clustering networks
based on partitioning-based dbscan and ant clustering, Expert Syst. Appl. 38 and Kohonen's self-organizing scheme, IEEE Trans. Neural Netw. 4 (4) (1993)
(8) (2011) 9373–9381. 549–557.
[38] Hua Jiang, Shenghe Yi, Jing Li, Fengqin Yang, Xin Hu, Ant clustering algorithm [71] Sandra Paterlini, Thiemo Krink, Differential evolution and particle swarm
with k-harmonic means clustering, Expert Syst. Appl. 37 (12) (2010) optimisation in partitional clustering, Comput. Stat. Data Anal. 50 (5) (2006)
8679–8684. 1220–1247.
[39] Yucheng Kao, Kevin Cheng, An aco-based clustering algorithm, in: Ant Colony [72] Sandra Paterlini, Tommaso Minerva, Evolutionary approaches for cluster
Optimization and Swarm Intelligence, Springer, Brussels, Belgium, 2006, analysis, in: Soft Computing Applications, Springer, 2003, pp. 165–176.
pp. 340–347. [73] D.T. Pham, S. Otri, A. Afify, Massudi Mahmuddin, H. Al-Jabbouli, Data cluster-
[40] Dervis Karaboga, Celal Ozturk, A novel clustering approach: artificial bee ing using the bees algorithm, 2007.
colony (abc) algorithm, Appl. Soft Comput. 11 (1) (2011) 652–657. [74] Gerard Salton, Automatic text analysis, Science 168 (3929) (1970) 335–343.
[41] George Karypis, Eui-Hong Han, Vipin Kumar, Chameleon: hierarchical cluster- [75] Gerard Salton, Christopher Buckley, Term-weighting approaches in automatic
ing using dynamic modeling, Computer 32 (8) (1999) 68–75. text retrieval, Inf. Process. Manag. 24 (5) (1988) 513–523.
[42] Michael Steinbach George Karypis, Vipin Kumar, A comparison of document [76] Manish Sarkar, B. Yegnanarayana, Deepak Khemani, A clustering algorithm
clustering techniques, in: TextMining Workshop at KDD2000 (May 2000), 2000. using an evolutionary programming-based approach, Pattern Recognit. Lett.
[43] James Kennedy, Particle swarm optimization, in: Encyclopedia of Machine 18 (10) (1997) 975–986.
Learning, Springer, 2010, pp. 760–766. [77] Mohammad Ali Shafia, Mohammad Rahimi Moghaddam, Rozita Tavakolian,
[44] James F. Kennedy, James Kennedy, Russel C. Eberhart, Swarm Intelligence, A hybrid algorithm for data clustering using honey bee algorithm, genetic
Morgan Kaufmann, 2001. algorithm and k-means method, J. Adv. Comput. Sci. Technol. Res. 1 (2) (2011).
[45] K. Krishna, M. Narasimha Murty, Genetic k-means algorithm, IEEE Trans. Syst. [78] Henry. Sharp Jr., Cardinality of finite topologies, J. Comb. Theory 5 (1) (1968)
Man Cybern. Part B: Cybern. 29 (3) (1999) 433–439. 82–86.
[46] M. Krishnamoorthi, A.M. Natarajan, Abk-means: an algorithm for data clustering [79] P.S. Shelokar, Valadi K. Jayaraman, Bhaskar D. Kulkarni, An ant colony
using abc and k-means algorithm, Int. J. Comput. Sci. Eng. 8 (4) (2013) 383–391. approach for clustering, Anal. Chim. Acta 509 (2) (2004) 187–195.
[47] Ravindra Krovi, Genetic algorithms for clustering: a preliminary investigation, [80] Marc Teboulle, A unified continuous optimization framework for center-based
in: Proceedings of the Twenty-Fifth Hawaii International Conference on clustering methods, J. Mach. Learn. Res. 8 (2007) 65–102.
System Sciences, vol. 4, IEEE, 1992, pp. 540–544. [81] Dušan. Teodorović, Swarm intelligence systems for transportation engineer-
[48] R.J. Kuo, H.S. Wang, Tung.-Lai. Hu, S.H. Chou, Application of ant k-means on ing: principles and applications, Transp. Res. Part C: Emerg. Technol. 16 (6)
clustering analysis, Comput. Math. Appl. 50 (10) (2005) 1709–1724. (2008) 651–667.
[49] Nicolas Labroche, Nicolas Monmarché, Gilles Venturini, Antclust: ant cluster- [82] DUŠAN Teodorovic, Tatjana Davidovic, Milica Selmic, Bee colony optimization:
ing and web usage mining, in: Genetic and Evolutionary ComputationGECCO the applications survey, ACM Trans. Comput. Logic 1529 (2011) 3785.
2003, Springer, Chicago, IL, 2003, pp. 25–36. [83] Chi-Ho Tsang, Sam Kwong, Ant colony clustering and feature extraction for
[50] Chaoshun Li, Jianzhong Zhou, Pangao Kou, Jian Xiao, A novel chaotic particle anomaly intrusion detection, in: Swarm Intelligence in Data Mining, Springer,
swarm optimization based fuzzy clustering algorithm, Neurocomputing 83 2006, pp. 101–123.
(2012) 98–109. [84] Shivakumar Vaithyanathan, Byron Dom, Model selection in unsupervised
[51] Shu-Hsien Liao, Chih-Hao Wen, Artificial neural networks classification and learning with applications to document clustering, in: Proceedings of the
clustering of methodologies and applications–literature analysis from 1995 to Sixteenth International Conference on Machine Learning, Morgan Kaufmann
2005, Expert Syst. Appl. 32 (1) (2007) 1–11. Publishers Inc., CA, USA, 1999, pp. 433–443.
[52] Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng, Susan J. Brown, Fgka: a fast [85] D.W. Van der Merwe, Andries Petrus Engelbrecht, Data clustering using
genetic k-means clustering algorithm, in: Proceedings of the 2004 ACM particle swarm optimization, in: CEC'03, The 2003 Congress on Evolutionary
symposium on Applied computing, ACM, Nicosia, Cyprus, 2004, pp. 622–623. Computation, vol. 1, IEEE, 2003, pp. 215–220.
[53] Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng, Susan J. Brown, Incremental [86] Miao Wan, Lixiang Li, Jinghua Xiao, Cong Wang, Yixian Yang, Data clustering
genetic k-means algorithm and its application in gene expression data using bacterial foraging optimization, J. Intell. Inf. Syst. 38 (2) (2012) 321–341.
analysis, BMC Bioinf. 5 (1) (2004) 172. [87] Shuting Xu, Jun Zhang, A parallel hybrid web document clustering algorithm
[54] Panta Lucic, Dusan Teodorovic, Bee system: modeling combinatorial optimization and its performance study, J. Supercomput. 30 (2) (2004) 117–131.
transportation engineering problems by swarm intelligence, in: Preprints of the [88] Xu Xiaohua, Lu Lin, He Ping, Pan Zhoujin, Chen Ling, Improving constrained
TRISTAN IV Triennial Symposium on Transportation Analysis, 2001, pp. 441–445. clustering via swarm intelligence, Neurocomputing 116 (2013) 317–325.
[55] Panta Lucic, Dusan Teodorovic, Transportation modeling: an artificial life [89] Xiaohui Yan, Yunlong Zhu, Wenping Zou, Liang Wang, A new approach for
approach, in: Proceedingsof 14th IEEE International Conference on Tools with data clustering using hybrid artificial bee colony algorithm, Neurocomputing
Artificial Intelligence, ICTAI 2002, IEEE, 2002, pp. 216–223. 97 (2012) 241–250.
[56] Panta Lučić, Dušan Teodorović, Computing with bees: attacking complex [90] Yan Yang, Mohamed Kamel, Fan Jin, A model of document clustering using ant
transportation engineering problems, Int. J. Artif. Intell. Tools 12 (03) (2003) colony algorithm and validity index, in: Proceedings of 2005 IEEE International
375–394. Joint Conference on Neural Networks, IJCNN'05, vol. 5, IEEE, 2005, pp. 2730–2735.
[57] James MacQueen, et al., Some methods for classification and analysis of [91] Yan Yang, Mohamed S. Kamel, An aggregated clustering approach using multi-
multivariate observations, in: Proceedings of the Fifth Berkeley Symposium on ant colonies algorithms, Pattern Recognit. 39 (7) (2006) 1278–1289.
Mathematical Statistics and Probability, vol. 1, California, USA, 1967, p. 14. [92] Reda Younsi, Wenjia Wang, A new artificial immune system algorithm for
[58] Meena Mahajan, Prajakta Nimbhorkar, Kasturi Varadarajan, The planar clustering, in: Intelligent Data Engineering and Automated Learning–IDEAL
k-means problem is np-hard, in: WALCOM: Algorithms and Computation, 2004, Springer, Exeter, UK, 2004, pp. 58–64.
Springer, Kolkata, India, 2009, pp. 274–285. [93] Charles T. Zahn, Graph-theoretical methods for detecting and describing
[59] M. Mahdavi, Mohammad Fesanghary, E. Damangir, An improved harmony gestalt clusters, IEEE Trans. Comput. 100 (1) (1971) 68–86.
search algorithm for solving optimization problems, Appl. Math. Comput. 188 [94] Changsheng Zhang, Dantong Ouyang, Jiaxu Ning, An artificial bee colony
(2) (2007) 1567–1579. approach for clustering, Expert Syst. Appl. 37 (7) (2010) 4761–4767.
[60] Mehrdad Mahdavi, Hassan Abolhassani, Harmony k-means algorithm for [95] Fuzhi Zhang, Yujing Ma, Na Hou, Hui Liu, An ant-based fast text clustering
document clustering, Data Min. Knowl. Discov. 18 (3) (2009) 370–391. approach using pheromone, in: FSKD'08, Fifth International Conference on
[61] Mehrdad Mahdavi, M. Haghir Chehreghani, Hassan Abolhassani, Rana Forsati, Fuzzy Systems and Knowledge Discovery, vol. 2, IEEE, 2008, pp. 385–389.
Novel meta-heuristic algorithms for clustering web documents, Appl. Math. [96] Tian Zhang, Raghu Ramakrishnan, Miron Livny, Birch: an efficient data
Comput. 201 (1) (2008) 441–451. clustering method for very large databases, in: ACM SIGMOD Record, vol.
[62] Jianchang Mao, Anil K. Jain, Artificial neural networks for feature extraction and 25, ACM, 1996, pp. 103–114.
multivariate data projection, IEEE Trans. Neural Netw. 6 (2) (1995) 296–317. [97] Zou, Wenping and Zhu, Yunlong and Chen, Hanning and Sui, Xin, A clustering
[63] Ujjwal Maulik, Sanghamitra Bandyopadhyay, Genetic algorithm-based cluster- approach using cooperative artificial bee colony algorithm, Discrete Dynamics
ing technique, Pattern Recognit. 33 (9) (2000) 1455–1465. in Nature and Society., vol. 2010, 2010 (Hindawi Publishing Corporation).
[64] Boris Mirkin, Mathematical Classification and Clustering: From How to What
and Why, Springer, 1998.
[65] Tom M. Mitchell, Machine learning and data mining, Commun. ACM 42 (11)
(1999) 30–36. Rana Forsati obtained her Ph.D. degree from Shahid
[66] Jerome Moore, Eui-Hong Han, Daniel Boley, Maria Gini, Robert Gross, Kyle Beheshti University, Tehren, Iran in summer 2014. She
Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, Web page was a member of the NLP Research Laboratory of
categorization and feature selection using association rule and principal Electrical and Computer Engineering department. She
component clustering, IBM shared research report/University of Minnesota also spent a year as a visiting research scholar at
(Minneapolis, Mn.), 98:3, 1997. University of Minnesota at the Department of Compu-
[67] Chivukula A. Murthy, Nirmalya Chowdhury, In search of optimal clusters using ter Engineering from March 2013 to March 2014. Her
genetic algorithms, Pattern Recognit. Lett. 17 (8) (1996) 825–832. research interests include machine learning, soft com-
[68] Clark F. Olson, Parallel algorithms for hierarchical clustering, Parallel Comput. puting and data mining with applications in natural
21 (8) (1995) 1313–1325. language processing and recommender systems.
[69] M. Omran, A. Salman, A.P. Engelbrecht, Image classification using particle
swarm optimization, in: Proceedings of the 4th Asia-Pacific Conference on
Simulated Evolution and Learning, vol. 2002, Singapore, 2002. pp. 370–374.
26 R. Forsati et al. / Neurocomputing 159 (2015) 9–26

Andisheh Keikha is a PhD student of Computer Engi- Mehrnoush Shamsfard has received her BS and MS
neering at Ryerson University. She received the M.Sc. both on computer software engineering from Sharif
degree from Shahid Beheshti University, Tehran, Iran. University of Technology, Tehran, Iran. She received her
Her research interests include data mining, soft comput- PhD in Computer Engineering – Artificial Intelligence
ing with applications in natural language processing. from Amir Kabir University of Technology in 2003.
Dr. Shamsfard has been an assistant professor at Shahid
Beheshti University since 2004. She is the head of NLP
research Laboratory of Electrical and Computer Engi-
neering faculty. Her main fields of interest are natural
language processing, ontology engineering, text mining
and semantic web.

You might also like