Categorizing Overlapping Regions in Clustering Analysis Using Three-Way Decisions
Categorizing Overlapping Regions in Clustering Analysis Using Three-Way Decisions
Email: [email protected]
Abstract—Clustering is a common technique for data analysis, number of clusters in advance, which is difficulty to obtain in
has been widely used in many practical area. In many real the real environment. And, there are some researches on graph-
applications such as social network analysis, wireless sensor based clustering algorithms. Al Hasan et al. [10] proposed
networks, document clustering, and so on, there exist overlaps an overlapping clustering algorithm SimClus, and the method
between different clusters due to various reasons. In this paper, creates a similarity graph by a threshold β, and then covers the
we propose to use the three-way decisions approach to address
graph by find a special kind of subgraph, star shaped subgraph,
categorizing overlapping regions. In contrast to existing soft
clustering methods that just point out the objects whether in which is a cluster. The limitation of SimClus is that it creates
overlapping regions, the three-way decisions method provides a a large number of clusters and may build clusters with severe
greater refinement of the categorization to system operators for overlapping. Prez et al. [11] proposed OClustR algorithm, it’s
further analysis, which is believed to show clearly the objects also a graph-based clustering algorithms, and it uses a new
have different impacts to construct clusters. Besides, we provide a graph-covering strategy and a new filtering strategy to solve
new relation-graph based clustering algorithm to obtain different the problem in the SimClus to some extent.
overlapping region types. The results of comparison experiments
are better and more reasonable to overlapping clustering. Overlapping clustering algorithms have also been widely
used in the environment of overlapping community detection
I. I NTRODUCTION in complex networks. Palla et al. [12] proposed the first over-
lapping community detection algorithm CPM, which assume a
Clustering is one of the most important research field in community or cluster as the k-cliques. The method find out all
machine learning and data mining. In recent years, cluster- k-cliques, and two k-cliques are overlapping if they share the
ing has been widely used in many areas such as business same node. The main drawbacks of CPM are its computational
intelligence, image recognition and biological information [1]. complexity is exponential and the definition of community is
However, there are some applications like social network too strict. Lancichinetti et al. [13] proposed LFM algorithm,
analysis [2], wireless sensor networks [3], document clustering which utilizing local expansion to discovering overlapping
[4], and so on, where it is common that objects may belong to community, the limitation of LFM is the expansion strategy
more than one cluster. In those areas, discovering overlapping may cause repetitive computation. Lee et al. [14] proposed
regions is important and meaningful. GCE algorithm to solve this problem, it uses k-cliques as
For the different application fields, researchers have pro- seeds and expands seeds to build clusters, but it suffers from
posed different overlapping clustering methods. Aydin et al. [5] the problem how to choose k. Shi et al. [15] proposed an
proposed an overlapping clustering algorithm used in Ad hoc overlapping community detection algorithm, which clusters
networks, it can improve network reliability and load balanc- edges instead of nodes to create overlapping communities, a
ing. Specific to the DBLP data, Obadi et al. [6] introduced an node is overlapping if links connected to it are put in more
overlapping clustering method, which can solve the problem than one community. However, the limitation of this algorithm
that a paper may corresponding to multiple topics. Lingras et is that it may create a large number of clusters.
al. [7] compared crisp and fuzzy clustering in the mobile phone
However, rarely of the existing researches addresses the
call dataset, and pointed out that, fuzzy clustering can capture
problems such as that, what is the relationship between these
objects which will split into two or more clusters form a single
objects in overlapping regions, and whether the significance of
cluster when the number of clusters are increased.
these objects to impact the clustering processing is the same.
There are also some achievements by combining uncertain Most of the soft clustering methods just point out that the
approaches such as fuzzy sets theory and rough sets theo- object whether belong to more than one clusters. For example,
ry. For example, Peters et al. [8] proposed an overlapping there are two clusters, football fans and basketball fans, the
clustering algorithm, by keeping roughness constant in each overlapping of two clusters means some one both like football
cycle, this method can detection the structure of cluster in and basketball. However, there are several situations: some one
dynamic environment. Lai et al. [9] extended the rough fuzzy is a fanatic of football and basketball, but another one just
k-means, proposed GRFKM algorithm, which can provide less is an amateur of football and basketball, some one may like
computing time and less threshold setting. But the limitation of football more than basketball or just the opposite. We can see
fuzzy k-means and rough k-means is that they need to set the that, although they both in the overlapping region, but have
of object xn in the attribute d, where n ∈ {1, · · · , N }, more than football, he/she belongs to P OB(Cj , Ci ).
d ∈ {1, · · · , D}. In fact, there may be more overlapping situations, for
example, two clusters may be overlapped in both P OP and
Yao and Lingras et al.[17] had formulated the clustering P OB. Thus, we need to refine the categorization. We call
using the form of interval sets. It is naturally that the region the above three cases as macro types. Table I describes all
between the lower and upper bound of an interval set means micro types, where the symbol “◦” denotes there exist the
the overlapping region. overlapping type, symbol “×” denotes there don’t exist the
That is, we use an interval set to represent a cluster in RC, overlapping type. It can be seen that, we categorize overlapping
regions into 4 macro types or 8 micro types. Macro type
namely, C k = [C k , C k ]. The C k represents the lower bound
TYPE.1 means two clusters exist the P OP case, micro type A,
of C k , whose objects belong to the C k definitely, and also
B, C, D are four categorizations belonging to TYPE.1. TYPE.2
be called as the positive region of C k , denoted as P OS(C k ).
means there don’t exist the P OP case, but there at least exists
The BN D(C k ) = C k − C k represents the boundary region of the P OB case; and micro type E, F are two subtypes belonging
C k , whose objects may belong to C k ; N EG(C k ) = U − C k to TYPE.2. TYPE.3 means there only exists the BOB case,
represents the negative region of C k , whose objects do not micro type G is the only subcase belonging to TYPE.3.
belong to C k definitely.
In order to characterize the diverse overlapping regions,
Then, we have the following clustering result formulated we introduce the overlapping degree between clusters. In this
by interval sets: paper, we mainly discuss the three macro types of overlapping
regions between clusters. That is, we will evaluate overlapping
RC = {[C1 , C1 ], · · · , [Ck , Ck ], · · · , [CK , CK ]}. degree between clusters from the following three respects:
351
TABLE I: Categorization of Overlapping regions the weight of the relationship. We can research the clustering
problems with the help of relation graph.
Traditional Categorization for Overlapping Regions
Macro Type Micro Type P OP P OB BOB definition 1: Relation Graph G =< V, E, W > is a
A ◦ ◦ ◦ relation graph, where V is the set of nodes, E is the set of
B ◦ ◦ × edges, W is the set of weights between nodes.
TYPE.1
Overlapping C ◦ × ◦
D ◦ × × In order to convert the U to the RG, set V = U , vn ∈ V
E × ◦ ◦ is the mapping of object xn ∈ U , (u, v) ∈ E represents node
TYPE.2
F × ◦ × u and v exists edge, and w(u, v) is the weight between them.
TYPE.3 G × × ◦
None None
In the relation graph, we also use interval sets to represent
Overlapping Overlapping
H × × × clusters. That is, a cluster Ck is a connected subgraph S ⊆
G, Ck is a connected subgraph S ⊆ G, and Ck also is a
connected subgraph S ⊆ G. They have S ⊆ S ⊆ S ⊆ G.
DegP OP (Ci , Cj ), the overlapping degree of positive region- For convenience, we still use the lower bound Ck , the upper
s, to reflect the degree of similarity between two clusters; bound Ck to represent the cluster Ck .
DegP OB(Ci , Cj ), the overlapping degree between the pos-
The basic notions in a relation graph are defined as follow.
itive region of Ci and the boundary region of Cj , to reflect
the degree of influence from Cj to Ci ; DegBOB(Ci , Cj ), the definition 2: Adj(v) For a node v, the Adj(v) = (u|∃u ∈
overlapping degree between boundary regions, to reflect the V ∧ (u, v) ∈ E) means the set of nodes which have adjacent
size of neutral space between two clusters. edge with v.
General speaking, we can design the special computing definition 3: W eight(v) For a node v, W eight(v) =
formulae according to the actual application background. In w(u, v)(u ∈ Adj(v)), means the sum of weights between
order to show the categorization is available, we will propose node v and its adjacent edges.
a concrete algorithm based on three-way decisions and these
formulae also be described in the next section. definition 4: LIV (v) The local importance of a node v,
defined as:
A
IV. A LGORITHM LIV (v) =
|Adj(v)|
In order to find the overlapping regions and show the
different type of overlapping regions, this section introduces an where |Adj(v)| means the number of node in Adj(v). A
overlapping clustering algorithm, TDC-RG (Three-way deci- denotes the number of nodes in Adj(v), whose W eight is
sions Overlapping Clustering Algorithm Based on the Relation not greater than the W eight of node v; namely, for node
Graph). The subsection IV-A introduces the relation graph first, u ∈ Adj(v), if W eight(u) ≤ W eight(v), A will be added
the subsection IV-B explains how to categorize the nodes in 1.
a relation graph, and the algorithm is described in subsection
IV-C. Obviously, if W eight of all adjacent nodes of v is greater
than v’s, LIV (v) = 0; if W eight of all adjacent nodes of v is
smaller or equal to node v, LIV (v) = 1. Generally speaking,
A. Relation Graph
if a node has greater W eight, more important is it. Therefore,
In the real world, there generally exists relationships be- LIV (v) can reflect the local importance of a node on a certain
tween individuals or objects. For example: if two people are extent.
friends, there is a friend relationship between them; if two
editors edit the same page on Wikipedia, there may have a B. Classify nodes
common interest relationship between them; if two researchers
publish a paper together, they may have a cooperation relation- The different object usually plays a different role in a
ship; if two objects are close in the Euclid space, they may cluster. Some are the real spirits and be the core leaders in
have similarity relationship. the cluster; some are very important and be the bone numbers
of the cluster; some are not very important to construct the
Meanwhile, the importance of relationships may be dif- cluster and be the trivial numbers, which have little influence to
ferent, usually we set a weight to depict them. For example, the cluster. For example, a community usually has few leader
the weight of friend relationship between good friends is members, who have great influence in the community; in a
bigger than the normal friends; the more same page edited, research area, though there have lots of researchers, there will
the more intense they are in the common interest; the more be few top researchers, who can guide the research directions.
papers published together, the cooperation relationship is more Though the bone objects are not the core objects, they still
firmness. have great importance to build the cluster, and they are the
Such relationship between individuals and individuals, peo- backbone of the cluster; on the contrary, trivial objects belong
ple and people, objects and objects can be mapped into a to the cluster, but they have little influence to build the cluster.
relation graph easily. Namely, in the graph, node represents Based on the above facts, we classify the nodes into three
the individual, people or object; edges between nodes represent types in a relation graph as core nodes, bone nodes and trivial
the relationship between them; the weight of edge represents nodes. According to the definition 4, LIV (v) reflects the local
352
(a) (b)
353
Algorithm 1 The Preliminary Clustering Algorithm
Input:
The bone graph, ClusterCore(v), α, β;
Output:
The preliminary clustering result;
1: Choose a cluster core which has not extended as an
initiating cluster C. If all cluster cores are extended, stop
the algorithm.
2: For the adjacent bone nodes, to calculate the μ(v, C)
according to Equation (2).
3: Choose one node v with the largest μ(v, C).
Fig. 3: ClusterCore(v1 ) 4: If μ(v, C) > α, add the node v to the positive region of
the cluster C, goto 2; if μ(v, C) ≥ β, add the node to the
boundary region of the cluster C, goto 2; if μ(v, C) < β,
goto 1.
Next, we take Fig.3 as an example to explain the advantage
of the proposed method, the figure is a subgraph of Zacharys
karate club’bone graph in Fig.2, it contains node v1 and Thus, a trivial node v must be adjacent to at least one
Adj(v1 ). There are two cluster cores corresponding to the core preliminary cluster. Then, we calculate μ(v, C) between v and
node v1 , namely, ClusterCore(v1 ) = { (v1 , v5 , v6 , v7 , v11 ), C. If μ(v, C) > α, we assign the node to the positive region
(v1 , v2 , v3 , v4 , v9 , }. Obviously, the v1 belongs to the two of the cluster; if μ(v, C) < α but μ(v, C) ≥ β, we assign it
clusters. Thus, the method to construct cluster cores is adept to the boundary of the cluster; if all μ(v, C) < β, we assign
in finding the overlapping objects. the node to the boundary of the cluster which has the highest
2) Preliminary Clustering: μ(v, C).
354
Fig. 4: Overlapping Between Boundary Regions
355
but it don’t point out the node 1 is the overlapping node.
Tabel III shows the overlapping nodes found by these methods.
There also exists overlapping between C2 and C3 , most of the
above algorithms can find the overlapping region. From the
Fig.5, we can observe that the most significance overlapping
nodes between C2 and C3 are the node 9 and the node 10,
the proposed TDC-RG algorithm, the LFM and Sun can find
them; some algorithms just find part of them.
By contrast to the other methods to treat nodes in the
overlapping region equally without distinction, the proposed
method divides these overlap nodes into different types. For Fig. 6: Clustering Result of TDC-RG Algorithm in Dolphin
example, the node 9 belongs to P OP (C2 , C3 ), the node Social Network
10 belongs to BOB(C2 , C3 ), and the node 31 belongs to
P OB(C3 , C2 ).
Further observations show the advantage of the proposed TABLE IV: Overlapping Nodes Found by Different Algorithm-
method. Zachary observed that the club split into two groups s in Dolphin Social Network
after a dispute, the administrator (node 34) and the instructor
(node 1) are the key persons of the two factions, respectively. Algorithm Overlapping Nodes
But in fact, before the fission, each member chose a faction to P OP (C1 , C2 ) 20
support; after the fission, each one chose a new group to join P OB(C1 , C2 ) 24,29,37
TDC-RG
in. P OB(C2 , C1 ) 2,40,42
BOB(C1 , C2 ) 8
Let us observe the faction and the choices. For the node 9, LFM [13] 8,20,29,31,40
he supported the administrator first, but joined the instructor’s EM-BOAD [21] 40
group finally. Zachary thinks that because there is only three Sun [24] 8,29,31,40
weeks away from a test for black belt, this makes the node
9 have to join the instructor’s group to get practice. Thus,
the node 9 is assigned into P OP (C2 , C3 ), which presents leave of SN100 (node 37). As shown in Table IV, only the
the processing he is hardly to decide which cluster he should proposed method successfully finds the special role SN100 in
be. For the node 10, he supported none of the factions, but overlapping regions.
joined the administrator’s group finally. Thus, we assign it into
BOB(C2 , C3 ), which shows that the node is neutral between
the two factions. For the node 31, he supported the adminis- C. Books About US Politics Network
trator and joined the administrator’s group finally. From the Books about US politics network [26] is a network of
Fig.5, we can find that the node 31 supports the administrator books about US politics published around the time of the
but it also have some connection with the instructor’s group. 2004 presidential election and sold by the online bookseller
Thus, it is very reasonable to assign it into P OB(C3 , C2 ). Amazon.com. It consists of 105 nodes and 441 edges, and
edges between books represent frequent co-purchasing of
B. Dolphin Social Network books by the same buyers. According to political inclination,
Dolphin social network is a social network about bottlenose books are divided into three classes, liberalism, conservatism
dolphins, it is compiled by Lusseau et.al [25] after seven years and neutralism.
studies. It consists of 62 nodes and 159 edges. It is generally Fig.7 shows the result of TDC-RG Algorithm. The thick
believed that there exists two clusters. line divides the data set into two clusters, C1 represents
Fig.6 shows the result of TDC-RG algorithm, which finds conservatism, and C2 represents liberalism. The notes labeled
two clusters. As a contrast, LFM [13] and Sun [24] also find with digit are neutralism, there are 13 nodes. It can be seen
two clusters, EM-BOAD [21] finds three clusters, GaoCD [15] that liberalism and conservatism can be easily distinguish, our
finds more than 10 clusters. Table IV shows the overlapping methods find two main clusters well. And the overlapping type
nodes found by these methods except GaoCD, considering of two clusters is type E, that is, there don’t exist the P OP
GaoCD finding too many clusters. By contrast to the other case between the conservatism and liberalism; which reflects
methods, our methods not only finds more overlapping nodes, the reality of the situation.
but also points out the different significance of overlapping
However, neutralism are scattered distribution and difficult
nodes by dividing these nodes into different types.
to form a cluster. To deal with this situation, we assign
For example, for nodes 24, 29, 37, we assign them into these nodes into different types of overlapping regions due
P OB(C1 , C2 ), because these nodes belong to C1 obviously to their different significance. For example, we assign the
but also have lots of connections with the C2 ; and the algorithm node 48 into P OB(C1 , C2 ), which means this book is trend
assigns node 2, 40, 42 into P OB(C2 , C1 ) because of the to conservatism; the nodes 4,7,26,69,104 are assigned into
similar reason. BOB(C2 , C1 ) because they are just neutralism books; we
assign the nodes 0,6,46 into P OS(C1 ) and node 51 into
Furthermore, Lusseau and Newman [27] point out that, this BN D(C1 ), which means these books are preferred to conser-
network may completely fall into two parts because of the vatism books to buyers; we assign the node 76 into P OS(C2 )
356
[7] P. Lingras, P. Bhalchandra, S. Khamitkar, S. Mekewad, and R. Rathod,
“Crisp and soft clustering of mobile calls,” in Multi-disciplinary Trends
in Artificial Intelligence. Springer, 2011, pp. 147–158.
[8] G. Peters, R. Weber, and R. Nowatzke, “Dynamic rough clustering and
its applications,” Applied Soft Computing, vol. 12, no. 10, pp. 3193–
3207, 2012.
[9] J. Z. Lai, E. Y. Juan, and F. J. Lai, “Rough clustering using generalized
fuzzy clustering algorithm,” Pattern Recognition, vol. 46, no. 9, pp.
2538–2547, 2013.
[10] M. Al Hasan, S. Salem, and M. J. Zaki, “Simclus: an effective algorithm
Fig. 7: Clustering Result of TDC-RG Algorithm in Books for clustering with a lower bound on similarity,” Knowledge and
About US Politics Network information systems, vol. 28, no. 3, pp. 665–685, 2011.
[11] A. Pérez-Suárez, J. F. Martı́nez-Trinidad, J. A. Carrasco-Ochoa, and
J. E. Medina-Pagola, “Oclustr: A new graph-based algorithm for over-
lapping clustering,” Neurocomputing, vol. 121, pp. 234–247, 2013.
and node 103 into BN D(C2 ), which means these books are [12] G. Palla, I. Derényi, I. Farkas, and T. Vicsek, “Uncovering the overlap-
ping community structure of complex networks in nature and society,”
preferred to liberalism books to buyers. Nature, vol. 435, no. 7043, pp. 814–818, 2005.
By contrast, GaoCD [15] also finds two main clusters, but [13] A. Lancichinetti, S. Fortunato, and J. Kertész, “Detecting the overlap-
for the rest nodes, it creates more than 10 clusters, which ping and hierarchical community structure in complex networks,” New
Journal of Physics, vol. 11, no. 3, p. 033015, 2009.
hardly to reveal the different inclinations of neutralism nodes.
[14] C. Lee, F. Reid, A. McDaid, and N. Hurley, “Detecting highly over-
lapping community structure by greedy clique expansion,” in SNA-
VI. C ONCLUSION KDD10: Proceedings of the 4th Workshop on Social Network Mining
and Analysis, 2010, pp. 32–42.
This paper introduced further analysis in overlapping re-
[15] C. Shi, Y. Cai, D. Fu, Y. Dong, and B. Wu, “A link clustering
gions based on three-way decisions, to reveal the different based overlapping community detection algorithm,” Data & Knowledge
significances to the clustering processing of these different Engineering, vol. 87, pp. 394–404, 2013.
objects. A cluster is represented by an interval set, the lower [16] Y. Yao, “An outline of a theory of three-way decisions,” in Rough Sets
bound and upper bound give the positive region and boundary and Current Trends in Computing. Springer, 2012, pp. 1–17.
region of a cluster, respectively. Then, the overlapping regions [17] Y. Yao, P. Lingras, R. Wang, and D. Miao, “Interval set cluster
are categorized into 4 macro types or 8 micro types. Thus, analysis: A re-formulation,” in Rough Sets, Fuzzy Sets, Data Mining
the proposed method can find different semantic of objects and Granular Computing. Springer, 2009, pp. 398–405.
in overlapping regions compared to other overlapping clus- [18] H. Yu and Y. Wang, “Three-way decisions method for overlapping
tering algorithms. In addition, a new overlapping clustering clustering,” in Rough Sets and Current Trends in Computing. Springer,
2012, pp. 277–286.
algorithm is proposed based on a relation graph, where cluster
[19] H. Yu and Q. Zhou, “A cluster ensemble framework based on three-way
cores, bone nodes and trivial nodes are defined. Compared decisions,” in Rough Sets and Knowledge Technology. Springer, 2013,
with other overlapping clustering algorithms on some real pp. 302–312.
social networks, the results are better and more reasonable. [20] W. Zachary, “An information flow modelfor conflict and fission in small
How to obtain different overlapping region types with a less groups1,” Journal of anthropological research, vol. 33, no. 4, pp. 452–
computational complexity algorithm is our further work. 473, 1977.
[21] J. Li, X. Wang, and J. Eustace, “Detecting overlapping communities by
ACKNOWLEDGMENT seed community in weighted complex networks,” Physica A: Statistical
Mechanics and its Applications, vol. 392, no. 23, pp. 6125–6134, 2013.
This work was supported in part by the National Natural [22] J. Baumes, M. K. Goldberg, M. S. Krishnamoorthy, M. Magdon-
Science Foundation of China under grant No.61379114 and Ismail, and N. Preston, “Finding communities by clustering a graph
into overlapping subgraphs.” IADIS AC, vol. 5, pp. 97–104, 2005.
No.61272060.
[23] J. Huang, H. Sun, J. Han, and B. Feng, “Density-based shrinkage
for revealing hierarchical and overlapping community structure in
R EFERENCES networks,” Physica A: Statistical Mechanics and its Applications, vol.
[1] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern 390, no. 11, pp. 2160–2171, 2011.
Recognition Letters, vol. 31, no. 8, pp. 651–666, 2010. [24] P. G. Sun, L. Gao, and S. Shan Han, “Identification of overlapping and
[2] J. Xie, S. Kelley, and B. K. Szymanski, “Overlapping community non-overlapping community structure by fuzzy clustering in complex
detection in networks: The state-of-the-art and comparative study,” ACM networks,” Information Sciences, vol. 181, no. 6, pp. 1060–1071, 2011.
Computing Surveys (CSUR), vol. 45, no. 4, p. 43, 2013. [25] D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and
[3] A. A. Abbasi and M. Younis, “A survey on clustering algorithms for S. M. Dawson, “The bottlenose dolphin community of doubtful sound
wireless sensor networks,” Computer communications, vol. 30, no. 14, features a large proportion of long-lasting associations,” Behavioral
pp. 2826–2841, 2007. Ecology and Sociobiology, vol. 54, no. 4, pp. 396–405, 2003.
[4] A. G. Alonso, A. P. Suárez, and J. E. M. Pagola, “Acons: a new [26] M. E. Newman, “Modularity and community structure in networks,”
algorithm for clustering documents,” in Progress in Pattern Recognition, Proceedings of the National Academy of Sciences, vol. 103, no. 23, pp.
Image Analysis and Applications. Springer, 2007, pp. 664–673. 8577–8582, 2006.
[5] N. Aydin, F. Nai t Abdesselam, V. Pryyma, and D. Turgut, “Overlapping [27] D. Lusseau and M. E. Newman, “Identifying the role that animals play
clusters algorithm in ad hoc networks,” in Global Telecommunications in their social networks,” Proceedings of the Royal Society of London.
Conference (GLOBECOM 2010), 2010 IEEE. IEEE, 2010, pp. 1–5. Series B: Biological Sciences, vol. 271, no. Suppl 6, pp. S477–S481,
2004.
[6] G. Obadi, P. Drázdilová, L. Hlavacek, J. Martinovic, and V. Snasel,
“A tolerance rough set based overlapping clustering for the dblp data,”
in Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010
IEEE/WIC/ACM International Conference on, vol. 3. IEEE, 2010, pp.
57–60.
357