0% found this document useful (0 votes)
22 views

Categorizing Overlapping Regions in Clustering Analysis Using Three-Way Decisions

This document summarizes a research paper presented at the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) that proposes using three-way decisions to categorize overlapping regions in clustering analysis. The paper argues that existing soft clustering methods only indicate whether objects belong to overlapping regions but do not provide more detail. The three-way decisions method categorizes overlapping regions more precisely to help analysts understand how objects in overlapping regions differently impact cluster construction. An algorithm is also introduced that uses relation graphs to obtain different types of overlapping regions.

Uploaded by

Anwar Shah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Categorizing Overlapping Regions in Clustering Analysis Using Three-Way Decisions

This document summarizes a research paper presented at the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) that proposes using three-way decisions to categorize overlapping regions in clustering analysis. The paper argues that existing soft clustering methods only indicate whether objects belong to overlapping regions but do not provide more detail. The three-way decisions method categorizes overlapping regions more precisely to help analysts understand how objects in overlapping regions differently impact cluster construction. An algorithm is also introduced that uses relation graphs to obtain different types of overlapping regions.

Uploaded by

Anwar Shah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)

Categorizing Overlapping Regions in Clustering


Analysis Using Three-way Decisions

Hong Yu∗ , Peng Jiao∗ , Guoyin Wang∗ and Yiyu Yao†


∗ Chongqing
Key Laboratory of Computational Intelligence
Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
Email: {yuhong,wanggy}@cqupt.edu.cn
† Department of Computer Science, University of Regina, Regina, Saskatchewan Canada S4S 0A2

Email: [email protected]

Abstract—Clustering is a common technique for data analysis, number of clusters in advance, which is difficulty to obtain in
has been widely used in many practical area. In many real the real environment. And, there are some researches on graph-
applications such as social network analysis, wireless sensor based clustering algorithms. Al Hasan et al. [10] proposed
networks, document clustering, and so on, there exist overlaps an overlapping clustering algorithm SimClus, and the method
between different clusters due to various reasons. In this paper, creates a similarity graph by a threshold β, and then covers the
we propose to use the three-way decisions approach to address
graph by find a special kind of subgraph, star shaped subgraph,
categorizing overlapping regions. In contrast to existing soft
clustering methods that just point out the objects whether in which is a cluster. The limitation of SimClus is that it creates
overlapping regions, the three-way decisions method provides a a large number of clusters and may build clusters with severe
greater refinement of the categorization to system operators for overlapping. Prez et al. [11] proposed OClustR algorithm, it’s
further analysis, which is believed to show clearly the objects also a graph-based clustering algorithms, and it uses a new
have different impacts to construct clusters. Besides, we provide a graph-covering strategy and a new filtering strategy to solve
new relation-graph based clustering algorithm to obtain different the problem in the SimClus to some extent.
overlapping region types. The results of comparison experiments
are better and more reasonable to overlapping clustering. Overlapping clustering algorithms have also been widely
used in the environment of overlapping community detection
I. I NTRODUCTION in complex networks. Palla et al. [12] proposed the first over-
lapping community detection algorithm CPM, which assume a
Clustering is one of the most important research field in community or cluster as the k-cliques. The method find out all
machine learning and data mining. In recent years, cluster- k-cliques, and two k-cliques are overlapping if they share the
ing has been widely used in many areas such as business same node. The main drawbacks of CPM are its computational
intelligence, image recognition and biological information [1]. complexity is exponential and the definition of community is
However, there are some applications like social network too strict. Lancichinetti et al. [13] proposed LFM algorithm,
analysis [2], wireless sensor networks [3], document clustering which utilizing local expansion to discovering overlapping
[4], and so on, where it is common that objects may belong to community, the limitation of LFM is the expansion strategy
more than one cluster. In those areas, discovering overlapping may cause repetitive computation. Lee et al. [14] proposed
regions is important and meaningful. GCE algorithm to solve this problem, it uses k-cliques as
For the different application fields, researchers have pro- seeds and expands seeds to build clusters, but it suffers from
posed different overlapping clustering methods. Aydin et al. [5] the problem how to choose k. Shi et al. [15] proposed an
proposed an overlapping clustering algorithm used in Ad hoc overlapping community detection algorithm, which clusters
networks, it can improve network reliability and load balanc- edges instead of nodes to create overlapping communities, a
ing. Specific to the DBLP data, Obadi et al. [6] introduced an node is overlapping if links connected to it are put in more
overlapping clustering method, which can solve the problem than one community. However, the limitation of this algorithm
that a paper may corresponding to multiple topics. Lingras et is that it may create a large number of clusters.
al. [7] compared crisp and fuzzy clustering in the mobile phone
However, rarely of the existing researches addresses the
call dataset, and pointed out that, fuzzy clustering can capture
problems such as that, what is the relationship between these
objects which will split into two or more clusters form a single
objects in overlapping regions, and whether the significance of
cluster when the number of clusters are increased.
these objects to impact the clustering processing is the same.
There are also some achievements by combining uncertain Most of the soft clustering methods just point out that the
approaches such as fuzzy sets theory and rough sets theo- object whether belong to more than one clusters. For example,
ry. For example, Peters et al. [8] proposed an overlapping there are two clusters, football fans and basketball fans, the
clustering algorithm, by keeping roughness constant in each overlapping of two clusters means some one both like football
cycle, this method can detection the structure of cluster in and basketball. However, there are several situations: some one
dynamic environment. Lai et al. [9] extended the rough fuzzy is a fanatic of football and basketball, but another one just
k-means, proposed GRFKM algorithm, which can provide less is an amateur of football and basketball, some one may like
computing time and less threshold setting. But the limitation of football more than basketball or just the opposite. We can see
fuzzy k-means and rough k-means is that they need to set the that, although they both in the overlapping region, but have

978-1-4799-4143-8/14 $31.00 © 2014 IEEE 350


DOI 10.1109/WI-IAT.2014.118
different semantic. Therefore, we will study the category for Obviously, a cluster satisfy the following properties:
overlapping regions in this paper by using three-way decisions 
theory. (i) Ck = ∅, 0 ≤ k ≤ K; (ii) Ck = U.
Ck ∈RC
Three-way decisions is proposed to make up some draw-
backs of binary-decisions, widely used in uncertain informa-
tion processing such as clinical decision-making, environmen- Property (i) requires that each cluster cannot be empty.
tal management, paper review, and so on [16]. Three-way Property (ii) states that every x ∈ U belongs to at least one
decisions theory extends binary-decisions (accept, reject) to cluster. Furthermore, if Ci ∩ Cj = ∅, i = j, it is a crisp
three-way decisions (accept, defer, reject). Let’s observe the clustering, otherwise it is an overlapping clustering.
relationship between an object and a cluster. They could be
the object does belong to the cluster certainly, or the object III. C ATEGORIZATION OF OVERLAPPING R EGIONS
does not belong to the cluster certainly, or the object might
Since the positive region and boundary region can represent
belong to the cluster, which is a typical three-way decisions.
a cluster accurately, it is obviously that there are three cases
A cluster is described by a single set usually, we represent for overlapping: overlapping between positive regions, over-
a cluster by an interval set [17], which is defined by a pair of lapping between the positive region and the boundary region,
sets called the lower and upper bounds. By means of the rep- and overlapping between boundary regions. Let Ci , Cj be any
resentation, the objects in the lower bound of a cluster means two clusters in RC, the overlapping regions can be categorized
they definitely belong to the cluster, the objects not in the upper into the following three types.
bound of a cluster means they do not belong to the cluster, and Case 1: Overlapping between positive regions, denoted
the objects between the two bounds means they might belong by P OP (Ci , Cj ), shorted by P OP . Namely, P OS(Ci ) ∩
to the cluster. We had proposed preliminary results based on P OS(Cj ) = ∅, also can be denoted as Ci ∩ Cj = ∅.
three-way decisions such as overlapping clustering method and
ensemble clustering framework [18][19]. Case 2: Overlapping between the positive region of Ci and
the boundary region of Cj , denoted by P OB(Ci , Cj ), shorted
Therefore, to go further insight into overlapping regions, by P OB. Namely, P OS(Ci ) ∩ BN D(Cj ) = ∅, also can be
this paper proposes a strategy based on three-way decisions denoted as Ci ∩ (Cj − Cj ) = ∅.
to categorize overlapping regions in Section III, and proposes
a new algorithm based on relation-graph to obtain different Case 3: Overlapping between boundary regions, denoted
types of overlapping regions in Section IV. The comparison by BOB(Ci , Cj ), shorted by BOB. Namely, BN D(Ci ) ∩
experimental results in several standard complex network data BN D(Cj ) = ∅, also can be denoted as Cj ∩ Cj = ∅.
sets are depicted in Section V. Finally, the conclusions are
As it can be seen above, we categorize overlapping regions
presented in Section VI.
into three cases. Therefore, the example about football fans and
basketball fans described in the section I can be explained in
this model. To denote football fans as the cluster Ci , basketball
II. F ORMULATION OF C LUSTERING BY I NTERVAL S ETS fans as cluster Cj . Then, if some one is a fanatic of football and
Let universe be U = {x1 , · · · , xn , · · · , xN }. The result of basketball, he/she belongs to the P OP (Ci , Cj ); if someone
clustering is RC = {C1 , · · · , Ck , · · · , CK }, which is a family is an amateur of football and basketball, he/she belongs to
of clusters of U . The xn is an object with D attributes, that BOB(Ci , Cj ); if someone likes football more than basketball,
is, xn = (x1n , · · · , xdn , · · · , xD he/she belongs to P OB(Ci , Cj ); if someone likes basketball
n ). The xn means the values
d

of object xn in the attribute d, where n ∈ {1, · · · , N }, more than football, he/she belongs to P OB(Cj , Ci ).
d ∈ {1, · · · , D}. In fact, there may be more overlapping situations, for
example, two clusters may be overlapped in both P OP and
Yao and Lingras et al.[17] had formulated the clustering P OB. Thus, we need to refine the categorization. We call
using the form of interval sets. It is naturally that the region the above three cases as macro types. Table I describes all
between the lower and upper bound of an interval set means micro types, where the symbol “◦” denotes there exist the
the overlapping region. overlapping type, symbol “×” denotes there don’t exist the
That is, we use an interval set to represent a cluster in RC, overlapping type. It can be seen that, we categorize overlapping
regions into 4 macro types or 8 micro types. Macro type
namely, C k = [C k , C k ]. The C k represents the lower bound
TYPE.1 means two clusters exist the P OP case, micro type A,
of C k , whose objects belong to the C k definitely, and also
B, C, D are four categorizations belonging to TYPE.1. TYPE.2
be called as the positive region of C k , denoted as P OS(C k ).
means there don’t exist the P OP case, but there at least exists
The BN D(C k ) = C k − C k represents the boundary region of the P OB case; and micro type E, F are two subtypes belonging
C k , whose objects may belong to C k ; N EG(C k ) = U − C k to TYPE.2. TYPE.3 means there only exists the BOB case,
represents the negative region of C k , whose objects do not micro type G is the only subcase belonging to TYPE.3.
belong to C k definitely.
In order to characterize the diverse overlapping regions,
Then, we have the following clustering result formulated we introduce the overlapping degree between clusters. In this
by interval sets: paper, we mainly discuss the three macro types of overlapping
regions between clusters. That is, we will evaluate overlapping
RC = {[C1 , C1 ], · · · , [Ck , Ck ], · · · , [CK , CK ]}. degree between clusters from the following three respects:

351
TABLE I: Categorization of Overlapping regions the weight of the relationship. We can research the clustering
problems with the help of relation graph.
Traditional Categorization for Overlapping Regions
Macro Type Micro Type P OP P OB BOB definition 1: Relation Graph G =< V, E, W > is a
A ◦ ◦ ◦ relation graph, where V is the set of nodes, E is the set of
B ◦ ◦ × edges, W is the set of weights between nodes.
TYPE.1
Overlapping C ◦ × ◦
D ◦ × × In order to convert the U to the RG, set V = U , vn ∈ V
E × ◦ ◦ is the mapping of object xn ∈ U , (u, v) ∈ E represents node
TYPE.2
F × ◦ × u and v exists edge, and w(u, v) is the weight between them.
TYPE.3 G × × ◦
None None
In the relation graph, we also use interval sets to represent
Overlapping Overlapping
H × × × clusters. That is, a cluster Ck is a connected subgraph S ⊆
G, Ck is a connected subgraph S ⊆ G, and Ck also is a
connected subgraph S ⊆ G. They have S ⊆ S ⊆ S ⊆ G.
DegP OP (Ci , Cj ), the overlapping degree of positive region- For convenience, we still use the lower bound Ck , the upper
s, to reflect the degree of similarity between two clusters; bound Ck to represent the cluster Ck .
DegP OB(Ci , Cj ), the overlapping degree between the pos-
The basic notions in a relation graph are defined as follow.
itive region of Ci and the boundary region of Cj , to reflect
the degree of influence from Cj to Ci ; DegBOB(Ci , Cj ), the definition 2: Adj(v) For a node v, the Adj(v) = (u|∃u ∈
overlapping degree between boundary regions, to reflect the V ∧ (u, v) ∈ E) means the set of nodes which have adjacent
size of neutral space between two clusters. edge with v.
General speaking, we can design the special computing  definition 3: W eight(v) For a node v, W eight(v) =
formulae according to the actual application background. In w(u, v)(u ∈ Adj(v)), means the sum of weights between
order to show the categorization is available, we will propose node v and its adjacent edges.
a concrete algorithm based on three-way decisions and these
formulae also be described in the next section. definition 4: LIV (v) The local importance of a node v,
defined as:
A
IV. A LGORITHM LIV (v) =
|Adj(v)|
In order to find the overlapping regions and show the
different type of overlapping regions, this section introduces an where |Adj(v)| means the number of node in Adj(v). A
overlapping clustering algorithm, TDC-RG (Three-way deci- denotes the number of nodes in Adj(v), whose W eight is
sions Overlapping Clustering Algorithm Based on the Relation not greater than the W eight of node v; namely, for node
Graph). The subsection IV-A introduces the relation graph first, u ∈ Adj(v), if W eight(u) ≤ W eight(v), A will be added
the subsection IV-B explains how to categorize the nodes in 1.
a relation graph, and the algorithm is described in subsection
IV-C. Obviously, if W eight of all adjacent nodes of v is greater
than v’s, LIV (v) = 0; if W eight of all adjacent nodes of v is
smaller or equal to node v, LIV (v) = 1. Generally speaking,
A. Relation Graph
if a node has greater W eight, more important is it. Therefore,
In the real world, there generally exists relationships be- LIV (v) can reflect the local importance of a node on a certain
tween individuals or objects. For example: if two people are extent.
friends, there is a friend relationship between them; if two
editors edit the same page on Wikipedia, there may have a B. Classify nodes
common interest relationship between them; if two researchers
publish a paper together, they may have a cooperation relation- The different object usually plays a different role in a
ship; if two objects are close in the Euclid space, they may cluster. Some are the real spirits and be the core leaders in
have similarity relationship. the cluster; some are very important and be the bone numbers
of the cluster; some are not very important to construct the
Meanwhile, the importance of relationships may be dif- cluster and be the trivial numbers, which have little influence to
ferent, usually we set a weight to depict them. For example, the cluster. For example, a community usually has few leader
the weight of friend relationship between good friends is members, who have great influence in the community; in a
bigger than the normal friends; the more same page edited, research area, though there have lots of researchers, there will
the more intense they are in the common interest; the more be few top researchers, who can guide the research directions.
papers published together, the cooperation relationship is more Though the bone objects are not the core objects, they still
firmness. have great importance to build the cluster, and they are the
Such relationship between individuals and individuals, peo- backbone of the cluster; on the contrary, trivial objects belong
ple and people, objects and objects can be mapped into a to the cluster, but they have little influence to build the cluster.
relation graph easily. Namely, in the graph, node represents Based on the above facts, we classify the nodes into three
the individual, people or object; edges between nodes represent types in a relation graph as core nodes, bone nodes and trivial
the relationship between them; the weight of edge represents nodes. According to the definition 4, LIV (v) reflects the local

352
(a) (b)

Fig. 1: Nodes Distribution in Zacharys Karate Club

importance of a node v, thus we use LIV (v) to define the three


types of nodes as follows.
definition 5: Core Node: If ∃v ∈ V ∧ LIV (v) = 1, the
(c) (d)
node v is a core node.
definition 6: Bone Node: If ∃v ∈ V ∧ LIV (v) > 0 ∧ Fig. 2: Processing of Constructing Cluster Cores
LIV (v) < 1, the node v is a bone node.
definition 7: Trivial Node: If ∃v ∈ V ∧ LIV (v) = 0, the
node v is a trivial node.
ed by TDC-RG algorithm. The subsection IV-C1 introduces
Besides, the subgraph which consists of core nodes and how to obtain a cluster core. The subsection IV-C2 describes
bone nodes is called the bone graph, and the subgraph consists how to expand a cluster core to get a preliminary clustering
of trivial nodes is the trivial graph. result. The subsection IV-C3 introduces how to assign the
trivial nodes to the preliminary clusters, and the subsection
Let’s take the Zacharys karate club [20] as an example IV-C4 introduces how to merger clusters.
to observe the distribution of different types of nodes. This
data set is a typical social network of friendships between 34 1) Cluster Core:
members of a karate club at a US university in the 1970s,
and the club split into two groups after a dispute between In the subsection IV-B, nodes is classified into three types as
the administrator(v34 ) and the instructor(v1 ). Fig.1 shows the core nodes, bone nodes and trivial nodes. Since a core node
distribution of nodes in the data set; where the node means the is a very important node in the relation graph, we construct
member of the club, and the edge between nodes means they some subgraphs, called cluster cores, from a core node.
are friends.
definition 8: ClusterCore(v) : Set v is a core node of
According to the above definitions, the core nodes, bone a relation graph G, and Gg (v) is a connected subgraph which
nodes and trivial nodes are found and depicted in the Fig.1. consists of the core node v and the Adj(v). A subgraph of
The dark nodes are core nodes, the gray nodes are bone nodes Gg (v) constructed as the following steps is called a cluster core
and the colorless node are trivial nodes. It can be found that the of v. All of cluster core of v is denoted by ClusterCore(v).
administrator (v34 ) and the instructor(v1 ) are the core nodes, The processing of constructing the ClusterCore(v):
the bone nodes distribute in the backbone of the group and Step a: to build the subgraph Gg (v), which consists of the core
trivial nodes distribute in the fringe of the group; which just node v and the Adj(v).
represents the relationship between members in this club. Step b: to remove the core node v form the Gg (v).
Step c: to find the maximal connected subgraphs except the
The discussion inspires us to propose a new clustering subgraph which has only one node.
algorithm, which can mainly divided into three phases. First Step d: to build a cluster core by adding the core node v to a
phase, according to the classification of nodes in a relation connected subgraph obtained in Step c, and all of cluster cores
graph, to divide a relation graph into the bone graph and the are the ClusterCore(v).
trivial graph; second phase, to execute a preliminary clustering
on the bone graph, and obtain a preliminary clustering result; The Fig.2 gives an example of constructing cluster cores.
third phase, to merger the nodes in the trivial graph into the The Fig.2a is a subgraph Gg (v) which consists of the core node
preliminary cluster, and obtain the final clustering result. In the v and its adjacent nodes Adj(v); the Fig.2b depicts the step
phase of preliminary clustering, the algorithm is computing b; the Fig.2c shows there are 3 maximal connected subgraphs
only on the bone graph which is a relatively small space which have more than one node, and there are 2 maximal
considering temporarily abandoning the trivial nodes; which connected subgraphs which have only one node; Fig.2d shows
can speed up the algorithm without losing much information there are three cluster cores.
of the relation graph.
As shown in Fig.2d, each cluster core contains a core node
C. Description of the TDC-RG Algorithm and it’s a group of dense connected nodes, thus it is suitable as
an initial structure to expand. Meanwhile, the core node could
This subsection will describe the three-way decisions over- appear in more than one cluster core.
lapping clustering algorithm based on the relation graph, short-

353
Algorithm 1 The Preliminary Clustering Algorithm
Input:
The bone graph, ClusterCore(v), α, β;
Output:
The preliminary clustering result;
1: Choose a cluster core which has not extended as an
initiating cluster C. If all cluster cores are extended, stop
the algorithm.
2: For the adjacent bone nodes, to calculate the μ(v, C)
according to Equation (2).
3: Choose one node v with the largest μ(v, C).
Fig. 3: ClusterCore(v1 ) 4: If μ(v, C) > α, add the node v to the positive region of
the cluster C, goto 2; if μ(v, C) ≥ β, add the node to the
boundary region of the cluster C, goto 2; if μ(v, C) < β,
goto 1.
Next, we take Fig.3 as an example to explain the advantage
of the proposed method, the figure is a subgraph of Zacharys
karate club’bone graph in Fig.2, it contains node v1 and Thus, a trivial node v must be adjacent to at least one
Adj(v1 ). There are two cluster cores corresponding to the core preliminary cluster. Then, we calculate μ(v, C) between v and
node v1 , namely, ClusterCore(v1 ) = { (v1 , v5 , v6 , v7 , v11 ), C. If μ(v, C) > α, we assign the node to the positive region
(v1 , v2 , v3 , v4 , v9 , }. Obviously, the v1 belongs to the two of the cluster; if μ(v, C) < α but μ(v, C) ≥ β, we assign it
clusters. Thus, the method to construct cluster cores is adept to the boundary of the cluster; if all μ(v, C) < β, we assign
in finding the overlapping objects. the node to the boundary of the cluster which has the highest
2) Preliminary Clustering: μ(v, C).

In the preliminary clustering processing, we build a cluster 4) Merger Clusters:


by adding adjacent nodes to every cluster core, which is also Once the preliminary cluster result is obtained, there might
called expanding method. There already exist some expanding be some clusters overlapping with each other. Thus, it is
methods such as the LFM [13], GCE [14], EM-BOAD[21], more reasonable to merger those similar clusters into one
and so on. We expand a cluster by using a fitness function to bigger cluster in size. As pointed out in Section III, the
decide which node can be added. The fitness function [22] is DegP OP (Ci , Cj ) reflects the degree of similarity between
given as follows. two clusters.
C Here, we introduce the method in [14] first, it defines the
win
f (C) = (1) similarity between clusters as δ(Ci , Cj ) and calculated by the
C
win + wout
C
following equation.
C C
where win and wout are the total internal and external weights
C
of the cluster C in RG. The win equals the sum of weights |Ci ∩ Cj |
on internal edges in the cluster; and wout C
equals the sum of δ(Ci , Cj ) = (3)
min(|Ci |, |Cj |)
weights on edges between interval nodes of the cluster and
others out of the cluster. Therefore, the fitness function for a Then, it is natural to enlighten us to define the computing
node to a cluster is given as: formula of DegP OP (Ci , Cj ) as follows.
μ(v, C) = f (C ∪ v) − f (C) (2)
|Ci ∩ Cj |
DegP OP (Ci , Cj ) = (4)
Let α, β are two thresholds, the preliminary clustering min(|Ci |, |Cj |)
algorithm based on three-way decisions is described in the
Algorithm 1. In the next, we will show that the definition of
DegP OP (Ci , Cj ) is more reasonable than the definition of
3) Merger Trivial Nodes: δ(Ci , Cj ), though both of them are used to describe the degree
After obtaining the preliminary clustering result, we need of overlapping between clusters. Suppose there is a threshold .
merger trivial nodes into the preliminary clusters. Generally speaking, if δ(Ci , Cj ) >  or DegP OP (Ci , Cj ) >
, we can merge the two clusters into one cluster. The method
Theorem 1: A trivial node must be adjacent to a prelimi- according to the δ(Ci , Cj ) will merge clusters provided that
nary cluster. their boundary regions have high degree of overlapping, which
is not always in accord with the facts.
Proof: Proved by contradiction. Suppose a trivial node v
isn’t adjacent to any preliminary clusters. Then ∀u ∈ Adj(v) To take Fig. 4 as an example, the solid line circles the
is a trivial node. Thus, we have u, v are trivial nodes and positive region of the cluster, the dashed line circles the
adjacent to each other. However, according to Definition 4 and boundary region of the cluster, the overlapping region between
Definition 7, two trivial nodes can’t be adjacent. which means the two clusters is denoted by colorless nodes. It can be seen
the assumption is not true. Then the theorem is proven. that, the colorless nodes belong to the overlapping regions,

354
Fig. 4: Overlapping Between Boundary Regions

Fig. 5: Clustering Result of TDC-RG Algorithm in Zacharys


TABLE II: Data sets and Algorithms Karate Club Network
Algorithm Karate Dolphins Politics
LFM [13] ◦ ◦ ×
GaoCD [15] × ◦ ◦
DenShrink [23] ◦ × × TABLE III: Overlapping Nodes Found by Different Algorithms
EM-BOAD [21] ◦ ◦ × in Karate Club Network
Sun [24] ◦ ◦ ×
Algorithm Overlapping Nodes
P OP (C2 , C3 ) 9
but those nodes don’t have very intimate relationships with TDC-RG P OB(C3 , C2 ) 31
the two clusters, and it is more reasonable that these objects BOB(C2 , C3 ) 10
belonging to the boundary region. If we set  = 0.5, according LFM [13] 3,9,10,14,31
to Equation (3), we have δ(Ci , Cj ) = 0.5, the two clusters will DenShrink [23] 10, 20
be merged into one cluster. However, according to Equation EM-BOAD [21] 3
(4), we have DegP OP (Ci , Cj ) = 0, the two cluster will not Sun [24] 9,10,14,20
be merged into one cluster and the boundary region is reserved,
the result just coincides with the fact.
set, the symbol “×” means the algorithm don’t test on this
Thus, when DegP OP (Ci , Cj ) > , we merge positive data set.
regions of these two clusters into the positive region of cluster
C at first. Then, we calculate μ(v, C) between v and C, In the following paper, the solid line draws the positive
where v is the node in boundary regions of the two clusters. region, the dashed line draws the boundary region, the thick
If μ(v, C) > α, we assign the node to the positive region line draws the actual partition of clusters, and the dark dots
of cluster C; otherwise, we assign the node to the boundary means core nodes, the gray dots means bone nodes, the
region of C. colorless dots means trivial nodes.

V. E XPERIMENTS A. Zacharys Karate Club Network


This section presents some experiments by comparing with Zacharys karate club network has been widely used as
other overlapping clustering methods, to evaluate the perfor- a benchmark for clustering algorithms, which consists of 34
mance of TDC-RG algorithm and to explain the significance nodes and 78 edges.
of the overlapping types. The data sets used here are the
Fig.5 shows the result of TDC-RG algorithm, and it finds
real networks such as the Zacharys karate club network [20],
three clusters and two core nodes, node 1 and node 34,
Lusseaus bottlenose dolphins network [25] and Books about
which are exactly the instructor and administrator. TDC-RG
US politics book network [26].
algorithm divides one of real cluster into two clusters, C1 and
The compared methods include: LFM [13] algorithm, C2 . Further observations show the reason: the overlapping type
which is a kind of local expanding overlapping clustering between C1 and C2 is the type D, which means the two clusters
algorithm; GaoCD [15] algorithm, which is a genetic algo- only overlap in the positive regions. We can also observe that
rithm to detect overlapping communities with link clustering; node 1 is the core node and it is the only node which bridges
DenShrink [23] algorithm, which is a hierarchical network the two clusters. Thus, we divide this part into two clusters and
clustering algorithm to find the hubs, hubs are bridge nodes set the node 1 in the overlapping region is very reasonable. The
between clusters and likely the overlapping nodes; EM-BOAD node 12 is a trivial node here, which has less influence than
[21] algorithm, which is an overlapping clustering algorithm the node 1. But the node 12 definitely belongs to C1 and C2
based on seeds expanded; Sun’s algorithm [24], which is a through the node 1, it is reasonable to assign it to the positive
fuzzy clustering algorithm to detect overlapping clusters. we of the two cluster. On the other hand, if the node 12 has more
will test our algorithm on the same data sets which are used connections with one of cluster, it could become the bone node.
in these references, and compare our results with the results It is an overlapping node in our method, which exactly reveals
of those references directly. the node is a potential developed member for one cluster.
Table II shows which datasets tested by which algorithms, As a contrast, LFM [13], EM-BOAD [21], and Sun [24]
where the symbol “◦” means the algorithm tests on this data directly finds two clusters, DenShrink [23] finds four clusters,

355
but it don’t point out the node 1 is the overlapping node.
Tabel III shows the overlapping nodes found by these methods.
There also exists overlapping between C2 and C3 , most of the
above algorithms can find the overlapping region. From the
Fig.5, we can observe that the most significance overlapping
nodes between C2 and C3 are the node 9 and the node 10,
the proposed TDC-RG algorithm, the LFM and Sun can find
them; some algorithms just find part of them.
By contrast to the other methods to treat nodes in the
overlapping region equally without distinction, the proposed
method divides these overlap nodes into different types. For Fig. 6: Clustering Result of TDC-RG Algorithm in Dolphin
example, the node 9 belongs to P OP (C2 , C3 ), the node Social Network
10 belongs to BOB(C2 , C3 ), and the node 31 belongs to
P OB(C3 , C2 ).
Further observations show the advantage of the proposed TABLE IV: Overlapping Nodes Found by Different Algorithm-
method. Zachary observed that the club split into two groups s in Dolphin Social Network
after a dispute, the administrator (node 34) and the instructor
(node 1) are the key persons of the two factions, respectively. Algorithm Overlapping Nodes
But in fact, before the fission, each member chose a faction to P OP (C1 , C2 ) 20
support; after the fission, each one chose a new group to join P OB(C1 , C2 ) 24,29,37
TDC-RG
in. P OB(C2 , C1 ) 2,40,42
BOB(C1 , C2 ) 8
Let us observe the faction and the choices. For the node 9, LFM [13] 8,20,29,31,40
he supported the administrator first, but joined the instructor’s EM-BOAD [21] 40
group finally. Zachary thinks that because there is only three Sun [24] 8,29,31,40
weeks away from a test for black belt, this makes the node
9 have to join the instructor’s group to get practice. Thus,
the node 9 is assigned into P OP (C2 , C3 ), which presents leave of SN100 (node 37). As shown in Table IV, only the
the processing he is hardly to decide which cluster he should proposed method successfully finds the special role SN100 in
be. For the node 10, he supported none of the factions, but overlapping regions.
joined the administrator’s group finally. Thus, we assign it into
BOB(C2 , C3 ), which shows that the node is neutral between
the two factions. For the node 31, he supported the adminis- C. Books About US Politics Network
trator and joined the administrator’s group finally. From the Books about US politics network [26] is a network of
Fig.5, we can find that the node 31 supports the administrator books about US politics published around the time of the
but it also have some connection with the instructor’s group. 2004 presidential election and sold by the online bookseller
Thus, it is very reasonable to assign it into P OB(C3 , C2 ). Amazon.com. It consists of 105 nodes and 441 edges, and
edges between books represent frequent co-purchasing of
B. Dolphin Social Network books by the same buyers. According to political inclination,
Dolphin social network is a social network about bottlenose books are divided into three classes, liberalism, conservatism
dolphins, it is compiled by Lusseau et.al [25] after seven years and neutralism.
studies. It consists of 62 nodes and 159 edges. It is generally Fig.7 shows the result of TDC-RG Algorithm. The thick
believed that there exists two clusters. line divides the data set into two clusters, C1 represents
Fig.6 shows the result of TDC-RG algorithm, which finds conservatism, and C2 represents liberalism. The notes labeled
two clusters. As a contrast, LFM [13] and Sun [24] also find with digit are neutralism, there are 13 nodes. It can be seen
two clusters, EM-BOAD [21] finds three clusters, GaoCD [15] that liberalism and conservatism can be easily distinguish, our
finds more than 10 clusters. Table IV shows the overlapping methods find two main clusters well. And the overlapping type
nodes found by these methods except GaoCD, considering of two clusters is type E, that is, there don’t exist the P OP
GaoCD finding too many clusters. By contrast to the other case between the conservatism and liberalism; which reflects
methods, our methods not only finds more overlapping nodes, the reality of the situation.
but also points out the different significance of overlapping
However, neutralism are scattered distribution and difficult
nodes by dividing these nodes into different types.
to form a cluster. To deal with this situation, we assign
For example, for nodes 24, 29, 37, we assign them into these nodes into different types of overlapping regions due
P OB(C1 , C2 ), because these nodes belong to C1 obviously to their different significance. For example, we assign the
but also have lots of connections with the C2 ; and the algorithm node 48 into P OB(C1 , C2 ), which means this book is trend
assigns node 2, 40, 42 into P OB(C2 , C1 ) because of the to conservatism; the nodes 4,7,26,69,104 are assigned into
similar reason. BOB(C2 , C1 ) because they are just neutralism books; we
assign the nodes 0,6,46 into P OS(C1 ) and node 51 into
Furthermore, Lusseau and Newman [27] point out that, this BN D(C1 ), which means these books are preferred to conser-
network may completely fall into two parts because of the vatism books to buyers; we assign the node 76 into P OS(C2 )

356
[7] P. Lingras, P. Bhalchandra, S. Khamitkar, S. Mekewad, and R. Rathod,
“Crisp and soft clustering of mobile calls,” in Multi-disciplinary Trends
in Artificial Intelligence. Springer, 2011, pp. 147–158.
[8] G. Peters, R. Weber, and R. Nowatzke, “Dynamic rough clustering and
its applications,” Applied Soft Computing, vol. 12, no. 10, pp. 3193–
3207, 2012.
[9] J. Z. Lai, E. Y. Juan, and F. J. Lai, “Rough clustering using generalized
fuzzy clustering algorithm,” Pattern Recognition, vol. 46, no. 9, pp.
2538–2547, 2013.
[10] M. Al Hasan, S. Salem, and M. J. Zaki, “Simclus: an effective algorithm
Fig. 7: Clustering Result of TDC-RG Algorithm in Books for clustering with a lower bound on similarity,” Knowledge and
About US Politics Network information systems, vol. 28, no. 3, pp. 665–685, 2011.
[11] A. Pérez-Suárez, J. F. Martı́nez-Trinidad, J. A. Carrasco-Ochoa, and
J. E. Medina-Pagola, “Oclustr: A new graph-based algorithm for over-
lapping clustering,” Neurocomputing, vol. 121, pp. 234–247, 2013.
and node 103 into BN D(C2 ), which means these books are [12] G. Palla, I. Derényi, I. Farkas, and T. Vicsek, “Uncovering the overlap-
ping community structure of complex networks in nature and society,”
preferred to liberalism books to buyers. Nature, vol. 435, no. 7043, pp. 814–818, 2005.
By contrast, GaoCD [15] also finds two main clusters, but [13] A. Lancichinetti, S. Fortunato, and J. Kertész, “Detecting the overlap-
for the rest nodes, it creates more than 10 clusters, which ping and hierarchical community structure in complex networks,” New
Journal of Physics, vol. 11, no. 3, p. 033015, 2009.
hardly to reveal the different inclinations of neutralism nodes.
[14] C. Lee, F. Reid, A. McDaid, and N. Hurley, “Detecting highly over-
lapping community structure by greedy clique expansion,” in SNA-
VI. C ONCLUSION KDD10: Proceedings of the 4th Workshop on Social Network Mining
and Analysis, 2010, pp. 32–42.
This paper introduced further analysis in overlapping re-
[15] C. Shi, Y. Cai, D. Fu, Y. Dong, and B. Wu, “A link clustering
gions based on three-way decisions, to reveal the different based overlapping community detection algorithm,” Data & Knowledge
significances to the clustering processing of these different Engineering, vol. 87, pp. 394–404, 2013.
objects. A cluster is represented by an interval set, the lower [16] Y. Yao, “An outline of a theory of three-way decisions,” in Rough Sets
bound and upper bound give the positive region and boundary and Current Trends in Computing. Springer, 2012, pp. 1–17.
region of a cluster, respectively. Then, the overlapping regions [17] Y. Yao, P. Lingras, R. Wang, and D. Miao, “Interval set cluster
are categorized into 4 macro types or 8 micro types. Thus, analysis: A re-formulation,” in Rough Sets, Fuzzy Sets, Data Mining
the proposed method can find different semantic of objects and Granular Computing. Springer, 2009, pp. 398–405.
in overlapping regions compared to other overlapping clus- [18] H. Yu and Y. Wang, “Three-way decisions method for overlapping
tering algorithms. In addition, a new overlapping clustering clustering,” in Rough Sets and Current Trends in Computing. Springer,
2012, pp. 277–286.
algorithm is proposed based on a relation graph, where cluster
[19] H. Yu and Q. Zhou, “A cluster ensemble framework based on three-way
cores, bone nodes and trivial nodes are defined. Compared decisions,” in Rough Sets and Knowledge Technology. Springer, 2013,
with other overlapping clustering algorithms on some real pp. 302–312.
social networks, the results are better and more reasonable. [20] W. Zachary, “An information flow modelfor conflict and fission in small
How to obtain different overlapping region types with a less groups1,” Journal of anthropological research, vol. 33, no. 4, pp. 452–
computational complexity algorithm is our further work. 473, 1977.
[21] J. Li, X. Wang, and J. Eustace, “Detecting overlapping communities by
ACKNOWLEDGMENT seed community in weighted complex networks,” Physica A: Statistical
Mechanics and its Applications, vol. 392, no. 23, pp. 6125–6134, 2013.
This work was supported in part by the National Natural [22] J. Baumes, M. K. Goldberg, M. S. Krishnamoorthy, M. Magdon-
Science Foundation of China under grant No.61379114 and Ismail, and N. Preston, “Finding communities by clustering a graph
into overlapping subgraphs.” IADIS AC, vol. 5, pp. 97–104, 2005.
No.61272060.
[23] J. Huang, H. Sun, J. Han, and B. Feng, “Density-based shrinkage
for revealing hierarchical and overlapping community structure in
R EFERENCES networks,” Physica A: Statistical Mechanics and its Applications, vol.
[1] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern 390, no. 11, pp. 2160–2171, 2011.
Recognition Letters, vol. 31, no. 8, pp. 651–666, 2010. [24] P. G. Sun, L. Gao, and S. Shan Han, “Identification of overlapping and
[2] J. Xie, S. Kelley, and B. K. Szymanski, “Overlapping community non-overlapping community structure by fuzzy clustering in complex
detection in networks: The state-of-the-art and comparative study,” ACM networks,” Information Sciences, vol. 181, no. 6, pp. 1060–1071, 2011.
Computing Surveys (CSUR), vol. 45, no. 4, p. 43, 2013. [25] D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and
[3] A. A. Abbasi and M. Younis, “A survey on clustering algorithms for S. M. Dawson, “The bottlenose dolphin community of doubtful sound
wireless sensor networks,” Computer communications, vol. 30, no. 14, features a large proportion of long-lasting associations,” Behavioral
pp. 2826–2841, 2007. Ecology and Sociobiology, vol. 54, no. 4, pp. 396–405, 2003.
[4] A. G. Alonso, A. P. Suárez, and J. E. M. Pagola, “Acons: a new [26] M. E. Newman, “Modularity and community structure in networks,”
algorithm for clustering documents,” in Progress in Pattern Recognition, Proceedings of the National Academy of Sciences, vol. 103, no. 23, pp.
Image Analysis and Applications. Springer, 2007, pp. 664–673. 8577–8582, 2006.
[5] N. Aydin, F. Nai t Abdesselam, V. Pryyma, and D. Turgut, “Overlapping [27] D. Lusseau and M. E. Newman, “Identifying the role that animals play
clusters algorithm in ad hoc networks,” in Global Telecommunications in their social networks,” Proceedings of the Royal Society of London.
Conference (GLOBECOM 2010), 2010 IEEE. IEEE, 2010, pp. 1–5. Series B: Biological Sciences, vol. 271, no. Suppl 6, pp. S477–S481,
2004.
[6] G. Obadi, P. Drázdilová, L. Hlavacek, J. Martinovic, and V. Snasel,
“A tolerance rough set based overlapping clustering for the dblp data,”
in Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010
IEEE/WIC/ACM International Conference on, vol. 3. IEEE, 2010, pp.
57–60.

357

You might also like