1. Introduction
Determining the group with some particular properties helps the analysts to capture the common properties from the members in the community. Many applications could be considered based on the community detection. For example, the precise information delivery, e.g., Google AdWords [
1] increases the transaction amounts for sending the advertisement information to the right person. Therefore, detecting communities is a popular research topic [
2,
3,
4,
5,
6,
7,
8].
Many results focus on the disjoin community sets that each node belongs to exactly one community [
2,
3]. However, in the real-world networks, many people may belong to multiple communities, so the communities may overlap with each other. For example, an engineer may belong to many projects in a company. Thus, instead of strict partitions, fuzzy partitions are more appropriate for understanding the network structures [
9,
10]. Fuzzy partitions allow a node belongs to multiple communities simultaneously. Considering a real-world situation, some staff work together in a building, and the manager would like to track the movement history for each staff [
11]. Each one may move to various rooms, and the move purpose comes from the role of each staff. When we treat the purpose of all staff to be the communities, the staff may belong to different communities.
The modularity function proposed by Newman and Girvan [
12] is the famous measurement of network partitions to measure the structure of a given network. The modularity function calculates the difference between the number of real intra-community edges and the expected number of edges to identify the qualities of the communities. The partition with larger modularity value has better community structure than those with lower modularity values. Finding the partitions with maximum modularity is a straightforward solution to the community detection. However, the modularity maximization has been proved as an NP-hard problem [
13], and finding the partition with maximum modularity is difficult. Therefore, many results are proposed to calculate the near optimal solutions, such as the random walk processes [
14], the structural clustering [
15], and the polynomial-time approximation algorithms [
16].
On the other hand, besides the computation complexity, the modularity maximization has two problems in detecting communities:
Resolution limits Fortunato et al. introduced that small communities cannot be detected in large networks [
17,
18]. Since the null model of modularity provides the global connectivity, the expected number of edges between two small groups in a large network might be very small. Eventually, the two small groups will be treated as one community. Many approaches are proposed for solving resolution limits to provide high solution qualities, such as greedy algorithms [
19,
20], spectral algorithms [
21,
22,
23], simulating annealing algorithms [
24] and mathematical programing [
25].
Overlapping community Some nodes may belong to several communities, so simply assigning the nodes to one community is difficult. Thus, the straightforward solution is to modify the modularity for allowing the nodes belonging to multiple communities at the same time [
26,
27,
28,
29,
30].
Figure 1 shows two benchmarks about overlapping communities. In
Figure 1a, the node
is the overlapping node, and we assign
to community
B and
C. Thus, we get three communities, and they are {{
,
,
,
}, {
,
,
,
,
}, {
,
,
,
,
,
}}. Moreover,
is assigned to
A and
B in
Figure 1b.
In this paper, we focus on the overlapping community detection, and propose the node weight allocation problem denoted by
to formulate the community overlap. Since computing the partition with maximum modularity is NP-complete, decreasing the computation cost to seek the near optimal partitions is the popular approach in solving the overlapping community detection. The heuristic algorithms are outstanding in seeking better solutions in large search space, especially for the genetic algorithms (GAs) [
2,
3,
8]. Therefore, some works consider GA as the core approach in their solutions. Mu et al. use a hybrid heuristic approach including GA and the simulated annealing to find out the communities [
2]. Shang et al. use GA with an extra local search [
3]. The heuristic algorithms perform well in seeking the solution with high quality in a large search space. However, the above results do not deal with the overlapping properties. The overlapping networks have various properties, so some approaches consider the multi-objective approach to find the balanced results [
4,
5,
6,
31]. The balanced results mean that most properties are considered, but the derived results may not be closed to the real-world properties. Therefore, Behera et al. check the similarity between each pair of nodes [
8]. The node similarity is also considered by Ezeh et al. to the overlapping nodes and their neighbors [
32]. To emphasize the community attribution of each node, Shakya et al. combine fuzzy with the GA to calculate the detail properties of the nodes [
7]. Shakya et al. consider the GA to reduce the computation time without decreasing the solution quality too much and adopt the fuzzy communities to identify the overlapping nodes.
Even if some approaches provide the solutions with high modularity, the partitions may not reflect the properties of the real-world networks in some situations. We found that the solution quality could be refined by considering following issues: ignoring overlapping nodes, merging clusters, and reweighting nodes. Therefore, we consider the modularity to design the solution searcher of the approach . We firstly modify the fitness function in to show the network properties by considering the null model, so the revised fitness function could output the partitions that are closer to the real-world behavior. Moreover, we design three refinement strategies to make the solutions to reflect the real-world properties.
In the simulation, we consider the synthetic network and popular networks that include Zachary Karate Club Network, Books about American Politics, and American College Football to evaluate the solution quality calculated by and other approaches. The derived networks correctly reflect the real-world properties in the synthetic networks and the real-world networks. Moreover, the proposed refinement strategies are also evaluated, and the refinement strategies provide higher quality of the derived partitions in the perspective of the real-world behavior. Therefore, the simulation results show that outputs the partitions, and the results are closed to the real-world properties.
This paper is organized as follows. The overlapping communities and the problem definition are introduced and formulated in
Section 2. The proposed approach
is shown in
Section 3, and the refinement strategies are also listed in this section. The simulation and comparisons are arranged in
Section 4, and we show the network partitions in this section. Eventually, the conclusion and future works are stated in
Section 5.
3. Allocate Node Weight by Genetic Algorithms
Computing the partition with maximum modularity has been proved as the NP-complete problem [
13]. Even if we consider the solution with high computation performance, e.g., the cloud computing [
35,
36] and the parallel computing [
37], to compute the partitions for maximizing the modularity, it still requires huge computation resource. Therefore, we propose a GA-based approach to get the near-optimal solution with minimum computation. The proposed algorithm
includes two steps. We first apply GA to obtain a high-quality feasible solution, and then design three refinement strategies to improve the derived solution to modify the derived partition to be closer to the real-world behavior. In the following context, we will introduce the revised GA algorithm and the refinement strategies.
3.1. Genetic Algorithm
The iterative process of GA as shown in Algorithm 1 includes three major processes: crossover, mutation, and selection. Before invoking the iterative process, the initial population
P with
chromosomes will be determined firstly. Each chromosome is represented by
, as shown in
Figure 2. Each entry
is a weight to indicate the assignment from
to
c. The initial population is generated randomly, and each row of
M must satisfy the problem constraints. Given a maximum number of iterations
, the GA then invokes following processes.
Crossover: we randomly select two chromosomes
and
form
P, and a random column. The offspring is generated by the selected column of
and the remaining part of
as shown in
Figure 3. The number of offsprings is determined by
, and in other words, we will obtain
chromosomes after the crossover.
Mutation: the mutation process is launched in 80% probability after finishing the crossover. Once the mutation is invoked, one of a randomly selected chromosome will be picked up within . Eventually, the offspring will be normalized to be a feasible solution to fit the requirements in .
Selection: we consider the modularity to be the objective function, and finding the partition with maximum modularity is the purpose of GA. We use to be the fitness function and calculate of each solution. Moreover, all chromosomes are sorted in the descending order of . Computing the chromosomes with maximum is the major goal of the GA, so we select top individuals, and they will survive to the next generation.
Algorithm 1: Genetic algorithm for allocating node weight |
|
To keep the heavily overlapping nodes, a threshold
in terms of
is given. We transform
to the corresponding
with the threshold
by Equation (
6).
3.2. Refinement Strategies
GA provides an elite solution from the population, but this solution may not be suitable for all instances. In the pre-analysis phase, we observed three situations derived by , and we could receive better solutions by some extra processes. The situations are (1) lightly overlapping nodes, (2) mergeable clusters, and (3) reweight nodes. We call the processes that are used to get better solutions the “refinement strategies”. Therefore, we provide three refinement strategies to refine the solutions for the above situations, respectively.
Ignore slight overlapping nodes The overlapping degree of each
is important for splitting the communities. Determining the community with low value of
is easier than that with a higher value. We use a threshold
corresponding to Equation (
6) to determine that the entry should be treated as an entry without overlaps. In addition, we also can derive
by Equation (
6). When
, we set
as zero. When
is set as a higher value, more entries will be assigned to single community.
Merge clusters Some small communities should be merged by other large community. If the overlapping ratio of any two communities is larger than a given merge threshold , they should be simply merged to a single community. Given two non-empty communities, we define to be the overlapping ratio. When is larger than a given threshold, and will be merged.
Reweight node values To calculate the weight distribution of each overlapping node, directly converting
to
via Equation (
6) results in a situation that a node belongs to multiple communities but the majority of its weight is allocated to one community. To avoid this problem, we propose the reweight strategy. The weight should be proportional to the number of edges that
linked in
c. Moreover, if the neighbors of
in
c are more than the average number of nodes in
c,
c is more important than others for
. Given a community c,
represents the average number of neighbors and
be the normalized term. Therefore, we have the new weight is:
where
is the set of nodes belong to
c and
is the set of communities that
belongs to. We use
for normalization, so we have
.
5. Conclusion and Discussion
Given a network, the modularity is used for measuring the partition quality while the fuzzy clustering recognizes the overlapping communities. Combining above concepts together to be the fuzzy modularity is an appropriate method to formulate the structure of the given network with overlapping communities. Maximizing the modularity outputs the partition with well network structure, but computing the partition with maximum modularity requires huge computation cost. Therefore, the heuristic algorithms are outstanding in seeking high quality solution from a large search space, and we can find some research results of using heuristic algorithms for finding the partitions with maximum modularity. However, there are some special cases that we have to deal with. We find out three common situations from the partitions derived from the GA with modularity maximization and propose three solution refinement strategies to ignore overlapping nodes, merge clusters, and reweight nodes to separate the network to be closer the real-world behaviors. Moreover, we modify the fitness function of the GA to consider the null model for measuring the distance between the derived partition and the random graph. Thus, the simulation results show that the proposed provide significant improvement comparing with previous approaches. The derived partition may not always have maximum modularity, but the community structure is more reasonable than the partitions derived by previous works. measures the connectivity of nodes and reweight the overlapping nodes to reflect the correct properties in the given networks. Eventually, determines the partitions appropriately, but the heavily overlapping nodes may be marked as the interior nodes by other approaches.
The overlapping nodes could be detected and provided appropriate allocation by . During the simulations, we found some extension works that will be address in the future, and they are listed as follows:
In our simulations, we got an interesting result as shown in
Figure 14 from the karate network with
. The result consists of three communities, and they are grouped by
,
and
. The community with
that the nodes are marked by red could be consider as an overlapping set. It means that the networks not only have overlapping nodes but also overlapping groups. Thus, applying the fuzzy concept to the communities will eliminate the group with
, and they may be more closed to the real-world behavior. Since the members in the group with
may belong to different communities based on the situations, e.g., the competitions or the events. Therefore, assigning the red nodes to any community may be inappropriate.
The proposed algorithm invokes GA to compute the preliminary partitions and then adopts proposed refinement strategies to correct the partitions by the secondary processes. The refinement strategies could be considered as the local search to improve the partition quality in each iteration. However, it is a tradeoff between the computation cost and the partition quality. Once the refinement strategies are modified from the external processes to the internal processes in GA, the computation cost will be increased. Moreover, the given networks may not always consist of the target properties that could be improved by the refinement strategies. Therefore, the refinement strategies could be designed as local search approaches, but the trigger of launching the local search approaches should be analyzed in the future.