Anomaly Detection in Networks
Anomaly Detection in Networks
3390/——
OPEN ACCESS
algorithms
ISSN 1999-4893
www.mdpi.com/journal/algorithms
Concept Paper
1
Institute 1, Jaypee University of Information Technology 1, Waknaghat, Solan, Himachal, India
arXiv:1511.07210v1 [cs.SI] 23 Nov 2015
structural analysis of data and other query retrieval purposes. NN search is very useful for dealing
with massive datasets, but it suffer with "curse of dimensionality"[1,2]. However, some recent
surge of results show that it is also very efficeint for high dimensional data provided a suitable
space partitioning data structure is used, like, kd-tree, quad-tree, R-tree, metric-tree and locality
sensitive hashing[3–6]. Some of these data structures also support approximate nearest neighbor
search which hardly made any degradation of results whereas saves lot of computational times. In
the NN-search problem, the goal is to pre-process a set of data points, so that later, given a query
point, one can find efficiently the data point nearest to the query point on some metric space of
consideration. NN search has many applications in data processing and analysis. For instance,
information retrieval, searching image databases, finding duplicate pages, compression, and many
others. To represent the objects and the similarity measures, one often uses geometric notions of
nearness [7,8].
One important research direction of recent interest is to extract network communities in large
real graphs such as social networks, web, collaboration networks and bio-networks [9–12]. The
availability of large, detailed datasets representing such networks has stimulated extensive study
of their basic properties, and the identification of hirarchical structural features[11,13]. Other than
graphs, the complex networks are characterized by small average path length and high clustering
coefficient. A network community (also known as a module or cluster) is typically a group of
nodes with more interconnections among its members than the remaining part of the network
[13–15]. To extract such group of nodes from a network one generally selects an objective function
that captures the possible communities as a set of nodes with better internal connectivity than
external [16,17]. However, very less research is done for network community detection which tries
to develop nearness among the nodes of a complex network and use nearest neighbor search for
partitioning the network[18–24]. Complex networks are characterized by small average path length
and high clustering coefficient the way the metric is defined should be able to capture the crucial
properties of complex networks. Therefore, we need to create the metric very carefully so that it
can explore the underlying community structure of the real life networks[25].
In this work, we have developed the notion of nearness among the nodes of the network using
some new matrices derived from modified adjacency matrix of the graph which is flexible over
the networks and can be tuned to enhance the structural properties of the network required for
community detection. The main contributions of this work include:
The rest of this paper is organized as follows:In Section 2 several definitions and challanges
relevent to nearest neighbor search in complex network are discussed. Section 3 describes the
Algorithms 2015, xx 3
notion of nearness in complex network and developed a method to represent a complex network as
points of a metric space. Section 4 describes the problem of nearest neighbor search over complex
network and the use of metric tree data structure in this regard. In Section 5, the problem
approximate nearest neighbor search on complex network is discussed with a newly developed
locality sensitive hashing method. Network community detection using exaxt and approximate
nearest neighbor search is formulated and several possible solutions are presented in Section 6,
also, the initialization procedures, termination criteria, convergence are discussed in detail. The
results of comparison between community detection algorithms are illustrated in Section 7. The
computational aspects of the proposed framework are also discussed in this section.
The nearest-neighbor searching problem is to find the nearest points in a D dimensional dataset
X ⊂ RD containing n points to a query point q ∈ RD , usually in a metric space. It has
applications in a wide range of real-world settings, in particular pattern recognition, machine
learning and database querying to name a few. Several effective methods exist for this problem
when the dimension D is small, such as Voronoi diagrams, however, kd-trees and metric trees are
common when the dimension is high. Many methods with different approach are developed for
searching data and finding the nearest point. Searching the nearest neighbor in different studies
are presented by different names such as post office problem, proximity search, closest point
search, best match file searching problem, index for similarity search, vector quantization encoder,
the light-bulb problem and etc.. The solutions for the Nearest Neighbor Search (NNS) problem
usually have two parts: nearness determination in the data and and algorithmic developments. In
most the NNS algorithms, the main framework is based on four fundamental algorithmic ideas:
Branch-and-bound, Walks, Mapping-based techniques and Epsilon nets. There are thousands
of possible framework variations and any practical application can lead to its unique problem
formalization such as pattern recognition, searching in multimedia data, data compression,
computational statistics, information retrieval, databases and data mining, machine learning,
algorithmic theory, computational geometry, recommendation systems and etc.
2.1. NN problem definition on complex network A NNS problem defined in a metric space is
defined below.
Definition 1 (Metric space). Given a set S of points and d as a function to compute the distance
between two points. Pair (S, d) distinguished metric space if d satisfies reflexivity, non-negativity,
symmetry and triangle inequality.
Non-metric space data are indexed by special data structures in non-metric spaces and then
searching is done on these indexes. A few efficient methods exist for searching in non- metric space
that in most of them, non-metric space is converted to metric space. The focus of this paper is
Algorithms 2015, xx 4
on the problems defined on a metric space. In a more detailed classification, NNS problems can
be defined in Euclidean space as follow:
This definition for a small dataset with low dimension has sub linear (or even logarithmic)
query time, but for massive dataset with high dimension is exponential. Fortunately, some little
approximation can decrease the exponential complexity into polynomial time.
Approximate NNS is defined as:
The first requirement in order to search in a metric space is the existence of a formula to
calculate the distance between each pair of objects in S. Different metric distance functions can
be defined depending on the search space of consideration. A NN query on a complex network
G, consists of a source node s and a metric function d(x, y). This computations depends on the
dimension of the instance and face "curse of dimensionality" problem. The computation can be
reduced drastically if instead of computing the exact nearest neighbor we compute the approximate
nearest neighbor.
3. Notion of nearness in complex network The notion of nearness among the nodes of a
graph are used in several purposes in the history fo literature of graph theory. Most of the time the
shortest path and edge connectivity are popular choice to describe nearness of nodes. However,
that edges do not give the true measure of network connectivity (proof by kleinbarg). The notion
of network connectivity some times generalized to be the number of paths, of any length, that
exist between two nodes. This measure, called influence by sociologists, because it measures the
ability of one node to affect another, gives a better measure of connectivity between nodes of real
life graphs / complex networks. Beside discovering natural groups within a network, the influence
metric can also help identify the weak ties who bridge different communities. Research in this
direction gained special attention in the domain of complex network analysis, some of them along
with the one proposed in this article are discussed in the following subsections.
Methods based on node neighborhoods. For a node x, let N (x) denote the set of neighbors of
x in a graph G(V, E) . A number of approaches are based on the idea that two nodes x and y
are more likely to be affected by one another if their sets of neighbors N (x) and N (y) have large
overlap.
Algorithms 2015, xx 5
Common neighbors: The most direct implementation of this idea for nearness computation is
to define d(x, y) := |N (x) ∩ N (y)|, the number of neighbors that x and y have in common.
Jaccard coefficient: The Jaccard coefficient, a commonly used similarity metric, measures the
probability that both x and y have a feature f , for a randomly selected feature f that either
x or y has. If we take features here to be neighbors in G(V, E) , this leads to the measure
d(x, y) := |N (x) ∩ N (y)|/|N (x) ∪ N (y)|.
Preferential attachment:The probability that a new edge involves node x is proportional to
|N (x)|, the current number of neighbors of x. The probability of co-authorship of x and y is
correlated with the product of the number of collaborators of x and y. This corresponds to the
measure d(x, y) := |N (x)| × |N (y)|.
Katz measure: This measure directly sums over the collection of paths, exponentially damped
by length to count short paths more heavily. This leads to the measure d(x, y) := β × |paths(x, y)|
where paths(x, y) is the set of all length paths from x to y. (β determines the path size, since
paths of length three or more contribute very little to the summation.)
Hitting time and PageRank: A random walk on G starts at a node x, and iteratively moves to
a neighbor of x chosen uniformly at random. The hitting time Hx, y from x to y is the expected
number of steps required for a random walk starting at x to reach y. Since the hitting time is not
in general symmetric, it is also natural to consider the commute time C(x, y) := H(x, y)+H(y, x).
Both of these measures serve as natural proximity measures, and hence (negated) can be used
as d(x, y). Random resets form the basis of the PageRank measure for Web pages, and we can
adapt it for link prediction as follows: Define d(x, y) to be the stationary probability of y in a
random walk that returns to x with probability α each step, moving to a random neighbor with
probability 1 − α.
Most of the methods are developed for different types of problems like information retrieval,
ranking, prediction e.t.c. and developed for general graphs. In this article we studied a measure
specially designed for complex network and discussed in the next subsection
3.2. Proposed metric on complex network In this section we have demonstrated the procedure
to transform a graph into points of a metric space and developed the methods of community
detection with the help of metric defined for pair of points. We have also studied and analyzed
the community structure of the network therein.
The nodes of the graph do not lie on a metric space. The standard Euclidean distance and
spherical distance define over the adjacency or Laplacian matrices above failed to capture similarity
information among the nodes of a complex network. On the other-hand, the algorithms developed
based on shortest path or Jaccard similarity are computationally inefficient and have less success
in terms of standard evaluation criteria(like, conductance and modularity).
In this work, we have tried to develop the notion of similarity among the nodes using some new
matrices derived from adjacency matrix and degree matrix of the graph. Let A be the adjacency
matrix and D the degree matrix of the graph G = (V, E). The Laplacian L = D − A. We
have defined two diagonal matrix of same size D(λ) and D(λx ) where λ is a parameter determine
Algorithms 2015, xx 6
from the given graph and can be optimized from the optimization criteria of the problem under
consideration. In D(λ) a fixed optimally determine value is used in the diagonal entries of the
matrix D and in D(λx ) a variable value also optimally determine is used in the diagonal entries of
the matrix D. The similarities are defined on matrices L1 and L2 , where L1 = D(λ) + A and L2 =
D(λx )+A respectively as spherical similarity among the rows and determine by applying a concave
function φ over the standard notions of similarities like, Pearson coefficient(σP C ), Spacerman
coefficient(σSC ) or Cosine similarity(σCS ). φ(σ)() must be chosen using the chord condition to
obtain a metric.
In this subsection we have demonstrated the algorithm to convert the nodes of the graph to the
points of a metric space preserving the community structure of the graph. The algorithm depends
on the sub modules 1) construction of Lx (L1 or L2 ) and 2) obtaining a structure preserving
distance function. The algorithm works as picking pair of nodes from Lx and and computing
distance defined in the second module.
3.2.1. Lx construction
The L1 is defined as L1 = D(λ) + A, where A is the adjacency matrix of the given network and
D(λ) is a diagonal matrix of same size with diagonal values equal to a non negative constant λ.
The L2 is defined as L2 = D(λx ) + A, where A is the adjacency matrix of the given network
and D(λx ) is a diagonal matrix of same size with diagonal values determine by a non negative
function λx of the node x.
The choice of λ and λx plays a crucial role in combination with the function chosen in the second
module for determination of a suitable metric and is discussed later part of this subsection.
The function selection module determine the metric for pair of nodes. The function selector
φ converts a similarity function (Pearson coefficient(σP C ), Spacerman coefficient(σSC ) or Cosine
similarity(σCS )) into a distance matrix. In general the similarity function satisfies the positivity
and similarity condition of metric but not triangle inequality. φ is a metric preserving
(φ(d(xi , xj ) = dφ (xi , xj )), concave and monotonically increasing function. The three conditions
above refer to as chord condition. The φ function is chosen to have minimum internal area with
chord.
The choices in the above sub modules play a crucial role in the graph to metric transformation
algorithm to be used for community detection. The complex network is characterized by small
average diameter and high clustering coefficient. Several studies on network structure analysis
reveal that there are hub nodes and local nodes characterizing the interesting structure of the
complex network. Suppose we have taken φ = arccos, σCS and constant λ ≥ 0. λ = 0 penalize
the effect of direct edge in the metric and is suitable to extract communities from highly dense
graph. λ = 1 place the similar weight of direct edge and common neighbor reduce the effect of
Algorithms 2015, xx 7
direct edge in the metric and is suitable to extract communities from moderately dense graph.
λ = 2 set more importance to direct edge than common neighbor (this is the common case of
available real networks). λ ≥ 2 penalize the effect of common neighbor in the metric and is
suitable to extract communities from very sparse graph. The choice of λ depends on the input
graph, i.e. whether it is sparse or dense and its cluster structure. A more detailed explanation on
the metric described above can be obtained in [25].
4. Nearest neighbor search on complex network using metric tree There are a large
number of methods developed to compute nearest neighbor search. However, finding nearest
neighbor search on some data where dimension is high suffer from curse of dimensionality. Some
recent research on this direction revealed that dimension constrained can be tackled by using
efficient data structures like metric tree and locality sensitive hashing. In this section we have
explored metric tree to perform nearest neighbor search on complex network with the help of
metric mapping of complex network described in the previous section.
4.1. Metric-tree
A metric tree is a data structure specially designed to perform nearest neighbor query for the
points residing on a metric space and perform well on high dimension particularly when some
approximation is permitted. A metric tree organizes a set of points in a spatial hierarchical
manner. It is a binary tree whose nodes represent a set of points. The root node represents
all points, and the points represented by an internal node v is partitioned into two subsets,
represented by its two children. Formally, if we use N (v) to denote the set of points represented
by node v, and use v.lc and v.rc to denote the left child and the right child of node v, then we
have N (v) = N (v.lc) ∪ N (v.rc) φ = N (v.lc) ∩ N (v.rc) for all the non-leaf nodes. At the lowest
level, each leaf node contains very few points.
An M-Tree [26] has these components and sub-components:
• Non-leaf nodes: A set of routing objects NRO , Pointer to Node’s parent object vp .
• Routing Object: (Feature value of) routing object vr , Covering radius r(vr ), Pointer to
covering tree T (vr ), Distance of vr from its parent object d(vr , P (vr ))
• Object: (Feature value of the) object vj , Object identifier oid(vj ), Distance of vj from its
parent object d(vj , P (vj ))
Partitioning: The key to building a metric-tree is how to partition a node v. A typical way is
as follows: We first choose two pivot points from N (v), denoted as v.lpv and v.rpv. Ideally, v.lpv
and v.rpv are chosen so that the distance between them is the largest of all distances within N (v).
More specifically, ||v.lpv − v.rpv|| = maxp1 ,p2 ∈N (v) ||p1 − p2 ||. However, it takes O(n2 ) time to find
the optimal v.lpv and v.rpv. In practice a linear-time heuristic is used to find reasonable pivot
Algorithms 2015, xx 8
points. v.lpv and v.rpv are then used to partition node v. We first project all the points down to
the vector u = v.rpv − v.lpv, and then find the median point A along u. Next, we assign all the
points projected to the left of A to v.lc, and all the points projected to the right of A to v.rc. We
use L to denote the d − 1 dimensional plane orthogonal to u and goes through A. It is known as
the decision boundary since all points to the left of L belong to v.lc and all points to the right of
L belong to v.rc. By using a median point to split the data points, we can ensure that the depth
of a metric-tree is logn. Each node v also has a hypersphere B, such that all points represented
by v fall in the ball centered at v.center with radius v.r, i.e. N (v) ∈ B(v.center, v.r).
Searching: A search on a metric-tree is performed using a stack. The current radius r is used
to decide which child node to search first. If the query q is on the left of current point, then v.lc is
searched first, otherwise, v.rc is searched first. At all times, the algorithm maintains a candidate
NN and there distance determine the current radius, which is the nearest neighbor it finds so
far while traversing the tree. We call this point x, and denote the distance between q and x by
r. If algorithm is about to exploit a node v, but discovers that no member of v can be within
distance r of q, then it skip the subtree from v. This happens whenever v.center − |q − v.r| ≥ r.
In practice, metric tree search typically finds a very good NN candidate quickly, and then spends
lots of the time verifying that it is in fact the true NN. However, in case of approximate NN we
can save majority of time with moderate approximation guarantee. The algorithm for NN search
using metric tree is given below 1.
Theorem 4. Let M = (V, d), be a bounded metric space. Then for any fixed data V ∈ Rn of size
n, and for constant c ≥ 1, ∃ such that we may compute d(q, V )| with at most c · dlog(n) + 1e
expected metric evaluations[27]
Metric trees, so far represent the practical state of the art for achieving efficiency in the
largest dimensionality possible. However, many real-world problems are posed with very large
dimensionality which are beyond the capability of such search structures to achieve sub-linear
efficiency. Thus, the high-dimensional case is the long-standing frontier of the nearest-neighbor
problem.
The approximate nearest neighbor can be computed very efficiently using Locality sensitive
hashing.
Given a metric space (S, d) and some finite subset SD of data pointsSD ⊂ S on which nearest
neighbor queries are to be made, our aim to organize SD s.t. NN queries can be answered more
efficiently. For any q ∈ S, NN problem consists of finding single minimal located point p ∈ SD s.t.
d(p, q) is minimum over all p ∈ SD . We denote this by p = N N (q, SD ).
An approximate NN of q ∈ S is to find a point p ∈ SD s.t. d(p, q) ≤ (1 + )d(x, d) ∀ x ∈ SD .
5.2. Locality Sensitive Hashing (LSH) Several methods to compute first nearest neighbor query
exists in the literatures and locality-sensitive hashing (LSH) is most popular because of its
dimension independent runtime [28,29]. In a locality sensitive hashing, the hash function has the
property that close points are hash into same bucket with high probability and distance points
are hash into same bucket with low probability. Mathematically, a family H = {h : S → U } is
called (r1 , r2 , p1 , p2 )-sensitive if for any p, q ∈ S
• if p ∈
/ B(q, r2 ) then P rH [h(q) = h(p)] ≤ p2
5.3. Locality sensitive hash function for complex network In this sub-section, we discuss the
existence of locality sensitive hash function families for the proposed metric on complex network.
The LSH data structure stores all nodes in hash tables and searches for nearest neighbor via
retrieval. The hash table is contain many buckets and identified by bucket id. Unlike conventional
hashing the LSH approach try to maximize the probability of collision of near items and put them
into same bucket. For any given the query q the bucket h(q) considered to search nearest node.
In general k hash functions are chosen independently and uniformly at random from hash family
H. The output of nearest neighbor query is provided from the union ok k buckets. The consensus
Algorithms 2015, xx 10
of k functions reduces the error of approximation. For metric defined in the previous section 3 we
considered k random points from the metric space. Each random point ri define a hash function
hi (x) = sign(d(x, ri )), where d is the metric and i ∈ [1, k]. These randomized hash functions are
locality sensitive [34,35].
Theorem 5. Let M = (V, d), be a bounded metric space. Then for any fixed data V ∈ Rn of size
n, and for constant c ≥ 1, ∃ such that we may compute d(q, V )| with at most mnO(1/) expected
metric evaluations, where m is the number of dimension of the metric space. In case of complex
network m = n so expected time is nO(2/) [27,36].
in real networks aims to capture the structural organization of the network using the connectivity
information as input[15,16]. Early work on this domain was attempted by Weiss and Jacobson
while searching for a work group within a government agency[14].
Most of the methods developed for network community detection are based on a two-step
approach. The first step is specifying a quality measure (evaluation measure, objective function)
that quantifies the desired properties of communities and the second step is applying an algorithmic
techniques to assign the nodes of graph into communities by optimizing the objective function.
Several measures for quantifying the quality of communities have been proposed, they mostly
consider that communities are set of nodes with many edges between them and few connections
with nodes of different communities(e.g. modularity, conductance, expansion, internal density,
average degree, triangle precipitation ratio,..,e.t.c. ).
6.1. Popular algorithms In this subsection we have given a brief list of the algorithms developed for
network community detection purposes. The broad categorization of the algorithms are based on
graph traversal, semidefinite programming and spectral. The basic approach and the complexity
of very popular algorithms are listed in the table 1. There are more algorithms developed to solve
network community detection problem a complete list can be obtained in several survey articles
[13,37,38].
A partial list of algorithms developed for network community detection purpose is tabulated in
1. The algorithms are categorized into three main group as spectral (SP), graph traversal based
(GT) and semi-definite programming based (SDP). The categories and complexities are also given
in the table 1.
Algorithms 2015, xx 11
6.2. k-central algorithm for network community detection using nearest neighbor search In this
section we have described k-central algorithm for the purpose of network community detection by
using the nearest neighbor search inside complex network. We have also studied and analyzed the
advantages of the k-central method over the standard algorithm for network community detection.
The community detection methods based on partitioning of graph is possible using nearest
neighbor search, because the nodes of the graph are converted into the points of a metric space.
This algorithm for network community detection converges automatically and does not compute
the value of objective function in iterations therefore reduce the computation compared to standard
methods. The results of this algorithm are competitive on a large set networks shown in section
7. The k-central algorithm for community detection and its details analysis is given below.
6.2.2. k selection
6.2.3. Initialization
The set of initial nodes are also very important problem for k-central algorithm
• Input: graph G = (V, E), with the node similarity sim(xa , xb ) defined on it
6.2.4. Convergence
Convergence of the network community detection algorithms are the least studied research
areas of network science. However, the rate of convergence is one of the important issues and
low rate of convergence is the major pitfall of the most of the existing algorithms. Due to the
transformation into the metric space, our algorithm equipped with the quick convergence facility
of the k-partitioning on metric space by providing a good set of initial points. Another crucial
pitfall suffer by majority of the existing algorithms is the validation of the objective function in
each iteration during convergence. Our algorithm converges automatically to the optimal partition
thus reduces the cost of validation during convergence.
Theorem 6. During the course of the k center partitioning algorithm, the cost monotonically
decreases.
Proof. Let Z t = {z1t , . . . , zkt } , T t = {C1t , . . . , Ckt } denote the centers and clusters at the start of
the tth iteration of k partitioning algorithm. The first step of the iteration assigns each data point
to its closest center; therefore cost(T t+1 , Z t ) ≤ cost(T t , Z t )
On the second step, each cluster is re-centered at its mean; therefore cost(T t+1 , Z t+1 ) ≤
cost(T t+1 , Z t )
nearest neighbor search based community detection method for complex network over several
real networks2. Objective of the experiment is to verify behavior of the algorithm and the time
required to compute the algorithm. One of the major goals of the experiment is to verify the
behavior of the algorithm with respect to the performance of other popular methods exists in the
literature with respect to the standard measures like conductance and modularity. Experiments
are conducted to compare the results (tables 3, 4 and 5) of our algorithm with the state of the art
algorithms (table 1) available in the literature in terms of common measures mostly used by the
researchers of the domain of network community detection. The details of the several experiments
and the analysis of the results are given in the following subsections.
7.1. Experimental designs Experiment for comparison: In this experiment we have compared
several algorithms for network community detection with our proposed algorithm developed using
nearest neighbor search in complex network. Experiment is performed on a large list of network
data sets. Two version of the experiment is developed for comparison purpose based on two
different quality measure conductance and modularity. The results are shown in the tables 3 and
4 respectively.
Experiment on performance and time: In this experiment we have evaluated our algorithm for
performance on the network collection2. We have evaluated the time taken by our algorithm on
different size of networks and is shown in the table 5.
7.2. Performance indicator Modularity: The notion of modularity is the most popular for the
network community detection purpose. The modularity index assigns high scores to communities
whose internal edges are more than that expected in a random-network model which preserves
the degree distribution of the given network.
Conductance: Conductance is widely used for graph partitioning literature. The conductance
of a set S with complement S C is the ratio of the number of edges connecting nodes in S to nodes
in S C by the total number of edges incident to S or to S C (whichever number is smaller).
7.3. Datasets
A list of real networks taken from several real life interactions is considered for our experiments
and they are tabulate 2 below. We have also listed the number of nodes, number of edges, average
diameter, data complexity for community detection (DCC) and the k value used (6.2.2). The
values of the last column can be used to assess the quality of detected communities.
7.4. Computational results In this subsection we have compared two groups of algorithms
for network community detection with our proposed algorithm using nearest neighbor search.
Experiment is performed on a large list of network data sets. Two version of the experiment
is developed for comparison purpose based on two different quality measure conductance and
modularity. The results based on conductance is shown in the table 3 and the results based
Algorithms 2015, xx 14
on modularity is shown in the table 4, respectively. Regarding the two groups of algorithms;
first group contain algorithms based on semi-definite programming and the second group contain
algorithms based on graph traversal approaches. For each group, we have taken the best value of
conductance in table 3 and best value of modularity in table 4 among all the algorithms in the
groups. The results obtained with our approach are very competitive with most of the well known
algorithms in the literature and this is justified over the large collection of datasets. On the other
hand, it can be observed that time taken (table 5) by our algorithm is quite less compared to
other methods and justify the theoretical findings.
Table 3. Comparison of our approaches with other best methods in terms of conductance
Name Spectral SDP GT Index M-tree LSH
Facebook 0.0097 0.1074 0.1044 0.1082 0.0827 0.0340
Gplus 0.0119 0.1593 0.1544 0.1602 0.1207 0.0500
Twitter 0.0035 0.0480 0.0465 0.0483 0.0363 0.0150
Epinions1 0.0087 0.1247 0.1208 0.1254 0.0941 0.0390
LiveJournal1 0.0039 0.0703 0.0680 0.0706 0.0523 0.0218
Pokec 0.0009 0.0174 0.0168 0.0175 0.0129 0.0054
Slashdot0811 0.0005 0.0097 0.0094 0.0098 0.0072 0.0030
Slashdot0922 0.0007 0.0138 0.0133 0.0138 0.0102 0.0043
Friendster 0.0012 0.0273 0.0263 0.0273 0.0200 0.0084
Orkut 0.0016 0.0411 0.0397 0.0412 0.0300 0.0126
Youtube 0.0031 0.0869 0.0838 0.0871 0.0633 0.0267
DBLP 0.0007 0.0210 0.0203 0.0211 0.0152 0.0064
Arxiv-AstroPh 0.0024 0.0929 0.0895 0.0931 0.0669 0.0283
web-Stanford 0.0007 0.0320 0.0308 0.0320 0.0229 0.0097
Amazon0601 0.0018 0.0899 0.0865 0.0900 0.0643 0.0273
P2P-Gnutella31 0.0009 0.0522 0.0503 0.0523 0.0373 0.0158
RoadNet-CA 0.0024 0.1502 0.1445 0.1504 0.1070 0.0455
Wiki-Vote 0.0026 0.1853 0.1783 0.1855 0.1318 0.0561
7.5. Results analysis and achievements In this subsection, we have described the analysis of the
results obtained in our experiments shown above and also highlighted the achievements from the
results. It is clearly evident from the results shown in the tables 3, 4 and 5 that, proposed nearest
neighbor based method for network community detection using metric tree and locality sensitive
hashing provide very good competitive performance with respect to conductance and modularity
and also in terms of time. It is also evident from the results that our methods provide case
base solution of network community detection depending on the requirements of time or better
conductance/modularity.
Algorithms 2015, xx 15
Table 4. Comparison of our approaches with other best methods in terms of modularity
Name Spectral SDP GT Index M-tree LSH
Facebook 0.4487 0.5464 0.5434 0.5472 0.5450 0.5421
Gplus 0.2573 0.4047 0.3998 0.4056 0.4041 0.4021
Twitter 0.3261 0.3706 0.3691 0.3709 0.3692 0.3669
Epinions1 0.0280 0.1440 0.1401 0.1447 0.1443 0.1437
LiveJournal1 0.0791 0.1455 0.1432 0.1458 0.1450 0.1439
Pokec 0.0129 0.0294 0.0288 0.0295 0.0292 0.0287
Slashdot0811 0.0038 0.0130 0.0127 0.0131 0.0129 0.0127
Slashdot0922 0.0045 0.0176 0.0171 0.0176 0.0174 0.0172
Friendster 0.0275 0.0536 0.0526 0.0536 0.0531 0.0525
Orkut 0.0294 0.0689 0.0675 0.0690 0.0685 0.0678
Youtube 0.0096 0.0934 0.0903 0.0936 0.0934 0.0930
DBLP 0.4011 0.4214 0.4207 0.4215 0.4196 0.4171
Arxiv-AstroPh 0.4174 0.5079 0.5045 0.5081 0.5061 0.5035
web-Stanford 0.3595 0.3908 0.3896 0.3908 0.3890 0.3866
Amazon0601 0.1768 0.2649 0.2615 0.2650 0.2637 0.2621
P2P-Gnutella31 0.0009 0.0522 0.0503 0.0523 0.0523 0.0523
RoadNet-CA 0.0212 0.1690 0.1633 0.1692 0.1680 0.1664
Wiki-Vote 0.0266 0.2093 0.2023 0.2095 0.2090 0.2083
Table 5. Comparison of our approaches with other best methods in terms of time
Name Spectral SDP GT Index M-tree LSH
Facebook 6 7 11 6 4 1
Gplus 797 832 1342 661 390 115
Twitter 462 485 786 398 235 68
Epinions1 411 419 667 292 174 56
LiveJournal1 1297 1332 2129 969 576 179
Pokec 1281 1305 2075 901 538 173
Slashdot0811 552 561 891 382 228 74
Slashdot0922 561 570 906 389 232 75
Friendster 2061 2105 3352 1477 880 280
Orkut 1497 1529 2435 1074 640 203
Youtube 829 844 1340 578 345 111
DBLP 381 403 655 341 201 57
Arxiv-AstroPh 217 230 375 197 116 33
web-Stanford 498 525 852 437 258 74
Amazon0601 653 678 1089 520 308 93
P2P-Gnutella31 182 184 293 124 74 24
RoadNet-CA 758 785 1261 599 355 107
Wiki-Vote 54 55 88 39 23 7
8. Conclusions In this paper, we have studied the interesting problem of nearest neighbor
queries in complex networks. Processing nearest neighbor search in complex networks cannot be
achieved by straightforward applications of previous approaches for the Euclidean space due to
the complexity of graph traversal based computations of node nearness as opposed to geometric
distances. We presented the transformation of graph to metric space and efficient computation of
nearest neighbor therein using metric tree and locality sensitive hashing. Our techniques can be
applied for various structural analysis of complex network using geometric approaches. To validate
the performance of proposed nearest neighbor search designed for complex networks we applied
our approaches on community detection problem. The results obtained on several network data
sets prove the usefulness of the proposed method and provide motivation for further application
of other structural analysis of complex network using nearest neighbor search.
Acknowledgments
Author Contributions
Algorithms 2015, xx 16
Suman Saha proposed the algorithm and prepared the manuscript. S.P. Ghrera was in charge
of the overall research and critical revision of the paper.
Conflicts of Interest
References
1. Uhlmann, J.K. Satisfying general proximity / similarity queries with metric trees.
Information Processing Letters 1991, 40, 175 – 179.
2. Ruiz, E.V. An algorithm for finding nearest neighbours in (approximately) constant average
time. Pattern Recognition Letters 1986, 4, 145 – 157.
3. Panigrahy, R. Entropy Based Nearest Neighbor Search in High Dimensions. Proceedings
of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm; Society for
Industrial and Applied Mathematics: Philadelphia, PA, USA, 2006; SODA ’06, pp.
1186–1195.
4. Indyk, P.; Motwani, R. Approximate Nearest Neighbors: Towards Removing the Curse
of Dimensionality. Proceedings of the Thirtieth Annual ACM Symposium on Theory of
Computing; ACM: New York, NY, USA, 1998; STOC ’98, pp. 604–613.
5. Gionis, A.; Indyk, P.; Motwani, R. Similarity Search in High Dimensions via Hashing.
Proceedings of the 25th International Conference on Very Large Data Bases; Morgan
Kaufmann Publishers Inc.: San Francisco, CA, USA, 1999; VLDB ’99, pp. 518–529.
6. Dasgupta, S.; Freund, Y. Random Projection Trees and Low Dimensional Manifolds.
Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing; ACM:
New York, NY, USA, 2008; STOC ’08, pp. 537–546.
7. Akoglu, L.; Khandekar, R.; Kumar, V.; Parthasarathy, S.; Rajan, D.; Wu, K.L. Fast
Nearest Neighbor Search on Large Time-Evolving Graphs. Proceedings of the European
Conference on Machine Learning and Knowledge Discovery in Databases - Volume 8724;
Springer-Verlag New York, Inc.: New York, NY, USA, 2014; ECML PKDD 2014, pp. 17–33.
8. Liu, T.; Moore, A.W.; Gray, E.; Yang, K. An investigation of practical approximate nearest
neighbor algorithms. NIPS2004. MIT Press, 2004, pp. 825–832.
9. Freeman, L.C. Centrality in social networks conceptual clarification. Social Networks 1978,
p. 215.
10. Carrington, P.J.; Scott, J.; Wasserman, S., Eds. Models and methods in social network
analysis; Cambridge University Press, 2005.
11. Newman, M. The Structure and Function of Complex Networks. SIAM review 2003,
45, 167–256.
12. Radicchi, F.; Castellano, C.; Cecconi, F.; Loreto, V.; Parisi, D. Defining and identifying
communities in networks. Proceedings of the National Academy of Sciences 2004, 101, 2658.
13. Fortunato, S. Community detection in graphs. Physics Reports 2010, 486, 75 – 174.
Algorithms 2015, xx 17
14. Weiss, R.; Jacobson, E. A Method for the Analysis of Complex Organisations. American
Sociological Review 1955, 20, 661–668.
15. Schaeffer, S.E. Graph clustering. Computer Science Review 2007, 1, 27 – 64.
16. Newman, M.E.J.; Girvan, M. Finding and evaluating community structure in networks.
Physical Review 2004, E 69.
17. Luxburg, U. A tutorial on spectral clustering. Statistics and Computing 2007, 17, 395–416.
18. Pons, P.; Latapy, M. Computing communities in large networks using random walks. J. of
Graph Alg. and App. 2004, 10, 284–293.
19. Duch, J.; Arenas, A. Community detection in complex networks using Extremal
Optimization. Physical Review E 2005, 72, 027104.
20. Chakrabarti, D. AutoPart: Parameter-Free Graph Partitioning and Outlier Detection.
PKDD; Boulicaut, J.F.; Esposito, F.; Giannotti, F.; Pedreschi, D., Eds. Springer, 2004,
Vol. 3202, Lecture Notes in Computer Science, pp. 112–124.
21. Macropol, K.; Singh, A.K. Scalable Discovery of Best Clusters on Large Graphs. PVLDB
2010, 3, 693–702.
22. Levorato, V.; Petermann, C. Detection of communities in directed networks based on
strongly p-connected components. CASoN. IEEE, 2011, pp. 211–216.
23. Brandes, U.; Gaertler, M.; Wagner, D. Experiments on Graph Clustering Algorithms. ESA;
Battista, G.D.; Zwick, U., Eds. Springer, 2003, Vol. 2832, Lecture Notes in Computer
Science, pp. 568–579.
24. Bullmore, E.; Sporns, O. Complex brain networks: graph theoretical analysis of structural
and functional systems. Nature Reviews Neuroscience 2009, 10, 186–198.
25. Saha, S.; Ghrera, S.P. Network Community Detection on Metric Space. Algorithms 2015,
8, 680–696.
26. Ciaccia, P.; Patella, M.; Zezula, P. M-tree: An Efficient Access Method for Similarity
Search in Metric Spaces. Proceedings of the 23rd International Conference on Very Large
Data Bases (VLDB’97); Morgan Kaufmann Publishers, Inc.: Athens, Greece, 1997; pp.
426–435.
27. Ciaccia, P.; Patella, M.; Zezula, P. A Cost Model for Similarity Queries in Metric Spaces.
Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of
Database Systems (PODS’97); ACM Press: Seattle, WA, 1998; pp. 59–68.
28. Motwani, R.; Naor, A.; Panigrahy, R. Lower Bounds on Locality Sensitive Hashing. SIAM
J. Discrete Math. 2007, 21, 930–935.
29. Paulevé, L.; Jégou, H.; Amsaleg, L. Locality Sensitive Hashing: A Comparison of Hash
Function Types and Querying Mechanisms. Pattern Recogn. Lett. 2010, 31, 1348–1358.
30. Indyk, P.; Motwani, R. Approximate Nearest Neighbors: Towards Removing the Curse
of Dimensionality. Proceedings of the Thirtieth Annual ACM Symposium on Theory of
Computing; ACM: New York, NY, USA, 1998; STOC ’98, pp. 604–613.
31. Gionis, A.; Indyk, P.; Motwani, R. Similarity Search in High Dimensions via Hashing.
Proceedings of the 25th International Conference on Very Large Data Bases; Morgan
Kaufmann Publishers Inc.: San Francisco, CA, USA, 1999; VLDB ’99, pp. 518–529.
Algorithms 2015, xx 18
32. Joly, A.; Buisson, O. A posteriori multi-probe locality sensitive hashing. ACM Multimedia;
El-Saddik, A.; Vuong, S.; Griwodz, C.; Bimbo, A.D.; Candan, K.S.; Jaimes, A., Eds. ACM,
2008, pp. 209–218.
33. Datar, M.; Immorlica, N.; Indyk, P.; Mirrokni, V.S. Locality-sensitive Hashing Scheme
Based on P-stable Distributions. Proceedings of the Twentieth Annual Symposium on
Computational Geometry; ACM: New York, NY, USA, 2004; SCG ’04, pp. 253–262.
34. Andoni, A.; Indyk, P. Near-optimal Hashing Algorithms for Approximate Nearest Neighbor
in High Dimensions. Commun. ACM 2008, 51, 117–122.
35. Charikar, M.S. Similarity Estimation Techniques from Rounding Algorithms. Proceedings
of the Thiry-fourth Annual ACM Symposium on Theory of Computing; ACM: New York,
NY, USA, 2002; STOC ’02, pp. 380–388.
36. Indyk, P. A Sublinear Time Approximation Scheme for Clustering in Metric Spaces. Proc.
40th IEEE FOCS, 2000, pp. 154–159.
37. Leskovec, J.; Lang, K.J.; Mahoney, M.W. Empirical comparison of algorithms for network
community detection. WWW; Rappa, M.; Jones, P.; Freire, J.; Chakrabarti, S., Eds.
ACM, 2010, pp. 631–640.
38. Yang, J.; Leskovec, J. Defining and Evaluating Network Communities Based on
Ground-Truth. ICDM; Zaki, M.J.; Siebes, A.; Yu, J.X.; Goethals, B.; Webb, G.I.; Wu, X.,
Eds. IEEE Computer Society, 2012, pp. 745–754.
39. van Dongen, S. A Cluster Algorithm For Graphs. Technical Report INS-R 0010, CWI,
Amsterdam, the Netherlands, 2000.
40. Eckmann, J.P.; Moses, E. Curvature of co-links uncovers hidden thematic layers in the
World Wide Web. PNAS 2002, 99, 5825–5829.
41. Girvan, M.; Newman, M.E.J. Community structure in social and biological networks.
Proceedings of the National Academy of Sciences 2002, 99, 7821–7826.
42. Zhou, H.; Lipowsky, R. Network Brownian Motion: A New Method to Measure
Vertex-Vertex Proximity and to Identify Communities and Subcommunities. International
Conference on Computational Science; Bubak, M.; van Albada, G.D.; Sloot, P.M.A.;
Dongarra, J., Eds. Springer, 2004, Vol. 3038, Lecture Notes in Computer Science, pp.
1062–1069.
43. Reichardt, J.; Bornholdt, S. Detecting fuzzy community structures in complex networks
with a Potts model. Phys Rev Lett 2004, 93, 218701.
44. Clauset, A.; Newman, M.E.J.; .; Moore, C. Finding community structure in very large
networks. Physical Review E 2004, pp. 1– 6.
45. Wu, F.; Huberman, B. Finding communities in linear time: a physics approach. The
European Physical Journal B - Condensed Matter and Complex Systems 2004, 38, 331–338.
46. Fortunato, S.; Latora, V.; Marchiori, M. Method to find community structures based on
information centrality. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics)
2004, 70, 056104.
47. Donetti, L.; MuÃśoz, M.A. Detecting network communities: a new systematic and efficient
algorithm. Journal of Statistical Mechanics: Theory and Experiment 2004, 2004, P10012.
Algorithms 2015, xx 19
48. Guimera, R.; Amaral, L.A.N. Functional cartography of complex metabolic networks.
Nature 2005, 433, 895–900.
49. Capocci, A.; Servedio, V.D.P.; Caldarelli, G.; Colaiori, F. Detecting communities in large
networks. Physica A: Statistical Mechanics and its Applications 2004, 352, 669–676.
50. Duch, J.; Arenas, A. Community detection in complex networks using Extremal
Optimization. Physical Review E 2005, 72, 027104.
51. Bagrow, J.P.; Bollt, E.M. Local method for detecting communities. Phys. Rev. E 2005,
72, 046108.
52. Palla, G.; Derenyi, I.; Farkas, I.; Vicsek, T. Uncovering the overlapping community
structure of complex networks in nature and society. Nature 2005, 435, 814–818.
53. Raghavan, U.N.; Albert, R.; Kumara, S. Near linear time algorithm to detect community
structures in large-scale networks. Phys. Rev. E 2007, 76, 036106.
54. Rosvall, M.; Bergstrom, C.T. Maps of random walks on complex networks reveal community
structure. Proceedings of the National Academy of Sciences 2008, 105, 1118–1123.
55. Ronhovde, P.; Nussinov, Z. Multiresolution community detection for megascale networks
by information-based replica correlations. Phys. Rev. E 2009, 80, 016109.
56. Leskovec, J.; Lang, K.J.; Dasgupta, A.; Mahoney, M.W. Community Structure in Large
Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters, 2008.
57. Gonzalez, T.F. Clustering to Minimize the Maximum Intercluster Distance. Theor.
Comput. Sci. 1985, 38, 293–306.
c 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article
distributed under the terms and conditions of the Creative Commons Attribution license
(http://creativecommons.org/licenses/by/4.0/).