Graph Mining: A Survey of Graph Mining Techniques: August 2012
Graph Mining: A Survey of Graph Mining Techniques: August 2012
net/publication/233801707
CITATIONS READS
70 16,295
3 authors:
Simon Fong
University of Macau
721 PUBLICATIONS 11,815 CITATIONS
SEE PROFILE
All content following this page was uploaded by Simon Fong on 20 May 2014.
Abstract-Data mining is comprised of many data analysis vertices of a given input graph into clusters [22] graph
techniques. Its basic objective is to discover the hidden and useful clustering is based on unsupervised learning technique in
data pattern from very large set of data. Graph mining, which which the classes are not known in prior to clustering. The
has gained much attention in the last few decades, is one of the
graph clusters are formed based on some similarities in the
novel approaches for mining the dataset represented by graph
underlying graph structured data graph. (2) Graph
structure. Graph mining finds its applications in various problem
domains, including: bioinformatics, chemical reactions, Program
Classification; in graph classification the main task is to
flow structures, computer networks, social networks etc. classify separate, individual graphs in a graph database into
Different data mining approaches are used for mining the graph two or more categories/classes [22]. Classification is based on
based data and performing useful analysis on these mined data. supervised/semi supervised learning technique in which the
In literature various graph mining approaches have been classes of the data are defined in prior. (3) Sub graph mining;
proposed. Each of these approaches is based on either sub graph is a graph whose vertices and edges are subsets of
classification; clustering or decision trees data mining techniques. another graph. The frequent sub graph mining problem is to
In this study, we present a comprehensive review of various
produce the set of sub graphs occurring in at least some given
graph mining techniq ues. These different graph mining
threshold of the given n input example graphs [23].
techniques have been critically evalnated in this study. This
evalnation is based on different parameters. In our future work,
In this study we have provided comprehensive summary
we will provide our own classification based graph mining details of the different graph mining techniques. Each of these
technique which will efficiently and accurately perform mining techniques has been outlined with their techniques details, their
on the graph structured data. major research contributions along with the limitation of the
proposed techniques. These techniques have been further
Index Terms-Graph Mining, Sub graphs, frequent graphs, critically evaluated.
Data Mining
The rest of this paper is organized as follow: In section II
I. INTRODUCTION the underlying terminologies used in graph theory is provided.
In section III a detailed literature review is provided on the
graph mining techniques proposed in the last few decades.
Over the last few years there has been a number of research Section IV focuses on the critical analysis of these different
work on data mining in seeking for better performance and graph mining techniques, whose details are discussed in section
innovation. One innovation includes mining from structured II. This study will end with the conclusion of our work with
data, which is a new challenge. Since a structure is represented some future directions in section V
by proper relations and a graph can easily represent such
relations, knowledge discovery from graph-structured data II. BASIC GRAPH THEORY
poses a general problem for mining from structured data. Some A graph G is a pair of sets G = (V, E). V is the set of vertices
examples amenable to graph mining are finding typical web and the number of vertices n = IVI is the order of the graph.
browsing patterns, identifying typical substructures of chemical The set E contains the edges of the graph. In an undirected
compounds, finding typical subsequences of DNA and graph, each edge is an unordered pair {v, w}. In a directed
discovering diagnostic rules from patient history records [21]. graph (also called a digraph in much literature), edges are
Graph mining techniques have been categorized into ordered pairs. The vertices v and w are called the endpoints of
following groups. (1) Graph clustering; is the task of grouping the edge. The edge count lEI = m is the size of the graph. In a
the vertices of the graph into clusters taking into consideration
weighted graph, a weight functions (j) : E -? R is defined that
the edge structure of the graph in such a way that there should
assigns a weight on each edge. A graph is planar if it can be
be many edges within each cluster and relatively few between
drawn in a plane without any of the edges crossing [22]. The
the clusters? Graph clustering in the sense of grouping the
,u
unlabeled nodes of the graphs are predicted by comparing its
G(V,E) is betweeness measure with that of maximum betweeness
measure. The technique proposed in [1] has been implemented
D=
rdeg(V2)deg(v2)
o
0 0
0
o
o
on the CORA database. Then different experiments have been
performed using this database. All of these experiments
showed that [1] is more efficient and can accurately classify the
o
o
0
0
deg(v2) deg(v2) j,J
0
o
unlabeled nodes of the graphs and outperforms the existing
techniques available in the literature such as [2] and [3]. Their
The length of a path is the number of edges on it, and the main achievement is to handle the graphs having large number
distance between v and u is the length of the shortest path of nodes and edges as compare to [2] and [3] techniques
connecting them in G. The distance from a vertex to itself is In [4], Kashima et al. have proposed a new method that can
zero: the path from a vertex to itself is an empty edge sequence. handle the classification problem of graphs that have extremely
A graph is connected if there exist paths between all pairs of large no of nodes and edges. Their graph classification method
vertices. If there are vertices that cannot be reached from others, is based on kernel method. The details of kernel methods can
the graph is disconnected. The minimum number of edges that be found in [5]. The method proposed in [4] efficiently
would need to be removed from G in order to make it computes the inner product of two graphs to make a feature
disconnected is the edge connectivity of the graph. A cycle is a space for classifying the graphs. This technique takes an
simple path that begins and ends at the same vertex. A graph unknown graph as input and classifies the unknown graph into
that contains no cycle is acyclic and is also called a forest. A an appropriate class. Their proposed method calculates the
connected forest is called a tree [21]. similarity of two graphs based on nodes of the graphs and
S
A sub graph G =(S, Es) of G=(V, E) is composed of a set labels of the edges in the graphs. In [4] graphs are classified
of vertices s � y and a set of edges Es� E such that {v, u} � into same group if their similarities are identical. The technique
Es implies u, v E S; the graph G is a super graph of d. A proposed in [4] has been implemented for the prediction of
connected acyclic sub graph that includes all vertices is called a properties of chemical compound using the mutag and PTC
spanning tree of the graph. A spanning tree has necessarily dataset. Then different experiments have been performed using
exactly n-l edges. If the edges are assigned weights, the these datasets. All of these experiments showed that [4] is not
spanning tree with smallest total weight is called the minimum as efficient as [6] for mutag dataset but for PTC dataset it is
spanning tree. more efficient then existing techniques available in the
Note that there may exist several minimum spanning trees literature such as [6] and [7].
that may even be edge disjoint [22]. In [8], Dhillon et al. have presented an efficient and fast
Two graphs Gi = (Vi> Ei) and Gj = (Y;, E) are isomorphic if technique for graph clustering. This technique can handles
there exists a bi-jective (one to one) mappingfVi�Y; (called graph having large number of nodes and very large number of
an isomorphism) such that {u, v}E Ei; if and only if (j(v), itw) } edges. Their graph clustering technique is based on multilevel
E Ei• A bipartite graph i s a graph where the vertex set V can be methods using weighted kernel K-means objective function as
split in two sets A and B such that all edges lie between those refinement algorithms .The details of weighted Kernel k-means
two sets: if {u, v}E Ei, either vE A and wEBor vEBand W objective function for multilevel methods can be found in [9].
The technique proposed in [8] does not restricts the size of the
E A [23]. A complete graph is a graph where every pair of
cluster be nearly equal as compared to existing graph clustering
distinct vertices is adjacent. A complete graph on n vertices is
techniques available in literature. Furthermore, the graph
denoted by Kn (or sometimes by K(n) ) and The complete
clustering objective functions proposed in [8] can be
graph Kn of order n is a simple graph with n vertices in which
specialized for all phases of the algorithm according to
every vertex is adjacent to every other is called clique.
situation. The technique proposed in [8] has been implemented
III. LITERATURE REVIEW on the IMDB Movie dataset. The dataset has 1.2 million nodes
and 7.6 million edges. Furthermore different experiments have
This section summarizes the different proposed graph
been performed using this dataset. The proposed techniques
mining algorithms with their major research contributions and
compute 5000 cluster and 5000 eigenvectors [8] which is
limitations.
impractical for the algorithm in [9] due to requirements of main
89
memory up to 25 GB. All of these experiments showed that [8] In [14], Le et al. have proposed a new method for
is more efficient not only in memory consumption but also in clustering of bi-partite graph. This technique is called Coring
running time compared to the existing techniques available technique. The proposed technique can handle the issues of
such as [9]. Their main achievement is to handle the graphs partitioning a large graph into small sub graphs. The nodes of
having large number of nodes and edges which is impractical the clustered sub graphs are strongly interconnected within
to be handled in existing graph clustering techniques [9]. graph and weakly connected to the nodes of other graphs. Their
In [10], Dias and Ochi have presented enhancement in the method is called coring method that can handle both weighted
basic Genetic Algorithms (GAs). Their proposed technique can and unweighted graphs. The technique in [14] can computes
efficiently handle the issues of graph partitioning in large graph clusters that have a highly dense core region and encircled by
databases. The [10] proposed different procedure as lower density region. The proposed method in [14] works in
evolutionary steps for the improvement in the performance of following steps Step 1: In this step, the coring method
the basic GA. The proposed modifications in [10] to the basic computes the density variation sequence .The method
GA algorithms do not alter the global acting of the basic iteratively computes the minimum density D and set of nodes
technique for GA. Therefore these modifications are having minimum density M. The output of this step is sequence
implemented as fittings to the Basic GA. The proposed of D,s and M,s. Setp2: Following step 1, the coring method
procedures in [10] modify the local search and other identify the core nodes . To identify the method calculates the
diversification procedures [10]. The proposed procedures in rate of decrease/increase in value of minimum density. If the
[10] are implemented in 7 different versions. The performance rate of increase/decrease in the D value is greater than the
of the proposed algorithms was evaluated for different no of threshold and the sequence of M is also in some order then the
nodes in graph. The results established that the proposed nodes are identified as core nodes. Step 3: In this step, the
algorithms produces high quality clusters while maintaining the coring method partition the graph nodes into clusters. The set
same running time as compared to existing GA in the literature. of core nodes is the output to the next step. Step 4: it is the final
The main contribution of the proposed procedures has good step of this technique the core groups are expended into full
performance when the no of nodes are high as 500 nodes. clusters. The core nodes are the center of the clusters and the
In [11], Zhao et al have proposed a new technique for lower density nodes are encircles these core nodes. The
mining closed free tree in large graphs. Their technique is technique proposed in [14] has been implemented on the
called CFFTree (Closed Frequent Free Tree). This technique microarray dataset containing 62 samples including 40 tumor
efficiently mine frequent closed free tree in large graph and 22 normal colon tissues. Each sample consists of 2000
database whose nodes are labeled. The technique proposed in gene expressions database. The [14] successfully cluster the
[11] can handle the issues of mining frequent free trees in large tumor tissues and normal tissues in the database further the
graph database which is NP complete the details of NP method was evaluated using image of size 200x300 and the [14]
problem is found in [12]. A tree t with no designated root is efficiently cluster the core region from the image. The
called a free tree and a free tree t is closed if no super tree of t proposed method was also evaluated for introducing noise into
that has the same frequency of t [11] exists. The authors the image. The method successfully clusters the core region.
suggested that closed free trees are very few in graph but can The main strength of the proposed work is that this method can
maintain the same useful information as free trees. Furthermore, efficiently be used for noisy data.
they established that the computational time of closed frequent In [15], Chen et at. have proposed a graph model that can
free trees mining algorithm is polynomial and closed free trees efficiently handle the many to many correspondences problem
are more efficient. The [11] proposed efficient pruning among concepts in ontologies. Their proposed technique used
methods such as safe labeling pruning , safe positioning weighted bi-partite graph to model ontologies. The similarity
pruning, auto-morphisim-based pruning and canonical measure is computed for the all the edges using similarity
mapping-based pruning the details of these methods can be measure techniques such as in [16]. The proposed technique,
found in [11] to prune free trees that cannot generate closed assigns the similarity degree as weights of the edges in the
free tree in order to tune the mining process of closed free trees. graph. In the proposed technique, edges of the bi-partite graph
The technique proposed in [11] has been implemented on the having weight greater than the threshold are maintained other
AIDS antiviral screen chemical compound from Development edges are purged. The [15] uses graph partitioning technique
Therapeutics program in NCIINIH. Different experiments have [15] to co-cluster the vertex of the graph as concept cluster for
been performed by using this database. All of these two ontologies. The concept cluster produced by [15] in
experiments proved that [11] is more efficient and can previous step contains all common concepts from ontologies.
accurately computes free trees compared to [13]. While the In next step the concept cluster is used to set up mappings
proposed technique is the only technique developed for mining among ontologies. The contribution of the proposed techniques
closed frequent free tree in time the paper was written. The is that many-to-many mapping can be establish among
main contribution of their proposed technique is working on ontologies.
the novel concept of closed frequent free trees mining and In [16], Barber has proposed a new graph clustering
designing an algorithm for mining closed trees from graph mechanism for representing graph in the form of matrix. Their
databases. proposed technique extends the incidence matrix (showing
joining vertex of graph as matrix) to clique matrix. The clique
90
matrix shows that which nodes of the graph can fonn a clique. In [19] T.Ozaki et al, have proposed a new method for sub
The clique matrix can be efficiently used for graph clustering. graph mining in graph-structured database. Their method is
The proposed technique executes in the following steps: (1) in called HSG. The algorithm proposed in [19] is based frequent
first step, it calculates the maximal clique. (2) In this step, the hyper clique patterns; which tries to find the dependencies
clustering is performed by [16] as it identified the matrix with among graph in the large. The method proposed in [19]
smallest no of columns. The size of the clique is controlled by efficiently mine correlation in structured database. The authors
using threshold parameter that controls how large the clique proposed efficient pruning methods based on h-confidence
should be. Their technique is successfully applied to find the measures and depth-first and breadth-depth search methods,
large well-connected group in social network and cluster gene the details of these methods can be found in [19]. The
expression that exists in large population. The main technique proposed in [19] has been implemented on the PTE
contribution of the proposed work is the clique matrix notation and DTP_CM datasets. The [19] efficiently mine frequent
for graphs. hyperclique patterns in these datasets in reasonable time. The
In [17], Kraus et at. have proposed a new algorithm for main contribution of the proposed work is that [19] introduces
handling the graph clustering. Their algorithm is called semi a new concept of hyperclique to mine correlation in graph
supervised divisive hierarchical Graph clustering algorithm. databases and proposed an algorithm to mine frequent hyper
Their proposed technique can effectively handle the problem of clique patterns in large graph databases.
clustering with having no knowledge of the structure of In [20] Fatta et ai, proposed a new method for sub graph
underlying dataset. The authors proposed a hierarchical mining in large graph database. The method is called
algorithm that incorporates background knowledge into the distributed algorithm. Their algorithm is based on distributed
graph. The technique in [17] is used with weighted undirected peer to peer communication framework. The [20] can handle to
graph. The Euclidian distance between two adjacent nodes is very high workload in distributed manner. The distributed
calculated. To calculate the Euclidian distance fonnula is given algorithm proposed in [20] efficiently mines sub graph in
below: molecular compounds, the molecular compounds have very
large trees and very large no of sub graph. The [20] first
n
partitioned the search space dynamically to partition a large
d(p, q) == d(q, p) == �(ql-ql)2+(q2-P2i+....+{q,,-p.,i == L (q;-p;i tree. In the second step the [20] distributes the portioned tree in
i==l peer-to-peer communication framework and in the last step the
distributed algorithm uses load balancing and receiver initiated
and the ratio is computed by dividing the distance with average methods [20] to execute the sub graph mining process in
Euclidian distance of all the nodes in a graph. The must link distributed environment. To further test the effectiveness of the
indicates that two data item must be placed in same groups, and proposed method. The proposed technique has been
can-not links - two data item cannot be placed in same group, implemented on the DPT dataset. Then different experiments
are identified. Links with less weight are removed to control have been perfonned using this dataset. All of these
the chaining effect of the nodes on the clusters. To propagate experiments showed that [1] is more efficient and can
background knowledge in adjacent nodes the probability of the accurately mind sub graph in highly distributed and
visiting nodes with some threshold steps are calculated for two heterogeneous environment. The method proposed in [20] also
nodes and neighborhood similarity is measured for the nodes. has been tested for fault tolerant and the results showed that the
The proposed algorithm increases the weight of the edge if two proposed method has handled the situation very efficiently.
nodes are similar else the weight of the edge is decrease. The main contribution of the proposed work is that it works in
Afterward, nodes with small neighborhood are removed for highly distributed and heterogeneous environment.
creating clusters. Nodes having similar neighborhood values
are cluster in same group. The main contribution of the IV. CRITICAL EVALUATION
proposed work is the including of background knowledge in In this section we comment about the techniques, critically.
the clustering process. The critical evaluation is based on the observation of the
In [18], Schenker et at. have proposed a graph model for following metrics: parameters, technique, method,
classification of web documents. The proposed method is implementation, features, comparison and efficiency. The
based on k-NN [18] that successfully classifies unknown details are shown in Table 1. According to the comparison in
documents to its respective classes automatically. The Table 1, the work in [8] seems to be more efficient in
experiments on [18] is conducted which reveals that the graph computation time and memory usage during the clustering
based model for document classification computation time is process than [1] and [4] for classification. The model in [10] is
parallel to other vector based k-NN model. The experiments capable of handling larger nodes than [16] and [18] for
showed that for small nodes up to 30 the classification time of clustering in a large graph in efficiency and features. The
the proposed technique is as efficient as vector based k-NN results generated from the model [14] is however more
techniques but the technique in [18] out perfonned vector accurate than those in [1] and [8] for feature support. Thus the
based k-NN methods for large no of nodes both in perfonnance above discussion reveals that [14] may be more accurate for
and accuracy. noisy data and [8] may be more efficient for larger graphs.
91
TABLE I. COMPARISON OF RECENT WORKS ON GRAPH MINING
Callut el al.[I] D-Walks CORA Capable of handling large graphs I .4 seconds per graph yes
Kashima el Multi-level Kernel k-means Mutag, PTC Reduced chaining effect; Computes yes
al[4] similarity both on label and edges
Dhillon el al [8] Multi-level Kernel k-means IMDB Movie Memory efficient 25 minutes for 1.2 yes
Efficient in running time million nodes and 7.6
million edges
Dias and Genetic Algorithm C++ Tracked the performance of GA for 98 % for 500 nodes yes
Ochi[IO] different type of graph
Zhao el al.[II] CFFfree C++,VS More efficient for graph with large no 10 to 1.5 free tree and yes
of nodes closed
Leel al.[14] Coring Method MicroArray Efficiently clustered core region in yes
dataset, image noisy data
Chen el al.[15] Bi-partite graph co-clustering yes
Barber[16] Clique matrix D1MACS Clique matrix notation for graphs. no
Clustering based on clique matrix
notations
Kraus el al.[17] SSHGCA MicroArray Including of background knowledge in yes
dataset clustering process
Schenker el K-NN Yahoo News More efficient and accurate for large Yes
al.[18] C++ size graph
T. Ozaki el HSG PTE, DW_CM Mine correlation in graphs No
al[19] Java
Fatta et al. [20] Distributed Algorithm PTE, DW_CM Efticient; Distributed; Heterogeneous No
Java
92