Graph Pattern Mining, Search and OLAP
Graph Pattern Mining, Search and OLAP
Xifeng Yan
November 21, 2012
The existing studies are mostly focused on the multiple graphs scenario.
With some modifications, the mining methodology can be extended to the
single graph scenario [30]. Washio and Motoda [56] conducted a survey
on graph-based data mining. Holder et al. [21] proposed SUBDUE to do
subgraph pattern discovery based on minimum description length and back-
ground knowledge. The most popular graph pattern mining algorithms adapt
either Apriori-based or pattern-growth approach.
In an Apriori-based approach, the search for frequent subgraphs starts
with graphs of small size, and proceeds in a bottom-up manner. At each
1
iteration, the size of newly discovered frequent subgraphs is increased by
one node or edge. The new candidates are generated by joining two similar
but slightly different frequent subgraphs that were discovered already. The
frequency of the newly formed graphs is then checked. Typical Apriori-based
frequent graph pattern mining algorithms include AGM [23], FSG [29] , and
an edge-disjoint path-join algorithm [55].
In a pattern-growth approach, a frequent graph is extended directly by
adding a new node or edge, in every possible position. A potential problem
with this extension approach is that the same graph can be discovered many
times. The gSpan [59] algorithm solves this problem by introducing a right-
most extension technique, where the only extensions take place on the right-
most path. Many other algorithms adapt a similar strategy, including MoFa
[4], FFSM [22], and Gaston [42].
Graph Patterns with Constraints Constraint-based graph pattern min-
ing finds frequent graph patterns that satisfy user-specified constraints such
as degree, density, frequency, size etc. Mining closed graph patterns was stud-
ied in [60]. The goal is to reduce the number of graph patterns by removing
subgraph patterns that can be derived from other patterns. Techniques were
developed for pushing constraints as deep as possible in the mining process
[65].
Approximate Graph Patterns Due to the complexity of isomorphism
testing and the inelastic pattern definition, frequent subgraphs are not able to
capture approximate graph patterns. In [28], proximity pattern is defined as
a set of labels that co-occur frequently in neighborhoods. It relaxes the rigid
structure constraint of frequent subgraphs, while introducing connectivity to
frequent itemsets. Empirical results show that it not only finds interesting
patterns that are ignored by the existing approaches, but also achieves high
performance for finding proximity patterns in large-scale graphs.
Due to the exponential set of frequent graph patterns, it is necessary
to discover the most representative ones. Random sampling techniques are
developed to sample the pattern space uniformly and equally [18]. By doing
so, the mining time can be significantly improved while the number of similar
patterns can be reduced.
Discriminative Graph Patterns Discriminative graph pattern mining is
to find significant graph patterns that can tell the difference between two
sets of graphs. The two sets of graphs could be graphs with different class
labels. The discovered discriminative graph patterns can be used as features
2
for classification. [52] proposed an algorithm for mining the minimal contrast
subgraph which is able to capture the structural differences between any two
collections of graphs. LEAP [58] is a general approach to leverage structural
proximity and frequency association to quickly skip pattern search space and
find discriminative graph patterns, with respect to the objective function
given by a user.
2 Graph Search
Development of scalable methods for analyzing large graph data sets, in-
cluding graphs built from knowledge base and social networks, poses great
challenges. At the core of many graph analysis applications, lies a com-
mon and critical problem: how to efficiently search graphs. There are two
problems settings: multiple graphs and single graph.
Single Graph: Given a graph G and a query graph g, find all the
embeddings of g in G.
3
a signature computed from the eigenvalues of adjacency matrices. Instead
of casting a graph to a vector form, [3] proposed a metric indexing scheme
which organizes graphs hierarchically according to their mutual distances.
In semistructured/XML databases, query languages built on path ex-
pressions become popular. Efficient indexing techniques for path expression
were initially introduced in DataGuide [16] and 1-index [38]. A(k)-index [26]
proposes k-bisimilarity to exploit local similarity existing in semistructured
databases. Index Fabric [10] represents every path in a tree as a string and
stores it in a Patricia trie.
For more complicated graph queries, Shasha et al. [46] (GraphGrep) ex-
tended the path-based technique to do full scale graph retrieval. GraphGrep
is an example of feature-based graph indexing techniques. Let F be a feature
set for a given graph database D. For any feature f F , Df is the set of
graphs containing f , Df = {G|f G, G D}. The graph query processing
has three steps: (1) Search, which enumerates all the features in a query
T
graph, Q, to compute the candidate query answer set, CQ = f Df (f Q
and f F ); each graph in CQ contains all of Qs features. Therefore, DQ
is a subset of CQ . (2) Fetching, which retrieves the graphs in the candidate
answer set from disks. (3) Verification, which checks the graphs in the can-
didate answer set to verify if they really satisfy the query. The candidate
answer set is verified to prune false positives.
gIndex [61] introduces a pattern-based indexing techniques that facilitate
graph search in graph databases with thousands of instances. Nevertheless,
similar techniques can also be applied to indexing single massive graphs. The
idea is to precompute features from a graph database and build indices based
on these features. There are various kinds of features that could be used,
including node/edge labels, paths, trees, and subgraph patterns. gIndex is
a subgraph pattern-based approach, while GraphGrep is a path-based ap-
proach. FG-index [7] builds index using frequent subgraphs too. However, it
directly answer frequent graph queries without verification.
Zhao et al. [63] analyzed the effectiveness and efficiency of paths, trees,
and graphs as indexing features from three aspects: feature size, feature
selection cost, and pruning power. Like paths and graphs, tree features can
be effectively and efficiently used as indexing features for graph databases.
GString [25] combines three basic structures together: path, star, and cycle
for graph search.
GCoding [66] is another tree-based graph indexing approach. For each
node u, it extracts a level-n path tree, which consists of all n-step simple
4
pathes from u in a graph. The node is then encoded with eigenvalues derived
from this local tree structure. If a query graph Q is a subgraph of a graph
G, for each vertex u in Q, there must exist a corresponding vertex u in G
such that the local structure around u in Q should be preserved around u
in G. There is a partial order relationship between the eigenvalues of these
two local structures. Based on this property, GCoding could quickly prune
graphs that violate the order.
Closure-Tree [19] organizes graphs into a tree-based index structure using
graph closures as the bounding boxes.
5
to convert a large network into a set of multidimensional vectors, where
sophisticated indexing and similarity search algorithms are available. Ness
is appropriate for graphs with low automorphism and high noise, which are
common in many social and information networks.
There are several studies on simulation and bisimulation-based graph
pattern matching, e.g., [37, 12, 34], which define subgraph matching as a
relation among the query nodes and target nodes.
5 Graph OLAP
Graph OLAP aims to provide a model to perform composite structure and
information analysis in heterogonous networks. For example, in terms of
network intrusions, apart from the topological structures encoded in the un-
derlying network, multidimensional attributes are often specified and associ-
ated with nodes and edges, e.g., security software installed in computers, de-
fense strategies, access policies, etc., forming the so-called multidimensional
networks. While studies on contemporary networks have been around for
decades [41] , and a plethora of algorithms and systems have been devised
6
for multidimensional analysis in relational databases [24], none has taken
both aspects into account in the multidimensional network scenario. Graph
OLAP is the technique developed to fill the technology gaps in multidimen-
sional networks.
Graph OLAP performs discovery-driven OLAP operations for fast and
accurate knowledge discovery, through structure discovery, network summa-
rization, aggregation, correlation, clustering and classification. The concept
of Graph OLAP was first introduced in [6]. Two kinds of OLAPs were de-
fined: Informational OLAP (abbr. I-OLAP) and Topological OLAP (abbr.
T-OLAP). For roll-up in I-OLAP, the characterizing feature is that, snap-
shots are just different observations of the same underlying network, and thus
when they are all grouped into one cell in the cube, it is like overlaying mul-
tiple pieces of information, without changing the objects whose interactions
are being looked at. For roll-up in T-OLAP, the reorganization switches to
happen inside individual networks. Here, merging is performed internally
which zooms out the users focus to a generalized set of objects, and a new
graph formed by such shrinking might greatly alter the original networks
topological structure. where
[50] introduced two potential operations to summarize graphs, a keystep
in T-OLAP. The first operation, called SNAP, produces a summary graph
by grouping nodes based on user-selected node attributes and relationships.
The second operation, called k-SNAP, further allows users to control the
resolutions of summaries and provides the drill-down and roll-up abilities to
navigate through summaries with different resolutions. [43] discussed how to
efficiently compute T-OLAP using graph cubing techniques. It implemented
Graph Cube by combining special characteristics of multidimensional net-
works with the existing well-studied data cube techniques.
In addition to graph summarization, another important operation in
graph OLAP is similarity search. Large-scale heterogeneous information
networks consist of multi-typed, interconnected objects, it is important to
provide similarity measures in such networks. Intuitively, two objects are
similar if they are linked by many paths in the network. However, differ-
ent semantic meanings behind paths shall be are taken into consideration.
[57] studied similarity search that is defined among the same type of objects
in heterogeneous networks, and introduced the concept of meta path-based
similarity, where a meta path is a path consisting of a sequence of relations
defined between different object types (i.e., structural paths at the meta
level). Meta-path similarity turns out to be more meaningful in many sce-
7
narios compared with random-walk based similarity measures.
6 Vertex Programming
Vertex programming is adopted in several leading distributed graph comput-
ing platforms in clusters such as Pregel [35] and GraphLab [32]. They can
be implemented using the bulk synchronous parallel model or asynchronous
models. Vertex Programming is suitable for graph algorithms that can be
modified to store computation states in vertices and these states can be
distributed and shared with multiple vertices. Pregel and GraphLab have
demonstrated their success in computation of shortest paths, random walk,
clustering, and belief propagation which can support many machine learning
algorithms. However, it is unknown if an effective implementation of sub-
graph isomorphism exists using vertex programming. [49] proposed passing
partial matches around computers in order to find a complete match. One can
also implement a centralized algorithm that collets partial matchings from
different machines and assembles them in a center machine. Both algorithms
have pros and cons. They are not compatible with vertex programming and
need a special demon process in computers to coordinate the partial result
assembly. Our approximate graph search algorithms that use message pass-
ing between vertices, e.g., NESS [27], are suitable for vertex programming.
NESS uses vector representation of graphs. The neighborhood information
of each vertex is computed by propagating the labels of its neighbors with
distance weighting, which is encoded in each vertex. The best matches of
each vertex can be further passed to its neighbors to find the best match of
the entire vertex set. The structure of a vertexs neighbors is encoded with
their distance to that vertex. When the number of distinct labels is high in a
graph, NESS will likely find a good match in terms of subgraph isomorphism.
References
[1] P. Barcelo, L. Libkin, and J. L. Reutter. Querying Graph Patterns.
PODS, 2011.
8
[3] S. Beretti, A. Bimbo, and E. Vicario. Efficient matching and indexing
of graph models in content based retrieval. IEEE Trans. on Pattern
Analysis and Machine Intelligence, 23:10891105, 2001.
[6] F. Zhu J. Han C. Chen, X. Yan and P. S. Yu. Graph olap: Towards
online analytical processing on graphs. In Proc. 2008 Int. Conf. on Data
Mining, 2008.
[12] W. Fan, J. Li, S. Ma, N. Tang, Y. Wu, and Y. Wu. Graph Pattern
Matching: From Intractable to Polynomial Time. PVLDB, 2010.
9
[15] B. Gallagher. Matching Structure and Semantics: A Survey on Graph-
Based Pattern Matching. AAAI FS., 2006.
[18] M. A. Hasan and M. J. Zaki. Output space sampling for graph patterns.
Proc. of the VLDB Endowment (35th Int. Conf. on Very Large Data
Bases), 2(1):730741, 2009.
10
[26] R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting local
similarity for efficient indexing of paths in graph structured data. In
Proc. of 2002 Int. Conf. on Data Engineering (ICDE02), pages 129
140, 2002.
[28] A. Khan, X. Yan, and K.-L. Wu. Towards Proximity Pattern Mining in
Large Graphs. SIGMOD, 2010.
11
[38] T. Milo and D. Suciu. Index structures for path expressions. Lecture
Notes in Computer Science, 1540:277295, 1999.
12
[50] Y. Tian, R. A. Hankins, and J. M. Patel. Efficient Aggregation for
Graph Summarization. SIGMOD, 2008.
[51] Y. Tian and J. M. Patel. TALE: A Tool for Approximate Large Graph
Matching. ICDE, 2008.
[56] T. Washio and H. Motoda. State of the art of graph-based data mining.
SIGKDD Explorations, 5:5968, 2003.
[60] X. Yan and J. Han. CloseGraph: Mining closed frequent graph patterns.
In Proc. of 2003 Int. Conf. on Knowledge Discovery and Data Mining
(KDD03), pages 286295, 2003.
[62] S. Zhang, J. Yang, and W. Jin. SAPPER: Subgraph Indexing and Ap-
proximate Matching in Large Graphs. PVLDB, 2010.
13
[63] P. Zhao, J. Yu, and P. Yu. Graph Indexing: Tree + Delta >= Graph.
VLDB, 2007.
[66] L. Zou, L. Chen, J. Yu, and Y. Lu. A Novel Spectral Coding in a Large
Graph Database. EDBT, 2008.
14