Feature-Based Similarity Search in Graph
Feature-Based Similarity Search in Graph
Structures
Xifeng Yan
University of Illinois at Urbana-Champaign
Feida Zhu
University of Illinois at Urbana-Champaign
Philip S. Yu
IBM T. J. Watson Research Center
and
Jiawei Han
University of Illinois at Urbana-Champaign
ACM Transactions on Database Systems, Vol. V, No. N, June 2006, Pages 1–0??.
2 · Xifeng Yan et al.
1. INTRODUCTION
Development of scalable methods for the analysis of large graph data sets, including
graphs built from chemical structures and biological networks, poses great chal-
lenges to database research. Due to the complexity of graph data and the diversity
of their applications, graphs are generally key entities in widely used databases in
chem-informatics and bioinformatics, such as PDB [Berman et al. 2000] and KEGG
[Kanehisa and Goto 2000].
In chemistry, the structures and properties of newly discovered or synthesized
chemical molecules are studied, classified, and recorded for scientific and commer-
cial purposes. ChemIDplus1 , a free data service offered by the National Library
of Medicine (NLM), provides access to structure and nomenclature information.
Users can query molecules by their names, structures, toxicity, and even weight in
a flexible way through its web interface. Given a query structure, it can quickly
identify a small subset of molecules for further analysis [Hagadone 1992, Willett
et al. 1998], thus shortening the discovery cycle in drug design and other scientific
activities. Nevertheless, the usage of a graph database as well as its query system
is not confined to chemical informatics only. In computer vision and pattern recog-
nition [Petrakis and Faloutsos 1997, Messmer and Bunke 1998, Beretti et al. 2001],
graphs are used to represent complex structures such as hand-drawn symbols, fin-
gerprints, 3D objects, and medical images. Researchers extract graph models from
various objects and compare them to identify unknown objects and scenes. The
developments in bioinformatics also call for efficient mechanisms in querying a large
number of biological pathways and protein interaction networks. These networks
are usually very complex with multi-level structures embedded [Kanehisa and Goto
2000]. All of these applications indicate the importance and the broad usage of
graph databases and its accompanying similarity search system.
While the motif discovery in graph datasets has been studied extensively, a sys-
tematic examination of graph query is becoming equally important. A major kind
of query in graph databases is searching topological structures, which cannot be
answered efficiently using existing database infrastructures. The indices built on
the labels of vertices or edges are usually not selective enough to distinguish com-
plicated, interconnected structures.
Due to the limitation of processing graph queries using existing database tech-
niques, tremendous efforts have been put into building practical graph query sys-
tems. Most of them fall into the following three categories: (1) full structure search:
find structures exactly the same as the query graph [Beretti et al. 2001]; (2) sub-
structure search: find structures that contain the query graph, or vice versa [Shasha
et al. 2002, Srinivasa and Kumar 2003, Yan et al. 2004]; and (3) full structure sim-
1 http://chem.sis.nlm.nih.gov/chemidplus.
ilarity search: find structures that are similar to the query graph [Petrakis and
Faloutsos 1997, Willett et al. 1998, Raymond et al. 2002]. These kinds of queries
are very useful. For example, in substructure search, a user may not know the
exact composition of the full structure he wants, but requires that it contain a set
of small functional fragments.
A common problem in substructure search is: what if there is no match or very
few matches for a given query graph? In this situation, a subsequent query refine-
ment process has to be taken in order to find the structures of interest. Unfortu-
nately, it is often too time-consuming for a user to perform manual refinements.
One solution is to ask the system to find graphs that nearly contain the entire
query graph. This similarity search strategy is more appealing since the user can
first define the portion of the query for exact matching and let the system change
the remaining portion slightly. The query could be relaxed progressively until a
relaxation threshold is reached or a reasonable number of matches are found.
N
N
N
N
N O N O
N N
O O
N N
N N S
N
N
O
O O
N
N
N
N
2. RELATED WORK
Structure similarity search has been studied in various fields. Willett et al. [Willett
et al. 1998] summarized the techniques of fingerprint-based and graph-based sim-
ilarity search in chemical compound databases. Raymond et al. [Raymond et al.
2002] proposed a three-tier algorithm for full structure similarity search. Recently,
substructure search has attracted lots of attention in the database research com-
munity. Shasha et al. [Shasha et al. 2002] developed a path-based approach for
substructure search, while Srinivasa et al. [Srinivasa and Kumar 2003] built mul-
tiple abstract graphs for the indexing purpose. Yan et al. [Yan et al. 2004] took
the discriminative frequent structures as indexing features to improve the search
performance.
As to substructure similarity search, in addition to graph edit distance and align-
ment distance, maximum common subgraph is used to measure the similarity be-
tween two structures. Unfortunately, finding the maximum common subgraph is
NP-complete [Garey and Johnson 1979]. Nilsson[Nilsson 1980] presented an al-
gorithm for the pairwise approximate substructure matching. The matching is
greedily performed to minimize a distance function for two structures. Hagadone
[Hagadone 1992] recognized the importance of substructure similarity search in a
large set of graphs. He used the atom and edge label to do screening. Holder et
al. [Holder et al. 1994] adopted the principle of minimum description length for ap-
proximate graph matching. Messmer and Bunke [Messmer and Bunke 1998] studied
the reverse substructure similarity search problem in computer vision and pattern
recognition. These methods did not explore the potential of using more compli-
cated structures to improve the filtering performance, which is studied extensively
by our work. In [Shasha et al. 2002], Shasha et al. also extended their substructure
search algorithm to support queries with wildcards, i.e., don’t care nodes and edges.
Different from their similarity model, we do not fix the positions of wildcards, thus
allowing a general and flexible search scheme.
In our recent work [Yan et al. 2005], we introduced the basic concept of a feature-
based indexing and filtering methodology for substructure similarity search. It was
shown that using either too few or too many features can result in poor filtering
performance. In this extended work, we provide a geometric interpretation to this
phenomena, followed by a rigorous analysis on the feature set selection problem
using a linear inequality system. We propose three optimization problems in our
filtering framework and prove that each of them takes Ω(2m ) steps to find the
optimal solution in the worst case, where m is the number of features for selec-
tion. These results ask for selection heuristics, such as clustering-based feature set
selection developed in our solution.
Besides the full-scale graph search problem, researchers also studied the approx-
imate tree search problem. Wang et al. [Wang et al. 1994] designed an interactive
system that allows a user to search inexact matchings of trees. Kailing et al. [Kail-
ing et al. 2004] presented new filtering methods based on tree height, node degree
and label information.
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
6 · Xifeng Yan et al.
The structural filtering approach presented in this study is also related to string
filtering algorithms. A comprehensive survey on various approximate string filter-
ing methods was presented by Navarro [Navarro 2001]. The well-known q-gram
method was initially developed by Ullmann [Ullmann 1977]. Ukkonen [Ukkonen
1992] independently discovered the q-gram approach, which was further extended
in [Gravano et al. 2001] against large scale sequence databases. These q-gram al-
gorithms work for consecutive sequences, not structures. Our work generalized the
q-gram method to fit structural patterns of various sizes.
3. PRELIMINARY CONCEPTS
Graphs are widely used to represent complex structures that are difficult to model.
In a labeled graph, vertices and edges are associated with attributes, called label s.
The labels could be tags in XML documents, atoms and bonds in chemical com-
pounds, genes in biological networks, and object descriptors in images. The choice
of using labeled graphs or unlabeled graphs depends on the application need. The
filtering algorithm we proposed in this article can handle both types efficiently.
Let V (G) denote the vertex set of a graph G and E(G) the edge set. A label
function, l, maps a vertex or an edge to a label. The size of a graph is defined by
the number of edges it has, written as |G|. A graph G is a subgraph of G′ if there
exists a subgraph isomorphism from G to G′ , denoted by G ⊆ G′ , in which case it
is called a supergraph of G.
Definition 1 Subgraph Isomorphism. A subgraph isomorphism is an injec-
tive function f : V (G) → V (G′ ), such that (1) ∀u ∈ V (G), f (u) ∈ V (G′ ) and
l(u) = l′ (f (u)), and (2) ∀(u, v) ∈ E(G), (f (u), f (v)) ∈ E(G′ ) and l(u, v) =
l′ (f (u), f (v)), where l and l′ is the label function of G and G′ , respectively. Such a
function f is called an embedding of G in G′ .
Given a graph database and a query graph, we may not find a graph (or may
find only a few graphs) in the database that contains the whole query graph. Thus,
it would be interesting to find graphs that contain the query graph approximately,
which is a substructure similarity search problem. Based on our observation, this
problem has two scenarios, similarity search and reverse similarity search.
Definition 2 Substructure Similarity Search. Given a graph database
D = {G1 , G2 , . . . , Gn } and a query graph Q, similarity search is to discover all the
graphs that approximately contain this query graph. Reverse similarity search is to
discover all the graphs that are approximately contained in this query graph.
Each type of search scenario has its own applications. In chemical informatics,
similarity search is more popular, while reverse similarity search has key appli-
cations in pattern recognition. In this article, we develop a structural filtering
algorithm for similarity search. Nevertheless, our algorithm can also be applied to
reverse similarity search with slight modifications.
To distinguish a query graph from the graphs in a database, we call the latter
target graphs. The question is how to measure the substructure similarity between
a target graph and the query graph. There are several similarity measures. We can
classify them into three categories: (1) physical property-based, e.g., toxicity and
weight; (2) feature-based; and (3) structure-based. For the feature-based measure,
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
Feature-based Similarity Search in Graph Structures · 7
Example 2. Consider the target graph in Figure 1(a) and the query graph in
Figure 2. Their maximum common subgraph has 11 out of the 12 edges. Thus, the
substructure similarity between these two graphs is around 92% with respect to the
query graph. That also means if we relax the query graph by 8%, the relaxed query
graph is contained in Figure 1(a). The similarity of graphs in Figures 1(b) and
1(c) with the query graph is 92% and 67%, respectively.
4. STRUCTURAL FILTERING
Given a relaxed query graph, the major goal of our algorithm is to filter out as
many graphs as possible using a feature-based approach. The features discussed
here could be paths [Shasha et al. 2002], discriminative frequent structures [Yan
et al. 2004], elementary structures, or any structures indexed in a graph database.
Previous work did not investigate the connection between the structure-based simi-
larity measure and the feature-based similarity measure. In this study, we explicitly
transform the query relaxation ratio to the number of misses of indexed features,
thus building a connection between these two measures.
e1
e2 e3
Let us first consider an example. Figure 3 shows a query graph and Figure 4
depicts three structural fragments. Assume that these fragments are indexed as
features in a graph database. For simplicity, we ignore all the label information
in this example. The symbols e1 , e2 , and e3 in Figure 3 do not represent labels
but edges themselves. Suppose we cannot find any match for this query graph in a
graph database. Then a user may relax one edge, e1 , e2 , or e3 , through a deletion or
relabeling operation. He/she may deliberately retain the middle edge, because the
deletion of that edge may break the query graph into pieces. Because the relaxation
can take place among e1 , e2 , and e3 , we are not sure which feature will be affected
by this relaxation. However, no matter which edge is relaxed, the relaxed query
graph should have at least three embeddings of these features. Equivalently, we say
that the relaxed query graph may miss at most four embeddings of these features
in comparison with the original query graph, which have seven embeddings: one fa ,
two fb ’s, and four fc ’s. Using this information, we can discard graphs that do not
contain at least three embeddings of these features. We name the above filtering
concept feature-based structural filtering.
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
Feature-based Similarity Search in Graph Structures · 9
G1 G2 G3 G4
fa 0 1 0 0
fb 0 0 1 0
fc 2 3 4 4
Using the feature-graph matrix, we can apply the feature-based filtering on any
query graph against a target graph in the database using any subset of the indexed
features. Consider the query shown in Figure 3 with one edge relaxation. According
to the feature-graph matrix in Figure 5, even if we do not know the structure of
G1 , we can filter G1 immediately based on the features included in G1 , since G1
only has two of all the embeddings of fa , fb , and fc . This feature-based filtering
process is not involved with any costly structure similarity checking. The only
computation needed is to retrieve the features from the indices that belong to a
query graph and compute the possible feature misses for a relaxation ratio. Since
our filtering algorithm is fully built on the feature-graph matrix index, we need not
access the physical database unless we want to calculate the accurate substructure
similarity.
We implement the feature-graph matrix based on a list, where each element points
to an array representing the row of the matrix. Using this implementation, we can
flexibly insert and delete features without rebuilding the whole index. In the next
subsection, we will present the general framework of processing similarity search,
and illustrate the position of our structural filtering algorithm in this framework.
(1) Index construction: Select small structures as features in the graph database,
and build the feature-graph matrix between the features and the graphs in the
database.
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
10 · Xifeng Yan et al.
(2) Feature miss estimation: Determine the indexed features belonging to the
query graph, select a feature set (i.e., a subset of the features), calculate the
number of selected features contained in the query graph and then compute
the upper bound of feature misses if the query graph is relaxed with one edge
deletion or relabeling. This upper bound is written as dmax . Some portion
of the query graph can be specified as not to be altered, e.g., key functional
structures.
(3) Query processing: Use the feature-graph matrix to calculate the difference
in the number of features between each graph G in the database and query Q.
If the difference is greater than dmax , discard graph G. The remaining graphs
constitute a candidate answer set, written as CQ . We then calculate substruc-
ture similarity using the existing algorithms and prune the false positives in
CQ .
(4) Query relaxation: Relax the query further if the user needs more matches
than those returned from the previous step; iterate Steps 2 to 4.
The feature-graph matrix in Step 1 is built beforehand and can be used by any
query. The similarity search for a query graph takes place in Step 2 and Step 3.
The filtering algorithm proposed should return a candidate answer set as small as
possible since the cost of the accurate similarity computation is proportional to
the size of the candidate set. Quite a lot of work has been done at calculating
the pairwise substructure similarity. Readers are referred to the related work in
[Nilsson 1980, Hagadone 1992, Raymond et al. 2002].
In the step of feature miss estimation, we calculate the number of features in the
query graph. One feature may have multiple embeddings in a graph; thus, we use
the number of embeddings of a feature as a more precise term. In this article, these
two terms are used interchangeably for convenience.
In the rest of this section, we will introduce how to estimate feature misses by
translating it into the maximum coverage problem. The estimation is further refined
through a branch-and-bound method. In Section 5, we will explore the opportunity
of using different feature sets to improve filtering efficiency.
4.3 Feature Miss Estimation
Substructure similarity search is akin to approximate string matching. In approx-
imate string matching, filtering algorithms such as q-gram achieve the best per-
formance because they do not inspect all the string characters. However, filtering
algorithms only work for a moderate relaxation ratio and need a validation algo-
rithm to check the actual matches [Navarro 2001]. Similar arguments also apply
to our structural filtering algorithm in substructure similarity search. Fortunately,
since we are doing substructure search instead of full structure similarity search,
usually the relaxation ratio is not very high in our problem setting.
A string with q characters is called a q-gram. A typical q-gram filtering algorithm
builds an index for all q-grams in a string database. A query string Q is broken
into a set of q-grams, which are compared against the q-grams of each target string
in the database. If the difference in the number of q-grams is greater than the
following threshold, Q will not match this string within k edit distance.
Given two strings P and Q, if their edit distance is k, their difference in the
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
Feature-based Similarity Search in Graph Structures · 11
e2 1 1 0 0 1 0 1
e3 1 0 1 0 0 1 1
In order to calculate the maximum feature misses for a given relaxation ratio,
we introduce edge-feature matrix that builds a map between edges and features
for a query graph. In this matrix, each row represents an edge while each column
represents an embedding of a feature. Figure 6 shows the matrix built for the query
graph in Figure 3 and the features shown in Figure 4. All of the embeddings are
recorded. For example, the second and the third columns are two embeddings of
feature fb in the query graph. The first embedding of fb cover s edges e1 and e2
while the second covers edges e1 and e3 . The middle edge does not appear in the
edge-feature matrix if a user prefers retaining it. We say that an edge ei hits a
feature fj if fj covers ei .
It is not expensive to build the edge-feature matrix on the fly as long as the
number of features is small. Whenever an embedding of a feature is discovered, a
new column is attached to the matrix. We formulate the feature miss estimation
problem as follows: Given a query graph Q and a set of features contained in Q,
if the relaxation ratio is θ, what is the maximum number of features that can be
missed ? In fact, it is the maximum number of columns that can be hit by k rows in
the edge-feature matrix, where k = ⌊θ ·|G|⌋. This is a classic maximum coverage (or
set k-cover) problem, which has been proved NP-complete. The optimal solution
that finds the maximal number of feature misses can be approximated by a greedy
algorithm. The greedy algorithm first selects a row that hits the largest number of
columns and then removes this row and the columns covering it. This selection and
deletion operation is repeated until k rows are removed. The number of columns
removed by this greedy algorithm provides a way to estimate the upper bound of
feature misses.
Algorithm 1 shows the pseudo-code of the greedy algorithm. Let mrc be the entry
in the r-th row, c-th column of matrix M. Mr denotes the r-th row vector of matrix
M, while Mc denotes the c-th column vector of matrix M. |Mr | represents the
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
12 · Xifeng Yan et al.
number of non-zero entries in the r-th row. Line 3 in Algorithm 1 returns the row
with the maximum number of non-zero entries.
Algorithm 1 GreedyCover
Input: Edge-feature Matrix M,
Maximum edge relaxations k.
Output: The number of feature misses Wgreedy .
1: let Wgreedy = 0;
2: for each l = 1 . . . k do
3: select row r s.t. r = arg maxi |Mi |;
4: Wgreedy = Wgreedy + |Mr |;
5: for each column c s.t. mrc = 1 do
6: set Mc =0;
7: return Wgreedy ;
Theorem 1. Let Wgreedy and Wopt be the total feature misses computed by the
greedy solution and by the optimal solution. We have
1 k 1
Wgreedy ≥ [1 − (1 − ) ] Wopt ≥ (1 − ) Wopt , (1)
k e
where k is the number of edge relaxations.
Proof. [Hochbaum 1997]
1
Wopt ≤ Wgreedy
1 − (1 − k1 )k
e
Wopt ≤ Wgreedy
e−1
Wopt ≤ 1.6 Wgreedy (2)
Traditional applications of the maximum coverage problem focus on how to ap-
proximate the optimal solution as much as possible. Here we are only interested in
the upper bound of the optimal solution. Let maxi |Mi | be the maximum number
of features that one edge hits. Obviously, Wopt should be less than k times of this
number,
The above bound is actually adopted from q-gram filtering algorithms. This
bound is a bit loose in our problem setting. The upper bound derived from In-
equality 2 is usually tighter for non-consecutive sequences, trees and other complex
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
Feature-based Similarity Search in Graph Structures · 13
structures. It may also be useful for approximate string filtering if we do not enu-
merate all q-grams in strings for a given query string.
4.4 Estimation Refinement
A tight bound of Wopt is critical to the filtering performance since it often leads to a
small set of candidate graphs. Although the bound derived by the greedy algorithm
cannot be improved asymptotically, we may still improve the greedy algorithm in
practice.
Let Wopt (M, k) be the optimal value of the maximum feature misses for k edge
relaxations. Suppose r = arg maxi |Mi |. Let M′ be M except (M′ )r = 0 and
(M′ )c = 0 for any column c that is hit by row r, and M′′ be M except (M′′ )r = 0.
Any optimal solution that leads to Wopt should satisfy one of the following two
cases: (1) r is selected in this solution; or (2) r is not selected (we call r disqualified
for the optimal solution). In the first case, the optimal solution should also contain
the optimal solution for the remaining matrix M′ . That is, Wopt (M, k) = |Mr | +
Wopt (M′ , k − 1). k − 1 means that we need to remove the remaining k − 1 rows
from M′ since row r is selected. In the second case, the optimal solution for M
should be the optimal solution for M′′ , i.e., Wopt (M, k) = Wopt (M′′ , k). k means
that we still need to remove k rows from M′′ since row r is disqualified. We call
the first case the selection step, and the second case the disqualifying step. Since
the optimal solution is to find the maximum number of columns that are hit by k
edges, Wopt should be equal to the maximum value returned by these two steps.
Therefore, we can draw the following conclusion.
Lemma 1.
(
|Mr | + Wopt (M′ , k − 1),
Wopt (M, k) = max (4)
Wopt (M′′ , k).
1: if b ≥ B or h ≥ H then
2: return Wapx (M, k);
3: select row r that maximizes |Mr |;
4: let M′ = M and M′′ = M;
5: set (M′ )r = 0 and (M′ )c = 0 for any c if mrc = 1;
6: set (M′′ )r = 0;
7: W1 = |Mr | + West (M′ , k − 1, h + 1, b) ;
8: W2 = West (M′′ , k, h, b + 1) ;
9: Wa = max(W1 , W2 ) ;
10: Wb = Wapx (M, k);
11: return West = min(Wa , Wb );
We select parameters H and B such that H is less than the number of edge
relaxations, and H + B is less than the number of rows in the matrix. Algorithm 2
is initialized by West (M, k, 0, 0). The bound obtained by Algorithm 2 is not greater
than the bound derived by the greedy algorithm since we intentionally select the
smaller one in Lines 10-11. On the other hand, West (M, k, 0, 0) is not less than
the optimal value since Algorithm 2 is just a simulation of the recursion in Lemma
1, and at each step, it has a greater value. Therefore, we can draw the following
conclusion.
Assume that Wopt (M(h,b) , k) ≤ West (M(h,b) , k, h, b) for some h and b, 0 < h ≤ H
and 0 < b ≤ B. Let West (M(h−1,b) , k, h−1, b) = min{max{W1 , W2 }, Wb } according
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
Feature-based Similarity Search in Graph Structures · 15
a query Q and the maximum allowed selection and disqualifying steps, H and B,
the cost of computing West is irrelevant to the number of the graphs in a database.
Thus, the cost of feature miss estimation remains constant with respect to the
database size.
u3
u5
u1 u4
u2 Target Graph G
v2
v3 v5
v1 v4
Query Graph Q
f1 f2 f3 f4 f5
We want to know how many more embeddings of feature fi appear in the query
graph, compared to the target graph. Equation (7) calculates this frequency differ-
ence for feature fi ,
(
0, if ui ≥ vi ,
r(ui , vi ) = (7)
vi − ui , otherwise.
For the feature vectors shown in Figure 7, r(u1 , v1 ) = 0; we do not take the extra
embeddings from the target graph into account. The summed frequency difference
of each feature in G and Q is written as d(G, Q). Equation (8) sums up all the
frequency differences,
n
X
d(G, Q) = r(ui , vi ). (8)
i=1
Suppose the query can be relaxed with k edges. Algorithm 2 estimates the upper
bound of allowed feature misses. If d(G, Q) is greater than that bound, we can
conclude that G does not contain Q within k edge relaxations. For this case, we
do not need to perform any complicated structure comparison between G and Q.
Since all the computations are done on the preprocessed information in the indices,
the filtering actually is very fast.
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
Feature-based Similarity Search in Graph Structures · 17
Before we check the problem of feature selection, let us first examine whether
we should include all the embeddings of a feature. An intuition is that we should
eliminate the automorphic embeddings of a feature.
Definition 4 Graph Automorphism. An automorphism of a graph is a map-
ping from the vertices of the given graph G back to vertices of G such that the
resulting graph is isomorphic with G.
Given two feature vectors built from a target graph G and a query graph Q, u =
(u1 , u2 , . . . , un ) and v = (v1 , v2 , . . . , vn ), where ui and vi are the frequencies (the
number of embeddings) of feature fi in G and Q, respectively. Suppose structure
fi has κ automorphisms. Then ui and vi can be divided by κ exactly. It also means
that the edge-feature matrix will have duplicate columns. In practice, we should
remove these duplicate columns since they do not provide additional information.
Let us check a specific case where a query graph Q only has two features f1 and f2 .
For any target graph G, G ∈ Γ{f1 ,f2 } if and only if d(G, Q) = r(u1 , v1 ) + r(u2 , v2 ) −
dk{f1 ,f2 } > 0. The only situation under which this inequality is guaranteed is when
G ∈ Γ{f1 } and G ∈ Γ{f2 } since in this case r(u1 , v1 ) − dk{f1 } > 0 and r(u2 , v2 ) −
dk{f2 } > 0. It follows from the lemma above that r(u1 , v1 ) + r(u2 , v2 ) − dk{f1 ,f2 } > 0.
It is easy to verify that under all other situations, even if G ∈ Γ{f1 } or G ∈ Γ{f2 } ,
it can still be the case that G 6∈ Γ{f1 ,f2 } . In the worst case, an evil adversary
can construct an index such that |Γ{f1 ,f2 } | < min{|Γ{f1 } |, |Γ{f2 } |}. This discussion
shows that an algorithm using all features therefore may fail to yield the optimal
solution.
v 1 +v 2 -d {f1,f2}
v2
v 2 -d {f2} v 2 -d {f2}
v 1 -d {f1} v 1 -d {f1} v 1
{f 1 } {f 2 } {f 1, f2 }
For a given query Q with two features {f1 , f2 }, each graph G in the database
can be represented by a point in the plane with coordinates in the form of (u1 , u2 ).
Let v = {v1 , v2 } be the feature vector of Q. To select a feature set and then use it
to prune the target graphs is equivalent to selecting a halfspace and throwing away
all points in the halfspace. Figure 8 depicts three feature selections: {f1 }, {f2 } and
{f1 , f2 }. If only f1 is selected, it corresponds to throwing away all points to the
left of line u1 = v1 − dk{f1 } . If only f2 is selected, it corresponds to throwing away
all points below line u2 = v2 − dk{f2 } . If both f1 and f2 are selected, it corresponds
to throwing away all points below line u1 + u2 = v1 + v2 − dk{f1 ,f2 } , points below
the line u2 = v2 − dk{f1 ,f2 } , and points to the left of line u1 = v1 − dk{f1 ,f2 } . It is
easy to observe that, depending on the distribution of the points, each feature set
could have varied pruning power.
Note that, by Lemma 4, the line u1 + u2 = v1 + v2 − dk{f1 ,f2 } is always above
the point (v1 − dk{f1 } , v2 − dk{f2 } ); it passes the point if and only if when dk{f1 ,f2 } =
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
20 · Xifeng Yan et al.
dk{f1 } + dk{f2 } . This explains why even applying all the features one after another for
pruning does not generally guarantee the optimal solution. Alternatively, we can
conclude that, given the set F = {f1 , f2 , . . . , fm } of all features in a query graph
Q, the smallest candidate set remained after the pruning is contained in a convex
subspace of the m-dimensional feature space. The convex subspace in the example
is shown as shaded area in Figure 8.
feasible region. In order to prove this result, we cite the following theorem which
is well-known in linear programming.
Theorem 2. [Padberg 1995] An inequality dx ≥ d0 is redundant relative to a
system of n linear inequalities in m unknowns: Ax ≥ b, x ≥ 0 if and only if the
inequality system is unsolvable or there exists a row vector u ∈ Rn satisfying
u ≥ 0, d ≥ uA, ub ≥ d0
w(s n )
sn
w(s 2 )
s2 ...
w(s 1 )
s1 w(s 3 )
s3
f1 f2 ... fm
Denote as π(X) the set of nonzero indices of a vector X and 2S the power set of
S.
Lemma 5. Given a graph Q, a feature set F = {f1 , f2 , . . . , fm } (fi ⊆ Q) and a
weighted set system Φ = (I, w), where I ⊆ 2F \ {∅, F }, w : I 7→ R+ , define function
gΦ : F 7→ R+ ,
½ S
0 if f ∈S,S∈I = ∅
gΦ (f ) = P
f ∈S,S∈I w(S) otherwise
Denote as gΦ (F ) the feature set F weighted by gΦ such that deleting an edge of
a feature f kills an amount of gΦ (f ) of that feature. Let dkgΦ (F ) be the maximum
amount of features that can be killed by deleting k edges on Q, for a weighted feature
set gΦ (F ). Then,
X¡
max{w(S)dkS } ≤ dkgΦ (F ) ≤ w(S)dkS
¢
S∈I
S∈I
Proof. (1) For any S ∈ I, since S ⊆ F , we have dkS ≤ dkF , so the weighted
inequality w(S)dkS ≤ w(S)dkF ≤ dkgΦ (F ) .
(2) Let F ∗ ⊆ F be the set of features killed in a solution of dkgΦ (F ) . Then for any
S ∈ I,we have |F ∗ ∩ S| ≤ dkS since all features in F ∗ S can be hit by deleting k
T
edges over the feature set S. Summing over I,
X¡ X
w(S)dkS ≥ (w(S)|F ∗ ∩ S|)
¢
S∈I S∈I
X X
= w(S)
f ∈F ∗ f ∈S,S∈I
X
= gΦ (f ) = dkgΦ (F ) .
f ∈F ∗
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
22 · Xifeng Yan et al.
Figure 9 depicts a weighted set system. The features in each set S is assigned
a weight w(S). The totalPweight of a feature f is the sum of weights of sets that
include this feature, i.e., f ∈S,S∈I w(S). Lemma 5 shows that an optimal solution
of dkgΦ (F ) in a feature set weighted by Φ constitutes a (sub)optimal solution of dkS .
Actually Lemma 5 is a generalization of Lemma 4.
Lemma 6. Given a feature set F (|F | > 1) and a weighted
P set system Φ = (I, w),
where I ⊆ 2F \ {∅, F } and w : I 7→ R+ , if ∀f ∈ F, f ∈S,S∈I w(S) = 1, then
1
P
S∈I w(S) ≥ 1 + 2m−1 −1 .
Proof. For any f ∈ F , there are at most 2m−1 − 1 subsets S ∈ I such that
∗ ∗ 1
f ∈ S. P Let w(S ) = max{w(S)|f∗ ∈ S, S ∈ I}. We have w(S ′ ) ≥ ∗2m−1 −1 ,
since f ∈S,S∈I w(S) = 1. Since SP ⊂ F , there P
exists a feature f 6∈ S . Since
∗
P
f ′ ∈S,S∈I w(S) = 1, we conclude S∈I w(S) ≥ f ′ ∈S,S∈I w(S) + w(S ) ≥ 1 +
1
2m−1 −1 .
f ik
...
... f i1
Q S1 Q S2 Q Sn
e
QF
Query Graph QS
Lemma 7. Given a feature set F and a weighted set system Φ = (I, w), where
I ⊆ 2F \ {∅, F } and w : I 7→ R+ , if ∀f ∈ F, f ∈S,S∈I w(S) = 1, then there exists
P
a query graph Q such that dkgΦ (F ) < S∈I (w(S)dkS ), for any weight function w.
P
Proof. We prove the lemma by constructing a query graph Q that has a set of
features, F = {f1 , f2 , . . . , fm }. Q has 2m − 1 connected components, as shown in
Figure 10. The components are constructed such that each component QS corre-
sponds to a different feature subset S ∈ 2F \ {∅}. Each QS can be viewed, at a high
level, as a set of connected “rings”, such that there is an edge in each ring, which is
called a “cutter” of QS , and the deletion of a “cutter” kills αS copies of each feature
in S. In each component, a “cutter” kills the most number of features among all
edges. Such a construction is certainly feasible and in fact straightforward since we
have the liberty to choose all the features. Edge e in Figure 10 is an example of a
cutter. The deletion of e will hit all of the features in QS . We then try to set αS
for all the components so that we can fix both the solution to dkgΦ (F ) and those to
dkS , S ∈ I, and make dkgΦ (F ) < S∈I (w(S)dkS ). Let the number of features killed by
P
P
deleting a “cutter” from QS be xS = f ∈S αS . We will later assign αS such that
xS is the same for all QS , S ∈ I. In particular, the following conditions have to be
satisfied:
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
Feature-based Similarity Search in Graph Structures · 23
(1) The solution to dkgΦ (F ) is the full feature set F . This means the k edges to be
deleted must all be the “cutter” in component QF . In this case, since each
“cutter” kills
X X X
αF w(S) = αF = mαF
f ∈F f ∈S,S∈I f ∈F
Lemma 8. There exist a query graph Q and a feature set F , such that none of
the inequalities in the corresponding inequality system Ax ≥ b is redundant.
Proof. Because every feature is chosen in 2m−1 different feature sets, any given
column of A thus consists exactly 2m−1 1s and 2m−1 − 1 0s. Recall that dkF is
defined to be the maximum number of features in a chosen feature set F that can
be killed by deleting k edges from a query graph. Therefore, b ≥ 0.
Take from the system the i-th inequality Ai x ≥ bi . Let A′ x ≥ b′ be the resulting
system after deleting this inequality. We prove that this inequality is not redundant
relative to A′ x ≥ b′ .
It is obvious that A′ x ≥ b′ is solvable, since by assigning values large enough
to all the variables, all inequalities can be satisfied. The feasible region is indeed
m
unbounded. We are left to show that there exists no such row vector u ∈ R2 −2
satisfying
u ≥ 0, Ai ≥ uA′ , ub′ ≥ bi .
As there are exactly 2m−1 1s in every column c 6∈ π(Ai ) of A′ , in order to satisfy
u ≥ 0, Ai ≥ uA′ , it has to be that uj = 0, j ∈ π(A′c ). We prove that for all such
u, ub′ < bi .
Let H = π(Ai ) and θπ((A′ )i ) = ui . For any S ⊆ {1, 2, . . . , m}, denote the
feature set FS = {fi |i ∈ S}. Define a weighted set system Φ = (I, w), I = 2H \
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
24 · Xifeng Yan et al.
X X X
ub′ − bi = θS vj − dkFS − vj − dkFH
S∈I j∈S j∈H
X X X X X
= θS vj − dkFS − vj θS + (1 − θS )
S∈I j∈S j∈H j∈S,S∈I j∈S,S∈I
+dkFH
X X X X X X
= θS vj − vj θS − vj (1 − θS )
S∈I j∈S j∈H j∈S,S∈I j∈H j∈S,S∈I
X
+dkFH − θS dkFS
S∈I
X X X X X X
= vj θS − vj θS − vj (1 − θS )
j∈{1,2,...,m} j∈S,S∈I j∈H j∈S,S∈I j∈H j∈S,S∈I
X
+dkFH − θS dkFS
S∈I
X X X
= − vj (1 − θS ) + dkFH − θS dkFS
j∈H j∈S,S∈I S∈I
P
(2) If ∀j, j∈S,S∈I θS = 1,
X
ub′ − bi = dkFH − θS dkFS
S∈I
X
= dkgΦ (FH ) − θS dkFS
S∈I
Since we have proved in Lemma 7 that there exists a query P graph Q and
a feature set F , such that, for any u ≥ 0 satisfying ∀j, j∈S,S∈I θS = 1,
dkgΦ (FH ) < S∈I θS dkFS . It follows that in this case ub′ − bi < 0.
P
Therefore we have ub′ − bi < 0. As such, there exists no such a row vector
m
u ∈ R2 −2 satisfying
u ≥ 0, Ai ≥ uA′ , ub′ ≥ bi .
Now that we have established these lemmas in the modified definition of r(ui , vi )
in Definition (11), it is time to go back to our original Definition (7). For any
′
selected feature
P set Fi , let FPi = {fj |fj ∈ Fi , uj ≥ vj }. Then the inequality of
Fi becomes xi ∈Fi \F ′ xi ≥ xi ∈Fi \F ′ vi − dkFi . Since we have dkFi \F ′ ≤ dkFi , the
i i i
hyperplane defined byPthis inequality always lies outside the feasible region of the
k
P
halfspace defined by xi ∈Fi \F ′ xi ≥ xi ∈Fi \Fi′ vi − dFi \Fi′ , and the latter is an
i
inequality of the inequality system in our proved lemma. Since a hyperplane has
to intersect the current feasible region to invalidate the nonredundancy of any
inequality, this means adding these hyperplanes will not make any of the inequalities
in the system redundant. By definition of redundant constraint, Lemma (8) also
holds under the original definition of r(ui , vi ) in Definition (7).
We now prove the lower bound on the complexity of the feature set selection
problem by adversary arguments, a technique that has been widely used in compu-
tational geometry to prove lower bounds for many fundamental geometric problems
[Kislicyn 1964, Erickson 1996]. In general, the arguments works as follows. Any al-
gorithm that correctly computes output must access the input. Instead of querying
an input chosen in advance, imagine an all-powerful malicious adversary pretends to
choose an input, and answers queries in whatever way that will make the algorithm
do the most work. If the algorithm does not make enough queries, there will be
several different inputs, each consistent with the adversary’s answers, that should
result in different outputs. Whatever the output of the algorithm, the adversary
can reveal an input that is consistent with all of its answers, yet inconsistent with
the algorithms’s output. Therefore any correct algorithm would have to make the
most queries in the worst case.
Theorem 3. [Single Feature Set Selection Problem] Suppose F = {f1 , f2 , . . . , fm }
is the set of all features in query graph Q. In the worst case, it takes Ω(2m ) steps
to compute Fopt such that |ΓFopt | = maxF ′ ⊆F {|ΓF ′ |}.
Proof. Given a query graph Q, imagine an adversary has the N points at his
disposal, each corresponding to an indexed graph. For any algorithm A to compute
Fopt , it would have to determine if there exists a halfspace defined by a feature set
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
26 · Xifeng Yan et al.
F ′ that could prune more points than the current best choice. Assume that A has
to compare a point with the hyperplane in order to know if the point lies in the
halfspace. Suppose that it stops after checking k inequalities and claims that Fopt
is found. Let S be the current feasible region formed by these k halfspaces. The
following observations are immediate.
(1) Placing any new point inside S does not change the number of points that can
be pruned by any F ′ already checked, i.e., the current best choice remains the
same.
(2) Any unchecked inequality corresponds to a hyperplane that will “cut” off a
nonempty convex subspace from S since it is not redundant.
Then a simple strategy for the adversary is to always keep more than half of the N
points in hand. Whenever A stops before checking all the 2m − 1 inequalities and
claims an answer for Fopt , the adversary can put all the points in hand into the
nonempty subspace of the current feasible region that would be cut off by adding an
unchecked inequality. Since now this inequality prunes more points than any other
inequality as yet, the algorithm A thus would fail in computing Fopt . Therefore, in
the worst case, any algorithm would have to take Ω(2m ) steps to compute Fopt
Corollary 1. [Fixed Number of Feature Sets Selection Problem] Suppose F =
{f1 , f2 , . . . , fm } is the set of all features in query graph Q. In the worst case, it
takes Ω(2m ) steps to compute SF = {F ′ |F ′ ⊆ F }, |SF | = c such that SF prunes the
most number of graphs for any set of c feature sets, where c is a constant.
Proof. The proof is by an argument similar to that in Theorem 3. Since an
adversary can always keep more than half of the N points in hand, and choose,
depending on the output of the algorithm, whether or not to place them in the
nonempty polytope cut off by an inequality that has not been checked; and the al-
gorithm, before checking the corresponding inequality, has no access to this knowl-
edge; any correct algorithm would fail if it announces an optimal set of c feature
sets before Ω(2m ) steps.
Corollary 2. [Multiple Feature Sets Selection Problem] Suppose F = {f1 , f2 ,
. . . , fm } is the set of all features in query graph Q. In the worst case, it takes Ω(2m )
steps to compute the smallest candidate set.
Proof. The proof is by an argument similar to that in Theorem 3 and Corollary
1.
Theorem 3 shows that to prune the most number of graphs in one access to the
index structure, it takes exponential time in the number of features in the worst
case. Corollary 1 shows that even if we want to compute a set of feature sets such
that, used one after another, they prune the most graphs with multiple accesses to
the index, such an optimal set is also hard to compute.
5.3 Clustering based Feature Set Selection
Theorem 3 shows that it takes an exponential number of steps to find an optimal
solution in the worst case. In practice, we are interested in the heuristics that
are good for a large number of query graphs. We use selectivity defined below to
measure the filtering power of a feature f for all graphs in the database.
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
Feature-based Similarity Search in Graph Structures · 27
is an individual cluster. At each level, it recursively merges the two closest clusters
into a single cluster. The “closest” means their selectivity is the closest. Each
cluster is associated with two parameters: the average selectivity of the cluster and
the number of features associated with it. The selectivity of two merged clusters is
defined by a linear interpolation of their own selectivity,
n1 δ1 + n2 δ2
, (12)
n1 + n2
where n1 and n2 are the number of features in the two clusters, and δ1 and δ2 are
their corresponding selectivity.
3 clusters
f1 f2 f3 f4 f5 f6
Features are first sorted according to their selectivity and then clustered hierar-
chically. Assume that δf1 (D, Q) ≤ δf2 (D, Q) ≤ . . . ≤ δf6 (D, Q) . Figure 11 shows
a hierarchical clustering tree. In the first round, f5 is merged with f6 . In the second
round, f1 is merged with f2 . After that, f4 is merged with the cluster formed by
f5 and f6 if f4 is the closest one to them. Since the clustering is performed in one
dimension, it is very efficient to build.
6. ALGORITHM IMPLEMENTATION
In this section, we formulate our filtering algorithm, called Grafil (Graph Similarity
Filtering).
Grafil consists of two components: a base component and a clustering component.
Both of them apply the multi-filter composition strategy. The base component
generates feature sets by grouping features of the same size and uses them to filter
graphs based on the upper bound of allowed feature misses derived in Section 4.4. It
first applies the filter using features with one edge, then the one using features with
two edges, and so on. We denote the base component by Grafil-base. The clustering
component combines the features whose sizes differ by at most 1, and groups them
by their selectivity values. Algorithm 3 sketches the outline of Grafil. Fi in Line 2
represents the set of features with i edges. Lines 2-4 form the base component and
Lines 5-11 form the clustering component. Once the hierarchical clustering is done
on features with i edges and i + 1 edges, Grafil divides them into three groups with
high selectivity, medium selectivity, and low selectivity, respectively. A separate
filter is constructed based on each group of features. For the hierarchical clusters
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
Feature-based Similarity Search in Graph Structures · 29
Algorithm 3 Grafil
Input: Graph database D, Feature set F ,
Maximum feature size maxL, and
A relaxed query Q.
Output: Candidate answer set CQ .
1: let CQ = D;
2: for each feature set Fi , i ≤ maxL do
3: calculate the maximum feature misses dmax ;
4: CQ = { G|d(G, Q) ≤ S dmax , G ∈ CQ };
5: for each feature set Fi Fi+1 , i < maxL do
6: compute the selectivity based on CQ ; S
7: do the hierarchical clustering on features in Fi Fi+1 ;
8: cluster features into three groups, X1 , X2 , and X3 ;
9: for each cluster Xi do
10: calculate the maximum feature misses dmax ;
11: CQ = { G|d(G, Q) ≤ dmax , G ∈ CQ };
12: return CQ ;
shown in Figure 11, Grafil will choose f1 and f2 as group 1, f3 as group 2, and f4 ,
f5 and f6 as group 3.
To deploy the multiple filters, Grafil can run in a pipeline mode or a parallel
mode. The diagram in Figure 12 depicts the pipeline mode, where the candidate
answer set returned from the current step is pumped into the next step.
...
feature set n Cn
the previous step, while the parallel mode does not. We will show the performance
impact raised by this difference in the next section.
7. EMPIRICAL STUDY
In this section, we conduct several experiments to examine the properties of Grafil.
The performance of Grafil is compared with two algorithms based on a single filter:
one using individual edges as features (denoted as Edge) and the other using all
features of a query graph (denoted as Allfeature). Many similarity search algorithms
[Hagadone 1992, Raymond et al. 2002] can only apply the edge-based filtering
mechanism since the mapping between edge deletion/relabeling and feature misses
was not established before this study. In fact, the edge-based filtering approach can
be viewed as a degenerate case of the feature-based approach using a filter with
single edge features. By demonstrating the conditions where Grafil can filter more
graphs than Edge and Allfeature, we show that Grafil can substantially improve
substructure similarity search in large graph databases.
Two kinds of datasets are used throughout our empirical study: one real dataset
and a series of synthetic datasets. The real dataset is an AIDS antiviral screen
dataset containing the topological structures of chemical compounds. This dataset
is available on the website of the Developmental Therapeutics Program (NCI/NIH)3 .
In this dataset, thousands of compounds have been checked for evidence of anti-HIV
activity. The dataset has around 44,000 structures. The synthetic data generator
was kindly provided by Kuramochi et al. [Kuramochi and Karypis 2001]. The gen-
erator allows the user to specify various parameters, such as the database size, the
average graph size, and the label types, to examine the scalability of Grafil.
We built Grafil based on the gIndex algorithm [Yan et al. 2004]. gIndex first
mines frequent subgraphs with size up to 10 and then retains discriminative ones
as indexing features. We thus take the discriminative frequent structures as our
indexing features. Certainly, other kinds of features can be used in Grafil too, since
Grafil does not rely on the kinds of features to be used. For example, Grafil can also
take paths [Shasha et al. 2002] as features to perform the similarity search.
Through our experiments, we illustrate that
(1) Grafil can efficiently prune the search space for substructure similarity search
and generate up to 15 times fewer candidate graphs than the alternatives in
the chemical dataset.
(2) Bound refinement and feature set selection for the multiple filter approach
developed by Grafil are both effective.
(3) Grafil performs much better for graphs with a small number of labels.
(4) The single filter approach using all features together does not perform well due
to the frequency conjugation problem identified in Section 5. Neither does the
approach using individual edges as features due to their low selectivity.
Experiments on the Chemical Compound Dataset.
We first examine the performance of Grafil over the AIDS antiviral database. The
test dataset consists of 10, 000 graphs that are randomly selected from the AIDS
3 http://dtpsearch.ncifcrf.gov/FTP/AIDO99SD.BIN
screen database. These graphs have 25 nodes and 27 edges on average. The largest
one has 214 nodes and 217 edges in total. Note that in this dataset most of the atoms
are carbons and most of the edges are carbon-carbon bonds. This characteristic
makes the substructure similarity search very challenging. The query graphs are
directly sampled from the database and are grouped together according to their
size. We denote the query set by Qm , where m is the size of the graphs in Qm . For
example, if the graphs in a query set have 20 edges each, the query set is written
as Q20 . Different from the experiment setting in [Yan et al. 2004], the edges in our
dataset are assigned with edge types, such as single bond, double bond, and so on.
By doing so, we reduce the number of exact substructure matches for each query
graph. This is exactly the case where the substructure similarity search will help:
find a relatively large matching set by relaxing the query graph a little bit. When
a user submits a substructure similarity query, he may not allow arbitrary deletion
of some critical atoms and bonds. In order to simulate this constraint, we retain
25% of the edges in each query graph.
104
# of candidate answers
103
102
Edge
All
Grafil
10
0 1 2 3 4 5
# of edge relaxations
104
# of candidate answers
103
102
10 Edge
All
Grafil
1
0 1 2 3 4 5
# of edge relaxations
2000
1800 with bound refinement
without bound refinement
# of candidate answers
1600
1400
1200
1000
800
600
400
200
0
0 1 2 3 4 5
# of edge relaxations
Having examined the overall performance of Grafil in comparison with the other
two approaches, we test the effectiveness of each component of Grafil. We take Q20
as a test set. Figure 15 shows the performance difference before and after we apply
the bound refinement in Grafil. In this experiment, we set the maximum number
of selection steps (H) at 2, and the maximum number of disqualifying steps (B)
at 6. It seems that the bound refinement makes critical improvement when the
relaxation ratio is below 20%. At the high relaxation ratio, bound refinement does
not have apparent effects. As explained in the previous experiments, Grafil mainly
relies on the edge feature set to filter graphs when the ratio is high. In this case,
bound refinement will not be effective at all. In summary, it is worth doing bound
refinement for the moderate relaxation ratio.
2
1.8
1.6
1.4
1.2
1
0 1 2 3 4 5
# of edge relaxations
Figure 16 shows the filtering ratio obtained by applying the clustering component
′
in Grafil. Let CQ and CQ be the candidate answer set returned by Grafil (with the
clustering component) and Grafil-base (with the base component only), respectively.
|C ′ |
The filtering ratio in the figure is defined by |CQQ|
. The test is performed on the
query set Q20 . Overall, Grafil with the clustering component is 40%–120% better
than Grafil-base. We also do a similar test to calculate the filtering gain achieved by
the pipeline mode over the parallel mode. The pipeline mode is 20%–60% better.
Experiments on the Synthetic Datasets.
The synthetic data generator first creates a set of seed structures randomly. Seed
structures are then randomly combined to form a synthesized graph. Readers are
referred to [Kuramochi and Karypis 2001] for details about the synthetic data gen-
erator. A typical dataset may have 10,000 graphs and use 200 seed fragments with
10 kinds of nodes and edges. We denote this dataset by D10kI10T 50L200E10V 10.
E10 (V 10) means there are 10 kinds of edge labels (node labels). In this dataset,
each graph has 50 edges (T 50) and each seed fragment has 10 edges (I10) on aver-
age.
Since the parameters of synthetic datasets are adjustable, we can examine the
conditions where Grafil outperforms Edge. One can imagine that when the types of
labels in a graph become very diverse, Edge will perform nearly as well as Grafil.
The reason is obvious. Since the graph will have less duplicate edges, we may treat
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
34 · Xifeng Yan et al.
it as a set of tuples {node1 label, node2 label, edge label} instead of a complex
structure. This result is confirmed by the following experiment. We generate a
synthetic dataset, D10kI10T 50L200 E10V 10, which has 10 edge labels and 10
node labels. This setting will generate (10 × 11)/2 × 10 = 550 different edge tuples.
Most of the graphs in this synthetic dataset have 30 to 100 edges. If we represent
a graph as a set of edge tuples, few edge tuples will be the same for each graph in
this dataset. In this situation, Edge is good enough for similarity search. Figure 17
shows the results for queries with 24 edges. The two curves are very close to each
other, as expected.
25
# of candidate answers
20
15
10
5 Edge
Grafil
0 1 2 3 4 5 6 7
# of edge relaxations
We then reduce the number of label types in the above synthetic dataset and only
allow 2 edge labels and 4 vertex labels. This setting significantly increases the self
similarity in a graph. Figure 18 shows that Grafil outperforms Edge significantly in
this dataset. We can further reduce the number of label types. For example, if we
ignore the label information and only consider the topological skeleton of graphs,
the edge-based filtering algorithm will not be effective at all. In that situation,
Grafil has more advantages than Edge.
104
# of candidate answers
103
102
Edge
Grafil
10
0 1 2 3 4 5
# of edge relaxations
8. CONCLUSIONS
In this study, we have investigated the problem of substructure similarity search in
large scale graph databases, a problem raised by the emergence of massive, com-
plex structural data. Different from the previous work, our solution explored the
filtering algorithm using indexed structural patterns, without doing costly struc-
ture comparisons. The transformation of the structure-based similarity measure to
the feature-based measure renders our method attractive in terms of accuracy and
efficiency. Since our filtering algorithm is fully built on the feature-graph matrix,
it performs very fast without accessing the physical database. We showed that
the multi-filter composition strategy adopted by Grafil is superior to the single fil-
ter approach using all features together due to the frequency conjugation problem
identified in this article. Based on a geometric interpretation, the complexity of
optimal feature set was analyzed in our study, which is Ω(2m ) in the worst case. In
practice, we identified several criteria to build effective feature sets for filtering and
demonstrated that the clustering-based feature selection can improve the filtering
performance further. The proposed feature-based indexing concept is very general
such that it can be applied to searching approximate non-consecutive sequences,
trees, and other structured data as well.
REFERENCES
Beretti, S., Bimbo, A., and Vicario, E. 2001. Efficient matching and indexing of graph models
in content based retrieval. IEEE Trans. on Pattern Analysis and Machine Intelligence 23,
1089–1105.
Berman, H., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T., Weissig, H., Shindyalov, I.,
and Bourne, P. 2000. The protein data bank. Nucleic Acids Research 28, 235–242.
Bunke, H. and Shearer, K. 1998. A graph distance metric based on the maximal common
subgraph. Pattern Recognition Letters 19, 255 – 259.
Erickson, J. 1996. Lower bounds for fundamental geometric problems. Ph.D. Thesis,University
of California at Berkeley.
Feige, U. 1998. A threshold of ln n for approximating set cover. Journal of the ACM 45, 634 –
652.
Garey, M. and Johnson, D. 1979. Computers and Intractability: A Guide to the Theory of
NP-Completeness. Freeman & Co., New York.
Giugno, R. and Shasha, D. 2002. Graphgrep: A fast and universal method for querying graphs.
Proc. 2002 Int. Conf. on Pattern Recognition (ICPR’02), 112–115.
Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Pietarinen, L.,
and Srivastava, D. 2001. Using q-grams in a dbms for approximate string processing. Data
Engineering Bulletin 24, 28–37.
Hagadone, T. 1992. Molecular substructure similarity searching: efficient retrieval in two-
dimensional structure databases. J. Chem. Inf. Comput. Sci. 32, 515–521.
Hochbaum, D. 1997. Approximation Algorithms for NP-Hard Problems. PWS Publishing, MA.
Holder, L., Cook, D., and Djoko, S. 1994. Substructure discovery in the subdue system. In
Proc. AAAI’94 Workshop on Knowledge Discovery in Databases (KDD’94). 169 – 180.
Kailing, K., Kriegel, H., Schnauer, S., and Seidl, T. 2004. Efficient similarity search for
hierarchical data in large databases. In Proc. 9th Int. Conf. on Extending Database Technology
(EDBT’04). 676–693.
Kanehisa, M. and Goto, S. 2000. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic
Acids Research 28, 27–30.
Kislicyn, S. S. 1964. On the selection of the kth element of an ordered set by pairwise compar-
isons. Sibirskii Matematiceskii Zurnal (in Russian) 5, 557–564.
ACM Transactions on Database Systems, Vol. V, No. N, June 2006.
36 · Xifeng Yan et al.
Kuramochi, M. and Karypis, G. 2001. Frequent subgraph discovery. In Proc. 2001 Int. Conf.
on Data Mining (ICDM’01). 313–320.
Messmer, B. and Bunke, H. 1998. A new algorithm for error-tolerant subgraph isomorphism
detection. IEEE Trans. on Pattern Analysis and Machine Intelligence 20, 493 – 504.
Navarro, G. 2001. A guided tour to approximate string matching. ACM Computing Surveys 33,
31 – 88.
Nilsson, N. 1980. Principles of Artificial Intelligence. Morgan Kaufmann, Palo Alto, CA.
Padberg, M. 1995. Linear Optimization and Extensions. Springer-Verlag.
Petrakis, E. and Faloutsos, C. 1997. Similarity searching in medical image databases. Knowl-
edge and Data Engineering 9, 3, 435–447.
Raymond, J., Gardiner, E., and Willett, P. 2002. Rascal: Calculation of graph similarity
using maximum common edge subgraphs. The Computer Journal 45, 631–644.
Shasha, D., Wang, J., and Giugno, R. 2002. Algorithmics and applications of tree and graph
searching. In Proc. 21th ACM Symp. on Principles of Database Systems (PODS’02). 39–52.
Srinivasa, S. and Kumar, S. 2003. A platform based on the multi-dimensional data model
for analysis of bio-molecular structures. In Proc. 2003 Int. Conf. on Very Large Data Bases
(VLDB’03). 975–986.
Ukkonen, E. 1992. Approximate string matching with q-grams and maximal matches. Theoretic
Computer Science, 191–211.
Ullmann, J. 1977. Binary n-gram technique for automatic correction of substitution, deletion,
insertion, and reversal errors in words. The Computer Journal 20, 141–147.
Wang, J., Zhang, K., Jeong, K., and Shasha, D. 1994. A system for approximate tree matching.
IEEE Trans. on Knowledge and Data Engineering 6, 559 – 571.
Willett, P., Barnard, J., and Downs, G. 1998. Chemical similarity searching. J. Chem. Inf.
Comput. Sci. 38, 983–996.
Yan, X., Yu, P., and Han, J. 2004. Graph indexing: A frequent structure-based approach. In
Proc. 2004 ACM Int. Conf. on Management of Data (SIGMOD’04). 335 – 346.
Yan, X., Yu, P., and Han, J. 2005. Substructure similarity search in graph databases. In Proc.
2005 ACM Int. Conf. on Management of Data (SIGMOD’05). 766 – 777.