Near-Neighbor Search in Pattern Distance Spaces

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Near-Neighbor Search in Pattern Distance Spaces

Haixun Wang Chang-Shing Perng Philip S. Yu


IBM Thomas J. Watson Research Center, Hawthorne, NY 10532
{haixun,perng,psyu}@us.ibm.com

Abstract t1 t2 t3 t4 t5 t6 ···
In this paper, we study the near-neighbor problem based on VPS8 401 281 120 275 298 210
pattern similarity, a new type of similarity which conven- SSA1 401 292 109 580 238 289
tional distance metrics such as Lp norm cannot model effec- SP07 228 290 285 148 224 231
tively. The problem, however, is important to many appli- EFB1 318 280 37 277 215 99
cations. For example, in DNA microarray analysis, the ex- MDM10 10 10 266 328 101 186
pression levels of two closely related genes may rise and fall CYS3 322 288 41 278 219 231
under different external conditions or at different time. Al- DEP1 317 272 334 232 192 110
though the magnitude of their expression levels may not be NTG1 329 296 33 274 228 129
..
close, the patterns they exhibit over the time or under differ- .
ent conditions can be very similar. In this paper, we measure
the distance between two objects by pattern similarity, i.e., Table 1: Expression data of Yeast genes
whether the two objects exhibit a synchronous pattern of rise
and fall under different conditions. We then present an ef-
ficient algorithm for near-neighbor search based on pattern
represent arbitrarily spaced points on the time axis. We find
similarity, and we perform tests on several real and synthetic
the expression levels of three genes, SP07, MDM10, and
data sets to show its effectiveness.
DEP1, manifest a coherent pattern with fixed time shift.
Given a new gene, biologists are interested in finding
1 Introduction
every gene whose expression levels under a certain set of
The efficiency of near neighbor search to a large extent de- conditions rise and fall coherently with those of the new
pends on the distance function in use [3]. More importantly, gene, as such discovery may reveal connections in gene
the distance function also determines the meaning of simi- regulatory networks [1]. Clearly, this pattern similarity
larity and the meaning of the near-neighbor search. In this cannot be captured by distance functions such as Euclidean
paper, we address a new type of similarity for near-neighbor even if they are applied in the related subspaces.
search. In this paper, we extend the concept of near-neighbor to
the above situation. We say genes VPS8, CYS3, and EFB1
DNA microarray analysis Finding near neighbors based are near-neighbors in the subspace defined by conditions
on subspace pattern similarity is important to many applica- {t , t , t }, and the time series expression levels of genes
1 3 5
tions [1, 9, 7, 8]. Table 1 shows a small portion of the Yeast SP07, MDM10, and DEP1 are near-neighbors from time
expression data, where entry dij represents the expression t , t and t .
1 2 3
level of gene i under condition j (or at time j). Investiga- An even more interesting and challenging type of near-
tions show that more often than not, several genes contribute neighbor query is the following. We are given the expression
to a disease, which motivates researchers to identify genes levels of a new gene.This new gene might be related to any
whose expression levels rise and fall synchronously under gene in the database as long as both of them exhibit a pattern
different conditions or over a period of time, that is, whether in some subspace or at some time offset. The dimensionality
they exhibit fluctuation of a similar shape when conditions of the subspace, or the length of the time period, is often an
change. indicator of the degree of their closeness, that is, the more
As shown in Table 1, the expression levels of three columns the pattern spans, the closer the relation between
genes, VPS8, CYS3, and EFB1, rise and fall coherently the two genes.
under three different external conditions t1 , t3 and t5 . We In this paper we focus on pattern based similarity de-
can also measure the expression levels of genes at fixed time scribed above. Traditional distance functions, such as the
intervals. In this case, assume t1 , t2 , · · ·, t5 in Table 1 Euclidean norm, cannot measure pattern similarity. We pro-

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 586


pose an efficient method to perform near-neighbor search or, if the columns are numerical (e.g. time), we have:
by pattern similarity. Traditional spatial access methods for
nearest neighbor search cannot be used for pattern similarity (2.2) f (u, i) = (ci+1 − ci , ui+1 − ui ), ..., (cn − ci , un − ui )
matching because they depend on metric distance functions
satisfying the triangular inequality. Experiments show our We then insert each base-column aligned suffix f (u, i) into a
method is effective and efficient, and it outperforms alterna- trie.In the following, we use an example to demonstrate the
tive algorithms (based on an adaptation of the R-Tree index) process.
by an order of magnitude.
E XAMPLE 1. Let database D be composed of the following
2 NN Search by Subspace Pattern Similarity 2 objects defined in space A = {c1 , c2 , c3 , c4 , c5 }.
In this section, we propose an index structure called P-Index obj c1 c2 c3 c4 c5
(pattern index) to support fast pattern matching and near- #1 3 0 4 2 0
neighbor search. A similar structure was used to support #2 4 1 5 3 6
sequence matching [8].
We represent each object by a sequence of (col-
2.1 An Overview We represent each object u ∈ D as a umn,value) pairs. For instance, object #1 in D can be rep-
sequence of (column, value) pairs. For each suffix of the resented by
sequence, we derive a base-column aligned suffix and insert
it into a trie. (c1 , 3), (c2 , 0), (c3 , 4), (c4 , 2), (c5 , 0)
The trie supports matching of patterns defined on a
column set composed of a continuous sequence of columns, We use the first column in the sequence as its base column,
S = {ci , ci+1 , ..., ci+k }. To find patterns in any subspace and derive a base-column aligned suffix by subtracting the
efficiently, we create P-index on top of the trie. value of the base column from each value in the suffix:
The trie is employed as an intermediary structure to
facilitate the building of the P-index. It embodies a compact (c1 , 0), (c2 , −3), (c3 , 1), (c4 , −1), (c5 , −3)
index to all the distinct, non-empty, base-column aligned
We do the same to each suffix (of length ≥ 2) of the object.
objects in D. Various approaches to build tries or suffix
Table 2 shows all the base-column aligned suffices derived
trees in linear time have been developed. Ukkonen [6],
from the two objects.
for instance, developed a linear-time, on-line suffix tree
construction algorithm. We do not address the details of f (u, i), where u ∈ {#1, #2} and i = 1, · · · , 4
building tries in this paper. (c1 , 0), (c2 , -3), (c3 , 1), (c4 , -1), (c5 , -3)
(c2 , 0), (c3 , 4), (c4 , 2), (c5 , 0)
2.2 The Trie Structure We first introduce a sequential (c3 , 0), (c4 , -2), (c5 , -4)
representation of the data, and then use an example to (c4 , 0), (c5 , -2)
demonstrate the process of constructing the P-index. (c1 , 0), (c2 , -3), (c3 , 1), (c4 , -1), (c5 , 2)
Let D be a dataset in multidimensional space A = (c2 , 0), (c3 , 4), (c4 , 2), (c5 , 5)
{c1 , c2 , ..., cn }. Unless the dimensions are already in an (c3 , 0), (c4 , -2), (c5 , 1)
ordered domain (for example, time), we create an arbitrary (c4 , 0), (c5 , 3)
order among the dimensions, that is, we assume c1 ≺ c2 ≺
Table 2: Sequences and suffixes derived from D
· · · ≺ cn is a total order. We represent each object u ∈ D as
a sequence of (column, value) pairs, that is:
We insert the base-column aligned suffixes into a trie.
u = (c1 , u1 ), (c2 , u2 ), ..., (cn , un ) Figure 1 demonstrates the insertion of sequence:

A suffix of u starting with column ci is denoted by: f (#1, 1) = (c1 , 0), (c2 , −3), (c3 , 1), (c4 , −1), (c5 , −3)
(ci , ui ), (ci+1 , ui+1 ), ..., (cn , un ) Each leaf node n in the trie maintains an object list, Ln . If
where 1 ≤ i ≤ n. Using the first column in each suffix the insertion of f (#1, 1) leads to node x, which is under arc
as its base column, we derive a base-column aligned suffix (e, −3), we append 1 (object #1), to object list Lx .
by subtracting the value of the base (first column) from each
column value in the suffix. We use f (u, i) to denote u’s base- 2.3 Building P-Index over a Trie The trie enables
column aligned suffix that begins with the ith column: us to find near-neighbors of a query object q =
(c1 , v1 ), ..., (cn , vn ) in a given subspace S, provided S
(2.1) f (u, i) = (ci , 0), (ci+1 , ui+1 − ui ), ..., (cn , un − ui ) is defined by a set of continuous columns, i.e., S =

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 587


distance to under the current node, which is time-consuming. The P-
base column
c1,0
index, described below, allows us to ’jump’ directly to nodes
-3 under (ck , ·), where k > j. This enables us to efficiently find
1 c2,-3 near-neighbors in any given subspace, and furthermore, near-
neighbors in any subspace whose dimensionality is larger
y
c3,1 than a given threshold requires additional index structures.
c4,-1 We use the following steps to build the P-index on top
Lx:
of a trie. First, after all sequences are inserted, we assign to
c5,-3
object each node x a pair of labels, hnx , sx i, where nx is the prefix-
x
1
list of order of node x in the trie (starting from 0, which is assigned
node x
to the root node), and sx is the number of x’s descendent
nodes. Next, we create pattern-distance links for each
Figure 1: Insertion of sequence f (u, 1) = (col, dist) pair, where col ∈ A, dist ∈ {−ξ + 1, . . . , ξ − 1},
(c1 , 0), (c2 , −3), (c3 , 1), (c4 , −1), (c5 , −3). The id of and ξ is the number of distinct column values1 . The links
the object is appended to x’s object list Lx . are constructed by a depth-first walk of the suffix trie. When
we encounter a node x under arc (col, dist), we append x’s
{ci , ci+1 , ..., ci+k }. If ² = 0, all we need to do is follow- label hnx , sx i to the pattern-distance link for pair (col, dist).
ing path (ci , 0), (ci+1 , vi+1 − vi ), ..., (ci+k , vi+k − vi ) in the Thus, a pattern link is composed of nodes that have the same
trie, and when we reach a certain node x at the end of the distance from their base columns (root node).
path, we return objects in the object lists of those leaf nodes The labeling scheme and the pattern-distance links have
that are x’s descendents (including x if x is a leaf node). If the following property.
² > 0, we may need to traverse multiple paths at each level. T HEOREM 2.1. P-Index Property

1. if node x and y are labeled hnx , sx i and hny , sy i


Input: T : a trie built on D respectively, and nx < ny ≤ nx + sx , then y is a
S: a subspace defined by a continuous col- descendent node of x;
umn set {ci , ci+1 , ..., ck }
q = (c1 , v1 ), · · · , (cn , vn ): a query object 2. nodes in any pattern-distance links are ordered by their
²: pattern threshold prefix-order number; and
Output: near-neighbors of q in subspace S
3. for any node x, x’s descendents in any pattern-distance
n ← root of T ; link are contiguous in that link.
search(n, S);
Proof. 1) and 2) are due to the labeling scheme which is
Function search(x, S)
if S = ∅ then based on depth-first traversal. For 3), note that if nodes
output the descendents of x; u, ..., v, ..., w are in a pattern-distance link (in that order),
and u, v are descendents of x, we have nx < nu < nv <
else
assume S = {cj , cj+1 , ..., ck }; nw ≤ nx + sx , which means v is also a descendent of x.
for x’s child node y under edge labeled (cj , v) The above properties enable us to use range queries to
where v ∈ [(vj − vi ) − ², (vj − vi ) + ²] do
find descendents of a given node in a given pattern-distance
search(y, {cj+1 , ..., ck });
link.
Algorithm 2 summarizes the index construction proce-
Algorithm 1: NN Search in a given subspace defined dure. The P-Index is composed of two major parts: i) arrays
by a continuous column set of hnx , sx i pairs for pattern-distance links; and ii) leaf nodes’
object lists.
Algorithm 1 is a formal description of the above process. The time complexity of building the P-index is
It finds all objects whose value difference between column cj O(|D||A|). The Ukkonen algorithm [6] builds suffix tree in
and ci is within region (vj −vi )±², where j = i, i+1, ..., i+ linear time. The construction of the trie for pattern-distance
k. Hence the correctness of the algorithm follows. indexing is less time consuming because the length of the in-
Algorithm 1, however, only finds near-neighbors in a dexed subsequences is constrained by |A|. Thus, it can be
given subspace defined by a set of continuous columns. In constructed by a brute-force algorithm [4] in linear time.
the algorithm, at each step j, we can only go directly to
node under edge (cj+1 , ·). To find a descendent node under 1 ξ is also regarded as a discretization parameter, or the number of bins

edge (ck , ·), where k > j, requires us to traverse the subtree the numerical values are discretized into.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 588


We now have matched all the columns in S, and the object
Input: D: objects in multi-dimensional space A lists of nodes x, y and their descendents contain offsets for
Output: P-Index of D the query.
Algorithm 3 outlines the searching of near-neighbors in
for each u ∈ D do
a given subspace (defined by an arbitrary set of columns).
insert f (u, i), 1 ≤ i < |A| into a trie;
(Eq 2.1) Here, we have demonstrated the purpose of having the
pattern-distance links. It enables us to ’jump’ directly to
for each node x encountered in a depth-first traver-
the next relevant column in the given subspace, while in
sal of the trie do
label node x by hnx , sx i; traditional suffix trie we can only follow the tree branches.
let (c, d) be the arc that points to x; As a result, the tree structure is not needed in the searching,
append hnx , sx i to pattern-distance link (c, d); since the pattern-distance links already contain the complete
information for pattern matching.

Algorithm 2: Index Construction


Input: q: a query object, S: a given subspace
²: pattern threshold
The space taken by the P-Index is linearly proportional
Output: q’s near-neighbors in subspace S
to the data size. Since each node appears once and only
once in the pattern links, the total number of entries in let (c1 , v1 ), · · · , (c|S| , v|S| ) be q’s projection on S;
Part I equals the total number of nodes in the trie, or x ← the node under arc (c1 , 0);
O(|D||A|2 ) in the worst case (if none of the nodes are shared search(x, 2);
by any subsequences). On the other hand, there are exactly
Function search(x, i)
|D|(|A| − 1) object ids stored in Part II. Thus, the space is
if i ≤ |S| then
linearly proportional to the data size |D|. for pattern link I of (ci , v), where v ∈ [vi −
v1 − ², vi − v1 + ²] do
2.4 Near-Neighbor Search in a Given Subspace In this /* perform a binary search on I */
section, we find near-neighbors in a given subspace using the for all node r ∈ I and nr ∈ [nx , nx + sx ]
P-index. For instance, assume we have a query object q: do
search(r, i + 1) ;
q = (a, 3), (c, 7), (e, 2) end
end
Using the first column of q as the base column, we get2 :
else
(a, 0), (c, 4), (e, −1) output objects in Lx , x = vs , ..., vm
end
We start with the pattern link of (a, 0), which contains only
one node. Let us assume its label is h20, 180i, meaning Algorithm 3: Pattern Matching
sequences starting with column a are indexed by nodes from
20 to 200. Next, we consult pattern-distance link (c, 4),
which contains all the c nodes that are 4 units away from their 3 Experiments
base column (root node). However, we are only interested We tested P-Index with both synthetic and real life data
in those nodes that are descendents of (a, 0). According setson a Linux machine with a 700 MHz CPU and 256 MB
to the property of pattern-distance links, those descendents main memory.
are contiguous in the pattern-distance link and their prefix- Gene Expression Data Gene expression data are being
order numbers are inside range [20, 200]. Since the nodes in generated by DNA chips and other micro-array techniques.
the buffer are organized in ascending order of their prefix- The data set is presented as a matrix. Each row corresponds
order numbers, the search is carried out as a range query to a gene and each column represents a condition under
in log time. Suppose we find three nodes, u = h42, 9i, which the gene is developed. Each entry represents the
v = h88, 11i, and w = h102, 18i, in that range. Then, we relative abundance of the mRNA of a gene under a specific
consult the next pattern-distance link (e, −1) and repeat the condition. The yeast micro-array is a 2, 884 × 17 matrix
process for each of the three nodes. Assume node x is a (2,884 genes under 17 conditions) [5]. The mouse cDNA
descendent of node u, node y a descendent of node v, and no array is a 10, 934 × 49 matrix (10,934 genes under 49
nodes in pattern link of (e, −1) are descendents of node w. conditions) [2] and it is pre-processed in the same way.
Synthetic Data We generate random integers from a
2 If the columns are numerical, we get (c − a, 4), (e − a, −1). uniform distribution in the range of 1 to ξ. Let |D| be

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 589


250
1000
ξ=80
ξ=60
index size (Mega bytes)

200
ξ=40
ξ=20 100
ξ=10

time (sec.)
150 ξ= 5 R-Tree index
linear scan
10 PD-Index
100

1
50

0 0.1
10 20 30 40 50 60 70 80 5 10 15 20 25 30 35 40 45 50
dataset size (Mega bytes) dataset size (Mega Bytes)

(a) |A| = 20, ξ = 5, ..., 80 (b) Pattern matching (a) Find Near-neighbors in DNA micro-array

Figure 2: Performance.

the number of objects in the dataset and |A| the number of 4 Conclusion
dimensions. The total data size is 4|D||A| bytes. We identify the need of finding near-neighbors under sub-
3.1 Space Analysis The space requirement of the pattern- space pattern similarity, a new type of similarity not cap-
distance index is linearly proportional to the data size (Fig- tured by Euclidean, Manhattan, etc., but essential to a wide
ure 2). In Figure 2(a), we fix the dimensionality of the data at range of applications, including DNA microarray analysis.
20 and change ξ, the discretization granularity, from 5 to 80. Two objects are similar if they manifest a coherent pattern
It shows that ξ has little impact on the index size when the of rise and fall in an arbitrary subspace, or over a certain
data size is small. When the data size increases, the growth time period with time shifting.We propose P-Index, which
of the trie slows down as each trie node is shared by more maps objects to sequences and index them using a tree struc-
objects (this is more obvious for smaller ξ in Figure 2(a)). ture. Experimental results show that P-Index achieves orders
of magnitude speedup over alternative algorithms based on
3.2 Time Analysis We compare the algorithms presented naive indexing and linear scan.
in this paper with two alternative approaches, i) brute force
linear scan, and ii) R-Tree family indices. The linear scan References
approach for near-neighbor search is straightforward to im-
plement. The R-Tree, however, indexes values not patterns.
To support queries based on pattern similarity, we create an [1] Y. Cheng and G. Church. Biclustering of expression data. In
Proc. of 8th International Conference on Intelligent System
extra dimension cij = ci − cj for every two dimensions ci for Molecular Biology, 2000.
and cj . [2] R. Miki et al. Delineating developmental and metabolic
The query time presented in Figure 2(b) indicates that P- pathways in vivo by expression profiling using the riken
set of 18,816 full-length enriched mouse cDNA arrays. In
Index scales much better than the two alternative approaches Proceedings of National Academy of Sciences, 98, pages
for pattern matching in given subspaces. The comparisons 2199–2204, 2001.
[3] Piotr Indyk. On approximate nearest neighbors in non-
are carried out on synthetic datasets of dimensionality |A| = euclidean spaces. In IEEE Symposium on Foundations of
40 and discretization level ξ = 20. Each time, a subspace is Computer Science, pages 148–155, 1998.
designated by randomly selecting 4 dimensions, and random [4] E. M. McCreight. A space-economical suffix tree construc-
tion algorithm. Journal of the ACM, 23(2):262–272, April
query objects are generated in the subspace. 1976.
To further analyze the impact of different query forms [5] S. Tavazoie, J. Hughes, M. Campbell, R. Cho,
on the performance, we base our comparisons on number and G. Church. Yeast micro data set. In
http://arep.med.harvard.edu/biclustering/yeast.matrix,
of disk accesses. First, we ask random queries against 2000.
yeast and mouse DNA micro-array data in subspaces of [6] E. Ukkonen. Constructing suffix-trees on-line in linear time.
Algorithms, Software, Architecture: Information Processing,
dimensionality ranging from 2 to 5. The selected dimensions pages 484–92, 1992.
are evenly separated. For instance, we select dimension set [7] Haixun Wang, Fang Chu, Wei Fan, Philip S. Yu, and Jian Pei.
{c1 , c13 , c25 , c37 , c49 } in a mouse cDNA array that has a total A fast algorithm for subspace clustering by pattern similarity.
In 16th International Conference on Scientific and Statistical
of 49 conditions. Figure 2(c) shows the average number Database Management (SSDBM), 2004.
of node accesses and disk accesses. Since P-Index offers [8] Haixun Wang, Chang-Shing Perng, Wei Fan, Sanghyun Park,
increased selectivity for longer queries, it is robust as the and Philip S. Yu. Indexing weighted sequences in large
databases. In ICDE, 2003.
dimensionality of the given subspace becomes larger. [9] Haixun Wang, Wei Wang, Jiong Yang, and Philip S. Yu. Clus-
tering by pattern similarity in large data sets. In SIGMOD,
2002.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited 590

You might also like