Near-Neighbor Search in Pattern Distance Spaces
Near-Neighbor Search in Pattern Distance Spaces
Near-Neighbor Search in Pattern Distance Spaces
Abstract t1 t2 t3 t4 t5 t6 ···
In this paper, we study the near-neighbor problem based on VPS8 401 281 120 275 298 210
pattern similarity, a new type of similarity which conven- SSA1 401 292 109 580 238 289
tional distance metrics such as Lp norm cannot model effec- SP07 228 290 285 148 224 231
tively. The problem, however, is important to many appli- EFB1 318 280 37 277 215 99
cations. For example, in DNA microarray analysis, the ex- MDM10 10 10 266 328 101 186
pression levels of two closely related genes may rise and fall CYS3 322 288 41 278 219 231
under different external conditions or at different time. Al- DEP1 317 272 334 232 192 110
though the magnitude of their expression levels may not be NTG1 329 296 33 274 228 129
..
close, the patterns they exhibit over the time or under differ- .
ent conditions can be very similar. In this paper, we measure
the distance between two objects by pattern similarity, i.e., Table 1: Expression data of Yeast genes
whether the two objects exhibit a synchronous pattern of rise
and fall under different conditions. We then present an ef-
ficient algorithm for near-neighbor search based on pattern
represent arbitrarily spaced points on the time axis. We find
similarity, and we perform tests on several real and synthetic
the expression levels of three genes, SP07, MDM10, and
data sets to show its effectiveness.
DEP1, manifest a coherent pattern with fixed time shift.
Given a new gene, biologists are interested in finding
1 Introduction
every gene whose expression levels under a certain set of
The efficiency of near neighbor search to a large extent de- conditions rise and fall coherently with those of the new
pends on the distance function in use [3]. More importantly, gene, as such discovery may reveal connections in gene
the distance function also determines the meaning of simi- regulatory networks [1]. Clearly, this pattern similarity
larity and the meaning of the near-neighbor search. In this cannot be captured by distance functions such as Euclidean
paper, we address a new type of similarity for near-neighbor even if they are applied in the related subspaces.
search. In this paper, we extend the concept of near-neighbor to
the above situation. We say genes VPS8, CYS3, and EFB1
DNA microarray analysis Finding near neighbors based are near-neighbors in the subspace defined by conditions
on subspace pattern similarity is important to many applica- {t , t , t }, and the time series expression levels of genes
1 3 5
tions [1, 9, 7, 8]. Table 1 shows a small portion of the Yeast SP07, MDM10, and DEP1 are near-neighbors from time
expression data, where entry dij represents the expression t , t and t .
1 2 3
level of gene i under condition j (or at time j). Investiga- An even more interesting and challenging type of near-
tions show that more often than not, several genes contribute neighbor query is the following. We are given the expression
to a disease, which motivates researchers to identify genes levels of a new gene.This new gene might be related to any
whose expression levels rise and fall synchronously under gene in the database as long as both of them exhibit a pattern
different conditions or over a period of time, that is, whether in some subspace or at some time offset. The dimensionality
they exhibit fluctuation of a similar shape when conditions of the subspace, or the length of the time period, is often an
change. indicator of the degree of their closeness, that is, the more
As shown in Table 1, the expression levels of three columns the pattern spans, the closer the relation between
genes, VPS8, CYS3, and EFB1, rise and fall coherently the two genes.
under three different external conditions t1 , t3 and t5 . We In this paper we focus on pattern based similarity de-
can also measure the expression levels of genes at fixed time scribed above. Traditional distance functions, such as the
intervals. In this case, assume t1 , t2 , · · ·, t5 in Table 1 Euclidean norm, cannot measure pattern similarity. We pro-
A suffix of u starting with column ci is denoted by: f (#1, 1) = (c1 , 0), (c2 , −3), (c3 , 1), (c4 , −1), (c5 , −3)
(ci , ui ), (ci+1 , ui+1 ), ..., (cn , un ) Each leaf node n in the trie maintains an object list, Ln . If
where 1 ≤ i ≤ n. Using the first column in each suffix the insertion of f (#1, 1) leads to node x, which is under arc
as its base column, we derive a base-column aligned suffix (e, −3), we append 1 (object #1), to object list Lx .
by subtracting the value of the base (first column) from each
column value in the suffix. We use f (u, i) to denote u’s base- 2.3 Building P-Index over a Trie The trie enables
column aligned suffix that begins with the ith column: us to find near-neighbors of a query object q =
(c1 , v1 ), ..., (cn , vn ) in a given subspace S, provided S
(2.1) f (u, i) = (ci , 0), (ci+1 , ui+1 − ui ), ..., (cn , un − ui ) is defined by a set of continuous columns, i.e., S =
edge (ck , ·), where k > j, requires us to traverse the subtree the numerical values are discretized into.
200
ξ=40
ξ=20 100
ξ=10
time (sec.)
150 ξ= 5 R-Tree index
linear scan
10 PD-Index
100
1
50
0 0.1
10 20 30 40 50 60 70 80 5 10 15 20 25 30 35 40 45 50
dataset size (Mega bytes) dataset size (Mega Bytes)
(a) |A| = 20, ξ = 5, ..., 80 (b) Pattern matching (a) Find Near-neighbors in DNA micro-array
Figure 2: Performance.
the number of objects in the dataset and |A| the number of 4 Conclusion
dimensions. The total data size is 4|D||A| bytes. We identify the need of finding near-neighbors under sub-
3.1 Space Analysis The space requirement of the pattern- space pattern similarity, a new type of similarity not cap-
distance index is linearly proportional to the data size (Fig- tured by Euclidean, Manhattan, etc., but essential to a wide
ure 2). In Figure 2(a), we fix the dimensionality of the data at range of applications, including DNA microarray analysis.
20 and change ξ, the discretization granularity, from 5 to 80. Two objects are similar if they manifest a coherent pattern
It shows that ξ has little impact on the index size when the of rise and fall in an arbitrary subspace, or over a certain
data size is small. When the data size increases, the growth time period with time shifting.We propose P-Index, which
of the trie slows down as each trie node is shared by more maps objects to sequences and index them using a tree struc-
objects (this is more obvious for smaller ξ in Figure 2(a)). ture. Experimental results show that P-Index achieves orders
of magnitude speedup over alternative algorithms based on
3.2 Time Analysis We compare the algorithms presented naive indexing and linear scan.
in this paper with two alternative approaches, i) brute force
linear scan, and ii) R-Tree family indices. The linear scan References
approach for near-neighbor search is straightforward to im-
plement. The R-Tree, however, indexes values not patterns.
To support queries based on pattern similarity, we create an [1] Y. Cheng and G. Church. Biclustering of expression data. In
Proc. of 8th International Conference on Intelligent System
extra dimension cij = ci − cj for every two dimensions ci for Molecular Biology, 2000.
and cj . [2] R. Miki et al. Delineating developmental and metabolic
The query time presented in Figure 2(b) indicates that P- pathways in vivo by expression profiling using the riken
set of 18,816 full-length enriched mouse cDNA arrays. In
Index scales much better than the two alternative approaches Proceedings of National Academy of Sciences, 98, pages
for pattern matching in given subspaces. The comparisons 2199–2204, 2001.
[3] Piotr Indyk. On approximate nearest neighbors in non-
are carried out on synthetic datasets of dimensionality |A| = euclidean spaces. In IEEE Symposium on Foundations of
40 and discretization level ξ = 20. Each time, a subspace is Computer Science, pages 148–155, 1998.
designated by randomly selecting 4 dimensions, and random [4] E. M. McCreight. A space-economical suffix tree construc-
tion algorithm. Journal of the ACM, 23(2):262–272, April
query objects are generated in the subspace. 1976.
To further analyze the impact of different query forms [5] S. Tavazoie, J. Hughes, M. Campbell, R. Cho,
on the performance, we base our comparisons on number and G. Church. Yeast micro data set. In
http://arep.med.harvard.edu/biclustering/yeast.matrix,
of disk accesses. First, we ask random queries against 2000.
yeast and mouse DNA micro-array data in subspaces of [6] E. Ukkonen. Constructing suffix-trees on-line in linear time.
Algorithms, Software, Architecture: Information Processing,
dimensionality ranging from 2 to 5. The selected dimensions pages 484–92, 1992.
are evenly separated. For instance, we select dimension set [7] Haixun Wang, Fang Chu, Wei Fan, Philip S. Yu, and Jian Pei.
{c1 , c13 , c25 , c37 , c49 } in a mouse cDNA array that has a total A fast algorithm for subspace clustering by pattern similarity.
In 16th International Conference on Scientific and Statistical
of 49 conditions. Figure 2(c) shows the average number Database Management (SSDBM), 2004.
of node accesses and disk accesses. Since P-Index offers [8] Haixun Wang, Chang-Shing Perng, Wei Fan, Sanghyun Park,
increased selectivity for longer queries, it is robust as the and Philip S. Yu. Indexing weighted sequences in large
databases. In ICDE, 2003.
dimensionality of the given subspace becomes larger. [9] Haixun Wang, Wei Wang, Jiong Yang, and Philip S. Yu. Clus-
tering by pattern similarity in large data sets. In SIGMOD,
2002.