0% found this document useful (0 votes)
11 views

Fast Subsequence Matching in Time-Series Databases

This document presents a new indexing method for efficiently locating subsequences within time-series data that match a given query pattern within a tolerance. It maps each time-series into a set of rectangles in feature space, which can then be indexed using spatial access methods like the R*-tree. It extracts features from a sliding window over the time-series to generate a trail in feature space, which is divided into sub-trails represented by minimum bounding rectangles. This allows efficient searching for variable-length query subsequences. Experiments on stock price data showed the method accelerated search time 3 to 100 times over sequential scanning.

Uploaded by

farsi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Fast Subsequence Matching in Time-Series Databases

This document presents a new indexing method for efficiently locating subsequences within time-series data that match a given query pattern within a tolerance. It maps each time-series into a set of rectangles in feature space, which can then be indexed using spatial access methods like the R*-tree. It extracts features from a sliding window over the time-series to generate a trail in feature space, which is divided into sub-trails represented by minimum bounding rectangles. This allows efficient searching for variable-length query subsequences. Experiments on stock price data showed the method accelerated search time 3 to 100 times over sequential scanning.

Uploaded by

farsi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Fast Subsequence Matching in Time-Series Databases

Christos Faloutsos
 M. Ranganathan
y Yannis Manolopoulos
z

Department of Computer Science


and Institute for Systems Research (ISR)
University of Maryland at College Park
email: christos,ranga,[email protected]

Abstract bene t from such a facility. Speci c applications include


We present an ecient indexing method to locate 1- the following:
dimensional subsequences within a collection of sequences,  nancial, marketing and production time series, such
such that the subsequences match a given (query) pattern as stock prices, sales numbers etc. In such databases,
within a speci ed tolerance. The idea is to map each data typical queries would be ` nd companies whose stock
sequence into a small set of multidimensional rectangles prices move similarly', or ` nd other companies that
in feature space. Then, these rectangles can be readily
indexed using traditional spatial access methods, like the have similar sales patterns with our company', or
R*-tree [9]. In more detail, we use a sliding window ` nd cases in the past that resemble last month's
over the data sequence and extract its features; the result sales pattern of our product'
is a trail in feature space. We propose an ecient and  scienti c databases, with time series of sensor
e ective algorithm to divide such trails into sub-trails, which
are subsequently represented by their Minimum Bounding data. For example, in weather data [11], geological,
Rectangles (MBRs). We also examine queries of varying environmental, astrophysics [30] databases, etc., we
lengths, and we show how to handle each case eciently. want to ask queries of the form, e.g., ` nd past days
We implemented our method and carried out experiments in which the solar magnetic wind showed patterns
on synthetic and real data (stock price movements). We similar to today's pattern' to help in predictions of
compared the method to sequential scanning, which is the the earth's magnetic eld [30].
only obvious competitor. The results were excellent: our
method accelerated the search time from 3 times up to 100 Searching for similar patterns in such databases is
times. essential, because it helps in predictions, hypothesis
testing and, in general, in `data mining' [1, 3, 4] and
1 Introduction rule discovery.
For the rest of the paper, we shall use the following
The problem we focus on is the design of fast searching notational conventions: If S and Q are two sequences,
methods that will search a database with time-series of then:
real numbers, to locate subsequences that match a query
subsequence, exactly or approximately. Historical,  Len(S) denotes the length of S
temporal [29] and spatio-temporal [5] databases will  S[i : j] denotes the subsequence that includes entries
 This research was partially funded by the Institute for in positions i through j
Systems Research (ISR), by the National Science Foundation  S[i] denotes the i-th entry of sequence S
under Grants IRI-9205273 and IRI-8958546 (PYI), with matching
funds from EMPRESS Software Inc. and Thinking Machines Inc.  D (S; Q) denotes the distance of the two (equal
y Also with IBM Federal Systems Company, Gaithersburg, MD
20879. length) sequences S and Q.
z On sabbatical leave from the Department of Informatics,
Aristotle University, Thessaloniki, Greece 54006. Similarity queries can been classi ed into two cate-
gories:
 Whole Matching. Given a collection of N data
sequences of real numbers S1 ; S2 ; : : :; SN and a query
sequence Q, we want to nd those data sequences
that are within distance  from Q. Notice that data
and query sequences must have the same length.
 Subsequence Matching. Given N data sequences the contributions of the present paper, giving some
S1 ; S2 ; : : :; SN of arbitrary lengths, a query sequence extensions of this technique and outlining some open
Q and a tolerance , we want to identify the data problems.
sequences Si (1  i  N) that contain matching
subsequences (i.e. subsequences with distance   2 Background
from Q). Report those data sequences, along with To the best of the authors' knowledge, this is the rst
the correct o sets within the data sequences that best work that examines indexing methods for approximate
match the query sequence. (We assume that we are subsequence matching in time-series databases. The
given a function D(S; Q), which gives the distance of following work is related, in di erent respects:
the sequences S and Q. For example, D() can be the
Euclidean distance.)  indexing in text [13] and DNA databases [6]. Text
The case of `whole match' queries can be handled as and DNA strings can be viewed as 1-dimensional
follows [2]: A distance-preserving transform, such as sequences; however, they consist of discrete symbols
the Discrete Fourier transform (DFT), can be used as opposed to continuous numbers, which makes a
to extract f features from sequences (eg., the rst di erence when we do the feature extraction.
f DFT coecients), thus mapping them into points  `whole matching' approximate queries on time-
in the f-dimensional feature space. Subsequently, sequences [2] or on color images [14, 21] or even on
any spatial access method (such as R*-trees) can be 3-d MRI brain scans [8]. In all these methods, the
used to search for range/approximate queries. This idea is to use f feature extraction functions to map
approach exploits the assumption that data sequences a whole sequence or image into a point in the (f-
and query sequences all have the same length. Here, dimensional) feature space [18]; then, spatial access
we generalize the problem and present a method to methods may be used to search for similar sequences
answer approximate-match queries for subsequences of or images. The resulting index that contains points
arbitrary length Len(Q) . The ideal method should in feature space is called F ? index [2].
ful ll the following requirements:
The F-index works as follows: Given N sequences, all
 it should be fast. Sequential scanning and distance of the same length n, we apply the n-point Discrete
calculation at each and every possible o set will be Fourier Transform (DFT) and we keep the rst few
too slow for large databases. coecients. Let's assume that we keep f numbers -
 it should be `correct'. In other words, it should thus, each sequence is mapped into a point in an f-
return all the qualifying subsequences, without dimensional space. These points are organized in an
missing any (i.e., no `false dismissals'). Notice R -tree, for faster searching. In the typical query, the
that `false alarms' are acceptable, since they can be user speci es a query sequence Q (of length n again) and
discarded easily through a post-processing step. a tolerance , requesting all the data sequences that are
within distance  from Q. To resolve this query, (a) we
 the proposed method should require a small space apply the n-point DFT on the sequence Q, we keep the
overhead. f features, thus mapping Q into a f-dimensional point
qf in feature space; (b) we use the F-index to retrieve
 the method should be dynamic. It should be easy to all the points within distance  from qf ; (c) we discard
insert and delete sequences, as well as to append new the false alarms (see more explanations in Lemma 1),
measurements at the end of a given data sequence. and we return the rest to the user.
 the method should handle data sequences of varying Here, we generalize the `F-index' method, which was
length, as well as queries of varying length. designed to handle `whole matching' queries. Our goal
is to handle subsequence queries, by mapping data se-
The remainder of the paper is organized as follows. quences into a few rectangles in feature space. Since we
Section 2 gives some background material on past rely on spatial access methods as the eventual indexing
related work, on spatial access methods and on the mechanism, we mention that several multidimensional
Discrete Fourier Transform. Section 3 focuses on indexing methods have been proposed, forming three
subsequence matching; we propose a new indexing classes: R*-trees [9] and the rest of the R-tree family
mechanism and we show how o -the-shelf spatial access [15, 17, 28]; linear quadtrees [26, 24]; and grid- les [22].
methods (and speci cally the R*-tree) can be used. To guarantee that the `F-index' method above does
Section 4 discusses performance results obtained from not result in any false dismissals, the distance in feature
experiments on real and synthetic data, which show space should match or underestimate the distance
the e ectiveness of our method. Section 5 summarizes between two objects. Mathematically, let O1 and O2 be

2
p
two objects (e.g., same-length sequences) with distance where j is the imaginary unit j = ?1. The signal ~x
function Dobject() (e.g., the Euclidean distance) and can be recovered by the inverse transform:
F(O1), F(O2) be their feature vectors (e.g., their
rst few Fourier coecients), with distance function p nX
?1
XF exp (j2Fi=n) i = 0; 1; : : :; n ? 1
Dfeature () (e.g., the Euclidean distance, again). Then xi = 1= n
we have: F =0
(6)
Lemma 1 To guarantee no false dismissals for range XF is a complex number (with the exception of X0 ,
queries, the feature extraction function F() should which is a real, if the signal ~x is real). The energy E(~x)
satisfy the following formula: of a sequence ~x is de ned as the sum of energies (squares
of the amplitude jxij) at every point of the sequence:
Dfeature (F(O1); F(O2))  Dobject(O1; O2) (1)
Proof: Let Q be the query object, O be a qualifying X
n?1

object, and  be the tolerance. We want to prove that E(~x) k ~x k2 jxij2 (7)
if the object O quali es for the query, then it will be i=0

retrieved when we issue a range query on the feature A fundamental observation for the correctness of our
space. That is, we want to prove that method is Parseval's theorem [23], which states that the
Dobject(Q; O)   ) Dfeature (F(Q); F(O))   (2) DFT preserves the energy of a signal:
Theorem (Parseval). Let X~ be the Discrete Fourier
However, this is obvious, since Transform of the sequence ~x. Then we have:
Dfeature(F(Q); F(O))  Dobject(Q; O)   (3) X
n?1 X
n?1
jxij2 = jXF j2 (8)
2 i=0 F =0
Following [2], we use the Euclidean distance as the
distance function between two sequences, that is, the Since the DFT is a linear transformation [23], Parseval's
sum of squared di erences. Formally, for two sequences theorem implies that the DFT also preserves the
S and Q of the same length l, we de ne their distance Euclidean distance between two signals ~x and ~y:
D(S; Q) as ~ Y~ )
D (~x; y~) = D(X; (9)
Xl != 1 2

D(S; Q)  (S[i] ? Q[i]) 2


(4) where X~ and Y~ are Fourier transforms of ~x and ~y
i=1 respectively.
We keep the rst few (2-3) coecients of the DFT as
As an example of feature extraction function F() we the features, following the recommendation in [2]. This
choose the Discrete Fourier Transform (DFT), for two `truncation' results in under-estimating the distance of
reasons: (a) it has been used successfully for `whole two sequences (because we ignore positive terms from
matching' [2] and (b) it provides a good, intuitive equation 4) and thus it introduces no false dismissals,
example to make the presentation more clear. It should according to Lemma 1.
be noted that our method is independent of the speci c The truncation will introduce only false alarms,
feature extraction function F(), as long as F() satis es which, for practical sequences, we expect to be few. The
the condition of Lemma 1 (Eq. 1). If the distance among reason is that most real sequences fall in the class of
objects (data sequences) is the Euclidean distance, the `colored noise', which has a skewed energy spectrum,
condition of Lemma 1 is satis ed, by any orthonormal of the form O(F (?b)). This implies that the rst few
transform, such as, the Discrete Cosine transform coecients contain most of the energy. Thus, the
(DCT) [31], the wavelet transform [25] etc. Next, we rst few coecients give good estimates of the actual
give the de nition and some properties of the DFT distance of the two sequences. For b = 1, we have the
transformation. pink noise, which, according to Birkho 's theory [27]
The n-point Discrete Fourier Transform [16, 23] of models signals like musical scores and other works of
a signal ~x = [xi], i = 0; : : :; n ? 1 is de ned to be a art. For b = 2, we have the brown noise (also known
sequence X~ of n complex numbers XF , F = 0; : : :; n ? 1, as random walk or brownian walk) which models stock
given by movements and exchange rates (eg., [10, 20]). For b > 2
p nX
?1 we have the black noise whose spectrum is even more
XF = 1= n xi exp (?j2Fi=n) F = 0; 1; : : :; n ? 1 skewed than the spectrum of brown noise; black noise
i=0 models successfully signals like the water level of a river
(5) as it varies over time [20].

3
application. For example, in stock price databases,
Symbols De nitions. analysts are interested in weekly or monthly patterns
N Number of data sequences. because shorter patterns are susceptible to noise [12].
Si The i-th data sequence (1  i  N). Notice that we never lose the ability to answer shorter
Len(S) Length of sequence S. than w queries, because we can always resort to
S[k] The k-the entry of sequence S. sequential scanning.
S[i : j]
Subsequence of S, including entries in Generalizing the reasoning of the method for `whole
positions i through j. matching', we use a sliding window of size w and place
Q A query sequence. it at every possible position (o set), on every data
w Minimum query sequence length. sequence. For each such placement of the window,
we extract the features of the subsequence inside the
D(Q; S) (Euclidean) distance between sequences window. Thus, a data sequence of length Len(S) is
Q and S of equal length. mapped to a trail in feature space, consisting of
 Tolerance (max. acceptable distance). Len(S) -w+1 points: one point for each possible o set
f Number of features. of the sliding window. Figure 1(a) gives an example
mc Marginal cost of a point. of trails. Assume that we have two sequences, S1 and
S2 (not shown in the gure), and that we keep the rst
f=2 features (eg, the amplitude of the rst and second
Table 1: Summary of Symbols and De nitions coecient of the w-point DFT). When the window of
length w is placed at o set=0 on S1 , we obtain the rst
This concludes our discussion on prior work, which point of the trail C1; as the window slides over S1 , we
concentrated on `whole match' queries. Next, we obtain the rest of the points of the trail C1 . The trail
describe in detail how we handle the requests for C2 is derived by S2 in the same manner.
matching subsequences. Figure 4 gives an example of trails, using a real time-
series (stock-price movements).
3 Proposed Method One straightforward way to index these trails would
be to keep track of the individual points of each trail,
Here, we examine the problem of subsequence matching. storing them in a spatial access method. We call
Speci cally, the problem is de ned as follows: this method `I-naive' method, where `I' stands for
 We are given a collection of N sequences of real `Index' (as opposed to sequential scanning). When
numbers S1 , S2 , SN , each one of potentially di erent presented with a query of length w and tolerance ,
length. we could extract the features of the query and search
the spatial access method for a range query with radius
 The user speci es query subsequence Q of length ; the retrieved points would correspond to promising
Len(Q) (which may vary) and the tolerance , subsequences; after discarding the false alarms (by
that is, the maximum acceptable dis-similarity (= retrieving all those subsequences and calculating their
distance). actual distance from the query) we would have the
 We want to nd quickly all the sequences Si ( 1  desired answer set. Notice that the method will not
i  N), along with the correct o sets k, such that miss any qualifying subsequence, because it satis es the
the subsequence Si [k : k +Len(Q) ? 1] matches the condition of Lemma 1.
query sequence: D(Q; Si [k : k + Len(Q) ? 1])  . However, storing the individual points of the trail in
an R*-tree is inecient, both in terms of space as well
The brute-force solution is to examine sequentially every as search speed. The reason is that, almost every point
possible subsequence of the data sequences for a match. in a data sequence will correspond to a point in the
We shall refer to this method by `SequentialScan' f-dimensional feature space, leading to an index with
method. Next, we describe a method that uses a a 1:f increase in storage requirements. Moreover, the
small space overhead, to achieve order of magnitudes search performance will also su er because the R-tree
savings over the `SequentialScan' method. The main will become tall and slow. As we shall see in the section
symbols used through the paper and their de nitions with the experiments, the `I-naive' method ended up
are summarized in Table 1. being almost twice as slow as the `SequentialScan' .
Thus, we want to improve the `I-naive' method, by
3.1 Sketch of the approach - `ST-index' making the representation of the trails more compact.
Without loss of generality, we assume that the minimum The solution we propose exploits the fact that
query length is w, where w ( 1) depends on the successive points of the trail will probably be similar,

4
F2 trails. Notice that it is possible that MBRs belonging
to the same trail may overlap, as C2 illustrates.
Thus, we propose to map a data sequence into a set
of rectangles in feature space. This yields signi cant
improvements with respect to space, as well as with
C1 respect to response time, as we shall see in section
4. Each MBR corresponds to a whole sub-trail,
that is, points in feature space that correspond to
successive positionings of the sliding window on the data
sequences. For each such MBR we have to store
 tstart; tend which are the o sets of the rst and last
C2 such positionings;
 a unique identi er for the data sequence (sequence id)
F1
and
MBR1  the extent of the MBR in each dimension
F2
(F1low ; F1high; F2low ; F2high; : : :).
MBR2 These MBRs can be subsequently stored in a spatial
access method. We have used R*-trees [9], in which
case these MBRs are recursively grouped into parent
MBRs, grandparent MBRs etc. Figure 1(b) shows how
the eight leaf-level MBRs of Figure 1(a) will be grouped
to form two MBRs at the next higher level, assuming
a fanout of 4 (i.e. at most 4 items per non-leaf node).
Note that the higher-level MBRs may contain leaf-level
MBRs from di erent data sequences. For example, in
Figure 1(b) we remark how the left-side MBR1 contains
a part of the south-east curve C2. Figure 2 shows the
structure of a leaf node and a non-leaf node. Notice
that the non-leaf nodes do not need to carry information
about sequence id's or o sets (tstart and tend ).
F1

Figure 1: Example of (a) dividing trails into sub-trails


and MBRs, and (b) grouping of MBRs in larger ones.
..... F1_min, F1_max ..... Level
F2_min, F2_max above leaves

because the contents of the sliding window in nearby


o sets will be similar. We propose to divide the trail of
a given data sequence into sub-trails and represent each
of them with its minimum bounding (hyper)-rectangle .....
Sequence_id
..... Leaf level
(MBR). Thus, instead of storing thousands of points of
T_start, T_end
F1_min, F1_max
a given trail, we shall store only a few MBRs. More F2_min, F2_max
important, at the same time we still guarantee `no false
dismissals': when a query arrives, we shall retrieve all
the MBRs that intersect the query region; thus, we Figure 2: Index node layout for the last two levels.
shall retrieve all the qualifying sub-trails, plus some false
alarms (sub-trails that do not intersect the query region, This completes the discussion of the structure of our
while their MBR does). proposed index. We shall refer to it by `ST-index' , for
`Sub-Trail index'. There are two questions that we have
Figure 1(a) gives an illustration of the proposed to answer, to complete the description of our method.
approach. Two trails are drawn; the rst curve, labeled
C1 (in the north-west side), has been divided into three  Insertions: when a new data sequence is inserted,
sub-trails (and MBRs), whereas the second one, labeled what is a good way to divide its trail in feature space
C2 (in the south-east side), has been divided in ve sub- into sub-trails.

5
 Queries: How to handle queries, and especially the
ones that are longer than w. Method Description
`SequentialScan' Sequential scan of the whole
These are the topics of the next two subsections, database.
respectively. `I-naive' Search using an `ST-index' with
1 point per sub-trail.
F2 `I- xed' Search using an `ST-index' with
a xed number of points per
*
P5 sub-trail.
*
P4 `I-adaptive' Search using an `ST-index' with
*
P3 P6 a variable number of points per
* sub-trail.
*
P7
P8
P2
* Table 2: Summary of searching methods and descrip-
P1 * P9
* tions
*
p
sub-trail length is 3 (i.e., = 9). The resulting MBRs
F1
(Figure 3(a)) are not as good as the MBRs shown in
F2 Figure 3(b). We collectively refer to all the above
heuristics as the `I- xed' method, because they use
*
P5 an index, with some xed-length sub-trails. Clearly,
P3
*
P4 the `I-naive' method is a special case of the `I- xed'
* method, when the sub-trail length is set to 1.
* P6 Thus we are looking for a method that will be able
* to adapt to the distribution of the points of the trail.
We propose to group points into sub-trails by means
P7
P8
* of an `adaptive' heuristic, which is based on a greedy
*
P1 P2
P9
* algorithm. The algorithm uses a cost function, which
* tries to estimate the number of disk accesses for each
of the options. The resulting indexing method will
F1 be called `I-adaptive'. This is the last of the four
alternatives we have introduced. Table 2 lists all of
Figure 3: Packing points using (a) a xed heuristic (sub- them, along with a brief description for each method.
trail size = 3), and (b) an adaptive heuristic. To complete the description of the `I-adaptive'method,
we have to de ne a cost function and the concept of
marginal cost of a point. In [19] we developed a for-
3.2 Insertion - Methods to divide trails into mula which, given the sides L~ = (L1 ; L2 ; : : :Ln ) of the
sub-trails n-dimensional MBR of a node in an R-tree, estimates
As we saw before, each data sequence is mapped into the average number of disk accesses DA(L~ ) that this
a `trail' in feature space. Then the question arising node will contribute for the average range query:
is: how should we optimally divide a trail in feature Y
n
space into sub-trails and eventually MBRs, so that the DA(L~ ) = (Li + 0:5) (10)
number of disk accesses is minimized? A rst idea i=1
would be to pack points in sub-trails according to a The formula assumes that the address space has been
pre-determined, xed number (e.g., 50). However, there normalized to the unit hyper-cube ( [0; 1)n). We use
is no justi able way to decide the optimal value of the expected number of disk accesses DA() as the cost
this constant. Another idea would be to use a simple function. The marginal cost (mc) of a point is de ned as
function of the length
p of the stored sequence for this sub- follows: Consider a sub-trail of k points with an MBR of
trail size (e.g. Len(S) ). However, both heuristics sizes L1 ; : : :; Ln. Then the marginal cost of each point
may lead to poor results. Figure 3 illustrates the in this sub-trail is
problem of having a pre-determined sub-trail size. It
shows a trail with 9 points, and it assumes that the mc = DA(L~ )=k (11)

6
That is, we divide the cost of this MBR equally among D(S; Q)   ) D(S[i : j]; Q[i : j])   (1  i  j  l)
the contained points. The algorithm is then as follows: (12)
Proof: Since D() is the Euclidean distance, we have
/* Algorithm Divide-to-Subtrails */
Assign the first point of the trail in a Xl
(trivial) sub-trail D(S; Q)   ) (S[k] ? Q[k])2  2 (13)
FOR each successive point k=1
IF it increases the marginal cost of the Since
current sub-trail
THEN start another sub-trail X
j Xl
ELSE include it in the current sub-trail
(S[k] ? Q[k]) 
2
(S[k] ? Q[k])2 (14)
k=i k=1

3.3 Searching - Queries longer than w we have


In the previous subsection we discussed how to insert X
j
a new data sequence in the `ST-index' , using an D(S[i : j]; Q[i : j]) = (S[k] ? Q[k])2   (15)
`adaptive' heuristic. Here we examine how to search for k=i
subsequences that match the query sequence Q within
tolerance . If the query is the shortest allowable which completes the proof. 2
(Len(Q) = w), the searching algorithm is relatively Using the `Pre xSearch' method, the query region in
straightforward: feature space is a sphere of radius , and therefore, it
Algorithm `Search Short' has volume proportional to f . Next, we show how to
reduce the volume of the query region and subsequently,
 the query sequence Q is mapped to a point qf in the number of false alarms. Without loss of generality,
feature space; the query corresponds to a sphere in we assume that Len(Q) is an integral multiple of w;
feature space with center qf and radius ; if this is not the case, we use Lemma 2 and keep the
longest pre x that is a multiple of w.
 we retrieve the sub-trails whose MBRs intersect the
query region using our index; Len(Q) = p w (16)
 then, we examine the corresponding subsequences of Then, we propose to split the longer query into p
the data sequences, to discard the false alarms. pieces of length w each, process each sub-query and
merge the results of the sub-queries. This approach
Notice that the retrieved MBRs of sub-trails is a takes full advantage of our `ST-index' . Moreover,
superset of the sub-trails we should actually retrieve; as we show, the tolerance speci ed for each sub-query
if a sub-trail intersects the query region, its MBR will can be reduced to =pp. The nal result is that
de nitely do so (while the reverse is not necessarily the total query volume in feature space is reduced.
true). Thus the method introduces no false dismissals. The following Lemma establishes the correctness of the
Handling longer queries (Len(Q) > w) is more proposed method. Consider two sequences Q and S of
complicated. The reason is that the `ST-index' `knows' the same length Len(Q) = Len(S) = p  w. Consider
only about subsequences of length w. A straightforward their p disjoint subsequences qi = Q[i  w+1 : (i+1)  w]
approach would be to select a subsequence (e.g., the and si = S[i  w + 1 : (i + 1)  w], where 0  i < p ? 1.
pre x) of Q of length w, and use our `ST-index' to Lemma 3 If Q and S agree within tolerance  then
search for data subsequences that match the pre x of Q at least one of the pairs (si ; qi) ofpcorresponding sub-
within tolerance . We call this method `Pre xSearch'. sequences agree within tolerance = p:
This will clearly return a superset of the qualifying ?1 (D(q ; s )  =pp) (17)
subsequences: a subsequence T that is within tolerance D(Q; S)   ) _pi=0 i i
 of the query sequence Q (Len(Q) = Len(T)) will where _ indicates disjunction.
have all its (sub)subsequences within tolerance   from Proof. By contradiction: If all the pairs of subse-
the corresponding subsequence of Q. In general we can quences have distance > =pp, then, by adding all these
prove the following lemma: distances, the distance of Q and S will be > , which is
Lemma 2 If two sequences S and Q with the same a contradiction. More formally, since for i = 0; : : :; p ? 1
length l agree within tolerance , then any pair (S[i : j], Xw
(i+1)
Q[i : j]) of corresponding subsequences agree within the D2 (qi; si ) = (qi [j] ? si [j])2 (18)
same tolerance. j =iw+1

7
we have that was a real number, having a size of 4 bytes. Figure
4(a) shows a sample of 250 points of such a stock price
8i (D(qi ; si) > =pp) ) (19) sequence. The system is written in `C' under AIX, on
0 i w 1 an IBM RS/6000. Based on the results of [2], we used
X( +1)
only the rst 3 frequencies of the DFT; thus our feature
8i @ (qi[j] ? si [j]) >  =pA )
2 2
(20) space has f=6 dimensions (real and imaginary parts of
j =iw+1
each retained DFT coecient). Figure 4(b) illustrates
X
pw a trail in feature space: it is a 2-dimensional `phase'
(Q[j] ? S[j])2 > p  t2 =p = 2 (21) plot of the amplitude of the 0-th DFT coecient vs.
j =1 the amplitude of the 1-st DFT coecient. Figure 4(c)
or similarly plots the amplitudes of the 1-st and 2-nd DFT
D(Q; S) >  (22) coecients. For both gures the window size w was 512
which contradicts the hypothesis. 2 points. The smooth evolution of both curves justi es
The searching algorithm that uses Lemma 3 will be our method to group successive points of the feature
called `MultiPiece' search. It works as follows: space in MBRs. An R*-tree [9] was used to store the
Algorithm `Search Long ( `MultiPiece' )' MBRs of each sub-trail in feature space.
We carried out experiments to measure the perfor-
 the query sequence Q is broken in p sub-queries mance of the most promising method: `I-adaptive'.
which correspond to p spheres in feature space with For each setting, we asked queries with variable se-
radius =pp; lectivities, and we measured the wall-clock time on
a dedicated machine. More speci cally, query se-
 we use our `ST-index' to retrieve the sub-trails quences were generated by taking random o sets into
whose MBRs intersect at least one the sub-query the data sequence and obtaining subsequences of length
regions; Len(Q) from those o sets. For each such query se-
 then, we examine the corresponding subsequences of quence, a tolerance  was speci ed and the query was
the data sequences, to discard the false alarms. run with that tolerance. This method was followed in
order to eliminate any bias in the results that may be
Next, we compare the two search algorithms caused by the index structure, which is not uniform at
( `Pre xSearch' and `MultiPiece' ) with respect to the the leaf level. Unless otherwise speci ed, in all the ex-
volume they require in feature space. The volume of an periments we used w = 512 and Len(Q) = w.
f-dimensional sphere of radius  is given by We carried out four groups of experiments, as follows:
K f (23) (a) Comparison of the the proposed `I-adaptive'method
against the `I-naive' method (the method that has
where K is a constant (K =  for a 2-dimensional space, sub-trails, each one consisting of one point).
etc). This is exactly the volume of the `Pre xSearch'
algorithm. The `MultiPiece' algorithm yields p spheres, (b) Experiments to compare the response time of our
each of volume proportional to method ( `I-adaptive') with sequential scanning for
queries of length w.
K(=pp)f (24)
(c) Experiments with queries longer than w.
for a total volume of
p (d) Experiments with a larger, synthetic data set, to
K  p  f = pf = K  f =pf=2?1 (25) see whether the superiority of our method holds for
This means that the proposed `MultiPiece' search other datasets.
method is likely to produce fewer false alarms, and Comparison with the `I-naive' method. Figure
therefore better response time than the `Pre xSearch' 5 illustrates the index size as a function of the length
method, whenever we have f > 2 features. of the sub-trails for the three alternatives ( `I-naive' ,
`I- xed' and `I-adaptive'). Our method requires only
4 Experiments 5Kb, while the `I-naive' method requires 24 MB. The
We implemented the `ST-index' method using the `I- xed' method gives varying results, according to the
`adaptive' heuristic as described in section 3, and we length of its sub-trails.
ran experiments on a stock prices database of 329,000 The large size of the index for the `I-naive' method
points, obtained from sfi.santafe.edu. Each point hurts its search performance as well: in our experiments,

8
2.58 0.105 0.14
2.575 0.1
0.12
2.57 0.095

Ampl. of 2-nd DFT coeff.


Ampl. of 1-st DFT coeff.
2.565 0.09 0.1
Stock Price

2.56 0.085
0.08
2.555 0.08
0.06
2.55 0.075
2.545 0.07 0.04
2.54 0.065
0.02
2.535 0.06
2.53 0.055 0
0 50 100 150 200 250 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.055 0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105
Time Amplitude of 0-th DFT coefficient Amplitude of 1-st DFT coefficient

(a) (b) (c)


Figure 4: (a) Sample stock-price sequence; (b) its trail in the feature space of the 0-th and 1-st DFT coecients and
(c) the equivalent trail of the 1-st and 2-nd DFT coecients
1e+08 1000
Relative Measured Wall Clock Time
I-naive
1e+07
Index Size (bytes)

100
1e+06

Ts/Tr
100000
I-fixed 10

10000
I-adaptive

1000 1
1 10 100 1000 10000 1e-06 1e-05 0.0001 0.001 0.01 0.1
Average sub-trail length
Query Selectivity

Figure 5: Index space vs. the average sub-trail length Figure 6: Relative wall clock time vs. selectivity in log-
(log-log scales) log scale (Len(Q) =w=512 points).

the `I-naive' method was approximately two times Response time - longer queries. Next we examine
slower than sequentially scanning the entire database. the relative performance of the two methods for queries
Response time - `Short' queries. We start examin- that are longer than w. As explained in the previous
ing our method's response time using the shortest ac- section, in these cases we have to split the query and
ceptable queries, that is, queries of length equal to the merge the results ( `MultiPiece' method). As illustrated
window size (Len(Q) = w). We used 512 points for in Figure 7, again the proposed `I-adaptive'method
Len(Q) and w. Figure 6 gives the relative response outperforms the sequential scanning, from 2 to 40 times.
time of the sequential scanning method (Ts ) vs. our in- Synthetic data. In Figure 8 we examine our tech-
dex assisted method (Tr , where r stands for `R-tree'), nique's performance against a time-series database con-
by counting the actual wall-clock time for each method. sisting of 500,000 points produced by a random-walk
The horizontal axis is the selectivity; both axes are in method. These points were generated with a starting
logarithmic scales. The query selectivity varies up to value of 1.5, whereas the step increment on each step
10% which is fairly large in comparison with our 329,000 was  .001. Again we remark that our method outper-
points time-series database. We see that our method forms sequential scanning from 100 to 10 times approx-
achieves from 3 up to 100 times better response time imately for selectivities up to 10%.
for selectivities in the range from 10?4 to 10%.
We carried out similar experiments for Len(Q) = w 5 Conclusions
= 256, 128 etc., and we observed similar behavior and We have presented the detailed design of a method that
similar savings. Thus, we omit those plots for brevity. eciently handles approximate (and exact) queries for
Our conclusion is that our method consistently achieves subsequence matching on a stored collection of data
large savings over the sequential scanning. sequences. This method generalizes the work in [2],

9
1000
 it is provably `correct', that is, it never misses
Relative Measured Wall Clock Time
qualifying subsequences (Lemmas 1-3)
Notice that the proposed method can be used with any
100 set of feature-extraction functions (in addition to DFT),
as well as with any spatial access method that handles
rectangles.
Ts/Tr

Future work includes the extension of this method for


10 2-dimensional gray-scale images, and in general for n-
dimensional vector- elds (such as 2-d color images to
answer queries by image content as in [21], or 3-d MRI
brain scans to help detect patterns of brain activation
1
1e-06 1e-05 0.0001 0.001 0.01 0.1 as discussed in [7] etc.)
Query Selectivity

Figure 7: Relative wall clock time vs. selectivity in a


References
log-log scale (w=128, Len(Q) =512 points). [1] R. Agrawal, T. Imielinski, and A. Swami. Database
mining: a performance perspective. IEEE Trans.
1000 on Knowledge and Data Engineering, 5(6):914{925,
Relative Measured Wall Clock Time 1993.
[2] Rakesh Agrawal, Christos Faloutsos, and Arun
100 Swami. Ecient similarity search in sequence
databases. In FODO Conference, Evanston, Illi-
nois, October 1993.
Ts/Tr

also available through anonymous ftp, from olym-


10 pos.cs.umd.edu:ftp/pub/TechReports/fodo.ps.
[3] Rakesh Agrawal, Sakti Ghosh, Tomasz Imielinski,
Bala Iyer, and Arun Swami. An interval classi er
for database mining applications. VLDB Conf.
Proc., pages 560{573, August 1992.
1
1e-06 1e-05 0.0001 0.001 0.01 0.1

[4] Rakesh Agrawal, Tomasz Imielinski, and Arun


Query Selectivity

Figure 8: Relative wall clock time vs. selectivity for Swami. Mining association rules between sets of
random walk data in a log-log scale (Len(Q) =w=512 items in large databases. Proc. ACM SIGMOD,
points). pages 207{216, May 1993.
[5] K.K. Al-Taha, R.T. Snodgrass, and M.D. Soo.
which examined the `whole matching' case (i.e., all Bibliography on spatiotemporal databases. ACM
queries and all the sequences in the time-series database SIGMOD Record, 22(1):59{67, March 1993.
had to have the same length). The idea in the proposed
method is to map a data sequence into a set of boxes in [6] S.F. Altschul, W. Gish, W. Miller, E.W. Myers,
feature space; subsequently, these can be stored in any and D.J. Lipman. A basic local alignment search
spatial access method, such as the R*-tree. tool. Journal of Molecular Biology, 1990.
The main contribution is that we have designed in
detail the rst, to our knowledge, indexing method for [7] Manish Arya, William Cody, Christos Faloutsos,
subsequence matching. The method has the following Joel Richardson, and Arthur Toga. Qbism: a pro-
desirable features: totype 3-d medical image database system. IEEE
Data Engineering Bulletin, 16(1):38{42, March
 it achieves orders of magnitude savings over the se- 1993.
quential scanning, as it was showed by our experi-
ments on real data, [8] Manish Arya, William Cody, Christos Faloutsos,
Joel Richardson, and Arthur Toga. Qbism: Ex-
 it requires small space overhead, tending a dbms to support 3d medical images.
Tenth Int. Conf. on Data Engineering (ICDE),
 it is dynamic, and February 1994. (to appear).

10
[9] N. Beckmann, H.-P. Kriegel, R. Schneider, and Also available as IBM Research Report RJ 9203
B. Seeger. The r*-tree: an ecient and robust (81511), Feb. 1, 1993, Computer Science.
access method for points and rectangles. ACM
SIGMOD, pages 322{331, May 1990. [22] J. Nievergelt, H. Hinterberger, and K.C. Sevcik.
The grid le: an adaptable, symmetric multikey le
[10] Christopher Chat eld. The Analysis of Time structure. ACM TODS, 9(1):38{71, March 1984.
Series: an Introduction. Chapman and Hall,
London & New York, 1984. Third Edition. [23] Alan Victor Oppenheim and Ronald W. Schafer.
Digital Signal Processing. Prentice-Hall, Engle-
[11] Mathematical Committee on Physical and NSF En- wood Cli s, N.J., 1975.
gineering Sciences. Grand Challenges: High Per-
formance Computing and Communications. Na- [24] J. Orenstein. Spatial query processing in an object-
tional Science Foundation, 1992. The FY 1992 U.S. oriented database system. Proc. ACM SIGMOD,
Research and Development Program. pages 326{336, May 1986.
[12] Robert D. Edwards and John Magee. Technical [25] Mary Beth Ruskai, Gregory Beylkin, Ronald Coif-
Analysis of Stock Trends. John Magee, Spring eld, man, Ingrid Daubechies, Stephane Mallat, Yves
Massachusetts, 1966. 5th Edition, second printing. Meyer, and Louise Raphael. Wavelets and Their
Applications. Jones and Bartlett Publishers,
[13] C. Faloutsos. Access methods for text. ACM Boston, MA, 1992.
Computing Surveys, 17(1):49{74, March 1985. [26] H. Samet. The Design and Analysis of Spatial Data
[14] C. Faloutsos, W. Equitz, M. Flickner, W. Niblack, Structures. Addison-Wesley, 1989.
D. Petkovic, and R. Barber. Ecient and e ective [27] Manfred Schroeder. Fractals, Chaos, Power Laws:
querying by image content. Journal of Intelligent Minutes From an In nite Paradise. W.H. Freeman
Information Systems, 1993. (to appear). and Company, New York, 1991.
[15] A. Guttman. R-trees: a dynamic index structure [28] T. Sellis, N. Roussopoulos, and C. Faloutsos. The
for spatial searching. Proc. ACM SIGMOD, pages r+ tree: a dynamic index for multi-dimensional
47{57, June 1984. objects. In Proc. 13th International Conference on
[16] Richard Wesley Hamming. Digital Filters. VLDB, pages 507{518, England,, September 1987.
Prentice-Hall Signal Processing Series, Englewood also available as SRC-TR-87-32, UMIACS-TR-87-
Cli s, N.J., 1977. 3, CS-TR-1795.
[17] H. V. Jagadish. Spatial search with polyhedra. [29] R. Stam and Richard Snodgrass. A bibliography
Proc. Sixth IEEE Int'l Conf. on Data Engineering, on temporal databases. IEEE Bulletin on Data
February 1990. Engineering, 11(4), December 1988.
[18] H.V. Jagadish. A retrieval technique for similar [30] Dimitris Vassiliadis. The input-state space ap-
shapes. Proc. ACM SIGMOD Conf., pages 208{ proach to the prediction of auroral geomagnetic ac-
217, May 1991. tivity from solar wind variables. Int. Workshop on
Applications of Arti cial Intelligence in Solar Ter-
[19] Ibrahim Kamel and Christos Faloutsos. On packing restrial Physics, September 1993.
r-trees. Second Int. Conf. on Information and
Knowledge Management (CIKM), November 1993. [31] Gregory K. Wallace. The jpeg still picture compres-
sion standard. CACM, 34(4):31{44, April 1991.
[20] B. Mandelbrot. Fractal Geometry of Nature. W.H.
Freeman, New York, 1977.
[21] Wayne Niblack, Ron Barber, Will Equitz, Myron
Flickner, Eduardo Glasman, Dragutin Petkovic,
Peter Yanker, Christos Faloutsos, and Gabriel
Taubin. The qbic project: Querying images by
content using color, texture and shape. SPIE 1993
Intl. Symposium on Electronic Imaging: Science
and Technology, Conf. 1908, Storage and Retrieval
for Image and Video Databases, February 1993.

11

You might also like