Fast Subsequence Matching in Time-Series Databases
Fast Subsequence Matching in Time-Series Databases
Christos Faloutsos
M. Ranganathan
y Yannis Manolopoulos
z
2
p
two objects (e.g., same-length sequences) with distance where j is the imaginary unit j = ?1. The signal ~x
function Dobject() (e.g., the Euclidean distance) and can be recovered by the inverse transform:
F(O1), F(O2) be their feature vectors (e.g., their
rst few Fourier coecients), with distance function p nX
?1
XF exp (j2Fi=n) i = 0; 1; : : :; n ? 1
Dfeature () (e.g., the Euclidean distance, again). Then xi = 1= n
we have: F =0
(6)
Lemma 1 To guarantee no false dismissals for range XF is a complex number (with the exception of X0 ,
queries, the feature extraction function F() should which is a real, if the signal ~x is real). The energy E(~x)
satisfy the following formula: of a sequence ~x is de ned as the sum of energies (squares
of the amplitude jxij) at every point of the sequence:
Dfeature (F(O1); F(O2)) Dobject(O1; O2) (1)
Proof: Let Q be the query object, O be a qualifying X
n?1
object, and be the tolerance. We want to prove that E(~x) k ~x k2 jxij2 (7)
if the object O quali es for the query, then it will be i=0
retrieved when we issue a range query on the feature A fundamental observation for the correctness of our
space. That is, we want to prove that method is Parseval's theorem [23], which states that the
Dobject(Q; O) ) Dfeature (F(Q); F(O)) (2) DFT preserves the energy of a signal:
Theorem (Parseval). Let X~ be the Discrete Fourier
However, this is obvious, since Transform of the sequence ~x. Then we have:
Dfeature(F(Q); F(O)) Dobject(Q; O) (3) X
n?1 X
n?1
jxij2 = jXF j2 (8)
2 i=0 F =0
Following [2], we use the Euclidean distance as the
distance function between two sequences, that is, the Since the DFT is a linear transformation [23], Parseval's
sum of squared di erences. Formally, for two sequences theorem implies that the DFT also preserves the
S and Q of the same length l, we de ne their distance Euclidean distance between two signals ~x and ~y:
D(S; Q) as ~ Y~ )
D (~x; y~) = D(X; (9)
Xl != 1 2
3
application. For example, in stock price databases,
Symbols De nitions. analysts are interested in weekly or monthly patterns
N Number of data sequences. because shorter patterns are susceptible to noise [12].
Si The i-th data sequence (1 i N). Notice that we never lose the ability to answer shorter
Len(S) Length of sequence S. than w queries, because we can always resort to
S[k] The k-the entry of sequence S. sequential scanning.
S[i : j]
Subsequence of S, including entries in Generalizing the reasoning of the method for `whole
positions i through j. matching', we use a sliding window of size w and place
Q A query sequence. it at every possible position (o set), on every data
w Minimum query sequence length. sequence. For each such placement of the window,
we extract the features of the subsequence inside the
D(Q; S) (Euclidean) distance between sequences window. Thus, a data sequence of length Len(S) is
Q and S of equal length. mapped to a trail in feature space, consisting of
Tolerance (max. acceptable distance). Len(S) -w+1 points: one point for each possible o set
f Number of features. of the sliding window. Figure 1(a) gives an example
mc Marginal cost of a point. of trails. Assume that we have two sequences, S1 and
S2 (not shown in the gure), and that we keep the rst
f=2 features (eg, the amplitude of the rst and second
Table 1: Summary of Symbols and De nitions coecient of the w-point DFT). When the window of
length w is placed at o set=0 on S1 , we obtain the rst
This concludes our discussion on prior work, which point of the trail C1; as the window slides over S1 , we
concentrated on `whole match' queries. Next, we obtain the rest of the points of the trail C1 . The trail
describe in detail how we handle the requests for C2 is derived by S2 in the same manner.
matching subsequences. Figure 4 gives an example of trails, using a real time-
series (stock-price movements).
3 Proposed Method One straightforward way to index these trails would
be to keep track of the individual points of each trail,
Here, we examine the problem of subsequence matching. storing them in a spatial access method. We call
Speci cally, the problem is de ned as follows: this method `I-naive' method, where `I' stands for
We are given a collection of N sequences of real `Index' (as opposed to sequential scanning). When
numbers S1 , S2 , SN , each one of potentially di erent presented with a query of length w and tolerance ,
length. we could extract the features of the query and search
the spatial access method for a range query with radius
The user speci es query subsequence Q of length ; the retrieved points would correspond to promising
Len(Q) (which may vary) and the tolerance , subsequences; after discarding the false alarms (by
that is, the maximum acceptable dis-similarity (= retrieving all those subsequences and calculating their
distance). actual distance from the query) we would have the
We want to nd quickly all the sequences Si ( 1 desired answer set. Notice that the method will not
i N), along with the correct o sets k, such that miss any qualifying subsequence, because it satis es the
the subsequence Si [k : k +Len(Q) ? 1] matches the condition of Lemma 1.
query sequence: D(Q; Si [k : k + Len(Q) ? 1]) . However, storing the individual points of the trail in
an R*-tree is inecient, both in terms of space as well
The brute-force solution is to examine sequentially every as search speed. The reason is that, almost every point
possible subsequence of the data sequences for a match. in a data sequence will correspond to a point in the
We shall refer to this method by `SequentialScan' f-dimensional feature space, leading to an index with
method. Next, we describe a method that uses a a 1:f increase in storage requirements. Moreover, the
small space overhead, to achieve order of magnitudes search performance will also su er because the R-tree
savings over the `SequentialScan' method. The main will become tall and slow. As we shall see in the section
symbols used through the paper and their de nitions with the experiments, the `I-naive' method ended up
are summarized in Table 1. being almost twice as slow as the `SequentialScan' .
Thus, we want to improve the `I-naive' method, by
3.1 Sketch of the approach - `ST-index' making the representation of the trails more compact.
Without loss of generality, we assume that the minimum The solution we propose exploits the fact that
query length is w, where w ( 1) depends on the successive points of the trail will probably be similar,
4
F2 trails. Notice that it is possible that MBRs belonging
to the same trail may overlap, as C2 illustrates.
Thus, we propose to map a data sequence into a set
of rectangles in feature space. This yields signi cant
improvements with respect to space, as well as with
C1 respect to response time, as we shall see in section
4. Each MBR corresponds to a whole sub-trail,
that is, points in feature space that correspond to
successive positionings of the sliding window on the data
sequences. For each such MBR we have to store
tstart; tend which are the o sets of the rst and last
C2 such positionings;
a unique identi er for the data sequence (sequence id)
F1
and
MBR1 the extent of the MBR in each dimension
F2
(F1low ; F1high; F2low ; F2high; : : :).
MBR2 These MBRs can be subsequently stored in a spatial
access method. We have used R*-trees [9], in which
case these MBRs are recursively grouped into parent
MBRs, grandparent MBRs etc. Figure 1(b) shows how
the eight leaf-level MBRs of Figure 1(a) will be grouped
to form two MBRs at the next higher level, assuming
a fanout of 4 (i.e. at most 4 items per non-leaf node).
Note that the higher-level MBRs may contain leaf-level
MBRs from di erent data sequences. For example, in
Figure 1(b) we remark how the left-side MBR1 contains
a part of the south-east curve C2. Figure 2 shows the
structure of a leaf node and a non-leaf node. Notice
that the non-leaf nodes do not need to carry information
about sequence id's or o sets (tstart and tend ).
F1
5
Queries: How to handle queries, and especially the
ones that are longer than w. Method Description
`SequentialScan' Sequential scan of the whole
These are the topics of the next two subsections, database.
respectively. `I-naive' Search using an `ST-index' with
1 point per sub-trail.
F2 `I- xed' Search using an `ST-index' with
a xed number of points per
*
P5 sub-trail.
*
P4 `I-adaptive' Search using an `ST-index' with
*
P3 P6 a variable number of points per
* sub-trail.
*
P7
P8
P2
* Table 2: Summary of searching methods and descrip-
P1 * P9
* tions
*
p
sub-trail length is 3 (i.e., = 9). The resulting MBRs
F1
(Figure 3(a)) are not as good as the MBRs shown in
F2 Figure 3(b). We collectively refer to all the above
heuristics as the `I- xed' method, because they use
*
P5 an index, with some xed-length sub-trails. Clearly,
P3
*
P4 the `I-naive' method is a special case of the `I- xed'
* method, when the sub-trail length is set to 1.
* P6 Thus we are looking for a method that will be able
* to adapt to the distribution of the points of the trail.
We propose to group points into sub-trails by means
P7
P8
* of an `adaptive' heuristic, which is based on a greedy
*
P1 P2
P9
* algorithm. The algorithm uses a cost function, which
* tries to estimate the number of disk accesses for each
of the options. The resulting indexing method will
F1 be called `I-adaptive'. This is the last of the four
alternatives we have introduced. Table 2 lists all of
Figure 3: Packing points using (a) a xed heuristic (sub- them, along with a brief description for each method.
trail size = 3), and (b) an adaptive heuristic. To complete the description of the `I-adaptive'method,
we have to de ne a cost function and the concept of
marginal cost of a point. In [19] we developed a for-
3.2 Insertion - Methods to divide trails into mula which, given the sides L~ = (L1 ; L2 ; : : :Ln ) of the
sub-trails n-dimensional MBR of a node in an R-tree, estimates
As we saw before, each data sequence is mapped into the average number of disk accesses DA(L~ ) that this
a `trail' in feature space. Then the question arising node will contribute for the average range query:
is: how should we optimally divide a trail in feature Y
n
space into sub-trails and eventually MBRs, so that the DA(L~ ) = (Li + 0:5) (10)
number of disk accesses is minimized? A rst idea i=1
would be to pack points in sub-trails according to a The formula assumes that the address space has been
pre-determined, xed number (e.g., 50). However, there normalized to the unit hyper-cube ( [0; 1)n). We use
is no justi able way to decide the optimal value of the expected number of disk accesses DA() as the cost
this constant. Another idea would be to use a simple function. The marginal cost (mc) of a point is de ned as
function of the length
p of the stored sequence for this sub- follows: Consider a sub-trail of k points with an MBR of
trail size (e.g. Len(S) ). However, both heuristics sizes L1 ; : : :; Ln. Then the marginal cost of each point
may lead to poor results. Figure 3 illustrates the in this sub-trail is
problem of having a pre-determined sub-trail size. It
shows a trail with 9 points, and it assumes that the mc = DA(L~ )=k (11)
6
That is, we divide the cost of this MBR equally among D(S; Q) ) D(S[i : j]; Q[i : j]) (1 i j l)
the contained points. The algorithm is then as follows: (12)
Proof: Since D() is the Euclidean distance, we have
/* Algorithm Divide-to-Subtrails */
Assign the first point of the trail in a Xl
(trivial) sub-trail D(S; Q) ) (S[k] ? Q[k])2 2 (13)
FOR each successive point k=1
IF it increases the marginal cost of the Since
current sub-trail
THEN start another sub-trail X
j Xl
ELSE include it in the current sub-trail
(S[k] ? Q[k])
2
(S[k] ? Q[k])2 (14)
k=i k=1
7
we have that was a real number, having a size of 4 bytes. Figure
4(a) shows a sample of 250 points of such a stock price
8i (D(qi ; si) > =pp) ) (19) sequence. The system is written in `C' under AIX, on
0 i w 1 an IBM RS/6000. Based on the results of [2], we used
X( +1)
only the rst 3 frequencies of the DFT; thus our feature
8i @ (qi[j] ? si [j]) > =pA )
2 2
(20) space has f=6 dimensions (real and imaginary parts of
j =iw+1
each retained DFT coecient). Figure 4(b) illustrates
X
pw a trail in feature space: it is a 2-dimensional `phase'
(Q[j] ? S[j])2 > p t2 =p = 2 (21) plot of the amplitude of the 0-th DFT coecient vs.
j =1 the amplitude of the 1-st DFT coecient. Figure 4(c)
or similarly plots the amplitudes of the 1-st and 2-nd DFT
D(Q; S) > (22) coecients. For both gures the window size w was 512
which contradicts the hypothesis. 2 points. The smooth evolution of both curves justi es
The searching algorithm that uses Lemma 3 will be our method to group successive points of the feature
called `MultiPiece' search. It works as follows: space in MBRs. An R*-tree [9] was used to store the
Algorithm `Search Long ( `MultiPiece' )' MBRs of each sub-trail in feature space.
We carried out experiments to measure the perfor-
the query sequence Q is broken in p sub-queries mance of the most promising method: `I-adaptive'.
which correspond to p spheres in feature space with For each setting, we asked queries with variable se-
radius =pp; lectivities, and we measured the wall-clock time on
a dedicated machine. More speci cally, query se-
we use our `ST-index' to retrieve the sub-trails quences were generated by taking random o sets into
whose MBRs intersect at least one the sub-query the data sequence and obtaining subsequences of length
regions; Len(Q) from those o sets. For each such query se-
then, we examine the corresponding subsequences of quence, a tolerance was speci ed and the query was
the data sequences, to discard the false alarms. run with that tolerance. This method was followed in
order to eliminate any bias in the results that may be
Next, we compare the two search algorithms caused by the index structure, which is not uniform at
( `Pre xSearch' and `MultiPiece' ) with respect to the the leaf level. Unless otherwise speci ed, in all the ex-
volume they require in feature space. The volume of an periments we used w = 512 and Len(Q) = w.
f-dimensional sphere of radius is given by We carried out four groups of experiments, as follows:
K f (23) (a) Comparison of the the proposed `I-adaptive'method
against the `I-naive' method (the method that has
where K is a constant (K = for a 2-dimensional space, sub-trails, each one consisting of one point).
etc). This is exactly the volume of the `Pre xSearch'
algorithm. The `MultiPiece' algorithm yields p spheres, (b) Experiments to compare the response time of our
each of volume proportional to method ( `I-adaptive') with sequential scanning for
queries of length w.
K(=pp)f (24)
(c) Experiments with queries longer than w.
for a total volume of
p (d) Experiments with a larger, synthetic data set, to
K p f = pf = K f =pf=2?1 (25) see whether the superiority of our method holds for
This means that the proposed `MultiPiece' search other datasets.
method is likely to produce fewer false alarms, and Comparison with the `I-naive' method. Figure
therefore better response time than the `Pre xSearch' 5 illustrates the index size as a function of the length
method, whenever we have f > 2 features. of the sub-trails for the three alternatives ( `I-naive' ,
`I- xed' and `I-adaptive'). Our method requires only
4 Experiments 5Kb, while the `I-naive' method requires 24 MB. The
We implemented the `ST-index' method using the `I- xed' method gives varying results, according to the
`adaptive' heuristic as described in section 3, and we length of its sub-trails.
ran experiments on a stock prices database of 329,000 The large size of the index for the `I-naive' method
points, obtained from sfi.santafe.edu. Each point hurts its search performance as well: in our experiments,
8
2.58 0.105 0.14
2.575 0.1
0.12
2.57 0.095
2.56 0.085
0.08
2.555 0.08
0.06
2.55 0.075
2.545 0.07 0.04
2.54 0.065
0.02
2.535 0.06
2.53 0.055 0
0 50 100 150 200 250 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.055 0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105
Time Amplitude of 0-th DFT coefficient Amplitude of 1-st DFT coefficient
100
1e+06
Ts/Tr
100000
I-fixed 10
10000
I-adaptive
1000 1
1 10 100 1000 10000 1e-06 1e-05 0.0001 0.001 0.01 0.1
Average sub-trail length
Query Selectivity
Figure 5: Index space vs. the average sub-trail length Figure 6: Relative wall clock time vs. selectivity in log-
(log-log scales) log scale (Len(Q) =w=512 points).
the `I-naive' method was approximately two times Response time - longer queries. Next we examine
slower than sequentially scanning the entire database. the relative performance of the two methods for queries
Response time - `Short' queries. We start examin- that are longer than w. As explained in the previous
ing our method's response time using the shortest ac- section, in these cases we have to split the query and
ceptable queries, that is, queries of length equal to the merge the results ( `MultiPiece' method). As illustrated
window size (Len(Q) = w). We used 512 points for in Figure 7, again the proposed `I-adaptive'method
Len(Q) and w. Figure 6 gives the relative response outperforms the sequential scanning, from 2 to 40 times.
time of the sequential scanning method (Ts ) vs. our in- Synthetic data. In Figure 8 we examine our tech-
dex assisted method (Tr , where r stands for `R-tree'), nique's performance against a time-series database con-
by counting the actual wall-clock time for each method. sisting of 500,000 points produced by a random-walk
The horizontal axis is the selectivity; both axes are in method. These points were generated with a starting
logarithmic scales. The query selectivity varies up to value of 1.5, whereas the step increment on each step
10% which is fairly large in comparison with our 329,000 was .001. Again we remark that our method outper-
points time-series database. We see that our method forms sequential scanning from 100 to 10 times approx-
achieves from 3 up to 100 times better response time imately for selectivities up to 10%.
for selectivities in the range from 10?4 to 10%.
We carried out similar experiments for Len(Q) = w 5 Conclusions
= 256, 128 etc., and we observed similar behavior and We have presented the detailed design of a method that
similar savings. Thus, we omit those plots for brevity. eciently handles approximate (and exact) queries for
Our conclusion is that our method consistently achieves subsequence matching on a stored collection of data
large savings over the sequential scanning. sequences. This method generalizes the work in [2],
9
1000
it is provably `correct', that is, it never misses
Relative Measured Wall Clock Time
qualifying subsequences (Lemmas 1-3)
Notice that the proposed method can be used with any
100 set of feature-extraction functions (in addition to DFT),
as well as with any spatial access method that handles
rectangles.
Ts/Tr
Figure 8: Relative wall clock time vs. selectivity for Swami. Mining association rules between sets of
random walk data in a log-log scale (Len(Q) =w=512 items in large databases. Proc. ACM SIGMOD,
points). pages 207{216, May 1993.
[5] K.K. Al-Taha, R.T. Snodgrass, and M.D. Soo.
which examined the `whole matching' case (i.e., all Bibliography on spatiotemporal databases. ACM
queries and all the sequences in the time-series database SIGMOD Record, 22(1):59{67, March 1993.
had to have the same length). The idea in the proposed
method is to map a data sequence into a set of boxes in [6] S.F. Altschul, W. Gish, W. Miller, E.W. Myers,
feature space; subsequently, these can be stored in any and D.J. Lipman. A basic local alignment search
spatial access method, such as the R*-tree. tool. Journal of Molecular Biology, 1990.
The main contribution is that we have designed in
detail the rst, to our knowledge, indexing method for [7] Manish Arya, William Cody, Christos Faloutsos,
subsequence matching. The method has the following Joel Richardson, and Arthur Toga. Qbism: a pro-
desirable features: totype 3-d medical image database system. IEEE
Data Engineering Bulletin, 16(1):38{42, March
it achieves orders of magnitude savings over the se- 1993.
quential scanning, as it was showed by our experi-
ments on real data, [8] Manish Arya, William Cody, Christos Faloutsos,
Joel Richardson, and Arthur Toga. Qbism: Ex-
it requires small space overhead, tending a dbms to support 3d medical images.
Tenth Int. Conf. on Data Engineering (ICDE),
it is dynamic, and February 1994. (to appear).
10
[9] N. Beckmann, H.-P. Kriegel, R. Schneider, and Also available as IBM Research Report RJ 9203
B. Seeger. The r*-tree: an ecient and robust (81511), Feb. 1, 1993, Computer Science.
access method for points and rectangles. ACM
SIGMOD, pages 322{331, May 1990. [22] J. Nievergelt, H. Hinterberger, and K.C. Sevcik.
The grid le: an adaptable, symmetric multikey le
[10] Christopher Chat eld. The Analysis of Time structure. ACM TODS, 9(1):38{71, March 1984.
Series: an Introduction. Chapman and Hall,
London & New York, 1984. Third Edition. [23] Alan Victor Oppenheim and Ronald W. Schafer.
Digital Signal Processing. Prentice-Hall, Engle-
[11] Mathematical Committee on Physical and NSF En- wood Cli s, N.J., 1975.
gineering Sciences. Grand Challenges: High Per-
formance Computing and Communications. Na- [24] J. Orenstein. Spatial query processing in an object-
tional Science Foundation, 1992. The FY 1992 U.S. oriented database system. Proc. ACM SIGMOD,
Research and Development Program. pages 326{336, May 1986.
[12] Robert D. Edwards and John Magee. Technical [25] Mary Beth Ruskai, Gregory Beylkin, Ronald Coif-
Analysis of Stock Trends. John Magee, Spring eld, man, Ingrid Daubechies, Stephane Mallat, Yves
Massachusetts, 1966. 5th Edition, second printing. Meyer, and Louise Raphael. Wavelets and Their
Applications. Jones and Bartlett Publishers,
[13] C. Faloutsos. Access methods for text. ACM Boston, MA, 1992.
Computing Surveys, 17(1):49{74, March 1985. [26] H. Samet. The Design and Analysis of Spatial Data
[14] C. Faloutsos, W. Equitz, M. Flickner, W. Niblack, Structures. Addison-Wesley, 1989.
D. Petkovic, and R. Barber. Ecient and e ective [27] Manfred Schroeder. Fractals, Chaos, Power Laws:
querying by image content. Journal of Intelligent Minutes From an In nite Paradise. W.H. Freeman
Information Systems, 1993. (to appear). and Company, New York, 1991.
[15] A. Guttman. R-trees: a dynamic index structure [28] T. Sellis, N. Roussopoulos, and C. Faloutsos. The
for spatial searching. Proc. ACM SIGMOD, pages r+ tree: a dynamic index for multi-dimensional
47{57, June 1984. objects. In Proc. 13th International Conference on
[16] Richard Wesley Hamming. Digital Filters. VLDB, pages 507{518, England,, September 1987.
Prentice-Hall Signal Processing Series, Englewood also available as SRC-TR-87-32, UMIACS-TR-87-
Cli s, N.J., 1977. 3, CS-TR-1795.
[17] H. V. Jagadish. Spatial search with polyhedra. [29] R. Stam and Richard Snodgrass. A bibliography
Proc. Sixth IEEE Int'l Conf. on Data Engineering, on temporal databases. IEEE Bulletin on Data
February 1990. Engineering, 11(4), December 1988.
[18] H.V. Jagadish. A retrieval technique for similar [30] Dimitris Vassiliadis. The input-state space ap-
shapes. Proc. ACM SIGMOD Conf., pages 208{ proach to the prediction of auroral geomagnetic ac-
217, May 1991. tivity from solar wind variables. Int. Workshop on
Applications of Arti cial Intelligence in Solar Ter-
[19] Ibrahim Kamel and Christos Faloutsos. On packing restrial Physics, September 1993.
r-trees. Second Int. Conf. on Information and
Knowledge Management (CIKM), November 1993. [31] Gregory K. Wallace. The jpeg still picture compres-
sion standard. CACM, 34(4):31{44, April 1991.
[20] B. Mandelbrot. Fractal Geometry of Nature. W.H.
Freeman, New York, 1977.
[21] Wayne Niblack, Ron Barber, Will Equitz, Myron
Flickner, Eduardo Glasman, Dragutin Petkovic,
Peter Yanker, Christos Faloutsos, and Gabriel
Taubin. The qbic project: Querying images by
content using color, texture and shape. SPIE 1993
Intl. Symposium on Electronic Imaging: Science
and Technology, Conf. 1908, Storage and Retrieval
for Image and Video Databases, February 1993.
11