Study of Supporting Sequences in Dbmss - Data Model, Query Language, and Storage Management
Study of Supporting Sequences in Dbmss - Data Model, Query Language, and Storage Management
Ling Lin
Recommended citation
<Author>. <Title>. Linköping Electronic Articles in
Computer and Information Science, Vol 3(1998): nr 4.
http://www.ep.liu.se/ea/cis/1998/004/. Feb. 17, 1998.
This URL will also contain a link to the author’s home page.
The publishers will keep this article on-line on the Internet (or
its possible replacement network in the future) for a period of
25 years from the date of publication, barring exceptional cir-
cumstances as described separately.
Many real life applications requires data that are inherently sequential.
Sequential data exist in many domains such as temporal databases, execu-
tion monitors, trigger mechanisms, and list processing.
Traditional database systems did not pay special attention to sequence
data which results in tedious query expression and bad performance. This
report summarizes recent research on supporting sequence data in
DBMSs, covering issues such as data model, query language, query opti-
mization, and storage management. The sequence database system SEQ is
described.
Ling Lin
Engineering Database and System Laboratory
Department of Computer and Information Science
Linköping Universitet
Linköping, Sweden
2
1 Introduction1
Many real life applications requires data that is inherently sequential.
Sequential data exist in many domains such as temporal databases, execu-
tion monitors, trigger mechanisms, and list processing. Examples of
sequence data include: stock prices in business applications, temperature
reading in scientific measurements, and event sequences in automatic con-
trol.
Traditional database systems are based on relational model which treats
tables as sets, not sequences. Consequently, expressing sequence queries
is tedious and execution is very inefficient [13]. Here is an example:
A weather monitoring system records information about various mete-
orological phenomena, such as volcano eruptions and earthquakes. These
event sequences are ordered by time. Now we ask the query:
• For which volcano eruptions was the strength of the most recent earth-
quake greater than 7.0?
the strength was greater than 7.0, possibly generating an answer. This
query can therefore be processed with a single scan of the two sequences,
and using very little memory. The key to such optimization is the sequen-
tially of the data and the query.
[13] points out that sequence data need to be modelled as an abstract
data type. Special operators such as sub-sequence selection, aggregate
functions such as sum, max, min, and moving average, should be associ-
ated with the data type. More importantly, the ordered semantics of
sequences should be utilized in query optimization (e.g., stream process-
ing) and storage management (e.g., clustering).
The ordering domain can be composed of any kind of ordered data such as
integer, time stamps, etc. Each element in the ordering domain is called a
position. Records can be of any data type such as floating values, strings,
or even relational tables. Different positions in the ordering domain can be
mapped to the same record, but every record can only be mapped to one
position (i.e., many to one relationship).
Notice that there can be “holes” in the ordering domain, which results
in sparse sequence. Sparse sequences correspond to real life sequences
where there are missing values in the measurements.
Operations over sequences include: 1) transform operators (apply a
function fn on each record in the sequence); 2) binary operators (e.g., two
sequences join); 3) offset operators (e.g., shift in position or record
domain); and 4) aggregate operators (e.g., moving average, max, min).
4
[15] used the efficiency of the “scan” operator over a sequence as the crite-
ria for a good physical implementation and concluded that “compressed
array” was the best storage choice. The rest of the experiments are based
on the compressed array storage implementation.
moving average, and zooming. Two stock price sequences Stock1 and
Stock2 are used in the examples. Both sequences have the same schema:
{time: Hour, high: Double, low: Double, volume: Integer} and are both
ordered by time.
• Estimate the monetary value of Stock1 traded in each hour when the low
price fell below 50.
• Finding the 24-hour moving average of the difference between the prices
of the two stocks.
• Zoom:
PROJECT min(A.volume)
FROM Stock1 A
ZOOM days
The first example selects part of the sequence based on the condition that
low price was less than 50. The second example applies a 24-hour moving
average over the whole sequence. The third example demonstrates the
zooming operation (zooms from hours to days).
Here I would like to point out something which is important and also
related to my research. Notice that the first example could be much more
efficient to execute if the IP-index [10] is available. Suppose that the IP-
index is built on the “low” value of the stock sequence, then the sub-
sequences which satisfy “low < 50” will be constructed quickly, then the
projection can be applied to the returned sub-sequences instead of to the
whole sequence. Currently in SEQ system the only way to process the first
query is to scan the whole sequence and apply the projection to the posi-
tions whenever “low < 50” is satisfied. This is yet another example to
show that an inverse index [10] is very important in sequence query
processing.
For the forward queries [10] they use weighted binary search to find the
record of position i, there is no index on the inverse direction as the IP-
index does.
8
SQL parser passes the SEQUIN sub-query to the SEQUIN parser. The
SQL optimizer is called on the outer query block, and the SEQUIN opti-
mizer is called on the nested query block. There is currently no optimiza-
tion performed across query blocks belonging to different E-ADTs.
Notice again that this query would be more efficient to execute if the
IP-index [10] is available. Suppose that the IP-index is built on the 24-
hour moving average of the high price, then the number of hours when the
24-hour moving average of the high price was greater than 100 can be
computed very fast since the IP-index stores cardinality information [11].
This counting can be accomplished efficiently without even going through
all the positions which satisfy the condition. This is extremely important
when the resulting sequences are large.
It is interesting to compare the system of SEQ with Illustra [8]. Illustra
supports sequences (more specifically, time-series) as ADTs with a collec-
tion of methods. The above example would be expressed in Illustra as the
following:
SELECT S.name
count(filter(“time > 3500”,
filter(“high > 100”,
mov_avg(-23, 0,
project(“time,high”, S.stock_history)))))
From Stocks S;
Until now I have covered the main points in [13], [14], and [15] about
supporting sequences in database systems. In the next section I will dis-
cuss storage management for large objects in disk-based database sys-
tems. The reason why we discuss storage management for large objects is
that sequences usually grow very large in real life applications, which
leads to the question of how to store these large sequences in disk in the
way that 1) random access of any position will be fast and 2) the dynamic
growing property of sequences will be supported well.
moving the whole object fast (in fewer disk block access) instead of ran-
dom access of a partial object. It supports fast sequential read/write. For
updates, only appending and trimming at the end of the object is sup-
ported. The approach used was “buddy segments” and bitmap encoding.
For details see [9].
It can be seen from the above discussion that EXODUS storage man-
agement for large objects is the best choice for sequence implementation.
This is because 1) it supports objects that are dynamically growing (due to
the B+-tree implementation); and 2) access or update in the middle of the
object is as efficient as access or update at the end of the object. The sec-
ond property is important for sequence data since operations on sequences
need random access quite often. Starburst long field manager is more suit-
able for managing other kind of large objects such as image, audio, video,
where operators over those objects require more sequential access than
random access.
In SHORE [3] there is a persistent data type named “sequence”. The
rest of the section will discuss the implementation of this data type.
4.2.1 Implementation
SHORE allocates another memory space which is double size as the old
one and copies the old sequence to the newly allocated space. In this way
the sequence grows in size*2 to any size as required.
When it is time to write the sequence from the main memory to disk,
SHORE storage manager checks if the sequence fits in one disk page or
not. If so, then the sequence is represented as (page #, slot #) which is the
same as a record stored in disk. If the sequence occupy multiple pages,
then the sequence is represented as a B+-tree index on byte positions
within the sequence, plus a collection of leaf (data) blocks. The size of a
leaf block can be set to 1 to 4 continuous disk pages. An example of a
large sequence stored in disk is in FIGURE 2.
OID
INTERNAL
PAGES 120 282 421 192 365
4.2.2 Performance:
OBJECT
HEADER ..... B1 B2 B3 .....
P1 P2 P3
LEAF
BLOCKS
CHUNK C
Br
CHUNK
DESCRIPTOR O Br B1:P1 B2:P2 B3:P3
is because the level of the B+-tree is very low, no more than 3 levels are
needed to hold up to 8MB-4GB objects.
5 Conclusions
This report represents what I have studied lately on supporting large
sequences in DBMSs, covering issues such as data model, query lan-
guage, and storage implementation. The sequence database system SEQ
was introduced and the storage management of EXODUS (SHORE) for
large dynamic objects was presented. I also pointed out some related
issues to my research work.
14
References
[1] M. Astrahan et al., “System R: Relational Approach to Database
Management”, ACM TODS, Vol. 1, No. 2 (June 1976).
[2] F. Bancilhon, C. Delobel, and P. Kanellakis (eds): “Building an
Object-Oriented Database System: The Story of O2”. Morgan Kaufmann
Publishers, 1992.
[3] Michael J. Carey, David J. Dewitt, Michael J. Franklin, Nancy E.
Hall, Mark L. McAuliffe, Jeffrey F. Naughton, Daniel T. Schuh, Marvin
H. Solomon, C. K. Tan, Odysseas G. Tsatalos, Seth J. White, Michael J.
Zwilling, “Shoring Up Persistent Applications”, in Proceeding of the 1994
ACM-SIGMOD Conf. on the Management of Data, Minneapolis, MN,
May 1994.
[4] M. J. Carey, D. J. DeWitt, J. e. Richardson, and E. J. Shekita: “Stor-
age Management for Objects in EXODUS”, in “Object-Oriented Con-
cepts, Databases, and Applications”, by W. Kim and F. Lochovsky, eds.,
Addison-Wesley Publishing Co., 1989.
[5] M. J. Carey, D. J. DeWitt, J. e. Richardson, and E. J. Shekita:
“Object and File Management in the EXODUS Extensible Database Sys-
tem”, in Proc. of the 12th VLDB Conf., Kyoto, Japan, 1986.
[6] H-T Chou et al: “Design and Implementation of the Wisconsin Stor-
age System,” in Software Practice and Experience, Vol. 15, No. 10, Oct.,
1985.
[7] D. J. Dewitt, N. Kabra, J. Luo, J. M. Patel and J. Yu: “Client-Server
paradise”, in Proc. of VLDB Conf., Santiago, Chile, 1994.
[8] Illustra Information Technologies, Inc. Ullustra User’s Guide, June
1994.
[9] T. J. Lehman and B. G. Lindsay: “The Starburst Long Field Man-
ager”, in Proc. of the 15th VLDB Conf., Amsterdam, 1989.
[10] L. Lin, T. Risch, M. Sköld, D. Badal, “Indexing Values of Time
Sequences”, in Proceedings of 5th International Conference on Infor-
mation and Knowledge Management, pp. 223-232, Rockville, Mary-
land, Nov. 1996.
[11] L. Lin, “A Value-Based Indexing Technique For Time Sequences”,
Lic. Thesis No 597, Linköping University, Jan., 1997, ISBN 91-7871-888-
0.
15