0% found this document useful (0 votes)
11 views

nested-intervals-tree-encoding-in-sql

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

nested-intervals-tree-encoding-in-sql

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Nested Intervals Tree Encoding in SQL

Vadim Tropashko
Oracle Corp.

Abstract Although Nested Sets are certainly appealing to


many database developers, they have 2
fundamental disadvantages:
Nested Intervals generalize Nested Sets. They
are immune to hierarchy reorganization problem.
They allow answering ancestor path hierarchical 1. The encoding is volatile. In a word, roughly
queries algorithmically - without accessing the half of the tree nodes should be relabeled
stored hierarchy relation. whenever a new node were inserted.

2. Querying ranges is asymmetric from


1 Introduction performance perspective. It is easy to answer if a
point falls inside some interval, but it is hard to
There are several SQL techniques to query graph index a set of intervals that contain a given point.
structures, in general, and trees, in particular [2]. For nested sets this translates into a difficulty
They can be classified into 2 major categories: answering queries about node’s ancestors.
Hierarchical/recursive SQL extensions and Tree
encodings. This article focuses upon tree [6] introduced Nested Intervals that generalize
encodings. Nested Sets. Since Nested Sets encoding with
integers admits only finite gaps for new node
Tree encodings methods themselves can be split insertions, it is natural to use dense domain such
into 2 groups: Materialized Path and Nested as rational numbers. One particular encoding
Sets. schema with Dyadic rational numbers was
developed in the rest of the article, and was a
Materialized Path is nearly ubiquitous encoding, subject of further improvements in the follow up
where each tree node is labeled with the path articles. Dyadic rational encoding has many nice
from the node to the root. UNIX global theoretical properties, and essentially is a
filenames is well known showcase for this idea. numeric reflection of Materialized Path. It has,
Materialized Path could be either represented as however, one significant flaw from practical
character string of unique sibling identifiers perspective. Dyadic fractions utilize domain of
(concatenated with some separator), or integer numbers rather uneconomically, so that
enveloped into user defined type [5]. numeric overflow prevents tree scaling to any
significant size.
Querying trees with Materialized Path technique
doesn’t appear especially elegant. It implies In general, Nested Intervals allow a certain
either string matching like this freedom choosing particular encoding schema.
[7] developed alternative encoding with Farey
select e1.ename from emp e1, emp e2 fractions. The development continued in [8].
where e2.path like e1.path || '%'
and e2.name = 'FORD' This article expands the perspective. It
demonstrates why both methods are natural
or leveraging complex data types that are realm choices, and describes the mapping between
of Object-Relational Databases. The alternative those tree encodings. It goes on exploring
tree encoding - Nested Sets [2] labels each node different ways of establishing interval structure.
with just a pair of integers. Ancestor-descendant The major result is introducing Path Matrices
relationship is reflected by subset relation and exposing their properties.
between intervals of integers, which provides
very intuitive base for hierarchical queries. Similar idea of leveraging Stern-Brocot tree is
briefly mentioned in [1]. The article, however,

SIGMOD Record, Vol. 34, No. 2, June 2005 47


contains just a hint, while pursuing some other encodings for all nodes on the path then, the
venue. [3] is, perhaps, the earliest reference in nodes themselves can be efficiently queried in
the database literature referring to Continued the database.
Fractions encoding. The manuscript is
unavailable to the author to draw detailed
comparison with his method.
3 Interval Halving
The easiest way to nest intervals is splitting
2 Nested Intervals Queries parent interval into two halves. If we start with
the points 0 and 1 and continue on halving the
Nested Intervals encode each tree node with a intervals iteratively then, what kind of numbers
pair of numbers head and tail. Interval for a child on the interval boundaries would be produced?
node is always contained within parent interval. Clearly, the ones whose denominator is power of
With this labeling transitive closure could be 2, or simply dyadic fractions [4].
queried like this
0_ _1
select e1.ename, e2.ename 1 1
from emp e1, emp e2 _1
where e2.head >= e1.head 2
and e2.head < e1.tail
_1 _3
4 4
Next, the subtree of all the descendants of a node
could be found by just restricting the above view _1 3_ _5 _7
with a single table predicate 8 8 8 8

select e2.ename from emp e1, emp e2 Fig.1. Dyadic Fractions at Conway tree.
where e2.head >= e1.head
and e2.head < e1.tail When splitting the interval [head,tail] into
and e1.ename = ‘SCOTT’
two, the point on the boundary is the average
(head+tail)/2. Alternatively, we could have
The ancestor path can be queried symmetrically chosen the mediant:

select e1.ename from emp e1, emp e2


where e2.head >= e1.head
head_numer + tail_numer
and e2.head < e1.tail head_denom + tail_denom
and e2.ename = ‘SCOTT’

There is a subtle problem with the last query, If we start with the points 0 and 1 and continue
however. Finding all the intervals that cover a on, then the Stern-Brocot tree of Farey fractions
given point is difficult. Although there are would be produced.
specialized indexing schemes like R-Tree, none
of them is as universally accepted as B-Tree. 0_ _1
1 1
Compare this to the descendants query, assuming _1
that the subtree of SCOTT’s subordinates is 2
small. The execution path in this case is very _1 _2
efficient: first, e1 record is fetched by the unique 3 3
index, and then all the e2 records are fetched by _1 2_ 3_
3_
index range scan. 4 4
5 5
The details of Nested Intervals encoding are
Fig.2. Farey fractions at Stern-Brocot tree.
developed in the next sections. The encoding is
algorithmic. Given a child node label, the parent
encoding can be calculated, not queried. The bijection between Dyadic and Stern-Brocot
Therefore, the whole path to the root node can be tree is defined by the following Minkowski
calculated. Hence, if we know tree node Question Mark function ?: [0,1] [0,1] [9].

48 SIGMOD Record, Vol. 34, No. 2, June 2005


If x has binary expansion .00...011... we’ll get a system of Nested Intervals shown at
100...011...1..., where there are a zeros in Fig.3.
the first block, then b ones in the second, then c
zeros, and so on, then ?(x) is the (simple) 0_ _1
continued fraction 1 _1 1
_1 2
1 _2
3 3
_1 2_ 3_ 3_
1 4 5 5 4
a+1+
1 _1 _2 _3 _3 _4 _5 _5 4_
b+
1 5 7 8 7 7 8 7 5
c+
so_on
Fig.3. Simple Farey interval structure.

Thus, for example, if x = 1/4, its two binary Dyadic Nested Interval structure is isomorphic to
expansions .0100000... and .00111111... Fig.3 - we omit the picture in order to save
yield the two expressions space. The reason why we preferred Farey over
Dyadic case would become evident in the last
1 1 section. It would also become apparent why it’s
= called “simple”.
1 1
1+1+ 2+1+
1 ∞ The other possible way to introduce Nested
1+
∞ Interval structure is shown at Fig.4, this time
with dyadic encoding.
Therefore, ?(1/4)=1/3. Note that the node ¼
is positioned in the Dyadic tree on Fig.1 in the 0_ _1
same place where the node 1/3 is in Stern-Brocot 1 _1 1
tree on Fig.2. 2
_1 _3
4 4
4 Nested Interval Structure _1 3_ _5 _7
8 8 8 8
In previous section we developed two alternative
but isomorphic systems how to generate interval 1_ 3_ _5 _7 _9 _
11 _
13 _
15
boundary points. What intervals should we 16 16 16 16 16 16 16 16
consider? Clearly, including all possible intervals
into our system would be too much. In Farey Fig.4. Monotonic dyadic interval structure.
case (Fig.2), for example, the interval
[1/3,1/2] would have at least two parents: Algorithms for navigating Dyadic Nested
[1/3,2/3] and [0/1,1/1]. Intervals are almost obvious:

What if we limit the scope to only those intervals 1. Younger sibling [head,tail] encoding is:
that correspond to the edges at Fig.1? If we
consider solid and dashed lines, then there still
would be too many intervals. Consider the 2 head_numer + 1 2 tail_numer + 1
,
interval [1/3,2/5] (Fig.2). How many siblings 2 head_denom 2 tail_denom
does it have? Well, no more than one:
[2/5,1/2]. Indeed, no other interval has 2. Older sibling [head,tail]:
[1/3,1/2] as a parent.

head_numer − 1 tail_numer − 1
If we consider solid lines only (with two ,
additional convenience intervals at the top), then head_denom tail_denom

SIGMOD Record, Vol. 34, No. 2, June 2005 49


(Be careful, however, when applying this rule to materialized path encoding. We add the column
the first child) with the remainders to the left
2 3 5
3. Parent of the first child: 3 4 7

head_numer tail_numer + 1 Matrix on the left


,
head_denom tail_denom
2 3
3 4
Farey Intervals are little bit more sophisticated.

5 The Path Matrix corresponds to the interval [2/3,3/4] - the


parent of our original interval.
Let’s study simple Farey interval structure
(Fig.3) in more detail. Consider the interval Continuing to the left we get
[5/7,3/4]. It is the first child of [2/3,3/4]. 0 1 2 3 5
Then, [2/3,3/4] is the second child of 1 1 3 4 7
[1/2,1/1]. Finally, [1/2,1/1] is the first
child of [0/1,1/1]. Therefore, the materialized together with the sequence of integer division
path encoding of [5/7,3/4] is 1.2.1. results 1,1,2, and 1. We stop as soon as zero
However, we have 4 intervals, i.e. 4 nodes, while in the top left corner appears.
materialized path has the length 3. How can that
be? Next, we can expand this number wall up. The
rule is the same but applied to rows instead of
Could interval [0/1,1/1] be considered as columns. In our example, we write
somebody’s else child too? Well, yes, and no.
The obvious parent candidate is [0/1,1/0]. 7 = 5* 7/5 + 7 mod 5 = 5*1 + 2
Then, the interval [1/1,2/1] is the second 4 = 3* 4/3 + 4 mod 3 = 3*1 + 1
child of [0/1,1/0]! The amended materialized 3 = 2* 3/2 + 3 mod 2 = 2*1 + 1
path encoding for [5/7,3/4] is 1.1.2.1 and, 1 = 1* 1/1 + 1 mod 1 = 1*1 + 0
by the way, we also are able to find Farey
interval encodings for materialized paths Adding the row 0,1,1,2 at the top results in
beginning with natural numbers other than 1.
0 1 1 2
Here is formal procedure converting Farey 0 1 2 3 5
encoding into materialized path. Start with Farey 1 1 3 4 7
interval written as 2x2 matrix:
Continuing this process, we get the following
3 5 number wall
4 7
0 1
0 1 1
Note that we switched the fractions. The purpose 0 1 1 2
is to keep the highest integer in the lower right 0 1 2 3 5
corner. Next, write 1 1 3 4 7

5 = 3 * 5/3 + 5 mod 3 = 3*1 + 2


7 = 4 * 7/4 + 7 mod 4 = 4*1 + 3 Note, that all 2x2 sub-matrices at this wall have
determinant 1 or -1. This property enforces the
unique way of completing the wall to square
The integer division result (which is 1 - the same matrix
in both cases) is the first element of the

50 SIGMOD Record, Vol. 34, No. 2, June 2005


1 0 1 0 1
0 1 0 1 1 works for matrices on the main antidiagonal
1 0 1 1 2 only.
0 1 2 3 5
1 1 3 4 7 Iterative application of matrix multiplication
property gives rise to the following matrix
decomposition
We refer to the number wall that we just have
built as the Path Matrix. It enjoys many nice 0 1 0 1 0 1 0 1 3 5
properties.
. . . =
1 1 1 2 1 1 1 1 4 7
1. The numbers at the main antidiagonal are all
1s. Each of the matrices on the left side corresponds
2. The sequence of numbers below the main to an elementary fragment of the materialized
antidiagonal is materialized path. The path is path 1.1.2.1. Since these elementary matrices
oriented from right to left. all have determinant -1, their multiple would
3. Adjacent 2x2 sub-matrices can be multiplied always have determinant -1 or 1 - the property
as shown on Fig.5 that we noticed earlier.

1 0 1 0 1 The determinant property allows filling in the


0 1 0 1 1 numbers in the Path Matrix in the other
1 0 1 1 2 direction. Suppose we know materialized path
0 1 2 3 5 and want to calculate corresponding Farey
1 1 3 4 7 interval. One way is multiplying elementary
matrices, by leveraging the above matrix
decomposition identity. Alternatively, we can
1 0 1 0 1 start with partially filled in Path Matrix. By
0 1 0 1 1 properties 1 and 2 we have
1 0 1 1 2
0 1 2 3 5 1 0 1 0 1
1 1 3 4 7 0 1 0 1 1
1 0 1 1
1 0 1 0 1
0 1 2
0 1 0 1 1
1 1
1 0 1 1 2
0 1 2 3 5
1 1 3 4 7 We fill in empty positions as follows. Select 2x2
matrix that has 3 elements defined and the 4th
element empty
Fig.5. Multiplying adjacent matrices.
1 0 1 0 1
The matrix identity in the middle case on 0 1 0 1 1
Fig.5, for example, is 1 0 1 1
0 1 2
1 2 0 1 2 3 1 1
. =
1 3 1 1 3 4
Fill in the empty position to satisfy the
determinant property. The sign of the
Multiplying overlapping matrices, similar to determinant is alternating. It is negative if the
the last case matrix is positioned at even distance from main
antidiagonal, and positive otherwise. In our case,
the matrix is just one step away from the position
0 1 0 1 1 2 at the main antidiagonal. Therefore, the value x
. =
1 1 1 2 1 3 at the empty position has to satisfy the equation

SIGMOD Record, Vol. 34, No. 2, June 2005 51


1*x - 1*2 = 1 are monotonic. The path matrix theory for the
additive continued fractions mimics the classic
hence, x=3, as expected. case described in section 6. One of the
distinguished feature of additive continued
fractions is that all the matrices have negative
After all the empty positions are filled, we can
entries in the second column, and determinant
grab 2x2 Farey interval matrix at the lower right
equal to 1. There is no alternation anymore:
corner.
interval encoding of the younger child always
precedes the older one. (In the simple continued
The final important property of the Path Matrix fractions case this was true for the odd levels,
is that matrix transposition corresponds to and reversely true for the even ones). Finally,
materialized path inversion. additive continued fractions map into monotonic
Farey interval structure.
6 Continued Fractions
References
[8] suggests one more perspective into Farey
interval encoding. Materialized path 1.1.2.1 [1] D. Aioanei, A. Malinaru. General trees
can be naturally written as the simple continued persisted in relational databases.
fraction http://www.codeproject.com/cs/database/persisti
ng_trees.asp?print=true
1
1 [2] J. Celko. Joe Celko'
s Trees and Hierarchies in
1+ SQL for Smarties. Morgan Kaufmann.
1
1+
1 [3] P. Ciaccia, D. Maio, and P. Tiberio. A
2+
1+x method for hierarchy processing in relational
systems. Information Systems, 14(2):93-105,
1989.
which can be simplified into Moebius function
[4] J. Conway. On Numbers and Games. New
4+3x York: Academic Press, Inc.
7+5x
[5] J. Roy. 2003. Using the Node Data Type to
Solve Problems with Hierarchies in DB2
Here the familiar 2x2 matrix from our example Universal Database
can be recognized. http://www106.ibm.com/developerworks/db2/lib
rary/techarticle/0302roy/0302roy.html
Simple continued fractions have somewhat
irritating feature that increasing any [6] V. Tropashko. Trees in SQL: Nested Sets
denominator, either increases the value of the and Materialized Path.
number, or decreases it, depending on the parity http://www.dbazine.com/tropashko4.shtml
of the position. Reversed (or additive) continued
fractions [7] V. Tropashko. Nested Intervals with Farey
Fractions. http://arxiv.org/html/cs.DB/0401014
1
[8] V. Tropashko. Nested Intervals Tree
1
1+1− Encoding with Continued Fractions.
1 http://arxiv.org/pdf/cs.DB/0402051
1+1−
1
2+1− [9] L. Vepstas. The Minkowski Question Mark
1+1−x
and the Modular Group SL(2,Z).
http://www.linas.org/math/chap-
minkowski/chap-minkowski.html

52 SIGMOD Record, Vol. 34, No. 2, June 2005

You might also like