0% found this document useful (0 votes)
5 views15 pages

2006 Rodeh

Uploaded by

Rudi Theunissen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views15 pages

2006 Rodeh

Uploaded by

Rudi Theunissen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

H-0245 (H0611-006) November 12, 2006

Computer Science

IBM Research Report


B-trees, Shadowing, and Clones

Ohad Rodeh
IBM Research Division
Haifa Research Laboratory
Mt. Carmel 31905
Haifa, Israel

Research Division
Almaden - Austin - Beijing - Haifa - India - T. J. Watson - Tokyo - Zurich

LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research
Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific
requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P.
O. Box 218, Yorktown Heights, NY 10598 USA (email: [email protected]). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home .
B-trees, Shadowing, and Clones

Ohad Rodeh, IBM Research

Abstract is modified a complete path to the root is shadowed cre-


B-trees are used by many file-systems to represent files ating a new tree rooted at A’. Nodes A, B, and C become
and directories. They provide guarantied logarithmic unreachable and will later on be deallocated.
time key-search, insert, and remove. Shadowing, or
copy-on-write, is used by other file-systems to im-
plement snapshots, crash-recovery, write-batching and
RAID. Serious difficulties arise when trying to use b-
trees and shadowing in a single system.
This paper is about a set of b-tree algorithms that re-
spects shadowing, achieves good concurrency, and im-
plements cloning (writeable-snapshots). Our cloning al-
gorithm is efficient and allows the creation of a, basically
unlimited, number of clones. These algorithms were
Figure 1: Modifying a leaf requires shadowing up to the
used in an experimental object-disk.
root.
We believe that this work is applicable not only to
object-disks but also to other file-systems. In order to support snapshots the file system allows
having more than a single root. Each root node points
1 Introduction to a tree that represents a valid image of the file system.
For example, if we were to decide to perform a snapshot
B-trees [18] are used by several file systems [3, 20, 10, 9] prior to modifying C then A would have been preserved
to represent files and directories. Compared to traditional as the root of the snapshot. Only upon the deletion of
i-nodes [14] b-trees offer guaranteed logarithmic time the snapshot would it be deallocated. The upshot is that
key-search, insert and remove. Furthermore, b-trees can pages can be shared between many snapshots; indeed,
represent sparse files well. whole subtrees can be shared between snapshots.
Shadowing is a technique used by some file systems This work was performed as part of research into
to ensure atomic update to persistent data-structures [2, building an object-disk storage device (OSD) [21, 4].
6, 11, 15, 22]. It is a powerful mechanism that has Roughly speaking, an OSD is a primitive file system that
been used to implement snapshots, crash-recovery, write- is exported through a network interface using a standard-
batching, and RAID. The basic scheme is to look at the ized protocol. It was desirable to (1) use b-trees to imple-
file system as a large tree made up of fixed-sized pages. ment the persistent OSD data-structures and (2) use shad-
Shadowing means that to update an on-disk page, the en- owing for update and snapshots. This would combine
tire page is read into memory, modified, and later written the good properties of both techniques: logarithmic ac-
to disk at an alternate location. When a page is shad- cess data-structures coupled with simple logging, crash-
owed its location on disk changes, this creates a need to recovery, and snapshots. However, we ran into serious
update (and shadow) the immediate ancestor of the page difficulties when trying to use these techniques together.
with the new address. Shadowing propagates up to the The classic persistent recoverable b-tree, as described
file system root. Figure 1 shows an initial file system in the literature [5, 8], is updated using a bottom-up pro-
with root A that contains seven nodes. After leaf node C cedure. Modifications are applied to leaf nodes. Rarely,

1
leaf-nodes split or merge in which case changes propa- would need to be shadowed. It is better to shuffle
gate to the next level up. This can occur recursively and keys from R.
changes can propagate high up into the tree. Leaves are
chained together to facilitate re-balancing operations and
range lookups. There are good concurrency schemes al-
lowing multiple threads to update a tree; the best one is
currently b-link trees [17].
The main issues when trying to apply shadowing to
the classic b-tree are:

Leaf chaining: In a regular b-tree leaves are chained to-


gether. This is used for tree rebalancing and range
lookups. In a b-tree that is updated using copy-on-
Figure 3: Removing a key and effects of re-balancing
write leaves cannot be linked together. For exam-
and shadowing.
ple, Figure 2 shows a tree whose right most leaf
node is C and where the leaves are linked from left
to right. If C is updated the entire tree needs to be This work describes the first b-tree construction that
shadowed. Without leaf-pointers only C, B, and A can coexist with shadowing while providing good con-
require shadowing. currency. This is a fundamental result because b-trees
and shadowing are basic file system techniques.
Our cloning algorithm improves upon the state of the
art. We support a large number of clones and allow good
concurrency when accessing multiple clones that share
blocks. Cloning is a fundamental user requirement; sup-
porting it efficiently is important in today’s file systems.
The rest of this paper is organized as follows: Sec-
tion 2 is about related work, Section 3 described the ex-
perimental object-disk, Section 4 discusses recoverabil-
ity, Section 5 describes the basic algorithms, Section 6
Figure 2: A tree whose leaves are chained together. The describes cloning, Section 7 describes the run-time sys-
right most leaf is C and B is its immediate ancestor. If C tem, Section 8 shows performance, Section 9 discusses
is modified, the entire tree has to be shadowed. future work, and Section 10 summarizes.

Without links between leaves much of the b-tree lit-


erature becomes inapplicable.
2 Related work

Concurrency: In a regular b-tree, in order to add a key There is an alternate style of copy-on-write used in some
to a leaf L, in most cases, only L needs to be locked databases [12] that is beyond the scope of this paper. The
and updated. When using shadowing, every change only form of shadowing discussed here is the one de-
propagates up to the root. This requires exclu- scribed in Section 1.
sive locking of top-nodes making them contention There are few papers that discuss concurrency together
points. Shadowing also excludes b-link trees be- with recoverability of b-trees, see [5, 8]. The challenge
cause b-link trees rely on in-place modification of in constructing an algorithm that achieves both goals is
nodes as a means to delay split operations. that recoverability severely constrains concurrency. Fur-
thermore, we found no published papers on concurrency,
Modifying a single path: Regular b-trees shuffle keys recoverability, and b-trees with the added constraint of
between neighboring leaf nodes for re-balancing shadowing.
purposes after a remove-key operation. When using This work makes use of top-down b-trees; these were
copy-on-write this habit could be expensive. For first described in [13], [25], and [23].
example, in Figure 3 a tree with leaf node A and Some file-systems use b-trees to represent directo-
neighboring nodes R and L is shown. A key is re- ries [3, 20, 9]. One of the difficulties in handling direc-
moved from node A and the modified path includes tory entries is that, unlike b-tree values in the OSD, they
3 nodes (shadowed in the figure). If keys from node are variable size. We believe that the b-trees described
L were moved into A then an additional tree-path here can be adapted to handle variable size values.

2
It is possible to use disk-extents instead of fixed-sized
blocks [3, 20, 9]. We have experimented with using ex-
tents in the OSD. The resulting algorithms are similar in
flavor to those reported here and we do not describe them
for brevity.
The WAFL system [6] has a cloning algorithm that is
closest to ours. Although the basic WAFL paper dis-
cusses read-only snapshots the same ideas can be used
to create clones. WAFL has two main limitations which
we improve upon:

1. WAFL is limited to 32 snapshots.

2. In WAFL the free-space bits of all blocks that be-


long to a newly created snapshot need to be set to 1
as part of snapshot creation.

In our algorithm
Figure 4: An object-catalog that points to objects A and
1. The limit is 232 −1 for the same space cost as WAFL B.
2. Only the children of the root of the clone are in-
volved in snapshot creation. Free-space map oper- Space is managed in 4KB blocks, also called pages.
ations are performed gradually through the lifetime Disk addresses are represented by 64-bits to accommo-
of the clone. date large disks. An object contains data blocks that are
aggregated using a b-tree. The b-tree maps offsets in the
There is a long standing debate whether it is better to object to data pages on disk. The object-catalog (OCAT)
shadow or to write-in-place. The wider discussion is be- maps an object-id to the root-page of the b-tree for that
yond the scope of this paper. This work is about a b-tree object; the OCAT is also a b-tree. Figure 4 shows an ex-
technique that works well with shadowing. ample where the catalog points to two objects: A and B.
The b-tree index nodes are laid out on 4KB meta-data
pages.
3 Object-disk (OSD) The b-trees representing objects and the catalog both
map 64-bit keys to 64-bit values. More generally, fixed-
An object-disk, according to the SNIA/T10 specifica- sized keys to fixed-sized values.
tion [21, 4], is essentially a primitive file system that is Section 5 describes the b-tree algorithms for create,
exported through a network interface using a standard- delete, lookup-key, remove-key, and insert-key. Here, we
ized protocol. The OSD contains objects, which have show how several object-disk commands are mapped to
64-bit names, and an object-catalog that indexes them. these operations.
There are no directories in an OSD. Objects are much A create-object(B) command is implemented by:
like files in a regular file system except that they are
likely to be sparse. The important commands are: cre- * B-root = lookup-key(OCAT-root, B)
ate, delete, read, and write object. Snapshots are also * if (B-root != 0)
supported. There is a command that creates a writeable - return obj-already-exists
snapshot of the OSD object-system. else
Our group built an experimental object-disk. The two - B-root = create-tree()
main types of persistent data-structures in our OSD are - insert-key(OCAT, B, B-root)
objects and the object-catalog. B-trees were a natural fit
for these two data structures. A delete-object(B) command is implemented by:
The design point for the OSD was that it would be
* B-root = lookup-key(OCAT-root, B)
part of a storage controller. Such systems typically have * if (B-root == 0)
severe limits on memory and CPU cycles. Therefore, It - return obj-does-not-exist
was important to try to maintain a small footprint. Using else
a generic database with full fledged transactions was not - remove-key(OCAT-root, B)
possible. - delete-tree(B-root)

3
A read(B, ofs, len=4KB) command, when the offset is 4 Recoverability
4KB aligned, is implemented by:
Shadowing file systems ensure recoverability by tak-
* B-root = lookup-key(OCAT-root, B)
ing periodic checkpoints, and logging commands in-
* if (B-root == 0)
- return obj-does-not-exist between. A checkpoint includes the entire file system
else tree; once a checkpoint is successfully written to disk the
- addr = lookup-key(B-root, ofs) previous one can be deleted. If a crash occurs the file
- if (addr == 0) data=zeros system goes back to the last complete checkpoint and re-
- otherwise, read data from addr plays the log.
- send data to client For example, Figure 5(a) shows an initial tree. Fig-
A write(B, ofs, len=4KB, data) command, when the ure 5(b) shows a set of modifications marked in gray.
offset is 4KB aligned, is implemented by: Figure 5(c) shows the situation after the checkpoint has
been committed and unreferenced pages have been deal-
* B-root = Lookup-key(OCAT-root, B) located.
* if (B-root == 0)
- return obj-does-not-exist
else
- addr = allocate 4KB on disk
- write data to disk at addr
- insert-key(B-root, ofs, addr)
In order to support writable snapshots on the OSD we
chose to support cloning at the b-tree level. To clone a
b-tree means to create a writable copy of it that allows
all operations: lookup, insert, remove, and delete. This (a) Initial file system tree
is described in Section 6.
The expected workload is mostly read/write com-
mands; create/delete commands are less frequent. Clone
commands are assumed to be infrequent. In terms of b-
tree operations this translates into a high frequency of
lookup-key/insert-key and a low frequency of remove-
key/create/delete.
The OSD handles multiple commands concurrently.
When a command arrives, the runtime system checks if (b) Modifications
there are enough resources to execute it. If so, the re-
sources are reserved and the command is logged and ex-
ecuted; otherwise, the command is rejected. In order to
simplify the runtime system we chose not to abort nor
roll back commands. Therefore, it is crucial to com-
pute in advance a worst-case estimate on the command’s
resource-usage and to practice deadlock avoidance. The
two important resources are memory-pages and disk-
(c) New checkpoint
pages. The amount of memory, depending on configu-
ration, can be very low, so memory-usage of each com-
mand should be low. Figure 5: Checkpoints.
This design means that the b-tree implementation has
to: The process of writing a checkpoint is efficient be-
1. Have good concurrency cause modifications can be batched and written sequen-
tially to disk. If the system crashes while writing a check-
2. Work well with shadowing point no harm is done, the previous checkpoint remains
3. Use deadlock avoidance intact.
Command logging is attractive because it combines
4. Have guarantied bounds on disk-space and number into a single log-entry a set of possibly complex modi-
of memory-pages required per each b-tree operation fications to the file system.
Section 5 goes into how such a b-tree is constructed. This combination of checkpointing and logging allows

4
an important optimization for the shadow-page primitive.
When a page belonging to a checkpoint is first shadowed
a cached copy of it is created and held in memory. All
modifications to the page can be performed on the cached
shadow copy. Assuming there is enough memory, the
dirty page can be held until the next checkpoint. Even
if the page needs to be swapped out, it can be written to Figure 7: A b-tree with two levels.
the shadow location and then paged to/from disk. This
means that additional shadows need not be created.
The OSD implements cloning. Checkpoints are sim- entries where b ≥ 2. For performance reasons it is desir-
ply clones that users cannot access. Clones are described able to increase the upper bound to 3b; however, in this
in Section 6. Section we limit ourselves to the range [b . . . 2b + 1].
A pro-active approach to rebalancing is used. When a
node with 2b + 1 entries is encountered during an insert-
5 Base algorithms
key operation, it is split. When a node with b entries
The variant of b-trees that is used here is known as b+- is found during a remove-key operation it is fixed. Fix-
trees. In a b+-tree leaf nodes contain key-data pairs, ing means either moving keys into it or merging it with
index nodes contain mappings between keys and child a neighbor node so it will have more than b keys. Pro-
nodes; see Figure 6. The tree is composed of individual active fix/split simplifies tree-modification algorithms as
nodes where a node takes up 4KB of disk-space. The in- well as locking protocols because it prevents modifica-
ternal structure of a node is based on [7]. There are no tions from propagating up the tree. However, care should
links between leaves. be taken to avoid excessive split/fix activity. If the tree
constraints were b and 2b − 1 then a node with 2b − 1
entries could never be split into two legal nodes. Further-
more, even if the constraints were b and 2b a node with 2b
entries would split into two nodes of size b which would
immediately need to be merged back together. Therefore,
the constrains are set further away enlarging the legal set
of values. In all the examples in this section b = 2 and
the set of legal values is [2 . . . 5].
During the descent through the tree lock-coupling [19]
is used. Lock coupling (or crabbing) is locking children
before unlocking the parent. This ensures the validity of
Figure 6: Three types of nodes in a tree; A is a root node,
a tree-path that a task is traversing without pre-locking
B is an index node, and C is a leaf node. Leaves are not
the entire path. Crabbing is deadlock free.
linked together.
When performing modifying operations, such as in-
sert/remove key, each node on the path to the leaf is shad-
Shadowing a page can cause the allocation of a page owed during the descent through the tree. This combines
on disk. When a leaf is modified all the nodes on the path locking, preparatory operations, and shadowing into one
to the root need to be shadowed. If trees are unbalanced downward traversal.
then the depth can vary depending on the leaf. One leaf
might cause the modification of 10 nodes, another, only
2. Here, all tree operations maintain a perfectly balanced 5.1 Create
tree; the distance from all leaves to the root is the same.
In order to create a new b-tree a root page is allocated
The b-trees use a minimum key rule. If node N1 has and formatted. The root page is special, it can contain
a child node N2 then the key in N1 pointing to N2 is zero to 2b + 1 entries. All other nodes have to contain
smaller or equal to the minimum of (N2 ). For example, at least b entries. Figure 8 presents a tree that contains a
figure 7 shows an example where integers are the keys. root node with 2 entries.
In this diagram and throughout this document data val-
ues that should appear at the leaf-nodes are omitted for
simplicity.
B-trees are normally described as having between b
and 2b − 1 entries per node. Here, these constraints are
relaxed and nodes may contain between b and 2b + 1 Figure 8: A b-tree containing only a root.

5
5.2 Delete
In order to erase a tree it is traversed and all the nodes and
data are deallocated. A recursive post-order traversal is
used.
An example for the post-order delete pass is shown in
Figure 9. A tree with eight nodes is deleted.

Figure 10: Inserting key 8 into a tree. Gray nodes have


(a) Tree to be deleted been shadowed. The root node has been split and a level
was added to the tree.

at least b + 1 keys. This guaranties that removing a key


from the leaf will, at worst, effect its immediate ances-
tor. During the descent in the tree lock-coupling is used.
Locks are taken in exclusive mode.
For example, Figure 11 shows a remove-key operation
(b) A post-order traversal that fixes index-node [3, 6] by merging it with its sibling
[15, 20].

Figure 9: Deleting a tree.

5.3 Insert key


Insert-key is implemented with a pro-active split policy.
On the way down to a leaf each full index node is split.
This ensures that inserting into a leaf will, at most, split
the leaf. During the descent lock-coupling is used. Locks
are taken in exclusive mode. This ensures the validity of
a tree-path that a task is traversing.
Figure 10 shows an example where key 8 is added
to a tree. Node [3, 6, 9, 15, 20] is split into [3, 6, 9] and
[15, 20] on the way down to leaf [6, 7]. Gray nodes have
been shadowed. Figure 11: Removing key 10 from a tree. Gray nodes
have been shadowed. The two children of the root were
merged and the root node was replaced.
5.4 Lookup key
Lookup for a key is performed by an iterative descent
through the tree using lock-coupling. Locks are taken in
5.6 Resource analysis
shared-mode.
This section analyzes the requirements of each operation
in terms of memory and disk-pages.
5.5 Remove key
For insert/remove key three memory pages are needed
Remove-key is implemented with a pro-active merge in the worst case; this happens if a node needs to be split
policy. On the way down to a leaf each node with a min- or fixed during the downward traversal. The number of
imal amount of keys is fixed, making sure it will have modified disk-pages can be 2 × tree-depth. Since the tree

6
is balanced the tree depth is, in the worst case, equal to instead of making a pass on the entire tree and increment-
logb (N ); where N is the number of keys in the tree. In ing the counters during the clone operation, this is done
practice, a tree of depth 6 is more than enough to cover in a lazy fashion.
huge objects and object-catalogs. Throughout the examples in this section trees Tp and
For lookup-key two memory-pages are needed. For Tq are used. Tree Tp has root P and tree Tq has root node
lookup-range three memory-pages are needed. Q. Nodes whose ref-count has changed are marked with
diagonals, modified nodes are colored in light gray. In or-
der to better visualize the algorithms reference counters
5.7 Comparison to standard b-tree are drawn inside nodes. This can be misleading, the ref-
A top-down b-tree with bounds [b, 2b + 1] is no worse counters are physically located in the free-space maps.
than a bottom-up b-tree with bounds [b + 1, 2b]. The
intuition is that a more aggressive top-down algorithm
would never allow nodes with b or 2b + 1 entries; such
6.1 Create
nodes would be immediately split or fixed. This means The algorithm for cloning a tree Tp is:
that each node would contain between b + 1 and 2b en-
tries. This is, more or less, equivalent to a bottom-up 1. Copy the root-node of Tp into a new root.
b-tree with b0 = b + 1.
In practice, the bounds of the number of entries in a 2. Increment the free-space counters for each of the
node are expanded to [b, 3b]. This improves performance children of the root by one.
because it means that there are few spurious cases of
split/merge. The average capacity of a node is around An example for cloning is shown in Figure 12. Tree
2b. Tp contains seven nodes and Tq is created as a clone of
Tp by copying the root P to Q. Both roots point to the
shared children: B and C. The reference counters for B
6 Clones and C are incremented to 2.
Notice that in Figure 12(II) nodes D, E, G and H have
This section builds upon Section 5 and adds the neces- a ref-count of one although they belong to two trees. This
sary modifications so that clones will work. is an example of lazy reference counting.
There are several desirable properties in a cloning al-
gorithm. Assume Tp is a b-tree and Tq is a clone of Tp ,
then: 6.2 Lookup
• Space efficiency: Tp and Tq should, as much as pos- The lookup-key and lookup-range algorithms are unaf-
sible, share common pages. fected by the modification to the free-space maps.

• Speed: creating Tq from Tp should take little time


and overhead.
6.3 Insert-key and Remove-key
Changing the way the free-space works impacts the
• Number of clones: it should be possible to clone Tp
insert-key and remove-key algorithms. It turns out that a
many times.
subtle change is sufficient to get them to work well with
• Clones as first class citizens: it should be possible free-space ref-counts.
to clone Tq . Before modifying a page, it is “marked-dirty”. This
lets the run-time system know that the page is about to
A trivial algorithm for cloning a tree is copying be modified and gives it a chance to shadow the page if
it wholesale. However, this does not provide space- necessary.
efficiency nor speed. The method proposed here does Without clones, the only requirement for the mark-
not copy the entire tree and has the desired properties. dirty operation is to check if the page does not belong
The main idea is to use a free space maps that main- to the previous checkpoint; if so, the page must be shad-
tains a reference count (ref-count) per block. The ref- owed. Otherwise, it can be modified in place. With
count records how many times a page is pointed to. A clones, this is more subtle. The following procedure is
zero ref-count means that a block is free. Essentially, in- followed when marking-dirty a clean page N :
stead of copying a tree, the ref-counts of all its nodes are
incremented by one. This means that all nodes belong 1. If the reference count is 1 nothing special is needed.
to two trees instead of one; they are all shared. However, This is no different than without cloning.

7
(I) Initial tree Tp (II) Creating a clone Tq

Figure 12: Cloning a b-tree.

2. If the ref-count is greater than 1 and page N is relo- 1. If the ref-count of N is greater than 1 then decre-
cated from address L1 to address L2 , the ref-count ment the ref-count and stop downward traversal.
for L1 is decremented and the ref-count for L2 is The node is shared with other trees.
made 1. The ref-count of N ’s children is incre-
mented by 1. 2. If the ref-count of N is one then it belongs only to
Tp . Continue downward traversal and on the way
For example, Figure 13 shows an example of a two back up deallocate N .
trees, Tp and Tq , that start out sharing all their nodes
Figure 14 shows an example where Tq and Tp are two
except the root. Initially, all nodes are clean. A key is
trees that share some of their nodes. Tree Tq is deleted.
inserted into leaf node H. This means that a downward
This frees nodes Q, X, and Z and reduces the ref-count
traversal is performed and nodes Q, C and H are shad-
on nodes C and Y to 1.
owed. In stage (II) node Q is shadowed. Its ref-count is
one, so nothing special is needed. In stage (III) node C
is shadowed, this splits C into two versions, one belong- 6.5 Resource and performance analysis
ing to Tp the other to Tq each with a ref-count of 1. The The modifications made to the basic algorithms do not
children of C are nodes H and G, their ref-count is incre- add b-tree node accesses. This means that the worst-case
mented to two. In stage (IV) node H is shadowed, this estimate on the number of memory-pages and number of
splits H into two separate versions each with ref-count disk-blocks used per operation remains unchanged. The
1. number of free-space accesses increases. This has a po-
Performing the mark-dirty in this fashion allows de- tential of significantly impacting performance.
laying the ref-count operations. For example, in Fig- Several observations make this unlikely:
ure 13(I) node C starts out with a ref-count of two. At
the end of the insert operation there are two versions of • Once sharing is broken for a page and it belong to
C each with a ref-count of 1. Node G starts out with a a single tree, there are no additional ref-count costs
ref-count of 1, because it is shared indirectly between Tp associated with it.
and Tq . At the end of the operation, it has a ref-count
of two because it is pointed-to directly from nodes in Tp • If a page is dirty and remains in-memory, no addi-
and Tq . tional checking is needed.
This modification to the mark-dirty primitive gets the • The vast majority of b-tree pages are leaves. Leaves
insert-key and remove-key algorithms to work. have no children and therefore do not incur addi-
tional overhead.
6.4 Delete
A major cost to free-space counters is the increased
The delete algorithm is also affected by the free-space size of free-space map. Instead of keeping a bit per block
ref-counts. Without cloning, a post-order traversal is like most file systems, a counter is needed. If 32-bit
made on the tree and all nodes are deallocated. In or- counters are used then the map grows by a factor of 32.
der to take ref-counts into account a modification has to This also allows supporting up to 232 clones. The WAFL
be made. Assume tree Tp is being deleted and that during file system [6] uses 32-bits in its free-space map and it
the downward part of the post-order traversal node N is is reputed to have to good performance. This gives the
reached: author reason to believe that this issue can be negotiated.

8
(I) Initial trees, Tp and Tq (II) Shadow Q

(III) shadow C (IV) shadow H

Figure 13: Inserting into a leaf node breaks sharing across the entire path.

(I) Initial trees Tp and Tq (II) After deleting Tq

Figure 14: Deleting a b-tree rooted at Q.

The test framework used in this work includes a free- rather than any fancy footwork that can be performed by
space map that resides in memory. This does not al- a log-structured file-system.
low a serious attempt to investigate the costs of a large The b-tree is split into 4KB pages that are paged
free-space map. Furthermore, even a relatively large b- to/from disk. A page-cache is situated between the b-
tree that takes up a gigabyte of disk-space can be rep- tree and the disk; it can cache clean and dirty pages. A
resented by a 1MB free-space map that can be held in simple clock scheme is implemented, no attempt is made
memory. Therefore, investigating this issue remains for to coalesce pages written to disk into large writes, no
future work. pre-fetching is performed. In order to shadow a page P ,
Concurrency remains unaffected by ref-counts. Shar- the page is first read from disk and put into the cache.
ing on any node that requires modification is quickly bro- As long as P stays in cache it can be modified in mem-
ken and each clone gets its own version. ory. Once there is memory pressure, P is written to disk.
If P belongs to the old checkpoint, it has to be written
7 The run-time system to an alternate location; otherwise, it can be written in
place. This way, the cache absorbs much of the overhead
A minimal run-time system was created for the b-tree. of shadowing, especially for heavily modified pages.
The rational is to focus on the tree algorithms themselves The free-space was implemented with a simple in-

9
memory map. There is a ref-count per block. This was In the experiments reported in this section the entries
done to eliminate any noise generated by the particulars are of size 16bytes: 8bytes for a key and 8bytes for data.
of the OSD free-space component. A 4KB node can contain up to 235 such entries.
A log was not used, it is assumed that the OSD protects The test-bed used in the experiments was a single ma-
all b-tree operations through logical logging of com- chine connected to a DS4400 disk controller through
mands. Fiber-Channel. The machine was a dual-CPU Xeon
A special threading package was used, it is similar (Pentium4) 2.4Ghz with 2GB of memory. It ran a Linux-
to [1]. The idea is to use a single operating-system 2.6.9 operating system. The b-tree was laid out on a vir-
thread, the main-thread, to run all the complex code: tual LUN taken from a DS4400 controller. The LUN is
caching, free-space, b-tree, command logic, etc. Sepa- a RAID5 in an 8+P pattern. Strip width is 64KB, this
rate operating-system threads perform the heavy lifting: means that full stripe is 8 × 64KB = 512KB. Read and
networking and IO. The main-thread executes multiple write caching is disabled.
light-weight tasks. Tasks are much like regular threads The trees created in the experiments were spread
except that they are non-preemptive and they cannot per- across a 1GB area on disk. Table 1 shows the IO-
form regular system-calls. A task yields the CPU either performance of the disk subsystem for such an area.
voluntarily or when it performs an IO. In the experimen- Three workloads were used (1) read a random page (2)
tal setup for this work most of the OSD code has been write a random page (3) read and write a random page.
eliminated; the upshot is that only the main-thread is ex- When using a single thread a 4KB write takes 18 mil-
ecuted along with the IO threads. This limits any b-tree liseconds, this is due to the RAID-5 penalty for short
code to execute on a single CPU. While the b-tree algo- writes. A short write requires 2 reads and 2 writes. A
rithms themselves are thread-safe for any threading pack- 4KB Read takes about 5 milliseconds. Reading a random
age, they are limited here to execute on a single CPU. 4KB page and then writing it back to disk takes 24 mil-
This system does not contain any kernel code. It was liseconds. When using 10 threads throughput improves
built and tested on a Linux operating system with an Intel by a factor of six.
processor.
#threads op. time per op.(ms) ops per second
8 Performance 10 read N/A 1283
write N/A 421
The OSD was built to be part of a storage controller. It R+W N/A 311
was specified to be able to manage terrabytes of disk
1 read 4.8 207
space with gigabytes of memory. Most of the memory
write 18.3 68
was to be used for caching customer data, most of the
R+W 24.6 41
CPU cycles were to be spent on networking and IO. The
b-tree was assumed to reside mostly on disk, with fre- Table 1: Basic disk-subsystem capabilities. Three work-
quently accessed pages in memory. The b-tree code was loads were used (1) read a random page (2) write a ran-
to use little CPU. dom page (3) read and write a random page. Using 10
In order to achieve good performance the b-tree had threads increases the number of operations per second
to: by a factor of six.
1. Work well when most of the tree is not in-memory
2. Use little CPU Therefore, large trees with about 64,000 leaves were
used to empirically assess performance. It turned out
3. Get good concurrency from the disk subsystem
that the only way to quickly build such large trees was
In this section we show that the algorithms, indeed, through an append only workload. The even numbers
achieve these goals. {0,2,4, . . . } were chosen as keys; they were inserted se-
In [24] there was a prediction that top-down algo- quentially into the tree.
rithms will not work well. This is because every tree Two base-trees were used T235 and T150 . The number
modification has to exclusively lock the root and one of of keys in a node is between b and 3b. T235 has a maximal
its children. This creates a serialization point. We found fanout of 235 entries and b is equal to 235
3 = 78. T150 has
that not to be a problem in practice. What happens is that a maximal fanout of 150 and b is equal to 150 3 = 50. A
the root and all of its children are almost always cached node can hold more than 150 entries; therefore, this limit
in memory, therefore, the time it takes to pass the root is artificially enforced by wasting some of the space in a
and its immediate children is very small. page.

10
T235 8.1 Effect of the in-memory percentage on
Maximal fanout: 235 performance
Legal #entries: 78 .. 235
Contains: 7520000 keys and 64827 nodes (64273 The in-memory percentage has a profound effect on per-
leaves, 554 index-nodes) formance. A pure random lookup-key workload was run
Tree depth is: 4 against T235 with in-memory ratios 100%, 50%, 10%,
Root degree is: 4 5% and 2%. Each experiment included 30000 random
Index node average fanout: 117 lookup-key operations and throughput per second was
Leaf node average capacity: 117 calculated. If the in-memory percentage is x then, under
ideal performance, x of the workload is absorbed by the
cache and the rest of workload reaches the disk; through-
1
T150 put per second would be 1283 × 1−x . Table 2 summa-
Maximal fanout: 150 rizes the results.
Legal #entries: 50 .. 150
Contains: 4800000 keys and 64864 nodes (63999 Tree % in-memory 1 task 10 tasks ideal
leaves, 865 index-nodes) Tree depth is: 4 T235 100 91354 91705 ∞
Root degree is: 11 50 393 2431 2566
Index node average fanout: 75 10 219 1374 1425
Leaf capacity average capacity: 75.00 5 207 1306 1350
2 197 1230 1309

T235 is representative of the OSD catalog. T150 is rep- Table 2: Throughput results, measured in operations per
resentative of a tree where the key-value pairs take up second. A pure random lookup-key workload is applied
20bytes instead of 16bytes. This is an approximation of to T235 .
a tree that holds disk-extents. Both T235 and T150 have
an average occupancy of 50%. This is caused by the
append-only workload used to create them. When using When the entire tree is in memory there is no differ-
append, the right edge of the tree keeps splitting leaving ence in performance between ten tasks and one. This
behind half-full nodes. is because all tasks share a single CPU, and it is 100%
A set of experiments starts by creating a base-tree of utilized. When memory percentages drop, the disk-
a specific fanout and flushing it to disk. A special proce- parallelism comes into play. For the other percentages
dure is used. A clone q is made of the base tree. For read- a speedup of about x6 is achieved.
only workloads 1000 random lookup-key operations are Performance with 10 tasks is very close to ideal per-
performed. For other workloads the clone is aged by per- formance, except for the case where the entire tree is in-
forming 1000 random insert-key/remove-key operations. memory. There, it is hard to compete with an infinitely
Then, the actual workload is applied to q. At the end fast CPU.
the clone is deleted. This procedure ensures that the base Performance is logarithmic with respect to cache size.
tree, which took a very long time to create, isn’t damaged This is because the clock algorithm is able to keep all
and can be used for the next experiment. Each measure- the index nodes for T235 in memory. This means that
ment is performed five times and results are averaged. operations like lookup/remove/insert-key access, in most
The standard deviation for all the experiments reported cases, one on-disk leaf page.
here was 1% of the average or less. Performance differences between 10%, 5%, and 2%
For each experiment the number of cache-pages is were very small, therefore, for the rest of the experiments
fixed at initialization time to be some percentage of the we focused on the 5% case.
total number of pages in the tree. This ratio is called the
in-memory percentage.
8.2 Latency
Our b-tree construction is novel and there are no
existing data-structures to compare it against. There- There are four operations whose latency was measured:
fore, we compare it to ideal performance that could be lookup-key, insert-key, remove-key, and append-key. In
achieved with a data-structure the could somehow locate order to measure latency of operation x an experiment
leaf nodes without incurring the overheads of an index- was performed where x was executed 30000 times, and
ing structure. This would allow devoting the entire cache total elapsed time was measured. The latency per op-
to leaf nodes. To compute ideal performance we assumed eration was computed as the average. Operations were
that the CPU was infinitely fast. performed with randomly chosen keys.

11
Table 3 shows the latency of the b-tree operations on In the Search-100 workload each lookup-key trans-
the two trees. The cost of a lookup is close to the cost lates into a disk-read for the leaf node. This means that
1
of a single disk read. An insert-key requires reading a ideal throughput is 1283 × 0.95 = 1350 requests per sec-
leaf from disk and modifying it. The dirty-page is later ond. Actual performance is within 3% of ideal.
flushed to disk. The average cost is therefore a disk-read In the Insert workload each insert-key request is trans-
and a disk-write, or, about 24ms. The performance of lated, roughly, into a single disk-read and a single disk-
remove-key is about the same as an insert-key; the algo- write of a leaf. This means that ideal throughput is
1
rithms are very similar. Append always costs 12us be- 311 × 0.95 = 327. Surprisingly, actual performance ex-
cause the pages it operates on are always cached. ceeds ideal performance by about 10%. This is because
we are using a write-back cache. After each experiment
Tree Lookup Insert Remove-key Append about 2000 dirty leaf nodes remain in cache and the cost
T235 4.780 24.175 24.437 0.012 of writing them to disk is not accounted for. This unfairly
T150 4.839 24.567 24.372 0.012 disadvantages the computation of ideal performance.
The Modify and Search-80 workloads are somewhere
Table 3: Latency for single-key operations in millisec- in the middle between Insert and Search-100. Overall,
onds. the b-tree performs no worse than 4% less than ideal.

Tree #tasks Src-100 Src-80 Modify Insert


T235 10 1307 763 407 359
8.3 Throughput 1 209 104 47 41
Throughput was measured using four workloads taken T150 10 1284 752 407 357
from [24], Search-100, Search-80, Update, and Insert. 1 206 102 47 40
Each workload is a combination of single-key opera- Ideal 1350 798 384 327
tions. Search-100 is the most read-intensive, it per-
Table 5: Throughput results, measured in operations per
forms 100% lookup. Search-80 mixes some updates with
second.
the lookup workload; it performs 80% lookups, 10%
remove-key, and 10% add-key. Update is an update
mostly workload; it performs 20% lookup, 40% remove-
key, and 40% add-key. Insert is an update-only work-
8.4 Append
load; it performs 100% insert-key. Table 4 summarizes
the workloads. The performance of append has very different charac-
teristics than performance of other operations. It is in-
lookup insert remove structive to examine a 100% append workload. The base
Search-100 100% 0% 0% trees, T235 and T150 , were built using a single task that
Search-80 80% 10% 10% appended to them. The time to create the trees and the
Modify 20% 40% 40% throughput in append operations/second is shown in Ta-
Insert 0% 100% 0% ble 6. The in-memory percentage was 5%

Table 4: The four different workloads. Tree #keys Total time (sec) append ops/sec
T235 7520000 1565.1 4800
T150 4800000 1564.8 3069
Each operation was performed 30000 times and
throughput per second was calculated. Five such experi- Table 6: Append throughput results when building trees
ments were performed and averaged. The throughput test T235 and T150 .
compared running a workload using one task compared
with the same workload but executed concurrently with
ten tasks. CPU utilization throughout all the tests was These throughput numbers are higher by two orders of
about 1%; the tests were all IO bound. magnitude compared with other workloads with a single
Table 5 shows ideal performance and the results for a task. This is because append has very good locality, it
single task and for ten tasks. There is little difference in needs only the nodes at the right edge of the tree. If they
performance between T235 and T150 this is because the are all in-memory then append can be performed at CPU
caching algorithm is able to place all the index nodes in speed. Once in a while, a split is needed which requires,
cache. The throughput gain in all cases is x6 or slightly in most cases, one additional page. Overall, there are
better. very few IOs needed to perform this workload.

12
8.5 Performance impact of checkpoints There is little performance degradation when using
clones. The clock algorithm is quite successful in placing
During a checkpoint all dirty pages must first be written the index nodes for both clones into the cache. This also
to disk before they are reused. It is not possible to con- shows that concurrency is good even when using clones.
tinue modifying a dirty-page that is memory-resident, it
must be evicted to disk first in order to create a consistent
checkpoint. 9 Future work
In terms of performance of an ongoing workload, the
Several issues that can have a significant impact on per-
worst-case occurs when all memory-resident pages are
formance have not been studied here:
dirty at the beginning of a checkpoint. The best case
occurs when all memory-resident pages are clean. Then, • Space allocation
the checkpoint occurs immediately, at essentially no cost.
In order to assess performance the throughput tests • Write-batching
were run against T235 . After 20% of the workload was
• More sophisticated caching algorithms, for exam-
complete, that is, after 6000 operations, a checkpoint was
ple, ARC [16]
initiated. Table 7 shows performance for tree T235 with
10 tasks. The first row shows results when running a We believe each of this issues merits further study.
checkpoint. The second row shows base results, for ease
of reference.
10 Summary
For the Search-100 workload there was virtually no
degradation. This is because there are no dirty-pages to B-trees are an important data-structure used in many file-
destage. Other workloads suffer between 3% and 10% systems. Shadowing is a powerful technique for updating
degradation in performance. file-system data-structures.
This paper has shown how to use shadowing to up-
Tree Src-100 Src-80 Modify Insert date b-trees and get the benefits of both algorithms:
checkpoint 1302 697 388 346 snapshots, recoverability, concurrency, and logarithmic
base 1307 763 407 359 lookup and update. The algorithms are efficient and they
make good use of the disk subsystem.
Table 7: Throughput results, when a checkpoint is per- Although our testbed was an object-disk we believe
formed during the workload. The in-memory percentage the ideas are applicable to other file-systems.
is 5%, the tree is T235 .

8.6 Performance for clones


In order to assess the performance of cloning a special
test was performed. Two clones of the base tree are
created, p and q. Both clones are aged by performing
1000 30000
2 = 500 operations on them. Finally, 2 = 15000
operations are performed against each clone.
Table 8 shows performance for tree T235 with 10 tasks.
The first row shows results with 2 clones. The second
row shows base results, for ease of reference.

Src-100 Src-80 Modify Insert


2 clones 1303 733 394 350
base 1307 763 407 359

Table 8: Throughput results with T235 and ten tasks. The


in-memory percentage is 5%. Measurements are in op-
erations per second.

13
References [13] L. Guibas and R. Sedgewick. A Dichromatic
Framework for Balanced Trees. In Nineteenth An-
[1] A. Adya, J. Howell, M. Theimer, W. Bolosky, and nual Symposium on Foundations of Computer Sci-
J. Douceur. Cooperative Task Management without ence, 1978.
Manual Stack Management or, Event-driven Pro-
gramming is Not the Opposite of Threaded Pro- [14] M. McKusick, W. Joy, S. Leffler, and R. Fabry. A
gramming. In Usenix Annual Technical Confer- Fast File System for Unix. ACM Transactions on
ence, June 2002. Computer Systems, 1984.

[2] A. Borg, W. Blau, W. Graetsch, F. Herrmann, and [15] M. Rosenblum and J. Ousterhout. The Design
W. Oberle. Fault Tolerance under Unix. In ACM and Implementation of a Log-Structured File Sys-
Trans. Computer Systems, February 1989. tem. ACM Transactions on Computer Systems,
10(1):26–52, 1992.
[3] A. Sweeny, D. Doucette, W. Hu, C. Anderson, M.
Nishimoto, and G. Peck. Scalability in the XFS File [16] N. Megiddo and D. S. Modha. ARC: A Self-
System. In USENIX, 1996. Tuning, Low Overhead Replacement Cache. In
USENIX File and Storage Technologies (FAST),
[4] SNIA Storage Networking Industry As- March 2003.
sociation. OSD: Object Based Stor-
age Devices Technical Work Group. [17] P. Lehman and S. Yao. Efficient Locking for Con-
http://www.snia.org/tech activities/workgroups/osd/. current Operations on B-Trees. ACM Transactions
on Database Systems, 6(4):650–670, 1981.
[5] C. Mohan and F. Levine. ARIES/IM: an efficient
[18] R. Bayer and E. McCreight. Organization and
and high concurrency index management method
Maintenance of Large Ordered Indices. Acta In-
using write-ahead logging. In ACM SIGMOD inter-
formatica, pages 173–189, 1972.
national conference on Management of data, pages
371 – 380, 1992. [19] R. Bayer and M. Schkolnick. Concurrency of oper-
ations on b-trees. Acta Informatica, 9:1–21, 1977.
[6] D. Hitz, J. Lau, and M. Malcolm. File System De-
sign for an NFS File Server Appliance. In USENIX, [20] S. Best. Journaling File Systems. Linux Magazine,
1994. October 2002.
[7] D. Lomet. The Evolution of Effective B-tree: Page [21] Object Based Storage Devices Command Set
Organization and Techniques: A Personal Account. (OSD). http://www.t10.org/drafts.htm. T10 Work-
In SIGMOD Record, 2001. ing draft.
[8] D. Lomet and B. Salzberg. Access method Con- [22] V. Henson, M. Ahrens, and J. Bonwick. Automatic
currency with Recovery. In ACM SIGMOD inter- Performance Tuning in the Zettabyte File System.
national conference on Management of data, pages In File and Storage Technologies (FAST), work in
351 – 360, 1992. progress report, 2003.
[9] H. Reiser. ReiserFS. [23] V. Lanin and D. Shasha. A symmetric concurrent
http://www.namesys.com/. B-tree algorithm. In Fall Joint Computer Confer-
ence, 1986.
[10] J. Menon, D. Pease, R. Rees, L. Duyanovich, and
B. Hillsberg. IBM Storage Tank a Heterogeneous [24] V. Srinivasan and M. Carey. Performance of b+
Scalable SAN File-System. IBM Systems Journal, tree concurrency control algorithms. VLDB Jour-
42(2):250–267, 2003. nal, The International Journal on Very Large Data
Bases, 2 (4):361 – 406, January 1993.
[11] J. Ousterhout and F. Douglis. Beating the I/O Bot-
tleneck: A Case for Log-Structured File Systems. [25] Y. Mond and Y. Raz. Concurrency Control in B+-
In ACM SIGOPS, January 1989. trees Databases Using Preparatory Operations. In
Eleventh International Conference on Very Large
[12] J. Rosenberg, F. Henskens, A. Brown, R. Morrison, Data Bases, 1985.
and D. Munro. Stability in a Persistent Store Based
on a Large Virtual Memory. Security and Persis-
tence, pages 229–245, 1990.

14

You might also like