0% found this document useful (0 votes)
59 views36 pages

How TokuDB Fractal Tree Databases Work Presentation

- B-trees are commonly used for database indexes but are slow for high-entropy inserts due to random disk I/O - Appending to a file is fast for inserts as it achieves disk bandwidth, but requires a full table scan for lookups - Fractal trees aim to provide good performance for both inserts and queries by organizing data into log N arrays similar to an append-only structure - Inserting into a simplified fractal tree involves adding elements to arrays and merging arrays as they fill up, allowing sequential disk writes and fast inserts like an append-only structure

Uploaded by

conys voms
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views36 pages

How TokuDB Fractal Tree Databases Work Presentation

- B-trees are commonly used for database indexes but are slow for high-entropy inserts due to random disk I/O - Appending to a file is fast for inserts as it achieves disk bandwidth, but requires a full table scan for lookups - Fractal trees aim to provide good performance for both inserts and queries by organizing data into log N arrays similar to an append-only structure - Inserting into a simplified fractal tree involves adding elements to arrays and merging arrays as they fill up, allowing sequential disk writes and fast inserts like an append-only structure

Uploaded by

conys voms
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

How TokuDB Fractal Tree Indexes Work TM

Bradley C. Kuszmaul

MySQL UC 2010—How Fractal Trees Work 1


More Information

You can download this talk and others at

http://tokutek.com/technology

MySQL UC 2010—How Fractal Trees Work 2


B-Trees are Everywhere

B-Trees show up in
database indexes (such
as MyISAM and
InnoDB), file systems
(such as XFS), and many
other storage systems.

MySQL UC 2010—How Fractal Trees Work 3


B-Trees are Fast at Sequential Inserts
B
In Memory

B B

···

···

Insertions are
into this leaf node

• One disk I/O per leaf (which contains many rows).


• Sequential disk I/O.
• Performance is limited by disk bandwidth.

MySQL UC 2010—How Fractal Trees Work 4


B-Trees are Slow for High-Entropy Inserts
B
In Memory

B B

···

···

• Most nodes are not in main memory.


• Most insertions require a random disk I/O.
• Performance is limited by disk head movement.
• Only 100’s of inserts/s/disk (≤ 0.2% of disk
bandwidth).

MySQL UC 2010—How Fractal Trees Work 5


New B-Trees Run Fast Range QueriesB

B B

···

···

Range Scan

• In newly created B-trees, the leaf nodes are often


laid out sequentially on disk.
• Can get near 100% of disk bandwidth.
• About 100MB/s per disk.

MySQL UC 2010—How Fractal Trees Work 6


Aged B-Trees Run Slow Range Queries B

B B

···

··· ···

Leaf Blocks Scattered Over Disk

• In aged trees, the leaf blocks end up scattered over


disk.
• For 16KB nodes, as little as 1.6% of disk
bandwidth.
• About 16KB/s per disk.
MySQL UC 2010—How Fractal Trees Work 7
Append-to-file Beats B-Trees at Insertions
Here’s a data structure that is very fast for insertions:
5 4 2 7 9 4
Write next key here

Write to the end of a file.


Pros:
• Achieve disk bandwidth even for random keys.
Cons:
• Looking up anything requires a table scan.

MySQL UC 2010—How Fractal Trees Work 8


Append-to-file Beats B-Trees at Insertions
Here’s a data structure that is very fast for insertions:
5 4 2 7 9 4
Write next key here

Write to the end of a file.


Pros:
• Achieve disk bandwidth even for random keys.
Cons:
• Looking up anything requires a table scan.

MySQL UC 2010—How Fractal Trees Work 9


A Performance Tradeoff?
Structure Inserts Point Queries Range Queries
B-Tree Horrible Good Good (young)
Append Wonderful Horrible Horrible
Fractal Tree Good Good Good
• B-trees are good at lookup, but bad at insert.
• Append-to-file is good at insert, but bad at lookup.
• Is there a data structure that is about as good as a
B-tree for lookup, but has insertion performance
closer to append?
Yes, Fractal Trees!
MySQL UC 2010—How Fractal Trees Work 10
A Performance Tradeoff?
Structure Inserts Point Queries Range Queries
B-Tree Horrible Good Good (young)
Append Wonderful Horrible Horrible
Fractal Tree Good Good Good
• B-trees are good at lookup, but bad at insert.
• Append-to-file is good at insert, but bad at lookup.
• Is there a data structure that is about as good as a
B-tree for lookup, but has insertion performance
closer to append?
Yes, Fractal Trees!
MySQL UC 2010—How Fractal Trees Work 11
An Algorithmic Performance Model
To analyze performance we use the Disk-Access
Machine (DAM) model. [Aggrawal, Vitter 88]
• Two levels of memory.
• Two parameters: block size B, and
memory size M.
• The game: Minimize the number
of block transfers. Don’t worry
about CPU cycles.

MySQL UC 2010—How Fractal Trees Work 12


Theoretical Results
Structure Insert Point Query
   
log N log N
B-Tree O O
log B log B
   
1 N
Append O O
B B
   
log N log N
Fractal Tree O O
B1−ε ε log B1−ε

MySQL UC 2010—How Fractal Trees Work 13


Example of Insertion Cost
30
• 1 billion 128-byte rows. N = 2 ; log(N) = 30.
• 1MB block holds 8192 rows. B = 8192; log B = 13.
   
log N 30
B-Tree: O = O ≈3
log B 13
   
log N 30
Fractal Tree: O =O ≈ 0.003.
B 8192
Fractal Trees use << 1 disk I/O per insertion.

MySQL UC 2010—How Fractal Trees Work 14


A Simplified Fractal Tree

5 10 • log N arrays, one array for


each power of two.
3 6 8 12 17 23 26 30
• Each array is completely
full or empty.
• Each array is sorted.

MySQL UC 2010—How Fractal Trees Work 15


Example (4 elements)
If there are 4 elements in our fractal tree, the structure
looks like this:

12 17 23 30

MySQL UC 2010—How Fractal Trees Work 16


If there are 10 elements in our fractal tree, the
structure might look like this:

5 10

3 6 8 12 17 23 26 30

But there is some freedom.


• Each array is full or empty, so the 2-array and the
8-array must be full.
• However, which elements go where isn’t
completely specified.

MySQL UC 2010—How Fractal Trees Work 17


Searching in a Simplified Fractal Tree
• Idea: Perform a binary
search in each array.
5 10 • Pros: It works. It’s faster
than a table scan.
3 6 8 12 17 23 26 30 • Cons: It’s slower than a
2
B-tree at O(log N) block
transfers.

Let’s put search aside, and consider insert.

MySQL UC 2010—How Fractal Trees Work 18


Inserting in a Simplified Fractal Tree

5 10

3 6 8 12 17 23 26 30

Add another array of each size for temporary storage.

At the beginning of each step, the temporary arrays


are empty.

MySQL UC 2010—How Fractal Trees Work 19


Insert 15
To insert 15, there is only one place to put it: In the
1-array.
15
5 10

3 6 8 12 17 23 26 30

MySQL UC 2010—How Fractal Trees Work 20


Insert 7
To insert 7, no space in the 1-array. Put it in the temp
1-array.
15 7
5 10

3 6 8 12 17 23 26 30

Then merge the two 1-arrays to make a new 2-array.

5 10 7 15

3 6 8 12 17 23 26 30
MySQL UC 2010—How Fractal Trees Work 21
Not done inserting 7

5 10 7 15

3 6 8 12 17 23 26 30

Must merge the 2-arrays to make a 4-array.

5 7 10 15
3 6 8 12 17 23 26 30

MySQL UC 2010—How Fractal Trees Work 22


An Insert Can Cause Many Merges

9 5 10 2 18 33 40 3 6 8 12 17 23 26 30

31
9 5 10 2 18 33 40 3 6 8 12 17 23 26 30

9 31
5 10 2 18 33 40 3 6 8 12 17 23 26 30

5 9 10 31
2 18 33 40 3 6 8 12 17 23 26 30

2 5 9 10 18 31 33 40
3 6 8 12 17 23 26 30

2 3 5 6 8 9 10 12 17 18 23 26 30 31 33 40

MySQL UC 2010—How Fractal Trees Work 23


Analysis of Insertion into Simplified
Fractal Tree
• Cost to merge 2 arrays of size
5 7 10 15
X is O(X/B) block I/Os.
3 6 8 12 17 23 26 30 Merge is very I/O efficient.

• Cost per element to merge is O(1/B) since O(X)


elements were merged.
• Max # of times each element is merged is O(log N).
 
log N
• Average insert cost is O .
B
MySQL UC 2010—How Fractal Trees Work 24
Improving Worst-Case Insertion
Although the average cost of a merge is low,
occasionally we merge a lot of stuff.

3 6 8 12 17 23 26 30 4 7 9 19 20 21 27 29

Idea: A separate thread merges arrays. An insert


returns quickly.
Lemma: As long as we merge Ω(log N) elements for
every insertion, the merge thread won’t fall behind.
MySQL UC 2010—How Fractal Trees Work 25
Speeding up Search

2
At log N, search is too expensive.
5 7 10 15
Now let’s shave a factor of log N.
3 6 8 12 17 23 26 30

The idea: Having searched an array for a row, we


know where that row would belong in the array. We
can gain information about where the row belongs in
the next array

MySQL UC 2010—How Fractal Trees Work 26


Forward Pointers
9
Each element gets a forward
2 14
pointer to where that element
5 7 13 25 goes in the next array using
3 6 8 12 17 23 26 30 Fractional Cascading. [Chazelle, Guibas 1986]

If you are careful, you can arrange for forward


pointers to land frequently (separated by at most a
constant). Search becomes O(log N) levels, each
looking at a constant number of elements, for
O(log N) I/Os.

MySQL UC 2010—How Fractal Trees Work 27


Industrial-Grade Fractal Trees
A real implementation, like TokuDB, must deal with
• Variable-sized rows;
• Deletions as well as insertions;
• Transactions, logging, and ACID-compliant crash
recovery;
• Must optimize sequential inserts more;
• Better search cost: O(logB N), not O(log2 N);
• Compression; and
• Multithreading.

MySQL UC 2010—How Fractal Trees Work 28


iiBench Insert Benchmark

iiBench was developed by us and Mark Callaghan to


measure insert performance.
Percona took these measurements about a year ago.
MySQL UC 2010—How Fractal Trees Work 29
iiBench on SSD

TokuDB on rotating disk beats InnoDB on SSD.


MySQL UC 2010—How Fractal Trees Work 30
Disk Size and Price Technology Trends
• SSD is getting cheaper.
• Rotating disk is getting cheaper faster. Seagate
indicates that 67TB drives will be here in 2017.
• Moore’s law for silicon lithography is slower over
the next decade than Moore’s law for rotating disks.
Conclusion: big data stored on disk isn’t going away
any time soon.
Fractal Tree indexes are good on disk.
TokuDB speedups do not try to keep indexes in main
memory. We realize the disk’s performance potential.
MySQL UC 2010—How Fractal Trees Work 31
Speed Trends
• Bandwidth off a rotating disk will hit about
500MB/s.
• Seek time will not change much.
Conclusion: Scaling with bandwidth is good. Scaling
with seek time is bad.
Fractal Tree indexes scale with bandwidth.
Unlike B-trees, Fractal Tree indexes can consume
many CPU cycles.

MySQL UC 2010—How Fractal Trees Work 32


Power Trends
• Big disks are much more power efficient per byte
stored than little disks.
• Making good use of disk bandwidth offers further
power savings.

Fractal Tree indexes can use 1/100th the power of


B-trees.

MySQL UC 2010—How Fractal Trees Work 33


CPU Trends
• CPU power will grow dramatically inside servers
over the next few years. 100-core machines are
around the corner. 1000-core machines are on the
horizon.
• Memory bandwidth will also increase.
• I/O bus bandwidth will also grow.
Conclusion: Scale-up machines will be impressive.

Fractal Tree indexes will make good use of cores.

MySQL UC 2010—How Fractal Trees Work 34


Customers
TokuDB has been generally available since February,
and you can read customer success stories at
tokutek.com

Others

MySQL UC 2010—How Fractal Trees Work 35


The Future
• Fractal Tree indexes dominate B-trees theoretically.
• Fractal Tree indexes ride the right technology
trends.
• In the future, all storage systems will use Fractal
Tree indexes.
• The future is in MySQL now! Fractal Tree indexes
are available for MySQL, with transactions and
recovery, from
tokutek.com

MySQL UC 2010—How Fractal Trees Work 36

You might also like