Re: [WIP] Effective storage of duplicates in B-tree index. - Mailing list pgsql-hackers
From | Anastasia Lubennikova |
---|---|
Subject | Re: [WIP] Effective storage of duplicates in B-tree index. |
Date | |
Msg-id | [email protected] Whole thread Raw |
In response to | [PROPOSAL] Effective storage of duplicates in B-tree index. (Anastasia Lubennikova <[email protected]>) |
Responses |
Re: [WIP] Effective storage of duplicates in B-tree index.
|
List | pgsql-hackers |
31.08.2015 10:41, Anastasia Lubennikova:
I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.
Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.
Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).
You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.
I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.
Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;
And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.
In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).
Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.
I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.
Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.
Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).
i | B-tree Old | B-tree New | GIN |
1 | 214,234375 | 87,7109375 | 10,2109375 |
10 | 214,234375 | 87,7109375 | 10,71875 |
100 | 214,234375 | 87,4375 | 15,640625 |
1000 | 214,234375 | 86,2578125 | 31,296875 |
10000 | 214,234375 | 78,421875 | 104,3046875 |
100000 | 214,234375 | 65,359375 | 49,078125 |
1000000 | 214,234375 | 90,140625 | 106,8203125 |
10000000 | 214,234375 | 214,234375 | 534,0625 |
You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.
I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.
Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;
And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
pgsql-hackers by date: