Skip to content

Commit 0d861bb

Browse files
Add deduplication to nbtree.
Deduplication reduces the storage overhead of duplicates in indexes that use the standard nbtree index access method. The deduplication process is applied lazily, after the point where opportunistic deletion of LP_DEAD-marked index tuples occurs. Deduplication is only applied at the point where a leaf page split would otherwise be required. New posting list tuples are formed by merging together existing duplicate tuples. The physical representation of the items on an nbtree leaf page is made more space efficient by deduplication, but the logical contents of the page are not changed. Even unique indexes make use of deduplication as a way of controlling bloat from duplicates whose TIDs point to different versions of the same logical table row. The lazy approach taken by nbtree has significant advantages over a GIN style eager approach. Most individual inserts of index tuples have exactly the same overhead as before. The extra overhead of deduplication is amortized across insertions, just like the overhead of page splits. The key space of indexes works in the same way as it has since commit dd299df (the commit that made heap TID a tiebreaker column). Testing has shown that nbtree deduplication can generally make indexes with about 10 or 15 tuples for each distinct key value about 2.5X - 4X smaller, even with single column integer indexes (e.g., an index on a referencing column that accompanies a foreign key). The final size of single column nbtree indexes comes close to the final size of a similar contrib/btree_gin index, at least in cases where GIN's posting list compression isn't very effective. This can significantly improve transaction throughput, and significantly reduce the cost of vacuuming indexes. A new index storage parameter (deduplicate_items) controls the use of deduplication. The default setting is 'on', so all new B-Tree indexes automatically use deduplication where possible. This decision will be reviewed at the end of the Postgres 13 beta period. There is a regression of approximately 2% of transaction throughput with synthetic workloads that consist of append-only inserts into a table with several non-unique indexes, where all indexes have few or no repeated values. The underlying issue is that cycles are wasted on unsuccessful attempts at deduplicating items in non-unique indexes. There doesn't seem to be a way around it short of disabling deduplication entirely. Note that deduplication of items in unique indexes is fairly well targeted in general, which avoids the problem there (we can use a special heuristic to trigger deduplication passes in unique indexes, since we're specifically targeting "version bloat"). Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed. No bump in BTREE_VERSION, since the representation of posting list tuples works in a way that's backwards compatible with version 4 indexes (i.e. indexes built on PostgreSQL 12). However, users must still REINDEX a pg_upgrade'd index to use deduplication, regardless of the Postgres version they've upgraded from. This is the only way to set the new nbtree metapage flag indicating that deduplication is generally safe. Author: Anastasia Lubennikova, Peter Geoghegan Reviewed-By: Peter Geoghegan, Heikki Linnakangas Discussion: https://postgr.es/m/[email protected] https://postgr.es/m/[email protected]
1 parent 612a1ab commit 0d861bb

28 files changed

+3553
-332
lines changed

contrib/amcheck/verify_nbtree.c

+188-43
Large diffs are not rendered by default.

doc/src/sgml/btree.sgml

+199-2
Original file line numberDiff line numberDiff line change
@@ -557,11 +557,208 @@ equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool
557557
<sect1 id="btree-implementation">
558558
<title>Implementation</title>
559559

560+
<para>
561+
This section covers B-Tree index implementation details that may be
562+
of use to advanced users. See
563+
<filename>src/backend/access/nbtree/README</filename> in the source
564+
distribution for a much more detailed, internals-focused description
565+
of the B-Tree implementation.
566+
</para>
567+
<sect2 id="btree-structure">
568+
<title>B-Tree Structure</title>
569+
<para>
570+
<productname>PostgreSQL</productname> B-Tree indexes are
571+
multi-level tree structures, where each level of the tree can be
572+
used as a doubly-linked list of pages. A single metapage is stored
573+
in a fixed position at the start of the first segment file of the
574+
index. All other pages are either leaf pages or internal pages.
575+
Leaf pages are the pages on the lowest level of the tree. All
576+
other levels consist of internal pages. Each leaf page contains
577+
tuples that point to table rows. Each internal page contains
578+
tuples that point to the next level down in the tree. Typically,
579+
over 99% of all pages are leaf pages. Both internal pages and leaf
580+
pages use the standard page format described in <xref
581+
linkend="storage-page-layout"/>.
582+
</para>
583+
<para>
584+
New leaf pages are added to a B-Tree index when an existing leaf
585+
page cannot fit an incoming tuple. A <firstterm>page
586+
split</firstterm> operation makes room for items that originally
587+
belonged on the overflowing page by moving a portion of the items
588+
to a new page. Page splits must also insert a new
589+
<firstterm>downlink</firstterm> to the new page in the parent page,
590+
which may cause the parent to split in turn. Page splits
591+
<quote>cascade upwards</quote> in a recursive fashion. When the
592+
root page finally cannot fit a new downlink, a <firstterm>root page
593+
split</firstterm> operation takes place. This adds a new level to
594+
the tree structure by creating a new root page that is one level
595+
above the original root page.
596+
</para>
597+
</sect2>
598+
599+
<sect2 id="btree-deduplication">
600+
<title>Deduplication</title>
601+
<para>
602+
A duplicate is a leaf page tuple (a tuple that points to a table
603+
row) where <emphasis>all</emphasis> indexed key columns have values
604+
that match corresponding column values from at least one other leaf
605+
page tuple that's close by in the same index. Duplicate tuples are
606+
quite common in practice. B-Tree indexes can use a special,
607+
space-efficient representation for duplicates when an optional
608+
technique is enabled: <firstterm>deduplication</firstterm>.
609+
</para>
610+
<para>
611+
Deduplication works by periodically merging groups of duplicate
612+
tuples together, forming a single posting list tuple for each
613+
group. The column key value(s) only appear once in this
614+
representation. This is followed by a sorted array of
615+
<acronym>TID</acronym>s that point to rows in the table. This
616+
significantly reduces the storage size of indexes where each value
617+
(or each distinct combination of column values) appears several
618+
times on average. The latency of queries can be reduced
619+
significantly. Overall query throughput may increase
620+
significantly. The overhead of routine index vacuuming may also be
621+
reduced significantly.
622+
</para>
623+
<note>
624+
<para>
625+
While NULL is generally not considered to be equal to any other
626+
value, including NULL, NULL is nevertheless treated as just
627+
another value from the domain of indexed values by the B-Tree
628+
implementation (except when enforcing uniqueness in a unique
629+
index). B-Tree deduplication is therefore just as effective with
630+
<quote>duplicates</quote> that contain a NULL value.
631+
</para>
632+
</note>
633+
<para>
634+
The deduplication process occurs lazily, when a new item is
635+
inserted that cannot fit on an existing leaf page. This prevents
636+
(or at least delays) leaf page splits. Unlike GIN posting list
637+
tuples, B-Tree posting list tuples do not need to expand every time
638+
a new duplicate is inserted; they are merely an alternative
639+
physical representation of the original logical contents of the
640+
leaf page. This design prioritizes consistent performance with
641+
mixed read-write workloads. Most client applications will at least
642+
see a moderate performance benefit from using deduplication.
643+
Deduplication is enabled by default.
644+
</para>
645+
<para>
646+
Write-heavy workloads that don't benefit from deduplication due to
647+
having few or no duplicate values in indexes will incur a small,
648+
fixed performance penalty (unless deduplication is explicitly
649+
disabled). The <literal>deduplicate_items</literal> storage
650+
parameter can be used to disable deduplication within individual
651+
indexes. There is never any performance penalty with read-only
652+
workloads, since reading posting list tuples is at least as
653+
efficient as reading the standard tuple representation. Disabling
654+
deduplication isn't usually helpful.
655+
</para>
656+
<para>
657+
B-Tree indexes are not directly aware that under MVCC, there might
658+
be multiple extant versions of the same logical table row; to an
659+
index, each tuple is an independent object that needs its own index
660+
entry. Thus, an update of a row always creates all-new index
661+
entries for the row, even if the key values did not change. Some
662+
workloads suffer from index bloat caused by these
663+
implementation-level version duplicates (this is typically a
664+
problem for <command>UPDATE</command>-heavy workloads that cannot
665+
apply the <acronym>HOT</acronym> optimization due to modifying at
666+
least one indexed column). B-Tree deduplication does not
667+
distinguish between these implementation-level version duplicates
668+
and conventional duplicates. Deduplication can nevertheless help
669+
with controlling index bloat caused by implementation-level version
670+
churn.
671+
</para>
672+
<tip>
673+
<para>
674+
A special heuristic is applied to determine whether a
675+
deduplication pass in a unique index should take place. It can
676+
often skip straight to splitting a leaf page, avoiding a
677+
performance penalty from wasting cycles on unhelpful deduplication
678+
passes. If you're concerned about the overhead of deduplication,
679+
consider setting <literal>deduplicate_items = off</literal>
680+
selectively. Leaving deduplication enabled in unique indexes has
681+
little downside.
682+
</para>
683+
</tip>
684+
<para>
685+
Deduplication cannot be used in all cases due to
686+
implementation-level restrictions. Deduplication safety is
687+
determined when <command>CREATE INDEX</command> or
688+
<command>REINDEX</command> run.
689+
</para>
690+
<para>
691+
Note that deduplication is deemed unsafe and cannot be used in the
692+
following cases involving semantically significant differences
693+
among equal datums:
694+
</para>
695+
<para>
696+
<itemizedlist>
697+
<listitem>
698+
<para>
699+
<type>text</type>, <type>varchar</type>, and <type>char</type>
700+
cannot use deduplication when a
701+
<emphasis>nondeterministic</emphasis> collation is used. Case
702+
and accent differences must be preserved among equal datums.
703+
</para>
704+
</listitem>
705+
706+
<listitem>
707+
<para>
708+
<type>numeric</type> cannot use deduplication. Numeric display
709+
scale must be preserved among equal datums.
710+
</para>
711+
</listitem>
712+
713+
<listitem>
714+
<para>
715+
<type>jsonb</type> cannot use deduplication, since the
716+
<type>jsonb</type> B-Tree operator class uses
717+
<type>numeric</type> internally.
718+
</para>
719+
</listitem>
720+
721+
<listitem>
722+
<para>
723+
<type>float4</type> and <type>float8</type> cannot use
724+
deduplication. These types have distinct representations for
725+
<literal>-0</literal> and <literal>0</literal>, which are
726+
nevertheless considered equal. This difference must be
727+
preserved.
728+
</para>
729+
</listitem>
730+
</itemizedlist>
731+
</para>
732+
<para>
733+
There is one further implementation-level restriction that may be
734+
lifted in a future version of
735+
<productname>PostgreSQL</productname>:
736+
</para>
737+
<para>
738+
<itemizedlist>
739+
<listitem>
740+
<para>
741+
Container types (such as composite types, arrays, or range
742+
types) cannot use deduplication.
743+
</para>
744+
</listitem>
745+
</itemizedlist>
746+
</para>
747+
<para>
748+
There is one further implementation-level restriction that applies
749+
regardless of the operator class or collation used:
750+
</para>
560751
<para>
561-
An introduction to the btree index implementation can be found in
562-
<filename>src/backend/access/nbtree/README</filename>.
752+
<itemizedlist>
753+
<listitem>
754+
<para>
755+
<literal>INCLUDE</literal> indexes can never use deduplication.
756+
</para>
757+
</listitem>
758+
</itemizedlist>
563759
</para>
564760

761+
</sect2>
565762
</sect1>
566763

567764
</chapter>

doc/src/sgml/charset.sgml

+5-4
Original file line numberDiff line numberDiff line change
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
928928
nondeterministic collations give a more <quote>correct</quote> behavior,
929929
especially when considering the full power of Unicode and its many
930930
special cases, they also have some drawbacks. Foremost, their use leads
931-
to a performance penalty. Also, certain operations are not possible with
932-
nondeterministic collations, such as pattern matching operations.
933-
Therefore, they should be used only in cases where they are specifically
934-
wanted.
931+
to a performance penalty. Note, in particular, that B-tree cannot use
932+
deduplication with indexes that use a nondeterministic collation. Also,
933+
certain operations are not possible with nondeterministic collations,
934+
such as pattern matching operations. Therefore, they should be used
935+
only in cases where they are specifically wanted.
935936
</para>
936937
</sect3>
937938
</sect2>

doc/src/sgml/citext.sgml

+4-3
Original file line numberDiff line numberDiff line change
@@ -233,9 +233,10 @@ SELECT * FROM users WHERE nick = 'Larry';
233233
<para>
234234
<type>citext</type> is not as efficient as <type>text</type> because the
235235
operator functions and the B-tree comparison functions must make copies
236-
of the data and convert it to lower case for comparisons. It is,
237-
however, slightly more efficient than using <function>lower</function> to get
238-
case-insensitive matching.
236+
of the data and convert it to lower case for comparisons. Also, only
237+
<type>text</type> can support B-Tree deduplication. However,
238+
<type>citext</type> is slightly more efficient than using
239+
<function>lower</function> to get case-insensitive matching.
239240
</para>
240241
</listitem>
241242

doc/src/sgml/func.sgml

+5-4
Original file line numberDiff line numberDiff line change
@@ -16561,10 +16561,11 @@ AND
1656116561
rows. Two rows might have a different binary representation even
1656216562
though comparisons of the two rows with the equality operator is true.
1656316563
The ordering of rows under these comparison operators is deterministic
16564-
but not otherwise meaningful. These operators are used internally for
16565-
materialized views and might be useful for other specialized purposes
16566-
such as replication but are not intended to be generally useful for
16567-
writing queries.
16564+
but not otherwise meaningful. These operators are used internally
16565+
for materialized views and might be useful for other specialized
16566+
purposes such as replication and B-Tree deduplication (see <xref
16567+
linkend="btree-deduplication"/>). They are not intended to be
16568+
generally useful for writing queries, though.
1656816569
</para>
1656916570
</sect2>
1657016571
</sect1>

doc/src/sgml/ref/create_index.sgml

+40-4
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
171171
maximum size allowed for the index type, data insertion will fail.
172172
In any case, non-key columns duplicate data from the index's table
173173
and bloat the size of the index, thus potentially slowing searches.
174+
Furthermore, B-tree deduplication is never used with indexes
175+
that have a non-key column.
174176
</para>
175177

176178
<para>
@@ -393,10 +395,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
393395
</variablelist>
394396

395397
<para>
396-
B-tree indexes additionally accept this parameter:
398+
B-tree indexes also accept these parameters:
397399
</para>
398400

399401
<variablelist>
402+
<varlistentry id="index-reloption-deduplication" xreflabel="deduplicate_items">
403+
<term><literal>deduplicate_items</literal>
404+
<indexterm>
405+
<primary><varname>deduplicate_items</varname></primary>
406+
<secondary>storage parameter</secondary>
407+
</indexterm>
408+
</term>
409+
<listitem>
410+
<para>
411+
Controls usage of the B-tree deduplication technique described
412+
in <xref linkend="btree-deduplication"/>. Set to
413+
<literal>ON</literal> or <literal>OFF</literal> to enable or
414+
disable the optimization. (Alternative spellings of
415+
<literal>ON</literal> and <literal>OFF</literal> are allowed as
416+
described in <xref linkend="config-setting"/>.) The default is
417+
<literal>ON</literal>.
418+
</para>
419+
420+
<note>
421+
<para>
422+
Turning <literal>deduplicate_items</literal> off via
423+
<command>ALTER INDEX</command> prevents future insertions from
424+
triggering deduplication, but does not in itself make existing
425+
posting list tuples use the standard tuple representation.
426+
</para>
427+
</note>
428+
</listitem>
429+
</varlistentry>
430+
400431
<varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
401432
<term><literal>vacuum_cleanup_index_scale_factor</literal>
402433
<indexterm>
@@ -451,9 +482,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
451482
This setting controls usage of the fast update technique described in
452483
<xref linkend="gin-fast-update"/>. It is a Boolean parameter:
453484
<literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
454-
(Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
455-
allowed as described in <xref linkend="config-setting"/>.) The
456-
default is <literal>ON</literal>.
485+
The default is <literal>ON</literal>.
457486
</para>
458487

459488
<note>
@@ -805,6 +834,13 @@ CREATE UNIQUE INDEX title_idx ON films (title) INCLUDE (director, rating);
805834
</programlisting>
806835
</para>
807836

837+
<para>
838+
To create a B-Tree index with deduplication disabled:
839+
<programlisting>
840+
CREATE INDEX title_idx ON films (title) WITH (deduplicate_items = off);
841+
</programlisting>
842+
</para>
843+
808844
<para>
809845
To create an index on the expression <literal>lower(title)</literal>,
810846
allowing efficient case-insensitive searches:

src/backend/access/common/reloptions.c

+10
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,16 @@ static relopt_bool boolRelOpts[] =
158158
},
159159
true
160160
},
161+
{
162+
{
163+
"deduplicate_items",
164+
"Enables \"deduplicate items\" feature for this btree index",
165+
RELOPT_KIND_BTREE,
166+
ShareUpdateExclusiveLock /* since it applies only to later
167+
* inserts */
168+
},
169+
true
170+
},
161171
/* list terminator */
162172
{{NULL}}
163173
};

src/backend/access/index/genam.c

+4
Original file line numberDiff line numberDiff line change
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
276276
/*
277277
* Get the latestRemovedXid from the table entries pointed at by the index
278278
* tuples being deleted.
279+
*
280+
* Note: index access methods that don't consistently use the standard
281+
* IndexTuple + heap TID item pointer representation will need to provide
282+
* their own version of this function.
279283
*/
280284
TransactionId
281285
index_compute_xid_horizon_for_tuples(Relation irel,

src/backend/access/nbtree/Makefile

+1
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
1414

1515
OBJS = \
1616
nbtcompare.o \
17+
nbtdedup.o \
1718
nbtinsert.o \
1819
nbtpage.o \
1920
nbtree.o \

0 commit comments

Comments
 (0)