postgrespro · Jan 13, 2021
diff --git a/‎doc/src/sgml/btree.sgml
Lines changed: 124 additions & 28 deletions b/‎doc/src/sgml/btree.sgml
Lines changed: 124 additions & 28 deletions
diff --git a/‎doc/src/sgml/ref/create_index.sgml
Lines changed: 29 additions & 7 deletions b/‎doc/src/sgml/ref/create_index.sgml
Lines changed: 29 additions & 7 deletions
@@ -629,6 +629,109 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
   </para>
  </sect2>
 
+ <sect2 id="btree-deletion">
+  <title>Bottom-up index deletion</title>
+  <para>
+   B-Tree indexes are not directly aware that under MVCC, there might
+   be multiple extant versions of the same logical table row; to an
+   index, each tuple is an independent object that needs its own index
+   entry.  <quote>Version churn</quote> tuples may sometimes
+   accumulate and adversely affect query latency and throughput.  This
+   typically occurs with <command>UPDATE</command>-heavy workloads
+   where most individual updates cannot apply the
+   <acronym>HOT</acronym> optimization.  Changing the value of only
+   one column covered by one index during an <command>UPDATE</command>
+   <emphasis>always</emphasis> necessitates a new set of index tuples
+   &mdash; one for <emphasis>each and every</emphasis> index on the
+   table.  Note in particular that this includes indexes that were not
+   <quote>logically modified</quote> by the <command>UPDATE</command>.
+   All indexes will need a successor physical index tuple that points
+   to the latest version in the table.  Each new tuple within each
+   index will generally need to coexist with the original
+   <quote>updated</quote> tuple for a short period of time (typically
+   until shortly after the <command>UPDATE</command> transaction
+   commits).
+  </para>
+  <para>
+   B-Tree indexes incrementally delete version churn index tuples by
+   performing <firstterm>bottom-up index deletion</firstterm> passes.
+   Each deletion pass is triggered in reaction to an anticipated
+   <quote>version churn page split</quote>.  This only happens with
+   indexes that are not logically modified by
+   <command>UPDATE</command> statements, where concentrated build up
+   of obsolete versions in particular pages would occur otherwise.  A
+   page split will usually be avoided, though it's possible that
+   certain implementation-level heuristics will fail to identify and
+   delete even one garbage index tuple (in which case a page split or
+   deduplication pass resolves the issue of an incoming new tuple not
+   fitting on a leaf page).  The worst case number of versions that
+   any index scan must traverse (for any single logical row) is an
+   important contributor to overall system responsiveness and
+   throughput.  A bottom-up index deletion pass targets suspected
+   garbage tuples in a single leaf page based on
+   <emphasis>qualitative</emphasis> distinctions involving logical
+   rows and versions.  This contrasts with the <quote>top-down</quote>
+   index cleanup performed by autovacuum workers, which is triggered
+   when certain <emphasis>quantitative</emphasis> table-level
+   thresholds are exceeded (see <xref linkend="autovacuum"/>).
+  </para>
+  <note>
+   <para>
+    Not all deletion operations that are performed within B-Tree
+    indexes are bottom-up deletion operations.  There is a distinct
+    category of index tuple deletion: <firstterm>simple index tuple
+     deletion</firstterm>.  This is a deferred maintenance operation
+    that deletes index tuples that are known to be safe to delete
+    (those whose item identifier's <literal>LP_DEAD</literal> bit is
+    already set).  Like bottom-up index deletion, simple index
+    deletion takes place at the point that a page split is anticipated
+    as a way of avoiding the split.
+   </para>
+   <para>
+    Simple deletion is opportunistic in the sense that it can only
+    take place when recent index scans set the
+    <literal>LP_DEAD</literal> bits of affected items in passing.
+    Prior to <productname>PostgreSQL</productname> 14, the only
+    category of B-Tree deletion was simple deletion.  The main
+    differences between it and bottom-up deletion are that only the
+    former is opportunistically driven by the activity of passing
+    index scans, while only the latter specifically targets version
+    churn from <command>UPDATE</command>s that do not logically modify
+    indexed columns.
+   </para>
+  </note>
+  <para>
+   Bottom-up index deletion performs the vast majority of all garbage
+   index tuple cleanup for particular indexes with certain workloads.
+   This is expected with any B-Tree index that is subject to
+   significant version churn from <command>UPDATE</command>s that
+   rarely or never logically modify the columns that the index covers.
+   The average and worst case number of versions per logical row can
+   be kept low purely through targeted incremental deletion passes.
+   It's quite possible that the on-disk size of certain indexes will
+   never increase by even one single page/block despite
+   <emphasis>constant</emphasis> version churn from
+   <command>UPDATE</command>s.  Even then, an exhaustive <quote>clean
+    sweep</quote> by a <command>VACUUM</command> operation (typically
+   run in an autovacuum worker process) will eventually be required as
+   a part of <emphasis>collective</emphasis> cleanup of the table and
+   each of its indexes.
+  </para>
+  <para>
+   Unlike <command>VACUUM</command>, bottom-up index deletion does not
+   provide any strong guarantees about how old the oldest garbage
+   index tuple may be.  No index can be permitted to retain
+   <quote>floating garbage</quote> index tuples that became dead prior
+   to a conservative cutoff point shared by the table and all of its
+   indexes collectively.  This fundamental table-level invariant makes
+   it safe to recycle table <acronym>TID</acronym>s.  This is how it
+   is possible for distinct logical rows to reuse the same table
+   <acronym>TID</acronym> over time (though this can never happen with
+   two logical rows whose lifetimes span the same
+   <command>VACUUM</command> cycle).
+  </para>
+ </sect2>
+
  <sect2 id="btree-deduplication">
   <title>Deduplication</title>
   <para>
@@ -666,15 +769,17 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
   </note>
   <para>
    The deduplication process occurs lazily, when a new item is
-   inserted that cannot fit on an existing leaf page.  This prevents
-   (or at least delays) leaf page splits.  Unlike GIN posting list
-   tuples, B-Tree posting list tuples do not need to expand every time
-   a new duplicate is inserted; they are merely an alternative
-   physical representation of the original logical contents of the
-   leaf page.  This design prioritizes consistent performance with
-   mixed read-write workloads.  Most client applications will at least
-   see a moderate performance benefit from using deduplication.
-   Deduplication is enabled by default.
+   inserted that cannot fit on an existing leaf page, though only when
+   index tuple deletion could not free sufficient space for the new
+   item (typically deletion is briefly considered and then skipped
+   over).  Unlike GIN posting list tuples, B-Tree posting list tuples
+   do not need to expand every time a new duplicate is inserted; they
+   are merely an alternative physical representation of the original
+   logical contents of the leaf page.  This design prioritizes
+   consistent performance with mixed read-write workloads.  Most
+   client applications will at least see a moderate performance
+   benefit from using deduplication.  Deduplication is enabled by
+   default.
   </para>
   <para>
    <command>CREATE INDEX</command> and <command>REINDEX</command>
@@ -702,25 +807,16 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
    deduplication isn't usually helpful.
   </para>
   <para>
-   B-Tree indexes are not directly aware that under MVCC, there might
-   be multiple extant versions of the same logical table row; to an
-   index, each tuple is an independent object that needs its own index
-   entry.  <quote>Version duplicates</quote> may sometimes accumulate
-   and adversely affect query latency and throughput.  This typically
-   occurs with <command>UPDATE</command>-heavy workloads where most
-   individual updates cannot apply the <acronym>HOT</acronym>
-   optimization (often because at least one indexed column gets
-   modified, necessitating a new set of index tuple versions &mdash;
-   one new tuple for <emphasis>each and every</emphasis> index).  In
-   effect, B-Tree deduplication ameliorates index bloat caused by
-   version churn.  Note that even the tuples from a unique index are
-   not necessarily <emphasis>physically</emphasis> unique when stored
-   on disk due to version churn.  The deduplication optimization is
-   selectively applied within unique indexes.  It targets those pages
-   that appear to have version duplicates.  The high level goal is to
-   give <command>VACUUM</command> more time to run before an
-   <quote>unnecessary</quote> page split caused by version churn can
-   take place.
+   It is sometimes possible for unique indexes (as well as unique
+   constraints) to use deduplication.  This allows leaf pages to
+   temporarily <quote>absorb</quote> extra version churn duplicates.
+   Deduplication in unique indexes augments bottom-up index deletion,
+   especially in cases where a long-running transactions holds a
+   snapshot that blocks garbage collection.  The goal is to buy time
+   for the bottom-up index deletion strategy to become effective
+   again.  Delaying page splits until a single long-running
+   transaction naturally goes away can allow a bottom-up deletion pass
+   to succeed where an earlier deletion pass failed.
   </para>
   <tip>
    <para>
 
@@ -386,17 +386,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
      <para>
       The fillfactor for an index is a percentage that determines how full
       the index method will try to pack index pages.  For B-trees, leaf pages
-      are filled to this percentage during initial index build, and also
+      are filled to this percentage during initial index builds, and also
       when extending the index at the right (adding new largest key values).
       If pages
       subsequently become completely full, they will be split, leading to
-      gradual degradation in the index's efficiency.  B-trees use a default
+      fragmentation of the on-disk index structure.  B-trees use a default
       fillfactor of 90, but any integer value from 10 to 100 can be selected.
-      If the table is static then fillfactor 100 is best to minimize the
-      index's physical size, but for heavily updated tables a smaller
-      fillfactor is better to minimize the need for page splits.  The
-      other index methods use fillfactor in different but roughly analogous
-      ways; the default fillfactor varies between methods.
+     </para>
+     <para>
+      B-tree indexes on tables where many inserts and/or updates are
+      anticipated can benefit from lower fillfactor settings at
+      <command>CREATE INDEX</command> time (following bulk loading into the
+      table).  Values in the range of 50 - 90 can usefully <quote>smooth
+       out</quote> the <emphasis>rate</emphasis> of page splits during the
+      early life of the B-tree index (lowering fillfactor like this may even
+      lower the absolute number of page splits, though this effect is highly
+      workload dependent).  The B-tree bottom-up index deletion technique
+      described in <xref linkend="btree-deletion"/> is dependent on having
+      some <quote>extra</quote> space on pages to store <quote>extra</quote>
+      tuple versions, and so can be affected by fillfactor (though the effect
+      is usually not significant).
+     </para>
+     <para>
+      In other specific cases it might be useful to increase fillfactor to
+      100 at <command>CREATE INDEX</command> time as a way of maximizing
+      space utilization.  You should only consider this when you are
+      completely sure that the table is static (i.e. that it will never be
+      affected by either inserts or updates).  A fillfactor setting of 100
+      otherwise risks <emphasis>harming</emphasis> performance: even a few
+      updates or inserts will cause a sudden flood of page splits.
+     </para>
+     <para>
+      The other index methods use fillfactor in different but roughly
+      analogous ways; the default fillfactor varies between methods.
      </para>
     </listitem>
    </varlistentry>