PostgreSQL Chapter 32 PDF
PostgreSQL Chapter 32 PDF
PostgreSQL Chapter 32 PDF
32
PostgreSQL
Ioannis Alagiannis (Swisscom AG)
Renata Borovica-Gajic (University of Melbourne, AU)1
This chapter is released under the Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA)
4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0.
For any use beyond those covered by this license, obtain permission by emailing [email protected].
1
2 Chapter 32 PostgreSQL
make better use of its features. It would also be particularly useful for students and
developers who wish to add functionality to the PostgreSQL system, by modifying its
source code.
The standard distribution of PostgreSQL comes with command-line tools for admin-
istering the database. However, there is a wide range of commercial and open-source
graphical administration and design tools that support PostgreSQL. Software develop-
ers may also access PostgreSQL through a comprehensive set of programming inter-
faces.
client postmaster
application daemon shared kernel
process create disk disk
initial
connection buffers buffers
communication request
through and
library API authentication
shared
tables
client PostgreSQL disk
interface server storage
SQL queries
library (back end) read/
and results
write
PostgreSQL distribution. The postmaster is responsible for handling the initial client
connections. For this, it constantly listens for new connections on a known port. When
it receives a connection request, the postmaster first performs initialization steps such
as user authentication, and then assigns an idle backend server process (or spawns a
new one if required) to handle the new client. After this initial connection, the client
interacts only with the backend server process, submitting queries and receiving query
results. As long as the client connection is active, the assigned backend server pro-
cess is dedicated to only that client connection. Thus, PostgreSQL uses a process-per-
connection model for the backend server.
The backend server process is responsible for executing the queries submitted by
the client by performing the necessary query-execution steps, including parsing, opti-
mization, and execution. Each backend server process can handle only a single query
at a time. An application that desires to execute more than one query in parallel must
maintain multiple connections to the server. At any given time there may be multiple
clients connected to the system, and thus multiple backend server processes may be
executing concurrently.
In addition to the postmaster and the backend server processes PostgreSQL utilizes
several background worker processes to perform data management tasks, including the
background writer, the statistics collector, the write-ahead log (WAL) writer and the
checkpointer processes. The background writer process is responsible for periodically
writing the dirty pages from the shared buffers to persistent storage. The statistics col-
lector process continuously collects statistics information about the table accesses and
the number of rows in tables. The WAL writer process periodically flushes the WAL data
to persistent storage while the checkpointer process performs database checkpoints to
speed up recovery. These background processes are initiated by the postmaster process.
When it comes to memory management in PostgreSQL, we can identify two differ-
ent categories a) local memory and b) shared memory. Each backend process allocates
local memory for its own tasks such as query processing (e.g., internal sort operations
hash tables and temporary tables) and maintenance operations (e.g., vacuum, create
index).
On the other hand, the in-memory buffer pool is placed in shared memory, so that
all the processes, including backend server processes and background processes can
access it. Shared memory is also used to store lock tables and other data that must be
shared by server and background processes.
Due to the use of shared memory as the inter-process communication medium,
a PostgreSQL server should run on a single shared-memory machine; a single-server
site cannot be executed across multiple machines that do not share memory, without
the assistance of third-party packages. However, it is possible to build a shared-nothing
parallel database system with an instance of PostgreSQL running on each node; in
fact, several commercial parallel database systems have been built with exactly this
architecture.
32.3 Storage and Indexing 5
PostgreSQL’s approach to data layout and storage has the goals of (1) a simple and
clean implementation and (2) ease of administration. As a step toward these goals,
PostgreSQL relies on file-system files for data storage (also referred to as “cooked”
files), instead of handling the physical layout of data on raw disk partitions by itself.
PostgreSQL maintains a list of directories in the file hierarchy to use for storage; these
directories are referred to as tablespaces. Each PostgreSQL installation is initialized
with a default tablespace, and additional tablespaces may be added at any time. When
creating a table, index, or entire database, the user may specify a tablespace in which
to store the related files. It is particularly useful to create multiple tablespaces if they
reside on different physical devices, so that tablespaces on the faster devices may be
dedicated to data that are accessed more frequently. Moreover, data that are stored on
separate disks may be accessed in parallel more efficiently.
The design of the PostgreSQL storage system potentially leads to some perfor-
mance limitations, due to clashes between PostgreSQL and the file system. The use
of cooked file systems results in double buffering, where blocks are first fetched from
disk into the file system’s cache (in kernel space), and are then copied to PostgreSQL’s
buffer pool. Performance can also be limited by the fact that PostgreSQL stores data in
8-KB blocks, which may not match the block size used by the kernel. It is possible to
change the PostgreSQL block size when the server is installed, but this may have unde-
sired consequences: small blocks limit the ability of PostgreSQL to store large tuples
efficiently, while large blocks are wasteful when a small region of a file is accessed.
On the other hand, modern enterprises increasingly use external storage systems,
such as network-attached storage and storage-area networks, instead of disks attached
to servers. Such storage systems are administered and tuned for performance sepa-
rately from the database. PostgreSQL may directly leverage these technologies because
of its reliance on “cooked” file systems. For most applications, the performance reduc-
tion due to the use of “cooked” file systems is minimal, and is justified by the ease of
administration and management, and the simplicity of implementation.
32.3.1 Tables
The primary unit of storage in PostgreSQL is a table. In PostgreSQL, tuples in a table are
stored in heap files. These files use a form of the standard slotted-page format (Section
13.2.2). The PostgreSQL slotted-page format is shown in Figure 32.2. In each page, the
page header is followed by an array of line pointers (also referred to as item identifiers). A
line pointer contains the offset (relative to the start of the page) and length of a specific
tuple in the page. The actual tuples are stored from the end of the page to simplify
insertions. When a new item is added in the page, if all line pointers are in use, a new
line pointer is allocated at the beginning of the unallocated space (pd lower) while the
actual item is stored from the end of the unallocated space (pd upper).
6 Chapter 32 PostgreSQL
... linpn
pd_lower
pd_upper
A record in a heap file is identified by its tuple ID (TID). The TID consists of a
4-byte block ID which specifies the page in the file containing the tuple and a 2-byte
slot ID. The slot ID is an index into the line pointer array that in turn is used to access
the tuple.
Due to the multi-version concurrency control (MVCC) technique used by
PostgreSQL, there may be multiple versions of a tuple, each with an associated start
and end time for validity. Delete operations do not immediately delete tuples, and up-
date operations do not directly update tuples. Instead, deletion of a tuple initially just
updates the end-time for validity, while an update of a tuple create a new version of
the existing tuple; the old version has its validity end-time set to just before the validity
start-time of the new version.
Old versions of tuples that are no longer visible to any transaction are physically
deleted later; deletion causes holes to be formed in a page. The indirection of accessing
tuples through the line pointer array permits the compaction of such holes by moving
the tuples, without affecting the TID of the tuple.
The length of a physical tuple is limited by the size of a data page, which makes
it difficult to store very long tuples. When PostgreSQL encounters a large tuple that
cannot fit in a page, it tries to “TOAST” individual large attributes, that is, compress
the attribute or break it up into smaller pieces. In some cases, “toasting” an attribute
may be accomplished by compressing the value. If compression does not shrink the
tuple enough to fit in the page (as is often the case), the data in the toasted attribute is
replaced by a reference to the attribute value; the attribute value is stored outside the
page in an associated TOAST table. Large attribute values are split into smaller chunks;
the chunk size is chosen such that four chunks can fit in a page. Each chunk is stored
as a separate row in the associated TOAST table. An index on the combination of the
identifier of a toasted attribute with the sequence number of each chunk allows efficient
retrieval of the values. Only the data types with variable-length representation support
32.3 Storage and Indexing 7
TOAST, to avoid imposing the overhead on data types that cannot produce large field
values. The toasted attribute size is limited to 1 GB.
32.3.2 Indices
A PostgreSQL index is a data structure that provides a dynamic mapping from search
predicates to sequences of tuple IDs from a particular table. The returned tuples are
intended to match the search predicate, although in some cases the predicate must
be rechecked on the actual tuples, since the index may return a superset of matching
tuples. PostgreSQL supports several different index types, including indices that are
based on user-extensible access methods. All the index types in PostgreSQL currently
use the slotted-page format described in Section 32.3.1 to organize the data within an
index page.
2 As is conventional in the industry, the term B-tree is used in place of B+ -tree, and should be interpreted as referring
to the B+ -tree data structure.
8 Chapter 32 PostgreSQL
Examples of some indices built using GiST index include R-trees, as well as less con-
ventional indices for multidimensional cubes and full-text search.
The GiST interface requires the access-method implementer to only implement
certain operations on the data type being indexed, and specify how to arrange those
data values in the GiST tree. New GiST access methods can be implemented by creating
an operator class. Each GiST operator class may have a different set of strategies based
on the search predicates implemented by the index. There are five support functions
that an index operator class for GiST must provide, such as for testing set membership,
for splitting sets of entries on page overflows, and for computing cost of inserting a new
entry. GiST also allows four support functions that are optional, such as for supporting
ordered scans, or to allow the index to contain a different type than the data type on
which it is build. An index built on the GiST interface may be lossy, meaning that such
an index might produce false matches; in that case, records fetched by the index need
to have the index predicate rechecked, and some of the fetched records may fail the
predicate.
PostgreSQL provides several index methods implemented using GiST such as
indices for multidimensional cubes, and for storing key-value pairs. The original
PostgreSQL implementation of R-trees was replaced by GiST operator classes which
allowed R-trees to take advantage of the write-ahead logging and concurrency capabili-
ties provided by the GiST index. The original R-tree implementation did not have these
features, illustrating the benefits of using the GiST index template to implement specific
indices.
Space-partitioned GiST (SP-GiST): Space-partitioned GiST indices leverage balanced
search trees to facilitate disk-based implementations of a wide range of non-balanced
data structures, such as quad-trees, k-d trees, and radix trees (tries). These data struc-
tures are designed for in-memory usage, with small node sizes and long paths in the tree,
and thus cannot directly be used to implement disk-based indices. SP-GiST maps search
tree nodes to disk pages in such a way that a search requires accessing only a few disk
pages, even if it traverses a larger number of tree nodes. Thus, SP-GiST permits the effi-
cient disk-based implementation of index structures originally designed for in-memory
use. Similar to GiST, SP-GiST provides an interface with a high level of abstraction that
allows the development of custom indices by providing appropriate access methods.
Generalized Inverted Index (GIN): The GIN index is designed for speeding up queries
on multi-valued elements, such as text documents, JSON structures and arrays. A GIN
stores a set of (key, posting list) pairs, where a posting list is a set of row IDs in which
the key occurs. The same row ID might appear in multiple posting lists. Queries can
specify multiple keys, for example with keys as words, GIN can be used to implement
keyword indices.
GIN, like GiST, provides extensibility by allowing an index implementor to specify
custom “strategies” for specific data types; the strategies specify, for example, how keys
are extracted from indexed items and from query conditions, and how to determine
whether a row that contains some of the key values in a query actually satisfies the
query.
32.3 Storage and Indexing 9
During query execution, GIN efficiently identifies index keys that overlap the search
key, and computes a bitmap indicating which searched-for elements are members of the
index key. To do so, GIN uses support function that extract members from a set, sup-
port functions that compare individual members. Another support function decides
if the search predicate is satisfied, based on the bitmap and the original predicate. If
the search predicate cannot be resolved without the full indexed attribute, the deci-
sion function must report a possible match and the predicate must be rechecked after
retrieving the data item.
Each key is stored only once in a GIN, making GIN suitable for situations where
many indexed items contain each key. However, updates are slower on GIN making
them better for querying relatively static data, while GiST indices are preferred for work-
loads with frequent updates.
Block Range Index (BRIN): BRIN indices are designed for indexing such very large
datasets that are naturally clustered on some specific attribute(s), with timestamps be-
ing a natural example. Many modern applications generate datasets with such charac-
teristics. For example, an application collecting sensor measurements like temperature
or humidity might have a timestamp column based on the time when each measure-
ment was collected. In most cases, tuples for older measurements will appear earlier in
the table storing the data. BRIN indices can speed up lookups and analytical queries
with aggregates while requiring significantly less storage space than a typical B-tree
index.
BRIN indices store some summary information for a group of pages that are physi-
cally adjacent in a table (block range). The summary data that a BRIN index will store
depends on the operator class selected for each column of the index. For example, data
types having a linear sort order can have operator classes that store the minimum and
maximum value within each block range.
A BRIN index exploits the summary information stored for each block range to
return tuples from only within the qualifying block ranges based on the query condi-
tions. For example, in case of a table with a BRIN index on a timestamp, if the table is
filtered by the timestamp, the BRIN is scanned to identify the block ranges that might
have qualifying values and a list of block ranges is returned. The decision of whether
a block range is to be selected or not is based on the summary information that the
BRIN maintains for the given block range, such as the min and max value for times-
tamps. The selected block ranges might have false matches, that is the block range may
contain items that do not satisfy the query condition. Therefore, the query executor
re-evaluates the predicates on tuples in the selected block ranges and discards tuples
that do not satisfy the predicate.
A BRIN index is typically very small, and scanning the index thus adds a very small
overhead compared to a sequential scan, but may help in avoiding scanning large parts
of a table that are found to not contain any matching tuples. The size of a BRIN index is
determined by the size of the relation divided by the size of the block range. The smaller
the block range, the larger the index, but at the same time the summary data stored can
10 Chapter 32 PostgreSQL
be more precise, and thus more data blocks can potentially be skipped during an index
scan.
• Multicolumn indices: These are useful for conjuncts of predicates over multiple
columns of a table. Multicolumn indices are supported for B-tree, GiST, GIN, and
BRIN indices, and up to 32 columns can be specified in the index.
• Unique indices: Unique and primary-key constraints can be enforced by using
unique indices in PostgreSQL. Only B-tree indices may be defined as being unique.
PostgreSQL automatically creates a unique index when a unique constraint or pri-
mary key is defined for a table.
• Covering indices: A covering index is an index that includes additional attributes
that are not part of the search key (as described earlier in Section 14.6). Such
extra attributes can be added to allow a frequently used query to be answered
using only the index, without accessing the underlying relation. Plans that use
an index without accessing the underlying relation are called index-only plans, the
implementation of index-only plans in PostgreSQL is described in Section 32.3.2.4.
PostgreSQL uses the include clause to specify the extra attributes to be included in
the index. A covering index can enhance query performance; however, the index
build time increases since more data should be written and the index size becomes
larger since the non-search attributes are duplicating data from the original table.
Currently, only B-tree indices support covering indices.
• Indices on expressions: In PostgreSQL, it is possible to create indices on arbitrary
scalar expressions of columns, and not just specific columns, of a table. An exam-
ple is to support case-insensitive comparisons by defining an index on the expres-
sion lower(column). A query with the predicate lower(column) = 'value' cannot
be efficiently evaluated using a regular B-tree index since matching records may
appear at multiple locations in the index; on the other hand, an index on the ex-
pression lower(column) can be used to efficiently evaluate the query since all such
records would map to the same index key.
Indices on expressions have a higher maintenance cost (i.e., insert and update
speed) but they can be useful when retrieval speed is more important.
• Operator classes: The specific comparison functions used to build, maintain, and
use an index on a column are tied to the data type of that column. Each data type
has associated with it a default operator class that identifies the actual operators
that would normally be used for the data type. While the default operator class
is sufficient for most uses, some data types might possess multiple “meaningful”
32.3 Storage and Indexing 11
This statement is executed by scanning the instructor relation to find row versions that
might be visible to a future transaction, then sorting their index attributes and building
the index structure. During this process, the building transaction holds a lock on the
instructor relation that prevents concurrent insert, delete, and update statements. Once
the process is finished, the index is ready to use and the table lock is released.
The lock acquired by the create index command may present a major inconve-
nience for some applications where it is difficult to suspend updates while the index is
built. For these cases, PostgreSQL provides the create index concurrently variant, which
allows concurrent updates during index construction. This is achieved by a more com-
plex construction algorithm that scans the base table twice. The first table scan builds
an initial version of the index, in a way similar to normal index construction described
above. This index may be missing tuples if the table was concurrently updated; how-
ever, the index is well formed, so it is flagged as being ready for insertions. Finally, the
algorithm scans the table a second time and inserts all tuples it finds that still need to
be indexed. This scan may also miss concurrently updated tuples, but the algorithm
synchronizes with other transactions to guarantee that tuples that are updated during
the second scan will be added to the index by the updating transaction. Hence, the
index is ready to use after the second table scan. Since this two-pass approach can be
expensive, the plain create index command is preferred if it is easy to suspend table
updates temporarily.
12 Chapter 32 PostgreSQL
allow efficient sequential access to tuples in the key order, avoiding expensive random
access that would be otherwise required by a secondary-index based access. An index-
only scan can be used provided all attributes required by the query are contained in the
index key or in the covering attributes.
32.3.3 Partitioning
Table partitioning in PostgreSQL allows a table to be split into smaller physical pieces,
based on the value of partitioning attributes. Partitioning can be quite beneficial in cer-
tain scenarios; for example, it can improve query performance when the query includes
predicates on the partitioning attributes, and the matching tuples are in a single parti-
tion or a small number of partitions. Table partitioning can also reduce the overhead
of bulk loading and deletion in some cases by adding or removing partitions without
modifying existing partitions. Partitioning can also make maintenance operations such
as VACUUM and REINDEX faster. Further, indices on the partitions are smaller than
a index on the whole table, thus it is more likely to fit into memory. Partitioning a rela-
tion is a good idea as long as most queries that access the relation include predicates
on the partitioning attributes. Otherwise, the overhead of accessing multiple partitions
can slow down query processing to some extent.
As of version 11, PostgreSQL comes with three types of partitioning tables:
1. Range Partitioning: The table is partitioned into ranges (e.g., date ranges) defined
by a key column or set of columns. The range of values in each partition is as-
signed based on some partitioning expression. The ranges should be contiguous
and non-overlapping.
2. List Partitioning: The table is partitioned by explicitly listing the set of discrete
values that should appear in each partition.
3. Hash Partitioning: The tuples are distributed across different partitions according
to a hash partition key. Hash partitioning is ideal for scenarios in which there is
no natural partitioning key or details about data distribution.
create table takes till 2017 partition of takes for values from (1900) to (2017);
create table takes 2017 partition of takes for values from (2017) to (2018);
create table takes 2018 partition of takes for values from (2018) to (2019);
create table takes 2019 partition of takes for values from (2019) to (2020);
create table takes from 2020 partition of takes for values from (2020) to (2100);
New tuples are routed to the proper partitions according to the selected partition key.
Partition key ranges must not overlap, and there must be a partition defined for each
valid key value. The query planner of PostgreSQL can exploit the partitioning informa-
tion to eliminate unnecessary partition accesses during query processing.
Each partition as above is a normal PostgreSQL table, and it is possible to specify
a tablespace and storage parameters for each partition separately. Partitions may have
their own indexes, constraints and default values, distinct from those of other partitions
of the same table. However, there is no support for foreign keys referencing partitioned
tables, or for exclusion constraints3 spanning all partitions.
Turning a table into a partitioned table or vice versa is not supported; however,
it is possible to add a regular or partitioned table containing data as a partition of a
partitioned table, or remove a partition from a partitioned table turning it into a stand-
alone table.
When PostgreSQL receives a query, the query is first parsed into an internal represen-
tation, which goes through a series of transformations, resulting in a query plan that is
used by the executor to process the query.
3 Exclusion constraints in PostgreSQL allow a constraint on each row that can involve other rows; for example, such a
constraint can specify that there is no other row with the same key value, or there is no other row with an overlap-
ping range. Efficient implementation of exclusion constraints requires the availability of appropriate indices. See the
PostgreSQL manuals for more details.
32.4 Query Processing and Optimization 15
The rewrite phase first deals with all update, delete, and insert statements by firing
all appropriate rules. Such statements might be complicated and contain select clauses.
Subsequently, all the remaining rules involving only select statements are fired. Since
the firing of a rule may cause the query to be rewritten to a form that may require
another rule to be fired, the rules are repeatedly checked on each form of the rewritten
query until a fixed point is reached and no more rules need to be fired.
• Standard planner: The standard planner uses the the bottom-up dynamic program-
ming algorithm for join order optimization, which we saw earlier in Section 16.4.1,
which is often referred to as the System R optimization algorithm.
• Genetic query optimizer: When the number of tables in a query block is very
large, System R’s dynamic programming algorithm becomes very expensive. Un-
like other commercial systems that default to greedy or rule-based techniques,
PostgreSQL uses a more radical approach: a genetic algorithm that was developed
initially to solve traveling-salesman problems. There exists anecdotal evidence of
the successful use of genetic query optimization in production systems for queries
with around 45 tables.
Since the planner operates in a bottom-up fashion on query blocks, it is able to per-
form certain transformations on the query plan as it is being built. One example is the
common subquery-to-join transformation that is present in many commercial systems
(usually implemented in the rewrite phase). When PostgreSQL encounters a noncorre-
lated subquery (such as one caused by a query on a view), it is generally possible to pull
up the planned subquery and merge it into the upper-level query block. PostgreSQL is
able to decorrelate many classes of correlated subqueries, but there are other classes
of queries that it is not able to decorrelate. (Decorrelation is described in more detail
in Section 16.4.4.)
The query optimization phase results in a query plan, which is a tree of relational
operators. Each operator represents a specific operation on one or more sets of tu-
ples. The operators can be unary (for example, sort, aggregation), binary (for example,
nested-loop join), or n-ary (for example, set union).
16 Chapter 32 PostgreSQL
Crucial to the cost model is an accurate estimate of the total number of tuples
that will be processed at each operator in the plan. These estimates are inferred by the
optimizer on the basis of statistics that are maintained on each relation in the system.
These statistics include the total number of tuples for each relation and average tuple
size. PostgreSQL also maintains statistics about each column of a relation, such as the
column cardinality (that is, the number of distinct values in the column), a list of most
common values in the table and the number of occurrences of each common value,
and a histogram that divides the column’s values into groups of equal population. In
addition, PostgreSQL also maintains a statistical correlation between the physical and
logical row orderings of a column’s values— this indicates the cost of an index scan to
retrieve tuples that pass predicates on the column. The DBA must ensure that these
statistics are current by running the analyze command periodically.
1. Access methods: Access methods are operators that are used to retrieve data from
disk, and include sequential scans of heap files, index scans, and bitmap index
scans.
• Sequential scans: The tuples of a relation are scanned sequentially from the
first to last blocks of the file. Each tuple is returned to the caller only if it is
“visible” according to the transaction isolation rules (Section 32.5.1.1).
• Index scans: Given a search condition such as a range or equality predicate,
an index scan returns a set of matching tuples from the associated heap file.
In a typical case, the operator processes one tuple at a time, starting by read-
ing an entry from the index and then fetching the corresponding tuple from
the heap file. This can result in a random page fetch for each tuple in the
worst case. The cost of accessing the heap file can be alleviated if an index-
only scan is used that allows for retrieving data directly from the index (see
Section 32.3.2.4 for more details).
• Bitmap index scans: A bitmap index scan reduces the danger of excessive
random page fetches in index scans. To do so, processing of tuples is done
in two phases.
32.4 Query Processing and Optimization 17
a. The first phase reads all index entries and populates a bitmap that con-
tains one bit per heap page; the tuple ID retrieved from the index scan is
used to set the bit of the corresponding page.
b. The second phase fetches heap pages whose bit is set, scanning the
bitmap in sequential order. This guarantees that each heap page is ac-
cessed only once, and increases the chance of sequential page fetches.
Once a heap page is fetched, the index predicate is rechecked on all the
tuples in the page, since a page whose bit is set may well contain tuples
that do not satisfy the index predicate.
Moreover, bitmaps from multiple indexes can be merged and intersected to
evaluate complex Boolean predicates before accessing the heap.
2. Join methods: PostgreSQL supports three join methods: sorted merge joins,
nested-loop joins (including index-nested loop variants for accessing the inner
relation using an index), and a hybrid hash join.
3. Sort: Small relations are sorted in-memory using quicksort, while larger are
sorted using an external sort algorithm. Initially, the input tuples are stored in
an unsorted array as long as there is available working memory for the sort op-
eration. If all the tuples fit in memory, the array is sorted using quicksort, and
the sorted tuples can be accessed by sequentially scanning the array. Otherwise,
the input is divided into sorted runs by using replacement selection; replacement
selection uses a priority tree implemented as a heap, and can generate sorted
runs that are bigger than the available memory. The sorted runs are stored in
temporary files and then merged using a polyphase merge.
4. Aggregation: Grouped aggregation in PostgreSQL can be either sort-based or
hash-based. When the estimated number of distinct groups is very large, the for-
mer is used; otherwise, an in-memory hash-based approach is preferred.
4 See https://www.postgresql.org/docs/current/parallel-query.html.
18 Chapter 32 PostgreSQL
PostgreSQL 11, parallel query plans will not be used if the transaction isolation level is
set to serializable, although this may be fixed in future versions.
When a parallel query operator is used, the master backend process coordinates
the parallel execution. It is responsible for spawning the required number of workers,
executing the non-parallel activity while contributing to parallel execution as one of the
workers. The planner determines the number of the background workers that will be
used to process the child plan of the Gather node.
A parallel query plan includes a Gather or Gather Merge node which has exactly
one child plan. This child plan is the part of the plan that will be executed in parallel.
If the root node is Gather or Gather Merge, then the whole query can be executed in
parallel. The master backend executes the Gather or Gather Merge node. The Gather
node is responsible for retrieving the tuples generated by the background workers. A
Gather Merge node is used when the parallel part of the plan produces tuples in sorted
order. The background workers and the master backend process communicate through
the shared memory area.
PostgreSQL has parallel-aware flavors for the basic query operations. It supports
three types of parallel scans; namely, parallel sequential scan, parallel bitmap heap scan
and parallel index/index-only scan (only for B-tree indexes). PostgreSQL also supports
parallel versions of nested loop, hash and merge joins. In a join operation, at least
one of the tables is scanned by multiple background workers. Each background worker
additionally scans the inner table of the join and then forwards the computed tuples to
the master backend coordinator process. For nested-loop join and merge join the inner
side of the join is always non-parallel.
PostgreSQL can also generate parallel plans for aggregation operations. In this case,
the aggregation happens in two steps: (a) each background worker produces a partial
result for a subset of the data, and (b) the partial results are collected to the master
backend process which computes the final result using the partial results generated by
the workers.
vides significant flexibility by hiding many low-level details. However, that flexibility
can incur a performance hit due to the interpretation overhead (e.g., function calls, un-
predictable branches, high number of instructions). For example, computing the qual-
ifying tuples for any arbitrary SQL expression requires evaluating predicates that can
use any of the supported SQL datatypes (e.g.,integer,double); the predicate evaluation
function must be able to handle all these data types, and also handle sub-expressions
containing other operators. The evaluation function is, in effect, an interpreter that
processes the execution plan. With JIT compilation, a generic interpreted program can
be compiled at query execution time into a native-code program that is tailored for the
specific data types used in a particular expression. The resultant compiled code can
execute significantly faster than the original interpreted function.
As of PostgreSQL 11, PostgreSQL exploits JIT compilation to accelerate expression
evaluation and tuple deforming (which is explained shortly). Those operations were
chosen because they are executed very frequently (per tuple) and therefore have a high
cost for analytics queries that process large amounts of data. PostgreSQL accelerates
expression evaluation (i.e., the code path used to evaluate WHERE clause predicates,
expressions in target lists, aggregates and projections) by generating tailored code to
each case, depending on the data types of the attributes. Tuple deforming is the process
of transforming an on-disk tuple into its in-memory representation. JIT compilation
in PostgreSQL creates a transformation function specific to the table layout and the
columns to be extracted. JIT compilation may be added for other operations in future
releases.
JIT compilation is primarily beneficial for long-running CPU-bound queries. For
short-running queries the overhead of performing JIT compilation and optimizations
can be higher than the savings in execution time. PostgreSQL selects whether JIT opti-
mizations will be applied during planning, based on whether the estimated query cost
is above some threshold.
PostgreSQL uses LLVM to perform JIT compilation; LLVM allows systems to gen-
erate a device independent assembly language code, which is then optimized and com-
piled to machine code specific to the hardware platform. As of PostgreSQL 11, JIT
compilation support is not enabled by default, and is used only when PostgreSQL is
built using the –with-llvm option. The LLVM dependent code is loaded on-demand
from a shared library.
• Dirty read. The transaction reads values written by another transaction that has
not committed yet.
• Nonrepeatable read. A transaction reads the same object twice during execution
and finds a different value the second time, although the transaction has not
changed the value in the meantime.
• Phantom read. A transaction re-executes a query returning a set of rows that sat-
isfy a search condition and finds that the set of rows satisfying the condition has
changed as a result of another recently committed transaction.
• Serialization anomaly. A successfully committed group of transactions is inconsis-
tent with all possible orderings of running those transactions one at a time.
Each of the above phenomena violates transaction isolation, and hence violates serial-
izability. Figure 32.3 shows the definition of the four SQL isolation levels specified in
the SQL standard— read uncommitted, read committed, repeatable read, and serializ-
able— in terms of these phenomena. In PostgreSQL the user can select any of the four
transaction isolation levels (using the command set transaction); however, PostgreSQL
implements only three distinct isolation levels. A request to set transaction isolation
level to read uncommitted is treated the same as a request to set the isolation level to
read committed. The default isolation level is read committed.
32.5 Transaction Management in PostgreSQL 21
sions of a row in a table is valid within the context of a given statement or transaction.
A transaction determines tuple visibility based on a database snapshot that is chosen
before executing a command.
A tuple is visible for a transaction T if the following two conditions hold:
1. The tuple was created by a transaction that committed before transaction T took
its snapshot.
2. Updates to the tuple (if any) were executed by a transaction that is either
• aborted, or
• started running after T took its snapshot, or
• was active when T took its snapshot.
• Each tuple in a table has a header with three fields: xmin, which contains the
transaction ID of the transaction that created the tuple and which is therefore also
called the creation-transaction ID; xmax, which contains the transaction ID of the
replacing/deleting transaction (or null if not deleted/replaced) and which is also
referred to as the expire-transaction ID; and a forward link to new versions of the
same logical row, if there are any.
• A SnapshotData data structure is created either at transaction start time or at query
start time, depending on the isolation level (described in more detail below). Its
main purpose is to decide whether a tuple is visible to the current command. The
SnapshotData stores information about the state of transactions at the time it is
created, which includes a list of active transactions and xmax, a value equal to 1
+ the highest ID of any transaction that has started so far. The value xmax serves
as a “cutoff” for transactions that may be considered visible.
Figure 32.4 illustrates some of this state information through a simple example
involving a database with only one table, the department table from Figure 32.5. The
department table has three columns, the name of the department, the building where the
department is located, and the budget of the department. Figure 32.4 shows a fragment
of the department table containing only the (versions of) the row corresponding to the
Physics department. The tuple headers indicate that the row was originally created by
transaction 100, and later updated by transaction 102 and transaction 106. Figure 32.4
also shows a fragment of the corresponding pg xact file. On the basis of the pg xact
file, transactions 100 and 102 are committed, while transactions 104 and 106 are in
progress.
Given the above state information, the two conditions that need to be satisfied for
a tuple to be visible can be rewritten as follows:
Consider the example database in Figure 32.4 and assume that the SnapshotData
used by transaction 104 simply uses 103 as the cutoff transaction ID xmax and does
not show any earlier transactions to be active. In this case, the only version of the row
corresponding to the Physics department that is visible to transaction 104 is the second
version in the table, created by transaction 102. The first version, created by transaction
100, is not visible, since it violates condition 2: The expire-transaction ID of this tuple is
102, which corresponds to a transaction that is not aborted and that has a transaction
ID less than or equal to 103. The third version of the Physics tuple is not visible, since
it was created by transaction 106, which has a transaction ID larger than transaction
103, implying that this version had not been committed at the time SnapshotData was
created. Moreover, transaction 106 is still in progress, which violates another one of the
conditions. The second version of the row meets all the conditions for tuple visibility.
The details of how PostgreSQL MVCC interacts with the execution of SQL state-
ments depends on whether the statement is an insert, select, update, or delete statement.
The simplest case is an insert statement, which may simply create a new tuple based
on the data in the statement, initialize the tuple header (the creation ID), and insert
the new tuple into the table. Unlike two-phase locking, this does not require any inter-
action with the concurrency-control protocol unless the insertion needs to be checked
for integrity conditions, such as uniqueness or foreign key constraints.
When the system executes a select, update, or delete statement the interaction with
the MVCC protocol depends on the isolation level specified by the application. If the
isolation level is read committed, the processing of a new statement begins with creating
a new SnapshotData data structure (independent of whether the statement starts a new
transaction or is part of an existing transaction). Next, the system identifies target tuples,
that is, the tuples that are visible with respect to the SnapshotData and that match the
search criteria of the statement. In the case of a select statement, the set of target tuples
make up the result of the query.
In the case of an update or delete statement in read committed mode, the snapshot
isolation protocol used by PostgreSQL requires an extra step after identifying the target
tuples and before the actual update or delete operation can take place. The reason is
that visibility of a tuple ensures only that the tuple has been created by a transaction that
committed before the update/delete statement in question started. However, it is possi-
32.5 Transaction Management in PostgreSQL 25
ble that, since query start, this tuple has been updated or deleted by another concur-
rently executing transaction. This can be detected by looking at the expire-transaction
ID of the tuple. If the expire-transaction ID corresponds to a transaction that is still
in progress, it is necessary to wait for the completion of this transaction first. If the
transaction aborts, the update or delete statement can proceed and perform the actual
modification. If the transaction commits, the search criteria of the statement need to
be evaluated again, and only if the tuple still meets these criteria can the row be mod-
ified. If the row is to be deleted, the main step is to update the expire-transaction ID
of the old tuple. A row update also performs this step, and additionally creates a new
version of the row, sets its creation-transaction ID, and sets the forward link of the old
tuple to reference the new tuple.
Going back to the example from Figure 32.4, transaction 104, which consists of a
select statement only, identifies the second version of the Physics row as a target tuple
and returns it immediately. If transaction 104 were an update statement instead, for
example, trying to increment the budget of the Physics department by some amount, it
would have to wait for transaction 106 to complete. It would then re-evaluate the search
condition and, only if it is still met, proceed with its update.
Using the protocol described above for update and delete statements provides only
the read-committed isolation level. Serializability can be violated in several ways. First,
nonrepeatable reads are possible. Since each query within a transaction may see a
different snapshot of the database, a query in a transaction might see the effects of an
update command completed in the meantime that were not visible to earlier queries
within the same transaction. Following the same line of thought, phantom reads are
possible when a relation is modified between queries.
In order to provide the PostgreSQL serializable isolation level, PostgreSQL MVCC
eliminates violations of serializability in two ways: First, when it is determining tuple
visibility, all queries within a transaction use a snapshot as of the start of the transac-
tion, rather than the start of the individual query. This way successive queries within a
transaction always see the same data.
Second, the way updates and deletes are processed is different in serializable mode
compared to read-committed mode. As in read-committed mode, transactions wait af-
ter identifying a visible target row that meets the search condition and is currently
updated or deleted by another concurrent transaction. If the concurrent transaction
that executes the update or delete aborts, the waiting transaction can proceed with
its own update. However, if the concurrent transaction commits, there is no way for
PostgreSQL to ensure serializability for the waiting transaction. Therefore, the waiting
transaction is rolled back and returns the following error message: “could not serialize
access due to read/write dependencies among transactions“. It is up to the applica-
tion to handle an error message like the above appropriately, by aborting the current
transaction and restarting the entire transaction from the beginning.
Further, to ensure serializability, the serializable snapshot-isolation technique
(which is used when the isolation level is set to serializable) tracks read-write conflicts
26 Chapter 32 PostgreSQL
1. An extra burden is placed on the storage system, since it needs to maintain dif-
ferent versions of tuples.
2. The development of concurrent applications takes some extra care, since
PostgreSQL MVCC can lead to subtle, but important, differences in how concur-
rent transactions behave, compared to systems where standard two-phase locking
is used.
3. The performance of MVCC depends on the characteristics of the workload run-
ning on it.
In general, versions of tuples are freed up by the vacuum process of PostgreSQL. The
vacuum process can be initiated by a command, but PostgreSQL employs a background
process to vacuum tables automatically. The vacuum process first scans the heap, and
whenever it finds a tuple version that cannot be accessed by any current/future trans-
action, it marks the tuple as “dead”. The vacuum process then scans all indices of the
relation, and removes any entries that point to dead tuples. Finally, it rescans the heap,
physically deleting tuple versions that were marked as dead earlier.
PostgreSQL also supports a more aggressive form of tuple reclaiming in cases where
the creation of a version does not affect the attributes used in indices, and further the
old and new tuple versions are on the same page. In this case no index entry is created
for the new tuple version, but instead a link is added from the old tuple version in the
heap page to the new tuple version (which is also on the same heap page). An index
lookup will first find the old version, and if it determines that the version is not visible
to the transaction, the version chain is followed to find the appropriate version. When
the old version is no longer visible to any transaction, the space for the old version can
be reclaimed in the heap page by some clever data structure tricks within the page,
without touching the index.
The vacuum command offers two modes of operation: Plain vacuum simply iden-
tifies tuples that are not needed, and makes their space available for reuse. This form
of the command executes as described above, and can operate in parallel with normal
reading and writing of the table. Vacuum full does more extensive processing, includ-
ing moving of tuples across blocks to try to compact the table to the minimum number
of disk blocks. This form is much slower and requires an exclusive lock on each table
while it is being processed.
PostgreSQL’s approach to concurrency control performs best for workloads con-
taining many more reads than updates, since in this case there is a very low chance
that two updates will conflict and force a transaction to roll back. Two-phase locking
may be more efficient for some update-intensive workloads, but this depends on many
factors, such as the length of transactions and the frequency of deadlocks.
only locks of the first three types. These three lock types are compatible with each
other, since MVCC takes care of protecting these operations against each other. DML
commands acquire these locks only for protection against DDL commands.
While their main purpose is providing PostgreSQL internal concurrency control for
DDL commands, all locks in Figure 32.6 can also be acquired explicitly by PostgreSQL
applications through the lock table command.
Locks are recorded in a lock table that is implemented as a shared-memory hash
table keyed by a signature that identifies the object being locked. If a transaction wants
to acquire a lock on an object that is held by another transaction in a conflicting
mode, it needs to wait until the lock is released. Lock waits are implemented through
semaphores, each of which is associated with a unique transaction. When waiting for
a lock, a transaction actually waits on the semaphore associated with the transaction
holding the lock. Once the lock holder releases the lock, it will signal the waiting trans-
32.5 Transaction Management in PostgreSQL 29
32.5.2 Recovery
PostgreSQL employs write-ahead log (WAL) based recovery to ensure atomicity and
durability. The approach is similar to the standard recovery techniques; however, re-
covery in PostgreSQL is simplified in some ways because of the MVCC protocol.
Under PostgreSQL, recovery does not have to undo the effects of aborted trans-
actions: an aborting transaction makes an entry in the pg xact file, recording the fact
that it is aborting. Consequently, all versions of rows it leaves behind will never be
visible to any other transactions. The only case where this approach could potentially
lead to problems is when a transaction aborts because of a crash of the corresponding
PostgreSQL process and the PostgreSQL process does not have a chance to create the
pg xact entry before the crash. PostgreSQL handles this as follows: Before checking the
status of a transaction in the pg xact file, PostgreSQL checks whether the transaction
is running on any of the PostgreSQL processes. If no PostgreSQL process is currently
running the transaction, but the pg xact file shows the transaction as still running, it
is safe to assume that the transaction crashed and the transaction’s pg xact entry is
updated to “aborted”.
30 Chapter 32 PostgreSQL
of potential data loss is shorter (typically below a second) than the one in file-based
log shipping.
PostgreSQL can also operate using synchronous replication. In this case, each com-
mit of a write transaction waits until confirmation is received that the commit has been
written to the WAL on disk of both the primary and secondary server. Even though this
approach increases the confidence that the data of a transaction commit will be avail-
able, commit processing is slower. Further, data loss is still possible if both primary and
secondary servers crash at the same time. For read-only transactions and transaction
rollbacks, there is no need to wait for the response from the secondary servers.
In addition to physical replication, PostgreSQL also supports logical replication.
Logical replication allows for fine-grained control over data replication by replicat-
ing logical data modifications from the WAL based on a replication identity (usually
a primary key). Physical replication, on the other hand, is based on exact block ad-
dresses and byte-by-byte replication. The logical replication can be enabled by setting
the wal level configuration parameter to logical.
Logical replication is implemented using a publish and subscribe model in which
one or more subscribers subscribing to one or more publications (changes generated
from a table or a group of tables). The server responsible for sending the changes is
called a publisher while the server that subscribes to the changes is called a subscriber.
When logical replication is enabled, the subscriber receives a snapshot of the data on
the publisher database. Then, each change that happens on the publisher is identified
and sent to the subscriber using streaming replication. The subscriber is responsible for
applying the change in the same order as the publisher, to guarantee consistency. Typi-
cal use-cases for logical replication include replicating data between different platforms
or different major versions of PostgreSQL, sharing a subset of the database between dif-
ferent groups of users, sending incremental changes in a single database, consolidating
multiple databases into a single one, among other use-cases.
The current version of PostgreSQL supports almost all entry-level SQL-92 features, as
well as many of the intermediate- and full-level features. It also supports many SQL:1999
and SQL:2003 features, including most object-relational features and the SQL/XML fea-
tures for parsed XML. In fact, some features of the current SQL standard (such as
arrays, functions, and inheritance) were pioneered by PostgreSQL.
• Base types: Base types are also known as abstract data types; that is, modules that
encapsulate both state and a set of operations. These are implemented below the
SQL level, typically in a language such as C (see Section 32.6.2.1). Examples are
int4 (already included in PostgreSQL) or complex (included as an optional exten-
sion type). A base type may represent either an individual scalar value or a variable-
length array of values. For each scalar type that exists in a database, PostgreSQL
automatically creates an array type that holds values of the same scalar type.
• Composite types: These correspond to table rows; that is, they are a list of field
names and their respective types. A composite type is created implicitly whenever
a table is created, but users may also construct them explicitly.
• Domains: A domain type is defined by coupling a base type with a constraint that
values of the type must satisfy. Values of the domain type and the associated base
type may be used interchangeably, provided that the constraint is satisfied. A do-
main may also have an optional default value, whose meaning is similar to the
default value of a table column.
• Enumerated types: These are similar to enum types used in programming languages
such as C and Java. An enumerated type is essentially a fixed list of named values.
In PostgreSQL, enumerated types may be converted to the textual representation
of their name, but this conversion must be specified explicitly in some cases to
ensure type safety. For instance, values of different enumerated types may not be
compared without explicit conversion to compatible types.
• Pseudotypes: Currently, PostgreSQL supports the following pseudotypes:
any, anyarray, anyelement, anyenum, anynonarray cstring, internal, opaque, lan-
guage handler, record, trigger, and void. These cannot be used in composite types
(and thus cannot be used for table columns), but can be used as argument and
return types of user-defined functions.
• Polymorphic types. Four of the pseudotypes anyelement, anyarray, anynonarray,
and anyenum are collectively known as polymorphic. Functions with arguments of
these types (correspondingly called polymorphic functions) may operate on any ac-
tual type. PostgreSQL has a simple type-resolution scheme that requires that: (1)
in any particular invocation of a polymorphic function, all occurrences of a poly-
morphic type must be bound to the same actual type (that is, a function defined
as f (anyelement, anyelement) may operate only on pairs of the same actual type),
and (2) if the return type is polymorphic, then at least one of the arguments must
be of the same polymorphic type.
32.6 SQL Variations and Extensions 33
tion; while jsonb data is stored in a decomposed binary format that makes it slightly
slower to input due to added conversion overhead, but significantly faster to process
while supporting indexing capabilities.
32.6.2 Extensibility
Like most relational database systems, PostgreSQL stores information about databases,
tables, columns, and so forth, in what are commonly known as system catalogs, which
appear to the user as normal tables. Other relational database systems are typically
extended by changing hard-coded procedures in the source code or by loading special
extension modules written by the vendor.
Unlike most relational database systems, PostgreSQL goes one step further and
stores much more information in its catalogs: not only information about tables and
columns, but also information about data types, functions, access methods, and so
on. Therefore, PostgreSQL makes it easy for users to extend and facilitates rapid pro-
totyping of new applications and storage structures. PostgreSQL can also incorporate
user-written code into the server, through dynamic loading of shared objects. This pro-
vides an alternative approach to writing extensions that can be used when catalog-based
extensions are not sufficient.
Furthermore, the contrib module of the PostgreSQL distribution includes numer-
ous user functions (for example, array iterators, fuzzy string matching, cryptographic
functions), base types (for example, encrypted passwords, ISBN/ISSNs, n-dimensional
cubes) and index extensions (for example, RD-trees,5 indexing for hierarchical labels).
Thanks to the open nature of PostgreSQL, there is a large community of PostgreSQL
professionals and enthusiasts who also actively extend PostgreSQL. Extension types are
identical in functionality to the built-in types; the latter are simply already linked into
the server and preregistered in the system catalog. Similarly, this is the only difference
between built-in and extension functions.
32.6.2.1 Types
PostgreSQL allows users to define composite types, enumeration types, and even new
base types. A composite-type definition is similar to a table definition (in fact, the lat-
ter implicitly does the former). Stand-alone composite types are typically useful for
function arguments. For example, the definition:
allows functions to accept and return city t tuples, even if there is no table that explicitly
contains rows of this type.
5 RD-treesare designed to index sets of items, and support set containment queries such as finding all sets that contain
a given query set.
32.6 SQL Variations and Extensions 35
The next step is to define functions to read and write values of the new type in text
format. Subsequently, the new type can be registered using the statement:
assuming the text I/O functions have been registered as complex in and complex out.
The user has the option of defining binary I/O functions as well (for more efficient data
dumping). Extension types can be used like the existing base types of PostgreSQL. In
fact, their only difference is that the extension types are dynamically loaded and linked
into the server. Furthermore, indices may be extended easily to handle new base types;
see Section 32.6.2.3.
32.6.2.2 Functions
PostgreSQL allows users to define functions that are stored and executed on the server.
PostgreSQL also supports function overloading (that is, functions may be declared by
using the same name but with arguments of different types). Functions can be written as
plain SQL statements, or in several procedural languages (covered in Section 32.6.2.4).
Finally, PostgreSQL has an application programmer interface for adding functions writ-
ten in C (explained in this section).
User-defined functions can be written in C (or a language with compatible calling
conventions, such as C++). The actual coding conventions are essentially the same for
dynamically loaded, user-defined functions, as well as for internal functions (which
are statically linked into the server). Hence, the standard internal function library is a
rich source of coding examples for user-defined C functions. Once the shared library
containing the function has been created, a declaration such as the following registers
it on the server:
36 Chapter 32 PostgreSQL
The entry point to the shared object file is assumed to be the same as the SQL function
name (here, complex out), unless otherwise specified.
The example here continues the one from Section 32.6.2.1. The application pro-
gram interface hides most of PostgreSQL’s internal details. Hence, the actual C code
for the above text output function of complex values is quite simple:
The first line declares the function complex out, and the following lines implement
the output function. The code uses several PostgreSQL-specific constructs, such as the
palloc() function, which dynamically allocates memory controlled by PostgreSQL’s
memory manager.
Aggregate functions in PostgreSQL operate by updating a state value via a state
transition function that is called for each tuple value in the aggregation group. For
example, the state for the avg operator consists of the running sum and the count
of values. As each tuple arrives, the transition function simply add its value to the
running sum and increment the count by one. Optionally, a final function may be called
to compute the return value based on the state information. For example, the final
function for avg would simply divide the running sum by the count and return it.
Thus, defining a new aggregate function (referred to a user-defined aggregate func-
tion) is a simple as defining these functions. For the complex type example, if complex
add is a user-defined function that takes two complex arguments and returns their sum,
then the sum aggregate operator can be extended to complex numbers using the simple
declaration:
Note the use of function overloading: PostgreSQL will call the appropriate sum aggre-
gate function, on the basis of the actual type of its argument during invocation. The
stype is the state value type. In this case, a final function is unnecessary, since the return
value is the state value itself (that is, the running sum in both cases).
User-defined functions can also be invoked by using operator syntax. Beyond sim-
ple “syntactic sugar” for function invocation, operator declarations can also provide
hints to the query optimizer in order to improve performance. These hints may include
information about commutativity, restriction and join selectivity estimation, and vari-
ous other properties related to join algorithms.
• Index-method strategies: These are a set of operators that can be used as qualifiers
in where clauses. The particular set depends on the index type. For example, B-tree
indices can retrieve ranges of objects, so the set consists of five operators (<, <=,
=, >=, and >), all of which can appear in a where clause involving a B-tree index
while a hash index allows only equality testing.
• Index-method support routines: The above set of operators is typically not sufficient
for the operation of the index. For example, a hash index requires a function to
compute the hash value for each object.
For example, if the following functions and operators are defined to compare the
magnitude of complex numbers (see Section 32.6.2.1), then we can make such objects
indexable by the following declaration:
The operator statements define the strategy methods and the function statements define
the support methods.
38 Chapter 32 PostgreSQL
Foreign data wrappers (FDW) allow a user to connect with external data sources to
transparently query data that reside outside of PostgreSQL, as if the data were part
of an existing table in a database. PostgreSQL implements FDWs to provide SQL/MED
(“Management of External Data”) functionality. SQL/MED is an extension of the ANSI
SQL standard specification that defines types that allow a database management system
to access external data. FDWs can be a powerful tool both for data migration and data
analysis scenarios.
Today, there are a number of FDWs that enable PostgreSQL to access different re-
mote stores, such as other relational databases supporting SQL, key-value (NoSQL)
sources, and flat files; however, most of them are implemented as PostgreSQL exten-
sions and are not officially supported. PostgreSQL provides two FDWs modules:
• file fdw: The file fdw module allow users to create foreign tables for data files in
the server’s file system, or to specify commands to be executed on the server and
read their output. Access is read-only and the data file or command output should
be in a format compatible to the copy from command. These include csv files,
text files with one row per line, with columns separated by user-specified delimiter
character, and a PostgreSQL specific binary format.
• postgres fdw: The postgres fdw module is used to access remote tables stored in
external PostgreSQL servers. Using postgres fdw, foreign tables are updatable as
long as the the required privileges are set. When a query references a remote table,
32.8 PostgreSQL Internals for Developers 39
postgres fdw opens a transaction on the remote server that is committed or aborted
when the local transaction commits or aborts. The remote transaction uses seri-
alizable isolation level when the local transaction has serializable isolation level;
otherwise it uses repeatable read isolation level. This ensures that a query perform-
ing multiple table scans on the remote server it will get snapshot-consistent results
for all the scans.
Instead of fetching all the required data from the remote database and compute
the query locally, postgres fdw tries to reduce the amount of data transferred from
foreign servers. Queries are optimized to send the query where clauses that use data
types, operators, and built-in functions to the remote server for execution and by
retrieving only the table columns that are needed for the correct query execution.
Similarly, when a join operation is performed between foreign tables on the same
foreign server, postgres fdw pushes down the join operation to the remote server
and retrieves only the results, unless the optimizer estimates that it will be more
efficient to fetch rows from each table individually.
This section is targeted for developers and researchers who plan to extend the
PostgreSQL source code to implement any desired functionality. The section provides
pointers on how to install PostgreSQL from source code, navigate the source code,
and understand some basic PostgreSQL data structures and concepts, as a first step
toward adding new functionality to PostgreSQL. The section pays particular focus on
the region-based memory manager of PostgreSQL and the structure of nodes in a query
plan, and the key functions that are invoked during the processing of a query. It also ex-
plains the organization of tuples and their internal representation in the form of Datum
data structures used to represent values, and various data structures used to represent
tuples. Finally, the section describes error handling mechanisms in PostgreSQL, and
40 Chapter 32 PostgreSQL
offer advice on steps required when adding a new functionality. This section can also
serve as a reference source for key concepts whose understanding is necessary when
changing the source code of PostgreSQL. For more development information we en-
courage the readers to refer to the PostgreSQL development wiki.6
32.8.1.1 Requirements
The following software packages are required for a successful build:
6 https://wiki.postgresql.org/wiki/Development information
32.8 PostgreSQL Internals for Developers 41
The configure script sets up files for building the server, utilities and all clients ap-
plications, by default under /usr/local/pgsql; to specify an alternative directory
you should run configure with the command line option —prefix=PATH, where
PATH is the directory where you wish to install PostgreSQL.7 ).
In addition to the –prefix option, other frequently used options include –enable-
debug, –enable-depend, and –enable-cassert, which enable debugging; it is
important to use these options to help you debug code that you create in
PostgreSQL. The –enable-debug option enables build with debugging symbols
(-g), the –enable-depend option turns on automatic dependency tracking, while
the –enable-cassert option enables assertion checks (used for debugging).
Further, it is recommended that you set the environment variable CFLAGS to
the value -O0 (the letter “O” followed by a zero) to turn off compiler optimiza-
tion entirely. This option reduces compilation time and improves debugging in-
formation. Thus, the following commands can be used to configure PostgreSQL
to support debugging:
export CFLAGS=-O0
./configure –prefix=PATH –enable-debug –enable-depend –enable-cassert
where PATH is the path for installing the files. The CFLAGS variable can alter-
natively be set in the command line by adding the option CFLAGS=’-O0’ to the
configure command above.
2. Build: To start the build, type either of the following commands:
make
make all
This will build the core of PostgreSQL. For a complete build that includes the
documentation as well as all additional modules (the contrib directory) type:
make world
7 More details about the command line options of configure can be found at: https://www.postgresql.org/docs/current/
install-procedure.html.
42 Chapter 32 PostgreSQL
This step will install files into the default directory or the directory specified with
the –prefix command line option provided in Step 1.
For a full build (including the documentation and the contribution modules)
type:
make install-world
1. Create a directory to hold the PostgreSQL data tree by executing the following
commands in the bash console:
mkdir DATA PATH
where DATA PATH is a directory on disk where PostgreSQL will hold data.
2. Create a PostgreSQL cluster by executing:
PATH/bin/initdb -D DATA PATH
where PATH is the installation directory (specified in the ./configure call), and
DATA PATH is the data directory path.
A database cluster is a collection of databases that are managed by a single server
instance. The initdb function creates the directories in which the database data
will be stored, generates the shared catalog tables (which are the tables that be-
long to the whole cluster rather than to any particular database), and creates
template1 (a template for generating new databases) and postgres databases.
The postgres database is a default database available for use by all users, and any
third party applications.
3. Start up the PostgreSQL server by executing:
PATH/bin/postgres -D DATA PATH >logfile 2>&1 &
different port, specified by using the flag -p. This port should then be specified
by all client applications (e.g. createdb, psql discussed next).
To run postgres on a different port type:
where PORT is an alternative port number between 1024 and 65535, that is not
currently used by any application on your computer.
The postgres command can also be called in single-user mode. This mode is par-
ticularly useful for debugging or disaster recovery. When invoked in single-user
mode from the shell, the user can enter queries and the results will be printed to
the screen, but in a form that is more useful for developers than end users. In the
single-user mode, the session user will be set to the user with ID 1, and implicit
superuser powers are granted to this user. This user does not actually have to
exist, so the single-user mode can be used to manually recover from certain kinds
of accidental damage to the system catalogs. To run the postgres server in the
single-user mode type:
where PORT is the port on which the server is running; the port specification can
be omitted if the default port (5432) is being used.
After this step, in addition to template1 and postgres databases, the database
named test will be placed in the cluster as well. You can use any other name in
place of test.
5. Log-in to the database using the psql command:
Now you can create tables, insert some data and run queries over this database.
When debugging, it is frequently useful to run SQL commands directly from the
command line or read them from a file. This can be achieved by specifying the
options -c or -f. To execute a specific command you can use:
where COMMAND is the command you wish to run, which is typically enclosed
in double quotes.
To read and execute SQL statements from a file you can use:
Directory Description
config Configuration system for driving the build
contrib Source code for contribution modules (extensions)
doc Documentation
src/backend PostgreSQL Server (backend)
src/bin psql, pg dump, initdb, pg upgrade and other front-
end utilities
src/common Code common to the front- and backends
src/fe utils Code useful for multiple front-end utilities
src/include Header files for PostgreSQL
src/include/catalog Definition of the PostgreSQL catalog tables
src/interfaces Interfaces to PostgreSQL including libpq, ecpg
src/pl Core Procedural Languages (plpgsql, plperl,
plpython, tcl)
src/port Platform-specific hacks
src/test Regression tests
src/timezone Timezone code from IANA
src/tools Developer tools (including pgindent8 )
src/tutorial SQL tutorial scripts
where FILENAME is the name of the file containing SQL commands. If a file has
multiple statements they need to be separated by semicolon.
When either -c or -f is specified, psql does not read commands from standard
input; instead it terminates after processing all the -c and -f options in sequence.
Directory Description
access Methods for accessing different types of data (e.g., heap, hash,
btree, gist/gin)
bootstrap Routines for running PostgreSQL in a ⣞bootstrap⣞ mode (by
initdb)
catalog Routines used for modifying objects in the PostgreSQL Catalog
commands User-level DDL/SQL commands (e.g., create, alter, vacuum, an-
alyze, copy)
executor Executor runs queries after they have been planned and optimized
foreign Handles foreign data wrappers, user mappings, etc
jit Provides independent Just-In-Time Compilation infrastructure
lib Contains general purpose data structures used in the backend (e.g.,
binary heap, bloom filters, etc)
libpq Code for the wire protocol (e.g., authentication, and encryption)
main The main() routine determines how the PostgreSQL backend pro-
cess will start and starts the right subsystem
nodes Generalized Node structures in PostgreSQL. Contains functions
to manipulate with nodes (e.g., copy, compare, print, etc)
optimizer Optimizer implements the costing system and generates a plan for
the executor
parser Parser parses the sent queries
partitioning Common code for declarative partitioning in PostgreSQL
po Translations of backend messages to other languages
port Backend-specific platform-specific hacks
postmaster The main PostgreSQL process that always runs, answers requests,
and hands off connections
regex Henry Spencer’s regex library
replication Backend components to support replication, shipping of WAL
logs, and reading them
rewrite Query rewrite engine used with RULEs
snowball Snowball stemming used with full-text search
statistics Extended statistics system (CREATE STATISTICS)
storage Storage layer handles file I/O, deals with pages and buffers
tcop Traffic Cop gets the actual queries, and runs them
tsearch Full-Text Search engine
utils Various backend utility components, caching system, memory
manager, etc
Some of the widely used system catalog tables are the following:
will print the list of table names of all tables in the database. System views can be rec-
ognized in the pg catalog schema by the plural suffix (e.g., pg tables, or pg indexes).
typedef struct {
NodeTag type;
} Node;
32.8 PostgreSQL Internals for Developers 49
The first field of any Node is NodeTag, which is used to determine a Node’s spe-
cific type at run-time. Each node consists of a type, plus appropriate data. It is partic-
ularly important to understand the node type system when adding new features, such
as new access path, or new execution operator. Important functions related to nodes
are: makeNode() for creating a new node, IsA() which is a macro for run-time type
testing, equal() for testing the equality of two nodes, copyObject() for a deep copy of
a node (which should make a copy of the tree rooted at that node), nodeToString() to
serialize a node to text (which is useful for printing the node and tree structure), and
stringToNode() for deserializing a node from text.
An important thing to remember when modifying or creating a new node type is
to update these functions (especially equal() and copy() that can be found in equal-
funcs.c and copyfuncs.c in the /CODE/nodes/ directory). For serialization and dese-
rialization, /nodes/outfuncs.c need to be modified as well.
static TestNode *
copyTestNode(const TestNode *from)
{
TestNode *newnode = makeNode(TestNode);
/* copy remainder of node fields (if any) */
newnode= COPY *(from);
return newnode;
}
As a general note, there may be other places in the code where we might need to
inform PostgreSQL about our new node type. The safest way to make sure no place in
the code has been overlooked is to search (e.g., using grep) for references to one or
two similar existing node types to find all the places where they appear in the code.
subtype (again by checking the NodeTag value). The following code snippet shows an
example of casting a supertype into a subtype by using the castNode macro:
static TupleTableSlot *
ExecSeqScan(PlanState *pstate)
{
/* Cast a PlanState (supertype) into a SeqScanState (subtype) */
SeqScanState *node = castNode(SeqScanState, pstate);
...
}
32.8.6 Datum
Datum is a generic data type used to store the internal representation of a single value of
any SQL data type that can be stored in a PostgreSQL table. It is defined in postgres.h.
A Datum contains either a value of a pass-by-value type or a pointer to a value of a
pass-by-reference type. The code using the Datum has to know which type it is, since
the Datum itself does not contain that information. Usually, C code will work with a
value in a native representation, and then convert to or from a Datum in order to pass
the value through data-type-independent interfaces.
There are a number of macros to cast a Datum to and from one of the specific data
types. For instance:
32.8.7 Tuple
Datums are used to extensively to represent values in tuples. A tuple comprises of a
sequence of Datums. HeapTupleData (defined in include/access/htup.h) is an in-
memory data structure that points to a tuple. It contains the length of a tuple, and a
pointer to the tuple header. The structure definition is as follows:
The t len field contains the tuple length; the value of this field should always be valid,
except in the pointer-to-nothing case. The t self pointer is a pointer to an item within
a disk page of a known file. It consists of a block ID (which is a unique identifier of a
block), and an offset within the block. The t self and t tableOid (the ID of the table
the tuple belongs to) values should be valid if the HeapTupleData points to a disk
buffer, or if it represents a copy of a tuple on disk. They should be explicitly set invalid
in tuples that do not correspond to tables in the database.
There are several ways in which a pointer t data can point to a tuple:
PostgresMain()
exec simple query()
pg parse query()
raw parser() – calling the parser
pg analyze rewrite()
parse analyze() – calling the parser (analyzer)
pg rewrite query()
QueryRewrite() – calling the rewriter
RewriteQuery()
pg plan queries()
pg plan query()
planner() – calling the optimizer
create plan()
PortalRun()
PortalRunSelect()
ExecutorRun()
ExecutePlan() – calling the executor
ExecProcNode()
– uses the demand-driven pipeline execution model
or
ProcessUtility() – calling utilities
These function pointers are redefined for different types of tuples, such as Heap-
Tuple, MinimalTuple, BufferHeapTuple, and VirtualTuple.
trees – each parse tree representing a different command, since a query may contain
multiple select statements separated by semicolons.
Each parse tree is then individually analyzed and rewritten. This is achieved by call-
ing pg analyze rewrite() from the exec simple query() routine. For a given raw parse
tree, the pg analyze rewrite() routine performs parse analysis and applies rule rewrit-
ing (combining parsing and rewriting), returning a list of Query nodes as a result (since
one query can be expanded into several ones as a result of this process). The first routine
that pg analyze rewrite() invokes is parse analyze() (located in /parser/analyze.c)
to obtain a Query node of the given raw parse tree.
Rewriter: The rule rewrite system is triggered after parser. It takes the output of the
parser, one Query tree, and defined rewrite rules, and creates zero or more Query trees
as result. Typical examples of rewrite rules are replacing the use of a view with its
definition, or populating procedural fields. The parse analyze() call from the parser
is thus followed by pg rewrite query() to perform rewriting. The pg rewrite query()
invokes the QueryRewrite() routine (located in /rewrite/rewriteHandler.c), which is
the primary module of the query rewriter. This method in turn makes a recursive call
of RewriteQuery() where rewrite rules are repeatedly applied, as long as some rule is
applicable.
Optimizer: After pg analyze rewrite() finishes, producing a list of Query nodes as
output, the pg plan queries() routine is invoked to generate plans for all the nodes
from the Query list. Each Query node is optimized by calling pg plan query(), which
in turn invokes planner() (located in /plan/planner.c), which is the main entry point
for the optimizer. The planner() routine invokes the create plan() routine to create
the best query plan for a given path, returning a Plan as output. Finally, the planner
routine creates a PlannedStmt node to be fed to the executor.
Executor: Once the best plan is found for each Query node, the exec simple query()
routine calls PortalRun(). A portal, previously created in the initialization step (dis-
cussed in the next section), represents the execution state of query. PortalRun() in turn
invokes ExecutorRun() through PortalRunSelect() in the case of queries, or ProcessU-
tility() in the case of utility functions for each individual statement. Both ExecutorRun()
and ProcessUtility() accept a PlannedStmt node; the only difference is that the utility
call has the commandType attribute of the node set to CMD UTILITY.
The ExecutorRun() defined in execMain.c, which is the main routine of the execu-
tor module, invokes ExecutePlan() which processes the query plan by calling ExecProc-
Node() for each individual node in the plan, applying the demand-driven pipelining
(iterator) model (see Section 15.7.2.1 for more details).
CreateQueryDesc()
ExecutorStart()
CreateExecutorState() — creates per-query context
switch to per-query context
InitPlan()
ExecInitNode() — recursively scans plan tree
CreateExprContext() — creates per-tuple context
ExecInitExpr()
ExecutorRun()
ExecutePlan()
ExecProcNode() — recursively called in per-query context
ExecEvalExpr() — called in per-tuple context
ResetExprContext() — to free per-tuple memory
ExecutorFinish()
ExecPostprocessPlan()
AfterTriggerEndQuery()
ExecutorEnd()
ExecEndPlan()
ExecEndNode() — recursively releases resources
FreeExecutorState() — frees per-query context and child contexts
FreeQueryDesc()
node in the plan tree. The context is switched from the per-query context into the per-
tuple context for each invocation of the ExecEvalExpr() routine.
Upon the exit from this routine, ResetExprContext() is invoked. This is a macro
that invokes the MemoryContextReset() routine to release all the space allocated
within the per-tuple context.
Cleanup: The ExecutorFinish() routine must be called after ExecutorRun(), and before
ExecutorEnd(). This routine performs cleanup actions such as calling ExecPostpro-
cessPlan() to allow plan nodes to execute required actions before the shutdown, and
AfterTriggerEndQuery() to invoke all AFTER IMMEDIATE trigger events.
The ExecutorEnd() routine must be called at the end of execution. This routine
invokes ExecEndPlan() which in turn calls ExecEndNode() to recursively release all
resources. FreeExecutorState() frees up the per-query context and consequently all
of its child contexts (e.g., per-tuple contexts) if they have not been released already.
Finally, FreeQueryDesc() from tcop/pquery.c frees the query descriptor created by
CreateQueryDesc().
This fine level of control through different contexts coupled with palloc() and
pfree() routines ensures that memory leaks rarely happen in the backend.
Prior to adding a desired functionality, the behavior of the feature should be dis-
cussed in depth, with a special focus on corner cases. Corner cases are frequently over-
looked and result in a substantial debugging overhead after the feature has been im-
plemented. Another important aspect is understanding the relationship between the
desired feature and other parts of PostgreSQL. Typical examples would include (but
are not limited to) the changes to the system catalog, or the parser.
PostgreSQL has a great community where developers can ask questions, and
questions are usually answered promptly. The web page https://www.postgresql.org/
developer/ provides links to a variety of resources that are useful for PostgreSQL devel-
opers. The [email protected] mailing list is targeted for developers, and
database administrators (DBAs) who have a question or problem when using Post-
greSQL. The [email protected] mailing list is targeted for developers to
submit and discuss patches, or for bug reports or issues with unreleased versions (e.g.
development snapshots, beta or release candidates), and for discussion about database
internals. Finally, the mailing list [email protected] is a great starting point
for all new developers, with a group of people who answer even basic questions.
58 Chapter 32 PostgreSQL
Bibliographical Notes
Parts of this chapter are based on a previous version of the chapter, authored by Anas-
tasia Ailamaki, Sailesh Krishnamurthy, Spiros Papadimitriou, Bianca Schroeder, Karl
Schnaitter, and Gavin Sherry, which was published in the 6th edition of this textbook,
There is extensive online documentation of PostgreSQL at www.postgresql.org.
This Web site is the authoritative source for information on new releases of PostgreSQL,
which occur on a frequent basis. Until PostgreSQL version 8, the only way to run
PostgreSQL under Microsoft Windows was by using Cygwin. Cygwin is a Linux-like
environment that allows rebuilding of Linux applications from source to run under
Windows. Details are at www.cygwin.com. Books on PostgreSQL include [Schonig
(2018)], [Maymala (2015)] and [Chauhan and Kumar (2017)]. Rules as used in
PostgreSQL are presented in [Stonebraker et al. (1990)]. Many tools and extensions
for PostgreSQL are documented by the pgFoundry at www.pgfoundry.org. These in-
clude the pgtcl library and the pgAccess administration tool mentioned in this chapter.
The pgAdmin tool is described on the Web at www.pgadmin.org. Additional details re-
garding the database-design tools TOra and PostgreSQL Maestro can be found at tora.
sourceforge.net and https://www.sqlmaestro.com/products/postgresql/maestro/,
respectively.
The serializable snapshot isolation protocol used in PostgreSQL is described
in [Ports and Grittner (2012)].
An open-source alternative to PostgreSQL is MySQL, which is available for non-
commercial use under the GNU General Public License. MySQL may be embedded in
commercial software that does not have freely distributed source code, but this requires
a special license to be purchased. Comparisons between the most recent versions of
the two systems are readily available on the Web.
Bibliography
[Chauhan and Kumar (2017)] C. Chauhan and D. Kumar, PostgreSQL High Performance
Cookbook, Packt Publishing (2017).
[Maymala (2015)] J. Maymala, PostgreSQL for data architects, Packt Publ., Birmingham
(2015).
[Ports and Grittner (2012)] D. R. K. Ports and K. Grittner, “Serializable Snapshot Isolation
in PostgreSQL”, Proceedings of the VLDB Endowment, Volume 5, Number 12 (2012), pages
1850–1861.
[Schonig (2018)] H.-J. Schonig, Mastering PostgreSQL 11, Packt Publishing (2018).
[Stonebraker et al. (1990)] M. Stonebraker, A. Jhingran, J. Goh, and S. Potamianos, “On
Rules, Procedure, Caching and Views in Database Systems”, In Proc. of the ACM SIGMOD
Conf. on Management of Data (1990), pages 281–290.