Postgre SQL
Postgre SQL
¿Que es PostgreSQL?
PostgreSQL is an object-relational database management system
(ORDBMS) developed at the University of California at Berkeley
Computer Science Department. POSTGRES pioneered many
concepts that only became available in some commercial
database systems much later. PostgreSQL is an open-source
descendant of this original Berkeley code. It supports a large part
of the SQL standard and offers many modern features:
• complex queries
• foreign keys
• triggers
• updatable views
• transactional integrity
• multiversion concurrency control
¿Que es PostgreSQL?
Also, PostgreSQL can be extended by the user in many ways,
for example by adding new
• data types
• functions
• operators
• aggregate functions
• index methods
• procedural languages
Architectural Fundamentals
• Config files
• postgresql.conf PostgreSQL configuration file
• pg_hba.conf PostgreSQL Client Authentication Config
File
• Data Directory PGDATA /var/lib/postgresql/13/main
Specifies the directory to use for data storage. This parameter can only
be set at server start.
• pg_ident.conf PostgreSQL User Name Maps
• Environment Variables
• PATH=/usr/local/pgsql/bin:$PATH
• PGDATA /var/lib/postgresql/13/main
• MANPATH=/usr/local/pgsql/share/man:$MANPATH
Connecting to a Server
TCP/IP
Client or network
Listener Database server
middle tier
Backgroud wal
checkpointer
MEMORY writer writer
logical
autovacuum Stats
replication
launcher collector
launcher
/var/lib/postgresql/13/main
DISK
Database Physical Storage
Traditionally, the configuration and data files used by a database cluster are stored together
within the cluster's data directory, commonly referred to as PGDATA (after the name of the
environment variable that can be used to define it). A common location for PGDATA is
/var/lib/pgsql/data. Multiple clusters, managed by different server instances, can exist on the
same machine.
Contents of PGDATA
Item Description
PG_VERSION A file containing the major version number of PostgreSQL
base Subdirectory containing per-database subdirectories
current_logfiles File recording the log file(s) currently written to by the logging collector
global Subdirectory containing cluster-wide tables, such as pg_database
pg_commit_ts Subdirectory containing transaction commit timestamp data
pg_dynshmem Subdirectory containing files used by the dynamic shared memory subsystem
pg_logical Subdirectory containing status data for logical decoding
Database Physical Storage
Item Description
pg_multixact decoding pg_multixact Subdirectory containing multitransaction status data
(used for shared row locks)
pg_notify Subdirectory containing LISTEN/NOTIFY status data
pg_replslot Subdirectory containing replication slot data
pg_serial Subdirectory containing information about committed serializable transactions
pg_snapshots Subdirectory containing exported snapshots
pg_stat Subdirectory containing permanent files for the statistics subsystem
pg_stat_tmp Subdirectory containing temporary files for the statistics subsystem
pg_subtrans Subdirectory containing subtransaction status data
pg_tblspc Subdirectory containing symbolic links to tablespaces
pg_twophase Subdirectory containing state files for prepared transactions
pg_wal Subdirectory containing WAL (Write Ahead Log) files
pg_xact Subdirectory containing transaction commit status data
postgresql.auto.con A file used for storing configuration parameters that are set by ALTER SYSTEM
f
Database Physical Storage
Item Description
postmaster.opts A file recording the command-line options the server was last started with
postmaster.pid A lock file recording the current postmaster process ID (PID), cluster data
directory path, postmaster start timestamp, port number, Unixdomain socket
directory path (could be empty), first valid listen_address (IP address or *, or
empty if not listening on TCP), and shared memory segment ID (this file is not
present after server shutdown)
For each database in the cluster there is a subdirectory within PGDATA/base, named after the
database's OID in pg_database. This subdirectory is the default location for the database's files;
in particular, its system catalogs are stored there.
createdb
createdb — create a new PostgreSQL database
Synopsis createdb [connection-option...] [option...] [dbname [description]]
Description createdb creates a new PostgreSQL database. Normally, the database user who
executes this command becomes the owner of the new database. However, a different owner
can be specified via the -O option, if the executing user has appropriate privileges. createdb is
a wrapper around the SQL command CREATE DATABASE. There is no effective difference
between creating databases via this utility and via other methods for accessing the server.
Options createdb accepts the following command-line arguments:
dbname Specifies the name of the database to be created. The name must be unique among
all PostgreSQL databases in this cluster. The default is to create a database with the same
name as the current system user.
description Specifies a comment to be associated with the newly created database.
-D tablespace --tablespace=tablespace Specifies the default tablespace for the database.
(This name is processed as a double-quoted identifier.)
createdb
Examples To create the database demo using the default database server:
$ createdb demo T
o create the database demo using the server on host eden, port 5000, using the template0
template database, here is the command-line command and the underlying SQL command:
Synopsis
CREATE USER name [ [ WITH ] option [ ... ] ] where option can be: SUPERUSER |
NOSUPERUSER | CREATEDB | NOCREATEDB | CREATEROLE | NOCREATEROLE | INHERIT |
NOINHERIT | LOGIN | NOLOGIN | REPLICATION | NOREPLICATION | BYPASSRLS | NOBYPASSRLS
| CONNECTION LIMIT connlimit | [ ENCRYPTED ] PASSWORD 'password' | PASSWORD NULL |
VALID UNTIL 'timestamp' | IN ROLE role_name [, ...] | IN GROUP role_name [, ...] | ROLE
role_name [, ...] | ADMIN role_name [, ...] | USER role_name [, ...] | SYSID uid
CREATE USER
Description CREATE USER is now an alias for CREATE ROLE. The only
difference is that when the command is spelled CREATE USER, LOGIN is
assumed by default, whereas NOLOGIN is assumed when the command
is spelled CREATE ROLE.
Compatibility The CREATE USER statement is a PostgreSQL extension.
The SQL standard leaves the definition of users to the implementation.
CREATE ROLE
CREATE ROLE name [ [ WITH ] option [ ... ] ]
where option can be:
SUPERUSER | NOSUPERUSER | CREATEDB | NOCREATEDB
| CREATEROLE | NOCREATEROLE | INHERIT | NOINHERIT |
LOGIN | NOLOGIN | REPLICATION | NOREPLICATION |
BYPASSRLS | NOBYPASSRLS | CONNECTION LIMIT
connlimit | [ ENCRYPTED ] PASSWORD 'password' |
PASSWORD NULL | VALID UNTIL 'timestamp' | IN ROLE
role_name [, ...] | IN GROUP role_name [, ...] | ROLE
role_name [, ...] | ADMIN role_name [, ...] | USER
role_name [, ...] | SYSID uid
CREATE ROLE
CREATE ROLE adds a new role to a PostgreSQL database
cluster. A role is an entity that can own database objects and
have database privileges; a role can be considered a “user”,
a “group”, or both depending on how it is used. Refer to
Chapter 21 and Chapter 20 for information about managing
users and authentication. You must have CREATEROLE
privilege or be a database superuser to use this command.
Note that roles are defined at the database cluster level, and
so are valid in all databases in the cluster
CREATE ROLE
Examples:
Create a role that can log in, but don't give it a password:
Create a role with a password that is valid until the end of 2021. After
one second has ticked in 2022, the password is no longer valid.
Users
Jenny David Rachel
Insert Select
employees. employees.
GRANT
GRANT — define access privileges
Synopsis
GRANT { { SELECT | INSERT | UPDATE | DELETE | TRUNCATE | REFERENCES | TRIGGER } [, ...] | ALL
[ PRIVILEGES ] } ON { [ TABLE ] table_name [, ...] | ALL TABLES IN SCHEMA schema_name [, ...] } TO
role_specification [, ...] [ WITH GRANT OPTION ] GRANT { { SELECT | INSERT | UPDATE | REFERENCES }
( column_name [, ...] ) [, ...] | ALL [ PRIVILEGES ] ( column_name [, ...] ) } ON [ TABLE ] table_name [, ...] TO
role_specification [, ...] [ WITH GRANT OPTION ] GRANT { { USAGE | SELECT | UPDATE } [, ...] | ALL
[ PRIVILEGES ] } ON { SEQUENCE sequence_name [, ...] | ALL SEQUENCES IN SCHEMA schema_name [, ...] }
TO role_specification [, ...] [ WITH GRANT OPTION ] GRANT { { CREATE | CONNECT | TEMPORARY | TEMP }
[, ...] | ALL [ PRIVILEGES ] } ON DATABASE database_name [, ...] TO role_specification [, ...] [ WITH GRANT
OPTION ] GRANT { USAGE | ALL [ PRIVILEGES ] } ON DOMAIN domain_name [, ...] TO role_specification [, ...]
[ WITH GRANT OPTION ] GRANT { USAGE | ALL [ PRIVILEGES ] } ON FOREIGN DATA WRAPPER fdw_name
[, ...] TO role_specification [, ...] [ WITH GRANT OPTION ] GRANT { USAGE | ALL [ PRIVILEGES ] } ON FOREIGN
SERVER server_name [, ...] TO role_specification [, ...] [ WITH GRANT OPTION ] GRANT { EXECUTE | ALL
[ PRIVILEGES ] } ON { { FUNCTION | PROCEDURE | ROUTINE } routine_name [ ( [ [ argmode ] [ arg_name ]
arg_type [, ...] ] ) ] [, ...] | ALL { FUNCTIONS | PROCEDURES | ROUTINES } IN SCHEMA schema_name [, ...] }
TO role_specification [, ...] [ WITH GRANT OPTION ] GRANT { USAGE | ALL [ PRIVILEGES ] } ON LANGUAGE
lang_name [, ...] TO role_specification [, ...] [ WITH GRANT OPTION ] GRANT { { SELECT | UPDATE } [, ...] |
ALL [ PRIVILEGES ] }
GRANT
GRANT on Roles This variant of the GRANT command grants membership in a role to
one or more other roles. Membership in a role is significant because it conveys the
privileges granted to a role to each of its members. If WITH ADMIN OPTION is specified,
the member can in turn grant membership in the role to others, and revoke
membership in the role as well. Without the admin option, ordinary users cannot do
that. A role is not considered to hold WITH ADMIN OPTION on itself, but it may grant or
revoke membership in itself from a database session where the session user matches
the role. Database superusers can grant or revoke membership in any role to anyone.
Roles having CREATEROLE privilege can grant or revoke membership in any role that is
not a superuser. If GRANTED BY is specified, the grant is recorded as having been done
by the specified role. Only database superusers may use this option, except when it
names the same role executing the command. Unlike the case with privileges,
membership in a role cannot be granted to PUBLIC. Note also that this form of the
command does not allow the noise word GROUP in role_specification.
GRANT
Examples
Note that while the above will indeed grant all privileges if
executed by a superuser or the owner of kinds, when executed by
someone else it will only grant those permissions for which the
someone else has grant options. Grant membership in role admins
to user joe:
Synopsis
Transaction 1 Transaction 2
Transaction 1 Transaction 2
• Explicit Locking
PostgreSQL provides various lock modes to control concurrent
access to data in tables. These modes can be used for application-
controlled locking in situations where MVCC does not give the
desired behavior. Also, most PostgreSQL commands automatically
acquire locks of appropriate modes to ensure that referenced tables
are not dropped or modified in incompatible ways while the
command executes. (For example, TRUNCATE cannot safely be
executed concurrently with other operations on the same table, so it
obtains an exclusive lock on the table to enforce that.) To examine a
list of the currently outstanding locks in a database server, use the
pg_locks system view. For more information on monitoring the
status of the lock manager subsystem, refer to Chapter 27.
Locking
• Table-Level Locks
The list below shows the available lock modes and the contexts in which they
are used automatically by PostgreSQL. You can also acquire any of these
locks explicitly with the command LOCK. Remember that all of these lock
modes are table-level locks, even if the name contains the word “row”; the
names of the lock modes are historical. To some extent the names reflect the
typical usage of each lock mode — but the semantics are all the same. The
only real difference between one lock mode and another is the set of lock
modes with which each conflicts (see Table 13.2). Two transactions cannot
hold locks of conflicting modes on the same table at the same time.
(However, a transaction never conflicts with itself. For example, it might
acquire ACCESS EXCLUSIVE lock and later acquire ACCESS SHARE lock on the
same table.) Non-conflicting lock modes can be held concurrently by many
transactions. Notice in particular that some lock modes are self-conflicting
(for example, an ACCESS EXCLUSIVE lock cannot be held by more than one
transaction at a time) while others are not self-conflicting (for example, an
ACCESS SHARE lock can be held by multiple transactions).
Table-Level Lock Modes
• ACCESS SHARE
• ROW SHARE
• ROW EXCLUSIVE
• SHARE UPDATE EXCLUSIVE
• SHARE
• SHARE ROW EXCLUSIVE
• EXCLUSIVE
Data Concurrency
Time: Transaction 1 UPDATE hr.employee
SET salary=salary+100
WHERE employee_id=100;
Transaction 2 UPDATE hr.employee
SET salary=salary+100
WHERE employee_id=101;
09:00:00 Transaction 3 UPDATE hr.employee
SET salary=salary+100
WHERE employee_id=102;
... ...
Transaction x UPDATE hr.employee
SET salary=salary+100
WHERE employee_id=xxx;
DML Locks
Transaction 1 Transaction 2
SQL> UPDATE employee SQL> UPDATE employee
2 SET salary=salary*1.1 2 SET salary=salary*1.1
3 WHERE employee_id= 107; 3 WHERE employee_id= 106;
1 row updated. 1 row updated.
6-mar-21 8:00
Lock Conflicts
Transaction 1 Time Transaction 2
BEGIN TRANSACTION ; BEGIN TRANSACTION;
UPDATE employee SET 9:00:00 UPDATE employee SET
salary=salary+100 WHERE salary=salary+100 WHERE
employee_id=100; employee_id=101;
1 row updated. 1 row updated.
UPDATE employee SET 9:00:05 SELECT sum(salary) FROM
COMMISSION_PCT=0.2 employee;
WHERE employee_id=101; SUM(SALARY)
Session waits enqueued due -----------
to lock conflict. 692634
Session still waiting! Many selects, inserts, updates,
16:30:00 and deletes during the last 7.5
hours, but no commits or
rollbacks!
1 row updated. 16:30:01 commit;
Session continues.
END TRANSACTION; END TRANSACTION;
Possible Causes of Lock Conflicts
• Uncommitted changes
• Long-running transactions
• Unnecessarily high locking levels
As you see, pg_dump writes its result to the standard output. We will see below how
this can be useful. While the above command creates a text file, pg_dump can create
files in other formats that allow for parallelism and more fine-grained control of object
restoration. pg_dump is a regular PostgreSQL client application (albeit a particularly
clever one). This means that you can perform this backup procedure from any remote
host that has access to the database. But remember that pg_dump does not operate
with special permissions. In particular, it must have read access to all tables that
you want to back up, so in order to back up the entire database you almost
always have to run it as a database superuser. (If you do not have sufficient
privileges to back up the entire database, you can still back up portions of the
database to which you do have access using options such as -n schema or -t table.)
SQL Dump
• It is not necessary to replay the WAL entries all the way to the end. We
could stop the replay at any point and have a consistent snapshot of
the database as it was at that time. Thus, this technique supports point-
in-time recovery: it is possible to restore the database to its state at any
time since your base backup was taken.
• If we continuously feed the series of WAL files to another machine that
has been loaded with the same base backup file, we have a warm
standby system: at any point we can bring up the second machine and
it will have a nearly-current copy of the database.
Continuous Archiving and Point-in-
Time Recovery (PITR)
NOTE: pg_dump and pg_dumpall do not produce file-system-level backups
and cannot be used as part of a continuous-archiving solution. Such dumps
are logical and do not contain enough information to be used by WAL replay.
As with the plain file-system-backup technique, this method can only
support restoration of an entire database cluster, not a subset. Also, it
requires a lot of archival storage: the base backup might be bulky, and a
busy system will generate many megabytes of WAL traffic that have to be
archived. Still, it is the preferred backup technique in many situations where
high reliability is needed. To recover successfully using continuous archiving
(also called “online backup” by many database vendors), you need a
continuous sequence of archived WAL files that extends back at least as far
as the start time of your backup. So to get started, you should set up and
test your procedure for archiving WAL files before you take your first base
backup. Accordingly, we first discuss the mechanics of archiving WAL files.
Setting Up WAL Archiving
In an abstract sense, a running PostgreSQL system produces
an indefinitely long sequence of WAL records. The system
physically divides this sequence into WAL segment files,
which are normally 16MB apiece (although the segment size
can be altered during initdb). The segment files are given
numeric names that reflect their position in the abstract WAL
sequence. When not using WAL archiving, the system
normally creates just a few segment files and then “recycles”
them by renaming no-longer-needed segment files to higher
segment numbers. It's assumed that segment files whose
contents precede the last checkpoint are no longer of
interest and can be recycled.
Setting Up WAL Archiving
When archiving WAL data, we need to capture the contents of each
segment file once it is filled, and save that data somewhere before the
segment file is recycled for reuse. Depending on the application and
the available hardware, there could be many different ways of “saving
the data somewhere”: we could copy the segment files to an NFS-
mounted directory on another machine, write them onto a tape drive
(ensuring that you have a way of identifying the original name of each
file), or batch them together and burn them onto CDs, or something
else entirely. To provide the database administrator with flexibility,
PostgreSQL tries not to make any assumptions about how the archiving
will be done. Instead, PostgreSQL lets the administrator specify a shell
command to be executed to copy a completed segment file to
wherever it needs to go. The command could be as simple as a cp, or it
could invoke a complex shell script — it's all up to you.
Setting Up WAL Archiving
To enable WAL archiving, set the wal_level configuration
parameter to replica or higher, archive_mode to on, and
specify the shell command to use in the archive_command
configuration parameter. In practice these settings will
always be placed in the postgresql.conf file. In
archive_command, %p is replaced by the path name of the
file to archive, while %f is replaced by only the file name.
(The path name is relative to the current working directory,
i.e., the cluster's data directory.) Use %% if you need to
embed an actual % character in the command. The simplest
useful command is something like:
Setting Up WAL Archiving
archive_command = 'test ! -f
/mnt/server/archivedir/%f && cp %p /
mnt/server/archivedir/%f' # Unix archive_command =
'copy "%p" "C:\\server\\archivedir\\%f"' # Windows
Tablespaces in PostgreSQL allow database administrators to define locations in the file system where the
files representing database objects can be stored. Once created, a tablespace can be referred to by name
when creating database objects.
By using tablespaces, an administrator can control the disk layout of a PostgreSQL installation. This is
useful in at least two ways. First, if the partition or volume on which the cluster was initialized runs out of
space and cannot be extended, a tablespace can be created on a different partition and used until the
system can be reconfigured.
Second, tablespaces allow an administrator to use knowledge of the usage pattern of database objects to
optimize performance. For example, an index which is very heavily used can be placed on a very fast, highly
available disk, such as an expensive solid state device. At the same time a table storing archived data which
is rarely used or not performance critical could be stored on a less expensive, slower disk system.
The location must be an existing, empty directory that is owned by the PostgreSQL operating system user.
All objects subsequently created within the tablespace will be stored in files underneath this directory. The
location must not be on removable or transient storage, as the cluster might fail to function if the tablespace
is missing or lost.
To determine the set of existing tablespaces, examine the pg_tablespace system catalog, for
example
SELECT spcname FROM pg_tablespace;
Indexes
Suppose we have a table similar to this:
CREATE TABLE test1 (
id integer,
content varchar
);
and the application issues many queries of the form:
SELECT content FROM test1 WHERE id = constant;
With no advance preparation, the system would have to scan the entire test1 table, row by row, to find all matching entries. If there
are many rows in test1 and only a few rows (perhaps zero or one) that would be returned by such a query, this is clearly an
inefficient method. But if the system has been instructed to maintain an index on the id column, it can use a more efficient method for
locating matching rows. For instance, it might only have to walk a few levels deep into a search tree.
A similar approach is used in most non-fiction books: terms and concepts that are frequently looked up by readers are collected in
an alphabetic index at the end of the book. The interested reader can scan the index relatively quickly and flip to the appropriate
page(s), rather than having to read the entire book to find the material of interest. Just as it is the task of the author to anticipate the
items that readers are likely to look up, it is the task of the database programmer to foresee which indexes will be useful.
The following command can be used to create an index on the id column, as discussed:
CREATE INDEX test1_id_index ON test1 (id);
The name test1_id_index can be chosen freely, but you should pick something that enables you
to remember later what the index was for.
To remove an index, use the DROP INDEX command. Indexes can be added to and removed from
Indexes
Once an index is created, no further intervention is required: the system will update the index when the table is modified, and it will
use the index in queries when it thinks doing so would be more efficient than a sequential table scan. But you might have to run the
ANALYZE command regularly to update statistics to allow the query planner to make educated decisions. See Chapter 14 for
information about how to find out whether an index is used and when and why the planner might choose not to use an
index.
Indexes can also benefit UPDATE and DELETE commands with search conditions. Indexes can moreover be used in join searches.
Thus, an index defined on a column that is part of a join condition can also significantly speed up queries with joins.
Creating an index on a large table can take a long time. By default, PostgreSQL allows reads (SELECT statements) to occur on the
table in parallel with index creation, but writes (INSERT, UPDATE, DELETE) are blocked until the index build is finished. In
production environments this is often unacceptable. It is possible to allow writes to occur in parallel with index creation, but there are
several caveats to be aware of — for more information see Building Indexes Concurrently.
After an index is created, the system has to keep it synchronized with the table. This adds overhead to data manipulation operations.
Therefore indexes that are seldom or never used in queries should be removed.
Index Types
PostgreSQL provides several index types: B-tree, Hash, GiST, SP-GiST, GIN and BRIN. Each index type uses
a different algorithm that is best suited to different types of queries. By default, the CREATE INDEX command
creates B-tree indexes, which fit the most common situations.
B-trees can handle equality and range queries on data that can be sorted into some ordering. In particular, the
PostgreSQL query planner will consider using a B-tree index whenever an indexed column is involved in a
comparison using one of these operators:
<
<=
=
>=
>
Constructs equivalent to combinations of these operators, such as BETWEEN and IN, can also be
implemented with a B-tree index search. Also, an IS NULL or IS NOT NULL condition on an index column can
be used with a B-tree index.
The optimizer can also use a B-tree index for queries involving the pattern matching operators LIKE and ~ if
the pattern is a constant and is anchored to the beginning of the string — for example, col LIKE 'foo%' or col ~
'^foo', but not col LIKE '%bar'. However, if your database does not use the C locale you will need to create the
index with a special operator class to support indexing of pattern-matching queries; see Section 11.10 below. It
is also possible to use B-tree indexes for ILIKE and ~*, but only if the pattern starts with non-alphabetic
characters, i.e., characters that are not affected by upper/lower case conversion.
B-tree indexes can also be used to retrieve data in sorted order. This is not always faster than a simple scan
Index Types
Hash indexes can only handle simple equality comparisons. The query planner will consider using a hash
index whenever an indexed column is involved in a comparison using the = operator. The following command
is used to create a hash index:
CREATE INDEX name ON table USING HASH (column);
GiST indexes are not a single kind of index, but rather an infrastructure within which many different indexing
strategies can be implemented. Accordingly, the particular operators with which a GiST index can be used vary
depending on the indexing strategy (the operator class). As an example, the standard distribution of
PostgreSQL includes GiST operator classes for several two-dimensional geometric data types, which support
indexed queries using these operators:
<
&<
&>
>>
<<|
&<|
|&>
|>>
@>
<@
~=
&&
Index Types
See Section 9.11 for the meaning of these operators.) The GiST operator classes included in the standard
distribution are documented in Table 64.1. Many other GiST operator classes are available in the contrib
collection or as separate projects. For more information see Chapter 64.
GiST indexes are also capable of optimizing “nearest-neighbor” searches, such as
SELECT * FROM places ORDER BY location <-> point '(101,456)' LIMIT 10;
which finds the ten places closest to a given target point. The ability to do this is again dependent on the
particular operator class being used. In Table 64.1, operators that can be used in this way are listed in the
column “Ordering Operators”.
SP-GiST indexes, like GiST indexes, offer an infrastructure that supports various kinds of searches.
SP-GiST permits implementation of a wide range of different non-balanced disk-based data structures, such
as quadtrees, k-d trees, and radix trees (tries). As an example, the standard distribution of PostgreSQL
includes SP-GiST operator classes for two-dimensional points.
Performance Tips
Query performance can be affected by many things. Some of these can be controlled by the user, while others
are fundamental to the underlying design of the system. This chapter provides some hints about
understanding and tuning PostgreSQL performance.
Using EXPLAIN
PostgreSQL devises a query plan for each query it receives. Choosing the right plan to match the query
structure and the properties of the data is absolutely critical for good performance, so the system includes a
complex planner that tries to choose good plans. You can use the EXPLAIN command to see what query plan
the planner creates for any query. Plan-reading is an art that requires some experience to master, but this
section attempts to cover the basics.
Examples in this section are drawn from the regression test database after doing a VACUUM ANALYZE, using
9.3 development sources. You should be able to get similar results if you try the examples yourself, but your
estimated costs and row counts might vary slightly because ANALYZE's statistics are random samples rather
than exact, and because costs are inherently somewhat platform-dependent.
The examples use EXPLAIN's default “text” output format, which is compact and convenient for humans to
read. If you want to feed EXPLAIN's output to a program for further analysis, you should use one of its
machine-readable output formats (XML, JSON, or YAML) instead.
Performance Tips
EXPLAIN Basics
The structure of a query plan is a tree of plan nodes. Nodes at the bottom level of the tree are scan nodes:
they return raw rows from a table. There are different types of scan nodes for different table access methods:
sequential scans, index scans, and bitmap index scans. There are also non-table row sources, such as
VALUES clauses and set-returning functions in FROM, which have their own scan node types. If the query
requires joining, aggregation, sorting, or other operations on the raw rows, then there will be additional nodes
above the scan nodes to perform these operations. Again, there is usually more than one possible way to do
these operations, so different node types can appear here too.
The output of EXPLAIN has one line for each node in the plan tree, showing the basic node type plus the cost
estimates that the planner made for the execution of that plan node. Additional lines might appear, indented
from the node's summary line, to show additional properties of the node. The very first line (the summary line
for the topmost node) has the estimated total execution cost for the plan; it is this number that the planner
seeks to minimize.
Here is a trivial example, just to show what the output looks like:
QUERY PLAN
-------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
Performance Tips
Since this query has no WHERE clause, it must scan all the rows of the table, so the planner has chosen to
use a simple sequential scan plan. The numbers that are quoted in parentheses are (left to right):
• Estimated start-up cost. This is the time expended before the output phase can begin, e.g., time to do the
sorting in a sort node.
• Estimated total cost. This is stated on the assumption that the plan node is run to completion, i.e., all
available rows are retrieved. In practice a node's parent node might stop short of reading all available rows
(see the LIMIT example below).
Estimated number of rows output by this plan node. Again, the node is assumed to be run to completion.
• Estimated average width of rows output by this plan node (in bytes).
The costs are measured in arbitrary units determined by the planner's cost parameters (see Section 19.7.2).
Traditional practice is to measure the costs in units of disk page fetches; that is, seq_page_cost is
conventionally set to 1.0 and the other cost parameters are set relative to that. The examples in this section
are run with the default cost parameters.
It's important to understand that the cost of an upper-level node includes the cost of all its child nodes.
It's also important to realize that the cost only reflects things that the planner cares about. In particular, the
cost does not consider the time spent transmitting result rows to the client, which could be an important factor
in the real elapsed time; but the planner ignores it because it cannot change it by altering the plan. (Every
correct plan will output the same row set, we trust.)
Performance Tips
The rows value is a little tricky because it is not the number of rows processed or scanned by the plan node,
but rather the number emitted by the node. This is often less than the number scanned, as a result of filtering
by any WHERE-clause conditions that are being applied at the node. Ideally the top-level rows estimate will
approximate the number of rows actually returned, updated, or deleted by the query.
you will find that tenk1 has 358 disk pages and 10000 rows. The estimated cost is computed as (disk pages
read * seq_page_cost) + (rows scanned * cpu_tuple_cost). By default, seq_page_cost is 1.0 and
cpu_tuple_cost is 0.01, so the estimated cost is (358 * 1.0) + (10000 * 0.01) = 458.
Performance Tips
Now let's modify the query to add a WHERE condition:
QUERY PLAN
------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..483.00 rows=7001 width=244)
Notice that the EXPLAIN output shows the WHERE clause being applied as a “filter” condition attached to the
Seq Scan plan node. This means that the plan node checks the condition for each row it scans, and outputs
only the ones that pass the condition. The estimate of output rows has been reduced because of the WHERE
clause. However, the scan will still have to visit all 10000 rows, so the cost hasn't decreased; in fact it has
gone up a bit (by 10000 * cpu_operator_cost, to be exact) to reflect the extra CPU time spent checking the
WHERE condition.
The actual number of rows this query would select is 7000, but the rows estimate is only approximate.
If you try to duplicate this experiment, you will probably get a slightly different estimate; moreover, it can
change after each ANALYZE command, because the statistics produced by ANALYZE are taken from a
randomized sample of the table.
Now, let's make the condition more restrictive:
Performance Tips
Here the planner has decided to use a two-step plan: the child plan node visits an index to find the locations of
rows matching the index condition, and then the upper plan node actually fetches those rows from the table
itself. Fetching rows separately is much more expensive than reading them sequentially, but because not all
the pages of the table have to be visited, this is still cheaper than a sequential scan. (The reason for using two
plan levels is that the upper plan node sorts the row locations identified by the index into physical order before
reading them, to minimize the cost of separate fetches. The “bitmap” mentioned in the node names is the
mechanism that does the sorting.)
Performance Tips
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100 AND stringu1 = 'xxx';
QUERY PLAN
------------------------------------------------------------------------------
Bitmap Heap Scan on tenk1 (cost=5.04..229.43 rows=1 width=244)
Recheck Cond: (unique1 < 100)
Filter: (stringu1 = 'xxx'::name)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04
rows=101 width=0)
Index Cond: (unique1 < 100)
The added condition stringu1 = 'xxx' reduces the output row count estimate, but not the cost because we still
have to visit the same set of rows. Notice that the stringu1 clause cannot be applied as an index condition,
since this index is only on the unique1 column. Instead it is applied as a filter on the rows retrieved by the
index. Thus the cost has actually gone up slightly to reflect this extra checking.
Performance Tips
EXPLAIN ANALYZE
It is possible to check the accuracy of the planner's estimates by using EXPLAIN's ANALYZE option.
With this option, EXPLAIN actually executes the query, and then displays the true row counts and true run time
accumulated within each plan node, along with the same estimates that a plain EXPLAIN shows. For example,
we might get a result like this:
EXPLAIN ANALYZE SELECT *
FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=4.65..118.62 rows=10 width=488) (actual
time=0.128..0.377 rows=10 loops=1)
-> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10
width=244) (actual time=0.057..0.121 rows=10 loops=1)
Recheck Cond: (unique1 < 10)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36
rows=10 width=0) (actual time=0.024..0.024 rows=10 loops=1)
Index Cond: (unique1 < 10)
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.91
rows=1 width=244) (actual time=0.021..0.022 rows=1 loops=10)
Index Cond: (unique2 = t1.unique2)
Planning time: 0.181 ms
Performance Tips
Note that the “actual time” values are in milliseconds of real time, whereas the cost estimates are expressed in
arbitrary units; so they are unlikely to match up. The thing that's usually most important to look for is whether
the estimated row counts are reasonably close to reality. In this example the estimates were all dead-on, but
that's quite unusual in practice.
In some query plans, it is possible for a subplan node to be executed more than once. For example, the inner
index scan will be executed once per outer row in the above nested-loop plan. In such cases, the loops value
reports the total number of executions of the node, and the actual time and rows values shown are averages
per-execution. This is done to make the numbers comparable with the way that the cost estimates are shown.
Multiply by the loops value to get the total time actually spent in the node.
In the above example, we spent a total of 0.220 milliseconds executing the index scans on tenk2.
Performance Tips
Statistics Used by the Planner
Single-Column Statistics
As we saw in the previous section, the query planner needs to estimate the number of rows retrieved by a
query in order to make good choices of query plans. This section provides a quick look at the statistics that the
system uses for these estimates.
One component of the statistics is the total number of entries in each table and index, as well as the number of
disk blocks occupied by each table and index. This information is kept in the table pg_class, in the columns
reltuples and relpages. We can look at it with queries similar to this one:
Here we can see that tenk1 contains 10000 rows, as do its indexes, but the indexes are (unsurprisingly) much
smaller than the table.
For efficiency reasons, reltuples and relpages are not updated on-the-fly, and so they usually contain somewhat
out-of-date values. They are updated by VACUUM, ANALYZE, and a few DDL commands such as CREATE
INDEX. A VACUUM or ANALYZE operation that does not scan the entire table (which is commonly the case)
will incrementally update the reltuples count on the basis of the part of the table it did scan, resulting in an
approximate value. In any case, the planner will scale the values it finds in pg_class to match the current
physical table size, thus obtaining a closer approximation.
Performance Tips
Most queries retrieve only a fraction of the rows in a table, due to WHERE clauses that restrict the rows to be
examined. The planner thus needs to make an estimate of the selectivity of WHERE clauses, that is, the
fraction of rows that match each condition in the WHERE clause. The information used for this task is stored in
the pg_statistic system catalog. Entries in pg_statistic are updated by the ANALYZE and VACUUM ANALYZE
commands, and are always approximate even when freshly updated.
Rather than look at pg_statistic directly, it's better to look at its view pg_stats when examining the statistics
manually. pg_stats is designed to be more easily readable. Furthermore, pg_stats is readable by all, whereas
pg_statistic is only readable by a superuser. (This prevents unprivileged users from learning something about
the contents of other people's tables from the statistics. The pg_stats view is restricted to show only rows about
tables that the current user can read.) For example, we might do:
Extended Statistics
It is common to see slow queries running bad execution plans because multiple columns used in the query
clauses are correlated. The planner normally assumes that multiple conditions are independent of each other,
an assumption that does not hold when column values are correlated. Regular statistics, because of their per-
individual-column nature, cannot capture any knowledge about cross-column correlation. However, PostgreSQL
has the ability to compute multivariate statistics, which can capture such information. Because the number of
possible column combinations is very large, it's impractical to compute multivariate statistics automatically.
Instead, extended statistics objects, more often called just statistics objects, can be created to instruct the
server to obtain statistics across interesting sets of columns.
Performance Tips
Statistics objects are created using the CREATE STATISTICS command. Creation of such an object merely
creates a catalog entry expressing interest in the statistics. Actual data collection is performed by ANALYZE
(either a manual command, or background auto-analyze). The collected values can be examined in the
pg_statistic_ext_data catalog.
ANALYZE computes extended statistics based on the same sample of table rows that it takes for computing
regular single-column statistics. Since the sample size is increased by increasing the statistics target for the
table or any of its columns (as described in the previous section), a larger statistics target will normally result in
more accurate extended statistics, as well as more time spent calculating them.
The following subsections describe the kinds of extended statistics that are currently supported.