postgresql-11
postgresql-11
17 Documentation
Legal Notice
PostgreSQL is Copyright © 1996-2022 by the PostgreSQL Global Development Group.
Permission to use, copy, modify, and distribute this software and its documentation for any purpose, without fee, and without a written
agreement is hereby granted, provided that the above copyright notice and this paragraph and the following two paragraphs appear in all copies.
IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL,
INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE
AND ITS DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.
THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED
HEREUNDER IS ON AN “AS-IS” BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO OBLIGATIONS TO PROVIDE MAIN-
TENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
Table of Contents
Preface ..................................................................................................................... xxx
1. What is PostgreSQL? ...................................................................................... xxx
2. A Brief History of PostgreSQL ......................................................................... xxx
2.1. The Berkeley POSTGRES Project ......................................................... xxxi
2.2. Postgres95 ......................................................................................... xxxi
2.3. PostgreSQL ....................................................................................... xxxii
3. Conventions ................................................................................................. xxxii
4. Further Information ....................................................................................... xxxii
5. Bug Reporting Guidelines ............................................................................. xxxiii
5.1. Identifying Bugs ............................................................................... xxxiii
5.2. What to Report ................................................................................. xxxiv
5.3. Where to Report Bugs ........................................................................ xxxv
I. Tutorial .................................................................................................................... 1
1. Getting Started .................................................................................................. 3
1.1. Installation ............................................................................................. 3
1.2. Architectural Fundamentals ....................................................................... 3
1.3. Creating a Database ................................................................................. 3
1.4. Accessing a Database .............................................................................. 5
2. The SQL Language ............................................................................................ 7
2.1. Introduction ............................................................................................ 7
2.2. Concepts ................................................................................................ 7
2.3. Creating a New Table .............................................................................. 7
2.4. Populating a Table With Rows .................................................................. 8
2.5. Querying a Table .................................................................................... 9
2.6. Joins Between Tables ............................................................................. 11
2.7. Aggregate Functions .............................................................................. 13
2.8. Updates ............................................................................................... 15
2.9. Deletions .............................................................................................. 15
3. Advanced Features ........................................................................................... 16
3.1. Introduction .......................................................................................... 16
3.2. Views .................................................................................................. 16
3.3. Foreign Keys ........................................................................................ 16
3.4. Transactions ......................................................................................... 17
3.5. Window Functions ................................................................................. 19
3.6. Inheritance ........................................................................................... 22
3.7. Conclusion ........................................................................................... 23
II. The SQL Language ................................................................................................. 24
4. SQL Syntax .................................................................................................... 32
4.1. Lexical Structure ................................................................................... 32
4.2. Value Expressions ................................................................................. 41
4.3. Calling Functions .................................................................................. 55
5. Data Definition ................................................................................................ 58
5.1. Table Basics ......................................................................................... 58
5.2. Default Values ...................................................................................... 59
5.3. Constraints ........................................................................................... 60
5.4. System Columns ................................................................................... 68
5.5. Modifying Tables .................................................................................. 69
5.6. Privileges ............................................................................................. 72
5.7. Row Security Policies ............................................................................ 73
5.8. Schemas ............................................................................................... 79
5.9. Inheritance ........................................................................................... 84
5.10. Table Partitioning ................................................................................ 87
5.11. Foreign Data ..................................................................................... 101
5.12. Other Database Objects ....................................................................... 101
5.13. Dependency Tracking ......................................................................... 101
iii
PostgreSQL 11.17 Documentation
iv
PostgreSQL 11.17 Documentation
v
PostgreSQL 11.17 Documentation
vi
PostgreSQL 11.17 Documentation
vii
PostgreSQL 11.17 Documentation
viii
PostgreSQL 11.17 Documentation
ix
PostgreSQL 11.17 Documentation
x
PostgreSQL 11.17 Documentation
xi
PostgreSQL 11.17 Documentation
xii
PostgreSQL 11.17 Documentation
xiii
PostgreSQL 11.17 Documentation
xiv
PostgreSQL 11.17 Documentation
xv
PostgreSQL 11.17 Documentation
xvi
PostgreSQL 11.17 Documentation
xvii
PostgreSQL 11.17 Documentation
xviii
PostgreSQL 11.17 Documentation
xix
PostgreSQL 11.17 Documentation
xx
List of Figures
60.1. Structured Diagram of a Genetic Algorithm .......................................................... 2207
xxi
List of Tables
4.1. Backslash Escape Sequences ................................................................................... 35
4.2. Operator Precedence (highest to lowest) .................................................................... 40
8.1. Data Types ......................................................................................................... 135
8.2. Numeric Types .................................................................................................... 136
8.3. Monetary Types .................................................................................................. 141
8.4. Character Types .................................................................................................. 142
8.5. Special Character Types ........................................................................................ 143
8.6. Binary Data Types ............................................................................................... 144
8.7. bytea Literal Escaped Octets ............................................................................... 145
8.8. bytea Output Escaped Octets ............................................................................... 145
8.9. Date/Time Types ................................................................................................. 146
8.10. Date Input ......................................................................................................... 147
8.11. Time Input ........................................................................................................ 148
8.12. Time Zone Input ................................................................................................ 148
8.13. Special Date/Time Inputs ..................................................................................... 150
8.14. Date/Time Output Styles ..................................................................................... 151
8.15. Date Order Conventions ...................................................................................... 151
8.16. ISO 8601 Interval Unit Abbreviations .................................................................... 153
8.17. Interval Input ..................................................................................................... 154
8.18. Interval Output Style Examples ............................................................................ 155
8.19. Boolean Data Type ............................................................................................. 155
8.20. Geometric Types ................................................................................................ 158
8.21. Network Address Types ...................................................................................... 161
8.22. cidr Type Input Examples ................................................................................. 161
8.23. JSON primitive types and corresponding PostgreSQL types ....................................... 170
8.24. Object Identifier Types ....................................................................................... 198
8.25. Pseudo-Types .................................................................................................... 200
9.1. Comparison Operators .......................................................................................... 202
9.2. Comparison Predicates .......................................................................................... 203
9.3. Comparison Functions .......................................................................................... 205
9.4. Mathematical Operators ........................................................................................ 205
9.5. Mathematical Functions ........................................................................................ 206
9.6. Random Functions ............................................................................................... 208
9.7. Trigonometric Functions ....................................................................................... 208
9.8. SQL String Functions and Operators ....................................................................... 209
9.9. Other String Functions .......................................................................................... 210
9.10. Built-in Conversions ........................................................................................... 217
9.11. SQL Binary String Functions and Operators ........................................................... 223
9.12. Other Binary String Functions .............................................................................. 224
9.13. Bit String Operators ........................................................................................... 226
9.14. Regular Expression Match Operators ..................................................................... 229
9.15. Regular Expression Atoms ................................................................................... 233
9.16. Regular Expression Quantifiers ............................................................................. 234
9.17. Regular Expression Constraints ............................................................................ 234
9.18. Regular Expression Character-entry Escapes ........................................................... 236
9.19. Regular Expression Class-shorthand Escapes ........................................................... 237
9.20. Regular Expression Constraint Escapes .................................................................. 237
9.21. Regular Expression Back References ..................................................................... 238
9.22. ARE Embedded-option Letters ............................................................................. 238
9.23. Formatting Functions .......................................................................................... 242
9.24. Template Patterns for Date/Time Formatting ........................................................... 243
9.25. Template Pattern Modifiers for Date/Time Formatting .............................................. 244
9.26. Template Patterns for Numeric Formatting ............................................................. 246
9.27. Template Pattern Modifiers for Numeric Formatting ................................................. 247
9.28. to_char Examples ........................................................................................... 248
xxii
PostgreSQL 11.17 Documentation
xxiii
PostgreSQL 11.17 Documentation
xxiv
PostgreSQL 11.17 Documentation
xxv
PostgreSQL 11.17 Documentation
xxvi
PostgreSQL 11.17 Documentation
xxvii
PostgreSQL 11.17 Documentation
xxviii
List of Examples
8.1. Using the Character Types .................................................................................... 143
8.2. Using the boolean Type ..................................................................................... 156
8.3. Using the Bit String Types .................................................................................... 163
9.1. XSLT Stylesheet for Converting SQL/XML Output to HTML ..................................... 289
10.1. Factorial Operator Type Resolution ....................................................................... 367
10.2. String Concatenation Operator Type Resolution ....................................................... 368
10.3. Absolute-Value and Negation Operator Type Resolution ........................................... 368
10.4. Array Inclusion Operator Type Resolution .............................................................. 369
10.5. Custom Operator on a Domain Type ..................................................................... 369
10.6. Rounding Function Argument Type Resolution ....................................................... 372
10.7. Variadic Function Resolution ............................................................................... 372
10.8. Substring Function Type Resolution ...................................................................... 373
10.9. character Storage Type Conversion .................................................................. 374
10.10. Type Resolution with Underspecified Types in a Union ........................................... 375
10.11. Type Resolution in a Simple Union ..................................................................... 375
10.12. Type Resolution in a Transposed Union ............................................................... 376
10.13. Type Resolution in a Nested Union ..................................................................... 376
11.1. Setting up a Partial Index to Exclude Common Values .............................................. 385
11.2. Setting up a Partial Index to Exclude Uninteresting Values ........................................ 385
11.3. Setting up a Partial Unique Index ......................................................................... 386
11.4. Do Not Use Partial Indexes as a Substitute for Partitioning ........................................ 387
20.1. Example pg_hba.conf Entries .......................................................................... 611
20.2. An Example pg_ident.conf File ..................................................................... 614
34.1. libpq Example Program 1 .................................................................................... 852
34.2. libpq Example Program 2 .................................................................................... 855
34.3. libpq Example Program 3 .................................................................................... 858
35.1. Large Objects with libpq Example Program ............................................................ 869
36.1. Example SQLDA Program .................................................................................. 921
36.2. ECPG Program Accessing Large Objects ............................................................... 935
42.1. Manual Installation of PL/Perl ............................................................................ 1173
43.1. Quoting Values In Dynamic Queries .................................................................... 1188
43.2. Exceptions with UPDATE/INSERT ...................................................................... 1203
43.3. A PL/pgSQL Trigger Function ............................................................................ 1217
43.4. A PL/pgSQL Trigger Function For Auditing ......................................................... 1218
43.5. A PL/pgSQL View Trigger Function For Auditing ................................................. 1219
43.6. A PL/pgSQL Trigger Function For Maintaining A Summary Table ............................ 1220
43.7. Auditing with Transition Tables .......................................................................... 1222
43.8. A PL/pgSQL Event Trigger Function ................................................................... 1224
43.9. Porting a Simple Function from PL/SQL to PL/pgSQL ............................................ 1231
43.10. Porting a Function that Creates Another Function from PL/SQL to PL/pgSQL ............ 1232
43.11. Porting a Procedure With String Manipulation and OUT Parameters from PL/SQL to
PL/pgSQL ............................................................................................................... 1233
43.12. Porting a Procedure from PL/SQL to PL/pgSQL .................................................. 1235
F.1. Create a Foreign Table for PostgreSQL CSV Logs ................................................... 2516
xxix
Preface
This book is the official documentation of PostgreSQL. It has been written by the PostgreSQL devel-
opers and other volunteers in parallel to the development of the PostgreSQL software. It describes all
the functionality that the current version of PostgreSQL officially supports.
To make the large amount of information about PostgreSQL manageable, this book has been organized
in several parts. Each part is targeted at a different class of users, or at users in different stages of their
PostgreSQL experience:
• Part II documents the SQL query language environment, including data types and functions, as well
as user-level performance tuning. Every PostgreSQL user should read this.
• Part III describes the installation and administration of the server. Everyone who runs a PostgreSQL
server, be it for private use or for others, should read this part.
• Part V contains information for advanced users about the extensibility capabilities of the server.
Topics include user-defined data types and functions.
• Part VI contains reference information about SQL commands, client and server programs. This part
supports the other parts with structured information sorted by command or program.
• Part VII contains assorted information that might be of use to PostgreSQL developers.
1. What is PostgreSQL?
PostgreSQL is an object-relational database management system (ORDBMS) based on POSTGRES,
Version 4.21, developed at the University of California at Berkeley Computer Science Department.
POSTGRES pioneered many concepts that only became available in some commercial database sys-
tems much later.
PostgreSQL is an open-source descendant of this original Berkeley code. It supports a large part of
the SQL standard and offers many modern features:
• complex queries
• foreign keys
• triggers
• updatable views
• transactional integrity
• multiversion concurrency control
Also, PostgreSQL can be extended by the user in many ways, for example by adding new
• data types
• functions
• operators
• aggregate functions
• index methods
• procedural languages
And because of the liberal license, PostgreSQL can be used, modified, and distributed by anyone free
of charge for any purpose, be it private, commercial, or academic.
xxx
Preface
The object-relational database management system now known as PostgreSQL is derived from the
POSTGRES package written at the University of California at Berkeley. With over two decades of de-
velopment behind it, PostgreSQL is now the most advanced open-source database available anywhere.
POSTGRES has undergone several major releases since then. The first “demoware” system became
operational in 1987 and was shown at the 1988 ACM-SIGMOD Conference. Version 1, described in
[ston90a], was released to a few external users in June 1989. In response to a critique of the first rule
system ([ston89]), the rule system was redesigned ([ston90b]), and Version 2 was released in June
1990 with the new rule system. Version 3 appeared in 1991 and added support for multiple storage
managers, an improved query executor, and a rewritten rule system. For the most part, subsequent
releases until Postgres95 (see below) focused on portability and reliability.
POSTGRES has been used to implement many different research and production applications. These
include: a financial data analysis system, a jet engine performance monitoring package, an aster-
oid tracking database, a medical information database, and several geographic information systems.
POSTGRES has also been used as an educational tool at several universities. Finally, Illustra Infor-
mation Technologies (later merged into Informix2, which is now owned by IBM3) picked up the code
and commercialized it. In late 1992, POSTGRES became the primary data manager for the Sequoia
2000 scientific computing project4.
The size of the external user community nearly doubled during 1993. It became increasingly obvious
that maintenance of the prototype code and support was taking up large amounts of time that should
have been devoted to database research. In an effort to reduce this support burden, the Berkeley POST-
GRES project officially ended with Version 4.2.
2.2. Postgres95
In 1994, Andrew Yu and Jolly Chen added an SQL language interpreter to POSTGRES. Under a new
name, Postgres95 was subsequently released to the web to find its own way in the world as an open-
source descendant of the original POSTGRES Berkeley code.
Postgres95 code was completely ANSI C and trimmed in size by 25%. Many internal changes im-
proved performance and maintainability. Postgres95 release 1.0.x ran about 30-50% faster on the Wis-
consin Benchmark compared to POSTGRES, Version 4.2. Apart from bug fixes, the following were
the major enhancements:
• The query language PostQUEL was replaced with SQL (implemented in the server). (Interface li-
brary libpq was named after PostQUEL.) Subqueries were not supported until PostgreSQL (see be-
low), but they could be imitated in Postgres95 with user-defined SQL functions. Aggregate func-
tions were re-implemented. Support for the GROUP BY query clause was also added.
• A new program (psql) was provided for interactive SQL queries, which used GNU Readline. This
largely superseded the old monitor program.
• A new front-end library, libpgtcl, supported Tcl-based clients. A sample shell, pgtclsh, pro-
vided new Tcl commands to interface Tcl programs with the Postgres95 server.
2
https://www.ibm.com/analytics/informix
3
https://www.ibm.com/
4
http://meteora.ucsd.edu/s2k/s2k_home.html
xxxi
Preface
• The large-object interface was overhauled. The inversion large objects were the only mechanism
for storing large objects. (The inversion file system was removed.)
• The instance-level rule system was removed. Rules were still available as rewrite rules.
• A short tutorial introducing regular SQL features as well as those of Postgres95 was distributed
with the source code
• GNU make (instead of BSD make) was used for the build. Also, Postgres95 could be compiled with
an unpatched GCC (data alignment of doubles was fixed).
2.3. PostgreSQL
By 1996, it became clear that the name “Postgres95” would not stand the test of time. We chose a new
name, PostgreSQL, to reflect the relationship between the original POSTGRES and the more recent
versions with SQL capability. At the same time, we set the version numbering to start at 6.0, putting
the numbers back into the sequence originally begun by the Berkeley POSTGRES project.
Many people continue to refer to PostgreSQL as “Postgres” (now rarely in all capital letters) because
of tradition or because it is easier to pronounce. This usage is widely accepted as a nickname or alias.
The emphasis during development of Postgres95 was on identifying and understanding existing prob-
lems in the server code. With PostgreSQL, the emphasis has shifted to augmenting features and capa-
bilities, although work continues in all areas.
Details about what has happened in PostgreSQL since then can be found in Appendix E.
3. Conventions
The following conventions are used in the synopsis of a command: brackets ([ and ]) indicate optional
parts. Braces ({ and }) and vertical lines (|) indicate that you must choose one alternative. Dots (...)
mean that the preceding element can be repeated. All other symbols, including parentheses, should
be taken literally.
Where it enhances the clarity, SQL commands are preceded by the prompt =>, and shell commands
are preceded by the prompt $. Normally, prompts are not shown, though.
An administrator is generally a person who is in charge of installing and running the server. A user
could be anyone who is using, or wants to use, any part of the PostgreSQL system. These terms
should not be interpreted too narrowly; this book does not have fixed presumptions about system
administration procedures.
4. Further Information
Besides the documentation, that is, this book, there are other resources about PostgreSQL:
Wiki
The PostgreSQL wiki5 contains the project's FAQ6 (Frequently Asked Questions) list, TODO7
list, and detailed information about many more topics.
Web Site
The PostgreSQL web site8 carries details on the latest release and other information to make your
work or play with PostgreSQL more productive.
5
https://wiki.postgresql.org
6
https://wiki.postgresql.org/wiki/Frequently_Asked_Questions
7
https://wiki.postgresql.org/wiki/Todo
8
https://www.postgresql.org
xxxii
Preface
Mailing Lists
The mailing lists are a good place to have your questions answered, to share experiences with
other users, and to contact the developers. Consult the PostgreSQL web site for details.
Yourself!
PostgreSQL is an open-source project. As such, it depends on the user community for ongoing
support. As you begin to use PostgreSQL, you will rely on others for help, either through the
documentation or through the mailing lists. Consider contributing your knowledge back. Read
the mailing lists and answer questions. If you learn something which is not in the documentation,
write it up and contribute it. If you add features to the code, contribute them.
The following suggestions are intended to assist you in forming bug reports that can be handled in an
effective fashion. No one is required to follow them but doing so tends to be to everyone's advantage.
We cannot promise to fix every bug right away. If the bug is obvious, critical, or affects a lot of users,
chances are good that someone will look into it. It could also happen that we tell you to update to
a newer version to see if the bug happens there. Or we might decide that the bug cannot be fixed
before some major rewrite we might be planning is done. Or perhaps it is simply too hard and there are
more important things on the agenda. If you need help immediately, consider obtaining a commercial
support contract.
• A program terminates with a fatal signal or an operating system error message that would point to
a problem in the program. (A counterexample might be a “disk full” message, since you have to
fix that yourself.)
• A program accepts invalid input without a notice or error message. But keep in mind that your idea
of invalid input might be our idea of an extension or compatibility with traditional practice.
• PostgreSQL fails to compile, build, or install according to the instructions on supported platforms.
Here “program” refers to any executable, not only the backend process.
Being slow or resource-hogging is not necessarily a bug. Read the documentation or ask on one of
the mailing lists for help in tuning your applications. Failing to comply to the SQL standard is not
necessarily a bug either, unless compliance for the specific feature is explicitly claimed.
Before you continue, check on the TODO list and in the FAQ to see if your bug is already known.
If you cannot decode the information on the TODO list, report your problem. The least we can do is
make the TODO list clearer.
xxxiii
Preface
• The exact sequence of steps from program start-up necessary to reproduce the problem. This should
be self-contained; it is not enough to send in a bare SELECT statement without the preceding CRE-
ATE TABLE and INSERT statements, if the output should depend on the data in the tables. We
do not have the time to reverse-engineer your database schema, and if we are supposed to make up
our own data we would probably miss the problem.
The best format for a test case for SQL-related problems is a file that can be run through the psql
frontend that shows the problem. (Be sure to not have anything in your ~/.psqlrc start-up file.)
An easy way to create this file is to use pg_dump to dump out the table declarations and data needed
to set the scene, then add the problem query. You are encouraged to minimize the size of your
example, but this is not absolutely necessary. If the bug is reproducible, we will find it either way.
If your application uses some other client interface, such as PHP, then please try to isolate the
offending queries. We will probably not set up a web server to reproduce your problem. In any case
remember to provide the exact input files; do not guess that the problem happens for “large files”
or “midsize databases”, etc. since this information is too inexact to be of use.
• The output you got. Please do not say that it “didn't work” or “crashed”. If there is an error message,
show it, even if you do not understand it. If the program terminates with an operating system error,
say which. If nothing at all happens, say so. Even if the result of your test case is a program crash
or otherwise obvious it might not happen on our platform. The easiest thing is to copy the output
from the terminal, if possible.
Note
If you are reporting an error message, please obtain the most verbose form of the message.
In psql, say \set VERBOSITY verbose beforehand. If you are extracting the message
from the server log, set the run-time parameter log_error_verbosity to verbose so that all
details are logged.
Note
In case of fatal errors, the error message reported by the client might not contain all the
information available. Please also look at the log output of the database server. If you do
not keep your server's log output, this would be a good time to start doing so.
• The output you expected is very important to state. If you just write “This command gives me that
output.” or “This is not what I expected.”, we might run it ourselves, scan the output, and think it
looks OK and is exactly what we expected. We should not have to spend the time to decode the
exact semantics behind your commands. Especially refrain from merely saying that “This is not
what SQL says/Oracle does.” Digging out the correct behavior from SQL is not a fun undertaking,
xxxiv
Preface
nor do we all know how all the other relational databases out there behave. (If your problem is a
program crash, you can obviously omit this item.)
• Any command line options and other start-up options, including any relevant environment variables
or configuration files that you changed from the default. Again, please provide exact information.
If you are using a prepackaged distribution that starts the database server at boot time, you should
try to find out how that is done.
• The PostgreSQL version. You can run the command SELECT version(); to find out the version
of the server you are connected to. Most executable programs also support a --version option;
at least postgres --version and psql --version should work. If the function or the
options do not exist then your version is more than old enough to warrant an upgrade. If you run a
prepackaged version, such as RPMs, say so, including any subversion the package might have. If
you are talking about a Git snapshot, mention that, including the commit hash.
If your version is older than 11.17 we will almost certainly tell you to upgrade. There are many bug
fixes and improvements in each new release, so it is quite possible that a bug you have encountered
in an older release of PostgreSQL has already been fixed. We can only provide limited support
for sites using older releases of PostgreSQL; if you require more than we can provide, consider
acquiring a commercial support contract.
• Platform information. This includes the kernel name and version, C library, processor, memory
information, and so on. In most cases it is sufficient to report the vendor and version, but do not
assume everyone knows what exactly “Debian” contains or that everyone runs on x86_64. If you
have installation problems then information about the toolchain on your machine (compiler, make,
and so on) is also necessary.
Do not be afraid if your bug report becomes rather lengthy. That is a fact of life. It is better to report
everything the first time than us having to squeeze the facts out of you. On the other hand, if your
input files are huge, it is fair to ask first whether somebody is interested in looking into it. Here is an
article9 that outlines some more tips on reporting bugs.
Do not spend all your time to figure out which changes in the input make the problem go away. This
will probably not help solving it. If it turns out that the bug cannot be fixed right away, you will still
have time to find and share your work-around. Also, once again, do not waste your time guessing why
the bug exists. We will find that out soon enough.
When writing a bug report, please avoid confusing terminology. The software package in total is
called “PostgreSQL”, sometimes “Postgres” for short. If you are specifically talking about the backend
process, mention that, do not just say “PostgreSQL crashes”. A crash of a single backend process
is quite different from crash of the parent “postgres” process; please don't say “the server crashed”
when you mean a single backend process went down, nor vice versa. Also, client programs such as the
interactive frontend “psql” are completely separate from the backend. Please try to be specific about
whether the problem is on the client or server side.
Another method is to fill in the bug report web-form available at the project's web site10. Entering
a bug report this way causes it to be mailed to the <[email protected]>
mailing list.
9
https://www.chiark.greenend.org.uk/~sgtatham/bugs.html
10
https://www.postgresql.org/
xxxv
Preface
If your bug report has security implications and you'd prefer that it not become immediately visible
in public archives, don't send it to pgsql-bugs. Security issues can be reported privately to <se-
[email protected]>.
Do not send bug reports to any of the user mailing lists, such as <[email protected]
gresql.org> or <[email protected]>. These mailing lists are for
answering user questions, and their subscribers normally do not wish to receive bug reports. More
importantly, they are unlikely to fix them.
Also, please do not send reports to the developers' mailing list <[email protected]
gresql.org>. This list is for discussing the development of PostgreSQL, and it would be nice if we
could keep the bug reports separate. We might choose to take up a discussion about your bug report
on pgsql-hackers, if the problem needs more review.
If you have a problem with the documentation, the best place to report it is the documentation mailing
list <[email protected]>. Please be specific about what part of the docu-
mentation you are unhappy with.
Note
Due to the unfortunate amount of spam going around, all of the above lists will be moderated
unless you are subscribed. That means there will be some delay before the email is delivered.
If you wish to subscribe to the lists, please visit https://lists.postgresql.org/ for instructions.
xxxvi
Part I. Tutorial
Welcome to the PostgreSQL Tutorial. The following few chapters are intended to give a simple introduction to
PostgreSQL, relational database concepts, and the SQL language to those who are new to any one of these aspects.
We only assume some general knowledge about how to use computers. No particular Unix or programming ex-
perience is required. This part is mainly intended to give you some hands-on experience with important aspects
of the PostgreSQL system. It makes no attempt to be a complete or thorough treatment of the topics it covers.
After you have worked through this tutorial you might want to move on to reading Part II to gain a more formal
knowledge of the SQL language, or Part IV for information about developing applications for PostgreSQL. Those
who set up and manage their own server should also read Part III.
Table of Contents
1. Getting Started .......................................................................................................... 3
1.1. Installation ..................................................................................................... 3
1.2. Architectural Fundamentals ............................................................................... 3
1.3. Creating a Database ......................................................................................... 3
1.4. Accessing a Database ...................................................................................... 5
2. The SQL Language .................................................................................................... 7
2.1. Introduction .................................................................................................... 7
2.2. Concepts ........................................................................................................ 7
2.3. Creating a New Table ...................................................................................... 7
2.4. Populating a Table With Rows .......................................................................... 8
2.5. Querying a Table ............................................................................................ 9
2.6. Joins Between Tables ..................................................................................... 11
2.7. Aggregate Functions ...................................................................................... 13
2.8. Updates ....................................................................................................... 15
2.9. Deletions ...................................................................................................... 15
3. Advanced Features ................................................................................................... 16
3.1. Introduction .................................................................................................. 16
3.2. Views .......................................................................................................... 16
3.3. Foreign Keys ................................................................................................ 16
3.4. Transactions ................................................................................................. 17
3.5. Window Functions ......................................................................................... 19
3.6. Inheritance ................................................................................................... 22
3.7. Conclusion ................................................................................................... 23
2
Chapter 1. Getting Started
1.1. Installation
Before you can use PostgreSQL you need to install it, of course. It is possible that PostgreSQL is
already installed at your site, either because it was included in your operating system distribution
or because the system administrator already installed it. If that is the case, you should obtain infor-
mation from the operating system documentation or your system administrator about how to access
PostgreSQL.
If you are not sure whether PostgreSQL is already available or whether you can use it for your exper-
imentation then you can install it yourself. Doing so is not hard and it can be a good exercise. Post-
greSQL can be installed by any unprivileged user; no superuser (root) access is required.
If you are installing PostgreSQL yourself, then refer to Chapter 16 for instructions on installation,
and return to this guide when the installation is complete. Be sure to follow closely the section about
setting up the appropriate environment variables.
If your site administrator has not set things up in the default way, you might have some more work to
do. For example, if the database server machine is a remote machine, you will need to set the PGHOST
environment variable to the name of the database server machine. The environment variable PGPORT
might also have to be set. The bottom line is this: if you try to start an application program and it
complains that it cannot connect to the database, you should consult your site administrator or, if
that is you, the documentation to make sure that your environment is properly set up. If you did not
understand the preceding paragraph then read the next section.
In database jargon, PostgreSQL uses a client/server model. A PostgreSQL session consists of the
following cooperating processes (programs):
• A server process, which manages the database files, accepts connections to the database from client
applications, and performs database actions on behalf of the clients. The database server program
is called postgres.
• The user's client (frontend) application that wants to perform database operations. Client applica-
tions can be very diverse in nature: a client could be a text-oriented tool, a graphical application, a
web server that accesses the database to display web pages, or a specialized database maintenance
tool. Some client applications are supplied with the PostgreSQL distribution; most are developed
by users.
As is typical of client/server applications, the client and the server can be on different hosts. In that
case they communicate over a TCP/IP network connection. You should keep this in mind, because
the files that can be accessed on a client machine might not be accessible (or might only be accessible
using a different file name) on the database server machine.
The PostgreSQL server can handle multiple concurrent connections from clients. To achieve this it
starts (“forks”) a new process for each connection. From that point on, the client and the new serv-
er process communicate without intervention by the original postgres process. Thus, the master
server process is always running, waiting for client connections, whereas client and associated server
processes come and go. (All of this is of course invisible to the user. We only mention it here for
completeness.)
The first test to see whether you can access the database server is to try to create a database. A running
PostgreSQL server can manage many databases. Typically, a separate database is used for each project
or for each user.
Possibly, your site administrator has already created a database for your use. In that case you can omit
this step and skip ahead to the next section.
To create a new database, in this example named mydb, you use the following command:
$ createdb mydb
If this produces no response then this step was successful and you can skip over the remainder of
this section.
then PostgreSQL was not installed properly. Either it was not installed at all or your shell's search path
was not set to include it. Try calling the command with an absolute path instead:
$ /usr/local/pgsql/bin/createdb mydb
The path at your site might be different. Contact your site administrator or check the installation in-
structions to correct the situation.
This means that the server was not started, or it was not started where createdb expected it. Again,
check the installation instructions or consult the administrator.
where your own login name is mentioned. This will happen if the administrator has not created a
PostgreSQL user account for you. (PostgreSQL user accounts are distinct from operating system user
accounts.) If you are the administrator, see Chapter 21 for help creating accounts. You will need to
become the operating system user under which PostgreSQL was installed (usually postgres) to
create the first user account. It could also be that you were assigned a PostgreSQL user name that is
different from your operating system user name; in that case you need to use the -U switch or set the
PGUSER environment variable to specify your PostgreSQL user name.
If you have a user account but it does not have the privileges required to create a database, you will
see the following:
4
Getting Started
Not every user has authorization to create new databases. If PostgreSQL refuses to create databases
for you then the site administrator needs to grant you permission to create databases. Consult your
site administrator if this occurs. If you installed PostgreSQL yourself then you should log in for the
purposes of this tutorial under the user account that you started the server as. 1
You can also create databases with other names. PostgreSQL allows you to create any number of
databases at a given site. Database names must have an alphabetic first character and are limited to
63 bytes in length. A convenient choice is to create a database with the same name as your current
user name. Many tools assume that database name as the default, so it can save you some typing. To
create that database, simply type:
$ createdb
If you do not want to use your database anymore you can remove it. For example, if you are the owner
(creator) of the database mydb, you can destroy it using the following command:
$ dropdb mydb
(For this command, the database name does not default to the user account name. You always need to
specify it.) This action physically removes all files associated with the database and cannot be undone,
so this should only be done with a great deal of forethought.
More about createdb and dropdb can be found in createdb and dropdb respectively.
• Running the PostgreSQL interactive terminal program, called psql, which allows you to interac-
tively enter, edit, and execute SQL commands.
• Using an existing graphical frontend tool like pgAdmin or an office suite with ODBC or JDBC
support to create and manipulate a database. These possibilities are not covered in this tutorial.
• Writing a custom application, using one of the several available language bindings. These possibil-
ities are discussed further in Part IV.
You probably want to start up psql to try the examples in this tutorial. It can be activated for the
mydb database by typing the command:
$ psql mydb
If you do not supply the database name then it will default to your user account name. You already
discovered this scheme in the previous section using createdb.
psql (11.17)
Type "help" for help.
mydb=>
5
Getting Started
mydb=#
That would mean you are a database superuser, which is most likely the case if you installed the
PostgreSQL instance yourself. Being a superuser means that you are not subject to access controls.
For the purposes of this tutorial that is not important.
If you encounter problems starting psql then go back to the previous section. The diagnostics of
createdb and psql are similar, and if the former worked the latter should work as well.
The last line printed out by psql is the prompt, and it indicates that psql is listening to you and that
you can type SQL queries into a work space maintained by psql. Try out these commands:
mydb=> SELECT 2 + 2;
?column?
----------
4
(1 row)
The psql program has a number of internal commands that are not SQL commands. They begin with
the backslash character, “\”. For example, you can get help on the syntax of various PostgreSQL SQL
commands by typing:
mydb=> \h
mydb=> \q
and psql will quit and return you to your command shell. (For more internal commands, type \? at
the psql prompt.) The full capabilities of psql are documented in psql. In this tutorial we will not
use these features explicitly, but you can use them yourself when it is helpful.
6
Chapter 2. The SQL Language
2.1. Introduction
This chapter provides an overview of how to use SQL to perform simple operations. This tutorial is
only intended to give you an introduction and is in no way a complete tutorial on SQL. Numerous
books have been written on SQL, including [melt93] and [date97]. You should be aware that some
PostgreSQL language features are extensions to the standard.
In the examples that follow, we assume that you have created a database named mydb, as described
in the previous chapter, and have been able to start psql.
Examples in this manual can also be found in the PostgreSQL source distribution in the directory
src/tutorial/. (Binary distributions of PostgreSQL might not provide those files.) To use those
files, first change to that directory and run make:
$ cd .../src/tutorial
$ make
This creates the scripts and compiles the C files containing user-defined functions and types. Then,
to start the tutorial, do the following:
$ psql -s mydb
...
mydb=> \i basics.sql
The \i command reads in commands from the specified file. psql's -s option puts you in single step
mode which pauses before sending each statement to the server. The commands used in this section
are in the file basics.sql.
2.2. Concepts
PostgreSQL is a relational database management system (RDBMS). That means it is a system for
managing data stored in relations. Relation is essentially a mathematical term for table. The notion
of storing data in tables is so commonplace today that it might seem inherently obvious, but there
are a number of other ways of organizing databases. Files and directories on Unix-like operating sys-
tems form an example of a hierarchical database. A more modern development is the object-oriented
database.
Each table is a named collection of rows. Each row of a given table has the same set of named
columns, and each column is of a specific data type. Whereas columns have a fixed order in each row,
it is important to remember that SQL does not guarantee the order of the rows within the table in any
way (although they can be explicitly sorted for display).
Tables are grouped into databases, and a collection of databases managed by a single PostgreSQL
server instance constitutes a database cluster.
7
The SQL Language
You can enter this into psql with the line breaks. psql will recognize that the command is not
terminated until the semicolon.
White space (i.e., spaces, tabs, and newlines) can be used freely in SQL commands. That means you
can type the command aligned differently than above, or even all on one line. Two dashes (“--”)
introduce comments. Whatever follows them is ignored up to the end of the line. SQL is case insen-
sitive about key words and identifiers, except when identifiers are double-quoted to preserve the case
(not done above).
varchar(80) specifies a data type that can store arbitrary character strings up to 80 characters
in length. int is the normal integer type. real is a type for storing single precision floating-point
numbers. date should be self-explanatory. (Yes, the column of type date is also named date. This
might be convenient or confusing — you choose.)
PostgreSQL supports the standard SQL types int, smallint, real, double precision,
char(N), varchar(N), date, time, timestamp, and interval, as well as other types of
general utility and a rich set of geometric types. PostgreSQL can be customized with an arbitrary
number of user-defined data types. Consequently, type names are not key words in the syntax, except
where required to support special cases in the SQL standard.
The second example will store cities and their associated geographical location:
Finally, it should be mentioned that if you don't need a table any longer or want to recreate it differently
you can remove it using the following command:
Note that all data types use rather obvious input formats. Constants that are not simple numeric values
usually must be surrounded by single quotes ('), as in the example. The date type is actually quite
flexible in what it accepts, but for this tutorial we will stick to the unambiguous format shown here.
8
The SQL Language
The syntax used so far requires you to remember the order of the columns. An alternative syntax allows
you to list the columns explicitly:
You can list the columns in a different order if you wish or even omit some columns, e.g., if the
precipitation is unknown:
Many developers consider explicitly listing the columns better style than relying on the order implic-
itly.
Please enter all the commands shown above so you have some data to work with in the following
sections.
You could also have used COPY to load large amounts of data from flat-text files. This is usually
faster because the COPY command is optimized for this application while allowing less flexibility than
INSERT. An example would be:
where the file name for the source file must be available on the machine running the backend process,
not the client, since the backend process reads the file directly. You can read more about the COPY
command in COPY.
Here * is a shorthand for “all columns”. 1 So the same result would be had with:
9
The SQL Language
(3 rows)
You can write expressions, not just simple column references, in the select list. For example, you can
do:
Notice how the AS clause is used to relabel the output column. (The AS clause is optional.)
A query can be “qualified” by adding a WHERE clause that specifies which rows are wanted. The
WHERE clause contains a Boolean (truth value) expression, and only rows for which the Boolean
expression is true are returned. The usual Boolean operators (AND, OR, and NOT) are allowed in the
qualification. For example, the following retrieves the weather of San Francisco on rainy days:
Result:
You can request that the results of a query be returned in sorted order:
In this example, the sort order isn't fully specified, and so you might get the San Francisco rows in
either order. But you'd always get the results shown above if you do:
You can request that duplicate rows be removed from the result of a query:
10
The SQL Language
FROM weather;
city
---------------
Hayward
San Francisco
(2 rows)
Here again, the result row ordering might vary. You can ensure consistent results by using DISTINCT
and ORDER BY together: 2
Note
This is only a conceptual model. The join is usually performed in a more efficient manner than
actually comparing each possible pair of rows, but this is invisible to the user.
SELECT *
FROM weather, cities
WHERE city = name;
• There is no result row for the city of Hayward. This is because there is no matching entry in the
cities table for Hayward, so the join ignores the unmatched rows in the weather table. We
will see shortly how this can be fixed.
2
In some database systems, including older versions of PostgreSQL, the implementation of DISTINCT automatically orders the rows and
so ORDER BY is unnecessary. But this is not required by the SQL standard, and current PostgreSQL does not guarantee that DISTINCT
causes the rows to be ordered.
11
The SQL Language
• There are two columns containing the city name. This is correct because the lists of columns from
the weather and cities tables are concatenated. In practice this is undesirable, though, so you
will probably want to list the output columns explicitly rather than using *:
Exercise: Attempt to determine the semantics of this query when the WHERE clause is omitted.
Since the columns all had different names, the parser automatically found which table they belong
to. If there were duplicate column names in the two tables you'd need to qualify the column names
to show which one you meant, as in:
It is widely considered good style to qualify all column names in a join query, so that the query won't
fail if a duplicate column name is later added to one of the tables.
Join queries of the kind seen thus far can also be written in this alternative form:
SELECT *
FROM weather INNER JOIN cities ON (weather.city = cities.name);
This syntax is not as commonly used as the one above, but we show it here to help you understand
the following topics.
Now we will figure out how we can get the Hayward records back in. What we want the query to do
is to scan the weather table and for each row to find the matching cities row(s). If no matching
row is found we want some “empty values” to be substituted for the cities table's columns. This
kind of query is called an outer join. (The joins we have seen so far are inner joins.) The command
looks like this:
SELECT *
FROM weather LEFT OUTER JOIN cities ON (weather.city =
cities.name);
This query is called a left outer join because the table mentioned on the left of the join operator will
have each of its rows in the output at least once, whereas the table on the right will only have those
rows output that match some row of the left table. When outputting a left-table row for which there is
no right-table match, empty (null) values are substituted for the right-table columns.
12
The SQL Language
Exercise: There are also right outer joins and full outer joins. Try to find out what those do.
We can also join a table against itself. This is called a self join. As an example, suppose we wish to
find all the weather records that are in the temperature range of other weather records. So we need to
compare the temp_lo and temp_hi columns of each weather row to the temp_lo and tem-
p_hi columns of all other weather rows. We can do this with the following query:
Here we have relabeled the weather table as W1 and W2 to be able to distinguish the left and right side
of the join. You can also use these kinds of aliases in other queries to save some typing, e.g.:
SELECT *
FROM weather w, cities c
WHERE w.city = c.name;
max
-----
46
(1 row)
If we wanted to know what city (or cities) that reading occurred in, we might try:
but this will not work since the aggregate max cannot be used in the WHERE clause. (This restriction
exists because the WHERE clause determines which rows will be included in the aggregate calculation;
so obviously it has to be evaluated before aggregate functions are computed.) However, as is often the
case the query can be restated to accomplish the desired result, here by using a subquery:
13
The SQL Language
city
---------------
San Francisco
(1 row)
This is OK because the subquery is an independent computation that computes its own aggregate
separately from what is happening in the outer query.
Aggregates are also very useful in combination with GROUP BY clauses. For example, we can get
the maximum low temperature observed in each city with:
city | max
---------------+-----
Hayward | 37
San Francisco | 46
(2 rows)
which gives us one output row per city. Each aggregate result is computed over the table rows matching
that city. We can filter these grouped rows using HAVING:
city | max
---------+-----
Hayward | 37
(1 row)
which gives us the same results for only the cities that have all temp_lo values below 40. Finally,
if we only care about cities whose names begin with “S”, we might do:
1 The LIKE operator does pattern matching and is explained in Section 9.7.
It is important to understand the interaction between aggregates and SQL's WHERE and HAVING claus-
es. The fundamental difference between WHERE and HAVING is this: WHERE selects input rows before
groups and aggregates are computed (thus, it controls which rows go into the aggregate computation),
whereas HAVING selects group rows after groups and aggregates are computed. Thus, the WHERE
clause must not contain aggregate functions; it makes no sense to try to use an aggregate to determine
which rows will be inputs to the aggregates. On the other hand, the HAVING clause always contains
aggregate functions. (Strictly speaking, you are allowed to write a HAVING clause that doesn't use
14
The SQL Language
aggregates, but it's seldom useful. The same condition could be used more efficiently at the WHERE
stage.)
In the previous example, we can apply the city name restriction in WHERE, since it needs no aggregate.
This is more efficient than adding the restriction to HAVING, because we avoid doing the grouping
and aggregate calculations for all rows that fail the WHERE check.
2.8. Updates
You can update existing rows using the UPDATE command. Suppose you discover the temperature
readings are all off by 2 degrees after November 28. You can correct the data as follows:
UPDATE weather
SET temp_hi = temp_hi - 2, temp_lo = temp_lo - 2
WHERE date > '1994-11-28';
2.9. Deletions
Rows can be removed from a table using the DELETE command. Suppose you are no longer interested
in the weather of Hayward. Then you can do the following to delete those rows from the table:
Without a qualification, DELETE will remove all rows from the given table, leaving it empty. The
system will not request confirmation before doing this!
15
Chapter 3. Advanced Features
3.1. Introduction
In the previous chapter we have covered the basics of using SQL to store and access your data in
PostgreSQL. We will now discuss some more advanced features of SQL that simplify management
and prevent loss or corruption of your data. Finally, we will look at some PostgreSQL extensions.
This chapter will on occasion refer to examples found in Chapter 2 to change or improve them, so
it will be useful to have read that chapter. Some examples from this chapter can also be found in
advanced.sql in the tutorial directory. This file also contains some sample data to load, which is
not repeated here. (Refer to Section 2.1 for how to use the file.)
3.2. Views
Refer back to the queries in Section 2.6. Suppose the combined listing of weather records and city
location is of particular interest to your application, but you do not want to type the query each time
you need it. You can create a view over the query, which gives a name to the query that you can refer
to like an ordinary table:
Making liberal use of views is a key aspect of good SQL database design. Views allow you to en-
capsulate the details of the structure of your tables, which might change as your application evolves,
behind consistent interfaces.
Views can be used in almost any place a real table can be used. Building views upon other views is
not uncommon.
16
Advanced Features
temp_hi int,
prcp real,
date date
);
The behavior of foreign keys can be finely tuned to your application. We will not go beyond this simple
example in this tutorial, but just refer you to Chapter 5 for more information. Making correct use of
foreign keys will definitely improve the quality of your database applications, so you are strongly
encouraged to learn about them.
3.4. Transactions
Transactions are a fundamental concept of all database systems. The essential point of a transaction is
that it bundles multiple steps into a single, all-or-nothing operation. The intermediate states between
the steps are not visible to other concurrent transactions, and if some failure occurs that prevents the
transaction from completing, then none of the steps affect the database at all.
For example, consider a bank database that contains balances for various customer accounts, as well as
total deposit balances for branches. Suppose that we want to record a payment of $100.00 from Alice's
account to Bob's account. Simplifying outrageously, the SQL commands for this might look like:
The details of these commands are not important here; the important point is that there are several
separate updates involved to accomplish this rather simple operation. Our bank's officers will want to
be assured that either all these updates happen, or none of them happen. It would certainly not do for
a system failure to result in Bob receiving $100.00 that was not debited from Alice. Nor would Alice
long remain a happy customer if she was debited without Bob being credited. We need a guarantee
that if something goes wrong partway through the operation, none of the steps executed so far will
take effect. Grouping the updates into a transaction gives us this guarantee. A transaction is said to be
atomic: from the point of view of other transactions, it either happens completely or not at all.
We also want a guarantee that once a transaction is completed and acknowledged by the database
system, it has indeed been permanently recorded and won't be lost even if a crash ensues shortly
thereafter. For example, if we are recording a cash withdrawal by Bob, we do not want any chance that
the debit to his account will disappear in a crash just after he walks out the bank door. A transactional
database guarantees that all the updates made by a transaction are logged in permanent storage (i.e.,
on disk) before the transaction is reported complete.
17
Advanced Features
Another important property of transactional databases is closely related to the notion of atomic up-
dates: when multiple transactions are running concurrently, each one should not be able to see the
incomplete changes made by others. For example, if one transaction is busy totalling all the branch
balances, it would not do for it to include the debit from Alice's branch but not the credit to Bob's
branch, nor vice versa. So transactions must be all-or-nothing not only in terms of their permanent
effect on the database, but also in terms of their visibility as they happen. The updates made so far by
an open transaction are invisible to other transactions until the transaction completes, whereupon all
the updates become visible simultaneously.
In PostgreSQL, a transaction is set up by surrounding the SQL commands of the transaction with
BEGIN and COMMIT commands. So our banking transaction would actually look like:
BEGIN;
UPDATE accounts SET balance = balance - 100.00
WHERE name = 'Alice';
-- etc etc
COMMIT;
If, partway through the transaction, we decide we do not want to commit (perhaps we just noticed that
Alice's balance went negative), we can issue the command ROLLBACK instead of COMMIT, and all
our updates so far will be canceled.
PostgreSQL actually treats every SQL statement as being executed within a transaction. If you do not
issue a BEGIN command, then each individual statement has an implicit BEGIN and (if successful)
COMMIT wrapped around it. A group of statements surrounded by BEGIN and COMMIT is sometimes
called a transaction block.
Note
Some client libraries issue BEGIN and COMMIT commands automatically, so that you might
get the effect of transaction blocks without asking. Check the documentation for the interface
you are using.
It's possible to control the statements in a transaction in a more granular fashion through the use of
savepoints. Savepoints allow you to selectively discard parts of the transaction, while committing the
rest. After defining a savepoint with SAVEPOINT, you can if needed roll back to the savepoint with
ROLLBACK TO. All the transaction's database changes between defining the savepoint and rolling
back to it are discarded, but changes earlier than the savepoint are kept.
After rolling back to a savepoint, it continues to be defined, so you can roll back to it several times.
Conversely, if you are sure you won't need to roll back to a particular savepoint again, it can be
released, so the system can free some resources. Keep in mind that either releasing or rolling back to
a savepoint will automatically release all savepoints that were defined after it.
All this is happening within the transaction block, so none of it is visible to other database sessions.
When and if you commit the transaction block, the committed actions become visible as a unit to other
sessions, while the rolled-back actions never become visible at all.
Remembering the bank database, suppose we debit $100.00 from Alice's account, and credit Bob's
account, only to find later that we should have credited Wally's account. We could do it using save-
points like this:
BEGIN;
UPDATE accounts SET balance = balance - 100.00
WHERE name = 'Alice';
SAVEPOINT my_savepoint;
18
Advanced Features
This example is, of course, oversimplified, but there's a lot of control possible in a transaction block
through the use of savepoints. Moreover, ROLLBACK TO is the only way to regain control of a
transaction block that was put in aborted state by the system due to an error, short of rolling it back
completely and starting again.
Here is an example that shows how to compare each employee's salary with the average salary in his
or her department:
The first three output columns come directly from the table empsalary, and there is one output row
for each row in the table. The fourth column represents an average taken across all the table rows
that have the same depname value as the current row. (This actually is the same function as the
non-window avg aggregate, but the OVER clause causes it to be treated as a window function and
computed across the window frame.)
A window function call always contains an OVER clause directly following the window function's
name and argument(s). This is what syntactically distinguishes it from a normal function or non-
window aggregate. The OVER clause determines exactly how the rows of the query are split up for
processing by the window function. The PARTITION BY clause within OVER divides the rows into
groups, or partitions, that share the same values of the PARTITION BY expression(s). For each row,
the window function is computed across the rows that fall into the same partition as the current row.
You can also control the order in which rows are processed by window functions using ORDER BY
within OVER. (The window ORDER BY does not even have to match the order in which the rows are
output.) Here is an example:
19
Advanced Features
As shown here, the rank function produces a numerical rank for each distinct ORDER BY value in
the current row's partition, using the order defined by the ORDER BY clause. rank needs no explicit
parameter, because its behavior is entirely determined by the OVER clause.
The rows considered by a window function are those of the “virtual table” produced by the query's
FROM clause as filtered by its WHERE, GROUP BY, and HAVING clauses if any. For example, a row
removed because it does not meet the WHERE condition is not seen by any window function. A query
can contain multiple window functions that slice up the data in different ways using different OVER
clauses, but they all act on the same collection of rows defined by this virtual table.
We already saw that ORDER BY can be omitted if the ordering of rows is not important. It is also
possible to omit PARTITION BY, in which case there is a single partition containing all rows.
There is another important concept associated with window functions: for each row, there is a set of
rows within its partition called its window frame. Some window functions act only on the rows of the
window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame
consists of all rows from the start of the partition up through the current row, plus any following rows
that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the
default frame consists of all rows in the partition. 1 Here is an example using sum:
salary | sum
--------+-------
5200 | 47100
5000 | 47100
3500 | 47100
4800 | 47100
3900 | 47100
4200 | 47100
4500 | 47100
4800 | 47100
6000 | 47100
5200 | 47100
(10 rows)
1
There are options to define the window frame in other ways, but this tutorial does not cover them. See Section 4.2.8 for details.
20
Advanced Features
Above, since there is no ORDER BY in the OVER clause, the window frame is the same as the partition,
which for lack of PARTITION BY is the whole table; in other words each sum is taken over the
whole table and so we get the same result for each output row. But if we add an ORDER BY clause,
we get very different results:
salary | sum
--------+-------
3500 | 3500
3900 | 7400
4200 | 11600
4500 | 16100
4800 | 25700
4800 | 25700
5000 | 30700
5200 | 41100
5200 | 41100
6000 | 47100
(10 rows)
Here the sum is taken from the first (lowest) salary up through the current one, including any duplicates
of the current one (notice the results for the duplicated salaries).
Window functions are permitted only in the SELECT list and the ORDER BY clause of the query.
They are forbidden elsewhere, such as in GROUP BY, HAVING and WHERE clauses. This is because
they logically execute after the processing of those clauses. Also, window functions execute after
non-window aggregate functions. This means it is valid to include an aggregate function call in the
arguments of a window function, but not vice versa.
If there is a need to filter or group rows after the window calculations are performed, you can use a
sub-select. For example:
The above query only shows the rows from the inner query having rank less than 3.
When a query involves multiple window functions, it is possible to write out each one with a separate
OVER clause, but this is duplicative and error-prone if the same windowing behavior is wanted for
several functions. Instead, each windowing behavior can be named in a WINDOW clause and then
referenced in OVER. For example:
More details about window functions can be found in Section 4.2.8, Section 9.21, Section 7.2.5, and
the SELECT reference page.
21
Advanced Features
3.6. Inheritance
Inheritance is a concept from object-oriented databases. It opens up interesting new possibilities of
database design.
Let's create two tables: A table cities and a table capitals. Naturally, capitals are also cities,
so you want some way to show the capitals implicitly when you list all cities. If you're really clever
you might invent some scheme like this:
This works OK as far as querying goes, but it gets ugly when you need to update several rows, for
one thing.
In this case, a row of capitals inherits all columns (name, population, and elevation) from
its parent, cities. The type of the column name is text, a native PostgreSQL type for variable
length character strings. The capitals table has an additional column, state, which shows its
state abbreviation. In PostgreSQL, a table can inherit from zero or more other tables.
For example, the following query finds the names of all cities, including state capitals, that are located
at an elevation over 500 feet:
which returns:
22
Advanced Features
name | elevation
-----------+-----------
Las Vegas | 2174
Mariposa | 1953
Madison | 845
(3 rows)
On the other hand, the following query finds all the cities that are not state capitals and are situated
at an elevation over 500 feet:
name | elevation
-----------+-----------
Las Vegas | 2174
Mariposa | 1953
(2 rows)
Here the ONLY before cities indicates that the query should be run over only the cities table, and
not tables below cities in the inheritance hierarchy. Many of the commands that we have already
discussed — SELECT, UPDATE, and DELETE — support this ONLY notation.
Note
Although inheritance is frequently useful, it has not been integrated with unique constraints or
foreign keys, which limits its usefulness. See Section 5.9 for more detail.
3.7. Conclusion
PostgreSQL has many features not touched upon in this tutorial introduction, which has been oriented
toward newer users of SQL. These features are discussed in more detail in the remainder of this book.
If you feel you need more introductory material, please visit the PostgreSQL web site2 for links to
more resources.
2
https://www.postgresql.org
23
Part II. The SQL Language
This part describes the use of the SQL language in PostgreSQL. We start with describing the general syntax of
SQL, then explain how to create the structures to hold data, how to populate the database, and how to query it. The
middle part lists the available data types and functions for use in SQL commands. The rest treats several aspects
that are important for tuning a database for optimal performance.
The information in this part is arranged so that a novice user can follow it start to end to gain a full understanding
of the topics without having to refer forward too many times. The chapters are intended to be self-contained, so
that advanced users can read the chapters individually as they choose. The information in this part is presented in
a narrative fashion in topical units. Readers looking for a complete description of a particular command should
see Part VI.
Readers of this part should know how to connect to a PostgreSQL database and issue SQL commands. Readers
that are unfamiliar with these issues are encouraged to read Part I first. SQL commands are typically entered using
the PostgreSQL interactive terminal psql, but other programs that have similar functionality can be used as well.
Table of Contents
4. SQL Syntax ............................................................................................................ 32
4.1. Lexical Structure ........................................................................................... 32
4.1.1. Identifiers and Key Words .................................................................... 32
4.1.2. Constants ........................................................................................... 34
4.1.3. Operators ........................................................................................... 38
4.1.4. Special Characters ............................................................................... 39
4.1.5. Comments ......................................................................................... 39
4.1.6. Operator Precedence ............................................................................ 40
4.2. Value Expressions ......................................................................................... 41
4.2.1. Column References ............................................................................. 42
4.2.2. Positional Parameters ........................................................................... 42
4.2.3. Subscripts .......................................................................................... 42
4.2.4. Field Selection .................................................................................... 43
4.2.5. Operator Invocations ........................................................................... 43
4.2.6. Function Calls .................................................................................... 44
4.2.7. Aggregate Expressions ......................................................................... 44
4.2.8. Window Function Calls ........................................................................ 46
4.2.9. Type Casts ......................................................................................... 49
4.2.10. Collation Expressions ......................................................................... 50
4.2.11. Scalar Subqueries .............................................................................. 50
4.2.12. Array Constructors ............................................................................ 51
4.2.13. Row Constructors .............................................................................. 52
4.2.14. Expression Evaluation Rules ............................................................... 54
4.3. Calling Functions .......................................................................................... 55
4.3.1. Using Positional Notation ..................................................................... 55
4.3.2. Using Named Notation ......................................................................... 56
4.3.3. Using Mixed Notation ......................................................................... 57
5. Data Definition ........................................................................................................ 58
5.1. Table Basics ................................................................................................. 58
5.2. Default Values .............................................................................................. 59
5.3. Constraints ................................................................................................... 60
5.3.1. Check Constraints ............................................................................... 60
5.3.2. Not-Null Constraints ............................................................................ 62
5.3.3. Unique Constraints .............................................................................. 63
5.3.4. Primary Keys ..................................................................................... 64
5.3.5. Foreign Keys ...................................................................................... 65
5.3.6. Exclusion Constraints .......................................................................... 68
5.4. System Columns ........................................................................................... 68
5.5. Modifying Tables .......................................................................................... 69
5.5.1. Adding a Column ............................................................................... 70
5.5.2. Removing a Column ............................................................................ 70
5.5.3. Adding a Constraint ............................................................................ 70
5.5.4. Removing a Constraint ........................................................................ 71
5.5.5. Changing a Column's Default Value ....................................................... 71
5.5.6. Changing a Column's Data Type ............................................................ 71
5.5.7. Renaming a Column ............................................................................ 72
5.5.8. Renaming a Table ............................................................................... 72
5.6. Privileges ..................................................................................................... 72
5.7. Row Security Policies .................................................................................... 73
5.8. Schemas ....................................................................................................... 79
5.8.1. Creating a Schema .............................................................................. 80
5.8.2. The Public Schema ............................................................................. 80
5.8.3. The Schema Search Path ...................................................................... 81
5.8.4. Schemas and Privileges ........................................................................ 82
5.8.5. The System Catalog Schema ................................................................. 82
25
The SQL Language
26
The SQL Language
27
The SQL Language
28
The SQL Language
29
The SQL Language
30
The SQL Language
31
Chapter 4. SQL Syntax
This chapter describes the syntax of SQL. It forms the foundation for understanding the following
chapters which will go into detail about how SQL commands are applied to define and modify data.
We also advise users who are already familiar with SQL to read this chapter carefully because it
contains several rules and concepts that are implemented inconsistently among SQL databases or that
are specific to PostgreSQL.
A token can be a key word, an identifier, a quoted identifier, a literal (or constant), or a special character
symbol. Tokens are normally separated by whitespace (space, tab, newline), but need not be if there
is no ambiguity (which is generally only the case if a special character is adjacent to some other token
type).
This is a sequence of three commands, one per line (although this is not required; more than one
command can be on a line, and commands can usefully be split across lines).
Additionally, comments can occur in SQL input. They are not tokens, they are effectively equivalent
to whitespace.
The SQL syntax is not very consistent regarding what tokens identify commands and which are
operands or parameters. The first few tokens are generally the command name, so in the above exam-
ple we would usually speak of a “SELECT”, an “UPDATE”, and an “INSERT” command. But for
instance the UPDATE command always requires a SET token to appear in a certain position, and this
particular variation of INSERT also requires a VALUES in order to be complete. The precise syntax
rules for each command are described in Part VI.
SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks
and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be
letters, underscores, digits (0-9), or dollar signs ($). Note that dollar signs are not allowed in identifiers
according to the letter of the SQL standard, so their use might render applications less portable. The
SQL standard will not define a key word that contains digits or starts or ends with an underscore, so
identifiers of this form are safe against possible conflict with future extensions of the standard.
32
SQL Syntax
The system uses no more than NAMEDATALEN-1 bytes of an identifier; longer names can be written
in commands, but they will be truncated. By default, NAMEDATALEN is 64 so the maximum identifier
length is 63 bytes. If this limit is problematic, it can be raised by changing the NAMEDATALEN constant
in src/include/pg_config_manual.h.
A convention often used is to write key words in upper case and names in lower case, e.g.:
There is a second kind of identifier: the delimited identifier or quoted identifier. It is formed by
enclosing an arbitrary sequence of characters in double-quotes ("). A delimited identifier is always
an identifier, never a key word. So "select" could be used to refer to a column or table named
“select”, whereas an unquoted select would be taken as a key word and would therefore provoke
a parse error when used where a table or column name is expected. The example can be written with
quoted identifiers like this:
Quoted identifiers can contain any character, except the character with code zero. (To include a double
quote, write two double quotes.) This allows constructing table or column names that would otherwise
not be possible, such as ones containing spaces or ampersands. The length limitation still applies.
A variant of quoted identifiers allows including escaped Unicode characters identified by their code
points. This variant starts with U& (upper or lower case U followed by ampersand) immediately before
the opening double quote, without any spaces in between, for example U&"foo". (Note that this
creates an ambiguity with the operator &. Use spaces around the operator to avoid this problem.) Inside
the quotes, Unicode characters can be specified in escaped form by writing a backslash followed by
the four-digit hexadecimal code point number or alternatively a backslash followed by a plus sign
followed by a six-digit hexadecimal code point number. For example, the identifier "data" could
be written as
U&"d\0061t\+000061"
The following less trivial example writes the Russian word “slon” (elephant) in Cyrillic letters:
U&"\0441\043B\043E\043D"
If a different escape character than backslash is desired, it can be specified using the UESCAPE clause
after the string, for example:
The escape character can be any single character other than a hexadecimal digit, the plus sign, a single
quote, a double quote, or a whitespace character. Note that the escape character is written in single
quotes, not double quotes.
33
SQL Syntax
The Unicode escape syntax works only when the server encoding is UTF8. When other server en-
codings are used, only code points in the ASCII range (up to \007F) can be specified. Both the 4-
digit and the 6-digit form can be used to specify UTF-16 surrogate pairs to compose characters with
code points larger than U+FFFF, although the availability of the 6-digit form technically makes this
unnecessary. (Surrogate pairs are not stored directly, but combined into a single code point that is then
encoded in UTF-8.)
Quoting an identifier also makes it case-sensitive, whereas unquoted names are always folded to lower
case. For example, the identifiers FOO, foo, and "foo" are considered the same by PostgreSQL, but
"Foo" and "FOO" are different from these three and each other. (The folding of unquoted names to
lower case in PostgreSQL is incompatible with the SQL standard, which says that unquoted names
should be folded to upper case. Thus, foo should be equivalent to "FOO" not "foo" according to
the standard. If you want to write portable applications you are advised to always quote a particular
name or never quote it.)
4.1.2. Constants
There are three kinds of implicitly-typed constants in PostgreSQL: strings, bit strings, and numbers.
Constants can also be specified with explicit types, which can enable more accurate representation and
more efficient handling by the system. These alternatives are discussed in the following subsections.
Two string constants that are only separated by whitespace with at least one newline are concatenated
and effectively treated as if the string had been written as one constant. For example:
SELECT 'foo'
'bar';
is equivalent to:
SELECT 'foobar';
but:
is not valid syntax. (This slightly bizarre behavior is specified by SQL; PostgreSQL is following the
standard.)
34
SQL Syntax
Any other character following a backslash is taken literally. Thus, to include a backslash character,
write two backslashes (\\). Also, a single quote can be included in an escape string by writing \',
in addition to the normal way of ''.
It is your responsibility that the byte sequences you create, especially when using the octal or hexa-
decimal escapes, compose valid characters in the server character set encoding. When the server en-
coding is UTF-8, then the Unicode escapes or the alternative Unicode escape syntax, explained in
Section 4.1.2.3, should be used instead. (The alternative would be doing the UTF-8 encoding by hand
and writing out the bytes, which would be very cumbersome.)
The Unicode escape syntax works fully only when the server encoding is UTF8. When other server
encodings are used, only code points in the ASCII range (up to \u007F) can be specified. Both the
4-digit and the 8-digit form can be used to specify UTF-16 surrogate pairs to compose characters
with code points larger than U+FFFF, although the availability of the 8-digit form technically makes
this unnecessary. (When surrogate pairs are used when the server encoding is UTF8, they are first
combined into a single code point that is then encoded in UTF-8.)
Caution
If the configuration parameter standard_conforming_strings is off, then PostgreSQL recog-
nizes backslash escapes in both regular and escape string constants. However, as of Post-
greSQL 9.1, the default is on, meaning that backslash escapes are recognized only in es-
cape string constants. This behavior is more standards-compliant, but might break applications
which rely on the historical behavior, where backslash escapes were always recognized. As
a workaround, you can set this parameter to off, but it is better to migrate away from using
backslash escapes. If you need to use a backslash escape to represent a special character, write
the string constant with an E.
35
SQL Syntax
or alternatively a backslash followed by a plus sign followed by a six-digit hexadecimal code point
number. For example, the string 'data' could be written as
U&'d\0061t\+000061'
The following less trivial example writes the Russian word “slon” (elephant) in Cyrillic letters:
U&'\0441\043B\043E\043D'
If a different escape character than backslash is desired, it can be specified using the UESCAPE clause
after the string, for example:
The escape character can be any single character other than a hexadecimal digit, the plus sign, a single
quote, a double quote, or a whitespace character.
The Unicode escape syntax works only when the server encoding is UTF8. When other server encod-
ings are used, only code points in the ASCII range (up to \007F) can be specified. Both the 4-digit
and the 6-digit form can be used to specify UTF-16 surrogate pairs to compose characters with code
points larger than U+FFFF, although the availability of the 6-digit form technically makes this unnec-
essary. (When surrogate pairs are used when the server encoding is UTF8, they are first combined into
a single code point that is then encoded in UTF-8.)
Also, the Unicode escape syntax for string constants only works when the configuration parameter
standard_conforming_strings is turned on. This is because otherwise this syntax could confuse clients
that parse the SQL statements to the point that it could lead to SQL injections and similar security
issues. If the parameter is set to off, this syntax will be rejected with an error message.
$$Dianne's horse$$
$SomeTag$Dianne's horse$SomeTag$
Notice that inside the dollar-quoted string, single quotes can be used without needing to be escaped.
Indeed, no characters inside a dollar-quoted string are ever escaped: the string content is always written
literally. Backslashes are not special, and neither are dollar signs, unless they are part of a sequence
matching the opening tag.
It is possible to nest dollar-quoted string constants by choosing different tags at each nesting level.
This is most commonly used in writing function definitions. For example:
$function$
BEGIN
36
SQL Syntax
The tag, if any, of a dollar-quoted string follows the same rules as an unquoted identifier, except that it
cannot contain a dollar sign. Tags are case sensitive, so $tag$String content$tag$ is correct,
but $TAG$String content$tag$ is not.
A dollar-quoted string that follows a keyword or identifier must be separated from it by whitespace;
otherwise the dollar quoting delimiter would be taken as part of the preceding identifier.
Dollar quoting is not part of the SQL standard, but it is often a more convenient way to write com-
plicated string literals than the standard-compliant single quote syntax. It is particularly useful when
representing string constants inside other constants, as is often needed in procedural function defini-
tions. With single-quote syntax, each backslash in the above example would have to be written as four
backslashes, which would be reduced to two backslashes in parsing the original string constant, and
then to one when the inner string constant is re-parsed during function execution.
Alternatively, bit-string constants can be specified in hexadecimal notation, using a leading X (upper
or lower case), e.g., X'1FF'. This notation is equivalent to a bit-string constant with four binary digits
for each hexadecimal digit.
Both forms of bit-string constant can be continued across lines in the same way as regular string
constants. Dollar quoting cannot be used in a bit-string constant.
digits
digits.[digits][e[+-]digits]
[digits].digits[e[+-]digits]
digitse[+-]digits
where digits is one or more decimal digits (0 through 9). At least one digit must be before or after the
decimal point, if one is used. At least one digit must follow the exponent marker (e), if one is present.
There cannot be any spaces or other characters embedded in the constant. Note that any leading plus
or minus sign is not actually considered part of the constant; it is an operator applied to the constant.
42
3.5
4.
.001
5e2
1.925e-3
37
SQL Syntax
A numeric constant that contains neither a decimal point nor an exponent is initially presumed to be
type integer if its value fits in type integer (32 bits); otherwise it is presumed to be type bigint
if its value fits in type bigint (64 bits); otherwise it is taken to be type numeric. Constants that
contain decimal points and/or exponents are always initially presumed to be type numeric.
The initially assigned data type of a numeric constant is just a starting point for the type resolution
algorithms. In most cases the constant will be automatically coerced to the most appropriate type de-
pending on context. When necessary, you can force a numeric value to be interpreted as a specific data
type by casting it. For example, you can force a numeric value to be treated as type real (float4)
by writing:
These are actually just special cases of the general casting notations discussed next.
type 'string'
'string'::type
CAST ( 'string' AS type )
The string constant's text is passed to the input conversion routine for the type called type. The result
is a constant of the indicated type. The explicit type cast can be omitted if there is no ambiguity as to
the type the constant must be (for example, when it is assigned directly to a table column), in which
case it is automatically coerced.
The string constant can be written using either regular SQL notation or dollar-quoting.
typename ( 'string' )
but not all type names can be used in this way; see Section 4.2.9 for details.
The ::, CAST(), and function-call syntaxes can also be used to specify run-time type conver-
sions of arbitrary expressions, as discussed in Section 4.2.9. To avoid syntactic ambiguity, the type
'string' syntax can only be used to specify the type of a simple literal constant. Another restriction
on the type 'string' syntax is that it does not work for array types; use :: or CAST() to specify
the type of an array constant.
The CAST() syntax conforms to SQL. The type 'string' syntax is a generalization of the
standard: SQL specifies this syntax only for a few data types, but PostgreSQL allows it for all types.
The syntax with :: is historical PostgreSQL usage, as is the function-call syntax.
4.1.3. Operators
An operator name is a sequence of up to NAMEDATALEN-1 (63 by default) characters from the fol-
lowing list:
+-*/<>=~!@#%^&|`?
38
SQL Syntax
• -- and /* cannot appear anywhere in an operator name, since they will be taken as the start of
a comment.
• A multiple-character operator name cannot end in + or -, unless the name also contains at least
one of these characters:
~!@#%^&|`?
For example, @- is an allowed operator name, but *- is not. This restriction allows PostgreSQL to
parse SQL-compliant queries without requiring spaces between tokens.
When working with non-SQL-standard operator names, you will usually need to separate adjacent
operators with spaces to avoid ambiguity. For example, if you have defined a left unary operator named
@, you cannot write X*@Y; you must write X* @Y to ensure that PostgreSQL reads it as two operator
names not one.
• A dollar sign ($) followed by digits is used to represent a positional parameter in the body of a
function definition or a prepared statement. In other contexts the dollar sign can be part of an iden-
tifier or a dollar-quoted string constant.
• Parentheses (()) have their usual meaning to group expressions and enforce precedence. In some
cases parentheses are required as part of the fixed syntax of a particular SQL command.
• Brackets ([]) are used to select the elements of an array. See Section 8.15 for more information
on arrays.
• Commas (,) are used in some syntactical constructs to separate the elements of a list.
• The semicolon (;) terminates an SQL command. It cannot appear anywhere within a command,
except within a string constant or quoted identifier.
• The colon (:) is used to select “slices” from arrays. (See Section 8.15.) In certain SQL dialects
(such as Embedded SQL), the colon is used to prefix variable names.
• The asterisk (*) is used in some contexts to denote all the fields of a table row or composite value.
It also has a special meaning when used as the argument of an aggregate function, namely that the
aggregate does not require any explicit parameter.
• The period (.) is used in numeric constants, and to separate schema, table, and column names.
4.1.5. Comments
A comment is a sequence of characters beginning with double dashes and extending to the end of
the line, e.g.:
/* multiline comment
* with nesting: /* nested block comment */
39
SQL Syntax
*/
where the comment begins with /* and extends to the matching occurrence of */. These block com-
ments nest, as specified in the SQL standard but unlike C, so that one can comment out larger blocks
of code that might contain existing block comments.
A comment is removed from the input stream before further syntax analysis and is effectively replaced
by whitespace.
You will sometimes need to add parentheses when using combinations of binary and unary operators.
For instance:
SELECT 5 ! - 6;
SELECT 5 ! (- 6);
because the parser has no idea — until it is too late — that ! is defined as a postfix operator, not an
infix one. To get the desired behavior in this case, you must write:
SELECT (5 !) - 6;
40
SQL Syntax
Note that the operator precedence rules also apply to user-defined operators that have the same names
as the built-in operators mentioned above. For example, if you define a “+” operator for some custom
data type it will have the same precedence as the built-in “+” operator, no matter what yours does.
When a schema-qualified operator name is used in the OPERATOR syntax, as for example in:
SELECT 3 OPERATOR(pg_catalog.+) 4;
the OPERATOR construct is taken to have the default precedence shown in Table 4.2 for “any other
operator”. This is true no matter which specific operator appears inside OPERATOR().
Note
PostgreSQL versions before 9.5 used slightly different operator precedence rules. In particular,
<= >= and <> used to be treated as generic operators; IS tests used to have higher priority; and
NOT BETWEEN and related constructs acted inconsistently, being taken in some cases as hav-
ing the precedence of NOT rather than BETWEEN. These rules were changed for better com-
pliance with the SQL standard and to reduce confusion from inconsistent treatment of logical-
ly equivalent constructs. In most cases, these changes will result in no behavioral change, or
perhaps in “no such operator” failures which can be resolved by adding parentheses. However
there are corner cases in which a query might change behavior without any parsing error being
reported. If you are concerned about whether these changes have silently broken something,
you can test your application with the configuration parameter operator_precedence_warning
turned on to see if any warnings are logged.
• A column reference
• A subscripted expression
• An operator invocation
• A function call
• An aggregate expression
41
SQL Syntax
• A type cast
• A collation expression
• A scalar subquery
• An array constructor
• A row constructor
• Another value expression in parentheses (used to group subexpressions and override precedence)
In addition to this list, there are a number of constructs that can be classified as an expression but do
not follow any general syntax rules. These generally have the semantics of a function or operator and
are explained in the appropriate location in Chapter 9. An example is the IS NULL clause.
We have already discussed constants in Section 4.1.2. The following sections discuss the remaining
options.
correlation.columnname
correlation is the name of a table (possibly qualified with a schema name), or an alias for a table
defined by means of a FROM clause. The correlation name and separating dot can be omitted if the
column name is unique across all the tables being used in the current query. (See also Chapter 7.)
$number
Here the $1 references the value of the first function argument whenever the function is invoked.
4.2.3. Subscripts
If an expression yields a value of an array type, then a specific element of the array value can be
extracted by writing
expression[subscript]
expression[lower_subscript:upper_subscript]
42
SQL Syntax
(Here, the brackets [ ] are meant to appear literally.) Each subscript is itself an expression, which
will be rounded to the nearest integer value.
In general the array expression must be parenthesized, but the parentheses can be omitted when
the expression to be subscripted is just a column reference or positional parameter. Also, multiple
subscripts can be concatenated when the original array is multidimensional. For example:
mytable.arraycolumn[4]
mytable.two_d_column[17][34]
$1[10:42]
(arrayfunction(a,b))[42]
The parentheses in the last example are required. See Section 8.15 for more about arrays.
expression.fieldname
In general the row expression must be parenthesized, but the parentheses can be omitted when the
expression to be selected from is just a table reference or positional parameter. For example:
mytable.mycolumn
$1.somecolumn
(rowfunction(a,b)).col3
(Thus, a qualified column reference is actually just a special case of the field selection syntax.) An
important special case is extracting a field from a table column that is of a composite type:
(compositecol).somefield
(mytable.compositecol).somefield
The parentheses are required here to show that compositecol is a column name not a table name,
or that mytable is a table name not a schema name in the second case.
You can ask for all fields of a composite value by writing .*:
(compositecol).*
This notation behaves differently depending on context; see Section 8.16.5 for details.
where the operator token follows the syntax rules of Section 4.1.3, or is one of the key words AND,
OR, and NOT, or is a qualified operator name in the form:
OPERATOR(schema.operatorname)
43
SQL Syntax
Which particular operators exist and whether they are unary or binary depends on what operators have
been defined by the system or the user. Chapter 9 describes the built-in operators.
sqrt(2)
The list of built-in functions is in Chapter 9. Other functions can be added by the user.
When issuing queries in a database where some users mistrust other users, observe security precautions
from Section 10.3 when writing function calls.
The arguments can optionally have names attached. See Section 4.3 for details.
Note
A function that takes a single argument of composite type can optionally be called using field-
selection syntax, and conversely field selection can be written in functional style. That is, the
notations col(table) and table.col are interchangeable. This behavior is not SQL-
standard but is provided in PostgreSQL because it allows use of functions to emulate “com-
puted fields”. For more information see Section 8.16.5.
where aggregate_name is a previously defined aggregate (possibly qualified with a schema name)
and expression is any value expression that does not itself contain an aggregate expression or
a window function call. The optional order_by_clause and filter_clause are described
below.
The first form of aggregate expression invokes the aggregate once for each input row. The second
form is the same as the first, since ALL is the default. The third form invokes the aggregate once for
each distinct value of the expression (or distinct set of values, for multiple expressions) found in the
input rows. The fourth form invokes the aggregate once for each input row; since no particular input
44
SQL Syntax
value is specified, it is generally only useful for the count(*) aggregate function. The last form is
used with ordered-set aggregate functions, which are described below.
Most aggregate functions ignore null inputs, so that rows in which one or more of the expression(s)
yield null are discarded. This can be assumed to be true, unless otherwise specified, for all built-in
aggregates.
For example, count(*) yields the total number of input rows; count(f1) yields the number of
input rows in which f1 is non-null, since count ignores nulls; and count(distinct f1) yields
the number of distinct non-null values of f1.
Ordinarily, the input rows are fed to the aggregate function in an unspecified order. In many cases
this does not matter; for example, min produces the same result no matter what order it receives the
inputs in. However, some aggregate functions (such as array_agg and string_agg) produce
results that depend on the ordering of the input rows. When using such an aggregate, the optional
order_by_clause can be used to specify the desired ordering. The order_by_clause has
the same syntax as for a query-level ORDER BY clause, as described in Section 7.5, except that its
expressions are always just expressions and cannot be output-column names or numbers. For example:
When dealing with multiple-argument aggregate functions, note that the ORDER BY clause goes after
all the aggregate arguments. For example, write this:
not this:
The latter is syntactically valid, but it represents a call of a single-argument aggregate function with
two ORDER BY keys (the second one being rather useless since it's a constant).
Note
The ability to specify both DISTINCT and ORDER BY in an aggregate function is a Post-
greSQL extension.
Placing ORDER BY within the aggregate's regular argument list, as described so far, is used when
ordering the input rows for general-purpose and statistical aggregates, for which ordering is op-
tional. There is a subclass of aggregate functions called ordered-set aggregates for which an or-
der_by_clause is required, usually because the aggregate's computation is only sensible in terms
of a specific ordering of its input rows. Typical examples of ordered-set aggregates include rank
and percentile calculations. For an ordered-set aggregate, the order_by_clause is written inside
WITHIN GROUP (...), as shown in the final syntax alternative above. The expressions in the
order_by_clause are evaluated once per input row just like regular aggregate arguments, sorted
as per the order_by_clause's requirements, and fed to the aggregate function as input arguments.
(This is unlike the case for a non-WITHIN GROUP order_by_clause, which is not treated as
argument(s) to the aggregate function.) The argument expressions preceding WITHIN GROUP, if
any, are called direct arguments to distinguish them from the aggregated arguments listed in the or-
der_by_clause. Unlike regular aggregate arguments, direct arguments are evaluated only once
45
SQL Syntax
per aggregate call, not once per input row. This means that they can contain variables only if those
variables are grouped by GROUP BY; this restriction is the same as if the direct arguments were not
inside an aggregate expression at all. Direct arguments are typically used for things like percentile
fractions, which only make sense as a single value per aggregation calculation. The direct argument
list can be empty; in this case, write just () not (*). (PostgreSQL will actually accept either spelling,
but only the first way conforms to the SQL standard.)
which obtains the 50th percentile, or median, value of the income column from table households.
Here, 0.5 is a direct argument; it would make no sense for the percentile fraction to be a value varying
across rows.
If FILTER is specified, then only the input rows for which the filter_clause evaluates to true
are fed to the aggregate function; other rows are discarded. For example:
SELECT
count(*) AS unfiltered,
count(*) FILTER (WHERE i < 5) AS filtered
FROM generate_series(1,10) AS s(i);
unfiltered | filtered
------------+----------
10 | 4
(1 row)
The predefined aggregate functions are described in Section 9.20. Other aggregate functions can be
added by the user.
An aggregate expression can only appear in the result list or HAVING clause of a SELECT command.
It is forbidden in other clauses, such as WHERE, because those clauses are logically evaluated before
the results of aggregates are formed.
When an aggregate expression appears in a subquery (see Section 4.2.11 and Section 9.22), the aggre-
gate is normally evaluated over the rows of the subquery. But an exception occurs if the aggregate's
arguments (and filter_clause if any) contain only outer-level variables: the aggregate then be-
longs to the nearest such outer level, and is evaluated over the rows of that query. The aggregate ex-
pression as a whole is then an outer reference for the subquery it appears in, and acts as a constant over
any one evaluation of that subquery. The restriction about appearing only in the result list or HAVING
clause applies with respect to the query level that the aggregate belongs to.
46
SQL Syntax
[ existing_window_name ]
[ PARTITION BY expression [, ...] ]
[ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS
{ FIRST | LAST } ] [, ...] ]
[ frame_clause ]
UNBOUNDED PRECEDING
offset PRECEDING
CURRENT ROW
offset FOLLOWING
UNBOUNDED FOLLOWING
Here, expression represents any value expression that does not itself contain window function
calls.
window_name is a reference to a named window specification defined in the query's WINDOW clause.
Alternatively, a full window_definition can be given within parentheses, using the same syntax
as for defining a named window in the WINDOW clause; see the SELECT reference page for details. It's
worth pointing out that OVER wname is not exactly equivalent to OVER (wname ...); the latter
implies copying and modifying the window definition, and will be rejected if the referenced window
specification includes a frame clause.
The PARTITION BY clause groups the rows of the query into partitions, which are processed sepa-
rately by the window function. PARTITION BY works similarly to a query-level GROUP BY clause,
except that its expressions are always just expressions and cannot be output-column names or num-
bers. Without PARTITION BY, all rows produced by the query are treated as a single partition. The
ORDER BY clause determines the order in which the rows of a partition are processed by the window
function. It works similarly to a query-level ORDER BY clause, but likewise cannot use output-column
names or numbers. Without ORDER BY, rows are processed in an unspecified order.
The frame_clause specifies the set of rows constituting the window frame, which is a subset of
the current partition, for those window functions that act on the frame instead of the whole partition.
The set of rows in the frame can vary depending on which row is the current row. The frame can be
47
SQL Syntax
specified in RANGE, ROWS or GROUPS mode; in each case, it runs from the frame_start to the
frame_end. If frame_end is omitted, the end defaults to CURRENT ROW.
A frame_start of UNBOUNDED PRECEDING means that the frame starts with the first row of
the partition, and similarly a frame_end of UNBOUNDED FOLLOWING means that the frame ends
with the last row of the partition.
In RANGE or GROUPS mode, a frame_start of CURRENT ROW means the frame starts with the
current row's first peer row (a row that the window's ORDER BY clause sorts as equivalent to the
current row), while a frame_end of CURRENT ROW means the frame ends with the current row's
last peer row. In ROWS mode, CURRENT ROW simply means the current row.
In the offset PRECEDING and offset FOLLOWING frame options, the offset must be an
expression not containing any variables, aggregate functions, or window functions. The meaning of
the offset depends on the frame mode:
• In ROWS mode, the offset must yield a non-null, non-negative integer, and the option means that
the frame starts or ends the specified number of rows before or after the current row.
• In GROUPS mode, the offset again must yield a non-null, non-negative integer, and the option
means that the frame starts or ends the specified number of peer groups before or after the current
row's peer group, where a peer group is a set of rows that are equivalent in the ORDER BY ordering.
(There must be an ORDER BY clause in the window definition to use GROUPS mode.)
• In RANGE mode, these options require that the ORDER BY clause specify exactly one column. The
offset specifies the maximum difference between the value of that column in the current row and
its value in preceding or following rows of the frame. The data type of the offset expression varies
depending on the data type of the ordering column. For numeric ordering columns it is typically
of the same type as the ordering column, but for datetime ordering columns it is an interval.
For example, if the ordering column is of type date or timestamp, one could write RANGE
BETWEEN '1 day' PRECEDING AND '10 days' FOLLOWING. The offset is still
required to be non-null and non-negative, though the meaning of “non-negative” depends on its
data type.
In any case, the distance to the end of the frame is limited by the distance to the end of the partition,
so that for rows near the partition ends the frame might contain fewer rows than elsewhere.
Notice that in both ROWS and GROUPS mode, 0 PRECEDING and 0 FOLLOWING are equivalent to
CURRENT ROW. This normally holds in RANGE mode as well, for an appropriate data-type-specific
meaning of “zero”.
The frame_exclusion option allows rows around the current row to be excluded from the frame,
even if they would be included according to the frame start and frame end options. EXCLUDE CUR-
RENT ROW excludes the current row from the frame. EXCLUDE GROUP excludes the current row and
its ordering peers from the frame. EXCLUDE TIES excludes any peers of the current row from the
frame, but not the current row itself. EXCLUDE NO OTHERS simply specifies explicitly the default
behavior of not excluding the current row or its peers.
The default framing option is RANGE UNBOUNDED PRECEDING, which is the same as RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY, this sets the frame
to be all rows from the partition start up through the current row's last ORDER BY peer. Without
ORDER BY, this means all rows of the partition are included in the window frame, since all rows
become peers of the current row.
48
SQL Syntax
If FILTER is specified, then only the input rows for which the filter_clause evaluates to true
are fed to the window function; other rows are discarded. Only window functions that are aggregates
accept a FILTER clause.
The built-in window functions are described in Table 9.57. Other window functions can be added by
the user. Also, any built-in or user-defined general-purpose or statistical aggregate can be used as a
window function. (Ordered-set and hypothetical-set aggregates cannot presently be used as window
functions.)
The syntaxes using * are used for calling parameter-less aggregate functions as window functions, for
example count(*) OVER (PARTITION BY x ORDER BY y). The asterisk (*) is customar-
ily not used for window-specific functions. Window-specific functions do not allow DISTINCT or
ORDER BY to be used within the function argument list.
Window function calls are permitted only in the SELECT list and the ORDER BY clause of the query.
More information about window functions can be found in Section 3.5, Section 9.21, and Section 7.2.5.
The CAST syntax conforms to SQL; the syntax with :: is historical PostgreSQL usage.
When a cast is applied to a value expression of a known type, it represents a run-time type conversion.
The cast will succeed only if a suitable type conversion operation has been defined. Notice that this
is subtly different from the use of casts with constants, as shown in Section 4.1.2.7. A cast applied
to an unadorned string literal represents the initial assignment of a type to a literal constant value,
and so it will succeed for any type (if the contents of the string literal are acceptable input syntax for
the data type).
An explicit type cast can usually be omitted if there is no ambiguity as to the type that a value expres-
sion must produce (for example, when it is assigned to a table column); the system will automatically
apply a type cast in such cases. However, automatic casting is only done for casts that are marked “OK
to apply implicitly” in the system catalogs. Other casts must be invoked with explicit casting syntax.
This restriction is intended to prevent surprising conversions from being applied silently.
typename ( expression )
However, this only works for types whose names are also valid as function names. For example, dou-
ble precision cannot be used this way, but the equivalent float8 can. Also, the names in-
terval, time, and timestamp can only be used in this fashion if they are double-quoted, because
of syntactic conflicts. Therefore, the use of the function-like cast syntax leads to inconsistencies and
should probably be avoided.
Note
The function-like syntax is in fact just a function call. When one of the two standard cast
syntaxes is used to do a run-time conversion, it will internally invoke a registered function
to perform the conversion. By convention, these conversion functions have the same name as
their output type, and thus the “function-like syntax” is nothing more than a direct invocation of
49
SQL Syntax
the underlying conversion function. Obviously, this is not something that a portable application
should rely on. For further details see CREATE CAST.
where collation is a possibly schema-qualified identifier. The COLLATE clause binds tighter than
operators; parentheses can be used when necessary.
If no collation is explicitly specified, the database system either derives a collation from the columns
involved in the expression, or it defaults to the default collation of the database if no column is involved
in the expression.
The two common uses of the COLLATE clause are overriding the sort order in an ORDER BY clause,
for example:
and overriding the collation of a function or operator call that has locale-sensitive results, for example:
Note that in the latter case the COLLATE clause is attached to an input argument of the operator we
wish to affect. It doesn't matter which argument of the operator or function call the COLLATE clause is
attached to, because the collation that is applied by the operator or function is derived by considering
all arguments, and an explicit COLLATE clause will override the collations of all other arguments.
(Attaching non-matching COLLATE clauses to more than one argument, however, is an error. For
more details see Section 23.2.) Thus, this gives the same result as the previous example:
because it attempts to apply a collation to the result of the > operator, which is of the non-collatable
data type boolean.
For example, the following finds the largest city population in each state:
50
SQL Syntax
SELECT ARRAY[1,2,3+4];
array
---------
{1,2,7}
(1 row)
By default, the array element type is the common type of the member expressions, determined using
the same rules as for UNION or CASE constructs (see Section 10.5). You can override this by explicitly
casting the array constructor to the desired type, for example:
SELECT ARRAY[1,2,22.7]::integer[];
array
----------
{1,2,23}
(1 row)
This has the same effect as casting each expression to the array element type individually. For more
on casting, see Section 4.2.9.
Multidimensional array values can be built by nesting array constructors. In the inner constructors, the
key word ARRAY can be omitted. For example, these produce the same result:
SELECT ARRAY[[1,2],[3,4]];
array
---------------
{{1,2},{3,4}}
(1 row)
Since multidimensional arrays must be rectangular, inner constructors at the same level must produce
sub-arrays of identical dimensions. Any cast applied to the outer ARRAY constructor propagates au-
tomatically to all the inner constructors.
Multidimensional array constructor elements can be anything yielding an array of the proper kind, not
only a sub-ARRAY construct. For example:
51
SQL Syntax
You can construct an empty array, but since it's impossible to have an array with no type, you must
explicitly cast your empty array to the desired type. For example:
SELECT ARRAY[]::integer[];
array
-------
{}
(1 row)
It is also possible to construct an array from the results of a subquery. In this form, the array construc-
tor is written with the key word ARRAY followed by a parenthesized (not bracketed) subquery. For
example:
The subquery must return a single column. If the subquery's output column is of a non-array type,
the resulting one-dimensional array will have an element for each row in the subquery result, with an
element type matching that of the subquery's output column. If the subquery's output column is of an
array type, the result will be an array of the same type but one higher dimension; in this case all the
subquery rows must yield arrays of identical dimensionality, else the result would not be rectangular.
The subscripts of an array value built with ARRAY always begin with one. For more information about
arrays, see Section 8.15.
The key word ROW is optional when there is more than one expression in the list.
A row constructor can include the syntax rowvalue.*, which will be expanded to a list of the
elements of the row value, just as occurs when the .* syntax is used at the top level of a SELECT list
(see Section 8.16.5). For example, if table t has columns f1 and f2, these are the same:
52
SQL Syntax
Note
Before PostgreSQL 8.2, the .* syntax was not expanded in row constructors, so that writing
ROW(t.*, 42) created a two-field row whose first field was another row value. The new
behavior is usually more useful. If you need the old behavior of nested row values, write the
inner row value without .*, for instance ROW(t, 42).
By default, the value created by a ROW expression is of an anonymous record type. If necessary, it can
be cast to a named composite type — either the row type of a table, or a composite type created with
CREATE TYPE AS. An explicit cast might be needed to avoid ambiguity. For example:
Row constructors can be used to build composite values to be stored in a composite-type table column,
or to be passed to a function that accepts a composite parameter. Also, it is possible to compare two
row values or test a row with IS NULL or IS NOT NULL, for example:
53
SQL Syntax
For more detail see Section 9.23. Row constructors can also be used in connection with subqueries,
as discussed in Section 9.22.
Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then
other subexpressions might not be evaluated at all. For instance, if one wrote:
then somefunc() would (probably) not be called at all. The same would be the case if one wrote:
Note that this is not the same as the left-to-right “short-circuiting” of Boolean operators that is found
in some programming languages.
As a consequence, it is unwise to use functions with side effects as part of complex expressions. It is
particularly dangerous to rely on side effects or evaluation order in WHERE and HAVING clauses, since
those clauses are extensively reprocessed as part of developing an execution plan. Boolean expressions
(AND/OR/NOT combinations) in those clauses can be reorganized in any manner allowed by the laws
of Boolean algebra.
When it is essential to force evaluation order, a CASE construct (see Section 9.17) can be used. For
example, this is an untrustworthy way of trying to avoid division by zero in a WHERE clause:
SELECT ... WHERE CASE WHEN x > 0 THEN y/x > 1.5 ELSE false END;
A CASE construct used in this fashion will defeat optimization attempts, so it should only be done
when necessary. (In this particular example, it would be better to sidestep the problem by writing y
> 1.5*x instead.)
CASE is not a cure-all for such issues, however. One limitation of the technique illustrated above is
that it does not prevent early evaluation of constant subexpressions. As described in Section 38.7,
functions and operators marked IMMUTABLE can be evaluated when the query is planned rather than
when it is executed. Thus for example
SELECT CASE WHEN x > 0 THEN x ELSE 1/0 END FROM tab;
is likely to result in a division-by-zero failure due to the planner trying to simplify the constant subex-
pression, even if every row in the table has x > 0 so that the ELSE arm would never be entered
at run time.
While that particular example might seem silly, related cases that don't obviously involve constants
can occur in queries executed within functions, since the values of function arguments and local vari-
ables can be inserted into queries as constants for planning purposes. Within PL/pgSQL functions, for
example, using an IF-THEN-ELSE statement to protect a risky computation is much safer than just
nesting it in a CASE expression.
54
SQL Syntax
Another limitation of the same kind is that a CASE cannot prevent evaluation of an aggregate expres-
sion contained within it, because aggregate expressions are computed before other expressions in a
SELECT list or HAVING clause are considered. For example, the following query can cause a divi-
sion-by-zero error despite seemingly having protected against it:
The min() and avg() aggregates are computed concurrently over all the input rows, so if any row
has employees equal to zero, the division-by-zero error will occur before there is any opportunity
to test the result of min(). Instead, use a WHERE or FILTER clause to prevent problematic input
rows from reaching an aggregate function in the first place.
In either notation, parameters that have default values given in the function declaration need not be
written in the call at all. But this is particularly useful in named notation, since any combination of
parameters can be omitted; while in positional notation parameters can only be omitted from right
to left.
PostgreSQL also supports mixed notation, which combines positional and named notation. In this case,
positional parameters are written first and named parameters appear after them.
The following examples will illustrate the usage of all three notations, using the following function
definition:
55
SQL Syntax
All arguments are specified in order. The result is upper case since uppercase is specified as true.
Another example is:
Here, the uppercase parameter is omitted, so it receives its default value of false, resulting in
lower case output. In positional notation, arguments can be omitted from right to left so long as they
have defaults.
Again, the argument uppercase was omitted so it is set to false implicitly. One advantage of
using named notation is that the arguments may be specified in any order, for example:
56
SQL Syntax
(1 row)
In the above query, the arguments a and b are specified positionally, while uppercase is specified
by name. In this example, that adds little except documentation. With a more complex function having
numerous parameters that have default values, named or mixed notation can save a great deal of writing
and reduce chances for error.
Note
Named and mixed call notations currently cannot be used when calling an aggregate function
(but they do work when an aggregate function is used as a window function).
57
Chapter 5. Data Definition
This chapter covers how one creates the database structures that will hold one's data. In a relational
database, the raw data is stored in tables, so the majority of this chapter is devoted to explaining how
tables are created and modified and what features are available to control what data is stored in the
tables. Subsequently, we discuss how tables can be organized into schemas, and how privileges can
be assigned to tables. Finally, we will briefly look at other features that affect the data storage, such
as inheritance, table partitioning, views, functions, and triggers.
Each column has a data type. The data type constrains the set of possible values that can be assigned to
a column and assigns semantics to the data stored in the column so that it can be used for computations.
For instance, a column declared to be of a numerical type will not accept arbitrary text strings, and
the data stored in such a column can be used for mathematical computations. By contrast, a column
declared to be of a character string type will accept almost any kind of data but it does not lend itself
to mathematical calculations, although other operations such as string concatenation are available.
PostgreSQL includes a sizable set of built-in data types that fit many applications. Users can also
define their own data types. Most built-in data types have obvious names and semantics, so we defer
a detailed explanation to Chapter 8. Some of the frequently used data types are integer for whole
numbers, numeric for possibly fractional numbers, text for character strings, date for dates,
time for time-of-day values, and timestamp for values containing both date and time.
To create a table, you use the aptly named CREATE TABLE command. In this command you specify
at least a name for the new table, the names of the columns and the data type of each column. For
example:
This creates a table named my_first_table with two columns. The first column is named
first_column and has a data type of text; the second column has the name second_column
and the type integer. The table and column names follow the identifier syntax explained in Sec-
tion 4.1.1. The type names are usually also identifiers, but there are some exceptions. Note that the
column list is comma-separated and surrounded by parentheses.
Of course, the previous example was heavily contrived. Normally, you would give names to your
tables and columns that convey what kind of data they store. So let's look at a more realistic example:
58
Data Definition
price numeric
);
(The numeric type can store fractional components, as would be typical of monetary amounts.)
Tip
When you create many interrelated tables it is wise to choose a consistent naming pattern for
the tables and columns. For instance, there is a choice of using singular or plural nouns for
table names, both of which are favored by some theorist or other.
There is a limit on how many columns a table can contain. Depending on the column types, it is
between 250 and 1600. However, defining a table with anywhere near this many columns is highly
unusual and often a questionable design.
If you no longer need a table, you can remove it using the DROP TABLE command. For example:
Attempting to drop a table that does not exist is an error. Nevertheless, it is common in SQL script files
to unconditionally try to drop each table before creating it, ignoring any error messages, so that the
script works whether or not the table exists. (If you like, you can use the DROP TABLE IF EXISTS
variant to avoid the error messages, but this is not standard SQL.)
If you need to modify a table that already exists, see Section 5.5 later in this chapter.
With the tools discussed so far you can create fully functional tables. The remainder of this chapter is
concerned with adding features to the table definition to ensure data integrity, security, or convenience.
If you are eager to fill your tables with data now you can skip ahead to Chapter 6 and read the rest
of this chapter later.
If no default value is declared explicitly, the default value is the null value. This usually makes sense
because a null value can be considered to represent unknown data.
In a table definition, default values are listed after the column data type. For example:
The default value can be an expression, which will be evaluated whenever the default value is inserted
(not when the table is created). A common example is for a timestamp column to have a default of
CURRENT_TIMESTAMP, so that it gets set to the time of row insertion. Another common example is
generating a “serial number” for each row. In PostgreSQL this is typically done by something like:
59
Data Definition
where the nextval() function supplies successive values from a sequence object (see Section 9.16).
This arrangement is sufficiently common that there's a special shorthand for it:
5.3. Constraints
Data types are a way to limit the kind of data that can be stored in a table. For many applications,
however, the constraint they provide is too coarse. For example, a column containing a product price
should probably only accept positive values. But there is no standard data type that accepts only pos-
itive numbers. Another issue is that you might want to constrain column data with respect to other
columns or rows. For example, in a table containing product information, there should be only one
row for each product number.
To that end, SQL allows you to define constraints on columns and tables. Constraints give you as
much control over the data in your tables as you wish. If a user attempts to store data in a column
that would violate a constraint, an error is raised. This applies even if the value came from the default
value definition.
As you see, the constraint definition comes after the data type, just like default value definitions.
Default values and constraints can be listed in any order. A check constraint consists of the key word
CHECK followed by an expression in parentheses. The check constraint expression should involve the
column thus constrained, otherwise the constraint would not make too much sense.
You can also give the constraint a separate name. This clarifies error messages and allows you to refer
to the constraint when you need to change it. The syntax is:
60
Data Definition
);
So, to specify a named constraint, use the key word CONSTRAINT followed by an identifier followed
by the constraint definition. (If you don't specify a constraint name in this way, the system chooses
a name for you.)
A check constraint can also refer to several columns. Say you store a regular price and a discounted
price, and you want to ensure that the discounted price is lower than the regular price:
The first two constraints should look familiar. The third one uses a new syntax. It is not attached to a
particular column, instead it appears as a separate item in the comma-separated column list. Column
definitions and these constraint definitions can be listed in mixed order.
We say that the first two constraints are column constraints, whereas the third one is a table constraint
because it is written separately from any one column definition. Column constraints can also be written
as table constraints, while the reverse is not necessarily possible, since a column constraint is supposed
to refer to only the column it is attached to. (PostgreSQL doesn't enforce that rule, but you should
follow it if you want your table definitions to work with other database systems.) The above example
could also be written as:
or even:
Names can be assigned to table constraints in the same way as column constraints:
61
Data Definition
It should be noted that a check constraint is satisfied if the check expression evaluates to true or the
null value. Since most expressions will evaluate to the null value if any operand is null, they will not
prevent null values in the constrained columns. To ensure that a column does not contain null values,
the not-null constraint described in the next section can be used.
Note
PostgreSQL does not support CHECK constraints that reference table data other than the new
or updated row being checked. While a CHECK constraint that violates this rule may appear
to work in simple tests, it cannot guarantee that the database will not reach a state in which
the constraint condition is false (due to subsequent changes of the other row(s) involved). This
would cause a database dump and restore to fail. The restore could fail even when the complete
database state is consistent with the constraint, due to rows not being loaded in an order that
will satisfy the constraint. If possible, use UNIQUE, EXCLUDE, or FOREIGN KEY constraints
to express cross-row and cross-table restrictions.
If what you desire is a one-time check against other rows at row insertion, rather than a con-
tinuously-maintained consistency guarantee, a custom trigger can be used to implement that.
(This approach avoids the dump/restore problem because pg_dump does not reinstall triggers
until after restoring data, so that the check will not be enforced during a dump/restore.)
Note
PostgreSQL assumes that CHECK constraints' conditions are immutable, that is, they will al-
ways give the same result for the same input row. This assumption is what justifies examin-
ing CHECK constraints only when rows are inserted or updated, and not at other times. (The
warning above about not referencing other table data is really a special case of this restriction.)
62
Data Definition
greSQL creating an explicit not-null constraint is more efficient. The drawback is that you cannot give
explicit names to not-null constraints created this way.
Of course, a column can have more than one constraint. Just write the constraints one after another:
The order doesn't matter. It does not necessarily determine in which order the constraints are checked.
The NOT NULL constraint has an inverse: the NULL constraint. This does not mean that the column
must be null, which would surely be useless. Instead, this simply selects the default behavior that the
column might be null. The NULL constraint is not present in the SQL standard and should not be used
in portable applications. (It was only added to PostgreSQL to be compatible with some other database
systems.) Some users, however, like it because it makes it easy to toggle the constraint in a script file.
For example, you could start with:
Tip
In most database designs the majority of columns should be marked not null.
63
Data Definition
To define a unique constraint for a group of columns, write it as a table constraint with the column
names separated by commas:
This specifies that the combination of values in the indicated columns is unique across the whole table,
though any one of the columns need not be (and ordinarily isn't) unique.
You can assign your own name for a unique constraint, in the usual way:
Adding a unique constraint will automatically create a unique B-tree index on the column or group of
columns listed in the constraint. A uniqueness restriction covering only some rows cannot be written
as a unique constraint, but it is possible to enforce such a restriction by creating a unique partial index.
In general, a unique constraint is violated if there is more than one row in the table where the values of
all of the columns included in the constraint are equal. However, two null values are never considered
equal in this comparison. That means even in the presence of a unique constraint it is possible to
store duplicate rows that contain a null value in at least one of the constrained columns. This behavior
conforms to the SQL standard, but we have heard that other SQL databases might not follow this rule.
So be careful when developing applications that are intended to be portable.
Primary keys can span more than one column; the syntax is similar to unique constraints:
64
Data Definition
a integer,
b integer,
c integer,
PRIMARY KEY (a, c)
);
Adding a primary key will automatically create a unique B-tree index on the column or group of
columns listed in the primary key, and will force the column(s) to be marked NOT NULL.
A table can have at most one primary key. (There can be any number of unique and not-null constraints,
which are functionally almost the same thing, but only one can be identified as the primary key.)
Relational database theory dictates that every table must have a primary key. This rule is not enforced
by PostgreSQL, but it is usually best to follow it.
Primary keys are useful both for documentation purposes and for client applications. For example, a
GUI application that allows modifying row values probably needs to know the primary key of a table
to be able to identify rows uniquely. There are also various ways in which the database system makes
use of a primary key if one has been declared; for example, the primary key defines the default target
column(s) for foreign keys referencing its table.
Say you have the product table that we have used several times already:
Let's also assume you have a table storing orders of those products. We want to ensure that the orders
table only contains orders of products that actually exist. So we define a foreign key constraint in the
orders table that references the products table:
Now it is impossible to create orders with non-NULL product_no entries that do not appear in
the products table.
We say that in this situation the orders table is the referencing table and the products table is the
referenced table. Similarly, there are referencing and referenced columns.
65
Data Definition
);
because in absence of a column list the primary key of the referenced table is used as the referenced
column(s).
You can assign your own name for a foreign key constraint, in the usual way.
A foreign key can also constrain and reference a group of columns. As usual, it then needs to be written
in table constraint form. Here is a contrived syntax example:
CREATE TABLE t1 (
a integer PRIMARY KEY,
b integer,
c integer,
FOREIGN KEY (b, c) REFERENCES other_table (c1, c2)
);
Of course, the number and type of the constrained columns need to match the number and type of
the referenced columns.
Sometimes it is useful for the “other table” of a foreign key constraint to be the same table; this is
called a self-referential foreign key. For example, if you want rows of a table to represent nodes of
a tree structure, you could write
A top-level node would have NULL parent_id, but non-NULL parent_id entries would be
constrained to reference valid rows of the table.
A table can have more than one foreign key constraint. This is used to implement many-to-many
relationships between tables. Say you have tables about products and orders, but now you want to
allow one order to contain possibly many products (which the structure above did not allow). You
could use this table structure:
66
Data Definition
);
Notice that the primary key overlaps with the foreign keys in the last table.
We know that the foreign keys disallow creation of orders that do not relate to any products. But what
if a product is removed after an order is created that references it? SQL allows you to handle that as
well. Intuitively, we have a few options:
To illustrate this, let's implement the following policy on the many-to-many relationship exam-
ple above: when someone wants to remove a product that is still referenced by an order (via or-
der_items), we disallow it. If someone removes an order, the order items are removed as well:
Restricting and cascading deletes are the two most common options. RESTRICT prevents deletion of
a referenced row. NO ACTION means that if any referencing rows still exist when the constraint is
checked, an error is raised; this is the default behavior if you do not specify anything. (The essential
difference between these two choices is that NO ACTION allows the check to be deferred until later
in the transaction, whereas RESTRICT does not.) CASCADE specifies that when a referenced row is
deleted, row(s) referencing it should be automatically deleted as well. There are two other options:
SET NULL and SET DEFAULT. These cause the referencing column(s) in the referencing row(s) to
be set to nulls or their default values, respectively, when the referenced row is deleted. Note that these
do not excuse you from observing any constraints. For example, if an action specifies SET DEFAULT
but the default value would not satisfy the foreign key constraint, the operation will fail.
Analogous to ON DELETE there is also ON UPDATE which is invoked when a referenced column is
changed (updated). The possible actions are the same. In this case, CASCADE means that the updated
values of the referenced column(s) should be copied into the referencing row(s).
Normally, a referencing row need not satisfy the foreign key constraint if any of its referencing
columns are null. If MATCH FULL is added to the foreign key declaration, a referencing row escapes
satisfying the constraint only if all its referencing columns are null (so a mix of null and non-null
values is guaranteed to fail a MATCH FULL constraint). If you don't want referencing rows to be able
to avoid satisfying the foreign key constraint, declare the referencing column(s) as NOT NULL.
A foreign key must reference columns that either are a primary key or form a unique constraint. This
means that the referenced columns always have an index (the one underlying the primary key or unique
67
Data Definition
constraint); so checks on whether a referencing row has a match will be efficient. Since a DELETE
of a row from the referenced table or an UPDATE of a referenced column will require a scan of the
referencing table for rows matching the old value, it is often a good idea to index the referencing
columns too. Because this is not always needed, and there are many choices available on how to
index, declaration of a foreign key constraint does not automatically create an index on the referencing
columns.
More information about updating and deleting data is in Chapter 6. Also see the description of foreign
key constraint syntax in the reference documentation for CREATE TABLE.
See also CREATE TABLE ... CONSTRAINT ... EXCLUDE for details.
Adding an exclusion constraint will automatically create an index of the type specified in the constraint
declaration.
oid
The object identifier (object ID) of a row. This column is only present if the table was created
using WITH OIDS, or if the default_with_oids configuration variable was set at the time. This
column is of type oid (same name as the column); see Section 8.19 for more information about
the type.
tableoid
The OID of the table containing this row. This column is particularly handy for queries that select
from inheritance hierarchies (see Section 5.9), since without it, it's difficult to tell which individual
table a row came from. The tableoid can be joined against the oid column of pg_class
to obtain the table name.
xmin
The identity (transaction ID) of the inserting transaction for this row version. (A row version is an
individual state of a row; each update of a row creates a new row version for the same logical row.)
cmin
xmax
68
Data Definition
The identity (transaction ID) of the deleting transaction, or zero for an undeleted row version. It
is possible for this column to be nonzero in a visible row version. That usually indicates that the
deleting transaction hasn't committed yet, or that an attempted deletion was rolled back.
cmax
ctid
The physical location of the row version within its table. Note that although the ctid can be
used to locate the row version very quickly, a row's ctid will change if it is updated or moved
by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. The OID, or even
better a user-defined serial number, should be used to identify logical rows.
OIDs are 32-bit quantities and are assigned from a single cluster-wide counter. In a large or long-lived
database, it is possible for the counter to wrap around. Hence, it is bad practice to assume that OIDs
are unique, unless you take steps to ensure that this is the case. If you need to identify the rows in
a table, using a sequence generator is strongly recommended. However, OIDs can be used as well,
provided that a few additional precautions are taken:
• A unique constraint should be created on the OID column of each table for which the OID will be
used to identify rows. When such a unique constraint (or unique index) exists, the system takes care
not to generate an OID matching an already-existing row. (Of course, this is only possible if the
table contains fewer than 232 (4 billion) rows, and in practice the table size had better be much less
than that, or performance might suffer.)
• OIDs should never be assumed to be unique across tables; use the combination of tableoid and
row OID if you need a database-wide identifier.
• Of course, the tables in question must be created WITH OIDS. As of PostgreSQL 8.1, WITHOUT
OIDS is the default.
Transaction identifiers are also 32-bit quantities. In a long-lived database it is possible for transaction
IDs to wrap around. This is not a fatal problem given appropriate maintenance procedures; see Chap-
ter 24 for details. It is unwise, however, to depend on the uniqueness of transaction IDs over the long
term (more than one billion transactions).
Command identifiers are also 32-bit quantities. This creates a hard limit of 232 (4 billion) SQL com-
mands within a single transaction. In practice this limit is not a problem — note that the limit is on
the number of SQL commands, not the number of rows processed. Also, only commands that actually
modify the database contents will consume a command identifier.
You can:
• Add columns
• Remove columns
• Add constraints
• Remove constraints
• Change default values
• Change column data types
69
Data Definition
• Rename columns
• Rename tables
All these actions are performed using the ALTER TABLE command, whose reference page contains
details beyond those given here.
The new column is initially filled with whatever default value is given (null if you don't specify a
DEFAULT clause).
Tip
From PostgreSQL 11, adding a column with a constant default value no longer means that
each row of the table needs to be updated when the ALTER TABLE statement is executed.
Instead, the default value will be returned the next time the row is accessed, and applied when
the table is rewritten, making the ALTER TABLE very fast even on large tables.
However, if the default value is volatile (e.g., clock_timestamp()) each row will need
to be updated with the value calculated at the time ALTER TABLE is executed. To avoid a
potentially lengthy update operation, particularly if you intend to fill the column with mostly
nondefault values anyway, it may be preferable to add the column with no default, insert the
correct values using UPDATE, and then add any desired default as described below.
You can also define constraints on the column at the same time, using the usual syntax:
In fact all the options that can be applied to a column description in CREATE TABLE can be used here.
Keep in mind however that the default value must satisfy the given constraints, or the ADD will fail.
Alternatively, you can add constraints later (see below) after you've filled in the new column correctly.
Whatever data was in the column disappears. Table constraints involving the column are dropped, too.
However, if the column is referenced by a foreign key constraint of another table, PostgreSQL will
not silently drop that constraint. You can authorize dropping everything that depends on the column
by adding CASCADE:
See Section 5.13 for a description of the general mechanism behind this.
70
Data Definition
To add a not-null constraint, which cannot be written as a table constraint, use this syntax:
The constraint will be checked immediately, so the table data must satisfy the constraint before it can
be added.
(If you are dealing with a generated constraint name like $2, don't forget that you'll need to dou-
ble-quote it to make it a valid identifier.)
As with dropping a column, you need to add CASCADE if you want to drop a constraint that something
else depends on. An example is that a foreign key constraint depends on a unique or primary key
constraint on the referenced column(s).
This works the same for all constraint types except not-null constraints. To drop a not null constraint
use:
Note that this doesn't affect any existing rows in the table, it just changes the default for future INSERT
commands.
This is effectively the same as setting the default to null. As a consequence, it is not an error to drop
a default where one hadn't been defined, because the default is implicitly the null value.
71
Data Definition
This will succeed only if each existing entry in the column can be converted to the new type by an
implicit cast. If a more complex conversion is needed, you can add a USING clause that specifies how
to compute the new values from the old.
PostgreSQL will attempt to convert the column's default value (if any) to the new type, as well as
any constraints that involve the column. But these conversions might fail, or might produce surprising
results. It's often best to drop any constraints on the column before altering its type, and then add back
suitably modified constraints afterwards.
5.6. Privileges
When an object is created, it is assigned an owner. The owner is normally the role that executed the
creation statement. For most kinds of objects, the initial state is that only the owner (or a superuser)
can do anything with the object. To allow other roles to use it, privileges must be granted.
There are different kinds of privileges: SELECT, INSERT, UPDATE, DELETE, TRUNCATE, REF-
ERENCES, TRIGGER, CREATE, CONNECT, TEMPORARY, EXECUTE, and USAGE. The privileges
applicable to a particular object vary depending on the object's type (table, function, etc). For com-
plete information on the different types of privileges supported by PostgreSQL, refer to the GRANT
reference page. The following sections and chapters will also show you how those privileges are used.
The right to modify or destroy an object is always the privilege of the owner only.
An object can be assigned to a new owner with an ALTER command of the appropriate kind for the
object, e.g., ALTER TABLE. Superusers can always do this; ordinary roles can only do it if they
are both the current owner of the object (or a member of the owning role) and a member of the new
owning role.
To assign privileges, the GRANT command is used. For example, if joe is an existing role, and ac-
counts is an existing table, the privilege to update the table can be granted with:
Writing ALL in place of a specific privilege grants all privileges that are relevant for the object type.
The special “role” name PUBLIC can be used to grant a privilege to every role on the system. Also,
“group” roles can be set up to help manage privileges when there are many users of a database —
for details see Chapter 21.
72
Data Definition
The special privileges of the object owner (i.e., the right to do DROP, GRANT, REVOKE, etc.) are
always implicit in being the owner, and cannot be granted or revoked. But the object owner can choose
to revoke their own ordinary privileges, for example to make a table read-only for themselves as well
as others.
Ordinarily, only the object's owner (or a superuser) can grant or revoke privileges on an object. How-
ever, it is possible to grant a privilege “with grant option”, which gives the recipient the right to grant
it in turn to others. If the grant option is subsequently revoked then all who received the privilege from
that recipient (directly or through a chain of grants) will lose the privilege. For details see the GRANT
and REVOKE reference pages.
When row security is enabled on a table (with ALTER TABLE ... ENABLE ROW LEVEL SECURI-
TY), all normal access to the table for selecting rows or modifying rows must be allowed by a row
security policy. (However, the table's owner is typically not subject to row security policies.) If no
policy exists for the table, a default-deny policy is used, meaning that no rows are visible or can be
modified. Operations that apply to the whole table, such as TRUNCATE and REFERENCES, are not
subject to row security.
Row security policies can be specific to commands, or to roles, or to both. A policy can be specified
to apply to ALL commands, or to SELECT, INSERT, UPDATE, or DELETE. Multiple roles can be
assigned to a given policy, and normal role membership and inheritance rules apply.
To specify which rows are visible or modifiable according to a policy, an expression is required that
returns a Boolean result. This expression will be evaluated for each row prior to any conditions or
functions coming from the user's query. (The only exceptions to this rule are leakproof functions,
which are guaranteed to not leak information; the optimizer may choose to apply such functions ahead
of the row-security check.) Rows for which the expression does not return true will not be processed.
Separate expressions may be specified to provide independent control over the rows which are visible
and the rows which are allowed to be modified. Policy expressions are run as part of the query and
with the privileges of the user running the query, although security-definer functions can be used to
access data not available to the calling user.
Superusers and roles with the BYPASSRLS attribute always bypass the row security system when
accessing a table. Table owners normally bypass row security as well, though a table owner can choose
to be subject to row security with ALTER TABLE ... FORCE ROW LEVEL SECURITY.
Enabling and disabling row security, as well as adding policies to a table, is always the privilege of
the table owner only.
Policies are created using the CREATE POLICY command, altered using the ALTER POLICY com-
mand, and dropped using the DROP POLICY command. To enable and disable row security for a
given table, use the ALTER TABLE command.
Each policy has a name and multiple policies can be defined for a table. As policies are table-specific,
each policy for a table must have a unique name. Different tables may have policies with the same
name.
73
Data Definition
When multiple policies apply to a given query, they are combined using either OR (for permissive
policies, which are the default) or using AND (for restrictive policies). This is similar to the rule that a
given role has the privileges of all roles that they are a member of. Permissive vs. restrictive policies
are discussed further below.
As a simple example, here is how to create a policy on the account relation to allow only members
of the managers role to access rows, and only rows of their accounts:
The policy above implicitly provides a WITH CHECK clause identical to its USING clause, so that
the constraint applies both to rows selected by a command (so a manager cannot SELECT, UPDATE,
or DELETE existing rows belonging to a different manager) and to rows modified by a command (so
rows belonging to a different manager cannot be created via INSERT or UPDATE).
If no role is specified, or the special user name PUBLIC is used, then the policy applies to all users
on the system. To allow all users to access only their own row in a users table, a simple policy
can be used:
To use a different policy for rows that are being added to the table compared to those rows that are
visible, multiple policies can be combined. This pair of policies would allow all users to view all rows
in the users table, but only modify their own:
In a SELECT command, these two policies are combined using OR, with the net effect being that all
rows can be selected. In other command types, only the second policy applies, so that the effects are
the same as before.
Row security can also be disabled with the ALTER TABLE command. Disabling row security does
not remove any policies that are defined on the table; they are simply ignored. Then all rows in the
table are visible and modifiable, subject to the standard SQL privileges system.
Below is a larger example of how this feature can be used in production environments. The table
passwd emulates a Unix password file:
74
Data Definition
-- Create policies
-- Administrator can see all rows and add any rows
CREATE POLICY admin_all ON passwd TO admin USING (true) WITH CHECK
(true);
-- Normal users can view all rows
CREATE POLICY all_view ON passwd FOR SELECT USING (true);
-- Normal users can update their own records, but
-- limit which shells a normal user is allowed to set
CREATE POLICY user_mod ON passwd FOR UPDATE
USING (current_user = user_name)
WITH CHECK (
current_user = user_name AND
shell IN ('/bin/bash','/bin/sh','/bin/dash','/bin/zsh','/bin/
tcsh')
);
As with any security settings, it's important to test and ensure that the system is behaving as expected.
Using the example above, this demonstrates that the permission system is working properly.
75
Data Definition
All of the policies constructed thus far have been permissive policies, meaning that when multiple
policies are applied they are combined using the “OR” Boolean operator. While permissive policies
can be constructed to only allow access to rows in the intended cases, it can be simpler to combine
76
Data Definition
permissive policies with restrictive policies (which the records must pass and which are combined
using the “AND” Boolean operator). Building on the example above, we add a restrictive policy to
require the administrator to be connected over a local Unix socket to access the records of the passwd
table:
We can then see that an administrator connecting over a network will not see any records, due to the
restrictive policy:
Referential integrity checks, such as unique or primary key constraints and foreign key references,
always bypass row security to ensure that data integrity is maintained. Care must be taken when de-
veloping schemas and row level policies to avoid “covert channel” leaks of information through such
referential integrity checks.
In some contexts it is important to be sure that row security is not being applied. For example, when
taking a backup, it could be disastrous if row security silently caused some rows to be omitted from
the backup. In such a situation, you can set the row_security configuration parameter to off. This
does not in itself bypass row security; what it does is throw an error if any query's results would get
filtered by a policy. The reason for the error can then be investigated and fixed.
In the examples above, the policy expressions consider only the current values in the row to be ac-
cessed or updated. This is the simplest and best-performing case; when possible, it's best to design row
security applications to work this way. If it is necessary to consult other rows or other tables to make
a policy decision, that can be accomplished using sub-SELECTs, or functions that contain SELECTs,
in the policy expressions. Be aware however that such accesses can create race conditions that could
allow information leakage if care is not taken. As an example, consider the following table design:
77
Data Definition
Now suppose that alice wishes to change the “slightly secret” information, but decides that mal-
lory should not be trusted with the new content of that row, so she does:
BEGIN;
UPDATE users SET group_id = 1 WHERE user_name = 'mallory';
UPDATE information SET info = 'secret from mallory' WHERE group_id
= 2;
COMMIT;
78
Data Definition
That looks safe; there is no window wherein mallory should be able to see the “secret from mallory”
string. However, there is a race condition here. If mallory is concurrently doing, say,
and her transaction is in READ COMMITTED mode, it is possible for her to see “secret from mallory”.
That happens if her transaction reaches the information row just after alice's does. It blocks
waiting for alice's transaction to commit, then fetches the updated row contents thanks to the FOR
UPDATE clause. However, it does not fetch an updated row for the implicit SELECT from users,
because that sub-SELECT did not have FOR UPDATE; instead the users row is read with the snap-
shot taken at the start of the query. Therefore, the policy expression tests the old value of mallory's
privilege level and allows her to see the updated row.
There are several ways around this problem. One simple answer is to use SELECT ... FOR
SHARE in sub-SELECTs in row security policies. However, that requires granting UPDATE privilege
on the referenced table (here users) to the affected users, which might be undesirable. (But another
row security policy could be applied to prevent them from actually exercising that privilege; or the
sub-SELECT could be embedded into a security definer function.) Also, heavy concurrent use of row
share locks on the referenced table could pose a performance problem, especially if updates of it are
frequent. Another solution, practical if updates of the referenced table are infrequent, is to take an
ACCESS EXCLUSIVE lock on the referenced table when updating it, so that no concurrent transac-
tions could be examining old row values. Or one could just wait for all concurrent transactions to end
after committing an update of the referenced table and before making changes that rely on the new
security situation.
5.8. Schemas
A PostgreSQL database cluster contains one or more named databases. Roles and a few other object
types are shared across the entire cluster. A client connection to the server can only access data in a
single database, the one specified in the connection request.
Note
Users of a cluster do not necessarily have the privilege to access every database in the cluster.
Sharing of role names means that there cannot be different roles named, say, joe in two
databases in the same cluster; but the system can be configured to allow joe access to only
some of the databases.
A database contains one or more named schemas, which in turn contain tables. Schemas also contain
other kinds of named objects, including data types, functions, and operators. The same object name
can be used in different schemas without conflict; for example, both schema1 and myschema can
contain tables named mytable. Unlike databases, schemas are not rigidly separated: a user can access
objects in any of the schemas in the database they are connected to, if they have privileges to do so.
There are several reasons why one might want to use schemas:
• To allow many users to use one database without interfering with each other.
• To organize database objects into logical groups to make them more manageable.
• Third-party applications can be put into separate schemas so they do not collide with the names
of other objects.
Schemas are analogous to directories at the operating system level, except that schemas cannot be
nested.
79
Data Definition
To create or access objects in a schema, write a qualified name consisting of the schema name and
table name separated by a dot:
schema.table
This works anywhere a table name is expected, including the table modification commands and the
data access commands discussed in the following chapters. (For brevity we will speak of tables only,
but the same ideas apply to other kinds of named objects, such as types and functions.)
database.schema.table
can be used too, but at present this is just for pro forma compliance with the SQL standard. If you
write a database name, it must be the same as the database you are connected to.
To drop a schema if it's empty (all objects in it have been dropped), use:
See Section 5.13 for a description of the general mechanism behind this.
Often you will want to create a schema owned by someone else (since this is one of the ways to restrict
the activities of your users to well-defined namespaces). The syntax for that is:
You can even omit the schema name, in which case the schema name will be the same as the user
name. See Section 5.8.6 for how this can be useful.
Schema names beginning with pg_ are reserved for system purposes and cannot be created by users.
80
Data Definition
In the previous sections we created tables without specifying any schema names. By default such
tables (and other objects) are automatically put into a schema named “public”. Every new database
contains such a schema. Thus, the following are equivalent:
and:
The ability to create like-named objects in different schemas complicates writing a query that refer-
ences precisely the same objects every time. It also opens up the potential for users to change the be-
havior of other users' queries, maliciously or accidentally. Due to the prevalence of unqualified names
in queries and their use in PostgreSQL internals, adding a schema to search_path effectively trusts
all users having CREATE privilege on that schema. When you run an ordinary query, a malicious user
able to create objects in a schema of your search path can take control and execute arbitrary SQL
functions as though you executed them.
The first schema named in the search path is called the current schema. Aside from being the first
schema searched, it is also the schema in which new tables will be created if the CREATE TABLE
command does not specify a schema name.
SHOW search_path;
search_path
--------------
"$user", public
The first element specifies that a schema with the same name as the current user is to be searched.
If no such schema exists, the entry is ignored. The second element refers to the public schema that
we have seen already.
The first schema in the search path that exists is the default location for creating new objects. That
is the reason that by default objects are created in the public schema. When objects are referenced
in any other context without schema qualification (table modification, data modification, or query
commands) the search path is traversed until a matching object is found. Therefore, in the default
configuration, any unqualified access again can only refer to the public schema.
81
Data Definition
(We omit the $user here because we have no immediate need for it.) And then we can access the
table without schema qualification:
Also, since myschema is the first element in the path, new objects would by default be created in it.
Then we no longer have access to the public schema without explicit qualification. There is nothing
special about the public schema except that it exists by default. It can be dropped, too.
See also Section 9.25 for other ways to manipulate the schema search path.
The search path works in the same way for data type names, function names, and operator names as it
does for table names. Data type and function names can be qualified in exactly the same way as table
names. If you need to write a qualified operator name in an expression, there is a special provision:
you must write
OPERATOR(schema.operator)
SELECT 3 OPERATOR(pg_catalog.+) 4;
In practice one usually relies on the search path for operators, so as not to have to write anything so
ugly as that.
A user can also be allowed to create objects in someone else's schema. To allow that, the CREATE
privilege on the schema needs to be granted. Note that by default, everyone has CREATE and USAGE
privileges on the schema public. This allows all users that are able to connect to a given database
to create objects in its public schema. Some usage patterns call for revoking that privilege:
(The first “public” is the schema, the second “public” means “every user”. In the first sense it is an
identifier, in the second sense it is a key word, hence the different capitalization; recall the guidelines
from Section 4.1.1.)
82
Data Definition
implicitly searched before searching the path's schemas. This ensures that built-in names will always
be findable. However, you can explicitly place pg_catalog at the end of your search path if you
prefer to have user-defined names override built-in names.
Since system table names begin with pg_, it is best to avoid such names to ensure that you won't suffer
a conflict if some future version defines a system table named the same as your table. (With the default
search path, an unqualified reference to your table name would then be resolved as the system table
instead.) System tables will continue to follow the convention of having names beginning with pg_,
so that they will not conflict with unqualified user-table names so long as users avoid the pg_ prefix.
• Constrain ordinary users to user-private schemas. To implement this, issue REVOKE CREATE ON
SCHEMA public FROM PUBLIC, and create a schema for each user with the same name as
that user. Recall that the default search path starts with $user, which resolves to the user name.
Therefore, if each user has a separate schema, they access their own schemas by default. After
adopting this pattern in a database where untrusted users had already logged in, consider auditing
the public schema for objects named like objects in schema pg_catalog. This pattern is a secure
schema usage pattern unless an untrusted user is the database owner or holds the CREATEROLE
privilege, in which case no secure schema usage pattern exists.
• Remove the public schema from the default search path, by modifying postgresql.conf or
by issuing ALTER ROLE ALL SET search_path = "$user". Everyone retains the
ability to create objects in the public schema, but only qualified names will choose those objects.
While qualified table references are fine, calls to functions in the public schema will be unsafe or
unreliable. If you create functions or extensions in the public schema, use the first pattern instead.
Otherwise, like the first pattern, this is secure unless an untrusted user is the database owner or
holds the CREATEROLE privilege.
• Keep the default. All users access the public schema implicitly. This simulates the situation where
schemas are not available at all, giving a smooth transition from the non-schema-aware world.
However, this is never a secure pattern. It is acceptable only when the database has a single user
or a few mutually-trusting users.
For any pattern, to install shared applications (tables to be used by everyone, additional functions pro-
vided by third parties, etc.), put them into separate schemas. Remember to grant appropriate privileges
to allow the other users to access them. Users can then refer to these additional objects by qualifying
the names with a schema name, or they can put the additional schemas into their search path, as they
choose.
5.8.7. Portability
In the SQL standard, the notion of objects in the same schema being owned by different users does
not exist. Moreover, some implementations do not allow you to create schemas that have a different
name than their owner. In fact, the concepts of schema and user are nearly equivalent in a database
system that implements only the basic schema support specified in the standard. Therefore, many users
consider qualified names to really consist of user_name.table_name. This is how PostgreSQL
will effectively behave if you create a per-user schema for every user.
Also, there is no concept of a public schema in the SQL standard. For maximum conformance to
the standard, you should not use the public schema.
83
Data Definition
Of course, some SQL database systems might not implement schemas at all, or provide namespace
support by allowing (possibly limited) cross-database access. If you need to work with those systems,
then maximum portability would be achieved by not using schemas at all.
5.9. Inheritance
PostgreSQL implements table inheritance, which can be a useful tool for database designers.
(SQL:1999 and later define a type inheritance feature, which differs in many respects from the features
described here.)
Let's start with an example: suppose we are trying to build a data model for cities. Each state has many
cities, but only one capital. We want to be able to quickly retrieve the capital city for any particular
state. This can be done by creating two tables, one for state capitals and one for cities that are not
capitals. However, what happens when we want to ask for data about a city, regardless of whether it is
a capital or not? The inheritance feature can help to resolve this problem. We define the capitals
table so that it inherits from cities:
In this case, the capitals table inherits all the columns of its parent table, cities. State capitals
also have an extra column, state, that shows their state.
In PostgreSQL, a table can inherit from zero or more other tables, and a query can reference either
all rows of a table or all rows of a table plus all of its descendant tables. The latter behavior is the
default. For example, the following query finds the names of all cities, including state capitals, that
are located at an elevation over 500 feet:
Given the sample data from the PostgreSQL tutorial (see Section 2.1), this returns:
name | elevation
-----------+-----------
Las Vegas | 2174
Mariposa | 1953
Madison | 845
On the other hand, the following query finds all the cities that are not state capitals and are situated
at an elevation over 500 feet:
name | elevation
84
Data Definition
-----------+-----------
Las Vegas | 2174
Mariposa | 1953
Here the ONLY keyword indicates that the query should apply only to cities, and not any tables
below cities in the inheritance hierarchy. Many of the commands that we have already discussed
— SELECT, UPDATE and DELETE — support the ONLY keyword.
You can also write the table name with a trailing * to explicitly specify that descendant tables are
included:
Writing * is not necessary, since this behavior is always the default. However, this syntax is still
supported for compatibility with older releases where the default could be changed.
In some cases you might wish to know which table a particular row originated from. There is a system
column called tableoid in each table which can tell you the originating table:
which returns:
(If you try to reproduce this example, you will probably get different numeric OIDs.) By doing a join
with pg_class you can see the actual table names:
which returns:
Another way to get the same effect is to use the regclass alias type, which will print the table OID
symbolically:
85
Data Definition
Inheritance does not automatically propagate data from INSERT or COPY commands to other tables
in the inheritance hierarchy. In our example, the following INSERT statement will fail:
We might hope that the data would somehow be routed to the capitals table, but this does not
happen: INSERT always inserts into exactly the table specified. In some cases it is possible to redirect
the insertion using a rule (see Chapter 41). However that does not help for the above case because
the cities table does not contain the column state, and so the command will be rejected before
the rule can be applied.
All check constraints and not-null constraints on a parent table are automatically inherited by its chil-
dren, unless explicitly specified otherwise with NO INHERIT clauses. Other types of constraints
(unique, primary key, and foreign key constraints) are not inherited.
A table can inherit from more than one parent table, in which case it has the union of the columns
defined by the parent tables. Any columns declared in the child table's definition are added to these.
If the same column name appears in multiple parent tables, or in both a parent table and the child's
definition, then these columns are “merged” so that there is only one such column in the child table. To
be merged, columns must have the same data types, else an error is raised. Inheritable check constraints
and not-null constraints are merged in a similar fashion. Thus, for example, a merged column will be
marked not-null if any one of the column definitions it came from is marked not-null. Check constraints
are merged if they have the same name, and the merge will fail if their conditions are different.
Table inheritance is typically established when the child table is created, using the INHERITS clause
of the CREATE TABLE statement. Alternatively, a table which is already defined in a compatible
way can have a new parent relationship added, using the INHERIT variant of ALTER TABLE. To do
this the new child table must already include columns with the same names and types as the columns
of the parent. It must also include check constraints with the same names and check expressions as
those of the parent. Similarly an inheritance link can be removed from a child using the NO INHERIT
variant of ALTER TABLE. Dynamically adding and removing inheritance links like this can be useful
when the inheritance relationship is being used for table partitioning (see Section 5.10).
One convenient way to create a compatible table that will later be made a new child is to use the
LIKE clause in CREATE TABLE. This creates a new table with the same columns as the source table.
If there are any CHECK constraints defined on the source table, the INCLUDING CONSTRAINTS
option to LIKE should be specified, as the new child must have constraints matching the parent to
be considered compatible.
A parent table cannot be dropped while any of its children remain. Neither can columns or check
constraints of child tables be dropped or altered if they are inherited from any parent tables. If you
wish to remove a table and all of its descendants, one easy way is to drop the parent table with the
CASCADE option (see Section 5.13).
ALTER TABLE will propagate any changes in column data definitions and check constraints down
the inheritance hierarchy. Again, dropping columns that are depended on by other tables is only pos-
sible when using the CASCADE option. ALTER TABLE follows the same rules for duplicate column
merging and rejection that apply during CREATE TABLE.
Inherited queries perform access permission checks on the parent table only. Thus, for example, grant-
ing UPDATE permission on the cities table implies permission to update rows in the capitals
table as well, when they are accessed through cities. This preserves the appearance that the data is
(also) in the parent table. But the capitals table could not be updated directly without an additional
grant. Two exceptions to this rule are TRUNCATE and LOCK TABLE, where permissions on the child
tables are always checked, whether they are processed directly or recursively via those commands
performed on the parent table.
In a similar way, the parent table's row security policies (see Section 5.7) are applied to rows coming
from child tables during an inherited query. A child table's policies, if any, are applied only when it
86
Data Definition
is the table explicitly named in the query; and in that case, any policies attached to its parent(s) are
ignored.
Foreign tables (see Section 5.11) can also be part of inheritance hierarchies, either as parent or child
tables, just as regular tables can be. If a foreign table is part of an inheritance hierarchy then any
operations not supported by the foreign table are not supported on the whole hierarchy either.
5.9.1. Caveats
Note that not all SQL commands are able to work on inheritance hierarchies. Commands that are used
for data querying, data modification, or schema modification (e.g., SELECT, UPDATE, DELETE, most
variants of ALTER TABLE, but not INSERT or ALTER TABLE ... RENAME) typically default
to including child tables and support the ONLY notation to exclude them. Commands that do database
maintenance and tuning (e.g., REINDEX, VACUUM) typically only work on individual, physical tables
and do not support recursing over inheritance hierarchies. The respective behavior of each individual
command is documented in its reference page (SQL Commands).
A serious limitation of the inheritance feature is that indexes (including unique constraints) and foreign
key constraints only apply to single tables, not to their inheritance children. This is true on both the
referencing and referenced sides of a foreign key constraint. Thus, in the terms of the above example:
• If we declared cities.name to be UNIQUE or a PRIMARY KEY, this would not stop the cap-
itals table from having rows with names duplicating rows in cities. And those duplicate rows
would by default show up in queries from cities. In fact, by default capitals would have no
unique constraint at all, and so could contain multiple rows with the same name. You could add a
unique constraint to capitals, but this would not prevent duplication compared to cities.
• Similarly, if we were to specify that cities.name REFERENCES some other table, this constraint
would not automatically propagate to capitals. In this case you could work around it by manually
adding the same REFERENCES constraint to capitals.
• Specifying that another table's column REFERENCES cities(name) would allow the other
table to contain city names, but not capital names. There is no good workaround for this case.
Some functionality not implemented for inheritance hierarchies is implemented for declarative parti-
tioning. Considerable care is needed in deciding whether partitioning with legacy inheritance is useful
for your application.
5.10.1. Overview
Partitioning refers to splitting what is logically one large table into smaller physical pieces. Partitioning
can provide several benefits:
• Query performance can be improved dramatically in certain situations, particularly when most of
the heavily accessed rows of the table are in a single partition or a small number of partitions.
Partitioning effectively substitutes for the upper tree levels of indexes, making it more likely that
the heavily-used parts of the indexes fit in memory.
• When queries or updates access a large percentage of a single partition, performance can be im-
proved by using a sequential scan of that partition instead of using an index, which would require
random-access reads scattered across the whole table.
• Bulk loads and deletes can be accomplished by adding or removing partitions, if the usage pattern is
accounted for in the partitioning design. Dropping an individual partition using DROP TABLE, or
87
Data Definition
doing ALTER TABLE DETACH PARTITION, is far faster than a bulk operation. These commands
also entirely avoid the VACUUM overhead caused by a bulk DELETE.
These benefits will normally be worthwhile only when a table would otherwise be very large. The
exact point at which a table will benefit from partitioning depends on the application, although a rule
of thumb is that the size of the table should exceed the physical memory of the database server.
Range Partitioning
The table is partitioned into “ranges” defined by a key column or set of columns, with no overlap
between the ranges of values assigned to different partitions. For example, one might partition by
date ranges, or by ranges of identifiers for particular business objects. Each range's bounds are
understood as being inclusive at the lower end and exclusive at the upper end. For example, if
one partition's range is from 1 to 10, and the next one's range is from 10 to 20, then value 10
belongs to the second partition not the first.
List Partitioning
The table is partitioned by explicitly listing which key value(s) appear in each partition.
Hash Partitioning
The table is partitioned by specifying a modulus and a remainder for each partition. Each partition
will hold the rows for which the hash value of the partition key divided by the specified modulus
will produce the specified remainder.
If your application needs to use other forms of partitioning not listed above, alternative methods such
as inheritance and UNION ALL views can be used instead. Such methods offer flexibility but do not
have some of the performance benefits of built-in declarative partitioning.
The partitioned table itself is a “virtual” table having no storage of its own. Instead, the storage belongs
to partitions, which are otherwise-ordinary tables associated with the partitioned table. Each partition
stores a subset of the data as defined by its partition bounds. All rows inserted into a partitioned table
will be routed to the appropriate one of the partitions based on the values of the partition key column(s).
Updating the partition key of a row will cause it to be moved into a different partition if it no longer
satisfies the partition bounds of its original partition.
It is not possible to turn a regular table into a partitioned table or vice versa. However, it is possible to
add an existing regular or partitioned table as a partition of a partitioned table, or remove a partition
from a partitioned table turning it into a standalone table; this can simplify and speed up many main-
tenance processes. See ALTER TABLE to learn more about the ATTACH PARTITION and DETACH
PARTITION sub-commands.
Partitions can also be foreign tables, although considerable care is needed because it is then the user's
responsibility that the contents of the foreign table satisfy the partitioning rule. There are some other
restrictions as well. See CREATE FOREIGN TABLE for more information.
88
Data Definition
5.10.2.1. Example
Suppose we are constructing a database for a large ice cream company. The company measures peak
temperatures every day as well as ice cream sales in each region. Conceptually, we want a table like:
We know that most queries will access just the last week's, month's or quarter's data, since the main
use of this table will be to prepare online reports for management. To reduce the amount of old data
that needs to be stored, we decide to keep only the most recent 3 years worth of data. At the beginning
of each month we will remove the oldest month's data. In this situation we can use partitioning to help
us meet all of our different requirements for the measurements table.
1. Create the measurement table as a partitioned table by specifying the PARTITION BY clause,
which includes the partitioning method (RANGE in this case) and the list of column(s) to use as
the partition key.
Partitions thus created are in every way normal PostgreSQL tables (or, possibly, foreign tables). It
is possible to specify a tablespace and storage parameters for each partition separately.
For our example, each partition should hold one month's worth of data, to match the requirement
of deleting one month's data at a time. So the commands might look like:
...
CREATE TABLE measurement_y2007m11 PARTITION OF measurement
FOR VALUES FROM ('2007-11-01') TO ('2007-12-01');
89
Data Definition
WITH (parallel_workers = 4)
TABLESPACE fasttablespace;
(Recall that adjacent partitions can share a bound value, since range upper bounds are treated as
exclusive bounds.)
If you wish to implement sub-partitioning, again specify the PARTITION BY clause in the com-
mands used to create individual partitions, for example:
Inserting data into the parent table that does not map to one of the existing partitions will cause an
error; an appropriate partition must be added manually.
It is not necessary to manually create table constraints describing the partition boundary conditions
for partitions. Such constraints will be created automatically.
3. Create an index on the key column(s), as well as any other indexes you might want, on the par-
titioned table. (The key index is not strictly necessary, but in most scenarios it is helpful.) This
automatically creates a matching index on each partition, and any partitions you create or attach
later will also have such an index. An index or unique constraint declared on a partitioned table
is “virtual” in the same way that the partitioned table is: the actual data is in child indexes on the
individual partition tables.
In the above example we would be creating a new partition each month, so it might be wise to write
a script that generates the required DDL automatically.
The simplest option for removing old data is to drop the partition that is no longer necessary:
This can very quickly delete millions of records because it doesn't have to individually delete every
record. Note however that the above command requires taking an ACCESS EXCLUSIVE lock on
the parent table.
Another option that is often preferable is to remove the partition from the partitioned table but retain
access to it as a table in its own right:
90
Data Definition
This allows further operations to be performed on the data before it is dropped. For example, this is
often a useful time to back up the data using COPY, pg_dump, or similar tools. It might also be a useful
time to aggregate data into smaller formats, perform other data manipulations, or run reports.
Similarly we can add a new partition to handle new data. We can create an empty partition in the
partitioned table just as the original partitions were created above:
As an alternative, it is sometimes more convenient to create the new table outside the partition struc-
ture, and make it a proper partition later. This allows new data to be loaded, checked, and transformed
prior to it appearing in the partitioned table. The CREATE TABLE ... LIKE option is helpful to
avoid tediously repeating the parent table's definition:
Before running the ATTACH PARTITION command, it is recommended to create a CHECK constraint
on the table to be attached that matches the expected partition constraint, as illustrated above. That
way, the system will be able to skip the scan which is otherwise needed to validate the implicit partition
constraint. Without the CHECK constraint, the table will be scanned to validate the partition constraint
while holding an ACCESS EXCLUSIVE lock on the parent table. It is recommended to drop the now-
redundant CHECK constraint after ATTACH PARTITION is finished.
As explained above, it is possible to create indexes on partitioned tables so that they are applied au-
tomatically to the entire hierarchy. This is very convenient, as not only will the existing partitions
become indexed, but also any partitions that are created in the future will. One limitation is that it's
not possible to use the CONCURRENTLY qualifier when creating such a partitioned index. To avoid
long lock times, it is possible to use CREATE INDEX ON ONLY the partitioned table; such an index
is marked invalid, and the partitions do not get the index applied automatically. The indexes on parti-
tions can be created individually using CONCURRENTLY, and then attached to the index on the parent
using ALTER INDEX .. ATTACH PARTITION. Once indexes for all partitions are attached to
the parent index, the parent index is marked valid automatically. Example:
91
Data Definition
This technique can be used with UNIQUE and PRIMARY KEY constraints too; the indexes are created
implicitly when the constraint is created. Example:
5.10.2.3. Limitations
The following limitations apply to partitioned tables:
• Unique constraints (and hence primary keys) on partitioned tables must include all the partition key
columns. This limitation exists because the individual indexes making up the constraint can only
directly enforce uniqueness within their own partitions; therefore, the partition structure itself must
guarantee that there are not duplicates in different partitions.
• There is no way to create an exclusion constraint spanning the whole partitioned table. It is only
possible to put such a constraint on each leaf partition individually. Again, this limitation stems
from not being able to enforce cross-partition restrictions.
• While primary keys are supported on partitioned tables, foreign keys referencing partitioned tables
are not supported. (Foreign key references from a partitioned table to some other table are support-
ed.)
• BEFORE ROW triggers, if necessary, must be defined on individual partitions, not the partitioned
table.
• Mixing temporary and permanent relations in the same partition tree is not allowed. Hence, if the
partitioned table is permanent, so must be its partitions and likewise if the partitioned table is tem-
porary. When using temporary relations, all members of the partition tree have to be from the same
session.
Individual partitions are linked to their partitioned table using inheritance behind-the-scenes. However,
it is not possible to use all of the generic features of inheritance with declaratively partitioned tables
or their partitions, as discussed below. Notably, a partition cannot have any parents other than the
partitioned table it is a partition of, nor can a table inherit from both a partitioned table and a regular
table. That means partitioned tables and their partitions never share an inheritance hierarchy with
regular tables.
Since a partition hierarchy consisting of the partitioned table and its partitions is still an inheritance
hierarchy, tableoid and all the normal rules of inheritance apply as described in Section 5.9, with
a few exceptions:
• Partitions cannot have columns that are not present in the parent. It is not possible to specify columns
when creating partitions with CREATE TABLE, nor is it possible to add columns to partitions
after-the-fact using ALTER TABLE. Tables may be added as a partition with ALTER TABLE ...
ATTACH PARTITION only if their columns exactly match the parent, including any oid column.
• Both CHECK and NOT NULL constraints of a partitioned table are always inherited by all its parti-
tions. CHECK constraints that are marked NO INHERIT are not allowed to be created on partitioned
tables. You cannot drop a NOT NULL constraint on a partition's column if the same constraint is
present in the parent table.
• Using ONLY to add or drop a constraint on only the partitioned table is supported as long as there
are no partitions. Once partitions exist, using ONLY will result in an error. Instead, constraints on
the partitions themselves can be added and (if they are not present in the parent table) dropped.
92
Data Definition
• As a partitioned table does not have any data itself, attempts to use TRUNCATE ONLY on a parti-
tioned table will always return an error.
• For declarative partitioning, partitions must have exactly the same set of columns as the partitioned
table, whereas with table inheritance, child tables may have extra columns not present in the parent.
• Declarative partitioning only supports range, list and hash partitioning, whereas table inheritance
allows data to be divided in a manner of the user's choosing. (Note, however, that if constraint
exclusion is unable to prune child tables effectively, query performance might be poor.)
• Some operations require a stronger lock when using declarative partitioning than when using table
inheritance. For example, adding or removing a partition to or from a partitioned table requires tak-
ing an ACCESS EXCLUSIVE lock on the parent table, whereas a SHARE UPDATE EXCLUSIVE
lock is enough in the case of regular inheritance.
5.10.3.1. Example
This example builds a partitioning structure equivalent to the declarative partitioning example above.
Use the following steps:
1. Create the “master” table, from which all of the “child” tables will inherit. This table will contain
no data. Do not define any check constraints on this table, unless you intend them to be applied
equally to all child tables. There is no point in defining any indexes or unique constraints on it,
either. For our example, the master table is the measurement table as originally defined:
CHECK ( x = 1 )
CHECK ( county IN ( 'Oxfordshire', 'Buckinghamshire',
'Warwickshire' ))
CHECK ( outletID >= 100 AND outletID < 200 )
93
Data Definition
Ensure that the constraints guarantee that there is no overlap between the key values permitted in
different child tables. A common mistake is to set up range constraints like:
This is wrong since it is not clear which child table the key value 200 belongs in. Instead, ranges
should be defined in this style:
...
CREATE TABLE measurement_y2007m11 (
CHECK ( logdate >= DATE '2007-11-01' AND logdate < DATE
'2007-12-01' )
) INHERITS (measurement);
94
Data Definition
BEGIN
INSERT INTO measurement_y2008m01 VALUES (NEW.*);
RETURN NULL;
END;
$$
LANGUAGE plpgsql;
After creating the function, we create a trigger which calls the trigger function:
We must redefine the trigger function each month so that it always inserts into the current child
table. The trigger definition does not need to be updated, however.
We might want to insert data and have the server automatically locate the child table into which
the row should be added. We could do this with a more complex trigger function, for example:
95
Data Definition
LANGUAGE plpgsql;
The trigger definition is the same as before. Note that each IF test must exactly match the CHECK
constraint for its child table.
While this function is more complex than the single-month case, it doesn't need to be updated as
often, since branches can be added in advance of being needed.
Note
In practice, it might be best to check the newest child first, if most inserts go into that child.
For simplicity, we have shown the trigger's tests in the same order as in other parts of this
example.
A different approach to redirecting inserts into the appropriate child table is to set up rules, instead
of a trigger, on the master table. For example:
A rule has significantly more overhead than a trigger, but the overhead is paid once per query rather
than once per row, so this method might be advantageous for bulk-insert situations. In most cases,
however, the trigger method will offer better performance.
Be aware that COPY ignores rules. If you want to use COPY to insert data, you'll need to copy into
the correct child table rather than directly into the master. COPY does fire triggers, so you can use
it normally if you use the trigger approach.
Another disadvantage of the rule approach is that there is no simple way to force an error if the set
of rules doesn't cover the insertion date; the data will silently go into the master table instead.
6. Ensure that the constraint_exclusion configuration parameter is not disabled in post-
gresql.conf; otherwise child tables may be accessed unnecessarily.
As we can see, a complex table hierarchy could require a substantial amount of DDL. In the above
example we would be creating a new child table each month, so it might be wise to write a script that
generates the required DDL automatically.
To remove the child table from the inheritance hierarchy table but retain access to it as a table in its
own right:
96
Data Definition
To add a new child table to handle new data, create an empty child table just as the original children
were created above:
Alternatively, one may want to create and populate the new child table before adding it to the table
hierarchy. This could allow data to be loaded, checked, and transformed before being made visible
to queries on the parent table.
5.10.3.3. Caveats
The following caveats apply to partitioning implemented using inheritance:
• There is no automatic way to verify that all of the CHECK constraints are mutually exclusive. It is
safer to create code that generates child tables and creates and/or modifies associated objects than
to write each by hand.
• The schemes shown here assume that the values of a row's key column(s) never change, or at least do
not change enough to require it to move to another partition. An UPDATE that attempts to do that will
fail because of the CHECK constraints. If you need to handle such cases, you can put suitable update
triggers on the child tables, but it makes management of the structure much more complicated.
• If you are using manual VACUUM or ANALYZE commands, don't forget that you need to run them
on each child table individually. A command like:
ANALYZE measurement;
• INSERT statements with ON CONFLICT clauses are unlikely to work as expected, as the ON
CONFLICT action is only taken in case of unique violations on the specified target relation, not
its child relations.
• Triggers or rules will be needed to route rows to the desired child table, unless the application is
explicitly aware of the partitioning scheme. Triggers may be complicated to write, and will be much
slower than the tuple routing performed internally by declarative partitioning.
97
Data Definition
Without partition pruning, the above query would scan each of the partitions of the measurement
table. With partition pruning enabled, the planner will examine the definition of each partition and
prove that the partition need not be scanned because it could not contain any rows meeting the query's
WHERE clause. When the planner can prove this, it excludes (prunes) the partition from the query plan.
By using the EXPLAIN command and the enable_partition_pruning configuration parameter, it's pos-
sible to show the difference between a plan for which partitions have been pruned and one for which
they have not. A typical unoptimized plan for this type of table setup is:
Some or all of the partitions might use index scans instead of full-table sequential scans, but the point
here is that there is no need to scan the older partitions at all to answer this query. When we enable
partition pruning, we get a significantly cheaper plan that will deliver the same answer:
Note that partition pruning is driven only by the constraints defined implicitly by the partition keys,
not by the presence of indexes. Therefore it isn't necessary to define indexes on the key columns.
Whether an index needs to be created for a given partition depends on whether you expect that queries
that scan the partition will generally scan a large part of the partition or just a small part. An index
will be helpful in the latter case but not the former.
98
Data Definition
Partition pruning can be performed not only during the planning of a given query, but also during its
execution. This is useful as it can allow more partitions to be pruned when clauses contain expressions
whose values are not known at query planning time; for example, parameters defined in a PREPARE
statement, using a value obtained from a subquery or using a parameterized value on the inner side of
a nested loop join. Partition pruning during execution can be performed at any of the following times:
• During initialization of the query plan. Partition pruning can be performed here for parameter values
which are known during the initialization phase of execution. Partitions which are pruned during
this stage will not show up in the query's EXPLAIN or EXPLAIN ANALYZE. It is possible to de-
termine the number of partitions which were removed during this phase by observing the “Subplans
Removed” property in the EXPLAIN output.
• During actual execution of the query plan. Partition pruning may also be performed here to remove
partitions using values which are only known during actual query execution. This includes values
from subqueries and values from execution-time parameters such as those from parameterized nest-
ed loop joins. Since the value of these parameters may change many times during the execution of
the query, partition pruning is performed whenever one of the execution parameters being used by
partition pruning changes. Determining if partitions were pruned during this phase requires careful
inspection of the loops property in the EXPLAIN ANALYZE output. Subplans corresponding to
different partitions may have different values for it depending on how many times each of them
was pruned during execution. Some may be shown as (never executed) if they were pruned
every time.
Note
Execution-time partition pruning currently only occurs for the Append node type, not for
MergeAppend or ModifyTable nodes. That is likely to be changed in a future release of
PostgreSQL.
Constraint exclusion works in a very similar way to partition pruning, except that it uses each table's
CHECK constraints — which gives it its name — whereas partition pruning uses the table's partition
bounds, which exist only in the case of declarative partitioning. Another difference is that constraint
exclusion is only applied at plan time; there is no attempt to remove partitions at execution time.
The fact that constraint exclusion uses CHECK constraints, which makes it slow compared to partition
pruning, can sometimes be used as an advantage: because constraints can be defined even on declar-
atively-partitioned tables, in addition to their internal partition bounds, constraint exclusion may be
able to elide additional partitions from the query plan.
The default (and recommended) setting of constraint_exclusion is neither on nor off, but an inter-
mediate setting called partition, which causes the technique to be applied only to queries that are
likely to be working on inheritance partitioned tables. The on setting causes the planner to examine
CHECK constraints in all queries, even simple ones that are unlikely to benefit.
• Constraint exclusion is only applied during query planning, unlike partition pruning, which can also
be applied during query execution.
99
Data Definition
• Constraint exclusion only works when the query's WHERE clause contains constants (or externally
supplied parameters). For example, a comparison against a non-immutable function such as CUR-
RENT_TIMESTAMP cannot be optimized, since the planner cannot know which child table the
function's value might fall into at run time.
• Keep the partitioning constraints simple, else the planner may not be able to prove that child tables
might not need to be visited. Use simple equality conditions for list partitioning, or simple range
tests for range partitioning, as illustrated in the preceding examples. A good rule of thumb is that
partitioning constraints should contain only comparisons of the partitioning column(s) to constants
using B-tree-indexable operators, because only B-tree-indexable column(s) are allowed in the par-
tition key.
• All constraints on all children of the parent table are examined during constraint exclusion, so large
numbers of children are likely to increase query planning time considerably. So the legacy inheri-
tance based partitioning will work well with up to perhaps a hundred child tables; don't try to use
many thousands of children.
One of the most critical design decisions will be the column or columns by which you partition your
data. Often the best choice will be to partition by the column or set of columns which most commonly
appear in WHERE clauses of queries being executed on the partitioned table. WHERE clauses that are
compatible with the partition bound constraints can be used to prune unneeded partitions. However,
you may be forced into making other decisions by requirements for the PRIMARY KEY or a UNIQUE
constraint. Removal of unwanted data is also a factor to consider when planning your partitioning
strategy. An entire partition can be detached fairly quickly, so it may be beneficial to design the par-
tition strategy in such a way that all data to be removed at once is located in a single partition.
Choosing the target number of partitions that the table should be divided into is also a critical decision
to make. Not having enough partitions may mean that indexes remain too large and that data locality
remains poor which could result in low cache hit ratios. However, dividing the table into too many
partitions can also cause issues. Too many partitions can mean longer query planning times and higher
memory consumption during both query planning and execution, as further described below. When
choosing how to partition your table, it's also important to consider what changes may occur in the
future. For example, if you choose to have one partition per customer and you currently have a small
number of large customers, consider the implications if in several years you instead find yourself with
a large number of small customers. In this case, it may be better to choose to partition by HASH and
choose a reasonable number of partitions rather than trying to partition by LIST and hoping that the
number of customers does not increase beyond what it is practical to partition the data by.
Sub-partitioning can be useful to further divide partitions that are expected to become larger than other
partitions. Another option is to use range partitioning with multiple columns in the partition key. Either
of these can easily lead to excessive numbers of partitions, so restraint is advisable.
It is important to consider the overhead of partitioning during query planning and execution. The
query planner is generally able to handle partition hierarchies with up to a few hundred partitions
fairly well, provided that typical queries allow the query planner to prune all but a small number of
partitions. Planning times become longer and memory consumption becomes higher as more partitions
are added. This is particularly true for the UPDATE and DELETE commands. Another reason to be
concerned about having a large number of partitions is that the server's memory consumption may
grow significantly over time, especially if many sessions touch large numbers of partitions. That's
because each partition requires its metadata to be loaded into the local memory of each session that
touches it.
With data warehouse type workloads, it can make sense to use a larger number of partitions than with
an OLTP type workload. Generally, in data warehouses, query planning time is less of a concern as
100
Data Definition
the majority of processing time is spent during query execution. With either of these two types of
workload, it is important to make the right decisions early, as re-partitioning large quantities of data
can be painfully slow. Simulations of the intended workload are often beneficial for optimizing the
partitioning strategy. Never just assume that more partitions are better than fewer partitions, nor vice-
versa.
Foreign data is accessed with help from a foreign data wrapper. A foreign data wrapper is a library
that can communicate with an external data source, hiding the details of connecting to the data source
and obtaining data from it. There are some foreign data wrappers available as contrib modules; see
Appendix F. Other kinds of foreign data wrappers might be found as third party products. If none of
the existing foreign data wrappers suit your needs, you can write your own; see Chapter 57.
To access foreign data, you need to create a foreign server object, which defines how to connect to
a particular external data source according to the set of options used by its supporting foreign data
wrapper. Then you need to create one or more foreign tables, which define the structure of the remote
data. A foreign table can be used in queries just like a normal table, but a foreign table has no storage
in the PostgreSQL server. Whenever it is used, PostgreSQL asks the foreign data wrapper to fetch data
from the external source, or transmit data to the external source in the case of update commands.
Accessing remote data may require authenticating to the external data source. This information can
be provided by a user mapping, which can provide additional data such as user names and passwords
based on the current PostgreSQL role.
For additional information, see CREATE FOREIGN DATA WRAPPER, CREATE SERVER, CRE-
ATE USER MAPPING, CREATE FOREIGN TABLE, and IMPORT FOREIGN SCHEMA.
• Views
To ensure the integrity of the entire database structure, PostgreSQL makes sure that you cannot drop
objects that other objects still depend on. For example, attempting to drop the products table we con-
101
Data Definition
sidered in Section 5.3.5, with the orders table depending on it, would result in an error message like
this:
The error message contains a useful hint: if you do not want to bother deleting all the dependent objects
individually, you can run:
and all the dependent objects will be removed, as will any objects that depend on them, recursively.
In this case, it doesn't remove the orders table, it only removes the foreign key constraint. It stops
there because nothing depends on the foreign key constraint. (If you want to check what DROP ...
CASCADE will do, run DROP without CASCADE and read the DETAIL output.)
Almost all DROP commands in PostgreSQL support specifying CASCADE. Of course, the nature of
the possible dependencies varies with the type of the object. You can also write RESTRICT instead
of CASCADE to get the default behavior, which is to prevent dropping objects that any other objects
depend on.
Note
According to the SQL standard, specifying either RESTRICT or CASCADE is required in
a DROP command. No database system actually enforces that rule, but whether the default
behavior is RESTRICT or CASCADE varies across systems.
If a DROP command lists multiple objects, CASCADE is only required when there are dependencies
outside the specified group. For example, when saying DROP TABLE tab1, tab2 the existence
of a foreign key referencing tab1 from tab2 would not mean that CASCADE is needed to succeed.
For user-defined functions, PostgreSQL tracks dependencies associated with a function's external-
ly-visible properties, such as its argument and result types, but not dependencies that could only be
known by examining the function body. As an example, consider this situation:
(See Section 38.5 for an explanation of SQL-language functions.) PostgreSQL will be aware that the
get_color_note function depends on the rainbow type: dropping the type would force dropping
the function, because its argument type would no longer be defined. But PostgreSQL will not consider
get_color_note to depend on the my_colors table, and so will not drop the function if the table
is dropped. While there are disadvantages to this approach, there are also benefits. The function is still
102
Data Definition
valid in some sense if the table is missing, though executing it would cause an error; creating a new
table of the same name would allow the function to work again.
103
Chapter 6. Data Manipulation
The previous chapter discussed how to create tables and other structures to hold your data. Now it is
time to fill the tables with data. This chapter covers how to insert, update, and delete table data. The
chapter after this will finally explain how to extract your long-lost data from the database.
To create a new row, use the INSERT command. The command requires the table name and column
values. For example, consider the products table from Chapter 5:
The data values are listed in the order in which the columns appear in the table, separated by commas.
Usually, the data values will be literals (constants), but scalar expressions are also allowed.
The above syntax has the drawback that you need to know the order of the columns in the table. To
avoid this you can also list the columns explicitly. For example, both of the following commands have
the same effect as the one above:
Many users consider it good practice to always list the column names.
If you don't have values for all the columns, you can omit some of them. In that case, the columns will
be filled with their default values. For example:
The second form is a PostgreSQL extension. It fills the columns from the left with as many values as
are given, and the rest will be defaulted.
For clarity, you can also request default values explicitly, for individual columns or for the entire row:
104
Data Manipulation
It is also possible to insert the result of a query (which might be no rows, one row, or many rows):
This provides the full power of the SQL query mechanism (Chapter 7) for computing the rows to be
inserted.
Tip
When inserting a lot of data at the same time, consider using the COPY command. It is not
as flexible as the INSERT command, but is more efficient. Refer to Section 14.4 for more
information on improving bulk loading performance.
To update existing rows, use the UPDATE command. This requires three pieces of information:
Recall from Chapter 5 that SQL does not, in general, provide a unique identifier for rows. Therefore it
is not always possible to directly specify which row to update. Instead, you specify which conditions
a row must meet in order to be updated. Only if you have a primary key in the table (independent
of whether you declared it or not) can you reliably address individual rows by choosing a condition
that matches the primary key. Graphical database access tools rely on this fact to allow you to update
rows individually.
For example, this command updates all products that have a price of 5 to have a price of 10:
This might cause zero, one, or many rows to be updated. It is not an error to attempt an update that
does not match any rows.
Let's look at that command in detail. First is the key word UPDATE followed by the table name. As
usual, the table name can be schema-qualified, otherwise it is looked up in the path. Next is the key
word SET followed by the column name, an equal sign, and the new column value. The new column
value can be any scalar expression, not just a constant. For example, if you want to raise the price of
all products by 10% you could use:
105
Data Manipulation
As you see, the expression for the new value can refer to the existing value(s) in the row. We also left
out the WHERE clause. If it is omitted, it means that all rows in the table are updated. If it is present,
only those rows that match the WHERE condition are updated. Note that the equals sign in the SET
clause is an assignment while the one in the WHERE clause is a comparison, but this does not create any
ambiguity. Of course, the WHERE condition does not have to be an equality test. Many other operators
are available (see Chapter 9). But the expression needs to evaluate to a Boolean result.
You can update more than one column in an UPDATE command by listing more than one assignment
in the SET clause. For example:
You use the DELETE command to remove rows; the syntax is very similar to the UPDATE command.
For instance, to remove all rows from the products table that have a price of 10, use:
The allowed contents of a RETURNING clause are the same as a SELECT command's output list (see
Section 7.3). It can contain column names of the command's target table, or value expressions using
those columns. A common shorthand is RETURNING *, which selects all columns of the target table
in order.
In an INSERT, the data available to RETURNING is the row as it was inserted. This is not so useful in
trivial inserts, since it would just repeat the data provided by the client. But it can be very handy when
relying on computed default values. For example, when using a serial column to provide unique
identifiers, RETURNING can return the ID assigned to a new row:
106
Data Manipulation
The RETURNING clause is also very useful with INSERT ... SELECT.
In an UPDATE, the data available to RETURNING is the new content of the modified row. For example:
In a DELETE, the data available to RETURNING is the content of the deleted row. For example:
If there are triggers (Chapter 39) on the target table, the data available to RETURNING is the row as
modified by the triggers. Thus, inspecting columns computed by triggers is another common use-case
for RETURNING.
107
Chapter 7. Queries
The previous chapters explained how to create tables, how to fill them with data, and how to manipulate
that data. Now we finally discuss how to retrieve the data from the database.
7.1. Overview
The process of retrieving or the command to retrieve data from a database is called a query. In SQL
the SELECT command is used to specify queries. The general syntax of the SELECT command is
The following sections describe the details of the select list, the table expression, and the sort specifi-
cation. WITH queries are treated last since they are an advanced feature.
Assuming that there is a table called table1, this command would retrieve all rows and all user-
defined columns from table1. (The method of retrieval depends on the client application. For ex-
ample, the psql program will display an ASCII-art table on the screen, while client libraries will offer
functions to extract individual values from the query result.) The select list specification * means all
columns that the table expression happens to provide. A select list can also select a subset of the avail-
able columns or make calculations using the columns. For example, if table1 has columns named
a, b, and c (and perhaps others) you can make the following query:
(assuming that b and c are of a numerical data type). See Section 7.3 for more details.
FROM table1 is a simple kind of table expression: it reads just one table. In general, table expres-
sions can be complex constructs of base tables, joins, and subqueries. But you can also omit the table
expression entirely and use the SELECT command as a calculator:
SELECT 3 * 4;
This is more useful if the expressions in the select list return varying results. For example, you could
call a function this way:
SELECT random();
The optional WHERE, GROUP BY, and HAVING clauses in the table expression specify a pipeline of
successive transformations performed on the table derived in the FROM clause. All these transforma-
108
Queries
tions produce a virtual table that provides the rows that are passed to the select list to compute the
output rows of the query.
A table reference can be a table name (possibly schema-qualified), or a derived table such as a sub-
query, a JOIN construct, or complex combinations of these. If more than one table reference is listed
in the FROM clause, the tables are cross-joined (that is, the Cartesian product of their rows is formed;
see below). The result of the FROM list is an intermediate virtual table that can then be subject to trans-
formations by the WHERE, GROUP BY, and HAVING clauses and is finally the result of the overall
table expression.
When a table reference names a table that is the parent of a table inheritance hierarchy, the table
reference produces rows of not only that table but all of its descendant tables, unless the key word
ONLY precedes the table name. However, the reference produces only the columns that appear in the
named table — any columns added in subtables are ignored.
Instead of writing ONLY before the table name, you can write * after the table name to explicitly
specify that descendant tables are included. There is no real reason to use this syntax any more, be-
cause searching descendant tables is now always the default behavior. However, it is supported for
compatibility with older releases.
T1 join_type T2 [ join_condition ]
Joins of all types can be chained together, or nested: either or both T1 and T2 can be joined tables.
Parentheses can be used around JOIN clauses to control the join order. In the absence of parentheses,
JOIN clauses nest left-to-right.
Join Types
Cross join
T1 CROSS JOIN T2
For every possible combination of rows from T1 and T2 (i.e., a Cartesian product), the joined
table will contain a row consisting of all columns in T1 followed by all columns in T2. If the
tables have N and M rows respectively, the joined table will have N * M rows.
Note
This latter equivalence does not hold exactly when more than two tables appear, because
JOIN binds more tightly than comma. For example FROM T1 CROSS JOIN T2
INNER JOIN T3 ON condition is not the same as FROM T1, T2 INNER JOIN
109
Queries
T3 ON condition because the condition can reference T1 in the first case but
not the second.
Qualified joins
The words INNER and OUTER are optional in all forms. INNER is the default; LEFT, RIGHT,
and FULL imply an outer join.
The join condition is specified in the ON or USING clause, or implicitly by the word NATURAL.
The join condition determines which rows from the two source tables are considered to “match”,
as explained in detail below.
INNER JOIN
For each row R1 of T1, the joined table has a row for each row in T2 that satisfies the join
condition with R1.
First, an inner join is performed. Then, for each row in T1 that does not satisfy the join
condition with any row in T2, a joined row is added with null values in columns of T2. Thus,
the joined table always has at least one row for each row in T1.
First, an inner join is performed. Then, for each row in T2 that does not satisfy the join
condition with any row in T1, a joined row is added with null values in columns of T1. This
is the converse of a left join: the result table will always have a row for each row in T2.
First, an inner join is performed. Then, for each row in T1 that does not satisfy the join
condition with any row in T2, a joined row is added with null values in columns of T2. Also,
for each row of T2 that does not satisfy the join condition with any row in T1, a joined row
with null values in the columns of T1 is added.
The ON clause is the most general kind of join condition: it takes a Boolean value expression of
the same kind as is used in a WHERE clause. A pair of rows from T1 and T2 match if the ON
expression evaluates to true.
The USING clause is a shorthand that allows you to take advantage of the specific situation where
both sides of the join use the same name for the joining column(s). It takes a comma-separated
list of the shared column names and forms a join condition that includes an equality comparison
for each one. For example, joining T1 and T2 with USING (a, b) produces the join condition
ON T1.a = T2.a AND T1.b = T2.b.
Furthermore, the output of JOIN USING suppresses redundant columns: there is no need to print
both of the matched columns, since they must have equal values. While JOIN ON produces all
columns from T1 followed by all columns from T2, JOIN USING produces one output column
for each of the listed column pairs (in the listed order), followed by any remaining columns from
T1, followed by any remaining columns from T2.
110
Queries
Finally, NATURAL is a shorthand form of USING: it forms a USING list consisting of all column
names that appear in both input tables. As with USING, these columns appear only once in the
output table. If there are no common column names, NATURAL JOIN behaves like JOIN ...
ON TRUE, producing a cross-product join.
Note
USING is reasonably safe from column changes in the joined relations since only the listed
columns are combined. NATURAL is considerably more risky since any schema changes
to either relation that cause a new matching column name to be present will cause the join
to combine that new column as well.
num | name
-----+------
1 | a
2 | b
3 | c
and t2:
num | value
-----+-------
1 | xxx
3 | yyy
5 | zzz
111
Queries
3 | c | yyy
(2 rows)
The join condition specified with ON can also contain conditions that do not relate directly to the join.
This can prove useful for some queries but needs to be thought out carefully. For example:
Notice that placing the restriction in the WHERE clause produces a different result:
112
Queries
This is because a restriction placed in the ON clause is processed before the join, while a restriction
placed in the WHERE clause is processed after the join. That does not matter with inner joins, but it
matters a lot with outer joins.
or
A typical application of table aliases is to assign short identifiers to long table names to keep the join
clauses readable. For example:
The alias becomes the new name of the table reference so far as the current query is concerned — it
is not allowed to refer to the table by the original name elsewhere in the query. Thus, this is not valid:
Table aliases are mainly for notational convenience, but it is necessary to use them when joining a
table to itself, e.g.:
Additionally, an alias is required if the table reference is a subquery (see Section 7.2.1.3).
Parentheses are used to resolve ambiguities. In the following example, the first statement assigns the
alias b to the second instance of my_table, but the second statement assigns the alias to the result
of the join:
Another form of table aliasing gives temporary names to the columns of the table, as well as the table
itself:
113
Queries
If fewer column aliases are specified than the actual table has columns, the remaining columns are not
renamed. This syntax is especially useful for self-joins or subqueries.
When an alias is applied to the output of a JOIN clause, the alias hides the original name(s) within
the JOIN. For example:
is not valid; the table alias a is not visible outside the alias c.
7.2.1.3. Subqueries
Subqueries specifying a derived table must be enclosed in parentheses and must be assigned a table
alias name (as in Section 7.2.1.2). For example:
This example is equivalent to FROM table1 AS alias_name. More interesting cases, which
cannot be reduced to a plain join, arise when the subquery involves grouping or aggregation.
Again, a table alias is required. Assigning alias names to the columns of the VALUES list is optional,
but is good practice. For more information see Section 7.7.
Table functions may also be combined using the ROWS FROM syntax, with the results returned in
parallel columns; the number of result rows in this case is that of the largest function result, with
smaller results padded with null values to match.
If the WITH ORDINALITY clause is specified, an additional column of type bigint will be added
to the function result columns. This column numbers the rows of the function result set, starting from
1. (This is a generalization of the SQL-standard syntax for UNNEST ... WITH ORDINALITY.)
By default, the ordinal column is called ordinality, but a different column name can be assigned
to it using an AS clause.
114
Queries
The special table function UNNEST may be called with any number of array parameters, and it returns
a corresponding number of columns, as if UNNEST (Section 9.18) had been called on each parameter
separately and combined using the ROWS FROM construct.
If no table_alias is specified, the function name is used as the table name; in the case of a ROWS
FROM() construct, the first function's name is used.
If column aliases are not supplied, then for a function returning a base data type, the column name is
also the same as the function name. For a function returning a composite type, the result columns get
the names of the individual attributes of the type.
Some examples:
In some cases it is useful to define table functions that can return different column sets depending on
how they are invoked. To support this, the table function can be declared as returning the pseudo-type
record with no OUT parameters. When such a function is used in a query, the expected row structure
must be specified in the query itself, so that the system can know how to parse and plan the query.
This syntax looks like:
When not using the ROWS FROM() syntax, the column_definition list replaces the column
alias list that could otherwise be attached to the FROM item; the names in the column definitions serve
as column aliases. When using the ROWS FROM() syntax, a column_definition list can be
attached to each member function separately; or if there is only one member function and no WITH
ORDINALITY clause, a column_definition list can be written in place of a column alias list
following ROWS FROM().
SELECT *
115
Queries
The dblink function (part of the dblink module) executes a remote query. It is declared to return
record since it might be used for any kind of query. The actual column set must be specified in the
calling query so that the parser knows, for example, what * should expand to.
SELECT *
FROM ROWS FROM
(
json_to_recordset('[{"a":40,"b":"foo"},
{"a":"100","b":"bar"}]')
AS (a INTEGER, b TEXT),
generate_series(1, 3)
) AS x (p, q, s)
ORDER BY p;
p | q | s
-----+-----+---
40 | foo | 1
100 | bar | 2
| | 3
It joins two functions into a single FROM target. json_to_recordset() is instructed to return
two columns, the first integer and the second text. The result of generate_series() is used
directly. The ORDER BY clause sorts the column values as integers.
Table functions appearing in FROM can also be preceded by the key word LATERAL, but for functions
the key word is optional; the function's arguments can contain references to columns provided by
preceding FROM items in any case.
A LATERAL item can appear at top level in the FROM list, or within a JOIN tree. In the latter case it
can also refer to any items that are on the left-hand side of a JOIN that it is on the right-hand side of.
When a FROM item contains LATERAL cross-references, evaluation proceeds as follows: for each
row of the FROM item providing the cross-referenced column(s), or set of rows of multiple FROM
items providing the columns, the LATERAL item is evaluated using that row or row set's values of
the columns. The resulting row(s) are joined as usual with the rows they were computed from. This is
repeated for each row or set of rows from the column source table(s).
This is not especially useful since it has exactly the same result as the more conventional
116
Queries
LATERAL is primarily useful when the cross-referenced column is necessary for computing the row(s)
to be joined. A common application is providing an argument value for a set-returning function. For
example, supposing that vertices(polygon) returns the set of vertices of a polygon, we could
identify close-together vertices of polygons stored in a table with:
or in several other equivalent formulations. (As already mentioned, the LATERAL key word is unnec-
essary in this example, but we use it for clarity.)
It is often particularly handy to LEFT JOIN to a LATERAL subquery, so that source rows will appear
in the result even if the LATERAL subquery produces no rows for them. For example, if get_prod-
uct_names() returns the names of products made by a manufacturer, but some manufacturers in
our table currently produce no products, we could find out which ones those are like this:
SELECT m.name
FROM manufacturers m LEFT JOIN LATERAL get_product_names(m.id)
pname ON true
WHERE pname IS NULL;
WHERE search_condition
where search_condition is any value expression (see Section 4.2) that returns a value of type
boolean.
After the processing of the FROM clause is done, each row of the derived virtual table is checked
against the search condition. If the result of the condition is true, the row is kept in the output table,
otherwise (i.e., if the result is false or null) it is discarded. The search condition typically references
at least one column of the table generated in the FROM clause; this is not required, but otherwise the
WHERE clause will be fairly useless.
Note
The join condition of an inner join can be written either in the WHERE clause or in the JOIN
clause. For example, these table expressions are equivalent:
117
Queries
and:
or perhaps even:
Which one of these you use is mainly a matter of style. The JOIN syntax in the FROM clause
is probably not as portable to other SQL database management systems, even though it is in
the SQL standard. For outer joins there is no choice: they must be done in the FROM clause.
The ON or USING clause of an outer join is not equivalent to a WHERE condition, because
it results in the addition of rows (for unmatched input rows) as well as the removal of rows
in the final result.
SELECT ... FROM fdt WHERE EXISTS (SELECT c1 FROM t2 WHERE c2 >
fdt.c1)
fdt is the table derived in the FROM clause. Rows that do not meet the search condition of the WHERE
clause are eliminated from fdt. Notice the use of scalar subqueries as value expressions. Just like any
other query, the subqueries can employ complex table expressions. Notice also how fdt is referenced
in the subqueries. Qualifying c1 as fdt.c1 is only necessary if c1 is also the name of a column
in the derived input table of the subquery. But qualifying the column name adds clarity even when
it is not needed. This example shows how the column naming scope of an outer query extends into
its inner queries.
SELECT select_list
FROM ...
[WHERE ...]
GROUP BY grouping_column_reference
[, grouping_column_reference]...
The GROUP BY Clause is used to group together those rows in a table that have the same values
in all the columns listed. The order in which the columns are listed does not matter. The effect is to
combine each set of rows having common values into one group row that represents all rows in the
118
Queries
group. This is done to eliminate redundancy in the output and/or compute aggregates that apply to
these groups. For instance:
In the second query, we could not have written SELECT * FROM test1 GROUP BY x, because
there is no single value for the column y that could be associated with each group. The grouped-by
columns can be referenced in the select list since they have a single value in each group.
In general, if a table is grouped, columns that are not listed in GROUP BY cannot be referenced except
in aggregate expressions. An example with aggregate expressions is:
Here sum is an aggregate function that computes a single value over the entire group. More information
about the available aggregate functions can be found in Section 9.20.
Tip
Grouping without aggregate expressions effectively calculates the set of distinct values in a
column. This can also be achieved using the DISTINCT clause (see Section 7.3.3).
Here is another example: it calculates the total sales for each product (rather than the total sales of
all products):
In this example, the columns product_id, p.name, and p.price must be in the GROUP BY
clause since they are referenced in the query select list (but see below). The column s.units does
not have to be in the GROUP BY list since it is only used in an aggregate expression (sum(...)),
which represents the sales of a product. For each product, the query returns a summary row about all
sales of the product.
119
Queries
If the products table is set up so that, say, product_id is the primary key, then it would be enough to
group by product_id in the above example, since name and price would be functionally dependent
on the product ID, and so there would be no ambiguity about which name and price value to return
for each product ID group.
In strict SQL, GROUP BY can only group by columns of the source table but PostgreSQL extends
this to also allow GROUP BY to group by columns in the select list. Grouping by value expressions
instead of simple column names is also allowed.
If a table has been grouped using GROUP BY, but only certain groups are of interest, the HAVING
clause can be used, much like a WHERE clause, to eliminate groups from the result. The syntax is:
Expressions in the HAVING clause can refer both to grouped expressions and to ungrouped expressions
(which necessarily involve an aggregate function).
Example:
In the example above, the WHERE clause is selecting rows by a column that is not grouped (the ex-
pression is only true for sales during the last four weeks), while the HAVING clause restricts the output
to groups with total gross sales over 5000. Note that the aggregate expressions do not necessarily need
to be the same in all parts of the query.
If a query contains aggregate function calls, but no GROUP BY clause, grouping still occurs: the result
is a single group row (or perhaps no rows at all, if the single row is then eliminated by HAVING).
The same is true if it contains a HAVING clause, even without any aggregate function calls or GROUP
BY clause.
120
Queries
More complex grouping operations than those described above are possible using the concept of group-
ing sets. The data selected by the FROM and WHERE clauses is grouped separately by each specified
grouping set, aggregates computed for each group just as for simple GROUP BY clauses, and then
the results returned. For example:
Each sublist of GROUPING SETS may specify zero or more columns or expressions and is interpreted
the same way as though it were directly in the GROUP BY clause. An empty grouping set means that
all rows are aggregated down to a single group (which is output even if no input rows were present),
as described above for the case of aggregate functions with no GROUP BY clause.
References to the grouping columns or expressions are replaced by null values in result rows for
grouping sets in which those columns do not appear. To distinguish which grouping a particular output
row resulted from, see Table 9.56.
A shorthand notation is provided for specifying two common types of grouping set. A clause of the
form
represents the given list of expressions and all prefixes of the list including the empty list; thus it is
equivalent to
GROUPING SETS (
( e1, e2, e3, ... ),
...
( e1, e2 ),
( e1 ),
( )
)
This is commonly used for analysis over hierarchical data; e.g., total salary by department, division,
and company-wide total.
121
Queries
represents the given list and all of its possible subsets (i.e., the power set). Thus
CUBE ( a, b, c )
is equivalent to
GROUPING SETS (
( a, b, c ),
( a, b ),
( a, c ),
( a ),
( b, c ),
( b ),
( c ),
( )
)
The individual elements of a CUBE or ROLLUP clause may be either individual expressions, or sublists
of elements in parentheses. In the latter case, the sublists are treated as single units for the purposes
of generating the individual grouping sets. For example:
is equivalent to
GROUPING SETS (
( a, b, c, d ),
( a, b ),
( c, d ),
( )
)
and
is equivalent to
GROUPING SETS (
( a, b, c, d ),
( a, b, c ),
( a ),
( )
)
The CUBE and ROLLUP constructs can be used either directly in the GROUP BY clause, or nested
inside a GROUPING SETS clause. If one GROUPING SETS clause is nested inside another, the
effect is the same as if all the elements of the inner clause had been written directly in the outer clause.
If multiple grouping items are specified in a single GROUP BY clause, then the final list of grouping
sets is the cross product of the individual items. For example:
122
Queries
is equivalent to
Note
The construct (a, b) is normally recognized in expressions as a row constructor. Within the
GROUP BY clause, this does not apply at the top levels of expressions, and (a, b) is parsed
as a list of expressions as described above. If for some reason you need a row constructor in
a grouping expression, use ROW(a, b).
When multiple window functions are used, all the window functions having syntactically equivalent
PARTITION BY and ORDER BY clauses in their window definitions are guaranteed to be evaluated
in a single pass over the data. Therefore they will see the same sort ordering, even if the ORDER BY
does not uniquely determine an ordering. However, no guarantees are made about the evaluation of
functions having different PARTITION BY or ORDER BY specifications. (In such cases a sort step is
typically required between the passes of window function evaluations, and the sort is not guaranteed
to preserve ordering of rows that its ORDER BY sees as equivalent.)
Currently, window functions always require presorted data, and so the query output will be ordered
according to one or another of the window functions' PARTITION BY/ORDER BY clauses. It is not
recommended to rely on this, however. Use an explicit top-level ORDER BY clause if you want to be
sure the results are sorted in a particular way.
The columns names a, b, and c are either the actual names of the columns of tables referenced in the
FROM clause, or the aliases given to them as explained in Section 7.2.1.2. The name space available
123
Queries
in the select list is the same as in the WHERE clause, unless grouping is used, in which case it is the
same as in the HAVING clause.
If more than one table has a column of the same name, the table name must also be given, as in:
When working with multiple tables, it can also be useful to ask for all the columns of a particular table:
If an arbitrary value expression is used in the select list, it conceptually adds a new virtual column to
the returned table. The value expression is evaluated once for each result row, with the row's values
substituted for any column references. But the expressions in the select list do not have to reference
any columns in the table expression of the FROM clause; they can be constant arithmetic expressions,
for instance.
If no output column name is specified using AS, the system assigns a default column name. For simple
column references, this is the name of the referenced column. For function calls, this is the name of
the function. For complex expressions, the system will generate a generic name.
The AS keyword is optional, but only if the new column name does not match any PostgreSQL key-
word (see Appendix C). To avoid an accidental match to a keyword, you can double-quote the column
name. For example, VALUE is a keyword, so this does not work:
For protection against possible future keyword additions, it is recommended that you always either
write AS or double-quote the output column name.
Note
The naming of output columns here is different from that done in the FROM clause (see Sec-
tion 7.2.1.2). It is possible to rename the same column twice, but the name assigned in the
select list is the one that will be passed on.
7.3.3. DISTINCT
After the select list has been processed, the result table can optionally be subject to the elimination of
duplicate rows. The DISTINCT key word is written directly after SELECT to specify this:
124
Queries
(Instead of DISTINCT the key word ALL can be used to specify the default behavior of retaining
all rows.)
Obviously, two rows are considered distinct if they differ in at least one column value. Null values
are considered equal in this comparison.
Alternatively, an arbitrary expression can determine what rows are to be considered distinct:
Here expression is an arbitrary value expression that is evaluated for all rows. A set of rows for
which all the expressions are equal are considered duplicates, and only the first row of the set is kept
in the output. Note that the “first row” of a set is unpredictable unless the query is sorted on enough
columns to guarantee a unique ordering of the rows arriving at the DISTINCT filter. (DISTINCT
ON processing occurs after ORDER BY sorting.)
The DISTINCT ON clause is not part of the SQL standard and is sometimes considered bad style
because of the potentially indeterminate nature of its results. With judicious use of GROUP BY and
subqueries in FROM, this construct can be avoided, but it is often the most convenient alternative.
where query1 and query2 are queries that can use any of the features discussed up to this point.
UNION effectively appends the result of query2 to the result of query1 (although there is no guar-
antee that this is the order in which the rows are actually returned). Furthermore, it eliminates duplicate
rows from its result, in the same way as DISTINCT, unless UNION ALL is used.
INTERSECT returns all rows that are both in the result of query1 and in the result of query2.
Duplicate rows are eliminated unless INTERSECT ALL is used.
EXCEPT returns all rows that are in the result of query1 but not in the result of query2. (This is
sometimes called the difference between two queries.) Again, duplicates are eliminated unless EX-
CEPT ALL is used.
In order to calculate the union, intersection, or difference of two queries, the two queries must be
“union compatible”, which means that they return the same number of columns and the corresponding
columns have compatible data types, as described in Section 10.5.
which is equivalent to
125
Queries
As shown here, you can use parentheses to control the order of evaluation. Without parentheses,
UNION and EXCEPT associate left-to-right, but INTERSECT binds more tightly than those two op-
erators. Thus
means
You can also surround an individual query with parentheses. This is important if the query needs
to use any of the clauses discussed in following sections, such as LIMIT. Without parentheses, you'll
get a syntax error, or else the clause will be understood as applying to the output of the set operation
rather than one of its inputs. For example,
not
SELECT select_list
FROM table_expression
ORDER BY sort_expression1 [ASC | DESC] [NULLS { FIRST | LAST }]
[, sort_expression2 [ASC | DESC] [NULLS { FIRST |
LAST }] ...]
The sort expression(s) can be any expression that would be valid in the query's select list. An example
is:
When more than one expression is specified, the later values are used to sort rows that are equal
according to the earlier values. Each expression can be followed by an optional ASC or DESC keyword
to set the sort direction to ascending or descending. ASC order is the default. Ascending order puts
smaller values first, where “smaller” is defined in terms of the < operator. Similarly, descending order
is determined with the > operator. 1
1
Actually, PostgreSQL uses the default B-tree operator class for the expression's data type to determine the sort ordering for ASC and DESC.
Conventionally, data types will be set up so that the < and > operators correspond to this sort ordering, but a user-defined data type's designer
could choose to do something different.
126
Queries
The NULLS FIRST and NULLS LAST options can be used to determine whether nulls appear before
or after non-null values in the sort ordering. By default, null values sort as if larger than any non-null
value; that is, NULLS FIRST is the default for DESC order, and NULLS LAST otherwise.
Note that the ordering options are considered independently for each sort column. For example ORDER
BY x, y DESC means ORDER BY x ASC, y DESC, which is not the same as ORDER BY
x DESC, y DESC.
A sort_expression can also be the column label or number of an output column, as in:
both of which sort by the first output column. Note that an output column name has to stand alone,
that is, it cannot be used in an expression — for example, this is not correct:
This restriction is made to reduce ambiguity. There is still ambiguity if an ORDER BY item is a simple
name that could match either an output column name or a column from the table expression. The
output column is used in such cases. This would only cause confusion if you use AS to rename an
output column to match some other table column's name.
ORDER BY can be applied to the result of a UNION, INTERSECT, or EXCEPT combination, but in
this case it is only permitted to sort by output column names or numbers, not by expressions.
SELECT select_list
FROM table_expression
[ ORDER BY ... ]
[ LIMIT { number | ALL } ] [ OFFSET number ]
If a limit count is given, no more than that many rows will be returned (but possibly fewer, if the
query itself yields fewer rows). LIMIT ALL is the same as omitting the LIMIT clause, as is LIMIT
with a NULL argument.
OFFSET says to skip that many rows before beginning to return rows. OFFSET 0 is the same as
omitting the OFFSET clause, as is OFFSET with a NULL argument.
If both OFFSET and LIMIT appear, then OFFSET rows are skipped before starting to count the
LIMIT rows that are returned.
When using LIMIT, it is important to use an ORDER BY clause that constrains the result rows into a
unique order. Otherwise you will get an unpredictable subset of the query's rows. You might be asking
for the tenth through twentieth rows, but tenth through twentieth in what ordering? The ordering is
unknown, unless you specified ORDER BY.
The query optimizer takes LIMIT into account when generating query plans, so you are very likely
to get different plans (yielding different row orders) depending on what you give for LIMIT and
OFFSET. Thus, using different LIMIT/OFFSET values to select different subsets of a query result
will give inconsistent results unless you enforce a predictable result ordering with ORDER BY. This
is not a bug; it is an inherent consequence of the fact that SQL does not promise to deliver the results
of a query in any particular order unless ORDER BY is used to constrain the order.
127
Queries
The rows skipped by an OFFSET clause still have to be computed inside the server; therefore a large
OFFSET might be inefficient.
Each parenthesized list of expressions generates a row in the table. The lists must all have the same
number of elements (i.e., the number of columns in the table), and corresponding entries in each
list must have compatible data types. The actual data type assigned to each column of the result is
determined using the same rules as for UNION (see Section 10.5).
As an example:
will return a table of two columns and three rows. It's effectively equivalent to:
By default, PostgreSQL assigns the names column1, column2, etc. to the columns of a VALUES
table. The column names are not specified by the SQL standard and different database systems do it
differently, so it's usually better to override the default names with a table alias list, like this:
=> SELECT * FROM (VALUES (1, 'one'), (2, 'two'), (3, 'three')) AS t
(num,letter);
num | letter
-----+--------
1 | one
2 | two
3 | three
(3 rows)
and can appear anywhere a SELECT can. For example, you can use it as part of a UNION, or attach a
sort_specification (ORDER BY, LIMIT, and/or OFFSET) to it. VALUES is most commonly
used as the data source in an INSERT command, and next most commonly as a subquery.
WITH provides a way to write auxiliary statements for use in a larger query. These statements, which
are often referred to as Common Table Expressions or CTEs, can be thought of as defining temporary
tables that exist just for one query. Each auxiliary statement in a WITH clause can be a SELECT,
INSERT, UPDATE, or DELETE; and the WITH clause itself is attached to a primary statement that
can also be a SELECT, INSERT, UPDATE, or DELETE.
WITH regional_sales AS (
SELECT region, SUM(amount) AS total_sales
FROM orders
GROUP BY region
), top_regions AS (
SELECT region
FROM regional_sales
WHERE total_sales > (SELECT SUM(total_sales)/10 FROM
regional_sales)
)
SELECT region,
product,
SUM(quantity) AS product_units,
SUM(amount) AS product_sales
FROM orders
WHERE region IN (SELECT region FROM top_regions)
GROUP BY region, product;
which displays per-product sales totals in only the top sales regions. The WITH clause defines two
auxiliary statements named regional_sales and top_regions, where the output of region-
al_sales is used in top_regions and the output of top_regions is used in the primary
SELECT query. This example could have been written without WITH, but we'd have needed two levels
of nested sub-SELECTs. It's a bit easier to follow this way.
The optional RECURSIVE modifier changes WITH from a mere syntactic convenience into a feature
that accomplishes things not otherwise possible in standard SQL. Using RECURSIVE, a WITH query
can refer to its own output. A very simple example is this query to sum the integers from 1 through 100:
The general form of a recursive WITH query is always a non-recursive term, then UNION (or UNION
ALL), then a recursive term, where only the recursive term can contain a reference to the query's own
output. Such a query is executed as follows:
129
Queries
a. Evaluate the recursive term, substituting the current contents of the working table for the
recursive self-reference. For UNION (but not UNION ALL), discard duplicate rows and
rows that duplicate any previous result row. Include all remaining rows in the result of the
recursive query, and also place them in a temporary intermediate table.
b. Replace the contents of the working table with the contents of the intermediate table, then
empty the intermediate table.
Note
Strictly speaking, this process is iteration not recursion, but RECURSIVE is the terminology
chosen by the SQL standards committee.
In the example above, the working table has just a single row in each step, and it takes on the values
from 1 through 100 in successive steps. In the 100th step, there is no output because of the WHERE
clause, and so the query terminates.
Recursive queries are typically used to deal with hierarchical or tree-structured data. A useful example
is this query to find all the direct and indirect sub-parts of a product, given only a table that shows
immediate inclusions:
When working with recursive queries it is important to be sure that the recursive part of the query will
eventually return no tuples, or else the query will loop indefinitely. Sometimes, using UNION instead
of UNION ALL can accomplish this by discarding rows that duplicate previous output rows. However,
often a cycle does not involve output rows that are completely duplicate: it may be necessary to check
just one or a few fields to see if the same point has been reached before. The standard method for
handling such situations is to compute an array of the already-visited values. For example, consider
the following query that searches a table graph using a link field:
This query will loop if the link relationships contain cycles. Because we require a “depth” output,
just changing UNION ALL to UNION would not eliminate the looping. Instead we need to recognize
130
Queries
whether we have reached the same row again while following a particular path of links. We add two
columns path and cycle to the loop-prone query:
Aside from preventing cycles, the array value is often useful in its own right as representing the “path”
taken to reach any particular row.
In the general case where more than one field needs to be checked to recognize a cycle, use an array
of rows. For example, if we needed to compare fields f1 and f2:
Tip
Omit the ROW() syntax in the common case where only one field needs to be checked to
recognize a cycle. This allows a simple array rather than a composite-type array to be used,
gaining efficiency.
Tip
The recursive query evaluation algorithm produces its output in breadth-first search order. You
can display the results in depth-first search order by making the outer query ORDER BY a
“path” column constructed in this way.
A helpful trick for testing queries when you are not certain if they might loop is to place a LIMIT in
the parent query. For example, this query would loop forever without the LIMIT:
131
Queries
This works because PostgreSQL's implementation evaluates only as many rows of a WITH query as
are actually fetched by the parent query. Using this trick in production is not recommended, because
other systems might work differently. Also, it usually won't work if you make the outer query sort the
recursive query's results or join them to some other table, because in such cases the outer query will
usually try to fetch all of the WITH query's output anyway.
A useful property of WITH queries is that they are evaluated only once per execution of the parent
query, even if they are referred to more than once by the parent query or sibling WITH queries. Thus,
expensive calculations that are needed in multiple places can be placed within a WITH query to avoid
redundant work. Another possible application is to prevent unwanted multiple evaluations of func-
tions with side-effects. However, the other side of this coin is that the optimizer is less able to push
restrictions from the parent query down into a WITH query than an ordinary subquery. The WITH
query will generally be evaluated as written, without suppression of rows that the parent query might
discard afterwards. (But, as mentioned above, evaluation might stop early if the reference(s) to the
query demand only a limited number of rows.)
The examples above only show WITH being used with SELECT, but it can be attached in the same
way to INSERT, UPDATE, or DELETE. In each case it effectively provides temporary table(s) that
can be referred to in the main command.
WITH moved_rows AS (
DELETE FROM products
WHERE
"date" >= '2010-10-01' AND
"date" < '2010-11-01'
RETURNING *
)
INSERT INTO products_log
SELECT * FROM moved_rows;
This query effectively moves rows from products to products_log. The DELETE in WITH
deletes the specified rows from products, returning their contents by means of its RETURNING
clause; and then the primary query reads that output and inserts it into products_log.
A fine point of the above example is that the WITH clause is attached to the INSERT, not the sub-
SELECT within the INSERT. This is necessary because data-modifying statements are only allowed
in WITH clauses that are attached to the top-level statement. However, normal WITH visibility rules
apply, so it is possible to refer to the WITH statement's output from the sub-SELECT.
Data-modifying statements in WITH usually have RETURNING clauses (see Section 6.4), as shown
in the example above. It is the output of the RETURNING clause, not the target table of the data-mod-
ifying statement, that forms the temporary table that can be referred to by the rest of the query. If a
data-modifying statement in WITH lacks a RETURNING clause, then it forms no temporary table and
cannot be referred to in the rest of the query. Such a statement will be executed nonetheless. A not-
particularly-useful example is:
132
Queries
WITH t AS (
DELETE FROM foo
)
DELETE FROM bar;
This example would remove all rows from tables foo and bar. The number of affected rows reported
to the client would only include rows removed from bar.
Recursive self-references in data-modifying statements are not allowed. In some cases it is possible
to work around this limitation by referring to the output of a recursive WITH, for example:
This query would remove all direct and indirect subparts of a product.
Data-modifying statements in WITH are executed exactly once, and always to completion, indepen-
dently of whether the primary query reads all (or indeed any) of their output. Notice that this is differ-
ent from the rule for SELECT in WITH: as stated in the previous section, execution of a SELECT is
carried only as far as the primary query demands its output.
The sub-statements in WITH are executed concurrently with each other and with the main query.
Therefore, when using data-modifying statements in WITH, the order in which the specified updates
actually happen is unpredictable. All the statements are executed with the same snapshot (see Chap-
ter 13), so they cannot “see” one another's effects on the target tables. This alleviates the effects of the
unpredictability of the actual order of row updates, and means that RETURNING data is the only way
to communicate changes between different WITH sub-statements and the main query. An example of
this is that in
WITH t AS (
UPDATE products SET price = price * 1.05
RETURNING *
)
SELECT * FROM products;
the outer SELECT would return the original prices before the action of the UPDATE, while in
WITH t AS (
UPDATE products SET price = price * 1.05
RETURNING *
)
SELECT * FROM t;
Trying to update the same row twice in a single statement is not supported. Only one of the modifica-
tions takes place, but it is not easy (and sometimes not possible) to reliably predict which one. This also
applies to deleting a row that was already updated in the same statement: only the update is performed.
Therefore you should generally avoid trying to modify a single row twice in a single statement. In
133
Queries
particular avoid writing WITH sub-statements that could affect the same rows changed by the main
statement or a sibling sub-statement. The effects of such a statement will not be predictable.
At present, any table used as the target of a data-modifying statement in WITH must not have a con-
ditional rule, nor an ALSO rule, nor an INSTEAD rule that expands to multiple statements.
134
Chapter 8. Data Types
PostgreSQL has a rich set of native data types available to users. Users can add new types to Post-
greSQL using the CREATE TYPE command.
Table 8.1 shows all the built-in general-purpose data types. Most of the alternative names listed in
the “Aliases” column are the names used internally by PostgreSQL for historical reasons. In addition,
some internally used or deprecated types are available, but are not listed here.
135
Data Types
Compatibility
The following types (or spellings thereof) are specified by SQL: bigint, bit, bit vary-
ing, boolean, char, character varying, character, varchar, date, dou-
ble precision, integer, interval, numeric, decimal, real, smallint,
time (with or without time zone), timestamp (with or without time zone), xml.
Each data type has an external representation determined by its input and output functions. Many of the
built-in types have obvious external formats. However, several types are either unique to PostgreSQL,
such as geometric paths, or have several possible formats, such as the date and time types. Some of the
input and output functions are not invertible, i.e., the result of an output function might lose accuracy
when compared to the original input.
136
Data Types
The syntax of constants for the numeric types is described in Section 4.1.2. The numeric types have a
full set of corresponding arithmetic operators and functions. Refer to Chapter 9 for more information.
The following sections describe the types in detail.
The type integer is the common choice, as it offers the best balance between range, storage size, and
performance. The smallint type is generally only used if disk space is at a premium. The bigint
type is designed to be used when the range of the integer type is insufficient.
SQL only specifies the integer types integer (or int), smallint, and bigint. The type names
int2, int4, and int8 are extensions, which are also used by some other SQL database systems.
We use the following terms below: The precision of a numeric is the total count of significant digits
in the whole number, that is, the number of digits to both sides of the decimal point. The scale of a
numeric is the count of decimal digits in the fractional part, to the right of the decimal point. So the
number 23.5141 has a precision of 6 and a scale of 4. Integers can be considered to have a scale of zero.
Both the maximum precision and the maximum scale of a numeric column can be configured. To
declare a column of type numeric use the syntax:
137
Data Types
NUMERIC(precision, scale)
NUMERIC(precision)
NUMERIC
without any precision or scale creates a column in which numeric values of any precision and scale
can be stored, up to the implementation limit on precision. A column of this kind will not coerce input
values to any particular scale, whereas numeric columns with a declared scale will coerce input
values to that scale. (The SQL standard requires a default scale of 0, i.e., coercion to integer precision.
We find this a bit useless. If you're concerned about portability, always specify the precision and scale
explicitly.)
Note
The maximum allowed precision when explicitly specified in the type declaration is 1000;
NUMERIC without a specified precision is subject to the limits described in Table 8.2.
If the scale of a value to be stored is greater than the declared scale of the column, the system will
round the value to the specified number of fractional digits. Then, if the number of digits to the left of
the decimal point exceeds the declared precision minus the declared scale, an error is raised.
Numeric values are physically stored without any extra leading or trailing zeroes. Thus, the declared
precision and scale of a column are maximums, not fixed allocations. (In this sense the numeric
type is more akin to varchar(n) than to char(n).) The actual storage requirement is two bytes
for each group of four decimal digits, plus three to eight bytes overhead.
In addition to ordinary numeric values, the numeric type allows the special value NaN, meaning
“not-a-number”. Any operation on NaN yields another NaN. When writing this value as a constant in
an SQL command, you must put quotes around it, for example UPDATE table SET x = 'NaN'.
On input, the string NaN is recognized in a case-insensitive manner.
Note
In most implementations of the “not-a-number” concept, NaN is not considered equal to any
other numeric value (including NaN). In order to allow numeric values to be sorted and used
in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN
values.
The types decimal and numeric are equivalent. Both types are part of the SQL standard.
When rounding values, the numeric type rounds ties away from zero, while (on most machines) the
real and double precision types round ties to the nearest even number. For example:
SELECT x,
round(x::numeric) AS num_round,
round(x::double precision) AS dbl_round
138
Data Types
Inexact means that some values cannot be converted exactly to the internal format and are stored as
approximations, so that storing and retrieving a value might show slight discrepancies. Managing these
errors and how they propagate through calculations is the subject of an entire branch of mathematics
and computer science and will not be discussed here, except for the following points:
• If you require exact storage and calculations (such as for monetary amounts), use the numeric
type instead.
• If you want to do complicated calculations with these types for anything important, especially if
you rely on certain behavior in boundary cases (infinity, underflow), you should evaluate the im-
plementation carefully.
• Comparing two floating-point values for equality might not always work as expected.
On most platforms, the real type has a range of at least 1E-37 to 1E+37 with a precision of at least
6 decimal digits. The double precision type typically has a range of around 1E-307 to 1E+308
with a precision of at least 15 digits. Values that are too large or too small will cause an error. Rounding
might take place if the precision of an input number is too high. Numbers too close to zero that are
not representable as distinct from zero will cause an underflow error.
Note
The extra_float_digits setting controls the number of extra significant digits included when a
floating point value is converted to text for output. With the default value of 0, the output is
the same on every platform supported by PostgreSQL. Increasing it will produce output that
more accurately represents the stored value, but may be unportable.
In addition to ordinary numeric values, the floating-point types have several special values:
Infinity
-Infinity
NaN
These represent the IEEE 754 special values “infinity”, “negative infinity”, and “not-a-number”, re-
spectively. (On a machine whose floating-point arithmetic does not follow IEEE 754, these values
will probably not work as expected.) When writing these values as constants in an SQL command,
139
Data Types
you must put quotes around them, for example UPDATE table SET x = '-Infinity'. On
input, these strings are recognized in a case-insensitive manner.
Note
IEEE754 specifies that NaN should not compare equal to any other floating-point value (in-
cluding NaN). In order to allow floating-point values to be sorted and used in tree-based in-
dexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values.
PostgreSQL also supports the SQL-standard notations float and float(p) for specifying inexact
numeric types. Here, p specifies the minimum acceptable precision in binary digits. PostgreSQL ac-
cepts float(1) to float(24) as selecting the real type, while float(25) to float(53)
select double precision. Values of p outside the allowed range draw an error. float with no
precision specified is taken to mean double precision.
Note
The assumption that real and double precision have exactly 24 and 53 bits in the
mantissa respectively is correct for IEEE-standard floating point implementations. On non-
IEEE platforms it might be off a little, but for simplicity the same ranges of p are used on
all platforms.
The data types smallserial, serial and bigserial are not true types, but merely a notation-
al convenience for creating unique identifier columns (similar to the AUTO_INCREMENT property
supported by some other databases). In the current implementation, specifying:
is equivalent to specifying:
Thus, we have created an integer column and arranged for its default values to be assigned from a
sequence generator. A NOT NULL constraint is applied to ensure that a null value cannot be inserted.
(In most cases you would also want to attach a UNIQUE or PRIMARY KEY constraint to prevent
duplicate values from being inserted by accident, but this is not automatic.) Lastly, the sequence is
marked as “owned by” the column, so that it will be dropped if the column or table is dropped.
140
Data Types
Note
Because smallserial, serial and bigserial are implemented using sequences, there
may be "holes" or gaps in the sequence of values which appears in the column, even if no rows
are ever deleted. A value allocated from the sequence is still "used up" even if a row containing
that value is never successfully inserted into the table column. This may happen, for example,
if the inserting transaction rolls back. See nextval() in Section 9.16 for details.
To insert the next value of the sequence into the serial column, specify that the serial column
should be assigned its default value. This can be done either by excluding the column from the list of
columns in the INSERT statement, or through the use of the DEFAULT key word.
The type names serial and serial4 are equivalent: both create integer columns. The type
names bigserial and serial8 work the same way, except that they create a bigint column.
bigserial should be used if you anticipate the use of more than 231 identifiers over the lifetime of
the table. The type names smallserial and serial2 also work the same way, except that they
create a smallint column.
The sequence created for a serial column is automatically dropped when the owning column is
dropped. You can drop the sequence without dropping the column, but this will force removal of the
column default expression.
Since the output of this data type is locale-sensitive, it might not work to load money data into a
database that has a different setting of lc_monetary. To avoid problems, before restoring a dump
into a new database make sure lc_monetary has the same or equivalent value as in the database
that was dumped.
Values of the numeric, int, and bigint data types can be cast to money. Conversion from the
real and double precision data types can be done by casting to numeric first, for example:
SELECT '12.34'::float8::numeric::money;
However, this is not recommended. Floating point numbers should not be used to handle money due
to the potential for rounding errors.
A money value can be cast to numeric without loss of precision. Conversion to other types could
potentially lose precision, and must also be done in two stages:
SELECT '52093.89'::money::numeric::float8;
141
Data Types
Division of a money value by an integer value is performed with truncation of the fractional part
towards zero. To get a rounded result, divide by a floating-point value, or cast the money value to
numeric before dividing and back to money afterwards. (The latter is preferable to avoid risking
precision loss.) When a money value is divided by another money value, the result is double pre-
cision (i.e., a pure number, not money); the currency units cancel each other out in the division.
SQL defines two primary character types: character varying(n) and character(n), where
n is a positive integer. Both of these types can store strings up to n characters (not bytes) in length. An
attempt to store a longer string into a column of these types will result in an error, unless the excess
characters are all spaces, in which case the string will be truncated to the maximum length. (This
somewhat bizarre exception is required by the SQL standard.) If the string to be stored is shorter than
the declared length, values of type character will be space-padded; values of type character
varying will simply store the shorter string.
The notations varchar(n) and char(n) are aliases for character varying(n) and char-
acter(n), respectively. character without length specifier is equivalent to character(1).
If character varying is used without length specifier, the type accepts strings of any size. The
latter is a PostgreSQL extension.
In addition, PostgreSQL provides the text type, which stores strings of any length. Although the type
text is not in the SQL standard, several other SQL database management systems have it as well.
Values of type character are physically padded with spaces to the specified width n, and are stored
and displayed that way. However, trailing spaces are treated as semantically insignificant and disre-
garded when comparing two values of type character. In collations where whitespace is signifi-
cant, this behavior can produce unexpected results; for example SELECT 'a '::CHAR(2) col-
late "C" < E'a\n'::CHAR(2) returns true, even though C locale would consider a space
to be greater than a newline. Trailing spaces are removed when converting a character value to
one of the other string types. Note that trailing spaces are semantically significant in character
varying and text values, and when using pattern matching, that is LIKE and regular expressions.
The characters that can be stored in any of these data types are determined by the database character set,
which is selected when the database is created. Regardless of the specific character set, the character
with code zero (sometimes called NUL) cannot be stored. For more information refer to Section 23.3.
The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which
includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead
of 1. Long strings are compressed by the system automatically, so the physical requirement on disk
might be less. Very long values are also stored in background tables so that they do not interfere with
rapid access to shorter column values. In any case, the longest possible character string that can be
stored is about 1 GB. (The maximum value that will be allowed for n in the data type declaration is less
than that. It wouldn't be useful to change this because with multibyte character encodings the number
of characters and bytes can be quite different. If you desire to store long strings with no specific upper
142
Data Types
limit, use text or character varying without a length specifier, rather than making up an
arbitrary length limit.)
Tip
There is no performance difference among these three types, apart from increased storage
space when using the blank-padded type, and a few extra CPU cycles to check the length
when storing into a length-constrained column. While character(n) has performance ad-
vantages in some other database systems, there is no such advantage in PostgreSQL; in fact
character(n) is usually the slowest of the three because of its additional storage costs. In
most situations text or character varying should be used instead.
Refer to Section 4.1.2.1 for information about the syntax of string literals, and to Chapter 9 for infor-
mation about available operators and functions.
a | char_length
------+-------------
ok | 2
b | char_length
-------+-------------
ok | 2
good | 5
too l | 5
There are two other fixed-length character types in PostgreSQL, shown in Table 8.5. The name type
exists only for the storage of identifiers in the internal system catalogs and is not intended for use
by the general user. Its length is currently defined as 64 bytes (63 usable characters plus terminator)
but should be referenced using the constant NAMEDATALEN in C source code. The length is set at
compile time (and is therefore adjustable for special uses); the default maximum length might change
in a future release. The type "char" (note the quotes) is different from char(1) in that it only uses
one byte of storage. It is internally used in the system catalogs as a simplistic enumeration type.
143
Data Types
A binary string is a sequence of octets (or bytes). Binary strings are distinguished from character
strings in two ways. First, binary strings specifically allow storing octets of value zero and other “non-
printable” octets (usually, octets outside the decimal range 32 to 126). Character strings disallow zero
octets, and also disallow any other octet values and sequences of octet values that are invalid according
to the database's selected character set encoding. Second, operations on binary strings process the
actual bytes, whereas the processing of character strings depends on locale settings. In short, binary
strings are appropriate for storing data that the programmer thinks of as “raw bytes”, whereas character
strings are appropriate for storing text.
The bytea type supports two formats for input and output: “hex” format and PostgreSQL's histori-
cal “escape” format. Both of these are always accepted on input. The output format depends on the
configuration parameter bytea_output; the default is hex. (Note that the hex format was introduced in
PostgreSQL 9.0; earlier versions and some tools don't understand it.)
The SQL standard defines a different binary string type, called BLOB or BINARY LARGE OBJECT.
The input format is different from bytea, but the provided functions and operators are mostly the
same.
Example:
SELECT '\xDEADBEEF';
When entering bytea values in escape format, octets of certain values must be escaped, while all
octet values can be escaped. In general, to escape an octet, convert it into its three-digit octal value and
144
Data Types
precede it by a backslash. Backslash itself (octet decimal value 92) can alternatively be represented
by double backslashes. Table 8.7 shows the characters that must be escaped, and gives the alternative
escape sequences where applicable.
The requirement to escape non-printable octets varies depending on locale settings. In some instances
you can get away with leaving them unescaped.
The reason that single quotes must be doubled, as shown in Table 8.7, is that this is true for any string
literal in a SQL command. The generic string-literal parser consumes the outermost single quotes and
reduces any pair of single quotes to one data character. What the bytea input function sees is just
one single quote, which it treats as a plain data character. However, the bytea input function treats
backslashes as special, and the other behaviors shown in Table 8.7 are implemented by that function.
In some contexts, backslashes must be doubled compared to what is shown above, because the generic
string-literal parser will also reduce pairs of backslashes to one data character; see Section 4.1.2.1.
Bytea octets are output in hex format by default. If you change bytea_output to escape, “non-
printable” octets are converted to their equivalent three-digit octal value and preceded by one back-
slash. Most “printable” octets are output by their standard representation in the client character set, e.g.:
The octet with decimal value 92 (backslash) is doubled in the output. Details are in Table 8.8.
Depending on the front end to PostgreSQL you use, you might have additional work to do in terms of
escaping and unescaping bytea strings. For example, you might also have to escape line feeds and
carriage returns if your interface automatically translates these.
145
Data Types
Note
The SQL standard requires that writing just timestamp be equivalent to timestamp
without time zone, and PostgreSQL honors that behavior. timestamptz is accepted
as an abbreviation for timestamp with time zone; this is a PostgreSQL extension.
time, timestamp, and interval accept an optional precision value p which specifies the number
of fractional digits retained in the seconds field. By default, there is no explicit bound on precision.
The allowed range of p is from 0 to 6.
The interval type has an additional option, which is to restrict the set of stored fields by writing
one of these phrases:
YEAR
MONTH
DAY
HOUR
MINUTE
SECOND
YEAR TO MONTH
DAY TO HOUR
146
Data Types
DAY TO MINUTE
DAY TO SECOND
HOUR TO MINUTE
HOUR TO SECOND
MINUTE TO SECOND
Note that if both fields and p are specified, the fields must include SECOND, since the precision
applies only to the seconds.
The type time with time zone is defined by the SQL standard, but the definition exhibits prop-
erties which lead to questionable usefulness. In most cases, a combination of date, time, time-
stamp without time zone, and timestamp with time zone should provide a complete
range of date/time functionality required by any application.
The types abstime and reltime are lower precision types which are used internally. You are
discouraged from using these types in applications; these internal types might disappear in a future
release.
PostgreSQL is more flexible in handling date/time input than the SQL standard requires. See Appen-
dix B for the exact parsing rules of date/time input and for the recognized text fields including months,
days of the week, and time zones.
Remember that any date or time literal input needs to be enclosed in single quotes, like text strings.
Refer to Section 4.1.2.7 for more information. SQL requires the following syntax
where p is an optional precision specification giving the number of fractional digits in the seconds
field. Precision can be specified for time, timestamp, and interval types, and can range from
0 to 6. If no precision is specified in a constant specification, it defaults to the precision of the literal
value (but not more than 6 digits).
8.5.1.1. Dates
Table 8.10 shows some possible inputs for the date type.
147
Data Types
Example Description
08-Jan-1999 January 8 in any mode
99-Jan-08 January 8 in YMD mode, else error
08-Jan-99 January 8, except error in YMD mode
Jan-08-99 January 8, except error in YMD mode
19990108 ISO 8601; January 8, 1999 in any mode
990108 ISO 8601; January 8, 1999 in any mode
1999.008 year and day of year
J2451187 Julian date
January 8, 99 BC year 99 BC
8.5.1.2. Times
The time-of-day types are time [ (p) ] without time zone and time [ (p) ] with
time zone. time alone is equivalent to time without time zone.
Valid input for these types consists of a time of day followed by an optional time zone. (See Table 8.11
and Table 8.12.) If a time zone is specified in the input for time without time zone, it is silently
ignored. You can also specify a date but it will be ignored, except when you use a time zone name
that involves a daylight-savings rule, such as America/New_York. In this case specifying the date
is required in order to determine whether standard or daylight-savings time applies. The appropriate
time zone offset is recorded in the time with time zone value.
148
Data Types
Example Description
America/New_York Full time zone name
PST8PDT POSIX-style time zone specification
-8:00:00 UTC offset for PST
-8:00 UTC offset for PST (ISO 8601 extended format)
-800 UTC offset for PST (ISO 8601 basic format)
-8 UTC offset for PST (ISO 8601 basic format)
zulu Military abbreviation for UTC
z Short form of zulu (also in ISO 8601)
Refer to Section 8.5.3 for more information on how to specify time zones.
1999-01-08 04:05:06
and:
are valid values, which follow the ISO 8601 standard. In addition, the common format:
is supported.
The SQL standard differentiates timestamp without time zone and timestamp with
time zone literals by the presence of a “+” or “-” symbol and time zone offset after the time. Hence,
according to the standard,
is a timestamp with time zone. PostgreSQL never examines the content of a literal string
before determining its type, and therefore will treat both of the above as timestamp without
time zone. To ensure that a literal is treated as timestamp with time zone, give it the
correct explicit type:
In a literal that has been determined to be timestamp without time zone, PostgreSQL will
silently ignore any time zone indication. That is, the resulting value is derived from the date/time fields
in the input value, and is not adjusted for time zone.
For timestamp with time zone, the internally stored value is always in UTC (Universal
Coordinated Time, traditionally known as Greenwich Mean Time, GMT). An input value that has an
explicit time zone specified is converted to UTC using the appropriate offset for that time zone. If
149
Data Types
no time zone is stated in the input string, then it is assumed to be in the time zone indicated by the
system's TimeZone parameter, and is converted to UTC using the offset for the timezone zone.
When a timestamp with time zone value is output, it is always converted from UTC to the
current timezone zone, and displayed as local time in that zone. To see the time in another time
zone, either change timezone or use the AT TIME ZONE construct (see Section 9.9.3).
Conversions between timestamp without time zone and timestamp with time zone
normally assume that the timestamp without time zone value should be taken or given as
timezone local time. A different time zone can be specified for the conversion using AT TIME
ZONE.
The following SQL-compatible functions can also be used to obtain the current time value for the cor-
responding data type: CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP, LOCALTIME,
LOCALTIMESTAMP. (See Section 9.9.4.) Note that these are SQL functions and are not recognized
in data input strings.
Caution
While the input strings now, today, tomorrow, and yesterday are fine to use in inter-
active SQL commands, they can have surprising behavior when the command is saved to be
executed later, for example in prepared statements, views, and function definitions. The string
can be converted to a specific time value that continues to be used long after it becomes stale.
Use one of the SQL functions instead in such contexts. For example, CURRENT_DATE + 1
is safer than 'tomorrow'::date.
150
Data Types
generally only the date or time part in accordance with the given examples. However, the POSTGRES
style outputs date-only values in ISO format.
Note
ISO 8601 specifies the use of uppercase letter T to separate the date and time. PostgreSQL
accepts that format on input, but on output it uses a space rather than T, as shown above. This
is for readability and for consistency with RFC 3339 as well as some other database systems.
In the SQL and POSTGRES styles, day appears before month if DMY field ordering has been spec-
ified, otherwise month appears before day. (See Section 8.5.1 for how this setting also affects inter-
pretation of input values.) Table 8.15 shows examples.
In the ISO style, the time zone is always shown as a signed numeric offset from UTC, with positive
sign used for zones east of Greenwich. The offset will be shown as hh (hours only) if it is an integral
number of hours, else as hh:mm if it is an integral number of minutes, else as hh:mm:ss. (The third case
is not possible with any modern time zone standard, but it can appear when working with timestamps
that predate the adoption of standardized time zones.) In the other date styles, the time zone is shown
as an alphabetic abbreviation if one is in common use in the current zone. Otherwise it appears as a
signed numeric offset in ISO 8601 basic format (hh or hhmm).
The date/time style can be selected by the user using the SET datestyle command, the DateStyle
parameter in the postgresql.conf configuration file, or the PGDATESTYLE environment vari-
able on the server or client.
The formatting function to_char (see Section 9.8) is also available as a more flexible way to format
date/time output.
151
Data Types
prone to arbitrary changes, particularly with respect to daylight-savings rules. PostgreSQL uses the
widely-used IANA (Olson) time zone database for information about historical time zone rules. For
times in the future, the assumption is that the latest known rules for a given time zone will continue
to be observed indefinitely far into the future.
PostgreSQL endeavors to be compatible with the SQL standard definitions for typical usage. However,
the SQL standard has an odd mix of date and time types and capabilities. Two obvious problems are:
• Although the date type cannot have an associated time zone, the time type can. Time zones in
the real world have little meaning unless associated with a date as well as a time, since the offset
can vary through the year with daylight-saving time boundaries.
• The default time zone is specified as a constant numeric offset from UTC. It is therefore impossible
to adapt to daylight-saving time when doing date/time arithmetic across DST boundaries.
To address these difficulties, we recommend using date/time types that contain both date and time
when using time zones. We do not recommend using the type time with time zone (though
it is supported by PostgreSQL for legacy applications and for compliance with the SQL standard).
PostgreSQL assumes your local time zone for any type containing only date or time.
All timezone-aware dates and times are stored internally in UTC. They are converted to local time in
the zone specified by the TimeZone configuration parameter before being displayed to the client.
• A full time zone name, for example America/New_York. The recognized time zone names are
listed in the pg_timezone_names view (see Section 52.90). PostgreSQL uses the widely-used
IANA time zone data for this purpose, so the same time zone names are also recognized by other
software.
• A time zone abbreviation, for example PST. Such a specification merely defines a particular offset
from UTC, in contrast to full time zone names which can imply a set of daylight savings transition
rules as well. The recognized abbreviations are listed in the pg_timezone_abbrevs view (see
Section 52.89). You cannot set the configuration parameters TimeZone or log_timezone to a time
zone abbreviation, but you can use abbreviations in date/time input values and with the AT TIME
ZONE operator.
• In addition to the timezone names and abbreviations, PostgreSQL will accept POSIX-style time
zone specifications, as described in Section B.5. This option is not normally preferable to using a
named time zone, but it may be necessary if no suitable IANA time zone entry is available.
In short, this is the difference between abbreviations and full names: abbreviations represent a specific
offset from UTC, whereas many of the full names imply a local daylight-savings time rule, and so have
two possible UTC offsets. As an example, 2014-06-04 12:00 America/New_York represents
noon local time in New York, which for this particular date was Eastern Daylight Time (UTC-4). So
2014-06-04 12:00 EDT specifies that same time instant. But 2014-06-04 12:00 EST
specifies noon Eastern Standard Time (UTC-5), regardless of whether daylight savings was nominally
in effect on that date.
To complicate matters, some jurisdictions have used the same timezone abbreviation to mean different
UTC offsets at different times; for example, in Moscow MSK has meant UTC+3 in some years and
UTC+4 in others. PostgreSQL interprets such abbreviations according to whatever they meant (or had
most recently meant) on the specified date; but, as with the EST example above, this is not necessarily
the same as local civil time on that date.
In all cases, timezone names and abbreviations are recognized case-insensitively. (This is a change
from PostgreSQL versions prior to 8.2, which were case-sensitive in some contexts but not others.)
Neither timezone names nor abbreviations are hard-wired into the server; they are obtained from con-
figuration files stored under .../share/timezone/ and .../share/timezonesets/ of
the installation directory (see Section B.4).
152
Data Types
The TimeZone configuration parameter can be set in the file postgresql.conf, or in any of the
other standard ways described in Chapter 19. There are also some special ways to set it:
• The SQL command SET TIME ZONE sets the time zone for the session. This is an alternative
spelling of SET TIMEZONE TO with a more SQL-spec-compatible syntax.
• The PGTZ environment variable is used by libpq clients to send a SET TIME ZONE command
to the server upon connection.
Quantities of days, hours, minutes, and seconds can be specified without explicit unit markings. For
example, '1 12:59:10' is read the same as '1 day 12 hours 59 min 10 sec'. Also,
a combination of years and months can be specified with a dash; for example '200-10' is read the
same as '200 years 10 months'. (These shorter forms are in fact the only ones allowed by the
SQL standard, and are used for output when IntervalStyle is set to sql_standard.)
Interval values can also be written as ISO 8601 time intervals, using either the “format with designa-
tors” of the standard's section 4.4.3.2 or the “alternative format” of section 4.4.3.3. The format with
designators looks like this:
The string must start with a P, and may include a T that introduces the time-of-day units. The available
unit abbreviations are given in Table 8.16. Units may be omitted, and may be specified in any order,
but units smaller than a day must appear after T. In particular, the meaning of M depends on whether
it is before or after T.
P [ years-months-days ] [ T hours:minutes:seconds ]
153
Data Types
the string must begin with P, and a T separates the date and time parts of the interval. The values are
given as numbers similar to ISO 8601 dates.
When writing an interval constant with a fields specification, or when assigning a string to an in-
terval column that was defined with a fields specification, the interpretation of unmarked quantities
depends on the fields. For example INTERVAL '1' YEAR is read as 1 year, whereas INTER-
VAL '1' means 1 second. Also, field values “to the right” of the least significant field allowed by the
fields specification are silently discarded. For example, writing INTERVAL '1 day 2:03:04'
HOUR TO MINUTE results in dropping the seconds field, but not the day field.
According to the SQL standard all fields of an interval value must have the same sign, so a leading
negative sign applies to all fields; for example the negative sign in the interval literal '-1 2:03:04'
applies to both the days and hour/minute/second parts. PostgreSQL allows the fields to have different
signs, and traditionally treats each field in the textual representation as independently signed, so that
the hour/minute/second part is considered positive in this example. If IntervalStyle is set to
sql_standard then a leading sign is considered to apply to all fields (but only if no additional
signs appear). Otherwise the traditional PostgreSQL interpretation is used. To avoid ambiguity, it's
recommended to attach an explicit sign to each field if any field is negative.
Field values can have fractional parts: for example, '1.5 weeks' or '01:02:03.45'. However,
because interval internally stores only three integer units (months, days, microseconds), fractional
units must be spilled to smaller units. Fractional parts of units greater than months are truncated to be
an integer number of months, e.g. '1.5 years' becomes '1 year 6 mons'. Fractional parts
of weeks and days are computed to be an integer number of days and microseconds, assuming 30 days
per month and 24 hours per day, e.g., '1.75 months' becomes 1 mon 22 days 12:00:00.
Only seconds will ever be shown as fractional on output.
Internally interval values are stored as months, days, and microseconds. This is done because
the number of days in a month varies, and a day can have 23 or 25 hours if a daylight savings time
adjustment is involved. The months and days fields are integers while the microseconds field can
store fractional seconds. Because intervals are usually created from constant strings or timestamp
subtraction, this storage method works well in most cases, but can cause unexpected results:
154
Data Types
Functions justify_days and justify_hours are available for adjusting days and hours that
overflow their normal ranges.
The sql_standard style produces output that conforms to the SQL standard's specification for
interval literal strings, if the interval value meets the standard's restrictions (either year-month only or
day-time only, with no mixing of positive and negative components). Otherwise the output looks like
a standard year-month literal string followed by a day-time literal string, with explicit signs added to
disambiguate mixed-sign intervals.
The output of the postgres style matches the output of PostgreSQL releases prior to 8.4 when the
DateStyle parameter was set to ISO.
The output of the postgres_verbose style matches the output of PostgreSQL releases prior to
8.4 when the DateStyle parameter was set to non-ISO output.
The output of the iso_8601 style matches the “format with designators” described in section 4.4.3.2
of the ISO 8601 standard.
Boolean constants can be represented in SQL queries by the SQL key words TRUE, FALSE, and NULL.
The datatype input function for type boolean accepts these string representations for the “true” state:
true
yes
on
1
155
Data Types
false
no
off
0
Unique prefixes of these strings are also accepted, for example t or n. Leading or trailing whitespace
is ignored, and case does not matter.
The datatype output function for type boolean always emits either t or f, as shown in Example 8.2.
The key words TRUE and FALSE are the preferred (SQL-compliant) method for writing Boolean
constants in SQL queries. But you can also use the string representations by following the generic
string-literal constant syntax described in Section 4.1.2.7, for example 'yes'::boolean.
Note that the parser automatically understands that TRUE and FALSE are of type boolean, but this
is not so for NULL because that can have any type. So in some contexts you might have to cast NULL
to boolean explicitly, for example NULL::boolean. Conversely, the cast can be omitted from a
string-literal Boolean value in contexts where the parser can deduce that the literal must be of type
boolean.
Once created, the enum type can be used in table and function definitions much like any other type:
156
Data Types
);
INSERT INTO person VALUES ('Moe', 'happy');
SELECT * FROM person WHERE current_mood = 'happy';
name | current_mood
------+--------------
Moe | happy
(1 row)
8.7.2. Ordering
The ordering of the values in an enum type is the order in which the values were listed when the
type was created. All standard comparison operators and related aggregate functions are supported
for enums. For example:
SELECT name
FROM person
WHERE current_mood = (SELECT MIN(current_mood) FROM person);
name
-------
Larry
(1 row)
157
Data Types
If you really need to do something like that, you can either write a custom operator or add explicit
casts to your query:
Although enum types are primarily intended for static sets of values, there is support for adding new
values to an existing enum type, and for renaming values (see ALTER TYPE). Existing values cannot
be removed from an enum type, nor can the sort ordering of such values be changed, short of dropping
and re-creating the enum type.
An enum value occupies four bytes on disk. The length of an enum value's textual label is limited by
the NAMEDATALEN setting compiled into PostgreSQL; in standard builds this means at most 63 bytes.
The translations from internal enum values to textual labels are kept in the system catalog pg_enum.
Querying this catalog directly can be useful.
A rich set of functions and operators is available to perform various geometric operations such as
scaling, translation, rotation, and determining intersections. They are explained in Section 9.11.
8.8.1. Points
158
Data Types
Points are the fundamental two-dimensional building block for geometric types. Values of type point
are specified using either of the following syntaxes:
( x , y )
x , y
8.8.2. Lines
Lines are represented by the linear equation Ax + By + C = 0, where A and B are not both zero. Values
of type line are input and output in the following form:
{ A, B, C }
[ ( x1 , y1 ) , ( x2 , y2 ) ]
( ( x1 , y1 ) , ( x2 , y2 ) )
( x1 , y1 ) , ( x2 , y2 )
x1 , y1 , x2 , y2
where (x1,y1) and (x2,y2) are two different points on the line.
[ ( x1 , y1 ) , ( x2 , y2 ) ]
( ( x1 , y1 ) , ( x2 , y2 ) )
( x1 , y1 ) , ( x2 , y2 )
x1 , y1 , x2 , y2
where (x1,y1) and (x2,y2) are the end points of the line segment.
8.8.4. Boxes
Boxes are represented by pairs of points that are opposite corners of the box. Values of type box are
specified using any of the following syntaxes:
( ( x1 , y1 ) , ( x2 , y2 ) )
( x1 , y1 ) , ( x2 , y2 )
x1 , y1 , x2 , y2
where (x1,y1) and (x2,y2) are any two opposite corners of the box.
Any two opposite corners can be supplied on input, but the values will be reordered as needed to store
the upper right and lower left corners, in that order.
159
Data Types
8.8.5. Paths
Paths are represented by lists of connected points. Paths can be open, where the first and last points in
the list are considered not connected, or closed, where the first and last points are considered connected.
Values of type path are specified using any of the following syntaxes:
[ ( x1 , y1 ) , ... , ( xn , yn ) ]
( ( x1 , y1 ) , ... , ( xn , yn ) )
( x1 , y1 ) , ... , ( xn , yn )
( x1 , y1 , ... , xn , yn )
x1 , y1 , ... , xn , yn
where the points are the end points of the line segments comprising the path. Square brackets ([])
indicate an open path, while parentheses (()) indicate a closed path. When the outermost parentheses
are omitted, as in the third through fifth syntaxes, a closed path is assumed.
8.8.6. Polygons
Polygons are represented by lists of points (the vertexes of the polygon). Polygons are very similar to
closed paths, but are stored differently and have their own set of support routines.
Values of type polygon are specified using any of the following syntaxes:
( ( x1 , y1 ) , ... , ( xn , yn ) )
( x1 , y1 ) , ... , ( xn , yn )
( x1 , y1 , ... , xn , yn )
x1 , y1 , ... , xn , yn
where the points are the end points of the line segments comprising the boundary of the polygon.
8.8.7. Circles
Circles are represented by a center point and radius. Values of type circle are specified using any
of the following syntaxes:
< ( x , y ) , r >
( ( x , y ) , r )
( x , y ) , r
x , y , r
where (x,y) is the center point and r is the radius of the circle.
160
Data Types
When sorting inet or cidr data types, IPv4 addresses will always sort before IPv6 addresses, in-
cluding IPv4 addresses encapsulated or mapped to IPv6 addresses, such as ::10.2.3.4 or ::ffff:10.4.3.2.
8.9.1. inet
The inet type holds an IPv4 or IPv6 host address, and optionally its subnet, all in one field. The subnet
is represented by the number of network address bits present in the host address (the “netmask”). If
the netmask is 32 and the address is IPv4, then the value does not indicate a subnet, only a single host.
In IPv6, the address length is 128 bits, so 128 bits specify a unique host address. Note that if you want
to accept only networks, you should use the cidr type rather than inet.
The input format for this type is address/y where address is an IPv4 or IPv6 address and y is
the number of bits in the netmask. If the /y portion is missing, the netmask is 32 for IPv4 and 128
for IPv6, so the value represents just a single host. On display, the /y portion is suppressed if the
netmask specifies a single host.
8.9.2. cidr
The cidr type holds an IPv4 or IPv6 network specification. Input and output formats follow Classless
Internet Domain Routing conventions. The format for specifying networks is address/y where
address is the network represented as an IPv4 or IPv6 address, and y is the number of bits in the
netmask. If y is omitted, it is calculated using assumptions from the older classful network numbering
system, except it will be at least large enough to include all of the octets written in the input. It is an
error to specify a network address that has bits set to the right of the specified netmask.
161
Data Types
Tip
If you do not like the output format for inet or cidr values, try the functions host, text,
and abbrev.
8.9.4. macaddr
The macaddr type stores MAC addresses, known for example from Ethernet card hardware addresses
(although MAC addresses are used for other purposes as well). Input is accepted in the following
formats:
'08:00:2b:01:02:03'
'08-00-2b-01-02-03'
'08002b:010203'
'08002b-010203'
'0800.2b01.0203'
'0800-2b01-0203'
'08002b010203'
These examples would all specify the same address. Upper and lower case is accepted for the digits
a through f. Output is always in the first of the forms shown.
IEEE Standard 802-2001 specifies the second form shown (with hyphens) as the canonical form for
MAC addresses, and specifies the first form (with colons) as used with bit-reversed, MSB-first nota-
tion, so that 08-00-2b-01-02-03 = 10:00:D4:80:40:C0. This convention is widely ignored nowadays,
and it is relevant only for obsolete network protocols (such as Token Ring). PostgreSQL makes no
provisions for bit reversal; all accepted formats use the canonical LSB order.
The remaining five input formats are not part of any standard.
8.9.5. macaddr8
The macaddr8 type stores MAC addresses in EUI-64 format, known for example from Ethernet
card hardware addresses (although MAC addresses are used for other purposes as well). This type
can accept both 6 and 8 byte length MAC addresses and stores them in 8 byte length format. MAC
addresses given in 6 byte format will be stored in 8 byte length format with the 4th and 5th bytes set
to FF and FE, respectively. Note that IPv6 uses a modified EUI-64 format where the 7th bit should
be set to one after the conversion from EUI-48. The function macaddr8_set7bit is provided to
make this change. Generally speaking, any input which is comprised of pairs of hex digits (on byte
boundaries), optionally separated consistently by one of ':', '-' or '.', is accepted. The number
of hex digits must be either 16 (8 bytes) or 12 (6 bytes). Leading and trailing whitespace is ignored.
The following are examples of input formats that are accepted:
162
Data Types
'08:00:2b:01:02:03:04:05'
'08-00-2b-01-02-03-04-05'
'08002b:0102030405'
'08002b-0102030405'
'0800.2b01.0203.0405'
'0800-2b01-0203-0405'
'08002b01:02030405'
'08002b0102030405'
These examples would all specify the same address. Upper and lower case is accepted for the digits
a through f. Output is always in the first of the forms shown. The last six input formats that are
mentioned above are not part of any standard. To convert a traditional 48 bit MAC address in EUI-48
format to modified EUI-64 format to be included as the host portion of an IPv6 address, use macad-
dr8_set7bit as shown:
SELECT macaddr8_set7bit('08:00:2b:01:02:03');
macaddr8_set7bit
-------------------------
0a:00:2b:ff:fe:01:02:03
(1 row)
bit type data must match the length n exactly; it is an error to attempt to store shorter or longer bit
strings. bit varying data is of variable length up to the maximum length n; longer strings will
be rejected. Writing bit without a length is equivalent to bit(1), while bit varying without
a length specification means unlimited length.
Note
If one explicitly casts a bit-string value to bit(n), it will be truncated or zero-padded on the
right to be exactly n bits, without raising an error. Similarly, if one explicitly casts a bit-string
value to bit varying(n), it will be truncated on the right if it is more than n bits.
Refer to Section 4.1.2.5 for information about the syntax of bit string constants. Bit-logical operators
and string manipulation functions are available; see Section 9.6.
a | b
163
Data Types
-----+-----
101 | 00
100 | 101
A bit string value requires 1 byte for each group of 8 bits, plus 5 or 8 bytes overhead depending on
the length of the string (but long values may be compressed or moved out-of-line, as explained in
Section 8.3 for character strings).
8.11.1. tsvector
A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized
to merge different variants of the same word (see Chapter 12 for details). Sorting and duplicate-elim-
ination are done automatically during input, as shown in this example:
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
tsvector
----------------------------------------------------
'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
(We use dollar-quoted string literals in this example and the next one to avoid the confusion of having
to double quote marks within the literals.) Embedded quotes and backslashes must be doubled:
SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10
fat:11 rat:12'::tsvector;
tsvector
-------------------------------------------------------------------------------
'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5
'rat':12 'sat':4
A position normally indicates the source word's location in the document. Positional information can
be used for proximity ranking. Position values can range from 1 to 16383; larger numbers are silently
set to 16383. Duplicate positions for the same lexeme are discarded.
164
Data Types
Lexemes that have positions can further be labeled with a weight, which can be A, B, C, or D. D is the
default and hence is not shown on output:
Weights are typically used to reflect document structure, for example by marking title words differ-
ently from body words. Text search ranking functions can assign different priorities to the different
weight markers.
It is important to understand that the tsvector type itself does not perform any word normalization;
it assumes the words it is given are normalized appropriately for the application. For example,
For most English-text-searching applications the above words would be considered non-normalized,
but tsvector doesn't care. Raw document text should usually be passed through to_tsvector
to normalize the words appropriately for searching:
8.11.2. tsquery
A tsquery value stores lexemes that are to be searched for, and can combine them using the Boolean
operators & (AND), | (OR), and ! (NOT), as well as the phrase search operator <-> (FOLLOWED
BY). There is also a variant <N> of the FOLLOWED BY operator, where N is an integer constant that
specifies the distance between the two lexemes being searched for. <-> is equivalent to <1>.
Parentheses can be used to enforce grouping of these operators. In the absence of parentheses, ! (NOT)
binds most tightly, <-> (FOLLOWED BY) next most tightly, then & (AND), with | (OR) binding
the least tightly.
165
Data Types
tsquery
------------------------
'fat' & 'rat' & !'cat'
Optionally, lexemes in a tsquery can be labeled with one or more weight letters, which restricts
them to match only tsvector lexemes with one of those weights:
SELECT 'super:*'::tsquery;
tsquery
-----------
'super':*
This query will match any word in a tsvector that begins with “super”.
Quoting rules for lexemes are the same as described previously for lexemes in tsvector; and, as with
tsvector, any required normalization of words must be done before converting to the tsquery
type. The to_tsquery function is convenient for performing such normalization:
Note that to_tsquery will process prefixes in the same way as other words, which means this
comparison returns true:
166
Data Types
a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11
PostgreSQL also accepts the following alternative forms for input: use of upper-case digits, the stan-
dard format surrounded by braces, omitting some or all hyphens, adding a hyphen after any group of
four digits. Examples are:
A0EEBC99-9C0B-4EF8-BB6D-6BB9BD380A11
{a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11}
a0eebc999c0b4ef8bb6d6bb9bd380a11
a0ee-bc99-9c0b-4ef8-bb6d-6bb9-bd38-0a11
{a0eebc99-9c0b4ef8-bb6d6bb9-bd380a11}
PostgreSQL provides storage and comparison functions for UUIDs, but the core database does not
include any function for generating UUIDs, because no single algorithm is well suited for every ap-
plication. The uuid-ossp module provides functions that implement several standard algorithms. The
pgcrypto module also provides a generation function for random UUIDs. Alternatively, UUIDs could
be generated by client applications or other libraries invoked through a server-side function.
The xml type can store well-formed “documents”, as defined by the XML standard, as well as “con-
tent” fragments, which are defined by reference to the more permissive “document node”1 of the
XQuery and XPath data model. Roughly, this means that content fragments can have more than one
top-level element or character node. The expression xmlvalue IS DOCUMENT can be used to
evaluate whether a particular xml value is a full document or only a content fragment.
Limits and compatibility notes for the xml data type can be found in Section D.3.
Examples:
While this is the only way to convert character strings into XML values according to the SQL standard,
the PostgreSQL-specific syntaxes:
1
https://www.w3.org/TR/2010/REC-xpath-datamodel-20101214/#DocumentNode
167
Data Types
xml '<foo>bar</foo>'
'<foo>bar</foo>'::xml
The xml type does not validate input values against a document type declaration (DTD), even when
the input value specifies a DTD. There is also currently no built-in support for validating against other
XML schema languages such as XML Schema.
The inverse operation, producing a character string value from xml, uses the function xmlserial-
ize:
type can be character, character varying, or text (or an alias for one of those). Again,
according to the SQL standard, this is the only way to convert between type xml and character types,
but PostgreSQL also allows you to simply cast the value.
When a character string value is cast to or from type xml without going through XMLPARSE or XM-
LSERIALIZE, respectively, the choice of DOCUMENT versus CONTENT is determined by the “XML
option” session configuration parameter, which can be set using the standard command:
When using binary mode to pass query parameters to the server and query results back to the client, no
encoding conversion is performed, so the situation is different. In this case, an encoding declaration
in the XML data will be observed, and if it is absent, the data will be assumed to be in UTF-8 (as
required by the XML standard; note that PostgreSQL does not support UTF-16). On output, data will
have an encoding declaration specifying the client encoding, unless the client encoding is UTF-8, in
which case it will be omitted.
Needless to say, processing XML data with PostgreSQL will be less error-prone and more efficient if
the XML data encoding, client encoding, and server encoding are the same. Since XML data is inter-
nally processed in UTF-8, computations will be most efficient if the server encoding is also UTF-8.
168
Data Types
Caution
Some XML-related functions may not work at all on non-ASCII data when the server encoding
is not UTF-8. This is known to be an issue for xmltable() and xpath() in particular.
Since there are no comparison operators for the xml data type, it is not possible to create an index
directly on a column of this type. If speedy searches in XML data are desired, possible workarounds
include casting the expression to a character string type and indexing that, or indexing an XPath ex-
pression. Of course, the actual query would have to be adjusted to search by the indexed expression.
The text-search functionality in PostgreSQL can also be used to speed up full-document searches of
XML data. The necessary preprocessing support is, however, not yet available in the PostgreSQL
distribution.
There are two JSON data types: json and jsonb. They accept almost identical sets of values as input.
The major practical difference is one of efficiency. The json data type stores an exact copy of the
input text, which processing functions must reparse on each execution; while jsonb data is stored in
a decomposed binary format that makes it slightly slower to input due to added conversion overhead,
but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which
can be a significant advantage.
Because the json type stores an exact copy of the input text, it will preserve semantically-insignificant
white space between tokens, as well as the order of keys within JSON objects. Also, if a JSON object
within the value contains the same key more than once, all the key/value pairs are kept. (The processing
functions consider the last value as the operative one.) By contrast, jsonb does not preserve white
space, does not preserve the order of object keys, and does not keep duplicate object keys. If duplicate
keys are specified in the input, only the last value is kept.
In general, most applications should prefer to store JSON data as jsonb, unless there are quite spe-
cialized needs, such as legacy assumptions about ordering of object keys.
PostgreSQL allows only one character set encoding per database. It is therefore not possible for the
JSON types to conform rigidly to the JSON specification unless the database encoding is UTF8. At-
tempts to directly include characters that cannot be represented in the database encoding will fail; con-
versely, characters that can be represented in the database encoding but not in UTF8 will be allowed.
RFC 7159 permits JSON strings to contain Unicode escape sequences denoted by \uXXXX. In the in-
put function for the json type, Unicode escapes are allowed regardless of the database encoding, and
2
https://tools.ietf.org/html/rfc7159
169
Data Types
are checked only for syntactic correctness (that is, that four hex digits follow \u). However, the input
function for jsonb is stricter: it disallows Unicode escapes for non-ASCII characters (those above U
+007F) unless the database encoding is UTF8. The jsonb type also rejects \u0000 (because that
cannot be represented in PostgreSQL's text type), and it insists that any use of Unicode surrogate
pairs to designate characters outside the Unicode Basic Multilingual Plane be correct. Valid Unicode
escapes are converted to the equivalent ASCII or UTF8 character for storage; this includes folding
surrogate pairs into a single character.
Note
Many of the JSON processing functions described in Section 9.15 will convert Unicode es-
capes to regular characters, and will therefore throw the same types of errors just described
even if their input is of type json not jsonb. The fact that the json input function does
not make these checks may be considered a historical artifact, although it does allow for sim-
ple storage (without processing) of JSON Unicode escapes in a non-UTF8 database encoding.
In general, it is best to avoid mixing Unicode escapes in JSON with a non-UTF8 database
encoding, if possible.
When converting textual JSON input into jsonb, the primitive types described by RFC 7159 are
effectively mapped onto native PostgreSQL types, as shown in Table 8.23. Therefore, there are some
minor additional constraints on what constitutes valid jsonb data that do not apply to the json type,
nor to JSON in the abstract, corresponding to limits on what can be represented by the underlying data
type. Notably, jsonb will reject numbers that are outside the range of the PostgreSQL numeric
data type, while json will not. Such implementation-defined restrictions are permitted by RFC 7159.
However, in practice such problems are far more likely to occur in other implementations, as it is
common to represent JSON's number primitive type as IEEE 754 double precision floating point
(which RFC 7159 explicitly anticipates and allows for). When using JSON as an interchange format
with such systems, the danger of losing numeric precision compared to data originally stored by Post-
greSQL should be considered.
Conversely, as noted in the table there are some minor restrictions on the input format of JSON prim-
itive types that do not apply to the corresponding PostgreSQL types.
170
Data Types
SELECT '5'::json;
As previously stated, when a JSON value is input and then printed without any additional processing,
json outputs the same text that was input, while jsonb does not preserve semantically-insignificant
details such as whitespace. For example, note the differences here:
One semantically-insignificant detail worth noting is that in jsonb, numbers will be printed according
to the behavior of the underlying numeric type. In practice this means that numbers entered with E
notation will be printed without it, for example:
However, jsonb will preserve trailing fractional zeroes, as seen in this example, even though those
are semantically insignificant for purposes such as equality checks.
JSON data is subject to the same concurrency-control considerations as any other data type when
stored in a table. Although storing large documents is practicable, keep in mind that any update ac-
quires a row-level lock on the whole row. Consider limiting JSON documents to a manageable size
in order to decrease lock contention among updating transactions. Ideally, JSON documents should
171
Data Types
each represent an atomic datum that business rules dictate cannot reasonably be further subdivided
into smaller datums that could be modified independently.
-- The array on the right side is contained within the one on the
left:
SELECT '[1, 2, 3]'::jsonb @> '[1, 3]'::jsonb;
The general principle is that the contained object must match the containing object as to structure and
data contents, possibly after discarding some non-matching array elements or object key/value pairs
from the containing object. But remember that the order of array elements is not significant when
doing a containment match, and duplicate array elements are effectively considered only once.
As a special exception to the general principle that the structures must match, an array may contain
a primitive value:
172
Data Types
jsonb also has an existence operator, which is a variation on the theme of containment: it tests
whether a string (given as a text value) appears as an object key or array element at the top level of
the jsonb value. These examples return true except as noted:
JSON objects are better suited than arrays for testing containment or existence when there are many
keys or elements involved, because unlike arrays they are internally optimized for searching, and do
not need to be searched linearly.
Tip
Because JSON containment is nested, an appropriate query can skip explicit selection of sub-
objects. As an example, suppose that we have a doc column containing objects at the top level,
with most objects containing tags fields that contain arrays of sub-objects. This query finds
entries in which sub-objects containing both "term":"paris" and "term":"food" ap-
pear, while ignoring any such keys outside the tags array:
but that approach is less flexible, and often less efficient as well.
On the other hand, the JSON existence operator is not nested: it will only look for the specified
key or array element at top level of the JSON value.
The various containment and existence operators, along with all other JSON operators and functions
are documented in Section 9.15.
173
Data Types
The default GIN operator class for jsonb supports queries with top-level key-exists operators ?, ?&
and ?| operators and path/value-exists operator @>. (For details of the semantics that these operators
implement, see Table 9.44.) An example of creating an index with this operator class is:
The non-default GIN operator class jsonb_path_ops supports indexing the @> operator only. An
example of creating an index with this operator class is:
Consider the example of a table that stores JSON documents retrieved from a third-party web service,
with a documented schema definition. A typical document is:
{
"guid": "9c36adc1-7fb5-4d5b-83b4-90356a46061a",
"name": "Angela Barton",
"is_active": true,
"company": "Magnafone",
"address": "178 Howard Place, Gulf, Washington, 702",
"registered": "2009-11-07T08:53:22 +08:00",
"latitude": 19.793713,
"longitude": 86.513373,
"tags": [
"enim",
"aliquip",
"qui"
]
}
We store these documents in a table named api, in a jsonb column named jdoc. If a GIN index is
created on this column, queries like the following can make use of the index:
However, the index could not be used for queries like the following, because though the operator ? is
indexable, it is not applied directly to the indexed column jdoc:
Still, with appropriate use of expression indexes, the above query can use an index. If querying for
particular items within the "tags" key is common, defining an index like this may be worthwhile:
Now, the WHERE clause jdoc -> 'tags' ? 'qui' will be recognized as an application of the
indexable operator ? to the indexed expression jdoc -> 'tags'. (More information on expression
indexes can be found in Section 11.7.)
174
Data Types
A simple GIN index on the jdoc column can support this query. But note that such an index will
store copies of every key and value in the jdoc column, whereas the expression index of the previous
example stores only data found under the tags key. While the simple-index approach is far more
flexible (since it supports queries about any key), targeted expression indexes are likely to be smaller
and faster to search than a simple index.
Although the jsonb_path_ops operator class supports only queries with the @> operator, it has
notable performance advantages over the default operator class jsonb_ops. A jsonb_path_ops
index is usually much smaller than a jsonb_ops index over the same data, and the specificity of
searches is better, particularly when queries contain keys that appear frequently in the data. Therefore
search operations typically perform better than with the default operator class.
The technical difference between a jsonb_ops and a jsonb_path_ops GIN index is that the
former creates independent index items for each key and value in the data, while the latter creates
index items only for each value in the data. 3 Basically, each jsonb_path_ops index item is a
hash of the value and the key(s) leading to it; for example to index {"foo": {"bar": "baz"}},
a single index item would be created incorporating all three of foo, bar, and baz into the hash
value. Thus a containment query looking for this structure would result in an extremely specific index
search; but there is no way at all to find out whether foo appears as a key. On the other hand, a
jsonb_ops index would create three index items representing foo, bar, and baz separately; then
to do the containment query, it would look for rows containing all three of these items. While GIN
indexes can perform such an AND search fairly efficiently, it will still be less specific and slower
than the equivalent jsonb_path_ops search, especially if there are a very large number of rows
containing any single one of the three index items.
A disadvantage of the jsonb_path_ops approach is that it produces no index entries for JSON
structures not containing any values, such as {"a": {}}. If a search for documents containing such
a structure is requested, it will require a full-index scan, which is quite slow. jsonb_path_ops is
therefore ill-suited for applications that often perform such searches.
jsonb also supports btree and hash indexes. These are usually useful only if it's important to
check equality of complete JSON documents. The btree ordering for jsonb datums is seldom of
great interest, but for completeness it is:
Object > Array > Boolean > Number > String > Null
Note that object keys are compared in their storage order; in particular, since shorter keys are stored
before longer keys, this can lead to results that might be unintuitive, such as:
175
Data Types
Similarly, arrays with equal numbers of elements are compared in the order:
Primitive JSON values are compared using the same comparison rules as for the underlying Post-
greSQL data type. Strings are compared using the default database collation.
8.14.5. Transforms
Additional extensions are available that implement transforms for the jsonb type for different pro-
cedural languages.
The extensions for PL/Perl are called jsonb_plperl and jsonb_plperlu. If you use them,
jsonb values are mapped to Perl arrays, hashes, and scalars, as appropriate.
The extensions for PL/Python are called jsonb_plpythonu, jsonb_plpython2u, and json-
b_plpython3u (see Section 46.1 for the PL/Python naming convention). If you use them, jsonb
values are mapped to Python dictionaries, lists, and scalars, as appropriate.
8.15. Arrays
PostgreSQL allows columns of a table to be defined as variable-length multidimensional arrays. Arrays
of any built-in or user-defined base type, enum type, composite type, range type, or domain can be
created.
As shown, an array data type is named by appending square brackets ([]) to the data type name of
the array elements. The above command will create a table named sal_emp with a column of type
text (name), a one-dimensional array of type integer (pay_by_quarter), which represents
the employee's salary by quarter, and a two-dimensional array of text (schedule), which repre-
sents the employee's weekly schedule.
The syntax for CREATE TABLE allows the exact size of arrays to be specified, for example:
However, the current implementation ignores any supplied array size limits, i.e., the behavior is the
same as for arrays of unspecified length.
The current implementation does not enforce the declared number of dimensions either. Arrays of
a particular element type are all considered to be of the same type, regardless of size or number of
dimensions. So, declaring the array size or number of dimensions in CREATE TABLE is simply
documentation; it does not affect run-time behavior.
An alternative syntax, which conforms to the SQL standard by using the keyword ARRAY, can be used
for one-dimensional arrays. pay_by_quarter could have been defined as:
176
Data Types
As before, however, PostgreSQL does not enforce the size restriction in any case.
where delim is the delimiter character for the type, as recorded in its pg_type entry. Among the
standard data types provided in the PostgreSQL distribution, all use a comma (,), except for type box
which uses a semicolon (;). Each val is either a constant of the array element type, or a subarray.
An example of an array constant is:
'{{1,2,3},{4,5,6},{7,8,9}}'
To set an element of an array constant to NULL, write NULL for the element value. (Any upper- or
lower-case variant of NULL will do.) If you want an actual string value “NULL”, you must put double
quotes around it.
(These kinds of array constants are actually only a special case of the generic type constants discussed
in Section 4.1.2.7. The constant is initially treated as a string and passed to the array input conversion
routine. An explicit type specification might be necessary.)
177
Data Types
Multidimensional arrays must have matching extents for each dimension. A mismatch causes an error,
for example:
Notice that the array elements are ordinary SQL constants or expressions; for instance, string literals
are single quoted, instead of double quoted as they would be in an array literal. The ARRAY constructor
syntax is discussed in more detail in Section 4.2.12.
name
-------
Carol
(1 row)
The array subscript numbers are written within square brackets. By default PostgreSQL uses a one-
based numbering convention for arrays, that is, an array of n elements starts with array[1] and
ends with array[n].
pay_by_quarter
----------------
10000
25000
178
Data Types
(2 rows)
We can also access arbitrary rectangular slices of an array, or subarrays. An array slice is denoted by
writing lower-bound:upper-bound for one or more array dimensions. For example, this query
retrieves the first item on Bill's schedule for the first two days of the week:
schedule
------------------------
{{meeting},{training}}
(1 row)
If any dimension is written as a slice, i.e., contains a colon, then all dimensions are treated as slices.
Any dimension that has only a single number (no colon) is treated as being from 1 to the number
specified. For example, [2] is treated as [1:2], as in this example:
schedule
-------------------------------------------
{{meeting,lunch},{training,presentation}}
(1 row)
To avoid confusion with the non-slice case, it's best to use slice syntax for all dimensions, e.g., [1:2]
[1:1], not [2][1:1].
It is possible to omit the lower-bound and/or upper-bound of a slice specifier; the missing
bound is replaced by the lower or upper limit of the array's subscripts. For example:
schedule
------------------------
{{lunch},{presentation}}
(1 row)
schedule
------------------------
{{meeting},{training}}
(1 row)
An array subscript expression will return null if either the array itself or any of the subscript expressions
are null. Also, null is returned if a subscript is outside the array bounds (this case does not raise
an error). For example, if schedule currently has the dimensions [1:3][1:2] then referencing
schedule[3][3] yields NULL. Similarly, an array reference with the wrong number of subscripts
yields a null rather than an error.
An array slice expression likewise yields null if the array itself or any of the subscript expressions are
null. However, in other cases such as selecting an array slice that is completely outside the current array
bounds, a slice expression yields an empty (zero-dimensional) array instead of null. (This does not
match non-slice behavior and is done for historical reasons.) If the requested slice partially overlaps
the array bounds, then it is silently reduced to just the overlapping region instead of returning null.
The current dimensions of any array value can be retrieved with the array_dims function:
179
Data Types
array_dims
------------
[1:2][1:2]
(1 row)
array_dims produces a text result, which is convenient for people to read but perhaps incon-
venient for programs. Dimensions can also be retrieved with array_upper and array_lower,
which return the upper and lower bound of a specified array dimension, respectively:
array_upper
-------------
2
(1 row)
array_length
--------------
2
(1 row)
cardinality returns the total number of elements in an array across all dimensions. It is effectively
the number of rows a call to unnest would yield:
cardinality
-------------
4
(1 row)
180
Data Types
or updated in a slice:
The slice syntaxes with omitted lower-bound and/or upper-bound can be used too, but only
when updating an array value that is not NULL or zero-dimensional (otherwise, there is no existing
subscript limit to substitute).
A stored array value can be enlarged by assigning to elements not already present. Any positions be-
tween those previously present and the newly assigned elements will be filled with nulls. For exam-
ple, if array myarray currently has 4 elements, it will have six elements after an update that assigns
to myarray[6]; myarray[5] will contain null. Currently, enlargement in this fashion is only al-
lowed for one-dimensional arrays, not multidimensional arrays.
Subscripted assignment allows creation of arrays that do not use one-based subscripts. For example
one might assign to myarray[-2:7] to create an array with subscript values from -2 to 7.
New array values can also be constructed using the concatenation operator, ||:
The concatenation operator allows a single element to be pushed onto the beginning or end of a one-
dimensional array. It also accepts two N-dimensional arrays, or an N-dimensional and an N+1-dimen-
sional array.
When a single element is pushed onto either the beginning or end of a one-dimensional array, the
result is an array with the same lower bound subscript as the array operand. For example:
When two arrays with an equal number of dimensions are concatenated, the result retains the lower
bound subscript of the left-hand operand's outer dimension. The result is an array comprising every
element of the left-hand operand followed by every element of the right-hand operand. For example:
181
Data Types
array_dims
------------
[1:5]
(1 row)
When an N-dimensional array is pushed onto the beginning or end of an N+1-dimensional array, the
result is analogous to the element-array case above. Each N-dimensional sub-array is essentially an
element of the N+1-dimensional array's outer dimension. For example:
In simple cases, the concatenation operator discussed above is preferred over direct use of these func-
tions. However, because the concatenation operator is overloaded to serve all three cases, there are
situations where use of one of the functions is helpful to avoid ambiguity. For example consider:
182
Data Types
In the examples above, the parser sees an integer array on one side of the concatenation operator,
and a constant of undetermined type on the other. The heuristic it uses to resolve the constant's type
is to assume it's of the same type as the operator's other input — in this case, integer array. So the
concatenation operator is presumed to represent array_cat, not array_append. When that's the
wrong choice, it could be fixed by casting the constant to the array's element type; but explicit use of
array_append might be a preferable solution.
However, this quickly becomes tedious for large arrays, and is not helpful if the size of the array is
unknown. An alternative method is described in Section 9.23. The above query could be replaced by:
In addition, you can find rows where the array has all values equal to 10000 with:
SELECT * FROM
(SELECT pay_by_quarter,
generate_subscripts(pay_by_quarter, 1) AS s
FROM sal_emp) AS foo
183
Data Types
You can also search an array using the && operator, which checks whether the left operand overlaps
with the right operand. For instance:
This and other array operators are further described in Section 9.18. It can be accelerated by an ap-
propriate index, as described in Section 11.2.
You can also search for specific values in an array using the array_position and array_po-
sitions functions. The former returns the subscript of the first occurrence of a value in an array;
the latter returns an array with the subscripts of all occurrences of the value in the array. For example:
SELECT
array_position(ARRAY['sun','mon','tue','wed','thu','fri','sat'],
'mon');
array_positions
-----------------
2
Tip
Arrays are not sets; searching for specific array elements can be a sign of database misdesign.
Consider using a separate table with a row for each item that would be an array element. This
will be easier to search, and is likely to scale better for a large number of elements.
The array output routine will put double quotes around element values if they are empty strings, con-
tain curly braces, delimiter characters, double quotes, backslashes, or white space, or match the word
NULL. Double quotes and backslashes embedded in element values will be backslash-escaped. For
numeric data types it is safe to assume that double quotes will never appear, but for textual data types
one should be prepared to cope with either the presence or absence of quotes.
By default, the lower bound index value of an array's dimensions is set to one. To represent arrays
with other lower bounds, the array subscript ranges can be specified explicitly before writing the array
contents. This decoration consists of square brackets ([]) around each array dimension's lower and
upper bounds, with a colon (:) delimiter character in between. The array dimension decoration is
followed by an equal sign (=). For example:
184
Data Types
e1 | e2
----+----
1 | 6
(1 row)
The array output routine will include explicit dimensions in its result only when there are one or more
lower bounds different from one.
If the value written for an element is NULL (in any case variant), the element is taken to be NULL.
The presence of any quotes or backslashes disables this and allows the literal string value “NULL”
to be entered. Also, for backward compatibility with pre-8.2 versions of PostgreSQL, the array_nulls
configuration parameter can be turned off to suppress recognition of NULL as a NULL.
As shown previously, when writing an array value you can use double quotes around any individual
array element. You must do so if the element value would otherwise confuse the array-value parser.
For example, elements containing curly braces, commas (or the data type's delimiter character), dou-
ble quotes, backslashes, or leading or trailing whitespace must be double-quoted. Empty strings and
strings matching the word NULL must be quoted, too. To put a double quote or backslash in a quoted
array element value, precede it with a backslash. Alternatively, you can avoid quotes and use back-
slash-escaping to protect all data characters that would otherwise be taken as array syntax.
You can add whitespace before a left brace or after a right brace. You can also add whitespace before
or after any individual item string. In all of these cases the whitespace will be ignored. However,
whitespace within double-quoted elements, or surrounded on both sides by non-whitespace characters
of an element, is not ignored.
Tip
The ARRAY constructor syntax (see Section 4.2.12) is often easier to work with than the ar-
ray-literal syntax when writing array values in SQL commands. In ARRAY, individual element
values are written the same way they would be written when not members of an array.
185
Data Types
supplier_id integer,
price numeric
);
The syntax is comparable to CREATE TABLE, except that only field names and types can be specified;
no constraints (such as NOT NULL) can presently be included. Note that the AS keyword is essential;
without it, the system will think a different kind of CREATE TYPE command is meant, and you will
get odd syntax errors.
or functions:
Whenever you create a table, a composite type is also automatically created, with the same name as
the table, to represent the table's row type. For example, had we said:
then the same inventory_item composite type shown above would come into being as a byprod-
uct, and could be used just as above. Note however an important restriction of the current implemen-
tation: since no constraints are associated with a composite type, the constraints shown in the table
definition do not apply to values of the composite type outside the table. (To work around this, cre-
ate a domain over the composite type, and apply the desired constraints as CHECK constraints of the
domain.)
An example is:
'("fuzzy dice",42,1.99)'
186
Data Types
which would be a valid value of the inventory_item type defined above. To make a field be
NULL, write no characters at all in its position in the list. For example, this constant specifies a NULL
third field:
'("fuzzy dice",42,)'
If you want an empty string rather than NULL, write double quotes:
'("",42,)'
Here the first field is a non-NULL empty string, the third is NULL.
(These constants are actually only a special case of the generic type constants discussed in Sec-
tion 4.1.2.7. The constant is initially treated as a string and passed to the composite-type input con-
version routine. An explicit type specification might be necessary to tell which type to convert the
constant to.)
The ROW expression syntax can also be used to construct composite values. In most cases this is
considerably simpler to use than the string-literal syntax since you don't have to worry about multiple
layers of quoting. We already used this method above:
The ROW keyword is actually optional as long as you have more than one field in the expression,
so these can be simplified to:
This will not work since the name item is taken to be a table name, not a column name of on_hand,
per SQL syntax rules. You must write it like this:
or if you need to use the table name as well (for instance in a multitable query), like this:
Now the parenthesized object is correctly interpreted as a reference to the item column, and then the
subfield can be selected from it.
187
Data Types
Similar syntactic issues apply whenever you select a field from a composite value. For instance, to
select just one field from the result of a function that returns a composite value, you'd need to write
something like:
The special field name * means “all fields”, as further explained in Section 8.16.5.
The first example omits ROW, the second uses it; we could have done it either way.
Notice here that we don't need to (and indeed cannot) put parentheses around the column name ap-
pearing just after SET, but we do need parentheses when referencing the same column in the expres-
sion to the right of the equal sign.
Had we not supplied values for all the subfields of the column, the remaining subfields would have
been filled with null values.
In PostgreSQL, a reference to a table name (or alias) in a query is effectively a reference to the com-
posite value of the table's current row. For example, if we had a table inventory_item as shown
above, we could write:
This query produces a single composite-valued column, so we might get output like:
c
------------------------
("fuzzy dice",42,1.99)
(1 row)
188
Data Types
Note however that simple names are matched to column names before table names, so this example
works only because there is no column named c in the query's tables.
When we write
then, according to the SQL standard, we should get the contents of the table expanded into separate
columns:
PostgreSQL will apply this expansion behavior to any composite-valued expression, although as
shown above, you need to write parentheses around the value that .* is applied to whenever it's not a
simple table name. For example, if myfunc() is a function returning a composite type with columns
a, b, and c, then these two queries have the same result:
Tip
PostgreSQL handles column expansion by actually transforming the first form into the second.
So, in this example, myfunc() would get invoked three times per row with either syntax. If
it's an expensive function you may wish to avoid that, which you can do with a query like:
Placing the function in a LATERAL FROM item keeps it from being invoked more than once per
row. m.* is still expanded into m.a, m.b, m.c, but now those variables are just references
to the output of the FROM item. (The LATERAL keyword is optional here, but we show it to
clarify that the function is getting x from some_table.)
The composite_value.* syntax results in column expansion of this kind when it appears at the
top level of a SELECT output list, a RETURNING list in INSERT/UPDATE/DELETE, a VALUES
clause, or a row constructor. In all other contexts (including when nested inside one of those con-
structs), attaching .* to a composite value does not change the value, since it means “all columns”
and so the same composite value is produced again. For example, if somefunc() accepts a com-
posite-valued argument, these queries are the same:
189
Data Types
In both cases, the current row of inventory_item is passed to the function as a single compos-
ite-valued argument. Even though .* does nothing in such cases, using it is good style, since it makes
clear that a composite value is intended. In particular, the parser will consider c in c.* to refer to a
table name or alias, not to a column name, so that there is no ambiguity; whereas without .*, it is not
clear whether c means a table name or a column name, and in fact the column-name interpretation
will be preferred if there is a column named c.
Another example demonstrating these concepts is that all these queries mean the same thing:
All of these ORDER BY clauses specify the row's composite value, resulting in sorting the rows ac-
cording to the rules described in Section 9.23.6. However, if inventory_item contained a column
named c, the first case would be different from the others, as it would mean to sort by that column
only. Given the column names previously shown, these queries are also equivalent to those above:
(The last case uses a row constructor with the key word ROW omitted.)
Another special syntactical behavior associated with composite values is that we can use functional
notation for extracting a field of a composite value. The simple way to explain this is that the notations
field(table) and table.field are interchangeable. For example, these queries are equiva-
lent:
Moreover, if we have a function that accepts a single argument of a composite type, we can call it
with either notation. These queries are all equivalent:
This equivalence between functional notation and field notation makes it possible to use functions on
composite types to implement “computed fields”. An application using the last query above wouldn't
need to be directly aware that somefunc isn't a real column of the table.
Tip
Because of this behavior, it's unwise to give a function that takes a single composite-type
argument the same name as any of the fields of that composite type. If there is ambiguity, the
field-name interpretation will be chosen if field-name syntax is used, while the function will
be chosen if function-call syntax is used. However, PostgreSQL versions before 11 always
chose the field-name interpretation, unless the syntax of the call required it to be a function
call. One way to force the function interpretation in older versions is to schema-qualify the
function name, that is, write schema.func(compositevalue).
190
Data Types
'( 42)'
the whitespace will be ignored if the field type is integer, but not if it is text.
As shown previously, when writing a composite value you can write double quotes around any indi-
vidual field value. You must do so if the field value would otherwise confuse the composite-value
parser. In particular, fields containing parentheses, commas, double quotes, or backslashes must be
double-quoted. To put a double quote or backslash in a quoted composite field value, precede it with
a backslash. (Also, a pair of double quotes within a double-quoted field value is taken to represent a
double quote character, analogously to the rules for single quotes in SQL literal strings.) Alternatively,
you can avoid quoting and use backslash-escaping to protect all data characters that would otherwise
be taken as composite syntax.
A completely empty field value (no characters at all between the commas or parentheses) represents
a NULL. To write a value that is an empty string rather than NULL, write "".
The composite output routine will put double quotes around field values if they are empty strings or
contain parentheses, commas, double quotes, backslashes, or white space. (Doing so for white space
is not essential, but aids legibility.) Double quotes and backslashes embedded in field values will be
doubled.
Note
Remember that what you write in an SQL command will first be interpreted as a string literal,
and then as a composite. This doubles the number of backslashes you need (assuming escape
string syntax is used). For example, to insert a text field containing a double quote and a
backslash in a composite value, you'd need to write:
The string-literal processor removes one level of backslashes, so that what arrives at the com-
posite-value parser looks like ("\"\\"). In turn, the string fed to the text data type's input
routine becomes "\. (If we were working with a data type whose input routine also treated
backslashes specially, bytea for example, we might need as many as eight backslashes in
the command to get one backslash into the stored composite field.) Dollar quoting (see Sec-
tion 4.1.2.4) can be used to avoid the need to double backslashes.
Tip
The ROW constructor syntax is usually easier to work with than the composite-literal syntax
when writing composite values in SQL commands. In ROW, individual field values are written
the same way they would be written when not members of a composite.
Range types are data types representing a range of values of some element type (called the range's
subtype). For instance, ranges of timestamp might be used to represent the ranges of time that a
meeting room is reserved. In this case the data type is tsrange (short for “timestamp range”), and
timestamp is the subtype. The subtype must have a total order so that it is well-defined whether
element values are within, before, or after a range of values.
Range types are useful because they represent many element values in a single range value, and be-
cause concepts such as overlapping ranges can be expressed clearly. The use of time and date ranges
for scheduling purposes is the clearest example; but price ranges, measurement ranges from an instru-
ment, and so forth can also be useful.
In addition, you can define your own range types; see CREATE TYPE for more information.
8.17.2. Examples
-- Containment
SELECT int4range(10, 20) @> 3;
-- Overlaps
SELECT numrange(11.1, 22.2) && numrange(20.0, 30.0);
See Table 9.50 and Table 9.51 for complete lists of operators and functions on range types.
192
Data Types
included in the range as well, while an exclusive bound means that the boundary point is not included
in the range.
In the text form of a range, an inclusive lower bound is represented by “[” while an exclusive lower
bound is represented by “(”. Likewise, an inclusive upper bound is represented by “]”, while an
exclusive upper bound is represented by “)”. (See Section 8.17.5 for more details.)
The functions lower_inc and upper_inc test the inclusivity of the lower and upper bounds of
a range value, respectively.
Element types that have the notion of “infinity” can use them as explicit bound values. For example,
with timestamp ranges, [today,infinity) excludes the special timestamp value infinity,
while [today,infinity] include it, as does [today,) and [today,].
The functions lower_inf and upper_inf test for infinite lower and upper bounds of a range,
respectively.
(lower-bound,upper-bound)
(lower-bound,upper-bound]
[lower-bound,upper-bound)
[lower-bound,upper-bound]
empty
The parentheses or brackets indicate whether the lower and upper bounds are exclusive or inclusive,
as described previously. Notice that the final pattern is empty, which represents an empty range (a
range that contains no points).
The lower-bound may be either a string that is valid input for the subtype, or empty to indicate
no lower bound. Likewise, upper-bound may be either a string that is valid input for the subtype,
or empty to indicate no upper bound.
Each bound value can be quoted using " (double quote) characters. This is necessary if the bound
value contains parentheses, brackets, commas, double quotes, or backslashes, since these characters
would otherwise be taken as part of the range syntax. To put a double quote or backslash in a quoted
bound value, precede it with a backslash. (Also, a pair of double quotes within a double-quoted bound
value is taken to represent a double quote character, analogously to the rules for single quotes in SQL
literal strings.) Alternatively, you can avoid quoting and use backslash-escaping to protect all data
characters that would otherwise be taken as range syntax. Also, to write a bound value that is an empty
string, write "", since writing nothing means an infinite bound.
Whitespace is allowed before and after the range value, but any whitespace between the parentheses
or brackets is taken as part of the lower or upper bound value. (Depending on the element type, it
might or might not be significant.)
193
Data Types
Note
These rules are very similar to those for writing field values in composite-type literals. See
Section 8.16.6 for additional commentary.
Examples:
-- The full form is: lower bound, upper bound, and text argument
indicating
-- inclusivity/exclusivity of bounds.
SELECT numrange(1.0, 14.0, '(]');
194
Data Types
ous, as is a range over timestamp. (Even though timestamp has limited precision, and so could
theoretically be treated as discrete, it's better to consider it continuous since the step size is normally
not of interest.)
Another way to think about a discrete range type is that there is a clear idea of a “next” or “previous”
value for each element value. Knowing that, it is possible to convert between inclusive and exclusive
representations of a range's bounds, by choosing the next or previous element value instead of the one
originally given. For example, in an integer range type [4,8] and (3,9) denote the same set of
values; but this would not be so for a range over numeric.
A discrete range type should have a canonicalization function that is aware of the desired step size for
the element type. The canonicalization function is charged with converting equivalent values of the
range type to have identical representations, in particular consistently inclusive or exclusive bounds.
If a canonicalization function is not specified, then ranges with different formatting will always be
treated as unequal, even though they might represent the same set of values in reality.
The built-in range types int4range, int8range, and daterange all use a canonical form that
includes the lower bound and excludes the upper bound; that is, [). User-defined range types can use
other conventions, however.
Because float8 has no meaningful “step”, we do not define a canonicalization function in this ex-
ample.
Defining your own range type also allows you to specify a different subtype B-tree operator class or
collation to use, so as to change the sort ordering that determines which values fall into a given range.
If the subtype is considered to have discrete rather than continuous values, the CREATE TYPE com-
mand should specify a canonical function. The canonicalization function takes an input range val-
ue, and must return an equivalent range value that may have different bounds and formatting. The
canonical output for two ranges that represent the same set of values, for example the integer ranges
[1, 7] and [1, 8), must be identical. It doesn't matter which representation you choose to be the
canonical one, so long as two equivalent values with different formattings are always mapped to the
same value with the same formatting. In addition to adjusting the inclusive/exclusive bounds format, a
canonicalization function might round off boundary values, in case the desired step size is larger than
what the subtype is capable of storing. For instance, a range type over timestamp could be defined
to have a step size of an hour, in which case the canonicalization function would need to round off
bounds that weren't a multiple of an hour, or perhaps throw an error instead.
In addition, any range type that is meant to be used with GiST or SP-GiST indexes should define a sub-
type difference, or subtype_diff, function. (The index will still work without subtype_diff,
but it is likely to be considerably less efficient than if a difference function is provided.) The subtype
difference function takes two input values of the subtype, and returns their difference (i.e., X minus
Y) represented as a float8 value. In our example above, the function float8mi that underlies the
regular float8 minus operator can be used; but for any other subtype, some type conversion would
be necessary. Some creative thought about how to represent differences as numbers might be needed,
195
Data Types
too. To the greatest extent possible, the subtype_diff function should agree with the sort ordering
implied by the selected operator class and collation; that is, its result should be positive whenever its
first argument is greater than its second according to the sort ordering.
See CREATE TYPE for more information about creating range types.
8.17.9. Indexing
GiST and SP-GiST indexes can be created for table columns of range types. For instance, to create
a GiST index:
A GiST or SP-GiST index can accelerate queries involving these range operators: =, &&, <@, @>, <<,
>>, -|-, &<, and &> (see Table 9.50 for more information).
In addition, B-tree and hash indexes can be created for table columns of range types. For these index
types, basically the only useful range operation is equality. There is a B-tree sort ordering defined for
range values, with corresponding < and > operators, but the ordering is rather arbitrary and not usually
useful in the real world. Range types' B-tree and hash support is primarily meant to allow sorting and
hashing internally in queries, rather than creation of actual indexes.
That constraint will prevent any overlapping values from existing in the table at the same time:
196
Data Types
You can use the btree_gist extension to define exclusion constraints on plain scalar data types,
which can then be combined with range exclusions for maximum flexibility. For example, after
btree_gist is installed, the following constraint will reject overlapping ranges only if the meeting
room numbers are equal:
For example, we could create a domain over integers that accepts only positive integers:
When an operator or function of the underlying type is applied to a domain value, the domain is
automatically down-cast to the underlying type. Thus, for example, the result of mytable.id - 1 is
considered to be of type integer not posint. We could write (mytable.id - 1)::posint
to cast the result back to posint, causing the domain's constraints to be rechecked. In this case, that
would result in an error if the expression had been applied to an id value of 1. Assigning a value of
197
Data Types
the underlying type to a field or variable of the domain type is allowed without writing an explicit
cast, but the domain's constraints will be checked.
The oid type is currently implemented as an unsigned four-byte integer. Therefore, it is not large
enough to provide database-wide uniqueness in large databases, or even in large individual tables. So,
using a user-created table's OID column as a primary key is discouraged. OIDs are best used only for
references to system tables.
The oid type itself has few operations beyond comparison. It can be cast to integer, however, and
then manipulated using the standard integer operators. (Beware of possible signed-versus-unsigned
confusion if you do this.)
The OID alias types have no operations of their own except for specialized input and output routines.
These routines are able to accept and display symbolic names for system objects, rather than the raw
numeric value that type oid would use. The alias types allow simplified lookup of OID values for
objects. For example, to examine the pg_attribute rows related to a table mytable, one could
write:
rather than:
While that doesn't look all that bad by itself, it's still oversimplified. A far more complicated sub-
select would be needed to select the right OID if there are multiple tables named mytable in different
schemas. The regclass input converter handles the table lookup according to the schema path
setting, and so it does the “right thing” automatically. Similarly, casting a table's OID to regclass
is handy for symbolic display of a numeric OID.
198
Data Types
All of the OID alias types for objects grouped by namespace accept schema-qualified names, and will
display schema-qualified names on output if the object would not be found in the current search path
without being qualified. The regproc and regoper alias types will only accept input names that are
unique (not overloaded), so they are of limited use; for most uses regprocedure or regoperator
are more appropriate. For regoperator, unary operators are identified by writing NONE for the
unused operand.
An additional property of most of the OID alias types is the creation of dependencies. If a constant
of one of these types appears in a stored expression (such as a column default expression or view),
it creates a dependency on the referenced object. For example, if a column has a default expres-
sion nextval('my_seq'::regclass), PostgreSQL understands that the default expression de-
pends on the sequence my_seq; the system will not let the sequence be dropped without first remov-
ing the default expression. regrole is the only exception for the property. Constants of this type are
not allowed in such expressions.
Note
The OID alias types do not completely follow transaction isolation rules. The planner also
treats them as simple constants, which may result in sub-optimal planning.
Another identifier type used by the system is xid, or transaction (abbreviated xact) identifier. This is
the data type of the system columns xmin and xmax. Transaction identifiers are 32-bit quantities.
A third identifier type used by the system is cid, or command identifier. This is the data type of the
system columns cmin and cmax. Command identifiers are also 32-bit quantities.
A final identifier type used by the system is tid, or tuple identifier (row identifier). This is the data
type of the system column ctid. A tuple ID is a pair (block number, tuple index within block) that
identifies the physical location of the row within its table.
Internally, an LSN is a 64-bit integer, representing a byte position in the write-ahead log stream. It is
printed as two hexadecimal numbers of up to 8 digits each, separated by a slash; for example, 16/
B374D848. The pg_lsn type supports the standard comparison operators, like = and >. Two LSNs
can be subtracted using the - operator; the result is the number of bytes separating those write-ahead
log locations.
8.21. Pseudo-Types
199
Data Types
The PostgreSQL type system contains a number of special-purpose entries that are collectively called
pseudo-types. A pseudo-type cannot be used as a column data type, but it can be used to declare a
function's argument or result type. Each of the available pseudo-types is useful in situations where a
function's behavior does not correspond to simply taking or returning a value of a specific SQL data
type. Table 8.25 lists the existing pseudo-types.
Functions coded in C (whether built-in or dynamically loaded) can be declared to accept or return any
of these pseudo data types. It is up to the function author to ensure that the function will behave safely
when a pseudo-type is used as an argument type.
200
Data Types
Functions coded in procedural languages can use pseudo-types only as allowed by their implemen-
tation languages. At present most procedural languages forbid use of a pseudo-type as an argument
type, and allow only void and record as a result type (plus trigger or event_trigger when
the function is used as a trigger or event trigger). Some also support polymorphic functions using the
types anyelement, anyarray, anynonarray, anyenum, and anyrange.
The internal pseudo-type is used to declare functions that are meant only to be called internally
by the database system, and not by direct invocation in an SQL query. If a function has at least one
internal-type argument then it cannot be called from SQL. To preserve the type safety of this
restriction it is important to follow this coding rule: do not create any function that is declared to return
internal unless it has at least one internal argument.
201
Chapter 9. Functions and Operators
PostgreSQL provides a large number of functions and operators for the built-in data types. Users can
also define their own functions and operators, as described in Part V. The psql commands \df and
\do can be used to list all available functions and operators, respectively.
If you are concerned about portability then note that most of the functions and operators described
in this chapter, with the exception of the most trivial arithmetic and comparison operators and some
explicitly marked functions, are not specified by the SQL standard. Some of this extended function-
ality is present in other SQL database management systems, and in many cases this functionality is
compatible and consistent between the various implementations. This chapter is also not exhaustive;
additional functions appear in relevant sections of the manual.
AND
OR
NOT
SQL uses a three-valued logic system with true, false, and null, which represents “unknown”. Ob-
serve the following truth tables:
a b a AND b a OR b
TRUE TRUE TRUE TRUE
TRUE FALSE FALSE TRUE
TRUE NULL NULL TRUE
FALSE FALSE FALSE FALSE
FALSE NULL FALSE NULL
NULL NULL NULL NULL
a NOT a
TRUE FALSE
FALSE TRUE
NULL NULL
The operators AND and OR are commutative, that is, you can switch the left and right operand without
affecting the result. But see Section 4.2.14 for more information about the order of evaluation of
subexpressions.
202
Functions and Operators
Operator Description
= equal
<> or != not equal
Note
The != operator is converted to <> in the parser stage. It is not possible to implement != and
<> operators that do different things.
Comparison operators are available for all relevant data types. All comparison operators are binary
operators that return values of type boolean; expressions like 1 < 2 < 3 are not valid (because
there is no < operator to compare a Boolean value with 3).
There are also some comparison predicates, as shown in Table 9.2. These behave much like operators,
but have special syntax mandated by the SQL standard.
a BETWEEN x AND y
is equivalent to
Notice that BETWEEN treats the endpoint values as included in the range. NOT BETWEEN does the
opposite comparison:
203
Functions and Operators
is equivalent to
a < x OR a > y
BETWEEN SYMMETRIC is like BETWEEN except there is no requirement that the argument to the
left of AND be less than or equal to the argument on the right. If it is not, those two arguments are
automatically swapped, so that a nonempty range is always implied.
Ordinary comparison operators yield null (signifying “unknown”), not true or false, when either input
is null. For example, 7 = NULL yields null, as does 7 <> NULL. When this behavior is not suitable,
use the IS [ NOT ] DISTINCT FROM predicates:
a IS DISTINCT FROM b
a IS NOT DISTINCT FROM b
For non-null inputs, IS DISTINCT FROM is the same as the <> operator. However, if both inputs
are null it returns false, and if only one input is null it returns true. Similarly, IS NOT DISTINCT
FROM is identical to = for non-null inputs, but it returns true when both inputs are null, and false when
only one input is null. Thus, these predicates effectively act as though null were a normal data value,
rather than “unknown”.
expression IS NULL
expression IS NOT NULL
expression ISNULL
expression NOTNULL
Do not write expression = NULL because NULL is not “equal to” NULL. (The null value repre-
sents an unknown value, and it is not known whether two unknown values are equal.)
Tip
Some applications might expect that expression = NULL returns true if expression
evaluates to the null value. It is highly recommended that these applications be modified to
comply with the SQL standard. However, if that cannot be done the transform_null_equals
configuration variable is available. If it is enabled, PostgreSQL will convert x = NULL
clauses to x IS NULL.
If the expression is row-valued, then IS NULL is true when the row expression itself is null
or when all the row's fields are null, while IS NOT NULL is true when the row expression itself
is non-null and all the row's fields are non-null. Because of this behavior, IS NULL and IS NOT
NULL do not always return inverse results for row-valued expressions; in particular, a row-valued
expression that contains both null and non-null fields will return false for both tests. In some cases,
it may be preferable to write row IS DISTINCT FROM NULL or row IS NOT DISTINCT
FROM NULL, which will simply check whether the overall row value is null without any additional
tests on the row fields.
boolean_expression IS TRUE
204
Functions and Operators
These will always return true or false, never a null value, even when the operand is null. A null input
is treated as the logical value “unknown”. Notice that IS UNKNOWN and IS NOT UNKNOWN are
effectively the same as IS NULL and IS NOT NULL, respectively, except that the input expression
must be of Boolean type.
205
Functions and Operators
The bitwise operators work only on integral data types and are also available for the bit string types
bit and bit varying, as shown in Table 9.13.
Table 9.5 shows the available mathematical functions. In the table, dp indicates double preci-
sion. Many of these functions are provided in multiple forms with different argument types. Except
where noted, any given form of a function returns the same data type as its argument. The functions
working with double precision data are mostly implemented on top of the host system's C
library; accuracy and behavior in boundary cases can therefore vary depending on the host system.
206
Functions and Operators
207
Functions and Operators
The characteristics of the values returned by random() depend on the system implementation. It is
not suitable for cryptographic applications; see pgcrypto module for an alternative.
Finally, Table 9.7 shows the available trigonometric functions. All trigonometric functions take argu-
ments and return values of type double precision. Each of the trigonometric functions comes
in two variants, one that measures angles in radians and one that measures angles in degrees.
Note
Another way to work with angles measured in degrees is to use the unit transformation func-
tions radians() and degrees() shown earlier. However, using the degree-based trigono-
metric functions is preferred, as that way avoids round-off error for special cases such as
sind(30).
208
Functions and Operators
SQL defines some string functions that use key words, rather than commas, to separate arguments.
Details are in Table 9.8. PostgreSQL also provides versions of these functions that use the regular
function invocation syntax (see Table 9.9).
Note
Before PostgreSQL 8.3, these functions would silently accept values of several non-string data
types as well, due to the presence of implicit coercions from those data types to text. Those
coercions have been removed because they frequently caused surprising behaviors. However,
the string concatenation operator (||) still accepts non-string input, so long as at least one
input is of a string type, as shown in Table 9.8. For other cases, insert an explicit coercion to
text if you need to duplicate the previous behavior.
209
Functions and Operators
Additional string manipulation functions are available and are listed in Table 9.9. Some of them are
used internally to implement the SQL-standard string functions listed in Table 9.8.
210
Functions and Operators
211
Functions and Operators
212
Functions and Operators
213
Functions and Operators
214
Functions and Operators
215
Functions and Operators
216
Functions and Operators
The concat, concat_ws and format functions are variadic, so it is possible to pass the values to
be concatenated or formatted as an array marked with the VARIADIC keyword (see Section 38.5.5).
The array's elements are treated as if they were separate ordinary arguments to the function. If the
variadic array argument is NULL, concat and concat_ws return NULL, but format treats a
NULL as a zero-element array.
217
Functions and Operators
218
Functions and Operators
219
Functions and Operators
220
Functions and Operators
9.4.1. format
The function format produces output formatted according to a format string, in a style similar to
the C function sprintf.
formatstr is a format string that specifies how the result should be formatted. Text in the format
string is copied directly to the result, except where format specifiers are used. Format specifiers act
as placeholders in the string, defining how subsequent function arguments should be formatted and
inserted into the result. Each formatarg argument is converted to text according to the usual output
rules for its data type, and then formatted and inserted into the result string according to the format
specifier(s).
%[position][flags][width]type
position (optional)
A string of the form n$ where n is the index of the argument to print. Index 1 means the first
argument after formatstr. If the position is omitted, the default is to use the next argument
in sequence.
flags (optional)
Additional options controlling how the format specifier's output is formatted. Currently the only
supported flag is a minus sign (-) which will cause the format specifier's output to be left-justified.
This has no effect unless the width field is also specified.
width (optional)
Specifies the minimum number of characters to use to display the format specifier's output. The
output is padded on the left or right (depending on the - flag) with spaces as needed to fill the
width. A too-small width does not cause truncation of the output, but is simply ignored. The width
221
Functions and Operators
may be specified using any of the following: a positive integer; an asterisk (*) to use the next
function argument as the width; or a string of the form *n$ to use the nth function argument
as the width.
If the width comes from a function argument, that argument is consumed before the argument that
is used for the format specifier's value. If the width argument is negative, the result is left aligned
(as if the - flag had been specified) within a field of length abs(width).
type (required)
The type of format conversion to use to produce the format specifier's output. The following types
are supported:
• s formats the argument value as a simple string. A null value is treated as an empty string.
• L quotes the argument value as an SQL literal. A null value is displayed as the string NULL,
without quotes (equivalent to quote_nullable).
In addition to the format specifiers described above, the special sequence %% may be used to output
a literal % character.
222
Functions and Operators
Result: |foo |
Unlike the standard C function sprintf, PostgreSQL's format function allows format specifiers
with and without position fields to be mixed in the same format string. A format specifier without
a position field always uses the next argument after the last argument consumed. In addition, the
format function does not require all function arguments to be used in the format string. For example:
The %I and %L format specifiers are particularly useful for safely constructing dynamic SQL state-
ments. See Example 43.1.
SQL defines some string functions that use key words, rather than commas, to separate arguments.
Details are in Table 9.11. PostgreSQL also provides versions of these functions that use the regular
function invocation syntax (see Table 9.12).
Note
The sample results shown on this page assume that the server parameter bytea_output is
set to escape (the traditional PostgreSQL format).
223
Functions and Operators
Additional binary string manipulation functions are available and are listed in Table 9.12. Some of
them are used internally to implement the SQL-standard string functions listed in Table 9.11.
224
Functions and Operators
get_byte and set_byte number the first byte of a binary string as byte 0. get_bit and
set_bit number bits from the right within each byte; for example bit 0 is the least significant bit of
the first byte, and bit 15 is the most significant bit of the second byte.
Note that for historic reasons, the function md5 returns a hex-encoded value of type text whereas the
SHA-2 functions return type bytea. Use the functions encode and decode to convert between the
two, for example encode(sha256('abc'), 'hex') to get a hex-encoded text representation.
See also the aggregate function string_agg in Section 9.20 and the large object functions in Sec-
tion 35.4.
225
Functions and Operators
The following SQL-standard functions work on bit strings as well as character strings: length,
bit_length, octet_length, position, substring, overlay.
The following functions work on bit strings as well as binary strings: get_bit, set_bit. When
working with a bit string, these functions number the first (leftmost) bit of the string as bit 0.
In addition, it is possible to cast integral values to and from type bit. Some examples:
44::bit(10) 0000101100
44::bit(3) 100
cast(-44 as bit(12)) 111111010100
'1110'::bit(4)::integer 14
Note that casting to just “bit” means casting to bit(1), and so will deliver only the least significant
bit of the integer.
Note
Casting an integer to bit(n) copies the rightmost n bits. Casting an integer to a bit string
width wider than the integer itself will sign-extend on the left.
Tip
If you have pattern matching needs that go beyond this, consider writing a user-defined func-
tion in Perl or Tcl.
Caution
While most regular-expression searches can be executed very quickly, regular expressions can
be contrived that take arbitrary amounts of time and memory to process. Be wary of accepting
226
Functions and Operators
regular-expression search patterns from hostile sources. If you must do so, it is advisable to
impose a statement timeout.
Searches using SIMILAR TO patterns have the same security hazards, since SIMILAR TO
provides many of the same capabilities as POSIX-style regular expressions.
LIKE searches, being much simpler than the other two options, are safer to use with possi-
bly-hostile pattern sources.
9.7.1. LIKE
The LIKE expression returns true if the string matches the supplied pattern. (As expected, the
NOT LIKE expression returns false if LIKE returns true, and vice versa. An equivalent expression
is NOT (string LIKE pattern).)
If pattern does not contain percent signs or underscores, then the pattern only represents the string
itself; in that case LIKE acts like the equals operator. An underscore (_) in pattern stands for
(matches) any single character; a percent sign (%) matches any sequence of zero or more characters.
Some examples:
LIKE pattern matching always covers the entire string. Therefore, if it's desired to match a sequence
anywhere within a string, the pattern must start and end with a percent sign.
To match a literal underscore or percent sign without matching other characters, the respective char-
acter in pattern must be preceded by the escape character. The default escape character is the back-
slash but a different one can be selected by using the ESCAPE clause. To match the escape character
itself, write two escape characters.
Note
If you have standard_conforming_strings turned off, any backslashes you write in literal string
constants will need to be doubled. See Section 4.1.2.1 for more information.
It's also possible to select no escape character by writing ESCAPE ''. This effectively disables
the escape mechanism, which makes it impossible to turn off the special meaning of underscore and
percent signs in the pattern.
The key word ILIKE can be used instead of LIKE to make the match case-insensitive according to
the active locale. This is not in the SQL standard but is a PostgreSQL extension.
The operator ~~ is equivalent to LIKE, and ~~* corresponds to ILIKE. There are also !~~ and !
~~* operators that represent NOT LIKE and NOT ILIKE, respectively. All of these operators are
PostgreSQL-specific. You may see these operator names in EXPLAIN output and similar places, since
the parser actually translates LIKE et al. to these operators.
227
Functions and Operators
The phrases LIKE, ILIKE, NOT LIKE, and NOT ILIKE are generally treated as operators in
PostgreSQL syntax; for example they can be used in expression operator ANY (subquery)
constructs, although an ESCAPE clause cannot be included there. In some obscure cases it may be
necessary to use the underlying operator names instead.
There is also the prefix operator ^@ and corresponding starts_with function which covers cases
when only searching by beginning of the string is needed.
The SIMILAR TO operator returns true or false depending on whether its pattern matches the given
string. It is similar to LIKE, except that it interprets the pattern using the SQL standard's definition of a
regular expression. SQL regular expressions are a curious cross between LIKE notation and common
regular expression notation.
Like LIKE, the SIMILAR TO operator succeeds only if its pattern matches the entire string; this is
unlike common regular expression behavior where the pattern can match any part of the string. Also
like LIKE, SIMILAR TO uses _ and % as wildcard characters denoting any single character and any
string, respectively (these are comparable to . and .* in POSIX regular expressions).
In addition to these facilities borrowed from LIKE, SIMILAR TO supports these pattern-matching
metacharacters borrowed from POSIX regular expressions:
• {m,n} denotes repetition of the previous item at least m and not more than n times.
• A bracket expression [...] specifies a character class, just as in POSIX regular expressions.
Notice that the period (.) is not a metacharacter for SIMILAR TO.
As with LIKE, a backslash disables the special meaning of any of these metacharacters; or a different
escape character can be specified with ESCAPE.
Some examples:
The substring function with three parameters, substring(string from pattern for
escape-character), provides extraction of a substring that matches an SQL regular expression
228
Functions and Operators
pattern. As with SIMILAR TO, the specified pattern must match the entire data string, or else the
function fails and returns null. To indicate the part of the pattern that should be returned on success,
the pattern must contain two occurrences of the escape character followed by a double quote ("). The
text matching the portion of the pattern between these markers is returned.
POSIX regular expressions provide a more powerful means for pattern matching than the LIKE and
SIMILAR TO operators. Many Unix tools such as egrep, sed, or awk use a pattern matching
language that is similar to the one described here.
Some examples:
The substring function with two parameters, substring(string from pattern), pro-
vides extraction of a substring that matches a POSIX regular expression pattern. It returns null if there
is no match, otherwise the portion of the text that matched the pattern. But if the pattern contains any
parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose
left parenthesis comes first) is returned. You can put parentheses around the whole expression if you
want to use parentheses within it without triggering this exception. If you need parentheses in the pat-
tern before the subexpression you want to extract, see the non-capturing parentheses described below.
229
Functions and Operators
Some examples:
The regexp_replace function provides substitution of new text for substrings that match POSIX
regular expression patterns. It has the syntax regexp_replace(source, pattern, replace-
ment [, flags ]). The source string is returned unchanged if there is no match to the pattern.
If there is a match, the source string is returned with the replacement string substituted for the
matching substring. The replacement string can contain \n, where n is 1 through 9, to indicate that
the source substring matching the n'th parenthesized subexpression of the pattern should be inserted,
and it can contain \& to indicate that the substring matching the entire pattern should be inserted.
Write \\ if you need to put a literal backslash in the replacement text. The flags parameter is an
optional text string containing zero or more single-letter flags that change the function's behavior. Flag
i specifies case-insensitive matching, while flag g specifies replacement of each matching substring
rather than only the first one. Supported flags (though not g) are described in Table 9.22.
Some examples:
The regexp_match function returns a text array of captured substring(s) resulting from the first
match of a POSIX regular expression pattern to a string. It has the syntax regexp_match(string,
pattern [, flags ]). If there is no match, the result is NULL. If a match is found, and the pattern
contains no parenthesized subexpressions, then the result is a single-element text array containing the
substring matching the whole pattern. If a match is found, and the pattern contains parenthesized
subexpressions, then the result is a text array whose n'th element is the substring matching the n'th
parenthesized subexpression of the pattern (not counting “non-capturing” parentheses; see below
for details). The flags parameter is an optional text string containing zero or more single-letter flags
that change the function's behavior. Supported flags are described in Table 9.22.
Some examples:
In the common case where you just want the whole matching substring or NULL for no match, write
something like
230
Functions and Operators
--------------
barbeque
(1 row)
The regexp_matches function returns a set of text arrays of captured substring(s) resulting from
matching a POSIX regular expression pattern to a string. It has the same syntax as regexp_match.
This function returns no rows if there is no match, one row if there is a match and the g flag is not given,
or N rows if there are N matches and the g flag is given. Each returned row is a text array containing the
whole matched substring or the substrings matching parenthesized subexpressions of the pattern,
just as described above for regexp_match. regexp_matches accepts all the flags shown in
Table 9.22, plus the g flag which commands it to return all matches, not just the first one.
Some examples:
Tip
In most cases regexp_matches() should be used with the g flag, since if you only want
the first match, it's easier and more efficient to use regexp_match(). However, regex-
p_match() only exists in PostgreSQL version 10 and up. When working in older versions,
a common trick is to place a regexp_matches() call in a sub-select, for example:
This produces a text array if there's a match, or NULL if not, the same as regexp_match()
would do. Without the sub-select, this query would produce no output at all for table rows
without a match, which is typically not the desired behavior.
The regexp_split_to_table function splits a string using a POSIX regular expression pattern
as a delimiter. It has the syntax regexp_split_to_table(string, pattern [, flags ]). If
there is no match to the pattern, the function returns the string. If there is at least one match,
for each match it returns the text from the end of the last match (or the beginning of the string) to
the beginning of the match. When there are no more matches, it returns the text from the end of the
last match to the end of the string. The flags parameter is an optional text string containing zero or
more single-letter flags that change the function's behavior. regexp_split_to_table supports
the flags described in Table 9.22.
231
Functions and Operators
Some examples:
As the last example demonstrates, the regexp split functions ignore zero-length matches that occur
at the start or end of the string or immediately after a previous match. This is contrary to the strict
definition of regexp matching that is implemented by regexp_match and regexp_matches, but
is usually the most convenient behavior in practice. Other software systems such as Perl use similar
definitions.
232
Functions and Operators
Regular expressions (REs), as defined in POSIX 1003.2, come in two forms: extended REs or EREs
(roughly those of egrep), and basic REs or BREs (roughly those of ed). PostgreSQL supports both
forms, and also implements some extensions that are not in the POSIX standard, but have become
widely used due to their availability in programming languages such as Perl and Tcl. REs using these
non-POSIX extensions are called advanced REs or AREs in this documentation. AREs are almost an
exact superset of EREs, but BREs have several notational incompatibilities (as well as being much
more limited). We first describe the ARE and ERE forms, noting features that apply only to AREs,
and then describe how BREs differ.
Note
PostgreSQL always initially presumes that a regular expression follows the ARE rules. How-
ever, the more limited ERE or BRE rules can be chosen by prepending an embedded option
to the RE pattern, as described in Section 9.7.3.4. This can be useful for compatibility with
applications that expect exactly the POSIX 1003.2 rules.
A regular expression is defined as one or more branches, separated by |. It matches anything that
matches one of the branches.
A branch is zero or more quantified atoms or constraints, concatenated. It matches a match for the
first, followed by a match for the second, etc; an empty branch matches the empty string.
A quantified atom is an atom possibly followed by a single quantifier. Without a quantifier, it matches
a match for the atom. With a quantifier, it can match some number of matches of the atom. An atom
can be any of the possibilities shown in Table 9.15. The possible quantifiers and their meanings are
shown in Table 9.16.
A constraint matches an empty string, but matches only when specific conditions are met. A constraint
can be used where an atom could be used, except it cannot be followed by a quantifier. The simple
constraints are shown in Table 9.17; some more constraints are described later.
233
Functions and Operators
Atom Description
x where x is a single character with no other signif-
icance, matches that character
Note
If you have standard_conforming_strings turned off, any backslashes you write in literal string
constants will need to be doubled. See Section 4.1.2.1 for more information.
The forms using {...} are known as bounds. The numbers m and n within a bound are unsigned
decimal integers with permissible values from 0 to 255 inclusive.
Non-greedy quantifiers (available in AREs only) match the same possibilities as their corresponding
normal (greedy) counterparts, but prefer the smallest number rather than the largest number of matches.
See Section 9.7.3.5 for more detail.
Note
A quantifier cannot immediately follow another quantifier, e.g., ** is invalid. A quantifier
cannot begin an expression or subexpression or follow ^ or |.
234
Functions and Operators
Constraint Description
(?!re) negative lookahead matches at any point where no
substring matching re begins (AREs only)
(?<=re) positive lookbehind matches at any point where a
substring matching re ends (AREs only)
(?<!re) negative lookbehind matches at any point where
no substring matching re ends (AREs only)
Lookahead and lookbehind constraints cannot contain back references (see Section 9.7.3.3), and all
parentheses within them are considered non-capturing.
To include a literal ] in the list, make it the first character (after ^, if that is used). To include a
literal -, make it the first or last character, or the second endpoint of a range. To use a literal - as
the first endpoint of a range, enclose it in [. and .] to make it a collating element (see below).
With the exception of these characters, some combinations using [ (see next paragraphs), and escapes
(AREs only), all other special characters lose their special significance within a bracket expression.
In particular, \ is not special when following ERE or BRE rules, though it is special (as introducing
an escape) in AREs.
Within a bracket expression, a collating element (a character, a multiple-character sequence that col-
lates as if it were a single character, or a collating-sequence name for either) enclosed in [. and .]
stands for the sequence of characters of that collating element. The sequence is treated as a single
element of the bracket expression's list. This allows a bracket expression containing a multiple-char-
acter collating element to match more than one character, e.g., if the collating sequence includes a ch
collating element, then the RE [[.ch.]]*c matches the first five characters of chchcc.
Note
PostgreSQL currently does not support multi-character collating elements. This information
describes possible future behavior.
Within a bracket expression, a collating element enclosed in [= and =] is an equivalence class, stand-
ing for the sequences of characters of all collating elements equivalent to that one, including itself. (If
there are no other equivalent collating elements, the treatment is as if the enclosing delimiters were [.
and .].) For example, if o and ^ are the members of an equivalence class, then [[=o=]], [[=^=]],
and [o^] are all synonymous. An equivalence class cannot be an endpoint of a range.
Within a bracket expression, the name of a character class enclosed in [: and :] stands for the list of
all characters belonging to that class. Standard character class names are: alnum, alpha, blank,
cntrl, digit, graph, lower, print, punct, space, upper, xdigit. These stand for the
character classes defined in ctype. A locale can provide others. A character class cannot be used as
an endpoint of a range.
There are two special cases of bracket expressions: the bracket expressions [[:<:]] and [[:>:]]
are constraints, matching empty strings at the beginning and end of a word respectively. A word is
defined as a sequence of word characters that is neither preceded nor followed by word characters. A
word character is an alnum character (as defined by ctype) or an underscore. This is an extension,
235
Functions and Operators
compatible with but not specified by POSIX 1003.2, and should be used with caution in software
intended to be portable to other systems. The constraint escapes described below are usually preferable;
they are no more standard, but are easier to type.
Character-entry escapes exist to make it easier to specify non-printing and other inconvenient char-
acters in REs. They are shown in Table 9.18.
Class-shorthand escapes provide shorthands for certain commonly-used character classes. They are
shown in Table 9.19.
A constraint escape is a constraint, matching the empty string if specific conditions are met, written
as an escape. They are shown in Table 9.20.
A back reference (\n) matches the same string matched by the previous parenthesized subexpression
specified by the number n (see Table 9.21). For example, ([bc])\1 matches bb or cc but not bc
or cb. The subexpression must entirely precede the back reference in the RE. Subexpressions are
numbered in the order of their leading parentheses. Non-capturing parentheses do not define subex-
pressions.
236
Functions and Operators
Escape Description
0xhhh (a single character no matter how many
hexadecimal digits are used)
\0 the character whose value is 0 (the null byte)
\xy (where xy is exactly two octal digits, and is not
a back reference) the character whose octal value
is 0xy
\xyz (where xyz is exactly three octal digits, and is not
a back reference) the character whose octal value
is 0xyz
Hexadecimal digits are 0-9, a-f, and A-F. Octal digits are 0-7.
Numeric character-entry escapes specifying values outside the ASCII range (0-127) have meanings
dependent on the database encoding. When the encoding is UTF-8, escape values are equivalent to
Unicode code points, for example \u1234 means the character U+1234. For other multibyte encod-
ings, character-entry escapes usually just specify the concatenation of the byte values for the character.
If the escape value does not correspond to any legal character in the database encoding, no error will
be raised, but it will never match any data.
The character-entry escapes are always taken as ordinary characters. For example, \135 is ] in ASCII,
but \135 does not terminate a bracket expression.
Within bracket expressions, \d, \s, and \w lose their outer brackets, and \D, \S, and \W are illegal.
(So, for example, [a-c\d] is equivalent to [a-c[:digit:]]. Also, [a-c\D], which is equiv-
alent to [a-c^[:digit:]], is illegal.)
A word is defined as in the specification of [[:<:]] and [[:>:]] above. Constraint escapes are
illegal within bracket expressions.
237
Functions and Operators
Note
There is an inherent ambiguity between octal character-entry escapes and back references,
which is resolved by the following heuristics, as hinted at above. A leading zero always indi-
cates an octal escape. A single non-zero digit, not followed by another digit, is always taken as
a back reference. A multi-digit sequence not starting with a zero is taken as a back reference
if it comes after a suitable subexpression (i.e., the number is in the legal range for a back ref-
erence), and otherwise is taken as octal.
An RE can begin with one of two special director prefixes. If an RE begins with ***:, the rest of
the RE is taken as an ARE. (This normally has no effect in PostgreSQL, since REs are assumed to be
AREs; but it does have an effect if ERE or BRE mode had been specified by the flags parameter
to a regex function.) If an RE begins with ***=, the rest of the RE is taken to be a literal string, with
all characters considered ordinary characters.
An ARE can begin with embedded options: a sequence (?xyz) (where xyz is one or more alpha-
betic characters) specifies options affecting the rest of the RE. These options override any previously
determined options — in particular, they can override the case-sensitivity behavior implied by a regex
operator, or the flags parameter to a regex function. The available option letters are shown in Ta-
ble 9.22. Note that these same option letters are used in the flags parameters of regex functions.
238
Functions and Operators
Option Description
s non-newline-sensitive matching (default)
t tight syntax (default; see below)
w inverse partial newline-sensitive (“weird”)
matching (see Section 9.7.3.5)
x expanded syntax (see below)
Embedded options take effect at the ) terminating the sequence. They can appear only at the start of
an ARE (after the ***: director if any).
In addition to the usual (tight) RE syntax, in which all characters are significant, there is an expanded
syntax, available by specifying the embedded x option. In the expanded syntax, white-space characters
in the RE are ignored, as are all characters between a # and the following newline (or the end of the
RE). This permits paragraphing and commenting a complex RE. There are three exceptions to that
basic rule:
• white space and comments cannot appear within multi-character symbols, such as (?:
For this purpose, white-space characters are blank, tab, newline, and any character that belongs to the
space character class.
Finally, in an ARE, outside bracket expressions, the sequence (?#ttt) (where ttt is any text not
containing a )) is a comment, completely ignored. Again, this is not allowed between the characters of
multi-character symbols, like (?:. Such comments are more a historical artifact than a useful facility,
and their use is deprecated; use the expanded syntax instead.
None of these metasyntax extensions is available if an initial ***= director has specified that the user's
input be treated as a literal string rather than as an RE.
• Most atoms, and all constraints, have no greediness attribute (because they cannot match variable
amounts of text anyway).
• A quantified atom with a fixed-repetition quantifier ({m} or {m}?) has the same greediness (pos-
sibly none) as the atom itself.
• A quantified atom with other normal quantifiers (including {m,n} with m equal to n) is greedy
(prefers longest match).
• A quantified atom with a non-greedy quantifier (including {m,n}? with m equal to n) is non-greedy
(prefers shortest match).
• A branch — that is, an RE that has no top-level | operator — has the same greediness as the first
quantified atom in it that has a greediness attribute.
239
Functions and Operators
The above rules associate greediness attributes not only with individual quantified atoms, but with
branches and entire REs that contain quantified atoms. What that means is that the matching is done in
such a way that the branch, or whole RE, matches the longest or shortest possible substring as a whole.
Once the length of the entire match is determined, the part of it that matches any particular subexpres-
sion is determined on the basis of the greediness attribute of that subexpression, with subexpressions
starting earlier in the RE taking priority over ones starting later.
In the first case, the RE as a whole is greedy because Y* is greedy. It can match beginning at the Y,
and it matches the longest possible string starting there, i.e., Y123. The output is the parenthesized
part of that, or 123. In the second case, the RE as a whole is non-greedy because Y*? is non-greedy.
It can match beginning at the Y, and it matches the shortest possible string starting there, i.e., Y1.
The subexpression [0-9]{1,3} is greedy but it cannot change the decision as to the overall match
length; so it is forced to match just 1.
In short, when an RE contains both greedy and non-greedy subexpressions, the total match length is
either as long as possible or as short as possible, according to the attribute assigned to the whole RE.
The attributes assigned to the subexpressions only affect how much of that match they are allowed
to “eat” relative to each other.
The quantifiers {1,1} and {1,1}? can be used to force greediness or non-greediness, respectively,
on a subexpression or a whole RE. This is useful when you need the whole RE to have a greediness
attribute different from what's deduced from its elements. As an example, suppose that we are trying
to separate a string containing some digits into the digits and the parts before and after them. We might
try to do that like this:
That didn't work: the first .* is greedy so it “eats” as much as it can, leaving the \d+ to match at the
last possible place, the last digit. We might try to fix that by making it non-greedy:
That didn't work either, because now the RE as a whole is non-greedy and so it ends the overall match
as soon as possible. We can get what we want by forcing the RE as a whole to be greedy:
Controlling the RE's overall greediness separately from its components' greediness allows great flex-
ibility in handling variable-length patterns.
When deciding what is a longer or shorter match, match lengths are measured in characters, not collat-
ing elements. An empty string is considered longer than no match at all. For example: bb* matches the
three middle characters of abbbc; (week|wee)(night|knights) matches all ten characters
240
Functions and Operators
of weeknights; when (.*).* is matched against abc the parenthesized subexpression matches
all three characters; and when (a*)* is matched against bc both the whole RE and the parenthesized
subexpression match an empty string.
If case-independent matching is specified, the effect is much as if all case distinctions had vanished
from the alphabet. When an alphabetic that exists in multiple cases appears as an ordinary character
outside a bracket expression, it is effectively transformed into a bracket expression containing both
cases, e.g., x becomes [xX]. When it appears inside a bracket expression, all case counterparts of it
are added to the bracket expression, e.g., [x] becomes [xX] and [^x] becomes [^xX].
If newline-sensitive matching is specified, . and bracket expressions using ^ will never match the
newline character (so that matches will never cross newlines unless the RE explicitly arranges it) and
^ and $ will match the empty string after and before a newline respectively, in addition to matching at
beginning and end of string respectively. But the ARE escapes \A and \Z continue to match beginning
or end of string only.
If partial newline-sensitive matching is specified, this affects . and bracket expressions as with new-
line-sensitive matching, but not ^ and $.
If inverse partial newline-sensitive matching is specified, this affects ^ and $ as with newline-sensitive
matching, but not . and bracket expressions. This isn't very useful but is provided for symmetry.
The only feature of AREs that is actually incompatible with POSIX EREs is that \ does not lose its
special significance inside bracket expressions. All other ARE features use syntax which is illegal or
has undefined or unspecified effects in POSIX EREs; the *** syntax of directors likewise is outside
the POSIX syntax for both BREs and EREs.
Many of the ARE extensions are borrowed from Perl, but some have been changed to clean them up,
and a few Perl extensions are not present. Incompatibilities of note include \b, \B, the lack of spe-
cial treatment for a trailing newline, the addition of complemented bracket expressions to the things
affected by newline-sensitive matching, the restrictions on parentheses and back references in looka-
head/lookbehind constraints, and the longest/shortest-match (rather than first-match) matching seman-
tics.
Two significant incompatibilities exist between AREs and the ERE syntax recognized by pre-7.4 re-
leases of PostgreSQL:
• In AREs, \ remains a special character within [], so a literal \ within a bracket expression must
be written \\.
241
Functions and Operators
beginning of the RE or the beginning of a parenthesized subexpression (after a possible leading ^).
Finally, single-digit back references are available, and \< and \> are synonyms for [[:<:]] and
[[:>:]] respectively; no other escapes are available in BREs.
Note
There is also a single-argument to_timestamp function; see Table 9.30.
Tip
to_timestamp and to_date exist to handle input formats that cannot be converted by
simple casting. For most standard date/time formats, simply casting the source string to the
required data type works, and is much easier. Similarly, to_number is unnecessary for stan-
dard numeric representations.
In a to_char output template string, there are certain patterns that are recognized and replaced with
appropriately-formatted data based on the given value. Any text that is not a template pattern is simply
copied verbatim. Similarly, in an input template string (for the other functions), template patterns
242
Functions and Operators
identify the values to be supplied by the input data string. If there are characters in the template string
that are not template patterns, the corresponding characters in the input data string are simply skipped
over (whether or not they are equal to the template string characters).
Table 9.24 shows the template patterns available for formatting date and time values.
243
Functions and Operators
Pattern Description
day full lower case day name (blank-padded to 9
chars)
DY abbreviated upper case day name (3 chars in Eng-
lish, localized lengths vary)
Dy abbreviated capitalized day name (3 chars in Eng-
lish, localized lengths vary)
dy abbreviated lower case day name (3 chars in Eng-
lish, localized lengths vary)
DDD day of year (001-366)
IDDD day of ISO 8601 week-numbering year (001-371;
day 1 of the year is Monday of the first ISO week)
DD day of month (01-31)
D day of the week, Sunday (1) to Saturday (7)
ID ISO 8601 day of the week, Monday (1) to Sunday
(7)
W week of month (1-5) (the first week starts on the
first day of the month)
WW week number of year (1-53) (the first week starts
on the first day of the year)
IW week number of ISO 8601 week-numbering year
(01-53; the first Thursday of the year is in week 1)
CC century (2 digits) (the twenty-first century starts
on 2001-01-01)
J Julian Date (integer days since November 24,
4714 BC at local midnight; see Section B.7)
Q quarter
RM month in upper case Roman numerals (I-XII;
I=January)
rm month in lower case Roman numerals (i-xii;
i=January)
TZ upper case time-zone abbreviation (only support-
ed in to_char)
tz lower case time-zone abbreviation (only support-
ed in to_char)
TZH time-zone hours
TZM time-zone minutes
OF time-zone offset from UTC (only supported in
to_char)
Modifiers can be applied to any template pattern to alter its behavior. For example, FMMonth is the
Month pattern with the FM modifier. Table 9.25 shows the modifier patterns for date/time formatting.
244
Functions and Operators
• FM suppresses leading zeroes and trailing blanks that would otherwise be added to make the output
of a pattern be fixed-width. In PostgreSQL, FM modifies only the next specification, while in Oracle
FM affects all subsequent specifications, and repeated FM modifiers toggle fill mode on and off.
• TM does not include trailing blanks. to_timestamp and to_date ignore the TM modifier.
• to_timestamp and to_date skip multiple blank spaces in the input string unless the FX op-
tion is used. For example, to_timestamp('2000 JUN', 'YYYY MON') works, but
to_timestamp('2000 JUN', 'FXYYYY MON') returns an error because to_time-
stamp expects one space only. FX must be specified as the first item in the template.
• Ordinary text is allowed in to_char templates and will be output literally. You can put a substring
in double quotes to force it to be interpreted as literal text even if it contains template patterns.
For example, in '"Hello Year "YYYY', the YYYY will be replaced by the year data, but
the single Y in Year will not be. In to_date, to_number, and to_timestamp, literal text
and double-quoted strings result in skipping the number of characters contained in the string; for
example "XX" skips two input characters (whether or not they are XX).
• If you want to have a double quote in the output you must precede it with a backslash, for example
'\"YYYY Month\"'. Backslashes are not otherwise special outside of double-quoted strings.
Within a double-quoted string, a backslash causes the next character to be taken literally, whatever
it is (but this has no special effect unless the next character is a double quote or another backslash).
• In to_timestamp and to_date, if the year format specification is less than four digits, e.g.,
YYY, and the supplied year is less than four digits, the year will be adjusted to be nearest to the year
2020, e.g., 95 becomes 1995.
• In to_timestamp and to_date, negative years are treated as signifying BC. If you write both
a negative year and an explicit BC field, you get AD again. An input of year zero is treated as 1 BC.
• In to_timestamp and to_date, the YYYY conversion has a restriction when process-
ing years with more than 4 digits. You must use some non-digit character or template after
YYYY, otherwise the year is always interpreted as 4 digits. For example (with the year 20000):
to_date('200001131', 'YYYYMMDD') will be interpreted as a 4-digit year; instead use
a non-digit separator after the year, like to_date('20000-1131', 'YYYY-MMDD') or
to_date('20000Nov31', 'YYYYMonDD').
• In to_timestamp and to_date, the CC (century) field is accepted but ignored if there is a
YYY, YYYY or Y,YYY field. If CC is used with YY or Y then the result is computed as that year
in the specified century. If the century is specified but the year is not, the first year of the century
is assumed.
• In to_timestamp and to_date, weekday names or numbers (DAY, D, and related field types)
are accepted but are ignored for purposes of computing the result. The same is true for quarter (Q)
fields.
• In to_timestamp and to_date, an ISO 8601 week-numbering date (as distinct from a Grego-
rian date) can be specified in one of two ways:
245
Functions and Operators
• Year and day of year: for example to_date('2006-291', 'IYYY-IDDD') also returns
2006-10-19.
Attempting to enter a date using a mixture of ISO 8601 week-numbering fields and Gregorian date
fields is nonsensical, and will cause an error. In the context of an ISO 8601 week-numbering year,
the concept of a “month” or “day of month” has no meaning. In the context of a Gregorian year,
the ISO week has no meaning.
Caution
While to_date will reject a mixture of Gregorian and ISO week-numbering date fields,
to_char will not, since output format specifications like YYYY-MM-DD (IYYY-IDDD)
can be useful. But avoid writing something like IYYY-MM-DD; that would yield surprising
results near the start of the year. (See Section 9.9.1 for more information.)
• In to_timestamp, millisecond (MS) or microsecond (US) fields are used as the seconds digits
after the decimal point. For example to_timestamp('12.3', 'SS.MS') is not 3 millisec-
onds, but 300, because the conversion treats it as 12 + 0.3 seconds. So, for the format SS.MS, the
input values 12.3, 12.30, and 12.300 specify the same number of milliseconds. To get three
milliseconds, one must write 12.003, which the conversion treats as 12 + 0.003 = 12.003 seconds.
• to_char(interval) formats HH and HH12 as shown on a 12-hour clock, for example zero
hours and 36 hours both output as 12, while HH24 outputs the full hour value, which can exceed
23 in an interval value.
Table 9.26 shows the template patterns available for formatting numeric values.
246
Functions and Operators
Pattern Description
PL plus sign in specified position (if number > 0)
SG plus/minus sign in specified position
RN Roman numeral (input between 1 and 3999)
TH or th ordinal number suffix
V shift specified number of digits (see notes)
EEEE exponent for scientific notation
• 0 specifies a digit position that will always be printed, even if it contains a leading/trailing zero. 9
also specifies a digit position, but if it is a leading zero then it will be replaced by a space, while
if it is a trailing zero and fill mode is specified then it will be deleted. (For to_number(), these
two pattern characters are equivalent.)
• The pattern characters S, L, D, and G represent the sign, currency symbol, decimal point, and thou-
sands separator characters defined by the current locale (see lc_monetary and lc_numeric). The pat-
tern characters period and comma represent those exact characters, with the meanings of decimal
point and thousands separator, regardless of locale.
• If no explicit provision is made for a sign in to_char()'s pattern, one column will be reserved
for the sign, and it will be anchored to (appear just left of) the number. If S appears just left of some
9's, it will likewise be anchored to the number.
• A sign formatted using SG, PL, or MI is not anchored to the number; for example, to_char(-12,
'MI9999') produces '- 12' but to_char(-12, 'S9999') produces ' -12'. (The
Oracle implementation does not allow the use of MI before 9, but rather requires that 9 precede MI.)
• TH does not convert values less than zero and does not convert fractional numbers.
• In to_number, if non-data template patterns such as L or TH are used, the corresponding number
of input characters are skipped, whether or not they match the template pattern, unless they are data
characters (that is, digits, sign, decimal point, or comma). For example, TH would skip two non-
data characters.
• V with to_char multiplies the input values by 10^n, where n is the number of digits following
V. V with to_number divides in a similar manner. to_char and to_number do not support
the use of V combined with a decimal point (e.g., 99.9V99 is not allowed).
• EEEE (scientific notation) cannot be used in combination with any of the other formatting patterns
or modifiers other than digit and decimal point patterns, and must be at the end of the format string
(e.g., 9.99EEEE is a valid pattern).
Certain modifiers can be applied to any template pattern to alter its behavior. For example, FM99.99
is the 99.99 pattern with the FM modifier. Table 9.27 shows the modifier patterns for numeric for-
matting.
247
Functions and Operators
Table 9.28 shows some examples of the use of the to_char function.
248
Functions and Operators
Expression Result
to_char(12, '99V999') ' 12000'
to_char(12.4, '99V999') ' 12400'
to_char(12.45, '99V9') ' 125'
to_char(0.0004859, '9.99EEEE') ' 4.86e-04'
In addition, the usual comparison operators shown in Table 9.1 are available for the date/time types.
Dates and timestamps (with or without time zone) are all comparable, while times (with or without
time zone) and intervals can only be compared to other values of the same data type. When comparing
a timestamp without time zone to a timestamp with time zone, the former value is assumed to be
given in the time zone specified by the TimeZone configuration parameter, and is rotated to UTC for
comparison to the latter value (which is already in UTC internally). Similarly, a date value is assumed
to represent midnight in the TimeZone zone when comparing it to a timestamp.
All the functions and operators described below that take time or timestamp inputs actually come
in two variants: one that takes time with time zone or timestamp with time zone, and
one that takes time without time zone or timestamp without time zone. For brevity,
these variants are not shown separately. Also, the + and * operators come in commutative pairs (for
example both date + integer and integer + date); we show only one of each such pair.
249
Functions and Operators
250
Functions and Operators
251
Functions and Operators
252
Functions and Operators
This expression yields true when two time periods (defined by their endpoints) overlap, false when
they do not overlap. The endpoints can be specified as pairs of dates, times, or time stamps; or as a
date, time, or time stamp followed by an interval. When a pair of values is provided, either the start or
the end can be written first; OVERLAPS automatically takes the earlier value of the pair as the start.
Each time period is considered to represent the half-open interval start <= time < end, unless
start and end are equal in which case it represents that single time instant. This means for instance
that two time periods with only an endpoint in common do not overlap.
When adding an interval value to (or subtracting an interval value from) a timestamp
with time zone value, the days component advances or decrements the date of the timestamp
with time zone by the indicated number of days, keeping the time of day the same. Across day-
light saving time changes (when the session time zone is set to a time zone that recognizes DST), this
means interval '1 day' does not necessarily equal interval '24 hours'. For example,
with the session time zone set to America/Denver:
253
Functions and Operators
This happens because an hour was skipped due to a change in daylight saving time at 2005-04-03
02:00:00 in time zone America/Denver.
Note there can be ambiguity in the months field returned by age because different months have
different numbers of days. PostgreSQL's approach uses the month from the earlier of the two dates
when calculating partial months. For example, age('2004-06-01', '2004-04-30') uses
April to yield 1 mon 1 day, while using May would yield 1 mon 2 days because May has
31 days, while April has only 30.
Subtraction of dates and timestamps can also be complex. One conceptually simple way to perform
subtraction is to convert each value to a number of seconds using EXTRACT(EPOCH FROM ...),
then subtract the results; this produces the number of seconds between the two values. This will adjust
for the number of days in each month, timezone changes, and daylight saving time adjustments. Sub-
traction of date or timestamp values with the “-” operator returns the number of days (24-hours) and
hours/minutes/seconds between the values, making the same adjustments. The age function returns
years, months, days, and hours/minutes/seconds, performing field-by-field subtraction and then ad-
justing for negative field values. The following queries illustrate the differences in these approaches.
The sample results were produced with timezone = 'US/Eastern'; there is a daylight saving
time change between the two dates used:
The extract function retrieves subfields such as year or hour from date/time values. source must
be a value expression of type timestamp, time, or interval. (Expressions of type date are
cast to timestamp and can therefore be used as well.) field is an identifier or string that selects
what field to extract from the source value. The extract function returns values of type double
precision. The following are valid field names:
century
The century
The first century starts at 0001-01-01 00:00:00 AD, although they did not know it at the time.
This definition applies to all Gregorian calendar countries. There is no century number 0, you
go from -1 century to 1 century. If you disagree with this, please write your complaint to: Pope,
Cathedral Saint-Peter of Roma, Vatican.
254
Functions and Operators
day
For timestamp values, the day (of the month) field (1 - 31) ; for interval values, the number
of days
decade
dow
Note that extract's day of the week numbering differs from that of the to_char(..., 'D')
function.
doy
epoch
For timestamp with time zone values, the number of seconds since 1970-01-01 00:00:00
UTC (negative for timestamps before that); for date and timestamp values, the nominal num-
ber of seconds since 1970-01-01 00:00:00, without regard to timezone or daylight-savings rules;
for interval values, the total number of seconds in the interval
You can convert an epoch value back to a timestamp with time zone with to_time-
stamp:
SELECT to_timestamp(982384720.12);
Result: 2001-02-17 04:38:40.12+00
255
Functions and Operators
Beware that applying to_timestamp to an epoch extracted from a date or timestamp value
could produce a misleading result: the result will effectively assume that the original value had
been given in UTC, which might not be the case.
hour
isodow
This is identical to dow except for Sunday. This matches the ISO 8601 day of the week numbering.
isoyear
The ISO 8601 week-numbering year that the date falls in (not applicable to intervals)
Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of
January, so in early January or late December the ISO year may be different from the Gregorian
year. See the week field for more information.
julian
The Julian Date corresponding to the date or timestamp (not applicable to intervals). Timestamps
that are not local midnight result in a fractional value. See Section B.7 for more information.
microseconds
The seconds field, including fractional parts, multiplied by 1 000 000; note that this includes full
seconds
millennium
The millennium
256
Functions and Operators
Years in the 1900s are in the second millennium. The third millennium started January 1, 2001.
milliseconds
The seconds field, including fractional parts, multiplied by 1000. Note that this includes full sec-
onds.
minute
month
For timestamp values, the number of the month within the year (1 - 12) ; for interval values,
the number of months, modulo 12 (0 - 11)
quarter
second
timezone
The time zone offset from UTC, measured in seconds. Positive values correspond to time zones
east of UTC, negative values to zones west of UTC. (Technically, PostgreSQL does not use UTC
because leap seconds are not handled.)
1
60 if leap seconds are implemented by the operating system
257
Functions and Operators
timezone_hour
timezone_minute
week
The number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start
on Mondays and the first week of a year contains January 4 of that year. In other words, the first
Thursday of a year is in week 1 of that year.
In the ISO week-numbering system, it is possible for early-January dates to be part of the 52nd
or 53rd week of the previous year, and for late-December dates to be part of the first week of the
next year. For example, 2005-01-01 is part of the 53rd week of year 2004, and 2006-01-01
is part of the 52nd week of year 2005, while 2012-12-31 is part of the first week of 2013. It's
recommended to use the isoyear field together with week to get consistent results.
year
The year field. Keep in mind there is no 0 AD, so subtracting BC years from AD years should
be done with care.
Note
When the input value is +/-Infinity, extract returns +/-Infinity for monotonically-increasing
fields (epoch, julian, year, isoyear, decade, century, and millennium). For
other fields, NULL is returned. PostgreSQL versions before 9.6 returned zero for all cases of
infinite input.
The extract function is primarily intended for computational processing. For formatting date/time
values for display, see Section 9.8.
The date_part function is modeled on the traditional Ingres equivalent to the SQL-standard func-
tion extract:
date_part('field', source)
Note that here the field parameter needs to be a string value, not a name. The valid field names for
date_part are the same as for extract.
9.9.2. date_trunc
258
Functions and Operators
The function date_trunc is conceptually similar to the trunc function for numbers.
date_trunc('field', source)
source is a value expression of type timestamp or interval. (Values of type date and time
are cast automatically to timestamp or interval, respectively.) field selects to which precision
to truncate the input value. The return value is of type timestamp or interval with all fields that
are less significant than the selected one set to zero (or one, for day and month).
microseconds
milliseconds
second
minute
hour
day
week
month
quarter
year
decade
century
millennium
Examples:
In these expressions, the desired time zone zone can be specified either as a text string (e.g., 'Amer-
ica/Los_Angeles') or as an interval (e.g., INTERVAL '-08:00'). In the text case, a time
zone name can be specified in any of the ways described in Section 8.5.3.
259
Functions and Operators
The first example adds a time zone to a value that lacks it, and displays the value using the current
TimeZone setting. The second example shifts the time stamp with time zone value to the specified
time zone, and returns the value without a time zone. This allows storage and display of values dif-
ferent from the current TimeZone setting. The third example converts Tokyo time to Chicago time.
Converting time values to other time zones uses the currently active time zone rules since no date is
supplied.
CURRENT_DATE
CURRENT_TIME
CURRENT_TIMESTAMP
CURRENT_TIME(precision)
CURRENT_TIMESTAMP(precision)
LOCALTIME
LOCALTIMESTAMP
LOCALTIME(precision)
LOCALTIMESTAMP(precision)
CURRENT_TIME and CURRENT_TIMESTAMP deliver values with time zone; LOCALTIME and LO-
CALTIMESTAMP deliver values without time zone.
Some examples:
SELECT CURRENT_TIME;
Result: 14:39:53.662522-05
SELECT CURRENT_DATE;
Result: 2001-12-23
SELECT CURRENT_TIMESTAMP;
Result: 2001-12-23 14:39:53.662522-05
SELECT CURRENT_TIMESTAMP(2);
Result: 2001-12-23 14:39:53.66-05
260
Functions and Operators
SELECT LOCALTIMESTAMP;
Result: 2001-12-23 14:39:53.662522
Since these functions return the start time of the current transaction, their values do not change during
the transaction. This is considered a feature: the intent is to allow a single transaction to have a con-
sistent notion of the “current” time, so that multiple modifications within the same transaction bear
the same time stamp.
Note
Other database systems might advance these values more frequently.
PostgreSQL also provides functions that return the start time of the current statement, as well as the
actual current time at the instant the function is called. The complete list of non-SQL-standard time
functions is:
transaction_timestamp()
statement_timestamp()
clock_timestamp()
timeofday()
now()
All the date/time data types also accept the special literal value now to specify the current date and time
(again, interpreted as the transaction start time). Thus, the following three all return the same result:
SELECT CURRENT_TIMESTAMP;
SELECT now();
SELECT TIMESTAMP 'now'; -- but see tip below
Tip
Do not use the third form when specifying a value to be evaluated later, for example in a
DEFAULT clause for a table column. The system will convert now to a timestamp as soon
as the constant is parsed, so that when the default value is needed, the time of the table creation
would be used! The first two forms will not be evaluated until the default value is used, because
they are function calls. Thus they will give the desired behavior of defaulting to the time of
row insertion. (See also Section 8.5.1.4.)
pg_sleep(seconds)
pg_sleep_for(interval)
261
Functions and Operators
pg_sleep makes the current session's process sleep until seconds seconds have elapsed. sec-
onds is a value of type double precision, so fractional-second delays can be specified.
pg_sleep_for is a convenience function for larger sleep times specified as an interval.
pg_sleep_until is a convenience function for when a specific wake-up time is desired. For ex-
ample:
SELECT pg_sleep(1.5);
SELECT pg_sleep_for('5 minutes');
SELECT pg_sleep_until('tomorrow 03:00');
Note
The effective resolution of the sleep interval is platform-specific; 0.01 seconds is a common
value. The sleep delay will be at least as long as specified. It might be longer depending on
factors such as server load. In particular, pg_sleep_until is not guaranteed to wake up
exactly at the specified time, but it will not wake up any earlier.
Warning
Make sure that your session does not hold more locks than necessary when calling pg_sleep
or its variants. Otherwise other sessions might have to wait for your sleeping process, slowing
down the entire system.
262
Functions and Operators
Notice that except for the two-argument form of enum_range, these functions disregard the specific
value passed to them; they care only about its declared data type. Either null or a specific value of the
type can be passed, with the same result. It is more common to apply these functions to a table column
or function argument than to a hardwired type name as suggested by the examples.
Caution
Note that the “same as” operator, ~=, represents the usual notion of equality for the point,
box, polygon, and circle types. Some of these types also have an = operator, but =
compares for equal areas only. The other scalar comparison operators (<= and so on) likewise
compare areas for these types.
263
Functions and Operators
264
Functions and Operators
Note
Before PostgreSQL 8.2, the containment operators @> and <@ were respectively called ~ and
@. These names are still available, but are deprecated and will eventually be removed.
265
Functions and Operators
266
Functions and Operators
It is possible to access the two component numbers of a point as though the point were an array with
indexes 0 and 1. For example, if t.p is a point column then SELECT p[0] FROM t retrieves
the X coordinate and UPDATE t SET p[1] = ... changes the Y coordinate. In the same way,
a value of type box or lseg can be treated as an array of two point values.
The area function works for the types box, circle, and path. The area function only works
on the path data type if the points in the path are non-intersecting. For example, the path
'((0,0),(0,1),(2,1),(2,2),(1,2),(1,0),(0,0))'::PATH will not work; howev-
er, the following visually identical path '((0,0),(0,1),(1,1),(1,2),(2,2),(2,1),
(1,1),(1,0),(0,0))'::PATH will work. If the concept of an intersecting versus non-intersect-
ing path is confusing, draw both of the above paths side by side on a piece of graph paper.
267
Functions and Operators
Table 9.37 shows the functions available for use with the cidr and inet types. The abbrev, host,
and text functions are primarily intended to offer alternative display formats.
268
Functions and Operators
Any cidr value can be cast to inet implicitly or explicitly; therefore, the functions shown above
as operating on inet also work on cidr values. (Where there are separate functions for inet and
cidr, it is because the behavior should be different for the two cases.) Also, it is permitted to cast
an inet value to cidr. When this is done, any bits to the right of the netmask are silently zeroed
to create a valid cidr value. In addition, you can cast a text value to inet or cidr using normal
casting syntax: for example, inet(expression) or colname::cidr.
Table 9.38 shows the functions available for use with the macaddr type. The function
trunc(macaddr) returns a MAC address with the last 3 bytes set to zero. This can be used to
associate the remaining prefix with a manufacturer.
The macaddr type also supports the standard relational operators (>, <=, etc.) for lexicographical
ordering, and the bitwise arithmetic operators (~, & and |) for NOT, AND and OR.
Table 9.39 shows the functions available for use with the macaddr8 type. The function
trunc(macaddr8) returns a MAC address with the last 5 bytes set to zero. This can be used to
associate the remaining prefix with a manufacturer.
The macaddr8 type also supports the standard relational operators (>, <=, etc.) for ordering, and the
bitwise arithmetic operators (~, & and |) for NOT, AND and OR.
269
Functions and Operators
Note
The tsquery containment operators consider only the lexemes listed in the two queries,
ignoring the combining operators.
270
Functions and Operators
In addition to the operators shown in the table, the ordinary B-tree comparison operators (=, <, etc) are
defined for types tsvector and tsquery. These are not very useful for text searching but allow,
for example, unique indexes to be built on columns of these types.
271
Functions and Operators
272
Functions and Operators
273
Functions and Operators
274
Functions and Operators
Note
All the text search functions that accept an optional regconfig argument will use the con-
figuration specified by default_text_search_config when that argument is omitted.
The functions in Table 9.42 are listed separately because they are not usually used in everyday text
searching operations. They are helpful for development and debugging of new text search configura-
tions.
275
Functions and Operators
Use of most of these functions requires PostgreSQL to have been built with configure --with-
libxml.
9.14.1.1. xmlcomment
xmlcomment(text)
The function xmlcomment creates an XML value containing an XML comment with the specified
text as content. The text cannot contain “--” or end with a “-” so that the resulting construct is a valid
XML comment. If the argument is null, the result is null.
Example:
276
Functions and Operators
SELECT xmlcomment('hello');
xmlcomment
--------------
<!--hello-->
9.14.1.2. xmlconcat
xmlconcat(xml[, ...])
The function xmlconcat concatenates a list of individual XML values to create a single value con-
taining an XML content fragment. Null values are omitted; the result is only null if there are no non-
null arguments.
Example:
xmlconcat
----------------------
<abc/><bar>foo</bar>
XML declarations, if present, are combined as follows. If all argument values have the same XML
version declaration, that version is used in the result, else no version is used. If all argument values
have the standalone declaration value “yes”, then that value is used in the result. If all argument values
have a standalone declaration value and at least one is “no”, then that is used in the result. Else the
result will have no standalone declaration. If the result is determined to require a standalone declaration
but no version declaration, a version declaration with version 1.0 will be used because XML requires
an XML declaration to contain a version declaration. Encoding declarations are ignored and removed
in all cases.
Example:
xmlconcat
-----------------------------------
<?xml version="1.1"?><foo/><bar/>
9.14.1.3. xmlelement
The xmlelement expression produces an XML element with the given name, attributes, and content.
Examples:
xmlelement
------------
277
Functions and Operators
<foo/>
xmlelement
------------------
<foo bar="xyz"/>
xmlelement
-------------------------------------
<foo bar="2007-01-26">content</foo>
Element and attribute names that are not valid XML names are escaped by replacing the offending
characters by the sequence _xHHHH_, where HHHH is the character's Unicode codepoint in hexadec-
imal notation. For example:
xmlelement
----------------------------------
<foo_x0024_bar a_x0026_b="xyz"/>
An explicit attribute name need not be specified if the attribute value is a column reference, in which
case the column's name will be used as the attribute name by default. In other cases, the attribute must
be given an explicit name. So this example is valid:
Element content, if specified, will be formatted according to its data type. If the content is itself of
type xml, complex XML documents can be constructed. For example:
xmlelement
----------------------------------------------
<foo bar="xyz"><abc/><!--test--><xyz/></foo>
Content of other types will be formatted into valid XML character data. This means in particular
that the characters <, >, and & will be converted to entities. Binary data (data type bytea) will
be represented in base64 or hex encoding, depending on the setting of the configuration parameter
xmlbinary. The particular behavior for individual data types is expected to evolve in order to align the
PostgreSQL mappings with those specified in SQL:2006 and later, as discussed in Section D.3.1.3.
278
Functions and Operators
9.14.1.4. xmlforest
The xmlforest expression produces an XML forest (sequence) of elements using the given names
and content.
Examples:
xmlforest
------------------------------
<foo>abc</foo><bar>123</bar>
xmlforest
--------------------------------------------------------------------------------
<table_name>pg_authid</table_name><column_name>rolname</
column_name>
<table_name>pg_authid</table_name><column_name>rolsuper</
column_name>
...
As seen in the second example, the element name can be omitted if the content value is a column
reference, in which case the column name is used by default. Otherwise, a name must be specified.
Element names that are not valid XML names are escaped as shown for xmlelement above. Simi-
larly, content data is escaped to make valid XML content, unless it is already of type xml.
Note that XML forests are not valid XML documents if they consist of more than one element, so it
might be useful to wrap xmlforest expressions in xmlelement.
9.14.1.5. xmlpi
The xmlpi expression creates an XML processing instruction. The content, if present, must not con-
tain the character sequence ?>.
Example:
xmlpi
-----------------------------
<?php echo "hello world";?>
9.14.1.6. xmlroot
279
Functions and Operators
The xmlroot expression alters the properties of the root node of an XML value. If a version is
specified, it replaces the value in the root node's version declaration; if a standalone setting is specified,
it replaces the value in the root node's standalone declaration.
xmlroot
----------------------------------------
<?xml version="1.0" standalone="yes"?>
<content>abc</content>
9.14.1.7. xmlagg
xmlagg(xml)
The function xmlagg is, unlike the other functions described here, an aggregate function. It concate-
nates the input values to the aggregate function call, much like xmlconcat does, except that con-
catenation occurs across rows rather than across expressions in a single row. See Section 9.20 for
additional information about aggregate functions.
Example:
To determine the order of the concatenation, an ORDER BY clause may be added to the aggregate call
as described in Section 4.2.7. For example:
The following non-standard approach used to be recommended in previous versions, and may still be
useful in specific cases:
280
Functions and Operators
9.14.2.1. IS DOCUMENT
xml IS DOCUMENT
The expression IS DOCUMENT returns true if the argument XML value is a proper XML document,
false if it is not (that is, it is a content fragment), or null if the argument is null. See Section 8.13 about
the difference between documents and content fragments.
The expression IS NOT DOCUMENT returns false if the argument XML value is a proper XML
document, true if it is not (that is, it is a content fragment), or null if the argument is null.
9.14.2.3. XMLEXISTS
The function xmlexists evaluates an XPath 1.0 expression (the first argument), with the passed
XML value as its context item. The function returns false if the result of that evaluation yields an
empty node-set, true if it yields any other value. The function returns null if any argument is null. A
nonnull value passed as the context item must be an XML document, not a content fragment or any
non-XML value.
Example:
xmlexists
------------
t
(1 row)
The BY REF clauses are accepted in PostgreSQL, but are ignored, as discussed in Section D.3.2. In
the SQL standard, the xmlexists function evaluates an expression in the XML Query language,
but PostgreSQL allows only an XPath 1.0 expression, as discussed in Section D.3.1.
9.14.2.4. xml_is_well_formed
xml_is_well_formed(text)
xml_is_well_formed_document(text)
xml_is_well_formed_content(text)
These functions check whether a text string is well-formed XML, returning a Boolean re-
sult. xml_is_well_formed_document checks for a well-formed document, while xm-
l_is_well_formed_content checks for well-formed content. xml_is_well_formed does
the former if the xmloption configuration parameter is set to DOCUMENT, or the latter if it is set to
CONTENT. This means that xml_is_well_formed is useful for seeing whether a simple cast to
type xml will succeed, whereas the other two functions are useful for seeing whether the correspond-
ing variants of XMLPARSE will succeed.
281
Functions and Operators
Examples:
SELECT xml_is_well_formed('<abc/>');
xml_is_well_formed
--------------------
t
(1 row)
The last example shows that the checks include whether namespaces are correctly matched.
9.14.3.1. xpath
The function xpath evaluates the XPath 1.0 expression xpath (a text value) against the XML
value xml. It returns an array of XML values corresponding to the node-set produced by the XPath
expression. If the XPath expression returns a scalar value rather than a node-set, a single-element array
is returned.
The second argument must be a well formed XML document. In particular, it must have a single root
node element.
The optional third argument of the function is an array of namespace mappings. This array should be
a two-dimensional text array with the length of the second axis being equal to 2 (i.e., it should be
282
Functions and Operators
an array of arrays, each of which consists of exactly 2 elements). The first element of each array entry
is the namespace name (alias), the second the namespace URI. It is not required that aliases provided
in this array be the same as those being used in the XML document itself (in other words, both in the
XML document and in the xpath function context, aliases are local).
Example:
xpath
--------
{test}
(1 row)
xpath
--------
{test}
(1 row)
9.14.3.2. xpath_exists
The function xpath_exists is a specialized form of the xpath function. Instead of returning the
individual XML values that satisfy the XPath 1.0 expression, this function returns a Boolean indicating
whether the query was satisfied or not (specifically, whether it produced any value other than an empty
node-set). This function is equivalent to the XMLEXISTS predicate, except that it also offers support
for a namespace mapping argument.
Example:
xpath_exists
--------------
t
(1 row)
9.14.3.3. xmltable
283
Functions and Operators
The xmltable function produces a table based on the given XML value, an XPath filter to extract
rows, and a set of column definitions.
The optional XMLNAMESPACES clause is a comma-separated list of namespaces. It specifies the XML
namespaces used in the document and their aliases. A default namespace specification is not currently
supported.
The required row_expression argument is an XPath 1.0 expression that is evaluated, passing the
document_expression as its context item, to obtain a set of XML nodes. These nodes are what
xmltable transforms into output rows. No rows will be produced if the document_expression
is null, nor if the row_expression produces an empty node-set or any value other than a node-set.
document_expression provides the context item for the row_expression. It must be a well-
formed XML document; fragments/forests are not accepted. The BY REF clause is accepted but ig-
nored, as discussed in Section D.3.2. In the SQL standard, the xmltable function evaluates expres-
sions in the XML Query language, but PostgreSQL allows only XPath 1.0 expressions, as discussed
in Section D.3.1.
The mandatory COLUMNS clause specifies the list of columns in the output table. Each entry describes
a single column. See the syntax summary above for the format. The column name and type are required;
the path, default and nullability clauses are optional.
A column marked FOR ORDINALITY will be populated with row numbers, starting with 1, in the
order of nodes retrieved from the row_expression's result node-set. At most one column may be
marked FOR ORDINALITY.
Note
XPath 1.0 does not specify an order for nodes in a node-set, so code that relies on a particular
order of the results will be implementation-dependent. Details can be found in Section D.3.1.2.
The column_expression for a column is an XPath 1.0 expression that is evaluated for each row,
with the current node from the row_expression result as its context item, to find the value of the
column. If no column_expression is given, then the column name is used as an implicit path.
If a column's XPath expression returns a non-XML value (limited to string, boolean, or double in
XPath 1.0) and the column has a PostgreSQL type other than xml, the column will be set as if by
assigning the value's string representation to the PostgreSQL type. In this release, an XPath boolean or
double result must be explicitly cast to string (that is, the XPath 1.0 string function wrapped around
the original column expression); PostgreSQL can then successfully assign the string to an SQL result
column of boolean or double type. These conversion rules differ from those of the SQL standard, as
discussed in Section D.3.1.3.
In this release, SQL result columns of xml type, or column XPath expressions evaluating to an XML
type, regardless of the output column SQL type, are handled as described in Section D.3.2; the behavior
changes significantly in PostgreSQL 12.
If the path expression returns an empty node-set (typically, when it does not match) for a given row, the
column will be set to NULL, unless a default_expression is specified; then the value resulting
from evaluating that expression is used.
284
Functions and Operators
Columns may be marked NOT NULL. If the column_expression for a NOT NULL column does
not match anything and there is no DEFAULT or the default_expression also evaluates to null,
an error is reported.
Examples:
SELECT xmltable.*
FROM xmldata,
XMLTABLE('//ROWS/ROW'
PASSING data
COLUMNS id int PATH '@id',
ordinality FOR ORDINALITY,
"COUNTRY_NAME" text,
country_id text PATH 'COUNTRY_ID',
size_sq_km float PATH 'SIZE[@unit =
"sq_km"]',
size_other text PATH
'concat(SIZE[@unit!="sq_km"], " ",
SIZE[@unit!="sq_km"]/@unit)',
premier_name text PATH 'PREMIER_NAME'
DEFAULT 'not specified') ;
285
Functions and Operators
The following example shows concatenation of multiple text() nodes, usage of the column name as
XPath filter, and the treatment of whitespace, XML comments and processing instructions:
SELECT xmltable.*
FROM xmlelements, XMLTABLE('/root' PASSING data COLUMNS element
text);
element
----------------------
Hello2a2 bbbCC
The following example illustrates how the XMLNAMESPACES clause can be used to specify a list of
namespaces used in the XML document as well as in the XPath expressions:
286
Functions and Operators
table_to_xml maps the content of the named table, passed as parameter tbl. The regclass
type accepts strings identifying tables using the usual notation, including optional schema qualifi-
cations and double quotes. query_to_xml executes the query whose text is passed as parameter
query and maps the result set. cursor_to_xml fetches the indicated number of rows from the
cursor specified by the parameter cursor. This variant is recommended if large tables have to be
mapped, because the result value is built up in memory by each function.
If tableforest is false, then the resulting XML document looks like this:
<tablename>
<row>
<columnname1>data</columnname1>
<columnname2>data</columnname2>
</row>
<row>
...
</row>
...
</tablename>
If tableforest is true, the result is an XML content fragment that looks like this:
<tablename>
<columnname1>data</columnname1>
<columnname2>data</columnname2>
</tablename>
<tablename>
...
</tablename>
...
If no table name is available, that is, when mapping a query or a cursor, the string table is used in
the first format, row in the second format.
The choice between these formats is up to the user. The first format is a proper XML document,
which will be important in many applications. The second format tends to be more useful in the cur-
sor_to_xml function if the result values are to be reassembled into one document later on. The
functions for producing XML content discussed above, in particular xmlelement, can be used to
alter the results to taste.
The data values are mapped in the same way as described for the function xmlelement above.
The parameter nulls determines whether null values should be included in the output. If true, null
values in columns are represented as:
<columnname xsi:nil="true"/>
where xsi is the XML namespace prefix for XML Schema Instance. An appropriate namespace de-
claration will be added to the result value. If false, columns containing null values are simply omitted
from the output.
The parameter targetns specifies the desired XML namespace of the result. If no particular name-
space is wanted, an empty string should be passed.
287
Functions and Operators
The following functions return XML Schema documents describing the mappings performed by the
corresponding functions above:
It is essential that the same parameters are passed in order to obtain matching XML data mappings
and XML Schema documents.
The following functions produce XML data mappings and the corresponding XML Schema in one
document (or forest), linked together. They can be useful where self-contained and self-describing
results are wanted:
In addition, the following functions are available to produce analogous mappings of entire schemas
or the entire current database:
Note that these potentially produce a lot of data, which needs to be built up in memory. When request-
ing content mappings of large schemas or databases, it might be worthwhile to consider mapping the
tables separately instead, possibly even through a cursor.
<schemaname>
table1-mapping
table2-mapping
...
</schemaname>
where the format of a table mapping depends on the tableforest parameter as explained above.
288
Functions and Operators
<dbname>
<schema1name>
...
</schema1name>
<schema2name>
...
</schema2name>
...
</dbname>
As an example of using the output produced by these functions, Example 9.1 shows an XSLT
stylesheet that converts the output of table_to_xml_and_xmlschema to an HTML document
containing a tabular rendition of the table data. In a similar manner, the results from these functions
can be converted into other XML-based formats.
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns="http://www.w3.org/1999/xhtml"
>
<xsl:output method="xml"
doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-
strict.dtd"
doctype-public="-//W3C/DTD XHTML 1.0 Strict//EN"
indent="yes"/>
<xsl:template match="/*">
<xsl:variable name="schema" select="//xsd:schema"/>
<xsl:variable name="tabletypename"
select="$schema/
xsd:element[@name=name(current())]/@type"/>
<xsl:variable name="rowtypename"
select="$schema/xsd:complexType[@name=
$tabletypename]/xsd:sequence/xsd:element[@name='row']/@type"/>
<html>
<head>
<title><xsl:value-of select="name(current())"/></title>
</head>
<body>
<table>
<tr>
<xsl:for-each select="$schema/xsd:complexType[@name=
$rowtypename]/xsd:sequence/xsd:element/@name">
289
Functions and Operators
<th><xsl:value-of select="."/></th>
</xsl:for-each>
</tr>
<xsl:for-each select="row">
<tr>
<xsl:for-each select="*">
<td><xsl:value-of select="."/></td>
</xsl:for-each>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
Note
There are parallel variants of these operators for both the json and jsonb types. The field/
element/path extraction operators return the same type as their left-hand input (either json
or jsonb), except for those specified as returning text, which coerce the value to text. The
field/element/path extraction operators return NULL, rather than failing, if the JSON input
does not have the right structure to match the request; for example if no such element exists.
290
Functions and Operators
The field/element/path extraction operators that accept integer JSON array subscripts all sup-
port negative subscripting from the end of arrays.
The standard comparison operators shown in Table 9.1 are available for jsonb, but not for json.
They follow the ordering rules for B-tree operations outlined at Section 8.14.4.
Some further operators also exist only for jsonb, as shown in Table 9.44. Many of these operators
can be indexed by jsonb operator classes. For a full description of jsonb containment and existence
semantics, see Section 8.14.3. Section 8.14.4 describes how these operators can be used to effectively
index jsonb data.
291
Functions and Operators
Note
The || operator concatenates two JSON objects by generating an object containing the union
of their keys, taking the second object's value when there are duplicate keys. All other cases
produce a JSON array: first, any non-array input is converted into a single-element array, and
then the two arrays are concatenated. It does not operate recursively; only the top-level array
or object structure is merged.
Table 9.45 shows the functions that are available for creating json and jsonb values. (There are no
equivalent functions for jsonb, of the row_to_json and array_to_json functions. However,
the to_jsonb function supplies much the same functionality as these functions would.)
292
Functions and Operators
Note
array_to_json and row_to_json have the same behavior as to_json except for of-
fering a pretty-printing option. The behavior described for to_json likewise applies to each
individual value converted by the other JSON creation functions.
Note
The hstore extension has a cast from hstore to json, so that hstore values converted
via the JSON creation functions will be represented as JSON objects, not as primitive string
values.
Table 9.46 shows the functions that are available for processing json and jsonb values.
293
Functions and Operators
jsonb_ex-
trac-
t_path(from_j-
son jsonb,
VARIADIC
path_elems
text[])
json_extrac- text Returns JSON val- json_extrac- foo
t_path_tex- ue pointed to by t_path_tex-
t(from_json path_elems as t('{"f2":{"f3":1},"f4":
json, text (equivalent {"f5":99,"f6":"foo"}}','f4',
VARIADIC to #>> operator). 'f6')
path_elems
text[])
jsonb_ex-
trac-
t_path_tex-
t(from_json
jsonb,
VARIADIC
path_elems
text[])
json_objec- setof text Returns set of keys json_objec-
t_keys(json) in the outermost t_keys('{"f1":"abc","f2":{"f3":"a",
json_object_keys
JSON object. "f4":"b"}}') ------------------
294
Functions and Operators
295
Functions and Operators
296
Functions and Operators
"f2": null
},
2,
null,
3
]
Note
Many of these functions and operators will convert Unicode escapes in JSON strings to the
appropriate single character. This is a non-issue if the input is type jsonb, because the con-
version was already done; but for json input, this may result in throwing an error, as noted
in Section 8.14.
Note
The functions json[b]_populate_record, json[b]_populate_recordset,
json[b]_to_record and json[b]_to_recordset operate on a JSON object, or ar-
ray of objects, and extract the values associated with keys whose names match column names
of the output row type. Object fields that do not correspond to any output column name are
ignored, and output columns that do not match any object field will be filled with nulls. To
convert a JSON value to the SQL type of an output column, the following rules are applied
in sequence:
297
Functions and Operators
• If the output column is a composite (row) type, and the JSON value is a JSON object, the
fields of the object are converted to columns of the output row type by recursive application
of these rules.
• Likewise, if the output column is an array type and the JSON value is a JSON array, the
elements of the JSON array are converted to elements of the output array by recursive ap-
plication of these rules.
• Otherwise, if the JSON value is a string literal, the contents of the string are fed to the input
conversion function for the column's data type.
• Otherwise, the ordinary text representation of the JSON value is fed to the input conversion
function for the column's data type.
While the examples for these functions use constants, the typical use would be to reference a
table in the FROM clause and use one of its json or jsonb columns as an argument to the
function. Extracted key values can then be referenced in other parts of the query, like WHERE
clauses and target lists. Extracting multiple values in this way can improve performance over
extracting them separately with per-key operators.
Note
All the items of the path parameter of jsonb_set as well as jsonb_insert except the
last item must be present in the target. If create_missing is false, all items of the path
parameter of jsonb_set must be present. If these conditions are not met the target is
returned unchanged.
If the last path item is an object key, it will be created if it is absent and given the new value.
If the last path item is an array index, if it is positive the item to set is found by counting from
the left, and if negative by counting from the right - -1 designates the rightmost element, and
so on. If the item is out of the range -array_length .. array_length -1, and create_missing is
true, the new value is added at the beginning of the array if the item is negative, and at the
end of the array if it is positive.
Note
The json_typeof function's null return value should not be confused with a SQL NULL.
While calling json_typeof('null'::json) will return null, calling json_type-
of(NULL::json) will return a SQL NULL.
Note
If the argument to json_strip_nulls contains duplicate field names in any object, the
result could be semantically somewhat different, depending on the order in which they occur.
This is not an issue for jsonb_strip_nulls since jsonb values never have duplicate
object field names.
See also Section 9.20 for the aggregate function json_agg which aggregates record values as JSON,
and the aggregate function json_object_agg which aggregates pairs of values into a JSON object,
and their jsonb equivalents, jsonb_agg and jsonb_object_agg.
This section describes functions for operating on sequence objects, also called sequence generators or
just sequences. Sequence objects are special single-row tables created with CREATE SEQUENCE.
Sequence objects are commonly used to generate unique identifiers for rows of a table. The sequence
functions, listed in Table 9.47, provide simple, multiuser-safe methods for obtaining successive se-
quence values from sequence objects.
Note
Before PostgreSQL 8.1, the arguments of the sequence functions were of type text, not
regclass, and the above-described conversion from a text string to an OID value would
happen at run time during each call. For backward compatibility, this facility still exists, but
internally it is now handled as an implicit coercion from text to regclass before the func-
tion is invoked.
When you write the argument of a sequence function as an unadorned literal string, it becomes
a constant of type regclass. Since this is really just an OID, it will track the originally iden-
tified sequence despite later renaming, schema reassignment, etc. This “early binding” behav-
ior is usually desirable for sequence references in column defaults and views. But sometimes
you might want “late binding” where the sequence reference is resolved at run time. To get late-
binding behavior, force the constant to be stored as a text constant instead of regclass:
299
Functions and Operators
Note that late binding was the only behavior supported in PostgreSQL releases before 8.1, so
you might need to do this to preserve the semantics of old applications.
nextval
Advance the sequence object to its next value and return that value. This is done atomically: even
if multiple sessions execute nextval concurrently, each will safely receive a distinct sequence
value.
If a sequence object has been created with default parameters, successive nextval calls will
return successive values beginning with 1. Other behaviors can be obtained by using special pa-
rameters in the CREATE SEQUENCE command; see its command reference page for more in-
formation.
currval
Return the value most recently obtained by nextval for this sequence in the current session.
(An error is reported if nextval has never been called for this sequence in this session.) Because
this is returning a session-local value, it gives a predictable answer whether or not other sessions
have executed nextval since the current session did.
lastval
Return the value most recently returned by nextval in the current session. This function is
identical to currval, except that instead of taking the sequence name as an argument it refers to
whichever sequence nextval was most recently applied to in the current session. It is an error
to call lastval if nextval has not yet been called in the current session.
This function requires USAGE or SELECT privilege on the last used sequence.
setval
Reset the sequence object's counter value. The two-parameter form sets the sequence's
last_value field to the specified value and sets its is_called field to true, meaning that
the next nextval will advance the sequence before returning a value. The value reported by
currval is also set to the specified value. In the three-parameter form, is_called can be set
to either true or false. true has the same effect as the two-parameter form. If it is set to
false, the next nextval will return exactly the specified value, and sequence advancement
commences with the following nextval. Furthermore, the value reported by currval is not
changed in this case. For example,
The result returned by setval is just the value of its second argument.
300
Functions and Operators
Caution
To avoid blocking concurrent transactions that obtain numbers from the same sequence, the
value obtained by nextval is not reclaimed for re-use if the calling transaction later aborts.
This means that transaction aborts or database crashes can result in gaps in the sequence of
assigned values. That can happen without a transaction abort, too. For example an INSERT
with an ON CONFLICT clause will compute the to-be-inserted tuple, including doing any
required nextval calls, before detecting any conflict that would cause it to follow the ON
CONFLICT rule instead. Thus, PostgreSQL sequence objects cannot be used to obtain “gap-
less” sequences.
Likewise, sequence state changes made by setval are immediately visible to other transac-
tions, and are not undone if the calling transaction rolls back.
If the database cluster crashes before committing a transaction containing a nextval or set-
val call, the sequence state change might not have made its way to persistent storage, so that
it is uncertain whether the sequence will have its original or updated state after the cluster
restarts. This is harmless for usage of the sequence within the database, since other effects of
uncommitted transactions will not be visible either. However, if you wish to use a sequence
value for persistent outside-the-database purposes, make sure that the nextval call has been
committed before doing so.
Tip
If your needs go beyond the capabilities of these conditional expressions, you might want to
consider writing a server-side function in a more expressive programming language.
9.17.1. CASE
The SQL CASE expression is a generic conditional expression, similar to if/else statements in other
programming languages:
CASE clauses can be used wherever an expression is valid. Each condition is an expression that
returns a boolean result. If the condition's result is true, the value of the CASE expression is the
result that follows the condition, and the remainder of the CASE expression is not processed. If the
condition's result is not true, any subsequent WHEN clauses are examined in the same manner. If no
WHEN condition yields true, the value of the CASE expression is the result of the ELSE clause.
If the ELSE clause is omitted and no condition is true, the result is null.
An example:
a
---
301
Functions and Operators
1
2
3
SELECT a,
CASE WHEN a=1 THEN 'one'
WHEN a=2 THEN 'two'
ELSE 'other'
END
FROM test;
a | case
---+-------
1 | one
2 | two
3 | other
The data types of all the result expressions must be convertible to a single output type. See Sec-
tion 10.5 for more details.
There is a “simple” form of CASE expression that is a variant of the general form above:
CASE expression
WHEN value THEN result
[WHEN ...]
[ELSE result]
END
The first expression is computed, then compared to each of the value expressions in the WHEN
clauses until one is found that is equal to it. If no match is found, the result of the ELSE clause (or
a null value) is returned. This is similar to the switch statement in C.
The example above can be written using the simple CASE syntax:
SELECT a,
CASE a WHEN 1 THEN 'one'
WHEN 2 THEN 'two'
ELSE 'other'
END
FROM test;
a | case
---+-------
1 | one
2 | two
3 | other
A CASE expression does not evaluate any subexpressions that are not needed to determine the result.
For example, this is a possible way of avoiding a division-by-zero failure:
SELECT ... WHERE CASE WHEN x <> 0 THEN y/x > 1.5 ELSE false END;
Note
As described in Section 4.2.14, there are various situations in which subexpressions of an
expression are evaluated at different times, so that the principle that “CASE evaluates only
302
Functions and Operators
necessary subexpressions” is not ironclad. For example a constant 1/0 subexpression will
usually result in a division-by-zero failure at planning time, even if it's within a CASE arm that
would never be entered at run time.
9.17.2. COALESCE
COALESCE(value [, ...])
The COALESCE function returns the first of its arguments that is not null. Null is returned only if all
arguments are null. It is often used to substitute a default value for null values when data is retrieved
for display, for example:
The arguments must all be convertible to a common data type, which will be the type of the result
(see Section 10.5 for details).
Like a CASE expression, COALESCE only evaluates the arguments that are needed to determine the
result; that is, arguments to the right of the first non-null argument are not evaluated. This SQL-
standard function provides capabilities similar to NVL and IFNULL, which are used in some other
database systems.
9.17.3. NULLIF
NULLIF(value1, value2)
The NULLIF function returns a null value if value1 equals value2; otherwise it returns value1.
This can be used to perform the inverse operation of the COALESCE example given above:
In this example, if value is (none), null is returned, otherwise the value of value is returned.
The two arguments must be of comparable types. To be specific, they are compared exactly as if you
had written value1 = value2, so there must be a suitable = operator available.
The result has the same type as the first argument — but there is a subtlety. What is actually returned is
the first argument of the implied = operator, and in some cases that will have been promoted to match
the second argument's type. For example, NULLIF(1, 2.2) yields numeric, because there is no
integer = numeric operator, only numeric = numeric.
GREATEST(value [, ...])
LEAST(value [, ...])
The GREATEST and LEAST functions select the largest or smallest value from a list of any number of
expressions. The expressions must all be convertible to a common data type, which will be the type of
the result (see Section 10.5 for details). NULL values in the list are ignored. The result will be NULL
only if all the expressions evaluate to NULL.
303
Functions and Operators
Note that GREATEST and LEAST are not in the SQL standard, but are a common extension. Some
other databases make them return NULL if any argument is NULL, rather than only when all are
NULL.
The array ordering operators (<, >=, etc) compare the array contents element-by-element, using the
default B-tree comparison function for the element data type, and sort based on the first difference. In
multidimensional arrays the elements are visited in row-major order (last subscript varies most rapid-
ly). If the contents of two arrays are equal but the dimensionality is different, the first difference in the
dimensionality information determines the sort order. (This is a change from versions of PostgreSQL
prior to 8.2: older versions would claim that two arrays with the same contents were equal, even if the
number of dimensions or subscript ranges were different.)
The array containment operators (<@ and @>) consider one array to be contained in another one if
each of its elements appears in the other one. Duplicates are not treated specially, thus ARRAY[1]
and ARRAY[1,1] are each considered to contain the other.
304
Functions and Operators
See Section 8.15 for more details about array operator behavior. See Section 11.2 for more details
about which operators support indexed operations.
Table 9.49 shows the functions available for use with array types. See Section 8.15 for more informa-
tion and examples of the use of these functions.
305
Functions and Operators
(2 rows)
unnest(an- setof anyele- expand multiple unnest(AR- 1 foo
yarray, an- ment, anyele- arrays (possibly of RAY[1,2],AR- 2 bar
yarray ment [, ...] different types) to a RAY['foo','bar','baz'])
NULL baz
[, ...]) set of rows. This is
only allowed in the (3 rows)
FROM clause; see
Section 7.2.1.4
306
Functions and Operators
In array_positions, NULL is returned only if the array is NULL; if the value is not found in the
array, an empty array is returned instead.
In string_to_array, if the delimiter parameter is NULL, each character in the input string will
become a separate element in the resulting array. If the delimiter is an empty string, then the entire
input string is returned as a one-element array. Otherwise the input string is split at each occurrence
of the delimiter string.
Note
There are two differences in the behavior of string_to_array from pre-9.1 versions of
PostgreSQL. First, it will return an empty (zero-element) array rather than NULL when the
input string is of zero length. Second, if the delimiter string is NULL, the function splits the
input into individual characters, rather than returning NULL as before.
See also Section 9.20 about the aggregate function array_agg for use with arrays.
307
Functions and Operators
The simple comparison operators <, >, <=, and >= compare the lower bounds first, and only if those
are equal, compare the upper bounds. These comparisons are not usually very useful for ranges, but
are provided to allow B-tree indexes to be constructed on ranges.
The left-of/right-of/adjacent operators always return false when an empty range is involved; that is,
an empty range is not considered to be either before or after any other range.
The union and difference operators will fail if the resulting range would need to contain two disjoint
sub-ranges, as such a range cannot be represented.
Table 9.51 shows the functions available for use with range types.
308
Functions and Operators
The lower and upper functions return null if the range is empty or the requested bound is infinite.
The lower_inc, upper_inc, lower_inf, and upper_inf functions all return false for an
empty range.
309
Functions and Operators
310
Functions and Operators
It should be noted that except for count, these functions return a null value when no rows are selected.
In particular, sum of no rows returns null, not zero as one might expect, and array_agg returns
null rather than an empty array when there are no input rows. The coalesce function can be used
to substitute zero or an empty array for null when necessary.
Aggregate functions which support Partial Mode are eligible to participate in various optimizations,
such as parallel aggregation.
Note
Boolean aggregates bool_and and bool_or correspond to standard SQL aggregates
every and any or some. As for any and some, it seems that there is an ambiguity built
into the standard syntax:
Here ANY can be considered either as introducing a subquery, or as being an aggregate func-
tion, if the subquery returns one row with a Boolean value. Thus the standard name cannot
be given to these aggregates.
Note
Users accustomed to working with other SQL database management systems might be disap-
pointed by the performance of the count aggregate when it is applied to the entire table. A
query like:
311
Functions and Operators
will require effort proportional to the size of the table: PostgreSQL will need to scan either the
entire table or the entirety of an index which includes all rows in the table.
Beware that this approach can fail if the outer query level contains additional processing, such as a
join, because that might cause the subquery's output to be reordered before the aggregate is computed.
Table 9.53 shows aggregate functions typically used in statistical analysis. (These are separated out
merely to avoid cluttering the listing of more-commonly-used aggregates.) Where the description men-
tions N, it means the number of input rows for which all the input expressions are non-null. In all cases,
null is returned if the computation is meaningless, for example when N is zero.
312
Functions and Operators
Table 9.54 shows some aggregate functions that use the ordered-set aggregate syntax. These functions
are sometimes referred to as “inverse distribution” functions.
313
Functions and Operators
314
Functions and Operators
All the aggregates listed in Table 9.54 ignore null values in their sorted input. For those that take a
fraction parameter, the fraction value must be between 0 and 1; an error is thrown if not. However,
a null fraction value simply produces a null result.
Each of the aggregates listed in Table 9.55 is associated with a window function of the same name
defined in Section 9.21. In each case, the aggregate result is the value that the associated window
function would have returned for the “hypothetical” row constructed from args, if such a row had
been added to the sorted group of rows computed from the sorted_args.
For each of these hypothetical-set aggregates, the list of direct arguments given in args must match
the number and types of the aggregated arguments given in sorted_args. Unlike most built-in
315
Functions and Operators
aggregates, these aggregates are not strict, that is they do not drop input rows containing nulls. Null
values sort according to the rule specified in the ORDER BY clause.
Grouping operations are used in conjunction with grouping sets (see Section 7.2.4) to distinguish result
rows. The arguments to the GROUPING operation are not actually evaluated, but they must match
exactly expressions given in the GROUP BY clause of the associated query level. Bits are assigned with
the rightmost argument being the least-significant bit; each bit is 0 if the corresponding expression
is included in the grouping criteria of the grouping set generating the result row, and 1 if it is not.
For example:
The built-in window functions are listed in Table 9.57. Note that these functions must be invoked
using window function syntax, i.e., an OVER clause is required.
In addition to these functions, any built-in or user-defined general-purpose or statistical aggregate (i.e.,
not ordered-set or hypothetical-set aggregates) can be used as a window function; see Section 9.20
for a list of the built-in aggregates. Aggregate functions act as window functions only when an OVER
clause follows the call; otherwise they act as non-window aggregates and return a single row for the
entire set.
316
Functions and Operators
317
Functions and Operators
All of the functions listed in Table 9.57 depend on the sort ordering specified by the ORDER BY clause
of the associated window definition. Rows that are not distinct when considering only the ORDER BY
columns are said to be peers. The four ranking functions (including cume_dist) are defined so that
they give the same answer for all peer rows.
Note that first_value, last_value, and nth_value consider only the rows within the “win-
dow frame”, which by default contains the rows from the start of the partition through the last peer
of the current row. This is likely to give unhelpful results for last_value and sometimes also
nth_value. You can redefine the frame by adding a suitable frame specification (RANGE, ROWS or
GROUPS) to the OVER clause. See Section 4.2.8 for more information about frame specifications.
When an aggregate function is used as a window function, it aggregates over the rows within the
current row's window frame. An aggregate used with ORDER BY and the default window frame
definition produces a “running sum” type of behavior, which may or may not be what's wanted. To
obtain aggregation over the whole partition, omit ORDER BY or use ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING. Other frame specifications can be used to obtain
other effects.
Note
The SQL standard defines a RESPECT NULLS or IGNORE NULLS option for lead, lag,
first_value, last_value, and nth_value. This is not implemented in PostgreSQL:
the behavior is always the same as the standard's default, namely RESPECT NULLS. Likewise,
the standard's FROM FIRST or FROM LAST option for nth_value is not implemented:
only the default FROM FIRST behavior is supported. (You can achieve the result of FROM
LAST by reversing the ORDER BY ordering.)
cume_dist computes the fraction of partition rows that are less than or equal to the current row and
its peers, while percent_rank computes the fraction of partition rows that are less than the current
row, assuming the current row does not exist in the partition.
9.22.1. EXISTS
EXISTS (subquery)
The argument of EXISTS is an arbitrary SELECT statement, or subquery. The subquery is evaluated
to determine whether it returns any rows. If it returns at least one row, the result of EXISTS is “true”;
if the subquery returns no rows, the result of EXISTS is “false”.
The subquery can refer to variables from the surrounding query, which will act as constants during
any one evaluation of the subquery.
The subquery will generally only be executed long enough to determine whether at least one row is
returned, not all the way to completion. It is unwise to write a subquery that has side effects (such as
calling sequence functions); whether the side effects occur might be unpredictable.
Since the result depends only on whether any rows are returned, and not on the contents of those rows,
the output list of the subquery is normally unimportant. A common coding convention is to write all
EXISTS tests in the form EXISTS(SELECT 1 WHERE ...). There are exceptions to this rule
however, such as subqueries that use INTERSECT.
318
Functions and Operators
This simple example is like an inner join on col2, but it produces at most one output row for each
tab1 row, even if there are several matching tab2 rows:
SELECT col1
FROM tab1
WHERE EXISTS (SELECT 1 FROM tab2 WHERE col2 = tab1.col2);
9.22.2. IN
expression IN (subquery)
The right-hand side is a parenthesized subquery, which must return exactly one column. The left-hand
expression is evaluated and compared to each row of the subquery result. The result of IN is “true”
if any equal subquery row is found. The result is “false” if no equal row is found (including the case
where the subquery returns no rows).
Note that if the left-hand expression yields null, or if there are no equal right-hand values and at
least one right-hand row yields null, the result of the IN construct will be null, not false. This is in
accordance with SQL's normal rules for Boolean combinations of null values.
As with EXISTS, it's unwise to assume that the subquery will be evaluated completely.
row_constructor IN (subquery)
The left-hand side of this form of IN is a row constructor, as described in Section 4.2.13. The right-
hand side is a parenthesized subquery, which must return exactly as many columns as there are ex-
pressions in the left-hand row. The left-hand expressions are evaluated and compared row-wise to
each row of the subquery result. The result of IN is “true” if any equal subquery row is found. The
result is “false” if no equal row is found (including the case where the subquery returns no rows).
As usual, null values in the rows are combined per the normal rules of SQL Boolean expressions.
Two rows are considered equal if all their corresponding members are non-null and equal; the rows
are unequal if any corresponding members are non-null and unequal; otherwise the result of that row
comparison is unknown (null). If all the per-row results are either unequal or null, with at least one
null, then the result of IN is null.
9.22.3. NOT IN
expression NOT IN (subquery)
The right-hand side is a parenthesized subquery, which must return exactly one column. The left-hand
expression is evaluated and compared to each row of the subquery result. The result of NOT IN is
“true” if only unequal subquery rows are found (including the case where the subquery returns no
rows). The result is “false” if any equal row is found.
Note that if the left-hand expression yields null, or if there are no equal right-hand values and at least
one right-hand row yields null, the result of the NOT IN construct will be null, not true. This is in
accordance with SQL's normal rules for Boolean combinations of null values.
As with EXISTS, it's unwise to assume that the subquery will be evaluated completely.
319
Functions and Operators
found (including the case where the subquery returns no rows). The result is “false” if any equal row
is found.
As usual, null values in the rows are combined per the normal rules of SQL Boolean expressions.
Two rows are considered equal if all their corresponding members are non-null and equal; the rows
are unequal if any corresponding members are non-null and unequal; otherwise the result of that row
comparison is unknown (null). If all the per-row results are either unequal or null, with at least one
null, then the result of NOT IN is null.
9.22.4. ANY/SOME
The right-hand side is a parenthesized subquery, which must return exactly one column. The left-hand
expression is evaluated and compared to each row of the subquery result using the given operator,
which must yield a Boolean result. The result of ANY is “true” if any true result is obtained. The result
is “false” if no true result is found (including the case where the subquery returns no rows).
Note that if there are no successes and at least one right-hand row yields null for the operator's result,
the result of the ANY construct will be null, not false. This is in accordance with SQL's normal rules
for Boolean combinations of null values.
As with EXISTS, it's unwise to assume that the subquery will be evaluated completely.
The left-hand side of this form of ANY is a row constructor, as described in Section 4.2.13. The right-
hand side is a parenthesized subquery, which must return exactly as many columns as there are ex-
pressions in the left-hand row. The left-hand expressions are evaluated and compared row-wise to
each row of the subquery result, using the given operator. The result of ANY is “true” if the com-
parison returns true for any subquery row. The result is “false” if the comparison returns false for
every subquery row (including the case where the subquery returns no rows). The result is NULL if
no comparison with a subquery row returns true, and at least one comparison returns NULL.
See Section 9.23.5 for details about the meaning of a row constructor comparison.
9.22.5. ALL
The right-hand side is a parenthesized subquery, which must return exactly one column. The left-hand
expression is evaluated and compared to each row of the subquery result using the given operator,
which must yield a Boolean result. The result of ALL is “true” if all rows yield true (including the case
where the subquery returns no rows). The result is “false” if any false result is found. The result is
NULL if no comparison with a subquery row returns false, and at least one comparison returns NULL.
As with EXISTS, it's unwise to assume that the subquery will be evaluated completely.
The left-hand side of this form of ALL is a row constructor, as described in Section 4.2.13. The right-
hand side is a parenthesized subquery, which must return exactly as many columns as there are ex-
320
Functions and Operators
pressions in the left-hand row. The left-hand expressions are evaluated and compared row-wise to each
row of the subquery result, using the given operator. The result of ALL is “true” if the comparison
returns true for all subquery rows (including the case where the subquery returns no rows). The result
is “false” if the comparison returns false for any subquery row. The result is NULL if no comparison
with a subquery row returns false, and at least one comparison returns NULL.
See Section 9.23.5 for details about the meaning of a row constructor comparison.
The left-hand side is a row constructor, as described in Section 4.2.13. The right-hand side is a paren-
thesized subquery, which must return exactly as many columns as there are expressions in the left-
hand row. Furthermore, the subquery cannot return more than one row. (If it returns zero rows, the
result is taken to be null.) The left-hand side is evaluated and compared row-wise to the single sub-
query result row.
See Section 9.23.5 for details about the meaning of a row constructor comparison.
9.23.1. IN
expression IN (value [, ...])
The right-hand side is a parenthesized list of scalar expressions. The result is “true” if the left-hand
expression's result is equal to any of the right-hand expressions. This is a shorthand notation for
expression = value1
OR
expression = value2
OR
...
Note that if the left-hand expression yields null, or if there are no equal right-hand values and at least
one right-hand expression yields null, the result of the IN construct will be null, not false. This is in
accordance with SQL's normal rules for Boolean combinations of null values.
9.23.2. NOT IN
expression NOT IN (value [, ...])
The right-hand side is a parenthesized list of scalar expressions. The result is “true” if the left-hand
expression's result is unequal to all of the right-hand expressions. This is a shorthand notation for
321
Functions and Operators
AND
...
Note that if the left-hand expression yields null, or if there are no equal right-hand values and at least
one right-hand expression yields null, the result of the NOT IN construct will be null, not true as
one might naively expect. This is in accordance with SQL's normal rules for Boolean combinations
of null values.
Tip
x NOT IN y is equivalent to NOT (x IN y) in all cases. However, null values are much
more likely to trip up the novice when working with NOT IN than when working with IN. It
is best to express your condition positively if possible.
The right-hand side is a parenthesized expression, which must yield an array value. The left-hand
expression is evaluated and compared to each element of the array using the given operator, which
must yield a Boolean result. The result of ANY is “true” if any true result is obtained. The result is
“false” if no true result is found (including the case where the array has zero elements).
If the array expression yields a null array, the result of ANY will be null. If the left-hand expression
yields null, the result of ANY is ordinarily null (though a non-strict comparison operator could possibly
yield a different result). Also, if the right-hand array contains any null elements and no true compar-
ison result is obtained, the result of ANY will be null, not false (again, assuming a strict comparison
operator). This is in accordance with SQL's normal rules for Boolean combinations of null values.
The right-hand side is a parenthesized expression, which must yield an array value. The left-hand
expression is evaluated and compared to each element of the array using the given operator, which
must yield a Boolean result. The result of ALL is “true” if all comparisons yield true (including the
case where the array has zero elements). The result is “false” if any false result is found.
If the array expression yields a null array, the result of ALL will be null. If the left-hand expression
yields null, the result of ALL is ordinarily null (though a non-strict comparison operator could possibly
yield a different result). Also, if the right-hand array contains any null elements and no false compar-
ison result is obtained, the result of ALL will be null, not true (again, assuming a strict comparison
operator). This is in accordance with SQL's normal rules for Boolean combinations of null values.
Each side is a row constructor, as described in Section 4.2.13. The two row values must have the
same number of fields. Each side is evaluated and they are compared row-wise. Row constructor
comparisons are allowed when the operator is =, <>, <, <=, > or >=. Every row element must be
of a type which has a default B-tree operator class or the attempted comparison may generate an error.
322
Functions and Operators
Note
Errors related to the number or types of elements might not occur if the comparison is resolved
using earlier columns.
The = and <> cases work slightly differently from the others. Two rows are considered equal if all their
corresponding members are non-null and equal; the rows are unequal if any corresponding members
are non-null and unequal; otherwise the result of the row comparison is unknown (null).
For the <, <=, > and >= cases, the row elements are compared left-to-right, stopping as soon as an
unequal or null pair of elements is found. If either of this pair of elements is null, the result of the row
comparison is unknown (null); otherwise comparison of this pair of elements determines the result.
For example, ROW(1,2,NULL) < ROW(1,3,0) yields true, not null, because the third pair of
elements are not considered.
Note
Prior to PostgreSQL 8.2, the <, <=, > and >= cases were not handled per SQL specification.
A comparison like ROW(a,b) < ROW(c,d) was implemented as a < c AND b < d
whereas the correct behavior is equivalent to a < c OR (a = c AND b < d).
This construct is similar to a <> row comparison, but it does not yield null for null inputs. Instead, any
null value is considered unequal to (distinct from) any non-null value, and any two nulls are considered
equal (not distinct). Thus the result will either be true or false, never null.
This construct is similar to a = row comparison, but it does not yield null for null inputs. Instead, any
null value is considered unequal to (distinct from) any non-null value, and any two nulls are considered
equal (not distinct). Thus the result will always be either true or false, never null.
The SQL specification requires row-wise comparison to return NULL if the result depends on com-
paring two NULL values or a NULL and a non-NULL. PostgreSQL does this only when comparing
the results of two row constructors (as in Section 9.23.5) or comparing a row constructor to the output
of a subquery (as in Section 9.22). In other contexts where two composite-type values are compared,
two NULL field values are considered equal, and a NULL is considered larger than a non-NULL. This
is necessary in order to have consistent sorting and indexing behavior for composite types.
Each side is evaluated and they are compared row-wise. Composite type comparisons are allowed
when the operator is =, <>, <, <=, > or >=, or has semantics similar to one of these. (To be specific,
an operator can be a row comparison operator if it is a member of a B-tree operator class, or is the
negator of the = member of a B-tree operator class.) The default behavior of the above operators is
the same as for IS [ NOT ] DISTINCT FROM for row constructors (see Section 9.23.5).
To support matching of rows which include elements without a default B-tree operator class, the fol-
lowing operators are defined for composite type comparison: *=, *<>, *<, *<=, *>, and *>=. These
operators compare the internal binary representation of the two rows. Two rows might have a differ-
ent binary representation even though comparisons of the two rows with the equality operator is true.
323
Functions and Operators
The ordering of rows under these comparison operators is deterministic but not otherwise meaningful.
These operators are used internally for materialized views and might be useful for other specialized
purposes such as replication but are not intended to be generally useful for writing queries.
When step is positive, zero rows are returned if start is greater than stop. Conversely, when
step is negative, zero rows are returned if start is less than stop. Zero rows are also returned for
NULL inputs. It is an error for step to be zero. Some examples follow:
324
Functions and Operators
1.1
2.4
3.7
(3 rows)
generate_subscripts is a convenience function that generates the set of valid subscripts for the
specified dimension of the given array. Zero rows are returned for arrays that do not have the requested
dimension, or for NULL arrays (but valid subscripts are returned for NULL array elements). Some
examples follow:
-- basic usage
SELECT generate_subscripts('{NULL,1,NULL,2}'::int[], 1) AS s;
s
---
1
2
3
4
(4 rows)
325
Functions and Operators
-- unnest a 2D array
CREATE OR REPLACE FUNCTION unnest2(anyarray)
RETURNS SETOF anyelement AS $$
select $1[i][j]
from generate_subscripts($1,1) g1(i),
generate_subscripts($1,2) g2(j);
$$ LANGUAGE sql IMMUTABLE;
CREATE FUNCTION
SELECT * FROM unnest2(ARRAY[[1,2],[3,4]]);
unnest2
---------
1
2
3
4
(4 rows)
When a function in the FROM clause is suffixed by WITH ORDINALITY, a bigint column is
appended to the output which starts from 1 and increments by 1 for each row of the function's output.
This is most useful in the case of set returning functions such as unnest().
326
Functions and Operators
pg_snapshots | 13
pg_multixact | 14
PG_VERSION | 15
pg_wal | 16
pg_hba.conf | 17
pg_stat_tmp | 18
pg_subtrans | 19
(19 rows)
In addition to the functions listed in this section, there are a number of functions related to the statistics
system that also provide system information. See Section 28.2.2 for more information.
327
Functions and Operators
Note
current_catalog, current_role, current_schema, current_user, ses-
sion_user, and user have special syntactic status in SQL: they must be called with-
out trailing parentheses. (In PostgreSQL, parentheses can optionally be used with curren-
t_schema, but not with the others.)
The session_user is normally the user who initiated the current database connection; but supe-
rusers can change this setting with SET SESSION AUTHORIZATION. The current_user is the
user identifier that is applicable for permission checking. Normally it is equal to the session user, but
it can be changed with SET ROLE. It also changes during the execution of functions with the attribute
SECURITY DEFINER. In Unix parlance, the session user is the “real user” and the current user
is the “effective user”. current_role and user are synonyms for current_user. (The SQL
standard draws a distinction between current_role and current_user, but PostgreSQL does
not, since it unifies users and roles into a single kind of entity.)
current_schema returns the name of the schema that is first in the search path (or a null value if
the search path is empty). This is the schema that will be used for any tables or other named objects that
are created without specifying a target schema. current_schemas(boolean) returns an array
of the names of all schemas presently in the search path. The Boolean option determines whether or not
implicitly included system schemas such as pg_catalog are included in the returned search path.
Note
The search path can be altered at run time. The command is:
328
Functions and Operators
inet_client_addr returns the IP address of the current client, and inet_client_port re-
turns the port number. inet_server_addr returns the IP address on which the server accepted the
current connection, and inet_server_port returns the port number. All these functions return
NULL if the current connection is via a Unix-domain socket.
pg_blocking_pids returns an array of the process IDs of the sessions that are blocking the server
process with the specified process ID, or an empty array if there is no such server process or it is not
blocked. One server process blocks another if it either holds a lock that conflicts with the blocked
process's lock request (hard block), or is waiting for a lock that would conflict with the blocked
process's lock request and is ahead of it in the wait queue (soft block). When using parallel queries the
result always lists client-visible process IDs (that is, pg_backend_pid results) even if the actual
lock is held or awaited by a child worker process. As a result of that, there may be duplicated PIDs in
the result. Also note that when a prepared transaction holds a conflicting lock, it will be represented
by a zero process ID in the result of this function. Frequent calls to this function could have some
impact on database performance, because it needs exclusive access to the lock manager's shared state
for a short time.
pg_conf_load_time returns the timestamp with time zone when the server configura-
tion files were last loaded. (If the current session was alive at the time, this will be the time when
the session itself re-read the configuration files, so the reading will vary a little in different sessions.
Otherwise it is the time when the postmaster process re-read the configuration files.)
pg_current_logfile returns, as text, the path of the log file(s) currently in use by the logging
collector. The path includes the log_directory directory and the log file name. Log collection must
be enabled or the return value is NULL. When multiple log files exist, each in a different format,
pg_current_logfile called without arguments returns the path of the file having the first format
found in the ordered list: stderr, csvlog. NULL is returned when no log file has any of these formats.
To request a specific file format supply, as text, either csvlog or stderr as the value of the optional
parameter. The return value is NULL when the log format requested is not a configured log_destination.
The pg_current_logfile reflects the contents of the current_logfiles file.
pg_my_temp_schema returns the OID of the current session's temporary schema, or zero if it has
none (because it has not created any temporary tables). pg_is_other_temp_schema returns true
if the given OID is the OID of another session's temporary schema. (This can be useful, for example,
to exclude other sessions' temporary tables from a catalog display.)
pg_postmaster_start_time returns the timestamp with time zone when the server
started.
329
Functions and Operators
version returns a string describing the PostgreSQL server's version. You can also get this informa-
tion from server_version or for a machine-readable version, server_version_num. Software develop-
ers should use server_version_num (available since 8.2) or PQserverVersion instead of
parsing the text version.
Table 9.61 lists functions that allow the user to query object access privileges programmatically. See
Section 5.6 for more information about privileges.
330
Functions and Operators
has_table_privilege checks whether a user can access a table in a particular way. The user can
be specified by name, by OID (pg_authid.oid), public to indicate the PUBLIC pseudo-role, or
if the argument is omitted current_user is assumed. The table can be specified by name or by OID.
(Thus, there are actually six variants of has_table_privilege, which can be distinguished by the
number and types of their arguments.) When specifying by name, the name can be schema-qualified
if necessary. The desired access privilege type is specified by a text string, which must evaluate to
one of the values SELECT, INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, or TRIGGER.
Optionally, WITH GRANT OPTION can be added to a privilege type to test whether the privilege
is held with grant option. Also, multiple privilege types can be listed separated by commas, in which
case the result will be true if any of the listed privileges is held. (Case of the privilege string is not
significant, and extra whitespace is allowed between but not within privilege names.) Some examples:
331
Functions and Operators
has_any_column_privilege checks whether a user can access any column of a table in a par-
ticular way. Its argument possibilities are analogous to has_table_privilege, except that the
desired access privilege type must evaluate to some combination of SELECT, INSERT, UPDATE, or
REFERENCES. Note that having any of these privileges at the table level implicitly grants it for each
column of the table, so has_any_column_privilege will always return true if has_ta-
ble_privilege does for the same arguments. But has_any_column_privilege also suc-
ceeds if there is a column-level grant of the privilege for at least one column.
has_column_privilege checks whether a user can access a column in a particular way. Its ar-
gument possibilities are analogous to has_table_privilege, with the addition that the column
can be specified either by name or attribute number. The desired access privilege type must evaluate to
some combination of SELECT, INSERT, UPDATE, or REFERENCES. Note that having any of these
privileges at the table level implicitly grants it for each column of the table.
has_function_privilege checks whether a user can access a function in a particular way. Its
argument possibilities are analogous to has_table_privilege. When specifying a function by
a text string rather than by OID, the allowed input is the same as for the regprocedure data type
(see Section 8.19). The desired access privilege type must evaluate to EXECUTE. An example is:
has_schema_privilege checks whether a user can access a schema in a particular way. Its ar-
gument possibilities are analogous to has_table_privilege. The desired access privilege type
must evaluate to some combination of CREATE or USAGE.
has_server_privilege checks whether a user can access a foreign server in a particular way.
Its argument possibilities are analogous to has_table_privilege. The desired access privilege
type must evaluate to USAGE.
has_type_privilege checks whether a user can access a type in a particular way. Its argument
possibilities are analogous to has_table_privilege. When specifying a type by a text string
rather than by OID, the allowed input is the same as for the regtype data type (see Section 8.19).
The desired access privilege type must evaluate to USAGE.
332
Functions and Operators
pg_has_role checks whether a user can access a role in a particular way. Its argument possibilities
are analogous to has_table_privilege, except that public is not allowed as a user name. The
desired access privilege type must evaluate to some combination of MEMBER or USAGE. MEMBER
denotes direct or indirect membership in the role (that is, the right to do SET ROLE), while USAGE
denotes whether the privileges of the role are immediately available without doing SET ROLE.
row_security_active checks whether row level security is active for the specified table in the
context of the current_user and environment. The table can be specified by name or by OID.
Table 9.62 shows functions that determine whether a certain object is visible in the current schema
search path. For example, a table is said to be visible if its containing schema is in the search path
and no table of the same name appears earlier in the search path. This is equivalent to the statement
that the table can be referenced by name without explicit schema qualification. To list the names of
all visible tables:
Each function performs the visibility check for one type of database object. Note that pg_ta-
ble_is_visible can also be used with views, materialized views, indexes, sequences and for-
eign tables; pg_function_is_visible can also be used with procedures and aggregates;
333
Functions and Operators
pg_type_is_visible can also be used with domains. For functions and operators, an object in
the search path is visible if there is no object of the same name and argument data type(s) earlier in
the path. For operator classes, both name and associated index access method are considered.
All these functions require object OIDs to identify the object to be checked. If you want to test an
object by name, it is convenient to use the OID alias types (regclass, regtype, regprocedure,
regoperator, regconfig, or regdictionary), for example:
SELECT pg_type_is_visible('myschema.widget'::regtype);
Note that it would not make much sense to test a non-schema-qualified type name in this way — if
the name can be recognized at all, it must be visible.
Table 9.63 lists functions that extract information from the system catalogs.
334
Functions and Operators
335
Functions and Operators
format_type returns the SQL name of a data type that is identified by its type OID and possibly a
type modifier. Pass NULL for the type modifier if no specific modifier is known.
pg_get_keywords returns a set of records describing the SQL keywords recognized by the server.
The word column contains the keyword. The catcode column contains a category code: U for
unreserved, C for column name, T for type or function name, or R for reserved. The catdesc column
contains a possibly-localized string describing the category.
pg_get_serial_sequence returns the name of the sequence associated with a column, or NULL
if no sequence is associated with the column. If the column is an identity column, the associated se-
quence is the sequence internally created for the identity column. For columns created using one of the
serial types (serial, smallserial, bigserial), it is the sequence created for that serial column
336
Functions and Operators
definition. In the latter case, this association can be modified or removed with ALTER SEQUENCE
OWNED BY. (The function probably should have been called pg_get_owned_sequence; its cur-
rent name reflects the fact that it has typically been used with serial or bigserial columns.)
The first input parameter is a table name with optional schema, and the second parameter is a column
name. Because the first parameter is potentially a schema and table, it is not treated as a double-quoted
identifier, meaning it is lower cased by default, while the second parameter, being just a column name,
is treated as double-quoted and has its case preserved. The function returns a value suitably formatted
for passing to sequence functions (see Section 9.16). A typical use is in reading the current value of
a sequence for an identity or serial column, for example:
337
Functions and Operators
pg_typeof returns the OID of the data type of the value that is passed to it. This can be helpful
for troubleshooting or dynamically constructing SQL queries. The function is declared as returning
regtype, which is an OID alias type (see Section 8.19); this means that it is the same as an OID for
comparison purposes but displays as a type name. For example:
SELECT pg_typeof(33);
pg_typeof
-----------
integer
(1 row)
The expression collation for returns the collation of the value that is passed to it. Example:
The value might be quoted and schema-qualified. If no collation is derived for the argument expression,
then a null value is returned. If the argument is not of a collatable data type, then an error is raised.
338
Functions and Operators
Table 9.67 lists functions related to database object identification and addressing.
pg_identify_object returns a row containing enough information to uniquely identify the data-
base object specified by catalog OID, object OID and sub-object ID. This information is intended to
be machine-readable, and is never translated. type identifies the type of database object; schema is
the schema name that the object belongs in, or NULL for object types that do not belong to schemas;
name is the name of the object, quoted if necessary, if the name (along with schema name, if perti-
nent) is sufficient to uniquely identify the object, otherwise NULL; identity is the complete object
identity, with the precise format depending on object type, and each name within the format being
schema-qualified and quoted as necessary.
339
Functions and Operators
The functions shown in Table 9.68 extract comments previously stored with the COMMENT com-
mand. A null value is returned if no comment could be found for the specified parameters.
col_description returns the comment for a table column, which is specified by the OID of its
table and its column number. (obj_description cannot be used for table columns since columns
do not have OIDs of their own.)
The two-parameter form of obj_description returns the comment for a database object spec-
ified by its OID and the name of the containing system catalog. For example, obj_descrip-
tion(123456,'pg_class') would retrieve the comment for the table with OID 123456. The
one-parameter form of obj_description requires only the object OID. It is deprecated since there
is no guarantee that OIDs are unique across different system catalogs; therefore, the wrong comment
might be returned.
shobj_description is used just like obj_description except it is used for retrieving com-
ments on shared objects. Some system catalogs are global to all databases within each cluster, and the
descriptions for objects in them are stored globally as well.
The functions shown in Table 9.69 provide server transaction information in an exportable form. The
main use of these functions is to determine which transactions were committed between two snapshots.
340
Functions and Operators
The internal transaction ID type (xid) is 32 bits wide and wraps around every 4 billion transactions.
However, these functions export a 64-bit format that is extended with an “epoch” counter so it will
not wrap around during the life of an installation. The data type used by these functions, txid_s-
napshot, stores information about transaction ID visibility at a particular moment in time. Its com-
ponents are described in Table 9.70.
txid_status(bigint) reports the commit status of a recent transaction. Applications may use
it to determine whether a transaction committed or aborted when the application and database server
become disconnected while a COMMIT is in progress. The status of a transaction will be reported as
either in progress, committed, or aborted, provided that the transaction is recent enough
that the system retains the commit status of that transaction. If is old enough that no references to that
transaction survive in the system and the commit status information has been discarded, this function
will return NULL. Note that prepared transactions are reported as in progress; applications must
check pg_prepared_xacts if they need to determine whether the txid is a prepared transaction.
The functions shown in Table 9.71 provide information about transactions that have been already
committed. These functions mainly provide information about when the transactions were committed.
They only provide useful data when track_commit_timestamp configuration option is enabled and
only for transactions that were committed after it was enabled.
341
Functions and Operators
The functions shown in Table 9.72 print information initialized during initdb, such as the catalog
version. They also show information about write-ahead logging and checkpoint processing. This in-
formation is cluster-wide, and not specific to any one database. They provide most of the same infor-
mation, from the same source, as pg_controldata, although in a form better suited to SQL functions.
342
Functions and Operators
343
Functions and Operators
The function current_setting yields the current value of the setting setting_name. It cor-
responds to the SQL command SHOW. An example:
SELECT current_setting('datestyle');
current_setting
-----------------
ISO, MDY
(1 row)
set_config sets the parameter setting_name to new_value. If is_local is true, the new
value will only apply to the current transaction. If you want the new value to apply for the current
session, use false instead. The function corresponds to the SQL command SET. An example:
set_config
------------
off
(1 row)
344
Functions and Operators
pg_reload_conf sends a SIGHUP signal to the server, causing configuration files to be reloaded
by all server processes.
pg_rotate_logfile signals the log-file manager to switch to a new output file immediately. This
works only when the built-in log collector is running, since otherwise there is no log-file manager
subprocess.
345
Functions and Operators
pg_start_backup accepts an arbitrary user-defined label for the backup. (Typically this would
be the name under which the backup dump file will be stored.) When used in exclusive mode, the
function writes a backup label file (backup_label) and, if there are any links in the pg_tblspc/
directory, a tablespace map file (tablespace_map) into the database cluster's data directory, per-
forms a checkpoint, and then returns the backup's starting write-ahead log location as text. The user
can ignore this result value, but it is provided in case it is useful. When used in non-exclusive mode,
the contents of these files are instead returned by the pg_stop_backup function, and should be
written to the backup by the caller.
There is an optional second parameter of type boolean. If true, it specifies executing pg_s-
tart_backup as quickly as possible. This forces an immediate checkpoint which will cause a spike
in I/O operations, slowing any concurrently executing queries.
In an exclusive backup, pg_stop_backup removes the label file and, if it exists, the table-
space_map file created by pg_start_backup. In a non-exclusive backup, the contents of the
backup_label and tablespace_map are returned in the result of the function, and should be
written to files in the backup (and not in the data directory). There is an optional second parameter of
type boolean. If false, the pg_stop_backup will return immediately after the backup is complet-
ed without waiting for WAL to be archived. This behavior is only useful for backup software which
independently monitors WAL archiving. Otherwise, WAL required to make the backup consistent
might be missing and make the backup useless. When this parameter is set to true, pg_stop_back-
up will wait for WAL to be archived when archiving is enabled; on the standby, this means that it
346
Functions and Operators
will wait only when archive_mode = always. If write activity on the primary is low, it may be
useful to run pg_switch_wal on the primary in order to trigger an immediate segment switch.
When executed on a primary, the function also creates a backup history file in the write-ahead log
archive area. The history file includes the label given to pg_start_backup, the starting and ending
write-ahead log locations for the backup, and the starting and ending times of the backup. The return
value is the backup's ending write-ahead log location (which again can be ignored). After recording
the ending location, the current write-ahead log insertion point is automatically advanced to the next
write-ahead log file, so that the ending write-ahead log file can be archived immediately to complete
the backup.
pg_switch_wal moves to the next write-ahead log file, allowing the current file to be archived
(assuming you are using continuous archiving). The return value is the ending write-ahead log location
+ 1 within the just-completed write-ahead log file. If there has been no write-ahead log activity since
the last write-ahead log switch, pg_switch_wal does nothing and returns the start location of the
write-ahead log file currently in use.
pg_create_restore_point creates a named write-ahead log record that can be used as recovery
target, and returns the corresponding write-ahead log location. The given name can then be used with
recovery_target_name to specify the point up to which recovery will proceed. Avoid creating multiple
restore points with the same name, since recovery will stop at the first one whose name matches the
recovery target.
pg_current_wal_lsn displays the current write-ahead log write location in the same format used
by the above functions. Similarly, pg_current_wal_insert_lsn displays the current write-
ahead log insertion location and pg_current_wal_flush_lsn displays the current write-ahead
log flush location. The insertion location is the “logical” end of the write-ahead log at any instant,
while the write location is the end of what has actually been written out from the server's internal
buffers and flush location is the location guaranteed to be written to durable storage. The write location
is the end of what can be examined from outside the server, and is usually what you want if you are
interested in archiving partially-complete write-ahead log files. The insertion and flush locations are
made available primarily for server debugging purposes. These are both read-only operations and do
not require superuser permissions.
You can use pg_walfile_name_offset to extract the corresponding write-ahead log file name
and byte offset from the results of any of the above functions. For example:
Similarly, pg_walfile_name extracts just the write-ahead log file name. When the given write-
ahead log location is exactly at a write-ahead log file boundary, both these functions return the name
of the preceding write-ahead log file. This is usually the desired behavior for managing write-ahead
log archiving behavior, since the preceding file is the last one that currently needs to be archived.
pg_wal_lsn_diff calculates the difference in bytes between two write-ahead log locations. It can
be used with pg_stat_replication or some functions shown in Table 9.79 to get the replication
lag.
For details about proper usage of these functions, see Section 25.3.
347
Functions and Operators
The functions shown in Table 9.81 control the progress of recovery. These functions may be executed
only during recovery.
348
Functions and Operators
While recovery is paused no further database changes are applied. If in hot standby, all new queries
will see the same consistent snapshot of the database, and no further query conflicts will be generated
until recovery is resumed.
If streaming replication is disabled, the paused state may continue indefinitely without problem. While
streaming replication is in progress WAL records will continue to be received, which will eventually
fill available disk space, depending upon the duration of the pause, the rate of WAL generation and
available disk space.
To solve this problem, PostgreSQL allows a transaction to export the snapshot it is using. As long
as the exporting transaction remains open, other transactions can import its snapshot, and thereby be
guaranteed that they see exactly the same view of the database that the first transaction sees. But note
that any database changes made by any one of these transactions remain invisible to the other transac-
tions, as is usual for changes made by uncommitted transactions. So the transactions are synchronized
with respect to pre-existing data, but act normally for changes they make themselves.
Snapshots are exported with the pg_export_snapshot function, shown in Table 9.82, and im-
ported with the SET TRANSACTION command.
The function pg_export_snapshot saves the current snapshot and returns a text string identi-
fying the snapshot. This string must be passed (outside the database) to clients that want to import the
snapshot. The snapshot is available for import only until the end of the transaction that exported it. A
transaction can export more than one snapshot, if needed. Note that doing so is only useful in READ
COMMITTED transactions, since in REPEATABLE READ and higher isolation levels, transactions use
the same snapshot throughout their lifetime. Once a transaction has exported any snapshots, it cannot
be prepared with PREPARE TRANSACTION.
349
Functions and Operators
Many of these functions have equivalent commands in the replication protocol; see Section 53.4.
The functions described in Section 9.26.3, Section 9.26.4, and Section 9.26.5 are also relevant for
replication.
350
Functions and Operators
351
Functions and Operators
352
Functions and Operators
353
Functions and Operators
pg_column_size shows the space used to store any individual data value.
pg_total_relation_size accepts the OID or name of a table or toast table, and returns the
total on-disk space used for that table, including all associated indexes. This function is equivalent to
pg_table_size + pg_indexes_size.
pg_table_size accepts the OID or name of a table and returns the disk space needed for that table,
exclusive of indexes. (TOAST space, free space map, and visibility map are included.)
pg_indexes_size accepts the OID or name of a table and returns the total disk space used by all
the indexes attached to that table.
354
Functions and Operators
pg_relation_size accepts the OID or name of a table, index or toast table, and returns the on-
disk size in bytes of one fork of that relation. (Note that for most purposes it is more convenient to
use the higher-level functions pg_total_relation_size or pg_table_size, which sum the
sizes of all forks.) With one argument, it returns the size of the main data fork of the relation. The
second argument can be provided to specify which fork to examine:
• 'main' returns the size of the main data fork of the relation.
• 'fsm' returns the size of the Free Space Map (see Section 69.3) associated with the relation.
• 'vm' returns the size of the Visibility Map (see Section 69.4) associated with the relation.
• 'init' returns the size of the initialization fork, if any, associated with the relation.
pg_size_pretty can be used to format the result of one of the other functions in a human-readable
way, using bytes, kB, MB, GB or TB as appropriate.
pg_size_bytes can be used to get the size in bytes from a string in human-readable format. The
input may have units of bytes, kB, MB, GB or TB, and is parsed case-insensitively. If no units are
specified, bytes are assumed.
Note
The units kB, MB, GB and TB used by the functions pg_size_pretty and
pg_size_bytes are defined using powers of 2 rather than powers of 10, so 1kB is 1024
bytes, 1MB is 10242 = 1048576 bytes, and so on.
The functions above that operate on tables or indexes accept a regclass argument, which is simply
the OID of the table or index in the pg_class system catalog. You do not have to look up the OID
by hand, however, since the regclass data type's input converter will do the work for you. Just
write the table name enclosed in single quotes so that it looks like a literal constant. For compatibility
with the handling of ordinary SQL names, the string will be converted to lower case unless it contains
double quotes around the table name.
If an OID that does not represent an existing object is passed as argument to one of the above functions,
NULL is returned.
The functions shown in Table 9.85 assist in identifying the specific disk files associated with database
objects.
355
Functions and Operators
pg_relation_filenode accepts the OID or name of a table, index, sequence, or toast table, and
returns the “filenode” number currently assigned to it. The filenode is the base component of the file
name(s) used for the relation (see Section 69.1 for more information). For most tables the result is
the same as pg_class.relfilenode, but for certain system catalogs relfilenode is zero and
this function must be used to get the correct value. The function returns NULL if passed a relation
that does not have storage, such as a view.
356
Functions and Operators
brin_summarize_new_values accepts the OID or name of a BRIN index and inspects the index
to find page ranges in the base table that are not currently summarized by the index; for any such
range it creates a new summary index tuple by scanning the table pages. It returns the number of new
page range summaries that were inserted into the index. brin_summarize_range does the same,
except it only summarizes the range that covers the given block number.
gin_clean_pending_list accepts the OID or name of a GIN index and cleans up the pending
list of the specified index by moving entries in it to the main GIN data structure in bulk. It returns the
number of pages removed from the pending list. Note that if the argument is a GIN index built with
the fastupdate option disabled, no cleanup happens and the return value is 0, because the index
doesn't have a pending list. Please see Section 66.4.1 and Section 66.5 for details of the pending list
and fastupdate option.
Note that granting users the EXECUTE privilege on pg_read_file(), or related functions, allows
them the ability to read any file on the server which the database can read and that those reads bypass all
in-database privilege checks. This means that, among other things, a user with this access is able to read
the contents of the pg_authid table where authentication information is contained, as well as read
any file in the database. Therefore, granting access to these functions should be carefully considered.
357
Functions and Operators
Some of these functions take an optional missing_ok parameter, which specifies the behavior when
the file or directory does not exist. If true, the function returns NULL (except pg_ls_dir, which
returns an empty result set). If false, an error is raised. The default is false.
pg_ls_dir returns the names of all files (and directories and other special files) in the specified
directory. The include_dot_dirs indicates whether “.” and “..” are included in the result set.
The default is to exclude them (false), but including them can be useful when missing_ok is
true, to distinguish an empty directory from an non-existent directory.
pg_ls_logdir returns the name, size, and last modified time (mtime) of each file in the log direc-
tory. By default, only superusers and members of the pg_monitor role can use this function. Access
may be granted to others using GRANT. Filenames beginning with a dot, directories, and other special
files are not shown.
pg_ls_waldir returns the name, size, and last modified time (mtime) of each file in the write ahead
log (WAL) directory. By default only superusers and members of the pg_monitor role can use this
function. Access may be granted to others using GRANT. Filenames beginning with a dot, directories,
and other special files are not shown.
pg_read_file returns part of a text file, starting at the given offset, returning at most length
bytes (less if the end of file is reached first). If offset is negative, it is relative to the end of the
file. If offset and length are omitted, the entire file is returned. The bytes read from the file are
interpreted as a string in the server encoding; an error is thrown if they are not valid in that encoding.
SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'),
'UTF8');
pg_stat_file returns a record containing the file size, last accessed time stamp, last modified time
stamp, last file status change time stamp (Unix platforms only), file creation time stamp (Windows
only), and a boolean indicating if it is a directory. Typical usages include:
358
Functions and Operators
359
Functions and Operators
pg_advisory_unlock_all will release all session level advisory locks held by the current ses-
sion. (This function is implicitly invoked at session end, even if the client disconnects ungracefully.)
360
Functions and Operators
Ideally, you should normally avoid running updates that don't actually change the data in the record.
Redundant updates can cost considerable unnecessary time, especially if there are lots of indexes to
alter, and space in dead rows that will eventually have to be vacuumed. However, detecting such
situations in client code is not always easy, or even possible, and writing expressions to detect them
can be error-prone. An alternative is to use suppress_redundant_updates_trigger, which
will skip updates that don't change the data. You should use this with care, however. The trigger takes
a small but non-trivial time for each record, so if most of the records affected by an update are actually
changed, use of this trigger will actually make the update run slower.
In most cases, you would want to fire this trigger last for each row. Bearing in mind that triggers fire
in name order, you would then choose a trigger name that comes after the name of any other trigger
you might have on the table.
361
Functions and Operators
362
Functions and Operators
363
Functions and Operators
364
Chapter 10. Type Conversion
SQL statements can, intentionally or not, require the mixing of different data types in the same ex-
pression. PostgreSQL has extensive facilities for evaluating mixed-type expressions.
In many cases a user does not need to understand the details of the type conversion mechanism. How-
ever, implicit conversions done by PostgreSQL can affect the results of a query. When necessary,
these results can be tailored by using explicit type conversion.
This chapter introduces the PostgreSQL type conversion mechanisms and conventions. Refer to the
relevant sections in Chapter 8 and Chapter 9 for more information on specific data types and allowed
functions and operators.
10.1. Overview
SQL is a strongly typed language. That is, every data item has an associated data type which deter-
mines its behavior and allowed usage. PostgreSQL has an extensible type system that is more general
and flexible than other SQL implementations. Hence, most type conversion behavior in PostgreSQL
is governed by general rules rather than by ad hoc heuristics. This allows the use of mixed-type ex-
pressions even with user-defined types.
The PostgreSQL scanner/parser divides lexical elements into five fundamental categories: integers,
non-integer numbers, strings, identifiers, and key words. Constants of most non-numeric types are
first classified as strings. The SQL language definition allows specifying type names with strings, and
this mechanism can be used in PostgreSQL to start the parser down the correct path. For example,
the query:
label | value
--------+-------
Origin | (0,0)
(1 row)
has two literal constants, of type text and point. If a type is not specified for a string literal, then
the placeholder type unknown is assigned initially, to be resolved in later stages as described below.
There are four fundamental SQL constructs requiring distinct type conversion rules in the PostgreSQL
parser:
Function calls
Much of the PostgreSQL type system is built around a rich set of functions. Functions can have
one or more arguments. Since PostgreSQL permits function overloading, the function name alone
does not uniquely identify the function to be called; the parser must select the right function based
on the data types of the supplied arguments.
Operators
PostgreSQL allows expressions with prefix and postfix unary (one-argument) operators, as well
as binary (two-argument) operators. Like functions, operators can be overloaded, so the same
problem of selecting the right operator exists.
Value Storage
SQL INSERT and UPDATE statements place the results of expressions into a table. The expres-
sions in the statement must be matched up with, and perhaps converted to, the types of the target
columns.
365
Type Conversion
Since all query results from a unionized SELECT statement must appear in a single set of columns,
the types of the results of each SELECT clause must be matched up and converted to a uniform
set. Similarly, the result expressions of a CASE construct must be converted to a common type
so that the CASE expression as a whole has a known output type. Some other constructs, such as
ARRAY[] and the GREATEST and LEAST functions, likewise require determination of a com-
mon type for several subexpressions.
The system catalogs store information about which conversions, or casts, exist between which data
types, and how to perform those conversions. Additional casts can be added by the user with the
CREATE CAST command. (This is usually done in conjunction with defining new data types. The
set of casts between built-in types has been carefully crafted and is best not altered.)
An additional heuristic provided by the parser allows improved determination of the proper casting
behavior among groups of types that have implicit casts. Data types are divided into several basic
type categories, including boolean, numeric, string, bitstring, datetime, timespan,
geometric, network, and user-defined. (For a list see Table 52.63; but note it is also possible
to create custom type categories.) Within each category there can be one or more preferred types,
which are preferred when there is a choice of possible types. With careful selection of preferred types
and available implicit casts, it is possible to ensure that ambiguous expressions (those with multiple
candidate parsing solutions) can be resolved in a useful way.
All type conversion rules are designed with several principles in mind:
• There should be no extra overhead in the parser or executor if a query does not need implicit type
conversion. That is, if a query is well-formed and the types already match, then the query should
execute without spending extra time in the parser and without introducing unnecessary implicit
conversion calls in the query.
• Additionally, if a query usually requires an implicit conversion for a function, and if then the user
defines a new function with the correct argument types, the parser should use this new function and
no longer do implicit conversion to use the old function.
10.2. Operators
The specific operator that is referenced by an operator expression is determined using the following
procedure. Note that this procedure is indirectly affected by the precedence of the operators involved,
since that will determine which sub-expressions are taken to be the inputs of which operators. See
Section 4.1.6 for more information.
1. Select the operators to be considered from the pg_operator system catalog. If a non-schema-
qualified operator name was used (the usual case), the operators considered are those with the
matching name and argument count that are visible in the current search path (see Section 5.8.3).
If a qualified operator name was given, only operators in the specified schema are considered.
• (Optional) If the search path finds multiple operators with identical argument types, only
the one appearing earliest in the path is considered. Operators with different argument types
are considered on an equal footing regardless of search path position.
2. Check for an operator accepting exactly the input argument types. If one exists (there can be
only one exact match in the set of operators considered), use it. Lack of an exact match creates a
366
Type Conversion
security hazard when calling, via qualified name 1 (not typical), any operator found in a schema
that permits untrusted users to create objects. In such situations, cast arguments to force an exact
match.
a. (Optional) If one argument of a binary operator invocation is of the unknown type, then
assume it is the same type as the other argument for this check. Invocations involving two
unknown inputs, or a unary operator with an unknown input, will never find a match at
this step.
b. (Optional) If one argument of a binary operator invocation is of the unknown type and
the other is of a domain type, next check to see if there is an operator accepting exactly the
domain's base type on both sides; if so, use it.
a. Discard candidate operators for which the input types do not match and cannot be converted
(using an implicit conversion) to match. unknown literals are assumed to be convertible to
anything for this purpose. If only one candidate remains, use it; else continue to the next step.
b. If any input argument is of a domain type, treat it as being of the domain's base type for
all subsequent steps. This ensures that domains act like their base types for purposes of
ambiguous-operator resolution.
c. Run through all candidates and keep those with the most exact matches on input types. Keep
all candidates if none have exact matches. If only one candidate remains, use it; else continue
to the next step.
d. Run through all candidates and keep those that accept preferred types (of the input data
type's type category) at the most positions where type conversion will be required. Keep all
candidates if none accept preferred types. If only one candidate remains, use it; else continue
to the next step.
e. If any input arguments are unknown, check the type categories accepted at those argu-
ment positions by the remaining candidates. At each position, select the string category
if any candidate accepts that category. (This bias towards string is appropriate since an un-
known-type literal looks like a string.) Otherwise, if all the remaining candidates accept the
same type category, select that category; otherwise fail because the correct choice cannot
be deduced without more clues. Now discard candidates that do not accept the selected type
category. Furthermore, if any candidate accepts a preferred type in that category, discard
candidates that accept non-preferred types for that argument. Keep all candidates if none
survive these tests. If only one candidate remains, use it; else continue to the next step.
f. If there are both unknown and known-type arguments, and all the known-type arguments
have the same type, assume that the unknown arguments are also of that type, and check
which candidates can accept that type at the unknown-argument positions. If exactly one
candidate passes this test, use it. Otherwise, fail.
367
Type Conversion
40 factorial
--------------------------------------------------
815915283247897734345611269596115894272000000000
(1 row)
So the parser does a type conversion on the operand and the query is equivalent to:
In this case the parser looks to see if there is an operator taking text for both arguments. Since there
is, it assumes that the second argument should be interpreted as type text.
unspecified
-------------
abcdef
(1 row)
In this case there is no initial hint for which type to use, since no types are specified in the query. So,
the parser looks for all candidate operators and finds that there are candidates accepting both string-
category and bit-string-category inputs. Since string category is preferred when available, that category
is selected, and then the preferred type for strings, text, is used as the specific type to resolve the
unknown-type literals as.
Here the system has implicitly resolved the unknown-type literal as type float8 before applying the
chosen operator. We can verify that float8 and not some other type was used:
368
Type Conversion
On the other hand, the prefix operator ~ (bitwise negation) is defined only for integer data types, not
for float8. So, if we try a similar case with ~, we get:
This happens because the system cannot decide which of the several possible ~ operators should be
preferred. We can help it out with an explicit cast:
negation
----------
-21
(1 row)
Here is another example of resolving an operator with one known and one unknown input:
is subset
-----------
t
(1 row)
The PostgreSQL operator catalog has several entries for the infix operator <@, but the only two that
could possibly accept an integer array on the left-hand side are array inclusion (anyarray <@ an-
yarray) and range inclusion (anyelement <@ anyrange). Since none of these polymorphic
pseudo-types (see Section 8.21) are considered preferred, the parser cannot resolve the ambiguity on
that basis. However, Step 3.f tells it to assume that the unknown-type literal is of the same type as the
other input, that is, integer array. Now only one of the two operators can match, so array inclusion is
selected. (Had range inclusion been selected, we would have gotten an error, because the string does
not have the right format to be a range literal.)
Users sometimes try to declare operators applying just to a domain type. This is possible but is not
nearly as useful as it might seem, because the operator resolution rules are designed to select operators
applying to the domain's base type. As an example consider
369
Type Conversion
This query will not use the custom operator. The parser will first see if there is a mytext = mytext
operator (Step 2.a), which there is not; then it will consider the domain's base type text, and see if
there is a text = text operator (Step 2.b), which there is; so it resolves the unknown-type literal
as text and uses the text = text operator. The only way to get the custom operator to be used
is to explicitly cast the literal:
so that the mytext = text operator is found immediately according to the exact-match rule. If the
best-match rules are reached, they actively discriminate against operators on domain types. If they did
not, such an operator would create too many ambiguous-operator failures, because the casting rules
always consider a domain as castable to or from its base type, and so the domain operator would be
considered usable in all the same cases as a similarly-named operator on the base type.
10.3. Functions
The specific function that is referenced by a function call is determined using the following procedure.
1. Select the functions to be considered from the pg_proc system catalog. If a non-schema-qual-
ified function name was used, the functions considered are those with the matching name and
argument count that are visible in the current search path (see Section 5.8.3). If a qualified func-
tion name was given, only functions in the specified schema are considered.
a. (Optional) If the search path finds multiple functions of identical argument types, only the
one appearing earliest in the path is considered. Functions of different argument types are
considered on an equal footing regardless of search path position.
b. (Optional) If a function is declared with a VARIADIC array parameter, and the call does
not use the VARIADIC keyword, then the function is treated as if the array parameter were
replaced by one or more occurrences of its element type, as needed to match the call. After
such expansion the function might have effective argument types identical to some non-
variadic function. In that case the function appearing earlier in the search path is used, or if
the two functions are in the same schema, the non-variadic one is preferred.
This creates a security hazard when calling, via qualified name 2, a variadic function found
in a schema that permits untrusted users to create objects. A malicious user can take control
and execute arbitrary SQL functions as though you executed them. Substitute a call bearing
the VARIADIC keyword, which bypasses this hazard. Calls populating VARIADIC "any"
parameters often have no equivalent formulation containing the VARIADIC keyword. To
issue those calls safely, the function's schema must permit only trusted users to create ob-
jects.
c. (Optional) Functions that have default values for parameters are considered to match any
call that omits zero or more of the defaultable parameter positions. If more than one such
function matches a call, the one appearing earliest in the search path is used. If there are
two or more such functions in the same schema with identical parameter types in the non-
2
The hazard does not arise with a non-schema-qualified name, because a search path containing schemas that permit untrusted users to create
objects is not a secure schema usage pattern.
370
Type Conversion
defaulted positions (which is possible if they have different sets of defaultable parameters),
the system will not be able to determine which to prefer, and so an “ambiguous function
call” error will result if no better match to the call can be found.
This creates an availability hazard when calling, via qualified name2, any function found
in a schema that permits untrusted users to create objects. A malicious user can create a
function with the name of an existing function, replicating that function's parameters and
appending novel parameters having default values. This precludes new calls to the original
function. To forestall this hazard, place functions in schemas that permit only trusted users
to create objects.
2. Check for a function accepting exactly the input argument types. If one exists (there can be only
one exact match in the set of functions considered), use it. Lack of an exact match creates a
security hazard when calling, via qualified name2, a function found in a schema that permits
untrusted users to create objects. In such situations, cast arguments to force an exact match. (Cases
involving unknown will never find a match at this step.)
3. If no exact match is found, see if the function call appears to be a special type conversion request.
This happens if the function call has just one argument and the function name is the same as
the (internal) name of some data type. Furthermore, the function argument must be either an
unknown-type literal, or a type that is binary-coercible to the named data type, or a type that could
be converted to the named data type by applying that type's I/O functions (that is, the conversion
is either to or from one of the standard string types). When these conditions are met, the function
call is treated as a form of CAST specification. 3
a. Discard candidate functions for which the input types do not match and cannot be converted
(using an implicit conversion) to match. unknown literals are assumed to be convertible to
anything for this purpose. If only one candidate remains, use it; else continue to the next step.
b. If any input argument is of a domain type, treat it as being of the domain's base type for
all subsequent steps. This ensures that domains act like their base types for purposes of
ambiguous-function resolution.
c. Run through all candidates and keep those with the most exact matches on input types. Keep
all candidates if none have exact matches. If only one candidate remains, use it; else continue
to the next step.
d. Run through all candidates and keep those that accept preferred types (of the input data
type's type category) at the most positions where type conversion will be required. Keep all
candidates if none accept preferred types. If only one candidate remains, use it; else continue
to the next step.
e. If any input arguments are unknown, check the type categories accepted at those argu-
ment positions by the remaining candidates. At each position, select the string category
if any candidate accepts that category. (This bias towards string is appropriate since an un-
known-type literal looks like a string.) Otherwise, if all the remaining candidates accept the
same type category, select that category; otherwise fail because the correct choice cannot
be deduced without more clues. Now discard candidates that do not accept the selected type
category. Furthermore, if any candidate accepts a preferred type in that category, discard
candidates that accept non-preferred types for that argument. Keep all candidates if none
survive these tests. If only one candidate remains, use it; else continue to the next step.
f. If there are both unknown and known-type arguments, and all the known-type arguments
have the same type, assume that the unknown arguments are also of that type, and check
3
The reason for this step is to support function-style cast specifications in cases where there is not an actual cast function. If there is a cast
function, it is conventionally named after its output type, and so there is no need to have a special case. See CREATE CAST for additional
commentary.
371
Type Conversion
which candidates can accept that type at the unknown-argument positions. If exactly one
candidate passes this test, use it. Otherwise, fail.
Note that the “best match” rules are identical for operator and function type resolution. Some examples
follow.
round
--------
4.0000
(1 row)
Since numeric constants with decimal points are initially assigned the type numeric, the following
query will require no type conversion and therefore might be slightly more efficient:
This function accepts, but does not require, the VARIADIC keyword. It tolerates both integer and
numeric arguments:
SELECT public.variadic_example(0),
public.variadic_example(0.0),
public.variadic_example(VARIADIC array[0.0]);
variadic_example | variadic_example | variadic_example
------------------+------------------+------------------
1 | 1 | 1
(1 row)
However, the first and second calls will prefer more-specific functions, if available:
372
Type Conversion
SELECT public.variadic_example(0),
public.variadic_example(0.0),
public.variadic_example(VARIADIC array[0.0]);
variadic_example | variadic_example | variadic_example
------------------+------------------+------------------
3 | 2 | 1
(1 row)
Given the default configuration and only the first function existing, the first and second calls are
insecure. Any user could intercept them by creating the second or third function. By matching the
argument type exactly and using the VARIADIC keyword, the third call is secure.
substr
--------
34
(1 row)
If the string is declared to be of type varchar, as might be the case if it comes from a table, then
the parser will try to convert it to become text:
substr
--------
34
(1 row)
Note
The parser learns from the pg_cast catalog that text and varchar are binary-compatible,
meaning that one can be passed to a function that accepts the other without doing any physical
conversion. Therefore, no type conversion call is really inserted in this case.
And, if the function is called with an argument of type integer, the parser will try to convert that
to text:
373
Type Conversion
HINT: No function matches the given name and argument types. You
might need
to add explicit type casts.
This does not work because integer does not have an implicit cast to text. An explicit cast will
work, however:
substr
--------
34
(1 row)
2. Otherwise, try to convert the expression to the target type. This is possible if an assignment cast
between the two types is registered in the pg_cast catalog (see CREATE CAST). Alternatively,
if the expression is an unknown-type literal, the contents of the literal string will be fed to the
input conversion routine for the target type.
3. Check to see if there is a sizing cast for the target type. A sizing cast is a cast from that type to
itself. If one is found in the pg_cast catalog, apply it to the expression before storing into the
destination column. The implementation function for such a cast always takes an extra parameter
of type integer, which receives the destination column's atttypmod value (typically its
declared length, although the interpretation of atttypmod varies for different data types), and
it may take a third boolean parameter that says whether the cast is explicit or implicit. The
cast function is responsible for applying any length-dependent semantics such as size checking
or truncation.
v | octet_length
----------------------+--------------
abcdef | 20
(1 row)
What has really happened here is that the two unknown literals are resolved to text by default,
allowing the || operator to be resolved as text concatenation. Then the text result of the operator
is converted to bpchar (“blank-padded char”, the internal name of the character data type) to
match the target column type. (Since the conversion from text to bpchar is binary-coercible, this
conversion does not insert any real function call.) Finally, the sizing function bpchar(bpchar,
374
Type Conversion
integer, boolean) is found in the system catalog and applied to the operator's result and the
stored column length. This type-specific function performs the required length check and addition of
padding spaces.
2. If any input is of a domain type, treat it as being of the domain's base type for all subsequent
steps. 4
3. If all inputs are of type unknown, resolve as type text (the preferred type of the string category).
Otherwise, unknown inputs are ignored for the purposes of the remaining rules.
4. If the non-unknown inputs are not all of the same type category, fail.
5. Select the first non-unknown input type as the candidate type, then consider each other non-un-
known input type, left to right. 5 If the candidate type can be implicitly converted to the other type,
but not vice-versa, select the other type as the new candidate type. Then continue considering the
remaining inputs. If, at any stage of this process, a preferred type is selected, stop considering
additional inputs.
6. Convert all inputs to the final candidate type. Fail if there is not an implicit conversion from a
given input type to the candidate type.
text
------
a
b
(2 rows)
4
Somewhat like the treatment of domain inputs for operators and functions, this behavior allows a domain type to be preserved through a
UNION or similar construct, so long as the user is careful to ensure that all inputs are implicitly or explicitly of that exact type. Otherwise
the domain's base type will be used.
5
For historical reasons, CASE treats its ELSE clause (if any) as the “first” input, with the THEN clauses(s) considered after that. In all other
cases, “left to right” means the order in which the expressions appear in the query text.
375
Type Conversion
numeric
---------
1
1.2
(2 rows)
The literal 1.2 is of type numeric, and the integer value 1 can be cast implicitly to numeric,
so that type is used.
real
------
1
2.2
(2 rows)
Here, since type real cannot be implicitly cast to integer, but integer can be implicitly cast
to real, the union result type is resolved as real.
This failure occurs because PostgreSQL treats multiple UNIONs as a nest of pairwise operations; that
is, this input is the same as
The inner UNION is resolved as emitting type text, according to the rules given above. Then the
outer UNION has inputs of types text and integer, leading to the observed error. The problem
can be fixed by ensuring that the leftmost UNION has at least one input of the desired result type.
INTERSECT and EXCEPT operations are likewise resolved pairwise. However, the other constructs
described in this section consider all of their inputs in one resolution step.
there is nothing to identify what type the string literal should be taken as. In this situation PostgreSQL
will fall back to resolving the literal's type as text.
When the SELECT is one arm of a UNION (or INTERSECT or EXCEPT) construct, or when it appears
within INSERT ... SELECT, this rule is not applied since rules given in preceding sections take
precedence. The type of an unspecified-type literal can be taken from the other UNION arm in the first
case, or from the destination column in the second case.
376
Type Conversion
RETURNING lists are treated the same as SELECT output lists for this purpose.
Note
Prior to PostgreSQL 10, this rule did not exist, and unspecified-type literals in a SELECT out-
put list were left as type unknown. That had assorted bad consequences, so it's been changed.
377
Chapter 11. Indexes
Indexes are a common way to enhance database performance. An index allows the database server to
find and retrieve specific rows much faster than it could do without an index. But indexes also add
overhead to the database system as a whole, so they should be used sensibly.
11.1. Introduction
Suppose we have a table similar to this:
With no advance preparation, the system would have to scan the entire test1 table, row by row, to
find all matching entries. If there are many rows in test1 and only a few rows (perhaps zero or one)
that would be returned by such a query, this is clearly an inefficient method. But if the system has
been instructed to maintain an index on the id column, it can use a more efficient method for locating
matching rows. For instance, it might only have to walk a few levels deep into a search tree.
A similar approach is used in most non-fiction books: terms and concepts that are frequently looked
up by readers are collected in an alphabetic index at the end of the book. The interested reader can scan
the index relatively quickly and flip to the appropriate page(s), rather than having to read the entire
book to find the material of interest. Just as it is the task of the author to anticipate the items that readers
are likely to look up, it is the task of the database programmer to foresee which indexes will be useful.
The following command can be used to create an index on the id column, as discussed:
The name test1_id_index can be chosen freely, but you should pick something that enables you
to remember later what the index was for.
To remove an index, use the DROP INDEX command. Indexes can be added to and removed from
tables at any time.
Once an index is created, no further intervention is required: the system will update the index when the
table is modified, and it will use the index in queries when it thinks doing so would be more efficient
than a sequential table scan. But you might have to run the ANALYZE command regularly to update
statistics to allow the query planner to make educated decisions. See Chapter 14 for information about
how to find out whether an index is used and when and why the planner might choose not to use an
index.
Indexes can also benefit UPDATE and DELETE commands with search conditions. Indexes can more-
over be used in join searches. Thus, an index defined on a column that is part of a join condition can
also significantly speed up queries with joins.
Creating an index on a large table can take a long time. By default, PostgreSQL allows reads (SELECT
statements) to occur on the table in parallel with index creation, but writes (INSERT, UPDATE,
378
Indexes
DELETE) are blocked until the index build is finished. In production environments this is often unac-
ceptable. It is possible to allow writes to occur in parallel with index creation, but there are several
caveats to be aware of — for more information see Building Indexes Concurrently.
After an index is created, the system has to keep it synchronized with the table. This adds overhead
to data manipulation operations. Therefore indexes that are seldom or never used in queries should
be removed.
B-trees can handle equality and range queries on data that can be sorted into some ordering. In
particular, the PostgreSQL query planner will consider using a B-tree index whenever an indexed
column is involved in a comparison using one of these operators:
<
<=
=
>=
>
Constructs equivalent to combinations of these operators, such as BETWEEN and IN, can also be
implemented with a B-tree index search. Also, an IS NULL or IS NOT NULL condition on an index
column can be used with a B-tree index.
The optimizer can also use a B-tree index for queries involving the pattern matching operators LIKE
and ~ if the pattern is a constant and is anchored to the beginning of the string — for example, col
LIKE 'foo%' or col ~ '^foo', but not col LIKE '%bar'. However, if your database
does not use the C locale you will need to create the index with a special operator class to support
indexing of pattern-matching queries; see Section 11.10 below. It is also possible to use B-tree indexes
for ILIKE and ~*, but only if the pattern starts with non-alphabetic characters, i.e., characters that
are not affected by upper/lower case conversion.
B-tree indexes can also be used to retrieve data in sorted order. This is not always faster than a simple
scan and sort, but it is often helpful.
Hash indexes can only handle simple equality comparisons. The query planner will consider using
a hash index whenever an indexed column is involved in a comparison using the = operator. The
following command is used to create a hash index:
GiST indexes are not a single kind of index, but rather an infrastructure within which many different
indexing strategies can be implemented. Accordingly, the particular operators with which a GiST index
can be used vary depending on the indexing strategy (the operator class). As an example, the standard
distribution of PostgreSQL includes GiST operator classes for several two-dimensional geometric data
types, which support indexed queries using these operators:
<<
&<
&>
>>
<<|
&<|
379
Indexes
|&>
|>>
@>
<@
~=
&&
(See Section 9.11 for the meaning of these operators.) The GiST operator classes included in the
standard distribution are documented in Table 64.1. Many other GiST operator classes are available
in the contrib collection or as separate projects. For more information see Chapter 64.
which finds the ten places closest to a given target point. The ability to do this is again dependent on
the particular operator class being used. In Table 64.1, operators that can be used in this way are listed
in the column “Ordering Operators”.
SP-GiST indexes, like GiST indexes, offer an infrastructure that supports various kinds of search-
es. SP-GiST permits implementation of a wide range of different non-balanced disk-based data struc-
tures, such as quadtrees, k-d trees, and radix trees (tries). As an example, the standard distribution of
PostgreSQL includes SP-GiST operator classes for two-dimensional points, which support indexed
queries using these operators:
<<
>>
~=
<@
<^
>^
(See Section 9.11 for the meaning of these operators.) The SP-GiST operator classes included in the
standard distribution are documented in Table 65.1. For more information see Chapter 65.
GIN indexes are “inverted indexes” which are appropriate for data values that contain multiple com-
ponent values, such as arrays. An inverted index contains a separate entry for each component value,
and can efficiently handle queries that test for the presence of specific component values.
Like GiST and SP-GiST, GIN can support many different user-defined indexing strategies, and the
particular operators with which a GIN index can be used vary depending on the indexing strategy. As
an example, the standard distribution of PostgreSQL includes a GIN operator class for arrays, which
supports indexed queries using these operators:
<@
@>
=
&&
(See Section 9.18 for the meaning of these operators.) The GIN operator classes included in the stan-
dard distribution are documented in Table 66.1. Many other GIN operator classes are available in the
contrib collection or as separate projects. For more information see Chapter 66.
BRIN indexes (a shorthand for Block Range INdexes) store summaries about the values stored in
consecutive physical block ranges of a table. Like GiST, SP-GiST and GIN, BRIN can support many
different indexing strategies, and the particular operators with which a BRIN index can be used vary
depending on the indexing strategy. For data types that have a linear sort order, the indexed data
380
Indexes
corresponds to the minimum and maximum values of the values in the column for each block range.
This supports indexed queries using these operators:
<
<=
=
>=
>
The BRIN operator classes included in the standard distribution are documented in Table 67.1. For
more information see Chapter 67.
(say, you keep your /dev directory in a database...) and you frequently issue queries like:
SELECT name FROM test2 WHERE major = constant AND minor = constant;
then it might be appropriate to define an index on the columns major and minor together, e.g.:
Currently, only the B-tree, GiST, GIN, and BRIN index types support multicolumn indexes. Up to 32
columns can be specified. (This limit can be altered when building PostgreSQL; see the file pg_con-
fig_manual.h.)
A multicolumn B-tree index can be used with query conditions that involve any subset of the index's
columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.
The exact rule is that equality constraints on leading columns, plus any inequality constraints on the
first column that does not have an equality constraint, will be used to limit the portion of the index
that is scanned. Constraints on columns to the right of these columns are checked in the index, so they
save visits to the table proper, but they do not reduce the portion of the index that has to be scanned.
For example, given an index on (a, b, c) and a query condition WHERE a = 5 AND b >=
42 AND c < 77, the index would have to be scanned from the first entry with a = 5 and b = 42 up
through the last entry with a = 5. Index entries with c >= 77 would be skipped, but they'd still have to
be scanned through. This index could in principle be used for queries that have constraints on b and/
or c with no constraint on a — but the entire index would have to be scanned, so in most cases the
planner would prefer a sequential table scan over using the index.
A multicolumn GiST index can be used with query conditions that involve any subset of the index's
columns. Conditions on additional columns restrict the entries returned by the index, but the condition
on the first column is the most important one for determining how much of the index needs to be
scanned. A GiST index will be relatively ineffective if its first column has only a few distinct values,
even if there are many distinct values in additional columns.
A multicolumn GIN index can be used with query conditions that involve any subset of the index's
columns. Unlike B-tree or GiST, index search effectiveness is the same regardless of which index
column(s) the query conditions use.
381
Indexes
A multicolumn BRIN index can be used with query conditions that involve any subset of the index's
columns. Like GIN and unlike B-tree or GiST, index search effectiveness is the same regardless of
which index column(s) the query conditions use. The only reason to have multiple BRIN indexes
instead of one multicolumn BRIN index on a single table is to have a different pages_per_range
storage parameter.
Of course, each column must be used with operators appropriate to the index type; clauses that involve
other operators will not be considered.
Multicolumn indexes should be used sparingly. In most situations, an index on a single column is
sufficient and saves space and time. Indexes with more than three columns are unlikely to be helpful
unless the usage of the table is extremely stylized. See also Section 11.5 and Section 11.9 for some
discussion of the merits of different index configurations.
The planner will consider satisfying an ORDER BY specification either by scanning an available index
that matches the specification, or by scanning the table in physical order and doing an explicit sort.
For a query that requires scanning a large fraction of the table, an explicit sort is likely to be faster than
using an index because it requires less disk I/O due to following a sequential access pattern. Indexes
are more useful when only a few rows need be fetched. An important special case is ORDER BY in
combination with LIMIT n: an explicit sort will have to process all the data to identify the first n
rows, but if there is an index matching the ORDER BY, the first n rows can be retrieved directly,
without scanning the remainder at all.
By default, B-tree indexes store their entries in ascending order with nulls last. This means that a
forward scan of an index on column x produces output satisfying ORDER BY x (or more verbosely,
ORDER BY x ASC NULLS LAST). The index can also be scanned backward, producing output
satisfying ORDER BY x DESC (or more verbosely, ORDER BY x DESC NULLS FIRST, since
NULLS FIRST is the default for ORDER BY DESC).
You can adjust the ordering of a B-tree index by including the options ASC, DESC, NULLS FIRST,
and/or NULLS LAST when creating the index; for example:
An index stored in ascending order with nulls first can satisfy either ORDER BY x ASC NULLS
FIRST or ORDER BY x DESC NULLS LAST depending on which direction it is scanned in.
You might wonder why bother providing all four options, when two options together with the possi-
bility of backward scan would cover all the variants of ORDER BY. In single-column indexes the
options are indeed redundant, but in multicolumn indexes they can be useful. Consider a two-column
index on (x, y): this can satisfy ORDER BY x, y if we scan forward, or ORDER BY x DESC,
y DESC if we scan backward. But it might be that the application frequently needs to use ORDER BY
x ASC, y DESC. There is no way to get that ordering from a plain index, but it is possible if the
index is defined as (x ASC, y DESC) or (x DESC, y ASC).
Obviously, indexes with non-default sort orderings are a fairly specialized feature, but sometimes they
can produce tremendous speedups for certain queries. Whether it's worth maintaining such an index
depends on how often you use queries that require a special sort ordering.
382
Indexes
Fortunately, PostgreSQL has the ability to combine multiple indexes (including multiple uses of the
same index) to handle cases that cannot be implemented by single index scans. The system can form
AND and OR conditions across several index scans. For example, a query like WHERE x = 42 OR
x = 47 OR x = 53 OR x = 99 could be broken down into four separate scans of an index
on x, each scan using one of the query clauses. The results of these scans are then ORed together
to produce the result. Another example is that if we have separate indexes on x and y, one possible
implementation of a query like WHERE x = 5 AND y = 6 is to use each index with the appropriate
query clause and then AND together the index results to identify the result rows.
To combine multiple indexes, the system scans each needed index and prepares a bitmap in memory
giving the locations of table rows that are reported as matching that index's conditions. The bitmaps
are then ANDed and ORed together as needed by the query. Finally, the actual table rows are visited
and returned. The table rows are visited in physical order, because that is how the bitmap is laid out;
this means that any ordering of the original indexes is lost, and so a separate sort step will be needed if
the query has an ORDER BY clause. For this reason, and because each additional index scan adds extra
time, the planner will sometimes choose to use a simple index scan even though additional indexes
are available that could have been used as well.
In all but the simplest applications, there are various combinations of indexes that might be useful,
and the database developer must make trade-offs to decide which indexes to provide. Sometimes
multicolumn indexes are best, but sometimes it's better to create separate indexes and rely on the
index-combination feature. For example, if your workload includes a mix of queries that sometimes
involve only column x, sometimes only column y, and sometimes both columns, you might choose to
create two separate indexes on x and y, relying on index combination to process the queries that use
both columns. You could also create a multicolumn index on (x, y). This index would typically
be more efficient than index combination for queries involving both columns, but as discussed in
Section 11.3, it would be almost useless for queries involving only y, so it should not be the only
index. A combination of the multicolumn index and a separate index on y would serve reasonably
well. For queries involving only x, the multicolumn index could be used, though it would be larger
and hence slower than an index on x alone. The last alternative is to create all three indexes, but this
is probably only reasonable if the table is searched much more often than it is updated and all three
types of query are common. If one of the types of query is much less common than the others, you'd
probably settle for creating just the two indexes that best match the common types.
When an index is declared unique, multiple table rows with equal indexed values are not allowed. Null
values are not considered equal. A multicolumn unique index will only reject cases where all indexed
columns are equal in multiple rows.
PostgreSQL automatically creates a unique index when a unique constraint or primary key is defined
for a table. The index covers the columns that make up the primary key or unique constraint (a multi-
column index, if appropriate), and is the mechanism that enforces the constraint.
383
Indexes
Note
There's no need to manually create indexes on unique columns; doing so would just duplicate
the automatically-created index.
For example, a common way to do case-insensitive comparisons is to use the lower function:
This query can use an index if one has been defined on the result of the lower(col1) function:
If we were to declare this index UNIQUE, it would prevent creation of rows whose col1 values differ
only in case, as well as rows whose col1 values are actually identical. Thus, indexes on expressions
can be used to enforce constraints that are not definable as simple unique constraints.
The syntax of the CREATE INDEX command normally requires writing parentheses around index
expressions, as shown in the second example. The parentheses can be omitted when the expression is
just a function call, as in the first example.
Index expressions are relatively expensive to maintain, because the derived expression(s) must be
computed for each row insertion and non-HOT update. However, the index expressions are not recom-
puted during an indexed search, since they are already stored in the index. In both examples above,
the system sees the query as just WHERE indexedcolumn = 'constant' and so the speed
of the search is equivalent to any other simple index query. Thus, indexes on expressions are useful
when retrieval speed is more important than insertion and update speed.
One major reason for using a partial index is to avoid indexing common values. Since a query searching
for a common value (one that accounts for more than a few percent of all the table rows) will not use
384
Indexes
the index anyway, there is no point in keeping those rows in the index at all. This reduces the size of
the index, which will speed up those queries that do use the index. It will also speed up many table
update operations because the index does not need to be updated in all cases. Example 11.1 shows a
possible application of this idea.
To create a partial index that suits our example, use a command such as this:
SELECT *
FROM access_log
WHERE url = '/index.html' AND client_ip = inet '212.78.10.32';
SELECT *
FROM access_log
WHERE client_ip = inet '192.168.100.23';
Observe that this kind of partial index requires that the common values be predetermined, so such
partial indexes are best used for data distributions that do not change. The indexes can be recreated
occasionally to adjust for new data distributions, but this adds maintenance effort.
Another possible use for a partial index is to exclude values from the index that the typical query
workload is not interested in; this is shown in Example 11.2. This results in the same advantages as
listed above, but it prevents the “uninteresting” values from being accessed via that index, even if
an index scan might be profitable in that case. Obviously, setting up partial indexes for this kind of
scenario will require a lot of care and experimentation.
385
Indexes
SELECT * FROM orders WHERE billed is not true AND order_nr < 10000;
However, the index can also be used in queries that do not involve order_nr at all, e.g.:
SELECT * FROM orders WHERE billed is not true AND amount > 5000.00;
This is not as efficient as a partial index on the amount column would be, since the system has to
scan the entire index. Yet, if there are relatively few unbilled orders, using this partial index just to
find the unbilled orders could be a win.
Example 11.2 also illustrates that the indexed column and the column used in the predicate do not
need to match. PostgreSQL supports partial indexes with arbitrary predicates, so long as only columns
of the table being indexed are involved. However, keep in mind that the predicate must match the
conditions used in the queries that are supposed to benefit from the index. To be precise, a partial
index can be used in a query only if the system can recognize that the WHERE condition of the query
mathematically implies the predicate of the index. PostgreSQL does not have a sophisticated theorem
prover that can recognize mathematically equivalent expressions that are written in different forms.
(Not only is such a general theorem prover extremely difficult to create, it would probably be too slow
to be of any real use.) The system can recognize simple inequality implications, for example “x <
1” implies “x < 2”; otherwise the predicate condition must exactly match part of the query's WHERE
condition or the index will not be recognized as usable. Matching takes place at query planning time,
not at run time. As a result, parameterized query clauses do not work with a partial index. For example
a prepared query with a parameter might specify “x < ?” which will never imply “x < 2” for all possible
values of the parameter.
A third possible use for partial indexes does not require the index to be used in queries at all. The idea
here is to create a unique index over a subset of a table, as in Example 11.3. This enforces uniqueness
among the rows that satisfy the index predicate, without constraining those that do not.
This is a particularly efficient approach when there are few successful tests and many unsuccessful
ones. It is also possible to allow only one null in a column by creating a unique partial index with
an IS NULL restriction.
386
Indexes
Finally, a partial index can also be used to override the system's query plan choices. Also, data sets
with peculiar distributions might cause the system to use an index when it really should not. In that
case the index can be set up so that it is not available for the offending query. Normally, PostgreSQL
makes reasonable choices about index usage (e.g., it avoids them when retrieving common values, so
the earlier example really only saves index size, it is not required to avoid index usage), and grossly
incorrect plan choices are cause for a bug report.
Keep in mind that setting up a partial index indicates that you know at least as much as the query
planner knows, in particular you know when an index might be profitable. Forming this knowledge
requires experience and understanding of how indexes in PostgreSQL work. In most cases, the ad-
vantage of a partial index over a regular index will be minimal. There are cases where they are quite
counterproductive, as in Example 11.4.
This is a bad idea! Almost certainly, you'll be better off with a single non-partial index, declared like
(Put the category column first, for the reasons described in Section 11.3.) While a search in this larger
index might have to descend through a couple more tree levels than a search in a smaller index, that's
almost certainly going to be cheaper than the planner effort needed to select the appropriate one of the
partial indexes. The core of the problem is that the system does not understand the relationship among
the partial indexes, and will laboriously test each one to see if it's applicable to the current query.
If your table is large enough that a single index really is a bad idea, you should look into using parti-
tioning instead (see Section 5.10). With that mechanism, the system does understand that the tables
and indexes are non-overlapping, so far better performance is possible.
More information about partial indexes can be found in [ston89b], [olson93], and [seshadri95].
To solve this performance problem, PostgreSQL supports index-only scans, which can answer queries
from an index alone without any heap access. The basic idea is to return values directly out of each
index entry instead of consulting the associated heap entry. There are two fundamental restrictions on
when this method can be used:
1. The index type must support index-only scans. B-tree indexes always do. GiST and SP-GiST in-
dexes support index-only scans for some operator classes but not others. Other index types have
387
Indexes
no support. The underlying requirement is that the index must physically store, or else be able to
reconstruct, the original data value for each index entry. As a counterexample, GIN indexes cannot
support index-only scans because each index entry typically holds only part of the original data
value.
2. The query must reference only columns stored in the index. For example, given an index on columns
x and y of a table that also has a column z, these queries could use index-only scans:
(Expression indexes and partial indexes complicate this rule, as discussed below.)
If these two fundamental requirements are met, then all the data values required by the query are
available from the index, so an index-only scan is physically possible. But there is an additional re-
quirement for any table scan in PostgreSQL: it must verify that each retrieved row be “visible” to
the query's MVCC snapshot, as discussed in Chapter 13. Visibility information is not stored in index
entries, only in heap entries; so at first glance it would seem that every row retrieval would require
a heap access anyway. And this is indeed the case, if the table row has been modified recently. How-
ever, for seldom-changing data there is a way around this problem. PostgreSQL tracks, for each page
in a table's heap, whether all rows stored in that page are old enough to be visible to all current and
future transactions. This information is stored in a bit in the table's visibility map. An index-only scan,
after finding a candidate index entry, checks the visibility map bit for the corresponding heap page. If
it's set, the row is known visible and so the data can be returned with no further work. If it's not set,
the heap entry must be visited to find out whether it's visible, so no performance advantage is gained
over a standard index scan. Even in the successful case, this approach trades visibility map accesses
for heap accesses; but since the visibility map is four orders of magnitude smaller than the heap it
describes, far less physical I/O is needed to access it. In most situations the visibility map remains
cached in memory all the time.
In short, while an index-only scan is possible given the two fundamental requirements, it will be a
win only if a significant fraction of the table's heap pages have their all-visible map bits set. But tables
in which a large fraction of the rows are unchanging are common enough to make this type of scan
very useful in practice.
To make effective use of the index-only scan feature, you might choose to create a covering index,
which is an index specifically designed to include the columns needed by a particular type of query
that you run frequently. Since queries typically need to retrieve more columns than just the ones they
search on, PostgreSQL allows you to create an index in which some columns are just “payload” and
are not part of the search key. This is done by adding an INCLUDE clause listing the extra columns.
For example, if you commonly run queries like
the traditional approach to speeding up such queries would be to create an index on x only. However,
an index defined as
could handle these queries as index-only scans, because y can be obtained from the index without
visiting the heap.
388
Indexes
Because column y is not part of the index's search key, it does not have to be of a data type that the
index can handle; it's merely stored in the index and is not interpreted by the index machinery. Also,
if the index is a unique index, that is
the uniqueness condition applies to just column x, not to the combination of x and y. (An INCLUDE
clause can also be written in UNIQUE and PRIMARY KEY constraints, providing alternative syntax
for setting up an index like this.)
It's wise to be conservative about adding non-key payload columns to an index, especially wide
columns. If an index tuple exceeds the maximum size allowed for the index type, data insertion will
fail. In any case, non-key columns duplicate data from the index's table and bloat the size of the index,
thus potentially slowing searches. And remember that there is little point in including payload columns
in an index unless the table changes slowly enough that an index-only scan is likely to not need to ac-
cess the heap. If the heap tuple must be visited anyway, it costs nothing more to get the column's value
from there. Other restrictions are that expressions are not currently supported as included columns,
and that only B-tree indexes currently support included columns.
Before PostgreSQL had the INCLUDE feature, people sometimes made covering indexes by writing
the payload columns as ordinary index columns, that is writing
even though they had no intention of ever using y as part of a WHERE clause. This works fine as
long as the extra columns are trailing columns; making them be leading columns is unwise for the
reasons explained in Section 11.3. However, this method doesn't support the case where you want the
index to enforce uniqueness on the key column(s). Also, explicitly marking non-searchable columns
as INCLUDE columns makes the index slightly smaller, because such columns need not be stored in
upper B-tree levels.
In principle, index-only scans can be used with expression indexes. For example, given an index on
f(x) where x is a table column, it should be possible to execute
as an index-only scan; and this is very attractive if f() is an expensive-to-compute function. However,
PostgreSQL's planner is currently not very smart about such cases. It considers a query to be potentially
executable by index-only scan only when all columns needed by the query are available from the
index. In this example, x is not needed except in the context f(x), but the planner does not notice
that and concludes that an index-only scan is not possible. If an index-only scan seems sufficiently
worthwhile, this can be worked around by adding x as an included column, for example
An additional caveat, if the goal is to avoid recalculating f(x), is that the planner won't necessarily
match uses of f(x) that aren't in indexable WHERE clauses to the index column. It will usually get this
right in simple queries such as shown above, but not in queries that involve joins. These deficiencies
may be remedied in future versions of PostgreSQL.
Partial indexes also have interesting interactions with index-only scans. Consider the partial index
shown in Example 11.3:
389
Indexes
But there's a problem: the WHERE clause refers to success which is not available as a result column
of the index. Nonetheless, an index-only scan is possible because the plan does not need to recheck
that part of the WHERE clause at run time: all entries found in the index necessarily have success
= true so this need not be explicitly checked in the plan. PostgreSQL versions 9.6 and later will
recognize such cases and allow index-only scans to be generated, but older versions will not.
The operator class identifies the operators to be used by the index for that column. For example, a B-
tree index on the type int4 would use the int4_ops class; this operator class includes comparison
functions for values of type int4. In practice the default operator class for the column's data type is
usually sufficient. The main reason for having operator classes is that for some data types, there could
be more than one meaningful index behavior. For example, we might want to sort a complex-number
data type either by absolute value or by real part. We could do this by defining two operator classes for
the data type and then selecting the proper class when making an index. The operator class determines
the basic sort ordering (which can then be modified by adding sort options COLLATE, ASC/DESC
and/or NULLS FIRST/NULLS LAST).
There are also some built-in operator classes besides the default ones:
Note that you should also create an index with the default operator class if you want queries involv-
ing ordinary <, <=, >, or >= comparisons to use an index. Such queries cannot use the xxx_pat-
tern_ops operator classes. (Ordinary equality comparisons can use these operator classes, how-
ever.) It is possible to create multiple indexes on the same column with different operator classes.
If you do use the C locale, you do not need the xxx_pattern_ops operator classes, because an
index with the default operator class is usable for pattern-matching queries in the C locale.
390
Indexes
An operator class is actually just a subset of a larger structure called an operator family. In cases where
several data types have similar behaviors, it is frequently useful to define cross-data-type operators
and allow these to work with indexes. To do this, the operator classes for each of the types must be
grouped into the same operator family. The cross-type operators are members of the family, but are
not associated with any single class within the family.
This expanded version of the previous query shows the operator family each operator class belongs to:
This query shows all defined operator families and all the operators included in each family:
The index automatically uses the collation of the underlying column. So a query of the form
could use the index, because the comparison will by default use the collation of the column. However,
this index cannot accelerate queries that involve some other collation. So if queries of the form, say,
are also of interest, an additional index could be created that supports the "y" collation, like this:
391
Indexes
It is difficult to formulate a general procedure for determining which indexes to create. There are a
number of typical cases that have been shown in the examples throughout the previous sections. A
good deal of experimentation is often necessary. The rest of this section gives some tips for that:
• Always run ANALYZE first. This command collects statistics about the distribution of the values
in the table. This information is required to estimate the number of rows returned by a query, which
is needed by the planner to assign realistic costs to each possible query plan. In absence of any real
statistics, some default values are assumed, which are almost certain to be inaccurate. Examining an
application's index usage without having run ANALYZE is therefore a lost cause. See Section 24.1.3
and Section 24.1.6 for more information.
• Use real data for experimentation. Using test data for setting up indexes will tell you what indexes
you need for the test data, but that is all.
It is especially fatal to use very small test data sets. While selecting 1000 out of 100000 rows could
be a candidate for an index, selecting 1 out of 100 rows will hardly be, because the 100 rows probably
fit within a single disk page, and there is no plan that can beat sequentially fetching 1 disk page.
Also be careful when making up test data, which is often unavoidable when the application is not
yet in production. Values that are very similar, completely random, or inserted in sorted order will
skew the statistics away from the distribution that real data would have.
• When indexes are not used, it can be useful for testing to force their use. There are run-time para-
meters that can turn off various plan types (see Section 19.7.1). For instance, turning off sequen-
tial scans (enable_seqscan) and nested-loop joins (enable_nestloop), which are the most
basic plans, will force the system to use a different plan. If the system still chooses a sequential scan
or nested-loop join then there is probably a more fundamental reason why the index is not being
used; for example, the query condition does not match the index. (What kind of query can use what
kind of index is explained in the previous sections.)
• If forcing index usage does use the index, then there are two possibilities: Either the system is
right and using the index is indeed not appropriate, or the cost estimates of the query plans are
not reflecting reality. So you should time your query with and without indexes. The EXPLAIN
ANALYZE command can be useful here.
• If it turns out that the cost estimates are wrong, there are, again, two possibilities. The total cost
is computed from the per-row costs of each plan node times the selectivity estimate of the plan
node. The costs estimated for the plan nodes can be adjusted via run-time parameters (described
in Section 19.7.2). An inaccurate selectivity estimate is due to insufficient statistics. It might be
possible to improve this by tuning the statistics-gathering parameters (see ALTER TABLE).
If you do not succeed in adjusting the costs to be more appropriate, then you might have to resort
to forcing index usage explicitly. You might also want to contact the PostgreSQL developers to
examine the issue.
392
Chapter 12. Full Text Search
12.1. Introduction
Full Text Searching (or just text search) provides the capability to identify natural-language documents
that satisfy a query, and optionally to sort them by relevance to the query. The most common type
of search is to find all documents containing given query terms and return them in order of their
similarity to the query. Notions of query and similarity are very flexible and depend on the
specific application. The simplest search considers query as a set of words and similarity as
the frequency of query words in the document.
Textual search operators have existed in databases for years. PostgreSQL has ~, ~*, LIKE, and ILIKE
operators for textual data types, but they lack many essential properties required by modern informa-
tion systems:
• There is no linguistic support, even for English. Regular expressions are not sufficient because
they cannot easily handle derived words, e.g., satisfies and satisfy. You might miss docu-
ments that contain satisfies, although you probably would like to find them when searching
for satisfy. It is possible to use OR to search for multiple derived forms, but this is tedious and
error-prone (some words can have several thousand derivatives).
• They provide no ordering (ranking) of search results, which makes them ineffective when thousands
of matching documents are found.
• They tend to be slow because there is no index support, so they must process all documents for
every search.
Full text indexing allows documents to be preprocessed and an index saved for later rapid searching.
Preprocessing includes:
Parsing documents into tokens. It is useful to identify various classes of tokens, e.g., numbers,
words, complex words, email addresses, so that they can be processed differently. In principle token
classes depend on the specific application, but for most purposes it is adequate to use a predefined
set of classes. PostgreSQL uses a parser to perform this step. A standard parser is provided, and
custom parsers can be created for specific needs.
Converting tokens into lexemes. A lexeme is a string, just like a token, but it has been normalized
so that different forms of the same word are made alike. For example, normalization almost always
includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as s
or es in English). This allows searches to find variant forms of the same word, without tediously
entering all the possible variants. Also, this step typically eliminates stop words, which are words
that are so common that they are useless for searching. (In short, then, tokens are raw fragments of
the document text, while lexemes are words that are believed useful for indexing and searching.)
PostgreSQL uses dictionaries to perform this step. Various standard dictionaries are provided, and
custom ones can be created for specific needs.
Storing preprocessed documents optimized for searching. For example, each document can be rep-
resented as a sorted array of normalized lexemes. Along with the lexemes it is often desirable to
store positional information to use for proximity ranking, so that a document that contains a more
“dense” region of query words is assigned a higher rank than one with scattered query words.
Dictionaries allow fine-grained control over how tokens are normalized. With appropriate dictionaries,
you can:
393
Full Text Search
• Map different variations of a word to a canonical form using Snowball stemmer rules.
A data type tsvector is provided for storing preprocessed documents, along with a type tsquery
for representing processed queries (Section 8.11). There are many functions and operators available
for these data types (Section 9.13), the most important of which is the match operator @@, which we
introduce in Section 12.1.2. Full text searches can be accelerated using indexes (Section 12.9).
For searches within PostgreSQL, a document is normally a textual field within a row of a database
table, or possibly a combination (concatenation) of such fields, perhaps stored in several tables or
obtained dynamically. In other words, a document can be constructed from different parts for indexing
and it might not be stored anywhere as a whole. For example:
SELECT title || ' ' || author || ' ' || abstract || ' ' || body
AS document
FROM messages
WHERE mid = 12;
SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' ||
d.body AS document
FROM messages m, docs d
WHERE m.mid = d.did AND m.mid = 12;
Note
Actually, in these example queries, coalesce should be used to prevent a single NULL at-
tribute from causing a NULL result for the whole document.
Another possibility is to store the documents as simple text files in the file system. In this case, the
database can be used to store the full text index and to execute searches, and some unique identifier
can be used to retrieve the document from the file system. However, retrieving files from outside the
database requires superuser permissions or special function support, so this is usually less convenient
than keeping all the data inside PostgreSQL. Also, keeping everything inside the database allows easy
access to document metadata to assist in indexing and display.
For text search purposes, each document must be reduced to the preprocessed tsvector format.
Searching and ranking are performed entirely on the tsvector representation of a document — the
original text need only be retrieved when the document has been selected for display to a user. We
therefore often speak of the tsvector as being the document, but of course it is only a compact
representation of the full document.
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat
& rat'::tsquery;
?column?
394
Full Text Search
----------
t
SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a
fat rat'::tsvector;
?column?
----------
f
As the above example suggests, a tsquery is not just raw text, any more than a tsvector is.
A tsquery contains search terms, which must be already-normalized lexemes, and may combine
multiple terms using AND, OR, NOT, and FOLLOWED BY operators. (For syntax details see Sec-
tion 8.11.2.) There are functions to_tsquery, plainto_tsquery, and phraseto_tsquery
that are helpful in converting user-written text into a proper tsquery, primarily by normalizing
words appearing in the text. Similarly, to_tsvector is used to parse and normalize a document
string. So in practice a text search match would look more like this:
since here no normalization of the word rats will occur. The elements of a tsvector are lexemes,
which are assumed already normalized, so rats does not match rat.
The @@ operator also supports text input, allowing explicit conversion of a text string to tsvector
or tsquery to be skipped in simple cases. The variants available are:
tsvector @@ tsquery
tsquery @@ tsvector
text @@ tsquery
text @@ text
The first two of these we saw already. The form text @@ tsquery is equivalent to to_tsvec-
tor(x) @@ y. The form text @@ text is equivalent to to_tsvector(x) @@ plainto_t-
squery(y).
Within a tsquery, the & (AND) operator specifies that both its arguments must appear in the docu-
ment to have a match. Similarly, the | (OR) operator specifies that at least one of its arguments must
appear, while the ! (NOT) operator specifies that its argument must not appear in order to have a
match. For example, the query fat & ! rat matches documents that contain fat but not rat.
Searching for phrases is possible with the help of the <-> (FOLLOWED BY) tsquery operator,
which matches only if its arguments have matches that are adjacent and in the given order. For ex-
ample:
395
Full Text Search
?column?
----------
t
There is a more general version of the FOLLOWED BY operator having the form <N>, where N is
an integer standing for the difference between the positions of the matching lexemes. <1> is the same
as <->, while <2> allows exactly one other lexeme to appear between the matches, and so on. The
phraseto_tsquery function makes use of this operator to construct a tsquery that can match
a multi-word phrase when some of the words are stop words. For example:
A special case that's sometimes useful is that <0> can be used to require that two patterns match the
same word.
Parentheses can be used to control nesting of the tsquery operators. Without parentheses, | binds
least tightly, then &, then <->, and ! most tightly.
It's worth noticing that the AND/OR/NOT operators mean something subtly different when they are
within the arguments of a FOLLOWED BY operator than when they are not, because within FOL-
LOWED BY the exact position of the match is significant. For example, normally !x matches only
documents that do not contain x anywhere. But !x <-> y matches y if it is not immediately after
an x; an occurrence of x elsewhere in the document does not prevent a match. Another example is
that x & y normally only requires that x and y both appear somewhere in the document, but (x &
y) <-> z requires x and y to match at the same place, immediately before a z. Thus this query
behaves differently from x <-> z & y <-> z, which will match a document containing two
separate sequences x z and y z. (This specific query is useless as written, since x and y could not
match at the same place; but with more complex situations such as prefix-match patterns, a query of
this form could be useful.)
12.1.3. Configurations
The above are all simple text search examples. As mentioned before, full text search functionality
includes the ability to do many more things: skip indexing certain words (stop words), process syn-
onyms, and use sophisticated parsing, e.g., parse based on more than just white space. This function-
ality is controlled by text search configurations. PostgreSQL comes with predefined configurations
for many languages, and you can easily create your own configurations. (psql's \dF command shows
all available configurations.)
396
Full Text Search
Each text search function that depends on a configuration has an optional regconfig argument,
so that the configuration to use can be specified explicitly. default_text_search_config is
used only when this argument is omitted.
To make it easier to build custom text search configurations, a configuration is built up from simpler
database objects. PostgreSQL's text search facility provides four types of configuration-related data-
base objects:
• Text search parsers break documents into tokens and classify each token (for example, as words
or numbers).
• Text search dictionaries convert tokens to normalized form and reject stop words.
• Text search templates provide the functions underlying dictionaries. (A dictionary simply specifies
a template and a set of parameters for the template.)
• Text search configurations select a parser and a set of dictionaries to use to normalize the tokens
produced by the parser.
Text search parsers and templates are built from low-level C functions; therefore it requires C pro-
gramming ability to develop new ones, and superuser privileges to install one into a database. (There
are examples of add-on parsers and templates in the contrib/ area of the PostgreSQL distribution.)
Since dictionaries and configurations just parameterize and connect together some underlying parsers
and templates, no special privilege is needed to create a new dictionary or configuration. Examples of
creating custom dictionaries and configurations appear later in this chapter.
SELECT title
FROM pgweb
WHERE to_tsvector('english', body) @@ to_tsquery('english',
'friend');
This will also find related words such as friends and friendly, since all these are reduced to
the same normalized lexeme.
The query above specifies that the english configuration is to be used to parse and normalize the
strings. Alternatively we could omit the configuration parameters:
SELECT title
FROM pgweb
WHERE to_tsvector(body) @@ to_tsquery('friend');
A more complex example is to select the ten most recent documents that contain create and table
in the title or body:
SELECT title
FROM pgweb
397
Full Text Search
For clarity we omitted the coalesce function calls which would be needed to find rows that contain
NULL in one of the two fields.
Although these queries will work without an index, most applications will find this approach too slow,
except perhaps for occasional ad-hoc searches. Practical use of text searching usually requires creating
an index.
Notice that the 2-argument version of to_tsvector is used. Only text search functions that speci-
fy a configuration name can be used in expression indexes (Section 11.7). This is because the index
contents must be unaffected by default_text_search_config. If they were affected, the index contents
might be inconsistent because different entries could contain tsvectors that were created with dif-
ferent text search configurations, and there would be no way to guess which was which. It would be
impossible to dump and restore such an index correctly.
Because the two-argument version of to_tsvector was used in the index above, only a query
reference that uses the 2-argument version of to_tsvector with the same configuration name will
use that index. That is, WHERE to_tsvector('english', body) @@ 'a & b' can use
the index, but WHERE to_tsvector(body) @@ 'a & b' cannot. This ensures that an index
will be used only with the same configuration used to create the index entries.
It is possible to set up more complex expression indexes wherein the configuration name is specified
by another column, e.g.:
where config_name is a column in the pgweb table. This allows mixed configurations in the same
index while recording which configuration was used for each index entry. This would be useful, for
example, if the document collection contained documents in different languages. Again, queries that
are meant to use the index must be phrased to match, e.g., WHERE to_tsvector(config_name,
body) @@ 'a & b'.
Another approach is to create a separate tsvector column to hold the output of to_tsvector.
This example is a concatenation of title and body, using coalesce to ensure that one field will
still be indexed when the other is NULL:
398
Full Text Search
SELECT title
FROM pgweb
WHERE textsearchable_index_col @@ to_tsquery('create & table')
ORDER BY last_mod_date DESC
LIMIT 10;
When using a separate column to store the tsvector representation, it is necessary to create a trigger
to keep the tsvector column current anytime title or body changes. Section 12.4.3 explains
how to do that.
One advantage of the separate-column approach over an expression index is that it is not necessary to
explicitly specify the text search configuration in queries in order to make use of the index. As shown
in the example above, the query can depend on default_text_search_config. Another ad-
vantage is that searches will be faster, since it will not be necessary to redo the to_tsvector calls
to verify index matches. (This is more important when using a GiST index than a GIN index; see
Section 12.9.) The expression-index approach is simpler to set up, however, and it requires less disk
space since the tsvector representation is not stored explicitly.
to_tsvector parses a textual document into tokens, reduces the tokens to lexemes, and returns a
tsvector which lists the lexemes together with their positions in the document. The document is
processed according to the specified or default text search configuration. Here is a simple example:
In the example above we see that the resulting tsvector does not contain the words a, on, or it,
the word rats became rat, and the punctuation sign - was ignored.
The to_tsvector function internally calls a parser which breaks the document text into tokens and
assigns a type to each token. For each token, a list of dictionaries (Section 12.6) is consulted, where the
399
Full Text Search
list can vary depending on the token type. The first dictionary that recognizes the token emits one or
more normalized lexemes to represent the token. For example, rats became rat because one of the
dictionaries recognized that the word rats is a plural form of rat. Some words are recognized as stop
words (Section 12.6.1), which causes them to be ignored since they occur too frequently to be useful
in searching. In our example these are a, on, and it. If no dictionary in the list recognizes the token
then it is also ignored. In this example that happened to the punctuation sign - because there are in fact
no dictionaries assigned for its token type (Space symbols), meaning space tokens will never be
indexed. The choices of parser, dictionaries and which types of tokens to index are determined by the
selected text search configuration (Section 12.7). It is possible to have many different configurations in
the same database, and predefined configurations are available for various languages. In our example
we used the default configuration english for the English language.
The function setweight can be used to label the entries of a tsvector with a given weight, where
a weight is one of the letters A, B, C, or D. This is typically used to mark entries coming from different
parts of a document, such as title versus body. Later, this information can be used for ranking of search
results.
UPDATE tt SET ti =
setweight(to_tsvector(coalesce(title,'')), 'A') ||
setweight(to_tsvector(coalesce(keyword,'')), 'B') ||
setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
setweight(to_tsvector(coalesce(body,'')), 'D');
Here we have used setweight to label the source of each lexeme in the finished tsvector, and
then merged the labeled tsvector values using the tsvector concatenation operator ||. (Sec-
tion 12.4.1 gives details about these operations.)
to_tsquery creates a tsquery value from querytext, which must consist of single tokens
separated by the tsquery operators & (AND), | (OR), ! (NOT), and <-> (FOLLOWED BY),
possibly grouped using parentheses. In other words, the input to to_tsquery must already follow
the general rules for tsquery input, as described in Section 8.11.2. The difference is that while basic
tsquery input takes the tokens at face value, to_tsquery normalizes each token into a lexeme
using the specified or default configuration, and discards any tokens that are stop words according to
the configuration. For example:
As in basic tsquery input, weight(s) can be attached to each lexeme to restrict it to match only
tsvector lexemes of those weight(s). For example:
400
Full Text Search
Such a lexeme will match any word in a tsvector that begins with the given string.
to_tsquery can also accept single-quoted phrases. This is primarily useful when the configuration
includes a thesaurus dictionary that may trigger on such phrases. In the example below, a thesaurus
contains the rule supernovae stars : sn:
Without quotes, to_tsquery will generate a syntax error for tokens that are not separated by an
AND, OR, or FOLLOWED BY operator.
plainto_tsquery transforms the unformatted text querytext to a tsquery value. The text is
parsed and normalized much as for to_tsvector, then the & (AND) tsquery operator is inserted
between surviving words.
Example:
Note that plainto_tsquery will not recognize tsquery operators, weight labels, or pre-
fix-match labels in its input:
Here, all the input punctuation was discarded as being space symbols.
401
Full Text Search
phraseto_tsquery behaves much like plainto_tsquery, except that it inserts the <->
(FOLLOWED BY) operator between surviving words instead of the & (AND) operator. Also, stop
words are not simply discarded, but are accounted for by inserting <N> operators rather than <->
operators. This function is useful when searching for exact lexeme sequences, since the FOLLOWED
BY operators check lexeme order not just the presence of all the lexemes.
Example:
Like plainto_tsquery, the phraseto_tsquery function will not recognize tsquery oper-
ators, weight labels, or prefix-match labels in its input:
• unquoted text: text not inside quote marks will be converted to terms separated by & operators,
as if processed by plainto_tsquery.
• "quoted text": text inside quote marks will be converted to terms separated by <-> operators,
as if processed by phraseto_tsquery.
• OR: logical or will be converted to the | operator.
• -: the logical not operator, converted to the the ! operator.
Examples:
402
Full Text Search
This function computes the cover density ranking for the given document vector and query,
as described in Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three Term
Queries" in the journal "Information Processing and Management", 1999. Cover density is similar
to ts_rank ranking except that the proximity of matching lexemes to each other is taken into
consideration.
This function requires lexeme positional information to perform its calculation. Therefore, it ig-
nores any “stripped” lexemes in the tsvector. If there are no unstripped lexemes in the input,
the result will be zero. (See Section 12.4.1 for more information about the strip function and
positional information in tsvectors.)
For both these functions, the optional weights argument offers the ability to weigh word instances
more or less heavily depending on how they are labeled. The weight arrays specify how heavily to
weigh each category of word, in the order:
403
Full Text Search
Typically weights are used to mark words from special areas of the document, like the title or an initial
abstract, so they can be treated with more or less importance than words in the document body.
Since a longer document has a greater chance of containing a query term it is reasonable to take into
account document size, e.g., a hundred-word document with five instances of a search word is probably
more relevant than a thousand-word document with five instances. Both ranking functions take an
integer normalization option that specifies whether and how a document's length should impact
its rank. The integer option controls several behaviors, so it is a bit mask: you can specify one or more
behaviors using | (for example, 2|4).
If more than one flag bit is specified, the transformations are applied in the order listed.
It is important to note that the ranking functions do not use any global information, so it is impossible
to produce a fair normalization to 1% or 100% as sometimes desired. Normalization option 32 (rank/
(rank+1)) can be applied to scale all ranks into the range zero to one, but of course this is just a
cosmetic change; it will not affect the ordering of the search results.
404
Full Text Search
Ranking can be expensive since it requires consulting the tsvector of each matching document,
which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since
practical queries often result in large numbers of matches.
ts_headline accepts a document along with a query, and returns an excerpt from the document
in which terms from the query are highlighted. The configuration to be used to parse the document
can be specified by config; if config is omitted, the default_text_search_config con-
figuration is used.
If an options string is specified it must consist of a comma-separated list of one or more op-
tion=value pairs. The available options are:
• MaxWords, MinWords (integers): these numbers determine the longest and shortest headlines to
output. The default values are 35 and 15.
• ShortWord (integer): words of this length or less will be dropped at the start and end of a headline,
unless they are query terms. The default value of three eliminates common English articles.
• HighlightAll (boolean): if true the whole document will be used as the headline, ignoring
the preceding three parameters. The default is false.
• MaxFragments (integer): maximum number of text fragments to display. The default value of
zero selects a non-fragment-based headline generation method. A value greater than zero selects
fragment-based headline generation (see below).
• StartSel, StopSel (strings): the strings with which to delimit query words appearing in the
document, to distinguish them from other excerpted words. The default values are “<b>” and “</
b>”, which can be suitable for HTML output.
• FragmentDelimiter (string): When more than one fragment is displayed, the fragments will
be separated by this string. The default is “ ... ”.
These option names are recognized case-insensitively. You must double-quote string values if they
contain spaces or commas.
In non-fragment-based headline generation, ts_headline locates matches for the given query
and chooses a single one to display, preferring matches that have more query words within the allowed
headline length. In fragment-based headline generation, ts_headline locates the query matches
and splits each match into “fragments” of no more than MaxWords words each, preferring fragments
with more query words, and when possible “stretching” fragments to include surrounding words. The
fragment-based mode is thus more useful when the query matches span large sections of the document,
or when it's desirable to display multiple matches. In either mode, if no query matches can be identified,
then a single fragment of the first MinWords words in the document will be displayed.
405
Full Text Search
For example:
SELECT ts_headline('english',
'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
to_tsquery('english', 'query & similarity'));
ts_headline
------------------------------------------------------------
containing given <b>query</b> terms +
and return them in order of their <b>similarity</b> to the+
<b>query</b>.
SELECT ts_headline('english',
'Search terms may occur
many times in a document,
requiring ranking of the search matches to decide which
occurrences to display in the result.',
to_tsquery('english', 'search & term'),
'MaxFragments=10, MaxWords=7, MinWords=3, StartSel=<<,
StopSel=>>');
ts_headline
------------------------------------------------------------
<<Search>> <<terms>> may occur +
many times ... ranking of the <<search>> matches to decide
ts_headline uses the original document, not a tsvector summary, so it can be slow and should
be used with care.
tsvector || tsvector
The tsvector concatenation operator returns a vector which combines the lexemes and posi-
tional information of the two vectors given as arguments. Positions and weight labels are retained
during the concatenation. Positions appearing in the right-hand vector are offset by the largest
position mentioned in the left-hand vector, so that the result is nearly equivalent to the result
of performing to_tsvector on the concatenation of the two original document strings. (The
equivalence is not exact, because any stop-words removed from the end of the left-hand argument
will not affect the result, whereas they would have affected the positions of the lexemes in the
right-hand argument if textual concatenation were used.)
One advantage of using concatenation in the vector form, rather than concatenating text before
applying to_tsvector, is that you can use different configurations to parse different sections
of the document. Also, because the setweight function marks all lexemes of the given vector
the same way, it is necessary to parse the text and do setweight before concatenating if you
want to label different parts of the document with different weights.
406
Full Text Search
setweight returns a copy of the input vector in which every position has been labeled with the
given weight, either A, B, C, or D. (D is the default for new vectors and as such is not displayed on
output.) These labels are retained when vectors are concatenated, allowing words from different
parts of a document to be weighted differently by ranking functions.
Note that weight labels apply to positions, not lexemes. If the input vector has been stripped of
positions then setweight does nothing.
Returns a vector that lists the same lexemes as the given vector, but lacks any position or weight
information. The result is usually much smaller than an unstripped vector, but it is also less useful.
Relevance ranking does not work as well on stripped vectors as unstripped ones. Also, the <->
(FOLLOWED BY) tsquery operator will never match stripped input, since it cannot determine
the distance between lexeme occurrences.
tsquery || tsquery
!! tsquery
Returns a query that searches for a match to the first given query immediately followed by a match
to the second given query, using the <-> (FOLLOWED BY) tsquery operator. For example:
Returns a query that searches for a match to the first given query followed by a match to the second
given query at a distance of exactly distance lexemes, using the <N> tsquery operator. For
example:
407
Full Text Search
------------------
'fat' <10> 'cat'
Returns the number of nodes (lexemes plus operators) in a tsquery. This function is useful
to determine if the query is meaningful (returns > 0), or contains only stop words (returns 0).
Examples:
Returns the portion of a tsquery that can be used for searching an index. This function is useful
for detecting unindexable queries, for example those containing only stop words or only negated
terms. For example:
SELECT querytree(to_tsquery('!defined'));
querytree
-----------
This form of ts_rewrite simply applies a single rewrite rule: target is replaced by sub-
stitute wherever it appears in query. For example:
This form of ts_rewrite accepts a starting query and a SQL select command, which is
given as a text string. The select must yield two columns of tsquery type. For each row of
408
Full Text Search
the select result, occurrences of the first column value (the target) are replaced by the second
column value (the substitute) within the current query value. For example:
Note that when multiple rewrite rules are applied in this way, the order of application can be
important; so in practice you will want the source query to ORDER BY some ordering key.
Let's consider a real-life astronomical example. We'll expand query supernovae using table-driven
rewriting rules:
UPDATE aliases
SET s = to_tsquery('supernovae|sn & !nebulae')
WHERE t = to_tsquery('supernovae');
Rewriting can be slow when there are many rewriting rules, since it checks every rule for a possible
match. To filter out obvious non-candidate rules we can use the containment operators for the ts-
query type. In the example below, we select only those rules which might match the original query:
409
Full Text Search
These trigger functions automatically compute a tsvector column from one or more textual
columns, under the control of parameters specified in the CREATE TRIGGER command. An example
of their use is:
Having created this trigger, any change in title or body will automatically be reflected into tsv,
without the application having to worry about it.
The first trigger argument must be the name of the tsvector column to be updated. The second
argument specifies the text search configuration to be used to perform the conversion. For tsvec-
tor_update_trigger, the configuration name is simply given as the second trigger argument. It
must be schema-qualified as shown above, so that the trigger behavior will not change with changes in
search_path. For tsvector_update_trigger_column, the second trigger argument is the
name of another table column, which must be of type regconfig. This allows a per-row selection
of configuration to be made. The remaining argument(s) are the names of textual columns (of type
text, varchar, or char). These will be included in the document in the order given. NULL values
will be skipped (but the other columns will still be indexed).
A limitation of these built-in triggers is that they treat all the input columns alike. To process columns
differently — for example, to weight title differently from body — it is necessary to write a custom
trigger. Here is an example using PL/pgSQL as the trigger language:
410
Full Text Search
return new;
end
$$ LANGUAGE plpgsql;
Keep in mind that it is important to specify the configuration name explicitly when creating tsvec-
tor values inside triggers, so that the column's contents will not be affected by changes to de-
fault_text_search_config. Failure to do this is likely to lead to problems such as search
results changing after a dump and restore.
sqlquery is a text value containing an SQL query which must return a single tsvector column.
ts_stat executes the query and returns statistics about each distinct lexeme (word) contained in the
tsvector data. The columns returned are
If weights is supplied, only occurrences having one of those weights are counted.
For example, to find the ten most frequent words in a document collection:
12.5. Parsers
Text search parsers are responsible for splitting raw document text into tokens and identifying each
token's type, where the set of possible types is defined by the parser itself. Note that a parser does
not modify the text at all — it simply identifies plausible word boundaries. Because of this limited
scope, there is less need for application-specific custom parsers than there is for custom dictionaries.
At present PostgreSQL provides just one built-in parser, which has been found to be useful for a wide
range of applications.
The built-in parser is named pg_catalog.default. It recognizes 23 token types, shown in Ta-
ble 12.1.
411
Full Text Search
Note
The parser's notion of a “letter” is determined by the database's locale setting, specifically
lc_ctype. Words containing only the basic ASCII letters are reported as a separate token
type, since it is sometimes useful to distinguish them. In most European languages, token types
word and asciiword should be treated alike.
email does not support all valid email characters as defined by RFC 5322. Specifically,
the only non-alphanumeric characters supported for email user names are period, dash, and
underscore.
412
Full Text Search
It is possible for the parser to produce overlapping tokens from the same piece of text. As an example,
a hyphenated word will be reported both as the entire word and as each component:
This behavior is desirable since it allows searches to work for both the whole compound word and for
components. Here is another instructive example:
12.6. Dictionaries
Dictionaries are used to eliminate words that should not be considered in a search (stop words), and
to normalize words so that different derived forms of the same word will match. A successfully nor-
malized word is called a lexeme. Aside from improving search quality, normalization and removal of
stop words reduce the size of the tsvector representation of a document, thereby improving per-
formance. Normalization does not always have linguistic meaning and usually depends on application
semantics.
• Linguistic - Ispell dictionaries try to reduce input words to a normalized form; stemmer dictionaries
remove word endings
• URL locations can be canonicalized to make equivalent URLs match:
• http://www.pgsql.ru/db/mw/index.html
• http://www.pgsql.ru/db/mw/
• http://www.pgsql.ru/db/../db/mw/index.html
• Color names can be replaced by their hexadecimal values, e.g., red, green, blue, magenta
-> FF0000, 00FF00, 0000FF, FF00FF
• If indexing numbers, we can remove some fractional digits to reduce the range of possible numbers,
so for example 3.14159265359, 3.1415926, 3.14 will be the same after normalization if only two
digits are kept after the decimal point.
• an array of lexemes if the input token is known to the dictionary (notice that one token can produce
more than one lexeme)
• a single lexeme with the TSL_FILTER flag set, to replace the original token with a new token to
be passed to subsequent dictionaries (a dictionary that does this is called a filtering dictionary)
413
Full Text Search
• an empty array if the dictionary knows the token, but it is a stop word
• NULL if the dictionary does not recognize the input token
PostgreSQL provides predefined dictionaries for many languages. There are also several predefined
templates that can be used to create new dictionaries with custom parameters. Each predefined dictio-
nary template is described below. If no existing template is suitable, it is possible to create new ones;
see the contrib/ area of the PostgreSQL distribution for examples.
A text search configuration binds a parser together with a set of dictionaries to process the parser's
output tokens. For each token type that the parser can return, a separate list of dictionaries is specified
by the configuration. When a token of that type is found by the parser, each dictionary in the list is
consulted in turn, until some dictionary recognizes it as a known word. If it is identified as a stop
word, or if no dictionary recognizes the token, it will be discarded and not indexed or searched for.
Normally, the first dictionary that returns a non-NULL output determines the result, and any remaining
dictionaries are not consulted; but a filtering dictionary can replace the given word with a modified
word, which is then passed to subsequent dictionaries.
The general rule for configuring a list of dictionaries is to place first the most narrow, most specific
dictionary, then the more general dictionaries, finishing with a very general dictionary, like a Snowball
stemmer or simple, which recognizes everything. For example, for an astronomy-specific search
(astro_en configuration) one could bind token type asciiword (ASCII word) to a synonym
dictionary of astronomical terms, a general English dictionary and a Snowball English stemmer:
A filtering dictionary can be placed anywhere in the list, except at the end where it'd be useless.
Filtering dictionaries are useful to partially normalize words to simplify the task of later dictionaries.
For example, a filtering dictionary could be used to remove accents from accented letters, as is done
by the unaccent module.
The missing positions 1,2,4 are because of stop words. Ranks calculated for documents with and
without stop words are quite different:
414
Full Text Search
------------
0.1
It is up to the specific dictionary how it treats stop words. For example, ispell dictionaries first
normalize words and then look at the list of stop words, while Snowball stemmers first check the
list of stop words. The reason for the different behavior is an attempt to decrease noise.
Here, english is the base name of a file of stop words. The file's full name will be
$SHAREDIR/tsearch_data/english.stop, where $SHAREDIR means the PostgreSQL in-
stallation's shared-data directory, often /usr/local/share/postgresql (use pg_config
--sharedir to determine it if you're not sure). The file format is simply a list of words, one per
line. Blank lines and trailing spaces are ignored, and upper case is folded to lower case, but no other
processing is done on the file contents.
We can also choose to return NULL, instead of the lower-cased word, if it is not found in the stop words
file. This behavior is selected by setting the dictionary's Accept parameter to false. Continuing
the example:
415
Full Text Search
With the default setting of Accept = true, it is only useful to place a simple dictionary at the end
of a list of dictionaries, since it will never pass on any token to a following dictionary. Conversely,
Accept = false is only useful when there is at least one following dictionary.
Caution
Most types of dictionaries rely on configuration files, such as files of stop words. These files
must be stored in UTF-8 encoding. They will be translated to the actual database encoding, if
that is different, when they are read into the server.
Caution
Normally, a database session will read a dictionary configuration file only once, when it is first
used within the session. If you modify a configuration file and want to force existing sessions to
pick up the new contents, issue an ALTER TEXT SEARCH DICTIONARY command on the
dictionary. This can be a “dummy” update that doesn't actually change any parameter values.
The only parameter required by the synonym template is SYNONYMS, which is the base name of its
configuration file — my_synonyms in the above example. The file's full name will be $SHAREDIR/
416
Full Text Search
The synonym template also has an optional parameter CaseSensitive, which defaults to false.
When CaseSensitive is false, words in the synonym file are folded to lower case, as are input
tokens. When it is true, words and tokens are not folded to lower case, but are compared as-is.
An asterisk (*) can be placed at the end of a synonym in the configuration file. This indicates that
the synonym is a prefix. The asterisk is ignored when the entry is used in to_tsvector(), but
when it is used in to_tsquery(), the result will be a query item with the prefix match marker
(see Section 12.3.2). For example, suppose we have these entries in $SHAREDIR/tsearch_da-
ta/synonym_sample.syn:
postgres pgsql
postgresql pgsql
postgre pgsql
gogle googl
indices index*
417
Full Text Search
Basically a thesaurus dictionary replaces all non-preferred terms by one preferred term and, option-
ally, preserves the original terms for indexing as well. PostgreSQL's current implementation of the
thesaurus dictionary is an extension of the synonym dictionary with added phrase support. A thesaurus
dictionary requires a configuration file of the following format:
# this is a comment
sample word(s) : indexed word(s)
more sample word(s) : more indexed word(s)
...
where the colon (:) symbol acts as a delimiter between a phrase and its replacement.
The thesaurus dictionary chooses the longest match if there are multiple phrases matching the input,
and ties are broken by using the last definition.
Specific stop words recognized by the subdictionary cannot be specified; instead use ? to mark the
location where any stop word can appear. For example, assuming that a and the are stop words
according to the subdictionary:
matches a one the two and the one a two; both would be replaced by swsw.
Since a thesaurus dictionary has the capability to recognize phrases it must remember its state and
interact with the parser. A thesaurus dictionary uses these assignments to check if it should handle the
next word or stop accumulation. The thesaurus dictionary must be configured carefully. For example, if
the thesaurus dictionary is assigned to handle only the asciiword token, then a thesaurus dictionary
definition like one 7 will not work since token type uint is not assigned to the thesaurus dictionary.
Caution
Thesauruses are used during indexing so any change in the thesaurus dictionary's parameters
requires reindexing. For most other dictionary types, small changes such as adding or remov-
ing stopwords does not force reindexing.
418
Full Text Search
TEMPLATE = thesaurus,
DictFile = mythesaurus,
Dictionary = pg_catalog.english_stem
);
Here:
Now it is possible to bind the thesaurus dictionary thesaurus_simple to the desired token types
in a configuration, for example:
supernovae stars : sn
crab nebulae : crab
Below we create a dictionary and bind some token types to an astronomical thesaurus and English
stemmer:
Now we can see how it works. ts_lexize is not very useful for testing a thesaurus, because it treats
its input as a single token. Instead we can use plainto_tsquery and to_tsvector which will
break their input strings into multiple tokens:
419
Full Text Search
'sn':1
To index the original phrase as well as the substitute, just include it in the right-hand part of the
definition:
The standard PostgreSQL distribution does not include any Ispell configuration files. Dictionaries for a
large number of languages are available from Ispell1. Also, some more modern dictionary file formats
are supported — MySpell2 (OO < 2.0.1) and Hunspell3 (OO >= 2.0.2). A large list of dictionaries is
available on the OpenOffice Wiki4.
• download dictionary configuration files. OpenOffice extension files have the .oxt extension. It is
necessary to extract .aff and .dic files, change extensions to .affix and .dict. For some
dictionary files it is also needed to convert characters to the UTF-8 encoding with commands (for
example, for a Norwegian language dictionary):
420
Full Text Search
Stopwords = english);
Here, DictFile, AffFile, and StopWords specify the base names of the dictionary, affixes, and
stop-words files. The stop-words file has the same format explained above for the simple dictionary
type. The format of the other files is not specified here but is available from the above-mentioned
web sites.
Ispell dictionaries usually recognize a limited set of words, so they should be followed by another
broader dictionary; for example, a Snowball dictionary, which recognizes everything.
prefixes
flag *A:
. > RE # As in enter > reenter
suffixes
flag T:
E > ST # As in late > latest
[^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
[AEIOU]Y > EST # As in gray > grayest
[^EY] > EST # As in small > smallest
lapse/ADGRS
lard/DGRS
large/PRTY
lark/MRS
basic_form/affix_class_name
In the .affix file every affix flag is described in the following format:
Here, condition has a format similar to the format of regular expressions. It can use groupings [...]
and [^...]. For example, [AEIOU]Y means that the last letter of the word is "y" and the penulti-
mate letter is "a", "e", "i", "o" or "u". [^EY] means that the last letter is neither "e" nor "y".
Ispell dictionaries support splitting compound words; a useful feature. Notice that the affix file should
specify a special flag using the compoundwords controlled statement that marks dictionary
words that can participate in compound formation:
compoundwords controlled z
SELECT ts_lexize('norwegian_ispell',
'overbuljongterningpakkmesterassistent');
{over,buljong,terning,pakk,mester,assistent}
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
{sjokoladefabrikk,sjokolade,fabrikk}
421
Full Text Search
MySpell format is a subset of Hunspell. The .affix file of Hunspell has the following structure:
PFX A Y 1
PFX A 0 re .
SFX T N 4
SFX T 0 st e
SFX T y iest [^aeiou]y
SFX T 0 est [aeiou]y
SFX T 0 est [^ey]
The first line of an affix class is the header. Fields of an affix rules are listed after the header:
larder/M
lardy/RT
large/RSPMYT
largehearted
Note
MySpell does not support compound words. Hunspell has sophisticated support for compound
words. At present, PostgreSQL implements only the basic compound word operations of Hun-
spell.
A Snowball dictionary recognizes everything, whether or not it is able to simplify the word, so it
should be placed at the end of the dictionary list. It is useless to have it before any other dictionary
because a token will never pass through it to the next dictionary.
5
https://snowballstem.org/
422
Full Text Search
Several predefined text search configurations are available, and you can create custom configurations
easily. To facilitate management of text search objects, a set of SQL commands is available, and there
are several psql commands that display information about text search objects (Section 12.10).
As an example we will create a configuration pg, starting by duplicating the built-in english con-
figuration:
postgres pg
pgsql pg
postgresql pg
Next we register the Ispell dictionary english_ispell, which has its own configuration files:
We choose not to index or search some token types that the built-in configuration does handle:
423
Full Text Search
The next step is to set the session to use the new configuration, which was created in the public
schema:
=> \dF
List of text search configurations
Schema | Name | Description
---------+------+-------------
public | pg |
SHOW default_text_search_config;
default_text_search_config
----------------------------
public.pg
ts_debug displays information about every token of document as produced by the parser and
processed by the configured dictionaries. It uses the configuration specified by config, or de-
fault_text_search_config if that argument is omitted.
ts_debug returns one row for each token identified in the text by the parser. The columns returned
are
424
Full Text Search
425
Full Text Search
For a more extensive demonstration, we first create a public.english configuration and Ispell
dictionary for the English language:
In this example, the word Brightest was recognized by the parser as an ASCII word (alias
asciiword). For this token type the dictionary list is english_ispell and english_stem.
The word was recognized by english_ispell, which reduced it to the noun bright. The word
supernovaes is unknown to the english_ispell dictionary so it was passed to the next dic-
tionary, and, fortunately, was recognized (in fact, english_stem is a Snowball dictionary which
recognizes everything; that is why it was placed at the end of the dictionary list).
The word The was recognized by the english_ispell dictionary as a stop word (Section 12.6.1)
and will not be indexed. The spaces are discarded too, since the configuration provides no dictionaries
at all for them.
You can reduce the width of the output by explicitly specifying which columns you want to see:
426
Full Text Search
ts_parse parses the given document and returns a series of records, one for each token produced
by parsing. Each record includes a tokid showing the assigned token type and a token which is
the text of the token. For example:
ts_token_type returns a table which describes each type of token the specified parser can recog-
nize. For each token type, the table gives the integer tokid that the parser uses to label a token of that
type, the alias that names the token type in configuration commands, and a short description.
For example:
-------+-----------------
+------------------------------------------
1 | asciiword | Word, all ASCII
2 | word | Word, all letters
3 | numword | Word, letters and digits
4 | email | Email address
427
Full Text Search
5 | url | URL
6 | host | Host
7 | sfloat | Scientific notation
8 | version | Version number
9 | hword_numpart | Hyphenated word part, letters and digits
10 | hword_part | Hyphenated word part, all letters
11 | hword_asciipart | Hyphenated word part, all ASCII
12 | blank | Space symbols
13 | tag | XML tag
14 | protocol | Protocol head
15 | numhword | Hyphenated word, letters and digits
16 | asciihword | Hyphenated word, all ASCII
17 | hword | Hyphenated word, all letters
18 | url_path | URL path
19 | file | File or path name
20 | float | Decimal notation
21 | int | Signed integer
22 | uint | Unsigned integer
23 | entity | XML entity
ts_lexize returns an array of lexemes if the input token is known to the dictionary, or an empty
array if the token is known to the dictionary but it is a stop word, or NULL if it is an unknown word.
Examples:
Note
The ts_lexize function expects a single token, not text. Here is a case where this can be
confusing:
428
Full Text Search
Creates a GIN (Generalized Inverted Index)-based index. The column must be of tsvector
type.
Creates a GiST (Generalized Search Tree)-based index. The column can be of tsvector or
tsquery type.
GIN indexes are the preferred text search index type. As inverted indexes, they contain an index entry
for each word (lexeme), with a compressed list of matching locations. Multi-word searches can find
the first match, then use the index to remove rows that are lacking additional words. GIN indexes store
only the words (lexemes) of tsvector values, and not their weight labels. Thus a table row recheck
is needed when using a query that involves weights.
A GiST index is lossy, meaning that the index might produce false matches, and it is necessary to
check the actual table row to eliminate such false matches. (PostgreSQL does this automatically when
needed.) GiST indexes are lossy because each document is represented in the index by a fixed-length
signature. The signature is generated by hashing each word into a single bit in an n-bit string, with all
these bits OR-ed together to produce an n-bit document signature. When two words hash to the same
bit position there will be a false match. If all words in the query have matches (real or false) then the
table row must be retrieved to see if the match is correct.
Lossiness causes performance degradation due to unnecessary fetches of table records that turn out
to be false matches. Since random access to table records is slow, this limits the usefulness of GiST
indexes. The likelihood of false matches depends on several factors, in particular the number of unique
words, so using dictionaries to reduce this number is recommended.
Note that GIN index build time can often be improved by increasing maintenance_work_mem, while
GiST index build time is not sensitive to that parameter.
Partitioning of big collections and the proper use of GIN and GiST indexes allows the implementation
of very fast searches with online update. Partitioning can be done at the database level using table
inheritance, or by distributing documents over servers and collecting external search results, e.g., via
Foreign Data access. The latter is possible because ranking functions use only local information.
\dF{d,p,t}[+] [PATTERN]
429
Full Text Search
The optional parameter PATTERN can be the name of a text search object, optionally schema-qualified.
If PATTERN is omitted then information about all visible objects will be displayed. PATTERN can be a
regular expression and can provide separate patterns for the schema and object names. The following
examples illustrate this:
\dF[+] [PATTERN]
430
Full Text Search
\dFd[+] [PATTERN]
=> \dFd
List of text search dictionaries
Schema | Name |
Description
------------+-----------------
+-----------------------------------------------------------
pg_catalog | danish_stem | snowball stemmer for danish
language
pg_catalog | dutch_stem | snowball stemmer for dutch
language
pg_catalog | english_stem | snowball stemmer for english
language
pg_catalog | finnish_stem | snowball stemmer for finnish
language
pg_catalog | french_stem | snowball stemmer for french
language
pg_catalog | german_stem | snowball stemmer for german
language
pg_catalog | hungarian_stem | snowball stemmer for hungarian
language
pg_catalog | italian_stem | snowball stemmer for italian
language
pg_catalog | norwegian_stem | snowball stemmer for norwegian
language
pg_catalog | portuguese_stem | snowball stemmer for portuguese
language
pg_catalog | romanian_stem | snowball stemmer for romanian
language
pg_catalog | russian_stem | snowball stemmer for russian
language
pg_catalog | simple | simple dictionary: just lower
case and check for stopword
pg_catalog | spanish_stem | snowball stemmer for spanish
language
pg_catalog | swedish_stem | snowball stemmer for swedish
language
pg_catalog | turkish_stem | snowball stemmer for turkish
language
\dFp[+] [PATTERN]
=> \dFp
List of text search parsers
Schema | Name | Description
------------+---------+---------------------
pg_catalog | default | default word parser
=> \dFp+
Text search parser "pg_catalog.default"
Method | Function | Description
-----------------+----------------+-------------
Start parse | prsd_start |
431
Full Text Search
\dFt[+] [PATTERN]
=> \dFt
List of text search templates
Schema | Name | Description
------------+-----------
+-----------------------------------------------------------
pg_catalog | ispell | ispell dictionary
pg_catalog | simple | simple dictionary: just lower case and
check for stopword
pg_catalog | snowball | snowball stemmer
pg_catalog | synonym | synonym dictionary: replace word by
its synonym
pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase
substitution
12.11. Limitations
The current limitations of PostgreSQL's text search features are:
432
Full Text Search
For comparison, the PostgreSQL 8.1 documentation contained 10,441 unique words, a total of 335,420
words, and the most frequent word “postgresql” was mentioned 6,127 times in 655 documents.
Another example — the PostgreSQL mailing list archives contained 910,989 unique words with
57,491,343 lexemes in 461,020 messages.
433
Chapter 13. Concurrency Control
This chapter describes the behavior of the PostgreSQL database system when two or more sessions
try to access the same data at the same time. The goals in that situation are to allow efficient access for
all sessions while maintaining strict data integrity. Every developer of database applications should
be familiar with the topics covered in this chapter.
13.1. Introduction
PostgreSQL provides a rich set of tools for developers to manage concurrent access to data. Internally,
data consistency is maintained by using a multiversion model (Multiversion Concurrency Control,
MVCC). This means that each SQL statement sees a snapshot of data (a database version) as it was
some time ago, regardless of the current state of the underlying data. This prevents statements from
viewing inconsistent data produced by concurrent transactions performing updates on the same data
rows, providing transaction isolation for each database session. MVCC, by eschewing the locking
methodologies of traditional database systems, minimizes lock contention in order to allow for rea-
sonable performance in multiuser environments.
The main advantage of using the MVCC model of concurrency control rather than locking is that
in MVCC locks acquired for querying (reading) data do not conflict with locks acquired for writing
data, and so reading never blocks writing and writing never blocks reading. PostgreSQL maintains
this guarantee even when providing the strictest level of transaction isolation through the use of an
innovative Serializable Snapshot Isolation (SSI) level.
Table- and row-level locking facilities are also available in PostgreSQL for applications which don't
generally need full transaction isolation and prefer to explicitly manage particular points of conflict.
However, proper use of MVCC will generally provide better performance than locks. In addition,
application-defined advisory locks provide a mechanism for acquiring locks that are not tied to a single
transaction.
dirty read
nonrepeatable read
A transaction re-reads data it has previously read and finds that data has been modified by another
transaction (that committed since the initial read).
phantom read
A transaction re-executes a query returning a set of rows that satisfy a search condition and finds
that the set of rows satisfying the condition has changed due to another recently-committed trans-
action.
434
Concurrency Control
serialization anomaly
The result of successfully committing a group of transactions is inconsistent with all possible
orderings of running those transactions one at a time.
The SQL standard and PostgreSQL-implemented transaction isolation levels are described in Ta-
ble 13.1.
In PostgreSQL, you can request any of the four standard transaction isolation levels, but internally only
three distinct isolation levels are implemented, i.e., PostgreSQL's Read Uncommitted mode behaves
like Read Committed. This is because it is the only sensible way to map the standard isolation levels
to PostgreSQL's multiversion concurrency control architecture.
The table also shows that PostgreSQL's Repeatable Read implementation does not allow phantom
reads. This is acceptable under the SQL standard because the standard specifies which anomalies must
not occur at certain isolation levels; higher guarantees are acceptable. The behavior of the available
isolation levels is detailed in the following subsections.
To set the transaction isolation level of a transaction, use the command SET TRANSACTION.
Important
Some PostgreSQL data types and functions have special rules regarding transactional behav-
ior. In particular, changes made to a sequence (and therefore the counter of a column declared
using serial) are immediately visible to all other transactions and are not rolled back if the
transaction that made the changes aborts. See Section 9.16 and Section 8.1.4.
UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same
as SELECT in terms of searching for target rows: they will only find target rows that were committed
as of the command start time. However, such a target row might have already been updated (or deleted
or locked) by another concurrent transaction by the time it is found. In this case, the would-be updater
will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first
435
Concurrency Control
updater rolls back, then its effects are negated and the second updater can proceed with updating the
originally found row. If the first updater commits, the second updater will ignore the row if the first
updater deleted it, otherwise it will attempt to apply its operation to the updated version of the row.
The search condition of the command (the WHERE clause) is re-evaluated to see if the updated version
of the row still matches the search condition. If so, the second updater proceeds with its operation
using the updated version of the row. In the case of SELECT FOR UPDATE and SELECT FOR
SHARE, this means it is the updated version of the row that is locked and returned to the client.
INSERT with an ON CONFLICT DO UPDATE clause behaves similarly. In Read Committed mode,
each row proposed for insertion will either insert or update. Unless there are unrelated errors, one of
those two outcomes is guaranteed. If a conflict originates in another transaction whose effects are not
yet visible to the INSERT , the UPDATE clause will affect that row, even though possibly no version
of that row is conventionally visible to the command.
INSERT with an ON CONFLICT DO NOTHING clause may have insertion not proceed for a row due
to the outcome of another transaction whose effects are not visible to the INSERT snapshot. Again,
this is only the case in Read Committed mode.
Because of the above rules, it is possible for an updating command to see an inconsistent snapshot:
it can see the effects of concurrent updating commands on the same rows it is trying to update, but
it does not see effects of those commands on other rows in the database. This behavior makes Read
Committed mode unsuitable for commands that involve complex search conditions; however, it is just
right for simpler cases. For example, consider updating bank balances with transactions like:
BEGIN;
UPDATE accounts SET balance = balance + 100.00 WHERE acctnum =
12345;
UPDATE accounts SET balance = balance - 100.00 WHERE acctnum =
7534;
COMMIT;
If two such transactions concurrently try to change the balance of account 12345, we clearly want the
second transaction to start with the updated version of the account's row. Because each command is
affecting only a predetermined row, letting it see the updated version of the row does not create any
troublesome inconsistency.
More complex usage can produce undesirable results in Read Committed mode. For example, con-
sider a DELETE command operating on data that is being both added and removed from its restric-
tion criteria by another command, e.g., assume website is a two-row table with website.hits
equaling 9 and 10:
BEGIN;
UPDATE website SET hits = hits + 1;
-- run from another session: DELETE FROM website WHERE hits = 10;
COMMIT;
The DELETE will have no effect even though there is a website.hits = 10 row before and
after the UPDATE. This occurs because the pre-update row value 9 is skipped, and when the UPDATE
completes and DELETE obtains a lock, the new row value is no longer 10 but 11, which no longer
matches the criteria.
Because Read Committed mode starts each command with a new snapshot that includes all transac-
tions committed up to that instant, subsequent commands in the same transaction will see the effects
of the committed concurrent transaction in any case. The point at issue above is whether or not a single
command sees an absolutely consistent view of the database.
The partial transaction isolation provided by Read Committed mode is adequate for many applications,
and this mode is fast and simple to use; however, it is not sufficient for all cases. Applications that
436
Concurrency Control
do complex queries and updates might require a more rigorously consistent view of the database than
Read Committed mode provides.
This level is different from Read Committed in that a query in a repeatable read transaction sees a
snapshot as of the start of the first non-transaction-control statement in the transaction, not as of the
start of the current statement within the transaction. Thus, successive SELECT commands within a
single transaction see the same data, i.e., they do not see changes made by other transactions that
committed after their own transaction started.
Applications using this level must be prepared to retry transactions due to serialization failures.
UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same
as SELECT in terms of searching for target rows: they will only find target rows that were committed
as of the transaction start time. However, such a target row might have already been updated (or deleted
or locked) by another concurrent transaction by the time it is found. In this case, the repeatable read
transaction will wait for the first updating transaction to commit or roll back (if it is still in progress). If
the first updater rolls back, then its effects are negated and the repeatable read transaction can proceed
with updating the originally found row. But if the first updater commits (and actually updated or
deleted the row, not just locked it) then the repeatable read transaction will be rolled back with the
message
because a repeatable read transaction cannot modify or lock rows changed by other transactions after
the repeatable read transaction began.
When an application receives this error message, it should abort the current transaction and retry the
whole transaction from the beginning. The second time through, the transaction will see the previous-
ly-committed change as part of its initial view of the database, so there is no logical conflict in using
the new version of the row as the starting point for the new transaction's update.
Note that only updating transactions might need to be retried; read-only transactions will never have
serialization conflicts.
The Repeatable Read mode provides a rigorous guarantee that each transaction sees a completely stable
view of the database. However, this view will not necessarily always be consistent with some serial
(one at a time) execution of concurrent transactions of the same level. For example, even a read only
transaction at this level may see a control record updated to show that a batch has been completed but
not see one of the detail records which is logically part of the batch because it read an earlier revision
of the control record. Attempts to enforce business rules by transactions running at this isolation level
are not likely to work correctly without careful use of explicit locks to block conflicting transactions.
The Repeatable Read isolation level is implemented using a technique known in academic database
literature and in some other database products as Snapshot Isolation. Differences in behavior and
performance may be observed when compared with systems that use a traditional locking technique
that reduces concurrency. Some other systems may even offer Repeatable Read and Snapshot Isolation
as distinct isolation levels with different behavior. The permitted phenomena that distinguish the two
techniques were not formalized by database researchers until after the SQL standard was developed,
and are outside the scope of this manual. For a full treatment, please see [berenson95].
437
Concurrency Control
Note
Prior to PostgreSQL version 9.1, a request for the Serializable transaction isolation level pro-
vided exactly the same behavior described here. To retain the legacy Serializable behavior,
Repeatable Read should now be requested.
class | value
-------+-------
1 | 10
1 | 20
2 | 100
2 | 200
and then inserts the result (30) as the value in a new row with class = 2. Concurrently, serializable
transaction B computes:
and obtains the result 300, which it inserts in a new row with class = 1. Then both transactions
try to commit. If either transaction were running at the Repeatable Read isolation level, both would
be allowed to commit; but since there is no serial order of execution consistent with the result, using
Serializable transactions will allow one transaction to commit and will roll the other back with this
message:
This is because if A had executed before B, B would have computed the sum 330, not 300, and similarly
the other order would have resulted in a different sum computed by A.
When relying on Serializable transactions to prevent anomalies, it is important that any data read from
a permanent user table not be considered valid until the transaction which read it has successfully
committed. This is true even for read-only transactions, except that data read within a deferrable read-
only transaction is known to be valid as soon as it is read, because such a transaction waits until it
can acquire a snapshot guaranteed to be free from such problems before starting to read any data. In
438
Concurrency Control
all other cases applications must not depend on results read during a transaction that later aborted;
instead, they should retry the transaction until it succeeds.
To guarantee true serializability PostgreSQL uses predicate locking, which means that it keeps locks
which allow it to determine when a write would have had an impact on the result of a previous read
from a concurrent transaction, had it run first. In PostgreSQL these locks do not cause any blocking and
therefore can not play any part in causing a deadlock. They are used to identify and flag dependencies
among concurrent Serializable transactions which in certain combinations can lead to serialization
anomalies. In contrast, a Read Committed or Repeatable Read transaction which wants to ensure data
consistency may need to take out a lock on an entire table, which could block other users attempting
to use that table, or it may use SELECT FOR UPDATE or SELECT FOR SHARE which not only
can block other transactions but cause disk access.
Predicate locks in PostgreSQL, like in most other database systems, are based on data actually accessed
by a transaction. These will show up in the pg_locks system view with a mode of SIReadLock.
The particular locks acquired during execution of a query will depend on the plan used by the query,
and multiple finer-grained locks (e.g., tuple locks) may be combined into fewer coarser-grained locks
(e.g., page locks) during the course of the transaction to prevent exhaustion of the memory used to
track the locks. A READ ONLY transaction may be able to release its SIRead locks before completion,
if it detects that no conflicts can still occur which could lead to a serialization anomaly. In fact, READ
ONLY transactions will often be able to establish that fact at startup and avoid taking any predicate
locks. If you explicitly request a SERIALIZABLE READ ONLY DEFERRABLE transaction, it
will block until it can establish this fact. (This is the only case where Serializable transactions block
but Repeatable Read transactions don't.) On the other hand, SIRead locks often need to be kept past
transaction commit, until overlapping read write transactions complete.
Consistent use of Serializable transactions can simplify development. The guarantee that any set of
successfully committed concurrent Serializable transactions will have the same effect as if they were
run one at a time means that if you can demonstrate that a single transaction, as written, will do the
right thing when run by itself, you can have confidence that it will do the right thing in any mix of
Serializable transactions, even without any information about what those other transactions might do,
or it will not successfully commit. It is important that an environment which uses this technique have
a generalized way of handling serialization failures (which always return with a SQLSTATE value
of '40001'), because it will be very hard to predict exactly which transactions might contribute to the
read/write dependencies and need to be rolled back to prevent serialization anomalies. The monitoring
of read/write dependencies has a cost, as does the restart of transactions which are terminated with a
serialization failure, but balanced against the cost and blocking involved in use of explicit locks and
SELECT FOR UPDATE or SELECT FOR SHARE, Serializable transactions are the best performance
choice for some environments.
While PostgreSQL's Serializable transaction isolation level only allows concurrent transactions to
commit if it can prove there is a serial order of execution that would produce the same effect, it doesn't
always prevent errors from being raised that would not occur in true serial execution. In particular,
it is possible to see unique constraint violations caused by conflicts with overlapping Serializable
transactions even after explicitly checking that the key isn't present before attempting to insert it. This
can be avoided by making sure that all Serializable transactions that insert potentially conflicting keys
explicitly check if they can do so first. For example, imagine an application that asks the user for a
new key and then checks that it doesn't exist already by trying to select it first, or generates a new key
by selecting the maximum existing key and adding one. If some Serializable transactions insert new
keys directly without following this protocol, unique constraints violations might be reported even in
cases where they could not occur in a serial execution of the concurrent transactions.
For optimal performance when relying on Serializable transactions for concurrency control, these
issues should be considered:
• Control the number of active connections, using a connection pool if needed. This is always an
important performance consideration, but it can be particularly important in a busy system using
Serializable transactions.
439
Concurrency Control
• Don't put more into a single transaction than needed for integrity purposes.
• Don't leave connections dangling “idle in transaction” longer than necessary. The configuration
parameter idle_in_transaction_session_timeout may be used to automatically disconnect lingering
sessions.
• Eliminate explicit locks, SELECT FOR UPDATE, and SELECT FOR SHARE where no longer
needed due to the protections automatically provided by Serializable transactions.
• When the system is forced to combine multiple page-level predicate locks into a single relation-lev-
el predicate lock because the predicate lock table is short of memory, an increase in the rate of se-
rialization failures may occur. You can avoid this by increasing max_pred_locks_per_transaction,
max_pred_locks_per_relation, and/or max_pred_locks_per_page.
• A sequential scan will always necessitate a relation-level predicate lock. This can result in an in-
creased rate of serialization failures. It may be helpful to encourage the use of index scans by reduc-
ing random_page_cost and/or increasing cpu_tuple_cost. Be sure to weigh any decrease in transac-
tion rollbacks and restarts against any overall change in query execution time.
The Serializable isolation level is implemented using a technique known in academic database litera-
ture as Serializable Snapshot Isolation, which builds on Snapshot Isolation by adding checks for seri-
alization anomalies. Some differences in behavior and performance may be observed when compared
with other systems that use a traditional locking technique. Please see [ports12] for detailed informa-
tion.
To examine a list of the currently outstanding locks in a database server, use the pg_locks system
view. For more information on monitoring the status of the lock manager subsystem, refer to Chap-
ter 28.
440
Concurrency Control
The SELECT command acquires a lock of this mode on referenced tables. In general, any query
that only reads a table and does not modify it will acquire this lock mode.
The SELECT FOR UPDATE and SELECT FOR SHARE commands acquire a lock of this mode
on the target table(s) (in addition to ACCESS SHARE locks on any other tables that are referenced
but not selected FOR UPDATE/FOR SHARE).
Conflicts with the SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EX-
CLUSIVE lock modes.
The commands UPDATE, DELETE, and INSERT acquire this lock mode on the target table (in
addition to ACCESS SHARE locks on any other referenced tables). In general, this lock mode
will be acquired by any command that modifies data in a table.
Conflicts with the SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EX-
CLUSIVE, and ACCESS EXCLUSIVE lock modes. This mode protects a table against concur-
rent schema changes and VACUUM runs.
SHARE (ShareLock)
Conflicts with the ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE ROW EX-
CLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE lock modes. This mode protects a table
against concurrent data changes.
Conflicts with the ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW
EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE lock modes. This mode protects a table
against concurrent data changes, and is self-exclusive so that only one session can hold it at a time.
Acquired by CREATE COLLATION, CREATE TRIGGER, and many forms of ALTER TABLE
(see ALTER TABLE).
EXCLUSIVE (ExclusiveLock)
Conflicts with the ROW SHARE, ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE,
SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE lock modes. This mode
allows only concurrent ACCESS SHARE locks, i.e., only reads from the table can proceed in
parallel with a transaction holding this lock mode.
Conflicts with locks of all modes (ACCESS SHARE, ROW SHARE, ROW EXCLUSIVE, SHARE
UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EX-
CLUSIVE). This mode guarantees that the holder is the only transaction accessing the table in
any way.
441
Concurrency Control
Acquired by the DROP TABLE, TRUNCATE, REINDEX, CLUSTER, VACUUM FULL, and RE-
FRESH MATERIALIZED VIEW (without CONCURRENTLY) commands. Many forms of AL-
TER TABLE also acquire a lock at this level. This is also the default lock mode for LOCK TABLE
statements that do not specify a mode explicitly.
Tip
Only an ACCESS EXCLUSIVE lock blocks a SELECT (without FOR UPDATE/SHARE)
statement.
Once acquired, a lock is normally held until the end of the transaction. But if a lock is acquired after
establishing a savepoint, the lock is released immediately if the savepoint is rolled back to. This is
consistent with the principle that ROLLBACK cancels all effects of the commands since the savepoint.
The same holds for locks acquired within a PL/pgSQL exception block: an error escape from the block
releases locks acquired within it.
442
Concurrency Control
row. Row-level locks do not affect data querying; they block only writers and lockers to the same row.
Row-level locks are released at transaction end or during savepoint rollback, just like table-level locks.
FOR UPDATE causes the rows retrieved by the SELECT statement to be locked as though for
update. This prevents them from being locked, modified or deleted by other transactions until
the current transaction ends. That is, other transactions that attempt UPDATE, DELETE, SELECT
FOR UPDATE, SELECT FOR NO KEY UPDATE, SELECT FOR SHARE or SELECT FOR
KEY SHARE of these rows will be blocked until the current transaction ends; conversely, SELECT
FOR UPDATE will wait for a concurrent transaction that has run any of those commands on the
same row, and will then lock and return the updated row (or no row, if the row was deleted). Within
a REPEATABLE READ or SERIALIZABLE transaction, however, an error will be thrown if a
row to be locked has changed since the transaction started. For further discussion see Section 13.4.
The FOR UPDATE lock mode is also acquired by any DELETE on a row, and also by an UPDATE
that modifies the values of certain columns. Currently, the set of columns considered for the
UPDATE case are those that have a unique index on them that can be used in a foreign key (so
partial indexes and expressional indexes are not considered), but this may change in the future.
Behaves similarly to FOR UPDATE, except that the lock acquired is weaker: this lock will not
block SELECT FOR KEY SHARE commands that attempt to acquire a lock on the same rows.
This lock mode is also acquired by any UPDATE that does not acquire a FOR UPDATE lock.
FOR SHARE
Behaves similarly to FOR NO KEY UPDATE, except that it acquires a shared lock rather than
exclusive lock on each retrieved row. A shared lock blocks other transactions from performing
UPDATE, DELETE, SELECT FOR UPDATE or SELECT FOR NO KEY UPDATE on these
rows, but it does not prevent them from performing SELECT FOR SHARE or SELECT FOR
KEY SHARE.
Behaves similarly to FOR SHARE, except that the lock is weaker: SELECT FOR UPDATE is
blocked, but not SELECT FOR NO KEY UPDATE. A key-shared lock blocks other transactions
from performing DELETE or any UPDATE that changes the key values, but not other UPDATE,
and neither does it prevent SELECT FOR NO KEY UPDATE, SELECT FOR SHARE, or
SELECT FOR KEY SHARE.
PostgreSQL doesn't remember any information about modified rows in memory, so there is no limit
on the number of rows locked at one time. However, locking a row might cause a disk write, e.g.,
SELECT FOR UPDATE modifies selected rows to mark them locked, and so will result in disk writes.
443
Concurrency Control
13.3.4. Deadlocks
The use of explicit locking can increase the likelihood of deadlocks, wherein two (or more) transac-
tions each hold locks that the other wants. For example, if transaction 1 acquires an exclusive lock
on table A and then tries to acquire an exclusive lock on table B, while transaction 2 has already
exclusive-locked table B and now wants an exclusive lock on table A, then neither one can proceed.
PostgreSQL automatically detects deadlock situations and resolves them by aborting one of the trans-
actions involved, allowing the other(s) to complete. (Exactly which transaction will be aborted is dif-
ficult to predict and should not be relied upon.)
Note that deadlocks can also occur as the result of row-level locks (and thus, they can occur even if
explicit locking is not used). Consider the case in which two concurrent transactions modify a table.
The first transaction executes:
This acquires a row-level lock on the row with the specified account number. Then, the second trans-
action executes:
The first UPDATE statement successfully acquires a row-level lock on the specified row, so it succeeds
in updating that row. However, the second UPDATE statement finds that the row it is attempting to
update has already been locked, so it waits for the transaction that acquired the lock to complete.
Transaction two is now waiting on transaction one to complete before it continues execution. Now,
transaction one executes:
Transaction one attempts to acquire a row-level lock on the specified row, but it cannot: transaction
two already holds such a lock. So it waits for transaction two to complete. Thus, transaction one is
blocked on transaction two, and transaction two is blocked on transaction one: a deadlock condition.
PostgreSQL will detect this situation and abort one of the transactions.
The best defense against deadlocks is generally to avoid them by being certain that all applications
using a database acquire locks on multiple objects in a consistent order. In the example above, if both
transactions had updated the rows in the same order, no deadlock would have occurred. One should
444
Concurrency Control
also ensure that the first lock acquired on an object in a transaction is the most restrictive mode that
will be needed for that object. If it is not feasible to verify this in advance, then deadlocks can be
handled on-the-fly by retrying transactions that abort due to deadlocks.
So long as no deadlock situation is detected, a transaction seeking either a table-level or row-level lock
will wait indefinitely for conflicting locks to be released. This means it is a bad idea for applications
to hold transactions open for long periods of time (e.g., while waiting for user input).
There are two ways to acquire an advisory lock in PostgreSQL: at session level or at transaction level.
Once acquired at session level, an advisory lock is held until explicitly released or the session ends.
Unlike standard lock requests, session-level advisory lock requests do not honor transaction semantics:
a lock acquired during a transaction that is later rolled back will still be held following the rollback,
and likewise an unlock is effective even if the calling transaction fails later. A lock can be acquired
multiple times by its owning process; for each completed lock request there must be a corresponding
unlock request before the lock is actually released. Transaction-level lock requests, on the other hand,
behave more like regular lock requests: they are automatically released at the end of the transaction,
and there is no explicit unlock operation. This behavior is often more convenient than the session-level
behavior for short-term usage of an advisory lock. Session-level and transaction-level lock requests
for the same advisory lock identifier will block each other in the expected way. If a session already
holds a given advisory lock, additional requests by it will always succeed, even if other sessions are
awaiting the lock; this statement is true regardless of whether the existing lock hold and new request
are at session level or transaction level.
Like all locks in PostgreSQL, a complete list of advisory locks currently held by any session can be
found in the pg_locks system view.
Both advisory locks and regular locks are stored in a shared memory pool whose size is defined by the
configuration variables max_locks_per_transaction and max_connections. Care must be taken not to
exhaust this memory or the server will be unable to grant any locks at all. This imposes an upper limit
on the number of advisory locks grantable by the server, typically in the tens to hundreds of thousands
depending on how the server is configured.
In certain cases using advisory locking methods, especially in queries involving explicit ordering and
LIMIT clauses, care must be taken to control the locks acquired because of the order in which SQL
expressions are evaluated. For example:
In the above queries, the second form is dangerous because the LIMIT is not guaranteed to be applied
before the locking function is executed. This might cause some locks to be acquired that the application
was not expecting, and hence would fail to release (until it ends the session). From the point of view
of the application, such locks would be dangling, although still viewable in pg_locks.
445
Concurrency Control
The functions provided to manipulate advisory locks are described in Section 9.26.10.
While a Repeatable Read transaction has a stable view of the data throughout its execution, there is
a subtle issue with using MVCC snapshots for data consistency checks, involving something known
as read/write conflicts. If one transaction writes data and a concurrent transaction attempts to read
the same data (whether before or after the write), it cannot see the work of the other transaction. The
reader then appears to have executed first regardless of which started first or which committed first.
If that is as far as it goes, there is no problem, but if the reader also writes data which is read by
a concurrent transaction there is now a transaction which appears to have run before either of the
previously mentioned transactions. If the transaction which appears to have executed last actually
commits first, it is very easy for a cycle to appear in a graph of the order of execution of the transactions.
When such a cycle appears, integrity checks will not work correctly without some help.
As mentioned in Section 13.2.3, Serializable transactions are just Repeatable Read transactions which
add nonblocking monitoring for dangerous patterns of read/write conflicts. When a pattern is detected
which could cause a cycle in the apparent order of execution, one of the transactions involved is rolled
back to break the cycle.
When using this technique, it will avoid creating an unnecessary burden for application program-
mers if the application software goes through a framework which automatically retries transactions
which are rolled back with a serialization failure. It may be a good idea to set default_transac-
tion_isolation to serializable. It would also be wise to take some action to ensure that
no other transaction isolation level is used, either inadvertently or to subvert integrity checks, through
checks of the transaction isolation level in triggers.
Warning
This level of integrity protection using Serializable transactions does not yet extend to hot
standby mode (Section 26.5). Because of that, those using hot standby may want to use Re-
peatable Read and explicit locking on the master.
446
Concurrency Control
LOCK TABLE statement. (SELECT FOR UPDATE and SELECT FOR SHARE lock just the returned
rows against concurrent updates, while LOCK TABLE locks the whole table.) This should be taken
into account when porting applications to PostgreSQL from other environments.
Also of note to those converting from other environments is the fact that SELECT FOR UPDATE
does not ensure that a concurrent transaction will not update or delete a selected row. To do that in
PostgreSQL you must actually update the row, even if no values need to be changed. SELECT FOR
UPDATE temporarily blocks other transactions from acquiring the same lock or executing an UPDATE
or DELETE which would affect the locked row, but once the transaction holding this lock commits or
rolls back, a blocked transaction will proceed with the conflicting operation unless an actual UPDATE
of the row was performed while the lock was held.
Global validity checks require extra thought under non-serializable MVCC. For example, a banking
application might wish to check that the sum of all credits in one table equals the sum of debits in
another table, when both tables are being actively updated. Comparing the results of two successive
SELECT sum(...) commands will not work reliably in Read Committed mode, since the second
query will likely include the results of transactions not counted by the first. Doing the two sums in a
single repeatable read transaction will give an accurate picture of only the effects of transactions that
committed before the repeatable read transaction started — but one might legitimately wonder whether
the answer is still relevant by the time it is delivered. If the repeatable read transaction itself applied
some changes before trying to make the consistency check, the usefulness of the check becomes even
more debatable, since now it includes some but not all post-transaction-start changes. In such cases
a careful person might wish to lock all tables needed for the check, in order to get an indisputable
picture of current reality. A SHARE mode (or higher) lock guarantees that there are no uncommitted
changes in the locked table, other than those of the current transaction.
Note also that if one is relying on explicit locking to prevent concurrent changes, one should either
use Read Committed mode, or in Repeatable Read mode be careful to obtain locks before performing
queries. A lock obtained by a repeatable read transaction guarantees that no other transactions modi-
fying the table are still running, but if the snapshot seen by the transaction predates obtaining the lock,
it might predate some now-committed changes in the table. A repeatable read transaction's snapshot
is actually frozen at the start of its first query or data-modification command (SELECT, INSERT,
UPDATE, or DELETE), so it is possible to obtain locks explicitly before the snapshot is frozen.
13.5. Caveats
Some DDL commands, currently only TRUNCATE and the table-rewriting forms of ALTER TABLE,
are not MVCC-safe. This means that after the truncation or rewrite commits, the table will appear
empty to concurrent transactions, if they are using a snapshot taken before the DDL command com-
mitted. This will only be an issue for a transaction that did not access the table in question before the
DDL command started — any transaction that has done so would hold at least an ACCESS SHARE
table lock, which would block the DDL command until that transaction completes. So these commands
will not cause any apparent inconsistency in the table contents for successive queries on the target
table, but they could cause visible inconsistency between the contents of the target table and other
tables in the database.
Support for the Serializable transaction isolation level has not yet been added to Hot Standby replica-
tion targets (described in Section 26.5). The strictest isolation level currently supported in hot standby
mode is Repeatable Read. While performing all permanent database writes within Serializable trans-
actions on the master will ensure that all standbys will eventually reach a consistent state, a Repeatable
Read transaction run on the standby can sometimes see a transient state that is inconsistent with any
serial execution of the transactions on the master.
Internal access to the system catalogs is not done using the isolation level of the current transaction.
This means that newly created database objects such as tables are visible to concurrent Repeatable
Read and Serializable transactions, even though the rows they contain are not. In contrast, queries
that explicitly examine the system catalogs don't see rows representing concurrently created database
objects, in the higher isolation levels.
447
Concurrency Control
Short-term share/exclusive page-level locks are used for read/write access. Locks are released
immediately after each index row is fetched or inserted. These index types provide the highest
concurrency without deadlock conditions.
Hash indexes
Share/exclusive hash-bucket-level locks are used for read/write access. Locks are released after
the whole bucket is processed. Bucket-level locks provide better concurrency than index-level
ones, but deadlock is possible since the locks are held longer than one index operation.
GIN indexes
Short-term share/exclusive page-level locks are used for read/write access. Locks are released
immediately after each index row is fetched or inserted. But note that insertion of a GIN-indexed
value usually produces several index key insertions per row, so GIN might do substantial work
for a single value's insertion.
Currently, B-tree indexes offer the best performance for concurrent applications; since they also have
more features than hash indexes, they are the recommended index type for concurrent applications
that need to index scalar data. When dealing with non-scalar data, B-trees are not useful, and GiST,
SP-GiST or GIN indexes should be used instead.
448
Chapter 14. Performance Tips
Query performance can be affected by many things. Some of these can be controlled by the user, while
others are fundamental to the underlying design of the system. This chapter provides some hints about
understanding and tuning PostgreSQL performance.
Examples in this section are drawn from the regression test database after doing a VACUUM ANALYZE,
using 9.3 development sources. You should be able to get similar results if you try the examples
yourself, but your estimated costs and row counts might vary slightly because ANALYZE's statistics
are random samples rather than exact, and because costs are inherently somewhat platform-dependent.
The examples use EXPLAIN's default “text” output format, which is compact and convenient for
humans to read. If you want to feed EXPLAIN's output to a program for further analysis, you should
use one of its machine-readable output formats (XML, JSON, or YAML) instead.
Here is a trivial example, just to show what the output looks like:
QUERY PLAN
-------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
Since this query has no WHERE clause, it must scan all the rows of the table, so the planner has chosen
to use a simple sequential scan plan. The numbers that are quoted in parentheses are (left to right):
• Estimated start-up cost. This is the time expended before the output phase can begin, e.g., time to
do the sorting in a sort node.
• Estimated total cost. This is stated on the assumption that the plan node is run to completion, i.e.,
all available rows are retrieved. In practice a node's parent node might stop short of reading all
available rows (see the LIMIT example below).
449
Performance Tips
• Estimated number of rows output by this plan node. Again, the node is assumed to be run to com-
pletion.
• Estimated average width of rows output by this plan node (in bytes).
The costs are measured in arbitrary units determined by the planner's cost parameters (see Sec-
tion 19.7.2). Traditional practice is to measure the costs in units of disk page fetches; that is, se-
q_page_cost is conventionally set to 1.0 and the other cost parameters are set relative to that. The
examples in this section are run with the default cost parameters.
It's important to understand that the cost of an upper-level node includes the cost of all its child nodes.
It's also important to realize that the cost only reflects things that the planner cares about. In particular,
the cost does not consider the time spent transmitting result rows to the client, which could be an
important factor in the real elapsed time; but the planner ignores it because it cannot change it by
altering the plan. (Every correct plan will output the same row set, we trust.)
The rows value is a little tricky because it is not the number of rows processed or scanned by the plan
node, but rather the number emitted by the node. This is often less than the number scanned, as a result
of filtering by any WHERE-clause conditions that are being applied at the node. Ideally the top-level
rows estimate will approximate the number of rows actually returned, updated, or deleted by the query.
QUERY PLAN
-------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
you will find that tenk1 has 358 disk pages and 10000 rows. The estimated cost is computed as (disk
pages read * seq_page_cost) + (rows scanned * cpu_tuple_cost). By default, seq_page_cost is
1.0 and cpu_tuple_cost is 0.01, so the estimated cost is (358 * 1.0) + (10000 * 0.01) = 458.
QUERY PLAN
------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..483.00 rows=7001 width=244)
Filter: (unique1 < 7000)
Notice that the EXPLAIN output shows the WHERE clause being applied as a “filter” condition attached
to the Seq Scan plan node. This means that the plan node checks the condition for each row it scans, and
outputs only the ones that pass the condition. The estimate of output rows has been reduced because
of the WHERE clause. However, the scan will still have to visit all 10000 rows, so the cost hasn't
decreased; in fact it has gone up a bit (by 10000 * cpu_operator_cost, to be exact) to reflect the extra
CPU time spent checking the WHERE condition.
The actual number of rows this query would select is 7000, but the rows estimate is only approximate.
If you try to duplicate this experiment, you will probably get a slightly different estimate; moreover,
it can change after each ANALYZE command, because the statistics produced by ANALYZE are taken
from a randomized sample of the table.
450
Performance Tips
QUERY PLAN
------------------------------------------------------------------------------
Bitmap Heap Scan on tenk1 (cost=5.07..229.20 rows=101 width=244)
Recheck Cond: (unique1 < 100)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04
rows=101 width=0)
Index Cond: (unique1 < 100)
Here the planner has decided to use a two-step plan: the child plan node visits an index to find the lo-
cations of rows matching the index condition, and then the upper plan node actually fetches those rows
from the table itself. Fetching rows separately is much more expensive than reading them sequentially,
but because not all the pages of the table have to be visited, this is still cheaper than a sequential scan.
(The reason for using two plan levels is that the upper plan node sorts the row locations identified
by the index into physical order before reading them, to minimize the cost of separate fetches. The
“bitmap” mentioned in the node names is the mechanism that does the sorting.)
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100 AND stringu1 =
'xxx';
QUERY PLAN
------------------------------------------------------------------------------
Bitmap Heap Scan on tenk1 (cost=5.04..229.43 rows=1 width=244)
Recheck Cond: (unique1 < 100)
Filter: (stringu1 = 'xxx'::name)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04
rows=101 width=0)
Index Cond: (unique1 < 100)
The added condition stringu1 = 'xxx' reduces the output row count estimate, but not the cost
because we still have to visit the same set of rows. Notice that the stringu1 clause cannot be applied
as an index condition, since this index is only on the unique1 column. Instead it is applied as a
filter on the rows retrieved by the index. Thus the cost has actually gone up slightly to reflect this
extra checking.
In some cases the planner will prefer a “simple” index scan plan:
QUERY PLAN
-----------------------------------------------------------------------------
Index Scan using tenk1_unique1 on tenk1 (cost=0.29..8.30 rows=1
width=244)
Index Cond: (unique1 = 42)
In this type of plan the table rows are fetched in index order, which makes them even more expensive
to read, but there are so few that the extra cost of sorting the row locations is not worth it. You'll most
often see this plan type for queries that fetch just a single row. It's also often used for queries that have
an ORDER BY condition that matches the index order, because then no extra sorting step is needed
to satisfy the ORDER BY.
If there are separate indexes on several of the columns referenced in WHERE, the planner might choose
to use an AND or OR combination of the indexes:
451
Performance Tips
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000;
QUERY PLAN
--------------------------------------------------------------------------------
Bitmap Heap Scan on tenk1 (cost=25.08..60.21 rows=10 width=244)
Recheck Cond: ((unique1 < 100) AND (unique2 > 9000))
-> BitmapAnd (cost=25.08..25.08 rows=10 width=0)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04
rows=101 width=0)
Index Cond: (unique1 < 100)
-> Bitmap Index Scan on tenk1_unique2 (cost=0.00..19.78
rows=999 width=0)
Index Cond: (unique2 > 9000)
But this requires visiting both indexes, so it's not necessarily a win compared to using just one index
and treating the other condition as a filter. If you vary the ranges involved you'll see the plan change
accordingly.
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
LIMIT 2;
QUERY PLAN
--------------------------------------------------------------------------------
Limit (cost=0.29..14.48 rows=2 width=244)
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..71.27
rows=10 width=244)
Index Cond: (unique2 > 9000)
Filter: (unique1 < 100)
This is the same query as above, but we added a LIMIT so that not all the rows need be retrieved,
and the planner changed its mind about what to do. Notice that the total cost and row count of the
Index Scan node are shown as if it were run to completion. However, the Limit node is expected to
stop after retrieving only a fifth of those rows, so its total cost is only a fifth as much, and that's the
actual estimated cost of the query. This plan is preferred over adding a Limit node to the previous plan
because the Limit could not avoid paying the startup cost of the bitmap scan, so the total cost would
be something over 25 units with that approach.
Let's try joining two tables, using the columns we have been discussing:
EXPLAIN SELECT *
FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
QUERY PLAN
--------------------------------------------------------------------------------
Nested Loop (cost=4.65..118.62 rows=10 width=488)
-> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10
width=244)
Recheck Cond: (unique1 < 10)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36
rows=10 width=0)
Index Cond: (unique1 < 10)
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.91
rows=1 width=244)
452
Performance Tips
In this plan, we have a nested-loop join node with two table scans as inputs, or children. The indentation
of the node summary lines reflects the plan tree structure. The join's first, or “outer”, child is a bitmap
scan similar to those we saw before. Its cost and row count are the same as we'd get from SELECT ...
WHERE unique1 < 10 because we are applying the WHERE clause unique1 < 10 at that node.
The t1.unique2 = t2.unique2 clause is not relevant yet, so it doesn't affect the row count
of the outer scan. The nested-loop join node will run its second, or “inner” child once for each row
obtained from the outer child. Column values from the current outer row can be plugged into the inner
scan; here, the t1.unique2 value from the outer row is available, so we get a plan and costs similar
to what we saw above for a simple SELECT ... WHERE t2.unique2 = constant case. (The
estimated cost is actually a bit lower than what was seen above, as a result of caching that's expected
to occur during the repeated index scans on t2.) The costs of the loop node are then set on the basis
of the cost of the outer scan, plus one repetition of the inner scan for each outer row (10 * 7.91, here),
plus a little CPU time for join processing.
In this example the join's output row count is the same as the product of the two scans' row counts,
but that's not true in all cases because there can be additional WHERE clauses that mention both tables
and so can only be applied at the join point, not to either input scan. Here's an example:
EXPLAIN SELECT *
FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 < 10 AND t2.unique2 < 10 AND t1.hundred <
t2.hundred;
QUERY PLAN
--------------------------------------------------------------------------------
Nested Loop (cost=4.65..49.46 rows=33 width=488)
Join Filter: (t1.hundred < t2.hundred)
-> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10
width=244)
Recheck Cond: (unique1 < 10)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36
rows=10 width=0)
Index Cond: (unique1 < 10)
-> Materialize (cost=0.29..8.51 rows=10 width=244)
-> Index Scan using tenk2_unique2 on tenk2 t2
(cost=0.29..8.46 rows=10 width=244)
Index Cond: (unique2 < 10)
The condition t1.hundred < t2.hundred can't be tested in the tenk2_unique2 index, so
it's applied at the join node. This reduces the estimated output row count of the join node, but does
not change either input scan.
Notice that here the planner has chosen to “materialize” the inner relation of the join, by putting a
Materialize plan node atop it. This means that the t2 index scan will be done just once, even though
the nested-loop join node needs to read that data ten times, once for each row from the outer relation.
The Materialize node saves the data in memory as it's read, and then returns the data from memory
on each subsequent pass.
When dealing with outer joins, you might see join plan nodes with both “Join Filter” and plain “Filter”
conditions attached. Join Filter conditions come from the outer join's ON clause, so a row that fails
the Join Filter condition could still get emitted as a null-extended row. But a plain Filter condition is
applied after the outer-join rules and so acts to remove rows unconditionally. In an inner join there is
no semantic difference between these types of filters.
If we change the query's selectivity a bit, we might get a very different join plan:
453
Performance Tips
EXPLAIN SELECT *
FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2;
QUERY PLAN
--------------------------------------------------------------------------------
Hash Join (cost=230.47..713.98 rows=101 width=488)
Hash Cond: (t2.unique2 = t1.unique2)
-> Seq Scan on tenk2 t2 (cost=0.00..445.00 rows=10000
width=244)
-> Hash (cost=229.20..229.20 rows=101 width=244)
-> Bitmap Heap Scan on tenk1 t1 (cost=5.07..229.20
rows=101 width=244)
Recheck Cond: (unique1 < 100)
-> Bitmap Index Scan on tenk1_unique1
(cost=0.00..5.04 rows=101 width=0)
Index Cond: (unique1 < 100)
Here, the planner has chosen to use a hash join, in which rows of one table are entered into an in-
memory hash table, after which the other table is scanned and the hash table is probed for matches to
each row. Again note how the indentation reflects the plan structure: the bitmap scan on tenk1 is the
input to the Hash node, which constructs the hash table. That's then returned to the Hash Join node,
which reads rows from its outer child plan and searches the hash table for each one.
EXPLAIN SELECT *
FROM tenk1 t1, onek t2
WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2;
QUERY PLAN
--------------------------------------------------------------------------------
Merge Join (cost=198.11..268.19 rows=10 width=488)
Merge Cond: (t1.unique2 = t2.unique2)
-> Index Scan using tenk1_unique2 on tenk1 t1
(cost=0.29..656.28 rows=101 width=244)
Filter: (unique1 < 100)
-> Sort (cost=197.83..200.33 rows=1000 width=244)
Sort Key: t2.unique2
-> Seq Scan on onek t2 (cost=0.00..148.00 rows=1000
width=244)
Merge join requires its input data to be sorted on the join keys. In this plan the tenk1 data is sorted
by using an index scan to visit the rows in the correct order, but a sequential scan and sort is preferred
for onek, because there are many more rows to be visited in that table. (Sequential-scan-and-sort
frequently beats an index scan for sorting many rows, because of the nonsequential disk access required
by the index scan.)
One way to look at variant plans is to force the planner to disregard whatever strategy it thought was the
cheapest, using the enable/disable flags described in Section 19.7.1. (This is a crude tool, but useful.
See also Section 14.3.) For example, if we're unconvinced that sequential-scan-and-sort is the best
way to deal with table onek in the previous example, we could try
EXPLAIN SELECT *
FROM tenk1 t1, onek t2
454
Performance Tips
QUERY PLAN
--------------------------------------------------------------------------------
Merge Join (cost=0.56..292.65 rows=10 width=488)
Merge Cond: (t1.unique2 = t2.unique2)
-> Index Scan using tenk1_unique2 on tenk1 t1
(cost=0.29..656.28 rows=101 width=244)
Filter: (unique1 < 100)
-> Index Scan using onek_unique2 on onek t2 (cost=0.28..224.79
rows=1000 width=244)
which shows that the planner thinks that sorting onek by index-scanning is about 12% more expensive
than sequential-scan-and-sort. Of course, the next question is whether it's right about that. We can
investigate that using EXPLAIN ANALYZE, as discussed below.
QUERY
PLAN
--------------------------------------------------------------------------------
Nested Loop (cost=4.65..118.62 rows=10 width=488) (actual
time=0.128..0.377 rows=10 loops=1)
-> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10
width=244) (actual time=0.057..0.121 rows=10 loops=1)
Recheck Cond: (unique1 < 10)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36
rows=10 width=0) (actual time=0.024..0.024 rows=10 loops=1)
Index Cond: (unique1 < 10)
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.91
rows=1 width=244) (actual time=0.021..0.022 rows=1 loops=10)
Index Cond: (unique2 = t1.unique2)
Planning time: 0.181 ms
Execution time: 0.501 ms
Note that the “actual time” values are in milliseconds of real time, whereas the cost estimates are
expressed in arbitrary units; so they are unlikely to match up. The thing that's usually most important
to look for is whether the estimated row counts are reasonably close to reality. In this example the
estimates were all dead-on, but that's quite unusual in practice.
In some query plans, it is possible for a subplan node to be executed more than once. For example, the
inner index scan will be executed once per outer row in the above nested-loop plan. In such cases, the
loops value reports the total number of executions of the node, and the actual time and rows values
shown are averages per-execution. This is done to make the numbers comparable with the way that the
cost estimates are shown. Multiply by the loops value to get the total time actually spent in the node.
In the above example, we spent a total of 0.220 milliseconds executing the index scans on tenk2.
In some cases EXPLAIN ANALYZE shows additional execution statistics beyond the plan node exe-
cution times and row counts. For example, Sort and Hash nodes provide extra information:
455
Performance Tips
QUERY PLAN
--------------------------------------------------------------------------------
Sort (cost=717.34..717.59 rows=101 width=488) (actual
time=7.761..7.774 rows=100 loops=1)
Sort Key: t1.fivethous
Sort Method: quicksort Memory: 77kB
-> Hash Join (cost=230.47..713.98 rows=101 width=488) (actual
time=0.711..7.427 rows=100 loops=1)
Hash Cond: (t2.unique2 = t1.unique2)
-> Seq Scan on tenk2 t2 (cost=0.00..445.00 rows=10000
width=244) (actual time=0.007..2.583 rows=10000 loops=1)
-> Hash (cost=229.20..229.20 rows=101 width=244) (actual
time=0.659..0.659 rows=100 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 28kB
-> Bitmap Heap Scan on tenk1 t1 (cost=5.07..229.20
rows=101 width=244) (actual time=0.080..0.526 rows=100 loops=1)
Recheck Cond: (unique1 < 100)
-> Bitmap Index Scan on tenk1_unique1
(cost=0.00..5.04 rows=101 width=0) (actual time=0.049..0.049
rows=100 loops=1)
Index Cond: (unique1 < 100)
Planning time: 0.194 ms
Execution time: 8.008 ms
The Sort node shows the sort method used (in particular, whether the sort was in-memory or on-disk)
and the amount of memory or disk space needed. The Hash node shows the number of hash buckets
and batches as well as the peak amount of memory used for the hash table. (If the number of batches
exceeds one, there will also be disk space usage involved, but that is not shown.)
Another type of extra information is the number of rows removed by a filter condition:
QUERY PLAN
--------------------------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..483.00 rows=7000 width=244) (actual
time=0.016..5.107 rows=7000 loops=1)
Filter: (ten < 7)
Rows Removed by Filter: 3000
Planning time: 0.083 ms
Execution time: 5.905 ms
These counts can be particularly valuable for filter conditions applied at join nodes. The “Rows Re-
moved” line only appears when at least one scanned row, or potential join pair in the case of a join
node, is rejected by the filter condition.
A case similar to filter conditions occurs with “lossy” index scans. For example, consider this search
for polygons containing a specific point:
456
Performance Tips
QUERY PLAN
--------------------------------------------------------------------------------
Seq Scan on polygon_tbl (cost=0.00..1.05 rows=1 width=32) (actual
time=0.044..0.044 rows=0 loops=1)
Filter: (f1 @> '((0.5,2))'::polygon)
Rows Removed by Filter: 4
Planning time: 0.040 ms
Execution time: 0.083 ms
The planner thinks (quite correctly) that this sample table is too small to bother with an index scan,
so we have a plain sequential scan in which all the rows got rejected by the filter condition. But if we
force an index scan to be used, we see:
QUERY PLAN
--------------------------------------------------------------------------------
Index Scan using gpolygonind on polygon_tbl (cost=0.13..8.15
rows=1 width=32) (actual time=0.062..0.062 rows=0 loops=1)
Index Cond: (f1 @> '((0.5,2))'::polygon)
Rows Removed by Index Recheck: 1
Planning time: 0.034 ms
Execution time: 0.144 ms
Here we can see that the index returned one candidate row, which was then rejected by a recheck
of the index condition. This happens because a GiST index is “lossy” for polygon containment tests:
it actually returns the rows with polygons that overlap the target, and then we have to do the exact
containment test on those rows.
EXPLAIN has a BUFFERS option that can be used with ANALYZE to get even more run time statistics:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tenk1 WHERE unique1 < 100
AND unique2 > 9000;
QUERY
PLAN
--------------------------------------------------------------------------------
Bitmap Heap Scan on tenk1 (cost=25.08..60.21 rows=10 width=244)
(actual time=0.323..0.342 rows=10 loops=1)
Recheck Cond: ((unique1 < 100) AND (unique2 > 9000))
Buffers: shared hit=15
-> BitmapAnd (cost=25.08..25.08 rows=10 width=0) (actual
time=0.309..0.309 rows=0 loops=1)
Buffers: shared hit=7
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04
rows=101 width=0) (actual time=0.043..0.043 rows=100 loops=1)
Index Cond: (unique1 < 100)
Buffers: shared hit=2
-> Bitmap Index Scan on tenk1_unique2 (cost=0.00..19.78
rows=999 width=0) (actual time=0.227..0.227 rows=999 loops=1)
Index Cond: (unique2 > 9000)
Buffers: shared hit=5
Planning time: 0.088 ms
457
Performance Tips
The numbers provided by BUFFERS help to identify which parts of the query are the most I/O-inten-
sive.
Keep in mind that because EXPLAIN ANALYZE actually runs the query, any side-effects will happen
as usual, even though whatever results the query might output are discarded in favor of printing the
EXPLAIN data. If you want to analyze a data-modifying query without changing your tables, you can
roll the command back afterwards, for example:
BEGIN;
QUERY
PLAN
--------------------------------------------------------------------------------
Update on tenk1 (cost=5.07..229.46 rows=101 width=250) (actual
time=14.628..14.628 rows=0 loops=1)
-> Bitmap Heap Scan on tenk1 (cost=5.07..229.46 rows=101
width=250) (actual time=0.101..0.439 rows=100 loops=1)
Recheck Cond: (unique1 < 100)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04
rows=101 width=0) (actual time=0.043..0.043 rows=100 loops=1)
Index Cond: (unique1 < 100)
Planning time: 0.079 ms
Execution time: 14.727 ms
ROLLBACK;
As seen in this example, when the query is an INSERT, UPDATE, or DELETE command, the actual
work of applying the table changes is done by a top-level Insert, Update, or Delete plan node. The
plan nodes underneath this node perform the work of locating the old rows and/or computing the new
data. So above, we see the same sort of bitmap table scan we've seen already, and its output is fed to an
Update node that stores the updated rows. It's worth noting that although the data-modifying node can
take a considerable amount of run time (here, it's consuming the lion's share of the time), the planner
does not currently add anything to the cost estimates to account for that work. That's because the work
to be done is the same for every correct query plan, so it doesn't affect planning decisions.
When an UPDATE or DELETE command affects an inheritance hierarchy, the output might look like
this:
458
Performance Tips
In this example the Update node needs to consider three child tables as well as the originally-mentioned
parent table. So there are four input scanning subplans, one per table. For clarity, the Update node is
annotated to show the specific target tables that will be updated, in the same order as the corresponding
subplans. (These annotations are new as of PostgreSQL 9.5; in prior versions the reader had to intuit
the target tables by inspecting the subplans.)
The Planning time shown by EXPLAIN ANALYZE is the time it took to generate the query plan
from the parsed query and optimize it. It does not include parsing or rewriting.
The Execution time shown by EXPLAIN ANALYZE includes executor start-up and shut-down
time, as well as the time to run any triggers that are fired, but it does not include parsing, rewriting, or
planning time. Time spent executing BEFORE triggers, if any, is included in the time for the related
Insert, Update, or Delete node; but time spent executing AFTER triggers is not counted there because
AFTER triggers are fired after completion of the whole plan. The total time spent in each trigger
(either BEFORE or AFTER) is also shown separately. Note that deferred constraint triggers will not
be executed until end of transaction and are thus not considered at all by EXPLAIN ANALYZE.
14.1.3. Caveats
There are two significant ways in which run times measured by EXPLAIN ANALYZE can deviate
from normal execution of the same query. First, since no output rows are delivered to the client, net-
work transmission costs and I/O conversion costs are not included. Second, the measurement overhead
added by EXPLAIN ANALYZE can be significant, especially on machines with slow gettimeof-
day() operating-system calls. You can use the pg_test_timing tool to measure the overhead of timing
on your system.
EXPLAIN results should not be extrapolated to situations much different from the one you are actually
testing; for example, results on a toy-sized table cannot be assumed to apply to large tables. The
planner's cost estimates are not linear and so it might choose a different plan for a larger or smaller
table. An extreme example is that on a table that only occupies one disk page, you'll nearly always get
a sequential scan plan whether indexes are available or not. The planner realizes that it's going to take
one disk page read to process the table in any case, so there's no value in expending additional page
reads to look at an index. (We saw this happening in the polygon_tbl example above.)
There are cases in which the actual and estimated values won't match up well, but nothing is really
wrong. One such case occurs when plan node execution is stopped short by a LIMIT or similar effect.
For example, in the LIMIT query we used before,
EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2
> 9000 LIMIT 2;
QUERY
PLAN
--------------------------------------------------------------------------------
Limit (cost=0.29..14.71 rows=2 width=244) (actual
time=0.177..0.249 rows=2 loops=1)
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..72.42
rows=10 width=244) (actual time=0.174..0.244 rows=2 loops=1)
Index Cond: (unique2 > 9000)
Filter: (unique1 < 100)
Rows Removed by Filter: 287
Planning time: 0.096 ms
Execution time: 0.336 ms
459
Performance Tips
the estimated cost and row count for the Index Scan node are shown as though it were run to comple-
tion. But in reality the Limit node stopped requesting rows after it got two, so the actual row count is
only 2 and the run time is less than the cost estimate would suggest. This is not an estimation error,
only a discrepancy in the way the estimates and true values are displayed.
Merge joins also have measurement artifacts that can confuse the unwary. A merge join will stop
reading one input if it's exhausted the other input and the next key value in the one input is greater
than the last key value of the other input; in such a case there can be no more matches and so no need
to scan the rest of the first input. This results in not reading all of one child, with results like those
mentioned for LIMIT. Also, if the outer (first) child contains rows with duplicate key values, the
inner (second) child is backed up and rescanned for the portion of its rows matching that key value.
EXPLAIN ANALYZE counts these repeated emissions of the same inner rows as if they were real
additional rows. When there are many outer duplicates, the reported actual row count for the inner child
plan node can be significantly larger than the number of rows that are actually in the inner relation.
BitmapAnd and BitmapOr nodes always report their actual row counts as zero, due to implementation
limitations.
Normally, EXPLAIN will display every plan node created by the planner. However, there are cases
where the executor can determine that certain nodes need not be executed because they cannot produce
any rows, based on parameter values that were not available at planning time. (Currently this can only
happen for child nodes of an Append node that is scanning a partitioned table.) When this happens,
those plan nodes are omitted from the EXPLAIN output and a Subplans Removed: N annotation
appears instead.
One component of the statistics is the total number of entries in each table and index, as well as
the number of disk blocks occupied by each table and index. This information is kept in the table
pg_class, in the columns reltuples and relpages. We can look at it with queries similar to
this one:
Here we can see that tenk1 contains 10000 rows, as do its indexes, but the indexes are (unsurpris-
ingly) much smaller than the table.
For efficiency reasons, reltuples and relpages are not updated on-the-fly, and so they usual-
ly contain somewhat out-of-date values. They are updated by VACUUM, ANALYZE, and a few DDL
commands such as CREATE INDEX. A VACUUM or ANALYZE operation that does not scan the entire
460
Performance Tips
table (which is commonly the case) will incrementally update the reltuples count on the basis
of the part of the table it did scan, resulting in an approximate value. In any case, the planner will
scale the values it finds in pg_class to match the current physical table size, thus obtaining a closer
approximation.
Most queries retrieve only a fraction of the rows in a table, due to WHERE clauses that restrict the rows
to be examined. The planner thus needs to make an estimate of the selectivity of WHERE clauses, that
is, the fraction of rows that match each condition in the WHERE clause. The information used for this
task is stored in the pg_statistic system catalog. Entries in pg_statistic are updated by
the ANALYZE and VACUUM ANALYZE commands, and are always approximate even when freshly
updated.
Rather than look at pg_statistic directly, it's better to look at its view pg_stats when examin-
ing the statistics manually. pg_stats is designed to be more easily readable. Furthermore, pg_s-
tats is readable by all, whereas pg_statistic is only readable by a superuser. (This prevents
unprivileged users from learning something about the contents of other people's tables from the sta-
tistics. The pg_stats view is restricted to show only rows about tables that the current user can
read.) For example, we might do:
Note that two rows are displayed for the same column, one corresponding to the complete inheritance
hierarchy starting at the road table (inherited=t), and another one including only the road table
itself (inherited=f).
461
Performance Tips
or globally by setting the default_statistics_target configuration variable. The default limit is presently
100 entries. Raising the limit might allow more accurate planner estimates to be made, particularly for
columns with irregular data distributions, at the price of consuming more space in pg_statistic
and slightly more time to compute the estimates. Conversely, a lower limit might be sufficient for
columns with simple data distributions.
Further details about the planner's use of statistics can be found in Chapter 71.
Because the number of possible column combinations is very large, it's impractical to compute mul-
tivariate statistics automatically. Instead, extended statistics objects, more often called just statistics
objects, can be created to instruct the server to obtain statistics across interesting sets of columns.
Statistics objects are created using the CREATE STATISTICS command. Creation of such an object
merely creates a catalog entry expressing interest in the statistics. Actual data collection is performed
by ANALYZE (either a manual command, or background auto-analyze). The collected values can be
examined in the pg_statistic_ext catalog.
ANALYZE computes extended statistics based on the same sample of table rows that it takes for com-
puting regular single-column statistics. Since the sample size is increased by increasing the statistics
target for the table or any of its columns (as described in the previous section), a larger statistics target
will normally result in more accurate extended statistics, as well as more time spent calculating them.
The following subsections describe the kinds of extended statistics that are currently supported.
The existence of functional dependencies directly affects the accuracy of estimates in certain queries.
If a query contains conditions on both the independent and the dependent column(s), the conditions on
the dependent columns do not further reduce the result size; but without knowledge of the functional
dependency, the query planner will assume that the conditions are independent, resulting in underes-
timating the result size.
To inform the planner about functional dependencies, ANALYZE can collect measurements of cross-
column dependency. Assessing the degree of dependency between all sets of columns would be pro-
hibitively expensive, so data collection is limited to those groups of columns appearing together in a
statistics object defined with the dependencies option. It is advisable to create dependencies
statistics only for column groups that are strongly correlated, to avoid unnecessary overhead in both
ANALYZE and later query planning.
462
Performance Tips
ANALYZE zipcodes;
Here it can be seen that column 1 (zip code) fully determines column 5 (city) so the coefficient is 1.0,
while city only determines zip code about 42% of the time, meaning that there are many cities (58%)
that are represented by more than a single ZIP code.
When computing the selectivity for a query involving functionally dependent columns, the planner
adjusts the per-condition selectivity estimates using the dependency coefficients so as not to produce
an underestimate.
When estimating with functional dependencies, the planner assumes that conditions on the involved
columns are compatible and hence redundant. If they are incompatible, the correct estimate would be
zero rows, but that possibility is not considered. For example, given a query like
the planner will disregard the city clause as not changing the selectivity, which is correct. However,
it will make the same assumption about
even though there will really be zero rows satisfying this query. Functional dependency statistics do
not provide enough information to conclude that, however.
In many practical situations, this assumption is usually satisfied; for example, there might be a GUI in
the application that only allows selecting compatible city and ZIP code values to use in a query. But
if that's not the case, functional dependencies may not be a viable option.
To improve such estimates, ANALYZE can collect n-distinct statistics for groups of columns. As be-
fore, it's impractical to do this for every possible column grouping, so data is collected only for those
463
Performance Tips
groups of columns appearing together in a statistics object defined with the ndistinct option. Data
will be collected for each possible combination of two or more columns from the set of listed columns.
Continuing the previous example, the n-distinct counts in a table of ZIP codes might look like the
following:
ANALYZE zipcodes;
This indicates that there are three combinations of columns that have 33178 distinct values: ZIP code
and state; ZIP code and city; and ZIP code, city and state (the fact that they are all equal is expected
given that ZIP code alone is unique in this table). On the other hand, the combination of city and state
has only 27435 distinct values.
It's advisable to create ndistinct statistics objects only on combinations of columns that are actu-
ally used for grouping, and for which misestimation of the number of groups is resulting in bad plans.
Otherwise, the ANALYZE cycles are just wasted.
the planner is free to join the given tables in any order. For example, it could generate a query plan
that joins A to B, using the WHERE condition a.id = b.id, and then joins C to this joined table,
using the other WHERE condition. Or it could join B to C and then join A to that result. Or it could join
A to C and then join them with B — but that would be inefficient, since the full Cartesian product of
A and C would have to be formed, there being no applicable condition in the WHERE clause to allow
optimization of the join. (All joins in the PostgreSQL executor happen between two input tables, so
it's necessary to build up the result in one or another of these fashions.) The important point is that
these different join possibilities give semantically equivalent results but might have hugely different
execution costs. Therefore, the planner will explore all of them to try to find the most efficient query
plan.
When a query only involves two or three tables, there aren't many join orders to worry about. But the
number of possible join orders grows exponentially as the number of tables expands. Beyond ten or so
input tables it's no longer practical to do an exhaustive search of all the possibilities, and even for six
or seven tables planning might take an annoyingly long time. When there are too many input tables,
the PostgreSQL planner will switch from exhaustive search to a genetic probabilistic search through
464
Performance Tips
a limited number of possibilities. (The switch-over threshold is set by the geqo_threshold run-time
parameter.) The genetic search takes less time, but it won't necessarily find the best possible plan.
When the query involves outer joins, the planner has less freedom than it does for plain (inner) joins.
For example, consider:
Although this query's restrictions are superficially similar to the previous example, the semantics are
different because a row must be emitted for each row of A that has no matching row in the join of B
and C. Therefore the planner has no choice of join order here: it must join B to C and then join A to
that result. Accordingly, this query takes less time to plan than the previous query. In other cases, the
planner might be able to determine that more than one join order is safe. For example, given:
it is valid to join A to either B or C first. Currently, only FULL JOIN completely constrains the join
order. Most practical cases involving LEFT JOIN or RIGHT JOIN can be rearranged to some extent.
Explicit inner join syntax (INNER JOIN, CROSS JOIN, or unadorned JOIN) is semantically the
same as listing the input relations in FROM, so it does not constrain the join order.
Even though most kinds of JOIN don't completely constrain the join order, it is possible to instruct
the PostgreSQL query planner to treat all JOIN clauses as constraining the join order anyway. For
example, these three queries are logically equivalent:
But if we tell the planner to honor the JOIN order, the second and third take less time to plan than
the first. This effect is not worth worrying about for only three tables, but it can be a lifesaver with
many tables.
To force the planner to follow the join order laid out by explicit JOINs, set the join_collapse_limit
run-time parameter to 1. (Other possible values are discussed below.)
You do not need to constrain the join order completely in order to cut search time, because it's OK to
use JOIN operators within items of a plain FROM list. For example, consider:
With join_collapse_limit = 1, this forces the planner to join A to B before joining them to
other tables, but doesn't constrain its choices otherwise. In this example, the number of possible join
orders is reduced by a factor of 5.
Constraining the planner's search in this way is a useful technique both for reducing planning time and
for directing the planner to a good query plan. If the planner chooses a bad join order by default, you
can force it to choose a better order via JOIN syntax — assuming that you know of a better order,
that is. Experimentation is recommended.
A closely related issue that affects planning time is collapsing of subqueries into their parent query.
For example, consider:
465
Performance Tips
SELECT *
FROM x, y,
(SELECT * FROM a, b, c WHERE something) AS ss
WHERE somethingelse;
This situation might arise from use of a view that contains a join; the view's SELECT rule will be
inserted in place of the view reference, yielding a query much like the above. Normally, the planner
will try to collapse the subquery into the parent, yielding:
This usually results in a better plan than planning the subquery separately. (For example, the outer
WHERE conditions might be such that joining X to A first eliminates many rows of A, thus avoiding
the need to form the full logical output of the subquery.) But at the same time, we have increased the
planning time; here, we have a five-way join problem replacing two separate three-way join problems.
Because of the exponential growth of the number of possibilities, this makes a big difference. The
planner tries to avoid getting stuck in huge join search problems by not collapsing a subquery if more
than from_collapse_limit FROM items would result in the parent query. You can trade off
planning time against quality of plan by adjusting this run-time parameter up or down.
from_collapse_limit and join_collapse_limit are similarly named because they do almost the same
thing: one controls when the planner will “flatten out” subqueries, and the other controls when it
will flatten out explicit joins. Typically you would either set join_collapse_limit equal to
from_collapse_limit (so that explicit joins and subqueries act similarly) or set join_col-
lapse_limit to 1 (if you want to control join order with explicit joins). But you might set them
differently if you are trying to fine-tune the trade-off between planning time and run time.
If you cannot use COPY, it might help to use PREPARE to create a prepared INSERT statement, and
then use EXECUTE as many times as required. This avoids some of the overhead of repeatedly parsing
and planning INSERT. Different interfaces provide this facility in different ways; look for “prepared
statements” in the interface documentation.
Note that loading a large number of rows using COPY is almost always faster than using INSERT,
even if PREPARE is used and multiple insertions are batched into a single transaction.
466
Performance Tips
COPY is fastest when used within the same transaction as an earlier CREATE TABLE or TRUNCATE
command. In such cases no WAL needs to be written, because in case of an error, the files contain-
ing the newly loaded data will be removed anyway. However, this consideration only applies when
wal_level is minimal for non-partitioned tables as all commands must write WAL otherwise.
If you are adding large amounts of data to an existing table, it might be a win to drop the indexes, load
the table, and then recreate the indexes. Of course, the database performance for other users might
suffer during the time the indexes are missing. One should also think twice before dropping a unique
index, since the error checking afforded by the unique constraint will be lost while the index is missing.
What's more, when you load data into a table with existing foreign key constraints, each new row
requires an entry in the server's list of pending trigger events (since it is the firing of a trigger that
checks the row's foreign key constraint). Loading many millions of rows can cause the trigger event
queue to overflow available memory, leading to intolerable swapping or even outright failure of the
command. Therefore it may be necessary, not just desirable, to drop and re-apply foreign keys when
loading large amounts of data. If temporarily removing the constraint isn't acceptable, the only other
recourse may be to split up the load operation into smaller transactions.
467
Performance Tips
Aside from avoiding the time for the archiver or WAL sender to process the WAL data, doing this
will actually make certain commands faster, because they are designed not to write WAL at all if
wal_level is minimal. (They can guarantee crash safety more cheaply by doing an fsync at the
end than by writing WAL.) This applies to the following commands:
• CREATE INDEX (and variants such as ALTER TABLE ADD PRIMARY KEY)
• CLUSTER
• COPY FROM, when the target table has been created or truncated earlier in the same transaction
By default, pg_dump uses COPY, and when it is generating a complete schema-and-data dump, it is
careful to load data before creating indexes and foreign keys. So in this case several guidelines are
handled automatically. What is left for you to do is to:
• Set appropriate (i.e., larger than normal) values for maintenance_work_mem and max_w-
al_size.
• If using WAL archiving or streaming replication, consider disabling them during the restore. To do
that, set archive_mode to off, wal_level to minimal, and max_wal_senders to zero
before loading the dump. Afterwards, set them back to the right values and take a fresh base backup.
• Experiment with the parallel dump and restore modes of both pg_dump and pg_restore and find the
optimal number of concurrent jobs to use. Dumping and restoring in parallel by means of the -j
option should give you a significantly higher performance over the serial mode.
• Consider whether the whole dump should be restored as a single transaction. To do that, pass the
-1 or --single-transaction command-line option to psql or pg_restore. When using this
mode, even the smallest of errors will rollback the entire restore, possibly discarding many hours
of processing. Depending on how interrelated the data is, that might seem preferable to manual
cleanup, or not. COPY commands will run fastest if you use a single transaction and have WAL
archiving turned off.
• If multiple CPUs are available in the database server, consider using pg_restore's --jobs option.
This allows concurrent data loading and index creation.
468
Performance Tips
A data-only dump will still use COPY, but it does not drop or recreate indexes, and it does not normally
touch foreign keys. 1 So when loading a data-only dump, it is up to you to drop and recreate indexes
and foreign keys if you wish to use those techniques. It's still useful to increase max_wal_size
while loading the data, but don't bother increasing maintenance_work_mem; rather, you'd do that
while manually recreating indexes and foreign keys afterwards. And don't forget to ANALYZE when
you're done; see Section 24.1.3 and Section 24.1.6 for more information.
• Place the database cluster's data directory in a memory-backed file system (i.e., RAM disk). This
eliminates all database disk I/O, but limits data storage to the amount of available memory (and
perhaps swap).
• Turn off synchronous_commit; there might be no need to force WAL writes to disk on every com-
mit. This setting does risk transaction loss (though not data corruption) in case of a crash of the
database.
• Turn off full_page_writes; there is no need to guard against partial page writes.
• Increase max_wal_size and checkpoint_timeout; this reduces the frequency of checkpoints, but in-
creases the storage requirements of /pg_wal.
• Create unlogged tables to avoid WAL writes, though it makes the tables non-crash-safe.
1
You can get the effect of disabling foreign keys by using the --disable-triggers option — but realize that that eliminates, rather than
just postpones, foreign key validation, and so it is possible to insert bad data if you use it.
469
Chapter 15. Parallel Query
PostgreSQL can devise query plans that can leverage multiple CPUs in order to answer queries faster.
This feature is known as parallel query. Many queries cannot benefit from parallel query, either due
to limitations of the current implementation or because there is no imaginable query plan that is any
faster than the serial query plan. However, for queries that can benefit, the speedup from parallel query
is often very significant. Many queries can run more than twice as fast when using parallel query, and
some queries can run four times faster or even more. Queries that touch a large amount of data but
return only a few rows to the user will typically benefit most. This chapter explains some details of
how parallel query works and in which situations it can be used so that users who wish to make use
of it can understand what to expect.
--------------------------------------------------------------------------------
Gather (cost=1000.00..217018.43 rows=1 width=97)
Workers Planned: 2
-> Parallel Seq Scan on pgbench_accounts (cost=0.00..216018.33
rows=1 width=97)
Filter: (filler ~~ '%x%'::text)
(4 rows)
In all cases, the Gather or Gather Merge node will have exactly one child plan, which is the
portion of the plan that will be executed in parallel. If the Gather or Gather Merge node is at
the very top of the plan tree, then the entire query will execute in parallel. If it is somewhere else in
the plan tree, then only the portion of the plan below it will run in parallel. In the example above, the
query accesses only one table, so there is only one plan node other than the Gather node itself; since
that plan node is a child of the Gather node, it will run in parallel.
Using EXPLAIN, you can see the number of workers chosen by the planner. When the Gather node
is reached during query execution, the process that is implementing the user's session will request a
number of background worker processes equal to the number of workers chosen by the planner. The
number of background workers that the planner will consider using is limited to at most max_paral-
lel_workers_per_gather. The total number of background workers that can exist at any one time is
limited by both max_worker_processes and max_parallel_workers. Therefore, it is possible for a par-
allel query to run with fewer workers than planned, or even with no workers at all. The optimal plan
may depend on the number of workers that are available, so this can result in poor query performance.
If this occurrence is frequent, consider increasing max_worker_processes and max_paral-
lel_workers so that more workers can be run simultaneously or alternatively reducing max_par-
allel_workers_per_gather so that the planner requests fewer workers.
Every background worker process that is successfully started for a given parallel query will execute
the parallel portion of the plan. The leader will also execute that portion of the plan, but it has an
additional responsibility: it must also read all of the tuples generated by the workers. When the parallel
portion of the plan generates only a small number of tuples, the leader will often behave very much
like an additional worker, speeding up query execution. Conversely, when the parallel portion of the
plan generates a large number of tuples, the leader may be almost entirely occupied with reading the
tuples generated by the workers and performing any further processing steps that are required by plan
470
Parallel Query
nodes above the level of the Gather node or Gather Merge node. In such cases, the leader will
do very little of the work of executing the parallel portion of the plan.
When the node at the top of the parallel portion of the plan is Gather Merge rather than Gather,
it indicates that each process executing the parallel portion of the plan is producing tuples in sorted
order, and that the leader is performing an order-preserving merge. In contrast, Gather reads tuples
from the workers in whatever order is convenient, destroying any sort order that may have existed.
• max_parallel_workers_per_gather must be set to a value that is greater than zero. This is a special
case of the more general principle that no more workers should be used than the number configured
via max_parallel_workers_per_gather.
• dynamic_shared_memory_type must be set to a value other than none. Parallel query requires
dynamic shared memory in order to pass data between cooperating processes.
In addition, the system must not be running in single-user mode. Since the entire database system is
running in single process in this situation, no background workers will be available.
Even when it is in general possible for parallel query plans to be generated, the planner will not generate
them for a given query if any of the following are true:
• The query writes any data or locks any database rows. If a query contains a data-modifying oper-
ation either at the top level or within a CTE, no parallel plans for that query will be generated.
As an exception, the commands CREATE TABLE ... AS, SELECT INTO, and CREATE
MATERIALIZED VIEW that create a new table and populate it can use a parallel plan.
• The query might be suspended during execution. In any situation in which the system thinks that
partial or incremental execution might occur, no parallel plan is generated. For example, a cursor
created using DECLARE CURSOR will never use a parallel plan. Similarly, a PL/pgSQL loop of
the form FOR x IN query LOOP .. END LOOP will never use a parallel plan, because the
parallel query system is unable to verify that the code in the loop is safe to execute while parallel
query is active.
• The query uses any function marked PARALLEL UNSAFE. Most system-defined functions are
PARALLEL SAFE, but user-defined functions are marked PARALLEL UNSAFE by default. See
the discussion of Section 15.4.
• The query is running inside of another query that is already parallel. For example, if a function
called by a parallel query issues an SQL query itself, that query will never use a parallel plan. This
is a limitation of the current implementation, but it may not be desirable to remove this limitation,
since it could result in a single query using a very large number of processes.
• The transaction isolation level is serializable. This is a limitation of the current implementation.
Even when parallel query plan is generated for a particular query, there are several circumstances
under which it will be impossible to execute that plan in parallel at execution time. If this occurs, the
leader will execute the portion of the plan below the Gather node entirely by itself, almost as if the
Gather node were not present. This will happen if any of the following conditions are met:
• No background workers can be obtained because of the limitation that the total number of back-
ground workers cannot exceed max_worker_processes.
• No background workers can be obtained because of the limitation that the total number of back-
ground workers launched for purposes of parallel query cannot exceed max_parallel_workers.
471
Parallel Query
• The client sends an Execute message with a non-zero fetch count. See the discussion of the extended
query protocol. Since libpq currently provides no way to send such a message, this can only occur
when using a client that does not rely on libpq. If this is a frequent occurrence, it may be a good
idea to set max_parallel_workers_per_gather to zero in sessions where it is likely, so as to avoid
generating query plans that may be suboptimal when run serially.
• The transaction isolation level is serializable. This situation does not normally arise, because parallel
query plans are not generated when the transaction isolation level is serializable. However, it can
happen if the transaction isolation level is changed to serializable after the plan is generated and
before it is executed.
• In a parallel sequential scan, the table's blocks will be divided among the cooperating processes.
Blocks are handed out one at a time, so that access to the table remains sequential.
• In a parallel bitmap heap scan, one process is chosen as the leader. That process performs a scan
of one or more indexes and builds a bitmap indicating which table blocks need to be visited. These
blocks are then divided among the cooperating processes as in a parallel sequential scan. In other
words, the heap scan is performed in parallel, but the underlying index scan is not.
• In a parallel index scan or parallel index-only scan, the cooperating processes take turns reading
data from the index. Currently, parallel index scans are supported only for btree indexes. Each
process will claim a single index block and will scan and return all tuples referenced by that block;
other processes can at the same time be returning tuples from a different index block. The results
of a parallel btree scan are returned in sorted order within each worker process.
Other scan types, such as scans of non-btree indexes, may support parallel scans in the future.
• In a nested loop join, the inner side is always non-parallel. Although it is executed in full, this is
efficient if the inner side is an index scan, because the outer tuples and thus the loops that look up
values in the index are divided over the cooperating processes.
• In a merge join, the inner side is always a non-parallel plan and therefore executed in full. This
may be inefficient, especially if a sort must be performed, because the work and resulting data are
duplicated in every cooperating process.
• In a hash join (without the "parallel" prefix), the inner side is executed in full by every cooperating
process to build identical copies of the hash table. This may be inefficient if the hash table is large
472
Parallel Query
or the plan is expensive. In a parallel hash join, the inner side is a parallel hash that divides the
work of building a shared hash table over the cooperating processes.
Because the Finalize Aggregate node runs on the leader process, queries that produce a rela-
tively large number of groups in comparison to the number of input rows will appear less favorable
to the query planner. For example, in the worst-case scenario the number of groups seen by the Fi-
nalize Aggregate node could be as many as the number of input rows that were seen by all
worker processes in the Partial Aggregate stage. For such cases, there is clearly going to be
no performance benefit to using parallel aggregation. The query planner takes this into account during
the planning process and is unlikely to choose parallel aggregate in this scenario.
Parallel aggregation is not supported in all situations. Each aggregate must be safe for parallelism and
must have a combine function. If the aggregate has a transition state of type internal, it must have
serialization and deserialization functions. See CREATE AGGREGATE for more details. Parallel
aggregation is not supported if any aggregate function call contains DISTINCT or ORDER BY clause
and is also not supported for ordered set aggregates or when the query involves GROUPING SETS. It
can only be used when all joins involved in the query are also part of the parallel portion of the plan.
When an Append node is used in a parallel plan, each process will execute the child plans in the order
in which they appear, so that all participating processes cooperate to execute the first child plan until it
is complete and then move to the second plan at around the same time. When a Parallel Append
is used instead, the executor will instead spread out the participating processes as evenly as possible
across its child plans, so that multiple child plans are executed simultaneously. This avoids contention,
and also avoids paying the startup cost of a child plan in those processes that never execute it.
Also, unlike a regular Append node, which can only have partial children when used within a paral-
lel plan, a Parallel Append node can have both partial and non-partial child plans. Non-partial
children will be scanned by only a single process, since scanning them more than once would pro-
duce duplicate results. Plans that involve appending multiple results sets can therefore achieve coarse-
grained parallelism even when efficient partial plans are not available. For example, consider a query
against a partitioned table that can only be implemented efficiently by using an index that does not
support parallel scans. The planner might choose a Parallel Append of regular Index Scan
plans; each individual index scan would have to be executed to completion by a single process, but
different scans could be performed at the same time by different processes.
473
Parallel Query
small values of these settings (e.g., after setting them both to zero), there may be some reason why the
query planner is unable to generate a parallel plan for your query. See Section 15.2 and Section 15.4
for information on why this may be the case.
When executing a parallel plan, you can use EXPLAIN (ANALYZE, VERBOSE) to display per-
worker statistics for each plan node. This may be useful in determining whether the work is being
evenly distributed between all plan nodes and more generally in understanding the performance char-
acteristics of the plan.
• Scans of foreign tables, unless the foreign data wrapper has an IsForeignScanParallelSafe
API that indicates otherwise.
Functions and aggregates must be marked PARALLEL UNSAFE if they write to the database, access
sequences, change the transaction state even temporarily (e.g., a PL/pgSQL function that establishes
an EXCEPTION block to catch errors), or make persistent changes to settings. Similarly, functions
must be marked PARALLEL RESTRICTED if they access temporary tables, client connection state,
cursors, prepared statements, or miscellaneous backend-local state that the system cannot synchronize
across workers. For example, setseed and random are parallel restricted for this last reason.
474
Parallel Query
If a function executed within a parallel worker acquires locks that are not held by the leader, for
example by querying a table not referenced in the query, those locks will be released at worker exit, not
end of transaction. If you write a function that does this, and this behavior difference is important to
you, mark such functions as PARALLEL RESTRICTED to ensure that they execute only in the leader.
Note that the query planner does not consider deferring the evaluation of parallel-restricted functions or
aggregates involved in the query in order to obtain a superior plan. So, for example, if a WHERE clause
applied to a particular table is parallel restricted, the query planner will not consider performing a scan
of that table in the parallel portion of a plan. In some cases, it would be possible (and perhaps even
efficient) to include the scan of that table in the parallel portion of the query and defer the evaluation of
the WHERE clause so that it happens above the Gather node. However, the planner does not do this.
475
Part III. Server Administration
This part covers topics that are of interest to a PostgreSQL database administrator. This includes installation of
the software, set up and configuration of the server, management of users and databases, and maintenance tasks.
Anyone who runs a PostgreSQL server, even for personal use, but especially in production, should be familiar
with the topics covered in this part.
The information in this part is arranged approximately in the order in which a new user should read it. But the
chapters are self-contained and can be read individually as desired. The information in this part is presented in
a narrative fashion in topical units. Readers looking for a complete description of a particular command should
see Part VI.
The first few chapters are written so they can be understood without prerequisite knowledge, so new users who
need to set up their own server can begin their exploration with this part. The rest of this part is about tuning and
management; that material assumes that the reader is familiar with the general use of the PostgreSQL database
system. Readers are encouraged to look at Part I and Part II for additional information.
Table of Contents
16. Installation from Source Code ................................................................................. 482
16.1. Short Version ............................................................................................ 482
16.2. Requirements ............................................................................................. 482
16.3. Getting The Source .................................................................................... 484
16.4. Installation Procedure ................................................................................. 484
16.5. Post-Installation Setup ................................................................................. 499
16.5.1. Shared Libraries .............................................................................. 499
16.5.2. Environment Variables ..................................................................... 500
16.6. Supported Platforms ................................................................................... 500
16.7. Platform-specific Notes ............................................................................... 501
16.7.1. AIX ............................................................................................... 501
16.7.2. Cygwin .......................................................................................... 504
16.7.3. HP-UX .......................................................................................... 504
16.7.4. macOS ........................................................................................... 505
16.7.5. MinGW/Native Windows .................................................................. 506
16.7.6. Solaris ........................................................................................... 506
17. Installation from Source Code on Windows ............................................................... 509
17.1. Building with Visual C++ or the Microsoft Windows SDK ................................ 509
17.1.1. Requirements .................................................................................. 510
17.1.2. Special Considerations for 64-bit Windows .......................................... 511
17.1.3. Building ......................................................................................... 512
17.1.4. Cleaning and Installing ..................................................................... 512
17.1.5. Running the Regression Tests ............................................................ 513
17.1.6. Building the Documentation .............................................................. 513
18. Server Setup and Operation .................................................................................... 515
18.1. The PostgreSQL User Account ..................................................................... 515
18.2. Creating a Database Cluster ......................................................................... 515
18.2.1. Use of Secondary File Systems .......................................................... 516
18.2.2. Use of Network File Systems ............................................................. 517
18.3. Starting the Database Server ........................................................................ 517
18.3.1. Server Start-up Failures .................................................................... 519
18.3.2. Client Connection Problems .............................................................. 520
18.4. Managing Kernel Resources ......................................................................... 520
18.4.1. Shared Memory and Semaphores ........................................................ 520
18.4.2. systemd RemoveIPC ........................................................................ 526
18.4.3. Resource Limits .............................................................................. 526
18.4.4. Linux Memory Overcommit .............................................................. 527
18.4.5. Linux Huge Pages ........................................................................... 529
18.5. Shutting Down the Server ............................................................................ 529
18.6. Upgrading a PostgreSQL Cluster .................................................................. 530
18.6.1. Upgrading Data via pg_dumpall ......................................................... 531
18.6.2. Upgrading Data via pg_upgrade ......................................................... 533
18.6.3. Upgrading Data via Replication .......................................................... 533
18.7. Preventing Server Spoofing .......................................................................... 533
18.8. Encryption Options ..................................................................................... 533
18.9. Secure TCP/IP Connections with SSL ............................................................ 534
18.9.1. Basic Setup .................................................................................... 535
18.9.2. OpenSSL Configuration .................................................................... 535
18.9.3. Using Client Certificates ................................................................... 536
18.9.4. SSL Server File Usage ..................................................................... 536
18.9.5. Creating Certificates ......................................................................... 536
18.10. Secure TCP/IP Connections with SSH Tunnels .............................................. 538
18.11. Registering Event Log on Windows ............................................................. 539
19. Server Configuration ............................................................................................. 540
19.1. Setting Parameters ...................................................................................... 540
477
Server Administration
478
Server Administration
479
Server Administration
480
Server Administration
481
Chapter 16. Installation from Source
Code
This chapter describes the installation of PostgreSQL using the source code distribution. (If you are
installing a pre-packaged distribution, such as an RPM or Debian package, ignore this chapter and
read the packager's instructions instead.)
16.2. Requirements
In general, a modern Unix-compatible platform should be able to run PostgreSQL. The platforms
that had received specific testing at the time of release are listed in Section 16.6 below. In the doc
subdirectory of the distribution there are several platform-specific FAQ documents you might wish
to consult if you are having trouble.
• GNU make version 3.80 or newer is required; other make programs or older GNU make versions
will not work. (GNU make is sometimes installed under the name gmake.) To test for GNU make
enter:
make --version
• You need an ISO/ANSI C compiler (at least C89-compliant). Recent versions of GCC are recom-
mended, but PostgreSQL is known to build using a wide variety of compilers from different vendors.
• tar is required to unpack the source distribution, in addition to either gzip or bzip2.
• The GNU Readline library is used by default. It allows psql (the PostgreSQL command line SQL
interpreter) to remember each command you type, and allows you to use arrow keys to recall and edit
previous commands. This is very helpful and is strongly recommended. If you don't want to use it
then you must specify the --without-readline option to configure. As an alternative, you
can often use the BSD-licensed libedit library, originally developed on NetBSD. The libedit
library is GNU Readline-compatible and is used if libreadline is not found, or if --with-
libedit-preferred is used as an option to configure. If you are using a package-based
Linux distribution, be aware that you need both the readline and readline-devel packages,
if those are separate in your distribution.
482
Installation from Source Code
• The zlib compression library is used by default. If you don't want to use it then you must specify
the --without-zlib option to configure. Using this option disables support for compressed
archives in pg_dump and pg_restore.
The following packages are optional. They are not required in the default configuration, but they are
needed when certain build options are enabled, as explained below:
• To build the server programming language PL/Perl you need a full Perl installation, including the
libperl library and the header files. The minimum required version is Perl 5.8.3. Since PL/Perl
will be a shared library, the libperl library must be a shared library also on most platforms.
This appears to be the default in recent Perl versions, but it was not in earlier versions, and in any
case it is the choice of whomever installed Perl at your site. configure will fail if building PL/
Perl is selected but it cannot find a shared libperl. In that case, you will have to rebuild and
install Perl manually to be able to build PL/Perl. During the configuration process for Perl, request
a shared library.
If you intend to make more than incidental use of PL/Perl, you should ensure that the Perl installation
was built with the usemultiplicity option enabled (perl -V will show whether this is the
case).
• To build the PL/Python server programming language, you need a Python installation with the
header files and the sysconfig module. The minimum required version is Python 2.7. Python 3 is
supported if it's version 3.2 or later; but see Section 46.1 when using Python 3.
Since PL/Python will be a shared library, the libpython library must be a shared library also on
most platforms. This is not the case in a default Python installation built from source, but a shared
library is available in many operating system distributions. configure will fail if building PL/
Python is selected but it cannot find a shared libpython. That might mean that you either have
to install additional packages or rebuild (part of) your Python installation to provide this shared
library. When building from source, run Python's configure with the --enable-shared flag.
• To build the PL/Tcl procedural language, you of course need a Tcl installation. The minimum re-
quired version is Tcl 8.4.
• To enable Native Language Support (NLS), that is, the ability to display a program's messages in a
language other than English, you need an implementation of the Gettext API. Some operating sys-
tems have this built-in (e.g., Linux, NetBSD, Solaris), for other systems you can download an add-
on package from http://www.gnu.org/software/gettext/. If you are using the Gettext implementation
in the GNU C library then you will additionally need the GNU Gettext package for some utility
programs. For any of the other implementations you will not need it.
• You need OpenSSL, if you want to support encrypted client connections. The minimum required
version is 0.9.8.
• You need Kerberos, OpenLDAP, and/or PAM, if you want to support authentication using those
services.
• To build the PostgreSQL documentation, there is a separate set of requirements; see Section J.2.
If you are building from a Git tree instead of using a released source package, or if you want to do
server development, you also need the following packages:
• Flex and Bison are needed to build from a Git checkout, or if you changed the actual scanner
and parser definition files. If you need them, be sure to get Flex 2.5.31 or later and Bison 1.875 or
later. Other lex and yacc programs cannot be used.
• Perl 5.8.3 or later is needed to build from a Git checkout, or if you changed the input files for any
of the build steps that use Perl scripts. If building on Windows you will need Perl in any case. Perl
is also required to run some test suites.
If you need to get a GNU package, you can find it at your local GNU mirror site (see https://
www.gnu.org/prep/ftp for a list) or at ftp://ftp.gnu.org/gnu/.
483
Installation from Source Code
Also check that you have sufficient disk space. You will need about 100 MB for the source tree during
compilation and about 20 MB for the installation directory. An empty database cluster takes about 35
MB; databases take about five times the amount of space that a flat text file with the same data would
take. If you are going to run the regression tests you will temporarily need up to an extra 150 MB.
Use the df command to check free disk space.
gunzip postgresql-11.17.tar.gz
tar xf postgresql-11.17.tar
(Use bunzip2 instead of gunzip if you have the .bz2 file.) This will create a directory post-
gresql-11.17 under the current directory with the PostgreSQL sources. Change into that directory
for the rest of the installation procedure.
You can also get the source directly from the version control repository, see Appendix I.
The first step of the installation procedure is to configure the source tree for your system and
choose the options you would like. This is done by running the configure script. For a default
installation simply enter:
./configure
This script will run a number of tests to determine values for various system dependent variables
and detect any quirks of your operating system, and finally will create several files in the build
tree to record what it found. You can also run configure in a directory outside the source tree,
if you want to keep the build directory separate. This procedure is also called a VPATH build.
Here's how:
mkdir build_dir
cd build_dir
/path/to/source/tree/configure [options go here]
make
The default configuration will build the server and utilities, as well as all client applications and
interfaces that require only a C compiler. All files will be installed under /usr/local/pgsql
by default.
You can customize the build and installation process by supplying one or more of the following
command line options to configure:
--prefix=PREFIX
Install all files under the directory PREFIX instead of /usr/local/pgsql. The actual
files will be installed into various subdirectories; no files will ever be installed directly into
the PREFIX directory.
484
Installation from Source Code
If you have special needs, you can also customize the individual subdirectories with the fol-
lowing options. However, if you leave these with their defaults, the installation will be relo-
catable, meaning you can move the directory after installation. (The man and doc locations
are not affected by this.)
For relocatable installs, you might want to use configure's --disable-rpath option.
Also, you will need to tell the operating system how to find the shared libraries.
--exec-prefix=EXEC-PREFIX
You can install architecture-dependent files under a different prefix, EXEC-PREFIX, than
what PREFIX was set to. This can be useful to share architecture-independent files between
hosts. If you omit this, then EXEC-PREFIX is set equal to PREFIX and both architec-
485
Installation from Source Code
ture-dependent and independent files will be installed under the same tree, which is probably
what you want.
--bindir=DIRECTORY
Specifies the directory for executable programs. The default is EXEC-PREFIX/bin, which
normally means /usr/local/pgsql/bin.
--sysconfdir=DIRECTORY
--libdir=DIRECTORY
Sets the location to install libraries and dynamically loadable modules. The default is EX-
EC-PREFIX/lib.
--includedir=DIRECTORY
Sets the directory for installing C and C++ header files. The default is PREFIX/include.
--datarootdir=DIRECTORY
Sets the root directory for various types of read-only data files. This only sets the default for
some of the following options. The default is PREFIX/share.
--datadir=DIRECTORY
Sets the directory for read-only data files used by the installed programs. The default is
DATAROOTDIR. Note that this has nothing to do with where your database files will be
placed.
--localedir=DIRECTORY
Sets the directory for installing locale data, in particular message translation catalog files.
The default is DATAROOTDIR/locale.
--mandir=DIRECTORY
The man pages that come with PostgreSQL will be installed under this directory, in their
respective manx subdirectories. The default is DATAROOTDIR/man.
--docdir=DIRECTORY
Sets the root directory for installing documentation files, except “man” pages. This only sets
the default for the following options. The default value for this option is DATAROOTDIR/
doc/postgresql.
--htmldir=DIRECTORY
The HTML-formatted documentation for PostgreSQL will be installed under this directory.
The default is DATAROOTDIR.
Note
Care has been taken to make it possible to install PostgreSQL into shared installation
locations (such as /usr/local/include) without interfering with the namespace
of the rest of the system. First, the string “/postgresql” is automatically appended
to datadir, sysconfdir, and docdir, unless the fully expanded directory name
already contains the string “postgres”
486 or “pgsql”. For example, if you choose /
Installation from Source Code
--with-extra-version=STRING
Append STRING to the PostgreSQL version number. You can use this, for example, to mark
binaries built from unreleased Git snapshots or containing custom patches with an extra
version string such as a git describe identifier or a distribution package release number.
--with-includes=DIRECTORIES
DIRECTORIES is a colon-separated list of directories that will be added to the list the com-
piler searches for header files. If you have optional packages (such as GNU Readline) in-
stalled in a non-standard location, you have to use this option and probably also the corre-
sponding --with-libraries option.
Example: --with-includes=/opt/gnu/include:/usr/sup/include.
--with-libraries=DIRECTORIES
DIRECTORIES is a colon-separated list of directories to search for libraries. You will prob-
ably have to use this option (and the corresponding --with-includes option) if you
have packages installed in non-standard locations.
Example: --with-libraries=/opt/gnu/lib:/usr/sup/lib.
--enable-nls[=LANGUAGES]
Enables Native Language Support (NLS), that is, the ability to display a program's messages
in a language other than English. LANGUAGES is an optional space-separated list of codes
of the languages that you want supported, for example --enable-nls='de fr'. (The
intersection between your list and the set of actually provided translations will be computed
automatically.) If you do not specify a list, then all available translations are installed.
To use this option, you will need an implementation of the Gettext API; see above.
--with-pgport=NUMBER
Set NUMBER as the default port number for server and clients. The default is 5432. The port
can always be changed later on, but if you specify it here then both server and clients will
have the same default compiled in, which can be very convenient. Usually the only good
487
Installation from Source Code
reason to select a non-default value is if you intend to run multiple PostgreSQL servers on
the same machine.
--with-perl
--with-python
--with-tcl
--with-tclconfig=DIRECTORY
Tcl installs the file tclConfig.sh, which contains configuration information needed to
build modules interfacing to Tcl. This file is normally found automatically at a well-known
location, but if you want to use a different version of Tcl you can specify the directory in
which to look for it.
--with-gssapi
Build with support for GSSAPI authentication. On many systems, the GSSAPI (usually a
part of the Kerberos installation) system is not installed in a location that is searched by
default (e.g., /usr/include, /usr/lib), so you must use the options --with-in-
cludes and --with-libraries in addition to this option. configure will check for
the required header files and libraries to make sure that your GSSAPI installation is suffi-
cient before proceeding.
--with-krb-srvnam=NAME
The default name of the Kerberos service principal used by GSSAPI. postgres is the
default. There's usually no reason to change this unless you have a Windows environment,
in which case it must be set to upper case POSTGRES.
--with-llvm
Build with support for LLVM based JIT compilation (see Chapter 32). This requires the
LLVM library to be installed. The minimum required version of LLVM is currently 3.9.
llvm-config will be used to find the required compilation options. llvm-config, and
then llvm-config-$major-$minor for all supported versions, will be searched on
PATH. If that would not yield the correct binary, use LLVM_CONFIG to specify a path to
the correct llvm-config. For example
488
Installation from Source Code
LLVM support requires a compatible clang compiler (specified, if necessary, using the
CLANG environment variable), and a working C++ compiler (specified, if necessary, using
the CXX environment variable).
--with-icu
Build with support for the ICU library. This requires the ICU4C package to be installed. The
minimum required version of ICU4C is currently 4.2.
By default, pkg-config will be used to find the required compilation options. This is sup-
ported for ICU4C version 4.6 and later. For older versions, or if pkg-config is not available,
the variables ICU_CFLAGS and ICU_LIBS can be specified to configure, like in this
example:
(If ICU4C is in the default search path for the compiler, then you still need to specify a
nonempty string in order to avoid use of pkg-config, for example, ICU_CFLAGS=' '.)
--with-openssl
Build with support for SSL (encrypted) connections. This requires the OpenSSL package to
be installed. configure will check for the required header files and libraries to make sure
that your OpenSSL installation is sufficient before proceeding.
--with-pam
--with-bsd-auth
Build with BSD Authentication support. (The BSD Authentication framework is currently
only available on OpenBSD.)
--with-ldap
Build with LDAP support for authentication and connection parameter lookup (see Sec-
tion 34.17 and Section 20.10 for more information). On Unix, this requires the OpenLDAP
package to be installed. On Windows, the default WinLDAP library is used. configure
will check for the required header files and libraries to make sure that your OpenLDAP in-
stallation is sufficient before proceeding.
--with-systemd
Build with support for systemd service notifications. This improves integration if the server
binary is started under systemd but has no impact otherwise; see Section 18.3 for more in-
489
Installation from Source Code
formation. libsystemd and the associated header files need to be installed to be able to use
this option.
--without-readline
Prevents use of the Readline library (and libedit as well). This option disables command-line
editing and history in psql, so it is not recommended.
--with-libedit-preferred
Favors the use of the BSD-licensed libedit library rather than GPL-licensed Readline. This
option is significant only if you have both libraries installed; the default in that case is to
use Readline.
--with-bonjour
Build with Bonjour support. This requires Bonjour support in your operating system. Rec-
ommended on macOS.
--with-uuid=LIBRARY
Build the uuid-ossp module (which provides functions to generate UUIDs), using the spec-
ified UUID library. LIBRARY must be one of:
• bsd to use the UUID functions found in FreeBSD, NetBSD, and some other BSD-derived
systems
• e2fs to use the UUID library created by the e2fsprogs project; this library is present
in most Linux systems and in macOS, and can be obtained for other platforms as well
--with-ossp-uuid
--with-libxml
Build with libxml2, enabling SQL/XML support. Libxml2 version 2.6.23 or later is required
for this feature.
To detect the required compiler and linker options, PostgreSQL will query pkg-config,
if that is installed and knows about libxml2. Otherwise the program xml2-config, which
is installed by libxml2, will be used if it is found. Use of pkg-config is preferred, because
it can deal with multi-architecture installations better.
To use a libxml2 installation that is in an unusual location, you can set pkg-config-related
environment variables (see its documentation), or set the environment variable XML2_CON-
FIG to point to the xml2-config program belonging to the libxml2 installation, or set the
variables XML2_CFLAGS and XML2_LIBS. (If pkg-config is installed, then to override
1
http://www.ossp.org/pkg/lib/uuid/
490
Installation from Source Code
its idea of where libxml2 is you must either set XML2_CONFIG or set both XML2_CFLAGS
and XML2_LIBS to nonempty strings.)
--with-libxslt
Use libxslt when building the xml2 module. xml2 relies on this library to perform XSL
transformations of XML.
--disable-float4-byval
Disable passing float4 values “by value”, causing them to be passed “by reference” instead.
This option costs performance, but may be needed for compatibility with old user-defined
functions that are written in C and use the “version 0” calling convention. A better long-term
solution is to update any such functions to use the “version 1” calling convention.
--disable-float8-byval
Disable passing float8 values “by value”, causing them to be passed “by reference” instead.
This option costs performance, but may be needed for compatibility with old user-defined
functions that are written in C and use the “version 0” calling convention. A better long-term
solution is to update any such functions to use the “version 1” calling convention. Note that
this option affects not only float8, but also int8 and some related types such as timestamp.
On 32-bit platforms, --disable-float8-byval is the default and it is not allowed to
select --enable-float8-byval.
--with-segsize=SEGSIZE
Set the segment size, in gigabytes. Large tables are divided into multiple operating-system
files, each of size equal to the segment size. This avoids problems with file size limits that
exist on many platforms. The default segment size, 1 gigabyte, is safe on all supported plat-
forms. If your operating system has “largefile” support (which most do, nowadays), you
can use a larger segment size. This can be helpful to reduce the number of file descriptors
consumed when working with very large tables. But be careful not to select a value larger
than is supported by your platform and the file systems you intend to use. Other tools you
might wish to use, such as tar, could also set limits on the usable file size. It is recommended,
though not absolutely required, that this value be a power of 2. Note that changing this value
requires an initdb.
--with-blocksize=BLOCKSIZE
Set the block size, in kilobytes. This is the unit of storage and I/O within tables. The default,
8 kilobytes, is suitable for most situations; but other values may be useful in special cases.
The value must be a power of 2 between 1 and 32 (kilobytes). Note that changing this value
requires an initdb.
--with-wal-blocksize=BLOCKSIZE
Set the WAL block size, in kilobytes. This is the unit of storage and I/O within the WAL
log. The default, 8 kilobytes, is suitable for most situations; but other values may be useful
in special cases. The value must be a power of 2 between 1 and 64 (kilobytes). Note that
changing this value requires an initdb.
--disable-spinlocks
Allow the build to succeed even if PostgreSQL has no CPU spinlock support for the platform.
The lack of spinlock support will result in poor performance; therefore, this option should
only be used if the build aborts and informs you that the platform lacks spinlock support.
491
Installation from Source Code
If this option is required to build PostgreSQL on your platform, please report the problem
to the PostgreSQL developers.
--disable-strong-random
Allow the build to succeed even if PostgreSQL has no support for strong random numbers
on the platform. A source of random numbers is needed for some authentication protocols,
as well as some routines in the pgcrypto module. --disable-strong-random disables
functionality that requires cryptographically strong random numbers, and substitutes a weak
pseudo-random-number-generator for the generation of authentication salt values and query
cancel keys. It may make authentication less secure.
--disable-thread-safety
Disable the thread-safety of client libraries. This prevents concurrent threads in libpq and
ECPG programs from safely controlling their private connection handles.
--with-system-tzdata=DIRECTORY
PostgreSQL includes its own time zone database, which it requires for date and time op-
erations. This time zone database is in fact compatible with the IANA time zone database
provided by many operating systems such as FreeBSD, Linux, and Solaris, so it would be
redundant to install it again. When this option is used, the system-supplied time zone data-
base in DIRECTORY is used instead of the one included in the PostgreSQL source distribu-
tion. DIRECTORY must be specified as an absolute path. /usr/share/zoneinfo is a
likely directory on some operating systems. Note that the installation routine will not detect
mismatching or erroneous time zone data. If you use this option, you are advised to run the
regression tests to verify that the time zone data you have pointed to works correctly with
PostgreSQL.
This option is mainly aimed at binary package distributors who know their target operating
system well. The main advantage of using this option is that the PostgreSQL package won't
need to be upgraded whenever any of the many local daylight-saving time rules change.
Another advantage is that PostgreSQL can be cross-compiled more straightforwardly if the
time zone database files do not need to be built during the installation.
--without-zlib
Prevents use of the Zlib library. This disables support for compressed archives in pg_dump
and pg_restore. This option is only intended for those rare systems where this library is not
available.
--enable-debug
Compiles all programs and libraries with debugging symbols. This means that you can run
the programs in a debugger to analyze problems. This enlarges the size of the installed ex-
ecutables considerably, and on non-GCC compilers it usually also disables compiler opti-
mization, causing slowdowns. However, having the symbols available is extremely helpful
for dealing with any problems that might arise. Currently, this option is recommended for
production installations only if you use GCC. But you should always have it on if you are
doing development work or running a beta version.
--enable-coverage
If using GCC, all programs and libraries are compiled with code coverage testing instrumen-
tation. When run, they generate files in the build directory with code coverage metrics. See
492
Installation from Source Code
Section 33.5 for more information. This option is for use only with GCC and when doing
development work.
--enable-profiling
If using GCC, all programs and libraries are compiled so they can be profiled. On backend
exit, a subdirectory will be created that contains the gmon.out file for use in profiling. This
option is for use only with GCC and when doing development work.
--enable-cassert
Enables assertion checks in the server, which test for many “cannot happen” conditions.
This is invaluable for code development purposes, but the tests can slow down the server
significantly. Also, having the tests turned on won't necessarily enhance the stability of your
server! The assertion checks are not categorized for severity, and so what might be a rela-
tively harmless bug will still lead to server restarts if it triggers an assertion failure. This
option is not recommended for production use, but you should have it on for development
work or when running a beta version.
--enable-depend
Enables automatic dependency tracking. With this option, the makefiles are set up so that
all affected object files will be rebuilt when any header file is changed. This is useful if you
are doing development work, but is just wasted overhead if you intend only to compile once
and install. At present, this option only works with GCC.
--enable-dtrace
Compiles PostgreSQL with support for the dynamic tracing tool DTrace. See Section 28.5
for more information.
To point to the dtrace program, the environment variable DTRACE can be set. This will
often be necessary because dtrace is typically installed under /usr/sbin, which might
not be in the path.
Extra command-line options for the dtrace program can be specified in the environment
variable DTRACEFLAGS. On Solaris, to include DTrace support in a 64-bit binary, you must
specify DTRACEFLAGS="-64" to configure. For example, using the GCC compiler:
--enable-tap-tests
Enable tests using the Perl TAP tools. This requires a Perl installation and the Perl module
IPC::Run. See Section 33.4 for more information.
If you prefer a C compiler different from the one configure picks, you can set the environment
variable CC to the program of your choice. By default, configure will pick gcc if available,
493
Installation from Source Code
else the platform's default (usually cc). Similarly, you can override the default compiler flags if
needed with the CFLAGS variable.
You can specify environment variables on the configure command line, for example:
494
Installation from Source Code
Here is a list of the significant variables that can be set in this manner:
BISON
Bison program
CC
C compiler
CFLAGS
CLANG
path to clang program used to process source code for inlining when compiling with --
with-llvm
CPP
C preprocessor
CPPFLAGS
CXX
C++ compiler
CXXFLAGS
DTRACE
DTRACEFLAGS
FLEX
Flex program
LDFLAGS
LDFLAGS_EX
LDFLAGS_SL
LLVM_CONFIG
MSGFMT
PERL
Perl interpreter program. This will be used to determine the dependencies for building PL/
Perl. The default is perl.
PYTHON
Python interpreter program. This will be used to determine the dependencies for building
PL/Python. Also, whether Python 2 or 3 is specified here (or otherwise implicitly chosen)
determines which variant of the PL/Python language becomes available. See Section 46.1
for more information. If this is not set, the following are probed in this order: python
python3 python2.
TCLSH
Tcl interpreter program. This will be used to determine the dependencies for building PL/
Tcl, and it will be substituted into Tcl scripts.
XML2_CONFIG
Sometimes it is useful to add compiler flags after-the-fact to the set that were chosen by con-
figure. An important example is that gcc's -Werror option cannot be included in the CFLAGS
passed to configure, because it will break many of configure's built-in tests. To add such
flags, include them in the COPT environment variable while running make. The contents of COPT
are added to both the CFLAGS and LDFLAGS options set up by configure. For example, you
could do
make COPT='-Werror'
or
export COPT='-Werror'
make
Note
When developing code inside the server, it is recommended to use the configure op-
tions --enable-cassert (which turns on many run-time error checks) and --en-
able-debug (which improves the usefulness of debugging tools).
If using GCC, it is best to build with an optimization level of at least -O1, because using
no optimization (-O0) disables some important compiler warnings (such as the use of
uninitialized variables). However, non-zero optimization levels can complicate debug-
ging because stepping through compiled code will usually not match up one-to-one with
source code lines. If you get confused while trying to debug optimized code, recompile
496
Installation from Source Code
the specific files of interest with -O0. An easy way to do this is by passing an option to
make: make PROFILE=-O0 file.o.
The COPT and PROFILE environment variables are actually handled identically by the
PostgreSQL makefiles. Which to use is a matter of preference, but a common habit among
developers is to use PROFILE for one-time flag adjustments, while COPT might be kept
set all the time.
2. Build
make
make all
(Remember to use GNU make.) The build will take a few minutes depending on your hardware.
The last line displayed should be:
If you want to build everything that can be built, including the documentation (HTML and man
pages), and the additional modules (contrib), type instead:
make world
If you want to build everything that can be built, including the additional modules (contrib),
but without the documentation, type instead:
make world-bin
If you want to invoke the build from another makefile rather than manually, you must unset
MAKELEVEL or set it to zero, for instance like this:
build-postgresql:
$(MAKE) -C postgresql MAKELEVEL=0 all
Failure to do that can lead to strange error messages, typically about missing header files.
3. Regression Tests
If you want to test the newly built server before you install it, you can run the regression tests at
this point. The regression tests are a test suite to verify that PostgreSQL runs on your machine
in the way the developers expected it to. Type:
make check
497
Installation from Source Code
(This won't work as root; do it as an unprivileged user.) See Chapter 33 for detailed information
about interpreting the test results. You can repeat this test at any later time by issuing the same
command.
Note
If you are upgrading an existing system be sure to read Section 18.6, which has instruc-
tions about upgrading a cluster.
make install
This will install files into the directories that were specified in Step 1. Make sure that you have
appropriate permissions to write into that area. Normally you need to do this step as root. Alter-
natively, you can create the target directories in advance and arrange for appropriate permissions
to be granted.
make install-docs
make install-world
If you built the world without the documentation above, type instead:
make install-world-bin
You can use make install-strip instead of make install to strip the executable files
and libraries as they are installed. This will save some space. If you built with debugging support,
stripping will effectively remove the debugging support, so it should only be done if debugging
is no longer needed. install-strip tries to do a reasonable job saving space, but it does not
have perfect knowledge of how to strip every unneeded byte from an executable file, so if you
want to save all the disk space you possibly can, you will have to do manual work.
The standard installation provides all the header files needed for client application development
as well as for server-side program development, such as custom functions or data types written
in C. (Prior to PostgreSQL 8.0, a separate make install-all-headers command was
needed for the latter, but this step has been folded into the standard install.)
Client-only installation: If you want to install only the client applications and interface li-
braries, then you can use these commands:
src/bin has a few binaries for server-only use, but they are small.
Uninstallation: To undo the installation use the command make uninstall. However, this
will not remove any created directories.
Cleaning: After the installation you can free disk space by removing the built files from the source
tree with the command make clean. This will preserve the files made by the configure program,
so that you can rebuild everything with make later on. To reset the source tree to the state in which
it was distributed, use make distclean. If you are going to build for several platforms within the
same source tree you must do this and re-configure for each platform. (Alternatively, use a separate
build tree for each platform, so that the source tree remains unmodified.)
If you perform a build and then discover that your configure options were wrong, or if you change
anything that configure investigates (for example, software upgrades), then it's a good idea to do
make distclean before reconfiguring and rebuilding. Without this, your changes in configuration
choices might not propagate everywhere they need to.
The method to set the shared library search path varies between platforms, but the most widely-used
method is to set the environment variable LD_LIBRARY_PATH like so: In Bourne shells (sh, ksh,
bash, zsh):
LD_LIBRARY_PATH=/usr/local/pgsql/lib
export LD_LIBRARY_PATH
or in csh or tcsh:
Replace /usr/local/pgsql/lib with whatever you set --libdir to in Step 1. You should
put these commands into a shell start-up file such as /etc/profile or ~/.bash_profile.
Some good information about the caveats associated with this method can be found at http://xahlee.in-
fo/UnixResource_dir/_/ldpath.html.
On some systems it might be preferable to set the environment variable LD_RUN_PATH before build-
ing.
On Cygwin, put the library directory in the PATH or move the .dll files into the bin directory.
If in doubt, refer to the manual pages of your system (perhaps ld.so or rld). If you later get a
message like:
499
Installation from Source Code
If you are on Linux and you have root access, you can run:
/sbin/ldconfig /usr/local/pgsql/lib
(or equivalent directory) after installation to enable the run-time linker to find the shared libraries
faster. Refer to the manual page of ldconfig for more information. On FreeBSD, NetBSD, and
OpenBSD the command is:
/sbin/ldconfig -m /usr/local/pgsql/lib
To do this, add the following to your shell start-up file, such as ~/.bash_profile (or /etc/
profile, if you want it to affect all users):
PATH=/usr/local/pgsql/bin:$PATH
export PATH
To enable your system to find the man documentation, you need to add lines like the following to a
shell start-up file unless you installed into a location that is searched by default:
MANPATH=/usr/local/pgsql/share/man:$MANPATH
export MANPATH
The environment variables PGHOST and PGPORT specify to client applications the host and port of
the database server, overriding the compiled-in defaults. If you are going to run client applications
remotely then it is convenient if every user that plans to use the database sets PGHOST. This is not re-
quired, however; the settings can be communicated via command line options to most client programs.
In general, PostgreSQL can be expected to work on these CPU architectures: x86, x86_64, IA64,
PowerPC, PowerPC 64, S/390, S/390x, Sparc, Sparc 64, ARM, MIPS, MIPSEL, and PA-RISC. Code
support exists for M68K, M32R, and VAX, but these architectures are not known to have been test-
2
https://buildfarm.postgresql.org/
500
Installation from Source Code
ed recently. It is often possible to build on an unsupported CPU type by configuring with --dis-
able-spinlocks, but performance will be poor.
PostgreSQL can be expected to work on these operating systems: Linux (all recent distributions),
Windows (Win2000 SP4 and later), FreeBSD, OpenBSD, NetBSD, macOS, AIX, HP/UX, and Solaris.
Other Unix-like systems may also work but are not currently being tested. In most cases, all CPU
architectures supported by a given operating system will work. Look in Section 16.7 below to see if
there is information specific to your operating system, particularly if using an older system.
If you have installation problems on a platform that is known to be supported according to recent build
farm results, please report it to <[email protected]>. If you are interested
in porting PostgreSQL to a new platform, <[email protected]> is the
appropriate place to discuss that.
Platforms that are not covered here have no known platform-specific installation issues.
16.7.1. AIX
PostgreSQL works on AIX, but getting it installed properly can be challenging. AIX versions from
4.3.3 to 6.1 are considered supported. You can use GCC or the native IBM compiler xlc. In general,
using recent versions of AIX and PostgreSQL helps. Check the build farm for up to date information
about which versions of AIX are known to work.
The minimum recommended fix levels for supported AIX versions are:
AIX 4.3.3
AIX 5.1
AIX 5.2
AIX 5.3
Technology Level 7
AIX 6.1
Base Level
To check your current fix level, use oslevel -r in AIX 4.3.3 to AIX 5.2 ML 7, or oslevel -
s in later versions.
Use the following configure flags in addition to your own if you have installed Readline or libz in /
usr/local: --with-includes=/usr/local/include --with-libraries=/usr/
local/lib.
501
Installation from Source Code
You will want to use a version of GCC subsequent to 3.3.2, particularly if you use a prepackaged
version. We had good success with 4.0.1. Problems with earlier versions seem to have more to do with
the way IBM packaged GCC than with actual issues with GCC, so that if you compile GCC yourself,
you might well have success with an earlier version of GCC.
The problem was reported to IBM, and is recorded as bug report PMR29657. If you upgrade to mainte-
nance level 5300-03 or later, that will include this fix. A quick workaround is to alter _SS_MAXSIZE
to 1025 in /usr/include/sys/socket.h. In either case, recompile PostgreSQL once you have
the corrected header file.
When implementing PostgreSQL version 8.1 on AIX 5.3, we periodically ran into problems where
the statistics collector would “mysteriously” not come up successfully. This appears to be the result
of unexpected behavior in the IPv6 implementation. It looks like PostgreSQL and IPv6 do not play
very well together on AIX 5.3.
(as root)
# ifconfig lo0 inet6 ::1/0 delete
• Remove IPv6 from net services. The file /etc/netsvc.conf on AIX is roughly equivalent to
/etc/nsswitch.conf on Solaris/Linux. The default, on AIX, is thus:
hosts=local,bind
hosts=local4,bind4
Warning
This is really a workaround for problems relating to immaturity of IPv6 support, which im-
proved visibly during the course of AIX 5.3 releases. It has worked with AIX version 5.3, but
does not represent an elegant solution to the problem. It has been reported that this workaround
502
Installation from Source Code
is not only unnecessary, but causes problems on AIX 6.1, where IPv6 support has become
more mature.
Another example is out of memory errors in the PostgreSQL server logs, with every memory allocation
near or greater than 256 MB failing.
The overall cause of all these problems is the default bittedness and memory model used by the server
process. By default, all binaries built on AIX are 32-bit. This does not depend upon hardware type or
kernel in use. These 32-bit processes are limited to 4 GB of memory laid out in 256 MB segments
using one of a few models. The default allows for less than 256 MB in the heap as it shares a single
segment with the stack.
In the case of the plperl example, above, check your umask and the permissions of the binaries in
your PostgreSQL installation. The binaries involved in that example were 32-bit and installed as mode
750 instead of 755. Due to the permissions being set in this fashion, only the owner or a member of
the possessing group can load the library. Since it isn't world-readable, the loader places the object
into the process' heap instead of the shared library segments where it would otherwise be placed.
The “ideal” solution for this is to use a 64-bit build of PostgreSQL, but that is not always practical,
because systems with 32-bit processors can build, but not run, 64-bit binaries.
If a 32-bit binary is desired, set LDR_CNTRL to MAXDATA=0xn0000000, where 1 <= n <= 8, before
starting the PostgreSQL server, and try different values and postgresql.conf settings to find a
configuration that works satisfactorily. This use of LDR_CNTRL tells AIX that you want the server
to have MAXDATA bytes set aside for the heap, allocated in 256 MB segments. When you find a
workable configuration, ldedit can be used to modify the binaries so that they default to using
the desired heap size. PostgreSQL can also be rebuilt, passing configure LDFLAGS="-Wl,-
bmaxdata:0xn0000000" to achieve the same effect.
For a 64-bit build, set OBJECT_MODE to 64 and pass CC="gcc -maix64" and LD-
FLAGS="-Wl,-bbigtoc" to configure. (Options for xlc might differ.) If you omit the export
of OBJECT_MODE, your build may fail with linker errors. When OBJECT_MODE is set, it tells AIX's
build utilities such as ar, as, and ld what type of objects to default to handling.
By default, overcommit of paging space can happen. While we have not seen this occur, AIX will
kill processes when it runs out of memory and the overcommit is accessed. The closest to this that we
have seen is fork failing because the system decided that there was not enough memory for another
process. Like many other parts of AIX, the paging space allocation method and out-of-memory kill is
configurable on a system- or process-wide basis if this becomes a problem.
503
Installation from Source Code
16.7.2. Cygwin
PostgreSQL can be built using Cygwin, a Linux-like environment for Windows, but that method is
inferior to the native Windows build (see Chapter 17) and running a server under Cygwin is no longer
recommended.
When building from source, proceed according to the normal installation procedure (i.e., ./con-
figure; make; etc.), noting the following-Cygwin specific differences:
• Set your path to use the Cygwin bin directory before the Windows utilities. This will help prevent
problems with compilation.
• The adduser command is not supported; use the appropriate user management application on
Windows NT, 2000, or XP. Otherwise, skip this step.
• The su command is not supported; use ssh to simulate su on Windows NT, 2000, or XP. Otherwise,
skip this step.
• Start cygserver for shared memory support. To do this, enter the command /usr/sbin/
cygserver &. This program needs to be running anytime you start the PostgreSQL server or
initialize a database cluster (initdb). The default cygserver configuration may need to be
changed (e.g., increase SEMMNS) to prevent PostgreSQL from failing due to a lack of system re-
sources.
• Building might fail on some systems where a locale other than C is in use. To fix this, set the locale
to C by doing export LANG=C.utf8 before building, and then setting it back to the previous
setting, after you have installed PostgreSQL.
• The parallel regression tests (make check) can generate spurious regression test failures due to
overflowing the listen() backlog queue which causes connection refused errors or hangs. You
can limit the number of connections using the make variable MAX_CONNECTIONS thus:
It is possible to install cygserver and the PostgreSQL server as Windows NT services. For infor-
mation on how to do this, please refer to the README document included with the PostgreSQL binary
package on Cygwin. It is installed in the directory /usr/share/doc/Cygwin.
16.7.3. HP-UX
PostgreSQL 7.3+ should work on Series 700/800 PA-RISC machines running HP-UX 10.X or 11.X,
given appropriate system patch levels and build tools. At least one developer routinely tests on HP-
UX 10.20, and we have reports of successful installations on HP-UX 11.00 and 11.11.
Aside from the PostgreSQL source distribution, you will need GNU make (HP's make will not do),
and either GCC or HP's full ANSI C compiler. If you intend to build from Git sources rather than a
distribution tarball, you will also need Flex (GNU lex) and Bison (GNU yacc). We also recommend
making sure you are fairly up-to-date on HP patches. At a minimum, if you are building 64 bit binaries
on HP-UX 11.11 you may need PHSS_30966 (11.11) or a successor patch otherwise initdb may
hang:
504
Installation from Source Code
On general principles you should be current on libc and ld/dld patches, as well as compiler patches
if you are using HP's C compiler. See HP's support sites such as ftp://us-ffs.external.hp.com/ for free
copies of their latest patches.
If you are building on a PA-RISC 2.0 machine and want to have 64-bit binaries using GCC, you must
use a GCC 64-bit version.
If you are building on a PA-RISC 2.0 machine and want the compiled binaries to run on PA-RISC 1.1
machines you will need to specify +DAportable in CFLAGS.
If you are building on a HP-UX Itanium machine, you will need the latest HP ANSI C compiler with
its dependent patch or successor patches:
If you have both HP's C compiler and GCC's, then you might want to explicitly select the compiler
to use when you run configure:
./configure CC=cc
./configure CC=gcc
for GCC. If you omit this setting, then configure will pick gcc if it has a choice.
The default install target location is /usr/local/pgsql, which you might want to change to some-
thing under /opt. If so, use the --prefix switch to configure.
In the regression tests, there might be some low-order-digit differences in the geometry tests, which
vary depending on which compiler and math library versions you use. Any other error is cause for
suspicion.
16.7.4. macOS
To build PostgreSQL from source on macOS, you will need to install Apple's command line developer
tools, which can be done by issuing
xcode-select --install
(note that this will pop up a GUI dialog window for confirmation). You may or may not wish to also
install Xcode.
On recent macOS releases, it's necessary to embed the “sysroot” path in the include switches used to
find some system header files. This results in the outputs of the configure script varying depending on
which SDK version was used during configure. That shouldn't pose any problem in simple scenarios,
but if you are trying to do something like building an extension on a different machine than the server
code was built on, you may need to force use of a different sysroot path. To do that, set PG_SYSROOT,
for example
505
Installation from Source Code
xcrun --show-sdk-path
Note that building an extension using a different sysroot version than was used to build the core server
is not really recommended; in the worst case it could result in hard-to-debug ABI inconsistencies.
You can also select a non-default sysroot path when configuring, by specifying PG_SYSROOT to
configure:
This would primarily be useful to cross-compile for some other macOS version. There is no guarantee
that the resulting executables will run on the current host.
(any nonexistent pathname will work). This might be useful if you wish to build with a non-Apple
compiler, but beware that that case is not tested or supported by the PostgreSQL developers.
macOS's “System Integrity Protection” (SIP) feature breaks make check, because it prevents pass-
ing the needed setting of DYLD_LIBRARY_PATH down to the executables being tested. You can
work around that by doing make install before make check. Most Postgres developers just
turn off SIP, though.
The native Windows port requires a 32 or 64-bit version of Windows 2000 or later. Earlier operating
systems do not have sufficient infrastructure (but Cygwin may be used on those). MinGW, the Unix-
like build tools, and MSYS, a collection of Unix tools required to run shell scripts like configure,
can be downloaded from http://www.mingw.org/. Neither is required to run the resulting binaries; they
are needed only for creating the binaries.
To build 64 bit binaries using MinGW, install the 64 bit tool set from https://mingw-w64.org/, put
its bin directory in the PATH, and run configure with the --host=x86_64-w64-mingw32
option.
After you have everything installed, it is suggested that you run psql under CMD.EXE, as the MSYS
console has buffering issues.
16.7.6. Solaris
506
Installation from Source Code
PostgreSQL is well-supported on Solaris. The more up to date your operating system, the fewer issues
you will experience; details below.
LIBOBJS =
to read
LIBOBJS = snprintf.o
(There might be other files already listed in this variable. Order does not matter.) Then build as usual.
If you do not have a reason to use 64-bit binaries on SPARC, prefer the 32-bit version. The 64-bit
operations are slower and 64-bit binaries are slower than the 32-bit variants. And on other hand, 32-
bit code on the AMD64 CPU family is not native, and that is why 32-bit code is significant slower
on this CPU family.
507
Installation from Source Code
If you see the linking of the postgres executable abort with an error message like:
your DTrace installation is too old to handle probes in static functions. You need Solaris 10u4 or newer.
508
Chapter 17. Installation from Source
Code on Windows
It is recommended that most users download the binary distribution for Windows, available as a graph-
ical installer package from the PostgreSQL website. Building from source is only intended for people
developing PostgreSQL or extensions.
There are several different ways of building PostgreSQL on Windows. The simplest way to build with
Microsoft tools is to install Visual Studio 2022 and use the included compiler. It is also possible to
build with the full Microsoft Visual C++ 2005 to 2022. In some cases that requires the installation of
the Windows SDK in addition to the compiler.
It is also possible to build PostgreSQL using the GNU compiler tools provided by MinGW, or using
Cygwin for older versions of Windows.
Building using MinGW or Cygwin uses the normal build system, see Chapter 16 and the specific
notes in Section 16.7.5 and Section 16.7.2. To produce native 64 bit binaries in these environments,
use the tools from MinGW-w64. These tools can also be used to cross-compile for 32 bit and 64 bit
Windows targets on other hosts, such as Linux and macOS. Cygwin is not recommended for running
a production server, and it should only be used for running on older versions of Windows where the
native build does not work, such as Windows 98. The official binaries are built using Visual Studio.
Native builds of psql don't support command line editing. The Cygwin build does support command
line editing, so it should be used where psql is needed for interactive use on Windows.
Both 32-bit and 64-bit builds are possible with the Microsoft Compiler suite. 32-bit PostgreSQL builds
are possible with Visual Studio 2005 to Visual Studio 2022, as well as standalone Windows SDK
releases 6.0 to 10. 64-bit PostgreSQL builds are supported with Microsoft Windows SDK version
6.0a to 10 or Visual Studio 2008 and above. Compilation is supported down to Windows XP and
Windows Server 2003 when building with Visual Studio 2005 to Visual Studio 2013. Building with
Visual Studio 2015 is supported down to Windows Vista and Windows Server 2008. Building with
Visual Studio 2017 to Visual Studio 2022 is supported down to Windows 7 SP1 and Windows Server
2008 R2 SP1.
The tools for building using Visual C++ or Platform SDK are in the src/tools/msvc directory.
When building, make sure there are no tools from MinGW or Cygwin present in your system PATH.
Also, make sure you have all the required Visual C++ tools available in the PATH. In Visual Studio,
start the Visual Studio Command Prompt. If you wish to build a 64-bit version, you must use the 64-
bit version of the command, and vice versa. In the Microsoft Windows SDK, start the CMD shell
listed under the SDK on the Start Menu. In recent SDK versions you can change the targeted CPU
architecture, build type, and target OS by using the setenv command, e.g., setenv /x86 /
release /xp to target Windows XP or later with a 32-bit release build. See /? for other options
to setenv. All commands should be run from the src\tools\msvc directory.
Before you build, you may need to edit the file config.pl to reflect any configuration options
you want to change, or the paths to any third party libraries to use. The complete configuration is
509
Installation from Source
Code on Windows
determined by first reading and parsing the file config_default.pl, and then apply any changes
from config.pl. For example, to specify the location of your Python installation, put the following
in config.pl:
$config->{python} = 'c:\python26';
You only need to specify those parameters that are different from what's in config_default.pl.
If you need to set any other environment variables, create a file called buildenv.pl and put the
required commands there. For example, to add the path for bison when it's not in the PATH, create
a file containing:
$ENV{PATH}=$ENV{PATH} . ';c:\some\where\bison\bin';
To pass additional command line arguments to the Visual Studio build command (msbuild or vcbuild):
$ENV{MSBFLAGS}="/m";
17.1.1. Requirements
The following additional products are required to build PostgreSQL. Use the config.pl file to
specify which directories the libraries are available in.
If your build environment doesn't ship with a supported version of the Microsoft Windows SDK
it is recommended that you upgrade to the latest version (currently version 10), available for
download from https://www.microsoft.com/download.
You must always include the Windows Headers and Libraries part of the SDK. If you install a
Windows SDK including the Visual C++ Compilers, you don't need Visual Studio to build. Note
that as of Version 8.0a the Windows SDK no longer ships with a complete command-line build
environment.
ActiveState Perl
ActiveState Perl is required to run the build generation scripts. MinGW or Cygwin Perl will not
work. It must also be present in the PATH. Binaries can be downloaded from https://www.ac-
tivestate.com (Note: version 5.8.3 or later is required, the free Standard Distribution is sufficient).
The following additional products are not required to get started, but are required to build the complete
package. Use the config.pl file to specify which directories the libraries are available in.
ActiveState TCL
Required for building PL/Tcl (Note: version 8.4 is required, the free Standard Distribution is
sufficient).
Bison and Flex are required to build from Git, but not required when building from a release file.
Only Bison 1.875 or versions 2.2 and later will work. Flex must be version 2.5.31 or later.
Both Bison and Flex are included in the msys tool suite, available from http://www.mingw.org/wi-
ki/MSYS as part of the MinGW compiler suite.
You will need to add the directory containing flex.exe and bison.exe to the PATH envi-
ronment variable in buildenv.pl unless they are already in PATH. In the case of MinGW, the
directory is the \msys\1.0\bin subdirectory of your MinGW installation directory.
510
Installation from Source
Code on Windows
Note
The Bison distribution from GnuWin32 appears to have a bug that causes Bison to mal-
function when installed in a directory with spaces in the name, such as the default location
on English installations C:\Program Files\GnuWin32. Consider installing into C:
\GnuWin32 or use the NTFS short name path to GnuWin32 in your PATH environment
setting (e.g., C:\PROGRA~1\GnuWin32).
Note
The obsolete winflex binaries distributed on the PostgreSQL FTP site and referenced
in older documentation will fail with “flex: fatal internal error, exec failed” on 64-bit
Windows hosts. Use Flex from MSYS instead.
Diff
Diff is required to run the regression tests, and can be downloaded from http://gnuwin32.source-
forge.net.
Gettext
Gettext is required to build with NLS support, and can be downloaded from http://
gnuwin32.sourceforge.net. Note that binaries, dependencies and developer files are all needed.
MIT Kerberos
Required for GSSAPI authentication support. MIT Kerberos can be downloaded from https://
web.mit.edu/Kerberos/dist/index.html.
OpenSSL
ossp-uuid
Required for UUID-OSSP support (contrib only). Source can be downloaded from http://www.os-
sp.org/pkg/lib/uuid/.
Python
zlib
Required for compression support in pg_dump and pg_restore. Binaries can be downloaded from
https://www.zlib.net.
511
Installation from Source
Code on Windows
Mixing 32- and 64-bit versions in the same build tree is not supported. The build system will auto-
matically detect if it's running in a 32- or 64-bit environment, and build PostgreSQL accordingly. For
this reason, it is important to start the correct command prompt before building.
To use a server-side third party library such as python or OpenSSL, this library must also be 64-bit.
There is no support for loading a 32-bit library in a 64-bit server. Several of the third party libraries
that PostgreSQL supports may only be available in 32-bit versions, in which case they cannot be used
with 64-bit PostgreSQL.
17.1.3. Building
To build all of PostgreSQL in release configuration (the default), run the command:
build
build DEBUG
To build just a single project, for example psql, run the commands:
build psql
build DEBUG psql
To change the default build configuration to debug, put the following in the buildenv.pl file:
$ENV{CONFIG}="Debug";
It is also possible to build from inside the Visual Studio GUI. In this case, you need to run:
perl mkvcbuild.pl
from the command prompt, and then open the generated pgsql.sln (in the root directory of the
source tree) in Visual Studio.
By default, all files are written into a subdirectory of the debug or release directories. To install
these files using the standard layout, and also generate the files required to initialize and use the data-
base, run the command:
install c:\destination\directory
If you want to install only the client applications and interface libraries, then you can use these com-
mands:
512
Installation from Source
Code on Windows
vcregress check
vcregress installcheck
vcregress plcheck
vcregress contribcheck
vcregress modulescheck
vcregress ecpgcheck
vcregress isolationcheck
vcregress bincheck
vcregress recoverycheck
vcregress upgradecheck
To change the schedule used (default is parallel), append it to the command line like:
For more information about the regression tests, see Chapter 33.
Running the regression tests on client programs, with vcregress bincheck, or on recovery tests,
with vcregress recoverycheck, requires an additional Perl module to be installed:
IPC::Run
As of this writing, IPC::Run is not included in the ActiveState Perl installation, nor in the
ActiveState Perl Package Manager (PPM) library. To install, download the IPC-Run-<ver-
sion>.tar.gz source archive from CPAN, at https://metacpan.org/release/IPC-Run, and un-
compress. Edit the buildenv.pl file, and add a PERL5LIB variable to point to the lib sub-
directory from the extracted archive. For example:
$ENV{PERL5LIB}=$ENV{PERL5LIB} . ';c:\IPC-Run-0.94\lib';
Some of the TAP tests depend on a set of external commands that would optionally trigger tests related
to them. Each one of those variables can be set or unset in buildenv.pl:
GZIP_PROGRAM
Path to a gzip command. The default is gzip, that would be the command found in PATH.
TAR
Path to a tar command. The default is tar, that would be the command found in PATH.
OpenJade 1.3.1-2
513
Installation from Source
Code on Windows
Edit the buildenv.pl file, and add a variable for the location of the root directory, for example:
$ENV{DOCROOT}='c:\docbook';
To build the documentation, run the command builddoc.bat. Note that this will actually run the
build twice, in order to generate the indexes. The generated HTML files will be in doc\src\sgml.
514
Chapter 18. Server Setup and
Operation
This chapter discusses how to set up and run the database server and its interactions with the operating
system.
To add a Unix user account to your system, look for a command useradd or adduser. The user
name postgres is often used, and is assumed throughout this book, but you can use another name if
you like.
In file system terms, a database cluster is a single directory under which all data will be stored. We
call this the data directory or data area. It is completely up to you where you choose to store your
data. There is no default, although locations such as /usr/local/pgsql/data or /var/lib/
pgsql/data are popular. To initialize a database cluster, use the command initdb, which is installed
with PostgreSQL. The desired file system location of your database cluster is indicated by the -D
option, for example:
$ initdb -D /usr/local/pgsql/data
Note that you must execute this command while logged into the PostgreSQL user account, which is
described in the previous section.
Tip
As an alternative to the -D option, you can set the environment variable PGDATA.
Alternatively, you can run initdb via the pg_ctl program like so:
This may be more intuitive if you are using pg_ctl for starting and stopping the server (see Sec-
tion 18.3), so that pg_ctl would be the sole command you use for managing the database server
instance.
515
Server Setup and Operation
initdb will attempt to create the directory you specify if it does not already exist. Of course, this will
fail if initdb does not have permissions to write in the parent directory. It's generally recommendable
that the PostgreSQL user own not just the data directory but its parent directory as well, so that this
should not be a problem. If the desired parent directory doesn't exist either, you will need to create it
first, using root privileges if the grandparent directory isn't writable. So the process might look like this:
initdb will refuse to run if the data directory exists and already contains files; this is to prevent
accidentally overwriting an existing installation.
Because the data directory contains all the data stored in the database, it is essential that it be secured
from unauthorized access. initdb therefore revokes access permissions from everyone but the Post-
greSQL user, and optionally, group. Group access, when enabled, is read-only. This allows an unpriv-
ileged user in the same group as the cluster owner to take a backup of the cluster data or perform other
operations that only require read access.
Note that enabling or disabling group access on an existing cluster requires the cluster to be shut down
and the appropriate mode to be set on all directories and files before restarting PostgreSQL. Otherwise,
a mix of modes might exist in the data directory. For clusters that allow access only by the owner, the
appropriate modes are 0700 for directories and 0600 for files. For clusters that also allow reads by
the group, the appropriate modes are 0750 for directories and 0640 for files.
However, while the directory contents are secure, the default client authentication setup allows any
local user to connect to the database and even become the database superuser. If you do not trust
other local users, we recommend you use one of initdb's -W, --pwprompt or --pwfile options
to assign a password to the database superuser. Also, specify -A md5 or -A password so that
the default trust authentication mode is not used; or modify the generated pg_hba.conf file
after running initdb, but before you start the server for the first time. (Other reasonable approaches
include using peer authentication or file system permissions to restrict connections. See Chapter 20
for more information.)
initdb also initializes the default locale for the database cluster. Normally, it will just take the
locale settings in the environment and apply them to the initialized database. It is possible to specify a
different locale for the database; more information about that can be found in Section 23.1. The default
sort order used within the particular database cluster is set by initdb, and while you can create new
databases using different sort order, the order used in the template databases that initdb creates cannot
be changed without dropping and recreating them. There is also a performance impact for using locales
other than C or POSIX. Therefore, it is important to make this choice correctly the first time.
initdb also sets the default character set encoding for the database cluster. Normally this should be
chosen to match the locale setting. For details see Section 23.3.
Non-C and non-POSIX locales rely on the operating system's collation library for character set order-
ing. This controls the ordering of keys stored in indexes. For this reason, a cluster cannot switch to an
incompatible collation library version, either through snapshot restore, binary streaming replication,
a different operating system, or an operating system upgrade.
516
Server Setup and Operation
Storage Area Networks (SAN) typically use communication protocols other than NFS, and may or may
not be subject to hazards of this sort. It's advisable to consult the vendor's documentation concerning
data consistency guarantees. PostgreSQL cannot be more reliable than the file system it's using.
$ postgres -D /usr/local/pgsql/data
which will leave the server running in the foreground. This must be done while logged into the Post-
greSQL user account. Without -D, the server will try to use the data directory named by the environ-
ment variable PGDATA. If that variable is not provided either, it will fail.
Normally it is better to start postgres in the background. For this, use the usual Unix shell syntax:
It is important to store the server's stdout and stderr output somewhere, as shown above. It will help
for auditing purposes and to diagnose problems. (See Section 24.3 for a more thorough discussion of
log file handling.)
The postgres program also takes a number of other command-line options. For more information,
see the postgres reference page and Chapter 19 below.
This shell syntax can get tedious quickly. Therefore the wrapper program pg_ctl is provided to simplify
some tasks. For example:
will start the server in the background and put the output into the named log file. The -D option has
the same meaning here as for postgres. pg_ctl is also capable of stopping the server.
Normally, you will want to start the database server when the computer boots. Autostart scripts are
operating-system-specific. There are a few distributed with PostgreSQL in the contrib/start-
scripts directory. Installing one will require root privileges.
Different systems have different conventions for starting up daemons at boot time. Many systems have
a file /etc/rc.local or /etc/rc.d/rc.local. Others use init.d or rc.d directories.
Whatever you do, the server must be run by the PostgreSQL user account and not by root or any other
user. Therefore you probably should form your commands using su postgres -c '...'. For
example:
517
Server Setup and Operation
Here are a few more operating-system-specific suggestions. (In each case be sure to use the proper
installation directory and user name where we show generic values.)
if [ -x /usr/local/pgsql/bin/pg_ctl -a -x /usr/local/pgsql/bin/
postgres ]; then
su -l postgres -c '/usr/local/pgsql/bin/pg_ctl start -s -l /
var/postgresql/log -D /usr/local/pgsql/data'
echo -n ' postgresql'
fi
When using systemd, you can use the following service unit file (e.g., at /etc/systemd/sys-
tem/postgresql.service):
[Unit]
Description=PostgreSQL database server
Documentation=man:postgres(1)
[Service]
Type=notify
User=postgres
ExecStart=/usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data
ExecReload=/bin/kill -HUP $MAINPID
KillMode=mixed
KillSignal=SIGINT
TimeoutSec=infinity
[Install]
WantedBy=multi-user.target
Using Type=notify requires that the server binary was built with configure --with-
systemd.
Consider carefully the timeout setting. systemd has a default timeout of 90 seconds as of this writing
and will kill a process that does not report readiness within that time. But a PostgreSQL server
that might have to perform crash recovery at startup could take much longer to become ready. The
suggested value of infinity disables the timeout logic.
• On NetBSD, use either the FreeBSD or Linux start scripts, depending on preference.
• On Solaris, create a file called /etc/init.d/postgresql that contains the following line:
518
Server Setup and Operation
While the server is running, its PID is stored in the file postmaster.pid in the data directory. This
is used to prevent multiple server instances from running in the same data directory and can also be
used for shutting down the server.
This usually means just what it suggests: you tried to start another server on the same port where one
is already running. However, if the kernel error message is not Address already in use or
some variant of that, there might be a different problem. For example, trying to start a server on a
reserved port number might draw something like:
$ postgres -p 666
LOG: could not bind IPv4 address "127.0.0.1": Permission denied
HINT: Is another postmaster already running on port 666? If not,
wait a few seconds and retry.
FATAL: could not create any TCP/IP sockets
A message like:
probably means your kernel's limit on the size of shared memory is smaller than the work area Post-
greSQL is trying to create (4011376640 bytes in this example). Or it could mean that you do not have
System-V-style shared memory support configured into your kernel at all. As a temporary workaround,
you can try starting the server with a smaller-than-normal number of buffers (shared_buffers). You
will eventually want to reconfigure your kernel to increase the allowed shared memory size. You might
also see this message when trying to start multiple servers on the same machine, if their total space
requested exceeds the kernel limit.
An error like:
does not mean you've run out of disk space. It means your kernel's limit on the number of System
V semaphores is smaller than the number PostgreSQL wants to create. As above, you might be able
to work around the problem by starting the server with a reduced number of allowed connections
(max_connections), but you'll eventually want to increase the kernel limit.
519
Server Setup and Operation
If you get an “illegal system call” error, it is likely that shared memory or semaphores are not supported
in your kernel at all. In that case your only option is to reconfigure the kernel to enable these features.
Details about configuring System V IPC facilities are given in Section 18.4.1.
This is the generic “I couldn't find a server to talk to” failure. It looks like the above when TCP/IP
communication is attempted. A common mistake is to forget to configure the server to allow TCP/
IP connections.
Alternatively, you'll get this when attempting Unix-domain socket communication to a local server:
The last line is useful in verifying that the client is trying to connect to the right place. If there is in fact
no server running there, the kernel error message will typically be either Connection refused
or No such file or directory, as illustrated. (It is important to realize that Connection
refused in this context does not mean that the server got your connection request and rejected it.
That case will produce a different message, as shown in Section 20.15.) Other error messages such
as Connection timed out might indicate more fundamental problems, like lack of network
connectivity.
The complete lack of these facilities is usually manifested by an “Illegal system call” error upon server
start. In that case there is no alternative but to reconfigure your kernel. PostgreSQL won't work without
them. This situation is rare, however, among modern operating systems.
Upon starting the server, PostgreSQL normally allocates a very small amount of System V shared
memory, as well as a much larger amount of POSIX (mmap) shared memory. In addition a significant
number of semaphores, which can be either System V or POSIX style, are created at server startup.
Currently, POSIX semaphores are used on Linux and FreeBSD systems while other platforms use
System V semaphores.
520
Server Setup and Operation
Note
Prior to PostgreSQL 9.3, only System V shared memory was used, so the amount of System
V shared memory required to start the server was much larger. If you are running an older
version of the server, please consult the documentation for your server version.
System V IPC features are typically constrained by system-wide allocation limits. When PostgreSQL
exceeds one of these limits, the server will refuse to start and should leave an instructive error message
describing the problem and what to do about it. (See also Section 18.3.1.) The relevant kernel para-
meters are named consistently across different systems; Table 18.1 gives an overview. The methods
to set them, however, vary. Suggestions for some platforms are given below.
PostgreSQL requires a few bytes of System V shared memory (typically 48 bytes, on 64-bit platforms)
for each copy of the server. On most modern operating systems, this amount can easily be allocated.
However, if you are running many copies of the server, or if other applications are also using System
V shared memory, it may be necessary to increase SHMALL, which is the total amount of System
521
Server Setup and Operation
V shared memory system-wide. Note that SHMALL is measured in pages rather than bytes on many
systems.
Less likely to cause problems is the minimum size for shared memory segments (SHMMIN), which
should be at most approximately 32 bytes for PostgreSQL (it is usually just 1). The maximum number
of segments system-wide (SHMMNI) or per-process (SHMSEG) are unlikely to cause a problem unless
your system has them set to zero.
When using System V semaphores, PostgreSQL uses one semaphore per allowed connection
(max_connections), allowed autovacuum worker process (autovacuum_max_workers) and allowed
background process (max_worker_processes), in sets of 16. Each such set will also contain a 17th
semaphore which contains a “magic number”, to detect collision with semaphore sets used by other
applications. The maximum number of semaphores in the system is set by SEMMNS, which conse-
quently must be at least as high as max_connections plus autovacuum_max_workers plus
max_worker_processes, plus one extra for each 16 allowed connections plus workers (see the
formula in Table 18.1). The parameter SEMMNI determines the limit on the number of semaphore sets
that can exist on the system at one time. Hence this parameter must be at least ceil((max_con-
nections + autovacuum_max_workers + max_worker_processes + 5) / 16).
Lowering the number of allowed connections is a temporary workaround for failures, which are usu-
ally confusingly worded “No space left on device”, from the function semget.
In some cases it might also be necessary to increase SEMMAP to be at least on the order of SEMMNS.
If the system has this parameter (many do not), it defines the size of the semaphore resource map, in
which each contiguous block of available semaphores needs an entry. When a semaphore set is freed it
is either added to an existing entry that is adjacent to the freed block or it is registered under a new map
entry. If the map is full, the freed semaphores get lost (until reboot). Fragmentation of the semaphore
space could over time lead to fewer available semaphores than there should be.
Various other settings related to “semaphore undo”, such as SEMMNU and SEMUME, do not affect
PostgreSQL.
When using POSIX semaphores, the number of semaphores needed is the same as for System V, that
is one semaphore per allowed connection (max_connections), allowed autovacuum worker process
(autovacuum_max_workers) and allowed background process (max_worker_processes). On the plat-
forms where this option is preferred, there is no specific kernel limit on the number of POSIX sem-
aphores.
AIX
At least as of version 5.1, it should not be necessary to do any special configuration for such
parameters as SHMMAX, as it appears this is configured to allow all memory to be used as shared
memory. That is the sort of configuration commonly used for other databases such as DB/2.
FreeBSD
The default IPC settings can be changed using the sysctl or loader interfaces. The following
parameters can be set using sysctl:
# sysctl kern.ipc.shmall=32768
# sysctl kern.ipc.shmmax=134217728
These semaphore-related settings are read-only as far as sysctl is concerned, but can be set in
/boot/loader.conf:
522
Server Setup and Operation
kern.ipc.semmni=256
kern.ipc.semmns=512
After modifying that file, a reboot is required for the new settings to take effect.
You might also want to configure your kernel to lock shared memory into RAM and pre-
vent it from being paged out to swap. This can be accomplished using the sysctl setting
kern.ipc.shm_use_phys.
FreeBSD versions before 4.0 work like old OpenBSD (see below).
NetBSD
In NetBSD 5.0 and later, IPC parameters can be adjusted using sysctl, for example:
# sysctl -w kern.ipc.semmni=100
You might also want to configure your kernel to lock shared memory into RAM and pre-
vent it from being paged out to swap. This can be accomplished using the sysctl setting
kern.ipc.shm_use_phys.
NetBSD versions before 5.0 work like old OpenBSD (see below), except that kernel parameters
should be set with the keyword options not option.
OpenBSD
In OpenBSD 3.3 and later, IPC parameters can be adjusted using sysctl, for example:
# sysctl kern.seminfo.semmni=100
In older OpenBSD versions, you will need to build a custom kernel to change the IPC parameters.
Make sure that the options SYSVSHM and SYSVSEM are enabled, too. (They are by default.) The
following shows an example of how to set the various parameters in the kernel configuration file:
option SYSVSHM
option SHMMAXPGS=4096
option SHMSEG=256
option SYSVSEM
option SEMMNI=256
523
Server Setup and Operation
option SEMMNS=512
option SEMMNU=256
HP-UX
The default settings tend to suffice for normal installations. On HP-UX 10, the factory default for
SEMMNS is 128, which might be too low for larger database sites.
IPC parameters can be set in the System Administration Manager (SAM) under Kernel Configu-
ration → Configurable Parameters. Choose Create A New Kernel when you're done.
Linux
The default maximum segment size is 32 MB, and the default maximum total size is 2097152
pages. A page is almost always 4096 bytes except in unusual kernel configurations with “huge
pages” (use getconf PAGE_SIZE to verify).
The shared memory size settings can be changed via the sysctl interface. For example, to allow
16 GB:
$ sysctl -w kernel.shmmax=17179869184
$ sysctl -w kernel.shmall=4194304
In addition these settings can be preserved between reboots in the file /etc/sysctl.conf.
Doing that is highly recommended.
Ancient distributions might not have the sysctl program, but equivalent changes can be made
by manipulating the /proc file system:
The remaining defaults are quite generously sized, and usually do not require changes.
macOS
The recommended method for configuring shared memory in macOS is to create a file named /
etc/sysctl.conf, containing variable assignments such as:
kern.sysv.shmmax=4194304
kern.sysv.shmmin=1
kern.sysv.shmmni=32
kern.sysv.shmseg=8
kern.sysv.shmall=1024
Note that in some macOS versions, all five shared-memory parameters must be set in /etc/
sysctl.conf, else the values will be ignored.
Beware that recent releases of macOS ignore attempts to set SHMMAX to a value that isn't an exact
multiple of 4096.
In older macOS versions, you will need to reboot to have changes in the shared memory parame-
ters take effect. As of 10.5 it is possible to change all but SHMMNI on the fly, using sysctl. But
it's still best to set up your preferred values via /etc/sysctl.conf, so that the values will
be kept across reboots.
524
Server Setup and Operation
The file /etc/sysctl.conf is only honored in macOS 10.3.9 and later. If you are running a
previous 10.3.x release, you must edit the file /etc/rc and change the values in the following
commands:
sysctl -w kern.sysv.shmmax
sysctl -w kern.sysv.shmmin
sysctl -w kern.sysv.shmmni
sysctl -w kern.sysv.shmseg
sysctl -w kern.sysv.shmall
Note that /etc/rc is usually overwritten by macOS system updates, so you should expect to
have to redo these edits after each update.
In macOS 10.2 and earlier, instead edit these commands in the file /System/Library/Star-
tupItems/SystemTuning/SystemTuning.
set shmsys:shminfo_shmmax=0x2000000
set shmsys:shminfo_shmmin=1
set shmsys:shminfo_shmmni=256
set shmsys:shminfo_shmseg=256
set semsys:seminfo_semmap=256
set semsys:seminfo_semmni=512
set semsys:seminfo_semmns=512
set semsys:seminfo_semmsl=32
You need to reboot for the changes to take effect. See also http://sunsite.uakom.sk/sunworldon-
line/swol-09-1997/swol-09-insidesolaris.html for information on shared memory under older ver-
sions of Solaris.
In Solaris 10 and later, and OpenSolaris, the default shared memory and semaphore settings are
good enough for most PostgreSQL applications. Solaris now defaults to a SHMMAX of one-quarter
of system RAM. To further adjust this setting, use a project setting associated with the postgres
user. For example, run the following as root:
This command adds the user.postgres project and sets the shared memory maximum for
the postgres user to 8GB, and takes effect the next time that user logs in, or when you restart
PostgreSQL (not reload). The above assumes that PostgreSQL is run by the postgres user in
the postgres group. No server reboot is required.
Other recommended kernel setting changes for database servers which will have a large number
of connections are:
project.max-shm-ids=(priv,32768,deny)
project.max-sem-ids=(priv,4096,deny)
525
Server Setup and Operation
project.max-msg-ids=(priv,4096,deny)
Additionally, if you are running PostgreSQL inside a zone, you may need to raise the zone re-
source usage limits as well. See "Chapter2: Projects and Tasks" in the System Administrator's
Guide for more information on projects and prctl.
The setting RemoveIPC in logind.conf controls whether IPC objects are removed when a user
fully logs out. System users are exempt. This setting defaults to on in stock systemd, but some operating
system distributions default it to off.
A typical observed effect when this setting is on is that shared memory objects used for parallel query
execution are removed at apparently random times, leading to errors and warnings while attempting
to open and remove them, like
Different types of IPC objects (shared memory vs. semaphores, System V vs. POSIX) are treated
slightly differently by systemd, so one might observe that some IPC resources are not removed in the
same way as others. But it is not advisable to rely on these subtle differences.
A “user logging out” might happen as part of a maintenance job or manually when an administrator
logs in as the postgres user or something similar, so it is hard to prevent in general.
What is a “system user” is determined at systemd compile time from the SYS_UID_MAX setting in
/etc/login.defs.
Packaging and deployment scripts should be careful to create the postgres user as a system user
by using useradd -r, adduser --system, or equivalent.
Alternatively, if the user account was created incorrectly or cannot be changed, it is recommended
to set
RemoveIPC=no
Caution
At least one of these two things has to be ensured, or the PostgreSQL server will be very
unreliable.
526
Server Setup and Operation
Each of these have a “hard” and a “soft” limit. The soft limit is what actually counts but it can be
changed by the user up to the hard limit. The hard limit can only be changed by the root user. The
system call setrlimit is responsible for setting these parameters. The shell's built-in command
ulimit (Bourne shells) or limit (csh) is used to control the resource limits from the command line.
On BSD-derived systems the file /etc/login.conf controls the various resource limits set during
login. See the operating system documentation for details. The relevant parameters are maxproc,
openfiles, and datasize. For example:
default:\
...
:datasize-cur=256M:\
:maxproc-cur=256:\
:openfiles-cur=256:\
...
(-cur is the soft limit. Append -max to set the hard limit.)
• On Linux /proc/sys/fs/file-max determines the maximum number of open files that the
kernel will support. It can be changed by writing a different number into the file or by adding
an assignment in /etc/sysctl.conf. The maximum limit of files per process is fixed at the
time the kernel is compiled; see /usr/src/linux/Documentation/proc.txt for more
information.
The PostgreSQL server uses one process per connection so you should provide for at least as many
processes as allowed connections, in addition to what you need for the rest of your system. This is
usually not a problem but if you run several servers on one machine things might get tight.
The factory default limit on open files is often set to “socially friendly” values that allow many users
to coexist on a machine without using an inappropriate fraction of the system resources. If you run
many servers on a machine this is perhaps what you want, but on dedicated servers you might want
to raise this limit.
On the other side of the coin, some systems allow individual processes to open large numbers of
files; if more than a few processes do so then the system-wide limit can easily be exceeded. If you
find this happening, and you do not want to alter the system-wide limit, you can set PostgreSQL's
max_files_per_process configuration parameter to limit the consumption of open files.
If this happens, you will see a kernel message that looks like this (consult your system documentation
and configuration on where to look for such a message):
This indicates that the postgres process has been terminated due to memory pressure. Although
existing database connections will continue to function normally, no new connections will be accepted.
To recover, PostgreSQL will need to be restarted.
One way to avoid this problem is to run PostgreSQL on a machine where you can be sure that other
processes will not run the machine out of memory. If memory is tight, increasing the swap space of
527
Server Setup and Operation
the operating system can help avoid the problem, because the out-of-memory (OOM) killer is invoked
only when physical memory and swap space are exhausted.
If PostgreSQL itself is the cause of the system running out of memory, you can avoid the problem
by changing your configuration. In some cases, it may help to lower memory-related configuration
parameters, particularly shared_buffers and work_mem. In other cases, the problem may be
caused by allowing too many connections to the database server itself. In many cases, it may be better
to reduce max_connections and instead make use of external connection-pooling software.
On Linux 2.6 and later, it is possible to modify the kernel's behavior so that it will not “overcommit”
memory. Although this setting will not prevent the OOM killer1 from being invoked altogether, it will
lower the chances significantly and will therefore lead to more robust system behavior. This is done
by selecting strict overcommit mode via sysctl:
sysctl -w vm.overcommit_memory=2
or placing an equivalent entry in /etc/sysctl.conf. You might also wish to modify the related
setting vm.overcommit_ratio. For details see the kernel documentation file https://www.ker-
nel.org/doc/Documentation/vm/overcommit-accounting.
in the postmaster's startup script just before invoking the postmaster. Note that this action must be done
as root, or it will have no effect; so a root-owned startup script is the easiest place to do it. If you do this,
you should also set these environment variables in the startup script before invoking the postmaster:
export PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj
export PG_OOM_ADJUST_VALUE=0
These settings will cause postmaster child processes to run with the normal OOM score adjustment
of zero, so that the OOM killer can still target them at need. You could use some other value for
PG_OOM_ADJUST_VALUE if you want the child processes to run with some other OOM score ad-
justment. (PG_OOM_ADJUST_VALUE can also be omitted, in which case it defaults to zero.) If you
do not set PG_OOM_ADJUST_FILE, the child processes will run with the same OOM score adjust-
ment as the postmaster, which is unwise since the whole point is to ensure that the postmaster has a
preferential setting.
Older Linux kernels do not offer /proc/self/oom_score_adj, but may have a previous version
of the same functionality called /proc/self/oom_adj. This works the same except the disable
value is -17 not -1000.
Note
Some vendors' Linux 2.4 kernels are reported to have early versions of the 2.6 overcommit
sysctl parameter. However, setting vm.overcommit_memory to 2 on a 2.4 kernel that
does not have the relevant code will make things worse, not better. It is recommended that you
inspect the actual kernel source code (see the function vm_enough_memory in the file mm/
mmap.c) to verify what is supported in your kernel before you try this in a 2.4 installation.
The presence of the overcommit-accounting documentation file should not be taken as
evidence that the feature is there. If in any doubt, consult a kernel expert or your kernel vendor.
1
https://lwn.net/Articles/104179/
528
Server Setup and Operation
$ head -1 $PGDATA/postmaster.pid
4170
$ pmap 4170 | awk '/rw-s/ && /zero/ {print $2}'
6490428K
$ grep ^Hugepagesize /proc/meminfo
Hugepagesize: 2048 kB
6490428 / 2048 gives approximately 3169.154, so in this example we need at least 3170 huge
pages, which we can set with:
$ sysctl -w vm.nr_hugepages=3170
A larger setting would be appropriate if other programs on the machine also need huge pages. Don't
forget to add this setting to /etc/sysctl.conf so that it will be reapplied after reboots.
Sometimes the kernel is not able to allocate the desired number of huge pages immediately, so it might
be necessary to repeat the command or to reboot. (Immediately after a reboot, most of the machine's
memory should be available to convert into huge pages.) To verify the huge page allocation situation,
use:
It may also be necessary to give the database server's operating system user permission to use huge
pages by setting vm.hugetlb_shm_group via sysctl, and/or give permission to lock memory with
ulimit -l.
The default behavior for huge pages in PostgreSQL is to use them when possible and to fall back
to normal pages when failing. To enforce the use of huge pages, you can set huge_pages to on in
postgresql.conf. Note that with this setting PostgreSQL will fail to start if not enough huge
pages are available.
For a detailed description of the Linux huge pages feature have a look at https://www.kernel.org/doc/
Documentation/vm/hugetlbpage.txt.
SIGTERM
This is the Smart Shutdown mode. After receiving SIGTERM, the server disallows new connec-
tions, but lets existing sessions end their work normally. It shuts down only after all of the sessions
terminate. If the server is in online backup mode, it additionally waits until online backup mode
is no longer active. While backup mode is active, new connections will still be allowed, but only
to superusers (this exception allows a superuser to connect to terminate online backup mode). If
529
Server Setup and Operation
the server is in recovery when a smart shutdown is requested, recovery and streaming replication
will be stopped only after all regular sessions have terminated.
SIGINT
This is the Fast Shutdown mode. The server disallows new connections and sends all existing
server processes SIGTERM, which will cause them to abort their current transactions and exit
promptly. It then waits for all server processes to exit and finally shuts down. If the server is in
online backup mode, backup mode will be terminated, rendering the backup useless.
SIGQUIT
This is the Immediate Shutdown mode. The server will send SIGQUIT to all child processes and
wait for them to terminate. If any do not terminate within 5 seconds, they will be sent SIGKILL.
The master server process exits as soon as all child processes have exited, without doing normal
database shutdown processing. This will lead to recovery (by replaying the WAL log) upon next
start-up. This is recommended only in emergencies.
The pg_ctl program provides a convenient interface for sending these signals to shut down the server.
Alternatively, you can send the signal directly using kill on non-Windows systems. The PID of the
postgres process can be found using the ps program, or from the file postmaster.pid in the
data directory. For example, to do a fast shutdown:
Important
It is best not to use SIGKILL to shut down the server. Doing so will prevent the server from
releasing shared memory and semaphores, which might then have to be done manually before a
new server can be started. Furthermore, SIGKILL kills the postgres process without letting
it relay the signal to its subprocesses, so it will be necessary to kill the individual subprocesses
by hand as well.
To terminate an individual session while allowing other sessions to continue, use pg_termi-
nate_backend() (see Table 9.78) or send a SIGTERM signal to the child process associated with
the session.
Current PostgreSQL version numbers consist of a major and a minor version number. For example, in
the version number 10.1, the 10 is the major version number and the 1 is the minor version number,
meaning this would be the first minor release of the major release 10. For releases before PostgreSQL
version 10.0, version numbers consist of three numbers, for example, 9.5.3. In those cases, the major
version consists of the first two digit groups of the version number, e.g., 9.5, and the minor version is
the third number, e.g., 3, meaning this would be the third minor release of the major release 9.5.
Minor releases never change the internal storage format and are always compatible with earlier and
later minor releases of the same major version number. For example, version 10.1 is compatible with
version 10.0 and version 10.6. Similarly, for example, 9.5.3 is compatible with 9.5.0, 9.5.1, and 9.5.6.
To update between compatible versions, you simply replace the executables while the server is down
and restart the server. The data directory remains unchanged — minor upgrades are that simple.
For major releases of PostgreSQL, the internal data storage format is subject to change, thus compli-
cating upgrades. The traditional method for moving data to a new major version is to dump and restore
530
Server Setup and Operation
the database, though this can be slow. A faster method is pg_upgrade. Replication methods are also
available, as discussed below.
New major versions also typically introduce some user-visible incompatibilities, so application pro-
gramming changes might be required. All user-visible changes are listed in the release notes (Appen-
dix E); pay particular attention to the section labeled "Migration". Though you can upgrade from one
major version to another without upgrading to intervening versions, you should read the major release
notes of all intervening versions.
Cautious users will want to test their client applications on the new version before switching over
fully; therefore, it's often a good idea to set up concurrent installations of old and new versions. When
testing a PostgreSQL major upgrade, consider the following categories of possible changes:
Administration
The capabilities available for administrators to monitor and control the server often change and
improve in each major release.
SQL
Typically this includes new SQL command capabilities and not changes in behavior, unless specif-
ically mentioned in the release notes.
Library API
Typically libraries like libpq only add new functionality, again unless mentioned in the release
notes.
System Catalogs
This involves changes in the backend function API, which is written in the C programming lan-
guage. Such changes affect code that references backend functions deep inside the server.
It is recommended that you use the pg_dump and pg_dumpall programs from the newer version of
PostgreSQL, to take advantage of enhancements that might have been made in these programs. Current
releases of the dump programs can read data from any server version back to 8.0.
These instructions assume that your existing installation is under the /usr/local/pgsql directo-
ry, and that the data area is in /usr/local/pgsql/data. Substitute your paths appropriately.
1. If making a backup, make sure that your database is not being updated. This does not affect the
integrity of the backup, but the changed data would of course not be included. If necessary, edit
the permissions in the file /usr/local/pgsql/data/pg_hba.conf (or equivalent) to
disallow access from everyone except you. See Chapter 20 for additional information on access
control.
531
Server Setup and Operation
To make the backup, you can use the pg_dumpall command from the version you are currently
running; see Section 25.1.2 for more details. For best results, however, try to use the pg_dumpall
command from PostgreSQL 11.17, since this version contains bug fixes and improvements over
older versions. While this advice might seem idiosyncratic since you haven't installed the new
version yet, it is advisable to follow it if you plan to install the new version in parallel with the
old version. In that case you can complete the installation normally and transfer the data later.
This will also decrease the downtime.
pg_ctl stop
On systems that have PostgreSQL started at boot time, there is probably a start-up file that will
accomplish the same thing. For example, on a Red Hat Linux system one might find that this
works:
/etc/rc.d/init.d/postgresql stop
See Chapter 18 for details about starting and stopping the server.
3. If restoring from backup, rename or delete the old installation directory if it is not version-specific.
It is a good idea to rename the directory, rather than delete it, in case you have trouble and need
to revert to it. Keep in mind the directory might consume significant disk space. To rename the
directory, use a command like this:
mv /usr/local/pgsql /usr/local/pgsql.old
(Be sure to move the directory as a single unit so relative paths remain unchanged.)
5. Create a new database cluster if needed. Remember that you must execute these commands while
logged in to the special database user account (which you already have if you are upgrading).
/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data
7. Start the database server, again using the special database user account:
/usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data
The least downtime can be achieved by installing the new server in a different directory and running
both the old and the new servers in parallel, on different ports. Then you can use something like:
532
Server Setup and Operation
This method of upgrading can be performed using the built-in logical replication facilities as well as
using external logical replication systems such as pglogical, Slony, Londiste, and Bucardo.
One way to prevent spoofing of local connections is to use a Unix domain socket directory
(unix_socket_directories) that has write permission only for a trusted local user. This prevents a ma-
licious user from creating their own socket file in that directory. If you are concerned that some ap-
plications might still reference /tmp for the socket file and hence be vulnerable to spoofing, during
operating system startup create a symbolic link /tmp/.s.PGSQL.5432 that points to the relocat-
ed socket file. You also might need to modify your /tmp cleanup script to prevent removal of the
symbolic link.
Another option for local connections is for clients to use requirepeer to specify the required
owner of the server process connected to the socket.
To prevent spoofing on TCP connections, the best solution is to use SSL certificates and make sure
that clients check the server's certificate. To do that, the server must be configured to accept only
hostssl connections (Section 20.1) and have SSL key and certificate files (Section 18.9). The TCP
client must connect using sslmode=verify-ca or verify-full and have the appropriate root
certificate file installed (Section 34.18.1).
Password Encryption
Database user passwords are stored as hashes (determined by the setting password_encryption), so
the administrator cannot determine the actual password assigned to the user. If SCRAM or MD5
533
Server Setup and Operation
encryption is used for client authentication, the unencrypted password is never even temporarily
present on the server because the client encrypts it before being sent across the network. SCRAM
is preferred, because it is an Internet standard and is more secure than the PostgreSQL-specific
MD5 authentication protocol.
The pgcrypto module allows certain fields to be stored encrypted. This is useful if only some of
the data is sensitive. The client supplies the decryption key and the data is decrypted on the server
and then sent to the client.
The decrypted data and the decryption key are present on the server for a brief time while it is
being decrypted and communicated between the client and server. This presents a brief moment
where the data and keys can be intercepted by someone with complete access to the database
server, such as the system administrator.
Storage encryption can be performed at the file system level or the block level. Linux file system
encryption options include eCryptfs and EncFS, while FreeBSD uses PEFS. Block level or full
disk encryption options include dm-crypt + LUKS on Linux and GEOM modules geli and gbde
on FreeBSD. Many other operating systems support this functionality, including Windows.
This mechanism prevents unencrypted data from being read from the drives if the drives or the
entire computer is stolen. This does not protect against attacks while the file system is mounted,
because when mounted, the operating system provides an unencrypted view of the data. However,
to mount the file system, you need some way for the encryption key to be passed to the operating
system, and sometimes the key is stored somewhere on the host that mounts the disk.
SSL connections encrypt all data sent across the network: the password, the queries, and the data
returned. The pg_hba.conf file allows administrators to specify which hosts can use non-
encrypted connections (host) and which require SSL-encrypted connections (hostssl). Also,
clients can specify that they connect to servers only via SSL. Stunnel or SSH can also be used
to encrypt transmissions.
It is possible for both the client and server to provide SSL certificates to each other. It takes some
extra configuration on each side, but this provides stronger verification of identity than the mere
use of passwords. It prevents a computer from pretending to be the server just long enough to
read the password sent by the client. It also helps prevent “man in the middle” attacks where a
computer between the client and server pretends to be the server and reads and passes all data
between the client and server.
Client-Side Encryption
If the system administrator for the server's machine cannot be trusted, it is necessary for the client
to encrypt the data; this way, unencrypted data never appears on the database server. Data is
encrypted on the client before being sent to the server, and database results have to be decrypted
on the client before being used.
534
Server Setup and Operation
To start in SSL mode, files containing the server certificate and private key must exist. By default,
these files are expected to be named server.crt and server.key, respectively, in the server's
data directory, but other names and locations can be specified using the configuration parameters
ssl_cert_file and ssl_key_file.
On Unix systems, the permissions on server.key must disallow any access to world or group;
achieve this by the command chmod 0600 server.key. Alternatively, the file can be owned by
root and have group read access (that is, 0640 permissions). That setup is intended for installations
where certificate and key files are managed by the operating system. The user under which the Post-
greSQL server runs should then be made a member of the group that has access to those certificate
and key files.
If the data directory allows group read access then certificate files may need to be located outside of
the data directory in order to conform to the security requirements outlined above. Generally, group
access is enabled to allow an unprivileged user to backup the database, and in that case the backup
software will not be able to read the certificate files and will likely error.
If the private key is protected with a passphrase, the server will prompt for the passphrase and will not
start until it has been entered. Using a passphrase by default disables the ability to change the server's
SSL configuration without a server restart, but see ssl_passphrase_command_supports_reload. Fur-
thermore, passphrase-protected private keys cannot be used at all on Windows.
The first certificate in server.crt must be the server's certificate because it must match the serv-
er's private key. The certificates of “intermediate” certificate authorities can also be appended to the
file. Doing this avoids the necessity of storing intermediate certificates on clients, assuming the root
and intermediate certificates were created with v3_ca extensions. (This sets the certificate's basic
constraint of CA to true.) This allows easier expiration of intermediate certificates.
It is not necessary to add the root certificate to server.crt. Instead, clients must have the root
certificate of the server's certificate chain.
OpenSSL supports a wide range of ciphers and authentication algorithms, of varying strength. While a
list of ciphers can be specified in the OpenSSL configuration file, you can specify ciphers specifically
for use by the database server by modifying ssl_ciphers in postgresql.conf.
Note
It is possible to have authentication without encryption overhead by using NULL-SHA or
NULL-MD5 ciphers. However, a man-in-the-middle could read and pass communications be-
tween client and server. Also, encryption overhead is minimal compared to the overhead of
authentication. For these reasons NULL ciphers are not recommended.
535
Server Setup and Operation
Intermediate certificates that chain up to existing root certificates can also appear in the ssl_ca_file
file if you wish to avoid storing them on clients (assuming the root and intermediate certificates were
created with v3_ca extensions). Certificate Revocation List (CRL) entries are also checked if the
parameter ssl_crl_file is set.
The clientcert authentication option is available for all authentication methods, but only in
pg_hba.conf lines specified as hostssl. When clientcert is not specified or is set to 0, the
server will still verify any presented client certificates against its CA file, if one is configured — but
it will not insist that a client certificate be presented.
If you are setting up client certificates, you may wish to use the cert authentication method, so that the
certificates control user authentication as well as providing connection security. See Section 20.12 for
details. (It is not necessary to specify clientcert=1 explicitly when using the cert authentication
method.)
The server reads these files at server start and whenever the server configuration is reloaded. On
Windows systems, they are also re-read whenever a new backend process is spawned for a new client
connection.
If an error in these files is detected at server start, the server will refuse to start. But if an error is
detected during a configuration reload, the files are ignored and the old SSL configuration continues
to be used. On Windows systems, if an error in these files is detected at backend start, that backend
will be unable to establish an SSL connection. In all these cases, the error condition is reported in
the server log.
536
Server Setup and Operation
openssl req -new -x509 -days 365 -nodes -text -out server.crt \
-keyout server.key -subj "/CN=dbhost.yourdomain.com"
Then do:
because the server will reject the file if its permissions are more liberal than this. For more details on
how to create your server private key and certificate, refer to the OpenSSL documentation.
While a self-signed certificate can be used for testing, a certificate signed by a certificate authority
(CA) (usually an enterprise-wide root CA) should be used in production.
To create a server certificate whose identity can be validated by clients, first create a certificate signing
request (CSR) and a public/private key file:
Then, sign the request with the key to create a root certificate authority (using the default OpenSSL
configuration file location on Linux):
Finally, create a server certificate signed by the new root certificate authority:
server.crt and server.key should be stored on the server, and root.crt should be stored
on the client so the client can verify that the server's leaf certificate was signed by its trusted root
certificate. root.key should be stored offline for use in creating future certificates.
# root
openssl req -new -nodes -text -out root.csr \
-keyout root.key -subj "/CN=root.yourdomain.com"
chmod og-rwx root.key
openssl x509 -req -in root.csr -text -days 3650 \
-extfile /etc/ssl/openssl.cnf -extensions v3_ca \
-signkey root.key -out root.crt
# intermediate
openssl req -new -nodes -text -out intermediate.csr \
-keyout intermediate.key -subj "/CN=intermediate.yourdomain.com"
chmod og-rwx intermediate.key
537
Server Setup and Operation
# leaf
openssl req -new -nodes -text -out server.csr \
-keyout server.key -subj "/CN=dbhost.yourdomain.com"
chmod og-rwx server.key
openssl x509 -req -in server.csr -text -days 365 \
-CA intermediate.crt -CAkey intermediate.key -CAcreateserial \
-out server.crt
server.crt and intermediate.crt should be concatenated into a certificate file bundle and
stored on the server. server.key should also be stored on the server. root.crt should be stored
on the client so the client can verify that the server's leaf certificate was signed by a chain of certificates
linked to its trusted root certificate. root.key and intermediate.key should be stored offline
for use in creating future certificates.
First make sure that an SSH server is running properly on the same machine as the PostgreSQL server
and that you can log in using ssh as some user; you then can establish a secure tunnel to the remote
server. A secure tunnel listens on a local port and forwards all traffic to a port on the remote machine.
Traffic sent to the remote port can arrive on its localhost address, or different bind address if
desired; it does not appear as coming from your local machine. This command creates a secure tunnel
from the client machine to the remote machine foo.com:
The first number in the -L argument, 63333, is the local port number of the tunnel; it can be any
unused port. (IANA reserves ports 49152 through 65535 for private use.) The name or IP address
after this is the remote bind address you are connecting to, i.e., localhost, which is the default.
The second number, 5432, is the remote end of the tunnel, e.g., the port number your database server
is using. In order to connect to the database server using this tunnel, you connect to port 63333 on
the local machine:
To the database server it will then look as though you are user joe on host foo.com connecting to
the localhost bind address, and it will use whatever authentication procedure was configured for
connections by that user to that bind address. Note that the server will not think the connection is SSL-
encrypted, since in fact it is not encrypted between the SSH server and the PostgreSQL server. This
should not pose any extra security risk because they are on the same machine.
In order for the tunnel setup to succeed you must be allowed to connect via ssh as [email protected],
just as if you had attempted to use ssh to create a terminal session.
538
Server Setup and Operation
but then the database server will see the connection as coming in on its foo.com bind address, which
is not opened by the default setting listen_addresses = 'localhost'. This is usually not
what you want.
If you have to “hop” to the database server via some login host, one possible setup could look like this:
Note that this way the connection from shell.foo.com to db.foo.com will not be encrypted
by the SSH tunnel. SSH offers quite a few configuration possibilities when the network is restricted
in various ways. Please refer to the SSH documentation for details.
Tip
Several other applications exist that can provide secure tunnels using a procedure similar in
concept to the one just described.
regsvr32 pgsql_library_directory/pgevent.dll
This creates registry entries used by the event viewer, under the default event source named Post-
greSQL.
To specify a different event source name (see event_source), use the /n and /i options:
To unregister the event log library from the operating system, issue this command:
Note
To enable event logging in the database server, modify log_destination to include eventlog
in postgresql.conf.
539
Chapter 19. Server Configuration
There are many configuration parameters that affect the behavior of the database system. In the first
section of this chapter we describe how to interact with configuration parameters. The subsequent
sections discuss each parameter in detail.
• Boolean: Values can be written as on, off, true, false, yes, no, 1, 0 (all case-insensitive)
or any unambiguous prefix of one of these.
• String: In general, enclose the value in single quotes, doubling any single quotes within the value.
Quotes can usually be omitted if the value is a simple number or identifier, however.
• Numeric (integer and floating point): A decimal point is permitted only for floating-point parame-
ters. Do not use thousands separators. Quotes are not required.
• Numeric with Unit: Some numeric parameters have an implicit unit, because they describe quanti-
ties of memory or time. The unit might be bytes, kilobytes, blocks (typically eight kilobytes), mil-
liseconds, seconds, or minutes. An unadorned numeric value for one of these settings will use the
setting's default unit, which can be learned from pg_settings.unit. For convenience, settings
can be given with a unit specified explicitly, for example '120 ms' for a time value, and they
will be converted to whatever the parameter's actual unit is. Note that the value must be written as a
string (with quotes) to use this feature. The unit name is case-sensitive, and there can be whitespace
between the numeric value and the unit.
• Valid memory units are B (bytes), kB (kilobytes), MB (megabytes), GB (gigabytes), and TB (ter-
abytes). The multiplier for memory units is 1024, not 1000.
• Valid time units are ms (milliseconds), s (seconds), min (minutes), h (hours), and d (days).
• Enumerated: Enumerated-type parameters are written in the same way as string parameters, but are
restricted to have one of a limited set of values. The values allowable for such a parameter can be
found from pg_settings.enumvals. Enum parameter values are case-insensitive.
# This is a comment
log_connections = yes
log_destination = 'syslog'
search_path = '"$user", public'
shared_buffers = 128MB
One parameter is specified per line. The equal sign between name and value is optional. Whitespace
is insignificant (except within a quoted parameter value) and blank lines are ignored. Hash marks (#)
540
Server Configuration
designate the remainder of the line as a comment. Parameter values that are not simple identifiers or
numbers must be single-quoted. To embed a single quote in a parameter value, write either two quotes
(preferred) or backslash-quote. If the file contains multiple entries for the same parameter, all but the
last one are ignored.
Parameters set in this way provide default values for the cluster. The settings seen by active sessions
will be these values unless they are overridden. The following sections describe ways in which the
administrator or user can override these defaults.
The configuration file is reread whenever the main server process receives a SIGHUP signal; this
signal is most easily sent by running pg_ctl reload from the command line or by calling the SQL
function pg_reload_conf(). The main server process also propagates this signal to all currently
running server processes, so that existing sessions also adopt the new values (this will happen after
they complete any currently-executing client command). Alternatively, you can send the signal to a
single server process directly. Some parameters can only be set at server start; any changes to their
entries in the configuration file will be ignored until the server is restarted. Invalid parameter settings
in the configuration file are likewise ignored (but logged) during SIGHUP processing.
External tools may also modify postgresql.auto.conf. It is not recommended to do this while
the server is running, since a concurrent ALTER SYSTEM command could overwrite such changes.
Such tools might simply append new settings to the end, or they might choose to remove duplicate
settings and/or comments (as ALTER SYSTEM will).
The system view pg_file_settings can be helpful for pre-testing changes to the configuration
files, or for diagnosing problems if a SIGHUP signal did not have the desired effects.
• The ALTER DATABASE command allows global settings to be overridden on a per-database basis.
• The ALTER ROLE command allows both global and per-database settings to be overridden with
user-specific values.
Values set with ALTER DATABASE and ALTER ROLE are applied only when starting a fresh data-
base session. They override values obtained from the configuration files or server command line, and
constitute defaults for the rest of the session. Note that some settings cannot be changed after server
start, and so cannot be set with these commands (or the ones listed below).
Once a client is connected to the database, PostgreSQL provides two additional SQL commands (and
equivalent functions) to interact with session-local configuration settings:
• The SHOW command allows inspection of the current value of all parameters. The corresponding
function is current_setting(setting_name text).
• The SET command allows modification of the current value of those parameters that can be set
locally to a session; it has no effect on other sessions. The corresponding function is set_con-
fig(setting_name, new_value, is_local).
541
Server Configuration
In addition, the system view pg_settings can be used to view and change session-local values:
• Querying this view is similar to using SHOW ALL but provides more detail. It is also more flexible,
since it's possible to specify filter conditions or join against other relations.
• Using UPDATE on this view, specifically updating the setting column, is the equivalent of
issuing SET commands. For example, the equivalent of
is:
• During server startup, parameter settings can be passed to the postgres command via the -c
command-line parameter. For example,
Settings provided in this way override those set via postgresql.conf or ALTER SYSTEM, so
they cannot be changed globally without restarting the server.
• When starting a client session via libpq, parameter settings can be specified using the PGOPTIONS
environment variable. Settings established in this way constitute defaults for the life of the session,
but do not affect other sessions. For historical reasons, the format of PGOPTIONS is similar to that
used when launching the postgres command; specifically, the -c flag must be specified. For
example,
Other clients and libraries might provide their own mechanisms, via the shell or otherwise, that
allow the user to alter session settings without direct use of SQL commands.
In addition to individual parameter settings, the postgresql.conf file can contain include direc-
tives, which specify another file to read and process as if it were inserted into the configuration file at
this point. This feature allows a configuration file to be divided into physically separate parts. Include
directives simply look like:
include 'filename'
If the file name is not an absolute path, it is taken as relative to the directory containing the referencing
configuration file. Inclusions can be nested.
542
Server Configuration
There is also an include_if_exists directive, which acts the same as the include directive,
except when the referenced file does not exist or cannot be read. A regular include will consider
this an error condition, but include_if_exists merely logs a message and continues processing
the referencing configuration file.
The postgresql.conf file can also contain include_dir directives, which specify an entire
directory of configuration files to include. These look like
include_dir 'directory'
Non-absolute directory names are taken as relative to the directory containing the referencing config-
uration file. Within the specified directory, only non-directory files whose names end with the suffix
.conf will be included. File names that start with the . character are also ignored, to prevent mis-
takes since such files are hidden on some platforms. Multiple files within an include directory are
processed in file name order (according to C locale rules, i.e., numbers before letters, and uppercase
letters before lowercase ones).
Include files or directories can be used to logically separate portions of the database configuration,
rather than having a single large postgresql.conf file. Consider a company that has two database
servers, each with a different amount of memory. There are likely elements of the configuration both
will share, for things such as logging. But memory-related parameters on the server will vary between
the two. And there might be server specific customizations, too. One way to manage this situation is
to break the custom configuration changes for your site into three files. You could add this to the end
of your postgresql.conf file to include them:
include 'shared.conf'
include 'memory.conf'
include 'server.conf'
All systems would have the same shared.conf. Each server with a particular amount of memory
could share the same memory.conf; you might have one for all servers with 8GB of RAM, another
for those having 16GB. And finally server.conf could have truly server-specific configuration
information in it.
Another possibility is to create a configuration file directory and put this information into files there.
For example, a conf.d directory could be referenced at the end of postgresql.conf:
include_dir 'conf.d'
Then you could name the files in the conf.d directory like this:
00shared.conf
01memory.conf
02server.conf
This naming convention establishes a clear order in which these files will be loaded. This is important
because only the last setting encountered for a particular parameter while the server is reading con-
figuration files will be used. In this example, something set in conf.d/02server.conf would
override a value set in conf.d/01memory.conf.
You might instead use this approach to naming the files descriptively:
00shared.conf
01memory-8GB.conf
543
Server Configuration
02server-foo.conf
This sort of arrangement gives a unique name for each configuration file variation. This can help
eliminate ambiguity when several servers have their configurations all stored in one place, such as
in a version control repository. (Storing database configuration files under version control is another
good practice to consider.)
data_directory (string)
Specifies the directory to use for data storage. This parameter can only be set at server start.
config_file (string)
Specifies the main server configuration file (customarily called postgresql.conf). This pa-
rameter can only be set on the postgres command line.
hba_file (string)
Specifies the configuration file for host-based authentication (customarily called pg_hba.con-
f). This parameter can only be set at server start.
ident_file (string)
Specifies the configuration file for user name mapping (customarily called pg_ident.conf).
This parameter can only be set at server start. See also Section 20.2.
external_pid_file (string)
Specifies the name of an additional process-ID (PID) file that the server should create for use by
server administration programs. This parameter can only be set at server start.
In a default installation, none of the above parameters are set explicitly. Instead, the data directory is
specified by the -D command-line option or the PGDATA environment variable, and the configuration
files are all found within the data directory.
If you wish to keep the configuration files elsewhere than the data directory, the postgres -D com-
mand-line option or PGDATA environment variable must point to the directory containing the config-
uration files, and the data_directory parameter must be set in postgresql.conf (or on the
command line) to show where the data directory is actually located. Notice that data_directo-
ry overrides -D and PGDATA for the location of the data directory, but not for the location of the
configuration files.
If you wish, you can specify the configuration file names and locations individually using the para-
meters config_file, hba_file and/or ident_file. config_file can only be specified
on the postgres command line, but the others can be set within the main configuration file. If all
three parameters plus data_directory are explicitly set, then it is not necessary to specify -D
or PGDATA.
When setting any of these parameters, a relative path will be interpreted with respect to the directory
in which postgres is started.
544
Server Configuration
Specifies the TCP/IP address(es) on which the server is to listen for connections from client ap-
plications. The value takes the form of a comma-separated list of host names and/or numeric IP
addresses. The special entry * corresponds to all available IP interfaces. The entry 0.0.0.0
allows listening for all IPv4 addresses and :: allows listening for all IPv6 addresses. If the list
is empty, the server does not listen on any IP interface at all, in which case only Unix-domain
sockets can be used to connect to it. The default value is localhost, which allows only local TCP/IP
“loopback” connections to be made. While client authentication (Chapter 20) allows fine-grained
control over who can access the server, listen_addresses controls which interfaces accept
connection attempts, which can help prevent repeated malicious connection requests on insecure
network interfaces. This parameter can only be set at server start.
port (integer)
The TCP port the server listens on; 5432 by default. Note that the same port number is used for
all IP addresses the server listens on. This parameter can only be set at server start.
max_connections (integer)
Determines the maximum number of concurrent connections to the database server. The default
is typically 100 connections, but might be less if your kernel settings will not support it (as deter-
mined during initdb). This parameter can only be set at server start.
When running a standby server, you must set this parameter to the same or higher value than on
the master server. Otherwise, queries will not be allowed in the standby server.
superuser_reserved_connections (integer)
Determines the number of connection “slots” that are reserved for connections by PostgreSQL
superusers. At most max_connections connections can ever be active simultaneously. Whenev-
er the number of active concurrent connections is at least max_connections minus supe-
ruser_reserved_connections, new connections will be accepted only for superusers,
and no new replication connections will be accepted.
The default value is three connections. The value must be less than max_connections minus
max_wal_senders. This parameter can only be set at server start.
unix_socket_directories (string)
Specifies the directory of the Unix-domain socket(s) on which the server is to listen for connec-
tions from client applications. Multiple sockets can be created by listing multiple directories sep-
arated by commas. Whitespace between entries is ignored; surround a directory name with double
quotes if you need to include whitespace or commas in the name. An empty value specifies not
listening on any Unix-domain sockets, in which case only TCP/IP sockets can be used to connect
to the server. The default value is normally /tmp, but that can be changed at build time. This
parameter can only be set at server start.
In addition to the socket file itself, which is named .s.PGSQL.nnnn where nnnn is the server's
port number, an ordinary file named .s.PGSQL.nnnn.lock will be created in each of the
unix_socket_directories directories. Neither file should ever be removed manually.
This parameter is irrelevant on Windows, which does not have Unix-domain sockets.
545
Server Configuration
unix_socket_group (string)
Sets the owning group of the Unix-domain socket(s). (The owning user of the sockets is always the
user that starts the server.) In combination with the parameter unix_socket_permissions
this can be used as an additional access control mechanism for Unix-domain connections. By
default this is the empty string, which uses the default group of the server user. This parameter
can only be set at server start.
This parameter is irrelevant on Windows, which does not have Unix-domain sockets.
unix_socket_permissions (integer)
Sets the access permissions of the Unix-domain socket(s). Unix-domain sockets use the usual
Unix file system permission set. The parameter value is expected to be a numeric mode specified
in the format accepted by the chmod and umask system calls. (To use the customary octal format
the number must start with a 0 (zero).)
The default permissions are 0777, meaning anyone can connect. Reasonable alternatives are
0770 (only user and group, see also unix_socket_group) and 0700 (only user). (Note that
for a Unix-domain socket, only write permission matters, so there is no point in setting or revoking
read or execute permissions.)
This access control mechanism is independent of the one described in Chapter 20.
This parameter is irrelevant on systems, notably Solaris as of Solaris 10, that ignore socket per-
missions entirely. There, one can achieve a similar effect by pointing unix_socket_direc-
tories to a directory having search permission limited to the desired audience. This parameter
is also irrelevant on Windows, which does not have Unix-domain sockets.
bonjour (boolean)
Enables advertising the server's existence via Bonjour. The default is off. This parameter can only
be set at server start.
bonjour_name (string)
Specifies the Bonjour service name. The computer name is used if this parameter is set to the
empty string '' (which is the default). This parameter is ignored if the server was not compiled
with Bonjour support. This parameter can only be set at server start.
tcp_keepalives_idle (integer)
Specifies the number of seconds of inactivity after which TCP should send a keepalive message
to the client. A value of 0 uses the system default. This parameter is supported only on systems
that support TCP_KEEPIDLE or an equivalent socket option, and on Windows; on other systems,
it must be zero. In sessions connected via a Unix-domain socket, this parameter is ignored and
always reads as zero.
Note
On Windows, a value of 0 will set this parameter to 2 hours, since Windows does not
provide a way to read the system default value.
tcp_keepalives_interval (integer)
Specifies the number of seconds after which a TCP keepalive message that is not acknowledged
by the client should be retransmitted. A value of 0 uses the system default. This parameter is
546
Server Configuration
supported only on systems that support TCP_KEEPINTVL or an equivalent socket option, and
on Windows; on other systems, it must be zero. In sessions connected via a Unix-domain socket,
this parameter is ignored and always reads as zero.
Note
On Windows, a value of 0 will set this parameter to 1 second, since Windows does not
provide a way to read the system default value.
tcp_keepalives_count (integer)
Specifies the number of TCP keepalives that can be lost before the server's connection to the client
is considered dead. A value of 0 uses the system default. This parameter is supported only on
systems that support TCP_KEEPCNT or an equivalent socket option; on other systems, it must
be zero. In sessions connected via a Unix-domain socket, this parameter is ignored and always
reads as zero.
Note
This parameter is not supported on Windows, and must be zero.
19.3.2. Authentication
authentication_timeout (integer)
Maximum time to complete client authentication, in seconds. If a would-be client has not com-
pleted the authentication protocol in this much time, the server closes the connection. This pre-
vents hung clients from occupying a connection indefinitely. The default is one minute (1m). This
parameter can only be set in the postgresql.conf file or on the server command line.
password_encryption (enum)
When a password is specified in CREATE ROLE or ALTER ROLE, this parameter determines the
algorithm to use to encrypt the password. The default value is md5, which stores the password as
an MD5 hash (on is also accepted, as alias for md5). Setting this parameter to scram-sha-256
will encrypt the password with SCRAM-SHA-256.
Note that older clients might lack support for the SCRAM authentication mechanism, and hence
not work with passwords encrypted with SCRAM-SHA-256. See Section 20.5 for more details.
krb_server_keyfile (string)
Sets the location of the Kerberos server key file. See Section 20.6 for details. This parameter can
only be set in the postgresql.conf file or on the server command line.
krb_caseins_users (boolean)
Sets whether GSSAPI user names should be treated case-insensitively. The default is off (case
sensitive). This parameter can only be set in the postgresql.conf file or on the server com-
mand line.
db_user_namespace (boolean)
This parameter enables per-database user names. It is off by default. This parameter can only be
set in the postgresql.conf file or on the server command line.
547
Server Configuration
If this is on, you should create users as username@dbname. When username is passed by a
connecting client, @ and the database name are appended to the user name and that database-spe-
cific user name is looked up by the server. Note that when you create users with names containing
@ within the SQL environment, you will need to quote the user name.
With this parameter enabled, you can still create ordinary global users. Simply append @ when
specifying the user name in the client, e.g., joe@. The @ will be stripped off before the user name
is looked up by the server.
db_user_namespace causes the client's and server's user name representation to differ. Au-
thentication checks are always done with the server's user name so authentication methods must
be configured for the server's user name, not the client's. Because md5 uses the user name as salt
on both the client and server, md5 cannot be used with db_user_namespace.
Note
This feature is intended as a temporary measure until a complete solution is found. At that
time, this option will be removed.
19.3.3. SSL
See Section 18.9 for more information about setting up SSL.
ssl (boolean)
Enables SSL connections. This parameter can only be set in the postgresql.conf file or on
the server command line. The default is off.
ssl_ca_file (string)
Specifies the name of the file containing the SSL server certificate authority (CA). Relative paths
are relative to the data directory. This parameter can only be set in the postgresql.conf file
or on the server command line. The default is empty, meaning no CA file is loaded, and client
certificate verification is not performed.
ssl_cert_file (string)
Specifies the name of the file containing the SSL server certificate. Relative paths are relative
to the data directory. This parameter can only be set in the postgresql.conf file or on the
server command line. The default is server.crt.
ssl_crl_file (string)
Specifies the name of the file containing the SSL client certificate revocation list (CRL). Relative
paths are relative to the data directory. This parameter can only be set in the postgresql.conf
file or on the server command line. The default is empty, meaning no CRL file is loaded.
ssl_key_file (string)
Specifies the name of the file containing the SSL server private key. Relative paths are relative
to the data directory. This parameter can only be set in the postgresql.conf file or on the
server command line. The default is server.key.
ssl_ciphers (string)
Specifies a list of SSL cipher suites that are allowed to be used by SSL connections. See the
ciphers manual page in the OpenSSL package for the syntax of this setting and a list of supported
548
Server Configuration
values. Only connections using TLS version 1.2 and lower are affected. There is currently no
setting that controls the cipher choices used by TLS version 1.3 connections. The default value is
HIGH:MEDIUM:+3DES:!aNULL. The default is usually a reasonable choice unless you have
specific security requirements.
This parameter can only be set in the postgresql.conf file or on the server command line.
HIGH
Cipher suites that use ciphers from HIGH group (e.g., AES, Camellia, 3DES)
MEDIUM
Cipher suites that use ciphers from MEDIUM group (e.g., RC4, SEED)
+3DES
The OpenSSL default order for HIGH is problematic because it orders 3DES higher than
AES128. This is wrong because 3DES offers less security than AES128, and it is also much
slower. +3DES reorders it after all other HIGH and MEDIUM ciphers.
!aNULL
Disables anonymous cipher suites that do no authentication. Such cipher suites are vulnerable
to man-in-the-middle attacks and therefore should not be used.
Available cipher suite details will vary across OpenSSL versions. Use the command openssl
ciphers -v 'HIGH:MEDIUM:+3DES:!aNULL' to see actual details for the currently in-
stalled OpenSSL version. Note that this list is filtered at run time based on the server key type.
ssl_prefer_server_ciphers (boolean)
Specifies whether to use the server's SSL cipher preferences, rather than the client's. This para-
meter can only be set in the postgresql.conf file or on the server command line. The default
is true.
Older PostgreSQL versions do not have this setting and always use the client's preferences. This
setting is mainly for backward compatibility with those versions. Using the server's preferences
is usually better because it is more likely that the server is appropriately configured.
ssl_ecdh_curve (string)
Specifies the name of the curve to use in ECDH key exchange. It needs to be supported by all
clients that connect. It does not need to be the same curve used by the server's Elliptic Curve key.
This parameter can only be set in the postgresql.conf file or on the server command line.
The default is prime256v1.
OpenSSL names for the most common curves are: prime256v1 (NIST P-256), secp384r1
(NIST P-384), secp521r1 (NIST P-521). The full list of available curves can be shown with the
command openssl ecparam -list_curves. Not all of them are usable in TLS though.
ssl_dh_params_file (string)
Specifies the name of the file containing Diffie-Hellman parameters used for so-called ephemeral
DH family of SSL ciphers. The default is empty, in which case compiled-in default DH parameters
used. Using custom DH parameters reduces the exposure if an attacker manages to crack the
well-known compiled-in DH parameters. You can create your own DH parameters file with the
command openssl dhparam -out dhparams.pem 2048.
549
Server Configuration
This parameter can only be set in the postgresql.conf file or on the server command line.
ssl_passphrase_command (string)
Sets an external command to be invoked when a passphrase for decrypting an SSL file such as
a private key needs to be obtained. By default, this parameter is empty, which means the built-
in prompting mechanism is used.
The command must print the passphrase to the standard output and exit with code 0. In the parame-
ter value, %p is replaced by a prompt string. (Write %% for a literal %.) Note that the prompt string
will probably contain whitespace, so be sure to quote adequately. A single newline is stripped
from the end of the output if present.
The command does not actually have to prompt the user for a passphrase. It can read it from a
file, obtain it from a keychain facility, or similar. It is up to the user to make sure the chosen
mechanism is adequately secure.
This parameter can only be set in the postgresql.conf file or on the server command line.
ssl_passphrase_command_supports_reload (boolean)
This parameter can only be set in the postgresql.conf file or on the server command line.
Sets the amount of memory the database server uses for shared memory buffers. The default is
typically 128 megabytes (128MB), but might be less if your kernel settings will not support it
(as determined during initdb). This setting must be at least 128 kilobytes. (Non-default values
of BLCKSZ change the minimum.) However, settings significantly higher than the minimum are
usually needed for good performance. This parameter can only be set at server start.
If you have a dedicated database server with 1GB or more of RAM, a reasonable starting val-
ue for shared_buffers is 25% of the memory in your system. There are some workloads
where even larger settings for shared_buffers are effective, but because PostgreSQL al-
so relies on the operating system cache, it is unlikely that an allocation of more than 40%
of RAM to shared_buffers will work better than a smaller amount. Larger settings for
shared_buffers usually require a corresponding increase in max_wal_size, in order to
spread out the process of writing large quantities of new or changed data over a longer period
of time.
On systems with less than 1GB of RAM, a smaller percentage of RAM is appropriate, so as to
leave adequate space for the operating system.
huge_pages (enum)
Controls whether huge pages are requested for the main shared memory area. Valid values are
try (the default), on, and off. With huge_pages set to try, the server will try to request
550
Server Configuration
huge pages, but fall back to the default if that fails. With on, failure to request huge pages will
prevent the server from starting up. With off, huge pages will not be requested.
At present, this setting is supported only on Linux and Windows. The setting is ignored on other
systems when set to try.
The use of huge pages results in smaller page tables and less CPU time spent on memory man-
agement, increasing performance. For more details about using huge pages on Linux, see Sec-
tion 18.4.5.
Huge pages are known as large pages on Windows. To use them, you need to assign the user
right Lock Pages in Memory to the Windows user account that runs PostgreSQL. You can use
Windows Group Policy tool (gpedit.msc) to assign the user right Lock Pages in Memory. To start
the database server on the command prompt as a standalone process, not as a Windows service,
the command prompt must be run as an administrator or User Access Control (UAC) must be
disabled. When the UAC is enabled, the normal command prompt revokes the user right Lock
Pages in Memory when started.
Note that this setting only affects the main shared memory area. Operating systems such as Linux,
FreeBSD, and Illumos can also use huge pages (also known as “super” pages or “large” pages)
automatically for normal memory allocation, without an explicit request from PostgreSQL. On
Linux, this is called “transparent huge pages” (THP). That feature has been known to cause per-
formance degradation with PostgreSQL for some users on some Linux versions, so its use is cur-
rently discouraged (unlike explicit use of huge_pages).
temp_buffers (integer)
Sets the maximum number of temporary buffers used by each database session. These are ses-
sion-local buffers used only for access to temporary tables. The default is eight megabytes (8MB).
The setting can be changed within individual sessions, but only before the first use of temporary
tables within the session; subsequent attempts to change the value will have no effect on that
session.
A session will allocate temporary buffers as needed up to the limit given by temp_buffers.
The cost of setting a large value in sessions that do not actually need many temporary buffers
is only a buffer descriptor, or about 64 bytes, per increment in temp_buffers. However if a
buffer is actually used an additional 8192 bytes will be consumed for it (or in general, BLCKSZ
bytes).
max_prepared_transactions (integer)
Sets the maximum number of transactions that can be in the “prepared” state simultaneously (see
PREPARE TRANSACTION). Setting this parameter to zero (which is the default) disables the
prepared-transaction feature. This parameter can only be set at server start.
If you are not planning to use prepared transactions, this parameter should be set to zero to pre-
vent accidental creation of prepared transactions. If you are using prepared transactions, you will
probably want max_prepared_transactions to be at least as large as max_connections,
so that every session can have a prepared transaction pending.
When running a standby server, you must set this parameter to the same or higher value than on
the master server. Otherwise, queries will not be allowed in the standby server.
work_mem (integer)
Specifies the amount of memory to be used by internal sort operations and hash tables before writ-
ing to temporary disk files. The value defaults to four megabytes (4MB). Note that for a complex
query, several sort or hash operations might be running in parallel; each operation will be allowed
to use as much memory as this value specifies before it starts to write data into temporary files.
551
Server Configuration
Also, several running sessions could be doing such operations concurrently. Therefore, the total
memory used could be many times the value of work_mem; it is necessary to keep this fact in
mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT, and merge
joins. Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of
IN subqueries.
maintenance_work_mem (integer)
Specifies the maximum amount of memory to be used by maintenance operations, such as VAC-
UUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. It defaults to 64 megabytes
(64MB). Since only one of these operations can be executed at a time by a database session, and
an installation normally doesn't have many of them running concurrently, it's safe to set this value
significantly larger than work_mem. Larger settings might improve performance for vacuuming
and for restoring database dumps.
Note that when autovacuum runs, up to autovacuum_max_workers times this memory may be
allocated, so be careful not to set the default value too high. It may be useful to control for this
by separately setting autovacuum_work_mem.
Note that for the collection of dead tuple identifiers, VACUUM is only able to utilize up to a max-
imum of 1GB of memory.
autovacuum_work_mem (integer)
Specifies the maximum amount of memory to be used by each autovacuum worker process. It
defaults to -1, indicating that the value of maintenance_work_mem should be used instead. The
setting has no effect on the behavior of VACUUM when run in other contexts. This parameter can
only be set in the postgresql.conf file or on the server command line.
For the collection of dead tuple identifiers, autovacuum is only able to utilize up to a maximum of
1GB of memory, so setting autovacuum_work_mem to a value higher than that has no effect
on the number of dead tuples that autovacuum can collect while scanning a table.
max_stack_depth (integer)
Specifies the maximum safe depth of the server's execution stack. The ideal setting for this para-
meter is the actual stack size limit enforced by the kernel (as set by ulimit -s or local equiv-
alent), less a safety margin of a megabyte or so. The safety margin is needed because the stack
depth is not checked in every routine in the server, but only in key potentially-recursive routines
such as expression evaluation. The default setting is two megabytes (2MB), which is conserva-
tively small and unlikely to risk crashes. However, it might be too small to allow execution of
complex functions. Only superusers can change this setting.
Setting max_stack_depth higher than the actual kernel limit will mean that a runaway recur-
sive function can crash an individual backend process. On platforms where PostgreSQL can de-
termine the kernel limit, the server will not allow this variable to be set to an unsafe value. How-
ever, not all platforms provide the information, so caution is recommended in selecting a value.
dynamic_shared_memory_type (enum)
Specifies the dynamic shared memory implementation that the server should use. Possible values
are posix (for POSIX shared memory allocated using shm_open), sysv (for System V shared
memory allocated via shmget), windows (for Windows shared memory), mmap (to simulate
shared memory using memory-mapped files stored in the data directory), and none (to disable
this feature). Not all values are supported on all platforms; the first supported option is the default
for that platform. The use of the mmap option, which is not the default on any platform, is generally
discouraged because the operating system may write modified pages back to disk repeatedly,
increasing system I/O load; however, it may be useful for debugging, when the pg_dynshmem
directory is stored on a RAM disk, or when other shared memory facilities are not available.
552
Server Configuration
19.4.2. Disk
temp_file_limit (integer)
Specifies the maximum amount of disk space that a process can use for temporary files, such as
sort and hash temporary files, or the storage file for a held cursor. A transaction attempting to
exceed this limit will be canceled. The value is specified in kilobytes, and -1 (the default) means
no limit. Only superusers can change this setting.
This setting constrains the total space used at any instant by all temporary files used by a given
PostgreSQL process. It should be noted that disk space used for explicit temporary tables, as
opposed to temporary files used behind-the-scenes in query execution, does not count against this
limit.
Sets the maximum number of simultaneously open files allowed to each server subprocess. The
default is one thousand files. If the kernel is enforcing a safe per-process limit, you don't need
to worry about this setting. But on some platforms (notably, most BSD systems), the kernel will
allow individual processes to open many more files than the system can actually support if many
processes all try to open that many files. If you find yourself seeing “Too many open files” failures,
try reducing this setting. This parameter can only be set at server start.
The intent of this feature is to allow administrators to reduce the I/O impact of these commands on
concurrent database activity. There are many situations where it is not important that maintenance
commands like VACUUM and ANALYZE finish quickly; however, it is usually very important that
these commands do not significantly interfere with the ability of the system to perform other database
operations. Cost-based vacuum delay provides a way for administrators to achieve this.
This feature is disabled by default for manually issued VACUUM commands. To enable it, set the
vacuum_cost_delay variable to a nonzero value.
vacuum_cost_delay (integer)
The length of time, in milliseconds, that the process will sleep when the cost limit has been ex-
ceeded. The default value is zero, which disables the cost-based vacuum delay feature. Positive
values enable cost-based vacuuming. Note that on many systems, the effective resolution of sleep
delays is 10 milliseconds; setting vacuum_cost_delay to a value that is not a multiple of 10
might have the same results as setting it to the next higher multiple of 10.
When using cost-based vacuuming, appropriate values for vacuum_cost_delay are usually
quite small, perhaps 10 or 20 milliseconds. Adjusting vacuum's resource consumption is best done
by changing the other vacuum cost parameters.
vacuum_cost_page_hit (integer)
The estimated cost for vacuuming a buffer found in the shared buffer cache. It represents the cost
to lock the buffer pool, lookup the shared hash table and scan the content of the page. The default
value is one.
553
Server Configuration
vacuum_cost_page_miss (integer)
The estimated cost for vacuuming a buffer that has to be read from disk. This represents the effort
to lock the buffer pool, lookup the shared hash table, read the desired block in from the disk and
scan its content. The default value is 10.
vacuum_cost_page_dirty (integer)
The estimated cost charged when vacuum modifies a block that was previously clean. It represents
the extra I/O required to flush the dirty block out to disk again. The default value is 20.
vacuum_cost_limit (integer)
The accumulated cost that will cause the vacuuming process to sleep. The default value is 200.
Note
There are certain operations that hold critical locks and should therefore complete as quickly
as possible. Cost-based vacuum delays do not occur during such operations. Therefore it is
possible that the cost accumulates far higher than the specified limit. To avoid uselessly long
delays in such cases, the actual delay is calculated as vacuum_cost_delay * accumu-
lated_balance / vacuum_cost_limit with a maximum of vacuum_cost_delay
* 4.
bgwriter_delay (integer)
Specifies the delay between activity rounds for the background writer. In each round the writer
issues writes for some number of dirty buffers (controllable by the following parameters). It then
sleeps for bgwriter_delay milliseconds, and repeats. When there are no dirty buffers in the
buffer pool, though, it goes into a longer sleep regardless of bgwriter_delay. The default
value is 200 milliseconds (200ms). Note that on many systems, the effective resolution of sleep
delays is 10 milliseconds; setting bgwriter_delay to a value that is not a multiple of 10 might
have the same results as setting it to the next higher multiple of 10. This parameter can only be
set in the postgresql.conf file or on the server command line.
bgwriter_lru_maxpages (integer)
In each round, no more than this many buffers will be written by the background writer. Setting
this to zero disables background writing. (Note that checkpoints, which are managed by a separate,
dedicated auxiliary process, are unaffected.) The default value is 100 buffers. This parameter can
only be set in the postgresql.conf file or on the server command line.
The number of dirty buffers written in each round is based on the number of new buffers that have
been needed by server processes during recent rounds. The average recent need is multiplied by
554
Server Configuration
bgwriter_flush_after (integer)
Whenever more than bgwriter_flush_after bytes have been written by the background
writer, attempt to force the OS to issue these writes to the underlying storage. Doing so will limit
the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an
fsync is issued at the end of a checkpoint, or when the OS writes data back in larger batches
in the background. Often that will result in greatly reduced transaction latency, but there also are
some cases, especially with workloads that are bigger than shared_buffers, but smaller than the
OS's page cache, where performance might degrade. This setting may have no effect on some
platforms. The valid range is between 0, which disables forced writeback, and 2MB. The default
is 512kB on Linux, 0 elsewhere. (If BLCKSZ is not 8kB, the default and maximum values scale
proportionally to it.) This parameter can only be set in the postgresql.conf file or on the
server command line.
Sets the number of concurrent disk I/O operations that PostgreSQL expects can be executed si-
multaneously. Raising this value will increase the number of I/O operations that any individual
PostgreSQL session attempts to initiate in parallel. The allowed range is 1 to 1000, or zero to dis-
able issuance of asynchronous I/O requests. Currently, this setting only affects bitmap heap scans.
For magnetic drives, a good starting point for this setting is the number of separate drives com-
prising a RAID 0 stripe or RAID 1 mirror being used for the database. (For RAID 5 the parity
drive should not be counted.) However, if the database is often busy with multiple queries issued
in concurrent sessions, lower values may be sufficient to keep the disk array busy. A value high-
er than needed to keep the disks busy will only result in extra CPU overhead. SSDs and other
memory-based storage can often process many concurrent requests, so the best value might be
in the hundreds.
The default is 1 on supported systems, otherwise 0. This value can be overridden for tables in a
particular tablespace by setting the tablespace parameter of the same name (see ALTER TABLES-
PACE).
max_worker_processes (integer)
Sets the maximum number of background processes that the system can support. This parameter
can only be set at server start. The default is 8.
When running a standby server, you must set this parameter to the same or higher value than on
the master server. Otherwise, queries will not be allowed in the standby server.
555
Server Configuration
max_parallel_workers_per_gather (integer)
Sets the maximum number of workers that can be started by a single Gather or Gather Merge
node. Parallel workers are taken from the pool of processes established by max_worker_process-
es, limited by max_parallel_workers. Note that the requested number of workers may not actually
be available at run time. If this occurs, the plan will run with fewer workers than expected, which
may be inefficient. The default value is 2. Setting this value to 0 disables parallel query execution.
Note that parallel queries may consume very substantially more resources than non-parallel
queries, because each worker process is a completely separate process which has roughly the
same impact on the system as an additional user session. This should be taken into account when
choosing a value for this setting, as well as when configuring other settings that control resource
utilization, such as work_mem. Resource limits such as work_mem are applied individually to
each worker, which means the total utilization may be much higher across all processes than it
would normally be for any single process. For example, a parallel query using 4 workers may use
up to 5 times as much CPU time, memory, I/O bandwidth, and so forth as a query which uses
no workers at all.
max_parallel_maintenance_workers (integer)
Sets the maximum number of parallel workers that can be started by a single utility command.
Currently, the only parallel utility command that supports the use of parallel workers is CRE-
ATE INDEX, and only when building a B-tree index. Parallel workers are taken from the pool of
processes established by max_worker_processes, limited by max_parallel_workers. Note that the
requested number of workers may not actually be available at run time. If this occurs, the utility
operation will run with fewer workers than expected. The default value is 2. Setting this value to
0 disables the use of parallel workers by utility commands.
Note that parallel utility commands should not consume substantially more memory than equiva-
lent non-parallel operations. This strategy differs from that of parallel query, where resource lim-
its generally apply per worker process. Parallel utility commands treat the resource limit main-
tenance_work_mem as a limit to be applied to the entire utility command, regardless of the
number of parallel worker processes. However, parallel utility commands may still consume sub-
stantially more CPU resources and I/O bandwidth.
max_parallel_workers (integer)
Sets the maximum number of workers that the system can support for parallel operations. The
default value is 8. When increasing or decreasing this value, consider also adjusting max_paral-
lel_maintenance_workers and max_parallel_workers_per_gather. Also, note that a setting for this
value which is higher than max_worker_processes will have no effect, since parallel workers are
taken from the pool of worker processes established by that setting.
backend_flush_after (integer)
Whenever more than backend_flush_after bytes have been written by a single backend,
attempt to force the OS to issue these writes to the underlying storage. Doing so will limit the
amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an fsync
is issued at the end of a checkpoint, or when the OS writes data back in larger batches in the
background. Often that will result in greatly reduced transaction latency, but there also are some
cases, especially with workloads that are bigger than shared_buffers, but smaller than the OS's
page cache, where performance might degrade. This setting may have no effect on some platforms.
The valid range is between 0, which disables forced writeback, and 2MB. The default is 0, i.e.,
no forced writeback. (If BLCKSZ is not 8kB, the maximum value scales proportionally to it.)
556
Server Configuration
old_snapshot_threshold (integer)
Sets the minimum time that a snapshot can be used without risk of a snapshot too old error
occurring when using the snapshot. This parameter can only be set at server start.
Beyond the threshold, old data may be vacuumed away. This can help prevent bloat in the face
of snapshots which remain in use for a long time. To prevent incorrect results due to cleanup of
data which would otherwise be visible to the snapshot, an error is generated when the snapshot is
older than this threshold and the snapshot is used to read a page which has been modified since
the snapshot was built.
A value of -1 disables this feature, and is the default. Useful values for production work probably
range from a small number of hours to a few days. The setting will be coerced to a granularity of
minutes, and small numbers (such as 0 or 1min) are only allowed because they may sometimes
be useful for testing. While a setting as high as 60d is allowed, please note that in many workloads
extreme bloat or transaction ID wraparound may occur in much shorter time frames.
When this feature is enabled, freed space at the end of a relation cannot be released to the oper-
ating system, since that could remove information needed to detect the snapshot too old
condition. All space allocated to a relation remains associated with that relation for reuse only
within that relation unless explicitly freed (for example, with VACUUM FULL).
This setting does not attempt to guarantee that an error will be generated under any particular
circumstances. In fact, if the correct results can be generated from (for example) a cursor which has
materialized a result set, no error will be generated even if the underlying rows in the referenced
table have been vacuumed away. Some tables cannot safely be vacuumed early, and so will not
be affected by this setting, such as system catalogs. For such tables this setting will neither reduce
bloat nor create a possibility of a snapshot too old error on scanning.
19.5.1. Settings
wal_level (enum)
wal_level determines how much information is written to the WAL. The default value is
replica, which writes enough data to support WAL archiving and replication, including run-
ning read-only queries on a standby server. minimal removes all logging except the information
required to recover from a crash or immediate shutdown. Finally, logical adds information
necessary to support logical decoding. Each level includes the information logged at all lower
levels. This parameter can only be set at server start.
In minimal level, WAL-logging of some bulk operations can be safely skipped, which can make
those operations much faster (see Section 14.4.7). Operations in which this optimization can be
applied include:
CREATE TABLE AS
CREATE INDEX
CLUSTER
COPY into tables that were created or truncated in the same transaction
But minimal WAL does not contain enough information to reconstruct the data from a base backup
and the WAL logs, so replica or higher must be used to enable WAL archiving (archive_mode)
and streaming replication.
In logical level, the same information is logged as with replica, plus information needed
to allow extracting logical change sets from the WAL. Using a level of logical will increase
557
Server Configuration
the WAL volume, particularly if many tables are configured for REPLICA IDENTITY FULL
and many UPDATE and DELETE statements are executed.
In releases prior to 9.6, this parameter also allowed the values archive and hot_standby.
These are still accepted but mapped to replica.
fsync (boolean)
If this parameter is on, the PostgreSQL server will try to make sure that updates are phys-
ically written to disk, by issuing fsync() system calls or various equivalent methods (see
wal_sync_method). This ensures that the database cluster can recover to a consistent state after
an operating system or hardware crash.
While turning off fsync is often a performance benefit, this can result in unrecoverable data
corruption in the event of a power failure or system crash. Thus it is only advisable to turn off
fsync if you can easily recreate your entire database from external data.
Examples of safe circumstances for turning off fsync include the initial loading of a new data-
base cluster from a backup file, using a database cluster for processing a batch of data after which
the database will be thrown away and recreated, or for a read-only database clone which gets
recreated frequently and is not used for failover. High quality hardware alone is not a sufficient
justification for turning off fsync.
For reliable recovery when changing fsync off to on, it is necessary to force all modified buffers
in the kernel to durable storage. This can be done while the cluster is shutdown or while fsync
is on by running initdb --sync-only, running sync, unmounting the file system, or re-
booting the server.
In many situations, turning off synchronous_commit for noncritical transactions can provide much
of the potential performance benefit of turning off fsync, without the attendant risks of data
corruption.
fsync can only be set in the postgresql.conf file or on the server command line. If you
turn this parameter off, also consider turning off full_page_writes.
synchronous_commit (enum)
Specifies how much WAL processing must complete before the database server returns a
“success” indication to the client. Valid values are remote_apply, on (the default), re-
mote_write, local, and off.
When set to remote_apply, commits will wait until replies from the current synchronous
standby(s) indicate they have received the commit record of the transaction and applied it, so
that it has become visible to queries on the standby(s), and also written to durable storage on
the standbys. This will cause much larger commit delays than previous settings since it waits for
WAL replay. When set to on, commits wait until replies from the current synchronous standby(s)
558
Server Configuration
indicate they have received the commit record of the transaction and flushed it to durable storage.
This ensures the transaction will not be lost unless both the primary and all synchronous standbys
suffer corruption of their database storage. When set to remote_write, commits will wait until
replies from the current synchronous standby(s) indicate they have received the commit record
of the transaction and written it to their file systems. This setting ensures data preservation if a
standby instance of PostgreSQL crashes, but not if the standby suffers an operating-system-level
crash because the data has not necessarily reached durable storage on the standby. The setting
local causes commits to wait for local flush to disk, but not for replication. This is usually not
desirable when synchronous replication is in use, but is provided for completeness.
This parameter can be changed at any time; the behavior for any one transaction is determined by
the setting in effect when it commits. It is therefore possible, and useful, to have some transactions
commit synchronously and others asynchronously. For example, to make a single multistatement
transaction commit asynchronously when the default is the opposite, issue SET LOCAL syn-
chronous_commit TO OFF within the transaction.
wal_sync_method (enum)
Method used for forcing WAL updates out to disk. If fsync is off then this setting is irrelevant,
since WAL file updates will not be forced out at all. Possible values are:
The open_* options also use O_DIRECT if available. Not all of these choices are available on
all platforms. The default is the first method in the above list that is supported by the platform,
except that fdatasync is the default on Linux and FreeBSD. The default is not necessarily
ideal; it might be necessary to change this setting or other aspects of your system configuration
in order to create a crash-safe configuration or achieve optimal performance. These aspects are
discussed in Section 30.1. This parameter can only be set in the postgresql.conf file or on
the server command line.
full_page_writes (boolean)
When this parameter is on, the PostgreSQL server writes the entire content of each disk page
to WAL during the first modification of that page after a checkpoint. This is needed because a
page write that is in process during an operating system crash might be only partially completed,
559
Server Configuration
leading to an on-disk page that contains a mix of old and new data. The row-level change data
normally stored in WAL will not be enough to completely restore such a page during post-crash
recovery. Storing the full page image guarantees that the page can be correctly restored, but at
the price of increasing the amount of data that must be written to WAL. (Because WAL replay
always starts from a checkpoint, it is sufficient to do this during the first change of each page
after a checkpoint. Therefore, one way to reduce the cost of full-page writes is to increase the
checkpoint interval parameters.)
Turning this parameter off speeds normal operation, but might lead to either unrecoverable data
corruption, or silent data corruption, after a system failure. The risks are similar to turning off
fsync, though smaller, and it should be turned off only based on the same circumstances rec-
ommended for that parameter.
Turning off this parameter does not affect use of WAL archiving for point-in-time recovery
(PITR) (see Section 25.3).
This parameter can only be set in the postgresql.conf file or on the server command line.
The default is on.
wal_log_hints (boolean)
When this parameter is on, the PostgreSQL server writes the entire content of each disk page to
WAL during the first modification of that page after a checkpoint, even for non-critical modifi-
cations of so-called hint bits.
If data checksums are enabled, hint bit updates are always WAL-logged and this setting is ignored.
You can use this setting to test how much extra WAL-logging would occur if your database had
data checksums enabled.
This parameter can only be set at server start. The default value is off.
wal_compression (boolean)
When this parameter is on, the PostgreSQL server compresses a full page image written to WAL
when full_page_writes is on or during a base backup. A compressed page image will be decom-
pressed during WAL replay. The default value is off. Only superusers can change this setting.
Turning this parameter on can reduce the WAL volume without increasing the risk of unrecover-
able data corruption, but at the cost of some extra CPU spent on the compression during WAL
logging and on the decompression during WAL replay.
wal_buffers (integer)
The amount of shared memory used for WAL data that has not yet been written to disk. The default
setting of -1 selects a size equal to 1/32nd (about 3%) of shared_buffers, but not less than 64kB
nor more than the size of one WAL segment, typically 16MB. This value can be set manually
if the automatic choice is too large or too small, but any positive value less than 32kB will be
treated as 32kB. This parameter can only be set at server start.
The contents of the WAL buffers are written out to disk at every transaction commit, so extremely
large values are unlikely to provide a significant benefit. However, setting this value to at least a
few megabytes can improve write performance on a busy server where many clients are commit-
ting at once. The auto-tuning selected by the default setting of -1 should give reasonable results
in most cases.
wal_writer_delay (integer)
Specifies how often the WAL writer flushes WAL. After flushing WAL it sleeps for
wal_writer_delay milliseconds, unless woken up by an asynchronously committing trans-
action. If the last flush happened less than wal_writer_delay milliseconds ago and less
than wal_writer_flush_after bytes of WAL have been produced since, then WAL is
560
Server Configuration
only written to the operating system, not flushed to disk. The default value is 200 milliseconds
(200ms). Note that on many systems, the effective resolution of sleep delays is 10 milliseconds;
setting wal_writer_delay to a value that is not a multiple of 10 might have the same re-
sults as setting it to the next higher multiple of 10. This parameter can only be set in the post-
gresql.conf file or on the server command line.
wal_writer_flush_after (integer)
Specifies how often the WAL writer flushes WAL. If the last flush happened less than
wal_writer_delay milliseconds ago and less than wal_writer_flush_after bytes of
WAL have been produced since, then WAL is only written to the operating system, not flushed
to disk. If wal_writer_flush_after is set to 0 then WAL data is flushed immediately.
The default is 1MB. This parameter can only be set in the postgresql.conf file or on the
server command line.
commit_delay (integer)
commit_delay adds a time delay, measured in microseconds, before a WAL flush is initiated.
This can improve group commit throughput by allowing a larger number of transactions to commit
via a single WAL flush, if system load is high enough that additional transactions become ready
to commit within the given interval. However, it also increases latency by up to commit_de-
lay microseconds for each WAL flush. Because the delay is just wasted if no other transactions
become ready to commit, a delay is only performed if at least commit_siblings other trans-
actions are active when a flush is about to be initiated. Also, no delays are performed if fsync
is disabled. The default commit_delay is zero (no delay). Only superusers can change this
setting.
In PostgreSQL releases prior to 9.3, commit_delay behaved differently and was much less
effective: it affected only commits, rather than all WAL flushes, and waited for the entire config-
ured delay even if the WAL flush was completed sooner. Beginning in PostgreSQL 9.3, the first
process that becomes ready to flush waits for the configured interval, while subsequent processes
wait only until the leader completes the flush operation.
commit_siblings (integer)
Minimum number of concurrent open transactions to require before performing the com-
mit_delay delay. A larger value makes it more probable that at least one other transaction will
become ready to commit during the delay interval. The default is five transactions.
19.5.2. Checkpoints
checkpoint_timeout (integer)
Maximum time between automatic WAL checkpoints, in seconds. The valid range is between
30 seconds and one day. The default is five minutes (5min). Increasing this parameter can in-
crease the amount of time needed for crash recovery. This parameter can only be set in the post-
gresql.conf file or on the server command line.
Specifies the target of checkpoint completion, as a fraction of total time between checkpoints. The
default is 0.5. This parameter can only be set in the postgresql.conf file or on the server
command line.
checkpoint_flush_after (integer)
Whenever more than checkpoint_flush_after bytes have been written while performing
a checkpoint, attempt to force the OS to issue these writes to the underlying storage. Doing so will
limit the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an
fsync is issued at the end of the checkpoint, or when the OS writes data back in larger batches
561
Server Configuration
in the background. Often that will result in greatly reduced transaction latency, but there also are
some cases, especially with workloads that are bigger than shared_buffers, but smaller than the
OS's page cache, where performance might degrade. This setting may have no effect on some
platforms. The valid range is between 0, which disables forced writeback, and 2MB. The default
is 256kB on Linux, 0 elsewhere. (If BLCKSZ is not 8kB, the default and maximum values scale
proportionally to it.) This parameter can only be set in the postgresql.conf file or on the
server command line.
checkpoint_warning (integer)
Write a message to the server log if checkpoints caused by the filling of WAL segment files happen
closer together than this many seconds (which suggests that max_wal_size ought to be raised).
The default is 30 seconds (30s). Zero disables the warning. No warnings will be generated if
checkpoint_timeout is less than checkpoint_warning. This parameter can only be
set in the postgresql.conf file or on the server command line.
max_wal_size (integer)
Maximum size to let the WAL grow during automatic checkpoints. This is a soft limit; WAL
size can exceed max_wal_size under special circumstances, like under heavy load, a failing
archive_command, or a high wal_keep_segments setting. The default is 1 GB. Increas-
ing this parameter can increase the amount of time needed for crash recovery. This parameter can
only be set in the postgresql.conf file or on the server command line.
min_wal_size (integer)
As long as WAL disk usage stays below this setting, old WAL files are always recycled for future
use at a checkpoint, rather than removed. This can be used to ensure that enough WAL space
is reserved to handle spikes in WAL usage, for example when running large batch jobs. The
default is 80 MB. This parameter can only be set in the postgresql.conf file or on the server
command line.
19.5.3. Archiving
archive_mode (enum)
When archive_mode is enabled, completed WAL segments are sent to archive storage by
setting archive_command. In addition to off, to disable, there are two modes: on, and always.
During normal operation, there is no difference between the two modes, but when set to always
the WAL archiver is enabled also during archive recovery or standby mode. In always mode,
all files restored from the archive or streamed with streaming replication will be archived (again).
See Section 26.2.9 for details.
archive_command (string)
The local shell command to execute to archive a completed WAL file segment. Any %p in the
string is replaced by the path name of the file to archive, and any %f is replaced by only the
file name. (The path name is relative to the working directory of the server, i.e., the cluster's
data directory.) Use %% to embed an actual % character in the command. It is important for the
command to return a zero exit status only if it succeeds. For more information see Section 25.3.1.
This parameter can only be set in the postgresql.conf file or on the server command line.
It is ignored unless archive_mode was enabled at server start. If archive_command is
an empty string (the default) while archive_mode is enabled, WAL archiving is temporarily
disabled, but the server continues to accumulate WAL segment files in the expectation that a
command will soon be provided. Setting archive_command to a command that does nothing
562
Server Configuration
but return true, e.g., /bin/true (REM on Windows), effectively disables archiving, but also
breaks the chain of WAL files needed for archive recovery, so it should only be used in unusual
circumstances.
archive_timeout (integer)
The archive_command is only invoked for completed WAL segments. Hence, if your server gen-
erates little WAL traffic (or has slack periods where it does so), there could be a long delay be-
tween the completion of a transaction and its safe recording in archive storage. To limit how
old unarchived data can be, you can set archive_timeout to force the server to switch to
a new WAL segment file periodically. When this parameter is greater than zero, the server will
switch to a new segment file whenever this many seconds have elapsed since the last segment
file switch, and there has been any database activity, including a single checkpoint (checkpoints
are skipped if there is no database activity). Note that archived files that are closed early due to
a forced switch are still the same length as completely full files. Therefore, it is unwise to use
a very short archive_timeout — it will bloat your archive storage. archive_timeout
settings of a minute or so are usually reasonable. You should consider using streaming replication,
instead of archiving, if you want data to be copied off the master server more quickly than that.
This parameter can only be set in the postgresql.conf file or on the server command line.
19.6. Replication
These settings control the behavior of the built-in streaming replication feature (see Section 26.2.5).
Servers will be either a master or a standby server. Masters can send data, while standbys are always
receivers of replicated data. When cascading replication (see Section 26.2.7) is used, standby servers
can also be senders, as well as receivers. Parameters are mainly for sending and standby servers,
though some parameters have meaning only on the master server. Settings may vary across the cluster
without problems if that is required.
max_wal_senders (integer)
Specifies the maximum number of concurrent connections from standby servers or streaming base
backup clients (i.e., the maximum number of simultaneously running WAL sender processes). The
default is 10. The value 0 means replication is disabled. WAL sender processes count towards the
total number of connections, so this parameter's value must be less than max_connections minus
superuser_reserved_connections. Abrupt streaming client disconnection might leave an orphaned
connection slot behind until a timeout is reached, so this parameter should be set slightly higher
than the maximum number of expected clients so disconnected clients can immediately reconnect.
This parameter can only be set at server start. Also, wal_level must be set to replica or
higher to allow connections from standby servers.
max_replication_slots (integer)
Specifies the maximum number of replication slots (see Section 26.2.6) that the server can sup-
port. The default is 10. This parameter can only be set at server start. Setting it to a lower value
than the number of currently existing replication slots will prevent the server from starting. Also,
wal_level must be set to replica or higher to allow replication slots to be used.
On the subscriber side, specifies how many replication origins (see Chapter 50) can be tracked
simultaneously, effectively limiting how many logical replication subscriptions can be created on
the server. Setting it a lower value than the current number of tracked replication origins (reflected
in pg_replication_origin_status, not pg_replication_origin) will prevent the server from starting.
563
Server Configuration
wal_keep_segments (integer)
Specifies the minimum number of past log file segments kept in the pg_wal directory, in
case a standby server needs to fetch them for streaming replication. Each segment is normally
16 megabytes. If a standby server connected to the sending server falls behind by more than
wal_keep_segments segments, the sending server might remove a WAL segment still need-
ed by the standby, in which case the replication connection will be terminated. Downstream con-
nections will also eventually fail as a result. (However, the standby server can recover by fetching
the segment from archive, if WAL archiving is in use.)
This sets only the minimum number of segments retained in pg_wal; the system might need to
retain more segments for WAL archival or to recover from a checkpoint. If wal_keep_seg-
ments is zero (the default), the system doesn't keep any extra segments for standby purposes,
so the number of old WAL segments available to standby servers is a function of the location
of the previous checkpoint and status of WAL archiving. This parameter can only be set in the
postgresql.conf file or on the server command line.
wal_sender_timeout (integer)
Terminate replication connections that are inactive longer than the specified number of millisec-
onds. This is useful for the sending server to detect a standby crash or network outage. A value of
zero disables the timeout mechanism. This parameter can only be set in the postgresql.conf
file or on the server command line. The default value is 60 seconds.
track_commit_timestamp (boolean)
Record commit time of transactions. This parameter can only be set in postgresql.conf file
or on the server command line. The default value is off.
synchronous_standby_names (string)
Specifies a list of standby servers that can support synchronous replication, as described in Sec-
tion 26.2.8. There will be one or more active synchronous standbys; transactions waiting for
commit will be allowed to proceed after these standby servers confirm receipt of their data. The
synchronous standbys will be those whose names appear in this list, and that are both currently
connected and streaming data in real-time (as shown by a state of streaming in the pg_s-
tat_replication view). Specifying more than one synchronous standby can allow for very
high availability and protection against data loss.
The name of a standby server for this purpose is the application_name setting of the stand-
by, as set in the standby's connection information. In case of a physical replication standby, this
should be set in the primary_conninfo setting in recovery.conf; the default is wal-
receiver. For logical replication, this can be set in the connection information of the subscrip-
tion, and it defaults to the subscription name. For other replication stream consumers, consult
their documentation.
This parameter specifies a list of standby servers using either of the following syntaxes:
564
Server Configuration
where num_sync is the number of synchronous standbys that transactions need to wait for replies
from, and standby_name is the name of a standby server. FIRST and ANY specify the method
to choose synchronous standbys from the listed servers.
The keyword FIRST, coupled with num_sync, specifies a priority-based synchronous replica-
tion and makes transaction commits wait until their WAL records are replicated to num_sync
synchronous standbys chosen based on their priorities. For example, a setting of FIRST 3 (s1,
s2, s3, s4) will cause each commit to wait for replies from three higher-priority standbys
chosen from standby servers s1, s2, s3 and s4. The standbys whose names appear earlier in
the list are given higher priority and will be considered as synchronous. Other standby servers
appearing later in this list represent potential synchronous standbys. If any of the current synchro-
nous standbys disconnects for whatever reason, it will be replaced immediately with the next-
highest-priority standby. The keyword FIRST is optional.
The keyword ANY, coupled with num_sync, specifies a quorum-based synchronous replication
and makes transaction commits wait until their WAL records are replicated to at least num_sync
listed standbys. For example, a setting of ANY 3 (s1, s2, s3, s4) will cause each commit
to proceed as soon as at least any three standbys of s1, s2, s3 and s4 reply.
FIRST and ANY are case-insensitive. If these keywords are used as the name of a standby server,
its standby_name must be double-quoted.
The third syntax was used before PostgreSQL version 9.6 and is still supported. It's the same as
the first syntax with FIRST and num_sync equal to 1. For example, FIRST 1 (s1, s2)
and s1, s2 have the same meaning: either s1 or s2 is chosen as a synchronous standby.
There is no mechanism to enforce uniqueness of standby names. In case of duplicates one of the
matching standbys will be considered as higher priority, though exactly which one is indetermi-
nate.
Note
Each standby_name should have the form of a valid SQL identifier, unless it is *. You
can use double-quoting if necessary. But note that standby_names are compared to
standby application names case-insensitively, whether double-quoted or not.
If no synchronous standby names are specified here, then synchronous replication is not enabled
and transaction commits will not wait for replication. This is the default configuration. Even
when synchronous replication is enabled, individual transactions can be configured not to wait
for replication by setting the synchronous_commit parameter to local or off.
This parameter can only be set in the postgresql.conf file or on the server command line.
vacuum_defer_cleanup_age (integer)
Specifies the number of transactions by which VACUUM and HOT updates will defer cleanup
of dead row versions. The default is zero transactions, meaning that dead row versions can be
removed as soon as possible, that is, as soon as they are no longer visible to any open transaction.
You may wish to set this to a non-zero value on a primary server that is supporting hot standby
servers, as described in Section 26.5. This allows more time for queries on the standby to complete
without incurring conflicts due to early cleanup of rows. However, since the value is measured
in terms of number of write transactions occurring on the primary server, it is difficult to predict
just how much additional grace time will be made available to standby queries. This parameter
can only be set in the postgresql.conf file or on the server command line.
565
Server Configuration
This does not prevent cleanup of dead rows which have reached the age specified by old_s-
napshot_threshold.
hot_standby (boolean)
Specifies whether or not you can connect and run queries during recovery, as described in Sec-
tion 26.5. The default value is on. This parameter can only be set at server start. It only has effect
during archive recovery or in standby mode.
max_standby_archive_delay (integer)
When Hot Standby is active, this parameter determines how long the standby server should wait
before canceling standby queries that conflict with about-to-be-applied WAL entries, as described
in Section 26.5.2. max_standby_archive_delay applies when WAL data is being read
from WAL archive (and is therefore not current). The default is 30 seconds. Units are millisec-
onds if not specified. A value of -1 allows the standby to wait forever for conflicting queries
to complete. This parameter can only be set in the postgresql.conf file or on the server
command line.
Note that max_standby_archive_delay is not the same as the maximum length of time
a query can run before cancellation; rather it is the maximum total time allowed to apply any
one WAL segment's data. Thus, if one query has resulted in significant delay earlier in the WAL
segment, subsequent conflicting queries will have much less grace time.
max_standby_streaming_delay (integer)
When Hot Standby is active, this parameter determines how long the standby server should wait
before canceling standby queries that conflict with about-to-be-applied WAL entries, as described
in Section 26.5.2. max_standby_streaming_delay applies when WAL data is being re-
ceived via streaming replication. The default is 30 seconds. Units are milliseconds if not speci-
fied. A value of -1 allows the standby to wait forever for conflicting queries to complete. This
parameter can only be set in the postgresql.conf file or on the server command line.
Note that max_standby_streaming_delay is not the same as the maximum length of time
a query can run before cancellation; rather it is the maximum total time allowed to apply WAL data
once it has been received from the primary server. Thus, if one query has resulted in significant
delay, subsequent conflicting queries will have much less grace time until the standby server has
caught up again.
wal_receiver_status_interval (integer)
Specifies the minimum frequency for the WAL receiver process on the standby to send informa-
tion about replication progress to the primary or upstream standby, where it can be seen using
the pg_stat_replication view. The standby will report the last write-ahead log location
it has written, the last position it has flushed to disk, and the last position it has applied. This
parameter's value is the maximum interval, in seconds, between reports. Updates are sent each
time the write or flush positions change, or at least as often as specified by this parameter. Thus,
the apply position may lag slightly behind the true position. Setting this parameter to zero disables
status updates completely. This parameter can only be set in the postgresql.conf file or on
the server command line. The default value is 10 seconds.
hot_standby_feedback (boolean)
Specifies whether or not a hot standby will send feedback to the primary or upstream standby
about queries currently executing on the standby. This parameter can be used to eliminate query
566
Server Configuration
cancels caused by cleanup records, but can cause database bloat on the primary for some work-
loads. Feedback messages will not be sent more frequently than once per wal_receiver_s-
tatus_interval. The default value is off. This parameter can only be set in the post-
gresql.conf file or on the server command line.
If cascaded replication is in use the feedback is passed upstream until it eventually reaches the
primary. Standbys make no other use of feedback they receive other than to pass upstream.
This setting does not override the behavior of old_snapshot_threshold on the primary; a
snapshot on the standby which exceeds the primary's age threshold can become invalid, resulting
in cancellation of transactions on the standby. This is because old_snapshot_threshold is
intended to provide an absolute limit on the time which dead rows can contribute to bloat, which
would otherwise be violated because of the configuration of a standby.
wal_receiver_timeout (integer)
Terminate replication connections that are inactive longer than the specified number of millisec-
onds. This is useful for the receiving standby server to detect a primary node crash or network
outage. A value of zero disables the timeout mechanism. This parameter can only be set in the
postgresql.conf file or on the server command line. The default value is 60 seconds.
wal_retrieve_retry_interval (integer)
Specify how long the standby server should wait when WAL data is not available from any sources
(streaming replication, local pg_wal or WAL archive) before retrying to retrieve WAL data.
This parameter can only be set in the postgresql.conf file or on the server command line.
The default value is 5 seconds. Units are milliseconds if not specified.
This parameter is useful in configurations where a node in recovery needs to control the amount
of time to wait for new WAL data to be available. For example, in archive recovery, it is possible
to make the recovery more responsive in the detection of a new WAL log file by reducing the
value of this parameter. On a system with low WAL activity, increasing it reduces the amount of
requests necessary to access WAL archives, something useful for example in cloud environments
where the amount of times an infrastructure is accessed is taken into account.
19.6.4. Subscribers
These settings control the behavior of a logical replication subscriber. Their values on the publisher
are irrelevant.
max_logical_replication_workers (int)
Specifies maximum number of logical replication workers. This includes both apply workers and
table synchronization workers.
Logical replication workers are taken from the pool defined by max_worker_processes.
The default value is 4. This parameter can only be set at server start.
max_sync_workers_per_subscription (integer)
Maximum number of synchronization workers per subscription. This parameter controls the
amount of parallelism of the initial data copy during the subscription initialization or when new
tables are added.
567
Server Configuration
The synchronization workers are taken from the pool defined by max_logical_replica-
tion_workers.
The default value is 2. This parameter can only be set in the postgresql.conf file or on the
server command line.
enable_bitmapscan (boolean)
Enables or disables the query planner's use of bitmap-scan plan types. The default is on.
enable_gathermerge (boolean)
Enables or disables the query planner's use of gather merge plan types. The default is on.
enable_hashagg (boolean)
Enables or disables the query planner's use of hashed aggregation plan types. The default is on.
enable_hashjoin (boolean)
Enables or disables the query planner's use of hash-join plan types. The default is on.
enable_indexscan (boolean)
Enables or disables the query planner's use of index-scan plan types. The default is on.
enable_indexonlyscan (boolean)
Enables or disables the query planner's use of index-only-scan plan types (see Section 11.9). The
default is on.
enable_material (boolean)
Enables or disables the query planner's use of materialization. It is impossible to suppress mate-
rialization entirely, but turning this variable off prevents the planner from inserting materialize
nodes except in cases where it is required for correctness. The default is on.
enable_mergejoin (boolean)
Enables or disables the query planner's use of merge-join plan types. The default is on.
enable_nestloop (boolean)
Enables or disables the query planner's use of nested-loop join plans. It is impossible to suppress
nested-loop joins entirely, but turning this variable off discourages the planner from using one if
there are other methods available. The default is on.
enable_parallel_append (boolean)
Enables or disables the query planner's use of parallel-aware append plan types. The default is on.
568
Server Configuration
enable_parallel_hash (boolean)
Enables or disables the query planner's use of hash-join plan types with parallel hash. Has no
effect if hash-join plans are not also enabled. The default is on.
enable_partition_pruning (boolean)
Enables or disables the query planner's ability to eliminate a partitioned table's partitions from
query plans. This also controls the planner's ability to generate query plans which allow the
query executor to remove (ignore) partitions during query execution. The default is on. See Sec-
tion 5.10.4 for details.
enable_partitionwise_join (boolean)
Enables or disables the query planner's use of partitionwise join, which allows a join between
partitioned tables to be performed by joining the matching partitions. Partitionwise join currently
applies only when the join conditions include all the partition keys, which must be of the same
data type and have exactly matching sets of child partitions. Because partitionwise join planning
can use significantly more CPU time and memory during planning, the default is off.
enable_partitionwise_aggregate (boolean)
Enables or disables the query planner's use of partitionwise grouping or aggregation, which allows
grouping or aggregation on a partitioned tables performed separately for each partition. If the
GROUP BY clause does not include the partition keys, only partial aggregation can be performed
on a per-partition basis, and finalization must be performed later. Because partitionwise grouping
or aggregation can use significantly more CPU time and memory during planning, the default
is off.
enable_seqscan (boolean)
Enables or disables the query planner's use of sequential scan plan types. It is impossible to sup-
press sequential scans entirely, but turning this variable off discourages the planner from using
one if there are other methods available. The default is on.
enable_sort (boolean)
Enables or disables the query planner's use of explicit sort steps. It is impossible to suppress
explicit sorts entirely, but turning this variable off discourages the planner from using one if there
are other methods available. The default is on.
enable_tidscan (boolean)
Enables or disables the query planner's use of TID scan plan types. The default is on.
Note
Unfortunately, there is no well-defined method for determining ideal values for the cost vari-
ables. They are best treated as averages over the entire mix of queries that a particular instal-
569
Server Configuration
lation will receive. This means that changing them on the basis of just a few experiments is
very risky.
Sets the planner's estimate of the cost of a disk page fetch that is part of a series of sequential
fetches. The default is 1.0. This value can be overridden for tables and indexes in a particular
tablespace by setting the tablespace parameter of the same name (see ALTER TABLESPACE).
Sets the planner's estimate of the cost of a non-sequentially-fetched disk page. The default is
4.0. This value can be overridden for tables and indexes in a particular tablespace by setting the
tablespace parameter of the same name (see ALTER TABLESPACE).
Reducing this value relative to seq_page_cost will cause the system to prefer index scans;
raising it will make index scans look relatively more expensive. You can raise or lower both values
together to change the importance of disk I/O costs relative to CPU costs, which are described
by the following parameters.
Random access to mechanical disk storage is normally much more expensive than four times
sequential access. However, a lower default is used (4.0) because the majority of random accesses
to disk, such as indexed reads, are assumed to be in cache. The default value can be thought of
as modeling random access as 40 times slower than sequential, while expecting 90% of random
reads to be cached.
If you believe a 90% cache rate is an incorrect assumption for your workload, you can increase
random_page_cost to better reflect the true cost of random storage reads. Correspondingly, if your
data is likely to be completely in cache, such as when the database is smaller than the total server
memory, decreasing random_page_cost can be appropriate. Storage that has a low random read
cost relative to sequential, e.g., solid-state drives, might also be better modeled with a lower value
for random_page_cost, e.g., 1.1.
Tip
Although the system will let you set random_page_cost to less than se-
q_page_cost, it is not physically sensible to do so. However, setting them equal makes
sense if the database is entirely cached in RAM, since in that case there is no penalty for
touching pages out of sequence. Also, in a heavily-cached database you should lower both
values relative to the CPU parameters, since the cost of fetching a page already in RAM
is much smaller than it would normally be.
Sets the planner's estimate of the cost of processing each row during a query. The default is 0.01.
Sets the planner's estimate of the cost of processing each index entry during an index scan. The
default is 0.005.
Sets the planner's estimate of the cost of processing each operator or function executed during a
query. The default is 0.0025.
570
Server Configuration
Sets the planner's estimate of the cost of launching parallel worker processes. The default is 1000.
Sets the planner's estimate of the cost of transferring one tuple from a parallel worker process to
another process. The default is 0.1.
min_parallel_table_scan_size (integer)
Sets the minimum amount of table data that must be scanned in order for a parallel scan to be
considered. For a parallel sequential scan, the amount of table data scanned is always equal to the
size of the table, but when indexes are used the amount of table data scanned will normally be
less. The default is 8 megabytes (8MB).
min_parallel_index_scan_size (integer)
Sets the minimum amount of index data that must be scanned in order for a parallel scan to be
considered. Note that a parallel index scan typically won't touch the entire index; it is the number
of pages which the planner believes will actually be touched by the scan which is relevant. The
default is 512 kilobytes (512kB).
effective_cache_size (integer)
Sets the planner's assumption about the effective size of the disk cache that is available to a single
query. This is factored into estimates of the cost of using an index; a higher value makes it more
likely index scans will be used, a lower value makes it more likely sequential scans will be used.
When setting this parameter you should consider both PostgreSQL's shared buffers and the portion
of the kernel's disk cache that will be used for PostgreSQL data files, though some data might exist
in both places. Also, take into account the expected number of concurrent queries on different
tables, since they will have to share the available space. This parameter has no effect on the size
of shared memory allocated by PostgreSQL, nor does it reserve kernel disk cache; it is used only
for estimation purposes. The system also does not assume data remains in the disk cache between
queries. The default is 4 gigabytes (4GB).
Sets the query cost above which JIT compilation is activated, if enabled (see Chapter 32). Per-
forming JIT costs planning time but can accelerate query execution. Setting this to -1 disables
JIT compilation. The default is 100000.
Sets the query cost above which JIT compilation attempts to inline functions and operators. In-
lining adds planning time, but can improve execution speed. It is not meaningful to set this to less
than jit_above_cost. Setting this to -1 disables inlining. The default is 500000.
Sets the query cost above which JIT compilation applies expensive optimizations. Such op-
timization adds planning time, but can improve execution speed. It is not meaningful to set
this to less than jit_above_cost, and it is unlikely to be beneficial to set it to more than
jit_inline_above_cost. Setting this to -1 disables expensive optimizations. The default
is 500000.
571
Server Configuration
ducing plans that are sometimes inferior to those found by the normal exhaustive-search algorithm.
For more information see Chapter 60.
geqo (boolean)
Enables or disables genetic query optimization. This is on by default. It is usually best not to turn
it off in production; the geqo_threshold variable provides more granular control of GEQO.
geqo_threshold (integer)
Use genetic query optimization to plan queries with at least this many FROM items involved. (Note
that a FULL OUTER JOIN construct counts as only one FROM item.) The default is 12. For
simpler queries it is usually best to use the regular, exhaustive-search planner, but for queries with
many tables the exhaustive search takes too long, often longer than the penalty of executing a
suboptimal plan. Thus, a threshold on the size of the query is a convenient way to manage use
of GEQO.
geqo_effort (integer)
Controls the trade-off between planning time and query plan quality in GEQO. This variable must
be an integer in the range from 1 to 10. The default value is five. Larger values increase the time
spent doing query planning, but also increase the likelihood that an efficient query plan will be
chosen.
geqo_effort doesn't actually do anything directly; it is only used to compute the default values
for the other variables that influence GEQO behavior (described below). If you prefer, you can
set the other parameters by hand instead.
geqo_pool_size (integer)
Controls the pool size used by GEQO, that is the number of individuals in the genetic population.
It must be at least two, and useful values are typically 100 to 1000. If it is set to zero (the default
setting) then a suitable value is chosen based on geqo_effort and the number of tables in
the query.
geqo_generations (integer)
Controls the number of generations used by GEQO, that is the number of iterations of the algo-
rithm. It must be at least one, and useful values are in the same range as the pool size. If it is set
to zero (the default setting) then a suitable value is chosen based on geqo_pool_size.
Controls the selection bias used by GEQO. The selection bias is the selective pressure within the
population. Values can be from 1.50 to 2.00; the latter is the default.
Controls the initial value of the random number generator used by GEQO to select random paths
through the join order search space. The value can range from zero (the default) to one. Varying
the value changes the set of join paths explored, and may result in a better or worse best path
being found.
Sets the default statistics target for table columns without a column-specific target set via ALTER
TABLE SET STATISTICS. Larger values increase the time needed to do ANALYZE, but might
572
Server Configuration
improve the quality of the planner's estimates. The default is 100. For more information on the
use of statistics by the PostgreSQL query planner, refer to Section 14.2.
constraint_exclusion (enum)
Controls the query planner's use of table constraints to optimize queries. The allowed values of
constraint_exclusion are on (examine constraints for all tables), off (never examine
constraints), and partition (examine constraints only for inheritance child tables and UNION
ALL subqueries). partition is the default setting. It is often used with traditional inheritance
trees to improve performance.
When this parameter allows it for a particular table, the planner compares query conditions with
the table's CHECK constraints, and omits scanning tables for which the conditions contradict the
constraints. For example:
With constraint exclusion enabled, this SELECT will not scan child1000 at all, improving
performance.
Currently, constraint exclusion is enabled by default only for cases that are often used to imple-
ment table partitioning via inheritance trees. Turning it on for all tables imposes extra planning
overhead that is quite noticeable on simple queries, and most often will yield no benefit for simple
queries. If you have no tables that are partitioned using traditional inheritance, you might prefer
to turn it off entirely. (Note that the equivalent feature for partitioned tables is controlled by a
separate parameter, enable_partition_pruning.)
Refer to Section 5.10.5 for more information on using constraint exclusion to implement parti-
tioning.
Sets the planner's estimate of the fraction of a cursor's rows that will be retrieved. The default is
0.1. Smaller values of this setting bias the planner towards using “fast start” plans for cursors,
which will retrieve the first few rows quickly while perhaps taking a long time to fetch all rows.
Larger values put more emphasis on the total estimated time. At the maximum setting of 1.0,
cursors are planned exactly like regular queries, considering only the total estimated time and not
how soon the first rows might be delivered.
from_collapse_limit (integer)
The planner will merge sub-queries into upper queries if the resulting FROM list would have no
more than this many items. Smaller values reduce planning time but might yield inferior query
plans. The default is eight. For more information see Section 14.3.
Setting this value to geqo_threshold or more may trigger use of the GEQO planner, resulting in
non-optimal plans. See Section 19.7.3.
jit (boolean)
Determines whether JIT compilation may be used by PostgreSQL, if available (see Chapter 32).
The default is off.
573
Server Configuration
join_collapse_limit (integer)
The planner will rewrite explicit JOIN constructs (except FULL JOINs) into lists of FROM items
whenever a list of no more than this many items would result. Smaller values reduce planning
time but might yield inferior query plans.
By default, this variable is set the same as from_collapse_limit, which is appropriate for
most uses. Setting it to 1 prevents any reordering of explicit JOINs. Thus, the explicit join order
specified in the query will be the actual order in which the relations are joined. Because the query
planner does not always choose the optimal join order, advanced users can elect to temporarily
set this variable to 1, and then specify the join order they desire explicitly. For more information
see Section 14.3.
Setting this value to geqo_threshold or more may trigger use of the GEQO planner, resulting in
non-optimal plans. See Section 19.7.3.
parallel_leader_participation (boolean)
Allows the leader process to execute the query plan under Gather and Gather Merge nodes
instead of waiting for worker processes. The default is on. Setting this value to off reduces the
likelihood that workers will become blocked because the leader is not reading tuples fast enough,
but requires the leader process to wait for worker processes to start up before the first tuples can
be produced. The degree to which the leader can help or hinder performance depends on the plan
type, number of workers and query duration.
force_parallel_mode (enum)
Allows the use of parallel queries for testing purposes even in cases where no performance benefit
is expected. The allowed values of force_parallel_mode are off (use parallel mode only
when it is expected to improve performance), on (force parallel query for all queries for which it
is thought to be safe), and regress (like on, but with additional behavior changes as explained
below).
More specifically, setting this value to on will add a Gather node to the top of any query plan
for which this appears to be safe, so that the query runs inside of a parallel worker. Even when
a parallel worker is not available or cannot be used, operations such as starting a subtransaction
that would be prohibited in a parallel query context will be prohibited unless the planner believes
that this will cause the query to fail. If failures or unexpected results occur when this option is set,
some functions used by the query may need to be marked PARALLEL UNSAFE (or, possibly,
PARALLEL RESTRICTED).
Setting this value to regress has all of the same effects as setting it to on plus some additional
effects that are intended to facilitate automated regression testing. Normally, messages from a
parallel worker include a context line indicating that, but a setting of regress suppresses this
line so that the output is the same as in non-parallel execution. Also, the Gather nodes added
to plans by this setting are hidden in EXPLAIN output so that the output matches what would be
obtained if this setting were turned off.
PostgreSQL supports several methods for logging server messages, including stderr, csvlog and
syslog. On Windows, eventlog is also supported. Set this parameter to a list of desired log desti-
nations separated by commas. The default is to log to stderr only. This parameter can only be set
in the postgresql.conf file or on the server command line.
574
Server Configuration
If csvlog is included in log_destination, log entries are output in “comma separated value”
(CSV) format, which is convenient for loading logs into programs. See Section 19.8.4 for details.
logging_collector must be enabled to generate CSV-format log output.
When either stderr or csvlog are included, the file current_logfiles is created to record
the location of the log file(s) currently in use by the logging collector and the associated logging
destination. This provides a convenient way to find the logs currently in use by the instance. Here
is an example of this file's content:
stderr log/postgresql.log
csvlog log/postgresql.csv
current_logfiles is recreated when a new log file is created as an effect of rotation, and
when log_destination is reloaded. It is removed when neither stderr nor csvlog are included
in log_destination, and when the logging collector is disabled.
Note
On most Unix systems, you will need to alter the configuration of your system's syslog
daemon in order to make use of the syslog option for log_destination. PostgreSQL
can log to syslog facilities LOCAL0 through LOCAL7 (see syslog_facility), but the default
syslog configuration on most platforms will discard all such messages. You will need to
add something like:
local0.* /var/log/postgresql
On Windows, when you use the eventlog option for log_destination, you should
register an event source and its library with the operating system so that the Windows
Event Viewer can display event log messages cleanly. See Section 18.11 for details.
logging_collector (boolean)
This parameter enables the logging collector, which is a background process that captures log
messages sent to stderr and redirects them into log files. This approach is often more useful than
logging to syslog, since some types of messages might not appear in syslog output. (One common
example is dynamic-linker failure messages; another is error messages produced by scripts such
as archive_command.) This parameter can only be set at server start.
Note
It is possible to log to stderr without using the logging collector; the log messages will
just go to wherever the server's stderr is directed. However, that method is only suitable
for low log volumes, since it provides no convenient way to rotate log files. Also, on
some platforms not using the logging collector can result in lost or garbled log output,
because multiple processes writing concurrently to the same log file can overwrite each
other's output.
Note
The logging collector is designed to never lose messages. This means that in case of ex-
tremely high load, server processes could be blocked while trying to send additional log
575
Server Configuration
messages when the collector has fallen behind. In contrast, syslog prefers to drop mes-
sages if it cannot write them, which means it may fail to log some messages in such cases
but it will not block the rest of the system.
log_directory (string)
When logging_collector is enabled, this parameter determines the directory in which log
files will be created. It can be specified as an absolute path, or relative to the cluster data directory.
This parameter can only be set in the postgresql.conf file or on the server command line.
The default is log.
log_filename (string)
When logging_collector is enabled, this parameter sets the file names of the created log
files. The value is treated as a strftime pattern, so %-escapes can be used to specify time-
varying file names. (Note that if there are any time-zone-dependent %-escapes, the computation is
done in the zone specified by log_timezone.) The supported %-escapes are similar to those listed in
the Open Group's strftime 1 specification. Note that the system's strftime is not used directly,
so platform-specific (nonstandard) extensions do not work. The default is postgresql-%Y-
%m-%d_%H%M%S.log.
If you specify a file name without escapes, you should plan to use a log rotation utility to avoid
eventually filling the entire disk. In releases prior to 8.4, if no % escapes were present, PostgreSQL
would append the epoch of the new log file's creation time, but this is no longer the case.
This parameter can only be set in the postgresql.conf file or on the server command line.
log_file_mode (integer)
On Unix systems this parameter sets the permissions for log files when logging_collector
is enabled. (On Microsoft Windows this parameter is ignored.) The parameter value is expected
to be a numeric mode specified in the format accepted by the chmod and umask system calls.
(To use the customary octal format the number must start with a 0 (zero).)
The default permissions are 0600, meaning only the server owner can read or write the log files.
The other commonly useful setting is 0640, allowing members of the owner's group to read the
files. Note however that to make use of such a setting, you'll need to alter log_directory to store
the files somewhere outside the cluster data directory. In any case, it's unwise to make the log
files world-readable, since they might contain sensitive data.
This parameter can only be set in the postgresql.conf file or on the server command line.
log_rotation_age (integer)
log_rotation_size (integer)
576
Server Configuration
be created. Set to zero to disable size-based creation of new log files. This parameter can only be
set in the postgresql.conf file or on the server command line.
log_truncate_on_rotation (boolean)
Example: To keep 7 days of logs, one log file per day named server_log.Mon,
server_log.Tue, etc, and automatically overwrite last week's log with this week's log,
set log_filename to server_log.%a, log_truncate_on_rotation to on, and
log_rotation_age to 1440.
Example: To keep 24 hours of logs, one log file per hour, but also rotate sooner if the log file
size exceeds 1GB, set log_filename to server_log.%H%M, log_truncate_on_ro-
tation to on, log_rotation_age to 60, and log_rotation_size to 1000000. In-
cluding %M in log_filename allows any size-driven rotations that might occur to select a file
name different from the hour's initial file name.
syslog_facility (enum)
When logging to syslog is enabled, this parameter determines the syslog “facility” to be used.
You can choose from LOCAL0, LOCAL1, LOCAL2, LOCAL3, LOCAL4, LOCAL5, LOCAL6,
LOCAL7; the default is LOCAL0. See also the documentation of your system's syslog daemon.
This parameter can only be set in the postgresql.conf file or on the server command line.
syslog_ident (string)
When logging to syslog is enabled, this parameter determines the program name used to identify
PostgreSQL messages in syslog logs. The default is postgres. This parameter can only be set
in the postgresql.conf file or on the server command line.
syslog_sequence_numbers (boolean)
When logging to syslog and this is on (the default), then each message will be prefixed by an
increasing sequence number (such as [2]). This circumvents the “--- last message repeated N
times ---” suppression that many syslog implementations perform by default. In more modern
syslog implementations, repeated message suppression can be configured (for example, $Re-
peatedMsgReduction in rsyslog), so this might not be necessary. Also, you could turn this
off if you actually want to suppress repeated messages.
This parameter can only be set in the postgresql.conf file or on the server command line.
syslog_split_messages (boolean)
When logging to syslog is enabled, this parameter determines how messages are delivered to
syslog. When on (the default), messages are split by lines, and long lines are split so that they will
fit into 1024 bytes, which is a typical size limit for traditional syslog implementations. When off,
PostgreSQL server log messages are delivered to the syslog service as is, and it is up to the syslog
service to cope with the potentially bulky messages.
If syslog is ultimately logging to a text file, then the effect will be the same either way, and it is best
to leave the setting on, since most syslog implementations either cannot handle large messages
or would need to be specially configured to handle them. But if syslog is ultimately writing into
some other medium, it might be necessary or more useful to keep messages logically together.
577
Server Configuration
This parameter can only be set in the postgresql.conf file or on the server command line.
event_source (string)
When logging to event log is enabled, this parameter determines the program name used to identify
PostgreSQL messages in the log. The default is PostgreSQL. This parameter can only be set
in the postgresql.conf file or on the server command line.
Controls which message levels are written to the server log. Valid values are DEBUG5, DEBUG4,
DEBUG3, DEBUG2, DEBUG1, INFO, NOTICE, WARNING, ERROR, LOG, FATAL, and PANIC.
Each level includes all the levels that follow it. The later the level, the fewer messages are sent to
the log. The default is WARNING. Note that LOG has a different rank here than in client_min_mes-
sages. Only superusers can change this setting.
log_min_error_statement (enum)
Controls which SQL statements that cause an error condition are recorded in the server log. The
current SQL statement is included in the log entry for any message of the specified severity
or higher. Valid values are DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1, INFO, NOTICE,
WARNING, ERROR, LOG, FATAL, and PANIC. The default is ERROR, which means statements
causing errors, log messages, fatal errors, or panics will be logged. To effectively turn off logging
of failing statements, set this parameter to PANIC. Only superusers can change this setting.
log_min_duration_statement (integer)
Causes the duration of each completed statement to be logged if the statement ran for at least the
specified number of milliseconds. Setting this to zero prints all statement durations. Minus-one
(the default) disables logging statement durations. For example, if you set it to 250ms then all
SQL statements that run 250ms or longer will be logged. Enabling this parameter can be helpful in
tracking down unoptimized queries in your applications. Only superusers can change this setting.
For clients using extended query protocol, durations of the Parse, Bind, and Execute steps are
logged independently.
Note
When using this option together with log_statement, the text of statements that are logged
because of log_statement will not be repeated in the duration log message. If you are
not using syslog, it is recommended that you log the PID or session ID using log_line_pre-
fix so that you can link the statement message to the later duration message using the
process ID or session ID.
Table 19.2 explains the message severity levels used by PostgreSQL. If logging output is sent to syslog
or Windows' eventlog, the severity levels are translated as shown in the table.
578
Server Configuration
The application_name can be any string of less than NAMEDATALEN characters (64 char-
acters in a standard build). It is typically set by an application upon connection to the server. The
name will be displayed in the pg_stat_activity view and included in CSV log entries. It
can also be included in regular log entries via the log_line_prefix parameter. Only printable ASCII
characters may be used in the application_name value. Other characters will be replaced
with question marks (?).
debug_print_parse (boolean)
debug_print_rewritten (boolean)
debug_print_plan (boolean)
These parameters enable various debugging output to be emitted. When set, they print the resulting
parse tree, the query rewriter output, or the execution plan for each executed query. These mes-
sages are emitted at LOG message level, so by default they will appear in the server log but will not
be sent to the client. You can change that by adjusting client_min_messages and/or log_min_mes-
sages. These parameters are off by default.
debug_pretty_print (boolean)
579
Server Configuration
log_checkpoints (boolean)
Causes checkpoints and restartpoints to be logged in the server log. Some statistics are included
in the log messages, including the number of buffers written and the time spent writing them.
This parameter can only be set in the postgresql.conf file or on the server command line.
The default is off.
log_connections (boolean)
Causes each attempted connection to the server to be logged, as well as successful completion of
client authentication. Only superusers can change this parameter at session start, and it cannot be
changed at all within a session. The default is off.
Note
Some client programs, like psql, attempt to connect twice while determining if a password
is required, so duplicate “connection received” messages do not necessarily indicate a
problem.
log_disconnections (boolean)
Causes session terminations to be logged. The log output provides information similar to
log_connections, plus the duration of the session. Only superusers can change this parame-
ter at session start, and it cannot be changed at all within a session. The default is off.
log_duration (boolean)
Causes the duration of every completed statement to be logged. The default is off. Only supe-
rusers can change this setting.
For clients using extended query protocol, durations of the Parse, Bind, and Execute steps are
logged independently.
Note
The difference between setting this option and setting log_min_duration_statement to ze-
ro is that exceeding log_min_duration_statement forces the text of the query to
be logged, but this option doesn't. Thus, if log_duration is on and log_min_du-
ration_statement has a positive value, all durations are logged but the query text
is included only for statements exceeding the threshold. This behavior can be useful for
gathering statistics in high-load installations.
log_error_verbosity (enum)
Controls the amount of detail written in the server log for each message that is logged. Valid values
are TERSE, DEFAULT, and VERBOSE, each adding more fields to displayed messages. TERSE
excludes the logging of DETAIL, HINT, QUERY, and CONTEXT error information. VERBOSE
output includes the SQLSTATE error code (see also Appendix A) and the source code file name,
function name, and line number that generated the error. Only superusers can change this setting.
log_hostname (boolean)
By default, connection log messages only show the IP address of the connecting host. Turning
this parameter on causes logging of the host name as well. Note that depending on your host name
580
Server Configuration
resolution setup this might impose a non-negligible performance penalty. This parameter can only
be set in the postgresql.conf file or on the server command line.
log_line_prefix (string)
This is a printf-style string that is output at the beginning of each log line. % characters begin
“escape sequences” that are replaced with status information as outlined below. Unrecognized
escapes are ignored. Other characters are copied straight to the log line. Some escapes are only
recognized by session processes, and will be treated as empty by background processes such as the
main server process. Status information may be aligned either left or right by specifying a numeric
literal after the % and before the option. A negative value will cause the status information to be
padded on the right with spaces to give it a minimum width, whereas a positive value will pad on
the left. Padding can be useful to aid human readability in log files. This parameter can only be
set in the postgresql.conf file or on the server command line. The default is '%m [%p]
' which logs a time stamp and the process ID.
The %c escape prints a quasi-unique session identifier, consisting of two 4-byte hexadecimal
numbers (without leading zeros) separated by a dot. The numbers are the process start time and
the process ID, so %c can also be used as a space saving way of printing those items. For example,
to generate the session identifier from pg_stat_activity, use this query:
581
Server Configuration
Tip
If you set a nonempty value for log_line_prefix, you should usually make its last
character be a space, to provide visual separation from the rest of the log line. A punctu-
ation character can be used too.
Tip
Syslog produces its own time stamp and process ID information, so you probably do not
want to include those escapes if you are logging to syslog.
Tip
The %q escape is useful when including information that is only available in session (back-
end) context like user or database name. For example:
log_lock_waits (boolean)
Controls whether a log message is produced when a session waits longer than deadlock_timeout
to acquire a lock. This is useful in determining if lock waits are causing poor performance. The
default is off. Only superusers can change this setting.
log_statement (enum)
Controls which SQL statements are logged. Valid values are none (off), ddl, mod, and all
(all statements). ddl logs all data definition statements, such as CREATE, ALTER, and DROP
statements. mod logs all ddl statements, plus data-modifying statements such as INSERT, UP-
DATE, DELETE, TRUNCATE, and COPY FROM. PREPARE, EXECUTE, and EXPLAIN ANA-
LYZE statements are also logged if their contained command is of an appropriate type. For clients
using extended query protocol, logging occurs when an Execute message is received, and values
of the Bind parameters are included (with any embedded single-quote marks doubled).
Note
Statements that contain simple syntax errors are not logged even by the log_state-
ment = all setting, because the log message is emitted only after basic parsing has
been done to determine the statement type. In the case of extended query protocol, this
setting likewise does not log statements that fail before the Execute phase (i.e., during
parse analysis or planning). Set log_min_error_statement to ERROR (or lower)
to log such statements.
582
Server Configuration
log_replication_commands (boolean)
Causes each replication command to be logged in the server log. See Section 53.4 for more in-
formation about replication command. The default value is off. Only superusers can change this
setting.
log_temp_files (integer)
Controls logging of temporary file names and sizes. Temporary files can be created for sorts,
hashes, and temporary query results. A log entry is made for each temporary file when it is deleted.
A value of zero logs all temporary file information, while positive values log only files whose
size is greater than or equal to the specified number of kilobytes. The default setting is -1, which
disables such logging. Only superusers can change this setting.
log_timezone (string)
Sets the time zone used for timestamps written in the server log. Unlike TimeZone, this value
is cluster-wide, so that all sessions will report timestamps consistently. The built-in default is
GMT, but that is typically overridden in postgresql.conf; initdb will install a setting there
corresponding to its system environment. See Section 8.5.3 for more information. This parameter
can only be set in the postgresql.conf file or on the server command line.
583
Server Configuration
location text,
application_name text,
PRIMARY KEY (session_id, session_line_num)
);
To import a log file into this table, use the COPY FROM command:
It is also possible to access the file as a foreign table, using the supplied file_fdw module.
There are a few things you need to do to simplify importing CSV log files:
2. Set log_rotation_size to 0 to disable size-based log rotation, as it makes the log file name
difficult to predict.
3. Set log_truncate_on_rotation to on so that old log data isn't mixed with the new in the
same file.
4. The table definition above includes a primary key specification. This is useful to protect against
accidentally importing the same information twice. The COPY command commits all of the data it
imports at one time, so any error will cause the entire import to fail. If you import a partial log file
and later import the file again when it is complete, the primary key violation will cause the import
to fail. Wait until the log is complete and closed before importing. This procedure will also protect
against accidentally importing a partial line that hasn't been completely written, which would also
cause COPY to fail.
cluster_name (string)
Sets the cluster name that appears in the process title for all server processes in this cluster. The
name can be any string of less than NAMEDATALEN characters (64 characters in a standard build).
Only printable ASCII characters may be used in the cluster_name value. Other characters
will be replaced with question marks (?). No name is shown if this parameter is set to the empty
string '' (which is the default). This parameter can only be set at server start.
update_process_title (boolean)
Enables updating of the process title every time a new SQL command is received by the server.
This setting defaults to on on most platforms, but it defaults to off on Windows due to that
platform's larger overhead for updating the process title. Only superusers can change this setting.
584
Server Configuration
track_activities (boolean)
Enables the collection of information on the currently executing command of each session, along
with the time when that command began execution. This parameter is on by default. Note that
even when enabled, this information is not visible to all users, only to superusers, roles with
privileges of the pg_read_all_stats role and the user owning the sessions being reported
on (including sessions belonging to a role they have the privileges of), so it should not represent
a security risk. Only superusers can change this setting.
track_activity_query_size (integer)
Specifies the number of bytes reserved to track the currently executing command for each active
session, for the pg_stat_activity.query field. The default value is 1024. This parameter
can only be set at server start.
track_counts (boolean)
Enables collection of statistics on database activity. This parameter is on by default, because the
autovacuum daemon needs the collected information. Only superusers can change this setting.
track_io_timing (boolean)
Enables timing of database I/O calls. This parameter is off by default, because it will repeatedly
query the operating system for the current time, which may cause significant overhead on some
platforms. You can use the pg_test_timing tool to measure the overhead of timing on your system.
I/O timing information is displayed in pg_stat_database, in the output of EXPLAIN when the
BUFFERS option is used, and by pg_stat_statements. Only superusers can change this setting.
track_functions (enum)
Enables tracking of function call counts and time used. Specify pl to track only procedural-lan-
guage functions, all to also track SQL and C language functions. The default is none, which
disables function statistics tracking. Only superusers can change this setting.
Note
SQL-language functions that are simple enough to be “inlined” into the calling query will
not be tracked, regardless of this setting.
stats_temp_directory (string)
Sets the directory to store temporary statistics data in. This can be a path relative to the data direc-
tory or an absolute path. The default is pg_stat_tmp. Pointing this at a RAM-based file system
will decrease physical I/O requirements and can lead to improved performance. This parameter
can only be set in the postgresql.conf file or on the server command line.
For each query, output performance statistics of the respective module to the server log. This
is a crude profiling instrument, similar to the Unix getrusage() operating system facility.
log_statement_stats reports total statement statistics, while the others report per-module
statistics. log_statement_stats cannot be enabled together with any of the per-module
options. All of these options are disabled by default. Only superusers can change these settings.
585
Server Configuration
autovacuum (boolean)
Controls whether the server should run the autovacuum launcher daemon. This is on by default;
however, track_counts must also be enabled for autovacuum to work. This parameter can only
be set in the postgresql.conf file or on the server command line; however, autovacuuming
can be disabled for individual tables by changing table storage parameters.
Note that even when this parameter is disabled, the system will launch autovacuum processes if
necessary to prevent transaction ID wraparound. See Section 24.1.5 for more information.
log_autovacuum_min_duration (integer)
Causes each action executed by autovacuum to be logged if it ran for at least the specified number
of milliseconds. Setting this to zero logs all autovacuum actions. Minus-one (the default) disables
logging autovacuum actions. For example, if you set this to 250ms then all automatic vacuums
and analyzes that run 250ms or longer will be logged. In addition, when this parameter is set to
any value other than -1, a message will be logged if an autovacuum action is skipped due to
a conflicting lock or a concurrently dropped relation. Enabling this parameter can be helpful in
tracking autovacuum activity. This parameter can only be set in the postgresql.conf file or
on the server command line; but the setting can be overridden for individual tables by changing
table storage parameters.
autovacuum_max_workers (integer)
Specifies the maximum number of autovacuum processes (other than the autovacuum launcher)
that may be running at any one time. The default is three. This parameter can only be set at server
start.
autovacuum_naptime (integer)
Specifies the minimum delay between autovacuum runs on any given database. In each round
the daemon examines the database and issues VACUUM and ANALYZE commands as needed for
tables in that database. The delay is measured in seconds, and the default is one minute (1min).
This parameter can only be set in the postgresql.conf file or on the server command line.
autovacuum_vacuum_threshold (integer)
Specifies the minimum number of updated or deleted tuples needed to trigger a VACUUM in any
one table. The default is 50 tuples. This parameter can only be set in the postgresql.conf
file or on the server command line; but the setting can be overridden for individual tables by
changing table storage parameters.
autovacuum_analyze_threshold (integer)
Specifies the minimum number of inserted, updated or deleted tuples needed to trigger an AN-
ALYZE in any one table. The default is 50 tuples. This parameter can only be set in the post-
gresql.conf file or on the server command line; but the setting can be overridden for indi-
vidual tables by changing table storage parameters.
586
Server Configuration
autovacuum_freeze_max_age (integer)
Specifies the maximum age (in transactions) that a table's pg_class.relfrozenxid field
can attain before a VACUUM operation is forced to prevent transaction ID wraparound within the
table. Note that the system will launch autovacuum processes to prevent wraparound even when
autovacuum is otherwise disabled.
Vacuum also allows removal of old files from the pg_xact subdirectory, which is why the
default is a relatively low 200 million transactions. This parameter can only be set at server start,
but the setting can be reduced for individual tables by changing table storage parameters. For
more information see Section 24.1.5.
autovacuum_multixact_freeze_max_age (integer)
Specifies the maximum age (in multixacts) that a table's pg_class.relminmxid field can at-
tain before a VACUUM operation is forced to prevent multixact ID wraparound within the table.
Note that the system will launch autovacuum processes to prevent wraparound even when auto-
vacuum is otherwise disabled.
Vacuuming multixacts also allows removal of old files from the pg_multixact/members
and pg_multixact/offsets subdirectories, which is why the default is a relatively low 400
million multixacts. This parameter can only be set at server start, but the setting can be reduced for
individual tables by changing table storage parameters. For more information see Section 24.1.5.1.
autovacuum_vacuum_cost_delay (integer)
Specifies the cost delay value that will be used in automatic VACUUM operations. If -1 is specified,
the regular vacuum_cost_delay value will be used. The default value is 20 milliseconds. This
parameter can only be set in the postgresql.conf file or on the server command line; but
the setting can be overridden for individual tables by changing table storage parameters.
autovacuum_vacuum_cost_limit (integer)
Specifies the cost limit value that will be used in automatic VACUUM operations. If -1 is specified
(which is the default), the regular vacuum_cost_limit value will be used. Note that the value is
distributed proportionally among the running autovacuum workers, if there is more than one, so
that the sum of the limits for each worker does not exceed the value of this variable. This parameter
can only be set in the postgresql.conf file or on the server command line; but the setting
can be overridden for individual tables by changing table storage parameters.
Controls which message levels are sent to the client. Valid values are DEBUG5, DEBUG4, DE-
BUG3, DEBUG2, DEBUG1, LOG, NOTICE, WARNING, and ERROR. Each level includes all the
levels that follow it. The later the level, the fewer messages are sent. The default is NOTICE. Note
that LOG has a different rank here than in log_min_messages.
587
Server Configuration
search_path (string)
This variable specifies the order in which schemas are searched when an object (table, data type,
function, etc.) is referenced by a simple name with no schema specified. When there are objects of
identical names in different schemas, the one found first in the search path is used. An object that
is not in any of the schemas in the search path can only be referenced by specifying its containing
schema with a qualified (dotted) name.
The value for search_path must be a comma-separated list of schema names. Any name that
is not an existing schema, or is a schema for which the user does not have USAGE permission,
is silently ignored.
If one of the list items is the special name $user, then the schema having the name returned by
CURRENT_USER is substituted, if there is such a schema and the user has USAGE permission
for it. (If not, $user is ignored.)
The system catalog schema, pg_catalog, is always searched, whether it is mentioned in the
path or not. If it is mentioned in the path then it will be searched in the specified order. If pg_cat-
alog is not in the path then it will be searched before searching any of the path items.
When objects are created without specifying a particular target schema, they will be placed in the
first valid schema named in search_path. An error is reported if the search path is empty.
The default value for this parameter is "$user", public. This setting supports shared use of
a database (where no users have private schemas, and all share use of public), private per-user
schemas, and combinations of these. Other effects can be obtained by altering the default search
path setting, either globally or per-user.
For more information on schema handling, see Section 5.8. In particular, the default configuration
is suitable only when the database has a single user or a few mutually-trusting users.
The current effective value of the search path can be examined via the SQL function cur-
rent_schemas (see Section 9.25). This is not quite the same as examining the value of
search_path, since current_schemas shows how the items appearing in search_path
were resolved.
row_security (boolean)
This variable controls whether to raise an error in lieu of applying a row security policy. When
set to on, policies apply normally. When set to off, queries fail which would otherwise apply
at least one policy. The default is on. Change to off where limited row visibility could cause
incorrect results; for example, pg_dump makes that change by default. This variable has no ef-
fect on roles which bypass every row security policy, to wit, superusers and roles with the BY-
PASSRLS attribute.
default_tablespace (string)
This variable specifies the default tablespace in which to create objects (tables and indexes) when
a CREATE command does not explicitly specify a tablespace.
The value is either the name of a tablespace, or an empty string to specify using the default table-
space of the current database. If the value does not match the name of any existing tablespace,
588
Server Configuration
PostgreSQL will automatically use the default tablespace of the current database. If a nondefault
tablespace is specified, the user must have CREATE privilege for it, or creation attempts will fail.
This variable is not used for temporary tables; for them, temp_tablespaces is consulted instead.
This variable is also not used when creating databases. By default, a new database inherits its
tablespace setting from the template database it is copied from.
temp_tablespaces (string)
This variable specifies tablespaces in which to create temporary objects (temp tables and indexes
on temp tables) when a CREATE command does not explicitly specify a tablespace. Temporary
files for purposes such as sorting large data sets are also created in these tablespaces.
The value is a list of names of tablespaces. When there is more than one name in the list, Post-
greSQL chooses a random member of the list each time a temporary object is to be created; ex-
cept that within a transaction, successively created temporary objects are placed in successive
tablespaces from the list. If the selected element of the list is an empty string, PostgreSQL will
automatically use the default tablespace of the current database instead.
The default value is an empty string, which results in all temporary objects being created in the
default tablespace of the current database.
check_function_bodies (boolean)
This parameter is normally on. When set to off, it disables validation of the function body string
during CREATE FUNCTION. Disabling validation avoids side effects of the validation process
and avoids false positives due to problems such as forward references. Set this parameter to off
before loading functions on behalf of other users; pg_dump does so automatically.
default_transaction_isolation (enum)
Each SQL transaction has an isolation level, which can be either “read uncommitted”, “read com-
mitted”, “repeatable read”, or “serializable”. This parameter controls the default isolation level of
each new transaction. The default is “read committed”.
default_transaction_read_only (boolean)
A read-only SQL transaction cannot alter non-temporary tables. This parameter controls the de-
fault read-only status of each new transaction. The default is off (read/write).
default_transaction_deferrable (boolean)
When running at the serializable isolation level, a deferrable read-only SQL transaction
may be delayed before it is allowed to proceed. However, once it begins executing it does not
incur any of the overhead required to ensure serializability; so serialization code will have no
reason to force it to abort because of concurrent updates, making this option suitable for long-
running read-only transactions.
589
Server Configuration
This parameter controls the default deferrable status of each new transaction. It currently has no
effect on read-write transactions or those operating at isolation levels lower than serializ-
able. The default is off.
transaction_isolation (enum)
This parameter reflects the current transaction's isolation level. At the beginning of each trans-
action, it is set to the current value of default_transaction_isolation. Any subsequent attempt to
change it is equivalent to a SET TRANSACTION command.
transaction_read_only (boolean)
This parameter reflects the current transaction's read-only status. At the beginning of each trans-
action, it is set to the current value of default_transaction_read_only. Any subsequent attempt to
change it is equivalent to a SET TRANSACTION command.
transaction_deferrable (boolean)
This parameter reflects the current transaction's deferrability status. At the beginning of each
transaction, it is set to the current value of default_transaction_deferrable. Any subsequent attempt
to change it is equivalent to a SET TRANSACTION command.
session_replication_role (enum)
Controls firing of replication-related triggers and rules for the current session. Setting this variable
requires superuser privilege and results in discarding any previously cached query plans. Possible
values are origin (the default), replica and local.
The intended use of this setting is that logical replication systems set it to replica when they
are applying replicated changes. The effect of that will be that triggers and rules (that have not
been altered from their default configuration) will not fire on the replica. See the ALTER TABLE
clauses ENABLE TRIGGER and ENABLE RULE for more information.
PostgreSQL treats the settings origin and local the same internally. Third-party replication
systems may use these two values for their internal purposes, for example using local to des-
ignate a session whose changes should not be replicated.
Since foreign keys are implemented as triggers, setting this parameter to replica also disables
all foreign key checks, which can leave data in an inconsistent state if improperly used.
statement_timeout (integer)
Abort any statement that takes more than the specified number of milliseconds, starting from the
time the command arrives at the server from the client. If log_min_error_statement is
set to ERROR or lower, the statement that timed out will also be logged. A value of zero (the
default) turns this off.
lock_timeout (integer)
Abort any statement that waits longer than the specified number of milliseconds while attempting
to acquire a lock on a table, index, row, or other database object. The time limit applies separately
to each lock acquisition attempt. The limit applies both to explicit locking requests (such as LOCK
TABLE, or SELECT FOR UPDATE without NOWAIT) and to implicitly-acquired locks. A value
of zero (the default) turns this off.
Unlike statement_timeout, this timeout can only occur while waiting for locks. Note that
if statement_timeout is nonzero, it is rather pointless to set lock_timeout to the same
590
Server Configuration
or larger value, since the statement timeout would always trigger first. If log_min_error_s-
tatement is set to ERROR or lower, the statement that timed out will be logged.
idle_in_transaction_session_timeout (integer)
Terminate any session with an open transaction that has been idle for longer than the specified
duration in milliseconds. This allows any locks held by that session to be released and the con-
nection slot to be reused; it also allows tuples visible only to this transaction to be vacuumed. See
Section 24.1 for more details about this.
vacuum_freeze_table_age (integer)
vacuum_freeze_min_age (integer)
Specifies the cutoff age (in transactions) that VACUUM should use to decide whether to freeze row
versions while scanning a table. The default is 50 million transactions. Although users can set this
value anywhere from zero to one billion, VACUUM will silently limit the effective value to half the
value of autovacuum_freeze_max_age, so that there is not an unreasonably short time between
forced autovacuums. For more information see Section 24.1.5.
vacuum_multixact_freeze_table_age (integer)
VACUUM performs an aggressive scan if the table's pg_class.relminmxid field has reached
the age specified by this setting. An aggressive scan differs from a regular VACUUM in that it visits
every page that might contain unfrozen XIDs or MXIDs, not just those that might contain dead
tuples. The default is 150 million multixacts. Although users can set this value anywhere from
zero to two billions, VACUUM will silently limit the effective value to 95% of autovacuum_mul-
tixact_freeze_max_age, so that a periodical manual VACUUM has a chance to run before an an-
ti-wraparound is launched for the table. For more information see Section 24.1.5.1.
vacuum_multixact_freeze_min_age (integer)
Specifies the cutoff age (in multixacts) that VACUUM should use to decide whether to replace
multixact IDs with a newer transaction ID or multixact ID while scanning a table. The default is 5
million multixacts. Although users can set this value anywhere from zero to one billion, VACUUM
will silently limit the effective value to half the value of autovacuum_multixact_freeze_max_age,
so that there is not an unreasonably short time between forced autovacuums. For more information
see Section 24.1.5.1.
Specifies the fraction of the total number of heap tuples counted in the previous statistics collection
that can be inserted without incurring an index scan at the VACUUM cleanup stage. This setting
currently applies to B-tree indexes only.
If no tuples were deleted from the heap, B-tree indexes are still scanned at the VACUUM cleanup
stage when at least one of the following conditions is met: the index statistics are stale, or the
591
Server Configuration
index contains deleted pages that can be recycled during cleanup. Index statistics are considered
to be stale if the number of newly inserted tuples exceeds the vacuum_cleanup_index_s-
cale_factor fraction of the total number of heap tuples detected by the previous statistics col-
lection. The total number of heap tuples is stored in the index meta-page. Note that the meta-page
does not include this data until VACUUM finds no dead tuples, so B-tree index scan at the cleanup
stage can only be skipped if the second and subsequent VACUUM cycles detect no dead tuples.
bytea_output (enum)
Sets the output format for values of type bytea. Valid values are hex (the default) and escape
(the traditional PostgreSQL format). See Section 8.4 for more information. The bytea type al-
ways accepts both formats on input, regardless of this setting.
xmlbinary (enum)
Sets how binary values are to be encoded in XML. This applies for example when bytea val-
ues are converted to XML by the functions xmlelement or xmlforest. Possible values are
base64 and hex, which are both defined in the XML Schema standard. The default is base64.
For further information about XML-related functions, see Section 9.14.
The actual choice here is mostly a matter of taste, constrained only by possible restrictions in
client applications. Both methods support all possible values, although the hex encoding will be
somewhat larger than the base64 encoding.
xmloption (enum)
Sets whether DOCUMENT or CONTENT is implicit when converting between XML and character
string values. See Section 8.13 for a description of this. Valid values are DOCUMENT and CON-
TENT. The default is CONTENT.
gin_pending_list_limit (integer)
Sets the maximum size of the GIN pending list which is used when fastupdate is enabled.
If the list grows larger than this maximum size, it is cleaned up by moving the entries in it to
the main GIN data structure in bulk. The default is four megabytes (4MB). This setting can be
overridden for individual GIN indexes by changing index storage parameters. See Section 66.4.1
and Section 66.5 for more information.
Sets the display format for date and time values, as well as the rules for interpreting ambiguous
date input values. For historical reasons, this variable contains two independent components: the
output format specification (ISO, Postgres, SQL, or German) and the input/output specifica-
tion for year/month/day ordering (DMY, MDY, or YMD). These can be set separately or together.
The keywords Euro and European are synonyms for DMY; the keywords US, NonEuro, and
NonEuropean are synonyms for MDY. See Section 8.5 for more information. The built-in de-
592
Server Configuration
fault is ISO, MDY, but initdb will initialize the configuration file with a setting that corresponds
to the behavior of the chosen lc_time locale.
IntervalStyle (enum)
Sets the display format for interval values. The value sql_standard will produce output
matching SQL standard interval literals. The value postgres (which is the default) will pro-
duce output matching PostgreSQL releases prior to 8.4 when the DateStyle parameter was set to
ISO. The value postgres_verbose will produce output matching PostgreSQL releases prior
to 8.4 when the DateStyle parameter was set to non-ISO output. The value iso_8601 will
produce output matching the time interval “format with designators” defined in section 4.4.3.2
of ISO 8601.
The IntervalStyle parameter also affects the interpretation of ambiguous interval input. See
Section 8.5.4 for more information.
TimeZone (string)
Sets the time zone for displaying and interpreting time stamps. The built-in default is GMT, but that
is typically overridden in postgresql.conf; initdb will install a setting there corresponding
to its system environment. See Section 8.5.3 for more information.
timezone_abbreviations (string)
Sets the collection of time zone abbreviations that will be accepted by the server for datetime
input. The default is 'Default', which is a collection that works in most of the world; there
are also 'Australia' and 'India', and other collections can be defined for a particular
installation. See Section B.4 for more information.
extra_float_digits (integer)
This parameter adjusts the number of digits displayed for floating-point values, including
float4, float8, and geometric data types. The parameter value is added to the standard num-
ber of digits (FLT_DIG or DBL_DIG as appropriate). The value can be set as high as 3, to include
partially-significant digits; this is especially useful for dumping float data that needs to be restored
exactly. Or it can be set negative to suppress unwanted digits. See also Section 8.1.3.
client_encoding (string)
Sets the client-side encoding (character set). The default is to use the database encoding. The
character sets supported by the PostgreSQL server are described in Section 23.3.1.
lc_messages (string)
Sets the language in which messages are displayed. Acceptable values are system-dependent; see
Section 23.1 for more information. If this variable is set to the empty string (which is the default)
then the value is inherited from the execution environment of the server in a system-dependent
way.
On some systems, this locale category does not exist. Setting this variable will still work, but there
will be no effect. Also, there is a chance that no translated messages for the desired language exist.
In that case you will continue to see the English messages.
Only superusers can change this setting, because it affects the messages sent to the server log as
well as to the client, and an improper value might obscure the readability of the server logs.
lc_monetary (string)
Sets the locale to use for formatting monetary amounts, for example with the to_char family
of functions. Acceptable values are system-dependent; see Section 23.1 for more information. If
this variable is set to the empty string (which is the default) then the value is inherited from the
execution environment of the server in a system-dependent way.
593
Server Configuration
lc_numeric (string)
Sets the locale to use for formatting numbers, for example with the to_char family of functions.
Acceptable values are system-dependent; see Section 23.1 for more information. If this variable
is set to the empty string (which is the default) then the value is inherited from the execution
environment of the server in a system-dependent way.
lc_time (string)
Sets the locale to use for formatting dates and times, for example with the to_char family of
functions. Acceptable values are system-dependent; see Section 23.1 for more information. If
this variable is set to the empty string (which is the default) then the value is inherited from the
execution environment of the server in a system-dependent way.
default_text_search_config (string)
Selects the text search configuration that is used by those variants of the text search functions
that do not have an explicit argument specifying the configuration. See Chapter 12 for further
information. The built-in default is pg_catalog.simple, but initdb will initialize the config-
uration file with a setting that corresponds to the chosen lc_ctype locale, if a configuration
matching that locale can be identified.
PostgreSQL procedural language libraries can be preloaded in this way, typically by using the syntax
'$libdir/plXXX' where XXX is pgsql, perl, tcl, or python.
Only shared libraries specifically intended to be used with PostgreSQL can be loaded this way. Every
PostgreSQL-supported library has a “magic block” that is checked to guarantee compatibility. For
this reason, non-PostgreSQL libraries cannot be loaded in this way. You might be able to use operat-
ing-system facilities such as LD_PRELOAD for that.
In general, refer to the documentation of a specific module for the recommended way to load that
module.
local_preload_libraries (string)
This variable specifies one or more shared libraries that are to be preloaded at connection start.
It contains a comma-separated list of library names, where each name is interpreted as for the
LOAD command. Whitespace between entries is ignored; surround a library name with double
quotes if you need to include whitespace or commas in the name. The parameter value only takes
effect at the start of the connection. Subsequent changes have no effect. If a specified library is
not found, the connection attempt will fail.
This option can be set by any user. Because of that, the libraries that can be loaded are restricted to
those appearing in the plugins subdirectory of the installation's standard library directory. (It is
the database administrator's responsibility to ensure that only “safe” libraries are installed there.)
Entries in local_preload_libraries can specify this directory explicitly, for example
$libdir/plugins/mylib, or just specify the library name — mylib would have the same
effect as $libdir/plugins/mylib.
The intent of this feature is to allow unprivileged users to load debugging or performance-mea-
surement libraries into specific sessions without requiring an explicit LOAD command. To that
end, it would be typical to set this parameter using the PGOPTIONS environment variable on the
client or by using ALTER ROLE SET.
594
Server Configuration
However, unless a module is specifically designed to be used in this way by non-superusers, this
is usually not the right setting to use. Look at session_preload_libraries instead.
session_preload_libraries (string)
This variable specifies one or more shared libraries that are to be preloaded at connection start.
It contains a comma-separated list of library names, where each name is interpreted as for the
LOAD command. Whitespace between entries is ignored; surround a library name with double
quotes if you need to include whitespace or commas in the name. The parameter value only takes
effect at the start of the connection. Subsequent changes have no effect. If a specified library is
not found, the connection attempt will fail. Only superusers can change this setting.
shared_preload_libraries (string)
This variable specifies one or more shared libraries to be preloaded at server start. It contains
a comma-separated list of library names, where each name is interpreted as for the LOAD com-
mand. Whitespace between entries is ignored; surround a library name with double quotes if you
need to include whitespace or commas in the name. This parameter can only be set at server start.
If a specified library is not found, the server will fail to start.
Some libraries need to perform certain operations that can only take place at postmaster start,
such as allocating shared memory, reserving light-weight locks, or starting background workers.
Those libraries must be loaded at server start through this parameter. See the documentation of
each library for details.
Other libraries can also be preloaded. By preloading a shared library, the library startup time is
avoided when the library is first used. However, the time to start each new server process might
increase slightly, even if that process never uses the library. So this parameter is recommended
only for libraries that will be used in most sessions. Also, changing this parameter requires a
server restart, so this is not the right setting to use for short-term debugging tasks, say. Use ses-
sion_preload_libraries for that instead.
Note
On Windows hosts, preloading a library at server start will not reduce the time required to
start each new server process; each server process will re-load all preload libraries. How-
ever, shared_preload_libraries is still useful on Windows hosts for libraries
that need to perform operations at postmaster start time.
jit_provider (string)
This variable is the name of the JIT provider library to be used (see Section 32.4.2). The default
is llvmjit. This parameter can only be set at server start.
If set to a non-existent library, JIT will not be available, but no error will be raised. This allows
JIT support to be installed separately from the main PostgreSQL package.
595
Server Configuration
If a dynamically loadable module needs to be opened and the file name specified in the CREATE
FUNCTION or LOAD command does not have a directory component (i.e., the name does not
contain a slash), the system will search this path for the required file.
The value for dynamic_library_path must be a list of absolute directory paths separated
by colons (or semi-colons on Windows). If a list element starts with the special string $libdir,
the compiled-in PostgreSQL package library directory is substituted for $libdir; this is where
the modules provided by the standard PostgreSQL distribution are installed. (Use pg_config
--pkglibdir to find out the name of this directory.) For example:
dynamic_library_path = '/usr/local/lib/postgresql:/home/
my_project/lib:$libdir'
dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;
$libdir'
The default value for this parameter is '$libdir'. If the value is set to an empty string, the
automatic path search is turned off.
This parameter can be changed at run time by superusers, but a setting done that way will only
persist until the end of the client connection, so this method should be reserved for development
purposes. The recommended way to set this parameter is in the postgresql.conf configu-
ration file.
gin_fuzzy_search_limit (integer)
Soft upper limit of the size of the set returned by GIN index scans. For more information see
Section 66.5.
This is the amount of time, in milliseconds, to wait on a lock before checking to see if there is
a deadlock condition. The check for deadlock is relatively expensive, so the server doesn't run
it every time it waits for a lock. We optimistically assume that deadlocks are not common in
production applications and just wait on the lock for a while before checking for a deadlock.
Increasing this value reduces the amount of time wasted in needless deadlock checks, but slows
down reporting of real deadlock errors. The default is one second (1s), which is probably about
the smallest value you would want in practice. On a heavily loaded server you might want to
raise it. Ideally the setting should exceed your typical transaction time, so as to improve the odds
that a lock will be released before the waiter decides to check for deadlock. Only superusers can
change this setting.
When log_lock_waits is set, this parameter also determines the length of time to wait before a log
message is issued about the lock wait. If you are trying to investigate locking delays you might
want to set a shorter than normal deadlock_timeout.
max_locks_per_transaction (integer)
596
Server Configuration
jects can be locked at any one time. This parameter controls the average number of object locks
allocated for each transaction; individual transactions can lock more objects as long as the locks
of all transactions fit in the lock table. This is not the number of rows that can be locked; that
value is unlimited. The default, 64, has historically proven sufficient, but you might need to raise
this value if you have queries that touch many different tables in a single transaction, e.g., query
of a parent table with many children. This parameter can only be set at server start.
When running a standby server, you must set this parameter to the same or higher value than on
the master server. Otherwise, queries will not be allowed in the standby server.
max_pred_locks_per_transaction (integer)
max_pred_locks_per_relation (integer)
This controls how many pages or tuples of a single relation can be predicate-locked before the lock
is promoted to covering the whole relation. Values greater than or equal to zero mean an absolute
limit, while negative values mean max_pred_locks_per_transaction divided by the absolute value
of this setting. The default is -2, which keeps the behavior from previous versions of PostgreSQL.
This parameter can only be set in the postgresql.conf file or on the server command line.
max_pred_locks_per_page (integer)
This controls how many rows on a single page can be predicate-locked before the lock is pro-
moted to covering the whole page. The default is 2. This parameter can only be set in the post-
gresql.conf file or on the server command line.
This controls whether the array input parser recognizes unquoted NULL as specifying a null array
element. By default, this is on, allowing array values containing null values to be entered. How-
ever, PostgreSQL versions before 8.2 did not support null values in arrays, and therefore would
treat NULL as specifying a normal array element with the string value “NULL”. For backward
compatibility with applications that require the old behavior, this variable can be turned off.
Note that it is possible to create array values containing null values even when this variable is off.
backslash_quote (enum)
This controls whether a quote mark can be represented by \' in a string literal. The preferred,
SQL-standard way to represent a quote mark is by doubling it ('') but PostgreSQL has histori-
cally also accepted \'. However, use of \' creates security risks because in some client character
set encodings, there are multibyte characters in which the last byte is numerically equivalent to
ASCII \. If client-side code does escaping incorrectly then a SQL-injection attack is possible.
This risk can be prevented by making the server reject queries in which a quote mark appears to be
escaped by a backslash. The allowed values of backslash_quote are on (allow \' always),
off (reject always), and safe_encoding (allow only if client encoding does not allow ASCII
\ within a multibyte character). safe_encoding is the default setting.
597
Server Configuration
Note that in a standard-conforming string literal, \ just means \ anyway. This parameter only af-
fects the handling of non-standard-conforming literals, including escape string syntax (E'...').
default_with_oids (boolean)
This controls whether CREATE TABLE and CREATE TABLE AS include an OID column in
newly-created tables, if neither WITH OIDS nor WITHOUT OIDS is specified. It also determines
whether OIDs will be included in tables created by SELECT INTO. The parameter is off by
default; in PostgreSQL 8.0 and earlier, it was on by default.
The use of OIDs in user tables is considered deprecated, so most installations should leave this
variable disabled. Applications that require OIDs for a particular table should specify WITH
OIDS when creating the table. This variable can be enabled for compatibility with old applica-
tions that do not follow this behavior.
escape_string_warning (boolean)
When on, a warning is issued if a backslash (\) appears in an ordinary string literal ('...'
syntax) and standard_conforming_strings is off. The default is on.
Applications that wish to use backslash as escape should be modified to use escape string syntax
(E'...'), because the default behavior of ordinary strings is now to treat backslash as an ordi-
nary character, per SQL standard. This variable can be enabled to help locate code that needs to
be changed.
lo_compat_privileges (boolean)
In PostgreSQL releases prior to 9.0, large objects did not have access privileges and were, there-
fore, always readable and writable by all users. Setting this variable to on disables the new privi-
lege checks, for compatibility with prior releases. The default is off. Only superusers can change
this setting.
Setting this variable does not disable all security checks related to large objects — only those for
which the default behavior has changed in PostgreSQL 9.0.
operator_precedence_warning (boolean)
When on, the parser will emit a warning for any construct that might have changed meanings
since PostgreSQL 9.4 as a result of changes in operator precedence. This is useful for auditing
applications to see if precedence changes have broken anything; but it is not meant to be kept
turned on in production, since it will warn about some perfectly valid, standard-compliant SQL
code. The default is off.
quote_all_identifiers (boolean)
When the database generates SQL, force all identifiers to be quoted, even if they are not (cur-
rently) keywords. This will affect the output of EXPLAIN as well as the results of functions
like pg_get_viewdef. See also the --quote-all-identifiers option of pg_dump and
pg_dumpall.
standard_conforming_strings (boolean)
This controls whether ordinary string literals ('...') treat backslashes literally, as specified in
the SQL standard. Beginning in PostgreSQL 9.1, the default is on (prior releases defaulted to
off). Applications can check this parameter to determine how string literals will be processed.
The presence of this parameter can also be taken as an indication that the escape string syntax
(E'...') is supported. Escape string syntax (Section 4.1.2.2) should be used if an application
desires backslashes to be treated as escape characters.
598
Server Configuration
synchronize_seqscans (boolean)
This allows sequential scans of large tables to synchronize with each other, so that concurrent
scans read the same block at about the same time and hence share the I/O workload. When this
is enabled, a scan might start in the middle of the table and then “wrap around” the end to cover
all rows, so as to synchronize with the activity of scans already in progress. This can result in
unpredictable changes in the row ordering returned by queries that have no ORDER BY clause.
Setting this parameter to off ensures the pre-8.3 behavior in which a sequential scan always
starts from the beginning of the table. The default is on.
When on, expressions of the form expr = NULL (or NULL = expr) are treated as expr
IS NULL, that is, they return true if expr evaluates to the null value, and false otherwise. The
correct SQL-spec-compliant behavior of expr = NULL is to always return null (unknown).
Therefore this parameter defaults to off.
However, filtered forms in Microsoft Access generate queries that appear to use expr = NULL
to test for null values, so if you use that interface to access the database you might want to turn
this option on. Since expressions of the form expr = NULL always return the null value (using
the SQL standard interpretation), they are not very useful and do not appear often in normal
applications so this option does little harm in practice. But new users are frequently confused
about the semantics of expressions involving null values, so this option is off by default.
Note that this option only affects the exact form = NULL, not other comparison operators or other
expressions that are computationally equivalent to some expression involving the equals operator
(such as IN). Thus, this option is not a general fix for bad programming.
If true, any error will terminate the current session. By default, this is set to false, so that only
FATAL errors will terminate the session.
restart_after_crash (boolean)
When set to true, which is the default, PostgreSQL will automatically reinitialize after a backend
crash. Leaving this value set to true is normally the best way to maximize the availability of
the database. However, in some circumstances, such as when PostgreSQL is being invoked by
clusterware, it may be useful to disable the restart so that the clusterware can gain control and
take any actions it deems appropriate.
This parameter can only be set in the postgresql.conf file or on the server command line.
data_sync_retry (boolean)
When set to false, which is the default, PostgreSQL will raise a PANIC-level error on failure to
flush modified data files to the filesystem. This causes the database server to crash. This parameter
can only be set at server start.
On some operating systems, the status of data in the kernel's page cache is unknown after a write-
back failure. In some cases it might have been entirely forgotten, making it unsafe to retry; the
second attempt may be reported as successful, when in fact the data has been lost. In these circum-
599
Server Configuration
stances, the only way to avoid data loss is to recover from the WAL after any failure is reported,
preferably after investigating the root cause of the failure and replacing any faulty hardware.
If set to true, PostgreSQL will instead report an error but continue to run so that the data flushing
operation can be retried in a later checkpoint. Only set it to true after investigating the operating
system's treatment of buffered data in case of write-back failure.
block_size (integer)
Reports the size of a disk block. It is determined by the value of BLCKSZ when building the
server. The default value is 8192 bytes. The meaning of some configuration variables (such as
shared_buffers) is influenced by block_size. See Section 19.4 for information.
data_checksums (boolean)
Reports whether data checksums are enabled for this cluster. See data checksums for more infor-
mation.
data_directory_mode (integer)
On Unix systems this parameter reports the permissions of the data directory defined by (data_di-
rectory) at startup. (On Microsoft Windows this parameter will always display 0700). See group
access for more information.
debug_assertions (boolean)
Reports whether PostgreSQL has been built with assertions enabled. That is the case if the macro
USE_ASSERT_CHECKING is defined when PostgreSQL is built (accomplished e.g., by the
configure option --enable-cassert). By default PostgreSQL is built without assertions.
integer_datetimes (boolean)
Reports whether PostgreSQL was built with support for 64-bit-integer dates and times. As of
PostgreSQL 10, this is always on.
lc_collate (string)
Reports the locale in which sorting of textual data is done. See Section 23.1 for more information.
This value is determined when a database is created.
lc_ctype (string)
Reports the locale that determines character classifications. See Section 23.1 for more informa-
tion. This value is determined when a database is created. Ordinarily this will be the same as
lc_collate, but for special applications it might be set differently.
max_function_args (integer)
max_identifier_length (integer)
Reports the maximum identifier length. It is determined as one less than the value of NAME-
DATALEN when building the server. The default value of NAMEDATALEN is 64; therefore the
600
Server Configuration
max_index_keys (integer)
Reports the maximum number of index keys. It is determined by the value of INDEX_MAX_KEYS
when building the server. The default value is 32 keys.
segment_size (integer)
Reports the number of blocks (pages) that can be stored within a file segment. It is determined by
the value of RELSEG_SIZE when building the server. The maximum size of a segment file in
bytes is equal to segment_size multiplied by block_size; by default this is 1GB.
server_encoding (string)
Reports the database encoding (character set). It is determined when the database is created. Or-
dinarily, clients need only be concerned with the value of client_encoding.
server_version (string)
Reports the version number of the server. It is determined by the value of PG_VERSION when
building the server.
server_version_num (integer)
Reports the version number of the server as an integer. It is determined by the value of
PG_VERSION_NUM when building the server.
wal_block_size (integer)
Reports the size of a WAL disk block. It is determined by the value of XLOG_BLCKSZ when
building the server. The default value is 8192 bytes.
wal_segment_size (integer)
Reports the size of write ahead log segments. The default value is 16MB. See Section 30.4 for
more information.
Custom options have two-part names: an extension name, then a dot, then the parameter name proper,
much like qualified names in SQL. An example is plpgsql.variable_conflict.
Because custom options may need to be set in processes that have not loaded the relevant extension
module, PostgreSQL will accept a setting for any two-part parameter name. Such variables are treated
as placeholders and have no function until the module that defines them is loaded. When an extension
module is loaded, it will add its variable definitions, convert any placeholder values according to those
definitions, and issue warnings for any unrecognized placeholders that begin with its extension name.
601
Server Configuration
production database. As such, they have been excluded from the sample postgresql.conf file.
Note that many of these parameters require special source compilation flags to work at all.
allow_in_place_tablespaces (boolean)
Allows tablespaces to be created as directories inside pg_tblspc, when an empty location string
is provided to the CREATE TABLESPACE command. This is intended to allow testing replication
scenarios where primary and standby servers are running on the same machine. Such directories
are likely to confuse backup tools that expect to find only symbolic links in that location. Only
superusers can change this setting.
allow_system_table_mods (boolean)
Allows modification of the structure of system tables. This is used by initdb. This parameter
can only be set at server start.
ignore_system_indexes (boolean)
Ignore system indexes when reading system tables (but still update the indexes when modifying
the tables). This is useful when recovering from damaged system indexes. This parameter cannot
be changed after session start.
post_auth_delay (integer)
If nonzero, a delay of this many seconds occurs when a new server process is started, after it
conducts the authentication procedure. This is intended to give developers an opportunity to attach
to the server process with a debugger. This parameter cannot be changed after session start.
pre_auth_delay (integer)
If nonzero, a delay of this many seconds occurs just after a new server process is forked, before
it conducts the authentication procedure. This is intended to give developers an opportunity to
attach to the server process with a debugger to trace down misbehavior in authentication. This
parameter can only be set in the postgresql.conf file or on the server command line.
trace_notify (boolean)
Generates a great amount of debugging output for the LISTEN and NOTIFY commands. clien-
t_min_messages or log_min_messages must be DEBUG1 or lower to send this output to the client
or server logs, respectively.
trace_recovery_messages (enum)
Enables logging of recovery-related debugging output that otherwise would not be logged. This
parameter allows the user to override the normal setting of log_min_messages, but only for spe-
cific messages. This is intended for use in debugging Hot Standby. Valid values are DEBUG5,
DEBUG4, DEBUG3, DEBUG2, DEBUG1, and LOG. The default, LOG, does not affect logging de-
cisions at all. The other values cause recovery-related debug messages of that priority or higher
to be logged as though they had LOG priority; for common settings of log_min_messages
this results in unconditionally sending them to the server log. This parameter can only be set in
the postgresql.conf file or on the server command line.
trace_sort (boolean)
If on, emit information about resource usage during sort operations. This parameter is only
available if the TRACE_SORT macro was defined when PostgreSQL was compiled. (However,
TRACE_SORT is currently defined by default.)
trace_locks (boolean)
If on, emit information about lock usage. Information dumped includes the type of lock operation,
the type of lock and the unique identifier of the object being locked or unlocked. Also included
602
Server Configuration
are bit masks for the lock types already granted on this object as well as for the lock types awaited
on this object. For each lock type a count of the number of granted locks and waiting locks is also
dumped as well as the totals. An example of the log file output is shown here:
This parameter is only available if the LOCK_DEBUG macro was defined when PostgreSQL was
compiled.
trace_lwlocks (boolean)
If on, emit information about lightweight lock usage. Lightweight locks are intended primarily to
provide mutual exclusion of access to shared-memory data structures.
This parameter is only available if the LOCK_DEBUG macro was defined when PostgreSQL was
compiled.
trace_userlocks (boolean)
If on, emit information about user lock usage. Output is the same as for trace_locks, only
for advisory locks.
This parameter is only available if the LOCK_DEBUG macro was defined when PostgreSQL was
compiled.
trace_lock_oidmin (integer)
If set, do not trace locks for tables below this OID (used to avoid output on system tables).
This parameter is only available if the LOCK_DEBUG macro was defined when PostgreSQL was
compiled.
trace_lock_table (integer)
This parameter is only available if the LOCK_DEBUG macro was defined when PostgreSQL was
compiled.
debug_deadlocks (boolean)
If set, dumps information about all current locks when a deadlock timeout occurs.
This parameter is only available if the LOCK_DEBUG macro was defined when PostgreSQL was
compiled.
603
Server Configuration
log_btree_build_stats (boolean)
If set, logs system resource usage statistics (memory and CPU) on various B-tree operations.
This parameter is only available if the BTREE_BUILD_STATS macro was defined when Post-
greSQL was compiled.
wal_consistency_checking (string)
This parameter is intended to be used to check for bugs in the WAL redo routines. When enabled,
full-page images of any buffers modified in conjunction with the WAL record are added to the
record. If the record is subsequently replayed, the system will first apply each record and then
test whether the buffers modified by the record match the stored images. In certain cases (such as
hint bits), minor variations are acceptable, and will be ignored. Any unexpected differences will
result in a fatal error, terminating recovery.
The default value of this setting is the empty string, which disables the feature. It can be set
to all to check all records, or to a comma-separated list of resource managers to check only
records originating from those resource managers. Currently, the supported resource managers
are heap, heap2, btree, hash, gin, gist, sequence, spgist, brin, and generic.
Only superusers can change this setting.
wal_debug (boolean)
If on, emit WAL-related debugging output. This parameter is only available if the WAL_DEBUG
macro was defined when PostgreSQL was compiled.
ignore_checksum_failure (boolean)
Detection of a checksum failure during a read normally causes PostgreSQL to report an error,
aborting the current transaction. Setting ignore_checksum_failure to on causes the sys-
tem to ignore the failure (but still report a warning), and continue processing. This behavior may
cause crashes, propagate or hide corruption, or other serious problems. However, it may allow
you to get past the error and retrieve undamaged tuples that might still be present in the table if
the block header is still sane. If the header is corrupt an error will be reported even if this option
is enabled. The default setting is off, and it can only be changed by a superuser.
zero_damaged_pages (boolean)
Detection of a damaged page header normally causes PostgreSQL to report an error, aborting the
current transaction. Setting zero_damaged_pages to on causes the system to instead report
a warning, zero out the damaged page in memory, and continue processing. This behavior will
destroy data, namely all the rows on the damaged page. However, it does allow you to get past the
error and retrieve rows from any undamaged pages that might be present in the table. It is useful
for recovering data if corruption has occurred due to a hardware or software error. You should
generally not set this on until you have given up hope of recovering data from the damaged pages
of a table. Zeroed-out pages are not forced to disk so it is recommended to recreate the table or
the index before turning this parameter off again. The default setting is off, and it can only be
changed by a superuser.
jit_debugging_support (boolean)
If LLVM has the required functionality, register generated functions with GDB. This makes de-
bugging easier. The default setting is off. This parameter can only be set at server start.
jit_dump_bitcode (boolean)
Writes the generated LLVM IR out to the file system, inside data_directory. This is only useful
for working on the internals of the JIT implementation. The default setting is off. This parameter
can only be changed by a superuser.
604
Server Configuration
jit_expressions (boolean)
Determines whether expressions are JIT compiled, when JIT compilation is activated (see Sec-
tion 32.2). The default is on.
jit_profiling_support (boolean)
If LLVM has the required functionality, emit the data needed to allow perf to profile functions
generated by JIT. This writes out files to $HOME/.debug/jit/; the user is responsible for
performing cleanup when desired. The default setting is off. This parameter can only be set at
server start.
jit_tuple_deforming (boolean)
Determines whether tuple deforming is JIT compiled, when JIT compilation is activated (see
Section 32.2). The default is on.
605
Chapter 20. Client Authentication
When a client application connects to the database server, it specifies which PostgreSQL database user
name it wants to connect as, much the same way one logs into a Unix computer as a particular user.
Within the SQL environment the active database user name determines access privileges to database
objects — see Chapter 21 for more information. Therefore, it is essential to restrict which database
users can connect.
Note
As explained in Chapter 21, PostgreSQL actually does privilege management in terms of
“roles”. In this chapter, we consistently use database user to mean “role with the LOGIN priv-
ilege”.
Authentication is the process by which the database server establishes the identity of the client, and
by extension determines whether the client application (or the user who runs the client application) is
permitted to connect with the database user name that was requested.
PostgreSQL offers a number of different client authentication methods. The method used to authenti-
cate a particular client connection can be selected on the basis of (client) host address, database, and
user.
PostgreSQL database user names are logically separate from user names of the operating system in
which the server runs. If all the users of a particular server also have accounts on the server's machine,
it makes sense to assign database user names that match their operating system user names. However,
a server that accepts remote connections might have many database users who have no local operating
system account, and in such cases there need be no connection between database user names and OS
user names.
The general format of the pg_hba.conf file is a set of records, one per line. Blank lines are ignored,
as is any text after the # comment character. Records cannot be continued across lines. A record is
made up of a number of fields which are separated by spaces and/or tabs. Fields can contain white space
if the field value is double-quoted. Quoting one of the keywords in a database, user, or address field
(e.g., all or replication) makes the word lose its special meaning, and just match a database,
user, or host with that name.
Each record specifies a connection type, a client IP address range (if relevant for the connection type),
a database name, a user name, and the authentication method to be used for connections matching
these parameters. The first record with a matching connection type, client address, requested database,
and user name is used to perform authentication. There is no “fall-through” or “backup”: if one record
is chosen and the authentication fails, subsequent records are not considered. If no record matches,
access is denied.
606
Client Authentication
local
This record matches connection attempts using Unix-domain sockets. Without a record of this
type, Unix-domain socket connections are disallowed.
host
This record matches connection attempts made using TCP/IP. host records match either SSL
or non-SSL connection attempts.
Note
Remote TCP/IP connections will not be possible unless the server is started with an ap-
propriate value for the listen_addresses configuration parameter, since the default behav-
ior is to listen for TCP/IP connections only on the local loopback address localhost.
hostssl
This record matches connection attempts made using TCP/IP, but only when the connection is
made with SSL encryption.
To make use of this option the server must be built with SSL support. Furthermore, SSL must
be enabled by setting the ssl configuration parameter (see Section 18.9 for more information).
Otherwise, the hostssl record is ignored except for logging a warning that it cannot match any
connections.
hostnossl
This record type has the opposite behavior of hostssl; it only matches connection attempts
made over TCP/IP that do not use SSL.
database
Specifies which database name(s) this record matches. The value all specifies that it matches
all databases. The value sameuser specifies that the record matches if the requested database
has the same name as the requested user. The value samerole specifies that the requested user
must be a member of the role with the same name as the requested database. (samegroup is an
obsolete but still accepted spelling of samerole.) Superusers are not considered to be members
of a role for the purposes of samerole unless they are explicitly members of the role, directly or
indirectly, and not just by virtue of being a superuser. The value replication specifies that the
record matches if a physical replication connection is requested (note that replication connections
do not specify any particular database). Otherwise, this is the name of a specific PostgreSQL
database. Multiple database names can be supplied by separating them with commas. A separate
file containing database names can be specified by preceding the file name with @.
user
Specifies which database user name(s) this record matches. The value all specifies that it match-
es all users. Otherwise, this is either the name of a specific database user, or a group name pre-
607
Client Authentication
ceded by +. (Recall that there is no real distinction between users and groups in PostgreSQL; a
+ mark really means “match any of the roles that are directly or indirectly members of this role”,
while a name without a + mark matches only that specific role.) For this purpose, a superuser is
only considered to be a member of a role if they are explicitly a member of the role, directly or
indirectly, and not just by virtue of being a superuser. Multiple user names can be supplied by
separating them with commas. A separate file containing user names can be specified by preced-
ing the file name with @.
address
Specifies the client machine address(es) that this record matches. This field can contain either a
host name, an IP address range, or one of the special key words mentioned below.
An IP address range is specified using standard numeric notation for the range's starting address,
then a slash (/) and a CIDR mask length. The mask length indicates the number of high-order
bits of the client IP address that must match. Bits to the right of this should be zero in the given
IP address. There must not be any white space between the IP address, the /, and the CIDR mask
length.
Typical examples of an IPv4 address range specified this way are 172.20.143.89/32 for a
single host, or 172.20.143.0/24 for a small network, or 10.6.0.0/16 for a larger one.
An IPv6 address range might look like ::1/128 for a single host (in this case the IPv6 loopback
address) or fe80::7a31:c1ff:0000:0000/96 for a small network. 0.0.0.0/0 repre-
sents all IPv4 addresses, and ::0/0 represents all IPv6 addresses. To specify a single host, use
a mask length of 32 for IPv4 or 128 for IPv6. In a network address, do not omit trailing zeroes.
An entry given in IPv4 format will match only IPv4 connections, and an entry given in IPv6
format will match only IPv6 connections, even if the represented address is in the IPv4-in-IPv6
range. Note that entries in IPv6 format will be rejected if the system's C library does not have
support for IPv6 addresses.
You can also write all to match any IP address, samehost to match any of the server's own IP
addresses, or samenet to match any address in any subnet that the server is directly connected to.
If a host name is specified (anything that is not an IP address range or a special key word is
treated as a host name), that name is compared with the result of a reverse name resolution of
the client's IP address (e.g., reverse DNS lookup, if DNS is used). Host name comparisons are
case insensitive. If there is a match, then a forward name resolution (e.g., forward DNS lookup)
is performed on the host name to check whether any of the addresses it resolves to are equal to
the client's IP address. If both directions match, then the entry is considered to match. (The host
name that is used in pg_hba.conf should be the one that address-to-name resolution of the
client's IP address returns, otherwise the line won't be matched. Some host name databases allow
associating an IP address with multiple host names, but the operating system will only return one
host name when asked to resolve an IP address.)
A host name specification that starts with a dot (.) matches a suffix of the actual host name. So
.example.com would match foo.example.com (but not just example.com).
When host names are specified in pg_hba.conf, you should make sure that name resolution
is reasonably fast. It can be of advantage to set up a local name resolution cache such as nscd.
Also, you may wish to enable the configuration parameter log_hostname to see the client's
host name instead of the IP address in the log.
Note
Users sometimes wonder why host names are handled in this seemingly complicated way,
with two name resolutions including a reverse lookup of the client's IP address. This com-
plicates use of the feature in case the client's reverse DNS entry is not set up or yields
608
Client Authentication
some undesirable host name. It is done primarily for efficiency: this way, a connection
attempt requires at most two resolver lookups, one reverse and one forward. If there is a
resolver problem with some address, it becomes only that client's problem. A hypothetical
alternative implementation that only did forward lookups would have to resolve every
host name mentioned in pg_hba.conf during every connection attempt. That could be
quite slow if many names are listed. And if there is a resolver problem with one of the
host names, it becomes everyone's problem.
Also, a reverse lookup is necessary to implement the suffix matching feature, because the
actual client host name needs to be known in order to match it against the pattern.
Note that this behavior is consistent with other popular implementations of host name-
based access control, such as the Apache HTTP Server and TCP Wrappers.
IP-address
IP-mask
auth-method
Specifies the authentication method to use when a connection matches this record. The possible
choices are summarized here; details are in Section 20.3. All the options are lower case and treated
case sensitively, so even acronyms like ldap must be specified as lower case.
trust
Allow the connection unconditionally. This method allows anyone that can connect to the
PostgreSQL database server to login as any PostgreSQL user they wish, without the need for
a password or any other authentication. See Section 20.4 for details.
reject
Reject the connection unconditionally. This is useful for “filtering out” certain hosts from a
group, for example a reject line could block a specific host from connecting, while a later
line allows the remaining hosts in a specific network to connect.
scram-sha-256
Perform SCRAM-SHA-256 authentication to verify the user's password. See Section 20.5
for details.
md5
Perform SCRAM-SHA-256 or MD5 authentication to verify the user's password. See Sec-
tion 20.5 for details.
password
Require the client to supply an unencrypted password for authentication. Since the password
is sent in clear text over the network, this should not be used on untrusted networks. See
Section 20.5 for details.
gss
Use GSSAPI to authenticate the user. This is only available for TCP/IP connections. See
Section 20.6 for details.
609
Client Authentication
sspi
Use SSPI to authenticate the user. This is only available on Windows. See Section 20.7 for
details.
ident
Obtain the operating system user name of the client by contacting the ident server on the
client and check if it matches the requested database user name. Ident authentication can only
be used on TCP/IP connections. When specified for local connections, peer authentication
will be used instead. See Section 20.8 for details.
peer
Obtain the client's operating system user name from the operating system and check if it
matches the requested database user name. This is only available for local connections. See
Section 20.9 for details.
ldap
radius
cert
Authenticate using SSL client certificates. See Section 20.12 for details.
pam
Authenticate using the Pluggable Authentication Modules (PAM) service provided by the
operating system. See Section 20.13 for details.
bsd
Authenticate using the BSD Authentication service provided by the operating system. See
Section 20.14 for details.
auth-options
After the auth-method field, there can be field(s) of the form name=value that specify op-
tions for the authentication method. Details about which options are available for which authen-
tication methods appear below.
In addition to the method-specific options listed below, there is one method-independent authen-
tication option clientcert, which can be specified in any hostssl record. When set to 1,
this option requires the client to present a valid (trusted) SSL certificate, in addition to the other
requirements of the authentication method.
Files included by @ constructs are read as lists of names, which can be separated by either whitespace
or commas. Comments are introduced by #, just as in pg_hba.conf, and nested @ constructs are
allowed. Unless the file name following @ is an absolute path, it is taken to be relative to the directory
containing the referencing file.
Since the pg_hba.conf records are examined sequentially for each connection attempt, the order of
the records is significant. Typically, earlier records will have tight connection match parameters and
weaker authentication methods, while later records will have looser match parameters and stronger
authentication methods. For example, one might wish to use trust authentication for local TCP/IP
connections but require a password for remote TCP/IP connections. In this case a record specifying
trust authentication for connections from 127.0.0.1 would appear before a record specifying pass-
word authentication for a wider range of allowed client IP addresses.
610
Client Authentication
The pg_hba.conf file is read on start-up and when the main server process receives a SIGHUP
signal. If you edit the file on an active system, you will need to signal the postmaster (using pg_ctl
reload, calling the SQL function pg_reload_conf(), or using kill -HUP) to make it re-
read the file.
Note
The preceding statement is not true on Microsoft Windows: there, any changes in the pg_h-
ba.conf file are immediately applied by subsequent new connections.
The system view pg_hba_file_rules can be helpful for pre-testing changes to the pg_h-
ba.conf file, or for diagnosing problems if loading of the file did not have the desired effects. Rows
in the view with non-null error fields indicate problems in the corresponding lines of the file.
Tip
To connect to a particular database, a user must not only pass the pg_hba.conf checks, but
must have the CONNECT privilege for the database. If you wish to restrict which users can
connect to which databases, it's usually easier to control this by granting/revoking CONNECT
privilege than to put the rules in pg_hba.conf entries.
Some examples of pg_hba.conf entries are shown in Example 20.1. See the next section for details
on the different authentication methods.
611
Client Authentication
# The same using a host name (would typically cover both IPv4 and
IPv6).
#
# TYPE DATABASE USER ADDRESS
METHOD
host all all localhost
trust
612
Client Authentication
# If these are the only three lines for local connections, they
will
# allow local users to connect only to their own databases
(databases
# with the same name as their database user name) except for
administrators
# and members of role "support", who can connect to all databases.
The file
# $PGDATA/admins contains a list of names of administrators.
Passwords
# are required in all cases.
#
# TYPE DATABASE USER ADDRESS
METHOD
local sameuser all md5
local all @admins md5
local all +support md5
# The last two lines above can be combined into a single line:
local all @admins,+support md5
# The database column can also use lists and file names:
local db1,db2,@demodbs all md5
User name maps are defined in the ident map file, which by default is named pg_ident.conf and
is stored in the cluster's data directory. (It is possible to place the map file elsewhere, however; see the
ident_file configuration parameter.) The ident map file contains lines of the general form:
613
Client Authentication
Comments and whitespace are handled in the same way as in pg_hba.conf. The map-name is
an arbitrary name that will be used to refer to this mapping in pg_hba.conf. The other two fields
specify an operating system user name and a matching database user name. The same map-name can
be used repeatedly to specify multiple user-mappings within a single map.
There is no restriction regarding how many database users a given operating system user can corre-
spond to, nor vice versa. Thus, entries in a map should be thought of as meaning “this operating system
user is allowed to connect as this database user”, rather than implying that they are equivalent. The
connection will be allowed if there is any map entry that pairs the user name obtained from the external
authentication system with the database user name that the user has requested to connect as.
If the system-username field starts with a slash (/), the remainder of the field is treated as a
regular expression. (See Section 9.7.3.1 for details of PostgreSQL's regular expression syntax.) The
regular expression can include a single capture, or parenthesized subexpression, which can then be
referenced in the database-username field as \1 (backslash-one). This allows the mapping of
multiple user names in a single line, which is particularly useful for simple syntax substitutions. For
example, these entries
mymap /^(.*)@mydomain\.com$ \1
mymap /^(.*)@otherdomain\.com$ guest
will remove the domain part for users with system user names that end with @mydomain.com, and
allow any user whose system name ends with @otherdomain.com to log in as guest.
Tip
Keep in mind that by default, a regular expression can match just part of a string. It's usually
wise to use ^ and $, as shown in the above example, to force the match to be to the entire
system user name.
The pg_ident.conf file is read on start-up and when the main server process receives a SIGHUP
signal. If you edit the file on an active system, you will need to signal the postmaster (using pg_ctl
reload, calling the SQL function pg_reload_conf(), or using kill -HUP) to make it re-
read the file.
A pg_ident.conf file that could be used in conjunction with the pg_hba.conf file in Exam-
ple 20.1 is shown in Example 20.2. In this example, anyone logged in to a machine on the 192.168
network that does not have the operating system user name bryanh, ann, or robert would not be
granted access. Unix user robert would only be allowed access when he tries to connect as Post-
greSQL user bob, not as robert or anyone else. ann would only be allowed to connect as ann.
User bryanh would be allowed to connect as either bryanh or as guest1.
614
Client Authentication
• Trust authentication, which simply trusts that users are who they say they are.
• Ident authentication, which relies on an “Identification Protocol” (RFC 1413) service on the client's
machine. (On local Unix-socket connections, this is treated as peer authentication.)
• Peer authentication, which relies on operating system facilities to identify the process at the other
end of a local connection. This is not supported for remote connections.
• Certificate authentication, which requires an SSL connection and authenticates users by checking
the SSL certificate they send.
• BSD authentication, which relies on the BSD Authentication framework (currently available only
on OpenBSD).
Peer authentication is usually recommendable for local connections, though trust authentication might
be sufficient in some circumstances. Password authentication is the easiest choice for remote connec-
tions. All the other options require some kind of external security infrastructure (usually an authenti-
cation server or a certificate authority for issuing SSL certificates), or are platform-specific.
The following sections describe each of these authentication methods in more detail.
trust authentication is appropriate and very convenient for local connections on a single-user work-
station. It is usually not appropriate by itself on a multiuser machine. However, you might be able
to use trust even on a multiuser machine, if you restrict access to the server's Unix-domain socket
file using file-system permissions. To do this, set the unix_socket_permissions (and possibly
unix_socket_group) configuration parameters as described in Section 19.3. Or you could set
the unix_socket_directories configuration parameter to place the socket file in a suitably
restricted directory.
Setting file-system permissions only helps for Unix-socket connections. Local TCP/IP connections
are not restricted by file-system permissions. Therefore, if you want to use file-system permissions
for local security, remove the host ... 127.0.0.1 ... line from pg_hba.conf, or change
it to a non-trust authentication method.
trust authentication is only suitable for TCP/IP connections if you trust every user on every machine
that is allowed to connect to the server by the pg_hba.conf lines that specify trust. It is seldom
reasonable to use trust for any TCP/IP connections other than those from localhost (127.0.0.1).
There are several password-based authentication methods. These methods operate similarly but differ
in how the users' passwords are stored on the server and how the password provided by a client is
sent across the connection.
scram-sha-256
This is the most secure of the currently provided methods, but it is not supported by older client
libraries.
md5
The method md5 uses a custom less secure challenge-response mechanism. It prevents password
sniffing and avoids storing passwords on the server in plain text but provides no protection if an
attacker manages to steal the password hash from the server. Also, the MD5 hash algorithm is
nowadays no longer considered secure against determined attacks.
To ease transition from the md5 method to the newer SCRAM method, if md5 is specified as a
method in pg_hba.conf but the user's password on the server is encrypted for SCRAM (see
below), then SCRAM-based authentication will automatically be chosen instead.
password
The method password sends the password in clear-text and is therefore vulnerable to password
“sniffing” attacks. It should always be avoided if possible. If the connection is protected by SSL
encryption then password can be used safely, though. (Though SSL certificate authentication
might be a better choice if one is depending on using SSL).
PostgreSQL database passwords are separate from operating system user passwords. The password
for each database user is stored in the pg_authid system catalog. Passwords can be managed with
the SQL commands CREATE ROLE and ALTER ROLE, e.g., CREATE ROLE foo WITH LOGIN
PASSWORD 'secret', or the psql command \password. If no password has been set up for a
user, the stored password is null and password authentication will always fail for that user.
The availability of the different password-based authentication methods depends on how a user's pass-
word on the server is encrypted (or hashed, more accurately). This is controlled by the configuration
parameter password_encryption at the time the password is set. If a password was encrypted using
the scram-sha-256 setting, then it can be used for the authentication methods scram-sha-256
and password (but password transmission will be in plain text in the latter case). The authentication
method specification md5 will automatically switch to using the scram-sha-256 method in this
case, as explained above, so it will also work. If a password was encrypted using the md5 setting, then
it can be used only for the md5 and password authentication method specifications (again, with the
password transmitted in plain text in the latter case). (Previous PostgreSQL releases supported storing
the password on the server in plain text. This is no longer possible.) To check the currently stored
password hashes, see the system catalog pg_authid.
To upgrade an existing installation from md5 to scram-sha-256, after having ensured that all client
libraries in use are new enough to support SCRAM, set password_encryption = 'scram-
sha-256' in postgresql.conf, make all users set new passwords, and change the authentica-
tion method specifications in pg_hba.conf to scram-sha-256.
616
Client Authentication
GSSAPI is an industry-standard protocol for secure authentication defined in RFC 2743. PostgreSQL
supports GSSAPI with Kerberos authentication according to RFC 1964. GSSAPI provides automatic
authentication (single sign-on) for systems that support it. The authentication itself is secure, but the
data sent over the database connection will be sent unencrypted unless SSL is used.
GSSAPI support has to be enabled when PostgreSQL is built; see Chapter 16 for more information.
When GSSAPI uses Kerberos, it uses a standard principal in the format servicename/host-
name@realm. The PostgreSQL server will accept any principal that is included in the keytab used
by the server, but care needs to be taken to specify the correct principal details when making the
connection from the client using the krbsrvname connection parameter. (See also Section 34.1.2.)
The installation default can be changed from the default postgres at build time using ./config-
ure --with-krb-srvnam=whatever. In most environments, this parameter never needs to be
changed. Some Kerberos implementations might require a different service name, such as Microsoft
Active Directory which requires the service name to be in upper case (POSTGRES).
hostname is the fully qualified host name of the server machine. The service principal's realm is the
preferred realm of the server machine.
Client principals can be mapped to different PostgreSQL database user names with pg_iden-
t.conf. For example, pgusername@realm could be mapped to just pgusername. Alternative-
ly, you can use the full username@realm principal as the role name in PostgreSQL without any
mapping.
PostgreSQL also supports a parameter to strip the realm from the principal. This method is support-
ed for backwards compatibility and is strongly discouraged as it is then impossible to distinguish
different users with the same user name but coming from different realms. To enable this, set in-
clude_realm to 0. For simple single-realm installations, doing that combined with setting the kr-
b_realm parameter (which checks that the principal's realm matches exactly what is in the kr-
b_realm parameter) is still secure; but this is a less capable approach compared to specifying an
explicit mapping in pg_ident.conf.
Make sure that your server keytab file is readable (and preferably only readable, not writable) by
the PostgreSQL server account. (See also Section 18.1.) The location of the key file is specified by
the krb_server_keyfile configuration parameter. The default is /usr/local/pgsql/etc/kr-
b5.keytab (or whatever directory was specified as sysconfdir at build time). For security rea-
sons, it is recommended to use a separate keytab just for the PostgreSQL server rather than opening
up permissions on the system keytab file.
The keytab file is generated by the Kerberos software; see the Kerberos documentation for details.
The following example is for MIT-compatible Kerberos 5 implementations:
When connecting to the database make sure you have a ticket for a principal matching the requested
database user name. For example, for database user name fred, principal [email protected]
would be able to connect. To also allow principal fred/[email protected],
use a user name map, as described in Section 20.2.
include_realm
If set to 0, the realm name from the authenticated user principal is stripped off before being passed
through the user name mapping (Section 20.2). This is discouraged and is primarily available for
backwards compatibility, as it is not secure in multi-realm environments unless krb_realm is
also used. It is recommended to leave include_realm set to the default (1) and to provide an
explicit mapping in pg_ident.conf to convert principal names to PostgreSQL user names.
617
Client Authentication
map
Allows for mapping between system and database user names. See Section 20.2 for details.
For a GSSAPI/Kerberos principal, such as [email protected] (or, less common-
ly, username/[email protected]), the user name used for mapping is user-
[email protected] (or username/[email protected], respectively), unless
include_realm has been set to 0, in which case username (or username/hostbased)
is what is seen as the system user name when mapping.
krb_realm
Sets the realm to match user principal names against. If this parameter is set, only users of that
realm will be accepted. If it is not set, users of any realm can connect, subject to whatever user
name mapping is done.
When using Kerberos authentication, SSPI works the same way GSSAPI does; see Section 20.6 for
details.
include_realm
If set to 0, the realm name from the authenticated user principal is stripped off before being passed
through the user name mapping (Section 20.2). This is discouraged and is primarily available for
backwards compatibility, as it is not secure in multi-realm environments unless krb_realm is
also used. It is recommended to leave include_realm set to the default (1) and to provide an
explicit mapping in pg_ident.conf to convert principal names to PostgreSQL user names.
compat_realm
If set to 1, the domain's SAM-compatible name (also known as the NetBIOS name) is used for the
include_realm option. This is the default. If set to 0, the true realm name from the Kerberos
user principal name is used.
Do not disable this option unless your server runs under a domain account (this includes virtual
service accounts on a domain member system) and all clients authenticating through SSPI are
also using domain accounts, or authentication will fail.
upn_username
If this option is enabled along with compat_realm, the user name from the Kerberos UPN is
used for authentication. If it is disabled (the default), the SAM-compatible user name is used. By
default, these two names are identical for new user accounts.
Note that libpq uses the SAM-compatible name if no explicit user name is specified. If you use
libpq or a driver based on it, you should leave this option disabled or explicitly specify user name
in the connection string.
map
Allows for mapping between system and database user names. See Section 20.2 for de-
tails. For a SSPI/Kerberos principal, such as [email protected] (or, less common-
ly, username/[email protected]), the user name used for mapping is user-
618
Client Authentication
krb_realm
Sets the realm to match user principal names against. If this parameter is set, only users of that
realm will be accepted. If it is not set, users of any realm can connect, subject to whatever user
name mapping is done.
Note
When ident is specified for a local (non-TCP/IP) connection, peer authentication (see Sec-
tion 20.9) will be used instead.
map
Allows for mapping between system and database user names. See Section 20.2 for details.
The “Identification Protocol” is described in RFC 1413. Virtually every Unix-like operating system
ships with an ident server that listens on TCP port 113 by default. The basic functionality of an ident
server is to answer questions like “What user initiated the connection that goes out of your port X
and connects to my port Y?”. Since PostgreSQL knows both X and Y when a physical connection is
established, it can interrogate the ident server on the host of the connecting client and can theoretically
determine the operating system user for any given connection.
The drawback of this procedure is that it depends on the integrity of the client: if the client machine
is untrusted or compromised, an attacker could run just about any program on port 113 and return any
user name they choose. This authentication method is therefore only appropriate for closed networks
where each client machine is under tight control and where the database and system administrators
operate in close contact. In other words, you must trust the machine running the ident server. Heed
the warning:
—RFC 1413
Some ident servers have a nonstandard option that causes the returned user name to be encrypted,
using a key that only the originating machine's administrator knows. This option must not be used
when using the ident server with PostgreSQL, since PostgreSQL does not have any way to decrypt
the returned string to determine the actual user name.
619
Client Authentication
map
Allows for mapping between system and database user names. See Section 20.2 for details.
Peer authentication is only available on operating systems providing the getpeereid() function,
the SO_PEERCRED socket parameter, or similar mechanisms. Currently that includes Linux, most
flavors of BSD including macOS, and Solaris.
LDAP authentication can operate in two modes. In the first mode, which we will call the simple bind
mode, the server will bind to the distinguished name constructed as prefix username suffix.
Typically, the prefix parameter is used to specify cn=, or DOMAIN\ in an Active Directory en-
vironment. suffix is used to specify the remaining part of the DN in a non-Active Directory envi-
ronment.
In the second mode, which we will call the search+bind mode, the server first binds to the LDAP di-
rectory with a fixed user name and password, specified with ldapbinddn and ldapbindpasswd,
and performs a search for the user trying to log in to the database. If no user and password is configured,
an anonymous bind will be attempted to the directory. The search will be performed over the subtree
at ldapbasedn, and will try to do an exact match of the attribute specified in ldapsearchat-
tribute. Once the user has been found in this search, the server disconnects and re-binds to the
directory as this user, using the password specified by the client, to verify that the login is correct. This
mode is the same as that used by LDAP authentication schemes in other software, such as Apache
mod_authnz_ldap and pam_ldap. This method allows for significantly more flexibility in where
the user objects are located in the directory, but will cause two separate connections to the LDAP
server to be made.
ldapserver
Names or IP addresses of LDAP servers to connect to. Multiple servers may be specified, sepa-
rated by spaces.
ldapport
Port number on LDAP server to connect to. If no port is specified, the LDAP library's default
port setting will be used.
ldapscheme
Set to ldaps to use LDAPS. This is a non-standard way of using LDAP over SSL, supported by
some LDAP server implementations. See also the ldaptls option for an alternative.
ldaptls
Set to 1 to make the connection between PostgreSQL and the LDAP server use TLS encryption.
This uses the StartTLS operation per RFC 4513. See also the ldapscheme option for an
alternative.
Note that using ldapscheme or ldaptls only encrypts the traffic between the PostgreSQL server
and the LDAP server. The connection between the PostgreSQL server and the PostgreSQL client will
still be unencrypted unless SSL is used there as well.
620
Client Authentication
ldapprefix
String to prepend to the user name when forming the DN to bind as, when doing simple bind
authentication.
ldapsuffix
String to append to the user name when forming the DN to bind as, when doing simple bind
authentication.
ldapbasedn
Root DN to begin the search for the user in, when doing search+bind authentication.
ldapbinddn
DN of user to bind to the directory with to perform the search when doing search+bind authen-
tication.
ldapbindpasswd
Password for user to bind to the directory with to perform the search when doing search+bind
authentication.
ldapsearchattribute
Attribute to match against the user name in the search when doing search+bind authentication. If
no attribute is specified, the uid attribute will be used.
ldapsearchfilter
The search filter to use when doing search+bind authentication. Occurrences of $user-
name will be replaced with the user name. This allows for more flexible search filters than
ldapsearchattribute.
ldapurl
An RFC 4516 LDAP URL. This is an alternative way to write some of the other LDAP options
in a more compact and standard form. The format is
ldap[s]://host[:port]/basedn[?[attribute][?[scope][?[filter]]]]
scope must be one of base, one, sub, typically the last. (The default is base, which is nor-
mally not useful in this application.) attribute can nominate a single attribute, in which case
it is used as a value for ldapsearchattribute. If attribute is empty then filter can
be used as a value for ldapsearchfilter.
The URL scheme ldaps chooses the LDAPS method for making LDAP connections over
SSL, equivalent to using ldapscheme=ldaps. To use encrypted LDAP connections using the
StartTLS operation, use the normal URL scheme ldap and specify the ldaptls option in
addition to ldapurl.
LDAP URLs are currently only supported with OpenLDAP, not on Windows.
621
Client Authentication
It is an error to mix configuration options for simple bind with options for search+bind.
When using search+bind mode, the search can be performed using a single attribute specified
with ldapsearchattribute, or using a custom search filter specified with ldapsearch-
filter. Specifying ldapsearchattribute=foo is equivalent to specifying ldapsearch-
filter="(foo=$username)". If neither option is specified the default is ldapsearchat-
tribute=uid.
When a connection to the database server as database user someuser is requested, PostgreSQL will
attempt to bind to the LDAP server using the DN cn=someuser, dc=example, dc=net and
the password provided by the client. If that connection succeeds, the database access is granted.
When a connection to the database server as database user someuser is requested, PostgreSQL will
attempt to bind anonymously (since ldapbinddn was not specified) to the LDAP server, perform a
search for (uid=someuser) under the specified base DN. If an entry is found, it will then attempt
to bind using that found information and the password supplied by the client. If that second connection
succeeds, the database access is granted.
Some other software that supports authentication against LDAP uses the same URL format, so it will
be easier to share the configuration.
Tip
Since LDAP often uses commas and spaces to separate the different parts of a DN, it is often
necessary to use double-quoted parameter values when configuring LDAP options, as shown
in the examples.
622
Client Authentication
When using RADIUS authentication, an Access Request message will be sent to the configured
RADIUS server. This request will be of type Authenticate Only, and include parameters for
user name, password (encrypted) and NAS Identifier. The request will be encrypted using
a secret shared with the server. The RADIUS server will respond to this request with either Access
Accept or Access Reject. There is no support for RADIUS accounting.
Multiple RADIUS servers can be specified, in which case they will be tried sequentially. If a negative
response is received from a server, the authentication will fail. If no response is received, the next
server in the list will be tried. To specify multiple servers, separate the server names with commas and
surround the list with double quotes. If multiple servers are specified, the other RADIUS options can
also be given as comma-separated lists, to provide individual values for each server. They can also be
specified as a single value, in which case that value will apply to all servers.
radiusservers
The DNS names or IP addresses of the RADIUS servers to connect to. This parameter is required.
radiussecrets
The shared secrets used when talking securely to the RADIUS servers. This must have exactly
the same value on the PostgreSQL and RADIUS servers. It is recommended that this be a string
of at least 16 characters. This parameter is required.
Note
The encryption vector used will only be cryptographically strong if PostgreSQL is built
with support for OpenSSL. In other cases, the transmission to the RADIUS server should
only be considered obfuscated, not secured, and external security measures should be
applied if necessary.
radiusports
The port numbers to connect to on the RADIUS servers. If no port is specified, the default
RADIUS port (1812) will be used.
radiusidentifiers
The strings to be used as NAS Identifier in the RADIUS requests. This parameter can be
used, for example, to identify which database cluster the user is attempting to connect to, which
can be useful for policy matching on the RADIUS server. If no identifier is specified, the default
postgresql will be used.
If it is necessary to have a comma or whitespace in a RADIUS parameter value, that can be done by
putting double quotes around the value, but it is tedious because two layers of double-quoting are now
required. An example of putting whitespace into RADIUS secret strings is:
623
Client Authentication
the client provide a valid, trusted certificate. No password prompt will be sent to the client. The cn
(Common Name) attribute of the certificate will be compared to the requested database user name, and
if they match the login will be allowed. User name mapping can be used to allow cn to be different
from the database user name.
The following configuration options are supported for SSL certificate authentication:
map
Allows for mapping between system and database user names. See Section 20.2 for details.
pamservice
pam_use_hostname
Determines whether the remote IP address or the host name is provided to PAM modules through
the PAM_RHOST item. By default, the IP address is used. Set this option to 1 to use the resolved
host name instead. Host name resolution can lead to login delays. (Most PAM configurations
don't use this information, so it is only necessary to consider this setting if a PAM configuration
was specifically created to make use of it.)
Note
If PAM is set up to read /etc/shadow, authentication will fail because the PostgreSQL
server is started by a non-root user. However, this is not an issue when PAM is configured to
use LDAP or other authentication methods.
BSD Authentication in PostgreSQL uses the auth-postgresql login type and authenticates with
the postgresql login class if that's defined in login.conf. By default that login class does not
exist, and PostgreSQL will use the default login class.
2
https://www.kernel.org/pub/linux/libs/pam/
624
Client Authentication
Note
To use BSD Authentication, the PostgreSQL user account (that is, the operating system user
running the server) must first be added to the auth group. The auth group exists by default
on OpenBSD systems.
This is what you are most likely to get if you succeed in contacting the server, but it does not want to
talk to you. As the message suggests, the server refused the connection request because it found no
matching entry in its pg_hba.conf configuration file.
Messages like this indicate that you contacted the server, and it is willing to talk to you, but not
until you pass the authorization method specified in the pg_hba.conf file. Check the password
you are providing, or check your Kerberos or ident software if the complaint mentions one of those
authentication types.
The database you are trying to connect to does not exist. Note that if you do not specify a database
name, it defaults to the database user name, which might or might not be the right thing.
Tip
The server log might contain more information about an authentication failure than is reported
to the client. If you are confused about the reason for a failure, check the server log.
625
Chapter 21. Database Roles
PostgreSQL manages database access permissions using the concept of roles. A role can be thought of
as either a database user, or a group of database users, depending on how the role is set up. Roles can
own database objects (for example, tables and functions) and can assign privileges on those objects to
other roles to control who has access to which objects. Furthermore, it is possible to grant membership
in a role to another role, thus allowing the member role to use privileges assigned to another role.
The concept of roles subsumes the concepts of “users” and “groups”. In PostgreSQL versions before
8.1, users and groups were distinct kinds of entities, but now there are only roles. Any role can act
as a user, a group, or both.
This chapter describes how to create and manage roles. More information about the effects of role
privileges on various database objects can be found in Section 5.6.
name follows the rules for SQL identifiers: either unadorned without special characters, or dou-
ble-quoted. (In practice, you will usually want to add additional options, such as LOGIN, to the com-
mand. More details appear below.) To remove an existing role, use the analogous DROP ROLE com-
mand:
For convenience, the programs createuser and dropuser are provided as wrappers around these SQL
commands that can be called from the shell command line:
createuser name
dropuser name
To determine the set of existing roles, examine the pg_roles system catalog, for example
The psql program's \du meta-command is also useful for listing the existing roles.
In order to bootstrap the database system, a freshly initialized system always contains one predefined
role. This role is always a “superuser”, and by default (unless altered when running initdb) it will
have the same name as the operating system user that initialized the database cluster. Customarily,
this role will be named postgres. In order to create more roles you first have to connect as this
initial role.
Every connection to the database server is made using the name of some particular role, and this role
determines the initial access privileges for commands issued in that connection. The role name to use
for a particular database connection is indicated by the client that is initiating the connection request
in an application-specific fashion. For example, the psql program uses the -U command line option
626
Database Roles
to indicate the role to connect as. Many applications assume the name of the current operating system
user by default (including createuser and psql). Therefore it is often convenient to maintain a
naming correspondence between roles and operating system users.
The set of database roles a given client connection can connect as is determined by the client authen-
tication setup, as explained in Chapter 20. (Thus, a client is not limited to connect as the role matching
its operating system user, just as a person's login name need not match his or her real name.) Since the
role identity determines the set of privileges available to a connected client, it is important to carefully
configure privileges when setting up a multiuser environment.
login privilege
Only roles that have the LOGIN attribute can be used as the initial role name for a database
connection. A role with the LOGIN attribute can be considered the same as a “database user”. To
create a role with login privilege, use either:
(CREATE USER is equivalent to CREATE ROLE except that CREATE USER includes LOGIN
by default, while CREATE ROLE does not.)
superuser status
A database superuser bypasses all permission checks, except the right to log in. This is a dangerous
privilege and should not be used carelessly; it is best to do most of your work as a role that is not
a superuser. To create a new database superuser, use CREATE ROLE name SUPERUSER. You
must do this as a role that is already a superuser.
database creation
A role must be explicitly given permission to create databases (except for superusers, since those
bypass all permission checks). To create such a role, use CREATE ROLE name CREATEDB.
role creation
A role must be explicitly given permission to create more roles (except for superusers, since those
bypass all permission checks). To create such a role, use CREATE ROLE name CREATEROLE.
A role with CREATEROLE privilege can alter and drop other roles, too, as well as grant or revoke
membership in them. However, to create, alter, drop, or change membership of a superuser role,
superuser status is required; CREATEROLE is insufficient for that.
initiating replication
A role must explicitly be given permission to initiate streaming replication (except for superusers,
since those bypass all permission checks). A role used for streaming replication must have LOGIN
permission as well. To create such a role, use CREATE ROLE name REPLICATION LOGIN.
password
A password is only significant if the client authentication method requires the user to supply a
password when connecting to the database. The password and md5 authentication methods
make use of passwords. Database passwords are separate from operating system passwords. Spec-
ify a password upon role creation with CREATE ROLE name PASSWORD 'string'.
627
Database Roles
A role's attributes can be modified after creation with ALTER ROLE. See the reference pages for the
CREATE ROLE and ALTER ROLE commands for details.
Tip
It is good practice to create a role that has the CREATEDB and CREATEROLE privileges, but is
not a superuser, and then use this role for all routine management of databases and roles. This
approach avoids the dangers of operating as a superuser for tasks that do not really require it.
A role can also have role-specific defaults for many of the run-time configuration settings described
in Chapter 19. For example, if for some reason you want to disable index scans (hint: not a good idea)
anytime you connect, you can use:
This will save the setting (but not set it immediately). In subsequent connections by this role it will
appear as though SET enable_indexscan TO off had been executed just before the session
started. You can still alter this setting during the session; it will only be the default. To remove a role-
specific default setting, use ALTER ROLE rolename RESET varname. Note that role-specific
defaults attached to roles without LOGIN privilege are fairly useless, since they will never be invoked.
Typically a role being used as a group would not have the LOGIN attribute, though you can set it if
you wish.
Once the group role exists, you can add and remove members using the GRANT and REVOKE com-
mands:
You can grant membership to other group roles, too (since there isn't really any distinction between
group roles and non-group roles). The database will not let you set up circular membership loops.
Also, it is not permitted to grant membership in a role to PUBLIC.
The members of a group role can use the privileges of the role in two ways. First, every member
of a group can explicitly do SET ROLE to temporarily “become” the group role. In this state, the
database session has access to the privileges of the group role rather than the original login role, and any
database objects created are considered owned by the group role not the login role. Second, member
roles that have the INHERIT attribute automatically have use of the privileges of roles of which they
are members, including any privileges inherited by those roles. As an example, suppose we have done:
628
Database Roles
Immediately after connecting as role joe, a database session will have use of privileges granted di-
rectly to joe plus any privileges granted to admin, because joe “inherits” admin's privileges.
However, privileges granted to wheel are not available, because even though joe is indirectly a
member of wheel, the membership is via admin which has the NOINHERIT attribute. After:
the session would have use of only those privileges granted to admin, and not those granted to joe.
After:
the session would have use of only those privileges granted to wheel, and not those granted to either
joe or admin. The original privilege state can be restored with any of:
Note
The SET ROLE command always allows selecting any role that the original login role is
directly or indirectly a member of. Thus, in the above example, it is not necessary to become
admin before becoming wheel.
Note
In the SQL standard, there is a clear distinction between users and roles, and users do not
automatically inherit privileges while roles do. This behavior can be obtained in PostgreSQL
by giving roles being used as SQL roles the INHERIT attribute, while giving roles being used
as SQL users the NOINHERIT attribute. However, PostgreSQL defaults to giving all roles the
INHERIT attribute, for backward compatibility with pre-8.1 releases in which users always
had use of permissions granted to groups they were members of.
The role attributes LOGIN, SUPERUSER, CREATEDB, and CREATEROLE can be thought of as spe-
cial privileges, but they are never inherited as ordinary privileges on database objects are. You must
actually SET ROLE to a specific role having one of these attributes in order to make use of the at-
tribute. Continuing the above example, we might choose to grant CREATEDB and CREATEROLE to
the admin role. Then a session connecting as role joe would not have these privileges immediately,
only after doing SET ROLE admin.
Any memberships in the group role are automatically revoked (but the member roles are not otherwise
affected).
629
Database Roles
Ownership of objects can be transferred one at a time using ALTER commands, for example:
Alternatively, the REASSIGN OWNED command can be used to reassign ownership of all objects
owned by the role-to-be-dropped to a single other role. Because REASSIGN OWNED cannot access
objects in other databases, it is necessary to run it in each database that contains objects owned by the
role. (Note that the first such REASSIGN OWNED will change the ownership of any shared-across-
databases objects, that is databases or tablespaces, that are owned by the role-to-be-dropped.)
Once any valuable objects have been transferred to new owners, any remaining objects owned by the
role-to-be-dropped can be dropped with the DROP OWNED command. Again, this command cannot
access objects in other databases, so it is necessary to run it in each database that contains objects
owned by the role. Also, DROP OWNED will not drop entire databases or tablespaces, so it is necessary
to do that manually if the role owns any databases or tablespaces that have not been transferred to
new owners.
DROP OWNED also takes care of removing any privileges granted to the target role for objects that do
not belong to it. Because REASSIGN OWNED does not touch such objects, it's typically necessary to
run both REASSIGN OWNED and DROP OWNED (in that order!) to fully remove the dependencies
of a role to be dropped.
In short then, the most general recipe for removing a role that has been used to own objects is:
When not all owned objects are to be transferred to the same successor owner, it's best to handle the
exceptions manually and then perform the above steps to mop up.
If DROP ROLE is attempted while dependent objects still remain, it will issue messages identifying
which objects need to be reassigned or dropped.
The default roles are described in Table 21.1. Note that the specific permissions for each of the default
roles may change in the future as additional capabilities are added. Administrators should monitor the
release notes for changes.
630
Database Roles
The pg_signal_backend role is intended to allow administrators to enable trusted, but non-su-
peruser, roles to send signals to other backends. Currently this role enables sending of signals for can-
celing a query on another backend or terminating its session. A user granted this role cannot however
send signals to a backend owned by a superuser. See Section 9.26.2.
Care should be taken when granting these roles to ensure they are only used where needed and with
the understanding that these roles grant access to privileged information.
Administrators can grant access to these roles to users using the GRANT command, for example:
631
Database Roles
Functions run inside the backend server process with the operating system permissions of the database
server daemon. If the programming language used for the function allows unchecked memory access-
es, it is possible to change the server's internal data structures. Hence, among many other things, such
functions can circumvent any system access controls. Function languages that allow such access are
considered “untrusted”, and PostgreSQL allows only superusers to create functions written in those
languages.
632
Chapter 22. Managing Databases
Every instance of a running PostgreSQL server manages one or more databases. Databases are there-
fore the topmost hierarchical level for organizing SQL objects (“database objects”). This chapter de-
scribes the properties of databases, and how to create, manage, and destroy them.
22.1. Overview
A small number of objects, like role, database, and tablespace names, are defined at the cluster level
and stored in the pg_global tablespace. Inside the cluster are multiple databases, which are isolated
from each other but can access cluster-level objects. Inside each database are multiple schemas, which
contain objects like tables and functions. So the full hierarchy is: cluster, database, schema, table (or
some other kind of object, such as a function).
When connecting to the database server, a client must specify the database name in its connection
request. It is not possible to access more than one database per connection. However, clients can open
multiple connections to the same database, or different databases. Database-level security has two
components: access control (see Section 20.1), managed at the connection level, and authorization
control (see Section 5.6), managed via the grant system. Foreign data wrappers (see postgres_fdw)
allow for objects within one database to act as proxies for objects in other database or clusters. The
older dblink module (see dblink) provides a similar capability. By default, all users can connect to all
databases using all connection methods.
If one PostgreSQL server cluster is planned to contain unrelated projects or users that should be, for
the most part, unaware of each other, it is recommended to put them into separate databases and adjust
authorizations and access controls accordingly. If the projects or users are interrelated, and thus should
be able to use each other's resources, they should be put in the same database but probably into separate
schemas; this provides a modular structure with namespace isolation and authorization control. More
information about managing schemas is in Section 5.8.
While multiple databases can be created within a single cluster, it is advised to consider carefully
whether the benefits outweigh the risks and limitations. In particular, the impact that having a shared
WAL (see Chapter 30) has on backup and recovery options. While individual databases in the cluster
are isolated when considered from the user's perspective, they are closely bound from the database
administrator's point-of-view.
Databases are created with the CREATE DATABASE command (see Section 22.2) and destroyed
with the DROP DATABASE command (see Section 22.5). To determine the set of existing databases,
examine the pg_database system catalog, for example
The psql program's \l meta-command and -l command-line option are also useful for listing the
existing databases.
Note
The SQL standard calls databases “catalogs”, but there is no difference in practice.
633
Managing Databases
where name follows the usual rules for SQL identifiers. The current role automatically becomes the
owner of the new database. It is the privilege of the owner of a database to remove it later (which also
removes all the objects in it, even if they have a different owner).
The creation of databases is a restricted operation. See Section 21.2 for how to grant permission.
Since you need to be connected to the database server in order to execute the CREATE DATABASE
command, the question remains how the first database at any given site can be created. The first
database is always created by the initdb command when the data storage area is initialized. (See
Section 18.2.) This database is called postgres. So to create the first “ordinary” database you can
connect to postgres.
A second database, template1, is also created during database cluster initialization. Whenever a
new database is created within the cluster, template1 is essentially cloned. This means that any
changes you make in template1 are propagated to all subsequently created databases. Because of
this, avoid creating objects in template1 unless you want them propagated to every newly created
database. More details appear in Section 22.3.
As a convenience, there is a program you can execute from the shell to create new databases, cre-
atedb.
createdb dbname
createdb does no magic. It connects to the postgres database and issues the CREATE DATA-
BASE command, exactly as described above. The createdb reference page contains the invocation de-
tails. Note that createdb without any arguments will create a database with the current user name.
Note
Chapter 20 contains information about how to restrict who can connect to a given database.
Sometimes you want to create a database for someone else, and have them become the owner of
the new database, so they can configure and manage it themselves. To achieve that, use one of the
following commands:
from the shell. Only the superuser is allowed to create a database for someone else (that is, for a role
you are not a member of).
634
Managing Databases
There is a second standard system database named template0. This database contains the same data
as the initial contents of template1, that is, only the standard objects predefined by your version
of PostgreSQL. template0 should never be changed after the database cluster has been initialized.
By instructing CREATE DATABASE to copy template0 instead of template1, you can create a
“virgin” user database that contains none of the site-local additions in template1. This is particu-
larly handy when restoring a pg_dump dump: the dump script should be restored in a virgin database
to ensure that one recreates the correct contents of the dumped database, without conflicting with ob-
jects that might have been added to template1 later on.
Another common reason for copying template0 instead of template1 is that new encoding and
locale settings can be specified when copying template0, whereas a copy of template1 must use
the same settings it does. This is because template1 might contain encoding-specific or locale-spe-
cific data, while template0 is known not to.
It is possible to create additional template databases, and indeed one can copy any database in a cluster
by specifying its name as the template for CREATE DATABASE. It is important to understand, how-
ever, that this is not (yet) intended as a general-purpose “COPY DATABASE” facility. The principal
limitation is that no other sessions can be connected to the source database while it is being copied.
CREATE DATABASE will fail if any other connection exists when it starts; during the copy operation,
new connections to the source database are prevented.
Two useful flags exist in pg_database for each database: the columns datistemplate and
datallowconn. datistemplate can be set to indicate that a database is intended as a template
for CREATE DATABASE. If this flag is set, the database can be cloned by any user with CREATEDB
privileges; if it is not set, only superusers and the owner of the database can clone it. If datallow-
conn is false, then no new connections to that database will be allowed (but existing sessions are not
terminated simply by setting the flag false). The template0 database is normally marked datal-
lowconn = false to prevent its modification. Both template0 and template1 should always
be marked with datistemplate = true.
Note
template1 and template0 do not have any special status beyond the fact that the name
template1 is the default source database name for CREATE DATABASE. For example, one
could drop template1 and recreate it from template0 without any ill effects. This course
of action might be advisable if one has carelessly added a bunch of junk in template1. (To
delete template1, it must have pg_database.datistemplate = false.)
The postgres database is also created when a database cluster is initialized. This database
is meant as a default database for users and applications to connect to. It is simply a copy of
template1 and can be dropped and recreated if necessary.
635
Managing Databases
For example, if for some reason you want to disable the GEQO optimizer for a given database, you'd
ordinarily have to either disable it for all databases or make sure that every connecting client is careful
to issue SET geqo TO off. To make this setting the default within a particular database, you can
execute the command:
This will save the setting (but not set it immediately). In subsequent connections to this database it
will appear as though SET geqo TO off; had been executed just before the session started. Note
that users can still alter this setting during their sessions; it will only be the default. To undo any such
setting, use ALTER DATABASE dbname RESET varname.
Only the owner of the database, or a superuser, can drop a database. Dropping a database removes all
objects that were contained within the database. The destruction of a database cannot be undone.
You cannot execute the DROP DATABASE command while connected to the victim database. You
can, however, be connected to any other database, including the template1 database. template1
would be the only option for dropping the last user database of a given cluster.
dropdb dbname
(Unlike createdb, it is not the default action to drop the database with the current user name.)
22.6. Tablespaces
Tablespaces in PostgreSQL allow database administrators to define locations in the file system where
the files representing database objects can be stored. Once created, a tablespace can be referred to by
name when creating database objects.
By using tablespaces, an administrator can control the disk layout of a PostgreSQL installation. This
is useful in at least two ways. First, if the partition or volume on which the cluster was initialized runs
out of space and cannot be extended, a tablespace can be created on a different partition and used until
the system can be reconfigured.
Second, tablespaces allow an administrator to use knowledge of the usage pattern of database objects
to optimize performance. For example, an index which is very heavily used can be placed on a very
fast, highly available disk, such as an expensive solid state device. At the same time a table storing
archived data which is rarely used or not performance critical could be stored on a less expensive,
slower disk system.
Warning
Even though located outside the main PostgreSQL data directory, tablespaces are an integral
part of the database cluster and cannot be treated as an autonomous collection of data files.
They are dependent on metadata contained in the main data directory, and therefore cannot
be attached to a different database cluster or backed up individually. Similarly, if you lose a
tablespace (file deletion, disk failure, etc), the database cluster might become unreadable or
636
Managing Databases
unable to start. Placing a tablespace on a temporary file system like a RAM disk risks the
reliability of the entire cluster.
The location must be an existing, empty directory that is owned by the PostgreSQL operating system
user. All objects subsequently created within the tablespace will be stored in files underneath this
directory. The location must not be on removable or transient storage, as the cluster might fail to
function if the tablespace is missing or lost.
Note
There is usually not much point in making more than one tablespace per logical file system,
since you cannot control the location of individual files within a logical file system. However,
PostgreSQL does not enforce any such limitation, and indeed it is not directly aware of the file
system boundaries on your system. It just stores files in the directories you tell it to use.
Creation of the tablespace itself must be done as a database superuser, but after that you can allow
ordinary database users to use it. To do that, grant them the CREATE privilege on it.
Tables, indexes, and entire databases can be assigned to particular tablespaces. To do so, a user with
the CREATE privilege on a given tablespace must pass the tablespace name as a parameter to the
relevant command. For example, the following creates a table in the tablespace space1:
There is also a temp_tablespaces parameter, which determines the placement of temporary tables and
indexes, as well as temporary files that are used for purposes such as sorting large data sets. This can
be a list of tablespace names, rather than only one, so that the load associated with temporary objects
can be spread over multiple tablespaces. A random member of the list is picked each time a temporary
object is to be created.
The tablespace associated with a database is used to store the system catalogs of that database. Fur-
thermore, it is the default tablespace used for tables, indexes, and temporary files created within the
database, if no TABLESPACE clause is given and no other selection is specified by default_ta-
blespace or temp_tablespaces (as appropriate). If a database is created without specifying a
tablespace for it, it uses the same tablespace as the template database it is copied from.
Two tablespaces are automatically created when the database cluster is initialized. The pg_global
tablespace is used for shared system catalogs. The pg_default tablespace is the default tablespace
of the template1 and template0 databases (and, therefore, will be the default tablespace for
other databases as well, unless overridden by a TABLESPACE clause in CREATE DATABASE).
637
Managing Databases
Once created, a tablespace can be used from any database, provided the requesting user has sufficient
privilege. This means that a tablespace cannot be dropped until all objects in all databases using the
tablespace have been removed.
To determine the set of existing tablespaces, examine the pg_tablespace system catalog, for
example
The psql program's \db meta-command is also useful for listing the existing tablespaces.
PostgreSQL makes use of symbolic links to simplify the implementation of tablespaces. This means
that tablespaces can be used only on systems that support symbolic links.
The directory $PGDATA/pg_tblspc contains symbolic links that point to each of the non-built-in
tablespaces defined in the cluster. Although not recommended, it is possible to adjust the tablespace
layout by hand by redefining these links. Under no circumstances perform this operation while the
server is running. Note that in PostgreSQL 9.1 and earlier you will also need to update the pg_ta-
blespace catalog with the new locations. (If you do not, pg_dump will continue to output the old
tablespace locations.)
638
Chapter 23. Localization
This chapter describes the available localization features from the point of view of the administrator.
PostgreSQL supports two localization facilities:
• Using the locale features of the operating system to provide locale-specific collation order, number
formatting, translated messages, and other aspects. This is covered in Section 23.1 and Section 23.2.
• Providing a number of different character sets to support storing text in all kinds of languages, and
providing character set translation between client and server. This is covered in Section 23.3.
23.1.1. Overview
Locale support is automatically initialized when a database cluster is created using initdb. initdb
will initialize the database cluster with the locale setting of its execution environment by default, so if
your system is already set to use the locale that you want in your database cluster then there is nothing
else you need to do. If you want to use a different locale (or you are not sure which locale your system
is set to), you can instruct initdb exactly which locale to use by specifying the --locale option.
For example:
initdb --locale=sv_SE
This example for Unix systems sets the locale to Swedish (sv) as spoken in Sweden (SE). Other
possibilities might include en_US (U.S. English) and fr_CA (French Canadian). If more than one
character set can be used for a locale then the specifications can take the form language_terri-
tory.codeset. For example, fr_BE.UTF-8 represents the French language (fr) as spoken in
Belgium (BE), with a UTF-8 character set encoding.
What locales are available on your system under what names depends on what was provided by the
operating system vendor and what was installed. On most Unix systems, the command locale -
a will provide a list of available locales. Windows uses more verbose locale names, such as Ger-
man_Germany or Swedish_Sweden.1252, but the principles are the same.
Occasionally it is useful to mix rules from several locales, e.g., use English collation rules but Spanish
messages. To support that, a set of locale subcategories exist that control only certain aspects of the
localization rules:
The category names translate into names of initdb options to override the locale choice for a specific
category. For instance, to set the locale to French Canadian, but use U.S. rules for formatting currency,
use initdb --locale=fr_CA --lc-monetary=en_US.
639
Localization
If you want the system to behave as if it had no locale support, use the special locale name C, or
equivalently POSIX.
Some locale categories must have their values fixed when the database is created. You can use differ-
ent settings for different databases, but once a database is created, you cannot change them for that
database anymore. LC_COLLATE and LC_CTYPE are these categories. They affect the sort order of
indexes, so they must be kept fixed, or indexes on text columns would become corrupt. (But you can
alleviate this restriction using collations, as discussed in Section 23.2.) The default values for these
categories are determined when initdb is run, and those values are used when new databases are
created, unless specified otherwise in the CREATE DATABASE command.
The other locale categories can be changed whenever desired by setting the server configuration
parameters that have the same name as the locale categories (see Section 19.11.2 for details). The
values that are chosen by initdb are actually only written into the configuration file post-
gresql.conf to serve as defaults when the server is started. If you remove these assignments from
postgresql.conf then the server will inherit the settings from its execution environment.
Note that the locale behavior of the server is determined by the environment variables seen by the
server, not by the environment of any client. Therefore, be careful to configure the correct locale
settings before starting the server. A consequence of this is that if client and server are set up in different
locales, messages might appear in different languages depending on where they originated.
Note
When we speak of inheriting the locale from the execution environment, this means the fol-
lowing on most operating systems: For a given locale category, say the collation, the follow-
ing environment variables are consulted in this order until one is found to be set: LC_ALL,
LC_COLLATE (or the variable corresponding to the respective category), LANG. If none of
these environment variables are set then the locale defaults to C.
Some message localization libraries also look at the environment variable LANGUAGE which
overrides all other locale settings for the purpose of setting the language of messages. If in
doubt, please refer to the documentation of your operating system, in particular the documen-
tation about gettext.
To enable messages to be translated to the user's preferred language, NLS must have been selected at
build time (configure --enable-nls). All other locale support is built in automatically.
23.1.2. Behavior
The locale settings influence the following SQL features:
• Sort order in queries using ORDER BY or the standard comparison operators on textual data
• Pattern matching operators (LIKE, SIMILAR TO, and POSIX-style regular expressions); locales
affect both case insensitive matching and the classification of characters by character-class regular
expressions
The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It
slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use
locales only if you actually need them.
640
Localization
As a workaround to allow PostgreSQL to use indexes with LIKE clauses under a non-C locale, several
custom operator classes exist. These allow the creation of an index that performs a strict character-by-
character comparison, ignoring locale comparison rules. Refer to Section 11.10 for more information.
Another approach is to create indexes using the C collation, as discussed in Section 23.2.
23.1.3. Problems
If locale support doesn't work according to the explanation above, check that the locale support in your
operating system is correctly configured. To check what locales are installed on your system, you can
use the command locale -a if your operating system provides it.
Check that PostgreSQL is actually using the locale that you think it is. The LC_COLLATE and LC_C-
TYPE settings are determined when a database is created, and cannot be changed except by creating
a new database. Other locale settings including LC_MESSAGES and LC_MONETARY are initially de-
termined by the environment the server is started in, but can be changed on-the-fly. You can check
the active locale settings using the SHOW command.
The directory src/test/locale in the source distribution contains a test suite for PostgreSQL's
locale support.
Client applications that handle server-side errors by parsing the text of the error message will obviously
have problems when the server's messages are in a different language. Authors of such applications
are advised to make use of the error code scheme instead.
Maintaining catalogs of message translations requires the on-going efforts of many volunteers that
want to see PostgreSQL speak their preferred language well. If messages in your language are currently
not available or not fully translated, your assistance would be appreciated. If you want to help, refer
to Chapter 55 or write to the developers' mailing list.
23.2.1. Concepts
Conceptually, every expression of a collatable data type has a collation. (The built-in collatable data
types are text, varchar, and char. User-defined base types can also be marked collatable, and
of course a domain over a collatable data type is collatable.) If the expression is a column reference,
the collation of the expression is the defined collation of the column. If the expression is a constant,
the collation is the default collation of the data type of the constant. The collation of a more complex
expression is derived from the collations of its inputs, as described below.
The collation of an expression can be the “default” collation, which means the locale settings defined
for the database. It is also possible for an expression's collation to be indeterminate. In such cases,
ordering operations and other operations that need to know the collation will fail.
When the database system has to perform an ordering or a character classification, it uses the collation
of the input expression. This happens, for example, with ORDER BY clauses and function or operator
calls such as <. The collation to apply for an ORDER BY clause is simply the collation of the sort
key. The collation to apply for a function or operator call is derived from the arguments, as described
below. In addition to comparison operators, collations are taken into account by functions that convert
between lower and upper case letters, such as lower, upper, and initcap; by pattern matching
operators; and by to_char and related functions.
For a function or operator call, the collation that is derived by examining the argument collations is
used at run time for performing the specified operation. If the result of the function or operator call is of
641
Localization
a collatable data type, the collation is also used at parse time as the defined collation of the function or
operator expression, in case there is a surrounding expression that requires knowledge of its collation.
The collation derivation of an expression can be implicit or explicit. This distinction affects how col-
lations are combined when multiple different collations appear in an expression. An explicit collation
derivation occurs when a COLLATE clause is used; all other collation derivations are implicit. When
multiple collations need to be combined, for example in a function call, the following rules are used:
1. If any input expression has an explicit collation derivation, then all explicitly derived collations
among the input expressions must be the same, otherwise an error is raised. If any explicitly derived
collation is present, that is the result of the collation combination.
2. Otherwise, all input expressions must have the same implicit collation derivation or the default
collation. If any non-default collation is present, that is the result of the collation combination.
Otherwise, the result is the default collation.
3. If there are conflicting non-default implicit collations among the input expressions, then the com-
bination is deemed to have indeterminate collation. This is not an error condition unless the par-
ticular function being invoked requires knowledge of the collation it should apply. If it does, an
error will be raised at run-time.
Then in
the < comparison is performed according to de_DE rules, because the expression combines an im-
plicitly derived collation with the default collation. But in
the comparison is performed using fr_FR rules, because the explicit collation derivation overrides
the implicit one. Furthermore, given
the parser cannot determine which collation to apply, since the a and b columns have conflicting
implicit collations. Since the < operator does need to know which collation to use, this will result in an
error. The error can be resolved by attaching an explicit collation specifier to either input expression,
thus:
or equivalently
642
Localization
does not result in an error, because the || operator does not care about collations: its result is the
same regardless of the collation.
The collation assigned to a function or operator's combined input expressions is also considered to
apply to the function or operator's result, if the function or operator delivers a result of a collatable
data type. So, in
the ordering will be done according to de_DE rules. But this query:
results in an error, because even though the || operator doesn't need to know a collation, the ORDER
BY clause does. As before, the conflict can be resolved with an explicit collation specifier:
A collation object provided by libc maps to a combination of LC_COLLATE and LC_CTYPE set-
tings, as accepted by the setlocale() system library call. (As the name would suggest, the main
purpose of a collation is to set LC_COLLATE, which controls the sort order. But it is rarely necessary
in practice to have an LC_CTYPE setting that is different from LC_COLLATE, so it is more conve-
nient to collect these under one concept than to create another infrastructure for setting LC_CTYPE
per expression.) Also, a libc collation is tied to a character set encoding (see Section 23.3). The same
collation name may exist for different encodings.
A collation object provided by icu maps to a named collator provided by the ICU library. ICU does not
support separate “collate” and “ctype” settings, so they are always the same. Also, ICU collations are
independent of the encoding, so there is always only one ICU collation of a given name in a database.
Additionally, the SQL standard collation name ucs_basic is available for encoding UTF8. It is
equivalent to C and sorts by Unicode code point.
643
Localization
To inspect the currently available locales, use the query SELECT * FROM pg_collation, or
the command \dOS+ in psql.
The default set of collations provided by libc map directly to the locales installed in the operating
system, which can be listed using the command locale -a. In case a libc collation is needed
that has different values for LC_COLLATE and LC_CTYPE, or if new locales are installed in the
operating system after the database system was initialized, then a new collation may be created using
the CREATE COLLATION command. New operating system locales can also be imported en masse
using the pg_import_system_collations() function.
Within any particular database, only collations that use that database's encoding are of interest. Oth-
er entries in pg_collation are ignored. Thus, a stripped collation name such as de_DE can be
considered unique within a given database even though it would not be unique globally. Use of the
stripped collation names is recommended, since it will make one fewer thing you need to change if
you decide to change to another database encoding. Note however that the default, C, and POSIX
collations can be used regardless of the database encoding.
PostgreSQL considers distinct collation objects to be incompatible even when they have identical
properties. Thus for example,
will draw an error even though the C and POSIX collations have identical behaviors. Mixing stripped
and non-stripped collation names is therefore not recommended.
de-x-icu
de-AT-x-icu
(There are also, say, de-DE-x-icu or de-CH-x-icu, but as of this writing, they are equiv-
alent to de-x-icu.)
ICU “root” collation. Use this to get a reasonable language-agnostic sort order.
644
Localization
Some (less frequently used) encodings are not supported by ICU. When the database encoding is one
of these, ICU collation entries in pg_collation are ignored. Attempting to use one will draw an
error along the lines of “collation "de-x-icu" for encoding "WIN874" does not exist”.
The standard and predefined collations are in the schema pg_catalog, like all predefined objects.
User-defined collations should be created in user schemas. This also ensures that they are saved by
pg_dump.
The exact values that are acceptable for the locale clause in this command depend on the operating
system. On Unix-like systems, the command locale -a will show a list.
Since the predefined libc collations already include all collations defined in the operating system when
the database instance is initialized, it is not often necessary to manually create new ones. Reasons
might be if a different naming system is desired (in which case see also Section 23.2.2.3.3) or if
the operating system has been upgraded to provide new locale definitions (in which case see also
pg_import_system_collations()).
The first example selects the ICU locale using a “language tag” per BCP 47. The second example
uses the traditional ICU-specific locale syntax. The first style is preferred going forward, but it
is not supported by older ICU versions.
Note that you can name the collation objects in the SQL environment anything you want. In this
example, we follow the naming style that the predefined collations use, which in turn also follow
BCP 47, but that is not required for user-defined collations.
Root collation with Emoji collation type, per Unicode Technical Standard #51
Observe how in the traditional ICU locale naming system, the root locale is selected by an empty
string.
645
Localization
Sort Greek letters before Latin ones. (The default is Latin before Greek.)
Sort upper-case letters before lower-case letters. (The default is lower-case letters first.)
Numeric ordering, sorts sequences of digits by their numeric value, for example: A-21 < A-123
(also known as natural sort).
See Unicode Technical Standard #351 and BCP 472 for details. The list of possible collation types (co
subtag) can be found in the CLDR repository3.
Note that while this system allows creating collations that “ignore case” or “ignore accents” or similar
(using the ks key), PostgreSQL does not at the moment allow such collations to act in a truly case- or
accent-insensitive manner. Any strings that compare equal according to the collation but are not byte-
wise equal will be sorted according to their byte values.
Note
By design, ICU will accept almost any string as a locale name and match it to the closest locale
it can provide, using the fallback procedure described in its documentation. Thus, there will be
no direct feedback if a collation specification is composed using features that the given ICU
installation does not actually support. It is therefore recommended to create application-level
test cases to check that the collation definitions satisfy one's requirements.
646
Localization
The character set support in PostgreSQL allows you to store text in a variety of character sets (also
called encodings), including single-byte character sets such as the ISO 8859 series and multiple-byte
character sets such as EUC (Extended Unix Code), UTF-8, and Mule internal code. All supported
character sets can be used transparently by clients, but a few are not supported for use within the
server (that is, as a server-side encoding). The default character set is selected while initializing your
PostgreSQL database cluster using initdb. It can be overridden when you create a database, so you
can have multiple databases each with a different character set.
An important restriction, however, is that each database's character set must be compatible with the
database's LC_CTYPE (character classification) and LC_COLLATE (string sort order) locale settings.
For C or POSIX locale, any character set is allowed, but for other libc-provided locales there is only
one character set that will work correctly. (On Windows, however, UTF-8 encoding can be used with
any locale.) If you have ICU support configured, ICU-provided locales can be used with most but not
all server-side encodings.
647
Localization
648
Localization
Not all client APIs support all the listed character sets. For example, the PostgreSQL JDBC driver
does not support MULE_INTERNAL, LATIN6, LATIN8, and LATIN10.
The SQL_ASCII setting behaves considerably differently from the other settings. When the server
character set is SQL_ASCII, the server interprets byte values 0-127 according to the ASCII standard,
while byte values 128-255 are taken as uninterpreted characters. No encoding conversion will be done
when the setting is SQL_ASCII. Thus, this setting is not so much a declaration that a specific encoding
is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with any
non-ASCII data, it is unwise to use the SQL_ASCII setting because PostgreSQL will be unable to
help you by converting or validating non-ASCII characters.
initdb -E EUC_JP
sets the default character set to EUC_JP (Extended Unix Code for Japanese). You can use --en-
coding instead of -E if you prefer longer option strings. If no -E or --encoding option is given,
initdb attempts to determine the appropriate encoding to use based on the specified or default locale.
You can specify a non-default encoding at database creation time, provided that the encoding is com-
patible with the selected locale:
649
Localization
Notice that the above commands specify copying the template0 database. When copying any oth-
er database, the encoding and locale settings cannot be changed from those of the source database,
because that might result in corrupt data. For more information see Section 22.3.
The encoding for a database is stored in the system catalog pg_database. You can see it by using
the psql -l option or the \l command.
$ psql -l
List of databases
Name | Owner | Encoding | Collation | Ctype |
Access Privileges
-----------+----------+-----------+-------------+-------------
+-------------------------------------
clocaledb | hlinnaka | SQL_ASCII | C | C |
englishdb | hlinnaka | UTF8 | en_GB.UTF8 | en_GB.UTF8 |
japanese | hlinnaka | UTF8 | ja_JP.UTF8 | ja_JP.UTF8 |
korean | hlinnaka | EUC_KR | ko_KR.euckr | ko_KR.euckr |
postgres | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 |
template0 | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 |
{=c/hlinnaka,hlinnaka=CTc/hlinnaka}
template1 | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 |
{=c/hlinnaka,hlinnaka=CTc/hlinnaka}
(7 rows)
Important
On most modern operating systems, PostgreSQL can determine which character set is implied
by the LC_CTYPE setting, and it will enforce that only the matching database encoding is
used. On older systems it is your responsibility to ensure that you use the encoding expected
by the locale you have selected. A mistake in this area is likely to lead to strange behavior of
locale-dependent operations such as sorting.
PostgreSQL will allow superusers to create databases with SQL_ASCII encoding even when
LC_CTYPE is not C or POSIX. As noted above, SQL_ASCII does not enforce that the data
stored in the database has any particular encoding, and so this choice poses risks of locale-de-
pendent misbehavior. Using this combination of settings is deprecated and may someday be
forbidden altogether.
650
Localization
651
Localization
To enable automatic character set conversion, you have to tell PostgreSQL the character set (encoding)
you would like to use in the client. There are several ways to accomplish this:
• Using the \encoding command in psql. \encoding allows you to change client encoding on
the fly. For example, to change the encoding to SJIS, type:
\encoding SJIS
• Using SET client_encoding TO. Setting the client encoding can be done with this SQL
command:
Also you can use the standard SQL syntax SET NAMES for this purpose:
SHOW client_encoding;
RESET client_encoding;
• Using the configuration variable client_encoding. If the client_encoding variable is set, that
client encoding is automatically selected when a connection to the server is made. (This can subse-
quently be overridden using any of the other methods mentioned above.)
If the conversion of a particular character is not possible — suppose you chose EUC_JP for the server
and LATIN1 for the client, and some Japanese characters are returned that do not have a representation
in LATIN1 — an error is reported.
If the client character set is defined as SQL_ASCII, encoding conversion is disabled, regardless of
the server's character set. Just as for the server, use of SQL_ASCII is unwise unless you are working
with all-ASCII data.
652
Localization
http://www.unicode.org/
RFC 3629
653
Chapter 24. Routine Database
Maintenance Tasks
PostgreSQL, like any database software, requires that certain tasks be performed regularly to achieve
optimum performance. The tasks discussed here are required, but they are repetitive in nature and
can easily be automated using standard tools such as cron scripts or Windows' Task Scheduler. It is
the database administrator's responsibility to set up appropriate scripts, and to check that they execute
successfully.
One obvious maintenance task is the creation of backup copies of the data on a regular schedule.
Without a recent backup, you have no chance of recovery after a catastrophe (disk failure, fire, mis-
takenly dropping a critical table, etc.). The backup and recovery mechanisms available in PostgreSQL
are discussed at length in Chapter 25.
The other main category of maintenance task is periodic “vacuuming” of the database. This activity
is discussed in Section 24.1. Closely related to this is updating the statistics that will be used by the
query planner, as discussed in Section 24.1.3.
Another task that might need periodic attention is log file management. This is discussed in Sec-
tion 24.3.
check_postgres1 is available for monitoring database health and reporting unusual conditions.
check_postgres integrates with Nagios and MRTG, but can be run standalone too.
4. To protect against loss of very old data due to transaction ID wraparound or multixact ID wrap-
around.
Each of these reasons dictates performing VACUUM operations of varying frequency and scope, as
explained in the following subsections.
1
https://bucardo.org/check_postgres/
654
Routine Database
Maintenance Tasks
There are two variants of VACUUM: standard VACUUM and VACUUM FULL. VACUUM FULL can
reclaim more disk space but runs much more slowly. Also, the standard form of VACUUM can run in
parallel with production database operations. (Commands such as SELECT, INSERT, UPDATE, and
DELETE will continue to function normally, though you will not be able to modify the definition of a
table with commands such as ALTER TABLE while it is being vacuumed.) VACUUM FULL requires
an ACCESS EXCLUSIVE lock on the table it is working on, and therefore cannot be done in parallel
with other use of the table. Generally, therefore, administrators should strive to use standard VACUUM
and avoid VACUUM FULL.
VACUUM creates a substantial amount of I/O traffic, which can cause poor performance for other active
sessions. There are configuration parameters that can be adjusted to reduce the performance impact
of background vacuuming — see Section 19.4.4.
The standard form of VACUUM removes dead row versions in tables and indexes and marks the space
available for future reuse. However, it will not return the space to the operating system, except in
the special case where one or more pages at the end of a table become entirely free and an exclusive
table lock can be easily obtained. In contrast, VACUUM FULL actively compacts tables by writing a
complete new version of the table file with no dead space. This minimizes the size of the table, but
can take a long time. It also requires extra disk space for the new copy of the table, until the operation
completes.
The usual goal of routine vacuuming is to do standard VACUUMs often enough to avoid needing VAC-
UUM FULL. The autovacuum daemon attempts to work this way, and in fact will never issue VACUUM
FULL. In this approach, the idea is not to keep tables at their minimum size, but to maintain steady-
state usage of disk space: each table occupies space equivalent to its minimum size plus however much
space gets used up between vacuum runs. Although VACUUM FULL can be used to shrink a table
back to its minimum size and return the disk space to the operating system, there is not much point in
this if the table will just grow again in the future. Thus, moderately-frequent standard VACUUM runs
are a better approach than infrequent VACUUM FULL runs for maintaining heavily-updated tables.
Some administrators prefer to schedule vacuuming themselves, for example doing all the work at night
when load is low. The difficulty with doing vacuuming according to a fixed schedule is that if a table
has an unexpected spike in update activity, it may get bloated to the point that VACUUM FULL is
really necessary to reclaim space. Using the autovacuum daemon alleviates this problem, since the
daemon schedules vacuuming dynamically in response to update activity. It is unwise to disable the
daemon completely unless you have an extremely predictable workload. One possible compromise
is to set the daemon's parameters so that it will only react to unusually heavy update activity, thus
keeping things from getting out of hand, while scheduled VACUUMs are expected to do the bulk of
the work when the load is typical.
For those not using autovacuum, a typical approach is to schedule a database-wide VACUUM once a
day during a low-usage period, supplemented by more frequent vacuuming of heavily-updated tables
as necessary. (Some installations with extremely high update rates vacuum their busiest tables as often
as once every few minutes.) If you have multiple databases in a cluster, don't forget to VACUUM each
one; the program vacuumdb might be helpful.
Tip
Plain VACUUM may not be satisfactory when a table contains large numbers of dead row ver-
sions as a result of massive update or delete activity. If you have such a table and you need
655
Routine Database
Maintenance Tasks
to reclaim the excess disk space it occupies, you will need to use VACUUM FULL, or alterna-
tively CLUSTER or one of the table-rewriting variants of ALTER TABLE. These commands
rewrite an entire new copy of the table and build new indexes for it. All these options require
an ACCESS EXCLUSIVE lock. Note that they also temporarily use extra disk space approx-
imately equal to the size of the table, since the old copies of the table and indexes can't be
released until the new ones are complete.
Tip
If you have a table whose entire contents are deleted on a periodic basis, consider doing it
with TRUNCATE rather than using DELETE followed by VACUUM. TRUNCATE removes the
entire content of the table immediately, without requiring a subsequent VACUUM or VACUUM
FULL to reclaim the now-unused disk space. The disadvantage is that strict MVCC semantics
are violated.
The autovacuum daemon, if enabled, will automatically issue ANALYZE commands whenever the
content of a table has changed sufficiently. However, administrators might prefer to rely on manual-
ly-scheduled ANALYZE operations, particularly if it is known that update activity on a table will not
affect the statistics of “interesting” columns. The daemon schedules ANALYZE strictly as a function
of the number of rows inserted or updated; it has no knowledge of whether that will lead to meaningful
statistical changes.
Tuples changed in partitions and inheritance children do not trigger analyze on the parent table. If the
parent table is empty or rarely changed, it may never be processed by autovacuum, and the statistics
for the inheritance tree as a whole won't be collected. It is necessary to run ANALYZE on the parent
table manually in order to keep the statistics up to date.
As with vacuuming for space recovery, frequent updates of statistics are more useful for heavily-up-
dated tables than for seldom-updated ones. But even for a heavily-updated table, there might be no
need for statistics updates if the statistical distribution of the data is not changing much. A simple
rule of thumb is to think about how much the minimum and maximum values of the columns in the
table change. For example, a timestamp column that contains the time of row update will have a
constantly-increasing maximum value as rows are added and updated; such a column will probably
need more frequent statistics updates than, say, a column containing URLs for pages accessed on a
website. The URL column might receive changes just as often, but the statistical distribution of its
values probably changes relatively slowly.
It is possible to run ANALYZE on specific tables and even just specific columns of a table, so the
flexibility exists to update some statistics more frequently than others if your application requires it.
In practice, however, it is usually best to just analyze the entire database, because it is a fast operation.
ANALYZE uses a statistically random sampling of the rows of a table rather than reading every single
row.
Tip
Although per-column tweaking of ANALYZE frequency might not be very productive, you
might find it worthwhile to do per-column adjustment of the level of detail of the statistics
656
Routine Database
Maintenance Tasks
collected by ANALYZE. Columns that are heavily used in WHERE clauses and have highly
irregular data distributions might require a finer-grain data histogram than other columns. See
ALTER TABLE SET STATISTICS, or change the database-wide default using the de-
fault_statistics_target configuration parameter.
Also, by default there is limited information available about the selectivity of functions. How-
ever, if you create an expression index that uses a function call, useful statistics will be gath-
ered about the function, which can greatly improve query plans that use the expression index.
Tip
The autovacuum daemon does not issue ANALYZE commands for foreign tables, since it has
no means of determining how often that might be useful. If your queries require statistics
on foreign tables for proper planning, it's a good idea to run manually-managed ANALYZE
commands on those tables on a suitable schedule.
Tip
The autovacuum daemon does not issue ANALYZE commands for partitioned tables. Inheri-
tance parents will only be analyzed if the parent itself is changed - changes to child tables do
not trigger autoanalyze on the parent table. If your queries require statistics on parent tables
for proper planning, it is necessary to periodically run a manual ANALYZE on those tables to
keep the statistics up to date.
Second, it allows PostgreSQL to answer some queries using only the index, without reference to the
underlying table. Since PostgreSQL indexes don't contain tuple visibility information, a normal index
scan fetches the heap tuple for each matching index entry, to check whether it should be seen by the
current transaction. An index-only scan, on the other hand, checks the visibility map first. If it's known
that all tuples on the page are visible, the heap fetch can be skipped. This is most useful on large data
sets where the visibility map can prevent disk accesses. The visibility map is vastly smaller than the
heap, so it can easily be cached even when the heap is very large.
The reason that periodic vacuuming solves the problem is that VACUUM will mark rows as frozen,
indicating that they were inserted by a transaction that committed sufficiently far in the past that
657
Routine Database
Maintenance Tasks
the effects of the inserting transaction are certain to be visible to all current and future transactions.
Normal XIDs are compared using modulo-232 arithmetic. This means that for every normal XID,
there are two billion XIDs that are “older” and two billion that are “newer”; another way to say it
is that the normal XID space is circular with no endpoint. Therefore, once a row version has been
created with a particular normal XID, the row version will appear to be “in the past” for the next two
billion transactions, no matter which normal XID we are talking about. If the row version still exists
after more than two billion transactions, it will suddenly appear to be in the future. To prevent this,
PostgreSQL reserves a special XID, FrozenTransactionId, which does not follow the normal
XID comparison rules and is always considered older than every normal XID. Frozen row versions
are treated as if the inserting XID were FrozenTransactionId, so that they will appear to be “in
the past” to all normal transactions regardless of wraparound issues, and so such row versions will be
valid until deleted, no matter how long that is.
Note
In PostgreSQL versions before 9.4, freezing was implemented by actually replacing a row's
insertion XID with FrozenTransactionId, which was visible in the row's xmin system
column. Newer versions just set a flag bit, preserving the row's original xmin for possible
forensic use. However, rows with xmin equal to FrozenTransactionId (2) may still be
found in databases pg_upgrade'd from pre-9.4 versions.
Also, system catalogs may contain rows with xmin equal to BootstrapTransactionId
(1), indicating that they were inserted during the first phase of initdb. Like FrozenTrans-
actionId, this special XID is treated as older than every normal XID.
vacuum_freeze_min_age controls how old an XID value has to be before rows bearing that XID will
be frozen. Increasing this setting may avoid unnecessary work if the rows that would otherwise be
frozen will soon be modified again, but decreasing this setting increases the number of transactions
that can elapse before the table must be vacuumed again.
VACUUM uses the visibility map to determine which pages of a table must be scanned. Normally, it will
skip pages that don't have any dead row versions even if those pages might still have row versions with
old XID values. Therefore, normal VACUUMs won't always freeze every old row version in the table.
Periodically, VACUUM will perform an aggressive vacuum, skipping only those pages which contain
neither dead rows nor any unfrozen XID or MXID values. vacuum_freeze_table_age controls when
VACUUM does that: all-visible but not all-frozen pages are scanned if the number of transactions that
have passed since the last such scan is greater than vacuum_freeze_table_age minus vacu-
um_freeze_min_age. Setting vacuum_freeze_table_age to 0 forces VACUUM to use this
more aggressive strategy for all scans.
The maximum time that a table can go unvacuumed is two billion transactions minus the vacu-
um_freeze_min_age value at the time of the last aggressive vacuum. If it were to go unvacuumed
for longer than that, data loss could result. To ensure that this does not happen, autovacuum is invoked
on any table that might contain unfrozen rows with XIDs older than the age specified by the configu-
ration parameter autovacuum_freeze_max_age. (This will happen even if autovacuum is disabled.)
This implies that if a table is not otherwise vacuumed, autovacuum will be invoked on it approximately
once every autovacuum_freeze_max_age minus vacuum_freeze_min_age transactions.
For tables that are regularly vacuumed for space reclamation purposes, this is of little importance.
However, for static tables (including tables that receive inserts, but no updates or deletes), there is
no need to vacuum for space reclamation, so it can be useful to try to maximize the interval between
forced autovacuums on very large static tables. Obviously one can do this either by increasing au-
tovacuum_freeze_max_age or decreasing vacuum_freeze_min_age.
658
Routine Database
Maintenance Tasks
uum would be triggered at that point anyway, and the 0.95 multiplier leaves some breathing room to
run a manual VACUUM before that happens. As a rule of thumb, vacuum_freeze_table_age
should be set to a value somewhat below autovacuum_freeze_max_age, leaving enough gap
so that a regularly scheduled VACUUM or an autovacuum triggered by normal delete and update ac-
tivity is run in that window. Setting it too close could lead to anti-wraparound autovacuums, even
though the table was recently vacuumed to reclaim space, whereas lower values lead to more frequent
aggressive vacuuming.
To track the age of the oldest unfrozen XIDs in a database, VACUUM stores XID statistics in the sys-
tem tables pg_class and pg_database. In particular, the relfrozenxid column of a table's
pg_class row contains the freeze cutoff XID that was used by the last aggressive VACUUM for
that table. All rows inserted by transactions with XIDs older than this cutoff XID are guaranteed to
have been frozen. Similarly, the datfrozenxid column of a database's pg_database row is a
lower bound on the unfrozen XIDs appearing in that database — it is just the minimum of the per-
table relfrozenxid values within the database. A convenient way to examine this information is
to execute queries such as:
The age column measures the number of transactions from the cutoff XID to the current transaction's
XID.
VACUUM normally only scans pages that have been modified since the last vacuum, but rel-
frozenxid can only be advanced when every page of the table that might contain unfrozen XIDs is
scanned. This happens when relfrozenxid is more than vacuum_freeze_table_age trans-
actions old, when VACUUM's FREEZE option is used, or when all pages that are not already all-frozen
happen to require vacuuming to remove dead row versions. When VACUUM scans every page in the
table that is not already all-frozen, it should set age(relfrozenxid) to a value just a little more
than the vacuum_freeze_min_age setting that was used (more by the number of transactions
started since the VACUUM started). If no relfrozenxid-advancing VACUUM is issued on the table
until autovacuum_freeze_max_age is reached, an autovacuum will soon be forced for the table.
If for some reason autovacuum fails to clear old XIDs from a table, the system will begin to emit
warning messages like this when the database's oldest XIDs reach eleven million transactions from
the wraparound point:
659
Routine Database
Maintenance Tasks
(A manual VACUUM should fix the problem, as suggested by the hint; but note that the VACUUM must
be performed by a superuser, else it will fail to process system catalogs and thus not be able to advance
the database's datfrozenxid.) If these warnings are ignored, the system will shut down and refuse
to start any new transactions once there are fewer than 1 million transactions left until wraparound:
The 1-million-transaction safety margin exists to let the administrator recover without data loss, by
manually executing the required VACUUM commands. However, since the system will not execute
commands once it has gone into the safety shutdown mode, the only way to do this is to stop the server
and start the server in single-user mode to execute VACUUM. The shutdown mode is not enforced in
single-user mode. See the postgres reference page for details about using single-user mode.
Whenever VACUUM scans any part of a table, it will replace any multixact ID it encounters which
is older than vacuum_multixact_freeze_min_age by a different value, which can be the zero value,
a single transaction ID, or a newer multixact ID. For each table, pg_class.relminmxid stores
the oldest possible multixact ID still appearing in any tuple of that table. If this value is older than
vacuum_multixact_freeze_table_age, an aggressive vacuum is forced. As discussed in the previous
section, an aggressive vacuum means that only those pages which are known to be all-frozen will be
skipped. mxid_age() can be used on pg_class.relminmxid to find its age.
Aggressive VACUUM scans, regardless of what causes them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their oldest multixact values are advanced,
on-disk storage for older multixacts can be removed.
As a safety device, an aggressive vacuum scan will occur for any table whose multixact-age is greater
than autovacuum_multixact_freeze_max_age. Aggressive vacuum scans will also occur progressively
for all tables, starting with those that have the oldest multixact-age, if the amount of used member
storage space exceeds the amount 50% of the addressable storage space. Both of these kinds of ag-
gressive scans will occur even if autovacuum is nominally disabled.
660
Routine Database
Maintenance Tasks
In the default configuration, autovacuuming is enabled and the related configuration parameters are
appropriately set.
The “autovacuum daemon” actually consists of multiple processes. There is a persistent daemon
process, called the autovacuum launcher, which is in charge of starting autovacuum worker processes
for all databases. The launcher will distribute the work across time, attempting to start one worker
within each database every autovacuum_naptime seconds. (Therefore, if the installation has N data-
bases, a new worker will be launched every autovacuum_naptime/N seconds.) A maximum of
autovacuum_max_workers worker processes are allowed to run at the same time. If there are more
than autovacuum_max_workers databases to be processed, the next database will be processed
as soon as the first worker finishes. Each worker process will check each table within its database and
execute VACUUM and/or ANALYZE as needed. log_autovacuum_min_duration can be set to monitor
autovacuum workers' activity.
If several large tables all become eligible for vacuuming in a short amount of time, all autovacuum
workers might become occupied with vacuuming those tables for a long period. This would result in
other tables and databases not being vacuumed until a worker becomes available. There is no limit on
how many workers might be in a single database, but workers do try to avoid repeating work that has
already been done by other workers. Note that the number of running workers does not count towards
max_connections or superuser_reserved_connections limits.
Tables whose relfrozenxid value is more than autovacuum_freeze_max_age transactions old are
always vacuumed (this also applies to those tables whose freeze max age has been modified via storage
parameters; see below). Otherwise, if the number of tuples obsoleted since the last VACUUM exceeds
the “vacuum threshold”, the table is vacuumed. The vacuum threshold is defined as:
where the vacuum base threshold is autovacuum_vacuum_threshold, the vacuum scale factor is auto-
vacuum_vacuum_scale_factor, and the number of tuples is pg_class.reltuples. The number of
obsolete tuples is obtained from the statistics collector; it is a semi-accurate count updated by each
UPDATE and DELETE operation. (It is only semi-accurate because some information might be lost
under heavy load.) If the relfrozenxid value of the table is more than vacuum_freeze_ta-
ble_age transactions old, an aggressive vacuum is performed to freeze old tuples and advance rel-
frozenxid; otherwise, only pages that have been modified since the last vacuum are scanned.
is compared to the total number of tuples inserted, updated, or deleted since the last ANALYZE.
Partitioned tables are not processed by autovacuum. Statistics should be collected by running a manual
ANALYZE when it is first populated, and again whenever the distribution of data in its partitions
changes significantly.
Temporary tables cannot be accessed by autovacuum. Therefore, appropriate vacuum and analyze
operations should be performed via session SQL commands.
The default thresholds and scale factors are taken from postgresql.conf, but it is possible to
override them (and many other autovacuum control parameters) on a per-table basis; see Storage Pa-
rameters for more information. If a setting has been changed via a table's storage parameters, that
value is used when processing that table; otherwise the global settings are used. See Section 19.10 for
more details on the global settings.
When multiple workers are running, the autovacuum cost delay parameters (see Section 19.4.4) are
“balanced” among all the running workers, so that the total I/O impact on the system is the same regard-
661
Routine Database
Maintenance Tasks
less of the number of workers actually running. However, any workers processing tables whose per-
table autovacuum_vacuum_cost_delay or autovacuum_vacuum_cost_limit storage
parameters have been set are not considered in the balancing algorithm.
Autovacuum workers generally don't block other commands. If a process attempts to acquire a lock
that conflicts with the SHARE UPDATE EXCLUSIVE lock held by autovacuum, lock acquisition
will interrupt the autovacuum. For conflicting lock modes, see Table 13.2. However, if the autovacu-
um is running to prevent transaction ID wraparound (i.e., the autovacuum query name in the pg_s-
tat_activity view ends with (to prevent wraparound)), the autovacuum is not auto-
matically interrupted.
Warning
Regularly running commands that acquire locks conflicting with a SHARE UPDATE EX-
CLUSIVE lock (e.g., ANALYZE) can effectively prevent autovacuums from ever completing.
B-tree index pages that have become completely empty are reclaimed for re-use. However, there is
still a possibility of inefficient use of space: if all but a few index keys on a page have been deleted,
the page remains allocated. Therefore, a usage pattern in which most, but not all, keys in each range
are eventually deleted will see poor use of space. For such usage patterns, periodic reindexing is
recommended.
The potential for bloat in non-B-tree indexes has not been well researched. It is a good idea to period-
ically monitor the index's physical size when using any non-B-tree index type.
Also, for B-tree indexes, a freshly-constructed index is slightly faster to access than one that has been
updated many times because logically adjacent pages are usually also physically adjacent in a newly
built index. (This consideration does not apply to non-B-tree indexes.) It might be worthwhile to
reindex periodically just to improve access speed.
REINDEX can be used safely and easily in all cases. But since the command requires an exclusive
table lock, it is often preferable to execute an index rebuild with a sequence of creation and replacement
steps. Index types that support CREATE INDEX with the CONCURRENTLY option can instead be
recreated that way. If that is successful and the resulting index is valid, the original index can then be
replaced by the newly built one using a combination of ALTER INDEX and DROP INDEX. When an
index is used to enforce uniqueness or other constraints, ALTER TABLE might be necessary to swap
the existing constraint with one enforced by the new index. Review this alternate multistep rebuild
approach carefully before using it as there are limitations on which indexes can be reindexed this way,
and errors must be handled.
If you simply direct the stderr of postgres into a file, you will have log output, but the only way
to truncate the log file is to stop and restart the server. This might be acceptable if you are using
662
Routine Database
Maintenance Tasks
PostgreSQL in a development environment, but few production servers would find this behavior ac-
ceptable.
A better approach is to send the server's stderr output to some type of log rotation program. There
is a built-in log rotation facility, which you can use by setting the configuration parameter log-
ging_collector to true in postgresql.conf. The control parameters for this program are
described in Section 19.8.1. You can also use this approach to capture the log data in machine readable
CSV (comma-separated values) format.
Alternatively, you might prefer to use an external log rotation program if you have one that you are
already using with other server software. For example, the rotatelogs tool included in the Apache
distribution can be used with PostgreSQL. To do this, just pipe the server's stderr output to the desired
program. If you start the server with pg_ctl, then stderr is already redirected to stdout, so you just
need a pipe command, for example:
Another production-grade approach to managing log output is to send it to syslog and let syslog deal
with file rotation. To do this, set the configuration parameter log_destination to syslog (to log
to syslog only) in postgresql.conf. Then you can send a SIGHUP signal to the syslog daemon
whenever you want to force it to start writing a new log file. If you want to automate log rotation, the
logrotate program can be configured to work with log files from syslog.
On many systems, however, syslog is not very reliable, particularly with large log messages; it might
truncate or drop messages just when you need them the most. Also, on Linux, syslog will flush each
message to disk, yielding poor performance. (You can use a “-” at the start of the file name in the
syslog configuration file to disable syncing.)
Note that all the solutions described above take care of starting new log files at configurable intervals,
but they do not handle deletion of old, no-longer-useful log files. You will probably want to set up a
batch job to periodically delete old log files. Another possibility is to configure the rotation program
so that old log files are overwritten cyclically.
pgBadger2 is an external project that does sophisticated log file analysis. check_postgres3 provides
Nagios alerts when important messages appear in the log files, as well as detection of many other
extraordinary conditions.
2
https://pgbadger.darold.net/
3
https://bucardo.org/check_postgres/
663
Chapter 25. Backup and Restore
As with everything that contains valuable data, PostgreSQL databases should be backed up regularly.
While the procedure is essentially simple, it is important to have a clear understanding of the under-
lying techniques and assumptions.
• SQL dump
• Continuous archiving
Each has its own strengths and weaknesses; each is discussed in turn in the following sections.
As you see, pg_dump writes its result to the standard output. We will see below how this can be useful.
While the above command creates a text file, pg_dump can create files in other formats that allow for
parallelism and more fine-grained control of object restoration.
pg_dump is a regular PostgreSQL client application (albeit a particularly clever one). This means that
you can perform this backup procedure from any remote host that has access to the database. But
remember that pg_dump does not operate with special permissions. In particular, it must have read
access to all tables that you want to back up, so in order to back up the entire database you almost
always have to run it as a database superuser. (If you do not have sufficient privileges to back up
the entire database, you can still back up portions of the database to which you do have access using
options such as -n schema or -t table.)
To specify which database server pg_dump should contact, use the command line options -h host
and -p port. The default host is the local host or whatever your PGHOST environment variable
specifies. Similarly, the default port is indicated by the PGPORT environment variable or, failing
that, by the compiled-in default. (Conveniently, the server will normally have the same compiled-in
default.)
Like any other PostgreSQL client application, pg_dump will by default connect with the database user
name that is equal to the current operating system user name. To override this, either specify the -U
option or set the environment variable PGUSER. Remember that pg_dump connections are subject to
the normal client authentication mechanisms (which are described in Chapter 20).
An important advantage of pg_dump over the other backup methods described later is that pg_dump's
output can generally be re-loaded into newer versions of PostgreSQL, whereas file-level backups and
continuous archiving are both extremely server-version-specific. pg_dump is also the only method
that will work when transferring a database to a different machine architecture, such as going from
a 32-bit to a 64-bit server.
Dumps created by pg_dump are internally consistent, meaning, the dump represents a snapshot of
the database at the time pg_dump began running. pg_dump does not block other operations on the
664
Backup and Restore
database while it is working. (Exceptions are those operations that need to operate with an exclusive
lock, such as most forms of ALTER TABLE.)
where dumpfile is the file output by the pg_dump command. The database dbname will not be
created by this command, so you must create it yourself from template0 before executing psql
(e.g., with createdb -T template0 dbname). psql supports options similar to pg_dump for
specifying the database server to connect to and the user name to use. See the psql reference page for
more information. Non-text file dumps are restored using the pg_restore utility.
Before restoring an SQL dump, all the users who own objects or were granted permissions on objects in
the dumped database must already exist. If they do not, the restore will fail to recreate the objects with
the original ownership and/or permissions. (Sometimes this is what you want, but usually it is not.)
By default, the psql script will continue to execute after an SQL error is encountered. You might wish
to run psql with the ON_ERROR_STOP variable set to alter that behavior and have psql exit with an
exit status of 3 if an SQL error occurs:
Either way, you will only have a partially restored database. Alternatively, you can specify that the
whole dump should be restored as a single transaction, so the restore is either fully completed or
fully rolled back. This mode can be specified by passing the -1 or --single-transaction
command-line options to psql. When using this mode, be aware that even a minor error can rollback
a restore that has already run for many hours. However, that might still be preferable to manually
cleaning up a complex database after a partially restored dump.
The ability of pg_dump and psql to write to or read from pipes makes it possible to dump a database
directly from one server to another, for example:
Important
The dumps produced by pg_dump are relative to template0. This means that any languages,
procedures, etc. added via template1 will also be dumped by pg_dump. As a result, when
restoring, if you are using a customized template1, you must create the empty database
from template0, as in the example above.
After restoring a backup, it is wise to run ANALYZE on each database so the query optimizer has
useful statistics; see Section 24.1.3 and Section 24.1.6 for more information. For more advice on how
to load large amounts of data into PostgreSQL efficiently, refer to Section 14.4.
665
Backup and Restore
of the entire contents of a database cluster, the pg_dumpall program is provided. pg_dumpall backs
up each database in a given cluster, and also preserves cluster-wide data such as role and tablespace
definitions. The basic usage of this command is:
(Actually, you can specify any existing database name to start from, but if you are loading into an
empty cluster then postgres should usually be used.) It is always necessary to have database supe-
ruser access when restoring a pg_dumpall dump, as that is required to restore the role and tablespace
information. If you use tablespaces, make sure that the tablespace paths in the dump are appropriate
for the new installation.
pg_dumpall works by emitting commands to re-create roles, tablespaces, and empty databases, then
invoking pg_dump for each database. This means that while each database will be internally consistent,
the snapshots of different databases are not synchronized.
Cluster-wide data can be dumped alone using the pg_dumpall --globals-only option. This is
necessary to fully backup the cluster if running the pg_dump command on individual databases.
Use compressed dumps. You can use your favorite compression program, for example gzip:
Reload with:
or:
Use split. The split command allows you to split the output into smaller files that are accept-
able in size to the underlying file system. For example, to make 2 gigabyte chunks:
Reload with:
666
Backup and Restore
Use pg_dump's custom dump format. If PostgreSQL was built on a system with the zlib com-
pression library installed, the custom dump format will compress data as it writes it to the output file.
This will produce dump file sizes similar to using gzip, but it has the added advantage that tables can
be restored selectively. The following command dumps a database using the custom dump format:
A custom-format dump is not a script for psql, but instead must be restored with pg_restore, for ex-
ample:
For very large databases, you might need to combine split with one of the other two approaches.
Use pg_dump's parallel dump feature. To speed up the dump of a large database, you can use
pg_dump's parallel mode. This will dump multiple tables at the same time. You can control the degree
of parallelism with the -j parameter. Parallel dumps are only supported for the "directory" archive
format.
You can use pg_restore -j to restore a dump in parallel. This will work for any archive of either
the "custom" or the "directory" archive mode, whether or not it has been created with pg_dump -j.
There are two restrictions, however, which make this method impractical, or at least inferior to the
pg_dump method:
1. The database server must be shut down in order to get a usable backup. Half-way measures such
as disallowing all connections will not work (in part because tar and similar tools do not take an
atomic snapshot of the state of the file system, but also because of internal buffering within the
server). Information about stopping the server can be found in Section 18.5. Needless to say, you
also need to shut down the server before restoring the data.
2. If you have dug into the details of the file system layout of the database, you might be tempted to
try to back up or restore only certain individual tables or databases from their respective files or
directories. This will not work because the information contained in these files is not usable without
the commit log files, pg_xact/*, which contain the commit status of all transactions. A table file
is only usable with this information. Of course it is also impossible to restore only a table and the
associated pg_xact data because that would render all other tables in the database cluster useless.
So file system backups only work for complete backup and restoration of an entire database cluster.
667
Backup and Restore
An alternative file-system backup approach is to make a “consistent snapshot” of the data directory, if
the file system supports that functionality (and you are willing to trust that it is implemented correctly).
The typical procedure is to make a “frozen snapshot” of the volume containing the database, then copy
the whole data directory (not just parts, see above) from the snapshot to a backup device, then release
the frozen snapshot. This will work even while the database server is running. However, a backup
created in this way saves the database files in a state as if the database server was not properly shut
down; therefore, when you start the database server on the backed-up data, it will think the previous
server instance crashed and will replay the WAL log. This is not a problem; just be aware of it (and
be sure to include the WAL files in your backup). You can perform a CHECKPOINT before taking
the snapshot to reduce recovery time.
If your database is spread across multiple file systems, there might not be any way to obtain exact-
ly-simultaneous frozen snapshots of all the volumes. For example, if your data files and WAL log
are on different disks, or if tablespaces are on different file systems, it might not be possible to use
snapshot backup because the snapshots must be simultaneous. Read your file system documentation
very carefully before trusting the consistent-snapshot technique in such situations.
If simultaneous snapshots are not possible, one option is to shut down the database server long enough
to establish all the frozen snapshots. Another option is to perform a continuous archiving base backup
(Section 25.3.2) because such backups are immune to file system changes during the backup. This
requires enabling continuous archiving just during the backup process; restore is done using continu-
ous archive recovery (Section 25.3.4).
Another option is to use rsync to perform a file system backup. This is done by first running rsync
while the database server is running, then shutting down the database server long enough to do an
rsync --checksum. (--checksum is necessary because rsync only has file modification-time
granularity of one second.) The second rsync will be quicker than the first, because it has relatively
little data to transfer, and the end result will be consistent because the server was down. This method
allows a file system backup to be performed with minimal downtime.
Note that a file system backup will typically be larger than an SQL dump. (pg_dump does not need
to dump the contents of indexes for example, just the commands to recreate them.) However, taking
a file system backup might be faster.
• We do not need a perfectly consistent file system backup as the starting point. Any internal incon-
sistency in the backup will be corrected by log replay (this is not significantly different from what
happens during crash recovery). So we do not need a file system snapshot capability, just tar or a
similar archiving tool.
• Since we can combine an indefinitely long sequence of WAL files for replay, continuous backup
can be achieved simply by continuing to archive the WAL files. This is particularly valuable for
large databases, where it might not be convenient to take a full backup frequently.
• It is not necessary to replay the WAL entries all the way to the end. We could stop the replay at
any point and have a consistent snapshot of the database as it was at that time. Thus, this technique
668
Backup and Restore
supports point-in-time recovery: it is possible to restore the database to its state at any time since
your base backup was taken.
• If we continuously feed the series of WAL files to another machine that has been loaded with the
same base backup file, we have a warm standby system: at any point we can bring up the second
machine and it will have a nearly-current copy of the database.
Note
pg_dump and pg_dumpall do not produce file-system-level backups and cannot be used as
part of a continuous-archiving solution. Such dumps are logical and do not contain enough
information to be used by WAL replay.
As with the plain file-system-backup technique, this method can only support restoration of an entire
database cluster, not a subset. Also, it requires a lot of archival storage: the base backup might be
bulky, and a busy system will generate many megabytes of WAL traffic that have to be archived. Still,
it is the preferred backup technique in many situations where high reliability is needed.
To recover successfully using continuous archiving (also called “online backup” by many database
vendors), you need a continuous sequence of archived WAL files that extends back at least as far as
the start time of your backup. So to get started, you should set up and test your procedure for archiving
WAL files before you take your first base backup. Accordingly, we first discuss the mechanics of
archiving WAL files.
When archiving WAL data, we need to capture the contents of each segment file once it is filled, and
save that data somewhere before the segment file is recycled for reuse. Depending on the application
and the available hardware, there could be many different ways of “saving the data somewhere”: we
could copy the segment files to an NFS-mounted directory on another machine, write them onto a
tape drive (ensuring that you have a way of identifying the original name of each file), or batch them
together and burn them onto CDs, or something else entirely. To provide the database administrator
with flexibility, PostgreSQL tries not to make any assumptions about how the archiving will be done.
Instead, PostgreSQL lets the administrator specify a shell command to be executed to copy a completed
segment file to wherever it needs to go. The command could be as simple as a cp, or it could invoke
a complex shell script — it's all up to you.
To enable WAL archiving, set the wal_level configuration parameter to replica or higher,
archive_mode to on, and specify the shell command to use in the archive_command configura-
tion parameter. In practice these settings will always be placed in the postgresql.conf file. In
archive_command, %p is replaced by the path name of the file to archive, while %f is replaced by
only the file name. (The path name is relative to the current working directory, i.e., the cluster's data
directory.) Use %% if you need to embed an actual % character in the command. The simplest useful
command is something like:
669
Backup and Restore
which will copy archivable WAL segments to the directory /mnt/server/archivedir. (This
is an example, not a recommendation, and might not work on all platforms.) After the %p and %f
parameters have been replaced, the actual command executed might look like this:
test ! -f /mnt/server/archivedir/00000001000000A900000065
&& cp pg_wal/00000001000000A900000065 /mnt/server/
archivedir/00000001000000A900000065
The archive command will be executed under the ownership of the same user that the PostgreSQL
server is running as. Since the series of WAL files being archived contains effectively everything
in your database, you will want to be sure that the archived data is protected from prying eyes; for
example, archive into a directory that does not have group or world read access.
It is important that the archive command return zero exit status if and only if it succeeds. Upon getting
a zero result, PostgreSQL will assume that the file has been successfully archived, and will remove or
recycle it. However, a nonzero status tells PostgreSQL that the file was not archived; it will try again
periodically until it succeeds.
The archive command should generally be designed to refuse to overwrite any pre-existing archive
file. This is an important safety feature to preserve the integrity of your archive in case of administrator
error (such as sending the output of two different servers to the same archive directory).
It is advisable to test your proposed archive command to ensure that it indeed does not overwrite an
existing file, and that it returns nonzero status in this case. The example command above for Unix
ensures this by including a separate test step. On some Unix platforms, cp has switches such as -i
that can be used to do the same thing less verbosely, but you should not rely on these without verifying
that the right exit status is returned. (In particular, GNU cp will return status zero when -i is used
and the target file already exists, which is not the desired behavior.)
While designing your archiving setup, consider what will happen if the archive command fails repeat-
edly because some aspect requires operator intervention or the archive runs out of space. For example,
this could occur if you write to tape without an autochanger; when the tape fills, nothing further can
be archived until the tape is swapped. You should ensure that any error condition or request to a hu-
man operator is reported appropriately so that the situation can be resolved reasonably quickly. The
pg_wal/ directory will continue to fill with WAL segment files until the situation is resolved. (If
the file system containing pg_wal/ fills up, PostgreSQL will do a PANIC shutdown. No committed
transactions will be lost, but the database will remain offline until you free some space.)
The speed of the archiving command is unimportant as long as it can keep up with the average rate
at which your server generates WAL data. Normal operation continues even if the archiving process
falls a little behind. If archiving falls significantly behind, this will increase the amount of data that
would be lost in the event of a disaster. It will also mean that the pg_wal/ directory will contain
large numbers of not-yet-archived segment files, which could eventually exceed available disk space.
You are advised to monitor the archiving process to ensure that it is working as you intend.
In writing your archive command, you should assume that the file names to be archived can be up to 64
characters long and can contain any combination of ASCII letters, digits, and dots. It is not necessary
to preserve the original relative path (%p) but it is necessary to preserve the file name (%f).
Note that although WAL archiving will allow you to restore any modifications made to the data in
your PostgreSQL database, it will not restore changes made to configuration files (that is, post-
gresql.conf, pg_hba.conf and pg_ident.conf), since those are edited manually rather
than through SQL operations. You might wish to keep the configuration files in a location that will
be backed up by your regular file system backup procedures. See Section 19.2 for how to relocate
the configuration files.
The archive command is only invoked on completed WAL segments. Hence, if your server generates
only little WAL traffic (or has slack periods where it does so), there could be a long delay between the
670
Backup and Restore
completion of a transaction and its safe recording in archive storage. To put a limit on how old unar-
chived data can be, you can set archive_timeout to force the server to switch to a new WAL segment
file at least that often. Note that archived files that are archived early due to a forced switch are still
the same length as completely full files. It is therefore unwise to set a very short archive_time-
out — it will bloat your archive storage. archive_timeout settings of a minute or so are usually
reasonable.
Also, you can force a segment switch manually with pg_switch_wal if you want to ensure that
a just-finished transaction is archived as soon as possible. Other utility functions related to WAL
management are listed in Table 9.79.
When wal_level is minimal some SQL commands are optimized to avoid WAL logging, as
described in Section 14.4.7. If archiving or streaming replication were turned on during execution of
one of these statements, WAL would not contain enough information for archive recovery. (Crash
recovery is unaffected.) For this reason, wal_level can only be changed at server start. However,
archive_command can be changed with a configuration file reload. If you wish to temporarily stop
archiving, one way to do it is to set archive_command to the empty string (''). This will cause
WAL files to accumulate in pg_wal/ until a working archive_command is re-established.
It is not necessary to be concerned about the amount of time it takes to make a base backup. How-
ever, if you normally run the server with full_page_writes disabled, you might notice a drop
in performance while the backup runs since full_page_writes is effectively forced on during
backup mode.
To make use of the backup, you will need to keep all the WAL segment files generated during
and after the file system backup. To aid you in doing this, the base backup process creates a back-
up history file that is immediately stored into the WAL archive area. This file is named after the
first WAL segment file that you need for the file system backup. For example, if the starting WAL
file is 0000000100001234000055CD the backup history file will be named something like
0000000100001234000055CD.007C9330.backup. (The second part of the file name stands
for an exact position within the WAL file, and can ordinarily be ignored.) Once you have safely
archived the file system backup and the WAL segment files used during the backup (as specified in the
backup history file), all archived WAL segments with names numerically less are no longer needed
to recover the file system backup and can be deleted. However, you should consider keeping several
backup sets to be absolutely certain that you can recover your data.
The backup history file is just a small text file. It contains the label string you gave to pg_basebackup,
as well as the starting and ending times and WAL segments of the backup. If you used the label to
identify the associated dump file, then the archived history file is enough to tell you which dump file
to restore.
Since you have to keep around all the archived WAL files back to your last base backup, the interval
between base backups should usually be chosen based on how much storage you want to expend on
archived WAL files. You should also consider how long you are prepared to spend recovering, if
recovery should be necessary — the system will have to replay all those WAL segments, and that
could take awhile if it has been a long time since the last base backup.
671
Backup and Restore
Low level base backups can be made in a non-exclusive or an exclusive way. The non-exclusive
method is recommended and the exclusive one is deprecated and will eventually be removed.
2. Connect to the server (it does not matter which database) as a user with rights to run pg_start_backup
(superuser, or a user who has been granted EXECUTE on the function) and issue the command:
where label is any string you want to use to uniquely identify this backup operation. The connec-
tion calling pg_start_backup must be maintained until the end of the backup, or the backup
will be automatically aborted.
By default, pg_start_backup can take a long time to finish. This is because it performs a check-
point, and the I/O required for the checkpoint will be spread out over a significant period of time,
by default half your inter-checkpoint interval (see the configuration parameter checkpoint_comple-
tion_target). This is usually what you want, because it minimizes the impact on query processing.
If you want to start the backup as soon as possible, change the second parameter to true, which
will issue an immediate checkpoint using as much I/O as available.
The third parameter being false tells pg_start_backup to initiate a non-exclusive base back-
up.
3. Perform the backup, using any convenient file-system-backup tool such as tar or cpio (not pg_dump
or pg_dumpall). It is neither necessary nor desirable to stop normal operation of the database while
you do this. See Section 25.3.3.3 for things to consider during this backup.
This terminates backup mode. On a primary, it also performs an automatic switch to the next WAL
segment. On a standby, it is not possible to automatically switch WAL segments, so you may wish
to run pg_switch_wal on the primary to perform a manual switch. The reason for the switch is
to arrange for the last WAL segment file written during the backup interval to be ready to archive.
The pg_stop_backup will return one row with three values. The second of these fields should
be written to a file named backup_label in the root directory of the backup. The third field
should be written to a file named tablespace_map unless the field is empty. These files are
vital to the backup working and must be written byte for byte without modification, which may
require opening the file in binary mode.
5. Once the WAL segment files active during the backup are archived, you are done. The file identified
by pg_stop_backup's first return value is the last segment that is required to form a complete set
of backup files. On a primary, if archive_mode is enabled and the wait_for_archive pa-
rameter is true, pg_stop_backup does not return until the last segment has been archived. On
a standby, archive_mode must be always in order for pg_stop_backup to wait. Archiving
of these files happens automatically since you have already configured archive_command. In
most cases this happens quickly, but you are advised to monitor your archive system to ensure there
are no delays. If the archive process has fallen behind because of failures of the archive command, it
will keep retrying until the archive succeeds and the backup is complete. If you wish to place a time
limit on the execution of pg_stop_backup, set an appropriate statement_timeout value,
but make note that if pg_stop_backup terminates because of this your backup may not be valid.
672
Backup and Restore
If the backup process monitors and ensures that all WAL segment files required for the backup are
successfully archived then the wait_for_archive parameter (which defaults to true) can be
set to false to have pg_stop_backup return as soon as the stop backup record is written to the
WAL. By default, pg_stop_backup will wait until all WAL has been archived, which can take
some time. This option must be used with caution: if WAL archiving is not monitored correctly
then the backup might not include all of the WAL files and will therefore be incomplete and not
able to be restored.
2. Connect to the server (it does not matter which database) as a user with rights to run pg_start_backup
(superuser, or a user who has been granted EXECUTE on the function) and issue the command:
SELECT pg_start_backup('label');
where label is any string you want to use to uniquely identify this backup operation. pg_s-
tart_backup creates a backup label file, called backup_label, in the cluster directory with
information about your backup, including the start time and label string. The function also creates
a tablespace map file, called tablespace_map, in the cluster directory with information about
tablespace symbolic links in pg_tblspc/ if one or more such link is present. Both files are crit-
ical to the integrity of the backup, should you need to restore from it.
By default, pg_start_backup can take a long time to finish. This is because it performs a check-
point, and the I/O required for the checkpoint will be spread out over a significant period of time,
by default half your inter-checkpoint interval (see the configuration parameter checkpoint_comple-
tion_target). This is usually what you want, because it minimizes the impact on query processing.
If you want to start the backup as soon as possible, use:
3. Perform the backup, using any convenient file-system-backup tool such as tar or cpio (not pg_dump
or pg_dumpall). It is neither necessary nor desirable to stop normal operation of the database while
you do this. See Section 25.3.3.3 for things to consider during this backup.
Note that if the server crashes during the backup it may not be possible to restart until the back-
up_label file has been manually deleted from the PGDATA directory.
4. Again connect to the database as a user with rights to run pg_stop_backup (superuser, or a user
who has been granted EXECUTE on the function), and issue the command:
SELECT pg_stop_backup();
This function terminates backup mode and performs an automatic switch to the next WAL segment.
The reason for the switch is to arrange for the last WAL segment written during the backup interval
to be ready to archive.
5. Once the WAL segment files active during the backup are archived, you are done. The file identified
by pg_stop_backup's result is the last segment that is required to form a complete set of backup
673
Backup and Restore
files. If archive_mode is enabled, pg_stop_backup does not return until the last segment has
been archived. Archiving of these files happens automatically since you have already configured
archive_command. In most cases this happens quickly, but you are advised to monitor your
archive system to ensure there are no delays. If the archive process has fallen behind because of
failures of the archive command, it will keep retrying until the archive succeeds and the backup
is complete. If you wish to place a time limit on the execution of pg_stop_backup, set an
appropriate statement_timeout value, but make note that if pg_stop_backup terminates
because of this your backup may not be valid.
Be certain that your backup includes all of the files under the database cluster directory (e.g., /usr/
local/pgsql/data). If you are using tablespaces that do not reside underneath this directory,
be careful to include them as well (and be sure that your backup archives symbolic links as links,
otherwise the restore will corrupt your tablespaces).
You should, however, omit from the backup the files within the cluster's pg_wal/ subdirectory. This
slight adjustment is worthwhile because it reduces the risk of mistakes when restoring. This is easy to
arrange if pg_wal/ is a symbolic link pointing to someplace outside the cluster directory, which is a
common setup anyway for performance reasons. You might also want to exclude postmaster.pid
and postmaster.opts, which record information about the running postmaster, not about the
postmaster which will eventually use this backup. (These files can confuse pg_ctl.)
It is often a good idea to also omit from the backup the files within the cluster's pg_replslot/
directory, so that replication slots that exist on the master do not become part of the backup. Otherwise,
the subsequent use of the backup to create a standby may result in indefinite retention of WAL files on
the standby, and possibly bloat on the master if hot standby feedback is enabled, because the clients
that are using those replication slots will still be connecting to and updating the slots on the master,
not the standby. Even if the backup is only intended for use in creating a new master, copying the
replication slots isn't expected to be particularly useful, since the contents of those slots will likely be
badly out of date by the time the new master comes on line.
Any file or directory beginning with pgsql_tmp can be omitted from the backup. These files are
removed on postmaster start and the directories will be recreated as needed.
pg_internal.init files can be omitted from the backup whenever a file of that name is found.
These files contain relation cache data that is always rebuilt when recovering.
The backup label file includes the label string you gave to pg_start_backup, as well as the time at
which pg_start_backup was run, and the name of the starting WAL file. In case of confusion it is
therefore possible to look inside a backup file and determine exactly which backup session the dump
file came from. The tablespace map file includes the symbolic link names as they exist in the directory
pg_tblspc/ and the full path of each symbolic link. These files are not merely for your information;
their presence and contents are critical to the proper operation of the system's recovery process.
674
Backup and Restore
It is also possible to make a backup while the server is stopped. In this case, you obviously cannot use
pg_start_backup or pg_stop_backup, and you will therefore be left to your own devices to
keep track of which backup is which and how far back the associated WAL files go. It is generally
better to follow the continuous archiving procedure above.
2. If you have the space to do so, copy the whole cluster data directory and any tablespaces to a
temporary location in case you need them later. Note that this precaution will require that you have
enough free space on your system to hold two copies of your existing database. If you do not have
enough space, you should at least save the contents of the cluster's pg_wal subdirectory, as it
might contain logs which were not archived before the system went down.
3. Remove all existing files and subdirectories under the cluster data directory and under the root
directories of any tablespaces you are using.
4. Restore the database files from your file system backup. Be sure that they are restored with the right
ownership (the database system user, not root!) and with the right permissions. If you are using
tablespaces, you should verify that the symbolic links in pg_tblspc/ were correctly restored.
5. Remove any files present in pg_wal/; these came from the file system backup and are therefore
probably obsolete rather than current. If you didn't archive pg_wal/ at all, then recreate it with
proper permissions, being careful to ensure that you re-establish it as a symbolic link if you had
it set up that way before.
6. If you have unarchived WAL segment files that you saved in step 2, copy them into pg_wal/.
(It is best to copy them, not move them, so you still have the unmodified files if a problem occurs
and you have to start over.)
7. Create a recovery command file recovery.conf in the cluster data directory (see Chapter 27).
You might also want to temporarily modify pg_hba.conf to prevent ordinary users from con-
necting until you are sure the recovery was successful.
8. Start the server. The server will go into recovery mode and proceed to read through the archived
WAL files it needs. Should the recovery be terminated because of an external error, the server can
simply be restarted and it will continue recovery. Upon completion of the recovery process, the
server will rename recovery.conf to recovery.done (to prevent accidentally re-entering
recovery mode later) and then commence normal database operations.
9. Inspect the contents of the database to ensure you have recovered to the desired state. If not, return
to step 1. If all is well, allow your users to connect by restoring pg_hba.conf to normal.
The key part of all this is to set up a recovery configuration file that describes how you want to recover
and how far the recovery should run. You can use recovery.conf.sample (normally located in
the installation's share/ directory) as a prototype. The one thing that you absolutely must specify
in recovery.conf is the restore_command, which tells PostgreSQL how to retrieve archived
WAL file segments. Like the archive_command, this is a shell command string. It can contain
%f, which is replaced by the name of the desired log file, and %p, which is replaced by the path name
to copy the log file to. (The path name is relative to the current working directory, i.e., the cluster's
data directory.) Write %% if you need to embed an actual % character in the command. The simplest
useful command is something like:
675
Backup and Restore
which will copy previously archived WAL segments from the directory /mnt/serv-
er/archivedir. Of course, you can use something much more complicated, perhaps even a shell
script that requests the operator to mount an appropriate tape.
It is important that the command return nonzero exit status on failure. The command will be called
requesting files that are not present in the archive; it must return nonzero when so asked. This is not an
error condition. An exception is that if the command was terminated by a signal (other than SIGTERM,
which is used as part of a database server shutdown) or an error by the shell (such as command not
found), then recovery will abort and the server will not start up.
Not all of the requested files will be WAL segment files; you should also expect requests for files with
a suffix of .history. Also be aware that the base name of the %p path will be different from %f;
do not expect them to be interchangeable.
WAL segments that cannot be found in the archive will be sought in pg_wal/; this allows use of
recent un-archived segments. However, segments that are available from the archive will be used in
preference to files in pg_wal/.
Normally, recovery will proceed through all available WAL segments, thereby restoring the database
to the current point in time (or as close as possible given the available WAL segments). Therefore, a
normal recovery will end with a “file not found” message, the exact text of the error message depending
upon your choice of restore_command. You may also see an error message at the start of recovery
for a file named something like 00000001.history. This is also normal and does not indicate a
problem in simple recovery situations; see Section 25.3.5 for discussion.
If you want to recover to some previous point in time (say, right before the junior DBA dropped
your main transaction table), just specify the required stopping point in recovery.conf. You can
specify the stop point, known as the “recovery target”, either by date/time, named restore point or
by completion of a specific transaction ID. As of this writing only the date/time and named restore
point options are very usable, since there are no tools to help you identify with any accuracy which
transaction ID to use.
Note
The stop point must be after the ending time of the base backup, i.e., the end time of
pg_stop_backup. You cannot use a base backup to recover to a time when that backup
was in progress. (To recover to such a time, you must go back to your previous base backup
and roll forward from there.)
If recovery finds corrupted WAL data, recovery will halt at that point and the server will not start. In
such a case the recovery process could be re-run from the beginning, specifying a “recovery target”
before the point of corruption so that recovery can complete normally. If recovery fails for an external
reason, such as a system crash or if the WAL archive has become inaccessible, then the recovery can
simply be restarted and it will restart almost from where it failed. Recovery restart works much like
checkpointing in normal operation: the server periodically forces all its state to disk, and then updates
the pg_control file to indicate that the already-processed WAL data need not be scanned again.
25.3.5. Timelines
The ability to restore the database to a previous point in time creates some complexities that are akin
to science-fiction stories about time travel and parallel universes. For example, in the original history
of the database, suppose you dropped a critical table at 5:15PM on Tuesday evening, but didn't realize
your mistake until Wednesday noon. Unfazed, you get out your backup, restore to the point-in-time
5:14PM Tuesday evening, and are up and running. In this history of the database universe, you never
dropped the table. But suppose you later realize this wasn't such a great idea, and would like to return
to sometime Wednesday morning in the original history. You won't be able to if, while your database
was up-and-running, it overwrote some of the WAL segment files that led up to the time you now
676
Backup and Restore
wish you could get back to. Thus, to avoid this, you need to distinguish the series of WAL records
generated after you've done a point-in-time recovery from those that were generated in the original
database history.
To deal with this problem, PostgreSQL has a notion of timelines. Whenever an archive recovery com-
pletes, a new timeline is created to identify the series of WAL records generated after that recovery.
The timeline ID number is part of WAL segment file names so a new timeline does not overwrite the
WAL data generated by previous timelines. It is in fact possible to archive many different timelines.
While that might seem like a useless feature, it's often a lifesaver. Consider the situation where you
aren't quite sure what point-in-time to recover to, and so have to do several point-in-time recoveries
by trial and error until you find the best place to branch off from the old history. Without timelines
this process would soon generate an unmanageable mess. With timelines, you can recover to any prior
state, including states in timeline branches that you abandoned earlier.
Every time a new timeline is created, PostgreSQL creates a “timeline history” file that shows which
timeline it branched off from and when. These history files are necessary to allow the system to pick the
right WAL segment files when recovering from an archive that contains multiple timelines. Therefore,
they are archived into the WAL archive area just like WAL segment files. The history files are just
small text files, so it's cheap and appropriate to keep them around indefinitely (unlike the segment
files which are large). You can, if you like, add comments to a history file to record your own notes
about how and why this particular timeline was created. Such comments will be especially valuable
when you have a thicket of different timelines as a result of experimentation.
The default behavior of recovery is to recover along the same timeline that was current when the base
backup was taken. If you wish to recover into some child timeline (that is, you want to return to some
state that was itself generated after a recovery attempt), you need to specify the target timeline ID in
recovery.conf. You cannot recover into timelines that branched off earlier than the base backup.
As with base backups, the easiest way to produce a standalone hot backup is to use the pg_basebackup
tool. If you include the -X parameter when calling it, all the write-ahead log required to use the backup
will be included in the backup automatically, and no special action is required to restore the backup.
If more flexibility in copying the backup files is needed, a lower level process can be used for stand-
alone hot backups as well. To prepare for low level standalone hot backups, make sure wal_level
is set to replica or higher, archive_mode to on, and set up an archive_command that per-
forms archiving only when a switch file exists. For example:
With this preparation, a backup can be taken using a script like the following:
touch /var/lib/pgsql/backup_in_progress
677
Backup and Restore
Using a separate script file is advisable any time you want to use more than a single command in the
archiving process. This allows all complexity to be managed within the script, which can be written
in a popular scripting language such as bash or perl.
• Batching WAL files so that they are transferred every three hours, rather than one at a time
Tip
When using an archive_command script, it's desirable to enable logging_collector. Any
messages written to stderr from the script will then appear in the database server log, allowing
complex configurations to be diagnosed easily if they fail.
25.3.7. Caveats
At this writing, there are several limitations of the continuous archiving technique. These will probably
be fixed in future releases:
• If a CREATE DATABASE command is executed while a base backup is being taken, and then
the template database that the CREATE DATABASE copied is modified while the base backup is
678
Backup and Restore
still in progress, it is possible that recovery will cause those modifications to be propagated into the
created database as well. This is of course undesirable. To avoid this risk, it is best not to modify
any template databases while taking a base backup.
• CREATE TABLESPACE commands are WAL-logged with the literal absolute path, and will there-
fore be replayed as tablespace creations with the same absolute path. This might be undesirable if
the log is being replayed on a different machine. It can be dangerous even if the log is being replayed
on the same machine, but into a new data directory: the replay will still overwrite the contents of
the original tablespace. To avoid potential gotchas of this sort, the best practice is to take a new
base backup after creating or dropping tablespaces.
It should also be noted that the default WAL format is fairly bulky since it includes many disk page
snapshots. These page snapshots are designed to support crash recovery, since we might need to fix
partially-written disk pages. Depending on your system hardware and software, the risk of partial
writes might be small enough to ignore, in which case you can significantly reduce the total volume of
archived logs by turning off page snapshots using the full_page_writes parameter. (Read the notes and
warnings in Chapter 30 before you do so.) Turning off page snapshots does not prevent use of the logs
for PITR operations. An area for future development is to compress archived WAL data by removing
unnecessary page copies even when full_page_writes is on. In the meantime, administrators
might wish to reduce the number of page snapshots included in WAL by increasing the checkpoint
interval parameters as much as feasible.
679
Chapter 26. High Availability, Load
Balancing, and Replication
Database servers can work together to allow a second server to take over quickly if the primary server
fails (high availability), or to allow several computers to serve the same data (load balancing). Ideal-
ly, database servers could work together seamlessly. Web servers serving static web pages can be
combined quite easily by merely load-balancing web requests to multiple machines. In fact, read-only
database servers can be combined relatively easily too. Unfortunately, most database servers have a
read/write mix of requests, and read/write servers are much harder to combine. This is because though
read-only data needs to be placed on each server only once, a write to any server has to be propagated
to all servers so that future read requests to those servers return consistent results.
This synchronization problem is the fundamental difficulty for servers working together. Because
there is no single solution that eliminates the impact of the sync problem for all use cases, there are
multiple solutions. Each solution addresses this problem in a different way, and minimizes its impact
for a specific workload.
Some solutions deal with synchronization by allowing only one server to modify the data. Servers that
can modify data are called read/write, master or primary servers. Servers that track changes in the
master are called standby or secondary servers. A standby server that cannot be connected to until it
is promoted to a master server is called a warm standby server, and one that can accept connections
and serves read-only queries is called a hot standby server.
Some solutions are synchronous, meaning that a data-modifying transaction is not considered com-
mitted until all servers have committed the transaction. This guarantees that a failover will not lose any
data and that all load-balanced servers will return consistent results no matter which server is queried.
In contrast, asynchronous solutions allow some delay between the time of a commit and its propaga-
tion to the other servers, opening the possibility that some transactions might be lost in the switch
to a backup server, and that load balanced servers might return slightly stale results. Asynchronous
communication is used when synchronous would be too slow.
Solutions can also be categorized by their granularity. Some solutions can deal only with an entire
database server, while others allow control at the per-table or per-database level.
Performance must be considered in any choice. There is usually a trade-off between functionality and
performance. For example, a fully synchronous solution over a slow network might cut performance
by more than half, while an asynchronous one might have a minimal performance impact.
The remainder of this section outlines various failover, replication, and load balancing solutions.
Shared disk failover avoids synchronization overhead by having only one copy of the database.
It uses a single disk array that is shared by multiple servers. If the main database server fails, the
standby server is able to mount and start the database as though it were recovering from a database
crash. This allows rapid failover with no data loss.
Shared hardware functionality is common in network storage devices. Using a network file system
is also possible, though care must be taken that the file system has full POSIX behavior (see
Section 18.2.2). One significant limitation of this method is that if the shared disk array fails or
becomes corrupt, the primary and standby servers are both nonfunctional. Another issue is that
the standby server should never access the shared storage while the primary server is running.
680
High Availability, Load
Balancing, and Replication
A modified version of shared hardware functionality is file system replication, where all changes
to a file system are mirrored to a file system residing on another computer. The only restriction
is that the mirroring must be done in a way that ensures the standby server has a consistent copy
of the file system — specifically, writes to the standby must be done in the same order as those
on the master. DRBD is a popular file system replication solution for Linux.
Warm and hot standby servers can be kept current by reading a stream of write-ahead log (WAL)
records. If the main server fails, the standby contains almost all of the data of the main server, and
can be quickly made the new master database server. This can be synchronous or asynchronous
and can only be done for the entire database server.
A standby server can be implemented using file-based log shipping (Section 26.2) or streaming
replication (see Section 26.2.5), or a combination of both. For information on hot standby, see
Section 26.5.
Logical Replication
Logical replication allows a database server to send a stream of data modifications to another
server. PostgreSQL logical replication constructs a stream of logical data modifications from
the WAL. Logical replication allows the data changes from individual tables to be replicated.
Logical replication doesn't require a particular server to be designated as a master or a replica
but allows data to flow in multiple directions. For more information on logical replication, see
Chapter 31. Through the logical decoding interface (Chapter 49), third-party extensions can also
provide similar functionality.
A master-standby replication setup sends all data modification queries to the master server. The
master server asynchronously sends data changes to the standby server. The standby can answer
read-only queries while the master server is running. The standby server is ideal for data ware-
house queries.
Slony-I is an example of this type of replication, with per-table granularity, and support for mul-
tiple standby servers. Because it updates the standby server asynchronously (in batches), there is
possible data loss during fail over.
With statement-based replication middleware, a program intercepts every SQL query and sends
it to one or all servers. Each server operates independently. Read-write queries must be sent to all
servers, so that every server receives any changes. But read-only queries can be sent to just one
server, allowing the read workload to be distributed among them.
For servers that are not regularly connected or have slow communication links, like laptops or
remote servers, keeping data consistent among servers is a challenge. Using asynchronous mul-
681
High Availability, Load
Balancing, and Replication
timaster replication, each server works independently, and periodically communicates with the
other servers to identify conflicting transactions. The conflicts can be resolved by users or conflict
resolution rules. Bucardo is an example of this type of replication.
In synchronous multimaster replication, each server can accept write requests, and modified data
is transmitted from the original server to every other server before each transaction commits.
Heavy write activity can cause excessive locking and commit delays, leading to poor performance.
Read requests can be sent to any server. Some implementations use shared disk to reduce the
communication overhead. Synchronous multimaster replication is best for mostly read workloads,
though its big advantage is that any server can accept write requests — there is no need to partition
workloads between master and standby servers, and because the data changes are sent from one
server to another, there is no problem with non-deterministic functions like random().
PostgreSQL does not offer this type of replication, though PostgreSQL two-phase commit (PRE-
PARE TRANSACTION and COMMIT PREPARED) can be used to implement this in applica-
tion code or middleware.
Commercial Solutions
Because PostgreSQL is open source and easily extended, a number of companies have taken
PostgreSQL and created commercial closed-source solutions with unique failover, replication,
and load balancing capabilities.
Table 26.1 summarizes the capabilities of the various solutions listed above.
Table 26.1. High Availability, Load Balancing, and Replication Feature Matrix
Feature Shared File Sys- Write- Logical Trig- State- Asyn- Synchro-
Disk tem Ahead Replica- ger-Based ment-Based chronous nous
Failover Replica- Log tion Mas- Replica- Multi- Multi-
tion Shipping ter-Stand- tion Mid- master master
by Repli- dleware Replica- Replica-
cation tion tion
Most NAS DRBD built-in built-in Londiste, pgpool-II Bucardo
common stream- logical Slony
imple- ing repli- replica-
menta- cation tion, pg-
tions logical
Commu- shared disk WAL logical table SQL table table
nication disk blocks decoding rows rows rows and
method row locks
No spe- • • • • • • •
cial hard-
ware re-
quired
Allows • • • •
multiple
master
servers
No mas- • • • •
ter server
overhead
No wait- • with with • •
ing for sync off sync off
682
High Availability, Load
Balancing, and Replication
Feature Shared File Sys- Write- Logical Trig- State- Asyn- Synchro-
Disk tem Ahead Replica- ger-Based ment-Based chronous nous
Failover Replica- Log tion Mas- Replica- Multi- Multi-
tion Shipping ter-Stand- tion Mid- master master
by Repli- dleware Replica- Replica-
cation tion tion
multiple
servers
Master • • with with • •
failure sync on sync on
will never
lose data
Replicas with hot • • • • •
accept standby
read-only
queries
Per-table • • • •
granulari-
ty
No con- • • • • • •
flict reso-
lution
necessary
There are a few solutions that do not fit into the above categories:
Data Partitioning
Data partitioning splits tables into data sets. Each set can be modified by only one server. For
example, data can be partitioned by offices, e.g., London and Paris, with a server in each office.
If queries combining London and Paris data are necessary, an application can query both servers,
or master/standby replication can be used to keep a read-only copy of the other office's data on
each server.
Many of the above solutions allow multiple servers to handle multiple queries, but none allow a
single query to use multiple servers to complete faster. This solution allows multiple servers to
work concurrently on a single query. It is usually accomplished by splitting the data among servers
and having each server execute its part of the query and return results to a central server where
they are combined and returned to the user. This can be implemented using the PL/Proxy tool set.
The primary and standby server work together to provide this capability, though the servers are on-
ly loosely coupled. The primary server operates in continuous archiving mode, while each standby
server operates in continuous recovery mode, reading the WAL files from the primary. No changes
to the database tables are required to enable this capability, so it offers low administration overhead
compared to some other replication solutions. This configuration also has relatively low performance
impact on the primary server.
Directly moving WAL records from one database server to another is typically described as log ship-
ping. PostgreSQL implements file-based log shipping by transferring WAL records one file (WAL
683
High Availability, Load
Balancing, and Replication
segment) at a time. WAL files (16MB) can be shipped easily and cheaply over any distance, whether
it be to an adjacent system, another system at the same site, or another system on the far side of the
globe. The bandwidth required for this technique varies according to the transaction rate of the primary
server. Record-based log shipping is more granular and streams WAL changes incrementally over a
network connection (see Section 26.2.5).
It should be noted that log shipping is asynchronous, i.e., the WAL records are shipped after transaction
commit. As a result, there is a window for data loss should the primary server suffer a catastrophic
failure; transactions not yet shipped will be lost. The size of the data loss window in file-based log
shipping can be limited by use of the archive_timeout parameter, which can be set as low as a
few seconds. However such a low setting will substantially increase the bandwidth required for file
shipping. Streaming replication (see Section 26.2.5) allows a much smaller window of data loss.
Recovery performance is sufficiently good that the standby will typically be only moments away from
full availability once it has been activated. As a result, this is called a warm standby configuration
which offers high availability. Restoring a server from an archived base backup and rollforward will
take considerably longer, so that technique only offers a solution for disaster recovery, not high avail-
ability. A standby server can also be used for read-only queries, in which case it is called a Hot Standby
server. See Section 26.5 for more information.
26.2.1. Planning
It is usually wise to create the primary and standby servers so that they are as similar as possible, at least
from the perspective of the database server. In particular, the path names associated with tablespaces
will be passed across unmodified, so both primary and standby servers must have the same mount
paths for tablespaces if that feature is used. Keep in mind that if CREATE TABLESPACE is executed
on the primary, any new mount point needed for it must be created on the primary and all standby
servers before the command is executed. Hardware need not be exactly the same, but experience shows
that maintaining two identical systems is easier than maintaining two dissimilar ones over the lifetime
of the application and system. In any case the hardware architecture must be the same — shipping
from, say, a 32-bit to a 64-bit system will not work.
In general, log shipping between servers running different major PostgreSQL release levels is not
possible. It is the policy of the PostgreSQL Global Development Group not to make changes to disk
formats during minor release upgrades, so it is likely that running different minor release levels on
primary and standby servers will work successfully. However, no formal support for that is offered
and you are advised to keep primary and standby servers at the same release level as much as possible.
When updating to a new minor release, the safest policy is to update the standby servers first — a new
minor release is more likely to be able to read WAL files from a previous minor release than vice versa.
At startup, the standby begins by restoring all WAL available in the archive location, calling re-
store_command. Once it reaches the end of WAL available there and restore_command fails,
it tries to restore any WAL available in the pg_wal directory. If that fails, and streaming replication
has been configured, the standby tries to connect to the primary server and start streaming WAL from
the last valid record found in archive or pg_wal. If that fails or streaming replication is not config-
ured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore
the file from the archive again. This loop of retries from the archive, pg_wal, and via streaming
replication goes on until the server is stopped or failover is triggered by a trigger file.
684
High Availability, Load
Balancing, and Replication
Standby mode is exited and the server switches to normal operation when pg_ctl promote is run
or a trigger file is found (trigger_file). Before failover, any WAL immediately available in the
archive or in pg_wal will be restored, but no attempt is made to connect to the master.
If you want to use streaming replication, set up authentication on the primary server to allow replication
connections from the standby server(s); that is, create a role and provide a suitable entry or entries in
pg_hba.conf with the database field set to replication. Also ensure max_wal_senders is
set to a sufficiently large value in the configuration file of the primary server. If replication slots will
be used, ensure that max_replication_slots is set sufficiently high as well.
Take a base backup as described in Section 25.3.2 to bootstrap the standby server.
Note
Do not use pg_standby or similar tools with the built-in standby mode described here. re-
store_command should return immediately if the file does not exist; the server will retry
the command again if necessary. See Section 26.4 for using tools like pg_standby.
If you want to use streaming replication, fill in primary_conninfo with a libpq connection string,
including the host name (or IP address) and any additional details needed to connect to the primary
server. If the primary needs a password for authentication, the password needs to be specified in
primary_conninfo as well.
If you're setting up the standby server for high availability purposes, set up WAL archiving, connec-
tions and authentication like the primary server, because the standby server will work as a primary
server after failover.
If you're using a WAL archive, its size can be minimized using the archive_cleanup_command para-
meter to remove files that are no longer required by the standby server. The pg_archivecleanup utility
is designed specifically to be used with archive_cleanup_command in typical single-standby
configurations, see pg_archivecleanup. Note however, that if you're using the archive for backup pur-
poses, you need to retain files needed to recover from at least the latest base backup, even if they're
no longer needed by the standby.
standby_mode = 'on'
primary_conninfo = 'host=192.168.1.50 port=5432 user=foo
password=foopass'
restore_command = 'cp /path/to/archive/%f %p'
archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
685
High Availability, Load
Balancing, and Replication
You can have any number of standby servers, but if you use streaming replication, make sure you set
max_wal_senders high enough in the primary to allow them to be connected simultaneously.
Streaming replication is asynchronous by default (see Section 26.2.8), in which case there is a small
delay between committing a transaction in the primary and the changes becoming visible in the stand-
by. This delay is however much smaller than with file-based log shipping, typically under one sec-
ond assuming the standby is powerful enough to keep up with the load. With streaming replication,
archive_timeout is not required to reduce the data loss window.
If you use streaming replication without file-based continuous archiving, the server might recycle
old WAL segments before the standby has received them. If this occurs, the standby will need to
be reinitialized from a new base backup. You can avoid this by setting wal_keep_segments to
a value large enough to ensure that WAL segments are not recycled too early, or by configuring a
replication slot for the standby. If you set up a WAL archive that's accessible from the standby, these
solutions are not required, since the standby can always use the archive to catch up provided it retains
enough segments.
To use streaming replication, set up a file-based log-shipping standby server as described in Sec-
tion 26.2. The step that turns a file-based log-shipping standby into streaming replication standby
is setting primary_conninfo setting in the recovery.conf file to point to the primary serv-
er. Set listen_addresses and authentication options (see pg_hba.conf) on the primary so that the
standby server can connect to the replication pseudo-database on the primary server (see Sec-
tion 26.2.5.1).
On systems that support the keepalive socket option, setting tcp_keepalives_idle, tcp_keepalives_in-
terval and tcp_keepalives_count helps the primary promptly notice a broken connection.
Set the maximum number of concurrent connections from the standby servers (see max_wal_senders
for details).
When the standby is started and primary_conninfo is set correctly, the standby will connect to
the primary after replaying all WAL files available in the archive. If the connection is established
successfully, you will see a walreceiver process in the standby, and a corresponding walsender process
in the primary.
26.2.5.1. Authentication
It is very important that the access privileges for replication be set up so that only trusted users can
read the WAL stream, because it is easy to extract privileged information from it. Standby servers
must authenticate to the primary as a superuser or an account that has the REPLICATION privilege.
It is recommended to create a dedicated user account with REPLICATION and LOGIN privileges for
replication. While REPLICATION privilege gives very high permissions, it does not allow the user
to modify any data on the primary system, which the SUPERUSER privilege does.
686
High Availability, Load
Balancing, and Replication
#
# TYPE DATABASE USER ADDRESS
METHOD
host replication foo 192.168.1.100/32 md5
The host name and port number of the primary, connection user name, and password are specified
in the recovery.conf file. The password can also be set in the ~/.pgpass file on the standby
(specify replication in the database field). For example, if the primary is running on host IP
192.168.1.50, port 5432, the account name for replication is foo, and the password is foopass,
the administrator can add the following line to the recovery.conf file on the standby:
26.2.5.2. Monitoring
An important health indicator of streaming replication is the amount of WAL records generated in the
primary, but not yet applied in the standby. You can calculate this lag by comparing the current WAL
write location on the primary with the last WAL location received by the standby. These locations can
be retrieved using pg_current_wal_lsn on the primary and pg_last_wal_receive_lsn
on the standby, respectively (see Table 9.79 and Table 9.80 for details). The last WAL receive location
in the standby is also displayed in the process status of the WAL receiver process, displayed using the
ps command (see Section 28.1 for details).
You can retrieve a list of WAL sender processes via the pg_stat_replication view. Large differences be-
tween pg_current_wal_lsn and the view's sent_lsn field might indicate that the master serv-
er is under heavy load, while differences between sent_lsn and pg_last_wal_receive_lsn
on the standby might indicate network delay, or that the standby is under heavy load.
On a hot standby, the status of the WAL receiver process can be retrieved via the pg_stat_wal_receiver
view. A large difference between pg_last_wal_replay_lsn and the view's received_lsn
indicates that WAL is being received faster than it can be replayed.
In lieu of using replication slots, it is possible to prevent the removal of old WAL segments using
wal_keep_segments, or by storing the segments in an archive using archive_command. However,
these methods often result in retaining more WAL segments than required, whereas replication slots
retain only the number of segments known to be needed. An advantage of these methods is that they
bound the space requirement for pg_wal; there is currently no way to do this using replication slots.
Existing replication slots and their state can be seen in the pg_replication_slots view.
687
High Availability, Load
Balancing, and Replication
Slots can be created and dropped either via the streaming replication protocol (see Section 53.4) or
via SQL functions (see Section 9.26.6).
To configure the standby to use this slot, primary_slot_name should be configured in the stand-
by's recovery.conf. Here is a simple example:
standby_mode = 'on'
primary_conninfo = 'host=192.168.1.50 port=5432 user=foo
password=foopass'
primary_slot_name = 'node_a_slot'
A standby acting as both a receiver and a sender is known as a cascading standby. Standbys that are
more directly connected to the master are known as upstream servers, while those standby servers
further away are downstream servers. Cascading replication does not place limits on the number or
arrangement of downstream servers, though each standby connects to only one upstream server which
eventually links to a single master/primary server.
A cascading standby sends not only WAL records received from the master but also those restored
from the archive. So even if the replication connection in some upstream connection is terminated,
streaming replication continues downstream for as long as new WAL records are available.
Cascading replication is currently asynchronous. Synchronous replication (see Section 26.2.8) settings
have no effect on cascading replication at present.
If an upstream standby server is promoted to become new master, downstream servers will continue
to stream from the new master if recovery_target_timeline is set to 'latest'.
To use cascading replication, set up the cascading standby so that it can accept replication connections
(that is, set max_wal_senders and hot_standby, and configure host-based authentication). You will
also need to set primary_conninfo in the downstream standby to point to the cascading standby.
688
High Availability, Load
Balancing, and Replication
PostgreSQL streaming replication is asynchronous by default. If the primary server crashes then some
transactions that were committed may not have been replicated to the standby server, causing data
loss. The amount of data loss is proportional to the replication delay at the time of failover.
Synchronous replication offers the ability to confirm that all changes made by a transaction have been
transferred to one or more synchronous standby servers. This extends that standard level of durability
offered by a transaction commit. This level of protection is referred to as 2-safe replication in computer
science theory, and group-1-safe (group-safe and 1-safe) when synchronous_commit is set to
remote_write.
When requesting synchronous replication, each commit of a write transaction will wait until confir-
mation is received that the commit has been written to the write-ahead log on disk of both the primary
and standby server. The only possibility that data can be lost is if both the primary and the standby
suffer crashes at the same time. This can provide a much higher level of durability, though only if the
sysadmin is cautious about the placement and management of the two servers. Waiting for confirma-
tion increases the user's confidence that the changes will not be lost in the event of server crashes but
it also necessarily increases the response time for the requesting transaction. The minimum wait time
is the round-trip time between primary and standby.
Read only transactions and transaction rollbacks need not wait for replies from standby servers. Sub-
transaction commits do not wait for responses from standby servers, only top-level commits. Long
running actions such as data loading or index building do not wait until the very final commit message.
All two-phase commit actions require commit waits, including both prepare and commit.
After a commit record has been written to disk on the primary, the WAL record is then sent to the
standby. The standby sends reply messages each time a new batch of WAL data is written to disk,
unless wal_receiver_status_interval is set to zero on the standby. In the case that syn-
chronous_commit is set to remote_apply, the standby sends reply messages when the commit
record is replayed, making the transaction visible. If the standby is chosen as a synchronous standby,
according to the setting of synchronous_standby_names on the primary, the reply messages
from that standby will be considered along with those from other synchronous standbys to decide
when to release transactions waiting for confirmation that the commit record has been received. These
parameters allow the administrator to specify which standby servers should be synchronous standbys.
Note that the configuration of synchronous replication is mainly on the master. Named standbys must
be directly connected to the master; the master knows nothing about downstream standby servers us-
ing cascaded replication.
Setting synchronous_commit to remote_write will cause each commit to wait for confirma-
tion that the standby has received the commit record and written it out to its own operating system,
but not for the data to be flushed to disk on the standby. This setting provides a weaker guarantee of
durability than on does: the standby could lose the data in the event of an operating system crash,
689
High Availability, Load
Balancing, and Replication
though not a PostgreSQL crash. However, it's a useful setting in practice because it can decrease the
response time for the transaction. Data loss could only occur if both the primary and the standby crash
and the database of the primary gets corrupted at the same time.
Setting synchronous_commit to remote_apply will cause each commit to wait until the cur-
rent synchronous standbys report that they have replayed the transaction, making it visible to user
queries. In simple cases, this allows for load balancing with causal consistency.
Users will stop waiting if a fast shutdown is requested. However, as when using asynchronous repli-
cation, the server will not fully shutdown until all outstanding WAL records are transferred to the
currently connected standby servers.
The method FIRST specifies a priority-based synchronous replication and makes transaction commits
wait until their WAL records are replicated to the requested number of synchronous standbys chosen
based on their priorities. The standbys whose names appear earlier in the list are given higher priority
and will be considered as synchronous. Other standby servers appearing later in this list represent
potential synchronous standbys. If any of the current synchronous standbys disconnects for whatever
reason, it will be replaced immediately with the next-highest-priority standby.
In this example, if four standby servers s1, s2, s3 and s4 are running, the two standbys s1 and s2
will be chosen as synchronous standbys because their names appear early in the list of standby names.
s3 is a potential synchronous standby and will take over the role of synchronous standby when either
of s1 or s2 fails. s4 is an asynchronous standby since its name is not in the list.
The method ANY specifies a quorum-based synchronous replication and makes transaction commits
wait until their WAL records are replicated to at least the requested number of synchronous standbys
in the list.
In this example, if four standby servers s1, s2, s3 and s4 are running, transaction commits will wait
for replies from at least any two standbys of s1, s2 and s3. s4 is an asynchronous standby since
its name is not in the list.
The synchronous states of standby servers can be viewed using the pg_stat_replication view.
690
High Availability, Load
Balancing, and Replication
PostgreSQL allows the application developer to specify the durability level required via replication.
This can be specified for the system overall, though it can also be specified for specific users or
connections, or even individual transactions.
For example, an application workload might consist of: 10% of changes are important customer details,
while 90% of changes are less important data that the business can more easily survive if it is lost,
such as chat messages between users.
With synchronous replication options specified at the application level (on the primary) we can offer
synchronous replication for the most important changes, without slowing down the bulk of the total
workload. Application level options are an important and practical tool for allowing the benefits of
synchronous replication for high performance applications.
You should consider that the network bandwidth must be higher than the rate of generation of WAL
data.
The best solution for high availability is to ensure you keep as many synchronous standbys as re-
quested. This can be achieved by naming multiple potential synchronous standbys using synchro-
nous_standby_names.
In a priority-based synchronous replication, the standbys whose names appear earlier in the list will
be used as synchronous standbys. Standbys listed after these will take over the role of synchronous
standby if one of current ones should fail.
In a quorum-based synchronous replication, all the standbys appearing in the list will be used as can-
didates for synchronous standbys. Even if one of them should fail, the other standbys will keep per-
forming the role of candidates of synchronous standby.
When a standby first attaches to the primary, it will not yet be properly synchronized. This is described
as catchup mode. Once the lag between standby and primary reaches zero for the first time we move
to real-time streaming state. The catch-up duration may be long immediately after the standby has
been created. If the standby is shut down, then the catch-up period will increase according to the length
of time the standby has been down. The standby is only able to become a synchronous standby once it
has reached streaming state. This state can be viewed using the pg_stat_replication view.
If primary restarts while commits are waiting for acknowledgement, those waiting transactions will
be marked fully committed once the primary database recovers. There is no way to be certain that all
standbys have received all outstanding WAL data at time of the crash of the primary. Some transactions
may not show as committed on the standby, even though they show as committed on the primary. The
guarantee we offer is that the application will not receive explicit acknowledgement of the successful
commit of a transaction until the WAL data is known to be safely received by all the synchronous
standbys.
If you really cannot keep as many synchronous standbys as requested then you should decrease the
number of synchronous standbys that transaction commits must wait for responses from in synchro-
nous_standby_names (or disable it) and reload the configuration file on the primary server.
If the primary is isolated from remaining standby servers you should fail over to the best candidate
of those other remaining standby servers.
If you need to re-create a standby server while transactions are waiting, make sure that the commands
pg_start_backup() and pg_stop_backup() are run in a session with synchronous_commit = off,
otherwise those requests will wait forever for the standby to appear.
691
High Availability, Load
Balancing, and Replication
If archive_mode is set to on, the archiver is not enabled during recovery or standby mode. If the
standby server is promoted, it will start archiving after the promotion, but will not archive any WAL
or timeline history files that it did not generate itself. To get a complete series of WAL files in the
archive, you must ensure that all WAL is archived, before it reaches the standby. This is inherently
true with file-based log shipping, as the standby can only restore files that are found in the archive, but
not if streaming replication is enabled. When a server is not in recovery mode, there is no difference
between on and always modes.
26.3. Failover
If the primary server fails then the standby server should begin failover procedures.
If the standby server fails then no failover need take place. If the standby server can be restarted,
even some time later, then the recovery process can also be restarted immediately, taking advantage of
restartable recovery. If the standby server cannot be restarted, then a full new standby server instance
should be created.
If the primary server fails and the standby server becomes the new primary, and then the old primary
restarts, you must have a mechanism for informing the old primary that it is no longer the primary.
This is sometimes known as STONITH (Shoot The Other Node In The Head), which is necessary
to avoid situations where both systems think they are the primary, which will lead to confusion and
ultimately data loss.
Many failover systems use just two systems, the primary and the standby, connected by some kind of
heartbeat mechanism to continually verify the connectivity between the two and the viability of the
primary. It is also possible to use a third system (called a witness server) to prevent some cases of
inappropriate failover, but the additional complexity might not be worthwhile unless it is set up with
sufficient care and rigorous testing.
PostgreSQL does not provide the system software required to identify a failure on the primary and
notify the standby database server. Many such tools exist and are well integrated with the operating
system facilities required for successful failover, such as IP address migration.
Once failover to the standby occurs, there is only a single server in operation. This is known as a
degenerate state. The former standby is now the primary, but the former primary is down and might
stay down. To return to normal operation, a standby server must be recreated, either on the former
primary system when it comes up, or on a third, possibly new, system. The pg_rewind utility can
be used to speed up this process on large clusters. Once complete, the primary and standby can be
considered to have switched roles. Some people choose to use a third server to provide backup for
the new primary until the new standby server is recreated, though clearly this complicates the system
configuration and operational processes.
So, switching from primary to standby server can be fast but requires some time to re-prepare the
failover cluster. Regular switching from primary to standby is useful, since it allows regular downtime
on each system for maintenance. This also serves as a test of the failover mechanism to ensure that it
will really work when you need it. Written administration procedures are advised.
692
High Availability, Load
Balancing, and Replication
To trigger failover of a log-shipping standby server, run pg_ctl promote or create a trigger file
with the file name and path specified by the trigger_file setting in recovery.conf. If you're
planning to use pg_ctl promote to fail over, trigger_file is not required. If you're setting
up the reporting servers that are only used to offload read-only queries from the primary, not for high
availability purposes, you don't need to promote it.
Note that in this mode, the server will apply WAL one file at a time, so if you use the standby server for
queries (see Hot Standby), there is a delay between an action in the master and when the action becomes
visible in the standby, corresponding to the time it takes to fill up the WAL file. archive_timeout
can be used to make that delay shorter. Also note that you can't combine streaming replication with
this method.
The operations that occur on both primary and standby servers are normal continuous archiving and
recovery tasks. The only point of contact between the two database servers is the archive of WAL
files that both share: primary writing to the archive, standby reading from the archive. Care must be
taken to ensure that WAL archives from separate primary servers do not become mixed together or
confused. The archive need not be large if it is only required for standby operation.
The magic that makes the two loosely coupled servers work together is simply a restore_command
used on the standby that, when asked for the next WAL file, waits for it to become available from the
primary. The restore_command is specified in the recovery.conf file on the standby server.
Normal recovery processing would request a file from the WAL archive, reporting failure if the file
was unavailable. For standby processing it is normal for the next WAL file to be unavailable, so the
standby must wait for it to appear. For files ending in .history there is no need to wait, and a non-
zero return code must be returned. A waiting restore_command can be written as a custom script
that loops after polling for the existence of the next WAL file. There must also be some way to trigger
failover, which should interrupt the restore_command, break the loop and return a file-not-found
error to the standby server. This ends recovery and the standby will then come up as a normal server.
triggered = false;
while (!NextWALFileReady() && !triggered)
{
sleep(100000L); /* wait for ~0.1 sec */
if (CheckForExternalTrigger())
triggered = true;
}
if (!triggered)
CopyWALFileForRecovery();
The method for triggering failover is an important part of planning and design. One potential option is
the restore_command command. It is executed once for each WAL file, but the process running
the restore_command is created and dies for each file, so there is no daemon or server process,
and signals or a signal handler cannot be used. Therefore, the restore_command is not suitable
to trigger failover. It is possible to use a simple timeout facility, especially if used in conjunction
with a known archive_timeout setting on the primary. However, this is somewhat error prone
693
High Availability, Load
Balancing, and Replication
since a network problem or busy primary server might be sufficient to initiate failover. A notification
mechanism such as the explicit creation of a trigger file is ideal, if this can be arranged.
26.4.1. Implementation
The short procedure for configuring a standby server using this alternative method is as follows. For
full details of each step, refer to previous sections as noted.
1. Set up primary and standby systems as nearly identical as possible, including two identical copies
of PostgreSQL at the same release level.
2. Set up continuous archiving from the primary to a WAL archive directory on the standby server.
Ensure that archive_mode, archive_command and archive_timeout are set appropriately on the
primary (see Section 25.3.1).
3. Make a base backup of the primary server (see Section 25.3.2), and load this data onto the standby.
4. Begin recovery on the standby server from the local WAL archive, using a recovery.conf that
specifies a restore_command that waits as described previously (see Section 25.3.4).
Recovery treats the WAL archive as read-only, so once a WAL file has been copied to the standby
system it can be copied to tape at the same time as it is being read by the standby database server.
Thus, running a standby server for high availability can be performed at the same time as files are
stored for longer term disaster recovery purposes.
For testing purposes, it is possible to run both primary and standby servers on the same system. This
does not provide any worthwhile improvement in server robustness, nor would it be described as HA.
An external program can call the pg_walfile_name_offset() function (see Section 9.26) to
find out the file name and the exact byte offset within it of the current end of WAL. It can then access
the WAL file directly and copy the data from the last known end of WAL through the current end
over to the standby servers. With this approach, the window for data loss is the polling cycle time of
the copying program, which can be very small, and there is no wasted bandwidth from forcing par-
tially-used segment files to be archived. Note that the standby servers' restore_command scripts
can only deal with whole WAL files, so the incrementally copied data is not ordinarily made available
to the standby servers. It is of use only when the primary dies — then the last partial WAL file is
fed to the standby before allowing it to come up. The correct implementation of this process requires
cooperation of the restore_command script with the data copying program.
Starting with PostgreSQL version 9.0, you can use streaming replication (see Section 26.2.5) to
achieve the same benefits with less effort.
Running queries in hot standby mode is similar to normal query operation, though there are several
usage and administrative differences explained below.
694
High Availability, Load
Balancing, and Replication
The data on the standby takes some time to arrive from the primary server so there will be a measurable
delay between primary and standby. Running the same query nearly simultaneously on both primary
and standby might therefore return differing results. We say that data on the standby is eventually
consistent with the primary. Once the commit record for a transaction is replayed on the standby, the
changes made by that transaction will be visible to any new snapshots taken on the standby. Snapshots
may be taken at the start of each query or at the start of each transaction, depending on the current
transaction isolation level. For more details, see Section 13.2.
Transactions started during hot standby may issue the following commands:
• LOCK TABLE, though only when explicitly in one of these modes: ACCESS SHARE, ROW SHARE
or ROW EXCLUSIVE.
• UNLISTEN
Transactions started during hot standby will never be assigned a transaction ID and cannot write to
the system write-ahead log. Therefore, the following actions will produce error messages:
• Data Manipulation Language (DML) - INSERT, UPDATE, DELETE, COPY FROM, TRUNCATE.
Note that there are no allowed actions that result in a trigger being executed during recovery. This
restriction applies even to temporary tables, because table rows cannot be read or written without
assigning a transaction ID, which is currently not possible in a Hot Standby environment.
• Data Definition Language (DDL) - CREATE, DROP, ALTER, COMMENT. This restriction applies
even to temporary tables, because carrying out these operations would require updating the system
catalog tables.
• SELECT ... FOR SHARE | UPDATE, because row locks cannot be taken without updating
the underlying data files.
• LOCK that explicitly requests a mode higher than ROW EXCLUSIVE MODE.
695
High Availability, Load
Balancing, and Replication
• LISTEN, NOTIFY
In normal operation, “read-only” transactions are allowed to use LISTEN and NOTIFY, so Hot Stand-
by sessions operate under slightly tighter restrictions than ordinary read-only sessions. It is possible
that some of these restrictions might be loosened in a future release.
During hot standby, the parameter transaction_read_only is always true and may not be
changed. But as long as no attempt is made to modify the database, connections during hot standby
will act much like any other database connection. If failover or switchover occurs, the database will
switch to normal processing mode. Sessions will remain connected while the server changes mode.
Once hot standby finishes, it will be possible to initiate read-write transactions (even from a session
begun during hot standby).
Users will be able to tell whether their session is read-only by issuing SHOW transac-
tion_read_only. In addition, a set of functions (Table 9.80) allow users to access information
about the standby server. These allow you to write programs that are aware of the current state of the
database. These can be used to monitor the progress of recovery, or to allow you to write complex
programs that restore the database to particular states.
There are also additional types of conflict that can occur with Hot Standby. These conflicts are hard
conflicts in the sense that queries might need to be canceled and, in some cases, sessions disconnect-
ed to resolve them. The user is provided with several ways to handle these conflicts. Conflict cases
include:
• Access Exclusive locks taken on the primary server, including both explicit LOCK commands and
various DDL actions, conflict with table accesses in standby queries.
• Dropping a tablespace on the primary conflicts with standby queries using that tablespace for tem-
porary work files.
• Dropping a database on the primary conflicts with sessions connected to that database on the stand-
by.
• Application of a vacuum cleanup record from WAL conflicts with standby transactions whose snap-
shots can still “see” any of the rows to be removed.
• Application of a vacuum cleanup record from WAL conflicts with queries accessing the target page
on the standby, whether or not the data to be removed is visible.
On the primary server, these cases simply result in waiting; and the user might choose to cancel either
of the conflicting actions. However, on the standby there is no choice: the WAL-logged action already
696
High Availability, Load
Balancing, and Replication
occurred on the primary so the standby must not fail to apply it. Furthermore, allowing WAL applica-
tion to wait indefinitely may be very undesirable, because the standby's state will become increasingly
far behind the primary's. Therefore, a mechanism is provided to forcibly cancel standby queries that
conflict with to-be-applied WAL records.
An example of the problem situation is an administrator on the primary server running DROP TABLE
on a table that is currently being queried on the standby server. Clearly the standby query cannot
continue if the DROP TABLE is applied on the standby. If this situation occurred on the primary, the
DROP TABLE would wait until the other query had finished. But when DROP TABLE is run on the
primary, the primary doesn't have information about what queries are running on the standby, so it will
not wait for any such standby queries. The WAL change records come through to the standby while
the standby query is still running, causing a conflict. The standby server must either delay application
of the WAL records (and everything after them, too) or else cancel the conflicting query so that the
DROP TABLE can be applied.
When a conflicting query is short, it's typically desirable to allow it to complete by delaying WAL
application for a little bit; but a long delay in WAL application is usually not desirable. So the can-
cel mechanism has parameters, max_standby_archive_delay and max_standby_streaming_delay, that
define the maximum allowed delay in WAL application. Conflicting queries will be canceled once it
has taken longer than the relevant delay setting to apply any newly-received WAL data. There are two
parameters so that different delay values can be specified for the case of reading WAL data from an
archive (i.e., initial recovery from a base backup or “catching up” a standby server that has fallen far
behind) versus reading WAL data via streaming replication.
In a standby server that exists primarily for high availability, it's best to set the delay parameters
relatively short, so that the server cannot fall far behind the primary due to delays caused by standby
queries. However, if the standby server is meant for executing long-running queries, then a high or even
infinite delay value may be preferable. Keep in mind however that a long-running query could cause
other sessions on the standby server to not see recent changes on the primary, if it delays application
of WAL records.
Canceled queries may be retried immediately (after beginning a new transaction, of course). Since
query cancellation depends on the nature of the WAL records being replayed, a query that was canceled
may well succeed if it is executed again.
Keep in mind that the delay parameters are compared to the elapsed time since the WAL data was
received by the standby server. Thus, the grace period allowed to any one query on the standby is
never more than the delay parameter, and could be considerably less if the standby has already fallen
behind as a result of waiting for previous queries to complete, or as a result of being unable to keep
up with a heavy update load.
The most common reason for conflict between standby queries and WAL replay is “early cleanup”.
Normally, PostgreSQL allows cleanup of old row versions when there are no transactions that need
to see them to ensure correct visibility of data according to MVCC rules. However, this rule can only
be applied for transactions executing on the master. So it is possible that cleanup on the master will
remove row versions that are still visible to a transaction on the standby.
Experienced users should note that both row version cleanup and row version freezing will potentially
conflict with standby queries. Running a manual VACUUM FREEZE is likely to cause conflicts even
on tables with no updated or deleted rows.
Users should be clear that tables that are regularly and heavily updated on the primary server will
quickly cause cancellation of longer running queries on the standby. In such cases the setting of a
697
High Availability, Load
Balancing, and Replication
Another option is to increase vacuum_defer_cleanup_age on the primary server, so that dead rows will
not be cleaned up as quickly as they normally would be. This will allow more time for queries to exe-
cute before they are canceled on the standby, without having to set a high max_standby_stream-
ing_delay. However it is difficult to guarantee any specific execution-time window with this ap-
proach, since vacuum_defer_cleanup_age is measured in transactions executed on the primary
server.
The number of query cancels and the reason for them can be viewed using the pg_stat_data-
base_conflicts system view on the standby server. The pg_stat_database system view
also contains summary information.
Consistency information is recorded once per checkpoint on the primary. It is not possible to enable
hot standby when reading WAL written during a period when wal_level was not set to replica
or logical on the primary. Reaching a consistent state can also be delayed in the presence of both
of these conditions:
If you are running file-based log shipping ("warm standby"), you might need to wait until the next
WAL file arrives, which could be as long as the archive_timeout setting on the primary.
The setting of some parameters on the standby will need reconfiguration if they have been changed
on the primary. For these parameters, the value on the standby must be equal to or greater than the
value on the primary. Therefore, if you want to increase these values, you should do so on all standby
servers first, before applying the changes to the primary server. Conversely, if you want to decrease
698
High Availability, Load
Balancing, and Replication
these values, you should do so on the primary server first, before applying the changes to all standby
servers. If these parameters are not set high enough then the standby will refuse to start. Higher values
can then be supplied and the server restarted to begin recovery again. These parameters are:
• max_connections
• max_prepared_transactions
• max_locks_per_transaction
• max_worker_processes
It is important that the administrator select appropriate settings for max_standby_archive_delay and
max_standby_streaming_delay. The best choices vary depending on business priorities. For example
if the server is primarily tasked as a High Availability server, then you will want low delay settings,
perhaps even zero, though that is a very aggressive setting. If the standby server is tasked as an addi-
tional server for decision support queries then it might be acceptable to set the maximum delay values
to many hours, or even -1 which means wait forever for queries to complete.
Transaction status "hint bits" written on the primary are not WAL-logged, so data on the standby will
likely re-write the hints again on the standby. Thus, the standby server will still perform disk writes
even though all users are read-only; no changes occur to the data values themselves. Users will still
write large sort temporary files and re-generate relcache info files, so no part of the database is truly
read-only during hot standby mode. Note also that writes to remote databases using dblink module,
and other operations outside the database using PL functions will still be possible, even though the
transaction is read-only locally.
The following types of administration commands are not accepted during recovery mode:
Again, note that some of these commands are actually allowed during "read only" mode transactions
on the primary.
As a result, you cannot create additional indexes that exist solely on the standby, nor statistics that
exist solely on the standby. If these administration commands are needed, they should be executed on
the primary, and eventually those changes will propagate to the standby.
pg_locks will show locks held by backends, as normal. pg_locks also shows a virtual transaction
managed by the Startup process that owns all AccessExclusiveLocks held by transactions being
replayed by recovery. Note that the Startup process does not acquire locks to make database changes,
and thus locks other than AccessExclusiveLocks do not show in pg_locks for the Startup
process; they are just presumed to exist.
The Nagios plugin check_pgsql will work, because the simple information it checks for exists. The
check_postgres monitoring script will also work, though some reported values could give different or
confusing results. For example, last vacuum time will not be maintained, since no vacuum occurs on
the standby. Vacuums running on the primary do still send their changes to the standby.
WAL file control commands will not work during recovery, e.g., pg_start_backup,
pg_switch_wal etc.
699
High Availability, Load
Balancing, and Replication
Advisory locks work normally in recovery, including deadlock detection. Note that advisory locks
are never WAL logged, so it is impossible for an advisory lock on either the primary or the standby
to conflict with WAL replay. Nor is it possible to acquire an advisory lock on the primary and have
it initiate a similar advisory lock on the standby. Advisory locks relate only to the server on which
they are acquired.
Trigger-based replication systems such as Slony, Londiste and Bucardo won't run on the standby at
all, though they will run happily on the primary server as long as the changes are not sent to standby
servers to be applied. WAL replay is not trigger-based so you cannot relay from the standby to any
system that requires additional database writes or relies on the use of triggers.
New OIDs cannot be assigned, though some UUID generators may still work as long as they do not
rely on writing new status to the database.
Currently, temporary table creation is not allowed during read only transactions, so in some cases
existing scripts will not run correctly. This restriction might be relaxed in a later release. This is both
a SQL standard compliance issue and a technical issue.
DROP TABLESPACE can only succeed if the tablespace is empty. Some standby users may be actively
using the tablespace via their temp_tablespaces parameter. If there are temporary files in the
tablespace, all active queries are canceled to ensure that temporary files are removed, so the tablespace
can be removed and WAL replay can continue.
Running DROP DATABASE or ALTER DATABASE ... SET TABLESPACE on the primary will
generate a WAL entry that will cause all users connected to that database on the standby to be forcibly
disconnected. This action occurs immediately, whatever the setting of max_standby_stream-
ing_delay. Note that ALTER DATABASE ... RENAME does not disconnect users, which in
most cases will go unnoticed, though might in some cases cause a program confusion if it depends
in some way upon database name.
In normal (non-recovery) mode, if you issue DROP USER or DROP ROLE for a role with login
capability while that user is still connected then nothing happens to the connected user - they remain
connected. The user cannot reconnect however. This behavior applies in recovery also, so a DROP
USER on the primary does not disconnect that user on the standby.
The statistics collector is active during recovery. All scans, reads, blocks, index usage, etc., will be
recorded normally on the standby. Replayed actions will not duplicate their effects on primary, so re-
playing an insert will not increment the Inserts column of pg_stat_user_tables. The stats file is deleted
at the start of recovery, so stats from primary and standby will differ; this is considered a feature,
not a bug.
Autovacuum is not active during recovery. It will start normally at the end of recovery.
The checkpointer process and the background writer process are active during recovery. The check-
pointer process will perform restartpoints (similar to checkpoints on the primary) and the background
writer process will perform normal block cleaning activities. This can include updates of the hint bit
information stored on the standby server. The CHECKPOINT command is accepted during recovery,
though it performs a restartpoint rather than a new checkpoint.
700
High Availability, Load
Balancing, and Replication
26.5.5. Caveats
There are several limitations of Hot Standby. These can and probably will be fixed in future releases:
• Full knowledge of running transactions is required before snapshots can be taken. Transactions that
use large numbers of subtransactions (currently greater than 64) will delay the start of read only
connections until the completion of the longest running write transaction. If this situation occurs,
explanatory messages will be sent to the server log.
• Valid starting points for standby queries are generated at each checkpoint on the master. If the
standby is shut down while the master is in a shutdown state, it might not be possible to re-enter Hot
Standby until the primary is started up, so that it generates further starting points in the WAL logs.
This situation isn't a problem in the most common situations where it might happen. Generally, if
the primary is shut down and not available anymore, that's likely due to a serious failure that requires
the standby being converted to operate as the new primary anyway. And in situations where the
primary is being intentionally taken down, coordinating to make sure the standby becomes the new
primary smoothly is also standard procedure.
• The Serializable transaction isolation level is not yet available in hot standby. (See Section 13.2.3
and Section 13.4.1 for details.) An attempt to set a transaction to the serializable isolation level in
hot standby mode will generate an error.
701
Chapter 27. Recovery Configuration
This chapter describes the settings available in the recovery.conf file. They apply only for the
duration of the recovery. They must be reset for any subsequent recovery you wish to perform. They
cannot be changed once recovery has begun.
Settings in recovery.conf are specified in the format name = 'value'. One parameter is
specified per line. Hash marks (#) designate the rest of the line as a comment. To embed a single quote
in a parameter value, write two quotes ('').
The local shell command to execute to retrieve an archived segment of the WAL file series. This
parameter is required for archive recovery, but optional for streaming replication. Any %f in the
string is replaced by the name of the file to retrieve from the archive, and any %p is replaced by the
copy destination path name on the server. (The path name is relative to the current working di-
rectory, i.e., the cluster's data directory.) Any %r is replaced by the name of the file containing the
last valid restart point. That is the earliest file that must be kept to allow a restore to be restartable,
so this information can be used to truncate the archive to just the minimum required to support
restarting from the current restore. %r is typically only used by warm-standby configurations (see
Section 26.2). Write %% to embed an actual % character.
It is important for the command to return a zero exit status only if it succeeds. The command
will be asked for file names that are not present in the archive; it must return nonzero when so
asked. Examples:
An exception is that if the command was terminated by a signal (other than SIGTERM, which is
used as part of a database server shutdown) or an error by the shell (such as command not found),
then recovery will abort and the server will not start up.
archive_cleanup_command (string)
This optional parameter specifies a shell command that will be executed at every restartpoint.
The purpose of archive_cleanup_command is to provide a mechanism for cleaning up old
archived WAL files that are no longer needed by the standby server. Any %r is replaced by the
name of the file containing the last valid restart point. That is the earliest file that must be kept to
allow a restore to be restartable, and so all files earlier than %r may be safely removed. This in-
formation can be used to truncate the archive to just the minimum required to support restart from
the current restore. The pg_archivecleanup module is often used in archive_cleanup_com-
mand for single-standby configurations, for example:
Note however that if multiple standby servers are restoring from the same archive directory, you
will need to ensure that you do not delete WAL files until they are no longer needed by any of the
servers. archive_cleanup_command would typically be used in a warm-standby configu-
ration (see Section 26.2). Write %% to embed an actual % character in the command.
702
Recovery Configuration
If the command returns a nonzero exit status then a warning log message will be written. An
exception is that if the command was terminated by a signal or an error by the shell (such as
command not found), a fatal error will be raised.
recovery_end_command (string)
This parameter specifies a shell command that will be executed once only at the end of recov-
ery. This parameter is optional. The purpose of the recovery_end_command is to provide a
mechanism for cleanup following replication or recovery. Any %r is replaced by the name of the
file containing the last valid restart point, like in archive_cleanup_command.
If the command returns a nonzero exit status then a warning log message will be written and the
database will proceed to start up anyway. An exception is that if the command was terminated
by a signal or an error by the shell (such as command not found), the database will not proceed
with startup.
recovery_target = 'immediate'
This parameter specifies that recovery should end as soon as a consistent state is reached, i.e.,
as early as possible. When restoring from an online backup, this means the point where taking
the backup ended.
Technically, this is a string parameter, but 'immediate' is currently the only allowed value.
recovery_target_name (string)
This parameter specifies the named restore point (created with pg_create_re-
store_point()) to which recovery will proceed.
recovery_target_time (timestamp)
This parameter specifies the time stamp up to which recovery will proceed. The precise stopping
point is also influenced by recovery_target_inclusive.
recovery_target_xid (string)
This parameter specifies the transaction ID up to which recovery will proceed. Keep in mind that
while transaction IDs are assigned sequentially at transaction start, transactions can complete in
a different numeric order. The transactions that will be recovered are those that committed before
(and optionally including) the specified one. The precise stopping point is also influenced by
recovery_target_inclusive.
recovery_target_lsn (pg_lsn)
This parameter specifies the LSN of the write-ahead log location up to which recovery will pro-
ceed. The precise stopping point is also influenced by recovery_target_inclusive. This parameter
is parsed using the system data type pg_lsn.
The following options further specify the recovery target, and affect what happens when the target
is reached:
recovery_target_inclusive (boolean)
Specifies whether to stop just after the specified recovery target (true), or just before the recov-
ery target (false). Applies when recovery_target_lsn, recovery_target_time, or recovery_tar-
703
Recovery Configuration
get_xid is specified. This setting controls whether transactions having exactly the target WAL
location (LSN), commit time, or transaction ID, respectively, will be included in the recovery.
Default is true.
recovery_target_timeline (string)
Specifies recovering into a particular timeline. The default is to recover along the same timeline
that was current when the base backup was taken. Setting this to latest recovers to the latest
timeline found in the archive, which is useful in a standby server. Other than that you only need
to set this parameter in complex re-recovery situations, where you need to return to a state that
itself was reached after a point-in-time recovery. See Section 25.3.5 for discussion.
recovery_target_action (enum)
Specifies what action the server should take once the recovery target is reached. The default
is pause, which means recovery will be paused. promote means the recovery process will
finish and the server will start to accept connections. Finally shutdown will stop the server after
reaching the recovery target.
The intended use of the pause setting is to allow queries to be executed against the database
to check if this recovery target is the most desirable point for recovery. The paused state can be
resumed by using pg_wal_replay_resume() (see Table 9.81), which then causes recovery
to end. If this recovery target is not the desired stopping point, then shut down the server, change
the recovery target settings to a later target and restart to continue recovery.
The shutdown setting is useful to have the instance ready at the exact replay point desired.
The instance will still be able to replay more WAL records (and in fact will have to replay WAL
records since the last checkpoint next time it is started).
This setting has no effect if no recovery target is set. If hot_standby is not enabled, a setting of
pause will act the same as shutdown.
Specifies whether to start the PostgreSQL server as a standby. If this parameter is on, the server
will not stop recovery when the end of archived WAL is reached, but will keep trying to continue
recovery by fetching new WAL segments using restore_command and/or by connecting to
the primary server as specified by the primary_conninfo setting.
primary_conninfo (string)
Specifies a connection string to be used for the standby server to connect with the primary. This
string is in the format described in Section 34.1.1. If any option is unspecified in this string,
then the corresponding environment variable (see Section 34.14) is checked. If the environment
variable is not set either, then defaults are used.
The connection string should specify the host name (or address) of the primary server, as well
as the port number if it is not the same as the standby server's default. Also specify a user name
corresponding to a suitably-privileged role on the primary (see Section 26.2.5.1). A password
needs to be provided too, if the primary demands password authentication. It can be provided in
the primary_conninfo string, or in a separate ~/.pgpass file on the standby server (use
replication as the database name). Do not specify a database name in the primary_con-
ninfo string.
704
Recovery Configuration
primary_slot_name (string)
Optionally specifies an existing replication slot to be used when connecting to the primary via
streaming replication to control resource removal on the upstream node (see Section 26.2.6). This
setting has no effect if primary_conninfo is not set.
trigger_file (string)
Specifies a trigger file whose presence ends recovery in the standby. Even if this value is not
set, you can still promote the standby using pg_ctl promote. This setting has no effect if
standby_mode is off.
recovery_min_apply_delay (integer)
By default, a standby server restores WAL records from the primary as soon as possible. It may be
useful to have a time-delayed copy of the data, offering opportunities to correct data loss errors.
This parameter allows you to delay recovery by a fixed period of time, measured in milliseconds
if no unit is specified. For example, if you set this parameter to 5min, the standby will replay
each transaction commit only when the system time on the standby is at least five minutes past
the commit time reported by the master.
It is possible that the replication delay between servers exceeds the value of this parameter, in
which case no delay is added. Note that the delay is calculated between the WAL time stamp as
written on master and the current time on the standby. Delays in transfer because of network lag or
cascading replication configurations may reduce the actual wait time significantly. If the system
clocks on master and standby are not synchronized, this may lead to recovery applying records
earlier than expected; but that is not a major issue because useful settings of this parameter are
much larger than typical time deviations between servers.
The delay occurs only on WAL records for transaction commits. Other records are replayed as
quickly as possible, which is not a problem because MVCC visibility rules ensure their effects
are not visible until the corresponding commit record is applied.
The delay occurs once the database in recovery has reached a consistent state, until the standby
is promoted or triggered. After that the standby will end recovery without further waiting.
This parameter is intended for use with streaming replication deployments; however, if the para-
meter is specified it will be honored in all cases. hot_standby_feedback will be delayed
by use of this feature which could lead to bloat on the master; use both together with care.
Warning
Synchronous replication is affected by this setting when synchronous_commit is set
to remote_apply; every COMMIT will need to wait to be applied.
705
Chapter 28. Monitoring Database
Activity
A database administrator frequently wonders, “What is the system doing right now?” This chapter
discusses how to find that out.
Several tools are available for monitoring database activity and analyzing performance. Most of this
chapter is devoted to describing PostgreSQL's statistics collector, but one should not neglect regular
Unix monitoring programs such as ps, top, iostat, and vmstat. Also, once one has identified a
poorly-performing query, further investigation might be needed using PostgreSQL's EXPLAIN com-
mand. Section 14.1 discusses EXPLAIN and other methods for understanding the behavior of an in-
dividual query.
(The appropriate invocation of ps varies across different platforms, as do the details of what is shown.
This example is from a recent Linux system.) The first process listed here is the master server process.
The command arguments shown for it are the same ones used when it was launched. The next five
processes are background worker processes automatically launched by the master process. (The “stats
collector” process will not be present if you have set the system not to start the statistics collector;
likewise the “autovacuum launcher” process can be disabled.) Each of the remaining processes is a
server process handling one client connection. Each such process sets its command line display in
the form
The user, database, and (client) host items remain the same for the life of the client connection, but
the activity indicator changes. The activity can be idle (i.e., waiting for a client command), idle
in transaction (waiting for client inside a BEGIN block), or a command type name such as
SELECT. Also, waiting is appended if the server process is presently waiting on a lock held by
another session. In the above example we can infer that process 15606 is waiting for process 15610 to
706
Monitoring Database Activity
complete its transaction and thereby release some lock. (Process 15610 must be the blocker, because
there is no other active session. In more complicated cases it would be necessary to look into the
pg_locks system view to determine who is blocking whom.)
If cluster_name has been configured the cluster name will also be shown in ps output:
$ ps aux|grep server1
postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00
postgres: server1: background writer
...
If you have turned off update_process_title then the activity indicator is not updated; the process title
is set only once when a new process is launched. On some platforms this saves a measurable amount
of per-command overhead; on others it's insignificant.
Tip
Solaris requires special handling. You must use /usr/ucb/ps, rather than /bin/ps. You
also must use two w flags, not just one. In addition, your original invocation of the postgres
command must have a shorter ps status display than that provided by each server process.
If you fail to do all three things, the ps output for each server process will be the original
postgres command line.
PostgreSQL also supports reporting dynamic information about exactly what is going on in the system
right now, such as the exact command currently being executed by other server processes, and which
other connections exist in the system. This facility is independent of the collector process.
The parameter track_activities enables monitoring of the current command being executed by any
server process.
The parameter track_counts controls whether statistics are collected about table and index accesses.
The parameter track_io_timing enables monitoring of block read and write times.
Normally these parameters are set in postgresql.conf so that they apply to all server processes,
but it is possible to turn them on or off in individual sessions using the SET command. (To prevent
707
Monitoring Database Activity
ordinary users from hiding their activity from the administrator, only superusers are allowed to change
these parameters with SET.)
The statistics collector transmits the collected information to other PostgreSQL processes through
temporary files. These files are stored in the directory named by the stats_temp_directory parameter,
pg_stat_tmp by default. For better performance, stats_temp_directory can be pointed at
a RAM-based file system, decreasing physical I/O requirements. When the server shuts down cleanly,
a permanent copy of the statistics data is stored in the pg_stat subdirectory, so that statistics can
be retained across server restarts. When recovery is performed at server start (e.g., after immediate
shutdown, server crash, and point-in-time recovery), all statistics counters are reset.
When using the statistics to monitor collected data, it is important to realize that the information does
not update instantaneously. Each individual server process transmits new statistical counts to the col-
lector just before going idle; so a query or transaction still in progress does not affect the displayed
totals. Also, the collector itself emits a new report at most once per PGSTAT_STAT_INTERVAL mil-
liseconds (500 ms unless altered while building the server). So the displayed information lags behind
actual activity. However, current-query information collected by track_activities is always
up-to-date.
Another important point is that when a server process is asked to display any of these statistics, it first
fetches the most recent report emitted by the collector process and then continues to use this snapshot
for all statistical views and functions until the end of its current transaction. So the statistics will
show static information as long as you continue the current transaction. Similarly, information about
the current queries of all sessions is collected when any such information is first requested within a
transaction, and the same information will be displayed throughout the transaction. This is a feature,
not a bug, because it allows you to perform several queries on the statistics and correlate the results
without worrying that the numbers are changing underneath you. But if you want to see new results
with each query, be sure to do the queries outside any transaction block. Alternatively, you can invoke
pg_stat_clear_snapshot(), which will discard the current transaction's statistics snapshot (if
any). The next use of statistical information will cause a new snapshot to be fetched.
A transaction can also see its own statistics (as yet untransmitted to the collector) in
the views pg_stat_xact_all_tables, pg_stat_xact_sys_tables, pg_stat_xac-
t_user_tables, and pg_stat_xact_user_functions. These numbers do not act as stated
above; instead they update continuously throughout the transaction.
708
Monitoring Database Activity
709
Monitoring Database Activity
The per-index statistics are particularly useful to determine which indexes are being used and how
effective they are.
The pg_statio_ views are primarily useful to determine the effectiveness of the buffer cache.
When the number of actual disk reads is much smaller than the number of buffer hits, then the cache is
satisfying most read requests without invoking a kernel call. However, these statistics do not give the
entire story: due to the way in which PostgreSQL handles disk I/O, data that is not in the PostgreSQL
buffer cache might still reside in the kernel's I/O cache, and might therefore still be fetched without
requiring a physical read. Users interested in obtaining more detailed information on PostgreSQL I/O
behavior are advised to use the PostgreSQL statistics collector in combination with operating system
utilities that allow insight into the kernel's handling of I/O.
710
Monitoring Database Activity
711
Monitoring Database Activity
712
Monitoring Database Activity
• idle in transaction:
The backend is in a transac-
tion, but is not currently exe-
cuting a query.
• idle in transaction
(aborted): This state is
similar to idle in trans-
action, except one of the
statements in the transaction
caused an error.
• fastpath function
call: The backend is execut-
ing a fast-path function.
713
Monitoring Database Activity
The pg_stat_activity view will have one row per server process, showing information related
to the current activity of that process.
Note
The wait_event and state columns are independent. If a backend is in the active state,
it may or may not be waiting on some event. If the state is active and wait_event
is non-null, it means that a query is being executed, but is being blocked somewhere in the
system.
714
Monitoring Database Activity
715
Monitoring Database Activity
716
Monitoring Database Activity
717
Monitoring Database Activity
718
Monitoring Database Activity
719
Monitoring Database Activity
720
Monitoring Database Activity
721
Monitoring Database Activity
722
Monitoring Database Activity
723
Monitoring Database Activity
Note
For tranches registered by extensions, the name is specified by extension and this will be
displayed as wait_event. It is quite possible that user has registered the tranche in one of
the backends (by having allocation in dynamic shared memory) in which case other backends
won't have that information, so we display extension for such cases.
724
Monitoring Database Activity
725
Monitoring Database Activity
726
Monitoring Database Activity
The pg_stat_replication view will contain one row per WAL sender process, showing statis-
tics about replication to that sender's connected standby server. Only directly connected standbys are
listed; no information is available about downstream standby servers.
The lag times reported in the pg_stat_replication view are measurements of the time taken
for recent WAL to be written, flushed and replayed and for the sender to know about it. These times
represent the commit delay that was (or would have been) introduced by each synchronous commit
level, if the remote server was configured as a synchronous standby. For an asynchronous standby, the
replay_lag column approximates the delay before recent transactions became visible to queries. If
the standby server has entirely caught up with the sending server and there is no more WAL activity,
the most recently measured lag times will continue to be displayed for a short time and then show
NULL.
Lag times work automatically for physical replication. Logical decoding plugins may optionally emit
tracking messages; if they do not, the tracking mechanism will simply display NULL lag.
Note
The reported lag times are not predictions of how long it will take for the standby to catch up
with the sending server assuming the current rate of replay. Such a system would show similar
times while new WAL is being generated, but would differ when the sender becomes idle.
In particular, when the standby has caught up completely, pg_stat_replication shows
the time taken to write, flush and replay the most recent reported WAL location rather than
zero as some users might expect. This is consistent with the goal of measuring synchronous
commit and transaction visibility delays for recent write transactions. To reduce confusion for
users expecting a different model of lag, the lag columns revert to NULL after a short time on
a fully replayed idle system. Monitoring systems should choose whether to represent this as
missing data, zero or continue to display the last known value.
727
Monitoring Database Activity
The pg_stat_wal_receiver view will contain only one row, showing statistics about the WAL
receiver from that receiver's connected server.
728
Monitoring Database Activity
The pg_stat_subscription view will contain one row per subscription for main worker (with
null PID if the worker is not running), and additional rows for workers handling the initial data copy
of the subscribed tables.
The pg_stat_ssl view will contain one row per backend or WAL sender process, showing sta-
tistics about SSL usage on this connection. It can be joined to pg_stat_activity or pg_s-
tat_replication on the pid column to get more details about the connection.
729
Monitoring Database Activity
The pg_stat_archiver view will always have a single row, containing data about the archiver
process of the cluster.
730
Monitoring Database Activity
The pg_stat_bgwriter view will always have a single row, containing global data for the cluster.
731
Monitoring Database Activity
The pg_stat_database view will contain one row for each database in the cluster, showing data-
base-wide statistics.
The pg_stat_database_conflicts view will contain one row per database, showing data-
base-wide statistics about query cancels occurring due to conflicts with recovery on standby servers.
732
Monitoring Database Activity
This view will only contain information on standby servers, since conflicts do not occur on master
servers.
733
Monitoring Database Activity
The pg_stat_all_tables view will contain one row for each table in the current database
(including TOAST tables), showing statistics about accesses to that specific table. The pg_s-
tat_user_tables and pg_stat_sys_tables views contain the same information, but fil-
tered to only show user and system tables respectively.
The pg_stat_all_indexes view will contain one row for each index in the current database,
showing statistics about accesses to that specific index. The pg_stat_user_indexes and pg_s-
tat_sys_indexes views contain the same information, but filtered to only show user and system
indexes respectively.
Indexes can be used by simple index scans, “bitmap” index scans, and the optimizer. In a bitmap scan
the output of several indexes can be combined via AND or OR rules, so it is difficult to associate
individual heap row fetches with specific indexes when a bitmap scan is used. Therefore, a bitmap scan
increments the pg_stat_all_indexes.idx_tup_read count(s) for the index(es) it uses, and
it increments the pg_stat_all_tables.idx_tup_fetch count for the table, but it does not
affect pg_stat_all_indexes.idx_tup_fetch. The optimizer also accesses indexes to check
for supplied constants whose values are outside the recorded range of the optimizer statistics because
the optimizer statistics might be stale.
Note
The idx_tup_read and idx_tup_fetch counts can be different even without any use of
bitmap scans, because idx_tup_read counts index entries retrieved from the index while
idx_tup_fetch counts live rows fetched from the table. The latter will be less if any dead
or not-yet-committed rows are fetched using the index, or if any heap fetches are avoided by
means of an index-only scan.
734
Monitoring Database Activity
The pg_statio_all_tables view will contain one row for each table in the current data-
base (including TOAST tables), showing statistics about I/O on that specific table. The pg_sta-
tio_user_tables and pg_statio_sys_tables views contain the same information, but fil-
tered to only show user and system tables respectively.
The pg_statio_all_indexes view will contain one row for each index in the current database,
showing statistics about I/O on that specific index. The pg_statio_user_indexes and pg_s-
tatio_sys_indexes views contain the same information, but filtered to only show user and sys-
tem indexes respectively.
735
Monitoring Database Activity
The pg_statio_all_sequences view will contain one row for each sequence in the current
database, showing statistics about I/O on that specific sequence.
The pg_stat_user_functions view will contain one row for each tracked function, showing
statistics about executions of that function. The track_functions parameter controls exactly which func-
tions are tracked.
736
Monitoring Database Activity
737
Monitoring Database Activity
• View all the locks currently outstanding, all the locks on relations in a particular database, all the
locks on a particular relation, or all the locks held by a particular PostgreSQL session.
• Determine the relation in the current database with the most ungranted locks (which might be a
source of contention among database clients).
• Determine the effect of lock contention on overall database performance, as well as the extent to
which contention varies with overall database traffic.
738
Monitoring Database Activity
Details of the pg_locks view appear in Section 52.73. For more information on locking and man-
aging concurrency with PostgreSQL, refer to Chapter 13.
739
Monitoring Database Activity
Phase Description
initializing VACUUM is preparing to begin scanning the heap.
This phase is expected to be very brief.
scanning heap VACUUM is currently scanning the heap. It will
prune and defragment each page if required,
and possibly perform freezing activity. The
heap_blks_scanned column can be used to
monitor the progress of the scan.
vacuuming indexes VACUUM is currently vacuuming the indexes. If
a table has any indexes, this will happen at least
once per vacuum, after the heap has been com-
pletely scanned. It may happen multiple times
per vacuum if maintenance_work_mem (or, in the
case of autovacuum, autovacuum_work_mem if
set) is insufficient to store the number of dead tu-
ples found.
vacuuming heap VACUUM is currently vacuuming the heap. Vac-
uuming the heap is distinct from scanning the
heap, and occurs after each instance of vacuuming
indexes. If heap_blks_scanned is less than
heap_blks_total, the system will return to
scanning the heap after this phase is completed;
otherwise, it will begin cleaning up indexes after
this phase is completed.
cleaning up indexes VACUUM is currently cleaning up indexes. This
occurs after the heap has been completely scanned
and all vacuuming of the indexes and the heap has
been completed.
truncating heap VACUUM is currently truncating the heap so as to
return empty pages at the end of the relation to
the operating system. This occurs after cleaning
up indexes.
performing final cleanup VACUUM is performing final cleanup. During this
phase, VACUUM will vacuum the free space map,
update statistics in pg_class, and report statis-
tics to the statistics collector. When this phase is
completed, VACUUM will end.
740
Monitoring Database Activity
A number of probes or trace points are already inserted into the source code. These probes are intended
to be used by database developers and administrators. By default the probes are not compiled into
PostgreSQL; the user needs to explicitly tell the configure script to make the probes available.
Currently, the DTrace1 utility is supported, which, at the time of this writing, is available on Solaris,
macOS, FreeBSD, NetBSD, and Oracle Linux. The SystemTap2 project for Linux provides a DTrace
equivalent and can also be used. Supporting other dynamic tracing utilities is theoretically possible by
changing the definitions for the macros in src/include/utils/probes.h.
1
https://en.wikipedia.org/wiki/DTrace
2
https://sourceware.org/systemtap/
741
Monitoring Database Activity
742
Monitoring Database Activity
743
Monitoring Database Activity
744
Monitoring Database Activity
745
Monitoring Database Activity
746
Monitoring Database Activity
747
Monitoring Database Activity
#!/usr/sbin/dtrace -qs
postgresql$1:::transaction-start
{
@start["Start"] = count();
self->ts = timestamp;
}
postgresql$1:::transaction-abort
{
@abort["Abort"] = count();
}
postgresql$1:::transaction-commit
/self->ts/
{
@commit["Commit"] = count();
@time["Total time (ns)"] = sum(timestamp - self->ts);
self->ts=0;
}
Start 71
Commit 70
Total time (ns) 2312105013
748
Monitoring Database Activity
Note
SystemTap uses a different notation for trace scripts than DTrace does, even though the un-
derlying trace points are compatible. One point worth noting is that at this writing, SystemTap
scripts must reference probe names using double underscores in place of hyphens. This is ex-
pected to be fixed in future SystemTap releases.
You should remember that DTrace scripts need to be carefully written and debugged, otherwise the
trace information collected might be meaningless. In most cases where problems are found it is the
instrumentation that is at fault, not the underlying system. When discussing information found using
dynamic tracing, be sure to enclose the script used to allow that too to be checked and discussed.
1. Decide on probe names and data to be made available through the probes
3. Include pg_trace.h if it is not already present in the module(s) containing the probe points,
and insert TRACE_POSTGRESQL probe macros at the desired locations in the source code
Example: Here is an example of how you would add a probe to trace all new transactions by
transaction ID.
1. Decide that the probe will be named transaction-start and requires a parameter of type
LocalTransactionId
probe transaction__start(LocalTransactionId);
Note the use of the double underline in the probe name. In a DTrace script using the probe, the
double underline needs to be replaced with a hyphen, so transaction-start is the name
to document for users.
TRACE_POSTGRESQL_TRANSACTION_START(vxid.localTransactionId);
4. After recompiling and running the new binary, check that your newly added probe is available
by executing the following DTrace command. You should see similar output:
749
Monitoring Database Activity
There are a few things to be careful about when adding trace macros to the C code:
• You should take care that the data types specified for a probe's parameters match the data types of
the variables used in the macro. Otherwise, you will get compilation errors.
if (TRACE_POSTGRESQL_TRANSACTION_START_ENABLED())
TRACE_POSTGRESQL_TRANSACTION_START(some_function(...));
750
Chapter 29. Monitoring Disk Usage
This chapter discusses how to monitor the disk usage of a PostgreSQL database system.
You can monitor disk space in three ways: using the SQL functions listed in Table 9.84, using the
oid2name module, or using manual inspection of the system catalogs. The SQL functions are the
easiest to use and are generally recommended. The remainder of this section shows how to do it by
inspection of the system catalogs.
Using psql on a recently vacuumed or analyzed database, you can issue queries to see the disk usage
of any table:
pg_relation_filepath | relpages
----------------------+----------
base/16384/16806 | 60
(1 row)
Each page is typically 8 kilobytes. (Remember, relpages is only updated by VACUUM, ANALYZE,
and a few DDL commands such as CREATE INDEX.) The file path name is of interest if you want
to examine the table's disk file directly.
To show the space used by TOAST tables, use a query like the following:
relname | relpages
----------------------+----------
pg_toast_16806 | 0
pg_toast_16806_index | 1
751
Monitoring Disk Usage
relname | relpages
----------------------+----------
customer_id_indexdex | 26
It is easy to find your largest tables and indexes using this information:
relname | relpages
----------------------+----------
bigtable | 3290
customer | 3144
If you cannot free up additional space on the disk by deleting other things, you can move some of the
database files to other file systems by making use of tablespaces. See Section 22.6 for more information
about that.
Tip
Some file systems perform badly when they are almost full, so do not wait until the disk is
completely full to take action.
If your system supports per-user disk quotas, then the database will naturally be subject to whatever
quota is placed on the user the server runs as. Exceeding the quota will have the same bad effects as
running out of disk space entirely.
752
Chapter 30. Reliability and the Write-
Ahead Log
This chapter explains how the Write-Ahead Log is used to obtain efficient, reliable operation.
30.1. Reliability
Reliability is an important property of any serious database system, and PostgreSQL does everything
possible to guarantee reliable operation. One aspect of reliable operation is that all data recorded by
a committed transaction should be stored in a nonvolatile area that is safe from power loss, operating
system failure, and hardware failure (except failure of the nonvolatile area itself, of course). Success-
fully writing the data to the computer's permanent storage (disk drive or equivalent) ordinarily meets
this requirement. In fact, even if a computer is fatally damaged, if the disk drives survive they can be
moved to another computer with similar hardware and all committed transactions will remain intact.
While forcing data to the disk platters periodically might seem like a simple operation, it is not. Be-
cause disk drives are dramatically slower than main memory and CPUs, several layers of caching exist
between the computer's main memory and the disk platters. First, there is the operating system's buffer
cache, which caches frequently requested disk blocks and combines disk writes. Fortunately, all op-
erating systems give applications a way to force writes from the buffer cache to disk, and PostgreSQL
uses those features. (See the wal_sync_method parameter to adjust how this is done.)
Next, there might be a cache in the disk drive controller; this is particularly common on RAID con-
troller cards. Some of these caches are write-through, meaning writes are sent to the drive as soon as
they arrive. Others are write-back, meaning data is sent to the drive at some later time. Such caches
can be a reliability hazard because the memory in the disk controller cache is volatile, and will lose
its contents in a power failure. Better controller cards have battery-backup units (BBUs), meaning
the card has a battery that maintains power to the cache in case of system power loss. After power is
restored the data will be written to the disk drives.
And finally, most disk drives have caches. Some are write-through while some are write-back, and
the same concerns about data loss exist for write-back drive caches as for disk controller caches.
Consumer-grade IDE and SATA drives are particularly likely to have write-back caches that will not
survive a power failure. Many solid-state drives (SSD) also have volatile write-back caches.
These caches can typically be disabled; however, the method for doing this varies by operating system
and drive type:
• On Linux, IDE and SATA drives can be queried using hdparm -I; write caching is enabled if
there is a * next to Write cache. hdparm -W 0 can be used to turn off write caching. SCSI
drives can be queried using sdparm1. Use sdparm --get=WCE to check whether the write cache
is enabled and sdparm --clear=WCE to disable it.
• On FreeBSD, IDE drives can be queried using atacontrol and write caching turned off us-
ing hw.ata.wc=0 in /boot/loader.conf; SCSI drives can be queried using camcontrol
identify, and the write cache both queried and changed using sdparm when available.
• On Solaris, the disk write cache is controlled by format -e. (The Solaris ZFS file system is safe
with disk write-cache enabled because it issues its own disk cache flush commands.)
753
Reliability and the Write-Ahead Log
Recent SATA drives (those following ATAPI-6 or later) offer a drive cache flush command (FLUSH
CACHE EXT), while SCSI drives have long supported a similar command SYNCHRONIZE CACHE.
These commands are not directly accessible to PostgreSQL, but some file systems (e.g., ZFS, ext4) can
use them to flush data to the platters on write-back-enabled drives. Unfortunately, such file systems
behave suboptimally when combined with battery-backup unit (BBU) disk controllers. In such setups,
the synchronize command forces all data from the controller cache to the disks, eliminating much of
the benefit of the BBU. You can run the pg_test_fsync program to see if you are affected. If you are
affected, the performance benefits of the BBU can be regained by turning off write barriers in the file
system or reconfiguring the disk controller, if that is an option. If write barriers are turned off, make
sure the battery remains functional; a faulty battery can potentially lead to data loss. Hopefully file
system and disk controller designers will eventually address this suboptimal behavior.
When the operating system sends a write request to the storage hardware, there is little it can do to
make sure the data has arrived at a truly non-volatile storage area. Rather, it is the administrator's
responsibility to make certain that all storage components ensure integrity for both data and file-system
metadata. Avoid disk controllers that have non-battery-backed write caches. At the drive level, disable
write-back caching if the drive cannot guarantee the data will be written before shutdown. If you use
SSDs, be aware that many of these do not honor cache flush commands by default. You can test for
reliable I/O subsystem behavior using diskchecker.pl2.
Another risk of data loss is posed by the disk platter write operations themselves. Disk platters are
divided into sectors, commonly 512 bytes each. Every physical read or write operation processes a
whole sector. When a write request arrives at the drive, it might be for some multiple of 512 bytes
(PostgreSQL typically writes 8192 bytes, or 16 sectors, at a time), and the process of writing could fail
due to power loss at any time, meaning some of the 512-byte sectors were written while others were not.
To guard against such failures, PostgreSQL periodically writes full page images to permanent WAL
storage before modifying the actual page on disk. By doing this, during crash recovery PostgreSQL can
restore partially-written pages from WAL. If you have file-system software that prevents partial page
writes (e.g., ZFS), you can turn off this page imaging by turning off the full_page_writes parameter.
Battery-Backed Unit (BBU) disk controllers do not prevent partial page writes unless they guarantee
that data is written to the BBU as full (8kB) pages.
PostgreSQL also protects against some kinds of data corruption on storage devices that may occur
because of hardware errors or media failure over time, such as reading/writing garbage data.
• Each individual record in a WAL file is protected by a CRC-32 (32-bit) check that allows us to tell
if record contents are correct. The CRC value is set when we write each WAL record and checked
during crash recovery, archive recovery and replication.
• Data pages are not currently checksummed by default, though full page images recorded in WAL
records will be protected; see initdb for details about enabling data page checksums.
• Temporary data files used in larger SQL queries for sorts, materializations and intermediate results
are not currently checksummed, nor will WAL records be written for changes to those files.
PostgreSQL does not protect against correctable memory errors and it is assumed you will operate
using RAM that uses industry standard Error Correcting Codes (ECC) or better protection.
2
https://brad.livejournal.com/2116715.html
754
Reliability and the Write-Ahead Log
Tip
Because WAL restores database file contents after a crash, journaled file systems are not nec-
essary for reliable storage of the data files or WAL files. In fact, journaling overhead can re-
duce performance, especially if journaling causes file system data to be flushed to disk. Fortu-
nately, data flushing during journaling can often be disabled with a file system mount option,
e.g., data=writeback on a Linux ext3 file system. Journaled file systems do improve boot
speed after a crash.
Using WAL results in a significantly reduced number of disk writes, because only the log file needs
to be flushed to disk to guarantee that a transaction is committed, rather than every data file changed
by the transaction. The log file is written sequentially, and so the cost of syncing the log is much
less than the cost of flushing the data pages. This is especially true for servers handling many small
transactions touching different parts of the data store. Furthermore, when the server is processing many
small concurrent transactions, one fsync of the log file may suffice to commit many transactions.
WAL also makes it possible to support on-line backup and point-in-time recovery, as described in
Section 25.3. By archiving the WAL data we can support reverting to any time instant covered by the
available WAL data: we simply install a prior physical backup of the database, and replay the WAL
log just as far as the desired time. What's more, the physical backup doesn't have to be an instantaneous
snapshot of the database state — if it is made over some period of time, then replaying the WAL log
for that period will fix any internal inconsistencies.
As described in the previous section, transaction commit is normally synchronous: the server waits for
the transaction's WAL records to be flushed to permanent storage before returning a success indication
to the client. The client is therefore guaranteed that a transaction reported to be committed will be
preserved, even in the event of a server crash immediately after. However, for short transactions this
delay is a major component of the total transaction time. Selecting asynchronous commit mode means
that the server returns success as soon as the transaction is logically completed, before the WAL
records it generated have actually made their way to disk. This can provide a significant boost in
throughput for small transactions.
Asynchronous commit introduces the risk of data loss. There is a short time window between the report
of transaction completion to the client and the time that the transaction is truly committed (that is, it is
guaranteed not to be lost if the server crashes). Thus asynchronous commit should not be used if the
client will take external actions relying on the assumption that the transaction will be remembered.
As an example, a bank would certainly not use asynchronous commit for a transaction recording an
ATM's dispensing of cash. But in many scenarios, such as event logging, there is no need for a strong
guarantee of this kind.
755
Reliability and the Write-Ahead Log
The risk that is taken by using asynchronous commit is of data loss, not data corruption. If the database
should crash, it will recover by replaying WAL up to the last record that was flushed. The database will
therefore be restored to a self-consistent state, but any transactions that were not yet flushed to disk
will not be reflected in that state. The net effect is therefore loss of the last few transactions. Because
the transactions are replayed in commit order, no inconsistency can be introduced — for example, if
transaction B made changes relying on the effects of a previous transaction A, it is not possible for
A's effects to be lost while B's effects are preserved.
The user can select the commit mode of each transaction, so that it is possible to have both synchro-
nous and asynchronous commit transactions running concurrently. This allows flexible trade-offs be-
tween performance and certainty of transaction durability. The commit mode is controlled by the user-
settable parameter synchronous_commit, which can be changed in any of the ways that a configura-
tion parameter can be set. The mode used for any one transaction depends on the value of synchro-
nous_commit when transaction commit begins.
Certain utility commands, for instance DROP TABLE, are forced to commit synchronously regardless
of the setting of synchronous_commit. This is to ensure consistency between the server's file
system and the logical state of the database. The commands supporting two-phase commit, such as
PREPARE TRANSACTION, are also always synchronous.
If the database crashes during the risk window between an asynchronous commit and the writing of
the transaction's WAL records, then changes made during that transaction will be lost. The duration
of the risk window is limited because a background process (the “WAL writer”) flushes unwritten
WAL records to disk every wal_writer_delay milliseconds. The actual maximum duration of the risk
window is three times wal_writer_delay because the WAL writer is designed to favor writing
whole pages at a time during busy periods.
Caution
An immediate-mode shutdown is equivalent to a server crash, and will therefore cause loss of
any unflushed asynchronous commits.
Asynchronous commit provides behavior different from setting fsync = off. fsync is a server-wide
setting that will alter the behavior of all transactions. It disables all logic within PostgreSQL that
attempts to synchronize writes to different portions of the database, and therefore a system crash (that
is, a hardware or operating system crash, not a failure of PostgreSQL itself) could result in arbitrarily
bad corruption of the database state. In many scenarios, asynchronous commit provides most of the
performance improvement that could be obtained by turning off fsync, but without the risk of data
corruption.
commit_delay also sounds very similar to asynchronous commit, but it is actually a synchronous com-
mit method (in fact, commit_delay is ignored during an asynchronous commit). commit_delay
causes a delay just before a transaction flushes WAL to disk, in the hope that a single flush executed by
one such transaction can also serve other transactions committing at about the same time. The setting
can be thought of as a way of increasing the time window in which transactions can join a group about
to participate in a single flush, to amortize the cost of the flush among multiple transactions.
Checkpoints are points in the sequence of transactions at which it is guaranteed that the heap and index
data files have been updated with all information written before that checkpoint. At checkpoint time,
all dirty data pages are flushed to disk and a special checkpoint record is written to the log file. (The
756
Reliability and the Write-Ahead Log
change records were previously flushed to the WAL files.) In the event of a crash, the crash recovery
procedure looks at the latest checkpoint record to determine the point in the log (known as the redo
record) from which it should start the REDO operation. Any changes made to data files before that
point are guaranteed to be already on disk. Hence, after a checkpoint, log segments preceding the
one containing the redo record are no longer needed and can be recycled or removed. (When WAL
archiving is being done, the log segments must be archived before being recycled or removed.)
The checkpoint requirement of flushing all dirty data pages to disk can cause a significant I/O load.
For this reason, checkpoint activity is throttled so that I/O begins at checkpoint start and completes
before the next checkpoint is due to start; this minimizes performance degradation during checkpoints.
The server's checkpointer process automatically performs a checkpoint every so often. A checkpoint
is begun every checkpoint_timeout seconds, or if max_wal_size is about to be exceeded, whichever
comes first. The default settings are 5 minutes and 1 GB, respectively. If no WAL has been written
since the previous checkpoint, new checkpoints will be skipped even if checkpoint_timeout
has passed. (If WAL archiving is being used and you want to put a lower limit on how often files are
archived in order to bound potential data loss, you should adjust the archive_timeout parameter rather
than the checkpoint parameters.) It is also possible to force a checkpoint by using the SQL command
CHECKPOINT.
Checkpoints are fairly expensive, first because they require writing out all currently dirty buffers, and
second because they result in extra subsequent WAL traffic as discussed above. It is therefore wise
to set the checkpointing parameters high enough so that checkpoints don't happen too often. As a
simple sanity check on your checkpointing parameters, you can set the checkpoint_warning parameter.
If checkpoints happen closer together than checkpoint_warning seconds, a message will be
output to the server log recommending increasing max_wal_size. Occasional appearance of such
a message is not cause for alarm, but if it appears often then the checkpoint control parameters should
be increased. Bulk operations such as large COPY transfers might cause a number of such warnings
to appear if you have not set max_wal_size high enough.
To avoid flooding the I/O system with a burst of page writes, writing dirty buffers during a checkpoint
is spread over a period of time. That period is controlled by checkpoint_completion_target, which is
given as a fraction of the checkpoint interval. The I/O rate is adjusted so that the checkpoint finish-
es when the given fraction of checkpoint_timeout seconds have elapsed, or before max_w-
al_size is exceeded, whichever is sooner. With the default value of 0.5, PostgreSQL can be expect-
ed to complete each checkpoint in about half the time before the next checkpoint starts. On a system
that's very close to maximum I/O throughput during normal operation, you might want to increase
checkpoint_completion_target to reduce the I/O load from checkpoints. The disadvantage
of this is that prolonging checkpoints affects recovery time, because more WAL segments will need
to be kept around for possible use in recovery. Although checkpoint_completion_target
can be set as high as 1.0, it is best to keep it less than that (perhaps 0.9 at most) since checkpoints
include some other activities besides writing dirty buffers. A setting of 1.0 is quite likely to result in
checkpoints not being completed on time, which would result in performance loss due to unexpected
variation in the number of WAL segments needed.
On Linux and POSIX platforms checkpoint_flush_after allows to force the OS that pages written by
the checkpoint should be flushed to disk after a configurable number of bytes. Otherwise, these pages
may be kept in the OS's page cache, inducing a stall when fsync is issued at the end of a checkpoint.
This setting will often help to reduce transaction latency, but it also can have an adverse effect on
performance; particularly for workloads that are bigger than shared_buffers, but smaller than the OS's
page cache.
757
Reliability and the Write-Ahead Log
The number of WAL segment files in pg_wal directory depends on min_wal_size, max_w-
al_size and the amount of WAL generated in previous checkpoint cycles. When old log segment
files are no longer needed, they are removed or recycled (that is, renamed to become future segments in
the numbered sequence). If, due to a short-term peak of log output rate, max_wal_size is exceeded,
the unneeded segment files will be removed until the system gets back under this limit. Below that
limit, the system recycles enough WAL files to cover the estimated need until the next checkpoint, and
removes the rest. The estimate is based on a moving average of the number of WAL files used in pre-
vious checkpoint cycles. The moving average is increased immediately if the actual usage exceeds the
estimate, so it accommodates peak usage rather than average usage to some extent. min_wal_size
puts a minimum on the amount of WAL files recycled for future usage; that much WAL is always
recycled for future use, even if the system is idle and the WAL usage estimate suggests that little
WAL is needed.
In archive recovery or standby mode, the server periodically performs restartpoints, which are similar
to checkpoints in normal operation: the server forces all its state to disk, updates the pg_control
file to indicate that the already-processed WAL data need not be scanned again, and then recycles
any old log segment files in the pg_wal directory. Restartpoints can't be performed more frequently
than checkpoints in the master because restartpoints can only be performed at checkpoint records.
A restartpoint is triggered when a checkpoint record is reached if at least checkpoint_timeout
seconds have passed since the last restartpoint, or if WAL size is about to exceed max_wal_size.
However, because of limitations on when a restartpoint can be performed, max_wal_size is often
exceeded during recovery, by up to one checkpoint cycle's worth of WAL. (max_wal_size is never
a hard limit anyway, so you should always leave plenty of headroom to avoid running out of disk
space.)
There are two commonly used internal WAL functions: XLogInsertRecord and XLogFlush.
XLogInsertRecord is used to place a new record into the WAL buffers in shared memory. If there
is no space for the new record, XLogInsertRecord will have to write (move to kernel cache) a
few filled WAL buffers. This is undesirable because XLogInsertRecord is used on every data-
base low level modification (for example, row insertion) at a time when an exclusive lock is held on
affected data pages, so the operation needs to be as fast as possible. What is worse, writing WAL
buffers might also force the creation of a new log segment, which takes even more time. Normally,
WAL buffers should be written and flushed by an XLogFlush request, which is made, for the most
part, at transaction commit time to ensure that transaction records are flushed to permanent storage.
On systems with high log output, XLogFlush requests might not occur often enough to prevent
XLogInsertRecord from having to do writes. On such systems one should increase the number
of WAL buffers by modifying the wal_buffers parameter. When full_page_writes is set and the sys-
tem is very busy, setting wal_buffers higher will help smooth response times during the period
immediately following each checkpoint.
The commit_delay parameter defines for how many microseconds a group commit leader process will
sleep after acquiring a lock within XLogFlush, while group commit followers queue up behind the
leader. This delay allows other server processes to add their commit records to the WAL buffers so that
all of them will be flushed by the leader's eventual sync operation. No sleep will occur if fsync is not
enabled, or if fewer than commit_siblings other sessions are currently in active transactions; this avoids
sleeping when it's unlikely that any other session will commit soon. Note that on some platforms,
the resolution of a sleep request is ten milliseconds, so that any nonzero commit_delay setting
between 1 and 10000 microseconds would have the same effect. Note also that on some platforms,
sleep operations may take slightly longer than requested by the parameter.
Since the purpose of commit_delay is to allow the cost of each flush operation to be amortized
across concurrently committing transactions (potentially at the expense of transaction latency), it is
758
Reliability and the Write-Ahead Log
necessary to quantify that cost before the setting can be chosen intelligently. The higher that cost is,
the more effective commit_delay is expected to be in increasing transaction throughput, up to a
point. The pg_test_fsync program can be used to measure the average time in microseconds that a
single WAL flush operation takes. A value of half of the average time the program reports it takes
to flush after a single 8kB write operation is often the most effective setting for commit_delay,
so this value is recommended as the starting point to use when optimizing for a particular workload.
While tuning commit_delay is particularly useful when the WAL log is stored on high-latency
rotating disks, benefits can be significant even on storage media with very fast sync times, such as
solid-state drives or RAID arrays with a battery-backed write cache; but this should definitely be tested
against a representative workload. Higher values of commit_siblings should be used in such
cases, whereas smaller commit_siblings values are often helpful on higher latency media. Note
that it is quite possible that a setting of commit_delay that is too high can increase transaction
latency by so much that total transaction throughput suffers.
When commit_delay is set to zero (the default), it is still possible for a form of group commit
to occur, but each group will consist only of sessions that reach the point where they need to flush
their commit records during the window in which the previous flush operation (if any) is occurring.
At higher client counts a “gangway effect” tends to occur, so that the effects of group commit become
significant even when commit_delay is zero, and thus explicitly setting commit_delay tends to
help less. Setting commit_delay can only help when (1) there are some concurrently committing
transactions, and (2) throughput is limited to some degree by commit rate; but with high rotational
latency this setting can be effective in increasing transaction throughput with as few as two clients
(that is, a single committing client with one sibling transaction).
The wal_sync_method parameter determines how PostgreSQL will ask the kernel to force WAL up-
dates out to disk. All the options should be the same in terms of reliability, with the exception of
fsync_writethrough, which can sometimes force a flush of the disk cache even when other op-
tions do not do so. However, it's quite platform-specific which one will be the fastest. You can test
the speeds of different options using the pg_test_fsync program. Note that this parameter is irrelevant
if fsync has been turned off.
Enabling the wal_debug configuration parameter (provided that PostgreSQL has been compiled with
support for it) will result in each XLogInsertRecord and XLogFlush WAL call being logged
to the server log. This option might be replaced by a more general mechanism in the future.
WAL records are appended to the WAL logs as each new record is written. The insert position is de-
scribed by a Log Sequence Number (LSN) that is a byte offset into the logs, increasing monotonically
with each new record. LSN values are returned as the datatype pg_lsn. Values can be compared to
calculate the volume of WAL data that separates them, so they are used to measure the progress of
replication and recovery.
WAL logs are stored in the directory pg_wal under the data directory, as a set of segment files,
normally each 16 MB in size (but the size can be changed by altering the --wal-segsize initdb
option). Each segment is divided into pages, normally 8 kB each (this size can be changed via the --
with-wal-blocksize configure option). The log record headers are described in access/xlo-
grecord.h; the record content is dependent on the type of event that is being logged. Segment
files are given ever-increasing numbers as names, starting at 000000010000000000000001. The
numbers do not wrap, but it will take a very, very long time to exhaust the available stock of numbers.
It is advantageous if the log is located on a different disk from the main database files. This can be
achieved by moving the pg_wal directory to another location (while the server is shut down, of
course) and creating a symbolic link from the original location in the main data directory to the new
location.
759
Reliability and the Write-Ahead Log
The aim of WAL is to ensure that the log is written before database records are altered, but this can
be subverted by disk drives that falsely report a successful write to the kernel, when in fact they have
only cached the data and not yet stored it on the disk. A power failure in such a situation might lead
to irrecoverable data corruption. Administrators should try to ensure that disks holding PostgreSQL's
WAL log files do not make such false reports. (See Section 30.1.)
After a checkpoint has been made and the log flushed, the checkpoint's position is saved in the file
pg_control. Therefore, at the start of recovery, the server first reads pg_control and then the
checkpoint record; then it performs the REDO operation by scanning forward from the log location
indicated in the checkpoint record. Because the entire content of data pages is saved in the log on
the first page modification after a checkpoint (assuming full_page_writes is not disabled), all pages
changed since the checkpoint will be restored to a consistent state.
To deal with the case where pg_control is corrupt, we should support the possibility of scanning
existing log segments in reverse order — newest to oldest — in order to find the latest checkpoint.
This has not been implemented yet. pg_control is small enough (less than one disk page) that it
is not subject to partial-write problems, and as of this writing there have been no reports of database
failures due solely to the inability to read pg_control itself. So while it is theoretically a weak spot,
pg_control does not seem to be a problem in practice.
760
Chapter 31. Logical Replication
Logical replication is a method of replicating data objects and their changes, based upon their replica-
tion identity (usually a primary key). We use the term logical in contrast to physical replication, which
uses exact block addresses and byte-by-byte replication. PostgreSQL supports both mechanisms con-
currently, see Chapter 26. Logical replication allows fine-grained control over both data replication
and security.
Logical replication uses a publish and subscribe model with one or more subscribers subscribing to one
or more publications on a publisher node. Subscribers pull data from the publications they subscribe to
and may subsequently re-publish data to allow cascading replication or more complex configurations.
Logical replication of a table typically starts with taking a snapshot of the data on the publisher data-
base and copying that to the subscriber. Once that is done, the changes on the publisher are sent to the
subscriber as they occur in real-time. The subscriber applies the data in the same order as the publish-
er so that transactional consistency is guaranteed for publications within a single subscription. This
method of data replication is sometimes referred to as transactional replication.
• Consolidating multiple databases into a single one (for example for analytical purposes).
• Replicating between PostgreSQL instances on different platforms (for example Linux to Windows)
The subscriber database behaves in the same way as any other PostgreSQL instance and can be used
as a publisher for other databases by defining its own publications. When the subscriber is treated as
read-only by application, there will be no conflicts from a single subscription. On the other hand, if
there are other writes done either by an application or by other subscribers to the same set of tables,
conflicts can arise.
31.1. Publication
A publication can be defined on any physical replication master. The node where a publication is
defined is referred to as publisher. A publication is a set of changes generated from a table or a group
of tables, and might also be described as a change set or replication set. Each publication exists in
only one database.
Publications are different from schemas and do not affect how the table is accessed. Each table can
be added to multiple publications if needed. Publications may currently only contain tables. Objects
must be added explicitly, except when a publication is created for ALL TABLES.
Publications can choose to limit the changes they produce to any combination of INSERT, UPDATE,
DELETE, and TRUNCATE, similar to how triggers are fired by particular event types. By default, all
operation types are replicated.
A published table must have a “replica identity” configured in order to be able to replicate UPDATE
and DELETE operations, so that appropriate rows to update or delete can be identified on the subscriber
761
Logical Replication
side. By default, this is the primary key, if there is one. Another unique index (with certain additional
requirements) can also be set to be the replica identity. If the table does not have any suitable key, then
it can be set to replica identity “full”, which means the entire row becomes the key. This, however,
is very inefficient and should only be used as a fallback if no other solution is possible. If a replica
identity other than “full” is set on the publisher side, a replica identity comprising the same or fewer
columns must also be set on the subscriber side. See REPLICA IDENTITY for details on how to
set the replica identity. If a table without a replica identity is added to a publication that replicates
UPDATE or DELETE operations then subsequent UPDATE or DELETE operations will cause an error
on the publisher. INSERT operations can proceed regardless of any replica identity.
A publication is created using the CREATE PUBLICATION command and may later be altered or
dropped using corresponding commands.
The individual tables can be added and removed dynamically using ALTER PUBLICATION. Both the
ADD TABLE and DROP TABLE operations are transactional; so the table will start or stop replicating
at the correct snapshot once the transaction has committed.
31.2. Subscription
A subscription is the downstream side of logical replication. The node where a subscription is defined
is referred to as the subscriber. A subscription defines the connection to another database and set of
publications (one or more) to which it wants to subscribe.
The subscriber database behaves in the same way as any other PostgreSQL instance and can be used
as a publisher for other databases by defining its own publications.
A subscriber node may have multiple subscriptions if desired. It is possible to define multiple sub-
scriptions between a single publisher-subscriber pair, in which case care must be taken to ensure that
the subscribed publication objects don't overlap.
Each subscription will receive changes via one replication slot (see Section 26.2.6). Additional tem-
porary replication slots may be required for the initial data synchronization of pre-existing table data.
A logical replication subscription can be a standby for synchronous replication (see Section 26.2.8).
The standby name is by default the subscription name. An alternative name can be specified as ap-
plication_name in the connection information of the subscription.
Subscriptions are dumped by pg_dump if the current user is a superuser. Otherwise a warning is
written and subscriptions are skipped, because non-superusers cannot read all subscription information
from the pg_subscription catalog.
The subscription is added using CREATE SUBSCRIPTION and can be stopped/resumed at any time
using the ALTER SUBSCRIPTION command and removed using DROP SUBSCRIPTION.
When a subscription is dropped and recreated, the synchronization information is lost. This means
that the data has to be resynchronized afterwards.
The schema definitions are not replicated, and the published tables must exist on the subscriber. Only
regular tables may be the target of replication. For example, you can't replicate to a view.
The tables are matched between the publisher and the subscriber using the fully qualified table name.
Replication to differently-named tables on the subscriber is not supported.
Columns of a table are also matched by name. The order of columns in the subscriber table does not
need to match that of the publisher. The data types of the columns do not need to match, as long as the
text representation of the data can be converted to the target type. For example, you can replicate from
a column of type integer to a column of type bigint. The target table can also have additional
762
Logical Replication
columns not provided by the published table. Any such columns will be filled with the default value
as specified in the definition of the target table.
• When creating a subscription, the replication slot already exists. In that case, the subscription can
be created using the create_slot = false option to associate with the existing slot.
• When creating a subscription, the remote host is not reachable or in an unclear state. In that case,
the subscription can be created using the connect = false option. The remote host will then
not be contacted at all. This is what pg_dump uses. The remote replication slot will then have to be
created manually before the subscription can be activated.
• When dropping a subscription, the replication slot should be kept. This could be useful when the
subscriber database is being moved to a different host and will be activated from there. In that case,
disassociate the slot from the subscription using ALTER SUBSCRIPTION before attempting to
drop the subscription.
• When dropping a subscription, the remote host is not reachable. In that case, disassociate the slot
from the subscription using ALTER SUBSCRIPTION before attempting to drop the subscription.
If the remote database instance no longer exists, no further action is then necessary. If, however, the
remote database instance is just unreachable, the replication slot should then be dropped manually;
otherwise it would continue to reserve WAL and might eventually cause the disk to fill up. Such
cases should be carefully investigated.
31.3. Conflicts
Logical replication behaves similarly to normal DML operations in that the data will be updated even if
it was changed locally on the subscriber node. If incoming data violates any constraints the replication
will stop. This is referred to as a conflict. When replicating UPDATE or DELETE operations, missing
data will not produce a conflict and such operations will simply be skipped.
A conflict will produce an error and will stop the replication; it must be resolved manually by the user.
Details about the conflict can be found in the subscriber's server log.
The resolution can be done either by changing data on the subscriber so that it does not conflict with
the incoming change or by skipping the transaction that conflicts with the existing data. The trans-
action can be skipped by calling the pg_replication_origin_advance() function with a
node_name corresponding to the subscription name, and a position. The current position of origins
can be seen in the pg_replication_origin_status system view.
31.4. Restrictions
Logical replication currently has the following restrictions or missing functionality. These might be
addressed in future releases.
• The database schema and DDL commands are not replicated. The initial schema can be copied by
hand using pg_dump --schema-only. Subsequent schema changes would need to be kept in
sync manually. (Note, however, that there is no need for the schemas to be absolutely the same
on both sides.) Logical replication is robust when schema definitions change in a live database:
When the schema is changed on the publisher and replicated data starts arriving at the subscriber but
763
Logical Replication
does not fit into the table schema, replication will error until the schema is updated. In many cases,
intermittent errors can be avoided by applying additive schema changes to the subscriber first.
• Sequence data is not replicated. The data in serial or identity columns backed by sequences will of
course be replicated as part of the table, but the sequence itself would still show the start value on
the subscriber. If the subscriber is used as a read-only database, then this should typically not be a
problem. If, however, some kind of switchover or failover to the subscriber database is intended,
then the sequences would need to be updated to the latest values, either by copying the current data
from the publisher (perhaps using pg_dump) or by determining a sufficiently high value from the
tables themselves.
• Replication of TRUNCATE commands is supported, but some care must be taken when truncating
groups of tables connected by foreign keys. When replicating a truncate action, the subscriber will
truncate the same group of tables that was truncated on the publisher, either explicitly specified or
implicitly collected via CASCADE, minus tables that are not part of the subscription. This will work
correctly if all affected tables are part of the same subscription. But if some tables to be truncated
on the subscriber have foreign-key links to tables that are not part of the same (or any) subscription,
then the application of the truncate action on the subscriber will fail.
• Large objects (see Chapter 35) are not replicated. There is no workaround for that, other than storing
data in normal tables.
• Replication is only possible from base tables to base tables. That is, the tables on the publication and
on the subscription side must be normal tables, not views, materialized views, partition root tables,
or foreign tables. In the case of partitions, you can therefore replicate a partition hierarchy one-to-
one, but you cannot currently replicate to a differently partitioned setup. Attempts to replicate tables
other than base tables will result in an error.
31.5. Architecture
Logical replication starts by copying a snapshot of the data on the publisher database. Once that is
done, changes on the publisher are sent to the subscriber as they occur in real time. The subscriber
applies data in the order in which commits were made on the publisher so that transactional consistency
is guaranteed for the publications within any single subscription.
Logical replication is built with an architecture similar to physical streaming replication (see Sec-
tion 26.2.5). It is implemented by “walsender” and “apply” processes. The walsender process starts
logical decoding (described in Chapter 49) of the WAL and loads the standard logical decoding plugin
(pgoutput). The plugin transforms the changes read from WAL to the logical replication protocol (see
Section 53.5) and filters the data according to the publication specification. The data is then continu-
ously transferred using the streaming replication protocol to the apply worker, which maps the data to
local tables and applies the individual changes as they are received, in correct transactional order.
The apply process on the subscriber database always runs with session_replication_role
set to replica, which produces the usual effects on triggers and constraints.
The logical replication apply process currently only fires row triggers, not statement triggers. The
initial table synchronization, however, is implemented like a COPY command and thus fires both row
and statement triggers for INSERT.
764
Logical Replication
31.6. Monitoring
Because logical replication is based on a similar architecture as physical streaming replication, the
monitoring on a publication node is similar to monitoring of a physical replication master (see Sec-
tion 26.2.5.2).
Normally, there is a single apply process running for an enabled subscription. A disabled subscription
or a crashed subscription will have zero rows in this view. If the initial data synchronization of any
table is in progress, there will be additional workers for the tables being synchronized.
31.7. Security
A user able to modify the schema of subscriber-side tables can execute arbitrary code as a superuser.
Limit ownership and TRIGGER privilege on such tables to roles that superusers trust. Moreover, if
untrusted users can create tables, use only publications that list tables explicitly. That is to say, create
a subscription FOR ALL TABLES only when superusers trust every user permitted to create a non-
temp table on the publisher or the subscriber.
The role used for the replication connection must have the REPLICATION attribute (or be a supe-
ruser). If the role lacks SUPERUSER and BYPASSRLS, publisher row security policies can execute.
If the role does not trust all table owners, include options=-crow_security=off in the con-
nection string; if a table owner then adds a row security policy, that setting will cause replication to
halt rather than execute the policy. Access for the role must be configured in pg_hba.conf and it
must have the LOGIN attribute.
In order to be able to copy the initial table data, the role used for the replication connection must have
the SELECT privilege on a published table (or be a superuser).
To create a publication, the user must have the CREATE privilege in the database.
To add tables to a publication, the user must have ownership rights on the table. To create a publication
that publishes all tables automatically, the user must be a superuser.
The subscription apply process will run in the local database with the privileges of a superuser.
Privileges are only checked once at the start of a replication connection. They are not re-checked as
each change record is read from the publisher, nor are they re-checked for each change when applied.
The subscriber also requires the max_replication_slots be set to configure how many repli-
cation origins can be tracked. In this case it should be set to at least the number of subscriptions
that will be added to the subscriber. max_logical_replication_workers must be set to at
least the number of subscriptions, again plus some reserve for the table synchronization. Additionally
the max_worker_processes may need to be adjusted to accommodate for replication workers,
765
Logical Replication
wal_level = logical
The other required settings have default values that are sufficient for a basic setup.
pg_hba.conf needs to be adjusted to allow replication (the values here depend on your actual
network configuration and user you want to use for connecting):
The above will start the replication process, which synchronizes the initial table contents of the tables
users and departments and then starts replicating incremental changes to those tables.
766
Chapter 32. Just-in-Time Compilation
(JIT)
This chapter explains what just-in-time compilation is, and how it can be configured in PostgreSQL.
PostgreSQL has builtin support to perform JIT compilation using LLVM1 when PostgreSQL is built
with --with-llvm.
Expression evaluation is used to evaluate WHERE clauses, target lists, aggregates and projections. It
can be accelerated by generating code specific to each case.
Tuple deforming is the process of transforming an on-disk tuple (see Section 69.6.1) into its in-memory
representation. It can be accelerated by creating a function specific to the table layout and the number
of columns to be extracted.
32.1.2. Inlining
PostgreSQL is very extensible and allows new data types, functions, operators and other database
objects to be defined; see Chapter 38. In fact the built-in objects are implemented using nearly the
same mechanisms. This extensibility implies some overhead, for example due to function calls (see
Section 38.3). To reduce that overhead, JIT compilation can inline the bodies of small functions into
the expressions using them. That allows a significant percentage of the overhead to be optimized away.
32.1.3. Optimization
LLVM has support for optimizing generated code. Some of the optimizations are cheap enough to
be performed whenever JIT is used, while others are only beneficial for longer-running queries. See
https://llvm.org/docs/Passes.html#transform-passes for more details about optimizations.
To determine whether JIT compilation should be used, the total estimated cost of a query (see Chap-
ter 71 and Section 19.7.2) is used. The estimated cost of the query will be compared with the setting of
1
https://llvm.org/
767
Just-in-Time Compilation (JIT)
jit_above_cost. If the cost is higher, JIT compilation will be performed. Two further decisions are then
needed. Firstly, if the estimated cost is more than the setting of jit_inline_above_cost, short functions
and operators used in the query will be inlined. Secondly, if the estimated cost is more than the set-
ting of jit_optimize_above_cost, expensive optimizations are applied to improve the generated code.
Each of these options increases the JIT compilation overhead, but can reduce query execution time
considerably.
These cost-based decisions will be made at plan time, not execution time. This means that when pre-
pared statements are in use, and a generic plan is used (see PREPARE), the values of the configuration
parameters in effect at prepare time control the decisions, not the settings at execution time.
Note
If jit is set to off, or if no JIT implementation is available (for example because the server was
compiled without --with-llvm), JIT will not be performed, even if it would be beneficial
based on the above criteria. Setting jit to off has effects at both plan and execution time.
EXPLAIN can be used to see whether JIT is used or not. As an example, here is a query that is not
using JIT:
Given the cost of the plan, it is entirely reasonable that no JIT was used; the cost of JIT would have
been bigger than the potential savings. Adjusting the cost limits will lead to JIT use:
As visible here, JIT was used, but inlining and expensive optimization were not. If jit_in-
line_above_cost or jit_optimize_above_cost were also lowered, that would change.
768
Just-in-Time Compilation (JIT)
32.3. Configuration
The configuration variable jit determines whether JIT compilation is enabled or disabled. If it is en-
abled, the configuration variables jit_above_cost, jit_inline_above_cost, and jit_optimize_above_cost
determine whether JIT compilation is performed for a query, and how much effort is spent doing so.
jit_provider determines which JIT implementation is used. It is rarely required to be changed. See
Section 32.4.2.
For development and debugging purposes a few additional configuration parameters exist, as described
in Section 19.17.
32.4. Extensibility
32.4.1. Inlining Support for Extensions
PostgreSQL's JIT implementation can inline the bodies of functions of types C and internal, as
well as operators based on such functions. To do so for functions in extensions, the definitions of those
functions need to be made available. When using PGXS to build an extension against a server that has
been compiled with LLVM JIT support, the relevant files will be built and installed automatically.
Note
For functions built into PostgreSQL itself, the bitcode is installed into $pkglibdir/bit-
code/postgres.
struct JitProviderCallbacks
{
JitProviderResetAfterErrorCB reset_after_error;
JitProviderReleaseContextCB release_context;
JitProviderCompileExprCB compile_expr;
};
769
Chapter 33. Regression Tests
The regression tests are a comprehensive set of tests for the SQL implementation in PostgreSQL. They
test standard SQL operations as well as the extended capabilities of PostgreSQL.
make check
in the top-level directory. (Or you can change to src/test/regress and run the command there.)
At the end you should see something like:
=======================
All 115 tests passed.
=======================
or otherwise a note about which tests failed. See Section 33.2 below before assuming that a “failure”
represents a serious problem.
Because this test method runs a temporary server, it will not work if you did the build as the root user,
since the server will not start as root. Recommended procedure is not to do the build as root, or else
to perform testing after completing the installation.
If you have configured PostgreSQL to install into a location where an older PostgreSQL installation
already exists, and you perform make check before installing the new version, you might find
that the tests fail because the new programs try to use the already-installed shared libraries. (Typical
symptoms are complaints about undefined symbols.) If you wish to run the tests before overwriting the
old installation, you'll need to build with configure --disable-rpath. It is not recommended
that you use this option for the final installation, however.
The parallel regression test starts quite a few processes under your user ID. Presently, the maximum
concurrency is twenty parallel test scripts, which means forty processes: there's a server process and a
psql process for each test script. So if your system enforces a per-user limit on the number of processes,
make sure this limit is at least fifty or so, else you might get random-seeming failures in the parallel
test. If you are not in a position to raise the limit, you can cut down the degree of parallelism by setting
the MAX_CONNECTIONS parameter. For example:
770
Regression Tests
make installcheck
make installcheck-parallel
The tests will expect to contact the server at the local host and the default port number, unless directed
otherwise by PGHOST and PGPORT environment variables. The tests will be run in a database named
regression; any existing database by this name will be dropped.
The tests will also transiently create some cluster-wide objects, such as roles and tablespaces. These
objects will have names beginning with regress_. Beware of using installcheck mode in
installations that have any actual users or tablespaces named that way.
To run all test suites applicable to the modules that have been selected to be built, including the core
tests, type one of these commands at the top of the build tree:
make check-world
make installcheck-world
These commands run the tests using temporary servers or an already-installed server, respectively, just
as previously explained for make check and make installcheck. Other considerations are
the same as previously explained for each method. Note that make check-world builds a separate
temporary installation tree for each tested module, so it requires a great deal more time and disk space
than make installcheck-world.
Alternatively, you can run individual test suites by typing make check or make installcheck
in the appropriate subdirectory of the build tree. Keep in mind that make installcheck assumes
you've installed the relevant module(s), not only the core server.
• Regression tests for optional procedural languages (other than PL/pgSQL, which is tested by the
core tests). These are located under src/pl.
• Regression tests for contrib modules, located under contrib. Not all contrib modules have
tests.
771
Regression Tests
When using installcheck mode, these tests will destroy any existing databases named pl_re-
gression, contrib_regression, isolation_regression, ecpg1_regression, or
ecpg2_regression, as well as regression.
The TAP-based tests are run only when PostgreSQL was configured with the option --en-
able-tap-tests. This is recommended for development, but can be omitted if there is no suitable
Perl installation.
Some test suites are not run by default, either because they are not secure to run on a multiuser system
or because they require special software. You can decide which test suites to run additionally by setting
the make or environment variable PG_TEST_EXTRA to a whitespace-separated list, for example:
kerberos
Runs the test suite under src/test/kerberos. This requires an MIT Kerberos installation
and opens TCP/IP listen sockets.
ldap
Runs the test suite under src/test/ldap. This requires an OpenLDAP installation and opens
TCP/IP listen sockets.
ssl
Runs the test suite under src/test/ssl. This opens TCP/IP listen sockets.
Tests for features that are not supported by the current build configuration are not run even if they are
mentioned in PG_TEST_EXTRA.
For implementation reasons, setting LC_ALL does not work for this purpose; all the other locale-re-
lated environment variables do work.
When testing against an existing installation, the locale is determined by the existing database cluster
and cannot be set separately for the test run.
You can also choose the database encoding explicitly by setting the variable ENCODING, for example:
Setting the database encoding this way typically only makes sense if the locale is C; otherwise the
encoding is chosen automatically from the locale, and specifying an encoding that does not match the
locale will result in an error.
The database encoding can be set for tests against either a temporary or an existing installation, though
in the latter case it must be compatible with the installation's locale.
772
Regression Tests
To run the Hot Standby tests, first create a database called regression on the primary:
Now arrange for the default database connection to be to the standby server under test (for example,
by setting the PGHOST and PGPORT environment variables). Finally, run make standbycheck
in the regression directory:
cd src/test/regress
make standbycheck
Some extreme behaviors can also be generated on the primary using the script src/test/
regress/sql/hs_primary_extremes.sql to allow the behavior of the standby to be tested.
773
Regression Tests
reported as “failed”, always examine the differences between expected and actual results; you might
find that the differences are not significant. Nonetheless, we still strive to maintain accurate reference
files across all supported platforms, so it can be expected that all tests pass.
The actual outputs of the regression tests are in files in the src/test/regress/results direc-
tory. The test script uses diff to compare each output file against the reference outputs stored in
the src/test/regress/expected directory. Any differences are saved for your inspection in
src/test/regress/regression.diffs. (When running a test suite other than the core tests,
these files of course appear in the relevant subdirectory, not src/test/regress.)
If you don't like the diff options that are used by default, set the environment variable PG_RE-
GRESS_DIFF_OPTS, for instance PG_REGRESS_DIFF_OPTS='-u'. (Or you can run diff
yourself, if you prefer.)
If for some reason a particular platform generates a “failure” for a given test, but inspection of the
output convinces you that the result is valid, you can add a new comparison file to silence the failure
report in future test runs. See Section 33.3 for details.
To run the tests in a different locale when using the temporary-installation method, pass the appropriate
locale-related environment variables on the make command line, for example:
(The regression test driver unsets LC_ALL, so it does not work to choose the locale using that variable.)
To use no locale, either unset all locale-related environment variables (or set them to C) or use the
following special invocation:
When running the tests against an existing installation, the locale setup is determined by the existing
installation. To change it, initialize the database cluster with a different locale by passing the appro-
priate options to initdb.
In general, it is advisable to try to run the regression tests in the locale setup that is wanted for pro-
duction use, as this will exercise the locale- and encoding-related code portions that will actually be
used in production. Depending on the operating system environment, you might get failures, but then
you will at least know what locale-specific behaviors to expect when running real applications.
774
Regression Tests
tests are not run with that time zone setting. The regression test driver sets environment variable PGTZ
to PST8PDT, which normally ensures proper results.
Some systems display minus zero as -0, while others just show 0.
Some systems signal errors from pow() and exp() differently from the mechanism expected by the
current PostgreSQL code.
Therefore, if you see an ordering difference, it's not something to worry about, unless the query does
have an ORDER BY that your result is violating. However, please report it anyway, so that we can add
an ORDER BY to that particular query to eliminate the bogus “failure” in future releases.
You might wonder why we don't order all the regression test queries explicitly to get rid of this issue
once and for all. The reason is that that would make the regression tests less useful, not more, since
they'd tend to exercise query plan types that produce ordered results to the exclusion of those that don't.
On platforms supporting getrlimit(), the server should automatically choose a safe value of
max_stack_depth; so unless you've manually overridden this setting, a failure of this kind is a
reportable bug.
should produce only one or a few lines of differences. You need not worry unless the random test
fails repeatedly.
775
Regression Tests
The first mechanism allows comparison files to be selected for specific platforms. There is a mapping
file, src/test/regress/resultmap, that defines which comparison file to use for each plat-
form. To eliminate bogus test “failures” for a particular platform, you first choose or make a variant
result file, and then add a line to the resultmap file.
testname:output:platformpattern=comparisonfilename
The test name is just the name of the particular regression test module. The output value indicates
which output file to check. For the standard regression tests, this is always out. The value corresponds
to the file extension of the output file. The platform pattern is a pattern in the style of the Unix tool
expr (that is, a regular expression with an implicit ^ anchor at the start). It is matched against the
platform name as printed by config.guess. The comparison file name is the base name of the
substitute result comparison file.
For example: some systems interpret very small floating-point values as zero, rather than reporting an
underflow error. This causes a few differences in the float8 regression test. Therefore, we provide a
variant comparison file, float8-small-is-zero.out, which includes the results to be expect-
ed on these systems. To silence the bogus “failure” message on OpenBSD platforms, resultmap
includes:
float8:out:i.86-.*-openbsd=float8-small-is-zero.out
which will trigger on any machine where the output of config.guess matches i.86-.*-
openbsd. Other lines in resultmap select the variant comparison file for other platforms where
it's appropriate.
The second selection mechanism for variant comparison files is much more automatic: it simply uses
the “best match” among several supplied comparison files. The regression test driver script consid-
ers both the standard comparison file for a test, testname.out, and variant files named test-
name_digit.out (where the digit is any single digit 0-9). If any such file is an exact match,
the test is considered to pass; otherwise, the one that generates the shortest diff is used to create the
failure report. (If resultmap includes an entry for the particular test, then the base testname is
the substitute name given in resultmap.)
For example, for the char test, the comparison file char.out contains results that are expected
in the C and POSIX locales, while the file char_1.out contains results sorted as they appear in
many other locales.
The best-match mechanism was devised to cope with locale-dependent results, but it can be used in any
situation where the test results cannot be predicted easily from the platform name alone. A limitation
of this mechanism is that the test driver cannot tell which variant is actually “correct” for the current
776
Regression Tests
environment; it will just pick the variant that seems to work best. Therefore it is safest to use this
mechanism only for variant results that you are willing to consider equally valid in all contexts.
The make variable PROVE_TESTS can be used to define a whitespace-separated list of paths relative
to the Makefile invoking prove to run the specified subset of tests instead of the default t/*.pl.
For example:
The TAP tests require the Perl module IPC::Run. This module is available from CPAN or an
operating system package. They also require PostgreSQL to be configured with the option --en-
able-tap-tests.
Then point your HTML browser to coverage/index.html. The make commands also work in
subdirectories.
If you don't have lcov or prefer text output over an HTML report, you can also run
make coverage
instead of make coverage-html, which will produce .gcov output files for each source file
relevant to the test. (make coverage and make coverage-html will overwrite each other's
files, so mixing them might be confusing.)
make coverage-clean
777
Part IV. Client Interfaces
This part describes the client programming interfaces distributed with PostgreSQL. Each of these chapters can be
read independently. Note that there are many other programming interfaces for client programs that are distributed
separately and contain their own documentation (Appendix H lists some of the more popular ones). Readers of
this part should be familiar with using SQL commands to manipulate and query the database (see Part II) and of
course with the programming language that the interface uses.
Table of Contents
34. libpq - C Library .................................................................................................. 783
34.1. Database Connection Control Functions ......................................................... 783
34.1.1. Connection Strings ........................................................................... 790
34.1.2. Parameter Key Words ...................................................................... 792
34.2. Connection Status Functions ........................................................................ 796
34.3. Command Execution Functions .................................................................... 802
34.3.1. Main Functions ............................................................................... 802
34.3.2. Retrieving Query Result Information ................................................... 810
34.3.3. Retrieving Other Result Information .................................................... 814
34.3.4. Escaping Strings for Inclusion in SQL Commands ................................. 815
34.4. Asynchronous Command Processing .............................................................. 818
34.5. Retrieving Query Results Row-By-Row ......................................................... 822
34.6. Canceling Queries in Progress ...................................................................... 823
34.7. The Fast-Path Interface ............................................................................... 824
34.8. Asynchronous Notification ........................................................................... 825
34.9. Functions Associated with the COPY Command ............................................... 826
34.9.1. Functions for Sending COPY Data ...................................................... 827
34.9.2. Functions for Receiving COPY Data .................................................... 827
34.9.3. Obsolete Functions for COPY ............................................................. 828
34.10. Control Functions ..................................................................................... 830
34.11. Miscellaneous Functions ............................................................................ 832
34.12. Notice Processing ..................................................................................... 835
34.13. Event System ........................................................................................... 836
34.13.1. Event Types .................................................................................. 836
34.13.2. Event Callback Procedure ................................................................ 838
34.13.3. Event Support Functions ................................................................. 839
34.13.4. Event Example .............................................................................. 840
34.14. Environment Variables .............................................................................. 842
34.15. The Password File .................................................................................... 844
34.16. The Connection Service File ....................................................................... 844
34.17. LDAP Lookup of Connection Parameters ..................................................... 845
34.18. SSL Support ............................................................................................ 846
34.18.1. Client Verification of Server Certificates ............................................ 846
34.18.2. Client Certificates .......................................................................... 847
34.18.3. Protection Provided in Different Modes ............................................. 847
34.18.4. SSL Client File Usage .................................................................... 849
34.18.5. SSL Library Initialization ................................................................ 849
34.19. Behavior in Threaded Programs .................................................................. 850
34.20. Building libpq Programs ............................................................................ 850
34.21. Example Programs .................................................................................... 852
35. Large Objects ...................................................................................................... 863
35.1. Introduction ............................................................................................... 863
35.2. Implementation Features .............................................................................. 863
35.3. Client Interfaces ......................................................................................... 863
35.3.1. Creating a Large Object .................................................................... 864
35.3.2. Importing a Large Object .................................................................. 864
35.3.3. Exporting a Large Object .................................................................. 865
35.3.4. Opening an Existing Large Object ...................................................... 865
35.3.5. Writing Data to a Large Object .......................................................... 865
35.3.6. Reading Data from a Large Object ...................................................... 866
35.3.7. Seeking in a Large Object ................................................................. 866
35.3.8. Obtaining the Seek Position of a Large Object ...................................... 866
35.3.9. Truncating a Large Object ................................................................. 867
35.3.10. Closing a Large Object Descriptor .................................................... 867
35.3.11. Removing a Large Object ................................................................ 867
779
Client Interfaces
780
Client Interfaces
781
Client Interfaces
782
Chapter 34. libpq - C Library
libpq is the C application programmer's interface to PostgreSQL. libpq is a set of library functions
that allow client programs to pass queries to the PostgreSQL backend server and to receive the results
of these queries.
libpq is also the underlying engine for several other PostgreSQL application interfaces, including those
written for C++, Perl, Python, Tcl and ECPG. So some aspects of libpq's behavior will be important
to you if you use one of those packages. In particular, Section 34.14, Section 34.15 and Section 34.18
describe behavior that is visible to the user of any application that uses libpq.
Some short programs are included at the end of this chapter (Section 34.21) to show how to write
programs that use libpq. There are also several complete examples of libpq applications in the directory
src/test/examples in the source code distribution.
Client programs that use libpq must include the header file libpq-fe.h and must link with the
libpq library.
Warning
If untrusted users have access to a database that has not adopted a secure schema usage pattern,
begin each session by removing publicly-writable schemas from search_path. One can
set parameter key word options to value -csearch_path=. Alternately, one can issue
PQexec(conn, "SELECT pg_catalog.set_config('search_path', '',
false)") after connecting. This consideration is not specific to libpq; it applies to every
interface for executing arbitrary SQL commands.
Warning
On Unix, forking a process with open libpq connections can lead to unpredictable results be-
cause the parent and child processes share the same sockets and operating system resources.
For this reason, such usage is not recommended, though doing an exec from the child process
to load a new executable is safe.
PQconnectdbParams
783
libpq - C Library
This function opens a new database connection using the parameters taken from two NULL-ter-
minated arrays. The first, keywords, is defined as an array of strings, each one being a key
word. The second, values, gives the value for each key word. Unlike PQsetdbLogin below,
the parameter set can be extended without changing the function signature, so use of this function
(or its nonblocking analogs PQconnectStartParams and PQconnectPoll) is preferred
for new application programming.
The currently recognized parameter key words are listed in Section 34.1.2.
The passed arrays can be empty to use all default parameters, or can contain one or more parameter
settings. They must be matched in length. Processing will stop at the first NULL entry in the
keywords array. Also, if the values entry associated with a non-NULL keywords entry is
NULL or an empty string, that entry is ignored and processing continues with the next pair of
array entries.
When expand_dbname is non-zero, the value for the first dbname key word is checked to see if
it is a connection string. If so, it is “expanded” into the individual connection parameters extracted
from the string. The value is considered to be a connection string, rather than just a database
name, if it contains an equal sign (=) or it begins with a URI scheme designator. (More details
on connection string formats appear in Section 34.1.1.) Only the first occurrence of dbname is
treated in this way; any subsequent dbname parameter is processed as a plain database name.
In general the parameter arrays are processed from start to end. If any key word is repeated,
the last value (that is not NULL or empty) is used. This rule applies in particular when a key
word found in a connection string conflicts with one appearing in the keywords array. Thus,
the programmer may determine whether array entries can override or be overridden by values
taken from a connection string. Array entries appearing before an expanded dbname entry can
be overridden by fields of the connection string, and in turn those fields are overridden by array
entries appearing after dbname (but, again, only if those entries supply non-empty values).
After processing all the array entries and any expanded connection string, any connection para-
meters that remain unset are filled with default values. If an unset parameter's corresponding en-
vironment variable (see Section 34.14) is set, its value is used. If the environment variable is not
set either, then the parameter's built-in default value is used.
PQconnectdb
This function opens a new database connection using the parameters taken from the string con-
ninfo.
The passed string can be empty to use all default parameters, or it can contain one or more para-
meter settings separated by whitespace, or it can contain a URI. See Section 34.1.1 for details.
PQsetdbLogin
784
libpq - C Library
This is the predecessor of PQconnectdb with a fixed set of parameters. It has the same func-
tionality except that the missing parameters will always take on default values. Write NULL or an
empty string for any one of the fixed parameters that is to be defaulted.
If the dbName contains an = sign or has a valid connection URI prefix, it is taken as a conninfo
string in exactly the same way as if it had been passed to PQconnectdb, and the remaining
parameters are then applied as specified for PQconnectdbParams.
PQsetdb
This is a macro that calls PQsetdbLogin with null pointers for the login and pwd parameters.
It is provided for backward compatibility with very old programs.
PQconnectStartParams
PQconnectStart
PQconnectPoll
These three functions are used to open a connection to a database server such that your applica-
tion's thread of execution is not blocked on remote I/O whilst doing so. The point of this approach
is that the waits for I/O to complete can occur in the application's main loop, rather than down
inside PQconnectdbParams or PQconnectdb, and so the application can manage this op-
eration in parallel with other activities.
With PQconnectStartParams, the database connection is made using the parameters taken
from the keywords and values arrays, and controlled by expand_dbname, as described
above for PQconnectdbParams.
With PQconnectStart, the database connection is made using the parameters taken from the
string conninfo as described above for PQconnectdb.
• The hostaddr parameter must be used appropriately to prevent DNS queries from being
made. See the documentation of this parameter in Section 34.1.2 for details.
• If you call PQtrace, ensure that the stream object into which you trace will not block.
785
libpq - C Library
• You must ensure that the socket is in the appropriate state before calling PQconnectPoll,
as described below.
At any time during connection, the status of the connection can be checked by calling PQsta-
tus. If this call returns CONNECTION_BAD, then the connection procedure has failed; if the
call returns CONNECTION_OK, then the connection is ready. Both of these states are equally
detectable from the return value of PQconnectPoll, described above. Other states might also
occur during (and only during) an asynchronous connection procedure. These indicate the current
stage of the connection procedure and might be useful to provide feedback to the user for exam-
ple. These statuses are:
CONNECTION_STARTED
CONNECTION_MADE
CONNECTION_AWAITING_RESPONSE
CONNECTION_AUTH_OK
CONNECTION_SSL_STARTUP
CONNECTION_SETENV
CONNECTION_CHECK_WRITABLE
CONNECTION_CONSUME
786
libpq - C Library
Note that, although these constants will remain (in order to maintain compatibility), an application
should never rely upon these occurring in a particular order, or at all, or on the status always being
one of these documented values. An application might do something like this:
switch(PQstatus(conn))
{
case CONNECTION_STARTED:
feedback = "Connecting...";
break;
case CONNECTION_MADE:
feedback = "Connected to server...";
break;
.
.
.
default:
feedback = "Connecting...";
}
PQconndefaults
PQconninfoOption *PQconndefaults(void);
typedef struct
{
char *keyword; /* The keyword of the option */
char *envvar; /* Fallback environment variable name */
char *compiled; /* Fallback compiled in default value */
char *val; /* Option's current value, or NULL */
char *label; /* Label for field in connect dialog */
char *dispchar; /* Indicates how to display this field
in a connect dialog. Values are:
"" Display entered value as is
"*" Password field - hide value
"D" Debug option - don't show by
default */
int dispsize; /* Field size in characters for dialog */
} PQconninfoOption;
Returns a connection options array. This can be used to determine all possible PQconnectdb
options and their current default values. The return value points to an array of PQconninfoOp-
tion structures, which ends with an entry having a null keyword pointer. The null pointer
is returned if memory could not be allocated. Note that the current default values (val fields)
will depend on environment variables and other context. A missing or invalid service file will be
silently ignored. Callers must treat the connection options data as read-only.
787
libpq - C Library
After processing the options array, free it by passing it to PQconninfoFree. If this is not done,
a small amount of memory is leaked for each call to PQconndefaults.
PQconninfo
Returns a connection options array. This can be used to determine all possible PQconnectdb
options and the values that were used to connect to the server. The return value points to an array
of PQconninfoOption structures, which ends with an entry having a null keyword pointer.
All notes above for PQconndefaults also apply to the result of PQconninfo.
PQconninfoParse
Parses a connection string and returns the resulting options as an array; or returns NULL if there
is a problem with the connection string. This function can be used to extract the PQconnectdb
options in the provided connection string. The return value points to an array of PQconnin-
foOption structures, which ends with an entry having a null keyword pointer.
All legal options will be present in the result array, but the PQconninfoOption for any option
not present in the connection string will have val set to NULL; default values are not inserted.
If errmsg is not NULL, then *errmsg is set to NULL on success, else to a malloc'd error
string explaining the problem. (It is also possible for *errmsg to be set to NULL and the function
to return NULL; this indicates an out-of-memory condition.)
After processing the options array, free it by passing it to PQconninfoFree. If this is not done,
some memory is leaked for each call to PQconninfoParse. Conversely, if an error occurs and
errmsg is not NULL, be sure to free the error string using PQfreemem.
PQfinish
Closes the connection to the server. Also frees memory used by the PGconn object.
Note that even if the server connection attempt fails (as indicated by PQstatus), the application
should call PQfinish to free the memory used by the PGconn object. The PGconn pointer
must not be used again after PQfinish has been called.
PQreset
This function will close the connection to the server and attempt to establish a new connection,
using all the same parameters previously used. This might be useful for error recovery if a working
connection is lost.
788
libpq - C Library
PQresetStart
PQresetPoll
These functions will close the connection to the server and attempt to establish a new connection,
using all the same parameters previously used. This can be useful for error recovery if a working
connection is lost. They differ from PQreset (above) in that they act in a nonblocking manner.
These functions suffer from the same restrictions as PQconnectStartParams, PQconnec-
tStart and PQconnectPoll.
To initiate a connection reset, call PQresetStart. If it returns 0, the reset has failed. If it
returns 1, poll the reset using PQresetPoll in exactly the same way as you would create the
connection using PQconnectPoll.
PQpingParams
PQpingParams reports the status of the server. It accepts connection parameters identical to
those of PQconnectdbParams, described above. It is not necessary to supply correct user
name, password, or database name values to obtain the server status; however, if incorrect values
are provided, the server will log a failed connection attempt.
PQPING_OK
PQPING_REJECT
The server is running but is in a state that disallows connections (startup, shutdown, or crash
recovery).
PQPING_NO_RESPONSE
The server could not be contacted. This might indicate that the server is not running, or that
there is something wrong with the given connection parameters (for example, wrong port
number), or that there is a network connectivity problem (for example, a firewall blocking
the connection request).
PQPING_NO_ATTEMPT
No attempt was made to contact the server, because the supplied parameters were obviously
incorrect or there was some client-side problem (for example, out of memory).
PQping
PQping reports the status of the server. It accepts connection parameters identical to those of
PQconnectdb, described above. It is not necessary to supply correct user name, password, or
database name values to obtain the server status; however, if incorrect values are provided, the
server will log a failed connection attempt.
789
libpq - C Library
Example:
postgresql://[userspec@][hostspec][/dbname][?paramspec]
user[:password]
[host][:port][,...]
name=value[&...]
The URI scheme designator can be either postgresql:// or postgres://. Each of the remain-
ing URI parts is optional. The following examples illustrate valid URI syntax:
postgresql://
postgresql://localhost
postgresql://localhost:5433
postgresql://localhost/mydb
postgresql://user@localhost
postgresql://user:secret@localhost
postgresql://other@localhost/otherdb?
connect_timeout=10&application_name=myapp
1
https://tools.ietf.org/html/rfc3986
790
libpq - C Library
postgresql://host1:123,host2:456/somedb?
target_session_attrs=any&application_name=myapp
Values that would normally appear in the hierarchical part of the URI can alternatively be given as
named parameters. For example:
postgresql:///mydb?host=localhost&port=5433
All named parameters must match key words listed in Section 34.1.2, except that for compatibility
with JDBC connection URIs, instances of ssl=true are translated into sslmode=require.
Percent-encoding may be used to include symbols with special meaning in any of the URI parts, e.g.,
replace = with %3D.
The host part may be either a host name or an IP address. To specify an IPv6 address, enclose it in
square brackets:
postgresql://[2001:db8::1234]/database
The host part is interpreted as described for the parameter host. In particular, a Unix-domain socket
connection is chosen if the host part is either empty or looks like an absolute path name, otherwise a
TCP/IP connection is initiated. Note, however, that the slash is a reserved character in the hierarchical
part of the URI. So, to specify a non-standard Unix-domain socket directory, either omit the host part
of the URI and specify the host as a named parameter, or percent-encode the path in the host part of
the URI:
postgresql:///dbname?host=/var/lib/postgresql
postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
It is possible to specify multiple host components, each with an optional port component, in a sin-
gle URI. A URI of the form postgresql://host1:port1,host2:port2,host3:port3/
is equivalent to a connection string of the form host=host1,host2,host3
port=port1,port2,port3. As further described below, each host will be tried in turn until a
connection is successfully established.
In the connection URI format, you can list multiple host:port pairs separated by commas in the
host component of the URI.
In either format, a single host name can translate to multiple network addresses. A common example
of this is a host that has both an IPv4 and an IPv6 address.
When multiple hosts are specified, or when a single host name is translated to multiple addresses, all
the hosts and addresses will be tried in order, until one succeeds. If none of the hosts can be reached,
the connection fails. If a connection is established successfully, but authentication fails, the remaining
hosts in the list are not tried.
If a password file is used, you can have different passwords for different hosts. All the other connection
options are the same for every host in the list; it is not possible to e.g., specify different usernames
for different hosts.
791
libpq - C Library
host
Name of host to connect to. If a host name begins with a slash, it specifies Unix-domain commu-
nication rather than TCP/IP communication; the value is the name of the directory in which the
socket file is stored. The default behavior when host is not specified, or is empty, is to connect
to a Unix-domain socket in /tmp (or whatever socket directory was specified when PostgreSQL
was built). On machines without Unix-domain sockets, the default is to connect to localhost.
A comma-separated list of host names is also accepted, in which case each host name in the list
is tried in order; an empty item in the list selects the default behavior as explained above. See
Section 34.1.1.3 for details.
hostaddr
Numeric IP address of host to connect to. This should be in the standard IPv4 address format,
e.g., 172.28.40.9. If your machine supports IPv6, you can also use those addresses. TCP/IP
communication is always used when a nonempty string is specified for this parameter.
Using hostaddr instead of host allows the application to avoid a host name look-up, which
might be important in applications with time constraints. However, a host name is required for
GSSAPI or SSPI authentication methods, as well as for verify-full SSL certificate verifica-
tion. The following rules are used:
• If host is specified without hostaddr, a host name lookup occurs. (When using PQcon-
nectPoll, the lookup occurs when PQconnectPoll first considers this host name, and it
may cause PQconnectPoll to block for a significant amount of time.)
• If hostaddr is specified without host, the value for hostaddr gives the server network
address. The connection attempt will fail if the authentication method requires a host name.
• If both host and hostaddr are specified, the value for hostaddr gives the server network
address. The value for host is ignored unless the authentication method requires it, in which
case it will be used as the host name.
Note that authentication is likely to fail if host is not the name of the server at network address
hostaddr. Also, when both host and hostaddr are specified, host is used to identify the
connection in a password file (see Section 34.15).
A comma-separated list of hostaddr values is also accepted, in which case each host in the list
is tried in order. An empty item in the list causes the corresponding host name to be used, or the
default host name if that is empty as well. See Section 34.1.1.3 for details.
Without either a host name or host address, libpq will connect using a local Unix-domain socket;
or on machines without Unix-domain sockets, it will attempt to connect to localhost.
port
Port number to connect to at the server host, or socket file name extension for Unix-domain con-
nections. If multiple hosts were given in the host or hostaddr parameters, this parameter may
specify a comma-separated list of ports of the same length as the host list, or it may specify a sin-
gle port number to be used for all hosts. An empty string, or an empty item in a comma-separated
list, specifies the default port number established when PostgreSQL was built.
dbname
The database name. Defaults to be the same as the user name. In certain contexts, the value is
checked for extended formats; see Section 34.1.1 for more details on those.
792
libpq - C Library
user
PostgreSQL user name to connect as. Defaults to be the same as the operating system name of
the user running the application.
password
passfile
Specifies the name of the file used to store passwords (see Section 34.15). Defaults to ~/.pg-
pass, or %APPDATA%\postgresql\pgpass.conf on Microsoft Windows. (No error is
reported if this file does not exist.)
connect_timeout
Maximum wait for connection, in seconds (write as a decimal integer, e.g., 10). Zero, negative,
or not specified means wait indefinitely. The minimum allowed timeout is 2 seconds, therefore
a value of 1 is interpreted as 2. This timeout applies separately to each host name or IP address.
For example, if you specify two hosts and connect_timeout is 5, each host will time out if
no connection is made within 5 seconds, so the total time spent waiting for a connection might
be up to 10 seconds.
client_encoding
This sets the client_encoding configuration parameter for this connection. In addition to the
values accepted by the corresponding server option, you can use auto to determine the right en-
coding from the current locale in the client (LC_CTYPE environment variable on Unix systems).
options
Specifies command-line options to send to the server at connection start. For example, setting
this to -c geqo=off sets the session's value of the geqo parameter to off. Spaces within
this string are considered to separate command-line arguments, unless escaped with a backslash
(\); write \\ to represent a literal backslash. For a detailed discussion of the available options,
consult Chapter 19.
application_name
fallback_application_name
Specifies a fallback value for the application_name configuration parameter. This value will be
used if no value has been given for application_name via a connection parameter or the
PGAPPNAME environment variable. Specifying a fallback name is useful in generic utility pro-
grams that wish to set a default application name but allow it to be overridden by the user.
keepalives
Controls whether client-side TCP keepalives are used. The default value is 1, meaning on, but
you can change this to 0, meaning off, if keepalives are not wanted. This parameter is ignored for
connections made via a Unix-domain socket.
keepalives_idle
Controls the number of seconds of inactivity after which TCP should send a keepalive message
to the server. A value of zero uses the system default. This parameter is ignored for connections
made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems
793
libpq - C Library
keepalives_interval
Controls the number of seconds after which a TCP keepalive message that is not acknowledged
by the server should be retransmitted. A value of zero uses the system default. This parameter is
ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only
supported on systems where TCP_KEEPINTVL or an equivalent socket option is available, and
on Windows; on other systems, it has no effect.
keepalives_count
Controls the number of TCP keepalives that can be lost before the client's connection to the server
is considered dead. A value of zero uses the system default. This parameter is ignored for con-
nections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on
systems where TCP_KEEPCNT or an equivalent socket option is available; on other systems, it
has no effect.
tty
replication
This option determines whether the connection should use the replication protocol instead of
the normal protocol. This is what PostgreSQL replication connections as well as tools such as
pg_basebackup use internally, but it can also be used by third-party applications. For a description
of the replication protocol, consult Section 53.4.
database
The connection goes into logical replication mode, connecting to the database specified in
the dbname parameter.
In physical or logical replication mode, only the simple query protocol can be used.
sslmode
This option determines whether or with what priority a secure SSL TCP/IP connection will be
negotiated with the server. There are six modes:
disable
allow
prefer (default)
794
libpq - C Library
require
only try an SSL connection. If a root CA file is present, verify the certificate in the same way
as if verify-ca was specified
verify-ca
only try an SSL connection, and verify that the server certificate is issued by a trusted cer-
tificate authority (CA)
verify-full
only try an SSL connection, verify that the server certificate is issued by a trusted CA and
that the requested server host name matches that in the certificate
See Section 34.18 for a detailed description of how these options work.
sslmode is ignored for Unix domain socket communication. If PostgreSQL is compiled without
SSL support, using options require, verify-ca, or verify-full will cause an error,
while options allow and prefer will be accepted but libpq will not actually attempt an SSL
connection.
requiressl
If set to 1, an SSL connection to the server is required (this is equivalent to sslmode require).
libpq will then refuse to connect if the server does not accept an SSL connection. If set to 0 (de-
fault), libpq will negotiate the connection type with the server (equivalent to sslmode prefer).
This option is only available if PostgreSQL is compiled with SSL support.
sslcompression
If set to 1, data sent over SSL connections will be compressed. If set to 0, compression will be
disabled. The default is 0. This parameter is ignored if a connection without SSL is made.
SSL compression is nowadays considered insecure and its use is no longer recommended.
OpenSSL 1.1.0 disables compression by default, and many operating system distributions disable
it in prior versions as well, so setting this parameter to on will not have any effect if the server
does not accept compression. On the other hand, OpenSSL before 1.0.0 does not support disabling
compression, so this parameter is ignored with those versions, and whether compression is used
depends on the server.
If security is not a primary concern, compression can improve throughput if the network is the bot-
tleneck. Disabling compression can improve response time and throughput if CPU performance
is the limiting factor.
sslcert
This parameter specifies the file name of the client SSL certificate, replacing the default
~/.postgresql/postgresql.crt. This parameter is ignored if an SSL connection is not
made.
sslkey
This parameter specifies the location for the secret key used for the client certificate. It can
either specify a file name that will be used instead of the default ~/.postgresql/post-
gresql.key, or it can specify a key obtained from an external “engine” (engines are OpenSSL
loadable modules). An external engine specification should consist of a colon-separated engine
name and an engine-specific key identifier. This parameter is ignored if an SSL connection is
not made.
795
libpq - C Library
sslrootcert
This parameter specifies the name of a file containing SSL certificate authority (CA) certificate(s).
If the file exists, the server's certificate will be verified to be signed by one of these authorities.
The default is ~/.postgresql/root.crt.
sslcrl
This parameter specifies the file name of the SSL server certificate revocation list (CRL). Certifi-
cates listed in this file, if it exists, will be rejected while attempting to authenticate the server's
certificate. The default is ~/.postgresql/root.crl.
requirepeer
This parameter specifies the operating-system user name of the server, for example re-
quirepeer=postgres. When making a Unix-domain socket connection, if this parameter is
set, the client checks at the beginning of the connection that the server process is running under
the specified user name; if it is not, the connection is aborted with an error. This parameter can
be used to provide server authentication similar to that available with SSL certificates on TCP/
IP connections. (Note that if the Unix-domain socket is in /tmp or another publicly writable
location, any user could start a server listening there. Use this parameter to ensure that you are
connected to a server run by a trusted user.) This option is only supported on platforms for which
the peer authentication method is implemented; see Section 20.9.
krbsrvname
Kerberos service name to use when authenticating with GSSAPI. This must match the service
name specified in the server configuration for Kerberos authentication to succeed. (See also Sec-
tion 20.6.)
gsslib
GSS library to use for GSSAPI authentication. Currently this is disregarded except on Windows
builds that include both GSSAPI and SSPI support. In that case, set this to gssapi to cause libpq
to use the GSSAPI library for authentication instead of the default SSPI.
service
Service name to use for additional parameters. It specifies a service name in pg_ser-
vice.conf that holds additional connection parameters. This allows applications to specify
only a service name so connection parameters can be centrally maintained. See Section 34.16.
target_session_attrs
If this parameter is set to read-write, only a connection in which read-write transactions are
accepted by default is considered acceptable. The query SHOW transaction_read_only
will be sent upon any successful connection; if it returns on, the connection will be closed. If
multiple hosts were specified in the connection string, any remaining servers will be tried just
as if the connection attempt had failed. The default value of this parameter, any, regards all
connections as acceptable.
Tip
libpq application programmers should be careful to maintain the PGconn abstraction. Use
the accessor functions described below to get at the contents of PGconn. Reference to internal
796
libpq - C Library
PGconn fields using libpq-int.h is not recommended because they are subject to change
in the future.
The following functions return parameter values established at connection. These values are fixed for
the life of the connection. If a multi-host connection string is used, the values of PQhost, PQport,
and PQpass can change if a new connection is established using the same PGconn object. Other
values are fixed for the lifetime of the PGconn object.
PQdb
PQuser
PQpass
PQpass will return either the password specified in the connection parameters, or if there was
none and the password was obtained from the password file, it will return that. In the latter case, if
multiple hosts were specified in the connection parameters, it is not possible to rely on the result
of PQpass until the connection is established. The status of the connection can be checked using
the function PQstatus.
PQhost
Returns the server host name of the active connection. This can be a host name, an IP address, or
a directory path if the connection is via Unix socket. (The path case can be distinguished because
it will always be an absolute path, beginning with /.)
If the connection parameters specified both host and hostaddr, then PQhost will return the
host information. If only hostaddr was specified, then that is returned. If multiple hosts were
specified in the connection parameters, PQhost returns the host actually connected to.
PQhost returns NULL if the conn argument is NULL. Otherwise, if there is an error producing
the host information (perhaps if the connection has not been fully established or there was an
error), it returns an empty string.
If multiple hosts were specified in the connection parameters, it is not possible to rely on the result
of PQhost until the connection is established. The status of the connection can be checked using
the function PQstatus.
PQport
797
libpq - C Library
If multiple ports were specified in the connection parameters, PQport returns the port actually
connected to.
PQport returns NULL if the conn argument is NULL. Otherwise, if there is an error producing
the port information (perhaps if the connection has not been fully established or there was an
error), it returns an empty string.
If multiple ports were specified in the connection parameters, it is not possible to rely on the result
of PQport until the connection is established. The status of the connection can be checked using
the function PQstatus.
PQtty
Returns the debug TTY of the connection. (This is obsolete, since the server no longer pays at-
tention to the TTY setting, but the function remains for backward compatibility.)
PQoptions
The following functions return status data that can change as operations are executed on the PGconn
object.
PQstatus
The status can be one of a number of values. However, only two of these are seen outside of an
asynchronous connection procedure: CONNECTION_OK and CONNECTION_BAD. A good con-
nection to the database has the status CONNECTION_OK. A failed connection attempt is signaled
by status CONNECTION_BAD. Ordinarily, an OK status will remain so until PQfinish, but a
communications failure might result in the status changing to CONNECTION_BAD prematurely.
In that case the application could try to recover by calling PQreset.
PQtransactionStatus
798
libpq - C Library
PQparameterStatus
Certain parameter values are reported by the server automatically at connection startup or when-
ever their values change. PQparameterStatus can be used to interrogate these settings. It
returns the current value of a parameter if known, or NULL if the parameter is not known.
Pre-3.0-protocol servers do not report parameter settings, but libpq includes logic to obtain val-
ues for server_version and client_encoding anyway. Applications are encouraged
to use PQparameterStatus rather than ad hoc code to determine these values. (Beware
however that on a pre-3.0 connection, changing client_encoding via SET after connection
startup will not be reflected by PQparameterStatus.) For server_version, see also
PQserverVersion, which returns the information in a numeric form that is much easier to
compare against.
Although the returned pointer is declared const, it in fact points to mutable storage associated
with the PGconn structure. It is unwise to assume the pointer will remain valid across queries.
PQprotocolVersion
Applications might wish to use this function to determine whether certain features are supported.
Currently, the possible values are 2 (2.0 protocol), 3 (3.0 protocol), or zero (connection bad). The
protocol version will not change after connection startup is complete, but it could theoretically
change during a connection reset. The 3.0 protocol will normally be used when communicating
with PostgreSQL 7.4 or later servers; pre-7.4 servers support only protocol 2.0. (Protocol 1.0 is
obsolete and not supported by libpq.)
PQserverVersion
Applications might use this function to determine the version of the database server they are
connected to. The result is formed by multiplying the server's major version number by 10000
799
libpq - C Library
and adding the minor version number. For example, version 10.1 will be returned as 100001, and
version 11.0 will be returned as 110000. Zero is returned if the connection is bad.
Prior to major version 10, PostgreSQL used three-part version numbers in which the first two
parts together represented the major version. For those versions, PQserverVersion uses two
digits for each part; for example version 9.1.5 will be returned as 90105, and version 9.2.0 will
be returned as 90200.
Therefore, for purposes of determining feature compatibility, applications should divide the result
of PQserverVersion by 100 not 10000 to determine a logical major version number. In all
release series, only the last two digits differ between minor releases (bug-fix releases).
PQerrorMessage
Returns the error message most recently generated by an operation on the connection.
Nearly all libpq functions will set a message for PQerrorMessage if they fail. Note that by
libpq convention, a nonempty PQerrorMessage result can consist of multiple lines, and will
include a trailing newline. The caller should not free the result directly. It will be freed when the
associated PGconn handle is passed to PQfinish. The result string should not be expected to
remain the same across operations on the PGconn structure.
PQsocket
Obtains the file descriptor number of the connection socket to the server. A valid descriptor will
be greater than or equal to 0; a result of -1 indicates that no server connection is currently open.
(This will not change during normal operation, but could change during connection setup or reset.)
PQbackendPID
Returns the process ID (PID) of the backend process handling this connection.
The backend PID is useful for debugging purposes and for comparison to NOTIFY messages
(which include the PID of the notifying backend process). Note that the PID belongs to a process
executing on the database server host, not the local host!
PQconnectionNeedsPassword
Returns true (1) if the connection authentication method required a password, but none was avail-
able. Returns false (0) if not.
This function can be applied after a failed connection attempt to decide whether to prompt the
user for a password.
PQconnectionUsedPassword
Returns true (1) if the connection authentication method used a password. Returns false (0) if not.
800
libpq - C Library
This function can be applied after either a failed or successful connection attempt to detect whether
the server demanded a password.
The following functions return information related to SSL. This information usually doesn't change
after a connection is established.
PQsslInUse
Returns true (1) if the connection uses SSL, false (0) if not.
PQsslAttribute
The list of available attributes varies depending on the SSL library being used, and the type of
connection. If an attribute is not available, returns NULL.
library
protocol
SSL/TLS version in use. Common values are "TLSv1", "TLSv1.1" and "TLSv1.2",
but an implementation may return other strings if some other protocol is used.
key_bits
cipher
A short name of the ciphersuite used, e.g., "DHE-RSA-DES-CBC3-SHA". The names are
specific to each SSL implementation.
compression
If SSL compression is in use, returns the name of the compression algorithm, or "on" if
compression is used but the algorithm is not known. If compression is not in use, returns "off".
PQsslAttributeNames
Return an array of SSL attribute names available. The array is terminated by a NULL pointer.
PQsslStruct
801
libpq - C Library
The struct(s) available depend on the SSL implementation in use. For OpenSSL, there is one
struct, available under the name "OpenSSL", and it returns a pointer to the OpenSSL SSL struct.
To use this function, code along the following lines could be used:
#include <libpq-fe.h>
#include <openssl/ssl.h>
...
SSL *ssl;
dbconn = PQconnectdb(...);
...
This structure can be used to verify encryption levels, check server certificates, and more. Refer
to the OpenSSL documentation for information about this structure.
PQgetssl
Returns the SSL structure used in the connection, or null if SSL is not in use.
Returns a PGresult pointer or possibly a null pointer. A non-null pointer will generally be re-
turned except in out-of-memory conditions or serious errors such as inability to send the command
to the server. The PQresultStatus function should be called to check the return value for any
errors (including the value of a null pointer, in which case it will return PGRES_FATAL_ERROR).
Use PQerrorMessage to get more information about such errors.
The command string can include multiple SQL commands (separated by semicolons). Multiple queries
sent in a single PQexec call are processed in a single transaction, unless there are explicit BEGIN/
802
libpq - C Library
COMMIT commands included in the query string to divide it into multiple transactions. (See Sec-
tion 53.2.2.1 for more details about how the server handles multi-query strings.) Note however that the
returned PGresult structure describes only the result of the last command executed from the string.
Should one of the commands fail, processing of the string stops with it and the returned PGresult
describes the error condition.
PQexecParams
Submits a command to the server and waits for the result, with the ability to pass parameters
separately from the SQL command text.
PQexecParams is like PQexec, but offers additional functionality: parameter values can be
specified separately from the command string proper, and query results can be requested in either
text or binary format. PQexecParams is supported only in protocol 3.0 and later connections;
it will fail when using protocol 2.0.
conn
command
The SQL command string to be executed. If parameters are used, they are referred to in the
command string as $1, $2, etc.
nParams
paramTypes[]
Specifies, by OID, the data types to be assigned to the parameter symbols. If paramTypes
is NULL, or any particular element in the array is zero, the server infers a data type for the
parameter symbol in the same way it would do for an untyped literal string.
paramValues[]
Specifies the actual values of the parameters. A null pointer in this array means the corre-
sponding parameter is null; otherwise the pointer points to a zero-terminated text string (for
text format) or binary data in the format expected by the server (for binary format).
paramLengths[]
Specifies the actual data lengths of binary-format parameters. It is ignored for null parame-
ters and text-format parameters. The array pointer can be null when there are no binary pa-
rameters.
803
libpq - C Library
paramFormats[]
Specifies whether parameters are text (put a zero in the array entry for the corresponding
parameter) or binary (put a one in the array entry for the corresponding parameter). If the
array pointer is null then all parameters are presumed to be text strings.
Values passed in binary format require knowledge of the internal representation expect-
ed by the backend. For example, integers must be passed in network byte order. Pass-
ing numeric values requires knowledge of the server storage format, as implemented
in src/backend/utils/adt/numeric.c::numeric_send() and src/back-
end/utils/adt/numeric.c::numeric_recv().
resultFormat
Specify zero to obtain results in text format, or one to obtain results in binary format. (There
is not currently a provision to obtain different result columns in different formats, although
that is possible in the underlying protocol.)
The primary advantage of PQexecParams over PQexec is that parameter values can be separated
from the command string, thus avoiding the need for tedious and error-prone quoting and escaping.
Unlike PQexec, PQexecParams allows at most one SQL command in the given string. (There can
be semicolons in it, but not more than one nonempty command.) This is a limitation of the underlying
protocol, but has some usefulness as an extra defense against SQL-injection attacks.
Tip
Specifying parameter types via OIDs is tedious, particularly if you prefer not to hard-wire
particular OID values into your program. However, you can avoid doing so even in cases
where the server by itself cannot determine the type of the parameter, or chooses a different
type than you want. In the SQL command text, attach an explicit cast to the parameter symbol
to show what data type you will send. For example:
PQprepare
Submits a request to create a prepared statement with the given parameters, and waits for com-
pletion.
PQprepare creates a prepared statement for later execution with PQexecPrepared. This
feature allows commands to be executed repeatedly without being parsed and planned each time;
see PREPARE for details. PQprepare is supported only in protocol 3.0 and later connections;
it will fail when using protocol 2.0.
804
libpq - C Library
The function creates a prepared statement named stmtName from the query string, which must
contain a single SQL command. stmtName can be "" to create an unnamed statement, in which
case any pre-existing unnamed statement is automatically replaced; otherwise it is an error if the
statement name is already defined in the current session. If any parameters are used, they are
referred to in the query as $1, $2, etc. nParams is the number of parameters for which types are
pre-specified in the array paramTypes[]. (The array pointer can be NULL when nParams is
zero.) paramTypes[] specifies, by OID, the data types to be assigned to the parameter symbols.
If paramTypes is NULL, or any particular element in the array is zero, the server assigns a data
type to the parameter symbol in the same way it would do for an untyped literal string. Also,
the query can use parameter symbols with numbers higher than nParams; data types will be
inferred for these symbols as well. (See PQdescribePrepared for a means to find out what
data types were inferred.)
As with PQexec, the result is normally a PGresult object whose contents indicate server-side
success or failure. A null result indicates out-of-memory or inability to send the command at all.
Use PQerrorMessage to get more information about such errors.
Prepared statements for use with PQexecPrepared can also be created by executing SQL PRE-
PARE statements. Also, although there is no libpq function for deleting a prepared statement, the SQL
DEALLOCATE statement can be used for that purpose.
PQexecPrepared
Sends a request to execute a prepared statement with given parameters, and waits for the result.
The parameters are identical to PQexecParams, except that the name of a prepared statement
is given instead of a query string, and the paramTypes[] parameter is not present (it is not
needed since the prepared statement's parameter types were determined when it was created).
PQdescribePrepared
Submits a request to obtain information about the specified prepared statement, and waits for
completion.
stmtName can be "" or NULL to reference the unnamed statement, otherwise it must be the
name of an existing prepared statement. On success, a PGresult with status PGRES_COM-
805
libpq - C Library
MAND_OK is returned. The functions PQnparams and PQparamtype can be applied to this
PGresult to obtain information about the parameters of the prepared statement, and the func-
tions PQnfields, PQfname, PQftype, etc provide information about the result columns (if
any) of the statement.
PQdescribePortal
Submits a request to obtain information about the specified portal, and waits for completion.
portalName can be "" or NULL to reference the unnamed portal, otherwise it must be the name
of an existing portal. On success, a PGresult with status PGRES_COMMAND_OK is returned.
The functions PQnfields, PQfname, PQftype, etc can be applied to the PGresult to obtain
information about the result columns (if any) of the portal.
The PGresult structure encapsulates the result returned by the server. libpq application program-
mers should be careful to maintain the PGresult abstraction. Use the accessor functions below to
get at the contents of PGresult. Avoid directly referencing the fields of the PGresult structure
because they are subject to change in the future.
PQresultStatus
PGRES_EMPTY_QUERY
PGRES_COMMAND_OK
PGRES_TUPLES_OK
PGRES_COPY_OUT
PGRES_COPY_IN
PGRES_BAD_RESPONSE
806
libpq - C Library
PGRES_NONFATAL_ERROR
PGRES_FATAL_ERROR
PGRES_COPY_BOTH
Copy In/Out (to and from server) data transfer started. This feature is currently used only for
streaming replication, so this status should not occur in ordinary applications.
PGRES_SINGLE_TUPLE
The PGresult contains a single result tuple from the current command. This status occurs
only when single-row mode has been selected for the query (see Section 34.5).
PQresStatus
Converts the enumerated type returned by PQresultStatus into a string constant describing
the status code. The caller should not free the result.
PQresultErrorMessage
Returns the error message associated with the command, or an empty string if there was no error.
If there was an error, the returned string will include a trailing newline. The caller should not free
the result directly. It will be freed when the associated PGresult handle is passed to PQclear.
PQresultVerboseErrorMessage
Returns a reformatted version of the error message associated with a PGresult object.
807
libpq - C Library
PGVerbosity verbosity,
PGContextVisibility
show_context);
In some situations a client might wish to obtain a more detailed version of a previously-reported
error. PQresultVerboseErrorMessage addresses this need by computing the message that
would have been produced by PQresultErrorMessage if the specified verbosity settings
had been in effect for the connection when the given PGresult was generated. If the PGresult
is not an error result, “PGresult is not an error result” is reported instead. The returned string
includes a trailing newline.
Unlike most other functions for extracting data from a PGresult, the result of this function
is a freshly allocated string. The caller must free it using PQfreemem() when the string is no
longer needed.
PQresultErrorField
fieldcode is an error field identifier; see the symbols listed below. NULL is returned if the
PGresult is not an error or warning result, or does not include the specified field. Field values
will normally not include a trailing newline. The caller should not free the result directly. It will
be freed when the associated PGresult handle is passed to PQclear.
PG_DIAG_SEVERITY
The severity; the field contents are ERROR, FATAL, or PANIC (in an error message), or
WARNING, NOTICE, DEBUG, INFO, or LOG (in a notice message), or a localized translation
of one of these. Always present.
PG_DIAG_SEVERITY_NONLOCALIZED
The severity; the field contents are ERROR, FATAL, or PANIC (in an error message), or
WARNING, NOTICE, DEBUG, INFO, or LOG (in a notice message). This is identical to the
PG_DIAG_SEVERITY field except that the contents are never localized. This is present only
in reports generated by PostgreSQL versions 9.6 and later.
PG_DIAG_SQLSTATE
The SQLSTATE code for the error. The SQLSTATE code identifies the type of error that has
occurred; it can be used by front-end applications to perform specific operations (such as error
handling) in response to a particular database error. For a list of the possible SQLSTATE
codes, see Appendix A. This field is not localizable, and is always present.
PG_DIAG_MESSAGE_PRIMARY
The primary human-readable error message (typically one line). Always present.
PG_DIAG_MESSAGE_DETAIL
Detail: an optional secondary error message carrying more detail about the problem. Might
run to multiple lines.
808
libpq - C Library
PG_DIAG_MESSAGE_HINT
Hint: an optional suggestion what to do about the problem. This is intended to differ from
detail in that it offers advice (potentially inappropriate) rather than hard facts. Might run to
multiple lines.
PG_DIAG_STATEMENT_POSITION
A string containing a decimal integer indicating an error cursor position as an index into
the original statement string. The first character has index 1, and positions are measured in
characters not bytes.
PG_DIAG_INTERNAL_POSITION
PG_DIAG_INTERNAL_QUERY
The text of a failed internally-generated command. This could be, for example, a SQL query
issued by a PL/pgSQL function.
PG_DIAG_CONTEXT
An indication of the context in which the error occurred. Presently this includes a call stack
traceback of active procedural language functions and internally-generated queries. The trace
is one entry per line, most recent first.
PG_DIAG_SCHEMA_NAME
If the error was associated with a specific database object, the name of the schema containing
that object, if any.
PG_DIAG_TABLE_NAME
If the error was associated with a specific table, the name of the table. (Refer to the schema
name field for the name of the table's schema.)
PG_DIAG_COLUMN_NAME
If the error was associated with a specific table column, the name of the column. (Refer to
the schema and table name fields to identify the table.)
PG_DIAG_DATATYPE_NAME
If the error was associated with a specific data type, the name of the data type. (Refer to the
schema name field for the name of the data type's schema.)
PG_DIAG_CONSTRAINT_NAME
If the error was associated with a specific constraint, the name of the constraint. Refer to
fields listed above for the associated table or domain. (For this purpose, indexes are treated
as constraints, even if they weren't created with constraint syntax.)
PG_DIAG_SOURCE_FILE
The file name of the source-code location where the error was reported.
PG_DIAG_SOURCE_LINE
The line number of the source-code location where the error was reported.
809
libpq - C Library
PG_DIAG_SOURCE_FUNCTION
Note
The fields for schema name, table name, column name, data type name, and constraint
name are supplied only for a limited number of error types; see Appendix A. Do not
assume that the presence of any of these fields guarantees the presence of another field.
Core error sources observe the interrelationships noted above, but user-defined functions
may use these fields in other ways. In the same vein, do not assume that these fields denote
contemporary objects in the current database.
The client is responsible for formatting displayed information to meet its needs; in particular it
should break long lines as needed. Newline characters appearing in the error message fields should
be treated as paragraph breaks, not line breaks.
Errors generated internally by libpq will have severity and primary message, but typically no other
fields. Errors returned by a pre-3.0-protocol server will include severity and primary message,
and sometimes a detail message, but no other fields.
Note that error fields are only available from PGresult objects, not PGconn objects; there is
no PQerrorField function.
PQclear
Frees the storage associated with a PGresult. Every command result should be freed via PQ-
clear when it is no longer needed.
You can keep a PGresult object around for as long as you need it; it does not go away when
you issue a new command, nor even if you close the connection. To get rid of it, you must call
PQclear. Failure to do this will result in memory leaks in your application.
PQntuples
Returns the number of rows (tuples) in the query result. (Note that PGresult objects are limited
to no more than INT_MAX rows, so an int result is sufficient.)
PQnfields
Returns the number of columns (fields) in each row of the query result.
810
libpq - C Library
PQfname
Returns the column name associated with the given column number. Column numbers start at
0. The caller should not free the result directly. It will be freed when the associated PGresult
handle is passed to PQclear.
PQfnumber
Returns the column number associated with the given column name.
The given name is treated like an identifier in an SQL command, that is, it is downcased unless
double-quoted. For example, given a query result generated from the SQL command:
PQfname(res, 0) foo
PQfname(res, 1) BAR
PQfnumber(res, "FOO") 0
PQfnumber(res, "foo") 0
PQfnumber(res, "BAR") -1
PQfnumber(res, "\"BAR\"") 1
PQftable
Returns the OID of the table from which the given column was fetched. Column numbers start at 0.
InvalidOid is returned if the column number is out of range, or if the specified column is not
a simple reference to a table column, or when using pre-3.0 protocol. You can query the system
table pg_class to determine exactly which table is referenced.
The type Oid and the constant InvalidOid will be defined when you include the libpq header
file. They will both be some integer type.
PQftablecol
Returns the column number (within its table) of the column making up the specified query result
column. Query-result column numbers start at 0, but table columns have nonzero numbers.
811
libpq - C Library
Zero is returned if the column number is out of range, or if the specified column is not a simple
reference to a table column, or when using pre-3.0 protocol.
PQfformat
Returns the format code indicating the format of the given column. Column numbers start at 0.
Format code zero indicates textual data representation, while format code one indicates binary
representation. (Other codes are reserved for future definition.)
PQftype
Returns the data type associated with the given column number. The integer returned is the internal
OID number of the type. Column numbers start at 0.
You can query the system table pg_type to obtain the names and properties of the various data
types. The OIDs of the built-in data types are defined in the file include/server/cata-
log/pg_type_d.h in the install directory.
PQfmod
Returns the type modifier of the column associated with the given column number. Column num-
bers start at 0.
The interpretation of modifier values is type-specific; they typically indicate precision or size
limits. The value -1 is used to indicate “no information available”. Most data types do not use
modifiers, in which case the value is always -1.
PQfsize
Returns the size in bytes of the column associated with the given column number. Column num-
bers start at 0.
PQfsize returns the space allocated for this column in a database row, in other words the size
of the server's internal representation of the data type. (Accordingly, it is not really very useful to
clients.) A negative value indicates the data type is variable-length.
PQbinaryTuples
Returns 1 if the PGresult contains binary data and 0 if it contains text data.
812
libpq - C Library
This function is deprecated (except for its use in connection with COPY), because it is possible for
a single PGresult to contain text data in some columns and binary data in others. PQfformat
is preferred. PQbinaryTuples returns 1 only if all columns of the result are binary (format 1).
PQgetvalue
Returns a single field value of one row of a PGresult. Row and column numbers start at 0. The
caller should not free the result directly. It will be freed when the associated PGresult handle
is passed to PQclear.
For data in text format, the value returned by PQgetvalue is a null-terminated character string
representation of the field value. For data in binary format, the value is in the binary representation
determined by the data type's typsend and typreceive functions. (The value is actually
followed by a zero byte in this case too, but that is not ordinarily useful, since the value is likely
to contain embedded nulls.)
An empty string is returned if the field value is null. See PQgetisnull to distinguish null values
from empty-string values.
The pointer returned by PQgetvalue points to storage that is part of the PGresult structure.
One should not modify the data it points to, and one must explicitly copy the data into other
storage if it is to be used past the lifetime of the PGresult structure itself.
PQgetisnull
Tests a field for a null value. Row and column numbers start at 0.
This function returns 1 if the field is null and 0 if it contains a non-null value. (Note that PQget-
value will return an empty string, not a null pointer, for a null field.)
PQgetlength
Returns the actual length of a field value in bytes. Row and column numbers start at 0.
This is the actual data length for the particular data value, that is, the size of the object pointed to
by PQgetvalue. For text data format this is the same as strlen(). For binary format this is
essential information. Note that one should not rely on PQfsize to obtain the actual data length.
PQnparams
813
libpq - C Library
This function is only useful when inspecting the result of PQdescribePrepared. For other
types of queries it will return zero.
PQparamtype
Returns the data type of the indicated statement parameter. Parameter numbers start at 0.
This function is only useful when inspecting the result of PQdescribePrepared. For other
types of queries it will return zero.
PQprint
Prints out all the rows and, optionally, the column names to the specified output stream.
This function was formerly used by psql to print query results, but this is no longer the case. Note
that it assumes all the data is in text format.
PQcmdStatus
Returns the command status tag from the SQL command that generated the PGresult.
Commonly this is just the name of the command, but it might include additional data such as the
number of rows processed. The caller should not free the result directly. It will be freed when the
associated PGresult handle is passed to PQclear.
PQcmdTuples
814
libpq - C Library
This function returns a string containing the number of rows affected by the SQL statement that
generated the PGresult. This function can only be used following the execution of a SELECT,
CREATE TABLE AS, INSERT, UPDATE, DELETE, MOVE, FETCH, or COPY statement, or an
EXECUTE of a prepared query that contains an INSERT, UPDATE, or DELETE statement. If the
command that generated the PGresult was anything else, PQcmdTuples returns an empty
string. The caller should not free the return value directly. It will be freed when the associated
PGresult handle is passed to PQclear.
PQoidValue
Returns the OID of the inserted row, if the SQL command was an INSERT that inserted exactly
one row into a table that has OIDs, or a EXECUTE of a prepared query containing a suitable
INSERT statement. Otherwise, this function returns InvalidOid. This function will also return
InvalidOid if the table affected by the INSERT statement does not contain OIDs.
PQoidStatus
This function is deprecated in favor of PQoidValue and is not thread-safe. It returns a string
with the OID of the inserted row, while PQoidValue returns the OID value.
PQescapeLiteral escapes a string for use within an SQL command. This is useful when
inserting data values as literal constants in SQL commands. Certain characters (such as quotes
and backslashes) must be escaped to prevent them from being interpreted specially by the SQL
parser. PQescapeLiteral performs this operation.
PQescapeLiteral returns an escaped version of the str parameter in memory allocated with
malloc(). This memory should be freed using PQfreemem() when the result is no longer
needed. A terminating zero byte is not required, and should not be counted in length. (If a
terminating zero byte is found before length bytes are processed, PQescapeLiteral stops
at the zero; the behavior is thus rather like strncpy.) The return string has all special characters
replaced so that they can be properly processed by the PostgreSQL string literal parser. A termi-
nating zero byte is also added. The single quotes that must surround PostgreSQL string literals
are included in the result string.
On error, PQescapeLiteral returns NULL and a suitable message is stored in the conn object.
Tip
It is especially important to do proper escaping when handling strings that were received
from an untrustworthy source. Otherwise there is a security risk: you are vulnerable to
“SQL injection” attacks wherein unwanted SQL commands are fed to your database.
815
libpq - C Library
Note that it is neither necessary nor correct to do escaping when a data value is passed as a separate
parameter in PQexecParams or its sibling routines.
PQescapeIdentifier
PQescapeIdentifier escapes a string for use as an SQL identifier, such as a table, column,
or function name. This is useful when a user-supplied identifier might contain special characters
that would otherwise not be interpreted as part of the identifier by the SQL parser, or when the
identifier might contain upper case characters whose case should be preserved.
On error, PQescapeIdentifier returns NULL and a suitable message is stored in the conn
object.
Tip
As with string literals, to prevent SQL injection attacks, SQL identifiers must be escaped
when they are received from an untrustworthy source.
PQescapeStringConn
If the error parameter is not NULL, then *error is set to zero on success, nonzero on er-
ror. Presently the only possible error conditions involve invalid multibyte encoding in the source
string. The output string is still generated on error, but it can be expected that the server will
reject it as malformed. On error, a suitable message is stored in the conn object, whether or not
error is NULL.
PQescapeStringConn returns the number of bytes written to to, not including the terminat-
ing zero byte.
816
libpq - C Library
PQescapeString
The only difference from PQescapeStringConn is that PQescapeString does not take
PGconn or error parameters. Because of this, it cannot adjust its behavior depending on the
connection properties (such as character encoding) and therefore it might give the wrong results.
Also, it has no way to report error conditions.
PQescapeString can be used safely in client programs that work with only one PostgreSQL
connection at a time (in this case it can find out what it needs to know “behind the scenes”). In
other contexts it is a security hazard and should be avoided in favor of PQescapeStringConn.
PQescapeByteaConn
Escapes binary data for use within an SQL command with the type bytea. As with
PQescapeStringConn, this is only used when inserting data directly into an SQL command
string.
Certain byte values must be escaped when used as part of a bytea literal in an SQL statement.
PQescapeByteaConn escapes bytes using either hex encoding or backslash escaping. See
Section 8.4 for more information.
The from parameter points to the first byte of the string that is to be escaped, and the
from_length parameter gives the number of bytes in this binary string. (A terminating zero
byte is neither necessary nor counted.) The to_length parameter points to a variable that will
hold the resultant escaped string length. This result string length includes the terminating zero
byte of the result.
PQescapeByteaConn returns an escaped version of the from parameter binary string in mem-
ory allocated with malloc(). This memory should be freed using PQfreemem() when the
result is no longer needed. The return string has all special characters replaced so that they can
be properly processed by the PostgreSQL string literal parser, and the bytea input function.
A terminating zero byte is also added. The single quotes that must surround PostgreSQL string
literals are not part of the result string.
On error, a null pointer is returned, and a suitable error message is stored in the conn object.
Currently, the only possible error is insufficient memory for the result string.
PQescapeBytea
The only difference from PQescapeByteaConn is that PQescapeBytea does not take a
PGconn parameter. Because of this, PQescapeBytea can only be used safely in client pro-
grams that use a single PostgreSQL connection at a time (in this case it can find out what it needs
817
libpq - C Library
to know “behind the scenes”). It might give the wrong results if used in programs that use multiple
database connections (use PQescapeByteaConn in such cases).
PQunescapeBytea
Converts a string representation of binary data into binary data — the reverse of PQescape-
Bytea. This is needed when retrieving bytea data in text format, but not when retrieving it
in binary format.
The from parameter points to a string such as might be returned by PQgetvalue when applied
to a bytea column. PQunescapeBytea converts this string representation into its binary rep-
resentation. It returns a pointer to a buffer allocated with malloc(), or NULL on error, and puts
the size of the buffer in to_length. The result must be freed using PQfreemem when it is
no longer needed.
This conversion is not exactly the inverse of PQescapeBytea, because the string is not expected
to be “escaped” when received from PQgetvalue. In particular this means there is no need for
string quoting considerations, and so no need for a PGconn parameter.
• PQexec waits for the command to be completed. The application might have other work to do
(such as maintaining a user interface), in which case it won't want to block waiting for the response.
• Since the execution of the client application is suspended while it waits for the result, it is hard for
the application to decide that it would like to try to cancel the ongoing command. (It can be done
from a signal handler, but not otherwise.)
• PQexec can return only one PGresult structure. If the submitted command string contains mul-
tiple SQL commands, all but the last PGresult are discarded by PQexec.
• PQexec always collects the command's entire result, buffering it in a single PGresult. While
this simplifies error-handling logic for the application, it can be impractical for results containing
many rows.
Applications that do not like these limitations can instead use the underlying functions that
PQexec is built from: PQsendQuery and PQgetResult. There are also PQsendQuery-
Params, PQsendPrepare, PQsendQueryPrepared, PQsendDescribePrepared, and
PQsendDescribePortal, which can be used with PQgetResult to duplicate the functionali-
ty of PQexecParams, PQprepare, PQexecPrepared, PQdescribePrepared, and PQde-
scribePortal respectively.
PQsendQuery
Submits a command to the server without waiting for the result(s). 1 is returned if the command
was successfully dispatched and 0 if not (in which case, use PQerrorMessage to get more
information about the failure).
After successfully calling PQsendQuery, call PQgetResult one or more times to obtain the
results. PQsendQuery cannot be called again (on the same connection) until PQgetResult
has returned a null pointer, indicating that the command is done.
818
libpq - C Library
PQsendQueryParams
Submits a command and separate parameters to the server without waiting for the result(s).
This is equivalent to PQsendQuery except that query parameters can be specified separately
from the query string. The function's parameters are handled identically to PQexecParams. Like
PQexecParams, it will not work on 2.0-protocol connections, and it allows only one command
in the query string.
PQsendPrepare
Sends a request to create a prepared statement with the given parameters, without waiting for
completion.
This is an asynchronous version of PQprepare: it returns 1 if it was able to dispatch the re-
quest, and 0 if not. After a successful call, call PQgetResult to determine whether the server
successfully created the prepared statement. The function's parameters are handled identically to
PQprepare. Like PQprepare, it will not work on 2.0-protocol connections.
PQsendQueryPrepared
Sends a request to execute a prepared statement with given parameters, without waiting for the
result(s).
PQsendDescribePrepared
Submits a request to obtain information about the specified prepared statement, without waiting
for completion.
819
libpq - C Library
PQsendDescribePortal
Submits a request to obtain information about the specified portal, without waiting for completion.
PQgetResult
Waits for the next result from a prior PQsendQuery, PQsendQueryParams, PQsend-
Prepare, PQsendQueryPrepared, PQsendDescribePrepared, or PQsendDe-
scribePortal call, and returns it. A null pointer is returned when the command is complete
and there will be no more results.
PQgetResult must be called repeatedly until it returns a null pointer, indicating that the com-
mand is done. (If called when no command is active, PQgetResult will just return a null pointer
at once.) Each non-null result from PQgetResult should be processed using the same PGre-
sult accessor functions previously described. Don't forget to free each result object with PQ-
clear when done with it. Note that PQgetResult will block only if a command is active and
the necessary response data has not yet been read by PQconsumeInput.
Note
Even when PQresultStatus indicates a fatal error, PQgetResult should be called
until it returns a null pointer, to allow libpq to process the error information completely.
Using PQsendQuery and PQgetResult solves one of PQexec's problems: If a command string
contains multiple SQL commands, the results of those commands can be obtained individually. (This
allows a simple form of overlapped processing, by the way: the client can be handling the results of
one command while the server is still working on later queries in the same command string.)
Another frequently-desired feature that can be obtained with PQsendQuery and PQgetResult is
retrieving large query results a row at a time. This is discussed in Section 34.5.
By itself, calling PQgetResult will still cause the client to block until the server completes the next
SQL command. This can be avoided by proper use of two more functions:
PQconsumeInput
820
libpq - C Library
PQconsumeInput normally returns 1 indicating “no error”, but returns 0 if there was some kind
of trouble (in which case PQerrorMessage can be consulted). Note that the result does not say
whether any input data was actually collected. After calling PQconsumeInput, the application
can check PQisBusy and/or PQnotifies to see if their state has changed.
PQconsumeInput can be called even if the application is not prepared to deal with a result or
notification just yet. The function will read available data and save it in a buffer, thereby causing a
select() read-ready indication to go away. The application can thus use PQconsumeInput
to clear the select() condition immediately, and then examine the results at leisure.
PQisBusy
Returns 1 if a command is busy, that is, PQgetResult would block waiting for input. A 0 return
indicates that PQgetResult can be called with assurance of not blocking.
PQisBusy will not itself attempt to read data from the server; therefore PQconsumeInput
must be invoked first, or the busy state will never end.
A typical application using these functions will have a main loop that uses select() or poll()
to wait for all the conditions that it must respond to. One of the conditions will be input available
from the server, which in terms of select() means readable data on the file descriptor identified
by PQsocket. When the main loop detects input ready, it should call PQconsumeInput to read
the input. It can then call PQisBusy, followed by PQgetResult if PQisBusy returns false (0).
It can also call PQnotifies to detect NOTIFY messages (see Section 34.8).
A client that uses PQsendQuery/PQgetResult can also attempt to cancel a command that is still
being processed by the server; see Section 34.6. But regardless of the return value of PQcancel, the
application must continue with the normal result-reading sequence using PQgetResult. A success-
ful cancellation will simply cause the command to terminate sooner than it would have otherwise.
By using the functions described above, it is possible to avoid blocking while waiting for input from
the database server. However, it is still possible that the application will block waiting to send output to
the server. This is relatively uncommon but can happen if very long SQL commands or data values are
sent. (It is much more probable if the application sends data via COPY IN, however.) To prevent this
possibility and achieve completely nonblocking database operation, the following additional functions
can be used.
PQsetnonblocking
Sets the state of the connection to nonblocking if arg is 1, or blocking if arg is 0. Returns 0
if OK, -1 if error.
Note that PQexec does not honor nonblocking mode; if it is called, it will act in blocking fashion
anyway.
PQisnonblocking
821
libpq - C Library
PQflush
Attempts to flush any queued output data to the server. Returns 0 if successful (or if the send
queue is empty), -1 if it failed for some reason, or 1 if it was unable to send all the data in the send
queue yet (this case can only occur if the connection is nonblocking).
After sending any command or data on a nonblocking connection, call PQflush. If it returns 1, wait
for the socket to become read- or write-ready. If it becomes write-ready, call PQflush again. If it
becomes read-ready, call PQconsumeInput, then call PQflush again. Repeat until PQflush re-
turns 0. (It is necessary to check for read-ready and drain the input with PQconsumeInput, because
the server can block trying to send us data, e.g., NOTICE messages, and won't read our data until we
read its.) Once PQflush returns 0, wait for the socket to be read-ready and then read the response
as described above.
PQsetSingleRowMode
This function can only be called immediately after PQsendQuery or one of its sibling functions,
before any other operation on the connection such as PQconsumeInput or PQgetResult. If
called at the correct time, the function activates single-row mode for the current query and returns
1. Otherwise the mode stays unchanged and the function returns 0. In any case, the mode reverts
to normal after completion of the current query.
Caution
While processing a query, the server may return some rows and then encounter an error, caus-
ing the query to be aborted. Ordinarily, libpq discards any such rows and reports only the error.
But in single-row mode, those rows will have already been returned to the application. Hence,
822
libpq - C Library
PQgetCancel
Creates a data structure containing the information needed to cancel a command issued through
a particular database connection.
PQgetCancel creates a PGcancel object given a PGconn connection object. It will return
NULL if the given conn is NULL or an invalid connection. The PGcancel object is an opaque
structure that is not meant to be accessed directly by the application; it can only be passed to
PQcancel or PQfreeCancel.
PQfreeCancel
PQcancel
The return value is 1 if the cancel request was successfully dispatched and 0 if not. If not, errbuf
is filled with an explanatory error message. errbuf must be a char array of size errbufsize
(the recommended size is 256 bytes).
Successful dispatch is no guarantee that the request will have any effect, however. If the cancel-
lation is effective, the current command will terminate early and return an error result. If the can-
cellation fails (say, because the server was already done processing the command), then there will
be no visible result at all.
PQcancel can safely be invoked from a signal handler, if the errbuf is a local variable in the
signal handler. The PGcancel object is read-only as far as PQcancel is concerned, so it can
also be invoked from a thread that is separate from the one manipulating the PGconn object.
PQrequestCancel
823
libpq - C Library
Requests that the server abandon processing of the current command. It operates directly on the
PGconn object, and in case of failure stores the error message in the PGconn object (whence it
can be retrieved by PQerrorMessage). Although the functionality is the same, this approach
creates hazards for multiple-thread programs and signal handlers, since it is possible that over-
writing the PGconn's error message will mess up the operation currently in progress on the con-
nection.
Tip
This interface is somewhat obsolete, as one can achieve similar performance and greater func-
tionality by setting up a prepared statement to define the function call. Then, executing the
statement with binary transmission of parameters and results substitutes for a fast-path func-
tion call.
The function PQfn requests execution of a server function via the fast-path interface:
typedef struct
{
int len;
int isint;
union
{
int *ptr;
int integer;
} u;
} PQArgBlock;
The fnid argument is the OID of the function to be executed. args and nargs define the para-
meters to be passed to the function; they must match the declared function argument list. When the
isint field of a parameter structure is true, the u.integer value is sent to the server as an inte-
ger of the indicated length (this must be 2 or 4 bytes); proper byte-swapping occurs. When isint
is false, the indicated number of bytes at *u.ptr are sent with no processing; the data must be in
the format expected by the server for binary transmission of the function's argument data type. (The
declaration of u.ptr as being of type int * is historical; it would be better to consider it void
*.) result_buf points to the buffer in which to place the function's return value. The caller must
have allocated sufficient space to store the return value. (There is no check!) The actual result length
in bytes will be returned in the integer pointed to by result_len. If a 2- or 4-byte integer result
is expected, set result_is_int to 1, otherwise set it to 0. Setting result_is_int to 1 causes
libpq to byte-swap the value if necessary, so that it is delivered as a proper int value for the client
machine; note that a 4-byte integer is delivered into *result_buf for either allowed result size.
When result_is_int is 0, the binary-format byte string sent by the server is returned unmodified.
(In this case it's better to consider result_buf as being of type void *.)
824
libpq - C Library
PQfn always returns a valid PGresult pointer, with status PGRES_COMMAND_OK for success or
PGRES_FATAL_ERROR if some problem was encountered. The result status should be checked be-
fore the result is used. The caller is responsible for freeing the PGresult with PQclear when it
is no longer needed.
To pass a NULL argument to the function, set the len field of that parameter structure to -1; the
isint and u fields are then irrelevant. (But this works only in protocol 3.0 and later connections.)
If the function returns NULL, *result_len is set to -1, and *result_buf is not modified.
(This works only in protocol 3.0 and later connections; in protocol 2.0, neither *result_len nor
*result_buf are modified.)
Note that it is not possible to handle set-valued results when using this interface. Also, the function
must be a plain function, not an aggregate, window function, or procedure.
libpq applications submit LISTEN, UNLISTEN, and NOTIFY commands as ordinary SQL com-
mands. The arrival of NOTIFY messages can subsequently be detected by calling PQnotifies.
The function PQnotifies returns the next notification from a list of unhandled notification mes-
sages received from the server. It returns a null pointer if there are no pending notifications. Once a
notification is returned from PQnotifies, it is considered handled and will be removed from the
list of notifications.
After processing a PGnotify object returned by PQnotifies, be sure to free it with PQfreemem.
It is sufficient to free the PGnotify pointer; the relname and extra fields do not represent sep-
arate allocations. (The names of these fields are historical; in particular, channel names need not have
anything to do with relation names.)
Example 34.2 gives a sample program that illustrates the use of asynchronous notification.
PQnotifies does not actually read data from the server; it just returns messages previously absorbed
by another libpq function. In ancient releases of libpq, the only way to ensure timely receipt of NOTIFY
messages was to constantly submit commands, even empty ones, and then check PQnotifies after
each PQexec. While this still works, it is deprecated as a waste of processing power.
A better way to check for NOTIFY messages when you have no useful commands to execute is to
call PQconsumeInput, then check PQnotifies. You can use select() to wait for data to
arrive from the server, thereby using no CPU power unless there is something to do. (See PQsocket
to obtain the file descriptor number to use with select().) Note that this will work OK whether
825
libpq - C Library
you submit commands with PQsendQuery/PQgetResult or simply use PQexec. You should,
however, remember to check PQnotifies after each PQgetResult or PQexec, to see if any
notifications came in during the processing of the command.
The overall process is that the application first issues the SQL COPY command via PQexec or one of
the equivalent functions. The response to this (if there is no error in the command) will be a PGre-
sult object bearing a status code of PGRES_COPY_OUT or PGRES_COPY_IN (depending on the
specified copy direction). The application should then use the functions of this section to receive
or transmit data rows. When the data transfer is complete, another PGresult object is returned to
indicate success or failure of the transfer. Its status will be PGRES_COMMAND_OK for success or
PGRES_FATAL_ERROR if some problem was encountered. At this point further SQL commands can
be issued via PQexec. (It is not possible to execute other SQL commands using the same connection
while the COPY operation is in progress.)
If a COPY command is issued via PQexec in a string that could contain additional commands, the
application must continue fetching results via PQgetResult after completing the COPY sequence.
Only when PQgetResult returns NULL is it certain that the PQexec command string is done and
it is safe to issue more commands.
The functions of this section should be executed only after obtaining a result status of
PGRES_COPY_OUT or PGRES_COPY_IN from PQexec or PQgetResult.
A PGresult object bearing one of these status values carries some additional data about the COPY
operation that is starting. This additional data is available using functions that are also used in con-
nection with query results:
PQnfields
PQbinaryTuples
0 indicates the overall copy format is textual (rows separated by newlines, columns separated
by separator characters, etc). 1 indicates the overall copy format is binary. See COPY for more
information.
PQfformat
Returns the format code (0 for text, 1 for binary) associated with each column of the copy oper-
ation. The per-column format codes will always be zero when the overall copy format is textual,
but the binary format can support both text and binary columns. (However, as of the current im-
plementation of COPY, only binary columns appear in a binary copy; so the per-column formats
always match the overall format at present.)
Note
These additional data values are only available when using protocol 3.0. When using protocol
2.0, all these functions will return 0.
826
libpq - C Library
PQputCopyData
Transmits the COPY data in the specified buffer, of length nbytes, to the server. The result
is 1 if the data was queued, zero if it was not queued because of full buffers (this will only happen
in nonblocking mode), or -1 if an error occurred. (Use PQerrorMessage to retrieve details if
the return value is -1. If the value is zero, wait for write-ready and try again.)
The application can divide the COPY data stream into buffer loads of any convenient size. Buffer-
load boundaries have no semantic significance when sending. The contents of the data stream
must match the data format expected by the COPY command; see COPY for details.
PQputCopyEnd
Ends the COPY_IN operation successfully if errormsg is NULL. If errormsg is not NULL
then the COPY is forced to fail, with the string pointed to by errormsg used as the error message.
(One should not assume that this exact error message will come back from the server, however,
as the server might have already failed the COPY for its own reasons. Also note that the option to
force failure does not work when using pre-3.0-protocol connections.)
The result is 1 if the termination message was sent; or in nonblocking mode, this may only indi-
cate that the termination message was successfully queued. (In nonblocking mode, to be certain
that the data has been sent, you should next wait for write-ready and call PQflush, repeating
until it returns zero.) Zero indicates that the function could not queue the termination message
because of full buffers; this will only happen in nonblocking mode. (In this case, wait for write-
ready and try the PQputCopyEnd call again.) If a hard error occurs, -1 is returned; you can use
PQerrorMessage to retrieve details.
After successfully calling PQputCopyEnd, call PQgetResult to obtain the final result status
of the COPY command. One can wait for this result to be available in the usual way. Then return
to normal operation.
PQgetCopyData
827
libpq - C Library
Attempts to obtain another row of data from the server during a COPY. Data is always returned
one data row at a time; if only a partial row is available, it is not returned. Successful return of a
data row involves allocating a chunk of memory to hold the data. The buffer parameter must be
non-NULL. *buffer is set to point to the allocated memory, or to NULL in cases where no buffer
is returned. A non-NULL result buffer should be freed using PQfreemem when no longer needed.
When a row is successfully returned, the return value is the number of data bytes in the row (this
will always be greater than zero). The returned string is always null-terminated, though this is
probably only useful for textual COPY. A result of zero indicates that the COPY is still in progress,
but no row is yet available (this is only possible when async is true). A result of -1 indicates that
the COPY is done. A result of -2 indicates that an error occurred (consult PQerrorMessage
for the reason).
When async is true (not zero), PQgetCopyData will not block waiting for input; it will return
zero if the COPY is still in progress but no complete row is available. (In this case wait for read-
ready and then call PQconsumeInput before calling PQgetCopyData again.) When async
is false (zero), PQgetCopyData will block until data is available or the operation completes.
After PQgetCopyData returns -1, call PQgetResult to obtain the final result status of the
COPY command. One can wait for this result to be available in the usual way. Then return to
normal operation.
PQgetline
Reads a newline-terminated line of characters (transmitted by the server) into a buffer string of
size length.
This function copies up to length-1 characters into the buffer and converts the terminating
newline into a zero byte. PQgetline returns EOF at the end of input, 0 if the entire line has
been read, and 1 if the buffer is full but the terminating newline has not yet been read.
Note that the application must check to see if a new line consists of the two characters \., which
indicates that the server has finished sending the results of the COPY command. If the application
might receive lines that are more than length-1 characters long, care is needed to be sure it
recognizes the \. line correctly (and does not, for example, mistake the end of a long data line
for a terminator line).
PQgetlineAsync
Reads a row of COPY data (transmitted by the server) into a buffer without blocking.
828
libpq - C Library
int bufsize);
This function is similar to PQgetline, but it can be used by applications that must read COPY
data asynchronously, that is, without blocking. Having issued the COPY command and gotten
a PGRES_COPY_OUT response, the application should call PQconsumeInput and PQget-
lineAsync until the end-of-data signal is detected.
On each call, PQgetlineAsync will return data if a complete data row is available in libpq's
input buffer. Otherwise, no data is returned until the rest of the row arrives. The function returns
-1 if the end-of-copy-data marker has been recognized, or 0 if no data is available, or a positive
number giving the number of bytes of data returned. If -1 is returned, the caller must next call
PQendcopy, and then return to normal processing.
The data returned will not extend beyond a data-row boundary. If possible a whole row will be
returned at one time. But if the buffer offered by the caller is too small to hold a row sent by the
server, then a partial data row will be returned. With textual data this can be detected by testing
whether the last returned byte is \n or not. (In a binary COPY, actual parsing of the COPY data
format will be needed to make the equivalent determination.) The returned string is not null-
terminated. (If you want to add a terminating null, be sure to pass a bufsize one smaller than
the room actually available.)
PQputline
Sends a null-terminated string to the server. Returns 0 if OK and EOF if unable to send the string.
The COPY data stream sent by a series of calls to PQputline has the same format as that returned
by PQgetlineAsync, except that applications are not obliged to send exactly one data row per
PQputline call; it is okay to send a partial line or multiple lines per call.
Note
Before PostgreSQL protocol 3.0, it was necessary for the application to explicitly send
the two characters \. as a final line to indicate to the server that it had finished sending
COPY data. While this still works, it is deprecated and the special meaning of \. can
be expected to be removed in a future release. It is sufficient to call PQendcopy after
having sent the actual data.
PQputnbytes
Sends a non-null-terminated string to the server. Returns 0 if OK and EOF if unable to send the
string.
This is exactly like PQputline, except that the data buffer need not be null-terminated since
the number of bytes to send is specified directly. Use this procedure when sending binary data.
PQendcopy
829
libpq - C Library
This function waits until the server has finished the copying. It should either be issued when the
last string has been sent to the server using PQputline or when the last string has been received
from the server using PQgetline. It must be issued or the server will get “out of sync” with
the client. Upon return from this function, the server is ready to receive the next SQL command.
The return value is 0 on successful completion, nonzero otherwise. (Use PQerrorMessage to
retrieve details if the return value is nonzero.)
Older applications are likely to submit a COPY via PQexec and assume that the transaction is
done after PQendcopy. This will work correctly only if the COPY is the only SQL command
in the command string.
PQclientEncoding
Note that it returns the encoding ID, not a symbolic string such as EUC_JP. If unsuccessful, it
returns -1. To convert an encoding ID to an encoding name, you can use:
PQsetClientEncoding
conn is a connection to the server, and encoding is the encoding you want to use. If the function
successfully sets the encoding, it returns 0, otherwise -1. The current encoding for this connection
can be determined by using PQclientEncoding.
PQsetErrorVerbosity
typedef enum
{
PQERRORS_TERSE,
830
libpq - C Library
PQERRORS_DEFAULT,
PQERRORS_VERBOSE
} PGVerbosity;
PQsetErrorVerbosity sets the verbosity mode, returning the connection's previous setting.
In TERSE mode, returned messages include severity, primary text, and position only; this will
normally fit on a single line. The default mode produces messages that include the above plus any
detail, hint, or context fields (these might span multiple lines). The VERBOSE mode includes all
available fields. Changing the verbosity does not affect the messages available from already-ex-
isting PGresult objects, only subsequently-created ones. (But see PQresultVerboseEr-
rorMessage if you want to print a previous error with a different verbosity.)
PQsetErrorContextVisibility
typedef enum
{
PQSHOW_CONTEXT_NEVER,
PQSHOW_CONTEXT_ERRORS,
PQSHOW_CONTEXT_ALWAYS
} PGContextVisibility;
PQtrace
Note
On Windows, if the libpq library and an application are compiled with different flags, this
function call will crash the application because the internal representation of the FILE
pointers differ. Specifically, multithreaded/single-threaded, release/debug, and static/dy-
namic flags should be the same for the library and all applications using that library.
PQuntrace
831
libpq - C Library
PQfreemem
PQconninfoFree
A simple PQfreemem will not do for this, since the array contains references to subsidiary
strings.
PQencryptPasswordConn
This function is intended to be used by client applications that wish to send commands like ALTER
USER joe PASSWORD 'pwd'. It is good practice not to send the original cleartext password
in such a command, because it might be exposed in command logs, activity displays, and so on.
Instead, use this function to convert the password to encrypted form before it is sent.
The passwd and user arguments are the cleartext password, and the SQL name of the user it
is for. algorithm specifies the encryption algorithm to use to encrypt the password. Currently
supported algorithms are md5 and scram-sha-256 (on and off are also accepted as aliases
for md5, for compatibility with older server versions). Note that support for scram-sha-256
was introduced in PostgreSQL version 10, and will not work correctly with older server versions.
If algorithm is NULL, this function will query the server for the current value of the pass-
word_encryption setting. That can block, and will fail if the current transaction is aborted, or
if the connection is busy executing another query. If you wish to use the default algorithm for
the server but want to avoid blocking, query password_encryption yourself before calling
PQencryptPasswordConn, and pass that value as the algorithm.
The return value is a string allocated by malloc. The caller can assume the string doesn't contain
any special characters that would require escaping. Use PQfreemem to free the result when done
with it. On error, returns NULL, and a suitable message is stored in the connection object.
832
libpq - C Library
PQencryptPassword
PQmakeEmptyPGresult
This is libpq's internal function to allocate and initialize an empty PGresult object. This func-
tion returns NULL if memory could not be allocated. It is exported because some applications find
it useful to generate result objects (particularly objects with error status) themselves. If conn is
not null and status indicates an error, the current error message of the specified connection
is copied into the PGresult. Also, if conn is not null, any event procedures registered in the
connection are copied into the PGresult. (They do not get PGEVT_RESULTCREATE calls, but
see PQfireResultCreateEvents.) Note that PQclear should eventually be called on the
object, just as with a PGresult returned by libpq itself.
PQfireResultCreateEvents
Fires a PGEVT_RESULTCREATE event (see Section 34.13) for each event procedure registered
in the PGresult object. Returns non-zero for success, zero if any event procedure fails.
The conn argument is passed through to event procedures but not used directly. It can be NULL
if the event procedures won't use it.
The main reason that this function is separate from PQmakeEmptyPGresult is that it is often
appropriate to create a PGresult and fill it with data before invoking the event procedures.
PQcopyResult
Makes a copy of a PGresult object. The copy is not linked to the source result in any way
and PQclear must be called when the copy is no longer needed. If the function fails, NULL is
returned.
This is not intended to make an exact copy. The returned result is always put into PGRES_TU-
PLES_OK status, and does not copy any error message in the source. (It does copy the command
status string, however.) The flags argument determines what else is copied. It is a bitwise OR
of several flags. PG_COPYRES_ATTRS specifies copying the source result's attributes (column
definitions). PG_COPYRES_TUPLES specifies copying the source result's tuples. (This implies
copying the attributes, too.) PG_COPYRES_NOTICEHOOKS specifies copying the source result's
833
libpq - C Library
notify hooks. PG_COPYRES_EVENTS specifies copying the source result's events. (But any in-
stance data associated with the source is not copied.)
PQsetResultAttrs
The provided attDescs are copied into the result. If the attDescs pointer is NULL or nu-
mAttributes is less than one, the request is ignored and the function succeeds. If res already
contains attributes, the function will fail. If the function fails, the return value is zero. If the func-
tion succeeds, the return value is non-zero.
PQsetvalue
The function will automatically grow the result's internal tuples array as needed. However, the
tup_num argument must be less than or equal to PQntuples, meaning this function can only
grow the tuples array one tuple at a time. But any field of any existing tuple can be modified in
any order. If a value at field_num already exists, it will be overwritten. If len is -1 or value
is NULL, the field value will be set to an SQL null value. The value is copied into the result's
private storage, thus is no longer needed after the function returns. If the function fails, the return
value is zero. If the function succeeds, the return value is non-zero.
PQresultAlloc
Any memory allocated with this function will be freed when res is cleared. If the function fails,
the return value is NULL. The result is guaranteed to be adequately aligned for any type of data,
just as for malloc.
PQlibVersion
int PQlibVersion(void);
The result of this function can be used to determine, at run time, whether specific functionality
is available in the currently loaded version of libpq. The function can be used, for example, to
determine which connection options are available in PQconnectdb.
The result is formed by multiplying the library's major version number by 10000 and adding the
minor version number. For example, version 10.1 will be returned as 100001, and version 11.0
will be returned as 110000.
Prior to major version 10, PostgreSQL used three-part version numbers in which the first two parts
together represented the major version. For those versions, PQlibVersion uses two digits for
each part; for example version 9.1.5 will be returned as 90105, and version 9.2.0 will be returned
as 90200.
834
libpq - C Library
Therefore, for purposes of determining feature compatibility, applications should divide the result
of PQlibVersion by 100 not 10000 to determine a logical major version number. In all release
series, only the last two digits differ between minor releases (bug-fix releases).
Note
This function appeared in PostgreSQL version 9.1, so it cannot be used to detect required
functionality in earlier versions, since calling it will create a link dependency on version
9.1 or later.
For historical reasons, there are two levels of notice handling, called the notice receiver and notice
processor. The default behavior is for the notice receiver to format the notice and pass a string to the
notice processor for printing. However, an application that chooses to provide its own notice receiver
will typically ignore the notice processor layer and just do all the work in the notice receiver.
The function PQsetNoticeReceiver sets or examines the current notice receiver for a connec-
tion object. Similarly, PQsetNoticeProcessor sets or examines the current notice processor.
PQnoticeReceiver
PQsetNoticeReceiver(PGconn *conn,
PQnoticeReceiver proc,
void *arg);
PQnoticeProcessor
PQsetNoticeProcessor(PGconn *conn,
PQnoticeProcessor proc,
void *arg);
Each of these functions returns the previous notice receiver or processor function pointer, and sets the
new value. If you supply a null function pointer, no action is taken, but the current pointer is returned.
When a notice or warning message is received from the server, or generated internally by libpq, the no-
tice receiver function is called. It is passed the message in the form of a PGRES_NONFATAL_ERROR
PGresult. (This allows the receiver to extract individual fields using PQresultErrorField,
or obtain a complete preformatted message using PQresultErrorMessage or PQresultVer-
boseErrorMessage.) The same void pointer passed to PQsetNoticeReceiver is also passed.
(This pointer can be used to access application-specific state if needed.)
The default notice receiver simply extracts the message (using PQresultErrorMessage) and
passes it to the notice processor.
The notice processor is responsible for handling a notice or warning message given in text form. It is
passed the string text of the message (including a trailing newline), plus a void pointer that is the same
835
libpq - C Library
static void
defaultNoticeProcessor(void *arg, const char *message)
{
fprintf(stderr, "%s", message);
}
Once you have set a notice receiver or processor, you should expect that that function could be called
as long as either the PGconn object or PGresult objects made from it exist. At creation of a PGre-
sult, the PGconn's current notice handling pointers are copied into the PGresult for possible use
by functions like PQgetvalue.
Each registered event handler is associated with two pieces of data, known to libpq only as opaque
void * pointers. There is a passthrough pointer that is provided by the application when the event
handler is registered with a PGconn. The passthrough pointer never changes for the life of the PG-
conn and all PGresults generated from it; so if used, it must point to long-lived data. In addition
there is an instance data pointer, which starts out NULL in every PGconn and PGresult. This point-
er can be manipulated using the PQinstanceData, PQsetInstanceData, PQresultIn-
stanceData and PQsetResultInstanceData functions. Note that unlike the passthrough
pointer, instance data of a PGconn is not automatically inherited by PGresults created from it.
libpq does not know what passthrough and instance data pointers point to (if anything) and will never
attempt to free them — that is the responsibility of the event handler.
PGEVT_REGISTER
The register event occurs when PQregisterEventProc is called. It is the ideal time to ini-
tialize any instanceData an event procedure may need. Only one register event will be fired
per event handler per connection. If the event procedure fails, the registration is aborted.
typedef struct
{
PGconn *conn;
} PGEventRegister;
836
libpq - C Library
PGEVT_CONNRESET
The connection reset event is fired on completion of PQreset or PQresetPoll. In both cases,
the event is only fired if the reset was successful. If the event procedure fails, the entire connection
reset will fail; the PGconn is put into CONNECTION_BAD status and PQresetPoll will return
PGRES_POLLING_FAILED.
typedef struct
{
PGconn *conn;
} PGEventConnReset;
PGEVT_CONNDESTROY
The connection destroy event is fired in response to PQfinish. It is the event procedure's re-
sponsibility to properly clean up its event data as libpq has no ability to manage this memory.
Failure to clean up will lead to memory leaks.
typedef struct
{
PGconn *conn;
} PGEventConnDestroy;
PGEVT_RESULTCREATE
The result creation event is fired in response to any query execution function that generates a
result, including PQgetResult. This event will only be fired after the result has been created
successfully.
typedef struct
{
PGconn *conn;
PGresult *result;
} PGEventResultCreate;
PGEVT_RESULTCOPY
The result copy event is fired in response to PQcopyResult. This event will only be
fired after the copy is complete. Only event procedures that have successfully handled the
837
libpq - C Library
typedef struct
{
const PGresult *src;
PGresult *dest;
} PGEventResultCopy;
PGEVT_RESULTDESTROY
The result destroy event is fired in response to a PQclear. It is the event procedure's responsi-
bility to properly clean up its event data as libpq has no ability to manage this memory. Failure
to clean up will lead to memory leaks.
typedef struct
{
PGresult *result;
} PGEventResultDestroy;
PGEventProc is a typedef for a pointer to an event procedure, that is, the user callback function
that receives events from libpq. The signature of an event procedure must be
The evtId parameter indicates which PGEVT event occurred. The evtInfo pointer must
be cast to the appropriate structure type to obtain further information about the event. The
passThrough parameter is the pointer provided to PQregisterEventProc when the event
procedure was registered. The function should return a non-zero value if it succeeds and zero if
it fails.
A particular event procedure can be registered only once in any PGconn. This is because the
address of the procedure is used as a lookup key to identify the associated instance data.
Caution
On Windows, functions can have two different addresses: one visible from outside a DLL
and another visible from inside the DLL. One should be careful that only one of these
addresses is used with libpq's event-procedure functions, else confusion will result. The
838
libpq - C Library
simplest rule for writing code that will work is to ensure that event procedures are declared
static. If the procedure's address must be available outside its own source file, expose
a separate function to return the address.
An event procedure must be registered once on each PGconn you want to receive events about.
There is no limit, other than memory, on the number of event procedures that can be registered
with a connection. The function returns a non-zero value if it succeeds and zero if it fails.
The proc argument will be called when a libpq event is fired. Its memory address is also used
to lookup instanceData. The name argument is used to refer to the event procedure in error
messages. This value cannot be NULL or a zero-length string. The name string is copied into the
PGconn, so what is passed need not be long-lived. The passThrough pointer is passed to the
proc whenever an event occurs. This argument can be NULL.
PQsetInstanceData
Sets the connection conn's instanceData for procedure proc to data. This returns non-
zero for success and zero for failure. (Failure is only possible if proc has not been properly
registered in conn.)
PQinstanceData
Returns the connection conn's instanceData associated with procedure proc, or NULL if
there is none.
PQresultSetInstanceData
Sets the result's instanceData for proc to data. This returns non-zero for success and zero
for failure. (Failure is only possible if proc has not been properly registered in the result.)
PQresultInstanceData
Returns the result's instanceData associated with proc, or NULL if there is none.
839
libpq - C Library
/* The instanceData */
typedef struct
{
int n;
char *str;
} mydata;
/* PGEventProc */
static int myEventProc(PGEventId evtId, void *evtInfo, void
*passThrough);
int
main(void)
{
mydata *data;
PGresult *res;
PGconn *conn =
PQconnectdb("dbname=postgres options=-csearch_path=");
if (PQstatus(conn) != CONNECTION_OK)
{
fprintf(stderr, "Connection to database failed: %s",
PQerrorMessage(conn));
PQfinish(conn);
return 1;
}
840
libpq - C Library
return 0;
}
static int
myEventProc(PGEventId evtId, void *evtInfo, void *passThrough)
{
switch (evtId)
{
case PGEVT_REGISTER:
{
PGEventRegister *e = (PGEventRegister *)evtInfo;
mydata *data = get_mydata(e->conn);
case PGEVT_CONNRESET:
{
PGEventConnReset *e = (PGEventConnReset *)evtInfo;
mydata *data = PQinstanceData(e->conn, myEventProc);
if (data)
memset(data, 0, sizeof(mydata));
break;
}
case PGEVT_CONNDESTROY:
{
PGEventConnDestroy *e = (PGEventConnDestroy *)evtInfo;
mydata *data = PQinstanceData(e->conn, myEventProc);
case PGEVT_RESULTCREATE:
{
841
libpq - C Library
PGEventResultCreate *e = (PGEventResultCreate
*)evtInfo;
mydata *conn_data = PQinstanceData(e->conn,
myEventProc);
mydata *res_data = dup_mydata(conn_data);
case PGEVT_RESULTCOPY:
{
PGEventResultCopy *e = (PGEventResultCopy *)evtInfo;
mydata *src_data = PQresultInstanceData(e->src,
myEventProc);
mydata *dest_data = dup_mydata(src_data);
case PGEVT_RESULTDESTROY:
{
PGEventResultDestroy *e = (PGEventResultDestroy
*)evtInfo;
mydata *data = PQresultInstanceData(e->result,
myEventProc);
842
libpq - C Library
• PGHOSTADDR behaves the same as the hostaddr connection parameter. This can be set instead of
or in addition to PGHOST to avoid DNS lookup overhead.
• PGPASSWORD behaves the same as the password connection parameter. Use of this environment
variable is not recommended for security reasons, as some operating systems allow non-root users to
see process environment variables via ps; instead consider using a password file (see Section 34.15).
• PGSERVICEFILE specifies the name of the per-user connection service file (see Section 34.16).
Defaults to ~/.pg_service.conf, or %APPDATA%\postgresql\.pg_service.conf
on Microsoft Windows.
• PGREQUIRESSL behaves the same as the requiressl connection parameter. This environment vari-
able is deprecated in favor of the PGSSLMODE variable; setting both variables suppresses the effect
of this one.
The following environment variables can be used to specify default behavior for each PostgreSQL
session. (See also the ALTER ROLE and ALTER DATABASE commands for ways to set default
behavior on a per-user or per-database basis.)
• PGDATESTYLE sets the default style of date/time representation. (Equivalent to SET dat-
estyle TO ....)
• PGTZ sets the default time zone. (Equivalent to SET timezone TO ....)
• PGGEQO sets the default mode for the genetic query optimizer. (Equivalent to SET geqo
TO ....)
Refer to the SQL command SET for information on correct values for these environment variables.
843
libpq - C Library
The following environment variables determine internal behavior of libpq; they override compiled-in
defaults.
• PGSYSCONFDIR sets the directory containing the pg_service.conf file and in a future ver-
sion possibly other system-wide configuration files.
• PGLOCALEDIR sets the directory containing the locale files for message localization.
hostname:port:database:username:password
(You can add a reminder comment to the file by copying the line above and preceding it with #.) Each
of the first four fields can be a literal value, or *, which matches anything. The password field from the
first line that matches the current connection parameters will be used. (Therefore, put more-specific
entries first when you are using wildcards.) If an entry needs to contain : or \, escape this character
with \. The host name field is matched to the host connection parameter if that is specified, otherwise
to the hostaddr parameter if that is specified; if neither are given then the host name localhost
is searched for. The host name localhost is also searched for when the connection is a Unix-
domain socket connection and the host parameter matches libpq's default socket directory path. In a
standby server, a database field of replication matches streaming replication connections made
to the master server. The database field is of limited usefulness otherwise, because users have the same
password for all databases in the same cluster.
On Unix systems, the permissions on a password file must disallow any access to world or group;
achieve this by a command such as chmod 0600 ~/.pgpass. If the permissions are less strict
than this, the file will be ignored. On Microsoft Windows, it is assumed that the file is stored in a
directory that is secure, so no special permissions check is made.
Service names can be defined in either a per-user service file or a system-wide file. If the same service
name exists in both the user and the system file, the user file takes precedence. By default, the per-user
service file is named ~/.pg_service.conf. On Microsoft Windows, it is named %APPDATA%
\postgresql\.pg_service.conf (where %APPDATA% refers to the Application Data subdi-
rectory in the user's profile). A different file name can be specified by setting the environment variable
PGSERVICEFILE. The system-wide file is named pg_service.conf. By default it is sought in
the etc directory of the PostgreSQL installation (use pg_config --sysconfdir to identify this
directory precisely). Another directory, but not a different file name, can be specified by setting the
environment variable PGSYSCONFDIR.
Either service file uses an “INI file” format where the section name is the service name and the para-
meters are connection parameters; see Section 34.1.2 for a list. For example:
844
libpq - C Library
# comment
[mydb]
host=somehost
port=5433
user=admin
Connection parameters obtained from a service file are combined with parameters obtained from other
sources. A service file setting overrides the corresponding environment variable, and in turn can be
overridden by a value given directly in the connection string. For example, using the above service
file, a connection string service=mydb port=5434 will use host somehost, port 5434, user
admin, and other parameters as set by environment variables or built-in defaults.
LDAP connection parameter lookup uses the connection service file pg_service.conf (see Sec-
tion 34.16). A line in a pg_service.conf stanza that starts with ldap:// will be recognized
as an LDAP URL and an LDAP query will be performed. The result must be a list of keyword =
value pairs which will be used to set connection options. The URL must conform to RFC 1959 and
be of the form
ldap://[hostname[:port]]/search_base?attribute?search_scope?filter
A sample LDAP entry that has been created with the LDIF file
version:1
dn:cn=mydatabase,dc=mycompany,dc=com
changetype:add
objectclass:top
objectclass:device
cn:mydatabase
description:host=dbserver.mycompany.com
description:port=5439
description:dbname=mydb
description:user=mydb_user
description:sslmode=require
845
libpq - C Library
ldap://ldap.mycompany.com/dc=mycompany,dc=com?description?one?
(cn=mydatabase)
You can also mix regular service file entries with LDAP lookups. A complete example for a stanza
in pg_service.conf would be:
# only host and port are stored in LDAP, specify dbname and user
explicitly
[customerdb]
dbname=customer
user=appuser
ldap://ldap.acme.com/cn=dbserver,cn=hosts?pgconnectinfo?base?
(objectclass=*)
libpq reads the system-wide OpenSSL configuration file. By default, this file is named openssl.c-
nf and is located in the directory reported by openssl version -d. This default can be overrid-
den by setting environment variable OPENSSL_CONF to the name of the desired configuration file.
To allow the client to verify the identity of the server, place a root certificate on the client and a leaf
certificate signed by the root certificate on the server. To allow the server to verify the identity of the
client, place a root certificate on the server and a leaf certificate signed by the root certificate on the
client. One or more intermediate certificates (usually stored with the leaf certificate) can also be used
to link the leaf certificate to the root certificate.
Once a chain of trust has been established, there are two ways for the client to validate the leaf cer-
tificate sent by the server. If the parameter sslmode is set to verify-ca, libpq will verify that the
server is trustworthy by checking the certificate chain up to the root certificate stored on the client.
If sslmode is set to verify-full, libpq will also verify that the server host name matches the
name stored in the server certificate. The SSL connection will fail if the server certificate cannot be
verified. verify-full is recommended in most security-sensitive environments.
In verify-full mode, the host name is matched against the certificate's Subject Alternative Name
attribute(s), or against the Common Name attribute if no Subject Alternative Name of type dNSName
is present. If the certificate's name attribute starts with an asterisk (*), the asterisk will be treated as a
wildcard, which will match all characters except a dot (.). This means the certificate will not match
subdomains. If the connection is made using an IP address instead of a host name, the IP address will
be matched (without doing any DNS lookups).
To allow server certificate verification, one or more root certificates must be placed in the file
~/.postgresql/root.crt in the user's home directory. (On Microsoft Windows the file is
named %APPDATA%\postgresql\root.crt.) Intermediate certificates should also be added to
the file if they are needed to link the certificate chain sent by the server to the root certificates stored
on the client.
846
libpq - C Library
Certificate Revocation List (CRL) entries are also checked if the file ~/.postgresql/root.crl
exists (%APPDATA%\postgresql\root.crl on Microsoft Windows).
The location of the root certificate file and the CRL can be changed by setting the connection parame-
ters sslrootcert and sslcrl or the environment variables PGSSLROOTCERT and PGSSLCRL.
Note
For backwards compatibility with earlier versions of PostgreSQL, if a root CA file exists, the
behavior of sslmode=require will be the same as that of verify-ca, meaning the serv-
er certificate is validated against the CA. Relying on this behavior is discouraged, and appli-
cations that need certificate validation should always use verify-ca or verify-full.
On Unix systems, the permissions on the private key file must disallow any access to world or group;
achieve this by a command such as chmod 0600 ~/.postgresql/postgresql.key. Al-
ternatively, the file can be owned by root and have group read access (that is, 0640 permissions).
That setup is intended for installations where certificate and key files are managed by the operating
system. The user of libpq should then be made a member of the group that has access to those certifi-
cate and key files. (On Microsoft Windows, there is no file permissions check, since the %APPDATA%
\postgresql directory is presumed secure.)
The first certificate in postgresql.crt must be the client's certificate because it must match the
client's private key. “Intermediate” certificates can be optionally appended to the file — doing so
avoids requiring storage of intermediate certificates on the server (ssl_ca_file).
Eavesdropping
If a third party can examine the network traffic between the client and the server, it can read both
connection information (including the user name and password) and the data that is passed. SSL
uses encryption to prevent this.
If a third party can modify the data while passing between the client and server, it can pretend to
be the server and therefore see and modify data even if it is encrypted. The third party can then
forward the connection information and data to the original server, making it impossible to detect
this attack. Common vectors to do this include DNS poisoning and address hijacking, whereby the
client is directed to a different server than intended. There are also several other attack methods
that can accomplish this. SSL uses certificate verification to prevent this, by authenticating the
server to the client.
847
libpq - C Library
Impersonation
If a third party can pretend to be an authorized client, it can simply access data it should not have
access to. Typically this can happen through insecure password management. SSL uses client
certificates to prevent this, by making sure that only holders of valid certificates can access the
server.
For a connection to be known secure, SSL usage must be configured on both the client and the server
before the connection is made. If it is only configured on the server, the client may end up sending
sensitive information (e.g., passwords) before it knows that the server requires high security. In libpq,
secure connections can be ensured by setting the sslmode parameter to verify-full or veri-
fy-ca, and providing the system with a root certificate to verify against. This is analogous to using
an https URL for encrypted web browsing.
Once the server has been authenticated, the client can pass sensitive data. This means that up until this
point, the client does not need to know if certificates will be used for authentication, making it safe
to specify that only in the server configuration.
All SSL options carry overhead in the form of encryption and key-exchange, so there is a trade-off
that has to be made between performance and security. Table 34.1 illustrates the risks the different
sslmode values protect against, and what statement they make about security and overhead.
848
libpq - C Library
The difference between verify-ca and verify-full depends on the policy of the root CA. If a
public CA is used, verify-ca allows connections to a server that somebody else may have registered
with the CA. In this case, verify-full should always be used. If a local CA is used, or even a self-
signed certificate, using verify-ca often provides enough protection.
The default value for sslmode is prefer. As is shown in the table, this makes no sense from a
security point of view, and it only promises performance overhead if possible. It is only provided as
the default for backward compatibility, and is not recommended in secure deployments.
PQinitOpenSSL
When do_ssl is non-zero, libpq will initialize the OpenSSL library before first opening a data-
base connection. When do_crypto is non-zero, the libcrypto library will be initialized. By
default (if PQinitOpenSSL is not called), both libraries are initialized. When SSL support is
not compiled in, this function is present but does nothing.
If your application uses and initializes either OpenSSL or its underlying libcrypto library,
you must call this function with zeroes for the appropriate parameter(s) before first opening a
database connection. Also be sure that you have done that initialization before opening a database
connection.
PQinitSSL
849
libpq - C Library
PQinitSSL has been present since PostgreSQL 8.0, while PQinitOpenSSL was added in
PostgreSQL 8.4, so PQinitSSL might be preferable for applications that need to work with
older versions of libpq.
PQisthreadsafe
int PQisthreadsafe();
One thread restriction is that no two threads attempt to manipulate the same PGconn object at the
same time. In particular, you cannot issue concurrent commands from different threads through the
same connection object. (If you need to run concurrent commands, use multiple connections.)
PGresult objects are normally read-only after creation, and so can be passed around freely between
threads. However, if you use any of the PGresult-modifying functions described in Section 34.11
or Section 34.13, it's up to you to avoid concurrent operations on the same PGresult, too.
The deprecated functions PQrequestCancel and PQoidStatus are not thread-safe and should
not be used in multithread programs. PQrequestCancel can be replaced by PQcancel.
PQoidStatus can be replaced by PQoidValue.
If you are using Kerberos inside your application (in addition to inside libpq), you will need to do
locking around Kerberos calls because Kerberos functions are not thread-safe. See function PQreg-
isterThreadLock in the libpq source code for a way to do cooperative locking between libpq
and your application.
#include <libpq-fe.h>
If you failed to do that then you will normally get error messages from your compiler similar to:
850
libpq - C Library
• Point your compiler to the directory where the PostgreSQL header files were installed, by supplying
the -Idirectory option to your compiler. (In some cases the compiler will look into the directory
in question by default, so you can omit this option.) For instance, your compile command line could
look like:
cc -c -I/usr/local/pgsql/include testprog.c
If you are using makefiles then add the option to the CPPFLAGS variable:
CPPFLAGS += -I/usr/local/pgsql/include
If there is any chance that your program might be compiled by other users then you should not
hardcode the directory location like that. Instead, you can run the utility pg_config to find out
where the header files are on the local system:
$ pg_config --includedir
/usr/local/include
Note that this will already include the -I in front of the path.
Failure to specify the correct option to the compiler will result in an error message such as:
• When linking the final program, specify the option -lpq so that the libpq library gets pulled in,
as well as the option -Ldirectory to point the compiler to the directory where the libpq library
resides. (Again, the compiler will search some directories by default.) For maximum portability,
put the -L option before the -lpq option. For example:
You can find out the library directory using pg_config as well:
$ pg_config --libdir
/usr/local/pgsql/lib
Note again that this prints the full options, not only the path.
This means you forgot the -L option or did not specify the right directory.
/*
* src/test/examples/testlibpq.c
*
*
* testlibpq.c
*
* Test the C version of libpq, the PostgreSQL frontend
library.
*/
#include <stdio.h>
#include <stdlib.h>
#include "libpq-fe.h"
static void
exit_nicely(PGconn *conn)
{
PQfinish(conn);
exit(1);
}
int
main(int argc, char **argv)
{
const char *conninfo;
PGconn *conn;
PGresult *res;
int nFields;
int i,
j;
/*
* If the user supplies a parameter on the command line, use it
as the
* conninfo string; otherwise default to setting
dbname=postgres and using
852
libpq - C Library
/*
* Should PQclear PGresult whenever it is no longer needed to
avoid memory
* leaks
*/
PQclear(res);
/*
* Our test case here involves using a cursor, for which we
must be inside
* a transaction block. We could do the whole thing with a
single
* PQexec() of "select * from pg_database", but that's too
trivial to make
* a good example.
*/
853
libpq - C Library
PQclear(res);
/*
* Fetch rows from pg_database, the system catalog of databases
*/
res = PQexec(conn, "DECLARE myportal CURSOR FOR select * from
pg_database");
if (PQresultStatus(res) != PGRES_COMMAND_OK)
{
fprintf(stderr, "DECLARE CURSOR failed: %s",
PQerrorMessage(conn));
PQclear(res);
exit_nicely(conn);
}
PQclear(res);
PQclear(res);
/* close the portal ... we don't bother to check for errors ...
*/
res = PQexec(conn, "CLOSE myportal");
PQclear(res);
return 0;
}
854
libpq - C Library
/*
* src/test/examples/testlibpq2.c
*
*
* testlibpq2.c
* Test of the asynchronous notification interface
*
* Start this program, then from psql in another window do
* NOTIFY TBL2;
* Repeat four times to get this program to exit.
*
* Or, if you want to get fancy, try this:
* populate a database with the following commands
* (provided in src/test/examples/testlibpq2.sql):
*
* CREATE SCHEMA TESTLIBPQ2;
* SET search_path = TESTLIBPQ2;
* CREATE TABLE TBL1 (i int4);
* CREATE TABLE TBL2 (i int4);
* CREATE RULE r1 AS ON INSERT TO TBL1 DO
* (INSERT INTO TBL2 VALUES (new.i); NOTIFY TBL2);
*
* Start this program, then from psql do this four times:
*
* INSERT INTO TESTLIBPQ2.TBL1 VALUES (10);
*/
#ifdef WIN32
#include <windows.h>
#endif
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/time.h>
#include <sys/types.h>
#ifdef HAVE_SYS_SELECT_H
#include <sys/select.h>
#endif
#include "libpq-fe.h"
static void
exit_nicely(PGconn *conn)
{
PQfinish(conn);
exit(1);
}
int
main(int argc, char **argv)
{
const char *conninfo;
PGconn *conn;
855
libpq - C Library
PGresult *res;
PGnotify *notify;
int nnotifies;
/*
* If the user supplies a parameter on the command line, use it
as the
* conninfo string; otherwise default to setting
dbname=postgres and using
* environment variables or defaults for all other connection
parameters.
*/
if (argc > 1)
conninfo = argv[1];
else
conninfo = "dbname = postgres";
/*
* Should PQclear PGresult whenever it is no longer needed to
avoid memory
* leaks
*/
PQclear(res);
/*
* Issue LISTEN command to enable notifications from the rule's
NOTIFY.
*/
res = PQexec(conn, "LISTEN TBL2");
if (PQresultStatus(res) != PGRES_COMMAND_OK)
{
fprintf(stderr, "LISTEN command failed: %s",
PQerrorMessage(conn));
PQclear(res);
856
libpq - C Library
exit_nicely(conn);
}
PQclear(res);
sock = PQsocket(conn);
if (sock < 0)
break; /* shouldn't happen */
FD_ZERO(&input_mask);
FD_SET(sock, &input_mask);
fprintf(stderr, "Done.\n");
return 0;
}
857
libpq - C Library
/*
* src/test/examples/testlibpq3.c
*
*
* testlibpq3.c
* Test out-of-line parameters and binary I/O.
*
* Before running this, populate a database with the following
commands
* (provided in src/test/examples/testlibpq3.sql):
*
* CREATE SCHEMA testlibpq3;
* SET search_path = testlibpq3;
* CREATE TABLE test1 (i int4, t text, b bytea);
* INSERT INTO test1 values (1, 'joe''s place', '\\000\\001\\002\
\003\\004');
* INSERT INTO test1 values (2, 'ho there', '\\004\\003\\002\\001\
\000');
*
* The expected output is:
*
* tuple 0: got
* i = (4 bytes) 1
* t = (11 bytes) 'joe's place'
* b = (5 bytes) \000\001\002\003\004
*
* tuple 0: got
* i = (4 bytes) 2
* t = (8 bytes) 'ho there'
* b = (5 bytes) \004\003\002\001\000
*/
#ifdef WIN32
#include <windows.h>
#endif
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <sys/types.h>
#include "libpq-fe.h"
/* for ntohl/htonl */
#include <netinet/in.h>
#include <arpa/inet.h>
static void
exit_nicely(PGconn *conn)
{
PQfinish(conn);
exit(1);
}
858
libpq - C Library
/*
* This function prints a query result that is a binary-format
fetch from
* a table defined as in the comment above. We split it out
because the
* main() function uses it twice.
*/
static void
show_binary_results(PGresult *res)
{
int i,
j;
int i_fnum,
t_fnum,
b_fnum;
/*
* The binary representation of INT4 is in network byte
order, which
* we'd better coerce to the local byte order.
*/
ival = ntohl(*((uint32_t *) iptr));
/*
* The binary representation of TEXT is, well, text, and
since libpq
* was nice enough to append a zero byte to it, it'll work
just fine
* as a C string.
*
* The binary representation of BYTEA is a bunch of bytes,
which could
* include embedded nulls so we have to pay attention to
field length.
*/
blen = PQgetlength(res, i, b_fnum);
859
libpq - C Library
int
main(int argc, char **argv)
{
const char *conninfo;
PGconn *conn;
PGresult *res;
const char *paramValues[1];
int paramLengths[1];
int paramFormats[1];
uint32_t binaryIntVal;
/*
* If the user supplies a parameter on the command line, use it
as the
* conninfo string; otherwise default to setting
dbname=postgres and using
* environment variables or defaults for all other connection
parameters.
*/
if (argc > 1)
conninfo = argv[1];
else
conninfo = "dbname = postgres";
860
libpq - C Library
/*
* The point of this program is to illustrate use of
PQexecParams() with
* out-of-line parameters, as well as binary transmission of
data.
*
* This first example transmits the parameters as text, but
receives the
* results in binary format. By using out-of-line parameters
we can avoid
* a lot of tedious mucking about with quoting and escaping,
even though
* the data is text. Notice how we don't have to do anything
special with
* the quote mark in the parameter value.
*/
res = PQexecParams(conn,
"SELECT * FROM test1 WHERE t = $1",
1, /* one param */
NULL, /* let the backend deduce param
type */
paramValues,
NULL, /* don't need param lengths since
text */
NULL, /* default to all text params */
1); /* ask for binary results */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
{
fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn));
PQclear(res);
exit_nicely(conn);
}
show_binary_results(res);
PQclear(res);
/*
* In this second example we transmit an integer parameter in
binary form,
* and again retrieve the results in binary form.
*
* Although we tell PQexecParams we are letting the backend
deduce
* parameter type, we really force the decision by casting the
parameter
* symbol in the query text. This is a good safety measure
when sending
* binary parameters.
*/
861
libpq - C Library
res = PQexecParams(conn,
"SELECT * FROM test1 WHERE i = $1::int4",
1, /* one param */
NULL, /* let the backend deduce param
type */
paramValues,
paramLengths,
paramFormats,
1); /* ask for binary results */
if (PQresultStatus(res) != PGRES_TUPLES_OK)
{
fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn));
PQclear(res);
exit_nicely(conn);
}
show_binary_results(res);
PQclear(res);
return 0;
}
862
Chapter 35. Large Objects
PostgreSQL has a large object facility, which provides stream-style access to user data that is stored
in a special large-object structure. Streaming access is useful when working with data values that are
too large to manipulate conveniently as a whole.
This chapter describes the implementation and the programming and query language interfaces to
PostgreSQL large object data. We use the libpq C library for the examples in this chapter, but most
programming interfaces native to PostgreSQL support equivalent functionality. Other interfaces might
use the large object interface internally to provide generic support for large values. This is not described
here.
35.1. Introduction
All large objects are stored in a single system table named pg_largeobject. Each large object
also has an entry in the system table pg_largeobject_metadata. Large objects can be created,
modified, and deleted using a read/write API that is similar to standard operations on files.
PostgreSQL also supports a storage system called “TOAST”, which automatically stores values larger
than a single database page into a secondary storage area per table. This makes the large object facility
partially obsolete. One remaining advantage of the large object facility is that it allows values up to
4 TB in size, whereas TOASTed fields can be at most 1 GB. Also, reading and updating portions of
a large object can be done efficiently, while most operations on a TOASTed field will read or write
the whole value as a unit.
The chunks stored for a large object do not have to be contiguous. For example, if an application
opens a new large object, seeks to offset 1000000, and writes a few bytes there, this does not result in
allocation of 1000000 bytes worth of storage; only of chunks covering the range of data bytes actually
written. A read operation will, however, read out zeroes for any unallocated locations preceding the
last existing chunk. This corresponds to the common behavior of “sparsely allocated” files in Unix
file systems.
As of PostgreSQL 9.0, large objects have an owner and a set of access permissions, which can be
managed using GRANT and REVOKE. SELECT privileges are required to read a large object, and
UPDATE privileges are required to write or truncate it. Only the large object's owner (or a database
superuser) can delete, comment on, or change the owner of a large object. To adjust this behavior for
compatibility with prior releases, see the lo_compat_privileges run-time parameter.
All large object manipulation using these functions must take place within an SQL transaction block,
since large object file descriptors are only valid for the duration of a transaction.
If an error occurs while executing any one of these functions, the function will return an otherwise-im-
possible value, typically 0 or -1. A message describing the error is stored in the connection object and
can be retrieved with PQerrorMessage.
863
Large Objects
Client applications that use these functions should include the header file libpq/libpq-fs.h and
link with the libpq library.
creates a new large object. The return value is the OID that was assigned to the new large object, or
InvalidOid (zero) on failure. mode is unused and ignored as of PostgreSQL 8.1; however, for
backward compatibility with earlier releases it is best to set it to INV_READ, INV_WRITE, or IN-
V_READ | INV_WRITE. (These symbolic constants are defined in the header file libpq/libpq-
fs.h.)
An example:
The function
also creates a new large object. The OID to be assigned can be specified by lobjId; if so, failure
occurs if that OID is already in use for some large object. If lobjId is InvalidOid (zero) then
lo_create assigns an unused OID (this is the same behavior as lo_creat). The return value is
the OID that was assigned to the new large object, or InvalidOid (zero) on failure.
lo_create is new as of PostgreSQL 8.1; if this function is run against an older server version, it
will fail and return InvalidOid.
An example:
filename specifies the operating system name of the file to be imported as a large object. The return
value is the OID that was assigned to the new large object, or InvalidOid (zero) on failure. Note
that the file is read by the client interface library, not by the server; so it must exist in the client file
system and be readable by the client application.
The function
also imports a new large object. The OID to be assigned can be specified by lobjId; if so, failure
occurs if that OID is already in use for some large object. If lobjId is InvalidOid (zero) then
lo_import_with_oid assigns an unused OID (this is the same behavior as lo_import). The
return value is the OID that was assigned to the new large object, or InvalidOid (zero) on failure.
864
Large Objects
The lobjId argument specifies the OID of the large object to export and the filename argument
specifies the operating system name of the file. Note that the file is written by the client interface
library, not by the server. Returns 1 on success, -1 on failure.
The lobjId argument specifies the OID of the large object to open. The mode bits control whether
the object is opened for reading (INV_READ), writing (INV_WRITE), or both. (These symbolic con-
stants are defined in the header file libpq/libpq-fs.h.) lo_open returns a (non-negative) large
object descriptor for later use in lo_read, lo_write, lo_lseek, lo_lseek64, lo_tell,
lo_tell64, lo_truncate, lo_truncate64, and lo_close. The descriptor is only valid for
the duration of the current transaction. On failure, -1 is returned.
The server currently does not distinguish between modes INV_WRITE and INV_READ | IN-
V_WRITE: you are allowed to read from the descriptor in either case. However there is a significant
difference between these modes and INV_READ alone: with INV_READ you cannot write on the
descriptor, and the data read from it will reflect the contents of the large object at the time of the
transaction snapshot that was active when lo_open was executed, regardless of later writes by this
or other transactions. Reading from a descriptor opened with INV_WRITE returns data that reflects
all writes of other committed transactions as well as writes of the current transaction. This is similar
to the behavior of REPEATABLE READ versus READ COMMITTED transaction modes for ordinary
SQL SELECT commands.
lo_open will fail if SELECT privilege is not available for the large object, or if INV_WRITE is
specified and UPDATE privilege is not available. (Prior to PostgreSQL 11, these privilege checks were
instead performed at the first actual read or write call using the descriptor.) These privilege checks can
be disabled with the lo_compat_privileges run-time parameter.
An example:
int lo_write(PGconn *conn, int fd, const char *buf, size_t len);
writes len bytes from buf (which must be of size len) to large object descriptor fd. The fd ar-
gument must have been returned by a previous lo_open. The number of bytes actually written is
returned (in the current implementation, this will always equal len unless there is an error). In the
event of an error, the return value is -1.
865
Large Objects
Although the len parameter is declared as size_t, this function will reject length values larger than
INT_MAX. In practice, it's best to transfer data in chunks of at most a few megabytes anyway.
reads up to len bytes from large object descriptor fd into buf (which must be of size len). The
fd argument must have been returned by a previous lo_open. The number of bytes actually read
is returned; this will be less than len if the end of the large object is reached first. In the event of
an error, the return value is -1.
Although the len parameter is declared as size_t, this function will reject length values larger than
INT_MAX. In practice, it's best to transfer data in chunks of at most a few megabytes anyway.
This function moves the current location pointer for the large object descriptor identified by fd to the
new location specified by offset. The valid values for whence are SEEK_SET (seek from object
start), SEEK_CUR (seek from current position), and SEEK_END (seek from object end). The return
value is the new location pointer, or -1 on error.
When dealing with large objects that might exceed 2GB in size, instead use
This function has the same behavior as lo_lseek, but it can accept an offset larger than 2GB
and/or deliver a result larger than 2GB. Note that lo_lseek will fail if the new location pointer
would be greater than 2GB.
lo_lseek64 is new as of PostgreSQL 9.3. If this function is run against an older server version,
it will fail and return -1.
When dealing with large objects that might exceed 2GB in size, instead use
This function has the same behavior as lo_tell, but it can deliver a result larger than 2GB. Note
that lo_tell will fail if the current read/write location is greater than 2GB.
lo_tell64 is new as of PostgreSQL 9.3. If this function is run against an older server version, it
will fail and return -1.
866
Large Objects
This function truncates the large object descriptor fd to length len. The fd argument must have been
returned by a previous lo_open. If len is greater than the large object's current length, the large
object is extended to the specified length with null bytes ('\0'). On success, lo_truncate returns
zero. On error, the return value is -1.
Although the len parameter is declared as size_t, lo_truncate will reject length values larger
than INT_MAX.
When dealing with large objects that might exceed 2GB in size, instead use
This function has the same behavior as lo_truncate, but it can accept a len value exceeding 2GB.
lo_truncate is new as of PostgreSQL 8.3; if this function is run against an older server version,
it will fail and return -1.
lo_truncate64 is new as of PostgreSQL 9.3; if this function is run against an older server version,
it will fail and return -1.
where fd is a large object descriptor returned by lo_open. On success, lo_close returns zero.
On error, the return value is -1.
Any large object descriptors that remain open at the end of a transaction will be closed automatically.
The lobjId argument specifies the OID of the large object to remove. Returns 1 if successful, -1
on failure.
867
Large Objects
There are additional server-side functions corresponding to each of the client-side functions described
earlier; indeed, for the most part the client-side functions are simply interfaces to the equivalent serv-
er-side functions. The ones just as convenient to call via SQL commands are lo_creat, lo_cre-
ate, lo_unlink, lo_import, and lo_export. Here are examples of their use:
INSERT INTO image (name, raster) -- same as above, but specify OID
to use
VALUES ('beautiful image', lo_import('/etc/motd', 68583));
The server-side lo_import and lo_export functions behave considerably differently from their
client-side analogs. These two functions read and write files in the server's file system, using the
permissions of the database's owning user. Therefore, by default their use is restricted to superusers.
In contrast, the client-side import and export functions read and write files in the client's file system,
using the permissions of the client program. The client-side functions do not require any database
privileges, except the privilege to read or write the large object in question.
Caution
It is possible to GRANT use of the server-side lo_import and lo_export functions to
non-superusers, but careful consideration of the security implications is required. A malicious
user of such privileges could easily parlay them into becoming superuser (for example by
rewriting server configuration files), or could attack the rest of the server's file system without
bothering to obtain database superuser privileges as such. Access to roles having such privilege
868
Large Objects
must therefore be guarded just as carefully as access to superuser roles. Nonetheless, if use
of server-side lo_import or lo_export is needed for some routine task, it's safer to use
a role with such privileges than one with full superuser privileges, as that helps to reduce the
risk of damage from accidental errors.
The functionality of lo_read and lo_write is also available via server-side calls, but the names of
the server-side functions differ from the client side interfaces in that they do not contain underscores.
You must call these functions as loread and lowrite.
/
*-------------------------------------------------------------------------
*
* testlo.c
* test using large objects with libpq
*
* Portions Copyright (c) 1996-2018, PostgreSQL Global Development
Group
* Portions Copyright (c) 1994, Regents of the University of
California
*
*
* IDENTIFICATION
* src/test/examples/testlo.c
*
*-------------------------------------------------------------------------
*/
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include "libpq-fe.h"
#include "libpq/libpq-fs.h"
/*
* importFile -
* import file "in_filename" into database as large object
"lobjOid"
*
*/
static Oid
importFile(PGconn *conn, char *filename)
{
869
Large Objects
Oid lobjId;
int lobj_fd;
char buf[BUFSIZE];
int nbytes,
tmp;
int fd;
/*
* open the file to be read in
*/
fd = open(filename, O_RDONLY, 0666);
if (fd < 0)
{ /* error */
fprintf(stderr, "cannot open unix file\"%s\"\n", filename);
}
/*
* create the large object
*/
lobjId = lo_creat(conn, INV_READ | INV_WRITE);
if (lobjId == 0)
fprintf(stderr, "cannot create large object");
/*
* read in from the Unix file and write to the inversion file
*/
while ((nbytes = read(fd, buf, BUFSIZE)) > 0)
{
tmp = lo_write(conn, lobj_fd, buf, nbytes);
if (tmp < nbytes)
fprintf(stderr, "error while reading \"%s\"",
filename);
}
close(fd);
lo_close(conn, lobj_fd);
return lobjId;
}
static void
pickout(PGconn *conn, Oid lobjId, int start, int len)
{
int lobj_fd;
char *buf;
int nbytes;
int nread;
nread = 0;
870
Large Objects
static void
overwrite(PGconn *conn, Oid lobjId, int start, int len)
{
int lobj_fd;
char *buf;
int nbytes;
int nwritten;
int i;
nwritten = 0;
while (len - nwritten > 0)
{
nbytes = lo_write(conn, lobj_fd, buf + nwritten, len -
nwritten);
nwritten += nbytes;
if (nbytes <= 0)
{
fprintf(stderr, "\nWRITE FAILED!\n");
break;
}
}
free(buf);
fprintf(stderr, "\n");
lo_close(conn, lobj_fd);
}
/*
* exportFile -
* export large object "lobjOid" to file "out_filename"
*
*/
static void
871
Large Objects
/*
* open the large object
*/
lobj_fd = lo_open(conn, lobjId, INV_READ);
if (lobj_fd < 0)
fprintf(stderr, "cannot open large object %u", lobjId);
/*
* open the file to be written to
*/
fd = open(filename, O_CREAT | O_WRONLY | O_TRUNC, 0666);
if (fd < 0)
{ /* error */
fprintf(stderr, "cannot open unix file\"%s\"",
filename);
}
/*
* read in from the inversion file and write to the Unix file
*/
while ((nbytes = lo_read(conn, lobj_fd, buf, BUFSIZE)) > 0)
{
tmp = write(fd, buf, nbytes);
if (tmp < nbytes)
{
fprintf(stderr, "error while writing \"%s\"",
filename);
}
}
lo_close(conn, lobj_fd);
close(fd);
return;
}
static void
exit_nicely(PGconn *conn)
{
PQfinish(conn);
exit(1);
}
int
main(int argc, char **argv)
{
char *in_filename,
*out_filename;
char *database;
Oid lobjOid;
872
Large Objects
PGconn *conn;
PGresult *res;
if (argc != 4)
{
fprintf(stderr, "Usage: %s database_name in_filename
out_filename\n",
argv[0]);
exit(1);
}
database = argv[1];
in_filename = argv[2];
out_filename = argv[3];
/*
* set up the connection
*/
conn = PQsetdb(NULL, NULL, NULL, NULL, database);
873
Large Objects
874
Chapter 36. ECPG - Embedded SQL in
C
This chapter describes the embedded SQL package for PostgreSQL. It was written by Linus Tolke
(<[email protected]>) and Michael Meskes (<[email protected]>). Originally it was
written to work with C. It also works with C++, but it does not recognize all C++ constructs yet.
This documentation is quite incomplete. But since this interface is standardized, additional information
can be found in many resources about SQL.
Embedded SQL has advantages over other methods for handling SQL commands from C code. First,
it takes care of the tedious passing of information to and from variables in your C program. Second,
the SQL code in the program is checked at build time for syntactical correctness. Third, embedded
SQL in C is specified in the SQL standard and supported by many other SQL database systems. The
PostgreSQL implementation is designed to match this standard as much as possible, and it is usually
possible to port embedded SQL programs written for other SQL databases to PostgreSQL with relative
ease.
As already stated, programs written for the embedded SQL interface are normal C programs with
special code inserted to perform database-related actions. This special code always has the form:
These statements syntactically take the place of a C statement. Depending on the particular statement,
they can appear at the global level or within a function. Embedded SQL statements follow the case-
sensitivity rules of normal SQL code, and not those of C. Also they allow nested C-style comments
that are part of the SQL standard. The C part of the program, however, follows the C standard of not
accepting nested comments.
875
ECPG - Embedded SQL in C
• dbname[@hostname][:port]
• tcp:postgresql://hostname[:port][/dbname][?options]
• unix:postgresql://hostname[:port][/dbname][?options]
• a reference to a character variable containing one of the above forms (see examples)
• DEFAULT
If you specify the connection target literally (that is, not through a variable reference) and you don't
quote the value, then the case-insensitivity rules of normal SQL are applied. In that case you can also
double-quote the individual parameters separately as needed. In practice, it is probably less error-prone
to use a (single-quoted) string literal or a variable reference. The connection target DEFAULT initiates
a connection to the default database under the default user name. No separate user name or connection
name can be specified in that case.
• username
• username/password
As above, the parameters username and password can be an SQL identifier, an SQL string literal,
or a reference to a character variable.
If the connection target includes any options, those consist of keyword=value specifications
separated by ampersands (&). The allowed key words are the same ones recognized by libpq (see
Section 34.1.2). Spaces are ignored before any keyword or value, though not within or after one.
Note that there is no way to write & within a value.
The connection-name is used to handle multiple connections in one program. It can be omitted
if a program uses only one connection. The most recently opened connection becomes the current
connection, which is used by default when an SQL statement is to be executed (see later in this chapter).
If untrusted users have access to a database that has not adopted a secure schema usage pattern,
begin each session by removing publicly-writable schemas from search_path. For example,
add options=-c search_path= to options, or issue EXEC SQL SELECT pg_cata-
log.set_config('search_path', '', false); after connecting. This consideration is
not specific to ECPG; it applies to every interface for executing arbitrary SQL commands.
876
ECPG - Embedded SQL in C
...
EXEC SQL CONNECT TO :target USER :user USING :passwd;
/* or EXEC SQL CONNECT TO :target USER :user/:passwd; */
The last form makes use of the variant referred to above as character variable reference. You will see
in later sections how C variables can be used in SQL statements when you prefix them with a colon.
Be advised that the format of the connection target is not specified in the SQL standard. So if you want
to develop portable applications, you might want to use something based on the last example above
to encapsulate the connection target string somewhere.
The first option is to explicitly choose a connection for each SQL statement, for example:
This option is particularly suitable if the application needs to use several connections in mixed order.
If your application uses multiple threads of execution, they cannot share a connection concurrently.
You must either explicitly control access to the connection (using mutexes) or use a connection for
each thread.
The second option is to execute a statement to switch the current connection. That statement is:
This option is particularly convenient if many statements are to be executed on the same connection.
#include <stdio.h>
int
main()
{
EXEC SQL CONNECT TO testdb1 AS con1 USER testuser;
EXEC SQL SELECT pg_catalog.set_config('search_path', '',
false); EXEC SQL COMMIT;
EXEC SQL CONNECT TO testdb2 AS con2 USER testuser;
EXEC SQL SELECT pg_catalog.set_config('search_path', '',
false); EXEC SQL COMMIT;
EXEC SQL CONNECT TO testdb3 AS con3 USER testuser;
EXEC SQL SELECT pg_catalog.set_config('search_path', '',
false); EXEC SQL COMMIT;
877
ECPG - Embedded SQL in C
• connection-name
• DEFAULT
• CURRENT
• ALL
It is good style that an application always explicitly disconnect from every connection it opened.
878
ECPG - Embedded SQL in C
Inserting rows:
EXEC SQL INSERT INTO foo (number, ascii) VALUES (9999, 'doodad');
EXEC SQL COMMIT;
Deleting rows:
Updates:
SELECT statements that return a single result row can also be executed using EXEC SQL directly.
To handle result sets with multiple rows, an application has to use a cursor; see Section 36.3.2 below.
(As a special case, an application can fetch multiple rows at once into an array host variable; see
Section 36.4.4.3.1.)
Single-row select:
EXEC SQL SELECT foo INTO :FooBar FROM table1 WHERE ascii =
'doodad';
The tokens of the form :something are host variables, that is, they refer to variables in the C
program. They are explained in Section 36.4.
For more details about declaration of the cursor, see DECLARE, and see FETCH for FETCH command
details.
879
ECPG - Embedded SQL in C
Note
The ECPG DECLARE command does not actually cause a statement to be sent to the Post-
greSQL backend. The cursor is opened in the backend (using the backend's DECLARE com-
mand) at the point when the OPEN command is executed.
The statement is prepared using the command PREPARE. For the values that are not known yet, use
the placeholder “?”:
EXEC SQL PREPARE stmt1 FROM "SELECT oid, datname FROM pg_database
WHERE oid = ?";
If a statement returns a single row, the application can call EXECUTE after PREPARE to execute the
statement, supplying the actual values for the placeholders with a USING clause:
880
ECPG - Embedded SQL in C
If a statement returns multiple rows, the application can use a cursor declared based on the prepared
statement. To bind input parameters, the cursor must be opened with a USING clause:
When you don't need the prepared statement anymore, you should deallocate it:
For more details about PREPARE, see PREPARE. Also see Section 36.5 for more details about using
placeholders and input parameters.
Another way to exchange values between PostgreSQL backends and ECPG applications is the use of
SQL descriptors, described in Section 36.7.
36.4.1. Overview
Passing data between the C program and the SQL statements is particularly simple in embedded SQL.
Instead of having the program paste the data into the statement, which entails various complications,
such as properly quoting the value, you can simply write the name of a C variable into the SQL
statement, prefixed by a colon. For example:
This statement refers to two C variables named v1 and v2 and also uses a regular SQL string literal,
to illustrate that you are not restricted to use one kind of data or the other.
This style of inserting C variables in SQL statements works anywhere a value expression is expected
in an SQL statement.
881
ECPG - Embedded SQL in C
Between those lines, there must be normal C variable declarations, such as:
int x = 4;
char foo[16], bar[16];
As you can see, you can optionally assign an initial value to the variable. The variable's scope is
determined by the location of its declaring section within the program. You can also declare variables
with the following syntax which implicitly creates a declare section:
The declarations are also echoed to the output file as normal C variables, so there's no need to declare
them again. Variables that are not intended to be used in SQL commands can be declared normally
outside these special sections.
The definition of a structure or union also must be listed inside a DECLARE section. Otherwise the
preprocessor cannot handle these types since it does not know the definition.
Here is an example:
/*
* assume this table:
* CREATE TABLE test1 (a int, b varchar(50));
*/
882
ECPG - Embedded SQL in C
...
So the INTO clause appears between the select list and the FROM clause. The number of elements in
the select list and the list after INTO (also called the target list) must be equal.
...
...
do
{
...
EXEC SQL FETCH NEXT FROM foo INTO :v1, :v2;
...
} while (...);
Here the INTO clause appears after all the normal clauses.
In this respect, there are two kinds of data types: Some simple PostgreSQL data types, such as in-
teger and text, can be read and written by the application directly. Other PostgreSQL data types,
such as timestamp and numeric can only be accessed through special library functions; see Sec-
tion 36.4.4.2.
Table 36.1 shows which PostgreSQL data types correspond to which C data types. When you wish
to send or receive a value of a given PostgreSQL data type, you should declare a C variable of the
corresponding C data type in the declare section.
Table 36.1. Mapping Between PostgreSQL Data Types and C Variable Types
PostgreSQL data type Host variable type
smallint short
integer int
bigint long long int
decimal decimala
numeric numerica
real float
883
ECPG - Embedded SQL in C
One way is using char[], an array of char, which is the most common way to handle character
data in C.
Note that you have to take care of the length yourself. If you use this host variable as the target variable
of a query which returns a string with more than 49 characters, a buffer overflow occurs.
The other way is using the VARCHAR type, which is a special type provided by ECPG. The definition
on an array of type VARCHAR is converted into a named struct for every variable. A declaration like:
VARCHAR var[180];
is converted into:
The member arr hosts the string including a terminating zero byte. Thus, to store a string in a VAR-
CHAR host variable, the host variable has to be declared with the length including the zero byte ter-
minator. The member len holds the length of the string stored in the arr without the terminating
zero byte. When a host variable is used as input for a query, if strlen(arr) and len are different,
the shorter one is used.
VARCHAR can be written in upper or lower case, but not in mixed case.
char and VARCHAR host variables can also hold values of other SQL types, which will be stored
in their string forms.
884
ECPG - Embedded SQL in C
The follow subsections describe these special data types. For more details about pgtypes library func-
tions, see Section 36.6.
First, the program has to include the header file for the timestamp type:
#include <pgtypes_timestamp.h>
And after reading a value into the host variable, process it using pgtypes library functions. In follow-
ing example, the timestamp value is converted into text (ASCII) form with the PGTYPEStime-
stamp_to_asc() function:
ts = 2010-06-27 18:03:56.949343
In addition, the DATE type can be handled in the same way. The program has to include pgtype-
s_date.h, declare a host variable as the date type and convert a DATE value into a text form us-
ing PGTYPESdate_to_asc() function. For more details about the pgtypes library functions, see
Section 36.6.
36.4.4.2.2. interval
The handling of the interval type is also similar to the timestamp and date types. It is required,
however, to allocate memory for an interval type value explicitly. In other words, the memory
space for the variable has to be allocated in the heap memory, not in the stack memory.
#include <stdio.h>
#include <stdlib.h>
#include <pgtypes_interval.h>
885
ECPG - Embedded SQL in C
int
main(void)
{
EXEC SQL BEGIN DECLARE SECTION;
interval *in;
EXEC SQL END DECLARE SECTION;
in = PGTYPESinterval_new();
EXEC SQL SELECT '1 min'::interval INTO :in;
printf("interval = %s\n", PGTYPESinterval_to_asc(in));
PGTYPESinterval_free(in);
No functions are provided specifically for the decimal type. An application has to convert it to a
numeric variable using a pgtypes library function to do further processing.
#include <stdio.h>
#include <stdlib.h>
#include <pgtypes_numeric.h>
int
main(void)
{
EXEC SQL BEGIN DECLARE SECTION;
numeric *num;
numeric *num2;
decimal *dec;
EXEC SQL END DECLARE SECTION;
num = PGTYPESnumeric_new();
dec = PGTYPESdecimal_new();
886
ECPG - Embedded SQL in C
PGTYPESnumeric_free(num2);
PGTYPESdecimal_free(dec);
PGTYPESnumeric_free(num);
36.4.4.3.1. Arrays
There are two use cases for arrays as host variables. The first is a way to store some text string in
char[] or VARCHAR[], as explained in Section 36.4.4.1. The second use case is to retrieve multiple
rows from a query result without using a cursor. Without an array, to process a query result consisting
of multiple rows, it is required to use a cursor and the FETCH command. But with array host variables,
multiple rows can be received at once. The length of the array has to be defined to be able to accom-
modate all rows, otherwise a buffer overflow will likely occur.
Following example scans the pg_database system table and shows all OIDs and names of the
available databases:
int
main(void)
{
EXEC SQL BEGIN DECLARE SECTION;
int dbid[8];
char dbname[8][16];
int i;
EXEC SQL END DECLARE SECTION;
887
ECPG - Embedded SQL in C
This example shows following result. (The exact values depend on local circumstances.)
oid=1, dbname=template1
oid=11510, dbname=template0
oid=11511, dbname=postgres
oid=313780, dbname=testdb
oid=0, dbname=
oid=0, dbname=
oid=0, dbname=
36.4.4.3.2. Structures
A structure whose member names match the column names of a query result, can be used to retrieve
multiple columns at once. The structure enables handling multiple column values in a single host
variable.
The following example retrieves OIDs, names, and sizes of the available databases from the
pg_database system table and using the pg_database_size() function. In this example, a
structure variable dbinfo_t with members whose names match each column in the SELECT result
is used to retrieve one result row without putting multiple host variables in the FETCH statement.
dbinfo_t dbval;
EXEC SQL END DECLARE SECTION;
memset(&dbval, 0, sizeof(dbinfo_t));
while (1)
{
/* Fetch multiple columns into one structure. */
EXEC SQL FETCH FROM cur1 INTO :dbval;
888
ECPG - Embedded SQL in C
This example shows following result. (The exact values depend on local circumstances.)
Structure host variables “absorb” as many columns as the structure as fields. Additional columns can
be assigned to other host variables. For example, the above program could also be restructured like
this, with the size variable outside the structure:
dbinfo_t dbval;
long long int size;
EXEC SQL END DECLARE SECTION;
memset(&dbval, 0, sizeof(dbinfo_t));
while (1)
{
/* Fetch multiple columns into one structure. */
EXEC SQL FETCH FROM cur1 INTO :dbval, :size;
36.4.4.3.3. Typedefs
Use the typedef keyword to map new types to already existing types.
889
ECPG - Embedded SQL in C
36.4.4.3.4. Pointers
You can declare pointers to the most common types. Note however that you cannot use pointers as
target variables of queries without auto-allocation. See Section 36.7 for more information on auto-al-
location.
36.4.5.1. Arrays
Multi-dimensional SQL-level arrays are not directly supported in ECPG. One-dimensional SQL-level
arrays can be mapped into C array host variables and vice-versa. However, when creating a statement
ecpg does not know the types of the columns, so that it cannot check if a C array is input into a corre-
sponding SQL-level array. When processing the output of a SQL statement, ecpg has the necessary
information and thus checks if both are arrays.
If a query accesses elements of an array separately, then this avoids the use of arrays in ECPG. Then,
a host variable with a type that can be mapped to the element type should be used. For example, if a
column type is array of integer, a host variable of type int can be used. Also if the element type
is varchar or text, a host variable of type char[] or VARCHAR[] can be used.
CREATE TABLE t3 (
ii integer[]
);
The following example program retrieves the 4th element of the array and stores it into a host variable
of type int:
EXEC SQL DECLARE cur1 CURSOR FOR SELECT ii[4] FROM t3;
890
ECPG - Embedded SQL in C
while (1)
{
EXEC SQL FETCH FROM cur1 INTO :ii ;
printf("ii=%d\n", ii);
}
ii=4
To map multiple array elements to the multiple elements in an array type host variables each element of
array column and each element of the host variable array have to be managed separately, for example:
EXEC SQL DECLARE cur1 CURSOR FOR SELECT ii[1], ii[2], ii[3], ii[4]
FROM t3;
EXEC SQL OPEN cur1;
while (1)
{
EXEC SQL FETCH FROM cur1
INTO :ii_a[0], :ii_a[1], :ii_a[2], :ii_a[3];
...
}
while (1)
{
/* WRONG */
EXEC SQL FETCH FROM cur1 INTO :ii_a;
...
}
would not work correctly in this case, because you cannot map an array type column to an array host
variable directly.
891
ECPG - Embedded SQL in C
Another workaround is to store arrays in their external string representation in host variables of type
char[] or VARCHAR[]. For more details about this representation, see Section 8.15.2. Note that
this means that the array cannot be accessed naturally as an array in the host program (without further
processing that parses the text representation).
For the following examples, assume the following type and table:
The most obvious solution is to access each attribute separately. The following program retrieves data
from the example table by selecting each attribute of the type comp_t separately:
while (1)
{
/* Fetch each element of the composite type column into host
variables. */
EXEC SQL FETCH FROM cur1 INTO :intval, :textval;
To enhance this example, the host variables to store values in the FETCH command can be gathered
into one structure. For more details about the host variable in the structure form, see Section 36.4.4.3.2.
To switch to the structure, the example can be modified as below. The two host variables, intval and
textval, become members of the comp_t structure, and the structure is specified on the FETCH
command.
892
ECPG - Embedded SQL in C
comp_t compval;
EXEC SQL END DECLARE SECTION;
while (1)
{
/* Put all values in the SELECT list into one structure. */
EXEC SQL FETCH FROM cur1 INTO :compval;
Although a structure is used in the FETCH command, the attribute names in the SELECT clause are
specified one by one. This can be enhanced by using a * to ask for all attributes of the composite
type value.
...
EXEC SQL DECLARE cur1 CURSOR FOR SELECT (compval).* FROM t4;
EXEC SQL OPEN cur1;
while (1)
{
/* Put all values in the SELECT list into one structure. */
EXEC SQL FETCH FROM cur1 INTO :compval;
This way, composite types can be mapped into structures almost seamlessly, even though ECPG does
not understand the composite type itself.
Finally, it is also possible to store composite type values in their external string representation in host
variables of type char[] or VARCHAR[]. But that way, it is not easily possible to access the fields
of the value from the host program.
Here is an example using the data type complex from the example in Section 38.12. The external
string representation of that type is (%f,%f), which is defined in the functions complex_in()
893
ECPG - Embedded SQL in C
and complex_out() functions in Section 38.12. The following example inserts the complex type
values (1,1) and (3,3) into the columns a and b, and select them from the table after that.
while (1)
{
EXEC SQL FETCH FROM cur1 INTO :a, :b;
printf("a=%s, b=%s\n", a.arr, b.arr);
}
a=(1,1), b=(3,3)
Another workaround is avoiding the direct use of the user-defined types in ECPG and instead create
a function or cast that converts between the user-defined type and a primitive type that ECPG can
handle. Note, however, that type casts, especially implicit ones, should be introduced into the type
system very carefully.
For example,
a = 1;
b = 2;
c = 3;
d = 4;
894
ECPG - Embedded SQL in C
36.4.6. Indicators
The examples above do not handle null values. In fact, the retrieval examples will raise an error if
they fetch a null value from the database. To be able to pass null values to the database or retrieve
null values from the database, you need to append a second host variable specification to each host
variable that contains data. This second host variable is called the indicator and contains a flag that
tells whether the datum is null, in which case the value of the real host variable is ignored. Here is an
example that handles the retrieval of null values correctly:
...
The indicator variable val_ind will be zero if the value was not null, and it will be negative if the
value was null. (See Section 36.16 to enable Oracle-specific behavior.)
The indicator has another function: if the indicator value is positive, it means that the value is not null,
but it was truncated when it was stored in the host variable.
EXECUTE IMMEDIATE can be used for SQL statements that do not return a result set (e.g., DDL,
INSERT, UPDATE, DELETE). You cannot execute statements that retrieve data (e.g., SELECT) this
way. The next section describes how to do that.
895
ECPG - Embedded SQL in C
ment and then execute specific versions of it by substituting parameters. When preparing the state-
ment, write question marks where you want to substitute parameters later. For example:
When you don't need the prepared statement anymore, you should deallocate it:
An EXECUTE command can have an INTO clause, a USING clause, both, or neither.
If a query is expected to return more than one result row, a cursor should be used, as in the following
example. (See Section 36.3.2 for more details about the cursor.)
896
ECPG - Embedded SQL in C
while (1)
{
EXEC SQL FETCH cursor1 INTO :dbaname,:datname;
printf("dbaname=%s, datname=%s\n", dbaname, datname);
}
PGTYPESdate_today(&date1);
EXEC SQL SELECT started, duration INTO :ts1, :iv1 FROM datetbl
WHERE d=:date1;
PGTYPEStimestamp_add_interval(&ts1, &iv1, &tsout);
out = PGTYPEStimestamp_to_asc(&tsout);
printf("Started + duration: %s\n", out);
PGTYPESchar_free(out);
The following functions can be used to work with the numeric type:
PGTYPESnumeric_new
897
ECPG - Embedded SQL in C
numeric *PGTYPESnumeric_new(void);
PGTYPESnumeric_free
PGTYPESnumeric_from_asc
Valid formats are for example: -2, .794, +3.44, 592.49E07 or -32.84e-4. If the value
could be parsed successfully, a valid pointer is returned, else the NULL pointer. At the moment
ECPG always parses the complete string and so it currently does not support to store the address
of the first invalid character in *endptr. You can safely set endptr to NULL.
PGTYPESnumeric_to_asc
Returns a pointer to a string allocated by malloc that contains the string representation of the
numeric type num.
The numeric value will be printed with dscale decimal digits, with rounding applied if neces-
sary. The result must be freed with PGTYPESchar_free().
PGTYPESnumeric_add
The function adds the variables var1 and var2 into the result variable result. The function
returns 0 on success and -1 in case of error.
PGTYPESnumeric_sub
Subtract two numeric variables and return the result in a third one.
The function subtracts the variable var2 from the variable var1. The result of the operation is
stored in the variable result. The function returns 0 on success and -1 in case of error.
PGTYPESnumeric_mul
Multiply two numeric variables and return the result in a third one.
898
ECPG - Embedded SQL in C
The function multiplies the variables var1 and var2. The result of the operation is stored in the
variable result. The function returns 0 on success and -1 in case of error.
PGTYPESnumeric_div
Divide two numeric variables and return the result in a third one.
The function divides the variables var1 by var2. The result of the operation is stored in the
variable result. The function returns 0 on success and -1 in case of error.
PGTYPESnumeric_cmp
This function compares two numeric variables. In case of error, INT_MAX is returned. On success,
the function returns one of three possible results:
PGTYPESnumeric_from_int
This function accepts a variable of type signed int and stores it in the numeric variable var. Upon
success, 0 is returned and -1 in case of a failure.
PGTYPESnumeric_from_long
This function accepts a variable of type signed long int and stores it in the numeric variable var.
Upon success, 0 is returned and -1 in case of a failure.
PGTYPESnumeric_copy
This function copies over the value of the variable that src points to into the variable that dst
points to. It returns 0 on success and -1 if an error occurs.
899
ECPG - Embedded SQL in C
PGTYPESnumeric_from_double
This function accepts a variable of type double and stores the result in the variable that dst points
to. It returns 0 on success and -1 if an error occurs.
PGTYPESnumeric_to_double
The function converts the numeric value from the variable that nv points to into the double vari-
able that dp points to. It returns 0 on success and -1 if an error occurs, including overflow. On
overflow, the global variable errno will be set to PGTYPES_NUM_OVERFLOW additionally.
PGTYPESnumeric_to_int
The function converts the numeric value from the variable that nv points to into the integer vari-
able that ip points to. It returns 0 on success and -1 if an error occurs, including overflow. On
overflow, the global variable errno will be set to PGTYPES_NUM_OVERFLOW additionally.
PGTYPESnumeric_to_long
The function converts the numeric value from the variable that nv points to into the long integer
variable that lp points to. It returns 0 on success and -1 if an error occurs, including overflow. On
overflow, the global variable errno will be set to PGTYPES_NUM_OVERFLOW additionally.
PGTYPESnumeric_to_decimal
The function converts the numeric value from the variable that src points to into the decimal
variable that dst points to. It returns 0 on success and -1 if an error occurs, including overflow.
On overflow, the global variable errno will be set to PGTYPES_NUM_OVERFLOW additionally.
PGTYPESnumeric_from_decimal
The function converts the decimal value from the variable that src points to into the numeric
variable that dst points to. It returns 0 on success and -1 if an error occurs. Since the decimal
900
ECPG - Embedded SQL in C
type is implemented as a limited version of the numeric type, overflow cannot occur with this
conversion.
The following functions can be used to work with the date type:
PGTYPESdate_from_timestamp
The function receives a timestamp as its only argument and returns the extracted date part from
this timestamp.
PGTYPESdate_from_asc
The function receives a C char* string str and a pointer to a C char* string endptr. At the
moment ECPG always parses the complete string and so it currently does not support to store the
address of the first invalid character in *endptr. You can safely set endptr to NULL.
Note that the function always assumes MDY-formatted dates and there is currently no variable
to change that within ECPG.
901
ECPG - Embedded SQL in C
Input Result
1999.008 year and day of year
J2451187 Julian day
January 8, 99 BC year 99 before the Common Era
PGTYPESdate_to_asc
The function receives the date dDate as its only parameter. It will output the date in the form
1999-01-18, i.e., in the YYYY-MM-DD format. The result must be freed with PGTYPE-
Schar_free().
PGTYPESdate_julmdy
Extract the values for the day, the month and the year from a variable of type date.
The function receives the date d and a pointer to an array of 3 integer values mdy. The variable
name indicates the sequential order: mdy[0] will be set to contain the number of the month,
mdy[1] will be set to the value of the day and mdy[2] will contain the year.
PGTYPESdate_mdyjul
Create a date value from an array of 3 integers that specify the day, the month and the year of
the date.
The function receives the array of the 3 integers (mdy) as its first argument and as its second
argument a pointer to a variable of type date that should hold the result of the operation.
PGTYPESdate_dayofweek
Return a number representing the day of the week for a date value.
The function receives the date variable d as its only argument and returns an integer that indicates
the day of the week for this date.
• 0 - Sunday
• 1 - Monday
• 2 - Tuesday
• 3 - Wednesday
• 4 - Thursday
• 5 - Friday
902
ECPG - Embedded SQL in C
• 6 - Saturday
PGTYPESdate_today
The function receives a pointer to a date variable (d) that it sets to the current date.
PGTYPESdate_fmt_asc
Convert a variable of type date to its textual representation using a format mask.
The function receives the date to convert (dDate), the format mask (fmtstring) and the string
that will hold the textual representation of the date (outbuf).
The following literals are the field specifiers you can use:
Table 36.3 indicates a few possible formats. This will give you an idea of how to use this function.
All output lines are based on the same date: November 23, 1959.
903
ECPG - Embedded SQL in C
Format Result
(ddd) mmm. dd, yyyy (Mon) Nov. 23, 1959
PGTYPESdate_defmt_asc
The function receives a pointer to the date value that should hold the result of the operation (d),
the format mask to use for parsing the date (fmt) and the C char* string containing the textual
representation of the date (str). The textual representation is expected to match the format mask.
However you do not need to have a 1:1 mapping of the string to the format mask. The function
only analyzes the sequential order and looks for the literals yy or yyyy that indicate the position
of the year, mm to indicate the position of the month and dd to indicate the position of the day.
Table 36.4 indicates a few possible formats. This will give you an idea of how to use this function.
The following functions can be used to work with the timestamp type:
PGTYPEStimestamp_from_asc
904
ECPG - Embedded SQL in C
The function receives the string to parse (str) and a pointer to a C char* (endptr). At the
moment ECPG always parses the complete string and so it currently does not support to store the
address of the first invalid character in *endptr. You can safely set endptr to NULL.
In general, the input string can contain any combination of an allowed date specification, a white-
space character and an allowed time specification. Note that time zones are not supported by
ECPG. It can parse them but does not apply any calculation as the PostgreSQL server does for
example. Timezone specifiers are silently discarded.
PGTYPEStimestamp_to_asc
The function receives the timestamp tstamp as its only argument and returns an allocated string
that contains the textual representation of the timestamp. The result must be freed with PGTYPE-
Schar_free().
PGTYPEStimestamp_current
The function retrieves the current timestamp and saves it into the timestamp variable that ts
points to.
PGTYPEStimestamp_fmt_asc
The function receives a pointer to the timestamp to convert as its first argument (ts), a pointer
to the output buffer (output), the maximal length that has been allocated for the output buffer
(str_len) and the format mask to use for the conversion (fmtstr).
Upon success, the function returns 0 and a negative value if an error occurred.
905
ECPG - Embedded SQL in C
You can use the following format specifiers for the format mask. The format specifiers are the
same ones that are used in the strftime function in libc. Any non-format specifier will be
copied into the output buffer.
• %C - is replaced by (year / 100) as decimal number; single digits are preceded by a zero.
• %D - is equivalent to %m/%d/%y.
• %E* %O* - POSIX locale extensions. The sequences %Ec %EC %Ex %EX %Ey %EY %Od %Oe
%OH %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy are supposed to provide alternative
representations.
Additionally %OB implemented to represent alternative months names (used standalone, with-
out day mentioned).
• %e - is replaced by the day of month as a decimal number (1-31); single digits are preceded
by a blank.
• %F - is equivalent to %Y-%m-%d.
• %G - is replaced by a year as a decimal number with century. This year is the one that contains
the greater part of the week (Monday as the first day of the week).
• %g - is replaced by the same year as in %G, but as a decimal number without century (00-99).
• %k - is replaced by the hour (24-hour clock) as a decimal number (0-23); single digits are
preceded by a blank.
• %l - is replaced by the hour (12-hour clock) as a decimal number (1-12); single digits are
preceded by a blank.
• %n - is replaced by a newline.
906
ECPG - Embedded SQL in C
• %R - is equivalent to %H:%M.
• %T - is equivalent to %H:%M:%S
• %t - is replaced by a tab.
• %U - is replaced by the week number of the year (Sunday as the first day of the week) as a
decimal number (00-53).
• %u - is replaced by the weekday (Monday as the first day of the week) as a decimal number
(1-7).
• %V - is replaced by the week number of the year (Monday as the first day of the week) as a
decimal number (01-53). If the week containing January 1 has four or more days in the new
year, then it is week 1; otherwise it is the last week of the previous year, and the next week
is week 1.
• %v - is equivalent to %e-%b-%Y.
• %W - is replaced by the week number of the year (Monday as the first day of the week) as a
decimal number (00-53).
• %w - is replaced by the weekday (Sunday as the first day of the week) as a decimal number (0-6).
• %z - is replaced by the time zone offset from UTC; a leading plus sign stands for east of UTC,
a minus sign for west of UTC, hours and minutes follow with two digits each and no delimiter
between them (common form for RFC 822 date headers).
• %-* - GNU libc extension. Do not do any padding when performing numerical outputs.
• %% - is replaced by %.
PGTYPEStimestamp_sub
Subtract one timestamp from another one and save the result in a variable of type interval.
907
ECPG - Embedded SQL in C
The function will subtract the timestamp variable that ts2 points to from the timestamp variable
that ts1 points to and will store the result in the interval variable that iv points to.
Upon success, the function returns 0 and a negative value if an error occurred.
PGTYPEStimestamp_defmt_asc
Parse a timestamp value from its textual representation using a formatting mask.
The function receives the textual representation of a timestamp in the variable str as well as the
formatting mask to use in the variable fmt. The result will be stored in the variable that d points to.
If the formatting mask fmt is NULL, the function will fall back to the default formatting mask
which is %Y-%m-%d %H:%M:%S.
PGTYPEStimestamp_add_interval
The function receives a pointer to a timestamp variable tin and a pointer to an interval variable
span. It adds the interval to the timestamp and saves the resulting timestamp in the variable that
tout points to.
Upon success, the function returns 0 and a negative value if an error occurred.
PGTYPEStimestamp_sub_interval
The function subtracts the interval variable that span points to from the timestamp variable that
tin points to and saves the result into the variable that tout points to.
Upon success, the function returns 0 and a negative value if an error occurred.
The following functions can be used to work with the interval type:
PGTYPESinterval_new
908
ECPG - Embedded SQL in C
interval *PGTYPESinterval_new(void);
PGTYPESinterval_free
PGTYPESinterval_from_asc
The function parses the input string str and returns a pointer to an allocated interval variable. At
the moment ECPG always parses the complete string and so it currently does not support to store
the address of the first invalid character in *endptr. You can safely set endptr to NULL.
PGTYPESinterval_to_asc
The function converts the interval variable that span points to into a C char*. The output looks
like this example: @ 1 day 12 hours 59 mins 10 secs. The result must be freed
with PGTYPESchar_free().
PGTYPESinterval_copy
The function copies the interval variable that intvlsrc points to into the variable that
intvldest points to. Note that you need to allocate the memory for the destination variable
before.
The following functions can be used to work with the decimal type and are not only contained in the
libcompat library.
PGTYPESdecimal_new
decimal *PGTYPESdecimal_new(void);
909
ECPG - Embedded SQL in C
PGTYPESdecimal_free
An argument should contain a numeric variable (or point to a numeric variable) but in fact its in-
memory representation was invalid.
PGTYPES_NUM_OVERFLOW
An overflow occurred. Since the numeric type can deal with almost arbitrary precision, converting
a numeric variable into other types might cause overflow.
PGTYPES_NUM_UNDERFLOW
An underflow occurred. Since the numeric type can deal with almost arbitrary precision, convert-
ing a numeric variable into other types might cause underflow.
PGTYPES_NUM_DIVIDE_ZERO
PGTYPES_DATE_BAD_DATE
PGTYPES_DATE_ERR_EARGS
PGTYPES_DATE_ERR_ENOSHORTDATE
An invalid token in the input string was found by the PGTYPESdate_defmt_asc function.
PGTYPES_INTVL_BAD_INTERVAL
PGTYPES_DATE_ERR_ENOTDMY
PGTYPES_DATE_BAD_DAY
An invalid day of the month value was found by the PGTYPESdate_defmt_asc function.
PGTYPES_DATE_BAD_MONTH
PGTYPES_TS_BAD_TIMESTAMP
910
ECPG - Embedded SQL in C
PGTYPES_TS_ERR_EINFTIME
An infinite timestamp value was encountered in a context that cannot handle it.
A value of type timestamp representing an invalid time stamp. This is returned by the function
PGTYPEStimestamp_from_asc on parse error. Note that due to the internal representation
of the timestamp data type, PGTYPESInvalidTimestamp is also a valid timestamp at the
same time. It is set to 1899-12-31 23:59:59. In order to detect errors, make sure that your
application does not only test for PGTYPESInvalidTimestamp but also for errno != 0
after each call to PGTYPEStimestamp_from_asc.
Before you can use an SQL descriptor area, you need to allocate one:
The identifier serves as the “variable name” of the descriptor area. When you don't need the descriptor
anymore, you should deallocate it:
To use a descriptor area, specify it as the storage target in an INTO clause, instead of listing host
variables:
EXEC SQL FETCH NEXT FROM mycursor INTO SQL DESCRIPTOR mydesc;
If the result set is empty, the Descriptor Area will still contain the metadata from the query, i.e., the
field names.
For not yet executed prepared queries, the DESCRIBE statement can be used to get the metadata of
the result set:
911
ECPG - Embedded SQL in C
Before PostgreSQL 9.0, the SQL keyword was optional, so using DESCRIPTOR and SQL DESCRIP-
TOR produced named SQL Descriptor Areas. Now it is mandatory, omitting the SQL keyword pro-
duces SQLDA Descriptor Areas, see Section 36.7.2.
In DESCRIBE and FETCH statements, the INTO and USING keywords can be used to similarly: they
produce the result set and the metadata in a Descriptor Area.
Now how do you get the data out of the descriptor area? You can think of the descriptor area as a
structure with named fields. To retrieve the value of a field from the header and store it into a host
variable, use the following command:
Currently, there is only one header field defined: COUNT, which tells how many item descriptor areas
exist (that is, how many columns are contained in the result). The host variable needs to be of an
integer type. To get a field from the item descriptor area, use the following command:
num can be a literal integer or a host variable containing an integer. Possible fields are:
CARDINALITY (integer)
DATA
actual data item (therefore, the data type of this field depends on the query)
DATETIME_INTERVAL_CODE (integer)
When TYPE is 9, DATETIME_INTERVAL_CODE will have a value of 1 for DATE, 2 for TIME,
3 for TIMESTAMP, 4 for TIME WITH TIME ZONE, or 5 for TIMESTAMP WITH TIME ZONE.
DATETIME_INTERVAL_PRECISION (integer)
not implemented
INDICATOR (integer)
KEY_MEMBER (integer)
not implemented
LENGTH (integer)
NAME (string)
NULLABLE (integer)
not implemented
912
ECPG - Embedded SQL in C
OCTET_LENGTH (integer)
PRECISION (integer)
RETURNED_LENGTH (integer)
RETURNED_OCTET_LENGTH (integer)
SCALE (integer)
TYPE (integer)
In EXECUTE, DECLARE and OPEN statements, the effect of the INTO and USING keywords are
different. A Descriptor Area can also be manually built to provide the input parameters for a query
or a cursor and USING SQL DESCRIPTOR name is the way to pass the input parameters into a
parameterized query. The statement to build a named SQL Descriptor Area is below:
PostgreSQL supports retrieving more that one record in one FETCH statement and storing the data in
host variables in this case assumes that the variable is an array. E.g.:
Note that the SQL keyword is omitted. The paragraphs about the use cases of the INTO and
USING keywords in Section 36.7.1 also apply here with an addition. In a DESCRIBE statement the
DESCRIPTOR keyword can be completely omitted if the INTO keyword is used:
913
ECPG - Embedded SQL in C
3. Declare an SQLDA for the input parameters, and initialize them (memory allocation, parameter
settings).
5. Fetch rows from the cursor, and store them into an output SQLDA.
6. Read values from the output SQLDA into the host variables (with conversion if necessary).
Tip
PostgreSQL's SQLDA has a similar data structure to the one in IBM DB2 Universal Database,
so some technical information on DB2's SQLDA could help understanding PostgreSQL's one
better.
struct sqlda_struct
{
char sqldaid[8];
long sqldabc;
short sqln;
short sqld;
struct sqlda_struct *desc_next;
struct sqlvar_struct sqlvar[1];
};
sqldaid
sqldabc
914
ECPG - Embedded SQL in C
sqln
It contains the number of input parameters for a parameterized query in case it's passed into OPEN,
DECLARE or EXECUTE statements using the USING keyword. In case it's used as output of
SELECT, EXECUTE or FETCH statements, its value is the same as sqld statement
sqld
desc_next
If the query returns more than one record, multiple linked SQLDA structures are returned, and
desc_next holds a pointer to the next entry in the list.
sqlvar
struct sqlvar_struct
{
short sqltype;
short sqllen;
char *sqldata;
short *sqlind;
struct sqlname sqlname;
};
sqltype
Contains the type identifier of the field. For values, see enum ECPGttype in ecpgtype.h.
sqllen
Contains the binary length of the field. e.g., 4 bytes for ECPGt_int.
sqldata
Points to the data. The format of the data is described in Section 36.4.4.
sqlind
sqlname
915
ECPG - Embedded SQL in C
#define NAMEDATALEN 64
struct sqlname
{
short length;
char data[NAMEDATALEN];
};
length
data
3. Check the number of records in the result set by looking at sqln, a member of the sqlda_t
structure.
4. Get the values of each column from sqlvar[0], sqlvar[1], etc., members of the sqlda_t
structure.
5. Go to next row (sqlda_t structure) by following the desc_next pointer, a member of the
sqlda_t structure.
sqlda_t *sqlda1;
sqlda_t *cur_sqlda;
916
ECPG - Embedded SQL in C
Inside the loop, run another loop to retrieve each column data (sqlvar_t structure) of the row.
To get a column value, check the sqltype value, a member of the sqlvar_t structure. Then,
switch to an appropriate way, depending on the column type, to copy data from the sqlvar field
to a host variable.
char var_buf[1024];
switch (v.sqltype)
{
case ECPGt_char:
memset(&var_buf, 0, sizeof(var_buf));
memcpy(&var_buf, sqldata, (sizeof(var_buf) <= sqllen ?
sizeof(var_buf) - 1 : sqllen));
break;
...
}
3. Allocate memory area (as sqlda_t structure) for the input SQLDA.
Here is an example.
917
ECPG - Embedded SQL in C
Next, allocate memory for an SQLDA, and set the number of input parameters in sqln, a member
variable of the sqlda_t structure. When two or more input parameters are required for the prepared
query, the application has to allocate additional memory space which is calculated by (nr. of params
- 1) * sizeof(sqlvar_t). The example shown here allocates memory space for two input parameters.
sqlda_t *sqlda2;
After memory allocation, store the parameter values into the sqlvar[] array. (This is same array
used for retrieving column values when the SQLDA is receiving a result set.) In this example, the
input parameters are "postgres", having a string type, and 1, having an integer type.
sqlda2->sqlvar[0].sqltype = ECPGt_char;
sqlda2->sqlvar[0].sqldata = "postgres";
sqlda2->sqlvar[0].sqllen = 8;
int intval = 1;
sqlda2->sqlvar[1].sqltype = ECPGt_int;
sqlda2->sqlvar[1].sqldata = (char *) &intval;
sqlda2->sqlvar[1].sqllen = sizeof(intval);
By opening a cursor and specifying the SQLDA that was set up beforehand, the input parameters are
passed to the prepared statement.
Finally, after using input SQLDAs, the allocated memory space must be freed explicitly, unlike SQL-
DAs used for receiving query results.
free(sqlda2);
This application joins two system tables, pg_database and pg_stat_database on the database OID, and
also fetches and shows the database statistics which are retrieved by two input parameters (a database
postgres, and OID 1).
Next, connect to the database, prepare a statement, and declare a cursor for the prepared statement.
918
ECPG - Embedded SQL in C
int
main(void)
{
EXEC SQL BEGIN DECLARE SECTION;
char query[1024] = "SELECT d.oid,* FROM pg_database d,
pg_stat_database s WHERE d.oid=s.datid AND ( d.datname=? OR
d.oid=? )";
EXEC SQL END DECLARE SECTION;
Next, put some values in the input SQLDA for the input parameters. Allocate memory for the input
SQLDA, and set the number of input parameters to sqln. Store type, value, and value length into
sqltype, sqldata, and sqllen in the sqlvar structure.
sqlda2->sqlvar[0].sqltype = ECPGt_char;
sqlda2->sqlvar[0].sqldata = "postgres";
sqlda2->sqlvar[0].sqllen = 8;
intval = 1;
sqlda2->sqlvar[1].sqltype = ECPGt_int;
sqlda2->sqlvar[1].sqldata = (char *)&intval;
sqlda2->sqlvar[1].sqllen = sizeof(intval);
After setting up the input SQLDA, open a cursor with the input SQLDA.
Fetch rows into the output SQLDA from the opened cursor. (Generally, you have to call FETCH
repeatedly in the loop, to fetch all rows in the result set.)
while (1)
{
sqlda_t *cur_sqlda;
Next, retrieve the fetched records from the SQLDA, by following the linked list of the sqlda_t
structure.
919
ECPG - Embedded SQL in C
cur_sqlda != NULL ;
cur_sqlda = cur_sqlda->desc_next)
{
...
Read each columns in the first record. The number of columns is stored in sqld, the actual data of
the first column is stored in sqlvar[0], both members of the sqlda_t structure.
Now, the column data is stored in the variable v. Copy every datum into host variables, looking at
v.sqltype for the type of the column.
switch (v.sqltype) {
int intval;
double doubleval;
unsigned long long int longlongval;
case ECPGt_char:
memset(&var_buf, 0, sizeof(var_buf));
memcpy(&var_buf, sqldata, (sizeof(var_buf) <=
sqllen ? sizeof(var_buf)-1 : sqllen));
break;
...
default:
...
}
Close the cursor after processing all of records, and disconnect from the database.
920
ECPG - Embedded SQL in C
#include <stdlib.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
int
main(void)
{
EXEC SQL BEGIN DECLARE SECTION;
char query[1024] = "SELECT d.oid,* FROM pg_database d,
pg_stat_database s WHERE d.oid=s.datid AND ( d.datname=? OR
d.oid=? )";
int intval;
unsigned long long int longlongval;
EXEC SQL END DECLARE SECTION;
sqlda2->sqlvar[0].sqltype = ECPGt_char;
sqlda2->sqlvar[0].sqldata = "postgres";
sqlda2->sqlvar[0].sqllen = 8;
intval = 1;
sqlda2->sqlvar[1].sqltype = ECPGt_int;
sqlda2->sqlvar[1].sqldata = (char *) &intval;
sqlda2->sqlvar[1].sqllen = sizeof(intval);
while (1)
{
sqlda_t *cur_sqlda;
921
ECPG - Embedded SQL in C
strncpy(name_buf, v.sqlname.data,
v.sqlname.length);
name_buf[v.sqlname.length] = '\0';
switch (v.sqltype)
{
case ECPGt_char:
memset(&var_buf, 0, sizeof(var_buf));
memcpy(&var_buf, sqldata,
(sizeof(var_buf)<=sqllen ? sizeof(var_buf)-1 : sqllen) );
break;
default:
{
int i;
memset(var_buf, 0, sizeof(var_buf));
for (i = 0; i < sqllen; i++)
{
char tmpbuf[16];
snprintf(tmpbuf, sizeof(tmpbuf), "%02x
", (unsigned char) sqldata[i]);
strncat(var_buf, tmpbuf,
sizeof(var_buf));
}
}
break;
}
922
ECPG - Embedded SQL in C
printf("\n");
}
}
return 0;
}
The output of this example should look something like the following (some numbers will vary).
oid = 1 (type: 1)
datname = template1 (type: 1)
datdba = 10 (type: 1)
encoding = 0 (type: 5)
datistemplate = t (type: 1)
datallowconn = t (type: 1)
datconnlimit = -1 (type: 5)
datlastsysoid = 11510 (type: 1)
datfrozenxid = 379 (type: 1)
dattablespace = 1663 (type: 1)
datconfig = (type: 1)
datacl = {=c/uptime,uptime=CTc/uptime} (type: 1)
datid = 1 (type: 1)
datname = template1 (type: 1)
numbackends = 0 (type: 5)
xact_commit = 113606 (type: 9)
xact_rollback = 0 (type: 9)
blks_read = 130 (type: 9)
blks_hit = 7341714 (type: 9)
tup_returned = 38262679 (type: 9)
tup_fetched = 1836281 (type: 9)
tup_inserted = 0 (type: 9)
tup_updated = 0 (type: 9)
tup_deleted = 0 (type: 9)
923
ECPG - Embedded SQL in C
numbackends = 0 (type: 5)
xact_commit = 221069 (type: 9)
xact_rollback = 18 (type: 9)
blks_read = 1176 (type: 9)
blks_hit = 13943750 (type: 9)
tup_returned = 77410091 (type: 9)
tup_fetched = 3253694 (type: 9)
tup_inserted = 0 (type: 9)
tup_updated = 0 (type: 9)
tup_deleted = 0 (type: 9)
• Callbacks can be configured to handle warning and error conditions using the WHENEVER com-
mand.
• Detailed information about the error or warning can be obtained from the sqlca variable.
SQLERROR
The specified action is called whenever an error occurs during the execution of an SQL statement.
SQLWARNING
The specified action is called whenever a warning occurs during the execution of an SQL state-
ment.
NOT FOUND
The specified action is called whenever an SQL statement retrieves or affects zero rows. (This
condition is not an error, but you might be interested in handling it specially.)
CONTINUE
This effectively means that the condition is ignored. This is the default.
GOTO label
GO TO label
SQLPRINT
Print a message to standard error. This is useful for simple programs or during prototyping. The
details of the message cannot be configured.
924
ECPG - Embedded SQL in C
STOP
DO BREAK
Execute the C statement break. This should only be used in loops or switch statements.
DO CONTINUE
Execute the C statement continue. This should only be used in loops statements. if executed,
will cause the flow of control to return to the top of the loop.
Call the specified C functions with the specified arguments. (This use is different from the mean-
ing of CALL and DO in the normal PostgreSQL grammar.)
The SQL standard only provides for the actions CONTINUE and GOTO (and GO TO).
Here is an example that you might want to use in a simple program. It prints a simple message when
a warning occurs and aborts the program when an error happens:
The statement EXEC SQL WHENEVER is a directive of the SQL preprocessor, not a C statement. The
error or warning actions that it sets apply to all embedded SQL statements that appear below the point
where the handler is set, unless a different action was set for the same condition between the first EXEC
SQL WHENEVER and the SQL statement causing the condition, regardless of the flow of control in
the C program. So neither of the two following C program excerpts will have the desired effect:
/*
* WRONG
*/
int main(int argc, char *argv[])
{
...
if (verbose) {
EXEC SQL WHENEVER SQLWARNING SQLPRINT;
}
...
EXEC SQL SELECT ...;
...
}
/*
* WRONG
*/
int main(int argc, char *argv[])
{
...
set_error_handler();
...
EXEC SQL SELECT ...;
925
ECPG - Embedded SQL in C
...
}
36.8.2. sqlca
For more powerful error handling, the embedded SQL interface provides a global variable with the
name sqlca (SQL communication area) that has the following structure:
struct
{
char sqlcaid[8];
long sqlabc;
long sqlcode;
struct
{
int sqlerrml;
char sqlerrmc[SQLERRMC_LEN];
} sqlerrm;
char sqlerrp[8];
long sqlerrd[6];
char sqlwarn[8];
char sqlstate[5];
} sqlca;
(In a multithreaded program, every thread automatically gets its own copy of sqlca. This works
similarly to the handling of the standard C global variable errno.)
sqlca covers both warnings and errors. If multiple warnings or errors occur during the execution of
a statement, then sqlca will only contain information about the last one.
If no error occurred in the last SQL statement, sqlca.sqlcode will be 0 and sqlca.sqlstate
will be "00000". If a warning or error occurred, then sqlca.sqlcode will be negative and sql-
ca.sqlstate will be different from "00000". A positive sqlca.sqlcode indicates a harmless
condition, such as that the last query returned zero rows. sqlcode and sqlstate are two different
error code schemes; details appear below.
If the last SQL statement was successful, then sqlca.sqlerrd[1] contains the OID of the
processed row, if applicable, and sqlca.sqlerrd[2] contains the number of processed or returned
rows, if applicable to the command.
In case of a warning, sqlca.sqlwarn[2] is set to W. (In all other cases, it is set to something
different from W.) If sqlca.sqlwarn[1] is set to W, then a value was truncated when it was stored
in a host variable. sqlca.sqlwarn[0] is set to W if any of the other elements are set to indicate
a warning.
The fields sqlcaid, sqlabc, sqlerrp, and the remaining elements of sqlerrd and sqlwarn
currently contain no useful information.
926
ECPG - Embedded SQL in C
The structure sqlca is not defined in the SQL standard, but is implemented in several other SQL
database systems. The definitions are similar at the core, but if you want to write portable applications,
then you should investigate the different implementations carefully.
Here is one example that combines the use of WHENEVER and sqlca, printing out the contents of
sqlca when an error occurs. This is perhaps useful for debugging or prototyping applications, before
installing a more “user-friendly” error handler.
void
print_sqlca()
{
fprintf(stderr, "==== sqlca ====\n");
fprintf(stderr, "sqlcode: %ld\n", sqlca.sqlcode);
fprintf(stderr, "sqlerrm.sqlerrml: %d\n",
sqlca.sqlerrm.sqlerrml);
fprintf(stderr, "sqlerrm.sqlerrmc: %s\n",
sqlca.sqlerrm.sqlerrmc);
fprintf(stderr, "sqlerrd: %ld %ld %ld %ld %ld %ld\n",
sqlca.sqlerrd[0],sqlca.sqlerrd[1],sqlca.sqlerrd[2],
sqlca.sqlerrd[3],sqlca.sqlerrd[4],sqlca.sqlerrd[5]);
fprintf(stderr, "sqlwarn: %d %d %d %d %d %d %d %d\n",
sqlca.sqlwarn[0], sqlca.sqlwarn[1], sqlca.sqlwarn[2],
sqlca.sqlwarn[6], sqlca.sqlwarn[7]);
fprintf(stderr, "sqlstate: %5s\n", sqlca.sqlstate);
fprintf(stderr, "===============\n");
}
The result could look as follows (here an error due to a misspelled table name):
SQLSTATE is a five-character array. The five characters contain digits or upper-case letters that rep-
resent codes of various error and warning conditions. SQLSTATE has a hierarchical scheme: the first
two characters indicate the general class of the condition, the last three characters indicate a subclass
of the general condition. A successful state is indicated by the code 00000. The SQLSTATE codes are
for the most part defined in the SQL standard. The PostgreSQL server natively supports SQLSTATE
927
ECPG - Embedded SQL in C
error codes; therefore a high degree of consistency can be achieved by using this error code scheme
throughout all applications. For further information see Appendix A.
SQLCODE, the deprecated error code scheme, is a simple integer. A value of 0 indicates success, a
positive value indicates success with additional information, a negative value indicates an error. The
SQL standard only defines the positive value +100, which indicates that the last command returned
or affected zero rows, and no specific negative values. Therefore, this scheme can only achieve poor
portability and does not have a hierarchical code assignment. Historically, the embedded SQL proces-
sor for PostgreSQL has assigned some specific SQLCODE values for its use, which are listed below
with their numeric value and their symbolic name. Remember that these are not portable to other SQL
implementations. To simplify the porting of applications to the SQLSTATE scheme, the correspond-
ing SQLSTATE is also listed. There is, however, no one-to-one or one-to-many mapping between
the two schemes (indeed it is many-to-many), so you should consult the global SQLSTATE listing in
Appendix A in each case.
0 (ECPG_NO_ERROR)
100 (ECPG_NOT_FOUND)
This is a harmless condition indicating that the last command retrieved or processed zero rows,
or that you are at the end of the cursor. (SQLSTATE 02000)
When processing a cursor in a loop, you could use this code as a way to detect when to abort
the loop, like this:
while (1)
{
EXEC SQL FETCH ... ;
if (sqlca.sqlcode == ECPG_NOT_FOUND)
break;
}
But WHENEVER NOT FOUND DO BREAK effectively does this internally, so there is usually
no advantage in writing this out explicitly.
-12 (ECPG_OUT_OF_MEMORY)
Indicates that your virtual memory is exhausted. The numeric value is defined as -ENOMEM.
(SQLSTATE YE001)
-200 (ECPG_UNSUPPORTED)
Indicates the preprocessor has generated something that the library does not know about. Perhaps
you are running incompatible versions of the preprocessor and the library. (SQLSTATE YE002)
-201 (ECPG_TOO_MANY_ARGUMENTS)
This means that the command specified more host variables than the command expected. (SQLS-
TATE 07001 or 07002)
-202 (ECPG_TOO_FEW_ARGUMENTS)
This means that the command specified fewer host variables than the command expected. (SQLS-
TATE 07001 or 07002)
928
ECPG - Embedded SQL in C
-203 (ECPG_TOO_MANY_MATCHES)
This means a query has returned multiple rows but the statement was only prepared to store one
result row (for example, because the specified variables are not arrays). (SQLSTATE 21000)
-204 (ECPG_INT_FORMAT)
The host variable is of type int and the datum in the database is of a different type and contains
a value that cannot be interpreted as an int. The library uses strtol() for this conversion.
(SQLSTATE 42804)
-205 (ECPG_UINT_FORMAT)
The host variable is of type unsigned int and the datum in the database is of a different
type and contains a value that cannot be interpreted as an unsigned int. The library uses
strtoul() for this conversion. (SQLSTATE 42804)
-206 (ECPG_FLOAT_FORMAT)
The host variable is of type float and the datum in the database is of another type and contains
a value that cannot be interpreted as a float. The library uses strtod() for this conversion.
(SQLSTATE 42804)
-207 (ECPG_NUMERIC_FORMAT)
The host variable is of type numeric and the datum in the database is of another type and contains
a value that cannot be interpreted as a numeric value. (SQLSTATE 42804)
-208 (ECPG_INTERVAL_FORMAT)
The host variable is of type interval and the datum in the database is of another type and
contains a value that cannot be interpreted as an interval value. (SQLSTATE 42804)
-209 (ECPG_DATE_FORMAT)
The host variable is of type date and the datum in the database is of another type and contains
a value that cannot be interpreted as a date value. (SQLSTATE 42804)
-210 (ECPG_TIMESTAMP_FORMAT)
The host variable is of type timestamp and the datum in the database is of another type and
contains a value that cannot be interpreted as a timestamp value. (SQLSTATE 42804)
-211 (ECPG_CONVERT_BOOL)
This means the host variable is of type bool and the datum in the database is neither 't' nor
'f'. (SQLSTATE 42804)
-212 (ECPG_EMPTY)
The statement sent to the PostgreSQL server was empty. (This cannot normally happen in an
embedded SQL program, so it might point to an internal error.) (SQLSTATE YE002)
-213 (ECPG_MISSING_INDICATOR)
A null value was returned and no null indicator variable was supplied. (SQLSTATE 22002)
-214 (ECPG_NO_ARRAY)
An ordinary variable was used in a place that requires an array. (SQLSTATE 42804)
929
ECPG - Embedded SQL in C
-215 (ECPG_DATA_NOT_ARRAY)
The database returned an ordinary variable in a place that requires array value. (SQLSTATE
42804)
-216 (ECPG_ARRAY_INSERT)
The value could not be inserted into the array. (SQLSTATE 42804)
-220 (ECPG_NO_CONN)
The program tried to access a connection that does not exist. (SQLSTATE 08003)
-221 (ECPG_NOT_CONN)
The program tried to access a connection that does exist but is not open. (This is an internal error.)
(SQLSTATE YE002)
-230 (ECPG_INVALID_STMT)
The statement you are trying to use has not been prepared. (SQLSTATE 26000)
-239 (ECPG_INFORMIX_DUPLICATE_KEY)
Duplicate key error, violation of unique constraint (Informix compatibility mode). (SQLSTATE
23505)
-240 (ECPG_UNKNOWN_DESCRIPTOR)
The descriptor specified was not found. The statement you are trying to use has not been prepared.
(SQLSTATE 33000)
-241 (ECPG_INVALID_DESCRIPTOR_INDEX)
-242 (ECPG_UNKNOWN_DESCRIPTOR_ITEM)
An invalid descriptor item was requested. (This is an internal error.) (SQLSTATE YE002)
-243 (ECPG_VAR_NOT_NUMERIC)
During the execution of a dynamic statement, the database returned a numeric value and the host
variable was not numeric. (SQLSTATE 07006)
-244 (ECPG_VAR_NOT_CHAR)
During the execution of a dynamic statement, the database returned a non-numeric value and the
host variable was numeric. (SQLSTATE 07006)
-284 (ECPG_INFORMIX_SUBSELECT_NOT_ONE)
A result of the subquery is not single row (Informix compatibility mode). (SQLSTATE 21000)
-400 (ECPG_PGSQL)
Some error caused by the PostgreSQL server. The message contains the error message from the
PostgreSQL server.
-401 (ECPG_TRANS)
The PostgreSQL server signaled that we cannot start, commit, or rollback the transaction. (SQLS-
TATE 08007)
930
ECPG - Embedded SQL in C
-402 (ECPG_CONNECT)
The connection attempt to the database did not succeed. (SQLSTATE 08001)
-403 (ECPG_DUPLICATE_KEY)
-404 (ECPG_SUBSELECT_NOT_ONE)
-602 (ECPG_WARNING_UNKNOWN_PORTAL)
-603 (ECPG_WARNING_IN_TRANSACTION)
-604 (ECPG_WARNING_NO_TRANSACTION)
-605 (ECPG_WARNING_PORTAL_EXISTS)
The embedded SQL preprocessor will look for a file named filename.h, preprocess it, and include
it in the resulting C output. Thus, embedded SQL statements in the included file are handled correctly.
The ecpg preprocessor will search a file at several directories in following order:
• current directory
• /usr/local/include
• /usr/include
But when EXEC SQL INCLUDE "filename" is used, only the current directory is searched.
In each directory, the preprocessor will first look for the file name as given, and if not found will
append .h to the file name and try again (unless the specified file name already has that suffix).
931
ECPG - Embedded SQL in C
#include <filename.h>
because this file would not be subject to SQL command preprocessing. Naturally, you can continue
to use the C #include directive to include other header files.
Note
The include file name is case-sensitive, even though the rest of the EXEC SQL INCLUDE
command follows the normal SQL case-sensitivity rules.
Of course you can continue to use the C versions #define and #undef in your embedded SQL
program. The difference is where your defined values get evaluated. If you use EXEC SQL DEFINE
then the ecpg preprocessor evaluates the defines and substitutes the values. For example if you write:
then ecpg will already do the substitution and your C compiler will never see any name or identifier
MYNUMBER. Note that you cannot use #define for a constant that you are going to use in an embed-
ded SQL query because in this case the embedded SQL precompiler is not able to see this declaration.
Checks a name and processes subsequent lines if name has been created with EXEC SQL de-
fine name.
932
ECPG - Embedded SQL in C
Checks a name and processes subsequent lines if name has not been created with EXEC SQL
define name.
Starts processing an alternative section to a section introduced by either EXEC SQL ifdef
name or EXEC SQL ifndef name.
Checks name and starts an alternative section if name has been created with EXEC SQL de-
fine name.
Example:
The preprocessor program is called ecpg and is included in a normal PostgreSQL installation. Em-
bedded SQL programs are typically named with an extension .pgc. If you have a program file called
prog1.pgc, you can preprocess it by simply calling:
ecpg prog1.pgc
This will create a file called prog1.c. If your input files do not follow the suggested naming pattern,
you can specify the output file explicitly using the -o option.
cc -c prog1.c
The generated C source files include header files from the PostgreSQL installation, so if you installed
PostgreSQL in a location that is not searched by default, you have to add an option such as -I/usr/
local/pgsql/include to the compilation command line.
To link an embedded SQL program, you need to include the libecpg library, like so:
933
ECPG - Embedded SQL in C
Again, you might have to add an option like -L/usr/local/pgsql/lib to that command line.
You can use pg_config or pkg-config with package name libecpg to get the paths for your
installation.
If you manage the build process of a larger project using make, it might be convenient to include the
following implicit rule to your makefiles:
ECPG = ecpg
%.c: %.pgc
$(ECPG) $<
The ecpg library is thread-safe by default. However, you might need to use some threading com-
mand-line options to compile your client code.
• ECPGdebug(int on, FILE *stream) turns on debug logging if called with the first argu-
ment non-zero. Debug logging is done on stream. The log contains all SQL statements with all
the input variables inserted, and the results from the PostgreSQL server. This can be very useful
when searching for errors in your SQL statements.
Note
On Windows, if the ecpg libraries and an application are compiled with different flags, this
function call will crash the application because the internal representation of the FILE point-
ers differ. Specifically, multithreaded/single-threaded, release/debug, and static/dynamic
flags should be the same for the library and all applications using that library.
Note
It is a bad idea to manipulate database connection handles made from ecpg directly with
libpq routines.
934
ECPG - Embedded SQL in C
For more details about the ECPGget_PGconn(), see Section 36.11. For information about the large
object function interface, see Chapter 35.
Large object functions have to be called in a transaction block, so when autocommit is off, BEGIN
commands have to be issued explicitly.
Example 36.2 shows an example program that illustrates how to create, write, and read a large object
in an ECPG application.
#include <stdio.h>
#include <stdlib.h>
#include <libpq-fe.h>
#include <libpq/libpq-fs.h>
int
main(void)
{
PGconn *conn;
Oid loid;
int fd;
char buf[256];
int buflen = 256;
char buf2[256];
int rc;
memset(buf, 1, buflen);
conn = ECPGget_PGconn("con1");
printf("conn = %p\n", conn);
/* create */
loid = lo_create(conn, 0);
if (loid < 0)
printf("lo_create() failed: %s", PQerrorMessage(conn));
/* write test */
fd = lo_open(conn, loid, INV_READ|INV_WRITE);
if (fd < 0)
printf("lo_open() failed: %s", PQerrorMessage(conn));
935
ECPG - Embedded SQL in C
rc = lo_close(conn, fd);
if (rc < 0)
printf("lo_close() failed: %s", PQerrorMessage(conn));
/* read test */
fd = lo_open(conn, loid, INV_READ);
if (fd < 0)
printf("lo_open() failed: %s", PQerrorMessage(conn));
rc = lo_close(conn, fd);
if (rc < 0)
printf("lo_close() failed: %s", PQerrorMessage(conn));
/* check */
rc = memcmp(buf, buf2, buflen);
printf("memcmp() = %d\n", rc);
/* cleanup */
rc = lo_unlink(conn, loid);
if (rc < 0)
printf("lo_unlink() failed: %s", PQerrorMessage(conn));
The ecpg preprocessor takes an input file written in C (or something like C) and embedded SQL
commands, converts the embedded SQL commands into C language chunks, and finally generates a
.c file. The header file declarations of the library functions used by the C language chunks that ecpg
generates are wrapped in extern "C" { ... } blocks when used under C++, so they should
work seamlessly in C++.
In general, however, the ecpg preprocessor only understands C; it does not handle the special syntax
and reserved words of the C++ language. So, some embedded SQL code written in C++ application
code that uses complicated features specific to C++ might fail to be preprocessed correctly or might
not work as expected.
A safe way to use the embedded SQL code in a C++ application is hiding the ECPG calls in a C
module, which the C++ application code calls into to access the database, and linking that together
with the rest of the C++ code. See Section 36.13.2 about that.
936
ECPG - Embedded SQL in C
For example, in the following case, the ecpg preprocessor cannot find any declaration for the variable
dbname in the test method, so an error will occur.
class TestCpp
{
EXEC SQL BEGIN DECLARE SECTION;
char dbname[1024];
EXEC SQL END DECLARE SECTION;
public:
TestCpp();
void test();
~TestCpp();
};
TestCpp::TestCpp()
{
EXEC SQL CONNECT TO testdb1;
EXEC SQL SELECT pg_catalog.set_config('search_path', '',
false); EXEC SQL COMMIT;
}
void Test::test()
{
EXEC SQL SELECT current_database() INTO :dbname;
printf("current_database = %s\n", dbname);
}
TestCpp::~TestCpp()
{
EXEC SQL DISCONNECT ALL;
}
ecpg test_cpp.pgc
test_cpp.pgc:28: ERROR: variable "dbname" is not declared
To avoid this scope issue, the test method could be modified to use a local variable as intermedi-
ate storage. But this approach is only a poor workaround, because it uglifies the code and reduces
performance.
void TestCpp::test()
{
EXEC SQL BEGIN DECLARE SECTION;
char tmp[1024];
EXEC SQL END DECLARE SECTION;
937
ECPG - Embedded SQL in C
Three kinds of files have to be created: a C file (*.pgc), a header file, and a C++ file:
test_mod.pgc
#include "test_mod.h"
#include <stdio.h>
void
db_connect()
{
EXEC SQL CONNECT TO testdb1;
EXEC SQL SELECT pg_catalog.set_config('search_path', '',
false); EXEC SQL COMMIT;
}
void
db_test()
{
EXEC SQL BEGIN DECLARE SECTION;
char dbname[1024];
EXEC SQL END DECLARE SECTION;
void
db_disconnect()
{
EXEC SQL DISCONNECT ALL;
}
test_mod.h
A header file with declarations of the functions in the C module (test_mod.pgc). It is included
by test_cpp.cpp. This file has to have an extern "C" block around the declarations,
because it will be linked from the C++ module.
#ifdef __cplusplus
938
ECPG - Embedded SQL in C
extern "C" {
#endif
void db_connect();
void db_test();
void db_disconnect();
#ifdef __cplusplus
}
#endif
test_cpp.cpp
The main code for the application, including the main routine, and in this example a C++ class.
#include "test_mod.h"
class TestCpp
{
public:
TestCpp();
void test();
~TestCpp();
};
TestCpp::TestCpp()
{
db_connect();
}
void
TestCpp::test()
{
db_test();
}
TestCpp::~TestCpp()
{
db_disconnect();
}
int
main(void)
{
TestCpp *t = new TestCpp();
t->test();
return 0;
}
To build the application, proceed as follows. Convert test_mod.pgc into test_mod.c by run-
ning ecpg, and generate test_mod.o by compiling test_mod.c with the C compiler:
939
ECPG - Embedded SQL in C
Finally, link these object files, test_cpp.o and test_mod.o, into one executable, using the C
++ compiler driver:
940
ECPG - Embedded SQL in C
ALLOCATE DESCRIPTOR
ALLOCATE DESCRIPTOR — allocate an SQL descriptor area
Synopsis
Description
ALLOCATE DESCRIPTOR allocates a new named SQL descriptor area, which can be used to ex-
change data between the PostgreSQL server and the host program.
Descriptor areas should be freed after use using the DEALLOCATE DESCRIPTOR command.
Parameters
name
A name of SQL descriptor, case sensitive. This can be an SQL identifier or a host variable.
Examples
Compatibility
ALLOCATE DESCRIPTOR is specified in the SQL standard.
See Also
DEALLOCATE DESCRIPTOR, GET DESCRIPTOR, SET DESCRIPTOR
941
ECPG - Embedded SQL in C
CONNECT
CONNECT — establish a database connection
Synopsis
Description
The CONNECT command establishes a connection between the client and the PostgreSQL server.
Parameters
connection_target
connection_target specifies the target server of the connection on one of several forms.
host variable
host variable of type char[] or VARCHAR[] containing a value in one of the above forms
connection_name
An optional identifier for the connection, so that it can be referred to in other commands. This
can be an SQL identifier or a host variable.
connection_user
This parameter can also specify user name and password, using one the forms
user_name/password, user_name IDENTIFIED BY password, or user_name
USING password.
User name and password can be SQL identifiers, string constants, or host variables.
942
ECPG - Embedded SQL in C
DEFAULT
Examples
Here a several variants for specifying connection parameters:
Here is an example program that illustrates the use of host variables to specify connection parameters:
int
main(void)
{
EXEC SQL BEGIN DECLARE SECTION;
char *dbname = "testdb"; /* database name */
char *user = "testuser"; /* connection user name */
char *connection = "tcp:postgresql://localhost:5432/testdb";
/* connection string */
char ver[256]; /* buffer to store the version
string */
EXEC SQL END DECLARE SECTION;
ECPGdebug(1, stderr);
943
ECPG - Embedded SQL in C
return 0;
}
Compatibility
CONNECT is specified in the SQL standard, but the format of the connection parameters is implemen-
tation-specific.
See Also
DISCONNECT, SET CONNECTION
944
ECPG - Embedded SQL in C
DEALLOCATE DESCRIPTOR
DEALLOCATE DESCRIPTOR — deallocate an SQL descriptor area
Synopsis
Description
DEALLOCATE DESCRIPTOR deallocates a named SQL descriptor area.
Parameters
name
The name of the descriptor which is going to be deallocated. It is case sensitive. This can be an
SQL identifier or a host variable.
Examples
Compatibility
DEALLOCATE DESCRIPTOR is specified in the SQL standard.
See Also
ALLOCATE DESCRIPTOR, GET DESCRIPTOR, SET DESCRIPTOR
945
ECPG - Embedded SQL in C
DECLARE
DECLARE — define a cursor
Synopsis
Description
DECLARE declares a cursor for iterating over the result set of a prepared statement. This command
has slightly different semantics from the direct SQL command DECLARE: Whereas the latter executes
a query and prepares the result set for retrieval, this embedded SQL command merely declares a name
as a “loop variable” for iterating over the result set of a query; the actual execution happens when the
cursor is opened with the OPEN command.
Parameters
cursor_name
A cursor name, case sensitive. This can be an SQL identifier or a host variable.
prepared_name
query
A SELECT or VALUES command which will provide the rows to be returned by the cursor.
Examples
Examples declaring a cursor for a query:
Compatibility
DECLARE is specified in the SQL standard.
See Also
OPEN, CLOSE, DECLARE
946
ECPG - Embedded SQL in C
DESCRIBE
DESCRIBE — obtain information about a prepared statement or result set
Synopsis
Description
DESCRIBE retrieves metadata information about the result columns contained in a prepared statement,
without actually fetching a row.
Parameters
prepared_name
The name of a prepared statement. This can be an SQL identifier or a host variable.
descriptor_name
sqlda_name
Examples
Compatibility
DESCRIBE is specified in the SQL standard.
See Also
ALLOCATE DESCRIPTOR, GET DESCRIPTOR
947
ECPG - Embedded SQL in C
DISCONNECT
DISCONNECT — terminate a database connection
Synopsis
DISCONNECT connection_name
DISCONNECT [ CURRENT ]
DISCONNECT DEFAULT
DISCONNECT ALL
Description
DISCONNECT closes a connection (or all connections) to the database.
Parameters
connection_name
CURRENT
Close the “current” connection, which is either the most recently opened connection, or the con-
nection set by the SET CONNECTION command. This is also the default if no argument is given
to the DISCONNECT command.
DEFAULT
ALL
Examples
int
main(void)
{
EXEC SQL CONNECT TO testdb AS DEFAULT USER testuser;
EXEC SQL CONNECT TO testdb AS con1 USER testuser;
EXEC SQL CONNECT TO testdb AS con2 USER testuser;
EXEC SQL CONNECT TO testdb AS con3 USER testuser;
return 0;
}
Compatibility
DISCONNECT is specified in the SQL standard.
948
ECPG - Embedded SQL in C
See Also
CONNECT, SET CONNECTION
949
ECPG - Embedded SQL in C
EXECUTE IMMEDIATE
EXECUTE IMMEDIATE — dynamically prepare and execute a statement
Synopsis
Description
EXECUTE IMMEDIATE immediately prepares and executes a dynamically specified SQL statement,
without retrieving result rows.
Parameters
string
Examples
Here is an example that executes an INSERT statement using EXECUTE IMMEDIATE and a host
variable named command:
Compatibility
EXECUTE IMMEDIATE is specified in the SQL standard.
950
ECPG - Embedded SQL in C
GET DESCRIPTOR
GET DESCRIPTOR — get information from an SQL descriptor area
Synopsis
Description
GET DESCRIPTOR retrieves information about a query result set from an SQL descriptor area and
stores it into host variables. A descriptor area is typically populated using FETCH or SELECT before
using this command to transfer the information into host language variables.
This command has two forms: The first form retrieves descriptor “header” items, which apply to the
result set in its entirety. One example is the row count. The second form, which requires the column
number as additional parameter, retrieves information about a particular column. Examples are the
column name and the actual column value.
Parameters
descriptor_name
A descriptor name.
descriptor_header_item
A token identifying which header information item to retrieve. Only COUNT, to get the number
of columns in the result set, is currently supported.
column_number
The number of the column about which information is to be retrieved. The count starts at 1.
descriptor_item
A token identifying which item of information about a column to retrieve. See Section 36.7.1 for
a list of supported items.
cvariable
A host variable that will receive the data retrieved from the descriptor area.
Examples
An example to retrieve the number of columns in a result set:
951
ECPG - Embedded SQL in C
int
main(void)
{
EXEC SQL BEGIN DECLARE SECTION;
int d_count;
char d_data[1024];
int d_returned_octet_length;
EXEC SQL END DECLARE SECTION;
/* Closing */
EXEC SQL CLOSE cur;
EXEC SQL COMMIT;
return 0;
}
When the example is executed, the result will look like this:
d_count = 1
d_returned_octet_length = 6
d_data = testdb
952
ECPG - Embedded SQL in C
Compatibility
GET DESCRIPTOR is specified in the SQL standard.
See Also
ALLOCATE DESCRIPTOR, SET DESCRIPTOR
953
ECPG - Embedded SQL in C
OPEN
OPEN — open a dynamic cursor
Synopsis
OPEN cursor_name
OPEN cursor_name USING value [, ... ]
OPEN cursor_name USING SQL DESCRIPTOR descriptor_name
Description
OPEN opens a cursor and optionally binds actual values to the placeholders in the cursor's declaration.
The cursor must previously have been declared with the DECLARE command. The execution of OPEN
causes the query to start executing on the server.
Parameters
cursor_name
The name of the cursor to be opened. This can be an SQL identifier or a host variable.
value
A value to be bound to a placeholder in the cursor. This can be an SQL constant, a host variable,
or a host variable with indicator.
descriptor_name
The name of a descriptor containing values to be bound to the placeholders in the cursor. This
can be an SQL identifier or a host variable.
Examples
Compatibility
OPEN is specified in the SQL standard.
See Also
DECLARE, CLOSE
954
ECPG - Embedded SQL in C
PREPARE
PREPARE — prepare a statement for execution
Synopsis
Description
PREPARE prepares a statement dynamically specified as a string for execution. This is different from
the direct SQL statement PREPARE, which can also be used in embedded programs. The EXECUTE
command is used to execute either kind of prepared statement.
Parameters
prepared_name
string
A literal C string or a host variable containing a preparable statement, one of the SELECT,
INSERT, UPDATE, or DELETE.
Examples
EXEC SQL EXECUTE foo USING SQL DESCRIPTOR indesc INTO SQL
DESCRIPTOR outdesc;
Compatibility
PREPARE is specified in the SQL standard.
See Also
EXECUTE
955
ECPG - Embedded SQL in C
SET AUTOCOMMIT
SET AUTOCOMMIT — set the autocommit behavior of the current session
Synopsis
Description
SET AUTOCOMMIT sets the autocommit behavior of the current database session. By default, em-
bedded SQL programs are not in autocommit mode, so COMMIT needs to be issued explicitly when
desired. This command can change the session to autocommit mode, where each individual statement
is committed implicitly.
Compatibility
SET AUTOCOMMIT is an extension of PostgreSQL ECPG.
956
ECPG - Embedded SQL in C
SET CONNECTION
SET CONNECTION — select a database connection
Synopsis
Description
SET CONNECTION sets the “current” database connection, which is the one that all commands use
unless overridden.
Parameters
connection_name
DEFAULT
Examples
Compatibility
SET CONNECTION is specified in the SQL standard.
See Also
CONNECT, DISCONNECT
957
ECPG - Embedded SQL in C
SET DESCRIPTOR
SET DESCRIPTOR — set information in an SQL descriptor area
Synopsis
Description
SET DESCRIPTOR populates an SQL descriptor area with values. The descriptor area is then typi-
cally used to bind parameters in a prepared query execution.
This command has two forms: The first form applies to the descriptor “header”, which is independent
of a particular datum. The second form assigns values to particular datums, identified by number.
Parameters
descriptor_name
A descriptor name.
descriptor_header_item
A token identifying which header information item to set. Only COUNT, to set the number of
descriptor items, is currently supported.
number
descriptor_item
A token identifying which item of information to set in the descriptor. See Section 36.7.1 for a
list of supported items.
value
A value to store into the descriptor item. This can be an SQL constant or a host variable.
Examples
Compatibility
SET DESCRIPTOR is specified in the SQL standard.
958
ECPG - Embedded SQL in C
See Also
ALLOCATE DESCRIPTOR, GET DESCRIPTOR
959
ECPG - Embedded SQL in C
TYPE
TYPE — define a new data type
Synopsis
Description
The TYPE command defines a new C type. It is equivalent to putting a typedef into a declare section.
This command is only recognized when ecpg is run with the -c option.
Parameters
type_name
The name for the new type. It must be a valid C type name.
ctype
A C type specification.
Examples
960
ECPG - Embedded SQL in C
};
int
main(void)
{
EXEC SQL BEGIN DECLARE SECTION;
tt t;
tt_ind t_ind;
EXEC SQL END DECLARE SECTION;
return 0;
}
t.v = testdb
t.i = 256
t_ind.v_ind = 0
t_ind.i_ind = 0
Compatibility
The TYPE command is a PostgreSQL extension.
961
ECPG - Embedded SQL in C
VAR
VAR — define a variable
Synopsis
Description
The VAR command assigns a new C data type to a host variable. The host variable must be previously
declared in a declare section.
Parameters
varname
A C variable name.
ctype
A C type specification.
Examples
Compatibility
The VAR command is a PostgreSQL extension.
962
ECPG - Embedded SQL in C
WHENEVER
WHENEVER — specify the action to be taken when an SQL statement causes a specific class con-
dition to be raised
Synopsis
Description
Define a behavior which is called on the special cases (Rows not found, SQL warnings or errors) in
the result of SQL execution.
Parameters
See Section 36.8.1 for a description of the parameters.
Examples
A typical application is the use of WHENEVER NOT FOUND BREAK to handle looping through
result sets:
int
main(void)
{
EXEC SQL CONNECT TO testdb AS con1;
EXEC SQL SELECT pg_catalog.set_config('search_path', '',
false); EXEC SQL COMMIT;
EXEC SQL ALLOCATE DESCRIPTOR d;
EXEC SQL DECLARE cur CURSOR FOR SELECT current_database(),
'hoge', 256;
EXEC SQL OPEN cur;
while (1)
{
EXEC SQL FETCH NEXT FROM cur INTO SQL DESCRIPTOR d;
...
}
963
ECPG - Embedded SQL in C
return 0;
}
Compatibility
WHENEVER is specified in the SQL standard, but most of the actions are PostgreSQL extensions.
$int j = 3;
$CONNECT TO :dbname;
$CREATE TABLE test(i INT PRIMARY KEY, j INT);
$INSERT INTO test(i, j) VALUES (7, :j);
$COMMIT;
Note
There must not be any white space between the $ and a following preprocessor directive, that
is, include, define, ifdef, etc. Otherwise, the preprocessor will parse the token as a
host variable.
When linking programs that use this compatibility mode, remember to link against libcompat that
is shipped with ECPG.
Besides the previously explained syntactic sugar, the Informix compatibility mode ports some func-
tions for input, output and transformation of data as well as embedded SQL statements known from
E/SQL to ECPG.
Informix compatibility mode is closely connected to the pgtypeslib library of ECPG. pgtypeslib maps
SQL data types to data types within the C host program and most of the additional functions of the
Informix compatibility mode allow you to operate on those C host program types. Note however that
the extent of the compatibility is limited. It does not try to copy Informix behavior; it allows you to do
more or less the same operations and gives you functions that have the same name and the same basic
behavior but it is no drop-in replacement if you are using Informix at the moment. Moreover, some
of the data types are different. For example, PostgreSQL's datetime and interval types do not know
about ranges like for example YEAR TO MINUTE so you won't find support in ECPG for that either.
964
ECPG - Embedded SQL in C
This statement closes the current connection. In fact, this is a synonym for ECPG's DISCONNECT
CURRENT:
FREE cursor_name
Due to the differences how ECPG works compared to Informix's ESQL/C (i.e., which steps are
purely grammar transformations and which steps rely on the underlying run-time library) there is
no FREE cursor_name statement in ECPG. This is because in ECPG, DECLARE CURSOR
doesn't translate to a function call into the run-time library that uses to the cursor name. This
means that there's no run-time bookkeeping of SQL cursors in the ECPG run-time library, only
in the PostgreSQL server.
FREE statement_name
struct sqlvar_compat
{
short sqltype;
int sqllen;
char *sqldata;
short *sqlind;
char *sqlname;
char *sqlformat;
short sqlitype;
short sqlilen;
char *sqlidata;
int sqlxid;
char *sqltypename;
short sqltypelen;
short sqlownerlen;
short sqlsourcetype;
char *sqlownername;
int sqlsourceid;
char *sqlilongdata;
965
ECPG - Embedded SQL in C
int sqlflags;
void *sqlreserved;
};
struct sqlda_compat
{
short sqld;
struct sqlvar_compat *sqlvar;
char desc_name[19];
short desc_occ;
struct sqlda_compat *desc_next;
void *reserved;
};
sqld
sqlvar
desc_name
desc_occ
desc_next
Pointer to the next SQLDA structure if the result set contains more than one record.
reserved
The per-field properties are below, they are stored in the sqlvar array:
sqltype
sqllen
sqldata
Pointer to the field data. The pointer is of char * type, the data pointed by it is in a binary
format. Example:
int intval;
switch (sqldata->sqlvar[i].sqltype)
966
ECPG - Embedded SQL in C
{
case SQLINTEGER:
intval = *(int *)sqldata->sqlvar[i].sqldata;
break;
...
}
sqlind
Pointer to the NULL indicator. If returned by DESCRIBE or FETCH then it's always a valid
pointer. If used as input for EXECUTE ... USING sqlda; then NULL-pointer value means
that the value for this field is non-NULL. Otherwise a valid pointer and sqlitype has to be
properly set. Example:
if (*(int2 *)sqldata->sqlvar[i].sqlind != 0)
printf("value is NULL\n");
sqlname
sqlformat
sqlitype
Type of the NULL indicator data. It's always SQLSMINT when returning data from the server.
When the SQLDA is used for a parameterized query, the data is treated according to the set type.
sqlilen
sqlxid
sqltypename
sqltypelen
sqlownerlen
sqlsourcetype
sqlownername
sqlsourceid
sqlflags
sqlreserved
Unused.
sqlilongdata
Example:
967
ECPG - Embedded SQL in C
...
while (1)
{
EXEC SQL FETCH mycursor USING sqlda;
}
For more information, see the sqlda.h header and the src/interfaces/ecpg/test/com-
pat_informix/sqlda.pgc regression test.
The function receives a pointer to the first operand of type decimal (arg1), a pointer to the second
operand of type decimal (arg2) and a pointer to a value of type decimal that will contain the sum
(sum). On success, the function returns 0. ECPG_INFORMIX_NUM_OVERFLOW is returned in
case of overflow and ECPG_INFORMIX_NUM_UNDERFLOW in case of underflow. -1 is returned
for other failures and errno is set to the respective errno number of the pgtypeslib.
deccmp
The function receives a pointer to the first decimal value (arg1), a pointer to the second decimal
value (arg2) and returns an integer value that indicates which is the bigger value.
• 1, if the value that arg1 points to is bigger than the value that var2 points to
968
ECPG - Embedded SQL in C
• -1, if the value that arg1 points to is smaller than the value that arg2 points to
• 0, if the value that arg1 points to and the value that arg2 points to are equal
deccopy
The function receives a pointer to the decimal value that should be copied as the first argument
(src) and a pointer to the target structure of type decimal (target) as the second argument.
deccvasc
The function receives a pointer to string that contains the string representation of the number to
be converted (cp) as well as its length len. np is a pointer to the decimal value that saves the
result of the operation.
Valid formats are for example: -2, .794, +3.44, 592.49E07 or -32.84e-4.
deccvdbl
The function receives the variable of type double that should be converted as its first argument
(dbl). As the second argument (np), the function receives a pointer to the decimal variable that
should hold the result of the operation.
The function returns 0 on success and a negative value if the conversion failed.
deccvint
The function receives the variable of type int that should be converted as its first argument (in).
As the second argument (np), the function receives a pointer to the decimal variable that should
hold the result of the operation.
The function returns 0 on success and a negative value if the conversion failed.
deccvlong
969
ECPG - Embedded SQL in C
The function receives the variable of type long that should be converted as its first argument
(lng). As the second argument (np), the function receives a pointer to the decimal variable that
should hold the result of the operation.
The function returns 0 on success and a negative value if the conversion failed.
decdiv
The function receives pointers to the variables that are the first (n1) and the second (n2) operands
and calculates n1/n2. result is a pointer to the variable that should hold the result of the op-
eration.
On success, 0 is returned and a negative value if the division fails. If overflow or under-
flow occurred, the function returns ECPG_INFORMIX_NUM_OVERFLOW or ECPG_INFOR-
MIX_NUM_UNDERFLOW respectively. If an attempt to divide by zero is observed, the function
returns ECPG_INFORMIX_DIVIDE_ZERO.
decmul
The function receives pointers to the variables that are the first (n1) and the second (n2) operands
and calculates n1*n2. result is a pointer to the variable that should hold the result of the
operation.
On success, 0 is returned and a negative value if the multiplication fails. If overflow or under-
flow occurred, the function returns ECPG_INFORMIX_NUM_OVERFLOW or ECPG_INFOR-
MIX_NUM_UNDERFLOW respectively.
decsub
The function receives pointers to the variables that are the first (n1) and the second (n2) operands
and calculates n1-n2. result is a pointer to the variable that should hold the result of the
operation.
On success, 0 is returned and a negative value if the subtraction fails. If overflow or under-
flow occurred, the function returns ECPG_INFORMIX_NUM_OVERFLOW or ECPG_INFOR-
MIX_NUM_UNDERFLOW respectively.
dectoasc
970
ECPG - Embedded SQL in C
The function receives a pointer to a variable of type decimal (np) that it converts to its textual
representation. cp is the buffer that should hold the result of the operation. The parameter right
specifies, how many digits right of the decimal point should be included in the output. The result
will be rounded to this number of decimal digits. Setting right to -1 indicates that all available
decimal digits should be included in the output. If the length of the output buffer, which is indi-
cated by len is not sufficient to hold the textual representation including the trailing zero byte,
only a single * character is stored in the result and -1 is returned.
The function returns either -1 if the buffer cp was too small or ECPG_INFOR-
MIX_OUT_OF_MEMORY if memory was exhausted.
dectodbl
The function receives a pointer to the decimal value to convert (np) and a pointer to the double
variable that should hold the result of the operation (dblp).
dectoint
The function receives a pointer to the decimal value to convert (np) and a pointer to the integer
variable that should hold the result of the operation (ip).
On success, 0 is returned and a negative value if the conversion failed. If an overflow occurred,
ECPG_INFORMIX_NUM_OVERFLOW is returned.
Note that the ECPG implementation differs from the Informix implementation. Informix limits an
integer to the range from -32767 to 32767, while the limits in the ECPG implementation depend
on the architecture (INT_MIN .. INT_MAX).
dectolong
The function receives a pointer to the decimal value to convert (np) and a pointer to the long
variable that should hold the result of the operation (lngp).
On success, 0 is returned and a negative value if the conversion failed. If an overflow occurred,
ECPG_INFORMIX_NUM_OVERFLOW is returned.
Note that the ECPG implementation differs from the Informix implementation. Informix limits
a long integer to the range from -2,147,483,647 to 2,147,483,647, while the limits in the ECPG
implementation depend on the architecture (-LONG_MAX .. LONG_MAX).
rdatestr
971
ECPG - Embedded SQL in C
The function receives two arguments, the first one is the date to convert (d) and the second one is
a pointer to the target string. The output format is always yyyy-mm-dd, so you need to allocate
at least 11 bytes (including the zero-byte terminator) for the string.
Note that ECPG's implementation differs from the Informix implementation. In Informix the for-
mat can be influenced by setting environment variables. In ECPG however, you cannot change
the output format.
rstrdate
The function receives the textual representation of the date to convert (str) and a pointer to a
variable of type date (d). This function does not allow you to specify a format mask. It uses the
default format mask of Informix which is mm/dd/yyyy. Internally, this function is implemented
by means of rdefmtdate. Therefore, rstrdate is not faster and if you have the choice you
should opt for rdefmtdate which allows you to specify the format mask explicitly.
rtoday
The function receives a pointer to a date variable (d) that it sets to the current date.
rjulmdy
Extract the values for the day, the month and the year from a variable of type date.
The function receives the date d and a pointer to an array of 3 short integer values mdy. The
variable name indicates the sequential order: mdy[0] will be set to contain the number of the
month, mdy[1] will be set to the value of the day and mdy[2] will contain the year.
rdefmtdate
The function receives a pointer to the date value that should hold the result of the operation (d),
the format mask to use for parsing the date (fmt) and the C char* string containing the textual
972
ECPG - Embedded SQL in C
representation of the date (str). The textual representation is expected to match the format mask.
However you do not need to have a 1:1 mapping of the string to the format mask. The function
only analyzes the sequential order and looks for the literals yy or yyyy that indicate the position
of the year, mm to indicate the position of the month and dd to indicate the position of the day.
• ECPG_INFORMIX_ENOTDMY - The format string did not correctly indicate the sequential
order of year, month and day.
rfmtdate
Convert a variable of type date to its textual representation using a format mask.
The function receives the date to convert (d), the format mask (fmt) and the string that will hold
the textual representation of the date (str).
Internally this function uses the PGTYPESdate_fmt_asc function, see the reference there for
examples.
rmdyjul
Create a date value from an array of 3 short integers that specify the day, the month and the year
of the date.
The function receives the array of the 3 short integers (mdy) and a pointer to a variable of type
date that should hold the result of the operation.
rdayofweek
Return a number representing the day of the week for a date value.
The function receives the date variable d as its only argument and returns an integer that indicates
the day of the week for this date.
973
ECPG - Embedded SQL in C
• 0 - Sunday
• 1 - Monday
• 2 - Tuesday
• 3 - Wednesday
• 4 - Thursday
• 5 - Friday
• 6 - Saturday
dtcurrent
The function retrieves the current timestamp and saves it into the timestamp variable that ts
points to.
dtcvasc
The function receives the string to parse (str) and a pointer to the timestamp variable that should
hold the result of the operation (ts).
Internally this function uses the PGTYPEStimestamp_from_asc function. See the reference
there for a table with example inputs.
dtcvfmtasc
Parses a timestamp from its textual representation using a format mask into a timestamp variable.
The function receives the string to parse (inbuf), the format mask to use (fmtstr) and a pointer
to the timestamp variable that should hold the result of the operation (dtvalue).
dtsub
Subtract one timestamp from another and return a variable of type interval.
974
ECPG - Embedded SQL in C
The function will subtract the timestamp variable that ts2 points to from the timestamp variable
that ts1 points to and will store the result in the interval variable that iv points to.
Upon success, the function returns 0 and a negative value if an error occurred.
dttoasc
The function receives a pointer to the timestamp variable to convert (ts) and the string that should
hold the result of the operation (output). It converts ts to its textual representation according
to the SQL standard, which is be YYYY-MM-DD HH:MM:SS.
Upon success, the function returns 0 and a negative value if an error occurred.
dttofmtasc
The function receives a pointer to the timestamp to convert as its first argument (ts), a pointer
to the output buffer (output), the maximal length that has been allocated for the output buffer
(str_len) and the format mask to use for the conversion (fmtstr).
Upon success, the function returns 0 and a negative value if an error occurred.
Internally, this function uses the PGTYPEStimestamp_fmt_asc function. See the reference
there for information on what format mask specifiers can be used.
intoasc
The function receives a pointer to the interval variable to convert (i) and the string that should
hold the result of the operation (str). It converts i to its textual representation according to the
SQL standard, which is be YYYY-MM-DD HH:MM:SS.
Upon success, the function returns 0 and a negative value if an error occurred.
rfmtlong
Convert a long integer value to its textual representation using a format mask.
The function receives the long value lng_val, the format mask fmt and a pointer to the output
buffer outbuf. It converts the long value according to the format mask to its textual represen-
tation.
The format mask can be composed of the following format specifying characters:
975
ECPG - Embedded SQL in C
• & (ampersand) - if this position would be blank otherwise, fill it with a zero.
• , (comma) - group numbers of four or more digits into groups of three digits separated by a
comma.
• . (period) - this character separates the whole-number part of the number from the fractional
part.
• ( - this replaces the minus sign in front of the negative number. The minus sign will not appear.
• ) - this character replaces the minus and is printed behind the negative value.
rupshift
The function receives a pointer to the string and transforms every lower case character to upper
case.
byleng
The function expects a fixed-length string as its first argument (str) and its length as its second
argument (len). It returns the number of significant characters, that is the length of the string
without trailing blanks.
ldchar
The function receives the fixed-length string to copy (src), its length (len) and a pointer to the
destination memory (dest). Note that you need to reserve at least len+1 bytes for the string
that dest points to. The function copies at most len bytes to the new location (less if the source
string has trailing blanks) and adds the null-terminator.
rgetmsg
976
ECPG - Embedded SQL in C
rtypalign
rtypmsize
rtypwidth
rsetnull
The function receives an integer that indicates the type of the variable and a pointer to the variable
itself that is cast to a C char* pointer.
977
ECPG - Embedded SQL in C
risnull
The function receives the type of the variable to test (t) as well a pointer to this variable (ptr).
Note that the latter needs to be cast to a char*. See the function rsetnull for a list of possible
variable types.
ECPG_INFORMIX_NUM_OVERFLOW
ECPG_INFORMIX_NUM_UNDERFLOW
ECPG_INFORMIX_DIVIDE_ZERO
Functions return this value if an attempt to divide by zero is observed. Internally it is defined as
-1202 (the Informix definition).
ECPG_INFORMIX_BAD_YEAR
Functions return this value if a bad value for a year was found while parsing a date. Internally it
is defined as -1204 (the Informix definition).
ECPG_INFORMIX_BAD_MONTH
Functions return this value if a bad value for a month was found while parsing a date. Internally
it is defined as -1205 (the Informix definition).
ECPG_INFORMIX_BAD_DAY
Functions return this value if a bad value for a day was found while parsing a date. Internally it
is defined as -1206 (the Informix definition).
978
ECPG - Embedded SQL in C
ECPG_INFORMIX_ENOSHORTDATE
Functions return this value if a parsing routine needs a short date representation but did not get
the date string in the right length. Internally it is defined as -1209 (the Informix definition).
ECPG_INFORMIX_DATE_CONVERT
Functions return this value if an error occurred during date formatting. Internally it is defined as
-1210 (the Informix definition).
ECPG_INFORMIX_OUT_OF_MEMORY
Functions return this value if memory was exhausted during their operation. Internally it is defined
as -1211 (the Informix definition).
ECPG_INFORMIX_ENOTDMY
Functions return this value if a parsing routine was supposed to get a format mask (like mmddyy)
but not all fields were listed correctly. Internally it is defined as -1212 (the Informix definition).
ECPG_INFORMIX_BAD_NUMERIC
Functions return this value either if a parsing routine cannot parse the textual representation for
a numeric value because it contains errors or if a routine cannot complete a calculation involving
numeric variables because at least one of the numeric variables is invalid. Internally it is defined
as -1213 (the Informix definition).
ECPG_INFORMIX_BAD_EXPONENT
Functions return this value if a parsing routine cannot parse an exponent. Internally it is defined
as -1216 (the Informix definition).
ECPG_INFORMIX_BAD_DATE
Functions return this value if a parsing routine cannot parse a date. Internally it is defined as -1218
(the Informix definition).
ECPG_INFORMIX_EXTRA_CHARS
Functions return this value if a parsing routine is passed extra characters it cannot parse. Internally
it is defined as -1264 (the Informix definition).
• Pad character arrays receiving character string types with trailing spaces to the specified length
• Zero byte terminate these character arrays, and set the indicator variable if truncation occurs
• Set the null indicator to -1 when character arrays receive empty character string types
36.17. Internals
This section explains how ECPG works internally. This information can occasionally be useful to help
users understand how to use ECPG.
979
ECPG - Embedded SQL in C
The first four lines written by ecpg to the output are fixed lines. Two are comments and two are
include lines necessary to interface to the library. Then the preprocessor reads through the file and
writes output. Normally it just echoes everything to the output.
When it sees an EXEC SQL statement, it intervenes and changes it. The command starts with EXEC
SQL and ends with ;. Everything in between is treated as an SQL statement and parsed for variable
substitution.
Variable substitution occurs when a symbol starts with a colon (:). The variable with that name is
looked up among the variables that were previously declared within a EXEC SQL DECLARE section.
The most important function in the library is ECPGdo, which takes care of executing most commands.
It takes a variable number of arguments. This can easily add up to 50 or so arguments, and we hope
this will not be a problem on any platform.
A line number
This is the line number of the original line; used in error messages only.
A string
This is the SQL command that is to be issued. It is modified by the input variables, i.e., the
variables that where not known at compile time but are to be entered in the command. Where the
variables should go the string contains ?.
Input variables
ECPGt_EOIT
Output variables
Every output variable causes ten arguments to be created. (See below.) These variables are filled
by the function.
ECPGt_EORT
For every variable that is part of the SQL command, the function gets ten arguments:
5. The offset to the next element in the array (for array fetches).
8. 0
980
ECPG - Embedded SQL in C
10.The offset to the next element in the indicator array (for array fetches).
Note that not all SQL commands are treated in this way. For instance, an open cursor statement like:
is not copied to the output. Instead, the cursor's DECLARE command is used at the position of the
OPEN command because it indeed opens the cursor.
Here is a complete example describing the output of the preprocessor of a file foo.pgc (details might
change with each particular version of the preprocessor):
is translated into:
#line 1 "foo.pgc"
int index;
int result;
/* exec sql end declare section */
...
ECPGdo(__LINE__, NULL, "SELECT res FROM mytable WHERE index = ?
",
ECPGt_int,&(index),1L,1L,sizeof(int),
ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, ECPGt_EOIT,
ECPGt_int,&(result),1L,1L,sizeof(int),
ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, ECPGt_EORT);
#line 147 "foo.pgc"
(The indentation here is added for readability and not something the preprocessor does.)
981
Chapter 37. The Information Schema
The information schema consists of a set of views that contain information about the objects defined
in the current database. The information schema is defined in the SQL standard and can therefore
be expected to be portable and remain stable — unlike the system catalogs, which are specific to
PostgreSQL and are modeled after implementation concerns. The information schema views do not,
however, contain information about PostgreSQL-specific features; to inquire about those you need to
query the system catalogs or other PostgreSQL-specific views.
Note
When querying the database for constraint information, it is possible for a standard-compli-
ant query that expects to return one row to return several. This is because the SQL standard
requires constraint names to be unique within a schema, but PostgreSQL does not enforce
this restriction. PostgreSQL automatically-generated constraint names avoid duplicates in the
same schema, but users can specify such duplicate names.
This problem can appear when querying information schema views such as check_con-
straint_routine_usage, check_constraints, domain_constraints, and
referential_constraints. Some other views have similar issues but contain the ta-
ble name to help distinguish duplicate rows, e.g., constraint_column_usage, con-
straint_table_usage, table_constraints.
By default, the information schema is not in the schema search path, so you need to access all objects
in it through qualified names. Since the names of some of the objects in the information schema are
generic names that might occur in user applications, you should be careful if you want to put the
information schema in the path.
cardinal_number
A nonnegative integer.
character_data
sql_identifier
A character string. This type is used for SQL identifiers, the type character_data is used
for any other kind of text data.
982
The Information Schema
time_stamp
yes_or_no
A character string domain that contains either YES or NO. This is used to represent Boolean (true/
false) data in the information schema. (The information schema was invented before the type
boolean was added to the SQL standard, so this convention is necessary to keep the information
schema backward compatible.)
Every column in the information schema has one of these five types.
37.3. information_schema_catalog_name
information_schema_catalog_name is a table that always contains one row and one column
containing the name of the current database (current catalog, in SQL terminology).
37.4. administrable_role_authoriza-
tions
The view administrable_role_authorizations identifies all roles that the current user
has the admin option for.
37.5. applicable_roles
The view applicable_roles identifies all roles whose privileges the current user can use. This
means there is some chain of role grants from the current user to the role in question. The current user
itself is also an applicable role. The set of applicable roles is generally used for permission checking.
983
The Information Schema
37.6. attributes
The view attributes contains information about the attributes of composite data types defined in
the database. (Note that the view does not give information about table columns, which are sometimes
called attributes in PostgreSQL contexts.) Only those attributes are shown that the current user has
access to (by way of being the owner of or having some privilege on the type).
984
The Information Schema
985
The Information Schema
986
The Information Schema
See also under Section 37.16, a similarly structured view, for further information on some of the
columns.
37.7. character_sets
The view character_sets identifies the character sets available in the current database. Since
PostgreSQL does not support multiple character sets within one database, this view only shows one,
which is the database encoding.
Take note of how the following terms are used in the SQL standard:
character repertoire
An abstract collection of characters, for example UNICODE, UCS, or LATIN1. Not exposed as
an SQL object, but visible in this view.
An encoding of some character repertoire. Most older character repertoires only use one encoding
form, and so there are no separate names for them (e.g., LATIN1 is an encoding form applicable
to the LATIN1 repertoire). But for example Unicode has the encoding forms UTF8, UTF16, etc.
(not all supported by PostgreSQL). Encoding forms are not exposed as an SQL object, but are
visible in this view.
character set
A named SQL object that identifies a character repertoire, a character encoding, and a default
collation. A predefined character set would typically have the same name as an encoding form,
but users could define other names. For example, the character set UTF8 would typically identify
the character repertoire UCS, encoding form UTF8, and some default collation.
You can think of an “encoding” in PostgreSQL either as a character set or a character encoding form.
They will have the same name, and there can only be one in one database.
987
The Information Schema
37.8. check_constraint_routine_usage
The view check_constraint_routine_usage identifies routines (functions and procedures)
that are used by a check constraint. Only those routines are shown that are owned by a currently
enabled role.
37.9. check_constraints
The view check_constraints contains all check constraints, either defined on a table or on a
domain, that are owned by a currently enabled role. (The owner of the table or domain is the owner
of the constraint.)
988
The Information Schema
37.10. collations
The view collations contains the collations available in the current database.
37.11. collation_character_set_applic-
ability
The view collation_character_set_applicability identifies which character set the
available collations are applicable to. In PostgreSQL, there is only one character set per database (see
explanation in Section 37.7), so this view does not provide much useful information.
37.12. column_domain_usage
The view column_domain_usage identifies all columns (of a table or a view) that make use of
some domain defined in the current database and owned by a currently enabled role.
989
The Information Schema
37.13. column_options
The view column_options contains all the options defined for foreign table columns in the current
database. Only those foreign table columns are shown that the current user has access to (by way of
being the owner or having some privilege).
37.14. column_privileges
The view column_privileges identifies all privileges granted on columns to a currently enabled
role or by a currently enabled role. There is one row for each combination of column, grantor, and
grantee.
If a privilege has been granted on an entire table, it will show up in this view as a grant for each
column, but only for the privilege types where column granularity is possible: SELECT, INSERT,
UPDATE, REFERENCES.
990
The Information Schema
37.15. column_udt_usage
The view column_udt_usage identifies all columns that use data types owned by a currently
enabled role. Note that in PostgreSQL, built-in data types behave like user-defined types, so they are
included here as well. See also Section 37.16 for details.
991
The Information Schema
37.16. columns
The view columns contains information about all table columns (or view columns) in the database.
System columns (oid, etc.) are not included. Only those columns are shown that the current user has
access to (by way of being the owner or having some privilege).
992
The Information Schema
993
The Information Schema
994
The Information Schema
Since data types can be defined in a variety of ways in SQL, and PostgreSQL contains additional
ways to define data types, their representation in the information schema can be somewhat difficult.
The column data_type is supposed to identify the underlying built-in type of the column. In Post-
995
The Information Schema
greSQL, this means that the type is defined in the system catalog schema pg_catalog. This column
might be useful if the application can handle the well-known built-in types specially (for example, for-
mat the numeric types differently or use the data in the precision columns). The columns udt_name,
udt_schema, and udt_catalog always identify the underlying data type of the column, even if
the column is based on a domain. (Since PostgreSQL treats built-in types like user-defined types, built-
in types appear here as well. This is an extension of the SQL standard.) These columns should be used
if an application wants to process data differently according to the type, because in that case it wouldn't
matter if the column is really based on a domain. If the column is based on a domain, the identity of
the domain is stored in the columns domain_name, domain_schema, and domain_catalog.
If you want to pair up columns with their associated data types and treat domains as separate types,
you could write coalesce(domain_name, udt_name), etc.
37.17. constraint_column_usage
The view constraint_column_usage identifies all columns in the current database that are
used by some constraint. Only those columns are shown that are contained in a table owned by a
currently enabled role. For a check constraint, this view identifies the columns that are used in the
check expression. For a foreign key constraint, this view identifies the columns that the foreign key
references. For a unique or primary key constraint, this view identifies the constrained columns.
37.18. constraint_table_usage
The view constraint_table_usage identifies all tables in the current database that are used
by some constraint and are owned by a currently enabled role. (This is different from the view ta-
ble_constraints, which identifies all table constraints along with the table they are defined on.)
For a foreign key constraint, this view identifies the table that the foreign key references. For a unique
or primary key constraint, this view simply identifies the table the constraint belongs to. Check con-
straints and not-null constraints are not included in this view.
996
The Information Schema
37.19. data_type_privileges
The view data_type_privileges identifies all data type descriptors that the current user has
access to, by way of being the owner of the described object or having some privilege for it. A data
type descriptor is generated whenever a data type is used in the definition of a table column, a domain,
or a function (as parameter or return type) and stores some information about how the data type is used
in that instance (for example, the declared maximum length, if applicable). Each data type descriptor
is assigned an arbitrary identifier that is unique among the data type descriptor identifiers assigned for
one object (table, domain, function). This view is probably not useful for applications, but it is used
to define some other views in the information schema.
997
The Information Schema
37.20. domain_constraints
The view domain_constraints contains all constraints belonging to domains defined in the cur-
rent database. Only those domains are shown that the current user has access to (by way of being the
owner or having some privilege).
37.21. domain_udt_usage
The view domain_udt_usage identifies all domains that are based on data types owned by a
currently enabled role. Note that in PostgreSQL, built-in data types behave like user-defined types,
so they are included here as well.
998
The Information Schema
37.22. domains
The view domains contains all domains defined in the current database. Only those domains are
shown that the current user has access to (by way of being the owner or having some privilege).
999
The Information Schema
1000
The Information Schema
37.23. element_types
The view element_types contains the data type descriptors of the elements of arrays. When a
table column, composite-type attribute, domain, function parameter, or function return value is defined
to be of an array type, the respective information schema view only contains ARRAY in the column
data_type. To obtain information on the element type of the array, you can join the respective view
with this view. For example, to show the columns of a table with data types and array element types,
if applicable, you could do:
1001
The Information Schema
This view only includes objects that the current user has access to, by way of being the owner or
having some privilege.
1002
The Information Schema
1003
The Information Schema
37.24. enabled_roles
The view enabled_roles identifies the currently “enabled roles”. The enabled roles are recursively
defined as the current user together with all roles that have been granted to the enabled roles with
automatic inheritance. In other words, these are all roles that the current user has direct or indirect,
automatically inheriting membership in.
For permission checking, the set of “applicable roles” is applied, which can be broader than the set
of enabled roles. So generally, it is better to use the view applicable_roles instead of this one;
See Section 37.5 for details on applicable_roles view.
37.25. foreign_data_wrapper_options
The view foreign_data_wrapper_options contains all the options defined for foreign-data
wrappers in the current database. Only those foreign-data wrappers are shown that the current user
has access to (by way of being the owner or having some privilege).
37.26. foreign_data_wrappers
The view foreign_data_wrappers contains all foreign-data wrappers defined in the current
database. Only those foreign-data wrappers are shown that the current user has access to (by way of
being the owner or having some privilege).
1004
The Information Schema
37.27. foreign_server_options
The view foreign_server_options contains all the options defined for foreign servers in the
current database. Only those foreign servers are shown that the current user has access to (by way of
being the owner or having some privilege).
37.28. foreign_servers
The view foreign_servers contains all foreign servers defined in the current database. Only
those foreign servers are shown that the current user has access to (by way of being the owner or
having some privilege).
1005
The Information Schema
37.29. foreign_table_options
The view foreign_table_options contains all the options defined for foreign tables in the
current database. Only those foreign tables are shown that the current user has access to (by way of
being the owner or having some privilege).
37.30. foreign_tables
The view foreign_tables contains all foreign tables defined in the current database. Only those
foreign tables are shown that the current user has access to (by way of being the owner or having
some privilege).
37.31. key_column_usage
The view key_column_usage identifies all columns in the current database that are restricted by
some unique, primary key, or foreign key constraint. Check constraints are not included in this view.
Only those columns are shown that the current user has access to, by way of being the owner or having
some privilege.
1006
The Information Schema
37.32. parameters
The view parameters contains information about the parameters (arguments) of all functions in
the current database. Only those functions are shown that the current user has access to (by way of
being the owner or having some privilege).
1007
The Information Schema
1008
The Information Schema
37.33. referential_constraints
The view referential_constraints contains all referential (foreign key) constraints in the
current database. Only those constraints are shown for which the current user has write access to the
referencing table (by way of being the owner or having some privilege other than SELECT).
1009
The Information Schema
37.34. role_column_grants
The view role_column_grants identifies all privileges granted on columns where the grantor or
grantee is a currently enabled role. Further information can be found under column_privileges.
The only effective difference between this view and column_privileges is that this view omits
columns that have been made accessible to the current user by way of a grant to PUBLIC.
1010
The Information Schema
37.35. role_routine_grants
The view role_routine_grants identifies all privileges granted on functions where the grantor
or grantee is a currently enabled role. Further information can be found under routine_privi-
leges. The only effective difference between this view and routine_privileges is that this
view omits functions that have been made accessible to the current user by way of a grant to PUBLIC.
37.36. role_table_grants
The view role_table_grants identifies all privileges granted on tables or views where the
grantor or grantee is a currently enabled role. Further information can be found under table_priv-
ileges. The only effective difference between this view and table_privileges is that this
view omits tables that have been made accessible to the current user by way of a grant to PUBLIC.
1011
The Information Schema
37.37. role_udt_grants
The view role_udt_grants is intended to identify USAGE privileges granted on user-defined
types where the grantor or grantee is a currently enabled role. Further information can be found under
udt_privileges. The only effective difference between this view and udt_privileges is
that this view omits objects that have been made accessible to the current user by way of a grant to
PUBLIC. Since data types do not have real privileges in PostgreSQL, but only an implicit grant to
PUBLIC, this view is empty.
1012
The Information Schema
37.38. role_usage_grants
The view role_usage_grants identifies USAGE privileges granted on various kinds of objects
where the grantor or grantee is a currently enabled role. Further information can be found under us-
age_privileges. The only effective difference between this view and usage_privileges is
that this view omits objects that have been made accessible to the current user by way of a grant to
PUBLIC.
37.39. routine_privileges
The view routine_privileges identifies all privileges granted on functions to a currently en-
abled role or by a currently enabled role. There is one row for each combination of function, grantor,
and grantee.
1013
The Information Schema
37.40. routines
The view routines contains all functions and procedures in the current database. Only those func-
tions and procedures are shown that the current user has access to (by way of being the owner or
having some privilege).
1014
The Information Schema
1015
The Information Schema
1016
The Information Schema
1017
The Information Schema
1018
The Information Schema
37.41. schemata
The view schemata contains all schemas in the current database that the current user has access to
(by way of being the owner or having some privilege).
37.42. sequences
The view sequences contains all sequences defined in the current database. Only those sequences
are shown that the current user has access to (by way of being the owner or having some privilege).
1019
The Information Schema
Note that in accordance with the SQL standard, the start, minimum, maximum, and increment values
are returned as character strings.
37.43. sql_features
The table sql_features contains information about which formal features defined in the SQL
standard are supported by PostgreSQL. This is the same information that is presented in Appendix D.
There you can also find some additional background information.
1020
The Information Schema
37.44. sql_implementation_info
The table sql_implementation_info contains information about various aspects that are left
implementation-defined by the SQL standard. This information is primarily intended for use in the
context of the ODBC interface; users of other interfaces will probably find this information to be of
little use. For this reason, the individual implementation information items are not described here; you
will find them in the description of the ODBC interface.
37.45. sql_languages
The table sql_languages contains one row for each SQL language binding that is supported by
PostgreSQL. PostgreSQL supports direct SQL and embedded SQL in C; that is all you will learn from
this table.
This table was removed from the SQL standard in SQL:2008, so there are no entries referring to
standards later than SQL:2003.
1021
The Information Schema
37.46. sql_packages
The table sql_packages contains information about which feature packages defined in the SQL
standard are supported by PostgreSQL. Refer to Appendix D for background information on feature
packages.
37.47. sql_parts
The table sql_parts contains information about which of the several parts of the SQL standard are
supported by PostgreSQL.
1022
The Information Schema
37.48. sql_sizing
The table sql_sizing contains information about various size limits and maximum values in Post-
greSQL. This information is primarily intended for use in the context of the ODBC interface; users of
other interfaces will probably find this information to be of little use. For this reason, the individual
sizing items are not described here; you will find them in the description of the ODBC interface.
37.49. sql_sizing_profiles
The table sql_sizing_profiles contains information about the sql_sizing values that are
required by various profiles of the SQL standard. PostgreSQL does not track any SQL profiles, so
this table is empty.
1023
The Information Schema
37.50. table_constraints
The view table_constraints contains all constraints belonging to tables that the current user
owns or has some privilege other than SELECT on.
37.51. table_privileges
The view table_privileges identifies all privileges granted on tables or views to a currently
enabled role or by a currently enabled role. There is one row for each combination of table, grantor,
and grantee.
1024
The Information Schema
37.52. tables
The view tables contains all tables and views defined in the current database. Only those tables
and views are shown that the current user has access to (by way of being the owner or having some
privilege).
1025
The Information Schema
37.53. transforms
The view transforms contains information about the transforms defined in the current database.
More precisely, it contains a row for each function contained in a transform (the “from SQL” or “to
SQL” function).
1026
The Information Schema
37.54. triggered_update_columns
For triggers in the current database that specify a column list (like UPDATE OF column1, col-
umn2), the view triggered_update_columns identifies these columns. Triggers that do not
specify a column list are not included in this view. Only those columns are shown that the current user
owns or has some privilege other than SELECT on.
37.55. triggers
The view triggers contains all triggers defined in the current database on tables and views that the
current user owns or has some privilege other than SELECT on.
1027
The Information Schema
Triggers in PostgreSQL have two incompatibilities with the SQL standard that affect the representation
in the information schema. First, trigger names are local to each table in PostgreSQL, rather than being
independent schema objects. Therefore there can be duplicate trigger names defined in one schema,
so long as they belong to different tables. (trigger_catalog and trigger_schema are really
the values pertaining to the table that the trigger is defined on.) Second, triggers can be defined to fire
on multiple events in PostgreSQL (e.g., ON INSERT OR UPDATE), whereas the SQL standard only
allows one. If a trigger is defined to fire on multiple events, it is represented as multiple rows in the
information schema, one for each type of event. As a consequence of these two issues, the primary key
of the view triggers is really (trigger_catalog, trigger_schema, event_objec-
t_table, trigger_name, event_manipulation) instead of (trigger_catalog,
trigger_schema, trigger_name), which is what the SQL standard specifies. Nonetheless,
if you define your triggers in a manner that conforms with the SQL standard (trigger names unique in
the schema and only one event type per trigger), this will not affect you.
1028
The Information Schema
Note
Prior to PostgreSQL 9.1, this view's columns action_timing, action_ref-
erence_old_table, action_reference_new_table, action_refer-
ence_old_row, and action_reference_new_row were named condition_tim-
ing, condition_reference_old_table, condition_reference_new_ta-
ble, condition_reference_old_row, and condition_reference_new_row
respectively. That was how they were named in the SQL:1999 standard. The new naming con-
forms to SQL:2003 and later.
37.56. udt_privileges
The view udt_privileges identifies USAGE privileges granted on user-defined types to a cur-
rently enabled role or by a currently enabled role. There is one row for each combination of type,
grantor, and grantee. This view shows only composite types (see under Section 37.58 for why); see
Section 37.57 for domain privileges.
37.57. usage_privileges
The view usage_privileges identifies USAGE privileges granted on various kinds of objects
to a currently enabled role or by a currently enabled role. In PostgreSQL, this currently applies to
collations, domains, foreign-data wrappers, foreign servers, and sequences. There is one row for each
combination of object, grantor, and grantee.
Since collations do not have real privileges in PostgreSQL, this view shows implicit non-grantable
USAGE privileges granted by the owner to PUBLIC for all collations. The other object types, however,
show real privileges.
In PostgreSQL, sequences also support SELECT and UPDATE privileges in addition to the USAGE
privilege. These are nonstandard and therefore not visible in the information schema.
1029
The Information Schema
37.58. user_defined_types
The view user_defined_types currently contains all composite types defined in the current
database. Only those types are shown that the current user has access to (by way of being the owner
or having some privilege).
SQL knows about two kinds of user-defined types: structured types (also known as composite types in
PostgreSQL) and distinct types (not implemented in PostgreSQL). To be future-proof, use the column
user_defined_type_category to differentiate between these. Other user-defined types such
as base types and enums, which are PostgreSQL extensions, are not shown here. For domains, see
Section 37.22 instead.
1030
The Information Schema
37.59. user_mapping_options
The view user_mapping_options contains all the options defined for user mappings in the cur-
rent database. Only those user mappings are shown where the current user has access to the corre-
sponding foreign server (by way of being the owner or having some privilege).
1031
The Information Schema
37.60. user_mappings
The view user_mappings contains all user mappings defined in the current database. Only those
user mappings are shown where the current user has access to the corresponding foreign server (by
way of being the owner or having some privilege).
37.61. view_column_usage
The view view_column_usage identifies all columns that are used in the query expression of a
view (the SELECT statement that defines the view). A column is only included if the table that contains
the column is owned by a currently enabled role.
Note
Columns of system tables are not included. This should be fixed sometime.
1032
The Information Schema
37.62. view_routine_usage
The view view_routine_usage identifies all routines (functions and procedures) that are used
in the query expression of a view (the SELECT statement that defines the view). A routine is only
included if that routine is owned by a currently enabled role.
37.63. view_table_usage
The view view_table_usage identifies all tables that are used in the query expression of a view
(the SELECT statement that defines the view). A table is only included if that table is owned by a
currently enabled role.
1033
The Information Schema
Note
System tables are not included. This should be fixed sometime.
37.64. views
The view views contains all views defined in the current database. Only those views are shown that
the current user has access to (by way of being the owner or having some privilege).
1034
The Information Schema
1035
Part V. Server Programming
This part is about extending the server functionality with user-defined functions, data types, triggers, etc. These
are advanced topics which should probably be approached only after all the other user documentation about Post-
greSQL has been understood. Later chapters in this part describe the server-side programming languages avail-
able in the PostgreSQL distribution as well as general issues concerning server-side programming languages. It
is essential to read at least the earlier sections of Chapter 38 (covering functions) before diving into the material
about server-side programming languages.
Table of Contents
38. Extending SQL ................................................................................................... 1041
38.1. How Extensibility Works ........................................................................... 1041
38.2. The PostgreSQL Type System .................................................................... 1041
38.2.1. Base Types ................................................................................... 1041
38.2.2. Container Types ............................................................................. 1041
38.2.3. Domains ....................................................................................... 1042
38.2.4. Pseudo-Types ................................................................................ 1042
38.2.5. Polymorphic Types ........................................................................ 1042
38.3. User-defined Functions .............................................................................. 1043
38.4. User-defined Procedures ............................................................................ 1043
38.5. Query Language (SQL) Functions ............................................................... 1044
38.5.1. Arguments for SQL Functions .......................................................... 1045
38.5.2. SQL Functions on Base Types ......................................................... 1045
38.5.3. SQL Functions on Composite Types .................................................. 1047
38.5.4. SQL Functions with Output Parameters .............................................. 1050
38.5.5. SQL Functions with Variable Numbers of Arguments ........................... 1051
38.5.6. SQL Functions with Default Values for Arguments .............................. 1052
38.5.7. SQL Functions as Table Sources ....................................................... 1053
38.5.8. SQL Functions Returning Sets .......................................................... 1053
38.5.9. SQL Functions Returning TABLE ..................................................... 1057
38.5.10. Polymorphic SQL Functions ........................................................... 1057
38.5.11. SQL Functions with Collations ....................................................... 1059
38.6. Function Overloading ................................................................................ 1059
38.7. Function Volatility Categories .................................................................... 1060
38.8. Procedural Language Functions ................................................................... 1062
38.9. Internal Functions ..................................................................................... 1062
38.10. C-Language Functions ............................................................................. 1062
38.10.1. Dynamic Loading ......................................................................... 1062
38.10.2. Base Types in C-Language Functions ............................................... 1064
38.10.3. Version 1 Calling Conventions ....................................................... 1066
38.10.4. Writing Code ............................................................................... 1070
38.10.5. Compiling and Linking Dynamically-loaded Functions ........................ 1070
38.10.6. Composite-type Arguments ............................................................ 1072
38.10.7. Returning Rows (Composite Types) ................................................. 1073
38.10.8. Returning Sets ............................................................................. 1075
38.10.9. Polymorphic Arguments and Return Types ....................................... 1081
38.10.10. Transform Functions ................................................................... 1082
38.10.11. Shared Memory and LWLocks ...................................................... 1083
38.10.12. Using C++ for Extensibility .......................................................... 1083
38.11. User-defined Aggregates .......................................................................... 1084
38.11.1. Moving-Aggregate Mode ............................................................... 1085
38.11.2. Polymorphic and Variadic Aggregates .............................................. 1087
38.11.3. Ordered-Set Aggregates ................................................................. 1088
38.11.4. Partial Aggregation ....................................................................... 1090
38.11.5. Support Functions for Aggregates .................................................... 1090
38.12. User-defined Types ................................................................................. 1091
38.12.1. TOAST Considerations .................................................................. 1094
38.13. User-defined Operators ............................................................................ 1095
38.14. Operator Optimization Information ............................................................ 1096
38.14.1. COMMUTATOR ............................................................................. 1096
38.14.2. NEGATOR ................................................................................... 1097
38.14.3. RESTRICT ................................................................................. 1097
38.14.4. JOIN ......................................................................................... 1098
38.14.5. HASHES ..................................................................................... 1098
38.14.6. MERGES ..................................................................................... 1099
1037
Server Programming
1038
Server Programming
1039
Server Programming
1040
Chapter 38. Extending SQL
In the sections that follow, we will discuss how you can extend the PostgreSQL SQL query language
by adding:
The PostgreSQL server can moreover incorporate user-written code into itself through dynamic load-
ing. That is, the user can specify an object code file (e.g., a shared library) that implements a new type
or function, and PostgreSQL will load it as required. Code written in SQL is even more trivial to add
to the server. This ability to modify its operation “on the fly” makes PostgreSQL uniquely suited for
rapid prototyping of new applications and storage structures.
Enumerated (enum) types can be considered as a subcategory of base types. The main difference is
that they can be created using just SQL commands, without any low-level programming. Refer to
Section 8.7 for more information.
Arrays can hold multiple values that are all of the same type. An array type is automatically created
for each base type, composite type, range type, and domain type. But there are no arrays of arrays. So
1041
Extending SQL
far as the type system is concerned, multi-dimensional arrays are the same as one-dimensional arrays.
Refer to Section 8.15 for more information.
Composite types, or row types, are created whenever the user creates a table. It is also possible to use
CREATE TYPE to define a “stand-alone” composite type with no associated table. A composite type
is simply a list of types with associated field names. A value of a composite type is a row or record
of field values. Refer to Section 8.16 for more information.
A range type can hold two values of the same type, which are the lower and upper bounds of the
range. Range types are user-created, although a few built-in ones exist. Refer to Section 8.17 for more
information.
38.2.3. Domains
A domain is based on a particular underlying type and for many purposes is interchangeable with
its underlying type. However, a domain can have constraints that restrict its valid values to a subset
of what the underlying type would allow. Domains are created using the SQL command CREATE
DOMAIN. Refer to Section 8.18 for more information.
38.2.4. Pseudo-Types
There are a few “pseudo-types” for special purposes. Pseudo-types cannot appear as columns of tables
or components of container types, but they can be used to declare the argument and result types of
functions. This provides a mechanism within the type system to identify special classes of functions.
Table 8.25 lists the existing pseudo-types.
Polymorphic arguments and results are tied to each other and are resolved to a specific data type
when a query calling a polymorphic function is parsed. Each position (either argument or return value)
declared as anyelement is allowed to have any specific actual data type, but in any given call
they must all be the same actual type. Each position declared as anyarray can have any array data
type, but similarly they must all be the same type. And similarly, positions declared as anyrange
must all be the same range type. Furthermore, if there are positions declared anyarray and others
declared anyelement, the actual array type in the anyarray positions must be an array whose
elements are the same type appearing in the anyelement positions. Similarly, if there are positions
declared anyrange and others declared anyelement or anyarray, the actual range type in the
anyrange positions must be a range whose subtype is the same type appearing in the anyelement
positions and the same as the element type of the anyarray positions. anynonarray is treated
exactly the same as anyelement, but adds the additional constraint that the actual type must not
be an array type. anyenum is treated exactly the same as anyelement, but adds the additional
constraint that the actual type must be an enum type.
Thus, when more than one argument position is declared with a polymorphic type, the net effect is
that only certain combinations of actual argument types are allowed. For example, a function declared
as equal(anyelement, anyelement) will take any two input values, so long as they are of
the same data type.
When the return value of a function is declared as a polymorphic type, there must be at least one
argument position that is also polymorphic, and the actual data type supplied as the argument deter-
mines the actual result type for that call. For example, if there were not already an array subscripting
mechanism, one could define a function that implements subscripting as subscript(anyarray,
1042
Extending SQL
integer) returns anyelement. This declaration constrains the actual first argument to be
an array type, and allows the parser to infer the correct result type from the actual first argument's
type. Another example is that a function declared as f(anyarray) returns anyenum will only
accept arrays of enum types.
In most cases, the parser can infer the actual data type for a polymorphic result type from arguments
that are of a different polymorphic type; for example anyarray can be deduced from anyelement
or vice versa. The exception is that a polymorphic result of type anyrange requires an argument of
type anyrange; it cannot be deduced from anyarray or anyelement arguments. This is because
there could be multiple range types with the same subtype.
Note that anynonarray and anyenum do not represent separate type variables; they are the
same type as anyelement, just with an additional constraint. For example, declaring a function as
f(anyelement, anyenum) is equivalent to declaring it as f(anyenum, anyenum): both
actual arguments have to be the same enum type.
A variadic function (one taking a variable number of arguments, as in Section 38.5.5) can be polymor-
phic: this is accomplished by declaring its last parameter as VARIADIC anyarray. For purposes of
argument matching and determining the actual result type, such a function behaves the same as if you
had written the appropriate number of anynonarray parameters.
• procedural language functions (functions written in, for example, PL/pgSQL or PL/Tcl) (Sec-
tion 38.8)
Every kind of function can take base types, composite types, or combinations of these as arguments
(parameters). In addition, every kind of function can return a base type or a composite type. Functions
can also be defined to return sets of base or composite values.
Many kinds of functions can take or return certain pseudo-types (such as polymorphic types), but the
available facilities vary. Consult the description of each kind of function for more details.
It's easiest to define SQL functions, so we'll start by discussing those. Most of the concepts presented
for SQL functions will carry over to the other types of functions.
Throughout this chapter, it can be useful to look at the reference page of the CREATE FUNCTION
command to understand the examples better. Some examples from this chapter can be found in func-
s.sql and funcs.c in the src/tutorial directory in the PostgreSQL source distribution.
• Procedures are defined with the CREATE PROCEDURE command, not CREATE FUNCTION.
• Procedures do not return a function value; hence CREATE PROCEDURE lacks a RETURNS clause.
However, procedures can instead return data to their callers via output parameters.
• While a function is called as part of a query or DML command, a procedure is called in isolation
using the CALL command.
1043
Extending SQL
• A procedure can commit or roll back transactions during its execution (then automatically beginning
a new transaction), so long as the invoking CALL command is not part of an explicit transaction
block. A function cannot do that.
• Certain function attributes, such as strictness, don't apply to procedures. Those attributes control
how the function is used in a query, which isn't relevant to procedures.
The explanations in the following sections about how to define user-defined functions apply to pro-
cedures as well, except for the points made above.
Collectively, functions and procedures are also known as routines. There are commands such as AL-
TER ROUTINE and DROP ROUTINE that can operate on functions and procedures without having
to know which kind it is. Note, however, that there is no CREATE ROUTINE command.
Alternatively, an SQL function can be declared to return a set (that is, multiple rows) by specifying
the function's return type as SETOF sometype, or equivalently by declaring it as RETURNS TA-
BLE(columns). In this case all rows of the last query's result are returned. Further details appear
below.
The body of an SQL function must be a list of SQL statements separated by semicolons. A semicolon
after the last statement is optional. Unless the function is declared to return void, the last statement
must be a SELECT, or an INSERT, UPDATE, or DELETE that has a RETURNING clause.
Any collection of commands in the SQL language can be packaged together and defined as a function.
Besides SELECT queries, the commands can include data modification queries (INSERT, UPDATE,
and DELETE), as well as other SQL commands. (You cannot use transaction control commands, e.g.,
COMMIT, SAVEPOINT, and some utility commands, e.g., VACUUM, in SQL functions.) However, the
final command must be a SELECT or have a RETURNING clause that returns whatever is specified as
the function's return type. Alternatively, if you want to define a SQL function that performs actions but
has no useful value to return, you can define it as returning void. For example, this function removes
rows with negative salaries from the emp table:
SELECT clean_emp();
clean_emp
-----------
(1 row)
Note
The entire body of a SQL function is parsed before any of it is executed. While a SQL func-
tion can contain commands that alter the system catalogs (e.g., CREATE TABLE), the effects
of such commands will not be visible during parse analysis of later commands in the func-
tion. Thus, for example, CREATE TABLE foo (...); INSERT INTO foo VAL-
1044
Extending SQL
UES(...); will not work as desired if packaged up into a single SQL function, since foo
won't exist yet when the INSERT command is parsed. It's recommended to use PL/pgSQL
instead of a SQL function in this type of situation.
The syntax of the CREATE FUNCTION command requires the function body to be written as a string
constant. It is usually most convenient to use dollar quoting (see Section 4.1.2.4) for the string con-
stant. If you choose to use regular single-quoted string constant syntax, you must double single quote
marks (') and backslashes (\) (assuming escape string syntax) in the body of the function (see Sec-
tion 4.1.2.1).
To use a name, declare the function argument as having a name, and then just write that name in the
function body. If the argument name is the same as any column name in the current SQL command
within the function, the column name will take precedence. To override this, qualify the argument
name with the name of the function itself, that is function_name.argument_name. (If this
would conflict with a qualified column name, again the column name wins. You can avoid the ambi-
guity by choosing a different alias for the table within the SQL command.)
In the older numeric approach, arguments are referenced using the syntax $n: $1 refers to the first
input argument, $2 to the second, and so on. This will work whether or not the particular argument
was declared with a name.
SQL function arguments can only be used as data values, not as identifiers. Thus for example this
is reasonable:
Note
The ability to use names to reference SQL function arguments was added in PostgreSQL 9.2.
Functions to be used in older servers must use the $n notation.
1045
Extending SQL
SELECT one();
one
-----
1
Notice that we defined a column alias within the function body for the result of the function (with the
name result), but this column alias is not visible outside the function. Hence, the result is labeled
one instead of result.
It is almost as easy to define SQL functions that take base types as arguments:
answer
--------
3
Alternatively, we could dispense with names for the arguments and use numbers:
answer
--------
3
Here is a more useful function, which might be used to debit a bank account:
In this example, we chose the name accountno for the first argument, but this is the same as the
name of a column in the bank table. Within the UPDATE command, accountno refers to the column
bank.accountno, so tf1.accountno must be used to refer to the argument. We could of course
avoid this by using a different name for the argument.
1046
Extending SQL
In practice one would probably like a more useful result from the function than a constant 1, so a more
likely definition is:
which adjusts the balance and returns the new balance. The same thing could be done in one command
using RETURNING:
A SQL function must return exactly its declared result type. This may require inserting an explicit
cast. For example, suppose we wanted the previous add_em function to return type float8 instead.
This won't work:
even though in other contexts PostgreSQL would be willing to insert an implicit cast to convert in-
teger to float8. We need to write it as
1047
Extending SQL
name | dream
------+-------
Bill | 8400
Notice the use of the syntax $1.salary to select one field of the argument row value. Also notice
how the calling SELECT command uses table_name.* to select the entire current row of a table
as a composite value. The table row can alternatively be referenced using just the table name, like this:
but this usage is deprecated since it's easy to get confused. (See Section 8.16.5 for details about these
two notations for the composite value of a table row.)
Sometimes it is handy to construct a composite argument value on-the-fly. This can be done with the
ROW construct. For example, we could adjust the data being passed to the function:
It is also possible to build a function that returns a composite type. This is an example of a function
that returns a single emp row:
In this example we have specified each of the attributes with a constant value, but any computation
could have been substituted for these constants.
• The select list order in the query must be exactly the same as that in which the columns appear in
the table associated with the composite type. (Naming the columns, as we did above, is irrelevant
to the system.)
• We must ensure each expression's type matches the corresponding column of the composite type,
inserting a cast if necessary. Otherwise we'll get errors like this:
As with the base-type case, the function will not insert any casts automatically.
1048
Extending SQL
Here we wrote a SELECT that returns just a single column of the correct composite type. This isn't
really better in this situation, but it is a handy alternative in some cases — for example, if we need
to compute the result by calling another function that returns the desired composite value. Another
example is that if we are trying to write a function that returns a domain over composite, rather than a
plain composite type, it is always necessary to write it as returning a single column, since there is no
other way to produce a value that is exactly of the domain type.
SELECT new_emp();
new_emp
--------------------------
(None,1000.0,25,"(2,2)")
When you use a function that returns a composite type, you might want only one field (attribute) from
its result. You can do that with syntax like this:
SELECT (new_emp()).name;
name
------
None
The extra parentheses are needed to keep the parser from getting confused. If you try to do it without
them, you get something like this:
SELECT new_emp().name;
ERROR: syntax error at or near "."
LINE 1: SELECT new_emp().name;
^
SELECT name(new_emp());
name
------
None
1049
Extending SQL
As explained in Section 8.16.5, the field notation and functional notation are equivalent.
Another way to use a function returning a composite type is to pass the result to another function that
accepts the correct row type as input:
SELECT getname(new_emp());
getname
---------
None
(1 row)
SELECT add_em(3,7);
add_em
--------
10
(1 row)
This is not essentially different from the version of add_em shown in Section 38.5.2. The real value
of output parameters is that they provide a convenient way of defining functions that return several
columns. For example,
What has essentially happened here is that we have created an anonymous composite type for the
result of the function. The above example has the same end result as
but not having to bother with the separate composite type definition is often handy. Notice that the
names attached to the output parameters are not just decoration, but determine the column names of
1050
Extending SQL
the anonymous composite type. (If you omit a name for an output parameter, the system will choose
a name on its own.)
Notice that output parameters are not included in the calling argument list when invoking such a
function from SQL. This is because PostgreSQL considers only the input parameters to define the
function's calling signature. That means also that only the input parameters matter when referencing
the function for purposes such as dropping it. We could drop the above function with either of
DROP FUNCTION sum_n_product (x int, y int, OUT sum int, OUT product
int);
DROP FUNCTION sum_n_product (int, int);
Parameters can be marked as IN (the default), OUT, INOUT, or VARIADIC. An INOUT parameter
serves as both an input parameter (part of the calling argument list) and an output parameter (part
of the result record type). VARIADIC parameters are input parameters, but are treated specially as
described next.
Effectively, all the actual arguments at or beyond the VARIADIC position are gathered up into a one-
dimensional array, as if you had written
You can't actually write that, though — or at least, it will not match this function definition. A para-
meter marked VARIADIC matches one or more occurrences of its element type, not of its own type.
This prevents expansion of the function's variadic parameter into its element type, thereby allowing the
array argument value to match normally. VARIADIC can only be attached to the last actual argument
of a function call.
1051
Extending SQL
Specifying VARIADIC in the call is also the only way to pass an empty array to a variadic function,
for example:
Simply writing SELECT mleast() does not work because a variadic parameter must match at least
one actual argument. (You could define a second function also named mleast, with no parameters,
if you wanted to allow such calls.)
The array element parameters generated from a variadic parameter are treated as not having any names
of their own. This means it is not possible to call a variadic function using named arguments (Sec-
tion 4.3), except when you specify VARIADIC. For example, this will work:
For example:
SELECT foo(10);
foo
1052
Extending SQL
-----
15
(1 row)
The = sign can also be used in place of the key word DEFAULT.
Here is an example:
As the example shows, we can work with the columns of the function's result just the same as if they
were columns of a regular table.
Note that we only got one row out of the function. This is because we did not use SETOF. That is
described in the next section.
This feature is normally used when calling the function in the FROM clause. In this case each row
returned by the function becomes a row of the table seen by the query. For example, assume that table
foo has the same contents as above, and we say:
1053
Extending SQL
It is also possible to return multiple rows with the columns defined by output parameters, like this:
The key point here is that you must write RETURNS SETOF record to indicate that the function
returns multiple rows instead of just one. If there is only one output parameter, write that parameter's
type instead of record.
It is frequently useful to construct a query's result by invoking a set-returning function multiple times,
with the parameters for each invocation coming from successive rows of a table or subquery. The
preferred way to do this is to use the LATERAL key word, which is described in Section 7.2.1.5. Here
is an example using a set-returning function to enumerate elements of a tree structure:
1054
Extending SQL
Child3
(3 rows)
This example does not do anything that we couldn't have done with a simple join, but in more complex
calculations the option to put some of the work into a function can be quite convenient.
Functions returning sets can also be called in the select list of a query. For each row that the query
generates by itself, the set-returning function is invoked, and an output row is generated for each
element of the function's result set. The previous example could also be done with queries like these:
SELECT listchildren('Top');
listchildren
--------------
Child1
Child2
Child3
(3 rows)
In the last SELECT, notice that no output row appears for Child2, Child3, etc. This happens
because listchildren returns an empty set for those arguments, so no result rows are generated.
This is the same behavior as we got from an inner join to the function result when using the LATERAL
syntax.
PostgreSQL's behavior for a set-returning function in a query's select list is almost exactly the same as
if the set-returning function had been written in a LATERAL FROM-clause item instead. For example,
is almost equivalent to
It would be exactly the same, except that in this specific example, the planner could choose to put g on
the outside of the nestloop join, since g has no actual lateral dependency on tab. That would result in
a different output row order. Set-returning functions in the select list are always evaluated as though
they are on the inside of a nestloop join with the rest of the FROM clause, so that the function(s) are
run to completion before the next row from the FROM clause is considered.
1055
Extending SQL
If there is more than one set-returning function in the query's select list, the behavior is similar to what
you get from putting the functions into a single LATERAL ROWS FROM( ... ) FROM-clause item.
For each row from the underlying query, there is an output row using the first result from each function,
then an output row using the second result, and so on. If some of the set-returning functions produce
fewer outputs than others, null values are substituted for the missing data, so that the total number of
rows emitted for one underlying row is the same as for the set-returning function that produced the
most outputs. Thus the set-returning functions run “in lockstep” until they are all exhausted, and then
execution continues with the next underlying row.
Set-returning functions can be nested in a select list, although that is not allowed in FROM-clause items.
In such cases, each level of nesting is treated separately, as though it were a separate LATERAL ROWS
FROM( ... ) item. For example, in
the set-returning functions srf2, srf3, and srf5 would be run in lockstep for each row of tab,
and then srf1 and srf4 would be applied in lockstep to each row produced by the lower functions.
It might seem that this should produce five repetitions of input rows that have x > 0, and a single
repetition of those that do not; but actually, because generate_series(1, 5) would be run in
an implicit LATERAL FROM item before the CASE expression is ever evaluated, it would produce five
repetitions of every input row. To reduce confusion, such cases produce a parse-time error instead.
Note
If a function's last command is INSERT, UPDATE, or DELETE with RETURNING, that com-
mand will always be executed to completion, even if the function is not declared with SETOF
or the calling query does not fetch all the result rows. Any extra rows produced by the RE-
TURNING clause are silently dropped, but the commanded table modifications still happen
(and are all completed before returning from the function).
Note
Before PostgreSQL 10, putting more than one set-returning function in the same select list
did not behave very sensibly unless they always produced equal numbers of rows. Otherwise,
what you got was a number of output rows equal to the least common multiple of the numbers
of rows produced by the set-returning functions. Also, nested set-returning functions did not
work as described above; instead, a set-returning function could have at most one set-returning
argument, and each nest of set-returning functions was run independently. Also, condition-
al execution (set-returning functions inside CASE etc) was previously allowed, complicating
things even more. Use of the LATERAL syntax is recommended when writing queries that
need to work in older PostgreSQL versions, because that will give consistent results across
different versions. If you have a query that is relying on conditional execution of a set-returning
function, you may be able to fix it by moving the conditional test into a custom set-returning
function. For example,
1056
Extending SQL
could become
For example, the preceding sum-and-product example could also be done this way:
It is not allowed to use explicit OUT or INOUT parameters with the RETURNS TABLE notation —
you must put all the output columns in the TABLE list.
1057
Extending SQL
Notice the use of the typecast 'a'::text to specify that the argument is of type text. This is
required if the argument is just a string literal, since otherwise it would be treated as type unknown,
and array of unknown is not a valid type. Without the typecast, you will get errors like this:
ERROR: could not determine polymorphic type because input has type
"unknown"
It is permitted to have polymorphic arguments with a fixed return type, but the converse is not. For
example:
Polymorphism can be used with functions that have output arguments. For example:
1058
Extending SQL
anyleast
----------
abc
(1 row)
will depend on the database's default collation. In C locale the result will be ABC, but in many other
locales it will be abc. The collation to use can be forced by adding a COLLATE clause to any of the
arguments, for example
Alternatively, if you wish a function to operate with a particular collation regardless of what it is called
with, insert COLLATE clauses as needed in the function definition. This version of anyleast would
always use en_US locale to compare strings:
But note that this will throw an error if applied to a non-collatable data type.
If no common collation can be identified among the actual arguments, then a SQL function treats
its parameters as having their data types' default collation (which is usually the database's default
collation, but could be different for parameters of domain types).
The behavior of collatable parameters can be thought of as a limited form of polymorphism, applicable
only to textual data types.
1059
Extending SQL
see Section 10.3. When a query is executed, the server will determine which function to call from
the data types and the number of the provided arguments. Overloading can also be used to simulate
functions with a variable number of arguments, up to a finite maximum number.
When creating a family of overloaded functions, one should be careful not to create ambiguities. For
instance, given the functions:
it is not immediately clear which function would be called with some trivial input like test(1,
1.5). The currently implemented resolution rules are described in Chapter 10, but it is unwise to
design a system that subtly relies on this behavior.
A function that takes a single argument of a composite type should generally not have the same name
as any attribute (field) of that type. Recall that attribute(table) is considered equivalent to
table.attribute. In the case that there is an ambiguity between a function on a composite type
and an attribute of the composite type, the attribute will always be used. It is possible to override that
choice by schema-qualifying the function name (that is, schema.func(table) ) but it's better
to avoid the problem by not choosing conflicting names.
Another possible conflict is between variadic and non-variadic functions. For instance, it is possible to
create both foo(numeric) and foo(VARIADIC numeric[]). In this case it is unclear which
one should be matched to a call providing a single numeric argument, such as foo(10.1). The rule
is that the function appearing earlier in the search path is used, or if the two functions are in the same
schema, the non-variadic one is preferred.
When overloading C-language functions, there is an additional constraint: The C name of each function
in the family of overloaded functions must be different from the C names of all other functions, either
internal or dynamically loaded. If this rule is violated, the behavior is not portable. You might get a
run-time linker error, or one of the functions will get called (usually the internal one). The alternative
form of the AS clause for the SQL CREATE FUNCTION command decouples the SQL function name
from the function name in the C source code. For instance:
The names of the C functions here reflect one of many possible conventions.
• A VOLATILE function can do anything, including modifying the database. It can return different
results on successive calls with the same arguments. The optimizer makes no assumptions about
the behavior of such functions. A query using a volatile function will re-evaluate the function at
every row where its value is needed.
• A STABLE function cannot modify the database and is guaranteed to return the same results given
the same arguments for all rows within a single statement. This category allows the optimizer to
optimize multiple calls of the function to a single call. In particular, it is safe to use an expression
containing such a function in an index scan condition. (Since an index scan will evaluate the com-
1060
Extending SQL
parison value only once, not once at each row, it is not valid to use a VOLATILE function in an
index scan condition.)
• An IMMUTABLE function cannot modify the database and is guaranteed to return the same results
given the same arguments forever. This category allows the optimizer to pre-evaluate the function
when a query calls it with constant arguments. For example, a query like SELECT ... WHERE
x = 2 + 2 can be simplified on sight to SELECT ... WHERE x = 4, because the function
underlying the integer addition operator is marked IMMUTABLE.
For best optimization results, you should label your functions with the strictest volatility category that
is valid for them.
Any function with side-effects must be labeled VOLATILE, so that calls to it cannot be optimized
away. Even a function with no side-effects needs to be labeled VOLATILE if its value can change
within a single query; some examples are random(), currval(), timeofday().
There is relatively little difference between STABLE and IMMUTABLE categories when considering
simple interactive queries that are planned and immediately executed: it doesn't matter a lot whether
a function is executed once during planning or once during query execution startup. But there is a big
difference if the plan is saved and reused later. Labeling a function IMMUTABLE when it really isn't
might allow it to be prematurely folded to a constant during planning, resulting in a stale value being
re-used during subsequent uses of the plan. This is a hazard when using prepared statements or when
using function languages that cache plans (such as PL/pgSQL).
For functions written in SQL or in any of the standard procedural languages, there is a second important
property determined by the volatility category, namely the visibility of any data changes that have been
made by the SQL command that is calling the function. A VOLATILE function will see such changes,
a STABLE or IMMUTABLE function will not. This behavior is implemented using the snapshotting
behavior of MVCC (see Chapter 13): STABLE and IMMUTABLE functions use a snapshot established
as of the start of the calling query, whereas VOLATILE functions obtain a fresh snapshot at the start
of each query they execute.
Note
Functions written in C can manage snapshots however they want, but it's usually a good idea
to make C functions work this way too.
Because of this snapshotting behavior, a function containing only SELECT commands can safely be
marked STABLE, even if it selects from tables that might be undergoing modifications by concurrent
queries. PostgreSQL will execute all commands of a STABLE function using the snapshot established
for the calling query, and so it will see a fixed view of the database throughout that query.
The same snapshotting behavior is used for SELECT commands within IMMUTABLE functions. It
is generally unwise to select from database tables within an IMMUTABLE function at all, since the
immutability will be broken if the table contents ever change. However, PostgreSQL does not enforce
that you do not do that.
A common error is to label a function IMMUTABLE when its results depend on a configuration para-
meter. For example, a function that manipulates timestamps might well have results that depend on
the TimeZone setting. For safety, such functions should be labeled STABLE instead.
Note
PostgreSQL requires that STABLE and IMMUTABLE functions contain no SQL commands
other than SELECT to prevent data modification. (This is not a completely bulletproof test,
1061
Extending SQL
since such functions could still call VOLATILE functions that modify the database. If you do
that, you will find that the STABLE or IMMUTABLE function does not notice the database
changes applied by the called function, since they are hidden from its snapshot.)
Normally, all internal functions present in the server are declared during the initialization of the data-
base cluster (see Section 18.2), but a user could use CREATE FUNCTION to create additional alias
names for an internal function. Internal functions are declared in CREATE FUNCTION with language
name internal. For instance, to create an alias for the sqrt function:
Note
Not all “predefined” functions are “internal” in the above sense. Some predefined functions
are written in SQL.
Currently only one calling convention is used for C functions (“version 1”). Support for that calling
convention is indicated by writing a PG_FUNCTION_INFO_V1() macro call for the function, as
illustrated below.
1062
Extending SQL
FUNCTION for a user-defined C function must therefore specify two pieces of information for the
function: the name of the loadable object file, and the C name (link symbol) of the specific function
to call within that object file. If the C name is not explicitly specified then it is assumed to be the same
as the SQL function name.
The following algorithm is used to locate the shared object file based on the name given in the CREATE
FUNCTION command:
2. If the name starts with the string $libdir, that part is replaced by the PostgreSQL package library
directory name, which is determined at build time.
3. If the name does not contain a directory part, the file is searched for in the path specified by the
configuration variable dynamic_library_path.
4. Otherwise (the file was not found in the path, or it contains a non-absolute directory part), the
dynamic loader will try to take the name as given, which will most likely fail. (It is unreliable to
depend on the current working directory.)
If this sequence does not work, the platform-specific shared library file name extension (often .so)
is appended to the given name and this sequence is tried again. If that fails as well, the load will fail.
It is recommended to locate shared libraries either relative to $libdir or through the dynamic library
path. This simplifies version upgrades if the new installation is at a different location. The actual di-
rectory that $libdir stands for can be found out with the command pg_config --pkglibdir.
The user ID the PostgreSQL server runs as must be able to traverse the path to the file you intend to
load. Making the file or a higher-level directory not readable and/or not executable by the postgres
user is a common mistake.
In any case, the file name that is given in the CREATE FUNCTION command is recorded literally in
the system catalogs, so if the file needs to be loaded again the same procedure is applied.
Note
PostgreSQL will not compile a C function automatically. The object file must be compiled
before it is referenced in a CREATE FUNCTION command. See Section 38.10.5 for additional
information.
To ensure that a dynamically loaded object file is not loaded into an incompatible server, PostgreSQL
checks that the file contains a “magic block” with the appropriate contents. This allows the server to
detect obvious incompatibilities, such as code compiled for a different major version of PostgreSQL.
To include a magic block, write this in one (and only one) of the module source files, after having
included the header fmgr.h:
PG_MODULE_MAGIC;
After it is used for the first time, a dynamically loaded object file is retained in memory. Future calls
in the same session to the function(s) in that file will only incur the small overhead of a symbol table
lookup. If you need to force a reload of an object file, for example after recompiling it, begin a fresh
session.
Optionally, a dynamically loaded file can contain initialization and finalization functions. If the file
includes a function named _PG_init, that function will be called immediately after loading the file.
The function receives no parameters and should return void. If the file includes a function named
1063
Extending SQL
_PG_fini, that function will be called immediately before unloading the file. Likewise, the function
receives no parameters and should return void. Note that _PG_fini will only be called during an
unload of the file, not during process termination. (Presently, unloads are disabled and will never
occur, but this may change in the future.)
By-value types can only be 1, 2, or 4 bytes in length (also 8 bytes, if sizeof(Datum) is 8 on your
machine). You should be careful to define your types such that they will be the same size (in bytes) on
all architectures. For example, the long type is dangerous because it is 4 bytes on some machines and
8 bytes on others, whereas int type is 4 bytes on most Unix machines. A reasonable implementation
of the int4 type on Unix machines might be:
(The actual PostgreSQL C code calls this type int32, because it is a convention in C that intXX
means XX bits. Note therefore also that the C type int8 is 1 byte in size. The SQL type int8 is
called int64 in C. See also Table 38.1.)
On the other hand, fixed-length types of any size can be passed by-reference. For example, here is a
sample implementation of a PostgreSQL type:
Only pointers to such types can be used when passing them in and out of PostgreSQL functions. To
return a value of such a type, allocate the right amount of memory with palloc, fill in the allocated
memory, and return a pointer to it. (Also, if you just want to return the same value as one of your input
arguments that's of the same data type, you can skip the extra palloc and just return the pointer to
the input value.)
Finally, all variable-length types must also be passed by reference. All variable-length types must
begin with an opaque length field of exactly 4 bytes, which will be set by SET_VARSIZE; never set
this field directly! All data to be stored within that type must be located in the memory immediately
following that length field. The length field contains the total length of the structure, that is, it includes
the size of the length field itself.
Another important point is to avoid leaving any uninitialized bits within data type values; for exam-
ple, take care to zero out any alignment padding bytes that might be present in structs. Without this,
1064
Extending SQL
logically-equivalent constants of your data type might be seen as unequal by the planner, leading to
inefficient (though not incorrect) plans.
Warning
Never modify the contents of a pass-by-reference input value. If you do so you are likely to
corrupt on-disk data, since the pointer you are given might point directly into a disk buffer.
The sole exception to this rule is explained in Section 38.11.
typedef struct {
int32 length;
char data[FLEXIBLE_ARRAY_MEMBER];
} text;
The [FLEXIBLE_ARRAY_MEMBER] notation means that the actual length of the data part is not
specified by this declaration.
When manipulating variable-length types, we must be careful to allocate the correct amount of memory
and set the length field correctly. For example, if we wanted to store 40 bytes in a text structure,
we might use a code fragment like this:
#include "postgres.h"
...
char buffer[40]; /* our source data */
...
text *destination = (text *) palloc(VARHDRSZ + 40);
SET_VARSIZE(destination, VARHDRSZ + 40);
memcpy(destination->data, buffer, 40);
...
VARHDRSZ is the same as sizeof(int32), but it's considered good style to use the macro
VARHDRSZ to refer to the size of the overhead for a variable-length type. Also, the length field must
be set using the SET_VARSIZE macro, not by simple assignment.
Table 38.1 shows the C types corresponding to many of the built-in SQL data types of PostgreSQL.
The “Defined In” column gives the header file that needs to be included to get the type definition.
(The actual definition might be in a different file that is included by the listed file. It is recommended
that users stick to the defined interface.) Note that you should always include postgres.h first in
any source file of server code, because it declares a number of things that you will need anyway, and
because including other headers first can cause portability issues.
1065
Extending SQL
Now that we've gone over all of the possible structures for base types, we can show some examples
of real functions.
Datum funcname(PG_FUNCTION_ARGS)
PG_FUNCTION_INFO_V1(funcname);
must appear in the same source file. (Conventionally, it's written just before the function itself.) This
macro call is not needed for internal-language functions, since PostgreSQL assumes that all inter-
nal functions use the version-1 convention. It is, however, required for dynamically-loaded functions.
1066
Extending SQL
In a version-1 function, each actual argument is fetched using a PG_GETARG_xxx() macro that cor-
responds to the argument's data type. (In non-strict functions there needs to be a previous check about
argument null-ness using PG_ARGISNULL(); see below.) The result is returned using a PG_RE-
TURN_xxx() macro for the return type. PG_GETARG_xxx() takes as its argument the number of
the function argument to fetch, where the count starts at 0. PG_RETURN_xxx() takes as its argument
the actual value to return.
#include "postgres.h"
#include <string.h>
#include "fmgr.h"
#include "utils/geo_decls.h"
PG_MODULE_MAGIC;
/* by value */
PG_FUNCTION_INFO_V1(add_one);
Datum
add_one(PG_FUNCTION_ARGS)
{
int32 arg = PG_GETARG_INT32(0);
PG_RETURN_INT32(arg + 1);
}
PG_FUNCTION_INFO_V1(add_one_float8);
Datum
add_one_float8(PG_FUNCTION_ARGS)
{
/* The macros for FLOAT8 hide its pass-by-reference nature. */
float8 arg = PG_GETARG_FLOAT8(0);
PG_RETURN_FLOAT8(arg + 1.0);
}
PG_FUNCTION_INFO_V1(makepoint);
Datum
makepoint(PG_FUNCTION_ARGS)
{
/* Here, the pass-by-reference nature of Point is not hidden.
*/
Point *pointx = PG_GETARG_POINT_P(0);
Point *pointy = PG_GETARG_POINT_P(1);
Point *new_point = (Point *) palloc(sizeof(Point));
new_point->x = pointx->x;
new_point->y = pointy->y;
PG_RETURN_POINT_P(new_point);
}
1067
Extending SQL
PG_FUNCTION_INFO_V1(copytext);
Datum
copytext(PG_FUNCTION_ARGS)
{
text *t = PG_GETARG_TEXT_PP(0);
/*
* VARSIZE_ANY_EXHDR is the size of the struct in bytes, minus
the
* VARHDRSZ or VARHDRSZ_SHORT of its header. Construct the
copy with a
* full-length header.
*/
text *new_t = (text *) palloc(VARSIZE_ANY_EXHDR(t) +
VARHDRSZ);
SET_VARSIZE(new_t, VARSIZE_ANY_EXHDR(t) + VARHDRSZ);
/*
* VARDATA is a pointer to the data region of the new struct.
The source
* could be a short datum, so retrieve its data through
VARDATA_ANY.
*/
memcpy((void *) VARDATA(new_t), /* destination */
(void *) VARDATA_ANY(t), /* source */
VARSIZE_ANY_EXHDR(t)); /* how many bytes */
PG_RETURN_TEXT_P(new_t);
}
PG_FUNCTION_INFO_V1(concat_text);
Datum
concat_text(PG_FUNCTION_ARGS)
{
text *arg1 = PG_GETARG_TEXT_PP(0);
text *arg2 = PG_GETARG_TEXT_PP(1);
int32 arg1_size = VARSIZE_ANY_EXHDR(arg1);
int32 arg2_size = VARSIZE_ANY_EXHDR(arg2);
int32 new_text_size = arg1_size + arg2_size + VARHDRSZ;
text *new_text = (text *) palloc(new_text_size);
SET_VARSIZE(new_text, new_text_size);
memcpy(VARDATA(new_text), VARDATA_ANY(arg1), arg1_size);
memcpy(VARDATA(new_text) + arg1_size, VARDATA_ANY(arg2),
arg2_size);
PG_RETURN_TEXT_P(new_text);
}
Supposing that the above code has been prepared in file funcs.c and compiled into a shared object,
we could define the functions to PostgreSQL with commands like this:
1068
Extending SQL
AS 'DIRECTORY/funcs', 'add_one'
LANGUAGE C STRICT;
Here, DIRECTORY stands for the directory of the shared library file (for instance the PostgreSQL
tutorial directory, which contains the code for the examples used in this section). (Better style would
be to use just 'funcs' in the AS clause, after having added DIRECTORY to the search path. In any
case, we can omit the system-specific extension for a shared library, commonly .so.)
Notice that we have specified the functions as “strict”, meaning that the system should automatically
assume a null result if any input value is null. By doing this, we avoid having to check for null inputs in
the function code. Without this, we'd have to check for null values explicitly, using PG_ARGISNUL-
L().
The macro PG_ARGISNULL(n) allows a function to test whether each input is null. (Of course,
doing this is only necessary in functions not declared “strict”.) As with the PG_GETARG_xxx()
macros, the input arguments are counted beginning at zero. Note that one should refrain from executing
PG_GETARG_xxx() until one has verified that the argument isn't null. To return a null result, execute
PG_RETURN_NULL(); this works in both strict and nonstrict functions.
At first glance, the version-1 coding conventions might appear to be just pointless obscurantism, com-
pared to using plain C calling conventions. They do however allow us to deal with NULLable argu-
ments/return values, and “toasted” (compressed or out-of-line) values.
Other options provided by the version-1 interface are two variants of the PG_GETARG_xxx()
macros. The first of these, PG_GETARG_xxx_COPY(), guarantees to return a copy of the spec-
ified argument that is safe for writing into. (The normal macros will sometimes return a point-
er to a value that is physically stored in a table, which must not be written to. Using the
PG_GETARG_xxx_COPY() macros guarantees a writable result.) The second variant consists of the
PG_GETARG_xxx_SLICE() macros which take three arguments. The first is the number of the
function argument (as above). The second and third are the offset and length of the segment to be
returned. Offsets are counted from zero, and a negative length requests that the remainder of the value
be returned. These macros provide more efficient access to parts of large values in the case where they
have storage type “external”. (The storage type of a column can be specified using ALTER TABLE
tablename ALTER COLUMN colname SET STORAGE storagetype. storagetype is
one of plain, external, extended, or main.)
Finally, the version-1 function call conventions make it possible to return set results (Section 38.10.8)
and implement trigger functions (Chapter 39) and procedural-language call handlers (Chapter 56). For
more details see src/backend/utils/fmgr/README in the source distribution.
1069
Extending SQL
The basic rules for writing and building C functions are as follows:
• Use pg_config --includedir-server to find out where the PostgreSQL server header
files are installed on your system (or the system that your users will be running on).
• Compiling and linking your code so that it can be dynamically loaded into PostgreSQL always re-
quires special flags. See Section 38.10.5 for a detailed explanation of how to do it for your particular
operating system.
• Remember to define a “magic block” for your shared library, as described in Section 38.10.1.
• When allocating memory, use the PostgreSQL functions palloc and pfree instead of the corre-
sponding C library functions malloc and free. The memory allocated by palloc will be freed
automatically at the end of each transaction, preventing memory leaks.
• Always zero the bytes of your structures using memset (or allocate them with palloc0 in the
first place). Even if you assign to each field of your structure, there might be alignment padding
(holes in the structure) that contain garbage values. Without this, it's difficult to support hash indexes
or hash joins, as you must pick out only the significant bits of your data structure to compute a
hash. The planner also sometimes relies on comparing constants via bitwise equality, so you can
get undesirable planning results if logically-equivalent values aren't bitwise equal.
• Most of the internal PostgreSQL types are declared in postgres.h, while the function manager
interfaces (PG_FUNCTION_ARGS, etc.) are in fmgr.h, so you will need to include at least these
two files. For portability reasons it's best to include postgres.h first, before any other system or
user header files. Including postgres.h will also include elog.h and palloc.h for you.
• Symbol names defined within object files must not conflict with each other or with symbols defined
in the PostgreSQL server executable. You will have to rename your functions or variables if you
get error messages to this effect.
For information beyond what is contained in this section you should read the documentation of your
operating system, in particular the manual pages for the C compiler, cc, and the link editor, ld. In
addition, the PostgreSQL source code contains several working examples in the contrib directory.
If you rely on these examples you will make your modules dependent on the availability of the Post-
greSQL source code, however.
Creating shared libraries is generally analogous to linking executables: first the source files are com-
piled into object files, then the object files are linked together. The object files need to be created as
position-independent code (PIC), which conceptually means that they can be placed at an arbitrary
location in memory when they are loaded by the executable. (Object files intended for executables are
1070
Extending SQL
usually not compiled that way.) The command to link a shared library contains special flags to distin-
guish it from linking an executable (at least in theory — on some systems the practice is much uglier).
In the following examples we assume that your source code is in a file foo.c and we will create a
shared library foo.so. The intermediate object file will be called foo.o unless otherwise noted. A
shared library can contain more than one object file, but we only use one here.
FreeBSD
The compiler flag to create PIC is -fPIC. To create shared libraries the compiler flag is -
shared.
HP-UX
The compiler flag of the system compiler to create PIC is +z. When using GCC it's -fPIC. The
linker flag for shared libraries is -b. So:
cc +z -c foo.c
or:
and then:
ld -b -o foo.sl foo.o
HP-UX uses the extension .sl for shared libraries, unlike most other systems.
Linux
The compiler flag to create PIC is -fPIC. The compiler flag to create a shared library is -
shared. A complete example looks like this:
cc -fPIC -c foo.c
cc -shared -o foo.so foo.o
macOS
cc -c foo.c
cc -bundle -flat_namespace -undefined suppress -o foo.so foo.o
NetBSD
The compiler flag to create PIC is -fPIC. For ELF systems, the compiler with the flag -shared
is used to link shared libraries. On the older non-ELF systems, ld -Bshareable is used.
1071
Extending SQL
OpenBSD
The compiler flag to create PIC is -fPIC. ld -Bshareable is used to link shared libraries.
Solaris
The compiler flag to create PIC is -KPIC with the Sun compiler and -fPIC with GCC. To link
shared libraries, the compiler option is -G with either compiler or alternatively -shared with
GCC.
cc -KPIC -c foo.c
cc -G -o foo.so foo.o
or
Tip
If this is too complicated for you, you should consider using GNU Libtool1, which hides the
platform differences behind a uniform interface.
The resulting shared library file can then be loaded into PostgreSQL. When specifying the file name
to the CREATE FUNCTION command, one must give it the name of the shared library file, not the
intermediate object file. Note that the system's standard shared-library extension (usually .so or .sl)
can be omitted from the CREATE FUNCTION command, and normally should be omitted for best
portability.
Refer back to Section 38.10.1 about where the server expects to find the shared library files.
1072
Extending SQL
#include "postgres.h"
#include "executor/executor.h" /* for GetAttributeByName() */
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(c_overpaid);
Datum
c_overpaid(PG_FUNCTION_ARGS)
{
HeapTupleHeader t = PG_GETARG_HEAPTUPLEHEADER(0);
int32 limit = PG_GETARG_INT32(1);
bool isnull;
Datum salary;
GetAttributeByName is the PostgreSQL system function that returns attributes out of the spec-
ified row. It has three arguments: the argument of type HeapTupleHeader passed into the func-
tion, the name of the desired attribute, and a return parameter that tells whether the attribute is null.
GetAttributeByName returns a Datum value that you can convert to the proper data type by
using the appropriate DatumGetXXX() macro. Note that the return value is meaningless if the null
flag is set; always check the null flag before trying to do anything with the result.
There is also GetAttributeByNum, which selects the target attribute by column number instead
of name.
Notice we have used STRICT so that we did not have to check whether the input arguments were
NULL.
#include "funcapi.h"
There are two ways you can build a composite data value (henceforth a “tuple”): you can build it from
an array of Datum values, or from an array of C strings that can be passed to the input conversion
functions of the tuple's column data types. In either case, you first need to obtain or construct a Tu-
1073
Extending SQL
pleDesc descriptor for the tuple structure. When working with Datums, you pass the TupleDesc
to BlessTupleDesc, and then call heap_form_tuple for each row. When working with C
strings, you pass the TupleDesc to TupleDescGetAttInMetadata, and then call BuildTu-
pleFromCStrings for each row. In the case of a function returning a set of tuples, the setup steps
can all be done once during the first call of the function.
Several helper functions are available for setting up the needed TupleDesc. The recommended way
to do this in most functions returning composite values is to call:
passing the same fcinfo struct passed to the calling function itself. (This of course requires that
you use the version-1 calling conventions.) resultTypeId can be specified as NULL or as the
address of a local variable to receive the function's result type OID. resultTupleDesc should be
the address of a local TupleDesc variable. Check that the result is TYPEFUNC_COMPOSITE; if so,
resultTupleDesc has been filled with the needed TupleDesc. (If it is not, you can report an
error along the lines of “function returning record called in context that cannot accept type record”.)
Tip
get_call_result_type can resolve the actual type of a polymorphic function result; so
it is useful in functions that return scalar polymorphic results, not only functions that return
composites. The resultTypeId output is primarily useful for functions returning polymor-
phic scalars.
Note
get_call_result_type has a sibling get_expr_result_type, which can be used
to resolve the expected output type for a function call represented by an expression tree. This
can be used when trying to determine the result type from outside the function itself. There is
also get_func_result_type, which can be used when only the function's OID is avail-
able. However these functions are not able to deal with functions declared to return record,
and get_func_result_type cannot resolve polymorphic types, so you should preferen-
tially use get_call_result_type.
to get a TupleDesc based on a type OID. This can be used to get a TupleDesc for a base or
composite type. It will not work for a function that returns record, however, and it cannot resolve
polymorphic types.
1074
Extending SQL
if you plan to work with C strings. If you are writing a function returning set, you can save the results
of these functions in the FuncCallContext structure — use the tuple_desc or attinmeta
field respectively.
to build a HeapTuple given user data in C string form. values is an array of C strings, one for
each attribute of the return row. Each C string should be in the form expected by the input function
of the attribute data type. In order to return a null value for one of the attributes, the corresponding
pointer in the values array should be set to NULL. This function will need to be called again for
each row you return.
Once you have built a tuple to return from your function, it must be converted into a Datum. Use:
HeapTupleGetDatum(HeapTuple tuple)
to convert a HeapTuple into a valid Datum. This Datum can be returned directly if you intend to
return just a single row, or it can be used as the current return value in a set-returning function.
When using ValuePerCall mode, it is important to remember that the query is not guaranteed to be
run to completion; that is, due to options such as LIMIT, the executor might stop making calls to the
set-returning function before all rows have been fetched. This means it is not safe to perform cleanup
activities in the last call, because that might not ever happen. It's recommended to use Materialize
mode for functions that need access to external resources, such as file descriptors.
The remainder of this section documents a set of helper macros that are commonly used (though
not required to be used) for SRFs using ValuePerCall mode. Additional details about Materialize
mode can be found in src/backend/utils/fmgr/README. Also, the contrib modules in
1075
Extending SQL
the PostgreSQL source distribution contain many examples of SRFs using both ValuePerCall and
Materialize mode.
To use the ValuePerCall support macros described here, include funcapi.h. These macros work
with a structure FuncCallContext that contains the state that needs to be saved across calls. Within
the calling SRF, fcinfo->flinfo->fn_extra is used to hold a pointer to FuncCallContext
across calls. The macros automatically fill that field on first use, and expect to find the same pointer
there on subsequent uses.
/*
* OPTIONAL maximum number of calls
*
* max_calls is here for convenience only and setting it is
optional.
* If not set, you must provide alternative means to know when
the
* function is done.
*/
uint64 max_calls;
/*
* OPTIONAL pointer to result slot
*
* This is obsolete and only present for backward
compatibility, viz,
* user-defined SRFs that use the deprecated
TupleDescGetSlot().
*/
TupleTableSlot *slot;
/*
* OPTIONAL pointer to miscellaneous user-provided context
information
*
* user_fctx is for use as a pointer to your own data to retain
* arbitrary context information between calls of your
function.
*/
void *user_fctx;
/*
* OPTIONAL pointer to struct containing attribute type input
metadata
*
* attinmeta is for use when returning tuples (i.e., composite
data types)
1076
Extending SQL
/*
* memory context used for structures that must live for
multiple calls
*
* multi_call_memory_ctx is set by SRF_FIRSTCALL_INIT() for
you, and used
* by SRF_RETURN_DONE() for cleanup. It is the most appropriate
memory
* context for any memory that is to be reused across multiple
calls
* of the SRF.
*/
MemoryContext multi_call_memory_ctx;
/*
* OPTIONAL pointer to struct containing tuple description
*
* tuple_desc is for use when returning tuples (i.e., composite
data types)
* and is only needed if you are going to build the tuples with
* heap_form_tuple() rather than with BuildTupleFromCStrings().
Note that
* the TupleDesc pointer stored here should usually have been
run through
* BlessTupleDesc() first.
*/
TupleDesc tuple_desc;
} FuncCallContext;
SRF_IS_FIRSTCALL()
Use this to determine if your function is being called for the first or a subsequent time. On the first
call (only), call:
SRF_FIRSTCALL_INIT()
to initialize the FuncCallContext. On every function call, including the first, call:
SRF_PERCALL_SETUP()
SRF_RETURN_NEXT(funcctx, result)
1077
Extending SQL
to return it to the caller. (result must be of type Datum, either a single value or a tuple prepared as
described above.) Finally, when your function is finished returning data, use:
SRF_RETURN_DONE(funcctx)
The memory context that is current when the SRF is called is a transient context that will be cleared
between calls. This means that you do not need to call pfree on everything you allocated using pal-
loc; it will go away anyway. However, if you want to allocate any data structures to live across calls,
you need to put them somewhere else. The memory context referenced by multi_call_memo-
ry_ctx is a suitable location for any data that needs to survive until the SRF is finished running.
In most cases, this means that you should switch into multi_call_memory_ctx while doing the
first-call setup. Use funcctx->user_fctx to hold a pointer to any such cross-call data structures.
(Data you allocate in multi_call_memory_ctx will go away automatically when the query ends,
so it is not necessary to free that data manually, either.)
Warning
While the actual arguments to the function remain unchanged between calls, if you detoast
the argument values (which is normally done transparently by the PG_GETARG_xxx macro)
in the transient context then the detoasted copies will be freed on each cycle. Accordingly,
if you keep references to such values in your user_fctx, you must either copy them into
the multi_call_memory_ctx after detoasting, or ensure that you detoast the values only
in that context.
Datum
my_set_returning_function(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
Datum result;
further declarations as needed
if (SRF_IS_FIRSTCALL())
{
MemoryContext oldcontext;
funcctx = SRF_FIRSTCALL_INIT();
oldcontext = MemoryContextSwitchTo(funcctx-
>multi_call_memory_ctx);
/* One-time setup code appears here: */
user code
if returning composite
build TupleDesc, and perhaps AttInMetadata
endif returning composite
user code
MemoryContextSwitchTo(oldcontext);
}
1078
Extending SQL
PG_FUNCTION_INFO_V1(retcomposite);
Datum
retcomposite(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
int call_cntr;
int max_calls;
TupleDesc tupdesc;
AttInMetadata *attinmeta;
/*
1079
Extending SQL
MemoryContextSwitchTo(oldcontext);
}
call_cntr = funcctx->call_cntr;
max_calls = funcctx->max_calls;
attinmeta = funcctx->attinmeta;
/*
* Prepare a values array for building the returned tuple.
* This should be an array of C strings which will
* be processed later by the type input functions.
*/
values = (char **) palloc(3 * sizeof(char *));
values[0] = (char *) palloc(16 * sizeof(char));
values[1] = (char *) palloc(16 * sizeof(char));
values[2] = (char *) palloc(16 * sizeof(char));
/* build a tuple */
tuple = BuildTupleFromCStrings(attinmeta, values);
SRF_RETURN_NEXT(funcctx, result);
}
else /* do when there is no more left */
{
SRF_RETURN_DONE(funcctx);
}
}
1080
Extending SQL
Notice that in this method the output type of the function is formally an anonymous record type.
For example, suppose we want to write a function to accept a single element of any type, and return
a one-dimensional array of that type:
PG_FUNCTION_INFO_V1(make_array);
Datum
make_array(PG_FUNCTION_ARGS)
{
ArrayType *result;
Oid element_type = get_fn_expr_argtype(fcinfo->flinfo,
0);
Datum element;
bool isnull;
int16 typlen;
bool typbyval;
char typalign;
int ndims;
int dims[MAXDIM];
int lbs[MAXDIM];
if (!OidIsValid(element_type))
1081
Extending SQL
PG_RETURN_ARRAYTYPE_P(result);
}
There is a variant of polymorphism that is only available to C-language functions: they can be declared
to take parameters of type "any". (Note that this type name must be double-quoted, since it's also a
SQL reserved word.) This works like anyelement except that it does not constrain different "any"
arguments to be the same type, nor do they help determine the function's result type. A C-language
function can also declare its final parameter to be VARIADIC "any". This will match one or more
actual arguments of any type (not necessarily the same type). These arguments will not be gathered into
an array as happens with normal variadic functions; they will just be passed to the function separately.
The PG_NARGS() macro and the methods described above must be used to determine the number of
actual arguments and their types when using this feature. Also, users of such a function might wish
to use the VARIADIC keyword in their function call, with the expectation that the function would
treat the array elements as separate arguments. The function itself must implement that behavior if
wanted, after using get_fn_expr_variadic to detect that the actual argument was marked with
VARIADIC.
1082
Extending SQL
proves that a simplified expression tree can substitute for all possible concrete calls represented there-
by, build and return that simplified expression. Otherwise, return a NULL pointer (not a SQL null).
We make no guarantee that PostgreSQL will never call the primary function in cases that the transform
function could simplify. Ensure rigorous equivalence between the simplified expression and an actual
call to the primary function.
Currently, this facility is not exposed to users at the SQL level because of security concerns, so it is
only practical to use for optimizing built-in functions.
from _PG_init. This will ensure that an array of num_lwlocks LWLocks is available under the
name tranche_name. Use GetNamedLWLockTranche to get a pointer to this array.
To avoid possible race-conditions, each backend should use the LWLock AddinShmemInitLock
when connecting to and initializing its allocation of shared memory, as shown here:
if (!ptr)
{
bool found;
LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
ptr = ShmemInitStruct("my struct name", size, &found);
if (!found)
{
initialize contents of shmem area;
acquire any requested LWLocks using:
ptr->locks = GetNamedLWLockTranche("my tranche
name");
}
LWLockRelease(AddinShmemInitLock);
}
• All functions accessed by the backend must present a C interface to the backend; these C functions
can then call C++ functions. For example, extern C linkage is required for backend-accessed
1083
Extending SQL
functions. This is also necessary for any functions that are passed as pointers between the backend
and C++ code.
• Free memory using the appropriate deallocation method. For example, most backend memory is
allocated using palloc(), so use pfree() to free it. Using C++ delete in such cases will fail.
• Prevent exceptions from propagating into the C code (use a catch-all block at the top level of all
extern C functions). This is necessary even if the C++ code does not explicitly throw any ex-
ceptions, because events like out-of-memory can still throw exceptions. Any exceptions must be
caught and appropriate errors passed back to the C interface. If possible, compile C++ with -fno-
exceptions to eliminate exceptions entirely; in such cases, you must check for failures in your
C++ code, e.g., check for NULL returned by new().
• If calling backend functions from C++ code, be sure that the C++ call stack contains only plain old
data structures (POD). This is necessary because backend errors generate a distant longjmp()
that does not properly unroll a C++ call stack with non-POD objects.
In summary, it is best to place C++ code behind a wall of extern C functions that interface to the
backend, and avoid exception, memory, and call stack leakage.
Thus, in addition to the argument and result data types seen by a user of the aggregate, there is an
internal state-value data type that might be different from both the argument and result types.
If we define an aggregate that does not use a final function, we have an aggregate that computes a
running function of the column values from each row. sum is an example of this kind of aggregate.
sum starts at zero and always adds the current row's value to its running total. For example, if we
want to make a sum aggregate to work on a data type for complex numbers, we only need the addition
function for that data type. The aggregate definition would be:
sum
-----------
1084
Extending SQL
(34,53.9)
(Notice that we are relying on function overloading: there is more than one aggregate named sum, but
PostgreSQL can figure out which kind of sum applies to a column of type complex.)
The above definition of sum will return zero (the initial state value) if there are no nonnull input
values. Perhaps we want to return null in that case instead — the SQL standard expects sum to behave
that way. We can do this simply by omitting the initcond phrase, so that the initial state value is
null. Ordinarily this would mean that the sfunc would need to check for a null state-value input. But
for sum and some other simple aggregates like max and min, it is sufficient to insert the first nonnull
input value into the state variable and then start applying the transition function at the second nonnull
input value. PostgreSQL will do that automatically if the initial state value is null and the transition
function is marked “strict” (i.e., not to be called for null inputs).
Another bit of default behavior for a “strict” transition function is that the previous state value is
retained unchanged whenever a null input value is encountered. Thus, null values are ignored. If you
need some other behavior for null inputs, do not declare your transition function as strict; instead code
it to test for null inputs and do whatever is needed.
avg (average) is a more complex example of an aggregate. It requires two pieces of running state:
the sum of the inputs and the count of the number of inputs. The final result is obtained by dividing
these quantities. Average is typically implemented by using an array as the state value. For example,
the built-in implementation of avg(float8) looks like:
Note
float8_accum requires a three-element array, not just two elements, because it accumulates
the sum of squares as well as the sum and count of the inputs. This is so that it can be used
for some other aggregates as well as avg.
Aggregate function calls in SQL allow DISTINCT and ORDER BY options that control which rows
are fed to the aggregate's transition function and in what order. These options are implemented behind
the scenes and are not the concern of the aggregate's support functions.
1085
Extending SQL
average frame length. With an inverse transition function, the run time is only proportional to the
number of input rows.
The inverse transition function is passed the current state value and the aggregate input value(s) for
the earliest row included in the current state. It must reconstruct what the state value would have been
if the given input row had never been aggregated, but only the rows following it. This sometimes
requires that the forward transition function keep more state than is needed for plain aggregation mode.
Therefore, the moving-aggregate mode uses a completely separate implementation from the plain
mode: it has its own state data type, its own forward transition function, and its own final function
if needed. These can be the same as the plain mode's data type and functions, if there is no need for
extra state.
As an example, we could extend the sum aggregate given above to support moving-aggregate mode
like this:
The parameters whose names begin with m define the moving-aggregate implementation. Except for
the inverse transition function minvfunc, they correspond to the plain-aggregate parameters without
m.
The forward transition function for moving-aggregate mode is not allowed to return null as the new
state value. If the inverse transition function returns null, this is taken as an indication that the inverse
function cannot reverse the state calculation for this particular input, and so the aggregate calculation
will be redone from scratch for the current frame starting position. This convention allows moving-ag-
gregate mode to be used in situations where there are some infrequent cases that are impractical to
reverse out of the running state value. The inverse transition function can “punt” on these cases, and
yet still come out ahead so long as it can work for most cases. As an example, an aggregate working
with floating-point numbers might choose to punt when a NaN (not a number) input has to be removed
from the running state value.
When writing moving-aggregate support functions, it is important to be sure that the inverse transition
function can reconstruct the correct state value exactly. Otherwise there might be user-visible differ-
ences in results depending on whether the moving-aggregate mode is used. An example of an aggregate
for which adding an inverse transition function seems easy at first, yet where this requirement cannot
be met is sum over float4 or float8 inputs. A naive declaration of sum(float8) could be
This aggregate, however, can give wildly different results than it would have without the inverse
transition function. For example, consider
1086
Extending SQL
SELECT
unsafe_sum(x) OVER (ORDER BY n ROWS BETWEEN CURRENT ROW AND 1
FOLLOWING)
FROM (VALUES (1, 1.0e20::float8),
(2, 1.0::float8)) AS v (n,x);
This query returns 0 as its second result, rather than the expected answer of 1. The cause is the limited
precision of floating-point values: adding 1 to 1e20 results in 1e20 again, and so subtracting 1e20
from that yields 0, not 1. Note that this is a limitation of floating-point arithmetic in general, not a
limitation of PostgreSQL.
Here, the actual state type for any given aggregate call is the array type having the actual input type as
elements. The behavior of the aggregate is to concatenate all the inputs into an array of that type. (Note:
the built-in aggregate array_agg provides similar functionality, with better performance than this
definition would have.)
Here's the output using two different actual data types as arguments:
attrelid | array_accum
---------------+---------------------------------------
pg_tablespace | {spcname,spcowner,spcacl,spcoptions}
(1 row)
attrelid | array_accum
---------------+---------------------------
pg_tablespace | {name,oid,aclitem[],text[]}
(1 row)
Ordinarily, an aggregate function with a polymorphic result type has a polymorphic state type, as in
the above example. This is necessary because otherwise the final function cannot be declared sensibly:
1087
Extending SQL
it would need to have a polymorphic result type but no polymorphic argument type, which CREATE
FUNCTION will reject on the grounds that the result type cannot be deduced from a call. But sometimes
it is inconvenient to use a polymorphic state type. The most common case is where the aggregate
support functions are to be written in C and the state type should be declared as internal because
there is no SQL-level equivalent for it. To address this case, it is possible to declare the final function
as taking extra “dummy” arguments that match the input arguments of the aggregate. Such dummy
arguments are always passed as null values since no specific value is available when the final function
is called. Their only use is to allow a polymorphic final function's result type to be connected to
the aggregate's input type(s). For example, the definition of the built-in aggregate array_agg is
equivalent to
Here, the finalfunc_extra option specifies that the final function receives, in addition to the
state value, extra dummy argument(s) corresponding to the aggregate's input argument(s). The extra
anynonarray argument allows the declaration of array_agg_finalfn to be valid.
An aggregate function can be made to accept a varying number of arguments by declaring its last ar-
gument as a VARIADIC array, in much the same fashion as for regular functions; see Section 38.5.5.
The aggregate's transition function(s) must have the same array type as their last argument. The tran-
sition function(s) typically would also be marked VARIADIC, but this is not strictly required.
Note
Variadic aggregates are easily misused in connection with the ORDER BY option (see Sec-
tion 4.2.7), since the parser cannot tell whether the wrong number of actual arguments have
been given in such a combination. Keep in mind that everything to the right of ORDER BY is
a sort key, not an argument to the aggregate. For example, in
the parser will see this as a single aggregate function argument and three sort keys. However,
the user might have intended
For the same reason, it's wise to think twice before creating aggregate functions with the same
names and different numbers of regular arguments.
1088
Extending SQL
The aggregates we have been describing so far are “normal” aggregates. PostgreSQL also supports
ordered-set aggregates, which differ from normal aggregates in two key ways. First, in addition to
ordinary aggregated arguments that are evaluated once per input row, an ordered-set aggregate can
have “direct” arguments that are evaluated only once per aggregation operation. Second, the syntax
for the ordinary aggregated arguments specifies a sort ordering for them explicitly. An ordered-set
aggregate is usually used to implement a computation that depends on a specific row ordering, for
instance rank or percentile, so that the sort ordering is a required aspect of any call. For example, the
built-in definition of percentile_disc is equivalent to:
This aggregate takes a float8 direct argument (the percentile fraction) and an aggregated input that
can be of any sortable data type. It could be used to obtain a median household income like this:
Here, 0.5 is a direct argument; it would make no sense for the percentile fraction to be a value varying
across rows.
Unlike the case for normal aggregates, the sorting of input rows for an ordered-set aggregate is not
done behind the scenes, but is the responsibility of the aggregate's support functions. The typical im-
plementation approach is to keep a reference to a “tuplesort” object in the aggregate's state value,
feed the incoming rows into that object, and then complete the sorting and read out the data in the
final function. This design allows the final function to perform special operations such as injecting
additional “hypothetical” rows into the data to be sorted. While normal aggregates can often be im-
plemented with support functions written in PL/pgSQL or another PL language, ordered-set aggre-
gates generally have to be written in C, since their state values aren't definable as any SQL data type.
(In the above example, notice that the state value is declared as type internal — this is typical.)
Also, because the final function performs the sort, it is not possible to continue adding input rows
by executing the transition function again later. This means the final function is not READ_ONLY; it
must be declared in CREATE AGGREGATE as READ_WRITE, or as SHAREABLE if it's possible for
additional final-function calls to make use of the already-sorted state.
The state transition function for an ordered-set aggregate receives the current state value plus the ag-
gregated input values for each row, and returns the updated state value. This is the same definition as
for normal aggregates, but note that the direct arguments (if any) are not provided. The final function
receives the last state value, the values of the direct arguments if any, and (if finalfunc_extra is
specified) null values corresponding to the aggregated input(s). As with normal aggregates, final-
func_extra is only really useful if the aggregate is polymorphic; then the extra dummy argumen-
t(s) are needed to connect the final function's result type to the aggregate's input type(s).
Currently, ordered-set aggregates cannot be used as window functions, and therefore there is no need
for them to support moving-aggregate mode.
1089
Extending SQL
To support partial aggregation, the aggregate definition must provide a combine function, which takes
two values of the aggregate's state type (representing the results of aggregating over two subsets of
the input rows) and produces a new value of the state type, representing what the state would have
been after aggregating over the combination of those sets of rows. It is unspecified what the relative
order of the input rows from the two sets would have been. This means that it's usually impossible to
define a useful combine function for aggregates that are sensitive to input row order.
As simple examples, MAX and MIN aggregates can be made to support partial aggregation by specify-
ing the combine function as the same greater-of-two or lesser-of-two comparison function that is used
as their transition function. SUM aggregates just need an addition function as combine function. (Again,
this is the same as their transition function, unless the state value is wider than the input data type.)
The combine function is treated much like a transition function that happens to take a value of the
state type, not of the underlying input type, as its second argument. In particular, the rules for dealing
with null values and strict functions are similar. Also, if the aggregate definition specifies a non-null
initcond, keep in mind that that will be used not only as the initial state for each partial aggregation
run, but also as the initial state for the combine function, which will be called to combine each partial
result into that state.
If the aggregate's state type is declared as internal, it is the combine function's responsibility that
its result is allocated in the correct memory context for aggregate state values. This means in particular
that when the first input is NULL it's invalid to simply return the second input, as that value will be in
the wrong context and will not have sufficient lifespan.
When the aggregate's state type is declared as internal, it is usually also appropriate for the ag-
gregate definition to provide a serialization function and a deserialization function, which allow such
a state value to be copied from one process to another. Without these functions, parallel aggregation
cannot be performed, and future applications such as local/remote aggregation will probably not work
either.
A serialization function must take a single argument of type internal and return a result of type
bytea, which represents the state value packaged up into a flat blob of bytes. Conversely, a deserial-
ization function reverses that conversion. It must take two arguments of types bytea and internal,
and return a result of type internal. (The second argument is unused and is always zero, but it is
required for type-safety reasons.) The result of the deserialization function should simply be allocated
in the current memory context, as unlike the combine function's result, it is not long-lived.
Worth noting also is that for an aggregate to be executed in parallel, the aggregate itself must be
marked PARALLEL SAFE. The parallel-safety markings on its support functions are not consulted.
if (AggCheckCallContext(fcinfo, NULL))
1090
Extending SQL
One reason for checking this is that when it is true, the first input must be a temporary state value and
can therefore safely be modified in-place rather than allocating a new copy. See int8inc() for an
example. (While aggregate transition functions are always allowed to modify the transition value in-
place, aggregate final functions are generally discouraged from doing so; if they do so, the behavior
must be declared when creating the aggregate. See CREATE AGGREGATE for more detail.)
The second argument of AggCheckCallContext can be used to retrieve the memory context in
which aggregate state values are being kept. This is useful for transition functions that wish to use
“expanded” objects (see Section 38.12.1) as their state values. On first call, the transition function
should return an expanded object whose memory context is a child of the aggregate state context, and
then keep returning the same expanded object on subsequent calls. See array_append() for an
example. (array_append() is not the transition function of any built-in aggregate, but it is written
to behave efficiently when used as transition function of a custom aggregate.)
The examples in this section can be found in complex.sql and complex.c in the src/tuto-
rial directory of the source distribution. See the README file in that directory for instructions about
running the examples.
A user-defined type must always have input and output functions. These functions determine how
the type appears in strings (for input by the user and output to the user) and how the type is organized
in memory. The input function takes a null-terminated character string as its argument and returns the
internal (in memory) representation of the type. The output function takes the internal representation
of the type as argument and returns a null-terminated character string. If we want to do anything
more with the type than merely store it, we must provide additional functions to implement whatever
operations we'd like to have for the type.
Suppose we want to define a type complex that represents complex numbers. A natural way to
represent a complex number in memory would be the following C structure:
We will need to make this a pass-by-reference type, since it's too large to fit into a single Datum value.
As the external string representation of the type, we choose a string of the form (x,y).
The input and output functions are usually not hard to write, especially the output function. But when
defining the external string representation of the type, remember that you must eventually write a
complete and robust parser for that representation as your input function. For instance:
PG_FUNCTION_INFO_V1(complex_in);
1091
Extending SQL
Datum
complex_in(PG_FUNCTION_ARGS)
{
char *str = PG_GETARG_CSTRING(0);
double x,
y;
Complex *result;
PG_FUNCTION_INFO_V1(complex_out);
Datum
complex_out(PG_FUNCTION_ARGS)
{
Complex *complex = (Complex *) PG_GETARG_POINTER(0);
char *result;
You should be careful to make the input and output functions inverses of each other. If you do not,
you will have severe problems when you need to dump your data into a file and then read it back in.
This is a particularly common problem when floating-point numbers are involved.
Optionally, a user-defined type can provide binary input and output routines. Binary I/O is normally
faster but less portable than textual I/O. As with textual I/O, it is up to you to define exactly what
the external binary representation is. Most of the built-in data types try to provide a machine-indepen-
dent binary representation. For complex, we will piggy-back on the binary I/O converters for type
float8:
PG_FUNCTION_INFO_V1(complex_recv);
Datum
complex_recv(PG_FUNCTION_ARGS)
{
StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
Complex *result;
1092
Extending SQL
result->y = pq_getmsgfloat8(buf);
PG_RETURN_POINTER(result);
}
PG_FUNCTION_INFO_V1(complex_send);
Datum
complex_send(PG_FUNCTION_ARGS)
{
Complex *complex = (Complex *) PG_GETARG_POINTER(0);
StringInfoData buf;
pq_begintypsend(&buf);
pq_sendfloat8(&buf, complex->x);
pq_sendfloat8(&buf, complex->y);
PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
}
Once we have written the I/O functions and compiled them into a shared library, we can define the
complex type in SQL. First we declare it as a shell type:
This serves as a placeholder that allows us to reference the type while defining its I/O functions. Now
we can define the I/O functions:
1093
Extending SQL
);
When you define a new base type, PostgreSQL automatically provides support for arrays of that
type. The array type typically has the same name as the base type with the underscore character (_)
prepended.
Once the data type exists, we can declare additional functions to provide useful operations on the data
type. Operators can then be defined atop the functions, and if needed, operator classes can be created
to support indexing of the data type. These additional layers are discussed in following sections.
If the internal representation of the data type is variable-length, the internal representation must follow
the standard layout for variable-length data: the first four bytes must be a char[4] field which is
never accessed directly (customarily named vl_len_). You must use the SET_VARSIZE() macro
to store the total size of the datum (including the length field itself) in this field and VARSIZE() to
retrieve it. (These macros exist because the length field may be encoded depending on platform.)
For further details see the description of the CREATE TYPE command.
To support TOAST storage, the C functions operating on the data type must always be careful to un-
pack any toasted values they are handed by using PG_DETOAST_DATUM. (This detail is customarily
hidden by defining type-specific GETARG_DATATYPE_P macros.) Then, when running the CREATE
TYPE command, specify the internal length as variable and select some appropriate storage option
other than plain.
If data alignment is unimportant (either just for a specific function or because the data type speci-
fies byte alignment anyway) then it's possible to avoid some of the overhead of PG_DETOAST_DA-
TUM. You can use PG_DETOAST_DATUM_PACKED instead (customarily hidden by defining a
GETARG_DATATYPE_PP macro) and using the macros VARSIZE_ANY_EXHDR and VARDA-
TA_ANY to access a potentially-packed datum. Again, the data returned by these macros is not aligned
even if the data type definition specifies an alignment. If the alignment is important you must go
through the regular PG_DETOAST_DATUM interface.
Note
Older code frequently declares vl_len_ as an int32 field instead of char[4]. This is OK
as long as the struct definition has other fields that have at least int32 alignment. But it is
dangerous to use such a struct definition when working with a potentially unaligned datum;
the compiler may take it as license to assume the datum actually is aligned, leading to core
dumps on architectures that are strict about alignment.
Another feature that's enabled by TOAST support is the possibility of having an expanded in-memory
data representation that is more convenient to work with than the format that is stored on disk. The
regular or “flat” varlena storage format is ultimately just a blob of bytes; it cannot for example contain
pointers, since it may get copied to other locations in memory. For complex data types, the flat format
may be quite expensive to work with, so PostgreSQL provides a way to “expand” the flat format into
a representation that is more suited to computation, and then pass that format in-memory between
functions of the data type.
To use expanded storage, a data type must define an expanded format that follows the rules given
in src/include/utils/expandeddatum.h, and provide functions to “expand” a flat varlena
1094
Extending SQL
value into expanded format and “flatten” the expanded format back to the regular varlena represen-
tation. Then ensure that all C functions for the data type can accept either representation, possibly
by converting one into the other immediately upon receipt. This does not require fixing all existing
functions for the data type at once, because the standard PG_DETOAST_DATUM macro is defined to
convert expanded inputs into regular flat format. Therefore, existing functions that work with the flat
varlena format will continue to work, though slightly inefficiently, with expanded inputs; they need
not be converted until and unless better performance is important.
C functions that know how to work with an expanded representation typically fall into two categories:
those that can only handle expanded format, and those that can handle either expanded or flat varlena
inputs. The former are easier to write but may be less efficient overall, because converting a flat input to
expanded form for use by a single function may cost more than is saved by operating on the expanded
format. When only expanded format need be handled, conversion of flat inputs to expanded form
can be hidden inside an argument-fetching macro, so that the function appears no more complex than
one working with traditional varlena input. To handle both types of input, write an argument-fetching
function that will detoast external, short-header, and compressed varlena inputs, but not expanded
inputs. Such a function can be defined as returning a pointer to a union of the flat varlena format and
the expanded format. Callers can use the VARATT_IS_EXPANDED_HEADER() macro to determine
which format they received.
The TOAST infrastructure not only allows regular varlena values to be distinguished from expanded
values, but also distinguishes “read-write” and “read-only” pointers to expanded values. C functions
that only need to examine an expanded value, or will only change it in safe and non-semantically-visi-
ble ways, need not care which type of pointer they receive. C functions that produce a modified version
of an input value are allowed to modify an expanded input value in-place if they receive a read-write
pointer, but must not modify the input if they receive a read-only pointer; in that case they have to copy
the value first, producing a new value to modify. A C function that has constructed a new expanded
value should always return a read-write pointer to it. Also, a C function that is modifying a read-write
expanded value in-place should take care to leave the value in a sane state if it fails partway through.
For examples of working with expanded values, see the standard array infrastructure, particularly
src/backend/utils/adt/array_expanded.c.
PostgreSQL supports left unary, right unary, and binary operators. Operators can be overloaded; that
is, the same operator name can be used for different operators that have different numbers and types
of operands. When a query is executed, the system determines the operator to call from the number
and types of the provided operands.
Here is an example of creating an operator for adding two complex numbers. We assume we've already
created the definition of type complex (see Section 38.12). First we need a function that does the
work, then we can define the operator:
CREATE OPERATOR + (
leftarg = complex,
1095
Extending SQL
rightarg = complex,
function = complex_add,
commutator = +
);
c
-----------------
(5.2,6.05)
(133.42,144.95)
We've shown how to create a binary operator here. To create unary operators, just omit one of left-
arg (for left unary) or rightarg (for right unary). The function clause and the argument clauses
are the only required items in CREATE OPERATOR. The commutator clause shown in the example
is an optional hint to the query optimizer. Further details about commutator and other optimizer
hints appear in the next section.
Additional optimization clauses might be added in future versions of PostgreSQL. The ones described
here are all the ones that release 11.17 understands.
38.14.1. COMMUTATOR
The COMMUTATOR clause, if provided, names an operator that is the commutator of the operator being
defined. We say that operator A is the commutator of operator B if (x A y) equals (y B x) for all
possible input values x, y. Notice that B is also the commutator of A. For example, operators < and >
for a particular data type are usually each others' commutators, and operator + is usually commutative
with itself. But operator - is usually not commutative with anything.
The left operand type of a commutable operator is the same as the right operand type of its commutator,
and vice versa. So the name of the commutator operator is all that PostgreSQL needs to be given to
look up the commutator, and that's all that needs to be provided in the COMMUTATOR clause.
It's critical to provide commutator information for operators that will be used in indexes and join
clauses, because this allows the query optimizer to “flip around” such a clause to the forms needed for
different plan types. For example, consider a query with a WHERE clause like tab1.x = tab2.y,
where tab1.x and tab2.y are of a user-defined type, and suppose that tab2.y is indexed. The
optimizer cannot generate an index scan unless it can determine how to flip the clause around to
tab2.y = tab1.x, because the index-scan machinery expects to see the indexed column on the
left of the operator it is given. PostgreSQL will not simply assume that this is a valid transformation
— the creator of the = operator must specify that it is valid, by marking the operator with commutator
information.
When you are defining a self-commutative operator, you just do it. When you are defining a pair of
commutative operators, things are a little trickier: how can the first one to be defined refer to the other
one, which you haven't defined yet? There are two solutions to this problem:
1096
Extending SQL
• One way is to omit the COMMUTATOR clause in the first operator that you define, and then provide
one in the second operator's definition. Since PostgreSQL knows that commutative operators come
in pairs, when it sees the second definition it will automatically go back and fill in the missing
COMMUTATOR clause in the first definition.
• The other, more straightforward way is just to include COMMUTATOR clauses in both definitions.
When PostgreSQL processes the first definition and realizes that COMMUTATOR refers to a nonex-
istent operator, the system will make a dummy entry for that operator in the system catalog. This
dummy entry will have valid data only for the operator name, left and right operand types, and
result type, since that's all that PostgreSQL can deduce at this point. The first operator's catalog
entry will link to this dummy entry. Later, when you define the second operator, the system updates
the dummy entry with the additional information from the second definition. If you try to use the
dummy operator before it's been filled in, you'll just get an error message.
38.14.2. NEGATOR
The NEGATOR clause, if provided, names an operator that is the negator of the operator being defined.
We say that operator A is the negator of operator B if both return Boolean results and (x A y) equals
NOT (x B y) for all possible inputs x, y. Notice that B is also the negator of A. For example, < and
>= are a negator pair for most data types. An operator can never validly be its own negator.
Unlike commutators, a pair of unary operators could validly be marked as each other's negators; that
would mean (A x) equals NOT (B x) for all x, or the equivalent for right unary operators.
An operator's negator must have the same left and/or right operand types as the operator to be defined,
so just as with COMMUTATOR, only the operator name need be given in the NEGATOR clause.
Providing a negator is very helpful to the query optimizer since it allows expressions like NOT (x
= y) to be simplified into x <> y. This comes up more often than you might think, because NOT
operations can be inserted as a consequence of other rearrangements.
Pairs of negator operators can be defined using the same methods explained above for commutator
pairs.
38.14.3. RESTRICT
The RESTRICT clause, if provided, names a restriction selectivity estimation function for the operator.
(Note that this is a function name, not an operator name.) RESTRICT clauses only make sense for
binary operators that return boolean. The idea behind a restriction selectivity estimator is to guess
what fraction of the rows in a table will satisfy a WHERE-clause condition of the form:
column OP constant
for the current operator and a particular constant value. This assists the optimizer by giving it some
idea of how many rows will be eliminated by WHERE clauses that have this form. (What happens if
the constant is on the left, you might be wondering? Well, that's one of the things that COMMUTATOR
is for...)
Writing new restriction selectivity estimation functions is far beyond the scope of this chapter, but
fortunately you can usually just use one of the system's standard estimators for many of your own
operators. These are the standard restriction estimators:
eqsel for =
neqsel for <>
scalarltsel for <
scalarlesel for <=
scalargtsel for >
scalargesel for >=
1097
Extending SQL
You can frequently get away with using either eqsel or neqsel for operators that have very high
or very low selectivity, even if they aren't really equality or inequality. For example, the approxi-
mate-equality geometric operators use eqsel on the assumption that they'll usually only match a
small fraction of the entries in a table.
You can use scalarltsel, scalarlesel, scalargtsel and scalargesel for compar-
isons on data types that have some sensible means of being converted into numeric scalars for range
comparisons. If possible, add the data type to those understood by the function convert_to_s-
calar() in src/backend/utils/adt/selfuncs.c. (Eventually, this function should be
replaced by per-data-type functions identified through a column of the pg_type system catalog; but
that hasn't happened yet.) If you do not do this, things will still work, but the optimizer's estimates
won't be as good as they could be.
There are additional selectivity estimation functions designed for geometric operators in src/back-
end/utils/adt/geo_selfuncs.c: areasel, positionsel, and contsel. At this writ-
ing these are just stubs, but you might want to use them (or even better, improve them) anyway.
38.14.4. JOIN
The JOIN clause, if provided, names a join selectivity estimation function for the operator. (Note that
this is a function name, not an operator name.) JOIN clauses only make sense for binary operators
that return boolean. The idea behind a join selectivity estimator is to guess what fraction of the rows
in a pair of tables will satisfy a WHERE-clause condition of the form:
table1.column1 OP table2.column2
for the current operator. As with the RESTRICT clause, this helps the optimizer very substantially by
letting it figure out which of several possible join sequences is likely to take the least work.
As before, this chapter will make no attempt to explain how to write a join selectivity estimator func-
tion, but will just suggest that you use one of the standard estimators if one is applicable:
eqjoinsel for =
neqjoinsel for <>
scalarltjoinsel for <
scalarlejoinsel for <=
scalargtjoinsel for >
scalargejoinsel for >=
areajoinsel for 2D area-based comparisons
positionjoinsel for 2D position-based comparisons
contjoinsel for 2D containment-based comparisons
38.14.5. HASHES
The HASHES clause, if present, tells the system that it is permissible to use the hash join method for
a join based on this operator. HASHES only makes sense for a binary operator that returns boolean,
and in practice the operator must represent equality for some data type or pair of data types.
The assumption underlying hash join is that the join operator can only return true for pairs of left and
right values that hash to the same hash code. If two values get put in different hash buckets, the join
will never compare them at all, implicitly assuming that the result of the join operator must be false.
So it never makes sense to specify HASHES for operators that do not represent some form of equality.
In most cases it is only practical to support hashing for operators that take the same data type on both
sides. However, sometimes it is possible to design compatible hash functions for two or more data
types; that is, functions that will generate the same hash codes for “equal” values, even though the
values have different representations. For example, it's fairly simple to arrange this property when
hashing integers of different widths.
1098
Extending SQL
To be marked HASHES, the join operator must appear in a hash index operator family. This is not
enforced when you create the operator, since of course the referencing operator family couldn't exist
yet. But attempts to use the operator in hash joins will fail at run time if no such operator family exists.
The system needs the operator family to find the data-type-specific hash function(s) for the operator's
input data type(s). Of course, you must also create suitable hash functions before you can create the
operator family.
Care should be exercised when preparing a hash function, because there are machine-dependent ways
in which it might fail to do the right thing. For example, if your data type is a structure in which there
might be uninteresting pad bits, you cannot simply pass the whole structure to hash_any. (Unless
you write your other operators and functions to ensure that the unused bits are always zero, which is
the recommended strategy.) Another example is that on machines that meet the IEEE floating-point
standard, negative zero and positive zero are different values (different bit patterns) but they are de-
fined to compare equal. If a float value might contain negative zero then extra steps are needed to
ensure it generates the same hash value as positive zero.
A hash-joinable operator must have a commutator (itself if the two operand data types are the same, or
a related equality operator if they are different) that appears in the same operator family. If this is not
the case, planner errors might occur when the operator is used. Also, it is a good idea (but not strictly
required) for a hash operator family that supports multiple data types to provide equality operators for
every combination of the data types; this allows better optimization.
Note
The function underlying a hash-joinable operator must be marked immutable or stable. If it is
volatile, the system will never attempt to use the operator for a hash join.
Note
If a hash-joinable operator has an underlying function that is marked strict, the function must
also be complete: that is, it should return true or false, never null, for any two nonnull inputs.
If this rule is not followed, hash-optimization of IN operations might generate wrong results.
(Specifically, IN might return false where the correct answer according to the standard would
be null; or it might yield an error complaining that it wasn't prepared for a null result.)
38.14.6. MERGES
The MERGES clause, if present, tells the system that it is permissible to use the merge-join method for
a join based on this operator. MERGES only makes sense for a binary operator that returns boolean,
and in practice the operator must represent equality for some data type or pair of data types.
Merge join is based on the idea of sorting the left- and right-hand tables into order and then scanning
them in parallel. So, both data types must be capable of being fully ordered, and the join operator
must be one that can only succeed for pairs of values that fall at the “same place” in the sort order.
In practice this means that the join operator must behave like equality. But it is possible to merge-
join two distinct data types so long as they are logically compatible. For example, the smallint-
versus-integer equality operator is merge-joinable. We only need sorting operators that will bring
both data types into a logically compatible sequence.
To be marked MERGES, the join operator must appear as an equality member of a btree index
operator family. This is not enforced when you create the operator, since of course the referencing
operator family couldn't exist yet. But the operator will not actually be used for merge joins unless a
matching operator family can be found. The MERGES flag thus acts as a hint to the planner that it's
worth looking for a matching operator family.
1099
Extending SQL
A merge-joinable operator must have a commutator (itself if the two operand data types are the same,
or a related equality operator if they are different) that appears in the same operator family. If this is
not the case, planner errors might occur when the operator is used. Also, it is a good idea (but not
strictly required) for a btree operator family that supports multiple data types to provide equality
operators for every combination of the data types; this allows better optimization.
Note
The function underlying a merge-joinable operator must be marked immutable or stable. If it
is volatile, the system will never attempt to use the operator for a merge join.
Operator classes can be grouped into operator families to show the relationships between semantically
compatible classes. When only a single data type is involved, an operator class is sufficient, so we'll
focus on that case first and then return to operator families.
The routines for an index method do not directly know anything about the data types that the index
method will operate on. Instead, an operator class identifies the set of operations that the index method
needs to use to work with a particular data type. Operator classes are so called because one thing they
specify is the set of WHERE-clause operators that can be used with an index (i.e., can be converted
into an index-scan qualification). An operator class can also specify some support function that are
needed by the internal operations of the index method, but do not directly correspond to any WHERE-
clause operator that can be used with the index.
It is possible to define multiple operator classes for the same data type and index method. By doing
this, multiple sets of indexing semantics can be defined for a single data type. For example, a B-tree
index requires a sort ordering to be defined for each data type it works on. It might be useful for a
complex-number data type to have one B-tree operator class that sorts the data by complex absolute
value, another that sorts by real part, and so on. Typically, one of the operator classes will be deemed
most commonly useful and will be marked as the default operator class for that data type and index
method.
The same operator class name can be used for several different index methods (for example, both
B-tree and hash index methods have operator classes named int4_ops), but each such class is an
independent entity and must be defined separately.
1100
Extending SQL
impose a strict ordering on keys, lesser to greater, and so operators like “less than” and “greater than
or equal to” are interesting with respect to a B-tree. Because PostgreSQL allows the user to define
operators, PostgreSQL cannot look at the name of an operator (e.g., < or >=) and tell what kind of
comparison it is. Instead, the index method defines a set of “strategies”, which can be thought of as
generalized operators. Each operator class specifies which actual operator corresponds to each strategy
for a particular data type and interpretation of the index semantics.
The B-tree index method defines five strategies, shown in Table 38.2.
Hash indexes support only equality comparisons, and so they use only one strategy, shown in Ta-
ble 38.3.
GiST indexes are more flexible: they do not have a fixed set of strategies at all. Instead, the “consis-
tency” support routine of each particular GiST operator class interprets the strategy numbers however
it likes. As an example, several of the built-in GiST index operator classes index two-dimensional
geometric objects, providing the “R-tree” strategies shown in Table 38.4. Four of these are true two-
dimensional tests (overlaps, same, contains, contained by); four of them consider only the X direction;
and the other four provide the same tests in the Y direction.
SP-GiST indexes are similar to GiST indexes in flexibility: they don't have a fixed set of strategies.
Instead the support routines of each operator class interpret the strategy numbers according to the
1101
Extending SQL
operator class's definition. As an example, the strategy numbers used by the built-in operator classes
for points are shown in Table 38.5.
GIN indexes are similar to GiST and SP-GiST indexes, in that they don't have a fixed set of strategies
either. Instead the support routines of each operator class interpret the strategy numbers according to
the operator class's definition. As an example, the strategy numbers used by the built-in operator class
for arrays are shown in Table 38.6.
BRIN indexes are similar to GiST, SP-GiST and GIN indexes in that they don't have a fixed set of
strategies either. Instead the support routines of each operator class interpret the strategy numbers
according to the operator class's definition. As an example, the strategy numbers used by the built-in
Minmax operator classes are shown in Table 38.7.
Notice that all the operators listed above return Boolean values. In practice, all operators defined as
index method search operators must return type boolean, since they must appear at the top level of a
WHERE clause to be used with an index. (Some index access methods also support ordering operators,
which typically don't return Boolean values; that feature is discussed in Section 38.15.7.)
1102
Extending SQL
key values. These operations do not correspond to operators used in qualifications in SQL commands;
they are administrative routines used by the index methods, internally.
Just as with strategies, the operator class identifies which specific functions should play each of these
roles for a given data type and semantic interpretation. The index method defines the set of functions
it needs, and the operator class identifies the correct functions to use by assigning them to the “support
function numbers” specified by the index method.
B-trees require a comparison support function, and allow two additional support functions to be sup-
plied at the operator class author's option, as shown in Table 38.8. The requirements for these support
functions are explained further in Section 63.3.
Hash indexes require one support function, and allow a second one to be supplied at the operator class
author's option, as shown in Table 38.9.
GiST indexes have nine support functions, two of which are optional, as shown in Table 38.10. (For
more information see Chapter 64.)
1103
Extending SQL
SP-GiST indexes require five support functions, as shown in Table 38.11. (For more information see
Chapter 65.)
GIN indexes have six support functions, three of which are optional, as shown in Table 38.12. (For
more information see Chapter 66.)
1104
Extending SQL
BRIN indexes have four basic support functions, as shown in Table 38.13; those basic functions may
require additional support functions to be provided. (For more information see Section 67.3.)
Unlike search operators, support functions return whichever data type the particular index method
expects; for example in the case of the comparison function for B-trees, a signed integer. The number
and types of the arguments to each support function are likewise dependent on the index method. For
B-tree and hash the comparison and hashing support functions take the same input data types as do
the operators included in the operator class, but this is not the case for most GiST, SP-GiST, GIN,
and BRIN support functions.
38.15.4. An Example
Now that we have seen the ideas, here is the promised example of creating a new operator class.
(You can find a working copy of this example in src/tutorial/complex.c and src/tuto-
rial/complex.sql in the source distribution.) The operator class encapsulates operators that sort
complex numbers in absolute value order, so we choose the name complex_abs_ops. First, we
need a set of operators. The procedure for defining operators was discussed in Section 38.13. For an
operator class on B-trees, the operators we require are:
The least error-prone way to define a related set of comparison operators is to write the B-tree com-
parison support function first, and then write the other functions as one-line wrappers around the sup-
port function. This reduces the odds of getting inconsistent results for corner cases. Following this
approach, we first write:
1105
Extending SQL
static int
complex_abs_cmp_internal(Complex *a, Complex *b)
{
double amag = Mag(a),
bmag = Mag(b);
PG_FUNCTION_INFO_V1(complex_abs_lt);
Datum
complex_abs_lt(PG_FUNCTION_ARGS)
{
Complex *a = (Complex *) PG_GETARG_POINTER(0);
Complex *b = (Complex *) PG_GETARG_POINTER(1);
The other four functions differ only in how they compare the internal function's result to zero.
Next we declare the functions and the operators based on the functions to SQL:
It is important to specify the correct commutator and negator operators, as well as suitable restriction
and join selectivity functions, otherwise the optimizer will be unable to make effective use of the index.
• There can only be one operator named, say, = and taking type complex for both operands. In this
case we don't have any other operator = for complex, but if we were building a practical data
type we'd probably want = to be the ordinary equality operation for complex numbers (and not
the equality of the absolute values). In that case, we'd need to use some other operator name for
complex_abs_eq.
• Although PostgreSQL can cope with functions having the same SQL name as long as they have
different argument data types, C can only cope with one global function having a given name. So
we shouldn't name the C function something simple like abs_eq. Usually it's a good practice to
1106
Extending SQL
include the data type name in the C function name, so as not to conflict with functions for other
data types.
• We could have made the SQL name of the function abs_eq, relying on PostgreSQL to distinguish
it by argument data types from any other SQL function of the same name. To keep the example
simple, we make the function have the same names at the C level and SQL level.
The next step is the registration of the support routine required by B-trees. The example C code that
implements this is in the same file that contains the operator functions. This is how we declare the
function:
Now that we have the required operators and support routine, we can finally create the operator class:
And we're done! It should now be possible to create and use B-tree indexes on complex columns.
but there is no need to do so when the operators take the same data type we are defining the operator
class for.
The above example assumes that you want to make this new operator class the default B-tree operator
class for the complex data type. If you don't, just leave out the word DEFAULT.
To handle these needs, PostgreSQL uses the concept of an operator family. An operator family contains
one or more operator classes, and can also contain indexable operators and corresponding support
functions that belong to the family as a whole but not to any single class within the family. We say
that such operators and functions are “loose” within the family, as opposed to being bound into a
specific class. Typically each operator class contains single-data-type operators while cross-data-type
operators are loose in the family.
1107
Extending SQL
All the operators and functions in an operator family must have compatible semantics, where the
compatibility requirements are set by the index method. You might therefore wonder why bother to
single out particular subsets of the family as operator classes; and indeed for many purposes the class
divisions are irrelevant and the family is the only interesting grouping. The reason for defining operator
classes is that they specify how much of the family is needed to support any particular index. If there
is an index using an operator class, then that operator class cannot be dropped without dropping the
index — but other parts of the operator family, namely other operator classes and loose operators,
could be dropped. Thus, an operator class should be specified to contain the minimum set of operators
and functions that are reasonably needed to work with an index on a specific data type, and then related
but non-essential operators can be added as loose members of the operator family.
As an example, PostgreSQL has a built-in B-tree operator family integer_ops, which includes
operator classes int8_ops, int4_ops, and int2_ops for indexes on bigint (int8), inte-
ger (int4), and smallint (int2) columns respectively. The family also contains cross-data-type
comparison operators allowing any two of these types to be compared, so that an index on one of these
types can be searched using a comparison value of another type. The family could be duplicated by
these definitions:
1108
Extending SQL
Notice that this definition “overloads” the operator strategy and support function numbers: each num-
ber occurs multiple times within the family. This is allowed so long as each instance of a particular
number has distinct input data types. The instances that have both input types equal to an operator
1109
Extending SQL
class's input type are the primary operators and support functions for that operator class, and in most
cases should be declared as part of the operator class rather than as loose members of the family.
In a B-tree operator family, all the operators in the family must sort compatibly, as is specified in
detail in Section 63.2. For each operator in the family there must be a support function having the
same two input data types as the operator. It is recommended that a family be complete, i.e., for each
combination of data types, all operators are included. Each operator class should include just the non-
cross-type operators and support function for its data type.
To build a multiple-data-type hash operator family, compatible hash support functions must be created
for each data type supported by the family. Here compatibility means that the functions are guaranteed
to return the same hash code for any two values that are considered equal by the family's equality
operators, even when the values are of different types. This is usually difficult to accomplish when the
types have different physical representations, but it can be done in some cases. Furthermore, casting
a value from one data type represented in the operator family to another data type also represented
in the operator family via an implicit or binary coercion cast must not change the computed hash
value. Notice that there is only one support function per data type, not one per equality operator. It
is recommended that a family be complete, i.e., provide an equality operator for each combination
of data types. Each operator class should include just the non-cross-type equality operator and the
support function for its data type.
GiST, SP-GiST, and GIN indexes do not have any explicit notion of cross-data-type operations. The
set of operators supported is just whatever the primary support functions for a given operator class
can handle.
In BRIN, the requirements depends on the framework that provides the operator classes. For operator
classes based on minmax, the behavior required is the same as for B-tree operator families: all the
operators in the family must sort compatibly, and casts must not change the associated sort ordering.
Note
Prior to PostgreSQL 8.3, there was no concept of operator families, and so any cross-data-type
operators intended to be used with an index had to be bound directly into the index's operator
class. While this approach still works, it is deprecated because it makes an index's dependen-
cies too broad, and because the planner can handle cross-data-type comparisons more effec-
tively when both data types have operators in the same operator family.
In particular, there are SQL features such as ORDER BY and DISTINCT that require comparison and
sorting of values. To implement these features on a user-defined data type, PostgreSQL looks for the
default B-tree operator class for the data type. The “equals” member of this operator class defines the
system's notion of equality of values for GROUP BY and DISTINCT, and the sort ordering imposed
by the operator class defines the default ORDER BY ordering.
If there is no default B-tree operator class for a data type, the system will look for a default hash
operator class. But since that kind of operator class only provides equality, it is only able to support
grouping not sorting.
When there is no default operator class for a data type, you will get errors like “could not identify an
ordering operator” if you try to use these SQL features with the data type.
1110
Extending SQL
Note
In PostgreSQL versions before 7.4, sorting and grouping operations would implicitly use op-
erators named =, <, and >. The new behavior of relying on default operator classes avoids
having to make any assumption about the behavior of operators with particular names.
Sorting by a non-default B-tree operator class is possible by specifying the class's less-than operator
in a USING option, for example
Alternatively, specifying the class's greater-than operator in USING selects a descending-order sort.
Comparison of arrays of a user-defined type also relies on the semantics defined by the type's default
B-tree operator class. If there is no default B-tree operator class, but there is a default hash operator
class, then array equality is supported, but not ordering comparisons.
Another SQL feature that requires even more data-type-specific knowledge is the RANGE offset
PRECEDING/FOLLOWING framing option for window functions (see Section 4.2.8). For a query such
as
it is not sufficient to know how to order by x; the database must also understand how to “subtract
5” or “add 10” to the current row's value of x to identify the bounds of the current window frame.
Comparing the resulting bounds to other rows' values of x is possible using the comparison operators
provided by the B-tree operator class that defines the ORDER BY ordering — but addition and sub-
traction operators are not part of the operator class, so which ones should be used? Hard-wiring that
choice would be undesirable, because different sort orders (different B-tree operator classes) might
need different behavior. Therefore, a B-tree operator class can specify an in_range support function
that encapsulates the addition and subtraction behaviors that make sense for its sort order. It can even
provide more than one in_range support function, in case there is more than one data type that makes
sense to use as the offset in RANGE clauses. If the B-tree operator class associated with the window's
ORDER BY clause does not have a matching in_range support function, the RANGE offset PRE-
CEDING/FOLLOWING option is not supported.
Another important point is that an equality operator that appears in a hash operator family is a candidate
for hash joins, hash aggregation, and related optimizations. The hash operator family is essential here
since it identifies the hash function(s) to use.
1111
Extending SQL
finds the ten places closest to a given target point. A GiST index on the location column can do this
efficiently because <-> is an ordering operator.
While search operators have to return Boolean results, ordering operators usually return some other
type, such as float or numeric for distances. This type is normally not the same as the data type being
indexed. To avoid hard-wiring assumptions about the behavior of different data types, the definition of
an ordering operator is required to name a B-tree operator family that specifies the sort ordering of the
result data type. As was stated in the previous section, B-tree operator families define PostgreSQL's
notion of ordering, so this is a natural representation. Since the point <-> operator returns float8,
it could be specified in an operator class creation command like this:
where float_ops is the built-in operator family that includes operations on float8. This decla-
ration states that the index is able to return rows in order of increasing values of the <-> operator.
Normally, declaring an operator as a member of an operator class (or family) means that the index
method can retrieve exactly the set of rows that satisfy a WHERE condition using the operator. For
example:
can be satisfied exactly by a B-tree index on the integer column. But there are cases where an index
is useful as an inexact guide to the matching rows. For example, if a GiST index stores only bound-
ing boxes for geometric objects, then it cannot exactly satisfy a WHERE condition that tests overlap
between nonrectangular objects such as polygons. Yet we could use the index to find objects whose
bounding box overlaps the bounding box of the target object, and then do the exact overlap test only on
the objects found by the index. If this scenario applies, the index is said to be “lossy” for the operator.
Lossy index searches are implemented by having the index method return a recheck flag when a row
might or might not really satisfy the query condition. The core system will then test the original query
condition on the retrieved row to see whether it should be returned as a valid match. This approach
works if the index is guaranteed to return all the required rows, plus perhaps some additional rows,
which can be eliminated by performing the original operator invocation. The index methods that sup-
port lossy searches (currently, GiST, SP-GiST and GIN) allow the support functions of individual
operator classes to set the recheck flag, and so this is essentially an operator-class feature.
Consider again the situation where we are storing in the index only the bounding box of a complex
object such as a polygon. In this case there's not much value in storing the whole polygon in the index
entry — we might as well store just a simpler object of type box. This situation is expressed by the
STORAGE option in CREATE OPERATOR CLASS: we'd write something like:
1112
Extending SQL
At present, only the GiST, GIN and BRIN index methods support a STORAGE type that's different
from the column data type. The GiST compress and decompress support routines must deal with
data-type conversion when STORAGE is used. In GIN, the STORAGE type identifies the type of the
“key” values, which normally is different from the type of the indexed column — for example, an
operator class for integer-array columns might have keys that are just integers. The GIN extract-
Value and extractQuery support routines are responsible for extracting keys from indexed val-
ues. BRIN is similar to GIN: the STORAGE type identifies the type of the stored summary values, and
operator classes' support procedures are responsible for interpreting the summary values correctly.
The main advantage of using an extension, rather than just running the SQL script to load a bunch
of “loose” objects into your database, is that PostgreSQL will then understand that the objects of the
extension go together. You can drop all the objects with a single DROP EXTENSION command (no
need to maintain a separate “uninstall” script). Even more useful, pg_dump knows that it should not
dump the individual member objects of the extension — it will just include a CREATE EXTENSION
command in dumps, instead. This vastly simplifies migration to a new version of the extension that
might contain more or different objects than the old version. Note however that you must have the
extension's control, script, and other files available when loading such a dump into a new database.
PostgreSQL will not let you drop an individual object contained in an extension, except by dropping
the whole extension. Also, while you can change the definition of an extension member object (for
example, via CREATE OR REPLACE FUNCTION for a function), bear in mind that the modified
definition will not be dumped by pg_dump. Such a change is usually only sensible if you concurrently
make the same change in the extension's script file. (But there are special provisions for tables con-
taining configuration data; see Section 38.16.3.) In production situations, it's generally better to create
an extension update script to perform changes to extension member objects.
The extension script may set privileges on objects that are part of the extension, using GRANT and
REVOKE statements. The final set of privileges for each object (if any are set) will be stored in the
pg_init_privs system catalog. When pg_dump is used, the CREATE EXTENSION command
will be included in the dump, followed by the set of GRANT and REVOKE statements necessary to set
the privileges on the objects to what they were at the time the dump was taken.
PostgreSQL does not currently support extension scripts issuing CREATE POLICY or SECURITY
LABEL statements. These are expected to be set after the extension has been created. All RLS policies
and security labels on extension objects will be included in dumps created by pg_dump.
The extension mechanism also has provisions for packaging modification scripts that adjust the de-
finitions of the SQL objects contained in an extension. For example, if version 1.1 of an extension
adds one function and changes the body of another function compared to 1.0, the extension author
can provide an update script that makes just those two changes. The ALTER EXTENSION UPDATE
command can then be used to apply these changes and track which version of the extension is actually
installed in a given database.
The kinds of SQL objects that can be members of an extension are shown in the description of AL-
TER EXTENSION. Notably, objects that are database-cluster-wide, such as databases, roles, and ta-
1113
Extending SQL
blespaces, cannot be extension members since an extension is only known within one database. (Al-
though an extension script is not prohibited from creating such objects, if it does so they will not be
tracked as part of the extension.) Also notice that while a table can be a member of an extension, its
subsidiary objects such as indexes are not directly considered members of the extension. Another im-
portant point is that schemas can belong to extensions, but not vice versa: an extension as such has an
unqualified name and does not exist “within” any schema. The extension's member objects, however,
will belong to schemas whenever appropriate for their object types. It may or may not be appropriate
for an extension to own the schema(s) its member objects are within.
If an extension's script creates any temporary objects (such as temp tables), those objects are treated as
extension members for the remainder of the current session, but are automatically dropped at session
end, as any temporary object would be. This is an exception to the rule that extension member objects
cannot be dropped without dropping the whole extension.
The file format for an extension control file is the same as for the postgresql.conf file, namely a
list of parameter_name = value assignments, one per line. Blank lines and comments introduced
by # are allowed. Be sure to quote any value that is not a single word or number.
directory (string)
The directory containing the extension's SQL script file(s). Unless an absolute path is given, the
name is relative to the installation's SHAREDIR directory. The default behavior is equivalent to
specifying directory = 'extension'.
default_version (string)
The default version of the extension (the one that will be installed if no version is specified in
CREATE EXTENSION). Although this can be omitted, that will result in CREATE EXTENSION
failing if no VERSION option appears, so you generally don't want to do that.
comment (string)
A comment (any string) about the extension. The comment is applied when initially creating an
extension, but not during extension updates (since that might override user-added comments).
Alternatively, the extension's comment can be set by writing a COMMENT command in the script
file.
encoding (string)
The character set encoding used by the script file(s). This should be specified if the script files
contain any non-ASCII characters. Otherwise the files will be assumed to be in the database
encoding.
module_pathname (string)
The value of this parameter will be substituted for each occurrence of MODULE_PATHNAME
in the script file(s). If it is not set, no substitution is made. Typically, this is set to $lib-
dir/shared_library_name and then MODULE_PATHNAME is used in CREATE FUNC-
TION commands for C-language functions, so that the script files do not need to hard-wire the
name of the shared library.
1114
Extending SQL
requires (string)
A list of names of extensions that this extension depends on, for example requires = 'foo,
bar'. Those extensions must be installed before this one can be installed.
superuser (boolean)
If this parameter is true (which is the default), only superusers can create the extension or update
it to a new version. If it is set to false, just the privileges required to execute the commands in
the installation or update script are required.
relocatable (boolean)
An extension is relocatable if it is possible to move its contained objects into a different schema
after initial creation of the extension. The default is false, i.e., the extension is not relocatable.
See Section 38.16.2 for more information.
schema (string)
This parameter can only be set for non-relocatable extensions. It forces the extension to be loaded
into exactly the named schema and not any other. The schema parameter is consulted only when
initially creating an extension, not during extension updates. See Section 38.16.2 for more infor-
mation.
In addition to the primary control file extension.control, an extension can have secondary
control files named in the style extension--version.control. If supplied, these must be
located in the script file directory. Secondary control files follow the same format as the primary
control file. Any parameters set in a secondary control file override the primary control file when
installing or updating to that version of the extension. However, the parameters directory and
default_version cannot be set in a secondary control file.
An extension's SQL script files can contain any SQL commands, except for transaction control com-
mands (BEGIN, COMMIT, etc) and commands that cannot be executed inside a transaction block (such
as VACUUM). This is because the script files are implicitly executed within a transaction block.
An extension's SQL script files can also contain lines beginning with \echo, which will be ignored
(treated as comments) by the extension mechanism. This provision is commonly used to throw an error
if the script file is fed to psql rather than being loaded via CREATE EXTENSION (see example script
in Section 38.16.7). Without that, users might accidentally load the extension's contents as “loose”
objects rather than as an extension, a state of affairs that's a bit tedious to recover from.
While the script files can contain any characters allowed by the specified encoding, control files should
contain only plain ASCII, because there is no way for PostgreSQL to know what encoding a control
file is in. In practice this is only an issue if you want to use non-ASCII characters in the extension's
comment. Recommended practice in that case is to not use the control file comment parameter, but
instead use COMMENT ON EXTENSION within a script file to set the comment.
• A fully relocatable extension can be moved into another schema at any time, even after it's been
loaded into a database. This is done with the ALTER EXTENSION SET SCHEMA command,
which automatically renames all the member objects into the new schema. Normally, this is only
possible if the extension contains no internal assumptions about what schema any of its objects are
in. Also, the extension's objects must all be in one schema to begin with (ignoring objects that do
not belong to any schema, such as procedural languages). Mark a fully relocatable extension by
setting relocatable = true in its control file.
1115
Extending SQL
• An extension might be relocatable during installation but not afterwards. This is typically the case
if the extension's script file needs to reference the target schema explicitly, for example in set-
ting search_path properties for SQL functions. For such an extension, set relocatable =
false in its control file, and use @extschema@ to refer to the target schema in the script file. All
occurrences of this string will be replaced by the actual target schema's name before the script is
executed. The user can set the target schema using the SCHEMA option of CREATE EXTENSION.
• If the extension does not support relocation at all, set relocatable = false in its control
file, and also set schema to the name of the intended target schema. This will prevent use of
the SCHEMA option of CREATE EXTENSION, unless it specifies the same schema named in the
control file. This choice is typically necessary if the extension contains internal assumptions about
schema names that can't be replaced by uses of @extschema@. The @extschema@ substitution
mechanism is available in this case too, although it is of limited use since the schema name is
determined by the control file.
In all cases, the script file will be executed with search_path initially set to point to the target schema;
that is, CREATE EXTENSION does the equivalent of this:
This allows the objects created by the script file to go into the target schema. The script file can
change search_path if it wishes, but that is generally undesirable. search_path is restored to
its previous setting upon completion of CREATE EXTENSION.
The target schema is determined by the schema parameter in the control file if that is given, other-
wise by the SCHEMA option of CREATE EXTENSION if that is given, otherwise the current default
object creation schema (the first one in the caller's search_path). When the control file schema
parameter is used, the target schema will be created if it doesn't already exist, but in the other two
cases it must already exist.
If any prerequisite extensions are listed in requires in the control file, their target schemas are added
to the initial setting of search_path, following the new extension's target schema. This allows their
objects to be visible to the new extension's script file.
For security, pg_temp is automatically appended to the end of search_path in all cases.
Although a non-relocatable extension can contain objects spread across multiple schemas, it is usually
desirable to place all the objects meant for external use into a single schema, which is considered
the extension's target schema. Such an arrangement works conveniently with the default setting of
search_path during creation of dependent extensions.
To solve this problem, an extension's script file can mark a table or a sequence it has created as a con-
figuration relation, which will cause pg_dump to include the table's or the sequence's contents (not its
definition) in dumps. To do that, call the function pg_extension_config_dump(regclass,
text) after creating the table or the sequence, for example
1116
Extending SQL
Any number of tables or sequences can be marked this way. Sequences associated with serial or
bigserial columns can be marked as well.
When the second argument of pg_extension_config_dump is an empty string, the entire con-
tents of the table are dumped by pg_dump. This is usually only correct if the table is initially empty as
created by the extension script. If there is a mixture of initial data and user-provided data in the table,
the second argument of pg_extension_config_dump provides a WHERE condition that selects
the data to be dumped. For example, you might do
and then make sure that standard_entry is true only in the rows created by the extension's script.
More complicated situations, such as initially-provided rows that might be modified by users, can
be handled by creating triggers on the configuration table to ensure that modified rows are marked
correctly.
You can alter the filter condition associated with a configuration table by calling pg_exten-
sion_config_dump again. (This would typically be useful in an extension update script.) The on-
ly way to mark a table as no longer a configuration table is to dissociate it from the extension with
ALTER EXTENSION ... DROP TABLE.
Note that foreign key relationships between these tables will dictate the order in which the tables
are dumped out by pg_dump. Specifically, pg_dump will attempt to dump the referenced-by table
before the referencing table. As the foreign key relationships are set up at CREATE EXTENSION
time (prior to data being loaded into the tables) circular dependencies are not supported. When circular
dependencies exist, the data will still be dumped out but the dump will not be able to be restored
directly and user intervention will be required.
Sequences associated with serial or bigserial columns need to be directly marked to dump
their state. Marking their parent relation is not enough for this purpose.
Given that a suitable update script is available, the command ALTER EXTENSION UPDATE will
update an installed extension to the specified new version. The update script is run in the same envi-
ronment that CREATE EXTENSION provides for installation scripts: in particular, search_path
is set up in the same way, and any new objects created by the script are automatically added to the
extension. Also, if the script chooses to drop extension member objects, they are automatically disso-
ciated from the extension.
1117
Extending SQL
If an extension has secondary control files, the control parameters that are used for an update script
are those associated with the script's target (new) version.
The update mechanism can be used to solve an important special case: converting a “loose” collection
of objects into an extension. Before the extension mechanism was added to PostgreSQL (in 9.1), many
people wrote extension modules that simply created assorted unpackaged objects. Given an existing
database containing such objects, how can we convert the objects into a properly packaged extension?
Dropping them and then doing a plain CREATE EXTENSION is one way, but it's not desirable if the
objects have dependencies (for example, if there are table columns of a data type created by the ex-
tension). The way to fix this situation is to create an empty extension, then use ALTER EXTENSION
ADD to attach each pre-existing object to the extension, then finally create any new objects that are in
the current extension version but were not in the unpackaged release. CREATE EXTENSION supports
this case with its FROM old_version option, which causes it to not run the normal installation script
for the target version, but instead the update script named extension--old_version--tar-
get_version.sql. The choice of the dummy version name to use as old_version is up to the
extension author, though unpackaged is a common convention. If you have multiple prior versions
you need to be able to update into extension style, use multiple dummy version names to identify them.
ALTER EXTENSION is able to execute sequences of update script files to achieve a requested update.
For example, if only foo--1.0--1.1.sql and foo--1.1--2.0.sql are available, ALTER
EXTENSION will apply them in sequence if an update to version 2.0 is requested when 1.0 is
currently installed.
PostgreSQL doesn't assume anything about the properties of version names: for example, it does not
know whether 1.1 follows 1.0. It just matches up the available version names and follows the path
that requires applying the fewest update scripts. (A version name can actually be any string that doesn't
contain -- or leading or trailing -.)
This shows each pair of distinct known version names for the specified extension, together with the
update path sequence that would be taken to get from the source version to the target version, or NULL
if there is no available update path. The path is shown in textual form with -- separators. You can
use regexp_split_to_array(path,'--') if you prefer an array format.
1118
Extending SQL
pathways are available then the shortest is preferred.) Arranging an extension's script files in this style
can reduce the amount of maintenance effort needed to produce small updates.
If you use secondary (version-specific) control files with an extension maintained in this style, keep
in mind that each version needs a control file even if it has no stand-alone installation script, as
that control file will determine how the implicit update to that version is performed. For example, if
foo--1.0.control specifies requires = 'bar' but foo's other control files do not, the
extension's dependency on bar will be dropped when updating from 1.0 to another version.
An extension that has the superuser property set to true must also consider security hazards for the
actions taken within its installation and update scripts. It is not terribly difficult for a malicious user
to create trojan-horse objects that will compromise later execution of a carelessly-written extension
script, allowing that user to acquire superuser privileges.
Advice about writing functions securely is provided in Section 38.16.6.1 below, and advice about
writing installation scripts securely is provided in Section 38.16.6.2.
The CREATE FUNCTION reference page contains advice about writing SECURITY DEFINER
functions safely. It's good practice to apply those techniques for any function provided by an extension,
since the function might be called by a high-privilege user.
If you cannot set the search_path to contain only secure schemas, assume that each unqualified
name could resolve to an object that a malicious user has defined. Beware of constructs that depend
on search_path implicitly; for example, IN and CASE expression WHEN always select an
operator using the search path. In their place, use OPERATOR(schema.=) ANY and CASE WHEN
expression.
A general-purpose extension usually should not assume that it's been installed into a secure schema,
which means that even schema-qualified references to its own objects are not entirely risk-free. For
example, if the extension has defined a function myschema.myfunc(bigint) then a call such as
myschema.myfunc(42) could be captured by a hostile function myschema.myfunc(inte-
ger). Be careful that the data types of function and operator parameters exactly match the declared
argument types, using explicit casts where necessary.
DDL commands such as CREATE FUNCTION and CREATE OPERATOR CLASS are generally se-
cure, but beware of any command having a general-purpose expression as a component. For example,
CREATE VIEW needs to be vetted, as does a DEFAULT expression in CREATE FUNCTION.
Sometimes an extension script might need to execute general-purpose SQL, for example to make
catalog adjustments that aren't possible via DDL. Be careful to execute such commands with a secure
1119
Extending SQL
search_path; do not trust the path provided by CREATE/ALTER EXTENSION to be secure. Best
practice is to temporarily set search_path to 'pg_catalog, pg_temp' and insert references
to the extension's installation schema explicitly where needed. (This practice might also be helpful
for creating views.) Examples can be found in the contrib modules in the PostgreSQL source code
distribution.
Cross-extension references are extremely difficult to make fully secure, partially because of uncer-
tainty about which schema the other extension is in. The hazards are reduced if both extensions are
installed in the same schema, because then a hostile object cannot be placed ahead of the referenced
extension in the installation-time search_path. However, no mechanism currently exists to require
that.
# pair extension
comment = 'A key/value pair data type'
default_version = '1.0'
# cannot be relocatable because of use of @extschema@
relocatable = false
While you hardly need a makefile to install these two files into the correct directory, you could use
a Makefile containing this:
1120
Extending SQL
EXTENSION = pair
DATA = pair--1.0.sql
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
This makefile relies on PGXS, which is described in Section 38.17. The command make install
will install the control and script files into the correct directory as reported by pg_config.
Once the files are installed, use the CREATE EXTENSION command to load the objects into any
particular database.
To use the PGXS infrastructure for your extension, you must write a simple makefile. In the makefile,
you need to set some variables and include the global PGXS makefile. Here is an example that builds
an extension module named isbn_issn, consisting of a shared library containing some C code, an
extension control file, a SQL script, an include file (only needed if other modules might need to access
the extension functions without going via SQL), and a documentation text file:
MODULES = isbn_issn
EXTENSION = isbn_issn
DATA = isbn_issn--1.0.sql
DOCS = README.isbn_issn
HEADERS_isbn_issn = isbn_issn.h
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
The last three lines should always be the same. Earlier in the file, you assign variables or add custom
make rules.
MODULES
list of shared-library objects to be built from source files with same stem (do not include library
suffixes in this list)
MODULE_big
a shared library to build from multiple source files (list object files in OBJS)
PROGRAM
1121
Extending SQL
EXTENSION
extension name(s); for each name you must provide an extension.control file, which will
be installed into prefix/share/extension
MODULEDIR
subdirectory of prefix/share into which DATA and DOCS files should be installed (if not
set, default is extension if EXTENSION is set, or contrib if not)
DATA
DATA_built
DATA_TSEARCH
DOCS
HEADERS
HEADERS_built
Unlike DATA_built, files in HEADERS_built are not removed by the clean target; if you
want them removed, also add them to EXTRA_CLEAN or add your own rules to do it.
HEADERS_$MODULE
HEADERS_built_$MODULE
It is legal to use both variables for the same module, or any combination, unless you have
two module names in the MODULES list that differ only by the presence of a prefix built_,
which would cause ambiguity. In that (hopefully unlikely) case, you should use only the HEAD-
ERS_built_$MODULE variables.
SCRIPTS
SCRIPTS_built
script files (not binaries) to install into prefix/bin, which need to be built first
REGRESS
1122
Extending SQL
REGRESS_OPTS
NO_INSTALLCHECK
don't define an installcheck target, useful e.g., if tests require special configuration, or don't
use pg_regress
EXTRA_CLEAN
PG_CPPFLAGS
PG_CFLAGS
PG_CXXFLAGS
PG_LDFLAGS
PG_LIBS
SHLIB_LINK
PG_CONFIG
path to pg_config program for the PostgreSQL installation to build against (typically just
pg_config to use the first one in your PATH)
Put this makefile as Makefile in the directory which holds your extension. Then you can do make
to compile, and then make install to install your module. By default, the extension is compiled
and installed for the PostgreSQL installation that corresponds to the first pg_config program found
in your PATH. You can use a different installation by setting PG_CONFIG to point to its pg_config
program, either within the makefile or on the make command line.
You can also run make in a directory outside the source tree of your extension, if you want to keep
the build directory separate. This procedure is also called a VPATH build. Here's how:
mkdir build_dir
cd build_dir
make -f /path/to/extension/source/tree/Makefile
make -f /path/to/extension/source/tree/Makefile install
Alternatively, you can set up a directory for a VPATH build in a similar way to how it is done for the
core code. One way to do this is using the core script config/prep_buildtree. Once this has
been done you can build by setting the make variable VPATH like this:
1123
Extending SQL
make VPATH=/path/to/extension/source/tree
make VPATH=/path/to/extension/source/tree install
The scripts listed in the REGRESS variable are used for regression testing of your module, which can
be invoked by make installcheck after doing make install. For this to work you must
have a running PostgreSQL server. The script files listed in REGRESS must appear in a subdirectory
named sql/ in your extension's directory. These files must have extension .sql, which must not
be included in the REGRESS list in the makefile. For each test there should also be a file containing
the expected output in a subdirectory named expected/, with the same stem and extension .out.
make installcheck executes each test script with psql, and compares the resulting output to the
matching expected file. Any differences will be written to the file regression.diffs in diff
-c format. Note that trying to run a test that is missing its expected file will be reported as “trouble”,
so make sure you have all expected files.
Tip
The easiest way to create the expected files is to create empty files, then do a test run (which
will of course report differences). Inspect the actual result files found in the results/ di-
rectory, then copy them to expected/ if they match what you expect from the test.
1124
Chapter 39. Triggers
This chapter provides general information about writing trigger functions. Trigger functions can be
written in most of the available procedural languages, including PL/pgSQL (Chapter 43), PL/Tcl
(Chapter 44), PL/Perl (Chapter 45), and PL/Python (Chapter 46). After reading this chapter, you should
consult the chapter for your favorite procedural language to find out the language-specific details of
writing a trigger in it.
It is also possible to write a trigger function in C, although most people find it easier to use one of the
procedural languages. It is not currently possible to write a trigger function in the plain SQL function
language.
On tables and foreign tables, triggers can be defined to execute either before or after any INSERT,
UPDATE, or DELETE operation, either once per modified row, or once per SQL statement. UPDATE
triggers can moreover be set to fire only if certain columns are mentioned in the SET clause of the
UPDATE statement. Triggers can also fire for TRUNCATE statements. If a trigger event occurs, the
trigger's function is called at the appropriate time to handle the event.
On views, triggers can be defined to execute instead of INSERT, UPDATE, or DELETE operations.
Such INSTEAD OF triggers are fired once for each row that needs to be modified in the view. It is the
responsibility of the trigger's function to perform the necessary modifications to the view's underlying
base table(s) and, where appropriate, return the modified row as it will appear in the view. Triggers
on views can also be defined to execute once per SQL statement, before or after INSERT, UPDATE,
or DELETE operations. However, such triggers are fired only if there is also an INSTEAD OF trigger
on the view. Otherwise, any statement targeting the view must be rewritten into a statement affecting
its underlying base table(s), and then the triggers that will be fired are the ones attached to the base
table(s).
The trigger function must be defined before the trigger itself can be created. The trigger function must
be declared as a function taking no arguments and returning type trigger. (The trigger function
receives its input through a specially-passed TriggerData structure, not in the form of ordinary
function arguments.)
Once a suitable trigger function has been created, the trigger is established with CREATE TRIGGER.
The same trigger function can be used for multiple triggers.
PostgreSQL offers both per-row triggers and per-statement triggers. With a per-row trigger, the trigger
function is invoked once for each row that is affected by the statement that fired the trigger. In contrast,
a per-statement trigger is invoked only once when an appropriate statement is executed, regardless of
the number of rows affected by that statement. In particular, a statement that affects zero rows will
still result in the execution of any applicable per-statement triggers. These two types of triggers are
sometimes called row-level triggers and statement-level triggers, respectively. Triggers on TRUNCATE
may only be defined at statement level, not per-row.
Triggers are also classified according to whether they fire before, after, or instead of the operation.
These are referred to as BEFORE triggers, AFTER triggers, and INSTEAD OF triggers respectively.
Statement-level BEFORE triggers naturally fire before the statement starts to do anything, while state-
ment-level AFTER triggers fire at the very end of the statement. These types of triggers may be defined
on tables, views, or foreign tables. Row-level BEFORE triggers fire immediately before a particular
row is operated on, while row-level AFTER triggers fire at the end of the statement (but before any
statement-level AFTER triggers). These types of triggers may only be defined on tables and foreign
1125
Triggers
tables, not views; BEFORE row-level triggers may not be defined on partitioned tables. INSTEAD OF
triggers may only be defined on views, and only at row level; they fire immediately as each row in
the view is identified as needing to be operated on.
A statement that targets a parent table in an inheritance or partitioning hierarchy does not cause the
statement-level triggers of affected child tables to be fired; only the parent table's statement-level
triggers are fired. However, row-level triggers of any affected child tables will be fired.
If an INSERT contains an ON CONFLICT DO UPDATE clause, it is possible that the effects of row-
level BEFORE INSERT triggers and row-level BEFORE UPDATE triggers can both be applied in a way
that is apparent from the final state of the updated row, if an EXCLUDED column is referenced. There
need not be an EXCLUDED column reference for both sets of row-level BEFORE triggers to execute,
though. The possibility of surprising outcomes should be considered when there are both BEFORE
INSERT and BEFORE UPDATE row-level triggers that change a row being inserted/updated (this can
be problematic even if the modifications are more or less equivalent, if they're not also idempotent).
Note that statement-level UPDATE triggers are executed when ON CONFLICT DO UPDATE is spec-
ified, regardless of whether or not any rows were affected by the UPDATE (and regardless of whether
the alternative UPDATE path was ever taken). An INSERT with an ON CONFLICT DO UPDATE
clause will execute statement-level BEFORE INSERT triggers first, then statement-level BEFORE
UPDATE triggers, followed by statement-level AFTER UPDATE triggers and finally statement-level
AFTER INSERT triggers.
If an UPDATE on a partitioned table causes a row to move to another partition, it will be performed
as a DELETE from the original partition followed by an INSERT into the new partition. In this case,
all row-level BEFORE UPDATE triggers and all row-level BEFORE DELETE triggers are fired on the
original partition. Then all row-level BEFORE INSERT triggers are fired on the destination partition.
The possibility of surprising outcomes should be considered when all these triggers affect the row
being moved. As far as AFTER ROW triggers are concerned, AFTER DELETE and AFTER INSERT
triggers are applied; but AFTER UPDATE triggers are not applied because the UPDATE has been
converted to a DELETE and an INSERT. As far as statement-level triggers are concerned, none of
the DELETE or INSERT triggers are fired, even if row movement occurs; only the UPDATE triggers
defined on the target table used in the UPDATE statement will be fired.
Trigger functions invoked by per-statement triggers should always return NULL. Trigger functions
invoked by per-row triggers can return a table row (a value of type HeapTuple) to the calling ex-
ecutor, if they choose. A row-level trigger fired before an operation has the following choices:
• It can return NULL to skip the operation for the current row. This instructs the executor to not
perform the row-level operation that invoked the trigger (the insertion, modification, or deletion of
a particular table row).
• For row-level INSERT and UPDATE triggers only, the returned row becomes the row that will be
inserted or will replace the row being updated. This allows the trigger function to modify the row
being inserted or updated.
A row-level BEFORE trigger that does not intend to cause either of these behaviors must be careful to
return as its result the same row that was passed in (that is, the NEW row for INSERT and UPDATE
triggers, the OLD row for DELETE triggers).
A row-level INSTEAD OF trigger should either return NULL to indicate that it did not modify any
data from the view's underlying base tables, or it should return the view row that was passed in (the
NEW row for INSERT and UPDATE operations, or the OLD row for DELETE operations). A nonnull
return value is used to signal that the trigger performed the necessary data modifications in the view.
This will cause the count of the number of rows affected by the command to be incremented. For
INSERT and UPDATE operations only, the trigger may modify the NEW row before returning it. This
will change the data returned by INSERT RETURNING or UPDATE RETURNING, and is useful
when the view will not show exactly the same data that was provided.
The return value is ignored for row-level triggers fired after an operation, and so they can return NULL.
1126
Triggers
If more than one trigger is defined for the same event on the same relation, the triggers will be fired
in alphabetical order by trigger name. In the case of BEFORE and INSTEAD OF triggers, the possi-
bly-modified row returned by each trigger becomes the input to the next trigger. If any BEFORE or
INSTEAD OF trigger returns NULL, the operation is abandoned for that row and subsequent triggers
are not fired (for that row).
A trigger definition can also specify a Boolean WHEN condition, which will be tested to see whether
the trigger should be fired. In row-level triggers the WHEN condition can examine the old and/or new
values of columns of the row. (Statement-level triggers can also have WHEN conditions, although the
feature is not so useful for them.) In a BEFORE trigger, the WHEN condition is evaluated just before
the function is or would be executed, so using WHEN is not materially different from testing the same
condition at the beginning of the trigger function. However, in an AFTER trigger, the WHEN condition
is evaluated just after the row update occurs, and it determines whether an event is queued to fire
the trigger at the end of statement. So when an AFTER trigger's WHEN condition does not return true,
it is not necessary to queue an event nor to re-fetch the row at end of statement. This can result in
significant speedups in statements that modify many rows, if the trigger only needs to be fired for a
few of the rows. INSTEAD OF triggers do not support WHEN conditions.
Typically, row-level BEFORE triggers are used for checking or modifying the data that will be inserted
or updated. For example, a BEFORE trigger might be used to insert the current time into a timestamp
column, or to check that two elements of the row are consistent. Row-level AFTER triggers are most
sensibly used to propagate the updates to other tables, or make consistency checks against other tables.
The reason for this division of labor is that an AFTER trigger can be certain it is seeing the final value
of the row, while a BEFORE trigger cannot; there might be other BEFORE triggers firing after it. If
you have no specific reason to make a trigger BEFORE or AFTER, the BEFORE case is more efficient,
since the information about the operation doesn't have to be saved until end of statement.
If a trigger function executes SQL commands then these commands might fire triggers again. This is
known as cascading triggers. There is no direct limitation on the number of cascade levels. It is possible
for cascades to cause a recursive invocation of the same trigger; for example, an INSERT trigger might
execute a command that inserts an additional row into the same table, causing the INSERT trigger to
be fired again. It is the trigger programmer's responsibility to avoid infinite recursion in such scenarios.
When a trigger is being defined, arguments can be specified for it. The purpose of including arguments
in the trigger definition is to allow different triggers with similar requirements to call the same function.
As an example, there could be a generalized trigger function that takes as its arguments two column
names and puts the current user in one and the current time stamp in the other. Properly written, this
trigger function would be independent of the specific table it is triggering on. So the same function
could be used for INSERT events on any table with suitable columns, to automatically track creation
of records in a transaction table for example. It could also be used to track last-update events if defined
as an UPDATE trigger.
Each programming language that supports triggers has its own method for making the trigger input
data available to the trigger function. This input data includes the type of trigger event (e.g., INSERT
or UPDATE) as well as any arguments that were listed in CREATE TRIGGER. For a row-level trigger,
the input data also includes the NEW row for INSERT and UPDATE triggers, and/or the OLD row for
UPDATE and DELETE triggers.
By default, statement-level triggers do not have any way to examine the individual row(s) modified
by the statement. But an AFTER STATEMENT trigger can request that transition tables be created to
make the sets of affected rows available to the trigger. AFTER ROW triggers can also request transition
tables, so that they can see the total changes in the table as well as the change in the individual row
they are currently being fired for. The method for examining the transition tables again depends on
the programming language that is being used, but the typical approach is to make the transition tables
act like read-only temporary tables that can be accessed by SQL commands issued within the trigger
function.
1127
Triggers
• Statement-level triggers follow simple visibility rules: none of the changes made by a statement are
visible to statement-level BEFORE triggers, whereas all modifications are visible to statement-level
AFTER triggers.
• The data change (insertion, update, or deletion) causing the trigger to fire is naturally not visible to
SQL commands executed in a row-level BEFORE trigger, because it hasn't happened yet.
• However, SQL commands executed in a row-level BEFORE trigger will see the effects of data
changes for rows previously processed in the same outer command. This requires caution, since the
ordering of these change events is not in general predictable; a SQL command that affects multiple
rows can visit the rows in any order.
• Similarly, a row-level INSTEAD OF trigger will see the effects of data changes made by previous
firings of INSTEAD OF triggers in the same outer command.
• When a row-level AFTER trigger is fired, all data changes made by the outer command are already
complete, and are visible to the invoked trigger function.
If your trigger function is written in any of the standard procedural languages, then the above state-
ments apply only if the function is declared VOLATILE. Functions that are declared STABLE or IM-
MUTABLE will not see changes made by the calling command in any case.
Further information about data visibility rules can be found in Section 47.5. The example in Sec-
tion 39.4 contains a demonstration of these rules.
When a function is called by the trigger manager, it is not passed any normal arguments, but it is passed
a “context” pointer pointing to a TriggerData structure. C functions can check whether they were
called from the trigger manager or not by executing the macro:
CALLED_AS_TRIGGER(fcinfo)
If this returns true, then it is safe to cast fcinfo->context to type TriggerData * and make use
of the pointed-to TriggerData structure. The function must not alter the TriggerData structure
or any of the data it points to.
1128
Triggers
type
Always T_TriggerData.
tg_event
Describes the event for which the function is called. You can use the following macros to examine
tg_event:
TRIGGER_FIRED_BEFORE(tg_event)
TRIGGER_FIRED_AFTER(tg_event)
TRIGGER_FIRED_INSTEAD(tg_event)
TRIGGER_FIRED_FOR_ROW(tg_event)
TRIGGER_FIRED_FOR_STATEMENT(tg_event)
TRIGGER_FIRED_BY_INSERT(tg_event)
TRIGGER_FIRED_BY_UPDATE(tg_event)
TRIGGER_FIRED_BY_DELETE(tg_event)
TRIGGER_FIRED_BY_TRUNCATE(tg_event)
tg_relation
A pointer to a structure describing the relation that the trigger fired for. Look at utils/rel.h
for details about this structure. The most interesting things are tg_relation->rd_att (de-
1129
Triggers
tg_trigtuple
A pointer to the row for which the trigger was fired. This is the row being inserted, updated,
or deleted. If this trigger was fired for an INSERT or DELETE then this is what you should
return from the function if you don't want to replace the row with a different one (in the case of
INSERT) or skip the operation. For triggers on foreign tables, values of system columns herein
are unspecified.
tg_newtuple
A pointer to the new version of the row, if the trigger was fired for an UPDATE, and NULL if it
is for an INSERT or a DELETE. This is what you have to return from the function if the event is
an UPDATE and you don't want to replace this row by a different one or skip the operation. For
triggers on foreign tables, values of system columns herein are unspecified.
tg_trigger
where tgname is the trigger's name, tgnargs is the number of arguments in tgargs, and
tgargs is an array of pointers to the arguments specified in the CREATE TRIGGER statement.
The other members are for internal use only.
tg_trigtuplebuf
tg_newtuplebuf
1130
Triggers
tg_oldtable
A pointer to a structure of type Tuplestorestate containing zero or more rows in the format
specified by tg_relation, or a NULL pointer if there is no OLD TABLE transition relation.
tg_newtable
A pointer to a structure of type Tuplestorestate containing zero or more rows in the format
specified by tg_relation, or a NULL pointer if there is no NEW TABLE transition relation.
To allow queries issued through SPI to reference transition tables, see SPI_register_trigger_data.
A trigger function must return either a HeapTuple pointer or a NULL pointer (not an SQL null value,
that is, do not set isNull true). Be careful to return either tg_trigtuple or tg_newtuple, as
appropriate, if you don't want to modify the row being operated on.
The function trigf reports the number of rows in the table ttest and skips the actual operation
if the command attempts to insert a null value into the column x. (So the trigger acts as a not-null
constraint but doesn't abort the transaction.)
#include "postgres.h"
#include "fmgr.h"
#include "executor/spi.h" /* this is what you need to work
with SPI */
#include "commands/trigger.h" /* ... triggers ... */
#include "utils/rel.h" /* ... and relations */
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(trigf);
Datum
trigf(PG_FUNCTION_ARGS)
{
TriggerData *trigdata = (TriggerData *) fcinfo->context;
TupleDesc tupdesc;
HeapTuple rettuple;
char *when;
bool checknull = false;
bool isnull;
int ret, i;
1131
Triggers
if (TRIGGER_FIRED_BEFORE(trigdata->tg_event))
when = "before";
else
when = "after ";
tupdesc = trigdata->tg_relation->rd_att;
if (ret < 0)
elog(ERROR, "trigf (fired %s): SPI_exec returned %d", when,
ret);
SPI_finish();
if (checknull)
{
SPI_getbinval(rettuple, tupdesc, 1, &isnull);
if (isnull)
rettuple = NULL;
}
return PointerGetDatum(rettuple);
}
After you have compiled the source code (see Section 38.10.5), declare the function and the triggers:
1132
Triggers
AS 'filename'
LANGUAGE C;
1133
Triggers
x
---
1
4
(2 rows)
1134
Chapter 40. Event Triggers
To supplement the trigger mechanism discussed in Chapter 39, PostgreSQL also provides event trig-
gers. Unlike regular triggers, which are attached to a single table and capture only DML events, event
triggers are global to a particular database and are capable of capturing DDL events.
Like regular triggers, event triggers can be written in any procedural language that includes event
trigger support, or in C, but not in plain SQL.
The ddl_command_start event occurs just before the execution of a CREATE, ALTER, DROP,
SECURITY LABEL, COMMENT, GRANT or REVOKE command. No check whether the affected object
exists or doesn't exist is performed before the event trigger fires. As an exception, however, this event
does not occur for DDL commands targeting shared objects — databases, roles, and tablespaces —
or for commands targeting event triggers themselves. The event trigger mechanism does not support
these object types. ddl_command_start also occurs just before the execution of a SELECT INTO
command, since this is equivalent to CREATE TABLE AS.
The ddl_command_end event occurs just after the execution of this same set of commands. To
obtain more details on the DDL operations that took place, use the set-returning function pg_even-
t_trigger_ddl_commands() from the ddl_command_end event trigger code (see Sec-
tion 9.28). Note that the trigger fires after the actions have taken place (but before the transaction
commits), and thus the system catalogs can be read as already changed.
The sql_drop event occurs just before the ddl_command_end event trigger for any operation
that drops database objects. To list the objects that have been dropped, use the set-returning function
pg_event_trigger_dropped_objects() from the sql_drop event trigger code (see Sec-
tion 9.28). Note that the trigger is executed after the objects have been deleted from the system cata-
logs, so it's not possible to look them up anymore.
The table_rewrite event occurs just before a table is rewritten by some actions of the commands
ALTER TABLE and ALTER TYPE. While other control statements are available to rewrite a table,
like CLUSTER and VACUUM, the table_rewrite event is not triggered by them.
Event triggers (like other functions) cannot be executed in an aborted transaction. Thus, if a DDL
command fails with an error, any associated ddl_command_end triggers will not be executed. Con-
versely, if a ddl_command_start trigger fails with an error, no further event triggers will fire, and
no attempt will be made to execute the command itself. Similarly, if a ddl_command_end trigger
fails with an error, the effects of the DDL statement will be rolled back, just as they would be in any
other case where the containing transaction aborts.
For a complete list of commands supported by the event trigger mechanism, see Section 40.2.
Event triggers are created using the command CREATE EVENT TRIGGER. In order to create an
event trigger, you must first create a function with the special return type event_trigger. This
function need not (and may not) return a value; the return type serves merely as a signal that the
function is to be invoked as an event trigger.
If more than one event trigger is defined for a particular event, they will fire in alphabetical order by
trigger name.
A trigger definition can also specify a WHEN condition so that, for example, a ddl_command_start
trigger can be fired only for particular commands which the user wishes to intercept. A common use
of such triggers is to restrict the range of DDL operations which users may perform.
1135
Event Triggers
1136
Event Triggers
1137
Event Triggers
1138
Event Triggers
1139
Event Triggers
1140
Event Triggers
Event trigger functions must use the “version 1” function manager interface.
When a function is called by the event trigger manager, it is not passed any normal arguments, but it
is passed a “context” pointer pointing to a EventTriggerData structure. C functions can check
whether they were called from the event trigger manager or not by executing the macro:
CALLED_AS_EVENT_TRIGGER(fcinfo)
1141
Event Triggers
type
Always T_EventTriggerData.
event
Describes the event for which the function is called, one of "ddl_command_start",
"ddl_command_end", "sql_drop", "table_rewrite". See Section 40.1 for the mean-
ing of these events.
parsetree
A pointer to the parse tree of the command. Check the PostgreSQL source code for details. The
parse tree structure is subject to change without notice.
tag
The command tag associated with the event for which the event trigger is run, for example "CRE-
ATE FUNCTION".
An event trigger function must return a NULL pointer (not an SQL null value, that is, do not set
isNull true).
The function noddl raises an exception each time it is called. The event trigger definition associated
the function with the ddl_command_start event. The effect is that all DDL commands (with the
exceptions mentioned in Section 40.1) are prevented from running.
#include "postgres.h"
#include "commands/event_trigger.h"
PG_MODULE_MAGIC;
1142
Event Triggers
PG_FUNCTION_INFO_V1(noddl);
Datum
noddl(PG_FUNCTION_ARGS)
{
EventTriggerData *trigdata;
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
errmsg("command \"%s\" denied", trigdata->tag)));
PG_RETURN_NULL();
}
After you have compiled the source code (see Section 38.10.5), declare the function and the triggers:
=# \dy
List of event triggers
Name | Event | Owner | Enabled | Function | Tags
-------+-------------------+-------+---------+----------+------
noddl | ddl_command_start | dim | enabled | noddl |
(1 row)
In this situation, in order to be able to run some DDL commands when you need to do so, you have
to either drop the event trigger or disable it. It can be convenient to disable the trigger for only the
duration of a transaction:
BEGIN;
ALTER EVENT TRIGGER noddl DISABLE;
CREATE TABLE foo (id serial);
ALTER EVENT TRIGGER noddl ENABLE;
COMMIT;
(Recall that DDL commands on event triggers themselves are not affected by event triggers.)
1143
Event Triggers
1144
Chapter 41. The Rule System
This chapter discusses the rule system in PostgreSQL. Production rule systems are conceptually sim-
ple, but there are many subtle points involved in actually using them.
Some other database systems define active database rules, which are usually stored procedures and
triggers. In PostgreSQL, these can be implemented using functions and triggers as well.
The rule system (more precisely speaking, the query rewrite rule system) is totally different from
stored procedures and triggers. It modifies queries to take rules into consideration, and then passes the
modified query to the query planner for planning and execution. It is very powerful, and can be used
for many things such as query language procedures, views, and versions. The theoretical foundations
and the power of this rule system are also discussed in [ston90b] and [ong90].
The rule system is located between the parser and the planner. It takes the output of the parser, one
query tree, and the user-defined rewrite rules, which are also query trees with some extra information,
and creates zero or more query trees as result. So its input and output are always things the parser itself
could have produced and thus, anything it sees is basically representable as an SQL statement.
Now what is a query tree? It is an internal representation of an SQL statement where the single
parts that it is built from are stored separately. These query trees can be shown in the server log
if you set the configuration parameters debug_print_parse, debug_print_rewritten,
or debug_print_plan. The rule actions are also stored as query trees, in the system catalog
pg_rewrite. They are not formatted like the log output, but they contain exactly the same infor-
mation.
Reading a raw query tree requires some experience. But since SQL representations of query trees are
sufficient to understand the rule system, this chapter will not teach how to read them.
When reading the SQL representations of the query trees in this chapter it is necessary to be able to
identify the parts the statement is broken into when it is in the query tree structure. The parts of a
query tree are
This is a simple value telling which command (SELECT, INSERT, UPDATE, DELETE) produced
the query tree.
The range table is a list of relations that are used in the query. In a SELECT statement these are
the relations given after the FROM key word.
Every range table entry identifies a table or view and tells by which name it is called in the other
parts of the query. In the query tree, the range table entries are referenced by number rather than
by name, so here it doesn't matter if there are duplicate names as it would in an SQL statement.
This can happen after the range tables of rules have been merged in. The examples in this chapter
will not have this situation.
This is an index into the range table that identifies the relation where the results of the query go.
SELECT queries don't have a result relation. (The special case of SELECT INTO is mostly iden-
tical to CREATE TABLE followed by INSERT ... SELECT, and is not discussed separately
here.)
1145
The Rule System
For INSERT, UPDATE, and DELETE commands, the result relation is the table (or view!) where
the changes are to take effect.
The target list is a list of expressions that define the result of the query. In the case of a SELECT,
these expressions are the ones that build the final output of the query. They correspond to the
expressions between the key words SELECT and FROM. (* is just an abbreviation for all the
column names of a relation. It is expanded by the parser into the individual columns, so the rule
system never sees it.)
DELETE commands don't need a normal target list because they don't produce any result. Instead,
the planner adds a special CTID entry to the empty target list, to allow the executor to find the
row to be deleted. (CTID is added when the result relation is an ordinary table. If it is a view, a
whole-row variable is added instead, by the rule system, as described in Section 41.2.4.)
For INSERT commands, the target list describes the new rows that should go into the result
relation. It consists of the expressions in the VALUES clause or the ones from the SELECT clause
in INSERT ... SELECT. The first step of the rewrite process adds target list entries for any
columns that were not assigned to by the original command but have defaults. Any remaining
columns (with neither a given value nor a default) will be filled in by the planner with a constant
null expression.
For UPDATE commands, the target list describes the new rows that should replace the old ones.
In the rule system, it contains just the expressions from the SET column = expression part
of the command. The planner will handle missing columns by inserting expressions that copy the
values from the old row into the new one. Just as for DELETE, a CTID or whole-row variable is
added so that the executor can identify the old row to be updated.
Every entry in the target list contains an expression that can be a constant value, a variable pointing
to a column of one of the relations in the range table, a parameter, or an expression tree made of
function calls, constants, variables, operators, etc.
the qualification
The query's qualification is an expression much like one of those contained in the target list en-
tries. The result value of this expression is a Boolean that tells whether the operation (INSERT,
UPDATE, DELETE, or SELECT) for the final result row should be executed or not. It corresponds
to the WHERE clause of an SQL statement.
The query's join tree shows the structure of the FROM clause. For a simple query like
SELECT ... FROM a, b, c, the join tree is just a list of the FROM items, because we are
allowed to join them in any order. But when JOIN expressions, particularly outer joins, are used,
we have to join in the order shown by the joins. In that case, the join tree shows the structure
of the JOIN expressions. The restrictions associated with particular JOIN clauses (from ON or
USING expressions) are stored as qualification expressions attached to those join-tree nodes. It
turns out to be convenient to store the top-level WHERE expression as a qualification attached
to the top-level join-tree item, too. So really the join tree represents both the FROM and WHERE
clauses of a SELECT.
the others
The other parts of the query tree like the ORDER BY clause aren't of interest here. The rule system
substitutes some entries there while applying rules, but that doesn't have much to do with the
fundamentals of the rule system.
Views in PostgreSQL are implemented using the rule system. In fact, there is essentially no difference
between:
because this is exactly what the CREATE VIEW command does internally. This has some side effects.
One of them is that the information about a view in the PostgreSQL system catalogs is exactly the
same as it is for a table. So for the parser, there is absolutely no difference between a table and a view.
They are the same thing: relations.
Currently, there can be only one action in an ON SELECT rule, and it must be an unconditional
SELECT action that is INSTEAD. This restriction was required to make rules safe enough to open
them for ordinary users, and it restricts ON SELECT rules to act like views.
The examples for this chapter are two join views that do some calculations and some more views using
them in turn. One of the two first views is customized later by adding rules for INSERT, UPDATE,
and DELETE operations so that the final result will be a view that behaves like a real table with some
magic functionality. This is not such a simple example to start from and this makes things harder to
get into. But it's better to have one example that covers all the points discussed step by step rather than
having many different ones that might mix up in mind.
The real tables we need in the first two rule system descriptions are these:
1147
The Rule System
);
The CREATE VIEW command for the shoelace view (which is the simplest one we have) will create
a relation shoelace and an entry in pg_rewrite that tells that there is a rewrite rule that must be
applied whenever the relation shoelace is referenced in a query's range table. The rule has no rule
qualification (discussed later, with the non-SELECT rules, since SELECT rules currently cannot have
them) and it is INSTEAD. Note that rule qualifications are not the same as query qualifications. The
action of our rule has a query qualification. The action of the rule is one query tree that is a copy of
the SELECT statement in the view creation command.
Note
The two extra range table entries for NEW and OLD that you can see in the pg_rewrite entry
aren't of interest for SELECT rules.
Now we populate unit, shoe_data and shoelace_data and run a simple query on a view:
1148
The Rule System
This is the simplest SELECT you can do on our views, so we take this opportunity to explain the
basics of view rules. The SELECT * FROM shoelace was interpreted by the parser and produced
the query tree:
and this is given to the rule system. The rule system walks through the range table and checks if there
are rules for any relation. When processing the range table entry for shoelace (the only one up to
now) it finds the _RETURN rule with the query tree:
To expand the view, the rewriter simply creates a subquery range-table entry containing the rule's
action query tree, and substitutes this range table entry for the original one that referenced the view.
The resulting rewritten query tree is almost the same as if you had typed:
1149
The Rule System
There is one difference however: the subquery's range table has two extra entries shoelace old
and shoelace new. These entries don't participate directly in the query, since they aren't referenced
by the subquery's join tree or target list. The rewriter uses them to store the access privilege check
information that was originally present in the range-table entry that referenced the view. In this way,
the executor will still check that the user has proper privileges to access the view, even though there's
no direct use of the view in the rewritten query.
That was the first rule applied. The rule system will continue checking the remaining range-table
entries in the top query (in this example there are no more), and it will recursively check the range-
table entries in the added subquery to see if any of them reference views. (But it won't expand old
or new — otherwise we'd have infinite recursion!) In this example, there are no rewrite rules for
shoelace_data or unit, so rewriting is complete and the above is the final result given to the
planner.
Now we want to write a query that finds out for which shoes currently in the store we have the matching
shoelaces (color and length) and where the total number of exactly matching pairs is greater than or
equal to two.
The first rule applied will be the one for the shoe_ready view and it results in the query tree:
1150
The Rule System
Similarly, the rules for shoe and shoelace are substituted into the range table of the subquery,
leading to a three-level final query tree:
This might look inefficient, but the planner will collapse this into a single-level query tree by “pulling
up” the subqueries, and then it will plan the joins just as if we'd written them out manually. So col-
lapsing the query tree is an optimization that the rewrite system doesn't have to concern itself with.
There are only a few differences between a query tree for a SELECT and one for any other command.
Obviously, they have a different command type and for a command other than a SELECT, the result
relation points to the range-table entry where the result should go. Everything else is absolutely the
same. So having two tables t1 and t2 with columns a and b, the query trees for the two statements:
1151
The Rule System
• The range tables contain entries for the tables t1 and t2.
• The target lists contain one variable that points to column b of the range table entry for table t2.
• The qualification expressions compare the columns a of both range-table entries for equality.
The consequence is, that both query trees result in similar execution plans: They are both joins over the
two tables. For the UPDATE the missing columns from t1 are added to the target list by the planner
and the final query tree will read as:
and thus the executor run over the join will produce exactly the same result set as:
But there is a little problem in UPDATE: the part of the executor plan that does the join does not care
what the results from the join are meant for. It just produces a result set of rows. The fact that one is a
SELECT command and the other is an UPDATE is handled higher up in the executor, where it knows
that this is an UPDATE, and it knows that this result should go into table t1. But which of the rows
that are there has to be replaced by the new row?
To resolve this problem, another entry is added to the target list in UPDATE (and also in DELETE)
statements: the current tuple ID (CTID). This is a system column containing the file block number
and position in the block for the row. Knowing the table, the CTID can be used to retrieve the original
row of t1 to be updated. After adding the CTID to the target list, the query actually looks like:
Now another detail of PostgreSQL enters the stage. Old table rows aren't overwritten, and this is why
ROLLBACK is fast. In an UPDATE, the new result row is inserted into the table (after stripping the
CTID) and in the row header of the old row, which the CTID pointed to, the cmax and xmax entries
are set to the current command counter and current transaction ID. Thus the old row is hidden, and
after the transaction commits the vacuum cleaner can eventually remove the dead row.
Knowing all that, we can simply apply view rules in absolutely the same way to any command. There
is no difference.
The benefit of implementing views with the rule system is that the planner has all the information
about which tables have to be scanned plus the relationships between these tables plus the restrictive
qualifications from the views plus the qualifications from the original query in one single query tree.
And this is still the situation when the original query is already a join over views. The planner has
to decide which is the best path to execute the query, and the more information the planner has, the
1152
The Rule System
better this decision can be. And the rule system as implemented in PostgreSQL ensures that this is all
information available about the query up to that point.
If the subquery selects from a single base relation and is simple enough, the rewriter can automatically
replace the subquery with the underlying base relation so that the INSERT, UPDATE, or DELETE
is applied to the base relation in the appropriate way. Views that are “simple enough” for this are
called automatically updatable. For detailed information on the kinds of view that can be automatically
updated, see CREATE VIEW.
Alternatively, the operation may be handled by a user-provided INSTEAD OF trigger on the view.
Rewriting works slightly differently in this case. For INSERT, the rewriter does nothing at all with
the view, leaving it as the result relation for the query. For UPDATE and DELETE, it's still necessary
to expand the view query to produce the “old” rows that the command will attempt to update or delete.
So the view is expanded as normal, but another unexpanded range-table entry is added to the query
to represent the view in its capacity as the result relation.
The problem that now arises is how to identify the rows to be updated in the view. Recall that when
the result relation is a table, a special CTID entry is added to the target list to identify the physical
locations of the rows to be updated. This does not work if the result relation is a view, because a view
does not have any CTID, since its rows do not have actual physical locations. Instead, for an UPDATE
or DELETE operation, a special wholerow entry is added to the target list, which expands to include
all columns from the view. The executor uses this value to supply the “old” row to the INSTEAD OF
trigger. It is up to the trigger to work out what to update based on the old and new row values.
Another possibility is for the user to define INSTEAD rules that specify substitute actions for INSERT,
UPDATE, and DELETE commands on a view. These rules will rewrite the command, typically into a
command that updates one or more tables, rather than views. That is the topic of Section 41.4.
Note that rules are evaluated first, rewriting the original query before it is planned and executed.
Therefore, if a view has INSTEAD OF triggers as well as rules on INSERT, UPDATE, or DELETE,
then the rules will be evaluated first, and depending on the result, the triggers may not be used at all.
Automatic rewriting of an INSERT, UPDATE, or DELETE query on a simple view is always tried
last. Therefore, if a view has rules or triggers, they will override the default behavior of automatically
updatable views.
If there are no INSTEAD rules or INSTEAD OF triggers for the view, and the rewriter cannot au-
tomatically rewrite the query as an update on the underlying base relation, an error will be thrown
because the executor cannot update a view as such.
and:
1153
The Rule System
are that the materialized view cannot subsequently be directly updated and that the query used to create
the materialized view is stored in exactly the same way that a view's query is stored, so that fresh data
can be generated for the materialized view with:
The information about a materialized view in the PostgreSQL system catalogs is exactly the same as
it is for a table or view. So for the parser, a materialized view is a relation, just like a table or a view.
When a materialized view is referenced in a query, the data is returned directly from the materialized
view, like from a table; the rule is only used for populating the materialized view.
While access to the data stored in a materialized view is often much faster than accessing the underlying
tables directly or through a view, the data is not always current; yet sometimes current data is not
needed. Consider a table which records sales:
If people want to be able to quickly graph historical sales data, they might want to summarize, and
they may not care about the incomplete data for the current date:
This materialized view might be useful for displaying a graph in the dashboard created for salespeople.
A job could be scheduled to update the statistics each night using this SQL statement:
Another use for a materialized view is to allow faster access to data brought across from a remote
system through a foreign data wrapper. A simple example using file_fdw is below, with timings,
but since this is using cache on the local system the performance difference compared to access to a
remote system would usually be greater than shown here. Notice we are also exploiting the ability to
put an index on the materialized view, whereas file_fdw does not support indexes; this advantage
might not apply for other sorts of foreign data access.
Setup:
1154
The Rule System
count
-------
0
(1 row)
Either way, the word is spelled wrong, so let's look for what we might have wanted. Again using
file_fdw:
SELECT word FROM words ORDER BY word <-> 'caterpiler' LIMIT 10;
word
---------------
cater
caterpillar
Caterpillar
caterpillars
caterpillar's
Caterpillar's
caterer
1155
The Rule System
caterer's
caters
catered
(10 rows)
If you can tolerate periodic update of the remote data to the local database, the performance benefit
can be substantial.
Second, they don't modify the query tree in place. Instead they create zero or more new query trees
and can throw away the original one.
Caution
In many cases, tasks that could be performed by rules on INSERT/UPDATE/DELETE are
better done with triggers. Triggers are notationally a bit more complicated, but their semantics
are much simpler to understand. Rules tend to have surprising results when the original query
contains volatile functions: volatile functions may get executed more times than expected in
the process of carrying out the rules.
1156
The Rule System
Also, there are some cases that are not supported by these types of rules at all, notably including
WITH clauses in the original query and multiple-assignment sub-SELECTs in the SET list of
UPDATE queries. This is because copying these constructs into a rule query would result in
multiple evaluations of the sub-query, contrary to the express intent of the query's author.
in mind. In the following, update rules means rules that are defined on INSERT, UPDATE, or DELETE.
Update rules get applied by the rule system when the result relation and the command type of a query
tree are equal to the object and event given in the CREATE RULE command. For update rules, the rule
system creates a list of query trees. Initially the query-tree list is empty. There can be zero (NOTHING
key word), one, or multiple actions. To simplify, we will look at a rule with one action. This rule can
have a qualification or not and it can be INSTEAD or ALSO (the default).
What is a rule qualification? It is a restriction that tells when the actions of the rule should be done and
when not. This qualification can only reference the pseudorelations NEW and/or OLD, which basically
represent the relation that was given as object (but with a special meaning).
So we have three cases that produce the following query trees for a one-action rule.
the query tree from the rule action with the original query tree's qualification added
the query tree from the rule action with the rule qualification and the original query tree's quali-
fication added
the query tree from the rule action with the rule qualification and the original query tree's quali-
fication; and the original query tree with the negated rule qualification added
Finally, if the rule is ALSO, the unchanged original query tree is added to the list. Since only qualified
INSTEAD rules already add the original query tree, we end up with either one or two output query
trees for a rule with one action.
For ON INSERT rules, the original query (if not suppressed by INSTEAD) is done before any ac-
tions added by rules. This allows the actions to see the inserted row(s). But for ON UPDATE and ON
DELETE rules, the original query is done after the actions added by rules. This ensures that the actions
can see the to-be-updated or to-be-deleted rows; otherwise, the actions might do nothing because they
find no rows matching their qualifications.
The query trees generated from rule actions are thrown into the rewrite system again, and maybe more
rules get applied resulting in additional or fewer query trees. So a rule's actions must have either a
different command type or a different result relation than the rule itself is on, otherwise this recursive
process will end up in an infinite loop. (Recursive expansion of a rule will be detected and reported
as an error.)
1157
The Rule System
The query trees found in the actions of the pg_rewrite system catalog are only templates. Since
they can reference the range-table entries for NEW and OLD, some substitutions have to be made before
they can be used. For any reference to NEW, the target list of the original query is searched for a
corresponding entry. If found, that entry's expression replaces the reference. Otherwise, NEW means
the same as OLD (for an UPDATE) or is replaced by a null value (for an INSERT). Any reference to
OLD is replaced by a reference to the range-table entry that is the result relation.
After the system is done applying update rules, it applies view rules to the produced query tree(s).
Views cannot insert new update actions so there is no need to apply update rules to the output of view
rewriting.
That's what we expected. What happened in the background is the following. The parser created the
query tree:
There is a rule log_shoelace that is ON UPDATE with the rule qualification expression:
1158
The Rule System
(This looks a little strange since you cannot normally write INSERT ... VALUES ... FROM. The
FROM clause here is just to indicate that there are range-table entries in the query tree for new and old.
These are needed so that they can be referenced by variables in the INSERT command's query tree.)
The rule is a qualified ALSO rule, so the rule system has to return two query trees: the modified rule
action and the original query tree. In step 1, the range table of the original query is incorporated into
the rule's action query tree. This results in:
In step 2, the rule qualification is added to it, so the result set is restricted to rows where sl_avail
changes:
(This looks even stranger, since INSERT ... VALUES doesn't have a WHERE clause either, but
the planner and executor will have no difficulty with it. They need to support this same functionality
anyway for INSERT ... SELECT.)
In step 3, the original query tree's qualification is added, restricting the result set further to only the
rows that would have been touched by the original query:
Step 4 replaces references to NEW by the target list entries from the original query tree or by the
matching variable references from the result relation:
1159
The Rule System
shoelace_data shoelace_data
WHERE 6 <> old.sl_avail
AND shoelace_data.sl_name = 'sl7';
That's it. Since the rule is ALSO, we also output the original query tree. In short, the output from the
rule system is a list of two query trees that correspond to these statements:
These are executed in this order, and that is exactly what the rule was meant to do.
The substitutions and the added qualifications ensure that, if the original query would be, say:
no log entry would get written. In that case, the original query tree does not contain a target list entry
for sl_avail, so NEW.sl_avail will get replaced by shoelace_data.sl_avail. Thus, the
extra command generated by the rule is:
It will also work if the original query modifies multiple rows. So if someone issued the command:
four rows in fact get updated (sl1, sl2, sl3, and sl4). But sl3 already has sl_avail = 0. In
this case, the original query trees qualification is different and that results in the extra query tree:
1160
The Rule System
being generated by the rule. This query tree will surely insert three new log entries. And that's ab-
solutely correct.
Here we can see why it is important that the original query tree is executed last. If the UPDATE had
been executed first, all the rows would have already been set to zero, so the logging INSERT would
not find any row where 0 <> shoelace_data.sl_avail.
If someone now tries to do any of these operations on the view relation shoe, the rule system will
apply these rules. Since the rules have no actions and are INSTEAD, the resulting list of query trees
will be empty and the whole query will become nothing because there is nothing left to be optimized
or executed after the rule system is done with it.
A more sophisticated way to use the rule system is to create rules that rewrite the query tree into one
that does the right operation on the real tables. To do that on the shoelace view, we create the
following rules:
1161
The Rule System
DO INSTEAD
DELETE FROM shoelace_data
WHERE sl_name = OLD.sl_name;
If you want to support RETURNING queries on the view, you need to make the rules include RE-
TURNING clauses that compute the view rows. This is usually pretty trivial for views on a single table,
but it's a bit tedious for join views such as shoelace. An example for the insert case is:
Note that this one rule supports both INSERT and INSERT RETURNING queries on the view — the
RETURNING clause is simply ignored for INSERT.
Now assume that once in a while, a pack of shoelaces arrives at the shop and a big parts list along
with it. But you don't want to manually update the shoelace view every time. Instead we set up
two little tables: one where you can insert the items from the part list, and one with a special trick.
The creation commands for these are:
Now you can fill the table shoelace_arrive with the data from the parts list:
arr_name | arr_quant
----------+-----------
sl3 | 10
sl6 | 20
sl8 | 20
(3 rows)
1162
The Rule System
It's a long way from the one INSERT ... SELECT to these results. And the description of the
query-tree transformation will be the last in this chapter. First, there is the parser's output:
Now the first rule shoelace_ok_ins is applied and turns this into:
1163
The Rule System
UPDATE shoelace
SET sl_avail = shoelace.sl_avail + shoelace_arrive.arr_quant
FROM shoelace_arrive shoelace_arrive, shoelace_ok shoelace_ok,
shoelace_ok old, shoelace_ok new,
shoelace shoelace
WHERE shoelace.sl_name = shoelace_arrive.arr_name;
and throws away the original INSERT on shoelace_ok. This rewritten query is passed to the rule
system again, and the second applied rule shoelace_upd produces:
UPDATE shoelace_data
SET sl_name = shoelace.sl_name,
sl_avail = shoelace.sl_avail + shoelace_arrive.arr_quant,
sl_color = shoelace.sl_color,
sl_len = shoelace.sl_len,
sl_unit = shoelace.sl_unit
FROM shoelace_arrive shoelace_arrive, shoelace_ok shoelace_ok,
shoelace_ok old, shoelace_ok new,
shoelace shoelace, shoelace old,
shoelace new, shoelace_data shoelace_data
WHERE shoelace.sl_name = shoelace_arrive.arr_name
AND shoelace_data.sl_name = shoelace.sl_name;
Again it's an INSTEAD rule and the previous query tree is trashed. Note that this query still uses the
view shoelace. But the rule system isn't finished with this step, so it continues and applies the
_RETURN rule on it, and we get:
UPDATE shoelace_data
SET sl_name = s.sl_name,
sl_avail = s.sl_avail + shoelace_arrive.arr_quant,
sl_color = s.sl_color,
sl_len = s.sl_len,
sl_unit = s.sl_unit
FROM shoelace_arrive shoelace_arrive, shoelace_ok shoelace_ok,
shoelace_ok old, shoelace_ok new,
shoelace shoelace, shoelace old,
shoelace new, shoelace_data shoelace_data,
shoelace old, shoelace new,
shoelace_data s, unit u
WHERE s.sl_name = shoelace_arrive.arr_name
AND shoelace_data.sl_name = s.sl_name;
Finally, the rule log_shoelace gets applied, producing the extra query tree:
1164
The Rule System
shoelace_log shoelace_log
WHERE s.sl_name = shoelace_arrive.arr_name
AND shoelace_data.sl_name = s.sl_name
AND (s.sl_avail + shoelace_arrive.arr_quant) <> s.sl_avail;
After that the rule system runs out of rules and returns the generated query trees.
So we end up with two final query trees that are equivalent to the SQL statements:
UPDATE shoelace_data
SET sl_avail = shoelace_data.sl_avail +
shoelace_arrive.arr_quant
FROM shoelace_arrive shoelace_arrive,
shoelace_data shoelace_data,
shoelace_data s
WHERE s.sl_name = shoelace_arrive.sl_name
AND shoelace_data.sl_name = s.sl_name;
The result is that data coming from one relation inserted into another, changed into updates on a third,
changed into updating a fourth plus logging that final update in a fifth gets reduced into two queries.
There is a little detail that's a bit ugly. Looking at the two queries, it turns out that the shoelace_da-
ta relation appears twice in the range table where it could definitely be reduced to one. The planner
does not handle it and so the execution plan for the rule systems output of the INSERT will be
Nested Loop
-> Merge Join
-> Seq Scan
-> Sort
-> Seq Scan on s
-> Seq Scan
-> Sort
-> Seq Scan on shoelace_arrive
-> Seq Scan on shoelace_data
Merge Join
-> Seq Scan
-> Sort
-> Seq Scan on s
-> Seq Scan
-> Sort
-> Seq Scan on shoelace_arrive
1165
The Rule System
which produces exactly the same entries in the log table. Thus, the rule system caused one extra scan
on the table shoelace_data that is absolutely not necessary. And the same redundant scan is done
once more in the UPDATE. But it was a really hard job to make that all possible at all.
Now we make a final demonstration of the PostgreSQL rule system and its power. Say you add some
shoelaces with extraordinary colors to your database:
We would like to make a view to check which shoelace entries do not fit any shoe in color. The
view for this is:
Now we want to set it up so that mismatching shoelaces that are not in stock are deleted from the
database. To make it a little harder for PostgreSQL, we don't delete it directly. Instead we create one
more view:
Voilà:
1166
The Rule System
A DELETE on a view, with a subquery qualification that in total uses 4 nesting/joined views, where
one of them itself has a subquery qualification containing a view and where calculated view columns
are used, gets rewritten into one single query tree that deletes the requested data from a real table.
There are probably only a few situations out in the real world where such a construct is necessary. But
it makes you feel comfortable that it works.
Rewrite rules don't have a separate owner. The owner of a relation (table or view) is automatically the
owner of the rewrite rules that are defined for it. The PostgreSQL rule system changes the behavior of
the default access control system. Relations that are used due to rules get checked against the privileges
of the rule owner, not the user invoking the rule. This means that users only need the required privileges
for the tables/views that are explicitly named in their queries.
For example: A user has a list of phone numbers where some of them are private, the others are of
interest for the assistant of the office. The user can construct the following:
Nobody except that user (and the database superusers) can access the phone_data table. But be-
cause of the GRANT, the assistant can run a SELECT on the phone_number view. The rule system
will rewrite the SELECT from phone_number into a SELECT from phone_data. Since the user
is the owner of phone_number and therefore the owner of the rule, the read access to phone_da-
ta is now checked against the user's privileges and the query is permitted. The check for accessing
phone_number is also performed, but this is done against the invoking user, so nobody but the user
and the assistant can use it.
The privileges are checked rule by rule. So the assistant is for now the only one who can see the public
phone numbers. But the assistant can set up another view and grant access to that to the public. Then,
anyone can see the phone_number data through the assistant's view. What the assistant cannot do is
to create a view that directly accesses phone_data. (Actually the assistant can, but it will not work
since every access will be denied during the permission checks.) And as soon as the user notices that the
assistant opened their phone_number view, the user can revoke the assistant's access. Immediately,
any access to the assistant's view would fail.
One might think that this rule-by-rule checking is a security hole, but in fact it isn't. But if it did not
work this way, the assistant could set up a table with the same columns as phone_number and copy
the data to there once per day. Then it's the assistant's own data and the assistant can grant access to
everyone they want. A GRANT command means, “I trust you”. If someone you trust does the thing
above, it's time to think it over and then use REVOKE.
Note that while views can be used to hide the contents of certain columns using the technique shown
above, they cannot be used to reliably conceal the data in unseen rows unless the security_bar-
rier flag has been set. For example, the following view is insecure:
1167
The Rule System
This view might seem secure, since the rule system will rewrite any SELECT from phone_number
into a SELECT from phone_data and add the qualification that only entries where phone does not
begin with 412 are wanted. But if the user can create their own functions, it is not difficult to convince
the planner to execute the user-defined function prior to the NOT LIKE expression. For example:
Every person and phone number in the phone_data table will be printed as a NOTICE, because
the planner will choose to execute the inexpensive tricky function before the more expensive NOT
LIKE. Even if the user is prevented from defining new functions, built-in functions can be used in
similar attacks. (For example, most casting functions include their input values in the error messages
they produce.)
Similar considerations apply to update rules. In the examples of the previous section, the owner of the
tables in the example database could grant the privileges SELECT, INSERT, UPDATE, and DELETE
on the shoelace view to someone else, but only SELECT on shoelace_log. The rule action to
write log entries will still be executed successfully, and that other user could see the log entries. But
they could not create fake entries, nor could they manipulate or remove existing ones. In this case,
there is no possibility of subverting the rules by convincing the planner to alter the order of operations,
because the only rule which references shoelace_log is an unqualified INSERT. This might not
be true in more complex scenarios.
When it is necessary for a view to provide row level security, the security_barrier attribute
should be applied to the view. This prevents maliciously-chosen functions and operators from being
passed values from rows until after the view has done its work. For example, if the view shown above
had been created like this, it would be secure:
Views created with the security_barrier may perform far worse than views created without
this option. In general, there is no way to avoid this: the fastest possible plan must be rejected if it may
compromise security. For this reason, this option is not enabled by default.
The query planner has more flexibility when dealing with functions that have no side effects. Such
functions are referred to as LEAKPROOF, and include many simple, commonly used operators, such
as many equality operators. The query planner can safely allow such functions to be evaluated at any
point in the query execution process, since invoking them on rows invisible to the user will not leak
any information about the unseen rows. Further, functions which do not take arguments or which are
not passed any arguments from the security barrier view do not have to be marked as LEAKPROOF to
be pushed down, as they never receive data from the view. In contrast, a function that might throw an
error depending on the values received as arguments (such as one that throws an error in the event of
overflow or division by zero) is not leak-proof, and could provide significant information about the
unseen rows if applied before the security view's row filters.
It is important to understand that even a view created with the security_barrier option is in-
tended to be secure only in the limited sense that the contents of the invisible tuples will not be passed
1168
The Rule System
to possibly-insecure functions. The user may well have other means of making inferences about the
unseen data; for example, they can see the query plan using EXPLAIN, or measure the run time of
queries against the view. A malicious attacker might be able to infer something about the amount of
unseen data, or even gain some information about the data distribution or most common values (since
these things may affect the run time of the plan; or even, since they are also reflected in the optimizer
statistics, the choice of plan). If these types of "covert channel" attacks are of concern, it is probably
unwise to grant any access to the data at all.
• If there is no unconditional INSTEAD rule for the query, then the originally given query will be ex-
ecuted, and its command status will be returned as usual. (But note that if there were any conditional
INSTEAD rules, the negation of their qualifications will have been added to the original query. This
might reduce the number of rows it processes, and if so the reported status will be affected.)
• If there is any unconditional INSTEAD rule for the query, then the original query will not be exe-
cuted at all. In this case, the server will return the command status for the last query that was inserted
by an INSTEAD rule (conditional or unconditional) and is of the same command type (INSERT,
UPDATE, or DELETE) as the original query. If no query meeting those requirements is added by
any rule, then the returned command status shows the original query type and zeroes for the row-
count and OID fields.
The programmer can ensure that any desired INSTEAD rule is the one that sets the command status
in the second case, by giving it the alphabetically last rule name among the active rules, so that it
gets applied last.
In this chapter, we focused on using rules to update views. All of the update rule examples in this
chapter can also be implemented using INSTEAD OF triggers on the views. Writing such triggers is
often easier than writing rules, particularly if complex logic is required to perform the update.
For the things that can be implemented by both, which is best depends on the usage of the database.
A trigger is fired once for each affected row. A rule modifies the query or generates an additional
query. So if many rows are affected in one statement, a rule issuing one extra command is likely to
be faster than a trigger that is called for every single row and must re-determine what to do many
times. However, the trigger approach is conceptually far simpler than the rule approach, and is easier
for novices to get right.
Here we show an example of how the choice of rules versus triggers plays out in one situation. There
are two tables:
1169
The Rule System
Both tables have many thousands of rows and the indexes on hostname are unique. The rule or
trigger should implement a constraint that deletes rows from software that reference a deleted
computer. The trigger would use this command:
Since the trigger is called for each individual row deleted from computer, it can prepare and save the
plan for this command and pass the hostname value in the parameter. The rule would be written as:
the table computer is scanned by index (fast), and the command issued by the trigger would also
use an index scan (also fast). The extra command from the rule would be:
Since there are appropriate indexes set up, the planner will create a plan of
Nestloop
-> Index Scan using comp_hostidx on computer
-> Index Scan using soft_hostidx on software
So there would be not that much difference in speed between the trigger and the rule implementation.
With the next delete we want to get rid of all the 2000 computers where the hostname starts with
old. There are two possible commands to do that. One is:
Hash Join
1170
The Rule System
which results in the following executing plan for the command added by the rule:
Nestloop
-> Index Scan using comp_hostidx on computer
-> Index Scan using soft_hostidx on software
This shows, that the planner does not realize that the qualification for hostname in computer could
also be used for an index scan on software when there are multiple qualification expressions com-
bined with AND, which is what it does in the regular-expression version of the command. The trigger
will get invoked once for each of the 2000 old computers that have to be deleted, and that will result in
one index scan over computer and 2000 index scans over software. The rule implementation will
do it with two commands that use indexes. And it depends on the overall size of the table software
whether the rule will still be faster in the sequential scan situation. 2000 command executions from
the trigger over the SPI manager take some time, even if all the index blocks will soon be in the cache.
Again this could result in many rows to be deleted from computer. So the trigger will again run
many commands through the executor. The command generated by the rule will be:
The plan for that command will again be the nested loop over two index scans, only using a different
index on computer:
Nestloop
-> Index Scan using comp_manufidx on computer
-> Index Scan using soft_hostidx on software
In any of these cases, the extra commands from the rule system will be more or less independent from
the number of affected rows in a command.
The summary is, rules will only be significantly slower than triggers if their actions result in large and
badly qualified joins, a situation where the planner fails.
1171
Chapter 42. Procedural Languages
PostgreSQL allows user-defined functions to be written in other languages besides SQL and C. These
other languages are generically called procedural languages (PLs). For a function written in a proce-
dural language, the database server has no built-in knowledge about how to interpret the function's
source text. Instead, the task is passed to a special handler that knows the details of the language.
The handler could either do all the work of parsing, syntax analysis, execution, etc. itself, or it could
serve as “glue” between PostgreSQL and an existing implementation of a programming language. The
handler itself is a C language function compiled into a shared object and loaded on demand, just like
any other C function.
There are currently four procedural languages available in the standard PostgreSQL distribution: PL/
pgSQL (Chapter 43), PL/Tcl (Chapter 44), PL/Perl (Chapter 45), and PL/Python (Chapter 46). There
are additional procedural languages available that are not included in the core distribution. Appendix H
has information about finding them. In addition other languages can be defined by users; the basics of
developing a new procedural language are covered in Chapter 56.
For the languages supplied with the standard distribution, it is only necessary to execute CREATE
EXTENSION language_name to install the language into the current database. The manual pro-
cedure described below is only recommended for installing languages that have not been packaged
as extensions.
1. The shared object for the language handler must be compiled and installed into an appropriate
library directory. This works in the same way as building and installing modules with regular
user-defined C functions does; see Section 38.10.5. Often, the language handler will depend on
an external library that provides the actual programming language engine; if so, that must be
installed as well.
The special return type of language_handler tells the database system that this function
does not return one of the defined SQL data types and is not directly usable in SQL statements.
3. (Optional) Optionally, the language handler can provide an “inline” handler function that ex-
ecutes anonymous code blocks (DO commands) written in this language. If an inline handler
function is provided by the language, declare it with a command like
1172
Procedural Languages
4. (Optional) Optionally, the language handler can provide a “validator” function that checks a
function definition for correctness without actually executing it. The validator function is called
by CREATE FUNCTION if it exists. If a validator function is provided by the language, declare
it with a command like
The optional key word TRUSTED specifies that the language does not grant access to data that the
user would not otherwise have. Trusted languages are designed for ordinary database users (those
without superuser privilege) and allows them to safely create functions and procedures. Since PL
functions are executed inside the database server, the TRUSTED flag should only be given for
languages that do not allow access to database server internals or the file system. The languages
PL/pgSQL, PL/Tcl, and PL/Perl are considered trusted; the languages PL/TclU, PL/PerlU, and
PL/PythonU are designed to provide unlimited functionality and should not be marked trusted.
Example 42.1 shows how the manual installation procedure would work with the language PL/Perl.
PL/Perl has an inline handler function and a validator function, so we declare those too:
The command:
1173
Procedural Languages
then defines that the previously declared functions should be invoked for functions and procedures
where the language attribute is plperl.
In a default PostgreSQL installation, the handler for the PL/pgSQL language is built and installed into
the “library” directory; furthermore, the PL/pgSQL language itself is installed in all databases. If Tcl
support is configured in, the handlers for PL/Tcl and PL/TclU are built and installed in the library
directory, but the language itself is not installed in any database by default. Likewise, the PL/Perl and
PL/PerlU handlers are built and installed if Perl support is configured, and the PL/PythonU handler is
installed if Python support is configured, but these languages are not installed by default.
1174
Chapter 43. PL/pgSQL - SQL
Procedural Language
43.1. Overview
PL/pgSQL is a loadable procedural language for the PostgreSQL database system. The design goals
of PL/pgSQL were to create a loadable procedural language that
• is easy to use.
Functions created with PL/pgSQL can be used anywhere that built-in functions could be used. For
example, it is possible to create complex conditional computation functions and later use them to
define operators or use them in index expressions.
In PostgreSQL 9.0 and later, PL/pgSQL is installed by default. However it is still a loadable module,
so especially security-conscious administrators could choose to remove it.
That means that your client application must send each query to the database server, wait for it to
be processed, receive and process the results, do some computation, then send further queries to the
server. All this incurs interprocess communication and will also incur network overhead if your client
is on a different machine than the database server.
With PL/pgSQL you can group a block of computation and a series of queries inside the database
server, thus having the power of a procedural language and the ease of use of SQL, but with consid-
erable savings of client/server communication overhead.
• Intermediate results that the client does not need do not have to be marshaled or transferred between
server and client
This can result in a considerable performance increase as compared to an application that does not
use stored functions.
Also, with PL/pgSQL you can use all the data types, operators and functions of SQL.
1175
PL/pgSQL - SQL Pro-
cedural Language
type (row type) specified by name. It is also possible to declare a PL/pgSQL function as accepting
record, which means that any composite type will do as input, or as returning record, which
means that the result is a row type whose columns are determined by specification in the calling query,
as discussed in Section 7.2.1.4.
PL/pgSQL functions can be declared to accept a variable number of arguments by using the
VARIADIC marker. This works exactly the same way as for SQL functions, as discussed in Sec-
tion 38.5.5.
PL/pgSQL functions can also be declared to accept and return the polymorphic types anyelement,
anyarray, anynonarray, anyenum, and anyrange. The actual data types handled by a poly-
morphic function can vary from call to call, as discussed in Section 38.2.5. An example is shown in
Section 43.3.1.
PL/pgSQL functions can also be declared to return a “set” (or table) of any data type that can be
returned as a single instance. Such a function generates its output by executing RETURN NEXT for
each desired element of the result set, or by using RETURN QUERY to output the result of evaluating
a query.
Finally, a PL/pgSQL function can be declared to return void if it has no useful return value. (Alter-
natively, it could be written as a procedure in that case.)
PL/pgSQL functions can also be declared with output parameters in place of an explicit specification of
the return type. This does not add any fundamental capability to the language, but it is often convenient,
especially for returning multiple values. The RETURNS TABLE notation can also be used in place
of RETURNS SETOF.
The function body is simply a string literal so far as CREATE FUNCTION is concerned. It is often
helpful to use dollar quoting (see Section 4.1.2.4) to write the function body, rather than the normal
single quote syntax. Without dollar quoting, any single quotes or backslashes in the function body
must be escaped by doubling them. Almost all the examples in this chapter use dollar-quoted literals
for their function bodies.
PL/pgSQL is a block-structured language. The complete text of a function body must be a block. A
block is defined as:
[ <<label>> ]
[ DECLARE
declarations ]
BEGIN
statements
END [ label ];
Each declaration and each statement within a block is terminated by a semicolon. A block that appears
within another block must have a semicolon after END, as shown above; however the final END that
concludes a function body does not require a semicolon.
1176
PL/pgSQL - SQL Pro-
cedural Language
Tip
A common mistake is to write a semicolon immediately after BEGIN. This is incorrect and
will result in a syntax error.
A label is only needed if you want to identify the block for use in an EXIT statement, or to qualify
the names of the variables declared in the block. If a label is given after END, it must match the label
at the block's beginning.
All key words are case-insensitive. Identifiers are implicitly converted to lower case unless dou-
ble-quoted, just as they are in ordinary SQL commands.
Comments work the same way in PL/pgSQL code as in ordinary SQL. A double dash (--) starts a
comment that extends to the end of the line. A /* starts a block comment that extends to the matching
occurrence of */. Block comments nest.
Any statement in the statement section of a block can be a subblock. Subblocks can be used for logical
grouping or to localize variables to a small group of statements. Variables declared in a subblock mask
any similarly-named variables of outer blocks for the duration of the subblock; but you can access the
outer variables anyway if you qualify their names with their block's label. For example:
RETURN quantity;
END;
$$ LANGUAGE plpgsql;
Note
There is actually a hidden “outer block” surrounding the body of any PL/pgSQL function. This
block provides the declarations of the function's parameters (if any), as well as some special
variables such as FOUND (see Section 43.5.5). The outer block is labeled with the function's
name, meaning that parameters and special variables can be qualified with the function's name.
It is important not to confuse the use of BEGIN/END for grouping statements in PL/pgSQL with the
similarly-named SQL commands for transaction control. PL/pgSQL's BEGIN/END are only for group-
1177
PL/pgSQL - SQL Pro-
cedural Language
ing; they do not start or end a transaction. See Section 43.8 for information on managing transactions
in PL/pgSQL. Also, a block containing an EXCEPTION clause effectively forms a subtransaction that
can be rolled back without affecting the outer transaction. For more about that see Section 43.6.8.
43.3. Declarations
All variables used in a block must be declared in the declarations section of the block. (The only ex-
ceptions are that the loop variable of a FOR loop iterating over a range of integer values is automat-
ically declared as an integer variable, and likewise the loop variable of a FOR loop iterating over a
cursor's result is automatically declared as a record variable.)
PL/pgSQL variables can have any SQL data type, such as integer, varchar, and char.
user_id integer;
quantity numeric(5);
url varchar;
myrow tablename%ROWTYPE;
myfield tablename.columnname%TYPE;
arow RECORD;
The DEFAULT clause, if given, specifies the initial value assigned to the variable when the block is
entered. If the DEFAULT clause is not given then the variable is initialized to the SQL null value. The
CONSTANT option prevents the variable from being assigned to after initialization, so that its value
will remain constant for the duration of the block. The COLLATE option specifies a collation to use
for the variable (see Section 43.3.6). If NOT NULL is specified, an assignment of a null value results
in a run-time error. All variables declared as NOT NULL must have a nonnull default value specified.
Equal (=) can be used instead of PL/SQL-compliant :=.
A variable's default value is evaluated and assigned to the variable each time the block is entered (not
just once per function call). So, for example, assigning now() to a variable of type timestamp
causes the variable to have the time of the current function call, not the time when the function was
precompiled.
Examples:
There are two ways to create an alias. The preferred way is to give a name to the parameter in the
CREATE FUNCTION command, for example:
1178
PL/pgSQL - SQL Pro-
cedural Language
BEGIN
RETURN subtotal * 0.06;
END;
$$ LANGUAGE plpgsql;
The other way is to explicitly declare an alias, using the declaration syntax
Note
These two examples are not perfectly equivalent. In the first case, subtotal could be refer-
enced as sales_tax.subtotal, but in the second case it could not. (Had we attached a
label to the inner block, subtotal could be qualified with that label, instead.)
When a PL/pgSQL function is declared with output parameters, the output parameters are given $n
names and optional aliases in just the same way as the normal input parameters. An output parameter
is effectively a variable that starts out NULL; it should be assigned to during the execution of the
function. The final value of the parameter is what is returned. For instance, the sales-tax example could
also be done this way:
1179
PL/pgSQL - SQL Pro-
cedural Language
$$ LANGUAGE plpgsql;
Notice that we omitted RETURNS real — we could have included it, but it would be redundant.
Output parameters are most useful when returning multiple values. A trivial example is:
CREATE FUNCTION sum_n_product(x int, y int, OUT sum int, OUT prod
int) AS $$
BEGIN
sum := x + y;
prod := x * y;
END;
$$ LANGUAGE plpgsql;
As discussed in Section 38.5.4, this effectively creates an anonymous record type for the function's
results. If a RETURNS clause is given, it must say RETURNS record.
Another way to declare a PL/pgSQL function is with RETURNS TABLE, for example:
This is exactly equivalent to declaring one or more OUT parameters and specifying RETURNS SETOF
sometype.
When the return type of a PL/pgSQL function is declared as a polymorphic type (anyelement,
anyarray, anynonarray, anyenum, or anyrange), a special parameter $0 is created. Its data
type is the actual return type of the function, as deduced from the actual input types (see Section 38.2.5).
This allows the function to access its actual return type as shown in Section 43.3.3. $0 is initialized to
null and can be modified by the function, so it can be used to hold the return value if desired, though
that is not required. $0 can also be given an alias. For example, this function works on any data type
that has a + operator:
The same effect can be obtained by declaring one or more output parameters as polymorphic types.
In this case the special $0 parameter is not used; the output parameters themselves serve the same
purpose. For example:
1180
PL/pgSQL - SQL Pro-
cedural Language
43.3.2. ALIAS
The ALIAS syntax is more general than is suggested in the previous section: you can declare an alias
for any variable, not just function parameters. The main practical use for this is to assign a different
name for variables with predetermined names, such as NEW or OLD within a trigger function.
Examples:
DECLARE
prior ALIAS FOR old;
updated ALIAS FOR new;
Since ALIAS creates two different ways to name the same object, unrestricted use can be confusing.
It's best to use it only for the purpose of overriding predetermined names.
variable%TYPE
%TYPE provides the data type of a variable or table column. You can use this to declare variables that
will hold database values. For example, let's say you have a column named user_id in your users
table. To declare a variable with the same data type as users.user_id you write:
user_id users.user_id%TYPE;
By using %TYPE you don't need to know the data type of the structure you are referencing, and most
importantly, if the data type of the referenced item changes in the future (for instance: you change the
type of user_id from integer to real), you might not need to change your function definition.
%TYPE is particularly valuable in polymorphic functions, since the data types needed for internal
variables can change from one call to the next. Appropriate variables can be created by applying
%TYPE to the function's arguments or result placeholders.
name table_name%ROWTYPE;
name composite_type_name;
A variable of a composite type is called a row variable (or row-type variable). Such a variable can
hold a whole row of a SELECT or FOR query result, so long as that query's column set matches the
declared type of the variable. The individual fields of the row value are accessed using the usual dot
notation, for example rowvar.field.
A row variable can be declared to have the same type as the rows of an existing table or view, by
using the table_name%ROWTYPE notation; or it can be declared by giving a composite type's name.
1181
PL/pgSQL - SQL Pro-
cedural Language
(Since every table has an associated composite type of the same name, it actually does not matter in
PostgreSQL whether you write %ROWTYPE or not. But the form with %ROWTYPE is more portable.)
Parameters to a function can be composite types (complete table rows). In that case, the corresponding
identifier $n will be a row variable, and fields can be selected from it, for example $1.user_id.
Here is an example of using composite types. table1 and table2 are existing tables having at least
the mentioned fields:
name RECORD;
Record variables are similar to row-type variables, but they have no predefined structure. They take
on the actual row structure of the row they are assigned during a SELECT or FOR command. The
substructure of a record variable can change each time it is assigned to. A consequence of this is that
until a record variable is first assigned to, it has no substructure, and any attempt to access a field in
it will draw a run-time error.
Note that RECORD is not a true data type, only a placeholder. One should also realize that when a
PL/pgSQL function is declared to return type record, this is not quite the same concept as a record
variable, even though such a function might use a record variable to hold its result. In both cases the
actual row structure is unknown when the function is written, but for a function returning record the
actual structure is determined when the calling query is parsed, whereas a record variable can change
its row structure on-the-fly.
1182
PL/pgSQL - SQL Pro-
cedural Language
The first use of less_than will use the common collation of text_field_1 and tex-
t_field_2 for the comparison, while the second use will use C collation.
Furthermore, the identified collation is also assumed as the collation of any local variables that are of
collatable types. Thus this function would not work any differently if it were written as
If there are no parameters of collatable data types, or no common collation can be identified for them,
then parameters and local variables use the default collation of their data type (which is usually the
database's default collation, but could be different for variables of domain types).
A local variable of a collatable data type can have a different collation associated with it by including
the COLLATE option in its declaration, for example
DECLARE
local_a text COLLATE "en_US";
This option overrides the collation that would otherwise be given to the variable according to the rules
above.
Also, of course explicit COLLATE clauses can be written inside a function if it is desired to force a
particular collation to be used in a particular operation. For example,
This overrides the collations associated with the table columns, parameters, or local variables used in
the expression, just as would happen in a plain SQL command.
43.4. Expressions
All expressions used in PL/pgSQL statements are processed using the server's main SQL executor.
For example, when you write a PL/pgSQL statement like
SELECT expression
to the main SQL engine. While forming the SELECT command, any occurrences of PL/pgSQL vari-
able names are replaced by parameters, as discussed in detail in Section 43.11.1. This allows the query
plan for the SELECT to be prepared just once and then reused for subsequent evaluations with differ-
1183
PL/pgSQL - SQL Pro-
cedural Language
ent values of the variables. Thus, what really happens on first use of an expression is essentially a
PREPARE command. For example, if we have declared two integer variables x and y, and we write
and then this prepared statement is EXECUTEd for each execution of the IF statement, with the cur-
rent values of the PL/pgSQL variables supplied as parameter values. Normally these details are not
important to a PL/pgSQL user, but they are useful to know when trying to diagnose a problem. More
information appears in Section 43.11.2.
43.5.1. Assignment
An assignment of a value to a PL/pgSQL variable is written as:
variable { := | = } expression;
As explained previously, the expression in such a statement is evaluated by means of an SQL SELECT
command sent to the main database engine. The expression must yield a single value (possibly a
row value, if the variable is a row or record variable). The target variable can be a simple variable
(optionally qualified with a block name), a field of a row or record variable, or an element of an array
that is a simple variable or field. Equal (=) can be used instead of PL/SQL-compliant :=.
If the expression's result data type doesn't match the variable's data type, the value will be coerced as
though by an assignment cast (see Section 10.4). If no assignment cast is known for the pair of data
types involved, the PL/pgSQL interpreter will attempt to convert the result value textually, that is by
applying the result type's output function followed by the variable type's input function. Note that this
could result in run-time errors generated by the input function, if the string form of the result value
is not acceptable to the input function.
Examples:
Any PL/pgSQL variable name appearing in the command text is treated as a parameter, and then the
current value of the variable is provided as the parameter value at run time. This is exactly like the
processing described earlier for expressions; for details see Section 43.11.1.
When executing a SQL command in this way, PL/pgSQL may cache and re-use the execution plan
for the command, as discussed in Section 43.11.2.
1184
PL/pgSQL - SQL Pro-
cedural Language
Sometimes it is useful to evaluate an expression or SELECT query but discard the result, for example
when calling a function that has side-effects but no useful result value. To do this in PL/pgSQL, use
the PERFORM statement:
PERFORM query;
This executes query and discards the result. Write the query the same way you would write an
SQL SELECT command, but replace the initial keyword SELECT with PERFORM. For WITH queries,
use PERFORM and then place the query in parentheses. (In this case, the query can only return one
row.) PL/pgSQL variables will be substituted into the query just as for commands that return no result,
and the plan is cached in the same way. Also, the special variable FOUND is set to true if the query
produced at least one row, or false if it produced no rows (see Section 43.5.5).
Note
One might expect that writing SELECT directly would accomplish this result, but at present
the only accepted way to do it is PERFORM. A SQL command that can return rows, such as
SELECT, will be rejected as an error unless it has an INTO clause as discussed in the next
section.
An example:
where target can be a record variable, a row variable, or a comma-separated list of simple variables
and record/row fields. PL/pgSQL variables will be substituted into the rest of the query, and the plan
is cached, just as described above for commands that do not return rows. This works for SELECT,
INSERT/UPDATE/DELETE with RETURNING, and utility commands that return row-set results (such
as EXPLAIN). Except for the INTO clause, the SQL command is the same as it would be written
outside PL/pgSQL.
Tip
Note that this interpretation of SELECT with INTO is quite different from PostgreSQL's reg-
ular SELECT INTO command, wherein the INTO target is a newly created table. If you want
to create a table from a SELECT result inside a PL/pgSQL function, use the syntax CREATE
TABLE ... AS SELECT.
If a row or a variable list is used as target, the query's result columns must exactly match the structure
of the target as to number and data types, or else a run-time error occurs. When a record variable is
the target, it automatically configures itself to the row type of the query result columns.
1185
PL/pgSQL - SQL Pro-
cedural Language
The INTO clause can appear almost anywhere in the SQL command. Customarily it is written either
just before or just after the list of select_expressions in a SELECT command, or at the end of
the command for other command types. It is recommended that you follow this convention in case the
PL/pgSQL parser becomes stricter in future versions.
If STRICT is not specified in the INTO clause, then target will be set to the first row returned by
the query, or to nulls if the query returned no rows. (Note that “the first row” is not well-defined unless
you've used ORDER BY.) Any result rows after the first row are discarded. You can check the special
FOUND variable (see Section 43.5.5) to determine whether a row was returned:
If the STRICT option is specified, the query must return exactly one row or a run-time error will be
reported, either NO_DATA_FOUND (no rows) or TOO_MANY_ROWS (more than one row). You can
use an exception block if you wish to catch the error, for example:
BEGIN
SELECT * INTO STRICT myrec FROM emp WHERE empname = myname;
EXCEPTION
WHEN NO_DATA_FOUND THEN
RAISE EXCEPTION 'employee % not found', myname;
WHEN TOO_MANY_ROWS THEN
RAISE EXCEPTION 'employee % not unique', myname;
END;
For INSERT/UPDATE/DELETE with RETURNING, PL/pgSQL reports an error for more than one
returned row, even when STRICT is not specified. This is because there is no option such as ORDER
BY with which to determine which affected row should be returned.
If print_strict_params is enabled for the function, then when an error is thrown because the
requirements of STRICT are not met, the DETAIL part of the error message will include information
about the parameters passed to the query. You can change the print_strict_params setting
for all functions by setting plpgsql.print_strict_params, though only subsequent function
compilations will be affected. You can also enable it on a per-function basis by using a compiler
option, for example:
1186
PL/pgSQL - SQL Pro-
cedural Language
Note
The STRICT option matches the behavior of Oracle PL/SQL's SELECT INTO and related
statements.
To handle cases where you need to process multiple result rows from a SQL query, see Section 43.6.6.
where command-string is an expression yielding a string (of type text) containing the command
to be executed. The optional target is a record variable, a row variable, or a comma-separated list
of simple variables and record/row fields, into which the results of the command will be stored. The
optional USING expressions supply values to be inserted into the command.
No substitution of PL/pgSQL variables is done on the computed command string. Any required vari-
able values must be inserted in the command string as it is constructed; or you can use parameters
as described below.
Also, there is no plan caching for commands executed via EXECUTE. Instead, the command is always
planned each time the statement is run. Thus the command string can be dynamically created within
the function to perform actions on different tables and columns.
The INTO clause specifies where the results of a SQL command returning rows should be assigned.
If a row or variable list is provided, it must exactly match the structure of the query's results (when a
record variable is used, it will configure itself to match the result structure automatically). If multiple
rows are returned, only the first will be assigned to the INTO variable. If no rows are returned, NULL
is assigned to the INTO variable(s). If no INTO clause is specified, the query results are discarded.
If the STRICT option is given, an error is reported unless the query produces exactly one row.
The command string can use parameter values, which are referenced in the command as $1, $2,
etc. These symbols refer to values supplied in the USING clause. This method is often preferable to
inserting data values into the command string as text: it avoids run-time overhead of converting the
values to text and back, and it is much less prone to SQL-injection attacks since there is no need for
quoting or escaping. An example is:
Note that parameter symbols can only be used for data values — if you want to use dynamically deter-
mined table or column names, you must insert them into the command string textually. For example,
if the preceding query needed to be done against a dynamically selected table, you could do this:
1187
PL/pgSQL - SQL Pro-
cedural Language
A cleaner approach is to use format()'s %I specification for table or column names (strings sepa-
rated by a newline are concatenated):
Another restriction on parameter symbols is that they only work in SELECT, INSERT, UPDATE, and
DELETE commands. In other statement types (generically called utility statements), you must insert
values textually even if they are just data values.
An EXECUTE with a simple constant command string and some USING parameters, as in the first
example above, is functionally equivalent to just writing the command directly in PL/pgSQL and
allowing replacement of PL/pgSQL variables to happen automatically. The important difference is
that EXECUTE will re-plan the command on each execution, generating a plan that is specific to the
current parameter values; whereas PL/pgSQL may otherwise create a generic plan and cache it for re-
use. In situations where the best plan depends strongly on the parameter values, it can be helpful to
use EXECUTE to positively ensure that a generic plan is not selected.
SELECT INTO is not currently supported within EXECUTE; instead, execute a plain SELECT com-
mand and specify INTO as part of the EXECUTE itself.
Note
The PL/pgSQL EXECUTE statement is not related to the EXECUTE SQL statement supported
by the PostgreSQL server. The server's EXECUTE statement cannot be used directly within
PL/pgSQL functions (and is not needed).
When working with dynamic commands you will often have to handle escaping of single quotes. The
recommended method for quoting fixed text in your function body is dollar quoting. (If you have
legacy code that does not use dollar quoting, please refer to the overview in Section 43.12.1, which
can save you some effort when translating said code to a more reasonable scheme.)
Dynamic values require careful handling since they might contain quote characters. An example using
format() (this assumes that you are dollar quoting the function body so quote marks need not be
doubled):
1188
PL/pgSQL - SQL Pro-
cedural Language
|| ' = '
|| quote_literal(newvalue)
|| ' WHERE key = '
|| quote_literal(keyvalue);
This example demonstrates the use of the quote_ident and quote_literal functions (see
Section 9.4). For safety, expressions containing column or table identifiers should be passed through
quote_ident before insertion in a dynamic query. Expressions containing values that should be
literal strings in the constructed command should be passed through quote_literal. These func-
tions take the appropriate steps to return the input text enclosed in double or single quotes respectively,
with any embedded special characters properly escaped.
Because quote_literal is labeled STRICT, it will always return null when called with a null
argument. In the above example, if newvalue or keyvalue were null, the entire dynamic query
string would become null, leading to an error from EXECUTE. You can avoid this problem by using the
quote_nullable function, which works the same as quote_literal except that when called
with a null argument it returns the string NULL. For example,
If you are dealing with values that might be null, you should usually use quote_nullable in place
of quote_literal.
As always, care must be taken to ensure that null values in a query do not deliver unintended results.
For example the WHERE clause
will never succeed if keyvalue is null, because the result of using the equality operator = with a
null operand is always null. If you wish null to work like an ordinary key value, you would need to
rewrite the above as
(At present, IS NOT DISTINCT FROM is handled much less efficiently than =, so don't do this
unless you must. See Section 9.2 for more information on nulls and IS DISTINCT.)
Note that dollar quoting is only useful for quoting fixed text. It would be a very bad idea to try to
write this example as:
because it would break if the contents of newvalue happened to contain $$. The same objection
would apply to any other dollar-quoting delimiter you might pick. So, to safely quote text that is not
known in advance, you must use quote_literal, quote_nullable, or quote_ident, as
appropriate.
1189
PL/pgSQL - SQL Pro-
cedural Language
Dynamic SQL statements can also be safely constructed using the format function (see Sec-
tion 9.4.1). For example:
This form is better because the variables are handled in their native data type format, rather than
unconditionally converting them to text and quoting them via %L. It is also more efficient.
A much larger example of a dynamic command and EXECUTE can be seen in Example 43.10, which
builds and executes a CREATE FUNCTION command to define a new function.
This command allows retrieval of system status indicators. CURRENT is a noise word (but see also
GET STACKED DIAGNOSTICS in Section 43.6.8.1). Each item is a key word identifying a status
value to be assigned to the specified variable (which should be of the right data type to receive it).
The currently available status items are shown in Table 43.1. Colon-equal (:=) can be used instead
of the SQL-standard = token. An example:
The second method to determine the effects of a command is to check the special variable named
FOUND, which is of type boolean. FOUND starts out false within each PL/pgSQL function call. It
is set by each of the following types of statements:
• A SELECT INTO statement sets FOUND true if a row is assigned, false if no row is returned.
• A PERFORM statement sets FOUND true if it produces (and discards) one or more rows, false if no
row is produced.
1190
PL/pgSQL - SQL Pro-
cedural Language
• UPDATE, INSERT, and DELETE statements set FOUND true if at least one row is affected, false
if no row is affected.
• A FETCH statement sets FOUND true if it returns a row, false if no row is returned.
• A MOVE statement sets FOUND true if it successfully repositions the cursor, false otherwise.
• A FOR or FOREACH statement sets FOUND true if it iterates one or more times, else false. FOUND is
set this way when the loop exits; inside the execution of the loop, FOUND is not modified by the loop
statement, although it might be changed by the execution of other statements within the loop body.
• RETURN QUERY and RETURN QUERY EXECUTE statements set FOUND true if the query returns
at least one row, false if no row is returned.
Other PL/pgSQL statements do not change the state of FOUND. Note in particular that EXECUTE
changes the output of GET DIAGNOSTICS, but does not change FOUND.
FOUND is a local variable within each PL/pgSQL function; any changes to it affect only the current
function.
NULL;
BEGIN
y := x / 0;
EXCEPTION
WHEN division_by_zero THEN
NULL; -- ignore the error
END;
BEGIN
y := x / 0;
EXCEPTION
WHEN division_by_zero THEN -- ignore the error
END;
Note
In Oracle's PL/SQL, empty statement lists are not allowed, and so NULL statements are re-
quired for situations such as this. PL/pgSQL allows you to just write nothing, instead.
1191
PL/pgSQL - SQL Pro-
cedural Language
43.6.1.1. RETURN
RETURN expression;
RETURN with an expression terminates the function and returns the value of expression to the
caller. This form is used for PL/pgSQL functions that do not return a set.
In a function that returns a scalar type, the expression's result will automatically be cast into the func-
tion's return type as described for assignments. But to return a composite (row) value, you must write
an expression delivering exactly the requested column set. This may require use of explicit casting.
If you declared the function with output parameters, write just RETURN with no expression. The cur-
rent values of the output parameter variables will be returned.
If you declared the function to return void, a RETURN statement can be used to exit the function
early; but do not write an expression following RETURN.
The return value of a function cannot be left undefined. If control reaches the end of the top-level
block of the function without hitting a RETURN statement, a run-time error will occur. This restriction
does not apply to functions with output parameters and functions returning void, however. In those
cases a RETURN statement is automatically executed if the top-level block finishes.
Some examples:
When a PL/pgSQL function is declared to return SETOF sometype, the procedure to follow is
slightly different. In that case, the individual items to return are specified by a sequence of RETURN
NEXT or RETURN QUERY commands, and then a final RETURN command with no argument is used
to indicate that the function has finished executing. RETURN NEXT can be used with both scalar
and composite data types; with a composite result type, an entire “table” of results will be returned.
RETURN QUERY appends the results of executing a query to the function's result set. RETURN NEXT
and RETURN QUERY can be freely intermixed in a single set-returning function, in which case their
results will be concatenated.
RETURN NEXT and RETURN QUERY do not actually return from the function — they simply append
zero or more rows to the function's result set. Execution then continues with the next statement in the
PL/pgSQL function. As successive RETURN NEXT or RETURN QUERY commands are executed,
1192
PL/pgSQL - SQL Pro-
cedural Language
the result set is built up. A final RETURN, which should have no argument, causes control to exit the
function (or you can just let control reach the end of the function).
RETURN QUERY has a variant RETURN QUERY EXECUTE, which specifies the query to be executed
dynamically. Parameter expressions can be inserted into the computed query string via USING, in just
the same way as in the EXECUTE command.
If you declared the function with output parameters, write just RETURN NEXT with no expression. On
each execution, the current values of the output parameter variable(s) will be saved for eventual return
as a row of the result. Note that you must declare the function as returning SETOF record when
there are multiple output parameters, or SETOF sometype when there is just one output parameter
of type sometype, in order to create a set-returning function with output parameters.
RETURN;
END;
1193
PL/pgSQL - SQL Pro-
cedural Language
$BODY$
LANGUAGE plpgsql;
Note
The current implementation of RETURN NEXT and RETURN QUERY stores the entire result
set before returning from the function, as discussed above. That means that if a PL/pgSQL
function produces a very large result set, performance might be poor: data will be written
to disk to avoid memory exhaustion, but the function itself will not return until the entire
result set has been generated. A future version of PL/pgSQL might allow users to define set-
returning functions that do not have this limitation. Currently, the point at which data begins
being written to disk is controlled by the work_mem configuration variable. Administrators
who have sufficient memory to store larger result sets in memory should consider increasing
this parameter.
If the procedure has output parameters, the final values of the output parameter variables will be
returned to the caller.
DO $$
DECLARE myvar int := 5;
BEGIN
CALL triple(myvar);
RAISE NOTICE 'myvar = %', myvar; -- prints 15
END;
$$;
43.6.4. Conditionals
IF and CASE statements let you execute alternative commands based on certain conditions. PL/pgSQL
has three forms of IF:
1194
PL/pgSQL - SQL Pro-
cedural Language
• IF ... THEN ... ELSIF ... THEN ... ELSE ... END IF
• CASE ... WHEN ... THEN ... ELSE ... END CASE
43.6.4.1. IF-THEN
IF boolean-expression THEN
statements
END IF;
IF-THEN statements are the simplest form of IF. The statements between THEN and END IF will
be executed if the condition is true. Otherwise, they are skipped.
Example:
43.6.4.2. IF-THEN-ELSE
IF boolean-expression THEN
statements
ELSE
statements
END IF;
IF-THEN-ELSE statements add to IF-THEN by letting you specify an alternative set of statements
that should be executed if the condition is not true. (Note this includes the case where the condition
evaluates to NULL.)
Examples:
1195
PL/pgSQL - SQL Pro-
cedural Language
43.6.4.3. IF-THEN-ELSIF
IF boolean-expression THEN
statements
[ ELSIF boolean-expression THEN
statements
[ ELSIF boolean-expression THEN
statements
...
]
]
[ ELSE
statements ]
END IF;
Sometimes there are more than just two alternatives. IF-THEN-ELSIF provides a convenient method
of checking several alternatives in turn. The IF conditions are tested successively until the first one that
is true is found. Then the associated statement(s) are executed, after which control passes to the next
statement after END IF. (Any subsequent IF conditions are not tested.) If none of the IF conditions
is true, then the ELSE block (if any) is executed.
Here is an example:
IF number = 0 THEN
result := 'zero';
ELSIF number > 0 THEN
result := 'positive';
ELSIF number < 0 THEN
result := 'negative';
ELSE
-- hmm, the only other possibility is that number is null
result := 'NULL';
END IF;
An alternative way of accomplishing the same task is to nest IF-THEN-ELSE statements, as in the
following example:
However, this method requires writing a matching END IF for each IF, so it is much more cumber-
some than using ELSIF when there are many alternatives.
CASE search-expression
WHEN expression [, expression [ ... ]] THEN
statements
[ WHEN expression [, expression [ ... ]] THEN
1196
PL/pgSQL - SQL Pro-
cedural Language
statements
... ]
[ ELSE
statements ]
END CASE;
The simple form of CASE provides conditional execution based on equality of operands. The
search-expression is evaluated (once) and successively compared to each expression in
the WHEN clauses. If a match is found, then the corresponding statements are executed, and then
control passes to the next statement after END CASE. (Subsequent WHEN expressions are not evalu-
ated.) If no match is found, the ELSE statements are executed; but if ELSE is not present, then
a CASE_NOT_FOUND exception is raised.
CASE x
WHEN 1, 2 THEN
msg := 'one or two';
ELSE
msg := 'other value than one or two';
END CASE;
CASE
WHEN boolean-expression THEN
statements
[ WHEN boolean-expression THEN
statements
... ]
[ ELSE
statements ]
END CASE;
The searched form of CASE provides conditional execution based on truth of Boolean expressions.
Each WHEN clause's boolean-expression is evaluated in turn, until one is found that yields
true. Then the corresponding statements are executed, and then control passes to the next state-
ment after END CASE. (Subsequent WHEN expressions are not evaluated.) If no true result is found,
the ELSE statements are executed; but if ELSE is not present, then a CASE_NOT_FOUND excep-
tion is raised.
Here is an example:
CASE
WHEN x BETWEEN 0 AND 10 THEN
msg := 'value is between zero and ten';
WHEN x BETWEEN 11 AND 20 THEN
msg := 'value is between eleven and twenty';
END CASE;
This form of CASE is entirely equivalent to IF-THEN-ELSIF, except for the rule that reaching an
omitted ELSE clause results in an error rather than doing nothing.
1197
PL/pgSQL - SQL Pro-
cedural Language
43.6.5.1. LOOP
[ <<label>> ]
LOOP
statements
END LOOP [ label ];
LOOP defines an unconditional loop that is repeated indefinitely until terminated by an EXIT or RE-
TURN statement. The optional label can be used by EXIT and CONTINUE statements within nested
loops to specify which loop those statements refer to.
43.6.5.2. EXIT
If no label is given, the innermost loop is terminated and the statement following END LOOP is
executed next. If label is given, it must be the label of the current or some outer level of nested loop
or block. Then the named loop or block is terminated and control continues with the statement after
the loop's/block's corresponding END.
If WHEN is specified, the loop exit occurs only if boolean-expression is true. Otherwise, control
passes to the statement after EXIT.
EXIT can be used with all types of loops; it is not limited to use with unconditional loops.
When used with a BEGIN block, EXIT passes control to the next statement after the end of the block.
Note that a label must be used for this purpose; an unlabeled EXIT is never considered to match a
BEGIN block. (This is a change from pre-8.4 releases of PostgreSQL, which would allow an unlabeled
EXIT to match a BEGIN block.)
Examples:
LOOP
-- some computations
IF count > 0 THEN
EXIT; -- exit loop
END IF;
END LOOP;
LOOP
-- some computations
EXIT WHEN count > 0; -- same result as previous example
END LOOP;
<<ablock>>
BEGIN
-- some computations
IF stocks > 100000 THEN
EXIT ablock; -- causes exit from the BEGIN block
END IF;
-- computations here will be skipped when stocks > 100000
END;
43.6.5.3. CONTINUE
1198
PL/pgSQL - SQL Pro-
cedural Language
If no label is given, the next iteration of the innermost loop is begun. That is, all statements remaining
in the loop body are skipped, and control returns to the loop control expression (if any) to determine
whether another loop iteration is needed. If label is present, it specifies the label of the loop whose
execution will be continued.
If WHEN is specified, the next iteration of the loop is begun only if boolean-expression is true.
Otherwise, control passes to the statement after CONTINUE.
CONTINUE can be used with all types of loops; it is not limited to use with unconditional loops.
Examples:
LOOP
-- some computations
EXIT WHEN count > 100;
CONTINUE WHEN count < 50;
-- some computations for count IN [50 .. 100]
END LOOP;
43.6.5.4. WHILE
[ <<label>> ]
WHILE boolean-expression LOOP
statements
END LOOP [ label ];
For example:
[ <<label>> ]
FOR name IN [ REVERSE ] expression .. expression [ BY expression ]
LOOP
statements
END LOOP [ label ];
This form of FOR creates a loop that iterates over a range of integer values. The variable name is
automatically defined as type integer and exists only inside the loop (any existing definition of the
variable name is ignored within the loop). The two expressions giving the lower and upper bound of
the range are evaluated once when entering the loop. If the BY clause isn't specified the iteration step
is 1, otherwise it's the value specified in the BY clause, which again is evaluated once on loop entry.
If REVERSE is specified then the step value is subtracted, rather than added, after each iteration.
1199
PL/pgSQL - SQL Pro-
cedural Language
If the lower bound is greater than the upper bound (or less than, in the REVERSE case), the loop body
is not executed at all. No error is raised.
If a label is attached to the FOR loop then the integer loop variable can be referenced with a qualified
name, using that label.
[ <<label>> ]
FOR target IN query LOOP
statements
END LOOP [ label ];
The target is a record variable, row variable, or comma-separated list of scalar variables. The tar-
get is successively assigned each row resulting from the query and the loop body is executed for
each row. Here is an example:
FOR mviews IN
SELECT n.nspname AS mv_schema,
c.relname AS mv_name,
pg_catalog.pg_get_userbyid(c.relowner) AS owner
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON (n.oid = c.relnamespace)
WHERE c.relkind = 'm'
ORDER BY 1
LOOP
1200
PL/pgSQL - SQL Pro-
cedural Language
quote_ident(mviews.mv_name),
quote_ident(mviews.owner);
EXECUTE format('REFRESH MATERIALIZED VIEW %I.%I',
mviews.mv_schema, mviews.mv_name);
END LOOP;
If the loop is terminated by an EXIT statement, the last assigned row value is still accessible after
the loop.
The query used in this type of FOR statement can be any SQL command that returns rows to the
caller: SELECT is the most common case, but you can also use INSERT, UPDATE, or DELETE with
a RETURNING clause. Some utility commands such as EXPLAIN will work too.
PL/pgSQL variables are substituted into the query text, and the query plan is cached for possible re-
use, as discussed in detail in Section 43.11.1 and Section 43.11.2.
[ <<label>> ]
FOR target IN EXECUTE text_expression [ USING expression [, ... ] ]
LOOP
statements
END LOOP [ label ];
This is like the previous form, except that the source query is specified as a string expression, which
is evaluated and replanned on each entry to the FOR loop. This allows the programmer to choose the
speed of a preplanned query or the flexibility of a dynamic query, just as with a plain EXECUTE state-
ment. As with EXECUTE, parameter values can be inserted into the dynamic command via USING.
Another way to specify the query whose results should be iterated through is to declare it as a cursor.
This is described in Section 43.7.4.
[ <<label>> ]
FOREACH target [ SLICE number ] IN ARRAY expression LOOP
statements
END LOOP [ label ];
Without SLICE, or if SLICE 0 is specified, the loop iterates through individual elements of the array
produced by evaluating the expression. The target variable is assigned each element value in
sequence, and the loop body is executed for each element. Here is an example of looping through the
elements of an integer array:
1201
PL/pgSQL - SQL Pro-
cedural Language
x int;
BEGIN
FOREACH x IN ARRAY $1
LOOP
s := s + x;
END LOOP;
RETURN s;
END;
$$ LANGUAGE plpgsql;
The elements are visited in storage order, regardless of the number of array dimensions. Although the
target is usually just a single variable, it can be a list of variables when looping through an array
of composite values (records). In that case, for each array element, the variables are assigned from
successive columns of the composite value.
With a positive SLICE value, FOREACH iterates through slices of the array rather than single elements.
The SLICE value must be an integer constant not larger than the number of dimensions of the array.
The target variable must be an array, and it receives successive slices of the array value, where
each slice is of the number of dimensions specified by SLICE. Here is an example of iterating through
one-dimensional slices:
SELECT scan_rows(ARRAY[[1,2,3],[4,5,6],[7,8,9],[10,11,12]]);
[ <<label>> ]
[ DECLARE
declarations ]
BEGIN
statements
EXCEPTION
WHEN condition [ OR condition ... ] THEN
handler_statements
[ WHEN condition [ OR condition ... ] THEN
handler_statements
... ]
END;
1202
PL/pgSQL - SQL Pro-
cedural Language
If no error occurs, this form of block simply executes all the statements, and then control passes
to the next statement after END. But if an error occurs within the statements, further processing
of the statements is abandoned, and control passes to the EXCEPTION list. The list is searched
for the first condition matching the error that occurred. If a match is found, the corresponding
handler_statements are executed, and then control passes to the next statement after END. If
no match is found, the error propagates out as though the EXCEPTION clause were not there at all: the
error can be caught by an enclosing block with EXCEPTION, or if there is none it aborts processing
of the function.
The condition names can be any of those shown in Appendix A. A category name matches
any error within its category. The special condition name OTHERS matches every error type except
QUERY_CANCELED and ASSERT_FAILURE. (It is possible, but often unwise, to trap those two er-
ror types by name.) Condition names are not case-sensitive. Also, an error condition can be specified
by SQLSTATE code; for example these are equivalent:
If a new error occurs within the selected handler_statements, it cannot be caught by this EX-
CEPTION clause, but is propagated out. A surrounding EXCEPTION clause could catch it.
When an error is caught by an EXCEPTION clause, the local variables of the PL/pgSQL function
remain as they were when the error occurred, but all changes to persistent database state within the
block are rolled back. As an example, consider this fragment:
When control reaches the assignment to y, it will fail with a division_by_zero error. This will
be caught by the EXCEPTION clause. The value returned in the RETURN statement will be the incre-
mented value of x, but the effects of the UPDATE command will have been rolled back. The INSERT
command preceding the block is not rolled back, however, so the end result is that the database con-
tains Tom Jones not Joe Jones.
Tip
A block containing an EXCEPTION clause is significantly more expensive to enter and exit
than a block without one. Therefore, don't use EXCEPTION without need.
1203
PL/pgSQL - SQL Pro-
cedural Language
This coding assumes the unique_violation error is caused by the INSERT, and not by, say, an
INSERT in a trigger function on the table. It might also misbehave if there is more than one unique
index on the table, since it will retry the operation regardless of which index caused the error. More
safety could be had by using the features discussed next to check that the trapped error was the one
expected.
Within an exception handler, the special variable SQLSTATE contains the error code that corresponds
to the exception that was raised (refer to Table A.1 for a list of possible error codes). The special
variable SQLERRM contains the error message associated with the exception. These variables are un-
defined outside exception handlers.
Within an exception handler, one may also retrieve information about the current exception by using
the GET STACKED DIAGNOSTICS command, which has the form:
Each item is a key word identifying a status value to be assigned to the specified variable (which
should be of the right data type to receive it). The currently available status items are shown in Ta-
ble 43.2.
1204
PL/pgSQL - SQL Pro-
cedural Language
If the exception did not set a value for an item, an empty string will be returned.
Here is an example:
DECLARE
text_var1 text;
text_var2 text;
text_var3 text;
BEGIN
-- some processing which might cause an exception
...
EXCEPTION WHEN OTHERS THEN
GET STACKED DIAGNOSTICS text_var1 = MESSAGE_TEXT,
text_var2 = PG_EXCEPTION_DETAIL,
text_var3 = PG_EXCEPTION_HINT;
END;
1205
PL/pgSQL - SQL Pro-
cedural Language
$$ LANGUAGE plpgsql;
SELECT outer_func();
GET STACKED DIAGNOSTICS ... PG_EXCEPTION_CONTEXT returns the same sort of stack
trace, but describing the location at which an error was detected, rather than the current location.
43.7. Cursors
Rather than executing a whole query at once, it is possible to set up a cursor that encapsulates the
query, and then read the query result a few rows at a time. One reason for doing this is to avoid
memory overrun when the result contains a large number of rows. (However, PL/pgSQL users do not
normally need to worry about that, since FOR loops automatically use a cursor internally to avoid
memory problems.) A more interesting usage is to return a reference to a cursor that a function has
created, allowing the caller to read the rows. This provides an efficient way to return large row sets
from functions.
(FOR can be replaced by IS for Oracle compatibility.) If SCROLL is specified, the cursor will be
capable of scrolling backward; if NO SCROLL is specified, backward fetches will be rejected; if neither
specification appears, it is query-dependent whether backward fetches will be allowed. arguments,
if specified, is a comma-separated list of pairs name datatype that define names to be replaced by
parameter values in the given query. The actual values to substitute for these names will be specified
later, when the cursor is opened.
Some examples:
DECLARE
curs1 refcursor;
1206
PL/pgSQL - SQL Pro-
cedural Language
All three of these variables have the data type refcursor, but the first can be used with any query,
while the second has a fully specified query already bound to it, and the last has a parameterized query
bound to it. (key will be replaced by an integer parameter value when the cursor is opened.) The
variable curs1 is said to be unbound since it is not bound to any particular query.
The SCROLL option cannot be used when the cursor's query uses FOR UPDATE/SHARE. Also, it is
best to use NO SCROLL with a query that involves volatile functions. The implementation of SCROLL
assumes that re-reading the query's output will give consistent results, which a volatile function might
not do.
Note
Bound cursor variables can also be used without explicitly opening the cursor, via the FOR
statement described in Section 43.7.4.
The cursor variable is opened and given the specified query to execute. The cursor cannot be open
already, and it must have been declared as an unbound cursor variable (that is, as a simple refcursor
variable). The query must be a SELECT, or something else that returns rows (such as EXPLAIN). The
query is treated in the same way as other SQL commands in PL/pgSQL: PL/pgSQL variable names are
substituted, and the query plan is cached for possible reuse. When a PL/pgSQL variable is substituted
into the cursor query, the value that is substituted is the one it has at the time of the OPEN; subsequent
changes to the variable will not affect the cursor's behavior. The SCROLL and NO SCROLL options
have the same meanings as for a bound cursor.
An example:
The cursor variable is opened and given the specified query to execute. The cursor cannot be open
already, and it must have been declared as an unbound cursor variable (that is, as a simple refcur-
sor variable). The query is specified as a string expression, in the same way as in the EXECUTE
command. As usual, this gives flexibility so the query plan can vary from one run to the next (see
Section 43.11.2), and it also means that variable substitution is not done on the command string. As
1207
PL/pgSQL - SQL Pro-
cedural Language
with EXECUTE, parameter values can be inserted into the dynamic command via format() and
USING. The SCROLL and NO SCROLL options have the same meanings as for a bound cursor.
An example:
In this example, the table name is inserted into the query via format(). The comparison value for
col1 is inserted via a USING parameter, so it needs no quoting.
This form of OPEN is used to open a cursor variable whose query was bound to it when it was declared.
The cursor cannot be open already. A list of actual argument value expressions must appear if and
only if the cursor was declared to take arguments. These values will be substituted in the query.
The query plan for a bound cursor is always considered cacheable; there is no equivalent of EXECUTE
in this case. Notice that SCROLL and NO SCROLL cannot be specified in OPEN, as the cursor's
scrolling behavior was already determined.
Argument values can be passed using either positional or named notation. In positional notation, all
arguments are specified in order. In named notation, each argument's name is specified using := to
separate it from the argument expression. Similar to calling functions, described in Section 4.3, it is
also allowed to mix positional and named notation.
OPEN curs2;
OPEN curs3(42);
OPEN curs3(key := 42);
Because variable substitution is done on a bound cursor's query, there are really two ways to pass
values into the cursor: either with an explicit argument to OPEN, or implicitly by referencing a PL/
pgSQL variable in the query. However, only variables declared before the bound cursor was declared
will be substituted into it. In either case the value to be passed is determined at the time of the OPEN.
For example, another way to get the same effect as the curs3 example above is
DECLARE
key integer;
curs4 CURSOR FOR SELECT * FROM tenk1 WHERE unique1 = key;
BEGIN
key := 42;
OPEN curs4;
These manipulations need not occur in the same function that opened the cursor to begin with. You
can return a refcursor value out of a function and let the caller operate on the cursor. (Internally, a
refcursor value is simply the string name of a so-called portal containing the active query for the
1208
PL/pgSQL - SQL Pro-
cedural Language
cursor. This name can be passed around, assigned to other refcursor variables, and so on, without
disturbing the portal.)
All portals are implicitly closed at transaction end. Therefore a refcursor value is usable to refer-
ence an open cursor only until the end of the transaction.
43.7.3.1. FETCH
FETCH retrieves the next row from the cursor into a target, which might be a row variable, a record
variable, or a comma-separated list of simple variables, just like SELECT INTO. If there is no next
row, the target is set to NULL(s). As with SELECT INTO, the special variable FOUND can be checked
to see whether a row was obtained or not.
The direction clause can be any of the variants allowed in the SQL FETCH command except the
ones that can fetch more than one row; namely, it can be NEXT, PRIOR, FIRST, LAST, ABSOLUTE
count, RELATIVE count, FORWARD, or BACKWARD. Omitting direction is the same as spec-
ifying NEXT. In the forms using a count, the count can be any integer-valued expression (unlike
the SQL FETCH command, which only allows an integer constant). direction values that require
moving backward are likely to fail unless the cursor was declared or opened with the SCROLL option.
cursor must be the name of a refcursor variable that references an open cursor portal.
Examples:
43.7.3.2. MOVE
MOVE repositions a cursor without retrieving any data. MOVE works exactly like the FETCH command,
except it only repositions the cursor and does not return the row moved to. As with SELECT INTO,
the special variable FOUND can be checked to see whether there was a next row to move to.
Examples:
MOVE curs1;
MOVE LAST FROM curs3;
MOVE RELATIVE -2 FROM curs4;
MOVE FORWARD 2 FROM curs4;
When a cursor is positioned on a table row, that row can be updated or deleted using the cursor to
identify the row. There are restrictions on what the cursor's query can be (in particular, no grouping)
1209
PL/pgSQL - SQL Pro-
cedural Language
and it's best to use FOR UPDATE in the cursor. For more information see the DECLARE reference
page.
An example:
43.7.3.4. CLOSE
CLOSE cursor;
CLOSE closes the portal underlying an open cursor. This can be used to release resources earlier than
end of transaction, or to free up the cursor variable to be opened again.
An example:
CLOSE curs1;
The portal name used for a cursor can be specified by the programmer or automatically generated.
To specify a portal name, simply assign a string to the refcursor variable before opening it. The
string value of the refcursor variable will be used by OPEN as the name of the underlying portal.
However, if the refcursor variable is null, OPEN automatically generates a name that does not
conflict with any existing portal, and assigns it to the refcursor variable.
Note
A bound cursor variable is initialized to the string value representing its name, so that the
portal name is the same as the cursor variable name, unless the programmer overrides it by
assignment before opening the cursor. But an unbound cursor variable defaults to the null
value initially, so it will receive an automatically-generated unique name, unless overridden.
The following example shows one way a cursor name can be supplied by the caller:
BEGIN;
SELECT reffunc('funccursor');
1210
PL/pgSQL - SQL Pro-
cedural Language
reffunc2
--------------------
<unnamed cursor 1>
(1 row)
The following example shows one way to return multiple cursors from a single function:
[ <<label>> ]
FOR recordvar IN bound_cursorvar [ ( [ argument_name
:= ] argument_value [, ...] ) ] LOOP
statements
1211
PL/pgSQL - SQL Pro-
cedural Language
The cursor variable must have been bound to some query when it was declared, and it cannot be
open already. The FOR statement automatically opens the cursor, and it closes the cursor again when
the loop exits. A list of actual argument value expressions must appear if and only if the cursor was
declared to take arguments. These values will be substituted in the query, in just the same way as
during an OPEN (see Section 43.7.2.3).
The variable recordvar is automatically defined as type record and exists only inside the loop
(any existing definition of the variable name is ignored within the loop). Each row returned by the
cursor is successively assigned to this record variable and the loop body is executed.
CALL transaction_test1();
Transaction control is only possible in CALL or DO invocations from the top level or nested CALL
or DO invocations without any other intervening command. For example, if the call stack is CALL
proc1() → CALL proc2() → CALL proc3(), then the second and third procedures can
perform transaction control actions. But if the call stack is CALL proc1() → SELECT func2()
→ CALL proc3(), then the last procedure cannot do transaction control, because of the SELECT
in between.
1212
PL/pgSQL - SQL Pro-
cedural Language
END;
$$;
CALL transaction_test2();
Normally, cursors are automatically closed at transaction commit. However, a cursor created as part
of a loop like this is automatically converted to a holdable cursor by the first COMMIT or ROLLBACK.
That means that the cursor is fully evaluated at the first COMMIT or ROLLBACK rather than row by
row. The cursor is still removed automatically after the loop, so this is mostly invisible to the user.
Transaction commands are not allowed in cursor loops driven by commands that are not read-only
(for example UPDATE ... RETURNING).
The level option specifies the error severity. Allowed levels are DEBUG, LOG, INFO, NOTICE,
WARNING, and EXCEPTION, with EXCEPTION being the default. EXCEPTION raises an error
(which normally aborts the current transaction); the other levels only generate messages of different
priority levels. Whether messages of a particular priority are reported to the client, written to the server
log, or both is controlled by the log_min_messages and client_min_messages configuration variables.
See Chapter 19 for more information.
After level if any, you can specify a format string (which must be a simple string literal, not an
expression). The format string specifies the error message text to be reported. The format string can be
followed by optional argument expressions to be inserted into the message. Inside the format string,
% is replaced by the string representation of the next optional argument's value. Write %% to emit a
literal %. The number of arguments must match the number of % placeholders in the format string, or
an error is raised during the compilation of the function.
In this example, the value of v_job_id will replace the % in the string:
You can attach additional information to the error report by writing USING followed by option =
expression items. Each expression can be any string-valued expression. The allowed option
key words are:
MESSAGE
Sets the error message text. This option can't be used in the form of RAISE that includes a format
string before USING.
1213
PL/pgSQL - SQL Pro-
cedural Language
DETAIL
HINT
ERRCODE
Specifies the error code (SQLSTATE) to report, either by condition name, as shown in Appen-
dix A, or directly as a five-character SQLSTATE code.
COLUMN
CONSTRAINT
DATATYPE
TABLE
SCHEMA
This example will abort the transaction with the given error message and hint:
There is a second RAISE syntax in which the main argument is the condition name or SQLSTATE
to be reported, for example:
RAISE division_by_zero;
RAISE SQLSTATE '22012';
In this syntax, USING can be used to supply a custom error message, detail, or hint. Another way to
do the earlier example is
Still another variant is to write RAISE USING or RAISE level USING and put everything else
into the USING list.
The last variant of RAISE has no parameters at all. This form can only be used inside a BEGIN block's
EXCEPTION clause; it causes the error currently being handled to be re-thrown.
Note
Before PostgreSQL 9.1, RAISE without parameters was interpreted as re-throwing the error
from the block containing the active exception handler. Thus an EXCEPTION clause nested
within that handler could not catch it, even if the RAISE was within the nested EXCEPTION
1214
PL/pgSQL - SQL Pro-
cedural Language
clause's block. This was deemed surprising as well as being incompatible with Oracle's PL/
SQL.
If no condition name nor SQLSTATE is specified in a RAISE EXCEPTION command, the default
is to use ERRCODE_RAISE_EXCEPTION (P0001). If no message text is specified, the default is to
use the condition name or SQLSTATE as message text.
Note
When specifying an error code by SQLSTATE code, you are not limited to the predefined
error codes, but can select any error code consisting of five digits and/or upper-case ASCII
letters, other than 00000. It is recommended that you avoid throwing error codes that end in
three zeroes, because these are category codes and can only be trapped by trapping the whole
category.
The condition is a Boolean expression that is expected to always evaluate to true; if it does, the
ASSERT statement does nothing further. If the result is false or null, then an ASSERT_FAILURE
exception is raised. (If an error occurs while evaluating the condition, it is reported as a normal
error.)
If the optional message is provided, it is an expression whose result (if not null) replaces the default
error message text “assertion failed”, should the condition fail. The message expression is not
evaluated in the normal case where the assertion succeeds.
Note that ASSERT is meant for detecting program bugs, not for reporting ordinary error conditions.
Use the RAISE statement, described above, for that.
When a PL/pgSQL function is called as a trigger, several special variables are created automatically
in the top-level block. They are:
1215
PL/pgSQL - SQL Pro-
cedural Language
NEW
Data type RECORD; variable holding the new database row for INSERT/UPDATE operations in
row-level triggers. This variable is null in statement-level triggers and for DELETE operations.
OLD
Data type RECORD; variable holding the old database row for UPDATE/DELETE operations in
row-level triggers. This variable is null in statement-level triggers and for INSERT operations.
TG_NAME
Data type name; variable that contains the name of the trigger actually fired.
TG_WHEN
Data type text; a string of BEFORE, AFTER, or INSTEAD OF, depending on the trigger's
definition.
TG_LEVEL
Data type text; a string of either ROW or STATEMENT depending on the trigger's definition.
TG_OP
Data type text; a string of INSERT, UPDATE, DELETE, or TRUNCATE telling for which op-
eration the trigger was fired.
TG_RELID
Data type oid; the object ID of the table that caused the trigger invocation.
TG_RELNAME
Data type name; the name of the table that caused the trigger invocation. This is now deprecated,
and could disappear in a future release. Use TG_TABLE_NAME instead.
TG_TABLE_NAME
Data type name; the name of the table that caused the trigger invocation.
TG_TABLE_SCHEMA
Data type name; the name of the schema of the table that caused the trigger invocation.
TG_NARGS
Data type integer; the number of arguments given to the trigger function in the CREATE
TRIGGER statement.
TG_ARGV[]
Data type array of text; the arguments from the CREATE TRIGGER statement. The index
counts from 0. Invalid indexes (less than 0 or greater than or equal to tg_nargs) result in a
null value.
A trigger function must return either NULL or a record/row value having exactly the structure of the
table the trigger was fired for.
Row-level triggers fired BEFORE can return null to signal the trigger manager to skip the rest of the
operation for this row (i.e., subsequent triggers are not fired, and the INSERT/UPDATE/DELETE does
not occur for this row). If a nonnull value is returned then the operation proceeds with that row value.
Returning a row value different from the original value of NEW alters the row that will be inserted or
1216
PL/pgSQL - SQL Pro-
cedural Language
updated. Thus, if the trigger function wants the triggering action to succeed normally without altering
the row value, NEW (or a value equal thereto) has to be returned. To alter the row to be stored, it is
possible to replace single values directly in NEW and return the modified NEW, or to build a complete
new record/row to return. In the case of a before-trigger on DELETE, the returned value has no direct
effect, but it has to be nonnull to allow the trigger action to proceed. Note that NEW is null in DELETE
triggers, so returning that is usually not sensible. The usual idiom in DELETE triggers is to return OLD.
INSTEAD OF triggers (which are always row-level triggers, and may only be used on views) can
return null to signal that they did not perform any updates, and that the rest of the operation for this
row should be skipped (i.e., subsequent triggers are not fired, and the row is not counted in the rows-
affected status for the surrounding INSERT/UPDATE/DELETE). Otherwise a nonnull value should
be returned, to signal that the trigger performed the requested operation. For INSERT and UPDATE
operations, the return value should be NEW, which the trigger function may modify to support INSERT
RETURNING and UPDATE RETURNING (this will also affect the row value passed to any subsequent
triggers, or passed to a special EXCLUDED alias reference within an INSERT statement with an ON
CONFLICT DO UPDATE clause). For DELETE operations, the return value should be OLD.
The return value of a row-level trigger fired AFTER or a statement-level trigger fired BEFORE or
AFTER is always ignored; it might as well be null. However, any of these types of triggers might still
abort the entire operation by raising an error.
1217
PL/pgSQL - SQL Pro-
cedural Language
Another way to log changes to a table involves creating a new table that holds a row for each insert,
update, or delete that occurs. This approach can be thought of as auditing changes to a table. Exam-
ple 43.4 shows an example of an audit trigger function in PL/pgSQL.
A variation of the previous example uses a view joining the main table to the audit table, to show
when each entry was last modified. This approach still records the full audit trail of changes to the
table, but also presents a simplified view of the audit trail, showing just the last modified timestamp
derived from the audit trail for each entry. Example 43.5 shows an example of an audit trigger on a
view in PL/pgSQL.
1218
PL/pgSQL - SQL Pro-
cedural Language
This example uses a trigger on the view to make it updatable, and ensure that any insert, update or
delete of a row in the view is recorded (i.e., audited) in the emp_audit table. The current time and
user name are recorded, together with the type of operation performed, and the view displays the last
modified time of each row.
OLD.last_updated = now();
INSERT INTO emp_audit VALUES('D', user, OLD.*);
RETURN OLD;
ELSIF (TG_OP = 'UPDATE') THEN
UPDATE emp SET salary = NEW.salary WHERE empname =
OLD.empname;
IF NOT FOUND THEN RETURN NULL; END IF;
NEW.last_updated = now();
INSERT INTO emp_audit VALUES('U', user, NEW.*);
RETURN NEW;
ELSIF (TG_OP = 'INSERT') THEN
INSERT INTO emp VALUES(NEW.empname, NEW.salary);
NEW.last_updated = now();
INSERT INTO emp_audit VALUES('I', user, NEW.*);
RETURN NEW;
END IF;
1219
PL/pgSQL - SQL Pro-
cedural Language
END;
$$ LANGUAGE plpgsql;
One use of triggers is to maintain a summary table of another table. The resulting summary can be used
in place of the original table for certain queries — often with vastly reduced run times. This technique
is commonly used in Data Warehousing, where the tables of measured or observed data (called fact
tables) might be extremely large. Example 43.6 shows an example of a trigger function in PL/pgSQL
that maintains a summary table for a fact table in a data warehouse.
--
-- Main tables - time dimension and sales fact.
--
CREATE TABLE time_dimension (
time_key integer NOT NULL,
day_of_week integer NOT NULL,
day_of_month integer NOT NULL,
month integer NOT NULL,
quarter integer NOT NULL,
year integer NOT NULL
);
CREATE UNIQUE INDEX time_dimension_key ON time_dimension(time_key);
--
-- Summary table - sales by time.
--
CREATE TABLE sales_summary_bytime (
time_key integer NOT NULL,
amount_sold numeric(15,2) NOT NULL,
units_sold numeric(12) NOT NULL,
amount_cost numeric(15,2) NOT NULL
);
CREATE UNIQUE INDEX sales_summary_bytime_key ON
sales_summary_bytime(time_key);
--
-- Function and trigger to amend summarized column(s) on UPDATE,
INSERT, DELETE.
1220
PL/pgSQL - SQL Pro-
cedural Language
--
CREATE OR REPLACE FUNCTION maint_sales_summary_bytime() RETURNS
TRIGGER
AS $maint_sales_summary_bytime$
DECLARE
delta_time_key integer;
delta_amount_sold numeric(15,2);
delta_units_sold numeric(12);
delta_amount_cost numeric(15,2);
BEGIN
delta_time_key = OLD.time_key;
delta_amount_sold = -1 * OLD.amount_sold;
delta_units_sold = -1 * OLD.units_sold;
delta_amount_cost = -1 * OLD.amount_cost;
delta_time_key = OLD.time_key;
delta_amount_sold = NEW.amount_sold - OLD.amount_sold;
delta_units_sold = NEW.units_sold - OLD.units_sold;
delta_amount_cost = NEW.amount_cost - OLD.amount_cost;
delta_time_key = NEW.time_key;
delta_amount_sold = NEW.amount_sold;
delta_units_sold = NEW.units_sold;
delta_amount_cost = NEW.amount_cost;
END IF;
1221
PL/pgSQL - SQL Pro-
cedural Language
BEGIN
INSERT INTO sales_summary_bytime (
time_key,
amount_sold,
units_sold,
amount_cost)
VALUES (
delta_time_key,
delta_amount_sold,
delta_units_sold,
delta_amount_cost
);
EXIT insert_update;
EXCEPTION
WHEN UNIQUE_VIOLATION THEN
-- do nothing
END;
END LOOP insert_update;
RETURN NULL;
END;
$maint_sales_summary_bytime$ LANGUAGE plpgsql;
AFTER triggers can also make use of transition tables to inspect the entire set of rows changed by
the triggering statement. The CREATE TRIGGER command assigns names to one or both transition
tables, and then the function can refer to those names as though they were read-only temporary tables.
Example 43.7 shows an example.
1222
PL/pgSQL - SQL Pro-
cedural Language
);
When a PL/pgSQL function is called as an event trigger, several special variables are created auto-
matically in the top-level block. They are:
TG_EVENT
Data type text; a string representing the event the trigger is fired for.
1223
PL/pgSQL - SQL Pro-
cedural Language
TG_TAG
Data type text; variable that contains the command tag for which the trigger is fired.
The first occurrence of foo must syntactically be a table name, so it will not be substituted, even if
the function has a variable named foo. The second occurrence must be the name of a column of the
table, so it will not be substituted either. Only the third occurrence is a candidate to be a reference
to the function's variable.
Note
PostgreSQL versions before 9.0 would try to substitute the variable in all three cases, leading
to syntax errors.
Since the names of variables are syntactically no different from the names of table columns, there can
be ambiguity in statements that also refer to tables: is a given name meant to refer to a table column,
or a variable? Let's change the previous example to
Here, dest and src must be table names, and col must be a column of dest, but foo and bar
might reasonably be either variables of the function or columns of src.
By default, PL/pgSQL will report an error if a name in a SQL statement could refer to either a variable
or a table column. You can fix such a problem by renaming the variable or column, or by qualifying
the ambiguous reference, or by telling PL/pgSQL which interpretation to prefer.
1224
PL/pgSQL - SQL Pro-
cedural Language
The simplest solution is to rename the variable or column. A common coding rule is to use a different
naming convention for PL/pgSQL variables than you use for column names. For example, if you
consistently name function variables v_something while none of your column names start with
v_, no conflicts will occur.
Alternatively you can qualify ambiguous references to make them clear. In the above example, sr-
c.foo would be an unambiguous reference to the table column. To create an unambiguous reference
to a variable, declare it in a labeled block and use the block's label (see Section 43.2). For example,
<<block>>
DECLARE
foo int;
BEGIN
foo := ...;
INSERT INTO dest (col) SELECT block.foo + bar FROM src;
Here block.foo means the variable even if there is a column foo in src. Function parameters,
as well as special variables such as FOUND, can be qualified by the function's name, because they are
implicitly declared in an outer block labeled with the function's name.
Sometimes it is impractical to fix all the ambiguous references in a large body of PL/pgSQL code. In
such cases you can specify that PL/pgSQL should resolve ambiguous references as the variable (which
is compatible with PL/pgSQL's behavior before PostgreSQL 9.0), or as the table column (which is
compatible with some other systems such as Oracle).
To change this behavior on a system-wide basis, set the configuration parameter plpgsql.vari-
able_conflict to one of error, use_variable, or use_column (where error is the fac-
tory default). This parameter affects subsequent compilations of statements in PL/pgSQL functions,
but not statements already compiled in the current session. Because changing this setting can cause
unexpected changes in the behavior of PL/pgSQL functions, it can only be changed by a superuser.
You can also set the behavior on a function-by-function basis, by inserting one of these special com-
mands at the start of the function text:
#variable_conflict error
#variable_conflict use_variable
#variable_conflict use_column
These commands affect only the function they are written in, and override the setting of
plpgsql.variable_conflict. An example is
In the UPDATE command, curtime, comment, and id will refer to the function's variable and pa-
rameters whether or not users has columns of those names. Notice that we had to qualify the refer-
ence to users.id in the WHERE clause to make it refer to the table column. But we did not have to
qualify the reference to comment as a target in the UPDATE list, because syntactically that must be
1225
PL/pgSQL - SQL Pro-
cedural Language
a column of users. We could write the same function without depending on the variable_con-
flict setting in this way:
Variable substitution does not happen in the command string given to EXECUTE or one of its variants.
If you need to insert a varying value into such a command, do so as part of constructing the string
value, or use USING, as illustrated in Section 43.5.4.
Variable substitution currently works only in SELECT, INSERT, UPDATE, and DELETE commands,
because the main SQL engine allows query parameters only in these commands. To use a non-constant
name or value in other statement types (generically called utility statements), you must construct the
utility statement as a string and EXECUTE it.
As each expression and SQL command is first executed in the function, the PL/pgSQL interpreter
parses and analyzes the command to create a prepared statement, using the SPI manager's SPI_pre-
pare function. Subsequent visits to that expression or command reuse the prepared statement. Thus,
a function with conditional code paths that are seldom visited will never incur the overhead of analyz-
ing those commands that are never executed within the current session. A disadvantage is that errors
in a specific expression or command cannot be detected until that part of the function is reached in
execution. (Trivial syntax errors will be detected during the initial parsing pass, but anything deeper
will not be detected until execution.)
PL/pgSQL (or more precisely, the SPI manager) can furthermore attempt to cache the execution plan
associated with any particular prepared statement. If a cached plan is not used, then a fresh execution
plan is generated on each visit to the statement, and the current parameter values (that is, PL/pgSQL
variable values) can be used to optimize the selected plan. If the statement has no parameters, or is
executed many times, the SPI manager will consider creating a generic plan that is not dependent on
specific parameter values, and caching that for re-use. Typically this will happen only if the execution
plan is not very sensitive to the values of the PL/pgSQL variables referenced in it. If it is, generating
a plan each time is a net win. See PREPARE for more information about the behavior of prepared
statements.
Because PL/pgSQL saves prepared statements and sometimes execution plans in this way, SQL com-
mands that appear directly in a PL/pgSQL function must refer to the same tables and columns on every
execution; that is, you cannot use a parameter as the name of a table or column in an SQL command.
To get around this restriction, you can construct dynamic commands using the PL/pgSQL EXECUTE
statement — at the price of performing new parse analysis and constructing a new execution plan on
every execution.
The mutable nature of record variables presents another problem in this connection. When fields of
a record variable are used in expressions or statements, the data types of the fields must not change
1226
PL/pgSQL - SQL Pro-
cedural Language
from one call of the function to the next, since each expression will be analyzed using the data type
that is present when the expression is first reached. EXECUTE can be used to get around this problem
when necessary.
If the same function is used as a trigger for more than one table, PL/pgSQL prepares and caches
statements independently for each such table — that is, there is a cache for each trigger function and
table combination, not just for each function. This alleviates some of the problems with varying data
types; for instance, a trigger function will be able to work successfully with a column named key
even if it happens to have different types in different tables.
Likewise, functions having polymorphic argument types have a separate statement cache for each
combination of actual argument types they have been invoked for, so that data type differences do not
cause unexpected failures.
Statement caching can sometimes have surprising effects on the interpretation of time-sensitive values.
For example there is a difference between what these two functions do:
and:
In the case of logfunc1, the PostgreSQL main parser knows when analyzing the INSERT that the
string 'now' should be interpreted as timestamp, because the target column of logtable is of
that type. Thus, 'now' will be converted to a timestamp constant when the INSERT is analyzed,
and then used in all invocations of logfunc1 during the lifetime of the session. Needless to say,
this isn't what the programmer wanted. A better idea is to use the now() or current_timestamp
function.
In the case of logfunc2, the PostgreSQL main parser does not know what type 'now' should
become and therefore it returns a data value of type text containing the string now. During the
ensuing assignment to the local variable curtime, the PL/pgSQL interpreter casts this string to the
timestamp type by calling the textout and timestamp_in functions for the conversion. So,
the computed time stamp is updated on each execution as the programmer expects. Even though this
happens to work as expected, it's not terribly efficient, so use of the now() function would still be
a better idea.
1227
PL/pgSQL - SQL Pro-
cedural Language
While running psql, you can load or reload such a function definition file with:
\i filename.sql
Another good way to develop in PL/pgSQL is with a GUI database access tool that facilitates devel-
opment in a procedural language. One example of such a tool is pgAdmin, although others exist. These
tools often provide convenient features such as escaping single quotes and making it easier to recreate
and debug functions.
Within this, you might use quote marks for simple literal strings in SQL commands and $$ to delimit
fragments of SQL commands that you are assembling as strings. If you need to quote text that includes
$$, you could use $Q$, and so on.
The following chart shows what you have to do when writing quote marks without dollar quoting. It
might be useful when translating pre-dollar quoting code into something more comprehensible.
1 quotation mark
Anywhere within a single-quoted function body, quote marks must appear in pairs.
2 quotation marks
a_output := ''Blah'';
SELECT * FROM users WHERE f_name=''foobar'';
1228
PL/pgSQL - SQL Pro-
cedural Language
a_output := 'Blah';
SELECT * FROM users WHERE f_name='foobar';
which is exactly what the PL/pgSQL parser would see in either case.
4 quotation marks
When you need a single quotation mark in a string constant inside the function body, for example:
The value actually appended to a_output would be: AND name LIKE 'foobar' AND
xyz.
being careful that any dollar-quote delimiters around this are not just $$.
6 quotation marks
When a single quotation mark in a string inside the function body is adjacent to the end of that
string constant, for example:
The value appended to a_output would then be: AND name LIKE 'foobar'.
10 quotation marks
When you want two single quotation marks in a string constant (which accounts for 8 quotation
marks) and this is adjacent to the end of that string constant (2 more). You will probably only need
that if you are writing a function that generates other functions, as in Example 43.10. For example:
1229
PL/pgSQL - SQL Pro-
cedural Language
where we assume we only need to put single quote marks into a_output, because it will be
re-quoted before use.
These additional checks are enabled through the configuration variables plpgsql.extra_warn-
ings for warnings and plpgsql.extra_errors for errors. Both can be set either to a com-
ma-separated list of checks, "none" or "all". The default is "none". Currently the list of avail-
able checks includes only one:
shadowed_variables
PL/pgSQL is similar to PL/SQL in many aspects. It is a block-structured, imperative language, and all
variables have to be declared. Assignments, loops, and conditionals are similar. The main differences
you should keep in mind when porting from PL/SQL to PL/pgSQL are:
• If a name used in a SQL command could be either a column name of a table or a reference to
a variable of the function, PL/SQL treats it as a column name. This corresponds to PL/pgSQL's
plpgsql.variable_conflict = use_column behavior, which is not the default, as ex-
1230
PL/pgSQL - SQL Pro-
cedural Language
plained in Section 43.11.1. It's often best to avoid such ambiguities in the first place, but if you
have to port a large amount of code that depends on this behavior, setting variable_conflict
may be the best solution.
• In PostgreSQL the function body must be written as a string literal. Therefore you need to use dollar
quoting or escape single quotes in the function body. (See Section 43.12.1.)
• Data type names often need translation. For example, in Oracle string values are commonly declared
as being of type varchar2, which is a non-SQL-standard type. In PostgreSQL, use type varchar
or text instead. Similarly, replace type number with numeric, or use some other numeric data
type if there's a more appropriate one.
• Since there are no packages, there are no package-level variables either. This is somewhat annoying.
You can keep per-session state in temporary tables instead.
• Integer FOR loops with REVERSE work differently: PL/SQL counts down from the second num-
ber to the first, while PL/pgSQL counts down from the first number to the second, requiring the
loop bounds to be swapped when porting. This incompatibility is unfortunate but is unlikely to be
changed. (See Section 43.6.5.5.)
• FOR loops over queries (other than cursors) also work differently: the target variable(s) must have
been declared, whereas PL/SQL always declares them implicitly. An advantage of this is that the
variable values are still accessible after the loop exits.
• There are various notational differences for the use of cursor variables.
Let's go through this function and see the differences compared to PL/pgSQL:
• The type name varchar2 has to be changed to varchar or text. In the examples in this section,
we'll use varchar, but text is often a better choice if you do not need specific string length limits.
• The RETURN key word in the function prototype (not the function body) becomes RETURNS in
PostgreSQL. Also, IS becomes AS, and you need to add a LANGUAGE clause because PL/pgSQL
is not the only possible function language.
1231
PL/pgSQL - SQL Pro-
cedural Language
• In PostgreSQL, the function body is considered to be a string literal, so you need to use quote marks
or dollar quotes around it. This substitutes for the terminating / in the Oracle approach.
• The show errors command does not exist in PostgreSQL, and is not needed since errors are
reported automatically.
Example 43.10 shows how to port a function that creates another function and how to handle the
ensuing quoting problems.
Example 43.10. Porting a Function that Creates Another Function from PL/SQL
to PL/pgSQL
The following procedure grabs rows from a SELECT statement and builds a large function with the
results in IF statements, for the sake of efficiency.
1232
PL/pgSQL - SQL Pro-
cedural Language
func_cmd :=
'CREATE OR REPLACE FUNCTION cs_find_referrer_type(v_host
varchar,
v_domain
varchar,
v_url
varchar)
RETURNS varchar AS '
|| quote_literal(func_body)
|| ' LANGUAGE plpgsql;' ;
EXECUTE func_cmd;
END;
$func$ LANGUAGE plpgsql;
Notice how the body of the function is built separately and passed through quote_literal to
double any quote marks in it. This technique is needed because we cannot safely use dollar quoting
for defining the new function: we do not know for sure what strings will be interpolated from the
referrer_key.key_string field. (We are assuming here that referrer_key.kind can
be trusted to always be host, domain, or url, but referrer_key.key_string might be
anything, in particular it might contain dollar signs.) This function is actually an improvement on the
Oracle original, because it will not generate broken code when referrer_key.key_string or
referrer_key.referrer_type contain quote marks.
Example 43.11 shows how to port a function with OUT parameters and string manipulation. Post-
greSQL does not have a built-in instr function, but you can create one using a combination of other
functions. In Section 43.13.3 there is a PL/pgSQL implementation of instr that you can use to make
your porting easier.
1233
PL/pgSQL - SQL Pro-
cedural Language
IF a_pos1 = 0 THEN
RETURN;
END IF;
a_pos2 := instr(v_url, '/', a_pos1 + 2);
IF a_pos2 = 0 THEN
v_host := substr(v_url, a_pos1 + 2);
v_path := '/';
RETURN;
END IF;
IF a_pos1 = 0 THEN
v_path := substr(v_url, a_pos2);
RETURN;
END IF;
1234
PL/pgSQL - SQL Pro-
cedural Language
IF a_pos1 = 0 THEN
RETURN;
END IF;
a_pos2 := instr(v_url, '/', a_pos1 + 2);
IF a_pos2 = 0 THEN
v_host := substr(v_url, a_pos1 + 2);
v_path := '/';
RETURN;
END IF;
IF a_pos1 = 0 THEN
v_path := substr(v_url, a_pos2);
RETURN;
END IF;
Example 43.12 shows how to port a procedure that uses numerous features that are specific to Oracle.
BEGIN
INSERT INTO cs_jobs (job_id, start_stamp) VALUES (v_job_id,
now());
EXCEPTION
WHEN dup_val_on_index THEN NULL; -- don't worry if it
already exists
1235
PL/pgSQL - SQL Pro-
cedural Language
END;
COMMIT;
END;
/
show errors
BEGIN
INSERT INTO cs_jobs (job_id, start_stamp) VALUES (v_job_id,
now());
EXCEPTION
WHEN unique_violation THEN -- 2
-- don't worry if it already exists
END;
COMMIT;
END;
$$ LANGUAGE plpgsql;
1 The syntax of RAISE is considerably different from Oracle's statement, although the basic case
RAISE exception_name works similarly.
2 The exception names supported by PL/pgSQL are different from Oracle's. The set of built-in
exception names is much larger (see Appendix A). There is not currently a way to declare user-
defined exception names, although you can throw user-chosen SQLSTATE values instead.
BEGIN
SAVEPOINT s1;
... code here ...
1236
PL/pgSQL - SQL Pro-
cedural Language
EXCEPTION
WHEN ... THEN
ROLLBACK TO s1;
... code here ...
WHEN ... THEN
ROLLBACK TO s1;
... code here ...
END;
If you are translating an Oracle procedure that uses SAVEPOINT and ROLLBACK TO in this style,
your task is easy: just omit the SAVEPOINT and ROLLBACK TO. If you have a procedure that uses
SAVEPOINT and ROLLBACK TO in a different way then some actual thought will be required.
43.13.2.2. EXECUTE
The PL/pgSQL version of EXECUTE works similarly to the PL/SQL version, but you have to remem-
ber to use quote_literal and quote_ident as described in Section 43.5.4. Constructs of the
type EXECUTE 'SELECT * FROM $1'; will not work reliably unless you use these functions.
When making use of these optimization attributes, your CREATE FUNCTION statement might look
something like this:
43.13.3. Appendix
This section contains the code for a set of Oracle-compatible instr functions that you can use to
simplify your porting efforts.
--
-- instr functions that mimic Oracle's counterpart
-- Syntax: instr(string1, string2 [, n [, m]])
-- where [] denotes optional parameters.
--
-- Search string1, beginning at the nth character, for the mth
occurrence
-- of string2. If n is negative, search backwards, starting at the
abs(n)'th
-- character from the end of string1.
-- If n is not passed, assume 1 (search starts at first character).
-- If m is not passed, assume 1 (find first occurrence).
-- Returns starting index of string2 in string1, or 0 if string2 is
not found.
--
1237
PL/pgSQL - SQL Pro-
cedural Language
END;
$$ LANGUAGE plpgsql STRICT IMMUTABLE;
IF pos = 0 THEN
RETURN 0;
ELSE
RETURN pos + beg_index - 1;
END IF;
ELSIF beg_index < 0 THEN
ss_length := char_length(string_to_search_for);
length := char_length(string);
beg := length + 1 + beg_index;
beg := beg - 1;
END LOOP;
RETURN 0;
ELSE
RETURN 0;
END IF;
END;
$$ LANGUAGE plpgsql STRICT IMMUTABLE;
1238
PL/pgSQL - SQL Pro-
cedural Language
RETURN beg;
ELSIF beg_index < 0 THEN
ss_length := char_length(string_to_search_for);
length := char_length(string);
beg := length + 1 + beg_index;
beg := beg - 1;
END LOOP;
RETURN 0;
ELSE
RETURN 0;
END IF;
END;
$$ LANGUAGE plpgsql STRICT IMMUTABLE;
1239
Chapter 44. PL/Tcl - Tcl Procedural
Language
PL/Tcl is a loadable procedural language for the PostgreSQL database system that enables the Tcl
language1 to be used to write PostgreSQL functions and procedures.
44.1. Overview
PL/Tcl offers most of the capabilities a function writer has in the C language, with a few restrictions,
and with the addition of the powerful string processing libraries that are available for Tcl.
One compelling good restriction is that everything is executed from within the safety of the context of a
Tcl interpreter. In addition to the limited command set of safe Tcl, only a few commands are available
to access the database via SPI and to raise messages via elog(). PL/Tcl provides no way to access
internals of the database server or to gain OS-level access under the permissions of the PostgreSQL
server process, as a C function can do. Thus, unprivileged database users can be trusted to use this
language; it does not give them unlimited authority.
The other notable implementation restriction is that Tcl functions cannot be used to create input/output
functions for new data types.
Sometimes it is desirable to write Tcl functions that are not restricted to safe Tcl. For example, one
might want a Tcl function that sends email. To handle these cases, there is a variant of PL/Tcl called
PL/TclU (for untrusted Tcl). This is exactly the same language except that a full Tcl interpreter is
used. If PL/TclU is used, it must be installed as an untrusted procedural language so that only database
superusers can create functions in it. The writer of a PL/TclU function must take care that the function
cannot be used to do anything unwanted, since it will be able to do anything that could be done by a
user logged in as the database administrator.
The shared object code for the PL/Tcl and PL/TclU call handlers is automatically built and installed in
the PostgreSQL library directory if Tcl support is specified in the configuration step of the installation
procedure. To install PL/Tcl and/or PL/TclU in a particular database, use the CREATE EXTENSION
command, for example CREATE EXTENSION pltcl or CREATE EXTENSION pltclu.
PL/TclU is the same, except that the language has to be specified as pltclu.
The body of the function is simply a piece of Tcl script. When the function is called, the argument
values are passed to the Tcl script as variables named 1 ... n. The result is returned from the Tcl code in
the usual way, with a return statement. In a procedure, the return value from the Tcl code is ignored.
For example, a function returning the greater of two integer values could be defined as:
1240
PL/Tcl - Tcl Procedural Language
return $2
$$ LANGUAGE pltcl STRICT;
Note the clause STRICT, which saves us from having to think about null input values: if a null value
is passed, the function will not be called at all, but will just return a null result automatically.
In a nonstrict function, if the actual value of an argument is null, the corresponding $n variable will be
set to an empty string. To detect whether a particular argument is null, use the function argisnull.
For example, suppose that we wanted tcl_max with one null and one nonnull argument to return
the nonnull argument, rather than null:
As shown above, to return a null value from a PL/Tcl function, execute return_null. This can be
done whether the function is strict or not.
Composite-type arguments are passed to the function as Tcl arrays. The element names of the array
are the attribute names of the composite type. If an attribute in the passed row has the null value, it
will not appear in the array. Here is an example:
PL/Tcl functions can return composite-type results, too. To do this, the Tcl code must return a list of
column name/value pairs matching the expected result type. Any column names omitted from the list
are returned as nulls, and an error is raised if there are unexpected column names. Here is an example:
CREATE FUNCTION square_cube(in int, out squared int, out cubed int)
AS $$
return [list squared [expr {$1 * $1}] cubed [expr {$1 * $1 *
$1}]]
$$ LANGUAGE pltcl;
Output arguments of procedures are returned in the same way, for example:
1241
PL/Tcl - Tcl Procedural Language
Tip
The result list can be made from an array representation of the desired tuple with the array
get Tcl command. For example:
PL/Tcl functions can return sets. To do this, the Tcl code should call return_next once per row
to be returned, passing either the appropriate value when returning a scalar type, or a list of column
name/value pairs when returning a composite type. Here is an example returning a scalar type:
For security reasons, PL/Tcl executes functions called by any one SQL role in a separate Tcl interpreter
for that role. This prevents accidental or malicious interference by one user with the behavior of another
user's PL/Tcl functions. Each such interpreter will have its own values for any “global” Tcl variables.
Thus, two PL/Tcl functions will share the same global variables if and only if they are executed by the
1242
PL/Tcl - Tcl Procedural Language
same SQL role. In an application wherein a single session executes code under multiple SQL roles (via
SECURITY DEFINER functions, use of SET ROLE, etc) you may need to take explicit steps to ensure
that PL/Tcl functions can share data. To do that, make sure that functions that should communicate
are owned by the same user, and mark them SECURITY DEFINER. You must of course take care
that such functions can't be used to do anything unintended.
All PL/TclU functions used in a session execute in the same Tcl interpreter, which of course is distinct
from the interpreter(s) used for PL/Tcl functions. So global data is automatically shared between PL/
TclU functions. This is not considered a security risk because all PL/TclU functions execute at the
same trust level, namely that of a database superuser.
To help protect PL/Tcl functions from unintentionally interfering with each other, a global array is
made available to each function via the upvar command. The global name of this variable is the
function's internal name, and the local name is GD. It is recommended that GD be used for persistent
private data of a function. Use regular Tcl global variables only for values that you specifically intend
to be shared among multiple functions. (Note that the GD arrays are only global within a particular
interpreter, so they do not bypass the security restrictions mentioned above.)
Executes an SQL command given as a string. An error in the command causes an error to be
raised. Otherwise, the return value of spi_exec is the number of rows processed (selected,
inserted, updated, or deleted) by the command, or zero if the command is a utility statement. In
addition, if the command is a SELECT statement, the values of the selected columns are placed
in Tcl variables as described below.
The optional -count value tells spi_exec the maximum number of rows to process in the
command. The effect of this is comparable to setting up a query as a cursor and then saying
FETCH n.
If the command is a SELECT statement, the values of the result columns are placed into Tcl
variables named after the columns. If the -array option is given, the column values are instead
stored into elements of the named associative array, with the column names used as array indexes.
In addition, the current row number within the result (counting from zero) is stored into the array
element named “.tupno”, unless that name is in use as a column name in the result.
If the command is a SELECT statement and no loop-body script is given, then only the first
row of results are stored into Tcl variables or array elements; remaining rows, if any, are ignored.
No storing occurs if the query returns no rows. (This case can be detected by checking the result
of spi_exec.) For example:
will set the Tcl variable $cnt to the number of rows in the pg_proc system catalog.
If the optional loop-body argument is given, it is a piece of Tcl script that is executed once for
each row in the query result. (loop-body is ignored if the given command is not a SELECT.)
The values of the current row's columns are stored into Tcl variables or array elements before
each iteration. For example:
1243
PL/Tcl - Tcl Procedural Language
will print a log message for every row of pg_class. This feature works similarly to other Tcl
looping constructs; in particular continue and break work in the usual way inside the loop
body.
If a column of a query result is null, the target variable for it is “unset” rather than being set.
Prepares and saves a query plan for later execution. The saved plan will be retained for the life
of the current session.
The query can use parameters, that is, placeholders for values to be supplied whenever the plan is
actually executed. In the query string, refer to parameters by the symbols $1 ... $n. If the query
uses parameters, the names of the parameter types must be given as a Tcl list. (Write an empty
list for typelist if no parameters are used.)
The return value from spi_prepare is a query ID to be used in subsequent calls to spi_ex-
ecp. See spi_execp for an example.
The optional value for -nulls is a string of spaces and 'n' characters telling spi_execp
which of the parameters are null values. If given, it must have exactly the same length as the
value-list. If it is not given, all the parameter values are nonnull.
Except for the way in which the query and its parameters are specified, spi_execp works just
like spi_exec. The -count, -array, and loop-body options are the same, and so is the
result value.
We need backslashes inside the query string given to spi_prepare to ensure that the $n mark-
ers will be passed through to spi_prepare as-is, and not replaced by Tcl variable substitution.
spi_lastoid
Returns the OID of the row inserted by the last spi_exec or spi_execp, if the command was
a single-row INSERT and the modified table contained OIDs. (If not, you get zero.)
1244
PL/Tcl - Tcl Procedural Language
subtransaction command
The Tcl script contained in command is executed within a SQL subtransaction. If the script returns
an error, that entire subtransaction is rolled back before returning the error out to the surrounding
Tcl code. See Section 44.9 for more details and an example.
quote string
Doubles all occurrences of single quote and backslash characters in the given string. This can be
used to safely quote strings that are to be inserted into SQL commands given to spi_exec or
spi_prepare. For example, think about an SQL command string like:
where the Tcl variable val actually contains doesn't. This would result in the final command
string:
which would cause a parse error during spi_exec or spi_prepare. To work properly, the
submitted command should contain:
One advantage of spi_execp is that you don't have to quote parameter values like this, since
the parameters are never parsed as part of an SQL command string.
Emits a log or error message. Possible levels are DEBUG, LOG, INFO, NOTICE, WARNING, ER-
ROR, and FATAL. ERROR raises an error condition; if this is not trapped by the surrounding Tcl
code, the error propagates out to the calling query, causing the current transaction or subtransac-
tion to be aborted. This is effectively the same as the Tcl error command. FATAL aborts the
transaction and causes the current session to shut down. (There is probably no good reason to use
this error level in PL/Tcl functions, but it's provided for completeness.) The other levels only gen-
erate messages of different priority levels. Whether messages of a particular priority are reported
to the client, written to the server log, or both is controlled by the log_min_messages and clien-
t_min_messages configuration variables. See Chapter 19 and Section 44.8 for more information.
The information from the trigger manager is passed to the function body in the following variables:
$TG_name
$TG_relid
The object ID of the table that caused the trigger function to be invoked.
1245
PL/Tcl - Tcl Procedural Language
$TG_table_name
The name of the table that caused the trigger function to be invoked.
$TG_table_schema
The schema of the table that caused the trigger function to be invoked.
$TG_relatts
A Tcl list of the table column names, prefixed with an empty list element. So looking up a column
name in the list with Tcl's lsearch command returns the element's number starting with 1 for
the first column, the same way the columns are customarily numbered in PostgreSQL. (Empty
list elements also appear in the positions of columns that have been dropped, so that the attribute
numbering is correct for columns to their right.)
$TG_when
The string BEFORE, AFTER, or INSTEAD OF, depending on the type of trigger event.
$TG_level
$TG_op
The string INSERT, UPDATE, DELETE, or TRUNCATE depending on the type of trigger event.
$NEW
An associative array containing the values of the new table row for INSERT or UPDATE actions,
or empty for DELETE. The array is indexed by column name. Columns that are null will not
appear in the array. This is not set for statement-level triggers.
$OLD
An associative array containing the values of the old table row for UPDATE or DELETE actions,
or empty for INSERT. The array is indexed by column name. Columns that are null will not
appear in the array. This is not set for statement-level triggers.
$args
A Tcl list of the arguments to the function as given in the CREATE TRIGGER statement. These
arguments are also accessible as $1 ... $n in the function body.
The return value from a trigger function can be one of the strings OK or SKIP, or a list of column name/
value pairs. If the return value is OK, the operation (INSERT/UPDATE/DELETE) that fired the trigger
will proceed normally. SKIP tells the trigger manager to silently suppress the operation for this row.
If a list is returned, it tells PL/Tcl to return a modified row to the trigger manager; the contents of the
modified row are specified by the column names and values in the list. Any columns not mentioned in
the list are set to null. Returning a modified row is only meaningful for row-level BEFORE INSERT or
UPDATE triggers, for which the modified row will be inserted instead of the one given in $NEW; or for
row-level INSTEAD OF INSERT or UPDATE triggers where the returned row is used as the source
data for INSERT RETURNING or UPDATE RETURNING clauses. In row-level BEFORE DELETE
or INSTEAD OF DELETE triggers, returning a modified row has the same effect as returning OK,
that is the operation proceeds. The trigger return value is ignored for all other types of triggers.
Tip
The result list can be made from an array representation of the modified tuple with the array
get Tcl command.
1246
PL/Tcl - Tcl Procedural Language
Here's a little example trigger function that forces an integer value in a table to keep track of the
number of updates that are performed on the row. For new rows inserted, the value is initialized to 0
and then incremented on every update operation.
Notice that the trigger function itself does not know the column name; that's supplied from the trigger
arguments. This lets the trigger function be reused with different tables.
The information from the trigger manager is passed to the function body in the following variables:
$TG_event
$TG_tag
Here's a little example event trigger function that simply raises a NOTICE message each time a sup-
ported command is executed:
Tcl code within or called from a PL/Tcl function can raise an error, either by executing some invalid
operation or by generating an error using the Tcl error command or PL/Tcl's elog command. Such
errors can be caught within Tcl using the Tcl catch command. If an error is not caught but is allowed
to propagate out to the top level of execution of the PL/Tcl function, it is reported as a SQL error in
the function's calling query.
Conversely, SQL errors that occur within PL/Tcl's spi_exec, spi_prepare, and spi_execp
commands are reported as Tcl errors, so they are catchable by Tcl's catch command. (Each of these
PL/Tcl commands runs its SQL operation in a subtransaction, which is rolled back on error, so that
any partially-completed operation is automatically cleaned up.) Again, if an error propagates out to
the top level without being caught, it turns back into a SQL error.
Tcl provides an errorCode variable that can represent additional information about an error in
a form that is easy for Tcl programs to interpret. The contents are in Tcl list format, and the first
word identifies the subsystem or library reporting the error; beyond that the contents are left to the
individual subsystem or library. For database errors reported by PL/Tcl commands, the first word
is POSTGRES, the second word is the PostgreSQL version number, and additional words are field
name/value pairs providing detailed information about the error. Fields SQLSTATE, condition,
and message are always supplied (the first two represent the error code and condition name as shown
in Appendix A). Fields that may be present include detail, hint, context, schema, table,
column, datatype, constraint, statement, cursor_position, filename, lineno,
and funcname.
A convenient way to work with PL/Tcl's errorCode information is to load it into an array, so that
the field names become array subscripts. Code for doing that might look like
1248
PL/Tcl - Tcl Procedural Language
} else {
set result "funds transferred successfully"
}
spi_exec "INSERT INTO operations (result) VALUES ('[quote
$result]')"
$$ LANGUAGE pltcl;
If the second UPDATE statement results in an exception being raised, this function will log the failure,
but the result of the first UPDATE will nevertheless be committed. In other words, the funds will be
withdrawn from Joe's account, but will not be transferred to Mary's account. This happens because
each spi_exec is a separate subtransaction, and only one of those subtransactions got rolled back.
To handle such cases, you can wrap multiple database operations in an explicit subtransaction, which
will succeed or roll back as a whole. PL/Tcl provides a subtransaction command to manage
this. We can rewrite our function as:
Note that use of catch is still required for this purpose. Otherwise the error would propagate to the
top level of the function, preventing the desired insertion into the operations table. The sub-
transaction command does not trap errors, it only assures that all database operations executed
inside its scope will be rolled back together when an error is reported.
A rollback of an explicit subtransaction occurs on any error reported by the contained Tcl code, not on-
ly errors originating from database access. Thus a regular Tcl exception raised inside a subtrans-
action command will also cause the subtransaction to be rolled back. However, non-error exits out
of the contained Tcl code (for instance, due to return) do not cause a rollback.
Here is an example:
1249
PL/Tcl - Tcl Procedural Language
AS $$
for {set i 0} {$i < 10} {incr i} {
spi_exec "INSERT INTO test1 (a) VALUES ($i)"
if {$i % 2 == 0} {
commit
} else {
rollback
}
}
$$;
CALL transaction_test1();
pltcl.start_proc (string)
This parameter, if set to a nonempty string, specifies the name (possibly schema-qualified) of a
parameterless PL/Tcl function that is to be executed whenever a new Tcl interpreter is created
for PL/Tcl. Such a function can perform per-session initialization, such as loading additional Tcl
code. A new Tcl interpreter is created when a PL/Tcl function is first executed in a database
session, or when an additional interpreter has to be created because a PL/Tcl function is called
by a new SQL role.
The referenced function must be written in the pltcl language, and must not be marked SE-
CURITY DEFINER. (These restrictions ensure that it runs in the interpreter it's supposed to ini-
tialize.) The current user must have permission to call it, too.
If the function fails with an error it will abort the function call that caused the new interpreter to be
created and propagate out to the calling query, causing the current transaction or subtransaction
to be aborted. Any actions already done within Tcl won't be undone; however, that interpreter
won't be used again. If the language is used again the initialization will be attempted again within
a fresh Tcl interpreter.
Only superusers can change this setting. Although this setting can be changed within a session,
such changes will not affect Tcl interpreters that have already been created.
pltclu.start_proc (string)
This parameter is exactly like pltcl.start_proc, except that it applies to PL/TclU. The
referenced function must be written in the pltclu language.
1250
Chapter 45. PL/Perl - Perl Procedural
Language
PL/Perl is a loadable procedural language that enables you to write PostgreSQL functions and proce-
dures in the Perl programming language1.
The main advantage to using PL/Perl is that this allows use, within stored functions and procedures,
of the manyfold “string munging” operators and functions available for Perl. Parsing complex strings
might be easier using Perl than it is with the string functions and control structures provided in PL/
pgSQL.
Tip
If a language is installed into template1, all subsequently created databases will have the
language installed automatically.
Note
Users of source packages must specially enable the build of PL/Perl during the installation
process. (Refer to Chapter 16 for more information.) Users of binary packages might find PL/
Perl in a separate subpackage.
The body of the function is ordinary Perl code. In fact, the PL/Perl glue code wraps it inside a Perl
subroutine. A PL/Perl function is called in a scalar context, so it can't return a list. You can return non-
scalar values (arrays, records, and sets) by returning a reference, as discussed below.
In a PL/Perl procedure, any return value from the Perl code is ignored.
PL/Perl also supports anonymous code blocks called with the DO statement:
DO $$
# PL/Perl code
$$ LANGUAGE plperl;
An anonymous code block receives no arguments, and whatever value it might return is discarded.
Otherwise it behaves just like a function.
1
https://www.perl.org
1251
PL/Perl - Perl Procedural Language
Note
The use of named nested subroutines is dangerous in Perl, especially if they refer to lexi-
cal variables in the enclosing scope. Because a PL/Perl function is wrapped in a subroutine,
any named subroutine you place inside one will be nested. In general, it is far safer to create
anonymous subroutines which you call via a coderef. For more information, see the entries for
Variable "%s" will not stay shared and Variable "%s" is not avail-
able in the perldiag man page, or search the Internet for “perl nested named subroutine”.
The syntax of the CREATE FUNCTION command requires the function body to be written as a string
constant. It is usually most convenient to use dollar quoting (see Section 4.1.2.4) for the string constant.
If you choose to use escape string syntax E'', you must double any single quote marks (') and
backslashes (\) used in the body of the function (see Section 4.1.2.1).
Arguments and results are handled as in any other Perl subroutine: arguments are passed in @_, and a
result value is returned with return or as the last expression evaluated in the function.
For example, a function returning the greater of two integer values could be defined as:
Note
Arguments will be converted from the database's encoding to UTF-8 for use inside PL/Perl,
and then converted from UTF-8 back to the database encoding upon return.
If an SQL null value is passed to a function, the argument value will appear as “undefined” in Perl.
The above function definition will not behave very nicely with null inputs (in fact, it will act as though
they are zeroes). We could add STRICT to the function definition to make PostgreSQL do something
more reasonable: if a null value is passed, the function will not be called at all, but will just return a
null result automatically. Alternatively, we could check for undefined inputs in the function body. For
example, suppose that we wanted perl_max with one null and one nonnull argument to return the
nonnull argument, rather than a null value:
As shown above, to return an SQL null value from a PL/Perl function, return an undefined value. This
can be done whether the function is strict or not.
Anything in a function argument that is not a reference is a string, which is in the standard PostgreSQL
external text representation for the relevant data type. In the case of ordinary numeric or text types, Perl
will just do the right thing and the programmer will normally not have to worry about it. However, in
1252
PL/Perl - Perl Procedural Language
other cases the argument will need to be converted into a form that is more usable in Perl. For example,
the decode_bytea function can be used to convert an argument of type bytea into unescaped
binary.
Similarly, values passed back to PostgreSQL must be in the external text representation format. For
example, the encode_bytea function can be used to escape binary data for a return value of type
bytea.
Perl can return PostgreSQL arrays as references to Perl arrays. Here is an example:
select returns_array();
# as an array reference
for (@$arg) {
$result .= $_;
}
return $result;
$$ LANGUAGE plperl;
SELECT concat_array_elements(ARRAY['PL','/','Perl']);
Note
Multidimensional arrays are represented as references to lower-dimensional arrays of refer-
ences in a way common to every Perl programmer.
Composite-type arguments are passed to the function as references to hashes. The keys of the hash are
the attribute names of the composite type. Here is an example:
1253
PL/Perl - Perl Procedural Language
A PL/Perl function can return a composite-type result using the same approach: return a reference to
a hash that has the required attributes. For example:
Any columns in the declared result data type that are not present in the hash will be returned as null
values.
PL/Perl functions can also return sets of either scalar or composite types. Usually you'll want to return
rows one at a time, both to speed up startup time and to keep from queuing up the entire result set
in memory. You can do this with return_next as illustrated below. Note that after the last re-
turn_next, you must put either return or (better) return undef.
For small result sets, you can return a reference to an array that contains either scalars, references to
arrays, or references to hashes for simple types, array types, and composite types, respectively. Here
are some simple examples of returning the entire result set as an array reference:
1254
PL/Perl - Perl Procedural Language
If you wish to use the strict pragma with your code you have a few options. For temporary global
use you can SET plperl.use_strict to true. This will affect subsequent compilations of PL/
Perl functions, but not functions already compiled in the current session. For permanent global use
you can set plperl.use_strict to true in the postgresql.conf file.
use strict;
The feature pragma is also available to use if your Perl is version 5.10.0 or higher.
spi_exec_query(query [, max-rows])
spi_exec_query executes an SQL command and returns the entire row set as a reference to
an array of hash references. You should only use this command when you know that the result
set will be relatively small. Here is an example of a query (SELECT command) with the optional
maximum number of rows:
This returns up to 5 rows from the table my_table. If my_table has a column my_column,
you can get that value from row $i of the result like this:
1255
PL/Perl - Perl Procedural Language
$foo = $rv->{rows}[$i]->{my_column};
The total number of rows returned from a SELECT query can be accessed like this:
$nrows = $rv->{processed}
You can then access the command status (e.g., SPI_OK_INSERT) like this:
$res = $rv->{status};
$nrows = $rv->{processed};
spi_query(command)
spi_fetchrow(cursor)
spi_cursor_close(cursor)
spi_query and spi_fetchrow work together as a pair for row sets which might be large,
or for cases where you wish to return rows as they arrive. spi_fetchrow works only with
spi_query. The following example illustrates how you use them together:
1256
PL/Perl - Perl Procedural Language
Normally, spi_fetchrow should be repeated until it returns undef, indicating that there
are no more rows to read. The cursor returned by spi_query is automatically freed when
spi_fetchrow returns undef. If you do not wish to read all the rows, instead call spi_cur-
sor_close to free the cursor. Failure to do so will result in memory leaks.
Once a query plan is prepared by a call to spi_prepare, the plan can be used instead of
the string query, either in spi_exec_prepared, where the result is the same as returned
by spi_exec_query, or in spi_query_prepared which returns a cursor exactly as
spi_query does, which can be later passed to spi_fetchrow. The optional second parame-
ter to spi_exec_prepared is a hash reference of attributes; the only attribute currently sup-
ported is limit, which sets the maximum number of rows returned by a query.
The advantage of prepared queries is that is it possible to use one prepared plan for more than one
query execution. After the plan is not needed anymore, it can be freed with spi_freeplan:
1257
PL/Perl - Perl Procedural Language
SELECT init();
SELECT add_time('1 day'), add_time('2 days'), add_time('3
days');
SELECT done();
Note that the parameter subscript in spi_prepare is defined via $1, $2, $3, etc, so avoid de-
claring query strings in double quotes that might easily lead to hard-to-catch bugs.
SELECT init_hosts_query();
1258
PL/Perl - Perl Procedural Language
SELECT query_hosts('192.168.1.0/30');
SELECT release_hosts_query();
query_hosts
-----------------
(1,192.168.1.1)
(2,192.168.1.2)
(2 rows)
spi_commit()
spi_rollback()
Commit or roll back the current transaction. This can only be called in a procedure or anonymous
code block (DO command) called from the top level. (Note that it is not possible to run the SQL
commands COMMIT or ROLLBACK via spi_exec_query or similar. It has to be done using
these functions.) After a transaction is ended, a new transaction is automatically started, so there
is no separate function for that.
Here is an example:
CALL transaction_test1();
Emit a log or error message. Possible levels are DEBUG, LOG, INFO, NOTICE, WARNING, and
ERROR. ERROR raises an error condition; if this is not trapped by the surrounding Perl code,
the error propagates out to the calling query, causing the current transaction or subtransaction to
be aborted. This is effectively the same as the Perl die command. The other levels only gener-
ate messages of different priority levels. Whether messages of a particular priority are reported
to the client, written to the server log, or both is controlled by the log_min_messages and clien-
t_min_messages configuration variables. See Chapter 19 for more information.
quote_literal(string)
Return the given string suitably quoted to be used as a string literal in an SQL statement string.
Embedded single-quotes and backslashes are properly doubled. Note that quote_literal re-
turns undef on undef input; if the argument might be undef, quote_nullable is often more
suitable.
quote_nullable(string)
Return the given string suitably quoted to be used as a string literal in an SQL statement string;
or, if the argument is undef, return the unquoted string "NULL". Embedded single-quotes and
backslashes are properly doubled.
1259
PL/Perl - Perl Procedural Language
quote_ident(string)
Return the given string suitably quoted to be used as an identifier in an SQL statement string.
Quotes are added only if necessary (i.e., if the string contains non-identifier characters or would
be case-folded). Embedded quotes are properly doubled.
decode_bytea(string)
Return the unescaped binary data represented by the contents of the given string, which should
be bytea encoded.
encode_bytea(string)
Return the bytea encoded form of the binary data contents of the given string.
encode_array_literal(array)
encode_array_literal(array, delimiter)
Returns the contents of the referenced array as a string in array literal format (see Section 8.15.2).
Returns the argument value unaltered if it's not a reference to an array. The delimiter used between
elements of the array literal defaults to ", " if a delimiter is not specified or is undef.
encode_typed_literal(value, typename)
Converts a Perl variable to the value of the data type passed as a second argument and returns a
string representation of this value. Correctly handles nested arrays and values of composite types.
encode_array_constructor(array)
Returns the contents of the referenced array as a string in array constructor format (see Sec-
tion 4.2.12). Individual values are quoted using quote_nullable. Returns the argument val-
ue, quoted using quote_nullable, if it's not a reference to an array.
looks_like_number(string)
Returns a true value if the content of the given string looks like a number, according to Perl, returns
false otherwise. Returns undef if the argument is undef. Leading and trailing space is ignored.
Inf and Infinity are regarded as numbers.
is_array_ref(argument)
Returns a true value if the given argument may be treated as an array reference, that is, if ref of
the argument is ARRAY or PostgreSQL::InServer::ARRAY. Returns false otherwise.
1260
PL/Perl - Perl Procedural Language
(You could have replaced the above with the one-liner return $_SHARED{myquote}-
>($_[0]); at the expense of readability.)
For security reasons, PL/Perl executes functions called by any one SQL role in a separate Perl inter-
preter for that role. This prevents accidental or malicious interference by one user with the behavior of
another user's PL/Perl functions. Each such interpreter has its own value of the %_SHARED variable
and other global state. Thus, two PL/Perl functions will share the same value of %_SHARED if and
only if they are executed by the same SQL role. In an application wherein a single session executes
code under multiple SQL roles (via SECURITY DEFINER functions, use of SET ROLE, etc) you
may need to take explicit steps to ensure that PL/Perl functions can share data via %_SHARED. To
do that, make sure that functions that should communicate are owned by the same user, and mark
them SECURITY DEFINER. You must of course take care that such functions can't be used to do
anything unintended.
Here is an example of a function that will not work because file system operations are not allowed
for security reasons:
1261
PL/Perl - Perl Procedural Language
The creation of this function will fail as its use of a forbidden operation will be caught by the validator.
Sometimes it is desirable to write Perl functions that are not restricted. For example, one might want
a Perl function that sends mail. To handle these cases, PL/Perl can also be installed as an “untrusted”
language (usually called PL/PerlU). In this case the full Perl language is available. When installing
the language, the language name plperlu will select the untrusted PL/Perl variant.
The writer of a PL/PerlU function must take care that the function cannot be used to do anything
unwanted, since it will be able to do anything that could be done by a user logged in as the database
administrator. Note that the database system allows only database superusers to create functions in
untrusted languages.
If the above function was created by a superuser using the language plperlu, execution would
succeed.
In the same way, anonymous code blocks written in Perl can use restricted operations if the language
is specified as plperlu rather than plperl, but the caller must be a superuser.
Note
While PL/Perl functions run in a separate Perl interpreter for each SQL role, all PL/PerlU
functions executed in a given session run in a single Perl interpreter (which is not any of the
ones used for PL/Perl functions). This allows PL/PerlU functions to share data freely, but no
communication can occur between PL/Perl and PL/PerlU functions.
Note
Perl cannot support multiple interpreters within one process unless it was built with the appro-
priate flags, namely either usemultiplicity or useithreads. (usemultiplicity
is preferred unless you actually need to use threads. For more details, see the perlembed man
page.) If PL/Perl is used with a copy of Perl that was not built this way, then it is only possible
to have one Perl interpreter per session, and so any one session can only execute either PL/
PerlU functions, or PL/Perl functions that are all called by the same SQL role.
$_TD->{new}{foo}
$_TD->{old}{foo}
1262
PL/Perl - Perl Procedural Language
$_TD->{name}
$_TD->{event}
$_TD->{when}
When the trigger was called: BEFORE, AFTER, INSTEAD OF, or UNKNOWN
$_TD->{level}
$_TD->{relid}
$_TD->{table_name}
$_TD->{relname}
Name of the table on which the trigger fired. This has been deprecated, and could be removed in
a future release. Please use $_TD->{table_name} instead.
$_TD->{table_schema}
Name of the schema in which the table on which the trigger fired, is
$_TD->{argc}
@{$_TD->{args}}
return;
"SKIP"
"MODIFY"
Indicates that the NEW row was modified by the trigger function
1263
PL/Perl - Perl Procedural Language
$_TD->{event}
$_TD->{tag}
plperl.on_init (string)
Specifies Perl code to be executed when a Perl interpreter is first initialized, before it is specialized
for use by plperl or plperlu. The SPI functions are not available when this code is executed.
If the code fails with an error it will abort the initialization of the interpreter and propagate out to
the calling query, causing the current transaction or subtransaction to be aborted.
The Perl code is limited to a single string. Longer code can be placed into a module and loaded
by the on_init string. Examples:
1264
PL/Perl - Perl Procedural Language
Any modules loaded by plperl.on_init, either directly or indirectly, will be available for use
by plperl. This may create a security risk. To see what modules have been loaded you can use:
Initialization will happen in the postmaster if the plperl library is included in shared_pre-
load_libraries, in which case extra consideration should be given to the risk of destabilizing the
postmaster. The principal reason for making use of this feature is that Perl modules loaded by
plperl.on_init need be loaded only at postmaster start, and will be instantly available with-
out loading overhead in individual database sessions. However, keep in mind that the overhead
is avoided only for the first Perl interpreter used by a database session — either PL/PerlU, or PL/
Perl for the first SQL role that calls a PL/Perl function. Any additional Perl interpreters created in
a database session will have to execute plperl.on_init afresh. Also, on Windows there will
be no savings whatsoever from preloading, since the Perl interpreter created in the postmaster
process does not propagate to child processes.
This parameter can only be set in the postgresql.conf file or on the server command line.
plperl.on_plperl_init (string)
plperl.on_plperlu_init (string)
These parameters specify Perl code to be executed when a Perl interpreter is specialized for
plperl or plperlu respectively. This will happen when a PL/Perl or PL/PerlU function is
first executed in a database session, or when an additional interpreter has to be created because
the other language is called or a PL/Perl function is called by a new SQL role. This follows any
initialization done by plperl.on_init. The SPI functions are not available when this code is
executed. The Perl code in plperl.on_plperl_init is executed after “locking down” the
interpreter, and thus it can only perform trusted operations.
If the code fails with an error it will abort the initialization and propagate out to the calling query,
causing the current transaction or subtransaction to be aborted. Any actions already done within
Perl won't be undone; however, that interpreter won't be used again. If the language is used again
the initialization will be attempted again within a fresh Perl interpreter.
Only superusers can change these settings. Although these settings can be changed within a ses-
sion, such changes will not affect Perl interpreters that have already been used to execute func-
tions.
plperl.use_strict (boolean)
When set true subsequent compilations of PL/Perl functions will have the strict pragma en-
abled. This parameter does not affect functions already compiled in the current session.
• If you are fetching very large data sets using spi_exec_query, you should be aware that these
will all go into memory. You can avoid this by using spi_query/spi_fetchrow as illustrated
earlier.
1265
PL/Perl - Perl Procedural Language
A similar problem occurs if a set-returning function passes a large set of rows back to PostgreSQL
via return. You can avoid this problem too by instead using return_next for each row re-
turned, as shown previously.
• When a session ends normally, not due to a fatal error, any END blocks that have been defined are
executed. Currently no other actions are performed. Specifically, file handles are not automatically
flushed and objects are not automatically destroyed.
1266
Chapter 46. PL/Python - Python
Procedural Language
The PL/Python procedural language allows PostgreSQL functions and procedures to be written in the
Python language1.
To install PL/Python in a particular database, use CREATE EXTENSION plpythonu (but see also
Section 46.1).
Tip
If a language is installed into template1, all subsequently created databases will have the
language installed automatically.
PL/Python is only available as an “untrusted” language, meaning it does not offer any way of restricting
what users can do in it and is therefore named plpythonu. A trusted variant plpython might
become available in the future if a secure execution mechanism is developed in Python. The writer
of a function in untrusted PL/Python must take care that the function cannot be used to do anything
unwanted, since it will be able to do anything that could be done by a user logged in as the database
administrator. Only superusers can create functions in untrusted languages such as plpythonu.
Note
Users of source packages must specially enable the build of PL/Python during the installation
process. (Refer to the installation instructions for more information.) Users of binary packages
might find PL/Python in a separate subpackage.
• The PostgreSQL language named plpython2u implements PL/Python based on the Python 2
language variant.
• The PostgreSQL language named plpython3u implements PL/Python based on the Python 3
language variant.
• The language named plpythonu implements PL/Python based on the default Python language
variant, which is currently Python 2. (This default is independent of what any local Python installa-
tions might consider to be their “default”, for example, what /usr/bin/python might be.) The
default will probably be changed to Python 3 in a distant future release of PostgreSQL, depending
on the progress of the migration to Python 3 in the Python community.
This scheme is analogous to the recommendations in PEP 3942 regarding the naming and transitioning
of the python command.
1
https://www.python.org
2
https://www.python.org/dev/peps/pep-0394/
1267
PL/Python - Python
Procedural Language
It depends on the build configuration or the installed packages whether PL/Python for Python 2 or
Python 3 or both are available.
Tip
The built variant depends on which Python version was found during the installation or which
version was explicitly set using the PYTHON environment variable; see Section 16.4. To make
both variants of PL/Python available in one installation, the source tree has to be configured
and built twice.
• Existing users and users who are currently not interested in Python 3 use the language name
plpythonu and don't have to change anything for the foreseeable future. It is recommended to
gradually “future-proof” the code via migration to Python 2.6/2.7 to simplify the eventual migration
to Python 3.
In practice, many PL/Python functions will migrate to Python 3 with few or no changes.
• Users who know that they have heavily Python 2 dependent code and don't plan to ever change it
can make use of the plpython2u language name. This will continue to work into the very distant
future, until Python 2 support might be completely dropped by PostgreSQL.
• Users who want to dive into Python 3 can use the plpython3u language name, which will keep
working forever by today's standards. In the distant future, when Python 3 might become the default,
they might like to remove the “3” for aesthetic reasons.
• Daredevils, who want to build a Python-3-only operating system environment, can change the con-
tents of pg_pltemplate to make plpythonu be equivalent to plpython3u, keeping in mind
that this would make their installation incompatible with most of the rest of the world.
See also the document What's New In Python 3.03 for more information about porting to Python 3.
It is not allowed to use PL/Python based on Python 2 and PL/Python based on Python 3 in the same
session, because the symbols in the dynamic modules would clash, which could result in crashes of the
PostgreSQL server process. There is a check that prevents mixing Python major versions in a session,
which will abort the session if a mismatch is detected. It is possible, however, to use both PL/Python
variants in the same database, from separate sessions.
The body of a function is simply a Python script. When the function is called, its arguments are passed
as elements of the list args; named arguments are also passed as ordinary variables to the Python
script. Use of named arguments is usually more readable. The result is returned from the Python code
in the usual way, with return or yield (in case of a result-set statement). If you do not provide a
return value, Python returns the default None. PL/Python translates Python's None into the SQL null
3
https://docs.python.org/3/whatsnew/3.0.html
1268
PL/Python - Python
Procedural Language
value. In a procedure, the result from the Python code must be None (typically achieved by ending
the procedure without a return statement or by using a return statement without argument);
otherwise, an error will be raised.
For example, a function to return the greater of two integers can be defined as:
The Python code that is given as the body of the function definition is transformed into a Python
function. For example, the above results in:
def __plpython_procedure_pymax_23456():
if a > b:
return a
return b
The arguments are set as global variables. Because of the scoping rules of Python, this has the subtle
consequence that an argument variable cannot be reassigned inside the function to the value of an
expression that involves the variable name itself, unless the variable is redeclared as global in the
block. For example, the following won't work:
because assigning to x makes x a local variable for the entire block, and so the x on the right-hand side
of the assignment refers to a not-yet-assigned local variable x, not the PL/Python function parameter.
Using the global statement, this can be made to work:
But it is advisable not to rely on this implementation detail of PL/Python. It is better to treat the function
parameters as read-only.
1269
PL/Python - Python
Procedural Language
• PostgreSQL smallint and int are converted to Python int. PostgreSQL bigint and oid
are converted to long in Python 2 and to int in Python 3.
• PostgreSQL numeric is converted to Python Decimal. This type is imported from the cdeci-
mal package if that is available. Otherwise, decimal.Decimal from the standard library will
be used. cdecimal is significantly faster than decimal. In Python 3.3 and up, however, cdec-
imal has been integrated into the standard library under the name decimal, so there is no longer
any difference.
• PostgreSQL bytea is converted to Python str in Python 2 and to bytes in Python 3. In Python
2, the string should be treated as a byte sequence without any character encoding.
• All other data types, including the PostgreSQL character string types, are converted to a Python
str. In Python 2, this string will be in the PostgreSQL server encoding; in Python 3, it will be a
Unicode string like all strings.
When a PL/Python function returns, its return value is converted to the function's declared PostgreSQL
return data type as follows:
• When the PostgreSQL return type is boolean, the return value will be evaluated for truth accord-
ing to the Python rules. That is, 0 and empty string are false, but notably 'f' is true.
• When the PostgreSQL return type is bytea, the return value will be converted to a string (Python 2)
or bytes (Python 3) using the respective Python built-ins, with the result being converted to bytea.
• For all other PostgreSQL return types, the return value is converted to a string using the Python built-
in str, and the result is passed to the input function of the PostgreSQL data type. (If the Python
value is a float, it is converted using the repr built-in instead of str, to avoid loss of precision.)
Strings in Python 2 are required to be in the PostgreSQL server encoding when they are passed to
PostgreSQL. Strings that are not valid in the current server encoding will raise an error, but not all
encoding mismatches can be detected, so garbage data can still result when this is not done correctly.
Unicode strings are converted to the correct encoding automatically, so it can be safer and more
convenient to use those. In Python 3, all strings are Unicode strings.
Note that logical mismatches between the declared PostgreSQL return type and the Python data type
of the actual return object are not flagged; the value will be converted in any case.
1270
PL/Python - Python
Procedural Language
As shown above, to return an SQL null value from a PL/Python function, return the value None. This
can be done whether the function is strict or not.
SELECT return_arr();
return_arr
-------------
{1,2,3,4,5}
(1 row)
Multidimensional arrays are passed into PL/Python as nested Python lists. A 2-dimensional array is a
list of lists, for example. When returning a multi-dimensional SQL array out of a PL/Python function,
the inner lists at each level must all be of the same size. For example:
Other Python sequences, like tuples, are also accepted for backwards-compatibility with PostgreSQL
versions 9.6 and below, when multi-dimensional arrays were not supported. However, they are always
treated as one-dimensional arrays, because they are ambiguous with composite types. For the same
reason, when a composite type is used in a multi-dimensional array, it must be represented by a tuple,
rather than a list.
Note that in Python, strings are sequences, which can have undesirable effects that might be familiar
to Python programmers:
1271
PL/Python - Python
Procedural Language
SELECT return_str_arr();
return_str_arr
----------------
{h,e,l,l,o}
(1 row)
There are multiple ways to return row or composite types from a Python function. The following
examples assume we have:
Sequence type (a tuple or list, but not a set because it is not indexable)
Returned sequence objects must have the same number of items as the composite result type has
fields. The item with index 0 is assigned to the first field of the composite type, 1 to the second
and so on. For example:
1272
PL/Python - Python
Procedural Language
$$ LANGUAGE plpythonu;
To return a SQL null for any column, insert None at the corresponding position.
When an array of composite types is returned, it cannot be returned as a list, because it is ambigu-
ous whether the Python list represents a composite type, or another array dimension.
Mapping (dictionary)
The value for each result type column is retrieved from the mapping with the column name as
key. Example:
Any extra dictionary key/value pairs are ignored. Missing keys are treated as errors. To return a
SQL null value for any column, insert None with the corresponding column name as the key.
# or simply
class nv: pass
nv.name = name
nv.value = value
return nv
$$ LANGUAGE plpythonu;
Output parameters of procedures are passed back the same way. For example:
1273
PL/Python - Python
Procedural Language
Generator (yield)
1274
PL/Python - Python
Procedural Language
Set-returning functions with OUT parameters (using RETURNS SETOF record) are also supported.
For example:
Each function gets its own execution environment in the Python interpreter, so that global data and
function arguments from myfunc are not available to myfunc2. The exception is the data in the GD
dictionary, as mentioned above.
DO $$
# PL/Python code
$$ LANGUAGE plpythonu;
An anonymous code block receives no arguments, and whatever value it might return is discarded.
Otherwise it behaves just like a function.
TD["event"]
TD["when"]
TD["level"]
TD["new"]
TD["old"]
For a row-level trigger, one or both of these fields contain the respective trigger rows, depending
on the trigger event.
TD["name"]
1275
PL/Python - Python
Procedural Language
TD["table_name"]
TD["table_schema"]
TD["relid"]
TD["args"]
If the CREATE TRIGGER command included arguments, they are available in TD["args"]
[0] to TD["args"][n-1].
If TD["when"] is BEFORE or INSTEAD OF and TD["level"] is ROW, you can return None or
"OK" from the Python function to indicate the row is unmodified, "SKIP" to abort the event, or if
TD["event"] is INSERT or UPDATE you can return "MODIFY" to indicate you've modified the
new row. Otherwise the return value is ignored.
plpy.execute(query [, max-rows])
Calling plpy.execute with a query string and an optional row limit argument causes that
query to be run and the result to be returned in a result object.
The result object emulates a list or dictionary object. The result object can be accessed by row
number and column name. For example:
foo = rv[i]["my_column"]
The number of rows returned can be obtained using the built-in len function.
nrows()
Returns the number of rows processed by the command. Note that this is not necessarily the
same as the number of rows returned. For example, an UPDATE command will set this value
but won't return any rows (unless RETURNING is used).
status()
1276
PL/Python - Python
Procedural Language
colnames()
coltypes()
coltypmods()
Return a list of column names, list of column type OIDs, and list of type-specific type mod-
ifiers for the columns, respectively.
These methods raise an exception when called on a result object from a command that did
not produce a result set, e.g., UPDATE without RETURNING, or DROP TABLE. But it is OK
to use these methods on a result set containing zero rows.
__str__()
The standard __str__ method is defined so that it is possible for example to debug query
execution results using plpy.debug(rv).
Note that calling plpy.execute will cause the entire result set to be read into memory. On-
ly use that function when you are sure that the result set will be relatively small. If you don't
want to risk excessive memory usage when fetching large results, use plpy.cursor rather than
plpy.execute.
plpy.prepare(query [, argtypes])
plpy.execute(plan [, arguments [, max-rows]])
plpy.prepare prepares the execution plan for a query. It is called with a query string and a
list of parameter types, if you have parameter references in the query. For example:
text is the type of the variable you will be passing for $1. The second argument is optional if
you don't want to pass any parameters to the query.
After preparing a statement, you use a variant of the function plpy.execute to run it:
rv = plpy.execute(plan, ["name"], 5)
Pass the plan as the first argument (instead of the query string), and a list of values to substitute
into the query as the second argument. The second argument is optional if the query does not
expect any parameters. The third argument is the optional row limit as before.
Alternatively, you can call the execute method on the plan object:
rv = plan.execute(["name"], 5)
Query parameters and result row fields are converted between PostgreSQL and Python data types
as described in Section 46.3.
When you prepare a plan using the PL/Python module it is automatically saved. Read the SPI
documentation (Chapter 47) for a description of what this means. In order to make effective use
of this across function calls one needs to use one of the persistent storage dictionaries SD or GD
(see Section 46.4). For example:
1277
PL/Python - Python
Procedural Language
if "plan" in SD:
plan = SD["plan"]
else:
plan = plpy.prepare("SELECT 1")
SD["plan"] = plan
# rest of function
$$ LANGUAGE plpythonu;
plpy.cursor(query)
plpy.cursor(plan [, arguments])
The plpy.cursor function accepts the same arguments as plpy.execute (except for the
row limit) and returns a cursor object, which allows you to process large result sets in smaller
chunks. As with plpy.execute, either a query string or a plan object along with a list of
arguments can be used, or the cursor function can be called as a method of the plan object.
The cursor object provides a fetch method that accepts an integer parameter and returns a re-
sult object. Each time you call fetch, the returned object will contain the next batch of rows,
never larger than the parameter value. Once all rows are exhausted, fetch starts returning an
empty result object. Cursor objects also provide an iterator interface4, yielding one row at a time
until all rows are exhausted. Data fetched that way is not returned as result objects, but rather as
dictionaries, each dictionary corresponding to a single result row.
return len(rows)
$$ LANGUAGE plpythonu;
4
https://docs.python.org/library/stdtypes.html#iterator-types
1278
PL/Python - Python
Procedural Language
Cursors are automatically disposed of. But if you want to explicitly release all resources held by
a cursor, use the close method. Once closed, a cursor cannot be fetched from anymore.
Tip
Do not confuse objects created by plpy.cursor with DB-API cursors as defined by
the Python Database API specification5. They don't have anything in common except for
the name.
The actual class of the exception being raised corresponds to the specific condition that caused the
error. Refer to Table A.1 for a list of possible conditions. The module plpy.spiexceptions
defines an exception class for each PostgreSQL condition, deriving their names from the condition
name. For instance, division_by_zero becomes DivisionByZero, unique_violation
becomes UniqueViolation, fdw_error becomes FdwError, and so on. Each of these excep-
tion classes inherits from SPIError. This separation makes it easier to handle specific errors, for
instance:
Note that because all exceptions from the plpy.spiexceptions module inherit from SPIError,
an except clause handling it will catch any database access error.
5
https://www.python.org/dev/peps/pep-0249/
1279
PL/Python - Python
Procedural Language
As an alternative way of handling different error conditions, you can catch the SPIError exception
and determine the specific error condition inside the except block by looking at the sqlstate
attribute of the exception object. This attribute is a string value containing the “SQLSTATE” error
code. This approach provides approximately the same functionality
If the second UPDATE statement results in an exception being raised, this function will report the error,
but the result of the first UPDATE will nevertheless be committed. In other words, the funds will be
withdrawn from Joe's account, but will not be transferred to Mary's account.
To avoid such issues, you can wrap your plpy.execute calls in an explicit subtransaction. The
plpy module provides a helper object to manage explicit subtransactions that gets created with the
plpy.subtransaction() function. Objects created by this function implement the context man-
ager interface6. Using explicit subtransactions we can rewrite our function as:
1280
PL/Python - Python
Procedural Language
$$ LANGUAGE plpythonu;
Note that the use of try/catch is still required. Otherwise the exception would propagate to the top
of the Python stack and would cause the whole function to abort with a PostgreSQL error, so that the
operations table would not have any row inserted into it. The subtransaction context manager does
not trap errors, it only assures that all database operations executed inside its scope will be atomically
committed or rolled back. A rollback of the subtransaction block occurs on any kind of exception exit,
not only ones caused by errors originating from database access. A regular Python exception raised
inside an explicit subtransaction block would also cause the subtransaction to be rolled back.
Note
Although context managers were implemented in Python 2.5, to use the with syntax in that
version you need to use a future statement7. Because of implementation details, however, you
cannot use future statements in PL/Python functions.
1281
PL/Python - Python
Procedural Language
To roll back the current transaction, call plpy.rollback(). (Note that it is not possible to run
the SQL commands COMMIT or ROLLBACK via plpy.execute or similar. It has to be done using
these functions.) After a transaction is ended, a new transaction is automatically started, so there is
no separate function for that.
Here is an example:
CALL transaction_test1();
plpy.debug(msg, **kwargs)
plpy.log(msg, **kwargs)
plpy.info(msg, **kwargs)
plpy.notice(msg, **kwargs)
plpy.warning(msg, **kwargs)
plpy.error(msg, **kwargs)
plpy.fatal(msg, **kwargs)
plpy.error and plpy.fatal actually raise a Python exception which, if uncaught, propagates
out to the calling query, causing the current transaction or subtransaction to be aborted. raise
plpy.Error(msg) and raise plpy.Fatal(msg) are equivalent to calling plpy.er-
ror(msg) and plpy.fatal(msg), respectively but the raise form does not allow passing key-
word arguments. The other functions only generate messages of different priority levels. Whether
messages of a particular priority are reported to the client, written to the server log, or both is con-
trolled by the log_min_messages and client_min_messages configuration variables. See Chapter 19
for more information.
The msg argument is given as a positional argument. For backward compatibility, more than one
positional argument can be given. In that case, the string representation of the tuple of positional
arguments becomes the message reported to the client.
detail
hint
sqlstate
schema_name
table_name
column_name
datatype_name
constraint_name
1282
PL/Python - Python
Procedural Language
The string representation of the objects passed as keyword-only arguments is used to enrich the mes-
sages reported to the client. For example:
=# SELECT raise_custom_exception();
ERROR: plpy.Error: custom exception message
DETAIL: some info about exception
HINT: hint for users
CONTEXT: Traceback (most recent call last):
PL/Python function "raise_custom_exception", line 4, in <module>
hint="hint for users")
PL/Python function "raise_custom_exception"
• PYTHONHOME
• PYTHONPATH
• PYTHONY2K
• PYTHONOPTIMIZE
• PYTHONDEBUG
• PYTHONVERBOSE
• PYTHONCASEOK
• PYTHONDONTWRITEBYTECODE
• PYTHONIOENCODING
• PYTHONUSERBASE
• PYTHONHASHSEED
1283
PL/Python - Python
Procedural Language
(It appears to be a Python implementation detail beyond the control of PL/Python that some of the
environment variables listed on the python man page are only effective in a command-line interpreter
and not an embedded Python interpreter.)
1284
Chapter 47. Server Programming
Interface
The Server Programming Interface (SPI) gives writers of user-defined C functions the ability to run
SQL commands inside their functions or procedures. SPI is a set of interface functions to simplify
access to the parser, planner, and executor. SPI also does some memory management.
Note
The available procedural languages provide various means to execute SQL commands from
functions. Most of these facilities are based on SPI, so this documentation might be of use for
users of those languages as well.
Note that if a command invoked via SPI fails, then control will not be returned to your C function.
Rather, the transaction or subtransaction in which your C function executes will be rolled back. (This
might seem surprising given that the SPI functions mostly have documented error-return conventions.
Those conventions only apply for errors detected within the SPI functions themselves, however.) It
is possible to recover control after an error by establishing your own subtransaction surrounding SPI
calls that might fail.
SPI functions return a nonnegative result on success (either via a returned integer value or in the global
variable SPI_result, as described below). On error, a negative result or NULL will be returned.
Source code files that use SPI must include the header file executor/spi.h.
1285
Server Programming Interface
SPI_connect
SPI_connect, SPI_connect_ext — connect a C function to the SPI manager
Synopsis
int SPI_connect(void)
Description
SPI_connect opens a connection from a C function invocation to the SPI manager. You must call
this function if you want to execute commands through SPI. Some utility SPI functions can be called
from unconnected C functions.
SPI_connect_ext does the same but has an argument that allows passing option flags. Currently,
the following option values are available:
SPI_OPT_NONATOMIC
Sets the SPI connection to be nonatomic, which means that transaction control calls (SPI_com-
mit, SPI_rollback) are allowed. Otherwise, calling those functions will result in an imme-
diate error.
Return Value
SPI_OK_CONNECT
on success
SPI_ERROR_CONNECT
on error
1286
Server Programming Interface
SPI_finish
SPI_finish — disconnect a C function from the SPI manager
Synopsis
int SPI_finish(void)
Description
SPI_finish closes an existing connection to the SPI manager. You must call this function after
completing the SPI operations needed during your C function's current invocation. You do not need
to worry about making this happen, however, if you abort the transaction via elog(ERROR). In that
case SPI will clean itself up automatically.
Return Value
SPI_OK_FINISH
if properly disconnected
SPI_ERROR_UNCONNECTED
1287
Server Programming Interface
SPI_execute
SPI_execute — execute a command
Synopsis
Description
SPI_execute executes the specified SQL command for count rows. If read_only is true, the
command must be read-only, and execution overhead is somewhat reduced.
If count is zero then the command is executed for all rows that it applies to. If count is greater than
zero, then no more than count rows will be retrieved; execution stops when the count is reached,
much like adding a LIMIT clause to the query. For example,
will retrieve at most 5 rows from the table. Note that such a limit is only effective when the command
actually returns rows. For example,
inserts all rows from bar, ignoring the count parameter. However, with
at most 5 rows would be inserted, since execution would stop after the fifth RETURNING result row
is retrieved.
You can pass multiple commands in one string; SPI_execute returns the result for the command
executed last. The count limit applies to each command separately (even though only the last result
will actually be returned). The limit is not applied to any hidden commands generated by rules.
When read_only is false, SPI_execute increments the command counter and computes a
new snapshot before executing each command in the string. The snapshot does not actually change
if the current transaction isolation level is SERIALIZABLE or REPEATABLE READ, but in READ
COMMITTED mode the snapshot update allows each command to see the results of newly committed
transactions from other sessions. This is essential for consistent behavior when the commands are
modifying the database.
When read_only is true, SPI_execute does not update either the snapshot or the command
counter, and it allows only plain SELECT commands to appear in the command string. The commands
are executed using the snapshot previously established for the surrounding query. This execution mode
is somewhat faster than the read/write mode due to eliminating per-command overhead. It also allows
genuinely stable functions to be built: since successive executions will all use the same snapshot, there
will be no change in the results.
It is generally unwise to mix read-only and read-write commands within a single function using SPI;
that could result in very confusing behavior, since the read-only queries would not see the results of
any database updates done by the read-write queries.
1288
Server Programming Interface
The actual number of rows for which the (last) command was executed is returned in the
global variable SPI_processed. If the return value of the function is SPI_OK_SELECT,
SPI_OK_INSERT_RETURNING, SPI_OK_DELETE_RETURNING, or SPI_OK_UPDATE_RE-
TURNING, then you can use the global pointer SPITupleTable *SPI_tuptable to access the
result rows. Some utility commands (such as EXPLAIN) also return row sets, and SPI_tuptable
will contain the result in these cases too. Some utility commands (COPY, CREATE TABLE AS) don't
return a row set, so SPI_tuptable is NULL, but they still return the number of rows processed
in SPI_processed.
typedef struct
{
MemoryContext tuptabcxt; /* memory context of result table
*/
uint64 alloced; /* number of alloced vals */
uint64 free; /* number of free vals */
TupleDesc tupdesc; /* row descriptor */
HeapTuple *vals; /* rows */
} SPITupleTable;
vals is an array of pointers to rows. (The number of valid entries is given by SPI_processed.)
tupdesc is a row descriptor which you can pass to SPI functions dealing with rows. tuptabcxt,
alloced, and free are internal fields not intended for use by SPI callers.
SPI_finish frees all SPITupleTables allocated during the current C function. You can free a
particular result table earlier, if you are done with it, by calling SPI_freetuptable.
Arguments
const char * command
bool read_only
long count
Return Value
If the execution of the command was successful then one of the following (nonnegative) values will
be returned:
SPI_OK_SELECT
SPI_OK_SELINTO
SPI_OK_INSERT
1289
Server Programming Interface
SPI_OK_DELETE
SPI_OK_UPDATE
SPI_OK_INSERT_RETURNING
SPI_OK_DELETE_RETURNING
SPI_OK_UPDATE_RETURNING
SPI_OK_UTILITY
SPI_OK_REWRITTEN
if the command was rewritten into another kind of command (e.g., UPDATE became an INSERT)
by a rule.
SPI_ERROR_ARGUMENT
SPI_ERROR_COPY
SPI_ERROR_TRANSACTION
SPI_ERROR_OPUNKNOWN
SPI_ERROR_UNCONNECTED
Notes
All SPI query-execution functions set both SPI_processed and SPI_tuptable (just the pointer,
not the contents of the structure). Save these two global variables into local C function variables if
you need to access the result table of SPI_execute or another query-execution function across later
calls.
1290
Server Programming Interface
SPI_exec
SPI_exec — execute a read/write command
Synopsis
Description
SPI_exec is the same as SPI_execute, with the latter's read_only parameter always taken
as false.
Arguments
const char * command
long count
Return Value
See SPI_execute.
1291
Server Programming Interface
SPI_execute_with_args
SPI_execute_with_args — execute a command with out-of-line parameters
Synopsis
Description
SPI_execute_with_args executes a command that might include references to externally sup-
plied parameters. The command text refers to a parameter as $n, and the call specifies data types and
values for each such symbol. read_only and count have the same interpretation as in SPI_ex-
ecute.
The main advantage of this routine compared to SPI_execute is that data values can be inserted into
the command without tedious quoting/escaping, and thus with much less risk of SQL-injection attacks.
Arguments
const char * command
command string
int nargs
Oid * argtypes
an array of length nargs, containing the OIDs of the data types of the parameters
Datum * values
If nulls is NULL then SPI_execute_with_args assumes that no parameters are null. Oth-
erwise, each entry of the nulls array should be ' ' if the corresponding parameter value is
non-null, or 'n' if the corresponding parameter value is null. (In the latter case, the actual value
in the corresponding values entry doesn't matter.) Note that nulls is not a text string, just an
array: it does not need a '\0' terminator.
bool read_only
1292
Server Programming Interface
long count
Return Value
The return value is the same as for SPI_execute.
1293
Server Programming Interface
SPI_prepare
SPI_prepare — prepare a statement, without executing it yet
Synopsis
Description
SPI_prepare creates and returns a prepared statement for the specified command, but doesn't
execute the command. The prepared statement can later be executed repeatedly using SPI_exe-
cute_plan.
A prepared command can be generalized by writing parameters ($1, $2, etc.) in place of what would
be constants in a normal command. The actual values of the parameters are then specified when
SPI_execute_plan is called. This allows the prepared command to be used over a wider range
of situations than would be possible without parameters.
The statement returned by SPI_prepare can be used only in the current invocation of the C function,
since SPI_finish frees memory allocated for such a statement. But the statement can be saved for
longer using the functions SPI_keepplan or SPI_saveplan.
Arguments
const char * command
command string
int nargs
Oid * argtypes
pointer to an array containing the OIDs of the data types of the parameters
Return Value
SPI_prepare returns a non-null pointer to an SPIPlan, which is an opaque struct representing a
prepared statement. On error, NULL will be returned, and SPI_result will be set to one of the same
error codes used by SPI_execute, except that it is set to SPI_ERROR_ARGUMENT if command
is NULL, or if nargs is less than 0, or if nargs is greater than 0 and argtypes is NULL.
Notes
If no parameters are defined, a generic plan will be created at the first use of SPI_execute_plan,
and used for all subsequent executions as well. If there are parameters, the first few uses of SPI_ex-
ecute_plan will generate custom plans that are specific to the supplied parameter values. Af-
1294
Server Programming Interface
ter enough uses of the same prepared statement, SPI_execute_plan will build a generic plan,
and if that is not too much more expensive than the custom plans, it will start using the generic
plan instead of re-planning each time. If this default behavior is unsuitable, you can alter it by pass-
ing the CURSOR_OPT_GENERIC_PLAN or CURSOR_OPT_CUSTOM_PLAN flag to SPI_pre-
pare_cursor, to force use of generic or custom plans respectively.
Although the main point of a prepared statement is to avoid repeated parse analysis and planning
of the statement, PostgreSQL will force re-analysis and re-planning of the statement before using it
whenever database objects used in the statement have undergone definitional (DDL) changes since the
previous use of the prepared statement. Also, if the value of search_path changes from one use to the
next, the statement will be re-parsed using the new search_path. (This latter behavior is new as
of PostgreSQL 9.3.) See PREPARE for more information about the behavior of prepared statements.
SPIPlanPtr is declared as a pointer to an opaque struct type in spi.h. It is unwise to try to ac-
cess its contents directly, as that makes your code much more likely to break in future revisions of
PostgreSQL.
The name SPIPlanPtr is somewhat historical, since the data structure no longer necessarily con-
tains an execution plan.
1295
Server Programming Interface
SPI_prepare_cursor
SPI_prepare_cursor — prepare a statement, without executing it yet
Synopsis
Description
SPI_prepare_cursor is identical to SPI_prepare, except that it also allows specification of
the planner's “cursor options” parameter. This is a bit mask having the values shown in nodes/
parsenodes.h for the options field of DeclareCursorStmt. SPI_prepare always takes
the cursor options as zero.
Arguments
const char * command
command string
int nargs
Oid * argtypes
pointer to an array containing the OIDs of the data types of the parameters
int cursorOptions
Return Value
SPI_prepare_cursor has the same return conventions as SPI_prepare.
Notes
Useful bits to set in cursorOptions include CURSOR_OPT_SCROLL,
CURSOR_OPT_NO_SCROLL, CURSOR_OPT_FAST_PLAN, CURSOR_OPT_GENERIC_PLAN,
and CURSOR_OPT_CUSTOM_PLAN. Note in particular that CURSOR_OPT_HOLD is ignored.
1296
Server Programming Interface
SPI_prepare_params
SPI_prepare_params — prepare a statement, without executing it yet
Synopsis
Description
SPI_prepare_params creates and returns a prepared statement for the specified command, but
doesn't execute the command. This function is equivalent to SPI_prepare_cursor, with the ad-
dition that the caller can specify parser hook functions to control the parsing of external parameter
references.
Arguments
const char * command
command string
ParserSetupHook parserSetup
void * parserSetupArg
int cursorOptions
Return Value
SPI_prepare_params has the same return conventions as SPI_prepare.
1297
Server Programming Interface
SPI_getargcount
SPI_getargcount — return the number of arguments needed by a statement prepared by SPI_pre-
pare
Synopsis
Description
SPI_getargcount returns the number of arguments needed to execute a statement prepared by
SPI_prepare.
Arguments
SPIPlanPtr plan
Return Value
The count of expected arguments for the plan. If the plan is NULL or invalid, SPI_result is set
to SPI_ERROR_ARGUMENT and -1 is returned.
1298
Server Programming Interface
SPI_getargtypeid
SPI_getargtypeid — return the data type OID for an argument of a statement prepared by SPI_pre-
pare
Synopsis
Description
SPI_getargtypeid returns the OID representing the type for the argIndex'th argument of a
statement prepared by SPI_prepare. First argument is at index zero.
Arguments
SPIPlanPtr plan
int argIndex
Return Value
The type OID of the argument at the given index. If the plan is NULL or invalid, or argIndex is
less than 0 or not less than the number of arguments declared for the plan, SPI_result is set to
SPI_ERROR_ARGUMENT and InvalidOid is returned.
1299
Server Programming Interface
SPI_is_cursor_plan
SPI_is_cursor_plan — return true if a statement prepared by SPI_prepare can be used with
SPI_cursor_open
Synopsis
Description
SPI_is_cursor_plan returns true if a statement prepared by SPI_prepare can be passed
as an argument to SPI_cursor_open, or false if that is not the case. The criteria are that the
plan represents one single command and that this command returns tuples to the caller; for example,
SELECT is allowed unless it contains an INTO clause, and UPDATE is allowed only if it contains a
RETURNING clause.
Arguments
SPIPlanPtr plan
Return Value
true or false to indicate if the plan can produce a cursor or not, with SPI_result set to zero.
If it is not possible to determine the answer (for example, if the plan is NULL or invalid, or if called
when not connected to SPI), then SPI_result is set to a suitable error code and false is returned.
1300
Server Programming Interface
SPI_execute_plan
SPI_execute_plan — execute a statement prepared by SPI_prepare
Synopsis
Description
SPI_execute_plan executes a statement prepared by SPI_prepare or one of its siblings.
read_only and count have the same interpretation as in SPI_execute.
Arguments
SPIPlanPtr plan
Datum * values
An array of actual parameter values. Must have same length as the statement's number of argu-
ments.
An array describing which parameters are null. Must have same length as the statement's number
of arguments.
If nulls is NULL then SPI_execute_plan assumes that no parameters are null. Otherwise,
each entry of the nulls array should be ' ' if the corresponding parameter value is non-null,
or 'n' if the corresponding parameter value is null. (In the latter case, the actual value in the
corresponding values entry doesn't matter.) Note that nulls is not a text string, just an array:
it does not need a '\0' terminator.
bool read_only
long count
Return Value
The return value is the same as for SPI_execute, with the following additional possible error (neg-
ative) results:
SPI_ERROR_ARGUMENT
SPI_ERROR_PARAM
1301
Server Programming Interface
1302
Server Programming Interface
SPI_execute_plan_with_paramlist
SPI_execute_plan_with_paramlist — execute a statement prepared by SPI_prepare
Synopsis
Description
SPI_execute_plan_with_paramlist executes a statement prepared by SPI_prepare.
This function is equivalent to SPI_execute_plan except that information about the parameter
values to be passed to the query is presented differently. The ParamListInfo representation can
be convenient for passing down values that are already available in that format. It also supports use of
dynamic parameter sets via hook functions specified in ParamListInfo.
Arguments
SPIPlanPtr plan
ParamListInfo params
bool read_only
long count
Return Value
The return value is the same as for SPI_execute_plan.
1303
Server Programming Interface
SPI_execp
SPI_execp — execute a statement in read/write mode
Synopsis
Description
SPI_execp is the same as SPI_execute_plan, with the latter's read_only parameter always
taken as false.
Arguments
SPIPlanPtr plan
Datum * values
An array of actual parameter values. Must have same length as the statement's number of argu-
ments.
An array describing which parameters are null. Must have same length as the statement's number
of arguments.
If nulls is NULL then SPI_execp assumes that no parameters are null. Otherwise, each entry
of the nulls array should be ' ' if the corresponding parameter value is non-null, or 'n' if the
corresponding parameter value is null. (In the latter case, the actual value in the corresponding
values entry doesn't matter.) Note that nulls is not a text string, just an array: it does not need
a '\0' terminator.
long count
Return Value
See SPI_execute_plan.
1304
Server Programming Interface
SPI_cursor_open
SPI_cursor_open — set up a cursor using a statement created with SPI_prepare
Synopsis
Description
SPI_cursor_open sets up a cursor (internally, a portal) that will execute a statement prepared
by SPI_prepare. The parameters have the same meanings as the corresponding parameters to
SPI_execute_plan.
Using a cursor instead of executing the statement directly has two benefits. First, the result rows can
be retrieved a few at a time, avoiding memory overrun for queries that return many rows. Second,
a portal can outlive the current C function (it can, in fact, live to the end of the current transaction).
Returning the portal name to the C function's caller provides a way of returning a row set as result.
The passed-in parameter data will be copied into the cursor's portal, so it can be freed while the cursor
still exists.
Arguments
const char * name
SPIPlanPtr plan
Datum * values
An array of actual parameter values. Must have same length as the statement's number of argu-
ments.
An array describing which parameters are null. Must have same length as the statement's number
of arguments.
If nulls is NULL then SPI_cursor_open assumes that no parameters are null. Otherwise,
each entry of the nulls array should be ' ' if the corresponding parameter value is non-null,
or 'n' if the corresponding parameter value is null. (In the latter case, the actual value in the
corresponding values entry doesn't matter.) Note that nulls is not a text string, just an array:
it does not need a '\0' terminator.
bool read_only
Return Value
Pointer to portal containing the cursor. Note there is no error return convention; any error will be
reported via elog.
1305
Server Programming Interface
SPI_cursor_open_with_args
SPI_cursor_open_with_args — set up a cursor using a query and parameters
Synopsis
Description
SPI_cursor_open_with_args sets up a cursor (internally, a portal) that will execute the spec-
ified query. Most of the parameters have the same meanings as the corresponding parameters to
SPI_prepare_cursor and SPI_cursor_open.
For one-time query execution, this function should be preferred over SPI_prepare_cursor fol-
lowed by SPI_cursor_open. If the same command is to be executed with many different parame-
ters, either method might be faster, depending on the cost of re-planning versus the benefit of custom
plans.
The passed-in parameter data will be copied into the cursor's portal, so it can be freed while the cursor
still exists.
Arguments
const char * name
command string
int nargs
Oid * argtypes
an array of length nargs, containing the OIDs of the data types of the parameters
Datum * values
1306
Server Programming Interface
bool read_only
int cursorOptions
Return Value
Pointer to portal containing the cursor. Note there is no error return convention; any error will be
reported via elog.
1307
Server Programming Interface
SPI_cursor_open_with_paramlist
SPI_cursor_open_with_paramlist — set up a cursor using parameters
Synopsis
Description
SPI_cursor_open_with_paramlist sets up a cursor (internally, a portal) that will execute a
statement prepared by SPI_prepare. This function is equivalent to SPI_cursor_open except
that information about the parameter values to be passed to the query is presented differently. The
ParamListInfo representation can be convenient for passing down values that are already avail-
able in that format. It also supports use of dynamic parameter sets via hook functions specified in
ParamListInfo.
The passed-in parameter data will be copied into the cursor's portal, so it can be freed while the cursor
still exists.
Arguments
const char * name
SPIPlanPtr plan
ParamListInfo params
bool read_only
Return Value
Pointer to portal containing the cursor. Note there is no error return convention; any error will be
reported via elog.
1308
Server Programming Interface
SPI_cursor_find
SPI_cursor_find — find an existing cursor by name
Synopsis
Description
SPI_cursor_find finds an existing portal by name. This is primarily useful to resolve a cursor
name returned as text by some other function.
Arguments
const char * name
Return Value
pointer to the portal with the specified name, or NULL if none was found
1309
Server Programming Interface
SPI_cursor_fetch
SPI_cursor_fetch — fetch some rows from a cursor
Synopsis
Description
SPI_cursor_fetch fetches some rows from a cursor. This is equivalent to a subset of the SQL
command FETCH (see SPI_scroll_cursor_fetch for more functionality).
Arguments
Portal portal
bool forward
long count
Return Value
SPI_processed and SPI_tuptable are set as in SPI_execute if successful.
Notes
Fetching backward may fail if the cursor's plan was not created with the CURSOR_OPT_SCROLL
option.
1310
Server Programming Interface
SPI_cursor_move
SPI_cursor_move — move a cursor
Synopsis
Description
SPI_cursor_move skips over some number of rows in a cursor. This is equivalent to a subset of
the SQL command MOVE (see SPI_scroll_cursor_move for more functionality).
Arguments
Portal portal
bool forward
long count
Notes
Moving backward may fail if the cursor's plan was not created with the CURSOR_OPT_SCROLL
option.
1311
Server Programming Interface
SPI_scroll_cursor_fetch
SPI_scroll_cursor_fetch — fetch some rows from a cursor
Synopsis
Description
SPI_scroll_cursor_fetch fetches some rows from a cursor. This is equivalent to the SQL
command FETCH.
Arguments
Portal portal
FetchDirection direction
long count
Return Value
SPI_processed and SPI_tuptable are set as in SPI_execute if successful.
Notes
See the SQL FETCH command for details of the interpretation of the direction and count pa-
rameters.
Direction values other than FETCH_FORWARD may fail if the cursor's plan was not created with the
CURSOR_OPT_SCROLL option.
1312
Server Programming Interface
SPI_scroll_cursor_move
SPI_scroll_cursor_move — move a cursor
Synopsis
Description
SPI_scroll_cursor_move skips over some number of rows in a cursor. This is equivalent to
the SQL command MOVE.
Arguments
Portal portal
FetchDirection direction
long count
Return Value
SPI_processed is set as in SPI_execute if successful. SPI_tuptable is set to NULL, since
no rows are returned by this function.
Notes
See the SQL FETCH command for details of the interpretation of the direction and count pa-
rameters.
Direction values other than FETCH_FORWARD may fail if the cursor's plan was not created with the
CURSOR_OPT_SCROLL option.
1313
Server Programming Interface
SPI_cursor_close
SPI_cursor_close — close a cursor
Synopsis
Description
SPI_cursor_close closes a previously created cursor and releases its portal storage.
All open cursors are closed automatically at the end of a transaction. SPI_cursor_close need
only be invoked if it is desirable to release resources sooner.
Arguments
Portal portal
1314
Server Programming Interface
SPI_keepplan
SPI_keepplan — save a prepared statement
Synopsis
Description
SPI_keepplan saves a passed statement (prepared by SPI_prepare) so that it will not be freed
by SPI_finish nor by the transaction manager. This gives you the ability to reuse prepared state-
ments in the subsequent invocations of your C function in the current session.
Arguments
SPIPlanPtr plan
Return Value
0 on success; SPI_ERROR_ARGUMENT if plan is NULL or invalid
Notes
The passed-in statement is relocated to permanent storage by means of pointer adjustment (no data
copying is required). If you later wish to delete it, use SPI_freeplan on it.
1315
Server Programming Interface
SPI_saveplan
SPI_saveplan — save a prepared statement
Synopsis
Description
SPI_saveplan copies a passed statement (prepared by SPI_prepare) into memory that will
not be freed by SPI_finish nor by the transaction manager, and returns a pointer to the copied
statement. This gives you the ability to reuse prepared statements in the subsequent invocations of
your C function in the current session.
Arguments
SPIPlanPtr plan
Return Value
Pointer to the copied statement; or NULL if unsuccessful. On error, SPI_result is set thus:
SPI_ERROR_ARGUMENT
SPI_ERROR_UNCONNECTED
Notes
The originally passed-in statement is not freed, so you might wish to do SPI_freeplan on it to
avoid leaking memory until SPI_finish.
In most cases, SPI_keepplan is preferred to this function, since it accomplishes largely the same
result without needing to physically copy the prepared statement's data structures.
1316
Server Programming Interface
SPI_register_relation
SPI_register_relation — make an ephemeral named relation available by name in SPI queries
Synopsis
Description
SPI_register_relation makes an ephemeral named relation, with associated information,
available to queries planned and executed through the current SPI connection.
Arguments
EphemeralNamedRelation enr
Return Value
If the execution of the command was successful then the following (nonnegative) value will be re-
turned:
SPI_OK_REL_REGISTER
SPI_ERROR_ARGUMENT
SPI_ERROR_UNCONNECTED
SPI_ERROR_REL_DUPLICATE
if the name specified in the name field of enr is already registered for this connection
1317
Server Programming Interface
SPI_unregister_relation
SPI_unregister_relation — remove an ephemeral named relation from the registry
Synopsis
Description
SPI_unregister_relation removes an ephemeral named relation from the registry for the
current connection.
Arguments
const char * name
Return Value
If the execution of the command was successful then the following (nonnegative) value will be re-
turned:
SPI_OK_REL_UNREGISTER
SPI_ERROR_ARGUMENT
if name is NULL
SPI_ERROR_UNCONNECTED
SPI_ERROR_REL_NOT_FOUND
1318
Server Programming Interface
SPI_register_trigger_data
SPI_register_trigger_data — make ephemeral trigger data available in SPI queries
Synopsis
Description
SPI_register_trigger_data makes any ephemeral relations captured by a trigger available to
queries planned and executed through the current SPI connection. Currently, this means the transition
tables captured by an AFTER trigger defined with a REFERENCING OLD/NEW TABLE AS ...
clause. This function should be called by a PL trigger handler function after connecting.
Arguments
TriggerData *tdata
Return Value
If the execution of the command was successful then the following (nonnegative) value will be re-
turned:
SPI_OK_TD_REGISTER
if the captured trigger data (if any) has been successfully registered
SPI_ERROR_ARGUMENT
if tdata is NULL
SPI_ERROR_UNCONNECTED
SPI_ERROR_REL_DUPLICATE
if the name of any trigger data transient relation is already registered for this connection
All functions described in this section can be used by both connected and unconnected C functions.
1319
Server Programming Interface
SPI_fname
SPI_fname — determine the column name for the specified column number
Synopsis
Description
SPI_fname returns a copy of the column name of the specified column. (You can use pfree to
release the copy of the name when you don't need it anymore.)
Arguments
TupleDesc rowdesc
int colnumber
Return Value
The column name; NULL if colnumber is out of range. SPI_result set to SPI_ERROR_NOAT-
TRIBUTE on error.
1320
Server Programming Interface
SPI_fnumber
SPI_fnumber — determine the column number for the specified column name
Synopsis
Description
SPI_fnumber returns the column number for the column with the specified name.
If colname refers to a system column (e.g., oid) then the appropriate negative column number
will be returned. The caller should be careful to test the return value for exact equality to SPI_ER-
ROR_NOATTRIBUTE to detect an error; testing the result for less than or equal to 0 is not correct
unless system columns should be rejected.
Arguments
TupleDesc rowdesc
column name
Return Value
Column number (count starts at 1 for user-defined columns), or SPI_ERROR_NOATTRIBUTE if the
named column was not found.
1321
Server Programming Interface
SPI_getvalue
SPI_getvalue — return the string value of the specified column
Synopsis
Description
SPI_getvalue returns the string representation of the value of the specified column.
The result is returned in memory allocated using palloc. (You can use pfree to release the memory
when you don't need it anymore.)
Arguments
HeapTuple row
TupleDesc rowdesc
int colnumber
Return Value
Column value, or NULL if the column is null, colnumber is out of range (SPI_result is set to
SPI_ERROR_NOATTRIBUTE), or no output function is available (SPI_result is set to SPI_ER-
ROR_NOOUTFUNC).
1322
Server Programming Interface
SPI_getbinval
SPI_getbinval — return the binary value of the specified column
Synopsis
Description
SPI_getbinval returns the value of the specified column in the internal form (as type Datum).
This function does not allocate new space for the datum. In the case of a pass-by-reference data type,
the return value will be a pointer into the passed row.
Arguments
HeapTuple row
TupleDesc rowdesc
int colnumber
bool * isnull
Return Value
The binary value of the column is returned. The variable pointed to by isnull is set to true if the
column is null, else to false.
1323
Server Programming Interface
SPI_gettype
SPI_gettype — return the data type name of the specified column
Synopsis
Description
SPI_gettype returns a copy of the data type name of the specified column. (You can use pfree
to release the copy of the name when you don't need it anymore.)
Arguments
TupleDesc rowdesc
int colnumber
Return Value
The data type name of the specified column, or NULL on error. SPI_result is set to SPI_ER-
ROR_NOATTRIBUTE on error.
1324
Server Programming Interface
SPI_gettypeid
SPI_gettypeid — return the data type OID of the specified column
Synopsis
Description
SPI_gettypeid returns the OID of the data type of the specified column.
Arguments
TupleDesc rowdesc
int colnumber
Return Value
The OID of the data type of the specified column or InvalidOid on error. On error, SPI_result
is set to SPI_ERROR_NOATTRIBUTE.
1325
Server Programming Interface
SPI_getrelname
SPI_getrelname — return the name of the specified relation
Synopsis
Description
SPI_getrelname returns a copy of the name of the specified relation. (You can use pfree to
release the copy of the name when you don't need it anymore.)
Arguments
Relation rel
input relation
Return Value
The name of the specified relation.
1326
Server Programming Interface
SPI_getnspname
SPI_getnspname — return the namespace of the specified relation
Synopsis
Description
SPI_getnspname returns a copy of the name of the namespace that the specified Relation be-
longs to. This is equivalent to the relation's schema. You should pfree the return value of this func-
tion when you are finished with it.
Arguments
Relation rel
input relation
Return Value
The name of the specified relation's namespace.
1327
Server Programming Interface
SPI_result_code_string
SPI_result_code_string — return error code as string
Synopsis
Description
SPI_result_code_string returns a string representation of the result code returned by various
SPI functions or stored in SPI_result.
Arguments
int code
result code
Return Value
A string representation of the result code.
SPI_connect creates a new memory context and makes it current. SPI_finish restores the pre-
vious current memory context and destroys the context created by SPI_connect. These actions
ensure that transient memory allocations made inside your C function are reclaimed at C function exit,
avoiding memory leakage.
However, if your C function needs to return an object in allocated memory (such as a value of a pass-
by-reference data type), you cannot allocate that memory using palloc, at least not while you are
connected to SPI. If you try, the object will be deallocated by SPI_finish, and your C function
will not work reliably. To solve this problem, use SPI_palloc to allocate memory for your return
object. SPI_palloc allocates memory in the “upper executor context”, that is, the memory context
that was current when SPI_connect was called, which is precisely the right context for a value
returned from your C function. Several of the other utility functions described in this section also
return objects created in the upper executor context.
When SPI_connect is called, the private context of the C function, which is created by SPI_con-
nect, is made the current context. All allocations made by palloc, repalloc, or SPI utility func-
tions (except as described in this section) are made in this context. When a C function disconnects from
the SPI manager (via SPI_finish) the current context is restored to the upper executor context, and
all allocations made in the C function memory context are freed and cannot be used any more.
1328
Server Programming Interface
SPI_palloc
SPI_palloc — allocate memory in the upper executor context
Synopsis
Description
SPI_palloc allocates memory in the upper executor context.
This function can only be used while connected to SPI. Otherwise, it throws an error.
Arguments
Size size
Return Value
pointer to new storage space of the specified size
1329
Server Programming Interface
SPI_repalloc
SPI_repalloc — reallocate memory in the upper executor context
Synopsis
Description
SPI_repalloc changes the size of a memory segment previously allocated using SPI_palloc.
This function is no longer different from plain repalloc. It's kept just for backward compatibility
of existing code.
Arguments
void * pointer
Size size
Return Value
pointer to new storage space of specified size with the contents copied from the existing area
1330
Server Programming Interface
SPI_pfree
SPI_pfree — free memory in the upper executor context
Synopsis
Description
SPI_pfree frees memory previously allocated using SPI_palloc or SPI_repalloc.
This function is no longer different from plain pfree. It's kept just for backward compatibility of
existing code.
Arguments
void * pointer
1331
Server Programming Interface
SPI_copytuple
SPI_copytuple — make a copy of a row in the upper executor context
Synopsis
Description
SPI_copytuple makes a copy of a row in the upper executor context. This is normally used to
return a modified row from a trigger. In a function declared to return a composite type, use SPI_re-
turntuple instead.
This function can only be used while connected to SPI. Otherwise, it returns NULL and sets SPI_re-
sult to SPI_ERROR_UNCONNECTED.
Arguments
HeapTuple row
row to be copied
Return Value
the copied row, or NULL on error (see SPI_result for an error indication)
1332
Server Programming Interface
SPI_returntuple
SPI_returntuple — prepare to return a tuple as a Datum
Synopsis
Description
SPI_returntuple makes a copy of a row in the upper executor context, returning it in the form of
a row type Datum. The returned pointer need only be converted to Datum via PointerGetDatum
before returning.
This function can only be used while connected to SPI. Otherwise, it returns NULL and sets SPI_re-
sult to SPI_ERROR_UNCONNECTED.
Note that this should be used for functions that are declared to return composite types. It is not used
for triggers; use SPI_copytuple for returning a modified row in a trigger.
Arguments
HeapTuple row
row to be copied
TupleDesc rowdesc
descriptor for row (pass the same descriptor each time for most effective caching)
Return Value
HeapTupleHeader pointing to copied row, or NULL on error (see SPI_result for an error in-
dication)
1333
Server Programming Interface
SPI_modifytuple
SPI_modifytuple — create a row by replacing selected fields of a given row
Synopsis
Description
SPI_modifytuple creates a new row by substituting new values for selected columns, copying
the original row's columns at other positions. The input row is not modified. The new row is returned
in the upper executor context.
This function can only be used while connected to SPI. Otherwise, it returns NULL and sets SPI_re-
sult to SPI_ERROR_UNCONNECTED.
Arguments
Relation rel
Used only as the source of the row descriptor for the row. (Passing a relation rather than a row
descriptor is a misfeature.)
HeapTuple row
row to be modified
int ncols
int * colnum
an array of length ncols, containing the numbers of the columns that are to be changed (column
numbers start at 1)
Datum * values
an array of length ncols, containing the new values for the specified columns
If nulls is NULL then SPI_modifytuple assumes that no new values are null. Otherwise,
each entry of the nulls array should be ' ' if the corresponding new value is non-null, or 'n'
if the corresponding new value is null. (In the latter case, the actual value in the corresponding
values entry doesn't matter.) Note that nulls is not a text string, just an array: it does not need
a '\0' terminator.
Return Value
new row with modifications, allocated in the upper executor context, or NULL on error (see SPI_re-
sult for an error indication)
1334
Server Programming Interface
SPI_ERROR_ARGUMENT
if rel is NULL, or if row is NULL, or if ncols is less than or equal to 0, or if colnum is NULL,
or if values is NULL.
SPI_ERROR_NOATTRIBUTE
if colnum contains an invalid column number (less than or equal to 0 or greater than the number
of columns in row)
SPI_ERROR_UNCONNECTED
1335
Server Programming Interface
SPI_freetuple
SPI_freetuple — free a row allocated in the upper executor context
Synopsis
Description
SPI_freetuple frees a row previously allocated in the upper executor context.
This function is no longer different from plain heap_freetuple. It's kept just for backward com-
patibility of existing code.
Arguments
HeapTuple row
row to free
1336
Server Programming Interface
SPI_freetuptable
SPI_freetuptable — free a row set created by SPI_execute or a similar function
Synopsis
Description
SPI_freetuptable frees a row set created by a prior SPI command execution function, such as
SPI_execute. Therefore, this function is often called with the global variable SPI_tuptable
as argument.
This function is useful if an SPI-using C function needs to execute multiple commands and does not
want to keep the results of earlier commands around until it ends. Note that any unfreed row sets will be
freed anyway at SPI_finish. Also, if a subtransaction is started and then aborted within execution
of an SPI-using C function, SPI automatically frees any row sets created while the subtransaction was
running.
Beginning in PostgreSQL 9.3, SPI_freetuptable contains guard logic to protect against dupli-
cate deletion requests for the same row set. In previous releases, duplicate deletions would lead to
crashes.
Arguments
SPITupleTable * tuptable
1337
Server Programming Interface
SPI_freeplan
SPI_freeplan — free a previously saved prepared statement
Synopsis
Description
SPI_freeplan releases a prepared statement previously returned by SPI_prepare or saved by
SPI_keepplan or SPI_saveplan.
Arguments
SPIPlanPtr plan
Return Value
0 on success; SPI_ERROR_ARGUMENT if plan is NULL or invalid
It is not generally safe and sensible to start and end transactions in arbitrary user-defined SQL-callable
functions without taking into account the context in which they are called. For example, a transaction
boundary in the middle of a function that is part of a complex SQL expression that is part of some SQL
command will probably result in obscure internal errors or crashes. The interface functions presented
here are primarily intended to be used by procedural language implementations to support transaction
management in SQL-level procedures that are invoked by the CALL command, taking the context of
the CALL invocation into account. SPI-using procedures implemented in C can implement the same
logic, but the details of that are beyond the scope of this documentation.
1338
Server Programming Interface
SPI_commit
SPI_commit — commit the current transaction
Synopsis
void SPI_commit(void)
Description
SPI_commit commits the current transaction. It is approximately equivalent to running the SQL
command COMMIT. After the transaction is committed, a new transaction is automatically started
using default transaction characteristics, so that the caller can continue using SPI facilities. If there is
a failure during commit, the current transaction is instead rolled back and a new transaction is started,
after which the error is thrown in the usual way.
This function can only be executed if the SPI connection has been set as nonatomic in the call to
SPI_connect_ext.
1339
Server Programming Interface
SPI_rollback
SPI_rollback — abort the current transaction
Synopsis
void SPI_rollback(void)
Description
SPI_rollback rolls back the current transaction. It is approximately equivalent to running the SQL
command ROLLBACK. After the transaction is rolled back, a new transaction is automatically started
using default transaction characteristics, so that the caller can continue using SPI facilities.
This function can only be executed if the SPI connection has been set as nonatomic in the call to
SPI_connect_ext.
1340
Server Programming Interface
SPI_start_transaction
SPI_start_transaction — obsolete function
Synopsis
void SPI_start_transaction(void)
Description
SPI_start_transaction does nothing, and exists only for code compatibility with earlier Post-
greSQL releases. It used to be required after calling SPI_commit or SPI_rollback, but now
those functions start a new transaction automatically.
• During the execution of an SQL command, any data changes made by the command are invisible
to the command itself. For example, in:
• Changes made by a command C are visible to all commands that are started after C, no matter
whether they are started inside C (during the execution of C) or after C is done.
• Commands executed via SPI inside a function called by an SQL command (either an ordinary func-
tion or a trigger) follow one or the other of the above rules depending on the read/write flag passed
to SPI. Commands executed in read-only mode follow the first rule: they cannot see changes of the
calling command. Commands executed in read-write mode follow the second rule: they can see all
changes made so far.
• All standard procedural languages set the SPI read-write mode depending on the volatility attribute
of the function. Commands of STABLE and IMMUTABLE functions are done in read-only mode,
while commands of VOLATILE functions are done in read-write mode. While authors of C func-
tions are able to violate this convention, it's unlikely to be a good idea to do so.
The next section contains an example that illustrates the application of these rules.
47.6. Examples
This section contains a very simple example of SPI usage. The C function execq takes an SQL
command as its first argument and a row count as its second, executes the command using SPI_exec
and returns the number of rows that were processed by the command. You can find more complex
examples for SPI in the source tree in src/test/regress/regress.c and in the spi module.
#include "postgres.h"
#include "executor/spi.h"
#include "utils/builtins.h"
1341
Server Programming Interface
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(execq);
Datum
execq(PG_FUNCTION_ARGS)
{
char *command;
int cnt;
int ret;
uint64 proc;
SPI_connect();
proc = SPI_processed;
/*
* If some rows were fetched, print them via elog(INFO).
*/
if (ret > 0 && SPI_tuptable != NULL)
{
TupleDesc tupdesc = SPI_tuptable->tupdesc;
SPITupleTable *tuptable = SPI_tuptable;
char buf[8192];
uint64 j;
SPI_finish();
pfree(command);
PG_RETURN_INT64(proc);
}
This is how you declare the function after having compiled it into a shared library (details are in
Section 38.10.5.):
1342
Server Programming Interface
AS 'filename'
LANGUAGE C STRICT;
execq
-------
2
(1 row)
execq
-------
3 -- 10 is the max value only, 3 is the real
number of rows
(1 row)
1343
Server Programming Interface
1344
Chapter 48. Background Worker
Processes
PostgreSQL can be extended to run user-supplied code in separate processes. Such processes are start-
ed, stopped and monitored by postgres, which permits them to have a lifetime closely linked to
the server's status. These processes have the option to attach to PostgreSQL's shared memory area and
to connect to databases internally; they can also run multiple transactions serially, just like a regular
client-connected server process. Also, by linking to libpq they can connect to the server and behave
like a regular client application.
Warning
There are considerable robustness and security risks in using background worker processes
because, being written in the C language, they have unrestricted access to data. Administra-
tors wishing to enable modules that include background worker processes should exercise ex-
treme caution. Only carefully audited modules should be permitted to run background worker
processes.
Background workers can be initialized at the time that PostgreSQL is started by including the mod-
ule name in shared_preload_libraries. A module wishing to run a background worker can
register it by calling RegisterBackgroundWorker(BackgroundWorker *worker) from
its _PG_init(). Background workers can also be started after the system is up and running by call-
ing the function RegisterDynamicBackgroundWorker(BackgroundWorker *work-
er, BackgroundWorkerHandle **handle). Unlike RegisterBackgroundWorker,
which can only be called from within the postmaster, RegisterDynamicBackgroundWorker
must be called from a regular backend or another background worker.
bgw_name and bgw_type are strings to be used in log messages, process listings and similar con-
texts. bgw_type should be the same for all background workers of the same type, so that it is possible
to group such workers in a process listing, for example. bgw_name on the other hand can contain
additional information about the specific process. (Typically, the string for bgw_name will contain
the type somehow, but that is not strictly required.)
bgw_flags is a bitwise-or'd bit mask indicating the capabilities that the module wants. Possible
values are:
1345
Background Worker Processes
BGWORKER_SHMEM_ACCESS
Requests shared memory access. Workers without shared memory access cannot access any of
PostgreSQL's shared data structures, such as heavyweight or lightweight locks, shared buffers, or
any custom data structures which the worker itself may wish to create and use.
BGWORKER_BACKEND_DATABASE_CONNECTION
Requests the ability to establish a database connection through which it can later run transactions
and queries. A background worker using BGWORKER_BACKEND_DATABASE_CONNECTION
to connect to a database must also attach shared memory using BGWORKER_SHMEM_ACCESS,
or worker start-up will fail.
bgw_start_time is the server state during which postgres should start the process; it can be
one of BgWorkerStart_PostmasterStart (start as soon as postgres itself has finished its
own initialization; processes requesting this are not eligible for database connections), BgWorkerS-
tart_ConsistentState (start as soon as a consistent state has been reached in a hot standby,
allowing processes to connect to databases and run read-only queries), and BgWorkerStart_Re-
coveryFinished (start as soon as the system has entered normal read-write state). Note the last
two values are equivalent in a server that's not a hot standby. Note that this setting only indicates when
the processes are to be started; they do not stop when a different state is reached.
bgw_restart_time is the interval, in seconds, that postgres should wait before restarting the
process in the event that it crashes. It can be any positive value, or BGW_NEVER_RESTART, indicating
not to restart the process in case of a crash.
bgw_library_name is the name of a library in which the initial entry point for the background
worker should be sought. The named library will be dynamically loaded by the worker process and
bgw_function_name will be used to identify the function to be called. If loading a function from
the core code, this must be set to "postgres".
bgw_main_arg is the Datum argument to the background worker main function. This main func-
tion should take a single argument of type Datum and return void. bgw_main_arg will be passed
as the argument. In addition, the global variable MyBgworkerEntry points to a copy of the Back-
groundWorker structure passed at registration time; the worker may find it helpful to examine this
structure.
On Windows (and anywhere else where EXEC_BACKEND is defined) or in dynamic background work-
ers it is not safe to pass a Datum by reference, only by value. If an argument is required, it is safest to
pass an int32 or other small value and use that as an index into an array allocated in shared memory. If
a value like a cstring or text is passed then the pointer won't be valid from the new background
worker process.
bgw_extra can contain extra data to be passed to the background worker. Unlike bgw_main_arg,
this data is not passed as an argument to the worker's main function, but it can be accessed via My-
BgworkerEntry, as discussed above.
bgw_notify_pid is the PID of a PostgreSQL backend process to which the postmaster should
send SIGUSR1 when the process is started or exits. It should be 0 for workers registered at postmaster
startup time, or when the backend registering the worker does not wish to wait for the worker to start
up. Otherwise, it should be initialized to MyProcPid.
1346
Background Worker Processes
shared catalogs can be accessed. If username is NULL or useroid is InvalidOid, the process
will run as the superuser created during initdb. If BGWORKER_BYPASS_ALLOWCONN is specified
as flags it is possible to bypass the restriction to connect to databases not allowing user connections.
A background worker can only call one of these two functions, and only once. It is not possible to
switch databases.
Signals are initially blocked when control reaches the background worker's main function, and must be
unblocked by it; this is to allow the process to customize its signal handlers, if necessary. Signals can
be unblocked in the new process by calling BackgroundWorkerUnblockSignals and blocked
by calling BackgroundWorkerBlockSignals.
In some cases, a process which registers a background worker may wish to wait for the worker to start
up. This can be accomplished by initializing bgw_notify_pid to MyProcPid and then passing
the BackgroundWorkerHandle * obtained at registration time to WaitForBackground-
WorkerStartup(BackgroundWorkerHandle *handle, pid_t *) function. This func-
tion will block until the postmaster has attempted to start the background worker, or until the post-
master dies. If the background worker is running, the return value will be BGWH_STARTED, and the
PID will be written to the provided address. Otherwise, the return value will be BGWH_STOPPED or
BGWH_POSTMASTER_DIED.
A process can also wait for a background worker to shut down, by using the WaitForBackground-
WorkerShutdown(BackgroundWorkerHandle *handle) function and passing the Back-
groundWorkerHandle * obtained at registration. This function will block until the background
worker exits, or postmaster dies. When the background worker exits, the return value is BGWH_S-
TOPPED, if postmaster dies it will return BGWH_POSTMASTER_DIED.
If a background worker sends asynchronous notifications with the NOTIFY command via the Server
Programming Interface (SPI), it should call ProcessCompletedNotifies explicitly after com-
mitting the enclosing transaction so that any notifications can be delivered. If a background worker
registers to receive asynchronous notifications with the LISTEN through SPI, the worker will log
those notifications, but there is no programmatic way for the worker to intercept and respond to those
notifications.
1347
Chapter 49. Logical Decoding
PostgreSQL provides infrastructure to stream the modifications performed via SQL to external con-
sumers. This functionality can be used for a variety of purposes, including replication solutions and
auditing.
The format in which those changes are streamed is determined by the output plugin used. An example
plugin is provided in the PostgreSQL distribution. Additional plugins can be written to extend the
choice of available formats without modifying any core code. Every output plugin has access to each
individual new row produced by INSERT and the new row version created by UPDATE. Availability of
old row versions for UPDATE and DELETE depends on the configured replica identity (see REPLICA
IDENTITY).
Changes can be consumed either using the streaming replication protocol (see Section 53.4 and Sec-
tion 49.3), or by calling functions via SQL (see Section 49.4). It is also possible to write additional
methods of consuming the output of a replication slot without modifying core code (see Section 49.7).
Before you can use logical decoding, you must set wal_level to logical and max_replication_slots
to at least 1. Then, you should connect to the target database (in the example below, postgres) as
a superuser.
1348
Logical Decoding
postgres=# BEGIN;
postgres=# INSERT INTO data(data) VALUES('1');
postgres=# INSERT INTO data(data) VALUES('2');
postgres=# COMMIT;
-----------+-------
+---------------------------------------------------------
0/BA5A688 | 10298 | BEGIN 10298
0/BA5A6F0 | 10298 | table public.data: INSERT: id[integer]:1
data[text]:'1'
0/BA5A7F8 | 10298 | table public.data: INSERT: id[integer]:2
data[text]:'2'
0/BA5A8A8 | 10298 | COMMIT 10298
(4 rows)
postgres=# -- You can also peek ahead in the change stream without
consuming changes
postgres=# SELECT * FROM
pg_logical_slot_peek_changes('regression_slot', NULL, NULL);
lsn | xid | data
-----------+-------
+---------------------------------------------------------
0/BA5A8E0 | 10299 | BEGIN 10299
0/BA5A8E0 | 10299 | table public.data: INSERT: id[integer]:3
data[text]:'3'
0/BA5A990 | 10299 | COMMIT 10299
(3 rows)
1349
Logical Decoding
-----------+-------
+---------------------------------------------------------
0/BA5A8E0 | 10299 | BEGIN 10299
0/BA5A8E0 | 10299 | table public.data: INSERT: id[integer]:3
data[text]:'3'
0/BA5A990 | 10299 | COMMIT 10299
(3 rows)
-----------+-------
+---------------------------------------------------------
0/BA5A8E0 | 10299 | BEGIN 10299
0/BA5A8E0 | 10299 | table public.data: INSERT: id[integer]:3
data[text]:'3'
0/BA5A990 | 10299 | COMMIT 10299 (at 2017-05-10
12:07:21.272494-04)
(3 rows)
(1 row)
The following example shows how logical decoding is controlled over the streaming replication pro-
tocol, using the program pg_recvlogical included in the PostgreSQL distribution. This requires that
client authentication is set up to allow replication connections (see Section 26.2.5.1) and that max_w-
al_senders is set sufficiently high to allow an additional connection.
1350
Logical Decoding
Logical decoding is the process of extracting all persistent changes to a database's tables into a coher-
ent, easy to understand format which can be interpreted without detailed knowledge of the database's
internal state.
In PostgreSQL, logical decoding is implemented by decoding the contents of the write-ahead log,
which describe changes on a storage level, into an application-specific form such as a stream of tuples
or SQL statements.
Note
PostgreSQL also has streaming replication slots (see Section 26.2.5), but they are used some-
what differently there.
A replication slot has an identifier that is unique across all databases in a PostgreSQL cluster. Slots
persist independently of the connection using them and are crash-safe.
A logical slot will emit each change just once in normal operation. The current position of each slot
is persisted only at checkpoint, so in the case of a crash the slot may return to an earlier LSN, which
will then cause recent changes to be sent again when the server restarts. Logical decoding clients are
responsible for avoiding ill effects from handling the same message more than once. Clients may wish
to record the last LSN they saw when decoding and skip over any repeated data or (when using the
replication protocol) request that decoding start from that LSN rather than letting the server determine
the start point. The Replication Progress Tracking feature is designed for this purpose, refer to repli-
cation origins.
Multiple independent slots may exist for a single database. Each slot has its own state, allowing dif-
ferent consumers to receive changes from different points in the database change stream. For most
applications, a separate slot will be required for each consumer.
A logical replication slot knows nothing about the state of the receiver(s). It's even possible to have
multiple different receivers using the same slot at different times; they'll just get the changes following
on from when the last receiver stopped consuming them. Only one receiver may consume changes
from a slot at any given time.
Caution
Replication slots persist across crashes and know nothing about the state of their consumer(s).
They will prevent removal of required resources even when there is no connection using them.
This consumes storage because neither required WAL nor required rows from the system cat-
alogs can be removed by VACUUM as long as they are required by a replication slot. In extreme
cases this could cause the database to shut down to prevent transaction ID wraparound (see
Section 24.1.5). So if a slot is no longer required it should be dropped.
1351
Logical Decoding
Creation of a snapshot is not always possible. In particular, it will fail when connected to a hot standby.
Applications that do not require snapshot export may suppress it with the NOEXPORT_SNAPSHOT
option.
are used to create, drop, and stream changes from a replication slot, respectively. These commands
are only available over a replication connection; they cannot be used via SQL. See Section 53.4 for
details on these commands.
The command pg_recvlogical can be used to control logical decoding over a streaming replication
connection. (It uses these commands internally.)
Synchronous replication (see Section 26.2.8) is only supported on replication slots used over the
streaming replication interface. The function interface and additional, non-core interfaces do not sup-
port synchronous replication.
1352
Logical Decoding
An output plugin is loaded by dynamically loading a shared library with the output plugin's name as
the library base name. The normal library search path is used to locate the library. To provide the
required output plugin callbacks and to indicate that the library is actually an output plugin it needs to
provide a function named _PG_output_plugin_init. This function is passed a struct that needs
to be filled with the callback function pointers for individual actions.
The begin_cb, change_cb and commit_cb callbacks are required, while startup_cb, fil-
ter_by_origin_cb, truncate_cb, and shutdown_cb are optional. If truncate_cb is not
set but a TRUNCATE is to be decoded, the action will be ignored.
49.6.2. Capabilities
To decode, format and output changes, output plugins can use most of the backend's normal infra-
structure, including calling output functions. Read only access to relations is permitted as long as only
relations are accessed that either have been created by initdb in the pg_catalog schema, or have
been marked as user provided catalog tables using
Any actions leading to transaction ID assignment are prohibited. That, among others, includes writing
to tables, performing DDL changes, and calling txid_current().
Concurrent transactions are decoded in commit order, and only changes belonging to a specific trans-
action are decoded between the begin and commit callbacks. Transactions that were rolled back
1353
Logical Decoding
explicitly or implicitly never get decoded. Successful savepoints are folded into the transaction con-
taining them in the order they were executed within that transaction.
Note
Only transactions that have already safely been flushed to disk will be decoded. That can
lead to a COMMIT not immediately being decoded in a directly following pg_logical_s-
lot_get_changes() when synchronous_commit is set to off.
The is_init parameter will be true when the replication slot is being created and false otherwise.
options points to a struct of options that output plugins can set:
1354
Logical Decoding
ReorderBufferTXN *txn);
The txn parameter contains meta information about the transaction, like the time stamp at which it
has been committed and its XID.
The ctx and txn parameters have the same contents as for the begin_cb and commit_cb call-
backs, but additionally the relation descriptor relation points to the relation the row belongs to and
a struct change describing the row modification are passed in.
Note
Only changes in user defined tables that are not unlogged (see UNLOGGED) and not temporary
(see TEMPORARY or TEMP) can be extracted using logical decoding.
The parameters are analogous to the change_cb callback. However, because TRUNCATE actions
on tables connected by foreign keys need to be executed together, this callback receives an array of
relations instead of just a single one. See the description of the TRUNCATE statement for details.
1355
Logical Decoding
The ctx parameter has the same contents as for the other callbacks. No information but the origin is
available. To signal that changes originating on the passed in node are irrelevant, return true, causing
them to be filtered away; false otherwise. The other callbacks will not be called for transactions and
changes that have been filtered away.
This is useful when implementing cascading or multidirectional replication solutions. Filtering by the
origin allows to prevent replicating the same changes back and forth in such setups. While transactions
and changes also carry information about the origin, filtering via this callback is noticeably more
efficient.
The txn parameter contains meta information about the transaction, like the time stamp at which it has
been committed and its XID. Note however that it can be NULL when the message is non-transactional
and the XID was not assigned yet in the transaction which logged the message. The lsn has WAL
location of the message. The transactional says if the message was sent as transactional or
not. The prefix is arbitrary null-terminated prefix which can be used for identifying interesting
messages for the current plugin. And finally the message parameter holds the actual message of
message_size size.
Extra care should be taken to ensure that the prefix the output plugin considers interesting is unique.
Using name of the extension or the output plugin itself is often a good choice.
The following example shows how to output data to the consumer of an output plugin:
OutputPluginPrepareWrite(ctx, true);
appendStringInfo(ctx->out, "BEGIN %u", txn->xid);
1356
Logical Decoding
OutputPluginWrite(ctx, true);
Note
A synchronous replica receiving changes via logical decoding will work in the scope of a
single database. Since, in contrast to that, synchronous_standby_names currently is
server wide, this means this technique will not work properly if more than one database is
actively used.
49.8.2. Caveats
In synchronous replication setup, a deadlock can happen, if the transaction has locked [user] catalog
tables exclusively. See Section 49.6.2 for information on user catalog tables. This is because logical
decoding of transactions can lock catalog tables to access them. To avoid this users must refrain from
taking an exclusive lock on [user] catalog tables. This can happen in the following ways:
Note that these commands that can cause deadlock apply to not only explicitly indicated system catalog
tables above but also to any other [user] catalog table.
1357
Chapter 50. Replication Progress
Tracking
Replication origins are intended to make it easier to implement logical replication solutions on top of
logical decoding. They provide a solution to two common problems:
• How to change replication behavior based on the origin of a row; for example, to prevent loops in
bi-directional replication setups
Replication origins have just two properties, a name and an OID. The name, which is what should
be used to refer to the origin across systems, is free-form text. It should be used in a way that
makes conflicts between replication origins created by different replication solutions unlikely; e.g., by
prefixing the replication solution's name to it. The OID is used only to avoid having to store the long
version in situations where space efficiency is important. It should never be shared across systems.
One nontrivial part of building a replication solution is to keep track of replay progress in a safe
manner. When the applying process, or the whole cluster, dies, it needs to be possible to find out up to
where data has successfully been replicated. Naive solutions to this, such as updating a row in a table
for every replayed transaction, have problems like run-time overhead and database bloat.
Using the replication origin infrastructure a session can be marked as replaying from a remote node (us-
ing the pg_replication_origin_session_setup() function). Additionally the LSN and
commit time stamp of every source transaction can be configured on a per transaction basis using
pg_replication_origin_xact_setup(). If that's done replication progress will persist in
a crash safe manner. Replay progress for all replication origins can be seen in the pg_replica-
tion_origin_status view. An individual origin's progress, e.g., when resuming replication,
can be acquired using pg_replication_origin_progress() for any origin or pg_repli-
cation_origin_session_progress() for the origin configured in the current session.
In replication topologies more complex than replication from exactly one system to one other system,
another problem can be that it is hard to avoid replicating replayed rows again. That can lead both
to cycles in the replication and inefficiencies. Replication origins provide an optional mechanism to
recognize and prevent that. When configured using the functions referenced in the previous paragraph,
every change and transaction passed to output plugin callbacks (see Section 49.6) generated by the
session is tagged with the replication origin of the generating session. This allows treating them dif-
ferently in the output plugin, e.g., ignoring all but locally-originating rows. Additionally the fil-
ter_by_origin_cb callback can be used to filter the logical decoding change stream based on
the source. While less flexible, filtering via that callback is considerably more efficient than doing it
in the output plugin.
1358
Part VI. Reference
The entries in this Reference are meant to provide in reasonable length an authoritative, complete, and formal
summary about their respective subjects. More information about the use of PostgreSQL, in narrative, tutorial, or
example form, can be found in other parts of this book. See the cross-references listed on each reference page.
1360
Reference
1361
Reference
1362
Reference
1363
SQL Commands
This part contains reference information for the SQL commands supported by PostgreSQL. By “SQL”
the language in general is meant; information about the standards conformance and compatibility of
each command can be found on the respective reference page.
Table of Contents
ABORT .................................................................................................................. 1368
ALTER AGGREGATE ............................................................................................. 1369
ALTER COLLATION .............................................................................................. 1371
ALTER CONVERSION ............................................................................................ 1373
ALTER DATABASE ................................................................................................ 1375
ALTER DEFAULT PRIVILEGES .............................................................................. 1378
ALTER DOMAIN .................................................................................................... 1381
ALTER EVENT TRIGGER ....................................................................................... 1385
ALTER EXTENSION ............................................................................................... 1386
ALTER FOREIGN DATA WRAPPER ........................................................................ 1390
ALTER FOREIGN TABLE ....................................................................................... 1392
ALTER FUNCTION ................................................................................................. 1397
ALTER GROUP ...................................................................................................... 1401
ALTER INDEX ....................................................................................................... 1403
ALTER LANGUAGE ............................................................................................... 1406
ALTER LARGE OBJECT ......................................................................................... 1407
ALTER MATERIALIZED VIEW ............................................................................... 1408
ALTER OPERATOR ................................................................................................ 1410
ALTER OPERATOR CLASS .................................................................................... 1412
ALTER OPERATOR FAMILY .................................................................................. 1413
ALTER POLICY ..................................................................................................... 1417
ALTER PROCEDURE .............................................................................................. 1419
ALTER PUBLICATION ........................................................................................... 1422
ALTER ROLE ......................................................................................................... 1424
ALTER ROUTINE ................................................................................................... 1428
ALTER RULE ......................................................................................................... 1430
ALTER SCHEMA ................................................................................................... 1431
ALTER SEQUENCE ................................................................................................ 1432
ALTER SERVER ..................................................................................................... 1435
ALTER STATISTICS ............................................................................................... 1437
ALTER SUBSCRIPTION .......................................................................................... 1438
ALTER SYSTEM .................................................................................................... 1440
ALTER TABLE ....................................................................................................... 1442
ALTER TABLESPACE ............................................................................................ 1458
ALTER TEXT SEARCH CONFIGURATION .............................................................. 1460
ALTER TEXT SEARCH DICTIONARY ..................................................................... 1462
ALTER TEXT SEARCH PARSER ............................................................................. 1464
ALTER TEXT SEARCH TEMPLATE ........................................................................ 1465
ALTER TRIGGER ................................................................................................... 1466
ALTER TYPE ......................................................................................................... 1468
ALTER USER ......................................................................................................... 1472
ALTER USER MAPPING ......................................................................................... 1473
ALTER VIEW ......................................................................................................... 1474
ANALYZE .............................................................................................................. 1476
BEGIN ................................................................................................................... 1479
CALL ..................................................................................................................... 1481
CHECKPOINT ........................................................................................................ 1482
CLOSE ................................................................................................................... 1483
1364
SQL Commands
1365
SQL Commands
1366
SQL Commands
1367
ABORT
ABORT — abort the current transaction
Synopsis
Description
ABORT rolls back the current transaction and causes all the updates made by the transaction to be
discarded. This command is identical in behavior to the standard SQL command ROLLBACK, and
is present only for historical reasons.
Parameters
WORK
TRANSACTION
Notes
Use COMMIT to successfully terminate a transaction.
Issuing ABORT outside of a transaction block emits a warning and otherwise has no effect.
Examples
To abort all changes:
ABORT;
Compatibility
This command is a PostgreSQL extension present for historical reasons. ROLLBACK is the equivalent
standard SQL command.
See Also
BEGIN, COMMIT, ROLLBACK
1368
ALTER AGGREGATE
ALTER AGGREGATE — change the definition of an aggregate function
Synopsis
* |
[ argmode ] [ argname ] argtype [ , ... ] |
[ [ argmode ] [ argname ] argtype [ , ... ] ] ORDER BY [ argmode ]
[ argname ] argtype [ , ... ]
Description
ALTER AGGREGATE changes the definition of an aggregate function.
You must own the aggregate function to use ALTER AGGREGATE. To change the schema of an
aggregate function, you must also have CREATE privilege on the new schema. To alter the owner,
you must also be a direct or indirect member of the new owning role, and that role must have CREATE
privilege on the aggregate function's schema. (These restrictions enforce that altering the owner doesn't
do anything you couldn't do by dropping and recreating the aggregate function. However, a superuser
can alter ownership of any aggregate function anyway.)
Parameters
name
argmode
argname
The name of an argument. Note that ALTER AGGREGATE does not actually pay any attention
to argument names, since only the argument data types are needed to determine the aggregate
function's identity.
argtype
An input data type on which the aggregate function operates. To reference a zero-argument aggre-
gate function, write * in place of the list of argument specifications. To reference an ordered-set
aggregate function, write ORDER BY between the direct and aggregated argument specifications.
new_name
1369
ALTER AGGREGATE
new_owner
new_schema
Notes
The recommended syntax for referencing an ordered-set aggregate is to write ORDER BY between
the direct and aggregated argument specifications, in the same style as in CREATE AGGREGATE.
However, it will also work to omit ORDER BY and just run the direct and aggregated argument
specifications into a single list. In this abbreviated form, if VARIADIC "any" was used in both the
direct and aggregated argument lists, write VARIADIC "any" only once.
Examples
To rename the aggregate function myavg for type integer to my_average:
To change the owner of the aggregate function myavg for type integer to joe:
To move the ordered-set aggregate mypercentile with direct argument of type float8 and ag-
gregated argument of type integer into schema myschema:
Compatibility
There is no ALTER AGGREGATE statement in the SQL standard.
See Also
CREATE AGGREGATE, DROP AGGREGATE
1370
ALTER COLLATION
ALTER COLLATION — change the definition of a collation
Synopsis
Description
ALTER COLLATION changes the definition of a collation.
You must own the collation to use ALTER COLLATION. To alter the owner, you must also be a
direct or indirect member of the new owning role, and that role must have CREATE privilege on the
collation's schema. (These restrictions enforce that altering the owner doesn't do anything you couldn't
do by dropping and recreating the collation. However, a superuser can alter ownership of any collation
anyway.)
Parameters
name
new_name
new_owner
new_schema
REFRESH VERSION
Notes
When using collations provided by the ICU library, the ICU-specific version of the collator is recorded
in the system catalog when the collation object is created. When the collation is used, the current
version is checked against the recorded version, and a warning is issued when there is a mismatch,
for example:
1371
ALTER COLLATION
A change in collation definitions can lead to corrupt indexes and other problems because the database
system relies on stored objects having a certain sort order. Generally, this should be avoided, but it can
happen in legitimate circumstances, such as when using pg_upgrade to upgrade to server binaries
linked with a newer version of ICU. When this happens, all objects depending on the collation should
be rebuilt, for example, using REINDEX. When that is done, the collation version can be refreshed
using the command ALTER COLLATION ... REFRESH VERSION. This will update the system
catalog to record the current collator version and will make the warning go away. Note that this does
not actually check whether all affected objects have been rebuilt correctly.
The following query can be used to identify all collations in the current database that need to be
refreshed and the objects that depend on them:
Examples
To rename the collation de_DE to german:
Compatibility
There is no ALTER COLLATION statement in the SQL standard.
See Also
CREATE COLLATION, DROP COLLATION
1372
ALTER CONVERSION
ALTER CONVERSION — change the definition of a conversion
Synopsis
Description
ALTER CONVERSION changes the definition of a conversion.
You must own the conversion to use ALTER CONVERSION. To alter the owner, you must also be
a direct or indirect member of the new owning role, and that role must have CREATE privilege on
the conversion's schema. (These restrictions enforce that altering the owner doesn't do anything you
couldn't do by dropping and recreating the conversion. However, a superuser can alter ownership of
any conversion anyway.)
Parameters
name
new_name
new_owner
new_schema
Examples
To rename the conversion iso_8859_1_to_utf8 to latin1_to_unicode:
Compatibility
There is no ALTER CONVERSION statement in the SQL standard.
1373
ALTER CONVERSION
See Also
CREATE CONVERSION, DROP CONVERSION
1374
ALTER DATABASE
ALTER DATABASE — change a database
Synopsis
ALLOW_CONNECTIONS allowconn
CONNECTION LIMIT connlimit
IS_TEMPLATE istemplate
Description
ALTER DATABASE changes the attributes of a database.
The first form changes certain per-database settings. (See below for details.) Only the database owner
or a superuser can change these settings.
The second form changes the name of the database. Only the database owner or a superuser can re-
name a database; non-superuser owners must also have the CREATEDB privilege. The current data-
base cannot be renamed. (Connect to a different database if you need to do that.)
The third form changes the owner of the database. To alter the owner, you must own the database
and also be a direct or indirect member of the new owning role, and you must have the CREATEDB
privilege. (Note that superusers have all these privileges automatically.)
The fourth form changes the default tablespace of the database. Only the database owner or a superuser
can do this; you must also have create privilege for the new tablespace. This command physically
moves any tables or indexes in the database's old default tablespace to the new tablespace. The new
default tablespace must be empty for this database, and no one can be connected to the database. Tables
and indexes in non-default tablespaces are unaffected.
The remaining forms change the session default for a run-time configuration variable for a PostgreSQL
database. Whenever a new session is subsequently started in that database, the specified value be-
comes the session default value. The database-specific default overrides whatever setting is present in
postgresql.conf or has been received from the postgres command line. Only the database
owner or a superuser can change the session defaults for a database. Certain variables cannot be set
this way, or can only be set by a superuser.
1375
ALTER DATABASE
Parameters
name
allowconn
connlimit
How many concurrent connections can be made to this database. -1 means no limit.
istemplate
If true, then this database can be cloned by any user with CREATEDB privileges; if false, then
only superusers or the owner of the database can clone it.
new_name
new_owner
new_tablespace
configuration_parameter
value
Set this database's session default for the specified configuration parameter to the given value. If
value is DEFAULT or, equivalently, RESET is used, the database-specific setting is removed,
so the system-wide default setting will be inherited in new sessions. Use RESET ALL to clear
all database-specific settings. SET FROM CURRENT saves the session's current value of the
parameter as the database-specific value.
See SET and Chapter 19 for more information about allowed parameter names and values.
Notes
It is also possible to tie a session default to a specific role rather than to a database; see ALTER ROLE.
Role-specific settings override database-specific ones if there is a conflict.
Examples
To disable index scans by default in the database test:
Compatibility
The ALTER DATABASE statement is a PostgreSQL extension.
1376
ALTER DATABASE
See Also
CREATE DATABASE, DROP DATABASE, SET, CREATE TABLESPACE
1377
ALTER DEFAULT PRIVILEGES
ALTER DEFAULT PRIVILEGES — define default access privileges
Synopsis
1378
ALTER DEFAULT PRIVILEGES
[ CASCADE | RESTRICT ]
Description
ALTER DEFAULT PRIVILEGES allows you to set the privileges that will be applied to objects
created in the future. (It does not affect privileges assigned to already-existing objects.) Currently, only
the privileges for schemas, tables (including views and foreign tables), sequences, functions, and types
(including domains) can be altered. For this command, functions include aggregates and procedures.
The words FUNCTIONS and ROUTINES are equivalent in this command. (ROUTINES is preferred
going forward as the standard term for functions and procedures taken together. In earlier PostgreSQL
releases, only the word FUNCTIONS was allowed. It is not possible to set default privileges for func-
tions and procedures separately.)
You can change default privileges only for objects that will be created by yourself or by roles that you
are a member of. The privileges can be set globally (i.e., for all objects created in the current database),
or just for objects created in specified schemas.
As explained under GRANT, the default privileges for any object type normally grant all grantable per-
missions to the object owner, and may grant some privileges to PUBLIC as well. However, this behav-
ior can be changed by altering the global default privileges with ALTER DEFAULT PRIVILEGES.
Default privileges that are specified per-schema are added to whatever the global default privileges
are for the particular object type. This means you cannot revoke privileges per-schema if they are
granted globally (either by default, or according to a previous ALTER DEFAULT PRIVILEGES
command that did not specify a schema). Per-schema REVOKE is only useful to reverse the effects
of a previous per-schema GRANT.
Parameters
target_role
The name of an existing role of which the current role is a member. If FOR ROLE is omitted,
the current role is assumed.
schema_name
The name of an existing schema. If specified, the default privileges are altered for objects later
created in that schema. If IN SCHEMA is omitted, the global default privileges are altered. IN
SCHEMA is not allowed when setting privileges for schemas, since schemas can't be nested.
1379
ALTER DEFAULT PRIVILEGES
role_name
The name of an existing role to grant or revoke privileges for. This parameter, and all the other
parameters in abbreviated_grant_or_revoke, act as described under GRANT or RE-
VOKE, except that one is setting permissions for a whole class of objects rather than specific
named objects.
Notes
Use psql's \ddp command to obtain information about existing assignments of default privileges. The
meaning of the privilege values is the same as explained for \dp under GRANT.
If you wish to drop a role for which the default privileges have been altered, it is necessary to reverse
the changes in its default privileges or use DROP OWNED BY to get rid of the default privileges entry
for the role.
Examples
Grant SELECT privilege to everyone for all tables (and views) you subsequently create in schema
myschema, and allow role webuser to INSERT into them too:
Undo the above, so that subsequently-created tables won't have any more permissions than normal:
Remove the public EXECUTE permission that is normally granted on functions, for all functions
subsequently created by role admin:
Note however that you cannot accomplish that effect with a command limited to a single schema. This
command has no effect, unless it is undoing a matching GRANT:
That's because per-schema default privileges can only add privileges to the global setting, not remove
privileges granted by it.
Compatibility
There is no ALTER DEFAULT PRIVILEGES statement in the SQL standard.
See Also
GRANT, REVOKE
1380
ALTER DOMAIN
ALTER DOMAIN — change the definition of a domain
Synopsis
Description
ALTER DOMAIN changes the definition of an existing domain. There are several sub-forms:
SET/DROP DEFAULT
These forms set or remove the default value for a domain. Note that defaults only apply to sub-
sequent INSERT commands; they do not affect rows already in a table using the domain.
These forms change whether a domain is marked to allow NULL values or to reject NULL values.
You can only SET NOT NULL when the columns using the domain contain no null values.
This form adds a new constraint to a domain using the same syntax as CREATE DOMAIN. When
a new constraint is added to a domain, all columns using that domain will be checked against
the newly added constraint. These checks can be suppressed by adding the new constraint using
the NOT VALID option; the constraint can later be made valid using ALTER DOMAIN ...
VALIDATE CONSTRAINT. Newly inserted or updated rows are always checked against all con-
straints, even those marked NOT VALID. NOT VALID is only accepted for CHECK constraints.
This form drops constraints on a domain. If IF EXISTS is specified and the constraint does not
exist, no error is thrown. In this case a notice is issued instead.
RENAME CONSTRAINT
1381
ALTER DOMAIN
VALIDATE CONSTRAINT
This form validates a constraint previously added as NOT VALID, that is, it verifies that all values
in table columns of the domain type satisfy the specified constraint.
OWNER
This form changes the owner of the domain to the specified user.
RENAME
SET SCHEMA
This form changes the schema of the domain. Any constraints associated with the domain are
moved into the new schema as well.
You must own the domain to use ALTER DOMAIN. To change the schema of a domain, you must also
have CREATE privilege on the new schema. To alter the owner, you must also be a direct or indirect
member of the new owning role, and that role must have CREATE privilege on the domain's schema.
(These restrictions enforce that altering the owner doesn't do anything you couldn't do by dropping
and recreating the domain. However, a superuser can alter ownership of any domain anyway.)
Parameters
name
domain_constraint
constraint_name
NOT VALID
CASCADE
Automatically drop objects that depend on the constraint, and in turn all objects that depend on
those objects (see Section 5.13).
RESTRICT
Refuse to drop the constraint if there are any dependent objects. This is the default behavior.
new_name
new_constraint_name
new_owner
1382
ALTER DOMAIN
new_schema
Notes
Although ALTER DOMAIN ADD CONSTRAINT attempts to verify that existing stored data satisfies
the new constraint, this check is not bulletproof, because the command cannot “see” table rows that
are newly inserted or updated and not yet committed. If there is a hazard that concurrent operations
might insert bad data, the way to proceed is to add the constraint using the NOT VALID option,
commit that command, wait until all transactions started before that commit have finished, and then
issue ALTER DOMAIN VALIDATE CONSTRAINT to search for data violating the constraint. This
method is reliable because once the constraint is committed, all new transactions are guaranteed to
enforce it against new values of the domain type.
Examples
To add a NOT NULL constraint to a domain:
Compatibility
ALTER DOMAIN conforms to the SQL standard, except for the OWNER, RENAME, SET SCHEMA,
and VALIDATE CONSTRAINT variants, which are PostgreSQL extensions. The NOT VALID clause
of the ADD CONSTRAINT variant is also a PostgreSQL extension.
1383
ALTER DOMAIN
See Also
CREATE DOMAIN, DROP DOMAIN
1384
ALTER EVENT TRIGGER
ALTER EVENT TRIGGER — change the definition of an event trigger
Synopsis
Description
ALTER EVENT TRIGGER changes properties of an existing event trigger.
Parameters
name
new_owner
new_name
These forms configure the firing of event triggers. A disabled trigger is still known to the system,
but is not executed when its triggering event occurs. See also session_replication_role.
Compatibility
There is no ALTER EVENT TRIGGER statement in the SQL standard.
See Also
CREATE EVENT TRIGGER, DROP EVENT TRIGGER
1385
ALTER EXTENSION
ALTER EXTENSION — change the definition of an extension
Synopsis
* |
[ argmode ] [ argname ] argtype [ , ... ] |
[ [ argmode ] [ argname ] argtype [ , ... ] ] ORDER BY [ argmode ]
[ argname ] argtype [ , ... ]
1386
ALTER EXTENSION
Description
ALTER EXTENSION changes the definition of an installed extension. There are several subforms:
UPDATE
This form updates the extension to a newer version. The extension must supply a suitable update
script (or series of scripts) that can modify the currently-installed version into the requested ver-
sion.
SET SCHEMA
This form moves the extension's objects into another schema. The extension has to be relocatable
for this command to succeed.
ADD member_object
This form adds an existing object to the extension. This is mainly useful in extension update
scripts. The object will subsequently be treated as a member of the extension; notably, it can only
be dropped by dropping the extension.
DROP member_object
This form removes a member object from the extension. This is mainly useful in extension update
scripts. The object is not dropped, only disassociated from the extension.
You must own the extension to use ALTER EXTENSION. The ADD/DROP forms require ownership
of the added/dropped object as well.
Parameters
name
new_version
The desired new version of the extension. This can be written as either an identifier or a string
literal. If not specified, ALTER EXTENSION UPDATE attempts to update to whatever is shown
as the default version in the extension's control file.
new_schema
object_name
aggregate_name
function_name
operator_name
procedure_name
routine_name
The name of an object to be added to or removed from the extension. Names of tables, aggregates,
domains, foreign tables, functions, operators, operator classes, operator families, procedures, rou-
tines, sequences, text search objects, types, and views can be schema-qualified.
source_type
1387
ALTER EXTENSION
target_type
argmode
The mode of a function, procedure, or aggregate argument: IN, OUT, INOUT, or VARIADIC. If
omitted, the default is IN. Note that ALTER EXTENSION does not actually pay any attention to
OUT arguments, since only the input arguments are needed to determine the function's identity.
So it is sufficient to list the IN, INOUT, and VARIADIC arguments.
argname
The name of a function, procedure, or aggregate argument. Note that ALTER EXTENSION does
not actually pay any attention to argument names, since only the argument data types are needed
to determine the function's identity.
argtype
left_type
right_type
The data type(s) of the operator's arguments (optionally schema-qualified). Write NONE for the
missing argument of a prefix or postfix operator.
PROCEDURAL
type_name
lang_name
Examples
To update the hstore extension to version 2.0:
Compatibility
ALTER EXTENSION is a PostgreSQL extension.
1388
ALTER EXTENSION
See Also
CREATE EXTENSION, DROP EXTENSION
1389
ALTER FOREIGN DATA WRAPPER
ALTER FOREIGN DATA WRAPPER — change the definition of a foreign-data wrapper
Synopsis
Description
ALTER FOREIGN DATA WRAPPER changes the definition of a foreign-data wrapper. The first form
of the command changes the support functions or the generic options of the foreign-data wrapper (at
least one clause is required). The second form changes the owner of the foreign-data wrapper.
Only superusers can alter foreign-data wrappers. Additionally, only superusers can own foreign-data
wrappers.
Parameters
name
HANDLER handler_function
NO HANDLER
This is used to specify that the foreign-data wrapper should no longer have a handler function.
Note that foreign tables that use a foreign-data wrapper with no handler cannot be accessed.
VALIDATOR validator_function
Note that it is possible that pre-existing options of the foreign-data wrapper, or of dependent
servers, user mappings, or foreign tables, are invalid according to the new validator. PostgreSQL
does not check for this. It is up to the user to make sure that these options are correct before using
the modified foreign-data wrapper. However, any options specified in this ALTER FOREIGN
DATA WRAPPER command will be checked using the new validator.
NO VALIDATOR
This is used to specify that the foreign-data wrapper should no longer have a validator function.
Change options for the foreign-data wrapper. ADD, SET, and DROP specify the action to be per-
formed. ADD is assumed if no operation is explicitly specified. Option names must be unique;
names and values are also validated using the foreign data wrapper's validator function, if any.
1390
ALTER FOREIGN
DATA WRAPPER
new_owner
new_name
Examples
Change a foreign-data wrapper dbi, add option foo, drop bar:
ALTER FOREIGN DATA WRAPPER dbi OPTIONS (ADD foo '1', DROP 'bar');
Compatibility
ALTER FOREIGN DATA WRAPPER conforms to ISO/IEC 9075-9 (SQL/MED), except that the
HANDLER, VALIDATOR, OWNER TO, and RENAME clauses are extensions.
See Also
CREATE FOREIGN DATA WRAPPER, DROP FOREIGN DATA WRAPPER
1391
ALTER FOREIGN TABLE
ALTER FOREIGN TABLE — change the definition of a foreign table
Synopsis
Description
ALTER FOREIGN TABLE changes the definition of an existing foreign table. There are several
subforms:
1392
ALTER FOREIGN TABLE
ADD COLUMN
This form adds a new column to the foreign table, using the same syntax as CREATE FOREIGN
TABLE. Unlike the case when adding a column to a regular table, nothing happens to the under-
lying storage: this action simply declares that some new column is now accessible through the
foreign table.
This form drops a column from a foreign table. You will need to say CASCADE if anything outside
the table depends on the column; for example, views. If IF EXISTS is specified and the column
does not exist, no error is thrown. In this case a notice is issued instead.
This form changes the type of a column of a foreign table. Again, this has no effect on any under-
lying storage: this action simply changes the type that PostgreSQL believes the column to have.
SET/DROP DEFAULT
These forms set or remove the default value for a column. Default values only apply in subsequent
INSERT or UPDATE commands; they do not cause rows already in the table to change.
SET STATISTICS
This form sets the per-column statistics-gathering target for subsequent ANALYZE operations.
See the similar form of ALTER TABLE for more details.
This form sets or resets per-attribute options. See the similar form of ALTER TABLE for more
details.
SET STORAGE
This form sets the storage mode for a column. See the similar form of ALTER TABLE for more
details. Note that the storage mode has no effect unless the table's foreign-data wrapper chooses
to pay attention to it.
This form adds a new constraint to a foreign table, using the same syntax as CREATE FOREIGN
TABLE. Currently only CHECK constraints are supported.
Unlike the case when adding a constraint to a regular table, nothing is done to verify the constraint
is correct; rather, this action simply declares that some new condition should be assumed to hold
for all rows in the foreign table. (See the discussion in CREATE FOREIGN TABLE.) If the
constraint is marked NOT VALID, then it isn't assumed to hold, but is only recorded for possible
future use.
VALIDATE CONSTRAINT
This form marks as valid a constraint that was previously marked as NOT VALID. No action is
taken to verify the constraint, but future queries will assume that it holds.
This form drops the specified constraint on a foreign table. If IF EXISTS is specified and the
constraint does not exist, no error is thrown. In this case a notice is issued instead.
1393
ALTER FOREIGN TABLE
These forms configure the firing of trigger(s) belonging to the foreign table. See the similar form
of ALTER TABLE for more details.
This form adds an oid system column to the table (see Section 5.4). It does nothing if the table
already has OIDs. Unless the table's foreign-data wrapper supports OIDs, this column will simply
read as zeroes.
Note that this is not equivalent to ADD COLUMN oid oid; that would add a normal column
that happened to be named oid, not a system column.
This form removes the oid system column from the table. This is exactly equivalent to DROP
COLUMN oid RESTRICT, except that it will not complain if there is already no oid column.
INHERIT parent_table
This form adds the target foreign table as a new child of the specified parent table. See the similar
form of ALTER TABLE for more details.
NO INHERIT parent_table
This form removes the target foreign table from the list of children of the specified parent table.
OWNER
This form changes the owner of the foreign table to the specified user.
Change options for the foreign table or one of its columns. ADD, SET, and DROP specify the
action to be performed. ADD is assumed if no operation is explicitly specified. Duplicate option
names are not allowed (although it's OK for a table option and a column option to have the same
name). Option names and values are also validated using the foreign data wrapper library.
RENAME
The RENAME forms change the name of a foreign table or the name of an individual column in
a foreign table.
SET SCHEMA
All the actions except RENAME and SET SCHEMA can be combined into a list of multiple alterations
to apply in parallel. For example, it is possible to add several columns and/or alter the type of several
columns in a single command.
If the command is written as ALTER FOREIGN TABLE IF EXISTS ... and the foreign table
does not exist, no error is thrown. A notice is issued in this case.
You must own the table to use ALTER FOREIGN TABLE. To change the schema of a foreign table,
you must also have CREATE privilege on the new schema. To alter the owner, you must also be a
direct or indirect member of the new owning role, and that role must have CREATE privilege on the
table's schema. (These restrictions enforce that altering the owner doesn't do anything you couldn't do
by dropping and recreating the table. However, a superuser can alter ownership of any table anyway.)
To add a column or alter a column type, you must also have USAGE privilege on the data type.
1394
ALTER FOREIGN TABLE
Parameters
name
The name (possibly schema-qualified) of an existing foreign table to alter. If ONLY is specified
before the table name, only that table is altered. If ONLY is not specified, the table and all its
descendant tables (if any) are altered. Optionally, * can be specified after the table name to ex-
plicitly indicate that descendant tables are included.
column_name
new_column_name
new_name
data_type
Data type of the new column, or new data type for an existing column.
table_constraint
constraint_name
CASCADE
Automatically drop objects that depend on the dropped column or constraint (for example, views
referencing the column), and in turn all objects that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the column or constraint if there are any dependent objects. This is the default
behavior.
trigger_name
ALL
Disable or enable all triggers belonging to the foreign table. (This requires superuser privilege if
any of the triggers are internally generated triggers. The core system does not add such triggers
to foreign tables, but add-on code could do so.)
USER
Disable or enable all triggers belonging to the foreign table except for internally generated triggers.
parent_table
1395
ALTER FOREIGN TABLE
new_owner
new_schema
Notes
The key word COLUMN is noise and can be omitted.
Consistency with the foreign server is not checked when a column is added or removed with ADD
COLUMN or DROP COLUMN, a NOT NULL or CHECK constraint is added, or a column type is changed
with SET DATA TYPE. It is the user's responsibility to ensure that the table definition matches the
remote side.
Examples
To mark a column as not-null:
ALTER FOREIGN TABLE distributors ALTER COLUMN street SET NOT NULL;
Compatibility
The forms ADD, DROP, and SET DATA TYPE conform with the SQL standard. The other forms are
PostgreSQL extensions of the SQL standard. Also, the ability to specify more than one manipulation
in a single ALTER FOREIGN TABLE command is an extension.
ALTER FOREIGN TABLE DROP COLUMN can be used to drop the only column of a foreign table,
leaving a zero-column table. This is an extension of SQL, which disallows zero-column foreign tables.
See Also
CREATE FOREIGN TABLE, DROP FOREIGN TABLE
1396
ALTER FUNCTION
ALTER FUNCTION — change the definition of a function
Synopsis
Description
ALTER FUNCTION changes the definition of a function.
You must own the function to use ALTER FUNCTION. To change a function's schema, you must also
have CREATE privilege on the new schema. To alter the owner, you must also be a direct or indirect
member of the new owning role, and that role must have CREATE privilege on the function's schema.
(These restrictions enforce that altering the owner doesn't do anything you couldn't do by dropping
and recreating the function. However, a superuser can alter ownership of any function anyway.)
Parameters
name
1397
ALTER FUNCTION
argmode
The mode of an argument: IN, OUT, INOUT, or VARIADIC. If omitted, the default is IN. Note
that ALTER FUNCTION does not actually pay any attention to OUT arguments, since only the
input arguments are needed to determine the function's identity. So it is sufficient to list the IN,
INOUT, and VARIADIC arguments.
argname
The name of an argument. Note that ALTER FUNCTION does not actually pay any attention
to argument names, since only the argument data types are needed to determine the function's
identity.
argtype
new_name
new_owner
The new owner of the function. Note that if the function is marked SECURITY DEFINER, it
will subsequently execute as the new owner.
new_schema
extension_name
CALLED ON NULL INPUT changes the function so that it will be invoked when some or
all of its arguments are null. RETURNS NULL ON NULL INPUT or STRICT changes the
function so that it is not invoked if any of its arguments are null; instead, a null result is assumed
automatically. See CREATE FUNCTION for more information.
IMMUTABLE
STABLE
VOLATILE
Change the volatility of the function to the specified setting. See CREATE FUNCTION for de-
tails.
Change whether the function is a security definer or not. The key word EXTERNAL is ignored for
SQL conformance. See CREATE FUNCTION for more information about this capability.
PARALLEL
Change whether the function is deemed safe for parallelism. See CREATE FUNCTION for de-
tails.
1398
ALTER FUNCTION
LEAKPROOF
Change whether the function is considered leakproof or not. See CREATE FUNCTION for more
information about this capability.
COST execution_cost
Change the estimated execution cost of the function. See CREATE FUNCTION for more infor-
mation.
ROWS result_rows
Change the estimated number of rows returned by a set-returning function. See CREATE FUNC-
TION for more information.
configuration_parameter
value
Add or change the assignment to be made to a configuration parameter when the function is called.
If value is DEFAULT or, equivalently, RESET is used, the function-local setting is removed, so
that the function executes with the value present in its environment. Use RESET ALL to clear all
function-local settings. SET FROM CURRENT saves the value of the parameter that is current
when ALTER FUNCTION is executed as the value to be applied when the function is entered.
See SET and Chapter 19 for more information about allowed parameter names and values.
RESTRICT
Examples
To rename the function sqrt for type integer to square_root:
To change the owner of the function sqrt for type integer to joe:
To change the schema of the function sqrt for type integer to maths:
To mark the function sqrt for type integer as being dependent on the extension mathlib:
1399
ALTER FUNCTION
The function will now execute with whatever search path is used by its caller.
Compatibility
This statement is partially compatible with the ALTER FUNCTION statement in the SQL standard.
The standard allows more properties of a function to be modified, but does not provide the ability
to rename a function, make a function a security definer, attach configuration parameter values to
a function, or change the owner, schema, or volatility of a function. The standard also requires the
RESTRICT key word, which is optional in PostgreSQL.
See Also
CREATE FUNCTION, DROP FUNCTION, ALTER PROCEDURE, ALTER ROUTINE
1400
ALTER GROUP
ALTER GROUP — change role name or membership
Synopsis
role_name
| CURRENT_USER
| SESSION_USER
Description
ALTER GROUP changes the attributes of a user group. This is an obsolete command, though still
accepted for backwards compatibility, because groups (and users too) have been superseded by the
more general concept of roles.
The first two variants add users to a group or remove them from a group. (Any role can play the part
of either a “user” or a “group” for this purpose.) These variants are effectively equivalent to granting
or revoking membership in the role named as the “group”; so the preferred way to do this is to use
GRANT or REVOKE.
The third variant changes the name of the group. This is exactly equivalent to renaming the role with
ALTER ROLE.
Parameters
group_name
user_name
Users (roles) that are to be added to or removed from the group. The users must already exist;
ALTER GROUP does not create or drop users.
new_name
Examples
Add users to a group:
1401
ALTER GROUP
Compatibility
There is no ALTER GROUP statement in the SQL standard.
See Also
GRANT, REVOKE, ALTER ROLE
1402
ALTER INDEX
ALTER INDEX — change the definition of an index
Synopsis
Description
ALTER INDEX changes the definition of an existing index. There are several subforms:
RENAME
The RENAME form changes the name of the index. If the index is associated with a table constraint
(either UNIQUE, PRIMARY KEY, or EXCLUDE), the constraint is renamed as well. There is no
effect on the stored data.
SET TABLESPACE
This form changes the index's tablespace to the specified tablespace and moves the data file(s)
associated with the index to the new tablespace. To change the tablespace of an index, you must
own the index and have CREATE privilege on the new tablespace. All indexes in the current
database in a tablespace can be moved by using the ALL IN TABLESPACE form, which will
lock all indexes to be moved and then move each one. This form also supports OWNED BY, which
will only move indexes owned by the roles specified. If the NOWAIT option is specified then the
command will fail if it is unable to acquire all of the locks required immediately. Note that system
catalogs will not be moved by this command, use ALTER DATABASE or explicit ALTER INDEX
invocations instead if desired. See also CREATE TABLESPACE.
ATTACH PARTITION
Causes the named index to become attached to the altered index. The named index must be on
a partition of the table containing the index being altered, and have an equivalent definition. An
attached index cannot be dropped by itself, and will automatically be dropped if its parent index
is dropped.
DEPENDS ON EXTENSION
This form marks the index as dependent on the extension, such that if the extension is dropped,
the index will automatically be dropped as well.
This form changes one or more index-method-specific storage parameters for the index. See CRE-
ATE INDEX for details on the available parameters. Note that the index contents will not be
1403
ALTER INDEX
modified immediately by this command; depending on the parameter you might need to rebuild
the index with REINDEX to get the desired effects.
This form resets one or more index-method-specific storage parameters to their defaults. As with
SET, a REINDEX might be needed to update the index entirely.
This form sets the per-column statistics-gathering target for subsequent ANALYZE operations,
though can be used only on index columns that are defined as an expression. Since expressions
lack a unique name, we refer to them using the ordinal number of the index column. The target
can be set in the range 0 to 10000; alternatively, set it to -1 to revert to using the system default
statistics target (default_statistics_target). For more information on the use of statistics by the
PostgreSQL query planner, refer to Section 14.2.
Parameters
IF EXISTS
Do not throw an error if the index does not exist. A notice is issued in this case.
column_number
The ordinal number refers to the ordinal (left-to-right) position of the index column.
name
new_name
tablespace_name
extension_name
storage_parameter
value
The new value for an index-method-specific storage parameter. This might be a number or a word
depending on the parameter.
Notes
These operations are also possible using ALTER TABLE. ALTER INDEX is in fact just an alias for
the forms of ALTER TABLE that apply to indexes.
There was formerly an ALTER INDEX OWNER variant, but this is now ignored (with a warning). An
index cannot have an owner different from its table's owner. Changing the table's owner automatically
changes the index as well.
1404
ALTER INDEX
Examples
To rename an existing index:
To change an index's fill factor (assuming that the index method supports it):
Compatibility
ALTER INDEX is a PostgreSQL extension.
See Also
CREATE INDEX, REINDEX
1405
ALTER LANGUAGE
ALTER LANGUAGE — change the definition of a procedural language
Synopsis
Description
ALTER LANGUAGE changes the definition of a procedural language. The only functionality is to
rename the language or assign a new owner. You must be superuser or owner of the language to use
ALTER LANGUAGE.
Parameters
name
Name of a language
new_name
new_owner
Compatibility
There is no ALTER LANGUAGE statement in the SQL standard.
See Also
CREATE LANGUAGE, DROP LANGUAGE
1406
ALTER LARGE OBJECT
ALTER LARGE OBJECT — change the definition of a large object
Synopsis
Description
ALTER LARGE OBJECT changes the definition of a large object.
You must own the large object to use ALTER LARGE OBJECT. To alter the owner, you must also be
a direct or indirect member of the new owning role. (However, a superuser can alter any large object
anyway.) Currently, the only functionality is to assign a new owner, so both restrictions always apply.
Parameters
large_object_oid
new_owner
Compatibility
There is no ALTER LARGE OBJECT statement in the SQL standard.
See Also
Chapter 35
1407
ALTER MATERIALIZED VIEW
ALTER MATERIALIZED VIEW — change the definition of a materialized view
Synopsis
Description
ALTER MATERIALIZED VIEW changes various auxiliary properties of an existing materialized
view.
You must own the materialized view to use ALTER MATERIALIZED VIEW. To change a materi-
alized view's schema, you must also have CREATE privilege on the new schema. To alter the owner,
you must also be a direct or indirect member of the new owning role, and that role must have CREATE
privilege on the materialized view's schema. (These restrictions enforce that altering the owner doesn't
do anything you couldn't do by dropping and recreating the materialized view. However, a superuser
can alter ownership of any view anyway.)
The DEPENDS ON EXTENSION form marks the materialized view as dependent on an extension,
such that the materialized view will automatically be dropped if the extension is dropped.
The statement subforms and actions available for ALTER MATERIALIZED VIEW are a subset of
those available for ALTER TABLE, and have the same meaning when used for materialized views.
See the descriptions for ALTER TABLE for details.
1408
ALTER MATERIALIZED VIEW
Parameters
name
column_name
extension_name
The name of the extension that the materialized view is to depend on.
new_column_name
new_owner
new_name
new_schema
Examples
To rename the materialized view foo to bar:
Compatibility
ALTER MATERIALIZED VIEW is a PostgreSQL extension.
See Also
CREATE MATERIALIZED VIEW, DROP MATERIALIZED VIEW, REFRESH MATERIALIZED
VIEW
1409
ALTER OPERATOR
ALTER OPERATOR — change the definition of an operator
Synopsis
Description
ALTER OPERATOR changes the definition of an operator.
You must own the operator to use ALTER OPERATOR. To alter the owner, you must also be a direct
or indirect member of the new owning role, and that role must have CREATE privilege on the opera-
tor's schema. (These restrictions enforce that altering the owner doesn't do anything you couldn't do
by dropping and recreating the operator. However, a superuser can alter ownership of any operator
anyway.)
Parameters
name
left_type
The data type of the operator's left operand; write NONE if the operator has no left operand.
right_type
The data type of the operator's right operand; write NONE if the operator has no right operand.
new_owner
new_schema
res_proc
The restriction selectivity estimator function for this operator; write NONE to remove existing
selectivity estimator.
1410
ALTER OPERATOR
join_proc
The join selectivity estimator function for this operator; write NONE to remove existing selec-
tivity estimator.
Examples
Change the owner of a custom operator a @@ b for type text:
Change the restriction and join selectivity estimator functions of a custom operator a && b for type
int[]:
Compatibility
There is no ALTER OPERATOR statement in the SQL standard.
See Also
CREATE OPERATOR, DROP OPERATOR
1411
ALTER OPERATOR CLASS
ALTER OPERATOR CLASS — change the definition of an operator class
Synopsis
Description
ALTER OPERATOR CLASS changes the definition of an operator class.
You must own the operator class to use ALTER OPERATOR CLASS. To alter the owner, you must also
be a direct or indirect member of the new owning role, and that role must have CREATE privilege on
the operator class's schema. (These restrictions enforce that altering the owner doesn't do anything you
couldn't do by dropping and recreating the operator class. However, a superuser can alter ownership
of any operator class anyway.)
Parameters
name
index_method
new_name
new_owner
new_schema
Compatibility
There is no ALTER OPERATOR CLASS statement in the SQL standard.
See Also
CREATE OPERATOR CLASS, DROP OPERATOR CLASS, ALTER OPERATOR FAMILY
1412
ALTER OPERATOR FAMILY
ALTER OPERATOR FAMILY — change the definition of an operator family
Synopsis
Description
ALTER OPERATOR FAMILY changes the definition of an operator family. You can add operators and
support functions to the family, remove them from the family, or change the family's name or owner.
When operators and support functions are added to a family with ALTER OPERATOR FAMILY, they
are not part of any specific operator class within the family, but are just “loose” within the family. This
indicates that these operators and functions are compatible with the family's semantics, but are not
required for correct functioning of any specific index. (Operators and functions that are so required
should be declared as part of an operator class, instead; see CREATE OPERATOR CLASS.) Post-
greSQL will allow loose members of a family to be dropped from the family at any time, but members
of an operator class cannot be dropped without dropping the whole class and any indexes that depend
on it. Typically, single-data-type operators and functions are part of operator classes because they are
needed to support an index on that specific data type, while cross-data-type operators and functions
are made loose members of the family.
You must be a superuser to use ALTER OPERATOR FAMILY. (This restriction is made because an
erroneous operator family definition could confuse or even crash the server.)
ALTER OPERATOR FAMILY does not presently check whether the operator family definition in-
cludes all the operators and functions required by the index method, nor whether the operators and
functions form a self-consistent set. It is the user's responsibility to define a valid operator family.
1413
ALTER OPERATOR FAMILY
Parameters
name
index_method
strategy_number
The index method's strategy number for an operator associated with the operator family.
operator_name
The name (optionally schema-qualified) of an operator associated with the operator family.
op_type
In an OPERATOR clause, the operand data type(s) of the operator, or NONE to signify a left-unary
or right-unary operator. Unlike the comparable syntax in CREATE OPERATOR CLASS, the
operand data types must always be specified.
In an ADD FUNCTION clause, the operand data type(s) the function is intended to support, if
different from the input data type(s) of the function. For B-tree comparison functions and hash
functions it is not necessary to specify op_type since the function's input data type(s) are always
the correct ones to use. For B-tree sort support functions and all functions in GiST, SP-GiST and
GIN operator classes, it is necessary to specify the operand data type(s) the function is to be used
with.
In a DROP FUNCTION clause, the operand data type(s) the function is intended to support must
be specified.
sort_family_name
The name (optionally schema-qualified) of an existing btree operator family that describes the
sort ordering associated with an ordering operator.
If neither FOR SEARCH nor FOR ORDER BY is specified, FOR SEARCH is the default.
support_number
The index method's support function number for a function associated with the operator family.
function_name
The name (optionally schema-qualified) of a function that is an index method support function
for the operator family. If no argument list is specified, the name must be unique in its schema.
argument_type
new_name
new_owner
1414
ALTER OPERATOR FAMILY
new_schema
Notes
Notice that the DROP syntax only specifies the “slot” in the operator family, by strategy or support
number and input data type(s). The name of the operator or function occupying the slot is not men-
tioned. Also, for DROP FUNCTION the type(s) to specify are the input data type(s) the function is
intended to support; for GiST, SP-GiST and GIN indexes this might have nothing to do with the actual
input argument types of the function.
Because the index machinery does not check access permissions on functions before using them, in-
cluding a function or operator in an operator family is tantamount to granting public execute permis-
sion on it. This is usually not an issue for the sorts of functions that are useful in an operator family.
The operators should not be defined by SQL functions. A SQL function is likely to be inlined into the
calling query, which will prevent the optimizer from recognizing that the query matches an index.
Before PostgreSQL 8.4, the OPERATOR clause could include a RECHECK option. This is no longer
supported because whether an index operator is “lossy” is now determined on-the-fly at run time. This
allows efficient handling of cases where an operator might or might not be lossy.
Examples
The following example command adds cross-data-type operators and support functions to an operator
family that already contains B-tree operator classes for data types int4 and int2.
-- int4 vs int2
OPERATOR 1 < (int4, int2) ,
OPERATOR 2 <= (int4, int2) ,
OPERATOR 3 = (int4, int2) ,
OPERATOR 4 >= (int4, int2) ,
OPERATOR 5 > (int4, int2) ,
FUNCTION 1 btint42cmp(int4, int2) ,
-- int2 vs int4
OPERATOR 1 < (int2, int4) ,
OPERATOR 2 <= (int2, int4) ,
OPERATOR 3 = (int2, int4) ,
OPERATOR 4 >= (int2, int4) ,
OPERATOR 5 > (int2, int4) ,
FUNCTION 1 btint24cmp(int2, int4) ;
-- int4 vs int2
OPERATOR 1 (int4, int2) ,
OPERATOR 2 (int4, int2) ,
OPERATOR 3 (int4, int2) ,
OPERATOR 4 (int4, int2) ,
1415
ALTER OPERATOR FAMILY
-- int2 vs int4
OPERATOR 1 (int2, int4) ,
OPERATOR 2 (int2, int4) ,
OPERATOR 3 (int2, int4) ,
OPERATOR 4 (int2, int4) ,
OPERATOR 5 (int2, int4) ,
FUNCTION 1 (int2, int4) ;
Compatibility
There is no ALTER OPERATOR FAMILY statement in the SQL standard.
See Also
CREATE OPERATOR FAMILY, DROP OPERATOR FAMILY, CREATE OPERATOR CLASS,
ALTER OPERATOR CLASS, DROP OPERATOR CLASS
1416
ALTER POLICY
ALTER POLICY — change the definition of a row level security policy
Synopsis
Description
ALTER POLICY changes the definition of an existing row-level security policy. Note that ALTER
POLICY only allows the set of roles to which the policy applies and the USING and WITH CHECK
expressions to be modified. To change other properties of a policy, such as the command to which it
applies or whether it is permissive or restrictive, the policy must be dropped and recreated.
To use ALTER POLICY, you must own the table that the policy applies to.
In the second form of ALTER POLICY, the role list, using_expression, and check_expres-
sion are replaced independently if specified. When one of those clauses is omitted, the corresponding
part of the policy is unchanged.
Parameters
name
table_name
The name (optionally schema-qualified) of the table that the policy is on.
new_name
role_name
The role(s) to which the policy applies. Multiple roles can be specified at one time. To apply the
policy to all roles, use PUBLIC.
using_expression
The USING expression for the policy. See CREATE POLICY for details.
check_expression
The WITH CHECK expression for the policy. See CREATE POLICY for details.
Compatibility
ALTER POLICY is a PostgreSQL extension.
1417
ALTER POLICY
See Also
CREATE POLICY, DROP POLICY
1418
ALTER PROCEDURE
ALTER PROCEDURE — change the definition of a procedure
Synopsis
Description
ALTER PROCEDURE changes the definition of a procedure.
You must own the procedure to use ALTER PROCEDURE. To change a procedure's schema, you
must also have CREATE privilege on the new schema. To alter the owner, you must also be a direct or
indirect member of the new owning role, and that role must have CREATE privilege on the procedure's
schema. (These restrictions enforce that altering the owner doesn't do anything you couldn't do by
dropping and recreating the procedure. However, a superuser can alter ownership of any procedure
anyway.)
Parameters
name
argmode
1419
ALTER PROCEDURE
argname
The name of an argument. Note that ALTER PROCEDURE does not actually pay any attention
to argument names, since only the argument data types are needed to determine the procedure's
identity.
argtype
new_name
new_owner
The new owner of the procedure. Note that if the procedure is marked SECURITY DEFINER,
it will subsequently execute as the new owner.
new_schema
extension_name
Change whether the procedure is a security definer or not. The key word EXTERNAL is ignored
for SQL conformance. See CREATE PROCEDURE for more information about this capability.
configuration_parameter
value
Add or change the assignment to be made to a configuration parameter when the procedure is
called. If value is DEFAULT or, equivalently, RESET is used, the procedure-local setting is
removed, so that the procedure executes with the value present in its environment. Use RESET
ALL to clear all procedure-local settings. SET FROM CURRENT saves the value of the parame-
ter that is current when ALTER PROCEDURE is executed as the value to be applied when the
procedure is entered.
See SET and Chapter 19 for more information about allowed parameter names and values.
RESTRICT
Examples
To rename the procedure insert_data with two arguments of type integer to in-
sert_record:
To change the owner of the procedure insert_data with two arguments of type integer to joe:
1420
ALTER PROCEDURE
To change the schema of the procedure insert_data with two arguments of type integer to
accounting:
The procedure will now execute with whatever search path is used by its caller.
Compatibility
This statement is partially compatible with the ALTER PROCEDURE statement in the SQL standard.
The standard allows more properties of a procedure to be modified, but does not provide the ability to
rename a procedure, make a procedure a security definer, attach configuration parameter values to a
procedure, or change the owner, schema, or volatility of a procedure. The standard also requires the
RESTRICT key word, which is optional in PostgreSQL.
See Also
CREATE PROCEDURE, DROP PROCEDURE, ALTER FUNCTION, ALTER ROUTINE
1421
ALTER PUBLICATION
ALTER PUBLICATION — change the definition of a publication
Synopsis
Description
The command ALTER PUBLICATION can change the attributes of a publication.
The first three variants change which tables are part of the publication. The SET TABLE clause will
replace the list of tables in the publication with the specified one. The ADD TABLE and DROP TABLE
clauses will add and remove one or more tables from the publication. Note that adding tables to a
publication that is already subscribed to will require a ALTER SUBSCRIPTION ... REFRESH
PUBLICATION action on the subscribing side in order to become effective.
The fourth variant of this command listed in the synopsis can change all of the publication properties
specified in CREATE PUBLICATION. Properties not mentioned in the command retain their previous
settings.
The remaining variants change the owner and the name of the publication.
You must own the publication to use ALTER PUBLICATION. Adding a table to a publication addi-
tionally requires owning that table. To alter the owner, you must also be a direct or indirect member
of the new owning role. The new owner must have CREATE privilege on the database. Also, the new
owner of a FOR ALL TABLES publication must be a superuser. However, a superuser can change
the ownership of a publication regardless of these restrictions.
Parameters
name
table_name
Name of an existing table. If ONLY is specified before the table name, only that table is affected.
If ONLY is not specified, the table and all its descendant tables (if any) are affected. Optionally,
* can be specified after the table name to explicitly indicate that descendant tables are included.
This clause alters publication parameters originally set by CREATE PUBLICATION. See there
for more information.
new_owner
1422
ALTER PUBLICATION
new_name
Examples
Change the publication to publish only deletes and updates:
Compatibility
ALTER PUBLICATION is a PostgreSQL extension.
See Also
CREATE PUBLICATION, DROP PUBLICATION, CREATE SUBSCRIPTION, ALTER
SUBSCRIPTION
1423
ALTER ROLE
ALTER ROLE — change a database role
Synopsis
SUPERUSER | NOSUPERUSER
| CREATEDB | NOCREATEDB
| CREATEROLE | NOCREATEROLE
| INHERIT | NOINHERIT
| LOGIN | NOLOGIN
| REPLICATION | NOREPLICATION
| BYPASSRLS | NOBYPASSRLS
| CONNECTION LIMIT connlimit
| [ ENCRYPTED ] PASSWORD 'password' | PASSWORD NULL
| VALID UNTIL 'timestamp'
role_name
| CURRENT_USER
| SESSION_USER
Description
ALTER ROLE changes the attributes of a PostgreSQL role.
The first variant of this command listed in the synopsis can change many of the role attributes that
can be specified in CREATE ROLE. (All the possible attributes are covered, except that there are
no options for adding or removing memberships; use GRANT and REVOKE for that.) Attributes not
mentioned in the command retain their previous settings. Database superusers can change any of these
settings for any role. Roles having CREATEROLE privilege can change any of these settings except
SUPERUSER, REPLICATION, and BYPASSRLS; but only for non-superuser and non-replication
roles. Ordinary roles can only change their own password.
The second variant changes the name of the role. Database superusers can rename any role. Roles
having CREATEROLE privilege can rename non-superuser roles. The current session user cannot be
renamed. (Connect as a different user if you need to do that.) Because MD5-encrypted passwords
use the role name as cryptographic salt, renaming a role clears its password if the password is MD5-
encrypted.
1424
ALTER ROLE
The remaining variants change a role's session default for a configuration variable, either for all data-
bases or, when the IN DATABASE clause is specified, only for sessions in the named database. If
ALL is specified instead of a role name, this changes the setting for all roles. Using ALL with IN
DATABASE is effectively the same as using the command ALTER DATABASE ... SET ....
Whenever the role subsequently starts a new session, the specified value becomes the session default,
overriding whatever setting is present in postgresql.conf or has been received from the post-
gres command line. This only happens at login time; executing SET ROLE or SET SESSION AU-
THORIZATION does not cause new configuration values to be set. Settings set for all databases are
overridden by database-specific settings attached to a role. Settings for specific databases or specific
roles override settings for all roles.
Superusers can change anyone's session defaults. Roles having CREATEROLE privilege can change
defaults for non-superuser roles. Ordinary roles can only set defaults for themselves. Certain config-
uration variables cannot be set this way, or can only be set if a superuser issues the command. Only
superusers can change a setting for all roles in all databases.
Parameters
name
CURRENT_USER
SESSION_USER
SUPERUSER
NOSUPERUSER
CREATEDB
NOCREATEDB
CREATEROLE
NOCREATEROLE
INHERIT
NOINHERIT
LOGIN
NOLOGIN
REPLICATION
NOREPLICATION
BYPASSRLS
NOBYPASSRLS
CONNECTION LIMIT connlimit
[ ENCRYPTED ] PASSWORD 'password'
PASSWORD NULL
VALID UNTIL 'timestamp'
These clauses alter attributes originally set by CREATE ROLE. For more information, see the
CREATE ROLE reference page.
new_name
database_name
The name of the database the configuration variable should be set in.
1425
ALTER ROLE
configuration_parameter
value
Set this role's session default for the specified configuration parameter to the given value. If val-
ue is DEFAULT or, equivalently, RESET is used, the role-specific variable setting is removed, so
the role will inherit the system-wide default setting in new sessions. Use RESET ALL to clear all
role-specific settings. SET FROM CURRENT saves the session's current value of the parameter
as the role-specific value. If IN DATABASE is specified, the configuration parameter is set or
removed for the given role and database only.
Role-specific variable settings take effect only at login; SET ROLE and SET SESSION AU-
THORIZATION do not process role-specific variable settings.
See SET and Chapter 19 for more information about allowed parameter names and values.
Notes
Use CREATE ROLE to add new roles, and DROP ROLE to remove a role.
ALTER ROLE cannot change a role's memberships. Use GRANT and REVOKE to do that.
Caution must be exercised when specifying an unencrypted password with this command. The pass-
word will be transmitted to the server in cleartext, and it might also be logged in the client's command
history or the server log. psql contains a command \password that can be used to change a role's
password without exposing the cleartext password.
It is also possible to tie a session default to a specific database rather than to a role; see ALTER
DATABASE. If there is a conflict, database-role-specific settings override role-specific ones, which
in turn override database-specific ones.
Examples
Change a role's password:
Change a password expiration date, specifying that the password should expire at midday on 4th May
2015 using the time zone which is one hour ahead of UTC:
Give a role the ability to create other roles and new databases:
1426
ALTER ROLE
Compatibility
The ALTER ROLE statement is a PostgreSQL extension.
See Also
CREATE ROLE, DROP ROLE, ALTER DATABASE, SET
1427
ALTER ROUTINE
ALTER ROUTINE — change the definition of a routine
Synopsis
Description
ALTER ROUTINE changes the definition of a routine, which can be an aggregate function, a nor-
mal function, or a procedure. See under ALTER AGGREGATE, ALTER FUNCTION, and ALTER
PROCEDURE for the description of the parameters, more examples, and further details.
Examples
To rename the routine foo for type integer to foobar:
This command will work independent of whether foo is an aggregate, function, or procedure.
Compatibility
This statement is partially compatible with the ALTER ROUTINE statement in the SQL standard. See
under ALTER FUNCTION and ALTER PROCEDURE for more details. Allowing routine names to
refer to aggregate functions is a PostgreSQL extension.
1428
ALTER ROUTINE
See Also
ALTER AGGREGATE, ALTER FUNCTION, ALTER PROCEDURE, DROP ROUTINE
1429
ALTER RULE
ALTER RULE — change the definition of a rule
Synopsis
Description
ALTER RULE changes properties of an existing rule. Currently, the only available action is to change
the rule's name.
To use ALTER RULE, you must own the table or view that the rule applies to.
Parameters
name
table_name
The name (optionally schema-qualified) of the table or view that the rule applies to.
new_name
Examples
To rename an existing rule:
Compatibility
ALTER RULE is a PostgreSQL language extension, as is the entire query rewrite system.
See Also
CREATE RULE, DROP RULE
1430
ALTER SCHEMA
ALTER SCHEMA — change the definition of a schema
Synopsis
Description
ALTER SCHEMA changes the definition of a schema.
You must own the schema to use ALTER SCHEMA. To rename a schema you must also have the
CREATE privilege for the database. To alter the owner, you must also be a direct or indirect member of
the new owning role, and you must have the CREATE privilege for the database. (Note that superusers
have all these privileges automatically.)
Parameters
name
new_name
The new name of the schema. The new name cannot begin with pg_, as such names are reserved
for system schemas.
new_owner
Compatibility
There is no ALTER SCHEMA statement in the SQL standard.
See Also
CREATE SCHEMA, DROP SCHEMA
1431
ALTER SEQUENCE
ALTER SEQUENCE — change the definition of a sequence generator
Synopsis
Description
ALTER SEQUENCE changes the parameters of an existing sequence generator. Any parameters not
specifically set in the ALTER SEQUENCE command retain their prior settings.
You must own the sequence to use ALTER SEQUENCE. To change a sequence's schema, you must also
have CREATE privilege on the new schema. To alter the owner, you must also be a direct or indirect
member of the new owning role, and that role must have CREATE privilege on the sequence's schema.
(These restrictions enforce that altering the owner doesn't do anything you couldn't do by dropping
and recreating the sequence. However, a superuser can alter ownership of any sequence anyway.)
Parameters
name
IF EXISTS
Do not throw an error if the sequence does not exist. A notice is issued in this case.
data_type
The optional clause AS data_type changes the data type of the sequence. Valid types are
smallint, integer, and bigint.
Changing the data type automatically changes the minimum and maximum values of the sequence
if and only if the previous minimum and maximum values were the minimum or maximum value
of the old data type (in other words, if the sequence had been created using NO MINVALUE or
NO MAXVALUE, implicitly or explicitly). Otherwise, the minimum and maximum values are pre-
served, unless new values are given as part of the same command. If the minimum and maximum
values do not fit into the new data type, an error will be generated.
1432
ALTER SEQUENCE
increment
The clause INCREMENT BY increment is optional. A positive value will make an ascending
sequence, a negative one a descending sequence. If unspecified, the old increment value will be
maintained.
minvalue
NO MINVALUE
The optional clause MINVALUE minvalue determines the minimum value a sequence can
generate. If NO MINVALUE is specified, the defaults of 1 and the minimum value of the data type
for ascending and descending sequences, respectively, will be used. If neither option is specified,
the current minimum value will be maintained.
maxvalue
NO MAXVALUE
The optional clause MAXVALUE maxvalue determines the maximum value for the sequence.
If NO MAXVALUE is specified, the defaults of the maximum value of the data type and -1 for
ascending and descending sequences, respectively, will be used. If neither option is specified, the
current maximum value will be maintained.
start
The optional clause START WITH start changes the recorded start value of the sequence.
This has no effect on the current sequence value; it simply sets the value that future ALTER
SEQUENCE RESTART commands will use.
restart
The optional clause RESTART [ WITH restart ] changes the current value of the sequence.
This is similar to calling the setval function with is_called = false: the specified value
will be returned by the next call of nextval. Writing RESTART with no restart value is
equivalent to supplying the start value that was recorded by CREATE SEQUENCE or last set by
ALTER SEQUENCE START WITH.
cache
The clause CACHE cache enables sequence numbers to be preallocated and stored in memory
for faster access. The minimum value is 1 (only one value can be generated at a time, i.e., no
cache). If unspecified, the old cache value will be maintained.
CYCLE
The optional CYCLE key word can be used to enable the sequence to wrap around when the max-
value or minvalue has been reached by an ascending or descending sequence respectively.
If the limit is reached, the next number generated will be the minvalue or maxvalue, respec-
tively.
NO CYCLE
If the optional NO CYCLE key word is specified, any calls to nextval after the sequence has
reached its maximum value will return an error. If neither CYCLE or NO CYCLE are specified,
the old cycle behavior will be maintained.
OWNED BY table_name.column_name
OWNED BY NONE
The OWNED BY option causes the sequence to be associated with a specific table column, such that
if that column (or its whole table) is dropped, the sequence will be automatically dropped as well.
1433
ALTER SEQUENCE
If specified, this association replaces any previously specified association for the sequence. The
specified table must have the same owner and be in the same schema as the sequence. Specifying
OWNED BY NONE removes any existing association, making the sequence “free-standing”.
new_owner
new_name
new_schema
Notes
ALTER SEQUENCE will not immediately affect nextval results in backends, other than the current
one, that have preallocated (cached) sequence values. They will use up all cached values prior to notic-
ing the changed sequence generation parameters. The current backend will be affected immediately.
ALTER SEQUENCE does not affect the currval status for the sequence. (Before PostgreSQL 8.3,
it sometimes did.)
ALTER SEQUENCE blocks concurrent nextval, currval, lastval, and setval calls.
For historical reasons, ALTER TABLE can be used with sequences too; but the only variants of ALTER
TABLE that are allowed with sequences are equivalent to the forms shown above.
Examples
Restart a sequence called serial, at 105:
Compatibility
ALTER SEQUENCE conforms to the SQL standard, except for the AS, START WITH, OWNED BY,
OWNER TO, RENAME TO, and SET SCHEMA clauses, which are PostgreSQL extensions.
See Also
CREATE SEQUENCE, DROP SEQUENCE
1434
ALTER SERVER
ALTER SERVER — change the definition of a foreign server
Synopsis
Description
ALTER SERVER changes the definition of a foreign server. The first form changes the server version
string or the generic options of the server (at least one clause is required). The second form changes
the owner of the server.
To alter the server you must be the owner of the server. Additionally to alter the owner, you must
own the server and also be a direct or indirect member of the new owning role, and you must have
USAGE privilege on the server's foreign-data wrapper. (Note that superusers satisfy all these criteria
automatically.)
Parameters
name
new_version
Change options for the server. ADD, SET, and DROP specify the action to be performed. ADD is
assumed if no operation is explicitly specified. Option names must be unique; names and values
are also validated using the server's foreign-data wrapper library.
new_owner
new_name
Examples
Alter server foo, add connection options:
1435
ALTER SERVER
Compatibility
ALTER SERVER conforms to ISO/IEC 9075-9 (SQL/MED). The OWNER TO and RENAME forms
are PostgreSQL extensions.
See Also
CREATE SERVER, DROP SERVER
1436
ALTER STATISTICS
ALTER STATISTICS — change the definition of an extended statistics object
Synopsis
Description
ALTER STATISTICS changes the parameters of an existing extended statistics object. Any para-
meters not specifically set in the ALTER STATISTICS command retain their prior settings.
You must own the statistics object to use ALTER STATISTICS. To change a statistics object's
schema, you must also have CREATE privilege on the new schema. To alter the owner, you must also
be a direct or indirect member of the new owning role, and that role must have CREATE privilege on the
statistics object's schema. (These restrictions enforce that altering the owner doesn't do anything you
couldn't do by dropping and recreating the statistics object. However, a superuser can alter ownership
of any statistics object anyway.)
Parameters
name
new_owner
new_name
new_schema
Compatibility
There is no ALTER STATISTICS command in the SQL standard.
See Also
CREATE STATISTICS, DROP STATISTICS
1437
ALTER SUBSCRIPTION
ALTER SUBSCRIPTION — change the definition of a subscription
Synopsis
Description
ALTER SUBSCRIPTION can change most of the subscription properties that can be specified in
CREATE SUBSCRIPTION.
You must own the subscription to use ALTER SUBSCRIPTION. To alter the owner, you must also be
a direct or indirect member of the new owning role. The new owner has to be a superuser. (Currently,
all subscription owners must be superusers, so the owner checks will be bypassed in practice. But this
might change in the future.)
Parameters
name
CONNECTION 'conninfo'
This clause alters the connection property originally set by CREATE SUBSCRIPTION. See there
for more information.
Changes list of subscribed publications. See CREATE SUBSCRIPTION for more information.
By default this command will also act like REFRESH PUBLICATION.
refresh (boolean)
When false, the command will not try to refresh table information. REFRESH PUBLI-
CATION should then be executed separately. The default is true.
1438
ALTER SUBSCRIPTION
REFRESH PUBLICATION
Fetch missing table information from publisher. This will start replication of tables that were
added to the subscribed-to publications since the last invocation of REFRESH PUBLICATION
or since CREATE SUBSCRIPTION.
refresh_option specifies additional options for the refresh operation. The supported options
are:
copy_data (boolean)
Specifies whether the existing data in the publications that are being subscribed to should
be copied once the replication starts. The default is true. (Previously subscribed tables are
not copied.)
ENABLE
Enables the previously disabled subscription, starting the logical replication worker at the end of
transaction.
DISABLE
Disables the running subscription, stopping the logical replication worker at the end of transaction.
This clause alters parameters originally set by CREATE SUBSCRIPTION. See there for more
information. The allowed options are slot_name and synchronous_commit
new_owner
new_name
Examples
Change the publication subscribed by a subscription to insert_only:
Compatibility
ALTER SUBSCRIPTION is a PostgreSQL extension.
See Also
CREATE SUBSCRIPTION, DROP SUBSCRIPTION, CREATE PUBLICATION, ALTER PUBLI-
CATION
1439
ALTER SYSTEM
ALTER SYSTEM — change a server configuration parameter
Synopsis
Description
ALTER SYSTEM is used for changing server configuration parameters across the entire database
cluster. It can be more convenient than the traditional method of manually editing the post-
gresql.conf file. ALTER SYSTEM writes the given parameter setting to the postgresql.au-
to.conf file, which is read in addition to postgresql.conf. Setting a parameter to DEFAULT,
or using the RESET variant, removes that configuration entry from the postgresql.auto.conf
file. Use RESET ALL to remove all such configuration entries.
Values set with ALTER SYSTEM will be effective after the next server configuration reload, or after
the next server restart in the case of parameters that can only be changed at server start. A server
configuration reload can be commanded by calling the SQL function pg_reload_conf(), running
pg_ctl reload, or sending a SIGHUP signal to the main server process.
Only superusers can use ALTER SYSTEM. Also, since this command acts directly on the file system
and cannot be rolled back, it is not allowed inside a transaction block or function.
Parameters
configuration_parameter
Name of a settable configuration parameter. Available parameters are documented in Chapter 19.
value
New value of the parameter. Values can be specified as string constants, identifiers, numbers,
or comma-separated lists of these, as appropriate for the particular parameter. DEFAULT can be
written to specify removing the parameter and its value from postgresql.auto.conf.
Notes
This command can't be used to set data_directory, nor parameters that are not allowed in post-
gresql.conf (e.g., preset options).
Examples
Set the wal_level:
1440
ALTER SYSTEM
Compatibility
The ALTER SYSTEM statement is a PostgreSQL extension.
See Also
SET, SHOW
1441
ALTER TABLE
ALTER TABLE — change the definition of a table
Synopsis
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
action [, ... ]
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
RENAME [ COLUMN ] column_name TO new_column_name
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
RENAME CONSTRAINT constraint_name TO new_constraint_name
ALTER TABLE [ IF EXISTS ] name
RENAME TO new_name
ALTER TABLE [ IF EXISTS ] name
SET SCHEMA new_schema
ALTER TABLE ALL IN TABLESPACE name [ OWNED BY role_name [, ... ] ]
SET TABLESPACE new_tablespace [ NOWAIT ]
ALTER TABLE [ IF EXISTS ] name
ATTACH PARTITION partition_name { FOR
VALUES partition_bound_spec | DEFAULT }
ALTER TABLE [ IF EXISTS ] name
DETACH PARTITION partition_name
1442
ALTER TABLE
[ CONSTRAINT constraint_name ]
{ NOT NULL |
NULL |
CHECK ( expression ) [ NO INHERIT ] |
DEFAULT default_expr |
GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY
[ ( sequence_options ) ] |
UNIQUE index_parameters |
PRIMARY KEY index_parameters |
REFERENCES reftable [ ( refcolumn ) ] [ MATCH FULL | MATCH
PARTIAL | MATCH SIMPLE ]
[ ON DELETE action ] [ ON UPDATE action ] }
[ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY
IMMEDIATE ]
[ CONSTRAINT constraint_name ]
1443
ALTER TABLE
[ CONSTRAINT constraint_name ]
{ UNIQUE | PRIMARY KEY } USING INDEX index_name
[ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED |
INITIALLY IMMEDIATE ]
Description
ALTER TABLE changes the definition of an existing table. There are several subforms described
below. Note that the lock level required may differ for each subform. An ACCESS EXCLUSIVE lock
is acquired unless explicitly noted. When multiple subcommands are given, the lock acquired will be
the strictest one required by any subcommand.
This form adds a new column to the table, using the same syntax as CREATE TABLE. If IF
NOT EXISTS is specified and a column already exists with this name, no error is thrown.
This form drops a column from a table. Indexes and table constraints involving the column will be
automatically dropped as well. Multivariate statistics referencing the dropped column will also be
removed if the removal of the column would cause the statistics to contain data for only a single
column. You will need to say CASCADE if anything outside the table depends on the column, for
example, foreign key references or views. If IF EXISTS is specified and the column does not
exist, no error is thrown. In this case a notice is issued instead.
This form changes the type of a column of a table. Indexes and simple table constraints involving
the column will be automatically converted to use the new column type by reparsing the originally
supplied expression. The optional COLLATE clause specifies a collation for the new column; if
omitted, the collation is the default for the new column type. The optional USING clause specifies
how to compute the new column value from the old; if omitted, the default conversion is the same
1444
ALTER TABLE
as an assignment cast from old data type to new. A USING clause must be provided if there is no
implicit or assignment cast from old to new type.
SET/DROP DEFAULT
These forms set or remove the default value for a column. Default values only apply in subsequent
INSERT or UPDATE commands; they do not cause rows already in the table to change.
These forms change whether a column is marked to allow null values or to reject null values. You
can only use SET NOT NULL when the column contains no null values.
If this table is a partition, one cannot perform DROP NOT NULL on a column if it is marked
NOT NULL in the parent table. To drop the NOT NULL constraint from all the partitions, perform
DROP NOT NULL on the parent table. Even if there is no NOT NULL constraint on the parent,
such a constraint can still be added to individual partitions, if desired; that is, the children can
disallow nulls even if the parent allows them, but not the other way around.
These forms change whether a column is an identity column or change the generation attribute of
an existing identity column. See CREATE TABLE for details.
If DROP IDENTITY IF EXISTS is specified and the column is not an identity column, no
error is thrown. In this case a notice is issued instead.
SET sequence_option
RESTART
These forms alter the sequence that underlies an existing identity column. sequence_option
is an option supported by ALTER SEQUENCE such as INCREMENT BY.
SET STATISTICS
This form sets the per-column statistics-gathering target for subsequent ANALYZE operations.
The target can be set in the range 0 to 10000; alternatively, set it to -1 to revert to using the system
default statistics target (default_statistics_target). For more information on the use of statistics by
the PostgreSQL query planner, refer to Section 14.2.
This form sets or resets per-attribute options. Currently, the only defined per-attribute options are
n_distinct and n_distinct_inherited, which override the number-of-distinct-values
estimates made by subsequent ANALYZE operations. n_distinct affects the statistics for the
table itself, while n_distinct_inherited affects the statistics gathered for the table plus its
inheritance children. When set to a positive value, ANALYZE will assume that the column contains
exactly the specified number of distinct nonnull values. When set to a negative value, which must
be greater than or equal to -1, ANALYZE will assume that the number of distinct nonnull values
in the column is linear in the size of the table; the exact count is to be computed by multiplying
the estimated table size by the absolute value of the given number. For example, a value of -1
implies that all values in the column are distinct, while a value of -0.5 implies that each value
appears twice on the average. This can be useful when the size of the table changes over time,
since the multiplication by the number of rows in the table is not performed until query planning
time. Specify a value of 0 to revert to estimating the number of distinct values normally. For more
information on the use of statistics by the PostgreSQL query planner, refer to Section 14.2.
1445
ALTER TABLE
SET STORAGE
This form sets the storage mode for a column. This controls whether this column is held inline
or in a secondary TOAST table, and whether the data should be compressed or not. PLAIN must
be used for fixed-length values such as integer and is inline, uncompressed. MAIN is for in-
line, compressible data. EXTERNAL is for external, uncompressed data, and EXTENDED is for
external, compressed data. EXTENDED is the default for most data types that support non-PLAIN
storage. Use of EXTERNAL will make substring operations on very large text and bytea val-
ues run faster, at the penalty of increased storage space. Note that SET STORAGE doesn't itself
change anything in the table, it just sets the strategy to be pursued during future table updates.
See Section 69.2 for more information.
This form adds a new constraint to a table using the same constraint syntax as CREATE TABLE,
plus the option NOT VALID, which is currently only allowed for foreign key and CHECK con-
straints.
Normally, this form will cause a scan of the table to verify that all existing rows in the table
satisfy the new constraint. But if the NOT VALID option is used, this potentially-lengthy scan
is skipped. The constraint will still be enforced against subsequent inserts or updates (that is,
they'll fail unless there is a matching row in the referenced table, in the case of foreign keys,
or they'll fail unless the new row matches the specified check condition). But the database will
not assume that the constraint holds for all rows in the table, until it is validated by using the
VALIDATE CONSTRAINT option. See Notes below for more information about using the NOT
VALID option.
Additional restrictions apply when unique or primary key constraints are added to partitioned
tables; see CREATE TABLE. Also, foreign key constraints on partitioned tables may not be
declared NOT VALID at present.
ADD table_constraint_using_index
This form adds a new PRIMARY KEY or UNIQUE constraint to a table based on an existing
unique index. All the columns of the index will be included in the constraint.
The index cannot have expression columns nor be a partial index. Also, it must be a b-tree index
with default sort ordering. These restrictions ensure that the index is equivalent to one that would
be built by a regular ADD PRIMARY KEY or ADD UNIQUE command.
If PRIMARY KEY is specified, and the index's columns are not already marked NOT NULL,
then this command will attempt to do ALTER COLUMN SET NOT NULL against each such
column. That requires a full table scan to verify the column(s) contain no nulls. In all other cases,
this is a fast operation.
If a constraint name is provided then the index will be renamed to match the constraint name.
Otherwise the constraint will be named the same as the index.
After this command is executed, the index is “owned” by the constraint, in the same way as if the
index had been built by a regular ADD PRIMARY KEY or ADD UNIQUE command. In particular,
dropping the constraint will make the index disappear too.
1446
ALTER TABLE
Note
Adding a constraint using an existing index can be helpful in situations where a new
constraint needs to be added without blocking table updates for a long time. To do that,
create the index using CREATE INDEX CONCURRENTLY, and then install it as an
official constraint using this syntax. See the example below.
ALTER CONSTRAINT
This form alters the attributes of a constraint that was previously created. Currently only foreign
key constraints may be altered.
VALIDATE CONSTRAINT
This form validates a foreign key or check constraint that was previously created as NOT VALID,
by scanning the table to ensure there are no rows for which the constraint is not satisfied. Nothing
happens if the constraint is already marked valid. (See Notes below for an explanation of the
usefulness of this command.)
This form drops the specified constraint on a table, along with any index underlying the constraint.
If IF EXISTS is specified and the constraint does not exist, no error is thrown. In this case a
notice is issued instead.
These forms configure the firing of trigger(s) belonging to the table. A disabled trigger is still
known to the system, but is not executed when its triggering event occurs. For a deferred trigger,
the enable status is checked when the event occurs, not when the trigger function is actually ex-
ecuted. One can disable or enable a single trigger specified by name, or all triggers on the table,
or only user triggers (this option excludes internally generated constraint triggers such as those
that are used to implement foreign key constraints or deferrable uniqueness and exclusion con-
straints). Disabling or enabling internally generated constraint triggers requires superuser priv-
ileges; it should be done with caution since of course the integrity of the constraint cannot be
guaranteed if the triggers are not executed.
The trigger firing mechanism is also affected by the configuration variable session_replica-
tion_role. Simply enabled triggers (the default) will fire when the replication role is “origin” (the
default) or “local”. Triggers configured as ENABLE REPLICA will only fire if the session is in
“replica” mode, and triggers configured as ENABLE ALWAYS will fire regardless of the current
replication role.
The effect of this mechanism is that in the default configuration, triggers do not fire on replicas.
This is useful because if a trigger is used on the origin to propagate data between tables, then
the replication system will also replicate the propagated data, and the trigger should not fire a
second time on the replica, because that would lead to duplication. However, if a trigger is used for
another purpose such as creating external alerts, then it might be appropriate to set it to ENABLE
ALWAYS so that it is also fired on replicas.
These forms configure the firing of rewrite rules belonging to the table. A disabled rule is still
known to the system, but is not applied during query rewriting. The semantics are as for dis-
abled/enabled triggers. This configuration is ignored for ON SELECT rules, which are always ap-
plied in order to keep views working even if the current session is in a non-default replication role.
1447
ALTER TABLE
The rule firing mechanism is also affected by the configuration variable session_replication_role,
analogous to triggers as described above.
These forms control the application of row security policies belonging to the table. If enabled and
no policies exist for the table, then a default-deny policy is applied. Note that policies can exist
for a table even if row level security is disabled - in this case, the policies will NOT be applied
and the policies will be ignored. See also CREATE POLICY.
These forms control the application of row security policies belonging to the table when the user
is the table owner. If enabled, row level security policies will be applied when the user is the table
owner. If disabled (the default) then row level security will not be applied when the user is the
table owner. See also CREATE POLICY.
CLUSTER ON
This form selects the default index for future CLUSTER operations. It does not actually re-cluster
the table.
This form removes the most recently used CLUSTER index specification from the table. This
affects future cluster operations that don't specify an index.
This form adds an oid system column to the table (see Section 5.4). It does nothing if the table
already has OIDs.
Note that this is not equivalent to ADD COLUMN oid oid; that would add a normal column
that happened to be named oid, not a system column.
This form removes the oid system column from the table. This is exactly equivalent to DROP
COLUMN oid RESTRICT, except that it will not complain if there is already no oid column.
SET TABLESPACE
This form changes the table's tablespace to the specified tablespace and moves the data file(s)
associated with the table to the new tablespace. Indexes on the table, if any, are not moved; but
they can be moved separately with additional SET TABLESPACE commands. All tables in the
current database in a tablespace can be moved by using the ALL IN TABLESPACE form, which
will lock all tables to be moved first and then move each one. This form also supports OWNED
BY, which will only move tables owned by the roles specified. If the NOWAIT option is specified
then the command will fail if it is unable to acquire all of the locks required immediately. Note
that system catalogs are not moved by this command, use ALTER DATABASE or explicit ALTER
TABLE invocations instead if desired. The information_schema relations are not considered
part of the system catalogs and will be moved. See also CREATE TABLESPACE.
This form changes the table from unlogged to logged or vice-versa (see UNLOGGED). It cannot
be applied to a temporary table.
1448
ALTER TABLE
This form changes one or more storage parameters for the table. See Storage Parameters for details
on the available parameters. Note that the table contents will not be modified immediately by
this command; depending on the parameter you might need to rewrite the table to get the desired
effects. That can be done with VACUUM FULL, CLUSTER or one of the forms of ALTER
TABLE that forces a table rewrite. For planner related parameters, changes will take effect from
the next time the table is locked so currently executing queries will not be affected.
SHARE UPDATE EXCLUSIVE lock will be taken for fillfactor, toast and autovacuum storage
parameters, as well as the planner parameter parallel_workers.
Note
While CREATE TABLE allows OIDS to be specified in the WITH (storage_para-
meter) syntax, ALTER TABLE does not treat OIDS as a storage parameter. Instead use
the SET WITH OIDS and SET WITHOUT OIDS forms to change OID status.
This form resets one or more storage parameters to their defaults. As with SET, a table rewrite
might be needed to update the table entirely.
INHERIT parent_table
This form adds the target table as a new child of the specified parent table. Subsequently, queries
against the parent will include records of the target table. To be added as a child, the target table
must already contain all the same columns as the parent (it could have additional columns, too).
The columns must have matching data types, and if they have NOT NULL constraints in the parent
then they must also have NOT NULL constraints in the child.
There must also be matching child-table constraints for all CHECK constraints of the par-
ent, except those marked non-inheritable (that is, created with ALTER TABLE ... ADD
CONSTRAINT ... NO INHERIT) in the parent, which are ignored; all child-table con-
straints matched must not be marked non-inheritable. Currently UNIQUE, PRIMARY KEY, and
FOREIGN KEY constraints are not considered, but this might change in the future.
NO INHERIT parent_table
This form removes the target table from the list of children of the specified parent table. Queries
against the parent table will no longer include records drawn from the target table.
OF type_name
This form links the table to a composite type as though CREATE TABLE OF had formed it.
The table's list of column names and types must precisely match that of the composite type; the
presence of an oid system column is permitted to differ. The table must not inherit from any
other table. These restrictions ensure that CREATE TABLE OF would permit an equivalent table
definition.
NOT OF
OWNER TO
This form changes the owner of the table, sequence, view, materialized view, or foreign table to
the specified user.
1449
ALTER TABLE
REPLICA IDENTITY
This form changes the information which is written to the write-ahead log to identify rows which
are updated or deleted. In most cases, the old value of each column is only logged if it differs
from the new value; however, if the old value is stored externally, it is always logged regardless
of whether it changed. This option has no effect except when logical replication is in use.
DEFAULT
Records the old values of the columns of the primary key, if any. This is the default for non-
system tables.
Records the old values of the columns covered by the named index, that must be unique,
not partial, not deferrable, and include only columns marked NOT NULL. If this index is
dropped, the behavior is the same as NOTHING.
FULL
NOTHING
Records no information about the old row. This is the default for system tables.
RENAME
The RENAME forms change the name of a table (or an index, sequence, view, materialized view,
or foreign table), the name of an individual column in a table, or the name of a constraint of the
table. When renaming a constraint that has an underlying index, the index is renamed as well.
There is no effect on the stored data.
SET SCHEMA
This form moves the table into another schema. Associated indexes, constraints, and sequences
owned by table columns are moved as well.
This form attaches an existing table (which might itself be partitioned) as a partition of the target
table. The table can be attached as a partition for specific values using FOR VALUES or as a
default partition by using DEFAULT. For each index in the target table, a corresponding one will
be created in the attached table; or, if an equivalent index already exists, it will be attached to the
target table's index, as if ALTER INDEX ATTACH PARTITION had been executed. Note that
if the existing table is a foreign table, it is currently not allowed to attach the table as a partition
of the target table if there are UNIQUE indexes on the target table. (See also CREATE FOREIGN
TABLE.) For each user-defined row-level trigger that exists in the target table, a corresponding
one is created in the attached table.
A partition using FOR VALUES uses same syntax for partition_bound_spec as CREATE
TABLE. The partition bound specification must correspond to the partitioning strategy and par-
tition key of the target table. The table to be attached must have all the same columns as the target
table and no more; moreover, the column types must also match. Also, it must have all the NOT
NULL and CHECK constraints of the target table. Currently FOREIGN KEY constraints are not
considered. UNIQUE and PRIMARY KEY constraints from the parent table will be created in the
partition, if they don't already exist. If any of the CHECK constraints of the table being attached
are marked NO INHERIT, the command will fail; such constraints must be recreated without
the NO INHERIT clause.
1450
ALTER TABLE
If the new partition is a regular table, a full table scan is performed to check that existing rows in
the table do not violate the partition constraint. It is possible to avoid this scan by adding a valid
CHECK constraint to the table that allows only rows satisfying the desired partition constraint
before running this command. The CHECK constraint will be used to determine that the table need
not be scanned to validate the partition constraint. This does not work, however, if any of the
partition keys is an expression and the partition does not accept NULL values. If attaching a list
partition that will not accept NULL values, also add NOT NULL constraint to the partition key
column, unless it's an expression.
If the new partition is a foreign table, nothing is done to verify that all the rows in the foreign
table obey the partition constraint. (See the discussion in CREATE FOREIGN TABLE about
constraints on the foreign table.)
When a table has a default partition, defining a new partition changes the partition constraint for
the default partition. The default partition can't contain any rows that would need to be moved
to the new partition, and will be scanned to verify that none are present. This scan, like the scan
of the new partition, can be avoided if an appropriate CHECK constraint is present. Also like the
scan of the new partition, it is always skipped when the default partition is a foreign table.
This form detaches the specified partition of the target table. The detached partition continues to
exist as a standalone table, but no longer has any ties to the table from which it was detached.
Any indexes that were attached to the target table's indexes are detached. Any triggers that were
created as clones of those in the target table are removed.
All the forms of ALTER TABLE that act on a single table, except RENAME, SET SCHEMA, ATTACH
PARTITION, and DETACH PARTITION can be combined into a list of multiple alterations to be
applied together. For example, it is possible to add several columns and/or alter the type of several
columns in a single command. This is particularly useful with large tables, since only one pass over
the table need be made.
You must own the table to use ALTER TABLE. To change the schema or tablespace of a table, you
must also have CREATE privilege on the new schema or tablespace. To add the table as a new child of
a parent table, you must own the parent table as well. Also, to attach a table as a new partition of the
table, you must own the table being attached. To alter the owner, you must also be a direct or indirect
member of the new owning role, and that role must have CREATE privilege on the table's schema.
(These restrictions enforce that altering the owner doesn't do anything you couldn't do by dropping and
recreating the table. However, a superuser can alter ownership of any table anyway.) To add a column
or alter a column type or use the OF clause, you must also have USAGE privilege on the data type.
Parameters
IF EXISTS
Do not throw an error if the table does not exist. A notice is issued in this case.
name
The name (optionally schema-qualified) of an existing table to alter. If ONLY is specified before
the table name, only that table is altered. If ONLY is not specified, the table and all its descendant
tables (if any) are altered. Optionally, * can be specified after the table name to explicitly indicate
that descendant tables are included.
column_name
new_column_name
1451
ALTER TABLE
new_name
data_type
Data type of the new column, or new data type for an existing column.
table_constraint
constraint_name
CASCADE
Automatically drop objects that depend on the dropped column or constraint (for example, views
referencing the column), and in turn all objects that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the column or constraint if there are any dependent objects. This is the default
behavior.
trigger_name
ALL
Disable or enable all triggers belonging to the table. (This requires superuser privilege if any of
the triggers are internally generated constraint triggers such as those that are used to implement
foreign key constraints or deferrable uniqueness and exclusion constraints.)
USER
Disable or enable all triggers belonging to the table except for internally generated constraint
triggers such as those that are used to implement foreign key constraints or deferrable uniqueness
and exclusion constraints.
index_name
storage_parameter
value
The new value for a table storage parameter. This might be a number or a word depending on
the parameter.
parent_table
new_owner
1452
ALTER TABLE
new_tablespace
new_schema
partition_name
The name of the table to attach as a new partition or to detach from this table.
partition_bound_spec
The partition bound specification for a new partition. Refer to CREATE TABLE for more details
on the syntax of the same.
Notes
The key word COLUMN is noise and can be omitted.
When a column is added with ADD COLUMN and a non-volatile DEFAULT is specified, the default
is evaluated at the time of the statement and the result stored in the table's metadata. That value will
be used for the column for all existing rows. If no DEFAULT is specified, NULL is used. In neither
case is a rewrite of the table required.
Adding a column with a volatile DEFAULT or changing the type of an existing column will require
the entire table and its indexes to be rewritten. As an exception, when changing the type of an existing
column, if the USING clause does not change the column contents and the old type is either binary
coercible to the new type or an unconstrained domain over the new type, a table rewrite is not needed;
but any indexes on the affected columns must still be rebuilt. Adding or removing a system oid
column also requires rewriting the entire table. Table and/or index rebuilds may take a significant
amount of time for a large table; and will temporarily require as much as double the disk space.
Adding a CHECK or NOT NULL constraint requires scanning the table to verify that existing rows
meet the constraint, but does not require a table rewrite.
Similarly, when attaching a new partition it may be scanned to verify that existing rows meet the
partition constraint.
The main reason for providing the option to specify multiple changes in a single ALTER TABLE is
that multiple table scans or rewrites can thereby be combined into a single pass over the table.
Scanning a large table to verify a new foreign key or check constraint can take a long time, and other
updates to the table are locked out until the ALTER TABLE ADD CONSTRAINT command is
committed. The main purpose of the NOT VALID constraint option is to reduce the impact of adding
a constraint on concurrent updates. With NOT VALID, the ADD CONSTRAINT command does not
scan the table and can be committed immediately. After that, a VALIDATE CONSTRAINT command
can be issued to verify that existing rows satisfy the constraint. The validation step does not need to
lock out concurrent updates, since it knows that other transactions will be enforcing the constraint for
rows that they insert or update; only pre-existing rows need to be checked. Hence, validation acquires
only a SHARE UPDATE EXCLUSIVE lock on the table being altered. (If the constraint is a foreign
key then a ROW SHARE lock is also required on the table referenced by the constraint.) In addition
to improving concurrency, it can be useful to use NOT VALID and VALIDATE CONSTRAINT in
cases where the table is known to contain pre-existing violations. Once the constraint is in place, no
new violations can be inserted, and the existing problems can be corrected at leisure until VALIDATE
CONSTRAINT finally succeeds.
The DROP COLUMN form does not physically remove the column, but simply makes it invisible to
SQL operations. Subsequent insert and update operations in the table will store a null value for the
1453
ALTER TABLE
column. Thus, dropping a column is quick but it will not immediately reduce the on-disk size of your
table, as the space occupied by the dropped column is not reclaimed. The space will be reclaimed over
time as existing rows are updated. (These statements do not apply when dropping the system oid
column; that is done with an immediate rewrite.)
To force immediate reclamation of space occupied by a dropped column, you can execute one of the
forms of ALTER TABLE that performs a rewrite of the whole table. This results in reconstructing
each row with the dropped column replaced by a null value.
The rewriting forms of ALTER TABLE are not MVCC-safe. After a table rewrite, the table will appear
empty to concurrent transactions, if they are using a snapshot taken before the rewrite occurred. See
Section 13.5 for more details.
The USING option of SET DATA TYPE can actually specify any expression involving the old values
of the row; that is, it can refer to other columns as well as the one being converted. This allows very
general conversions to be done with the SET DATA TYPE syntax. Because of this flexibility, the
USING expression is not applied to the column's default value (if any); the result might not be a
constant expression as required for a default. This means that when there is no implicit or assignment
cast from old to new type, SET DATA TYPE might fail to convert the default even though a USING
clause is supplied. In such cases, drop the default with DROP DEFAULT, perform the ALTER TYPE,
and then use SET DEFAULT to add a suitable new default. Similar considerations apply to indexes
and constraints involving the column.
If a table has any descendant tables, it is not permitted to add, rename, or change the type of a column
in the parent table without doing the same to the descendants. This ensures that the descendants always
have columns matching the parent. Similarly, a CHECK constraint cannot be renamed in the parent
without also renaming it in all descendants, so that CHECK constraints also match between the parent
and its descendants. (That restriction does not apply to index-based constraints, however.) Also, be-
cause selecting from the parent also selects from its descendants, a constraint on the parent cannot
be marked valid unless it is also marked valid for those descendants. In all of these cases, ALTER
TABLE ONLY will be rejected.
A recursive DROP COLUMN operation will remove a descendant table's column only if the descendant
does not inherit that column from any other parents and never had an independent definition of the
column. A nonrecursive DROP COLUMN (i.e., ALTER TABLE ONLY ... DROP COLUMN) never
removes any descendant columns, but instead marks them as independently defined rather than inher-
ited. A nonrecursive DROP COLUMN command will fail for a partitioned table, because all partitions
of a table must have the same columns as the partitioning root.
The actions for identity columns (ADD GENERATED, SET etc., DROP IDENTITY), as well as the
actions TRIGGER, CLUSTER, OWNER, and TABLESPACE never recurse to descendant tables; that
is, they always act as though ONLY were specified. Adding a constraint recurses only for CHECK
constraints that are not marked NO INHERIT.
Refer to CREATE TABLE for a further description of valid parameters. Chapter 5 has further infor-
mation on inheritance.
Examples
To add a column of type varchar to a table:
1454
ALTER TABLE
To change an integer column containing Unix timestamps to timestamp with time zone via
a USING clause:
The same, when the column has a default expression that won't automatically cast to the new data type:
1455
ALTER TABLE
To add a foreign key constraint to a table with the least impact on other work:
To add an automatically named primary key constraint to a table, noting that a table can only ever
have one primary key:
To recreate a primary key constraint, without blocking updates while the index is rebuilt:
1456
ALTER TABLE
Compatibility
The forms ADD (without USING INDEX), DROP [COLUMN], DROP IDENTITY, RESTART, SET
DEFAULT, SET DATA TYPE (without USING), SET GENERATED, and SET sequence_option
conform with the SQL standard. The other forms are PostgreSQL extensions of the SQL standard.
Also, the ability to specify more than one manipulation in a single ALTER TABLE command is an
extension.
ALTER TABLE DROP COLUMN can be used to drop the only column of a table, leaving a zero-column
table. This is an extension of SQL, which disallows zero-column tables.
See Also
CREATE TABLE
1457
ALTER TABLESPACE
ALTER TABLESPACE — change the definition of a tablespace
Synopsis
Description
ALTER TABLESPACE can be used to change the definition of a tablespace.
You must own the tablespace to change the definition of a tablespace. To alter the owner, you must
also be a direct or indirect member of the new owning role. (Note that superusers have these privileges
automatically.)
Parameters
name
new_name
The new name of the tablespace. The new name cannot begin with pg_, as such names are re-
served for system tablespaces.
new_owner
tablespace_option
A tablespace parameter to be set or reset. Currently, the only available parameters are se-
q_page_cost, random_page_cost and effective_io_concurrency. Setting either
value for a particular tablespace will override the planner's usual estimate of the cost of reading
pages from tables in that tablespace, as established by the configuration parameters of the same
name (see seq_page_cost, random_page_cost, effective_io_concurrency). This may be useful if
one tablespace is located on a disk which is faster or slower than the remainder of the I/O sub-
system.
Examples
Rename tablespace index_space to fast_raid:
1458
ALTER TABLESPACE
Compatibility
There is no ALTER TABLESPACE statement in the SQL standard.
See Also
CREATE TABLESPACE, DROP TABLESPACE
1459
ALTER TEXT SEARCH CONFIGURATION
ALTER TEXT SEARCH CONFIGURATION — change the definition of a text search configuration
Synopsis
Description
ALTER TEXT SEARCH CONFIGURATION changes the definition of a text search configuration.
You can modify its mappings from token types to dictionaries, or change the configuration's name
or owner.
You must be the owner of the configuration to use ALTER TEXT SEARCH CONFIGURATION.
Parameters
name
token_type
dictionary_name
The name of a text search dictionary to be consulted for the specified token type(s). If multiple
dictionaries are listed, they are consulted in the specified order.
old_dictionary
new_dictionary
1460
ALTER TEXT SEARCH
CONFIGURATION
new_name
new_owner
new_schema
The ADD MAPPING FOR form installs a list of dictionaries to be consulted for the specified token
type(s); it is an error if there is already a mapping for any of the token types. The ALTER MAPPING
FOR form does the same, but first removing any existing mapping for those token types. The ALTER
MAPPING REPLACE forms substitute new_dictionary for old_dictionary anywhere the
latter appears. This is done for only the specified token types when FOR appears, or for all mappings of
the configuration when it doesn't. The DROP MAPPING form removes all dictionaries for the specified
token type(s), causing tokens of those types to be ignored by the text search configuration. It is an
error if there is no mapping for the token types, unless IF EXISTS appears.
Examples
The following example replaces the english dictionary with the swedish dictionary anywhere
that english is used within my_config.
Compatibility
There is no ALTER TEXT SEARCH CONFIGURATION statement in the SQL standard.
See Also
CREATE TEXT SEARCH CONFIGURATION, DROP TEXT SEARCH CONFIGURATION
1461
ALTER TEXT SEARCH DICTIONARY
ALTER TEXT SEARCH DICTIONARY — change the definition of a text search dictionary
Synopsis
Description
ALTER TEXT SEARCH DICTIONARY changes the definition of a text search dictionary. You can
change the dictionary's template-specific options, or change the dictionary's name or owner.
You must be the owner of the dictionary to use ALTER TEXT SEARCH DICTIONARY.
Parameters
name
option
value
The new value to use for a template-specific option. If the equal sign and value are omitted, then
any previous setting for the option is removed from the dictionary, allowing the default to be used.
new_name
new_owner
new_schema
Examples
The following example command changes the stopword list for a Snowball-based dictionary. Other
parameters remain unchanged.
1462
ALTER TEXT SEARCH
DICTIONARY
The following example command changes the language option to dutch, and removes the stopword
option entirely.
The following example command “updates” the dictionary's definition without actually changing any-
thing.
(The reason this works is that the option removal code doesn't complain if there is no such option.)
This trick is useful when changing configuration files for the dictionary: the ALTER will force existing
database sessions to re-read the configuration files, which otherwise they would never do if they had
read them earlier.
Compatibility
There is no ALTER TEXT SEARCH DICTIONARY statement in the SQL standard.
See Also
CREATE TEXT SEARCH DICTIONARY, DROP TEXT SEARCH DICTIONARY
1463
ALTER TEXT SEARCH PARSER
ALTER TEXT SEARCH PARSER — change the definition of a text search parser
Synopsis
Description
ALTER TEXT SEARCH PARSER changes the definition of a text search parser. Currently, the only
supported functionality is to change the parser's name.
Parameters
name
new_name
new_schema
Compatibility
There is no ALTER TEXT SEARCH PARSER statement in the SQL standard.
See Also
CREATE TEXT SEARCH PARSER, DROP TEXT SEARCH PARSER
1464
ALTER TEXT SEARCH TEMPLATE
ALTER TEXT SEARCH TEMPLATE — change the definition of a text search template
Synopsis
Description
ALTER TEXT SEARCH TEMPLATE changes the definition of a text search template. Currently, the
only supported functionality is to change the template's name.
Parameters
name
new_name
new_schema
Compatibility
There is no ALTER TEXT SEARCH TEMPLATE statement in the SQL standard.
See Also
CREATE TEXT SEARCH TEMPLATE, DROP TEXT SEARCH TEMPLATE
1465
ALTER TRIGGER
ALTER TRIGGER — change the definition of a trigger
Synopsis
Description
ALTER TRIGGER changes properties of an existing trigger. The RENAME clause changes the name of
the given trigger without otherwise changing the trigger definition. The DEPENDS ON EXTENSION
clause marks the trigger as dependent on an extension, such that if the extension is dropped, the trigger
will automatically be dropped as well.
You must own the table on which the trigger acts to be allowed to change its properties.
Parameters
name
table_name
new_name
extension_name
Notes
The ability to temporarily enable or disable a trigger is provided by ALTER TABLE, not by ALTER
TRIGGER, because ALTER TRIGGER has no convenient way to express the option of enabling or
disabling all of a table's triggers at once.
Examples
To rename an existing trigger:
1466
ALTER TRIGGER
Compatibility
ALTER TRIGGER is a PostgreSQL extension of the SQL standard.
See Also
ALTER TABLE
1467
ALTER TYPE
ALTER TYPE — change the definition of a type
Synopsis
Description
ALTER TYPE changes the definition of an existing type. There are several subforms:
ADD ATTRIBUTE
This form adds a new attribute to a composite type, using the same syntax as CREATE TYPE.
This form drops an attribute from a composite type. If IF EXISTS is specified and the attribute
does not exist, no error is thrown. In this case a notice is issued instead.
OWNER
RENAME
This form changes the name of the type or the name of an individual attribute of a composite type.
SET SCHEMA
1468
ALTER TYPE
This form adds a new value to an enum type. The new value's place in the enum's ordering can
be specified as being BEFORE or AFTER one of the existing values. Otherwise, the new item is
added at the end of the list of values.
If IF NOT EXISTS is specified, it is not an error if the type already contains the new value:
a notice is issued but no other action is taken. Otherwise, an error will occur if the new value is
already present.
RENAME VALUE
This form renames a value of an enum type. The value's place in the enum's ordering is not
affected. An error will occur if the specified value is not present or the new name is already present.
The ADD ATTRIBUTE, DROP ATTRIBUTE, and ALTER ATTRIBUTE actions can be combined
into a list of multiple alterations to apply in parallel. For example, it is possible to add several attributes
and/or alter the type of several attributes in a single command.
You must own the type to use ALTER TYPE. To change the schema of a type, you must also have
CREATE privilege on the new schema. To alter the owner, you must also be a direct or indirect mem-
ber of the new owning role, and that role must have CREATE privilege on the type's schema. (These
restrictions enforce that altering the owner doesn't do anything you couldn't do by dropping and recre-
ating the type. However, a superuser can alter ownership of any type anyway.) To add an attribute or
alter an attribute type, you must also have USAGE privilege on the data type.
Parameters
name
new_name
new_owner
new_schema
attribute_name
new_attribute_name
data_type
The data type of the attribute to add, or the new type of the attribute to alter.
new_enum_value
The new value to be added to an enum type's list of values, or the new name to be given to an
existing value. Like all enum literals, it needs to be quoted.
1469
ALTER TYPE
neighbor_enum_value
The existing enum value that the new value should be added immediately before or after in the
enum type's sort ordering. Like all enum literals, it needs to be quoted.
existing_enum_value
The existing enum value that should be renamed. Like all enum literals, it needs to be quoted.
CASCADE
Automatically propagate the operation to typed tables of the type being altered, and their descen-
dants.
RESTRICT
Refuse the operation if the type being altered is the type of a typed table. This is the default.
Notes
ALTER TYPE ... ADD VALUE (the form that adds a new value to an enum type) cannot be
executed inside a transaction block.
Comparisons involving an added enum value will sometimes be slower than comparisons involving
only original members of the enum type. This will usually only occur if BEFORE or AFTER is used
to set the new value's sort position somewhere other than at the end of the list. However, sometimes it
will happen even though the new value is added at the end (this occurs if the OID counter “wrapped
around” since the original creation of the enum type). The slowdown is usually insignificant; but if
it matters, optimal performance can be regained by dropping and recreating the enum type, or by
dumping and restoring the database.
Examples
To rename a data type:
1470
ALTER TYPE
Compatibility
The variants to add and drop attributes are part of the SQL standard; the other variants are PostgreSQL
extensions.
See Also
CREATE TYPE, DROP TYPE
1471
ALTER USER
ALTER USER — change a database role
Synopsis
SUPERUSER | NOSUPERUSER
| CREATEDB | NOCREATEDB
| CREATEROLE | NOCREATEROLE
| INHERIT | NOINHERIT
| LOGIN | NOLOGIN
| REPLICATION | NOREPLICATION
| BYPASSRLS | NOBYPASSRLS
| CONNECTION LIMIT connlimit
| [ ENCRYPTED ] PASSWORD 'password' | PASSWORD NULL
| VALID UNTIL 'timestamp'
role_name
| CURRENT_USER
| SESSION_USER
Description
ALTER USER is now an alias for ALTER ROLE.
Compatibility
The ALTER USER statement is a PostgreSQL extension. The SQL standard leaves the definition of
users to the implementation.
See Also
ALTER ROLE
1472
ALTER USER MAPPING
ALTER USER MAPPING — change the definition of a user mapping
Synopsis
Description
ALTER USER MAPPING changes the definition of a user mapping.
The owner of a foreign server can alter user mappings for that server for any user. Also, a user can alter
a user mapping for their own user name if USAGE privilege on the server has been granted to the user.
Parameters
user_name
User name of the mapping. CURRENT_USER and USER match the name of the current user.
PUBLIC is used to match all present and future user names in the system.
server_name
Change options for the user mapping. The new options override any previously specified options.
ADD, SET, and DROP specify the action to be performed. ADD is assumed if no operation is ex-
plicitly specified. Option names must be unique; options are also validated by the server's for-
eign-data wrapper.
Examples
Change the password for user mapping bob, server foo:
ALTER USER MAPPING FOR bob SERVER foo OPTIONS (SET password
'public');
Compatibility
ALTER USER MAPPING conforms to ISO/IEC 9075-9 (SQL/MED). There is a subtle syntax issue:
The standard omits the FOR key word. Since both CREATE USER MAPPING and DROP USER
MAPPING use FOR in analogous positions, and IBM DB2 (being the other major SQL/MED imple-
mentation) also requires it for ALTER USER MAPPING, PostgreSQL diverges from the standard
here in the interest of consistency and interoperability.
See Also
CREATE USER MAPPING, DROP USER MAPPING
1473
ALTER VIEW
ALTER VIEW — change the definition of a view
Synopsis
Description
ALTER VIEW changes various auxiliary properties of a view. (If you want to modify the view's
defining query, use CREATE OR REPLACE VIEW.)
You must own the view to use ALTER VIEW. To change a view's schema, you must also have CREATE
privilege on the new schema. To alter the owner, you must also be a direct or indirect member of the
new owning role, and that role must have CREATE privilege on the view's schema. (These restrictions
enforce that altering the owner doesn't do anything you couldn't do by dropping and recreating the
view. However, a superuser can alter ownership of any view anyway.)
Parameters
name
IF EXISTS
Do not throw an error if the view does not exist. A notice is issued in this case.
SET/DROP DEFAULT
These forms set or remove the default value for a column. A view column's default value is
substituted into any INSERT or UPDATE command whose target is the view, before applying any
rules or triggers for the view. The view's default will therefore take precedence over any default
values from underlying relations.
new_owner
new_name
new_schema
1474
ALTER VIEW
check_option (string)
Changes the check option of the view. The value must be local or cascaded.
security_barrier (boolean)
Changes the security-barrier property of the view. The value must be Boolean value, such
as true or false.
Notes
For historical reasons, ALTER TABLE can be used with views too; but the only variants of ALTER
TABLE that are allowed with views are equivalent to the ones shown above.
Examples
To rename the view foo to bar:
Compatibility
ALTER VIEW is a PostgreSQL extension of the SQL standard.
See Also
CREATE VIEW, DROP VIEW
1475
ANALYZE
ANALYZE — collect statistics about a database
Synopsis
VERBOSE
Description
ANALYZE collects statistics about the contents of tables in the database, and stores the results in
the pg_statistic system catalog. Subsequently, the query planner uses these statistics to help
determine the most efficient execution plans for queries.
Without a table_and_columns list, ANALYZE processes every table and materialized view in the
current database that the current user has permission to analyze. With a list, ANALYZE processes only
those table(s). It is further possible to give a list of column names for a table, in which case only the
statistics for those columns are collected.
When the option list is surrounded by parentheses, the options can be written in any order. The paren-
thesized syntax was added in PostgreSQL 11; the unparenthesized syntax is deprecated.
Parameters
VERBOSE
table_name
The name (possibly schema-qualified) of a specific table to analyze. If omitted, all regular tables,
partitioned tables, and materialized views in the current database are analyzed (but not foreign
tables). If the specified table is a partitioned table, both the inheritance statistics of the partitioned
table as a whole and statistics of the individual partitions are updated.
column_name
Outputs
When VERBOSE is specified, ANALYZE emits progress messages to indicate which table is currently
being processed. Various statistics about the tables are printed as well.
1476
ANALYZE
Notes
To analyze a table, one must ordinarily be the table's owner or a superuser. However, database owners
are allowed to analyze all tables in their databases, except shared catalogs. (The restriction for shared
catalogs means that a true database-wide ANALYZE can only be performed by a superuser.) ANALYZE
will skip over any tables that the calling user does not have permission to analyze.
Foreign tables are analyzed only when explicitly selected. Not all foreign data wrappers support AN-
ALYZE. If the table's wrapper does not support ANALYZE, the command prints a warning and does
nothing.
In the default PostgreSQL configuration, the autovacuum daemon (see Section 24.1.6) takes care of
automatic analyzing of tables when they are first loaded with data, and as they change throughout
regular operation. When autovacuum is disabled, it is a good idea to run ANALYZE periodically, or
just after making major changes in the contents of a table. Accurate statistics will help the planner
to choose the most appropriate query plan, and thereby improve the speed of query processing. A
common strategy for read-mostly databases is to run VACUUM and ANALYZE once a day during a
low-usage time of day. (This will not be sufficient if there is heavy update activity.)
ANALYZE requires only a read lock on the target table, so it can run in parallel with other activity
on the table.
The statistics collected by ANALYZE usually include a list of some of the most common values in each
column and a histogram showing the approximate data distribution in each column. One or both of
these can be omitted if ANALYZE deems them uninteresting (for example, in a unique-key column,
there are no common values) or if the column data type does not support the appropriate operators.
There is more information about the statistics in Chapter 24.
For large tables, ANALYZE takes a random sample of the table contents, rather than examining every
row. This allows even very large tables to be analyzed in a small amount of time. Note, however,
that the statistics are only approximate, and will change slightly each time ANALYZE is run, even if
the actual table contents did not change. This might result in small changes in the planner's estimated
costs shown by EXPLAIN. In rare situations, this non-determinism will cause the planner's choices
of query plans to change after ANALYZE is run. To avoid this, raise the amount of statistics collected
by ANALYZE, as described below.
The extent of analysis can be controlled by adjusting the default_statistics_target configuration vari-
able, or on a column-by-column basis by setting the per-column statistics target with ALTER TA-
BLE ... ALTER COLUMN ... SET STATISTICS (see ALTER TABLE). The target value
sets the maximum number of entries in the most-common-value list and the maximum number of bins
in the histogram. The default target value is 100, but this can be adjusted up or down to trade off ac-
curacy of planner estimates against the time taken for ANALYZE and the amount of space occupied in
pg_statistic. In particular, setting the statistics target to zero disables collection of statistics for
that column. It might be useful to do that for columns that are never used as part of the WHERE, GROUP
BY, or ORDER BY clauses of queries, since the planner will have no use for statistics on such columns.
The largest statistics target among the columns being analyzed determines the number of table rows
sampled to prepare the statistics. Increasing the target causes a proportional increase in the time and
space needed to do ANALYZE.
One of the values estimated by ANALYZE is the number of distinct values that appear in each column.
Because only a subset of the rows are examined, this estimate can sometimes be quite inaccurate,
even with the largest possible statistics target. If this inaccuracy leads to bad query plans, a more
accurate value can be determined manually and then installed with ALTER TABLE ... ALTER
COLUMN ... SET (n_distinct = ...) (see ALTER TABLE).
If the table being analyzed has one or more children, ANALYZE will gather statistics twice: once on the
rows of the parent table only, and a second time on the rows of the parent table with all of its children.
This second set of statistics is needed when planning queries that traverse the entire inheritance tree.
1477
ANALYZE
The autovacuum daemon, however, will only consider inserts or updates on the parent table itself
when deciding whether to trigger an automatic analyze for that table. If that table is rarely inserted
into or updated, the inheritance statistics will not be up to date unless you run ANALYZE manually.
For partitioned tables, ANALYZE gathers statistics by sampling rows from all partitions; in addition,
it will recurse into each partition and update its statistics. Each leaf partition is analyzed only once,
even with multi-level partitioning. No statistics are collected for only the parent table (without data
from its partitions), because with partitioning it's guaranteed to be empty.
By contrast, if the table being analyzed has inheritance children, ANALYZE gathers two sets of statis-
tics: one on the rows of the parent table only, and a second including rows of both the parent table
and all of its children. This second set of statistics is needed when planning queries that process the
inheritance tree as a whole. The child tables themselves are not individually analyzed in this case.
The autovacuum daemon does not process partitioned tables, nor does it process inheritance parents
if only the children are ever modified. It is usually necessary to periodically run a manual ANALYZE
to keep the statistics of the table hierarchy up to date.
If any child tables or partitions are foreign tables whose foreign data wrappers do not support ANA-
LYZE, those tables are ignored while gathering inheritance statistics.
If the table being analyzed is completely empty, ANALYZE will not record new statistics for that table.
Any existing statistics will be retained.
Compatibility
There is no ANALYZE statement in the SQL standard.
See Also
VACUUM, vacuumdb, Section 19.4.4, Section 24.1.6
1478
BEGIN
BEGIN — start a transaction block
Synopsis
Description
BEGIN initiates a transaction block, that is, all statements after a BEGIN command will be executed in
a single transaction until an explicit COMMIT or ROLLBACK is given. By default (without BEGIN),
PostgreSQL executes transactions in “autocommit” mode, that is, each statement is executed in its
own transaction and a commit is implicitly performed at the end of the statement (if execution was
successful, otherwise a rollback is done).
Statements are executed more quickly in a transaction block, because transaction start/commit requires
significant CPU and disk activity. Execution of multiple statements inside a transaction is also useful
to ensure consistency when making several related changes: other sessions will be unable to see the
intermediate states wherein not all the related updates have been done.
If the isolation level, read/write mode, or deferrable mode is specified, the new transaction has those
characteristics, as if SET TRANSACTION was executed.
Parameters
WORK
TRANSACTION
Refer to SET TRANSACTION for information on the meaning of the other parameters to this state-
ment.
Notes
START TRANSACTION has the same functionality as BEGIN.
Issuing BEGIN when already inside a transaction block will provoke a warning message. The state
of the transaction is not affected. To nest transactions within a transaction block, use savepoints (see
SAVEPOINT).
1479
BEGIN
Examples
To begin a transaction block:
BEGIN;
Compatibility
BEGIN is a PostgreSQL language extension. It is equivalent to the SQL-standard command START
TRANSACTION, whose reference page contains additional compatibility information.
Incidentally, the BEGIN key word is used for a different purpose in embedded SQL. You are advised
to be careful about the transaction semantics when porting database applications.
See Also
COMMIT, ROLLBACK, START TRANSACTION, SAVEPOINT
1480
CALL
CALL — invoke a procedure
Synopsis
Description
CALL executes a procedure.
If the procedure has any output parameters, then a result row will be returned, containing the values
of those parameters.
Parameters
name
argument
An input argument for the procedure call. See Section 4.3 for the full details on function and
procedure call syntax, including use of named parameters.
Notes
The user must have EXECUTE privilege on the procedure in order to be allowed to invoke it.
If CALL is executed in a transaction block, then the called procedure cannot execute transaction control
statements. Transaction control statements are only allowed if CALL is executed in its own transaction.
PL/pgSQL handles output parameters in CALL commands differently; see Section 43.6.3.
Examples
CALL do_db_maintenance();
Compatibility
CALL conforms to the SQL standard.
See Also
CREATE PROCEDURE
1481
CHECKPOINT
CHECKPOINT — force a write-ahead log checkpoint
Synopsis
CHECKPOINT
Description
A checkpoint is a point in the write-ahead log sequence at which all data files have been updated to
reflect the information in the log. All data files will be flushed to disk. Refer to Section 30.4 for more
details about what happens during a checkpoint.
The CHECKPOINT command forces an immediate checkpoint when the command is issued, without
waiting for a regular checkpoint scheduled by the system (controlled by the settings in Section 19.5.2).
CHECKPOINT is not intended for use during normal operation.
If executed during recovery, the CHECKPOINT command will force a restartpoint (see Section 30.4)
rather than writing a new checkpoint.
Compatibility
The CHECKPOINT command is a PostgreSQL language extension.
1482
CLOSE
CLOSE — close a cursor
Synopsis
Description
CLOSE frees the resources associated with an open cursor. After the cursor is closed, no subsequent
operations are allowed on it. A cursor should be closed when it is no longer needed.
Every non-holdable open cursor is implicitly closed when a transaction is terminated by COMMIT or
ROLLBACK. A holdable cursor is implicitly closed if the transaction that created it aborts via ROLL-
BACK. If the creating transaction successfully commits, the holdable cursor remains open until an
explicit CLOSE is executed, or the client disconnects.
Parameters
name
ALL
Notes
PostgreSQL does not have an explicit OPEN cursor statement; a cursor is considered open when it is
declared. Use the DECLARE statement to declare a cursor.
You can see all available cursors by querying the pg_cursors system view.
If a cursor is closed after a savepoint which is later rolled back, the CLOSE is not rolled back; that
is, the cursor remains closed.
Examples
Close the cursor liahona:
CLOSE liahona;
Compatibility
CLOSE is fully conforming with the SQL standard. CLOSE ALL is a PostgreSQL extension.
See Also
DECLARE, FETCH, MOVE
1483
CLUSTER
CLUSTER — cluster a table according to an index
Synopsis
Description
CLUSTER instructs PostgreSQL to cluster the table specified by table_name based on the index
specified by index_name. The index must already have been defined on table_name.
When a table is clustered, it is physically reordered based on the index information. Clustering is a
one-time operation: when the table is subsequently updated, the changes are not clustered. That is, no
attempt is made to store new or updated rows according to their index order. (If one wishes, one can
periodically recluster by issuing the command again. Also, setting the table's fillfactor storage
parameter to less than 100% can aid in preserving cluster ordering during updates, since updated rows
are kept on the same page if enough space is available there.)
When a table is clustered, PostgreSQL remembers which index it was clustered by. The form
CLUSTER table_name reclusters the table using the same index as before. You can also use the
CLUSTER or SET WITHOUT CLUSTER forms of ALTER TABLE to set the index to be used for
future cluster operations, or to clear any previous setting.
CLUSTER without any parameter reclusters all the previously-clustered tables in the current database
that the calling user owns, or all such tables if called by a superuser. This form of CLUSTER cannot
be executed inside a transaction block.
When a table is being clustered, an ACCESS EXCLUSIVE lock is acquired on it. This prevents any
other database operations (both reads and writes) from operating on the table until the CLUSTER is
finished.
Parameters
table_name
index_name
VERBOSE
Notes
In cases where you are accessing single rows randomly within a table, the actual order of the data
in the table is unimportant. However, if you tend to access some data more than others, and there is
an index that groups them together, you will benefit from using CLUSTER. If you are requesting a
range of indexed values from a table, or a single indexed value that has multiple rows that match,
CLUSTER will help because once the index identifies the table page for the first row that matches,
1484
CLUSTER
all other rows that match are probably already on the same table page, and so you save disk accesses
and speed up the query.
CLUSTER can re-sort the table using either an index scan on the specified index, or (if the index is a
b-tree) a sequential scan followed by sorting. It will attempt to choose the method that will be faster,
based on planner cost parameters and available statistical information.
When an index scan is used, a temporary copy of the table is created that contains the table data in
the index order. Temporary copies of each index on the table are created as well. Therefore, you need
free space on disk at least equal to the sum of the table size and the index sizes.
When a sequential scan and sort is used, a temporary sort file is also created, so that the peak temporary
space requirement is as much as double the table size, plus the index sizes. This method is often faster
than the index scan method, but if the disk space requirement is intolerable, you can disable this choice
by temporarily setting enable_sort to off.
It is advisable to set maintenance_work_mem to a reasonably large value (but not more than the amount
of RAM you can dedicate to the CLUSTER operation) before clustering.
Because the planner records statistics about the ordering of tables, it is advisable to run ANALYZE
on the newly clustered table. Otherwise, the planner might make poor choices of query plans.
Because CLUSTER remembers which indexes are clustered, one can cluster the tables one wants clus-
tered manually the first time, then set up a periodic maintenance script that executes CLUSTER with-
out any parameters, so that the desired tables are periodically reclustered.
Examples
Cluster the table employees on the basis of its index employees_ind:
Cluster the employees table using the same index that was used before:
CLUSTER employees;
Cluster all tables in the database that have previously been clustered:
CLUSTER;
Compatibility
There is no CLUSTER statement in the SQL standard.
The syntax
See Also
clusterdb
1485
COMMENT
COMMENT — define or change the comment of an object
Synopsis
COMMENT ON
{
ACCESS METHOD object_name |
AGGREGATE aggregate_name ( aggregate_signature ) |
CAST (source_type AS target_type) |
COLLATION object_name |
COLUMN relation_name.column_name |
CONSTRAINT constraint_name ON table_name |
CONSTRAINT constraint_name ON DOMAIN domain_name |
CONVERSION object_name |
DATABASE object_name |
DOMAIN object_name |
EXTENSION object_name |
EVENT TRIGGER object_name |
FOREIGN DATA WRAPPER object_name |
FOREIGN TABLE object_name |
FUNCTION function_name [ ( [ [ argmode ] [ argname ] argtype
[, ...] ] ) ] |
INDEX object_name |
LARGE OBJECT large_object_oid |
MATERIALIZED VIEW object_name |
OPERATOR operator_name (left_type, right_type) |
OPERATOR CLASS object_name USING index_method |
OPERATOR FAMILY object_name USING index_method |
POLICY policy_name ON table_name |
[ PROCEDURAL ] LANGUAGE object_name |
PROCEDURE procedure_name [ ( [ [ argmode ] [ argname ] argtype
[, ...] ] ) ] |
PUBLICATION object_name |
ROLE object_name |
ROUTINE routine_name [ ( [ [ argmode ] [ argname ] argtype
[, ...] ] ) ] |
RULE rule_name ON table_name |
SCHEMA object_name |
SEQUENCE object_name |
SERVER object_name |
STATISTICS object_name |
SUBSCRIPTION object_name |
TABLE object_name |
TABLESPACE object_name |
TEXT SEARCH CONFIGURATION object_name |
TEXT SEARCH DICTIONARY object_name |
TEXT SEARCH PARSER object_name |
TEXT SEARCH TEMPLATE object_name |
TRANSFORM FOR type_name LANGUAGE lang_name |
TRIGGER trigger_name ON table_name |
TYPE object_name |
VIEW object_name
1486
COMMENT
} IS 'text'
* |
[ argmode ] [ argname ] argtype [ , ... ] |
[ [ argmode ] [ argname ] argtype [ , ... ] ] ORDER BY [ argmode ]
[ argname ] argtype [ , ... ]
Description
COMMENT stores a comment about a database object.
Only one comment string is stored for each object, so to modify a comment, issue a new COMMENT
command for the same object. To remove a comment, write NULL in place of the text string. Comments
are automatically dropped when their object is dropped.
For most kinds of object, only the object's owner can set the comment. Roles don't have owners, so
the rule for COMMENT ON ROLE is that you must be superuser to comment on a superuser role, or
have the CREATEROLE privilege to comment on non-superuser roles. Likewise, access methods don't
have owners either; you must be superuser to comment on an access method. Of course, a superuser
can comment on anything.
Comments can be viewed using psql's \d family of commands. Other user interfaces to retrieve com-
ments can be built atop the same built-in functions that psql uses, namely obj_description,
col_description, and shobj_description (see Table 9.68).
Parameters
object_name
relation_name.column_name
aggregate_name
constraint_name
function_name
operator_name
policy_name
procedure_name
routine_name
rule_name
trigger_name
The name of the object to be commented. Names of tables, aggregates, collations, conversions,
domains, foreign tables, functions, indexes, operators, operator classes, operator families, proce-
dures, routines, sequences, statistics, text search objects, types, and views can be schema-quali-
fied. When commenting on a column, relation_name must refer to a table, view, composite
type, or foreign table.
table_name
domain_name
When creating a comment on a constraint, a trigger, a rule or a policy these parameters specify
the name of the table or domain on which that object is defined.
source_type
1487
COMMENT
target_type
argmode
The mode of a function, procedure, or aggregate argument: IN, OUT, INOUT, or VARIADIC.
If omitted, the default is IN. Note that COMMENT does not actually pay any attention to OUT
arguments, since only the input arguments are needed to determine the function's identity. So it
is sufficient to list the IN, INOUT, and VARIADIC arguments.
argname
The name of a function, procedure, or aggregate argument. Note that COMMENT does not actually
pay any attention to argument names, since only the argument data types are needed to determine
the function's identity.
argtype
large_object_oid
left_type
right_type
The data type(s) of the operator's arguments (optionally schema-qualified). Write NONE for the
missing argument of a prefix or postfix operator.
PROCEDURAL
type_name
lang_name
text
The new comment, written as a string literal; or NULL to drop the comment.
Notes
There is presently no security mechanism for viewing comments: any user connected to a database
can see all the comments for objects in that database. For shared objects such as databases, roles, and
tablespaces, comments are stored globally so any user connected to any database in the cluster can see
all the comments for shared objects. Therefore, don't put security-critical information in comments.
Examples
Attach a comment to the table mytable:
1488
COMMENT
Remove it again:
1489
COMMENT
Compatibility
There is no COMMENT command in the SQL standard.
1490
COMMIT
COMMIT — commit the current transaction
Synopsis
Description
COMMIT commits the current transaction. All changes made by the transaction become visible to
others and are guaranteed to be durable if a crash occurs.
Parameters
WORK
TRANSACTION
Notes
Use ROLLBACK to abort a transaction.
Issuing COMMIT when not inside a transaction does no harm, but it will provoke a warning message.
Examples
To commit the current transaction and make all changes permanent:
COMMIT;
Compatibility
The SQL standard only specifies the two forms COMMIT and COMMIT WORK. Otherwise, this com-
mand is fully conforming.
See Also
BEGIN, ROLLBACK
1491
COMMIT PREPARED
COMMIT PREPARED — commit a transaction that was earlier prepared for two-phase commit
Synopsis
Description
COMMIT PREPARED commits a transaction that is in prepared state.
Parameters
transaction_id
Notes
To commit a prepared transaction, you must be either the same user that executed the transaction
originally, or a superuser. But you do not have to be in the same session that executed the transaction.
This command cannot be executed inside a transaction block. The prepared transaction is committed
immediately.
All currently available prepared transactions are listed in the pg_prepared_xacts system view.
Examples
Commit the transaction identified by the transaction identifier foobar:
Compatibility
COMMIT PREPARED is a PostgreSQL extension. It is intended for use by external transaction man-
agement systems, some of which are covered by standards (such as X/Open XA), but the SQL side
of those systems is not standardized.
See Also
PREPARE TRANSACTION, ROLLBACK PREPARED
1492
COPY
COPY — copy data between a file and a table
Synopsis
FORMAT format_name
OIDS [ boolean ]
FREEZE [ boolean ]
DELIMITER 'delimiter_character'
NULL 'null_string'
HEADER [ boolean ]
QUOTE 'quote_character'
ESCAPE 'escape_character'
FORCE_QUOTE { ( column_name [, ...] ) | * }
FORCE_NOT_NULL ( column_name [, ...] )
FORCE_NULL ( column_name [, ...] )
ENCODING 'encoding_name'
Description
COPY moves data between PostgreSQL tables and standard file-system files. COPY TO copies the
contents of a table to a file, while COPY FROM copies data from a file to a table (appending the data
to whatever is in the table already). COPY TO can also copy the results of a SELECT query.
If a column list is specified, COPY TO copies only the data in the specified columns to the file. For
COPY FROM, each field in the file is inserted, in order, into the specified column. Table columns not
specified in the COPY FROM column list will receive their default values.
COPY with a file name instructs the PostgreSQL server to directly read from or write to a file. The
file must be accessible by the PostgreSQL user (the user ID the server runs as) and the name must
be specified from the viewpoint of the server. When PROGRAM is specified, the server executes the
given command and reads from the standard output of the program, or writes to the standard input of
the program. The command must be specified from the viewpoint of the server, and be executable by
the PostgreSQL user. When STDIN or STDOUT is specified, data is transmitted via the connection
between the client and the server.
Parameters
table_name
1493
COPY
column_name
An optional list of columns to be copied. If no column list is specified, all columns of the table
will be copied.
query
A SELECT, VALUES, INSERT, UPDATE or DELETE command whose results are to be copied.
Note that parentheses are required around the query.
For INSERT, UPDATE and DELETE queries a RETURNING clause must be provided, and the
target relation must not have a conditional rule, nor an ALSO rule, nor an INSTEAD rule that
expands to multiple statements.
filename
The path name of the input or output file. An input file name can be an absolute or relative path,
but an output file name must be an absolute path. Windows users might need to use an E'' string
and double any backslashes used in the path name.
PROGRAM
A command to execute. In COPY FROM, the input is read from standard output of the command,
and in COPY TO, the output is written to the standard input of the command.
Note that the command is invoked by the shell, so if you need to pass any arguments to shell
command that come from an untrusted source, you must be careful to strip or escape any special
characters that might have a special meaning for the shell. For security reasons, it is best to use a
fixed command string, or at least avoid passing any user input in it.
STDIN
STDOUT
boolean
Specifies whether the selected option should be turned on or off. You can write TRUE, ON, or 1 to
enable the option, and FALSE, OFF, or 0 to disable it. The boolean value can also be omitted,
in which case TRUE is assumed.
FORMAT
Selects the data format to be read or written: text, csv (Comma Separated Values), or binary.
The default is text.
OIDS
Specifies copying the OID for each row. (An error is raised if OIDS is specified for a table that
does not have OIDs, or in the case of copying a query.)
FREEZE
Requests copying the data with rows already frozen, just as they would be after running the VAC-
UUM FREEZE command. This is intended as a performance option for initial data loading. Rows
will be frozen only if the table being loaded has been created or truncated in the current subtrans-
action, there are no cursors open and there are no older snapshots held by this transaction. It is
currently not possible to perform a COPY FREEZE on a partitioned table.
1494
COPY
Note that all other sessions will immediately be able to see the data once it has been successfully
loaded. This violates the normal rules of MVCC visibility and users specifying should be aware
of the potential problems this might cause.
DELIMITER
Specifies the character that separates columns within each row (line) of the file. The default is a
tab character in text format, a comma in CSV format. This must be a single one-byte character.
This option is not allowed when using binary format.
NULL
Specifies the string that represents a null value. The default is \N (backslash-N) in text format,
and an unquoted empty string in CSV format. You might prefer an empty string even in text format
for cases where you don't want to distinguish nulls from empty strings. This option is not allowed
when using binary format.
Note
When using COPY FROM, any data item that matches this string will be stored as a null
value, so you should make sure that you use the same string as you used with COPY TO.
HEADER
Specifies that the file contains a header line with the names of each column in the file. On output,
the first line contains the column names from the table, and on input, the first line is ignored. This
option is allowed only when using CSV format.
QUOTE
Specifies the quoting character to be used when a data value is quoted. The default is double-quote.
This must be a single one-byte character. This option is allowed only when using CSV format.
ESCAPE
Specifies the character that should appear before a data character that matches the QUOTE value.
The default is the same as the QUOTE value (so that the quoting character is doubled if it appears
in the data). This must be a single one-byte character. This option is allowed only when using
CSV format.
FORCE_QUOTE
Forces quoting to be used for all non-NULL values in each specified column. NULL output is never
quoted. If * is specified, non-NULL values will be quoted in all columns. This option is allowed
only in COPY TO, and only when using CSV format.
FORCE_NOT_NULL
Do not match the specified columns' values against the null string. In the default case where the
null string is empty, this means that empty values will be read as zero-length strings rather than
nulls, even when they are not quoted. This option is allowed only in COPY FROM, and only when
using CSV format.
FORCE_NULL
Match the specified columns' values against the null string, even if it has been quoted, and if a
match is found set the value to NULL. In the default case where the null string is empty, this
converts a quoted empty string into NULL. This option is allowed only in COPY FROM, and only
when using CSV format.
1495
COPY
ENCODING
Specifies that the file is encoded in the encoding_name. If this option is omitted, the current
client encoding is used. See the Notes below for more details.
Outputs
On successful completion, a COPY command returns a command tag of the form
COPY count
Note
psql will print this command tag only if the command was not COPY ... TO STDOUT, or
the equivalent psql meta-command \copy ... to stdout. This is to prevent confusing
the command tag with the data that was just printed.
Notes
COPY TO can be used only with plain tables, not views, and does not copy rows from child tables or
child partitions. For example, COPY table TO copies the same rows as SELECT * FROM ONLY
table. The syntax COPY (SELECT * FROM table) TO ... can be used to dump all of the
rows in an inheritance hierarchy, partitioned table, or view.
COPY FROM can be used with plain, foreign, or partitioned tables or with views that have INSTEAD
OF INSERT triggers.
You must have select privilege on the table whose values are read by COPY TO, and insert privilege
on the table into which values are inserted by COPY FROM. It is sufficient to have column privileges
on the column(s) listed in the command.
If row-level security is enabled for the table, the relevant SELECT policies will apply to COPY ta-
ble TO statements. Currently, COPY FROM is not supported for tables with row-level security. Use
equivalent INSERT statements instead.
Files named in a COPY command are read or written directly by the server, not by the client applica-
tion. Therefore, they must reside on or be accessible to the database server machine, not the client.
They must be accessible to and readable or writable by the PostgreSQL user (the user ID the serv-
er runs as), not the client. Similarly, the command specified with PROGRAM is executed directly by
the server, not by the client application, must be executable by the PostgreSQL user. COPY naming
a file or command is only allowed to database superusers or users who are granted one of the de-
fault roles pg_read_server_files, pg_write_server_files, or pg_execute_serv-
er_program, since it allows reading or writing any file or running a program that the server has
privileges to access.
Do not confuse COPY with the psql instruction \copy. \copy invokes COPY FROM STDIN or
COPY TO STDOUT, and then fetches/stores the data in a file accessible to the psql client. Thus, file
accessibility and access rights depend on the client rather than the server when \copy is used.
It is recommended that the file name used in COPY always be specified as an absolute path. This is
enforced by the server in the case of COPY TO, but for COPY FROM you do have the option of reading
from a file specified by a relative path. The path will be interpreted relative to the working directory
of the server process (normally the cluster's data directory), not the client's working directory.
1496
COPY
Executing a command with PROGRAM might be restricted by the operating system's access control
mechanisms, such as SELinux.
COPY FROM will invoke any triggers and check constraints on the destination table. However, it will
not invoke rules.
For identity columns, the COPY FROM command will always write the column values provided in the
input data, like the INSERT option OVERRIDING SYSTEM VALUE.
COPY input and output is affected by DateStyle. To ensure portability to other PostgreSQL instal-
lations that might use non-default DateStyle settings, DateStyle should be set to ISO before
using COPY TO. It is also a good idea to avoid dumping data with IntervalStyle set to sql_s-
tandard, because negative interval values might be misinterpreted by a server that has a different
setting for IntervalStyle.
Input data is interpreted according to ENCODING option or the current client encoding, and output
data is encoded in ENCODING or the current client encoding, even if the data does not pass through
the client but is read from or written to a file directly by the server.
COPY stops operation at the first error. This should not lead to problems in the event of a COPY TO,
but the target table will already have received earlier rows in a COPY FROM. These rows will not be
visible or accessible, but they still occupy disk space. This might amount to a considerable amount of
wasted disk space if the failure happened well into a large copy operation. You might wish to invoke
VACUUM to recover the wasted space.
FORCE_NULL and FORCE_NOT_NULL can be used simultaneously on the same column. This results
in converting quoted null strings to null values and unquoted null strings to empty strings.
File Formats
Text Format
When the text format is used, the data read or written is a text file with one line per table row.
Columns in a row are separated by the delimiter character. The column values themselves are strings
generated by the output function, or acceptable to the input function, of each attribute's data type. The
specified null string is used in place of columns that are null. COPY FROM will raise an error if any
line of the input file contains more or fewer columns than are expected. If OIDS is specified, the OID
is read or written as the first column, preceding the user data columns.
End of data can be represented by a single line containing just backslash-period (\.). An end-of-data
marker is not necessary when reading from a file, since the end of file serves perfectly well; it is needed
only when copying data to or from client applications using pre-3.0 client protocol.
Backslash characters (\) can be used in the COPY data to quote data characters that might otherwise
be taken as row or column delimiters. In particular, the following characters must be preceded by a
backslash if they appear as part of a column value: backslash itself, newline, carriage return, and the
current delimiter character.
The specified null string is sent by COPY TO without adding any backslashes; conversely, COPY
FROM matches the input against the null string before removing backslashes. Therefore, a null string
such as \N cannot be confused with the actual data value \N (which would be represented as \\N).
Sequence Represents
\b Backspace (ASCII 8)
\f Form feed (ASCII 12)
\n Newline (ASCII 10)
\r Carriage return (ASCII 13)
1497
COPY
Sequence Represents
\t Tab (ASCII 9)
\v Vertical tab (ASCII 11)
\digits Backslash followed by one to three octal digits
specifies the byte with that numeric code
\xdigits Backslash x followed by one or two hex digits
specifies the byte with that numeric code
Presently, COPY TO will never emit an octal or hex-digits backslash sequence, but it does use the
other sequences listed above for those control characters.
Any other backslashed character that is not mentioned in the above table will be taken to represent
itself. However, beware of adding backslashes unnecessarily, since that might accidentally produce a
string matching the end-of-data marker (\.) or the null string (\N by default). These strings will be
recognized before any other backslash processing is done.
It is strongly recommended that applications generating COPY data convert data newlines and carriage
returns to the \n and \r sequences respectively. At present it is possible to represent a data carriage
return by a backslash and carriage return, and to represent a data newline by a backslash and newline.
However, these representations might not be accepted in future releases. They are also highly vulner-
able to corruption if the COPY file is transferred across different machines (for example, from Unix
to Windows or vice versa).
All backslash sequences are interpreted after encoding conversion. The bytes specified with the octal
and hex-digit backslash sequences must form valid characters in the database encoding.
COPY TO will terminate each row with a Unix-style newline (“\n”). Servers running on Microsoft
Windows instead output carriage return/newline (“\r\n”), but only for COPY to a server file; for
consistency across platforms, COPY TO STDOUT always sends “\n” regardless of server platform.
COPY FROM can handle lines ending with newlines, carriage returns, or carriage return/newlines. To
reduce the risk of error due to un-backslashed newlines or carriage returns that were meant as data,
COPY FROM will complain if the line endings in the input are not all alike.
CSV Format
This format option is used for importing and exporting the Comma Separated Value (CSV) file format
used by many other programs, such as spreadsheets. Instead of the escaping rules used by PostgreSQL's
standard text format, it produces and recognizes the common CSV escaping mechanism.
The values in each record are separated by the DELIMITER character. If the value contains the de-
limiter character, the QUOTE character, the NULL string, a carriage return, or line feed character, then
the whole value is prefixed and suffixed by the QUOTE character, and any occurrence within the value
of a QUOTE character or the ESCAPE character is preceded by the escape character. You can also use
FORCE_QUOTE to force quotes when outputting non-NULL values in specific columns.
The CSV format has no standard way to distinguish a NULL value from an empty string. PostgreSQL's
COPY handles this by quoting. A NULL is output as the NULL parameter string and is not quoted,
while a non-NULL value matching the NULL parameter string is quoted. For example, with the default
settings, a NULL is written as an unquoted empty string, while an empty string data value is written
with double quotes (""). Reading values follows similar rules. You can use FORCE_NOT_NULL to
prevent NULL input comparisons for specific columns. You can also use FORCE_NULL to convert
quoted null string data values to NULL.
Because backslash is not a special character in the CSV format, \., the end-of-data marker, could also
appear as a data value. To avoid any misinterpretation, a \. data value appearing as a lone entry on
a line is automatically quoted on output, and on input, if quoted, is not interpreted as the end-of-data
marker. If you are loading a file created by another application that has a single unquoted column and
might have a value of \., you might need to quote that value in the input file.
1498
COPY
Note
In CSV format, all characters are significant. A quoted value surrounded by white space, or
any characters other than DELIMITER, will include those characters. This can cause errors if
you import data from a system that pads CSV lines with white space out to some fixed width. If
such a situation arises you might need to preprocess the CSV file to remove the trailing white
space, before importing the data into PostgreSQL.
Note
CSV format will both recognize and produce CSV files with quoted values containing embed-
ded carriage returns and line feeds. Thus the files are not strictly one line per table row like
text-format files.
Note
Many programs produce strange and occasionally perverse CSV files, so the file format is more
a convention than a standard. Thus you might encounter some files that cannot be imported
using this mechanism, and COPY might produce files that other programs cannot process.
Binary Format
The binary format option causes all data to be stored/read as binary format rather than as text. It is
somewhat faster than the text and CSV formats, but a binary-format file is less portable across machine
architectures and PostgreSQL versions. Also, the binary format is very data type specific; for example
it will not work to output binary data from a smallint column and read it into an integer column,
even though that would work fine in text format.
The binary file format consists of a file header, zero or more tuples containing the row data, and a
file trailer. Headers and data are in network byte order.
Note
PostgreSQL releases before 7.4 used a different binary file format.
File Header
The file header consists of 15 bytes of fixed fields, followed by a variable-length header extension
area. The fixed fields are:
Signature
11-byte sequence PGCOPY\n\377\r\n\0 — note that the zero byte is a required part of the
signature. (The signature is designed to allow easy identification of files that have been munged
by a non-8-bit-clean transfer. This signature will be changed by end-of-line-translation filters,
dropped zero bytes, dropped high bits, or parity changes.)
Flags field
32-bit integer bit mask to denote important aspects of the file format. Bits are numbered from
0 (LSB) to 31 (MSB). Note that this field is stored in network byte order (most significant byte
1499
COPY
first), as are all the integer fields used in the file format. Bits 16-31 are reserved to denote critical
file format issues; a reader should abort if it finds an unexpected bit set in this range. Bits 0-15
are reserved to signal backwards-compatible format issues; a reader should simply ignore any
unexpected bits set in this range. Currently only one flag bit is defined, and the rest must be zero:
Bit 16
32-bit integer, length in bytes of remainder of header, not including self. Currently, this is zero,
and the first tuple follows immediately. Future changes to the format might allow additional data
to be present in the header. A reader should silently skip over any header extension data it does
not know what to do with.
The header extension area is envisioned to contain a sequence of self-identifying chunks. The flags
field is not intended to tell readers what is in the extension area. Specific design of header extension
contents is left for a later release.
This design allows for both backwards-compatible header additions (add header extension chunks, or
set low-order flag bits) and non-backwards-compatible changes (set high-order flag bits to signal such
changes, and add supporting data to the extension area if needed).
Tuples
Each tuple begins with a 16-bit integer count of the number of fields in the tuple. (Presently, all tuples
in a table will have the same count, but that might not always be true.) Then, repeated for each field
in the tuple, there is a 32-bit length word followed by that many bytes of field data. (The length word
does not include itself, and can be zero.) As a special case, -1 indicates a NULL field value. No value
bytes follow in the NULL case.
Presently, all data values in a binary-format file are assumed to be in binary format (format code one).
It is anticipated that a future extension might add a header field that allows per-column format codes
to be specified.
To determine the appropriate binary format for the actual tuple data you should consult the PostgreSQL
source, in particular the *send and *recv functions for each column's data type (typically these
functions are found in the src/backend/utils/adt/ directory of the source distribution).
If OIDs are included in the file, the OID field immediately follows the field-count word. It is a normal
field except that it's not included in the field-count. In particular it has a length word — this will allow
handling of 4-byte vs. 8-byte OIDs without too much pain, and will allow OIDs to be shown as null
if that ever proves desirable.
File Trailer
The file trailer consists of a 16-bit integer word containing -1. This is easily distinguished from a
tuple's field-count word.
A reader should report an error if a field-count word is neither -1 nor the expected number of columns.
This provides an extra check against somehow getting out of sync with the data.
Examples
The following example copies a table to the client using the vertical bar (|) as the field delimiter:
1500
COPY
To copy into a file just the countries whose names start with 'A':
To copy into a compressed file, you can pipe the output through an external compression program:
Here is a sample of data suitable for copying into a table from STDIN:
AF AFGHANISTAN
AL ALBANIA
DZ ALGERIA
ZM ZAMBIA
ZW ZIMBABWE
Note that the white space on each line is actually a tab character.
The following is the same data, output in binary format. The data is shown after filtering through the
Unix utility od -c. The table has three columns; the first has type char(2), the second has type
text, and the third has type integer. All the rows have a null value in the third column.
0000000 P G C O P Y \n 377 \r \n \0 \0 \0 \0 \0
\0
0000020 \0 \0 \0 \0 003 \0 \0 \0 002 A F \0 \0 \0 013
A
0000040 F G H A N I S T A N 377 377 377 377 \0
003
0000060 \0 \0 \0 002 A L \0 \0 \0 007 A L B A N
I
0000100 A 377 377 377 377 \0 003 \0 \0 \0 002 D Z \0 \0
\0
0000120 007 A L G E R I A 377 377 377 377 \0 003 \0
\0
0000140 \0 002 Z M \0 \0 \0 006 Z A M B I A 377
377
0000160 377 377 \0 003 \0 \0 \0 002 Z W \0 \0 \0 \b Z
I
0000200 M B A B W E 377 377 377 377 377 377
Compatibility
There is no COPY statement in the SQL standard.
The following syntax was used before PostgreSQL version 9.0 and is still supported:
1501
COPY
Note that in this syntax, BINARY and CSV are treated as independent keywords, not as arguments
of a FORMAT option.
The following syntax was used before PostgreSQL version 7.3 and is still supported:
1502
CREATE ACCESS METHOD
CREATE ACCESS METHOD — define a new access method
Synopsis
Description
CREATE ACCESS METHOD creates a new access method.
Parameters
name
access_method_type
This clause specifies the type of access method to define. Only INDEX is supported at present.
handler_function
Examples
Create an index access method heptree with handler function heptree_handler:
Compatibility
CREATE ACCESS METHOD is a PostgreSQL extension.
See Also
DROP ACCESS METHOD, CREATE OPERATOR CLASS, CREATE OPERATOR FAMILY
1503
CREATE AGGREGATE
CREATE AGGREGATE — define a new aggregate function
Synopsis
1504
CREATE AGGREGATE
[ , FINALFUNC_EXTRA ]
[ , FINALFUNC_MODIFY = { READ_ONLY | SHAREABLE | READ_WRITE } ]
[ , COMBINEFUNC = combinefunc ]
[ , SERIALFUNC = serialfunc ]
[ , DESERIALFUNC = deserialfunc ]
[ , INITCOND = initial_condition ]
[ , MSFUNC = msfunc ]
[ , MINVFUNC = minvfunc ]
[ , MSTYPE = mstate_data_type ]
[ , MSSPACE = mstate_data_size ]
[ , MFINALFUNC = mffunc ]
[ , MFINALFUNC_EXTRA ]
[ , MFINALFUNC_MODIFY = { READ_ONLY | SHAREABLE |
READ_WRITE } ]
[ , MINITCOND = minitial_condition ]
[ , SORTOP = sort_operator ]
)
Description
CREATE AGGREGATE defines a new aggregate function. Some basic and commonly-used aggregate
functions are included with the distribution; they are documented in Section 9.20. If one defines new
types or needs an aggregate function not already provided, then CREATE AGGREGATE can be used
to provide the desired features.
If a schema name is given (for example, CREATE AGGREGATE myschema.myagg ...) then the
aggregate function is created in the specified schema. Otherwise it is created in the current schema.
An aggregate function is identified by its name and input data type(s). Two aggregates in the same
schema can have the same name if they operate on different input types. The name and input data
type(s) of an aggregate must also be distinct from the name and input data type(s) of every ordinary
function in the same schema. This behavior is identical to overloading of ordinary function names
(see CREATE FUNCTION).
A simple aggregate function is made from one or two ordinary functions: a state transition function
sfunc, and an optional final calculation function ffunc. These are used as follows:
PostgreSQL creates a temporary variable of data type stype to hold the current internal state of the
aggregate. At each input row, the aggregate argument value(s) are calculated and the state transition
function is invoked with the current state value and the new argument value(s) to calculate a new
internal state value. After all the rows have been processed, the final function is invoked once to
calculate the aggregate's return value. If there is no final function then the ending state value is returned
as-is.
An aggregate function can provide an initial condition, that is, an initial value for the internal state
value. This is specified and stored in the database as a value of type text, but it must be a valid
external representation of a constant of the state value data type. If it is not supplied then the state
value starts out null.
If the state transition function is declared “strict”, then it cannot be called with null inputs. With such
a transition function, aggregate execution behaves as follows. Rows with any null input values are
ignored (the function is not called and the previous state value is retained). If the initial state value
is null, then at the first row with all-nonnull input values, the first argument value replaces the state
value, and the transition function is invoked at each subsequent row with all-nonnull input values.
This is handy for implementing aggregates like max. Note that this behavior is only available when
1505
CREATE AGGREGATE
state_data_type is the same as the first arg_data_type. When these types are different, you
must supply a nonnull initial condition or use a nonstrict transition function.
If the state transition function is not strict, then it will be called unconditionally at each input row, and
must deal with null inputs and null state values for itself. This allows the aggregate author to have full
control over the aggregate's handling of null values.
If the final function is declared “strict”, then it will not be called when the ending state value is null;
instead a null result will be returned automatically. (Of course this is just the normal behavior of strict
functions.) In any case the final function has the option of returning a null value. For example, the
final function for avg returns null when it sees there were zero input rows.
Sometimes it is useful to declare the final function as taking not just the state value, but extra para-
meters corresponding to the aggregate's input values. The main reason for doing this is if the final
function is polymorphic and the state value's data type would be inadequate to pin down the result
type. These extra parameters are always passed as NULL (and so the final function must not be strict
when the FINALFUNC_EXTRA option is used), but nonetheless they are valid parameters. The final
function could for example make use of get_fn_expr_argtype to identify the actual argument
type in the current call.
An aggregate can optionally support moving-aggregate mode, as described in Section 38.11.1. This
requires specifying the MSFUNC, MINVFUNC, and MSTYPE parameters, and optionally the MSS-
PACE, MFINALFUNC, MFINALFUNC_EXTRA, MFINALFUNC_MODIFY, and MINITCOND para-
meters. Except for MINVFUNC, these parameters work like the corresponding simple-aggregate pa-
rameters without M; they define a separate implementation of the aggregate that includes an inverse
transition function.
The syntax with ORDER BY in the parameter list creates a special type of aggregate called an or-
dered-set aggregate; or if HYPOTHETICAL is specified, then a hypothetical-set aggregate is created.
These aggregates operate over groups of sorted values in order-dependent ways, so that specification
of an input sort order is an essential part of a call. Also, they can have direct arguments, which are
arguments that are evaluated only once per aggregation rather than once per input row. Hypotheti-
cal-set aggregates are a subclass of ordered-set aggregates in which some of the direct arguments are
required to match, in number and data types, the aggregated argument columns. This allows the val-
ues of those direct arguments to be added to the collection of aggregate-input rows as an additional
“hypothetical” row.
An aggregate can optionally support partial aggregation, as described in Section 38.11.4. This re-
quires specifying the COMBINEFUNC parameter. If the state_data_type is internal, it's usu-
ally also appropriate to provide the SERIALFUNC and DESERIALFUNC parameters so that parallel
aggregation is possible. Note that the aggregate must also be marked PARALLEL SAFE to enable
parallel aggregation.
Aggregates that behave like MIN or MAX can sometimes be optimized by looking into an index instead
of scanning every input row. If this aggregate can be so optimized, indicate it by specifying a sort
operator. The basic requirement is that the aggregate must yield the first element in the sort ordering
induced by the operator; in other words:
Further assumptions are that the aggregate ignores null inputs, and that it delivers a null result if and
only if there were no non-null inputs. Ordinarily, a data type's < operator is the proper sort operator
for MIN, and > is the proper sort operator for MAX. Note that the optimization will never actually take
effect unless the specified operator is the “less than” or “greater than” strategy member of a B-tree
index operator class.
1506
CREATE AGGREGATE
To be able to create an aggregate function, you must have USAGE privilege on the argument types,
the state type(s), and the return type, as well as EXECUTE privilege on the supporting functions.
Parameters
name
argmode
The mode of an argument: IN or VARIADIC. (Aggregate functions do not support OUT argu-
ments.) If omitted, the default is IN. Only the last argument can be marked VARIADIC.
argname
The name of an argument. This is currently only useful for documentation purposes. If omitted,
the argument has no name.
arg_data_type
An input data type on which this aggregate function operates. To create a zero-argument aggregate
function, write * in place of the list of argument specifications. (An example of such an aggregate
is count(*).)
base_type
In the old syntax for CREATE AGGREGATE, the input data type is specified by a basetype
parameter rather than being written next to the aggregate name. Note that this syntax allows only
one input parameter. To define a zero-argument aggregate function with this syntax, specify the
basetype as "ANY" (not *). Ordered-set aggregates cannot be defined with the old syntax.
sfunc
The name of the state transition function to be called for each input row. For a normal N-argument
aggregate function, the sfunc must take N+1 arguments, the first being of type state_da-
ta_type and the rest matching the declared input data type(s) of the aggregate. The function
must return a value of type state_data_type. This function takes the current state value and
the current input data value(s), and returns the next state value.
For ordered-set (including hypothetical-set) aggregates, the state transition function receives only
the current state value and the aggregated arguments, not the direct arguments. Otherwise it is
the same.
state_data_type
state_data_size
The approximate average size (in bytes) of the aggregate's state value. If this parameter is omitted
or is zero, a default estimate is used based on the state_data_type. The planner uses this
value to estimate the memory required for a grouped aggregate query. The planner will consider
using hash aggregation for such a query only if the hash table is estimated to fit in work_mem;
therefore, large values of this parameter discourage use of hash aggregation.
ffunc
The name of the final function called to compute the aggregate's result after all input rows
have been traversed. For a normal aggregate, this function must take a single argument of type
state_data_type. The return data type of the aggregate is defined as the return type of this
function. If ffunc is not specified, then the ending state value is used as the aggregate's result,
and the return type is state_data_type.
1507
CREATE AGGREGATE
For ordered-set (including hypothetical-set) aggregates, the final function receives not only the
final state value, but also the values of all the direct arguments.
If FINALFUNC_EXTRA is specified, then in addition to the final state value and any direct ar-
guments, the final function receives extra NULL values corresponding to the aggregate's regular
(aggregated) arguments. This is mainly useful to allow correct resolution of the aggregate result
type when a polymorphic aggregate is being defined.
This option specifies whether the final function is a pure function that does not modify its argu-
ments. READ_ONLY indicates it does not; the other two values indicate that it may change the
transition state value. See Notes below for more detail. The default is READ_ONLY, except for
ordered-set aggregates, for which the default is READ_WRITE.
combinefunc
The combinefunc function may optionally be specified to allow the aggregate function to
support partial aggregation. If provided, the combinefunc must combine two state_da-
ta_type values, each containing the result of aggregation over some subset of the input values,
to produce a new state_data_type that represents the result of aggregating over both sets
of inputs. This function can be thought of as an sfunc, where instead of acting upon an individ-
ual input row and adding it to the running aggregate state, it adds another aggregate state to the
running state.
The combinefunc must be declared as taking two arguments of the state_data_type and
returning a value of the state_data_type. Optionally this function may be “strict”. In this
case the function will not be called when either of the input states are null; the other state will
be taken as the correct result.
serialfunc
deserialfunc
Deserialize a previously serialized aggregate state back into state_data_type. This function
must take two arguments of types bytea and internal, and produce a result of type inter-
nal. (Note: the second, internal argument is unused, but is required for type safety reasons.)
initial_condition
The initial setting for the state value. This must be a string constant in the form accepted for the
data type state_data_type. If not specified, the state value starts out null.
msfunc
The name of the forward state transition function to be called for each input row in moving-ag-
gregate mode. This is exactly like the regular transition function, except that its first argument and
result are of type mstate_data_type, which might be different from state_data_type.
minvfunc
The name of the inverse state transition function to be used in moving-aggregate mode. This
function has the same argument and result types as msfunc, but it is used to remove a value from
1508
CREATE AGGREGATE
the current aggregate state, rather than add a value to it. The inverse transition function must have
the same strictness attribute as the forward state transition function.
mstate_data_type
The data type for the aggregate's state value, when using moving-aggregate mode.
mstate_data_size
The approximate average size (in bytes) of the aggregate's state value, when using moving-ag-
gregate mode. This works the same as state_data_size.
mffunc
The name of the final function called to compute the aggregate's result after all input rows have
been traversed, when using moving-aggregate mode. This works the same as ffunc, except
that its first argument's type is mstate_data_type and extra dummy arguments are speci-
fied by writing MFINALFUNC_EXTRA. The aggregate result type determined by mffunc or
mstate_data_type must match that determined by the aggregate's regular implementation.
This option is like FINALFUNC_MODIFY, but it describes the behavior of the moving-aggregate
final function.
minitial_condition
The initial setting for the state value, when using moving-aggregate mode. This works the same
as initial_condition.
sort_operator
The associated sort operator for a MIN- or MAX-like aggregate. This is just an operator name
(possibly schema-qualified). The operator is assumed to have the same input data types as the
aggregate (which must be a single-argument normal aggregate).
The meanings of PARALLEL SAFE, PARALLEL RESTRICTED, and PARALLEL UNSAFE are
the same as in CREATE FUNCTION. An aggregate will not be considered for parallelization if it
is marked PARALLEL UNSAFE (which is the default!) or PARALLEL RESTRICTED. Note that
the parallel-safety markings of the aggregate's support functions are not consulted by the planner,
only the marking of the aggregate itself.
HYPOTHETICAL
For ordered-set aggregates only, this flag specifies that the aggregate arguments are to be
processed according to the requirements for hypothetical-set aggregates: that is, the last few di-
rect arguments must match the data types of the aggregated (WITHIN GROUP) arguments. The
HYPOTHETICAL flag has no effect on run-time behavior, only on parse-time resolution of the
data types and collations of the aggregate's arguments.
The parameters of CREATE AGGREGATE can be written in any order, not just the order illustrated
above.
Notes
In parameters that specify support function names, you can write a schema name if needed, for example
SFUNC = public.sum. Do not write argument types there, however — the argument types of the
support functions are determined from other parameters.
1509
CREATE AGGREGATE
Ordinarily, PostgreSQL functions are expected to be true functions that do not modify their input
values. However, an aggregate transition function, when used in the context of an aggregate, is allowed
to cheat and modify its transition-state argument in place. This can provide substantial performance
benefits compared to making a fresh copy of the transition state each time.
Likewise, while an aggregate final function is normally expected not to modify its input values, some-
times it is impractical to avoid modifying the transition-state argument. Such behavior must be de-
clared using the FINALFUNC_MODIFY parameter. The READ_WRITE value indicates that the final
function modifies the transition state in unspecified ways. This value prevents use of the aggregate
as a window function, and it also prevents merging of transition states for aggregate calls that share
the same input values and transition functions. The SHAREABLE value indicates that the transition
function cannot be applied after the final function, but multiple final-function calls can be performed
on the ending transition state value. This value prevents use of the aggregate as a window function,
but it allows merging of transition states. (That is, the optimization of interest here is not applying
the same final function repeatedly, but applying different final functions to the same ending transition
state value. This is allowed as long as none of the final functions are marked READ_WRITE.)
If an aggregate supports moving-aggregate mode, it will improve calculation efficiency when the
aggregate is used as a window function for a window with moving frame start (that is, a frame start
mode other than UNBOUNDED PRECEDING). Conceptually, the forward transition function adds
input values to the aggregate's state when they enter the window frame from the bottom, and the inverse
transition function removes them again when they leave the frame at the top. So, when values are
removed, they are always removed in the same order they were added. Whenever the inverse transition
function is invoked, it will thus receive the earliest added but not yet removed argument value(s). The
inverse transition function can assume that at least one row will remain in the current state after it
removes the oldest row. (When this would not be the case, the window function mechanism simply
starts a fresh aggregation, rather than using the inverse transition function.)
The forward transition function for moving-aggregate mode is not allowed to return NULL as the
new state value. If the inverse transition function returns NULL, this is taken as an indication that
the inverse function cannot reverse the state calculation for this particular input, and so the aggregate
calculation will be redone from scratch for the current frame starting position. This convention allows
moving-aggregate mode to be used in situations where there are some infrequent cases that are im-
practical to reverse out of the running state value.
If no moving-aggregate implementation is supplied, the aggregate can still be used with moving
frames, but PostgreSQL will recompute the whole aggregation whenever the start of the frame moves.
Note that whether or not the aggregate supports moving-aggregate mode, PostgreSQL can handle a
moving frame end without recalculation; this is done by continuing to add new values to the aggre-
gate's state. This is why use of an aggregate as a window function requires that the final function be
read-only: it must not damage the aggregate's state value, so that the aggregation can be continued
even after an aggregate result value has been obtained for one set of frame boundaries.
The syntax for ordered-set aggregates allows VARIADIC to be specified for both the last direct para-
meter and the last aggregated (WITHIN GROUP) parameter. However, the current implementation re-
stricts use of VARIADIC in two ways. First, ordered-set aggregates can only use VARIADIC "any",
not other variadic array types. Second, if the last direct parameter is VARIADIC "any", then there
can be only one aggregated parameter and it must also be VARIADIC "any". (In the representa-
tion used in the system catalogs, these two parameters are merged into a single VARIADIC "any"
item, since pg_proc cannot represent functions with more than one VARIADIC parameter.) If the
aggregate is a hypothetical-set aggregate, the direct arguments that match the VARIADIC "any"
parameter are the hypothetical ones; any preceding parameters represent additional direct arguments
that are not constrained to match the aggregated arguments.
Currently, ordered-set aggregates do not need to support moving-aggregate mode, since they cannot
be used as window functions.
Partial (including parallel) aggregation is currently not supported for ordered-set aggregates. Also, it
will never be used for aggregate calls that include DISTINCT or ORDER BY clauses, since those
semantics cannot be supported during partial aggregation.
1510
CREATE AGGREGATE
Examples
See Section 38.11.
Compatibility
CREATE AGGREGATE is a PostgreSQL language extension. The SQL standard does not provide for
user-defined aggregate functions.
See Also
ALTER AGGREGATE, DROP AGGREGATE
1511
CREATE CAST
CREATE CAST — define a new cast
Synopsis
Description
CREATE CAST defines a new cast. A cast specifies how to perform a conversion between two data
types. For example,
converts the integer constant 42 to type float8 by invoking a previously specified function, in this
case float8(int4). (If no suitable cast has been defined, the conversion fails.)
Two types can be binary coercible, which means that the conversion can be performed “for free”
without invoking any function. This requires that corresponding values use the same internal repre-
sentation. For instance, the types text and varchar are binary coercible both ways. Binary co-
ercibility is not necessarily a symmetric relationship. For example, the cast from xml to text can
be performed for free in the present implementation, but the reverse direction requires a function that
performs at least a syntax check. (Two types that are binary coercible both ways are also referred to
as binary compatible.)
You can define a cast as an I/O conversion cast by using the WITH INOUT syntax. An I/O conversion
cast is performed by invoking the output function of the source data type, and passing the resulting
string to the input function of the target data type. In many common cases, this feature avoids the need
to write a separate cast function for conversion. An I/O conversion cast acts the same as a regular
function-based cast; only the implementation is different.
By default, a cast can be invoked only by an explicit cast request, that is an explicit CAST(x AS
typename) or x::typename construct.
If the cast is marked AS ASSIGNMENT then it can be invoked implicitly when assigning a value to a
column of the target data type. For example, supposing that foo.f1 is a column of type text, then:
will be allowed if the cast from type integer to type text is marked AS ASSIGNMENT, otherwise
not. (We generally use the term assignment cast to describe this kind of cast.)
1512
CREATE CAST
If the cast is marked AS IMPLICIT then it can be invoked implicitly in any context, whether assign-
ment or internally in an expression. (We generally use the term implicit cast to describe this kind of
cast.) For example, consider this query:
SELECT 2 + 4.0;
The parser initially marks the constants as being of type integer and numeric respectively. There
is no integer + numeric operator in the system catalogs, but there is a numeric + numeric
operator. The query will therefore succeed if a cast from integer to numeric is available and is
marked AS IMPLICIT — which in fact it is. The parser will apply the implicit cast and resolve the
query as if it had been written
Now, the catalogs also provide a cast from numeric to integer. If that cast were marked AS
IMPLICIT — which it is not — then the parser would be faced with choosing between the above
interpretation and the alternative of casting the numeric constant to integer and applying the
integer + integer operator. Lacking any knowledge of which choice to prefer, it would give
up and declare the query ambiguous. The fact that only one of the two casts is implicit is the way
in which we teach the parser to prefer resolution of a mixed numeric-and-integer expression as
numeric; there is no built-in knowledge about that.
Note
Sometimes it is necessary for usability or standards-compliance reasons to provide multiple
implicit casts among a set of types, resulting in ambiguity that cannot be avoided as above.
The parser has a fallback heuristic based on type categories and preferred types that can help
to provide desired behavior in such cases. See CREATE TYPE for more information.
To be able to create a cast, you must own the source or the target data type and have USAGE privilege
on the other type. To create a binary-coercible cast, you must be superuser. (This restriction is made
because an erroneous binary-coercible cast conversion can easily crash the server.)
Parameters
source_type
target_type
function_name[(argument_type [, ...])]
The function used to perform the cast. The function name can be schema-qualified. If it is not, the
function will be looked up in the schema search path. The function's result data type must match
1513
CREATE CAST
the target type of the cast. Its arguments are discussed below. If no argument list is specified, the
function name must be unique in its schema.
WITHOUT FUNCTION
Indicates that the source type is binary-coercible to the target type, so no function is required to
perform the cast.
WITH INOUT
Indicates that the cast is an I/O conversion cast, performed by invoking the output function of the
source data type, and passing the resulting string to the input function of the target data type.
AS ASSIGNMENT
AS IMPLICIT
Cast implementation functions can have one to three arguments. The first argument type must be
identical to or binary-coercible from the cast's source type. The second argument, if present, must be
type integer; it receives the type modifier associated with the destination type, or -1 if there is
none. The third argument, if present, must be type boolean; it receives true if the cast is an explicit
cast, false otherwise. (Bizarrely, the SQL standard demands different behaviors for explicit and
implicit casts in some cases. This argument is supplied for functions that must implement such casts.
It is not recommended that you design your own data types so that this matters.)
The return type of a cast function must be identical to or binary-coercible to the cast's target type.
Ordinarily a cast must have different source and target data types. However, it is allowed to declare
a cast with identical source and target types if it has a cast implementation function with more than
one argument. This is used to represent type-specific length coercion functions in the system catalogs.
The named function is used to coerce a value of the type to the type modifier value given by its second
argument.
When a cast has different source and target types and a function that takes more than one argument,
it supports converting from one type to another and applying a length coercion in a single step. When
no such entry is available, coercion to a type that uses a type modifier involves two cast steps, one to
convert between data types and a second to apply the modifier.
A cast to or from a domain type currently has no effect. Casting to or from a domain uses the casts
associated with its underlying type.
Notes
Use DROP CAST to remove user-defined casts.
Remember that if you want to be able to convert types both ways you need to declare casts both ways
explicitly.
It is normally not necessary to create casts between user-defined types and the standard string types
(text, varchar, and char(n), as well as user-defined types that are defined to be in the string
category). PostgreSQL provides automatic I/O conversion casts for that. The automatic casts to string
types are treated as assignment casts, while the automatic casts from string types are explicit-only.
You can override this behavior by declaring your own cast to replace an automatic cast, but usually
the only reason to do so is if you want the conversion to be more easily invokable than the standard
assignment-only or explicit-only setting. Another possible reason is that you want the conversion to
1514
CREATE CAST
behave differently from the type's I/O function; but that is sufficiently surprising that you should think
twice about whether it's a good idea. (A small number of the built-in types do indeed have different
behaviors for conversions, mostly because of requirements of the SQL standard.)
While not required, it is recommended that you continue to follow this old convention of naming cast
implementation functions after the target data type. Many users are used to being able to cast data
types using a function-style notation, that is typename(x). This notation is in fact nothing more
nor less than a call of the cast implementation function; it is not specially treated as a cast. If your
conversion functions are not named to support this convention then you will have surprised users.
Since PostgreSQL allows overloading of the same function name with different argument types, there
is no difficulty in having multiple conversion functions from different types that all use the target
type's name.
Note
Actually the preceding paragraph is an oversimplification: there are two cases in which a func-
tion-call construct will be treated as a cast request without having matched it to an actual func-
tion. If a function call name(x) does not exactly match any existing function, but name is
the name of a data type and pg_cast provides a binary-coercible cast to this type from the
type of x, then the call will be construed as a binary-coercible cast. This exception is made
so that binary-coercible casts can be invoked using functional syntax, even though they lack
any function. Likewise, if there is no pg_cast entry but the cast would be to or from a string
type, the call will be construed as an I/O conversion cast. This exception allows I/O conversion
casts to be invoked using functional syntax.
Note
There is also an exception to the exception: I/O conversion casts from composite types to
string types cannot be invoked using functional syntax, but must be written in explicit cast
syntax (either CAST or :: notation). This exception was added because after the introduction
of automatically-provided I/O conversion casts, it was found too easy to accidentally invoke
such a cast when a function or column reference was intended.
Examples
To create an assignment cast from type bigint to type int4 using the function int4(bigint):
Compatibility
The CREATE CAST command conforms to the SQL standard, except that SQL does not make pro-
visions for binary-coercible types or extra arguments to implementation functions. AS IMPLICIT
is a PostgreSQL extension, too.
See Also
CREATE FUNCTION, CREATE TYPE, DROP CAST
1515
CREATE COLLATION
CREATE COLLATION — define a new collation
Synopsis
Description
CREATE COLLATION defines a new collation using the specified operating system locale settings,
or by copying an existing collation.
To be able to create a collation, you must have CREATE privilege on the destination schema.
Parameters
IF NOT EXISTS
Do not throw an error if a collation with the same name already exists. A notice is issued in this
case. Note that there is no guarantee that the existing collation is anything like the one that would
have been created.
name
The name of the collation. The collation name can be schema-qualified. If it is not, the collation
is defined in the current schema. The collation name must be unique within that schema. (The
system catalogs can contain collations with the same name for other encodings, but these are
ignored if the database encoding does not match.)
locale
This is a shortcut for setting LC_COLLATE and LC_CTYPE at once. If you specify this, you
cannot specify either of those parameters.
lc_collate
Use the specified operating system locale for the LC_COLLATE locale category.
lc_ctype
Use the specified operating system locale for the LC_CTYPE locale category.
provider
Specifies the provider to use for locale services associated with this collation. Possible values
are: icu, libc. libc is the default. The available choices depend on the operating system and
build options.
1516
CREATE COLLATION
version
Specifies the version string to store with the collation. Normally, this should be omitted, which
will cause the version to be computed from the actual version of the collation as provided by the
operating system. This option is intended to be used by pg_upgrade for copying the version
from an existing installation.
See also ALTER COLLATION for how to handle collation version mismatches.
existing_collation
The name of an existing collation to copy. The new collation will have the same properties as the
existing one, but it will be an independent object.
Notes
Use DROP COLLATION to remove user-defined collations.
When using the libc collation provider, the locale must be applicable to the current database encod-
ing. See CREATE DATABASE for the precise rules.
Examples
To create a collation from the operating system locale fr_FR.utf8 (assuming the current database
encoding is UTF8):
To create a collation using the ICU provider using German phone book sort order:
Compatibility
There is a CREATE COLLATION statement in the SQL standard, but it is limited to copying an
existing collation. The syntax to create a new collation is a PostgreSQL extension.
See Also
ALTER COLLATION, DROP COLLATION
1517
CREATE CONVERSION
CREATE CONVERSION — define a new encoding conversion
Synopsis
Description
CREATE CONVERSION defines a new conversion between character set encodings. Also, conver-
sions that are marked DEFAULT can be used for automatic encoding conversion between client and
server. For this purpose, two conversions, from encoding A to B and from encoding B to A, must
be defined.
To be able to create a conversion, you must have EXECUTE privilege on the function and CREATE
privilege on the destination schema.
Parameters
DEFAULT
The DEFAULT clause indicates that this conversion is the default for this particular source to
destination encoding. There should be only one default encoding in a schema for the encoding
pair.
name
The name of the conversion. The conversion name can be schema-qualified. If it is not, the con-
version is defined in the current schema. The conversion name must be unique within a schema.
source_encoding
dest_encoding
function_name
The function used to perform the conversion. The function name can be schema-qualified. If it is
not, the function will be looked up in the path.
conv_proc(
integer, -- source encoding ID
integer, -- destination encoding ID
cstring, -- source string (null terminated C string)
internal, -- destination (fill with a null terminated C
string)
integer -- source string length
) RETURNS void;
1518
CREATE CONVERSION
Notes
Use DROP CONVERSION to remove user-defined conversions.
Examples
To create a conversion from encoding UTF8 to LATIN1 using myfunc:
Compatibility
CREATE CONVERSION is a PostgreSQL extension. There is no CREATE CONVERSION statement
in the SQL standard, but a CREATE TRANSLATION statement that is very similar in purpose and
syntax.
See Also
ALTER CONVERSION, CREATE FUNCTION, DROP CONVERSION
1519
CREATE DATABASE
CREATE DATABASE — create a new database
Synopsis
Description
CREATE DATABASE creates a new PostgreSQL database.
To create a database, you must be a superuser or have the special CREATEDB privilege. See CREATE
ROLE.
By default, the new database will be created by cloning the standard system database template1. A
different template can be specified by writing TEMPLATE name. In particular, by writing TEMPLATE
template0, you can create a virgin database containing only the standard objects predefined by
your version of PostgreSQL. This is useful if you wish to avoid copying any installation-local objects
that might have been added to template1.
Parameters
name
user_name
The role name of the user who will own the new database, or DEFAULT to use the default (namely,
the user executing the command). To create a database owned by another role, you must be a
direct or indirect member of that role, or be a superuser.
template
The name of the template from which to create the new database, or DEFAULT to use the default
template (template1).
encoding
Character set encoding to use in the new database. Specify a string constant (e.g.,
'SQL_ASCII'), or an integer encoding number, or DEFAULT to use the default encoding (name-
ly, the encoding of the template database). The character sets supported by the PostgreSQL server
are described in Section 23.3.1. See below for additional restrictions.
1520
CREATE DATABASE
lc_collate
Collation order (LC_COLLATE) to use in the new database. This affects the sort order applied to
strings, e.g., in queries with ORDER BY, as well as the order used in indexes on text columns. The
default is to use the collation order of the template database. See below for additional restrictions.
lc_ctype
Character classification (LC_CTYPE) to use in the new database. This affects the categorization
of characters, e.g., lower, upper and digit. The default is to use the character classification of the
template database. See below for additional restrictions.
tablespace_name
The name of the tablespace that will be associated with the new database, or DEFAULT to use
the template database's tablespace. This tablespace will be the default tablespace used for objects
created in this database. See CREATE TABLESPACE for more information.
allowconn
If false then no one can connect to this database. The default is true, allowing connections (except
as restricted by other mechanisms, such as GRANT/REVOKE CONNECT).
connlimit
How many concurrent connections can be made to this database. -1 (the default) means no limit.
istemplate
If true, then this database can be cloned by any user with CREATEDB privileges; if false (the
default), then only superusers or the owner of the database can clone it.
Optional parameters can be written in any order, not only the order illustrated above.
Notes
CREATE DATABASE cannot be executed inside a transaction block.
Errors along the line of “could not initialize database directory” are most likely related to insufficient
permissions on the data directory, a full disk, or other file system problems.
The program createdb is a wrapper program around this command, provided for convenience.
Database-level configuration parameters (set via ALTER DATABASE) and database-level permis-
sions (set via GRANT) are not copied from the template database.
Although it is possible to copy a database other than template1 by specifying its name as the
template, this is not (yet) intended as a general-purpose “COPY DATABASE” facility. The principal
limitation is that no other sessions can be connected to the template database while it is being copied.
CREATE DATABASE will fail if any other connection exists when it starts; otherwise, new connections
to the template database are locked out until CREATE DATABASE completes. See Section 22.3 for
more information.
The character set encoding specified for the new database must be compatible with the chosen locale
settings (LC_COLLATE and LC_CTYPE). If the locale is C (or equivalently POSIX), then all encod-
ings are allowed, but for other locale settings there is only one encoding that will work properly. (On
Windows, however, UTF-8 encoding can be used with any locale.) CREATE DATABASE will allow
superusers to specify SQL_ASCII encoding regardless of the locale settings, but this choice is dep-
1521
CREATE DATABASE
recated and may result in misbehavior of character-string functions if data that is not encoding-com-
patible with the locale is stored in the database.
The encoding and locale settings must match those of the template database, except when template0
is used as template. This is because other databases might contain data that does not match the speci-
fied encoding, or might contain indexes whose sort ordering is affected by LC_COLLATE and LC_C-
TYPE. Copying such data would result in a database that is corrupt according to the new settings.
template0, however, is known to not contain any data or indexes that would be affected.
The CONNECTION LIMIT option is only enforced approximately; if two new sessions start at about
the same time when just one connection “slot” remains for the database, it is possible that both will
fail. Also, the limit is not enforced against superusers or background worker processes.
Examples
To create a new database:
To create a database sales owned by user salesapp with a default tablespace of salesspace:
In this example, the TEMPLATE template0 clause is required if the specified locale is different
from the one in template1. (If it is not, then specifying the locale explicitly is redundant.)
To create a database music2 with a different locale and a different character set encoding:
The specified locale and encoding settings must match, or an error will be reported.
Note that locale names are specific to the operating system, so that the above commands might not
work in the same way everywhere.
Compatibility
There is no CREATE DATABASE statement in the SQL standard. Databases are equivalent to catalogs,
whose creation is implementation-defined.
See Also
ALTER DATABASE, DROP DATABASE
1522
CREATE DOMAIN
CREATE DOMAIN — define a new domain
Synopsis
[ CONSTRAINT constraint_name ]
{ NOT NULL | NULL | CHECK (expression) }
Description
CREATE DOMAIN creates a new domain. A domain is essentially a data type with optional constraints
(restrictions on the allowed set of values). The user who defines a domain becomes its owner.
If a schema name is given (for example, CREATE DOMAIN myschema.mydomain ...) then the
domain is created in the specified schema. Otherwise it is created in the current schema. The domain
name must be unique among the types and domains existing in its schema.
Domains are useful for abstracting common constraints on fields into a single location for maintenance.
For example, several tables might contain email address columns, all requiring the same CHECK
constraint to verify the address syntax. Define a domain rather than setting up each table's constraint
individually.
To be able to create a domain, you must have USAGE privilege on the underlying type.
Parameters
name
data_type
The underlying data type of the domain. This can include array specifiers.
collation
An optional collation for the domain. If no collation is specified, the domain has the same colla-
tion behavior as its underlying data type. The underlying type must be collatable if COLLATE
is specified.
DEFAULT expression
The DEFAULT clause specifies a default value for columns of the domain data type. The value
is any variable-free expression (but subqueries are not allowed). The data type of the default
expression must match the data type of the domain. If no default value is specified, then the default
value is the null value.
1523
CREATE DOMAIN
The default expression will be used in any insert operation that does not specify a value for the
column. If a default value is defined for a particular column, it overrides any default associated
with the domain. In turn, the domain default overrides any default value associated with the un-
derlying data type.
CONSTRAINT constraint_name
An optional name for a constraint. If not specified, the system generates a name.
NOT NULL
Values of this domain are prevented from being null (but see notes below).
NULL
This clause is only intended for compatibility with nonstandard SQL databases. Its use is discour-
aged in new applications.
CHECK (expression)
CHECK clauses specify integrity constraints or tests which values of the domain must satisfy. Each
constraint must be an expression producing a Boolean result. It should use the key word VALUE
to refer to the value being tested. Expressions evaluating to TRUE or UNKNOWN succeed. If
the expression produces a FALSE result, an error is reported and the value is not allowed to be
converted to the domain type.
Currently, CHECK expressions cannot contain subqueries nor refer to variables other than VALUE.
When a domain has multiple CHECK constraints, they will be tested in alphabetical order by name.
(PostgreSQL versions before 9.5 did not honor any particular firing order for CHECK constraints.)
Notes
Domain constraints, particularly NOT NULL, are checked when converting a value to the domain
type. It is possible for a column that is nominally of the domain type to read as null despite there being
such a constraint. For example, this can happen in an outer-join query, if the domain column is on the
nullable side of the outer join. A more subtle example is
INSERT INTO tab (domcol) VALUES ((SELECT domcol FROM tab WHERE
false));
The empty scalar sub-SELECT will produce a null value that is considered to be of the domain type,
so no further constraint checking is applied to it, and the insertion will succeed.
It is very difficult to avoid such problems, because of SQL's general assumption that a null value is a
valid value of every data type. Best practice therefore is to design a domain's constraints so that a null
value is allowed, and then to apply column NOT NULL constraints to columns of the domain type as
needed, rather than directly to the domain type.
PostgreSQL assumes that CHECK constraints' conditions are immutable, that is, they will always give
the same result for the same input value. This assumption is what justifies examining CHECK con-
straints only when a value is first converted to be of a domain type, and not at other times. (This is
essentially the same as the treatment of table CHECK constraints, as described in Section 5.3.1.)
1524
CREATE DOMAIN
constraint. That would cause a subsequent database dump and restore to fail. The recommended way to
handle such a change is to drop the constraint (using ALTER DOMAIN), adjust the function definition,
and re-add the constraint, thereby rechecking it against stored data.
Examples
This example creates the us_postal_code data type and then uses the type in a table definition.
A regular expression test is used to verify that the value looks like a valid US postal code:
Compatibility
The command CREATE DOMAIN conforms to the SQL standard.
See Also
ALTER DOMAIN, DROP DOMAIN
1525
CREATE EVENT TRIGGER
CREATE EVENT TRIGGER — define a new event trigger
Synopsis
Description
CREATE EVENT TRIGGER creates a new event trigger. Whenever the designated event occurs
and the WHEN condition associated with the trigger, if any, is satisfied, the trigger function will be
executed. For a general introduction to event triggers, see Chapter 40. The user who creates an event
trigger becomes its owner.
Parameters
name
The name to give the new trigger. This name must be unique within the database.
event
The name of the event that triggers a call to the given function. See Section 40.1 for more infor-
mation on event names.
filter_variable
The name of a variable used to filter events. This makes it possible to restrict the firing of the trig-
ger to a subset of the cases in which it is supported. Currently the only supported filter_vari-
able is TAG.
filter_value
A list of values for the associated filter_variable for which the trigger should fire. For
TAG, this means a list of command tags (e.g., 'DROP FUNCTION').
function_name
A user-supplied function that is declared as taking no argument and returning type even-
t_trigger.
In the syntax of CREATE EVENT TRIGGER, the keywords FUNCTION and PROCEDURE are
equivalent, but the referenced function must in any case be a function, not a procedure. The use
of the keyword PROCEDURE here is historical and deprecated.
Notes
Only superusers can create event triggers.
Event triggers are disabled in single-user mode (see postgres). If an erroneous event trigger disables
the database so much that you can't even drop the trigger, restart in single-user mode and you'll be
able to do that.
1526
CREATE EVENT TRIGGER
Examples
Forbid the execution of any DDL command:
Compatibility
There is no CREATE EVENT TRIGGER statement in the SQL standard.
See Also
ALTER EVENT TRIGGER, DROP EVENT TRIGGER, CREATE FUNCTION
1527
CREATE EXTENSION
CREATE EXTENSION — install an extension
Synopsis
Description
CREATE EXTENSION loads a new extension into the current database. There must not be an exten-
sion of the same name already loaded.
Loading an extension essentially amounts to running the extension's script file. The script will typically
create new SQL objects such as functions, data types, operators and index support methods. CREATE
EXTENSION additionally records the identities of all the created objects, so that they can be dropped
again if DROP EXTENSION is issued.
Loading an extension requires the same privileges that would be required to create its component
objects. For most extensions this means superuser or database owner privileges are needed. The user
who runs CREATE EXTENSION becomes the owner of the extension for purposes of later privilege
checks, as well as the owner of any objects created by the extension's script.
Parameters
IF NOT EXISTS
Do not throw an error if an extension with the same name already exists. A notice is issued in
this case. Note that there is no guarantee that the existing extension is anything like the one that
would have been created from the currently-available script file.
extension_name
The name of the extension to be installed. PostgreSQL will create the extension using details from
the file SHAREDIR/extension/extension_name.control.
schema_name
The name of the schema in which to install the extension's objects, given that the extension al-
lows its contents to be relocated. The named schema must already exist. If not specified, and
the extension's control file does not specify a schema either, the current default object creation
schema is used.
If the extension specifies a schema parameter in its control file, then that schema cannot be
overridden with a SCHEMA clause. Normally, an error will be raised if a SCHEMA clause is given
and it conflicts with the extension's schema parameter. However, if the CASCADE clause is also
given, then schema_name is ignored when it conflicts. The given schema_name will be used
for installation of any needed extensions that do not specify schema in their control files.
Remember that the extension itself is not considered to be within any schema: extensions have
unqualified names that must be unique database-wide. But objects belonging to the extension can
be within schemas.
1528
CREATE EXTENSION
version
The version of the extension to install. This can be written as either an identifier or a string literal.
The default version is whatever is specified in the extension's control file.
old_version
FROM old_version must be specified when, and only when, you are attempting to install an
extension that replaces an “old style” module that is just a collection of objects not packaged into
an extension. This option causes CREATE EXTENSION to run an alternative installation script
that absorbs the existing objects into the extension, instead of creating new objects. Be careful
that SCHEMA specifies the schema containing these pre-existing objects.
The value to use for old_version is determined by the extension's author, and might vary if
there is more than one version of the old-style module that can be upgraded into an extension.
For the standard additional modules supplied with pre-9.1 PostgreSQL, use unpackaged for
old_version when updating a module to extension style.
CASCADE
Automatically install any extensions that this extension depends on that are not already installed.
Their dependencies are likewise automatically installed, recursively. The SCHEMA clause, if giv-
en, applies to all extensions that get installed this way. Other options of the statement are not
applied to automatically-installed extensions; in particular, their default versions are always se-
lected.
Notes
Before you can use CREATE EXTENSION to load an extension into a database, the extension's sup-
porting files must be installed. Information about installing the extensions supplied with PostgreSQL
can be found in Additional Supplied Modules.
The extensions currently available for loading can be identified from the pg_available_exten-
sions or pg_available_extension_versions system views.
Caution
Installing an extension as superuser requires trusting that the extension's author wrote the ex-
tension installation script in a secure fashion. It is not terribly difficult for a malicious user to
create trojan-horse objects that will compromise later execution of a carelessly-written exten-
sion script, allowing that user to acquire superuser privileges. However, trojan-horse objects
are only hazardous if they are in the search_path during script execution, meaning that
they are in the extension's installation target schema or in the schema of some extension it
depends on. Therefore, a good rule of thumb when dealing with extensions whose scripts have
not been carefully vetted is to install them only into schemas for which CREATE privilege
has not been and will not be granted to any untrusted users. Likewise for any extensions they
depend on.
The extensions supplied with PostgreSQL are believed to be secure against installation-time
attacks of this sort, except for a few that depend on other extensions. As stated in the docu-
mentation for those extensions, they should be installed into secure schemas, or installed into
the same schemas as the extensions they depend on, or both.
Examples
Install the hstore extension into the current database, placing its objects in schema addons:
1529
CREATE EXTENSION
Be careful to specify the schema in which you installed the existing hstore objects.
Compatibility
CREATE EXTENSION is a PostgreSQL extension.
See Also
ALTER EXTENSION, DROP EXTENSION
1530
CREATE FOREIGN DATA WRAPPER
CREATE FOREIGN DATA WRAPPER — define a new foreign-data wrapper
Synopsis
CREATE FOREIGN DATA WRAPPER name
[ HANDLER handler_function | NO HANDLER ]
[ VALIDATOR validator_function | NO VALIDATOR ]
[ OPTIONS ( option 'value' [, ... ] ) ]
Description
CREATE FOREIGN DATA WRAPPER creates a new foreign-data wrapper. The user who defines
a foreign-data wrapper becomes its owner.
Parameters
name
HANDLER handler_function
handler_function is the name of a previously registered function that will be called to re-
trieve the execution functions for foreign tables. The handler function must take no arguments,
and its return type must be fdw_handler.
It is possible to create a foreign-data wrapper with no handler function, but foreign tables using
such a wrapper can only be declared, not accessed.
VALIDATOR validator_function
This clause specifies options for the new foreign-data wrapper. The allowed option names and
values are specific to each foreign data wrapper and are validated using the foreign-data wrapper's
validator function. Option names must be unique.
Notes
PostgreSQL's foreign-data functionality is still under active development. Optimization of queries is
primitive (and mostly left to the wrapper, too). Thus, there is considerable room for future performance
improvements.
1531
CREATE FOREIGN
DATA WRAPPER
Examples
Create a useless foreign-data wrapper dummy:
Compatibility
CREATE FOREIGN DATA WRAPPER conforms to ISO/IEC 9075-9 (SQL/MED), with the exception
that the HANDLER and VALIDATOR clauses are extensions and the standard clauses LIBRARY and
LANGUAGE are not implemented in PostgreSQL.
Note, however, that the SQL/MED functionality as a whole is not yet conforming.
See Also
ALTER FOREIGN DATA WRAPPER, DROP FOREIGN DATA WRAPPER, CREATE SERVER,
CREATE USER MAPPING, CREATE FOREIGN TABLE
1532
CREATE FOREIGN TABLE
CREATE FOREIGN TABLE — define a new foreign table
Synopsis
[ CONSTRAINT constraint_name ]
{ NOT NULL |
NULL |
CHECK ( expression ) [ NO INHERIT ] |
DEFAULT default_expr }
[ CONSTRAINT constraint_name ]
CHECK ( expression ) [ NO INHERIT ]
Description
CREATE FOREIGN TABLE creates a new foreign table in the current database. The table will be
owned by the user issuing the command.
1533
CREATE FOREIGN TABLE
If a schema name is given (for example, CREATE FOREIGN TABLE myschema.mytable ...)
then the table is created in the specified schema. Otherwise it is created in the current schema. The
name of the foreign table must be distinct from the name of any other foreign table, table, sequence,
index, view, or materialized view in the same schema.
CREATE FOREIGN TABLE also automatically creates a data type that represents the composite type
corresponding to one row of the foreign table. Therefore, foreign tables cannot have the same name
as any existing data type in the same schema.
To be able to create a foreign table, you must have USAGE privilege on the foreign server, as well as
USAGE privilege on all column types used in the table.
Parameters
IF NOT EXISTS
Do not throw an error if a relation with the same name already exists. A notice is issued in this
case. Note that there is no guarantee that the existing relation is anything like the one that would
have been created.
table_name
column_name
data_type
The data type of the column. This can include array specifiers. For more information on the data
types supported by PostgreSQL, refer to Chapter 8.
COLLATE collation
The COLLATE clause assigns a collation to the column (which must be of a collatable data type).
If not specified, the column data type's default collation is used.
The optional INHERITS clause specifies a list of tables from which the new foreign table auto-
matically inherits all columns. Parent tables can be plain tables or foreign tables. See the similar
form of CREATE TABLE for more details.
This form can be used to create the foreign table as partition of the given parent table with specified
partition bound values. See the similar form of CREATE TABLE for more details. Note that it
is currently not allowed to create the foreign table as a partition of the parent table if there are
UNIQUE indexes on the parent table. (See also ALTER TABLE ATTACH PARTITION.)
CONSTRAINT constraint_name
An optional name for a column or table constraint. If the constraint is violated, the constraint
name is present in error messages, so constraint names like col must be positive can
be used to communicate helpful constraint information to client applications. (Double-quotes are
1534
CREATE FOREIGN TABLE
needed to specify constraint names that contain spaces.) If a constraint name is not specified, the
system generates a name.
NOT NULL
NULL
This clause is only provided for compatibility with non-standard SQL databases. Its use is dis-
couraged in new applications.
The CHECK clause specifies an expression producing a Boolean result which each row in the
foreign table is expected to satisfy; that is, the expression should produce TRUE or UNKNOWN,
never FALSE, for all rows in the foreign table. A check constraint specified as a column constraint
should reference that column's value only, while an expression appearing in a table constraint can
reference multiple columns.
Currently, CHECK expressions cannot contain subqueries nor refer to variables other than columns
of the current row. The system column tableoid may be referenced, but not any other system
column.
DEFAULT default_expr
The DEFAULT clause assigns a default data value for the column whose column definition it
appears within. The value is any variable-free expression (subqueries and cross-references to other
columns in the current table are not allowed). The data type of the default expression must match
the data type of the column.
The default expression will be used in any insert operation that does not specify a value for the
column. If there is no default for a column, then the default is null.
server_name
The name of an existing foreign server to use for the foreign table. For details on defining a server,
see CREATE SERVER.
Options to be associated with the new foreign table or one of its columns. The allowed option
names and values are specific to each foreign data wrapper and are validated using the foreign-data
wrapper's validator function. Duplicate option names are not allowed (although it's OK for a table
option and a column option to have the same name).
Notes
Constraints on foreign tables (such as CHECK or NOT NULL clauses) are not enforced by the core
PostgreSQL system, and most foreign data wrappers do not attempt to enforce them either; that is,
the constraint is simply assumed to hold true. There would be little point in such enforcement since it
would only apply to rows inserted or updated via the foreign table, and not to rows modified by other
means, such as directly on the remote server. Instead, a constraint attached to a foreign table should
represent a constraint that is being enforced by the remote server.
Some special-purpose foreign data wrappers might be the only access mechanism for the data they
access, and in that case it might be appropriate for the foreign data wrapper itself to perform constraint
enforcement. But you should not assume that a wrapper does that unless its documentation says so.
1535
CREATE FOREIGN TABLE
Although PostgreSQL does not attempt to enforce constraints on foreign tables, it does assume that
they are correct for purposes of query optimization. If there are rows visible in the foreign table that
do not satisfy a declared constraint, queries on the table might produce errors or incorrect answers. It
is the user's responsibility to ensure that the constraint definition matches reality.
Caution
When a foreign table is used as a partition of a partitioned table, there is an implicit constraint
that its contents must satisfy the partitioning rule. Again, it is the user's responsibility to ensure
that that is true, which is best done by installing a matching constraint on the remote server.
Within a partitioned table containing foreign-table partitions, an UPDATE that changes the partition
key value can cause a row to be moved from a local partition to a foreign-table partition, provided the
foreign data wrapper supports tuple routing. However it is not currently possible to move a row from
a foreign-table partition to another partition. An UPDATE that would require doing that will fail due
to the partitioning constraint, assuming that that is properly enforced by the remote server.
Examples
Create foreign table films, which will be accessed through the server film_server:
Create foreign table measurement_y2016m07, which will be accessed through the server serv-
er_07, as a partition of the range partitioned table measurement:
Compatibility
The CREATE FOREIGN TABLE command largely conforms to the SQL standard; however, much as
with CREATE TABLE, NULL constraints and zero-column foreign tables are permitted. The ability to
specify column default values is also a PostgreSQL extension. Table inheritance, in the form defined
by PostgreSQL, is nonstandard.
See Also
ALTER FOREIGN TABLE, DROP FOREIGN TABLE, CREATE TABLE, CREATE SERVER, IM-
PORT FOREIGN SCHEMA
1536
CREATE FUNCTION
CREATE FUNCTION — define a new function
Synopsis
Description
CREATE FUNCTION defines a new function. CREATE OR REPLACE FUNCTION will either create
a new function, or replace an existing definition. To be able to define a function, the user must have
the USAGE privilege on the language.
If a schema name is included, then the function is created in the specified schema. Otherwise it is
created in the current schema. The name of the new function must not match any existing function or
procedure with the same input argument types in the same schema. However, functions and procedures
of different argument types can share a name (this is called overloading).
To replace the current definition of an existing function, use CREATE OR REPLACE FUNCTION.
It is not possible to change the name or argument types of a function this way (if you tried, you would
actually be creating a new, distinct function). Also, CREATE OR REPLACE FUNCTION will not let
you change the return type of an existing function. To do that, you must drop and recreate the function.
(When using OUT parameters, that means you cannot change the types of any OUT parameters except
by dropping the function.)
When CREATE OR REPLACE FUNCTION is used to replace an existing function, the ownership
and permissions of the function do not change. All other function properties are assigned the values
specified or implied in the command. You must own the function to replace it (this includes being a
member of the owning role).
If you drop and then recreate a function, the new function is not the same entity as the old; you will have
to drop existing rules, views, triggers, etc. that refer to the old function. Use CREATE OR REPLACE
1537
CREATE FUNCTION
FUNCTION to change a function definition without breaking objects that refer to the function. Also,
ALTER FUNCTION can be used to change most of the auxiliary properties of an existing function.
The user that creates the function becomes the owner of the function.
To be able to create a function, you must have USAGE privilege on the argument types and the return
type.
Parameters
name
argmode
The mode of an argument: IN, OUT, INOUT, or VARIADIC. If omitted, the default is IN. Only
OUT arguments can follow a VARIADIC one. Also, OUT and INOUT arguments cannot be used
together with the RETURNS TABLE notation.
argname
The name of an argument. Some languages (including SQL and PL/pgSQL) let you use the name
in the function body. For other languages the name of an input argument is just extra documenta-
tion, so far as the function itself is concerned; but you can use input argument names when calling
a function to improve readability (see Section 4.3). In any case, the name of an output argument
is significant, because it defines the column name in the result row type. (If you omit the name
for an output argument, the system will choose a default column name.)
argtype
The data type(s) of the function's arguments (optionally schema-qualified), if any. The argument
types can be base, composite, or domain types, or can reference the type of a table column.
default_expr
An expression to be used as default value if the parameter is not specified. The expression has
to be coercible to the argument type of the parameter. Only input (including INOUT) parameters
can have a default value. All input parameters following a parameter with a default value must
have default values as well.
rettype
The return data type (optionally schema-qualified). The return type can be a base, composite,
or domain type, or can reference the type of a table column. Depending on the implementation
language it might also be allowed to specify “pseudo-types” such as cstring. If the function is
not supposed to return a value, specify void as the return type.
When there are OUT or INOUT parameters, the RETURNS clause can be omitted. If present, it
must agree with the result type implied by the output parameters: RECORD if there are multiple
output parameters, or the same type as the single output parameter.
1538
CREATE FUNCTION
The SETOF modifier indicates that the function will return a set of items, rather than a single item.
column_name
The name of an output column in the RETURNS TABLE syntax. This is effectively another way
of declaring a named OUT parameter, except that RETURNS TABLE also implies RETURNS
SETOF.
column_type
lang_name
The name of the language that the function is implemented in. It can be sql, c, internal, or
the name of a user-defined procedural language, e.g., plpgsql. Enclosing the name in single
quotes is deprecated and requires matching case.
Lists which transforms a call to the function should apply. Transforms convert between SQL
types and language-specific data types; see CREATE TRANSFORM. Procedural language im-
plementations usually have hardcoded knowledge of the built-in types, so those don't need to be
listed here. If a procedural language implementation does not know how to handle a type and no
transform is supplied, it will fall back to a default behavior for converting data types, but this
depends on the implementation.
WINDOW
WINDOW indicates that the function is a window function rather than a plain function. This is
currently only useful for functions written in C. The WINDOW attribute cannot be changed when
replacing an existing function definition.
IMMUTABLE
STABLE
VOLATILE
These attributes inform the query optimizer about the behavior of the function. At most one choice
can be specified. If none of these appear, VOLATILE is the default assumption.
IMMUTABLE indicates that the function cannot modify the database and always returns the same
result when given the same argument values; that is, it does not do database lookups or otherwise
use information not directly present in its argument list. If this option is given, any call of the
function with all-constant arguments can be immediately replaced with the function value.
STABLE indicates that the function cannot modify the database, and that within a single table
scan it will consistently return the same result for the same argument values, but that its result
could change across SQL statements. This is the appropriate selection for functions whose results
depend on database lookups, parameter variables (such as the current time zone), etc. (It is inap-
propriate for AFTER triggers that wish to query rows modified by the current command.) Also
note that the current_timestamp family of functions qualify as stable, since their values do
not change within a transaction.
VOLATILE indicates that the function value can change even within a single table scan, so no
optimizations can be made. Relatively few database functions are volatile in this sense; some
examples are random(), currval(), timeofday(). But note that any function that has
side-effects must be classified volatile, even if its result is quite predictable, to prevent calls from
being optimized away; an example is setval().
1539
CREATE FUNCTION
LEAKPROOF
LEAKPROOF indicates that the function has no side effects. It reveals no information about its
arguments other than by its return value. For example, a function which throws an error message
for some argument values but not others, or which includes the argument values in any error mes-
sage, is not leakproof. This affects how the system executes queries against views created with the
security_barrier option or tables with row level security enabled. The system will enforce
conditions from security policies and security barrier views before any user-supplied conditions
from the query itself that contain non-leakproof functions, in order to prevent the inadvertent ex-
posure of data. Functions and operators marked as leakproof are assumed to be trustworthy, and
may be executed before conditions from security policies and security barrier views. In addition,
functions which do not take arguments or which are not passed any arguments from the security
barrier view or table do not have to be marked as leakproof to be executed before security condi-
tions. See CREATE VIEW and Section 41.5. This option can only be set by the superuser.
CALLED ON NULL INPUT (the default) indicates that the function will be called normally
when some of its arguments are null. It is then the function author's responsibility to check for
null values if necessary and respond appropriately.
RETURNS NULL ON NULL INPUT or STRICT indicates that the function always returns null
whenever any of its arguments are null. If this parameter is specified, the function is not executed
when there are null arguments; instead a null result is assumed automatically.
SECURITY INVOKER indicates that the function is to be executed with the privileges of the
user that calls it. That is the default. SECURITY DEFINER specifies that the function is to be
executed with the privileges of the user that owns it.
The key word EXTERNAL is allowed for SQL conformance, but it is optional since, unlike in
SQL, this feature applies to all functions not only external ones.
PARALLEL
PARALLEL UNSAFE indicates that the function can't be executed in parallel mode and the pres-
ence of such a function in an SQL statement forces a serial execution plan. This is the default.
PARALLEL RESTRICTED indicates that the function can be executed in parallel mode, but the
execution is restricted to parallel group leader. PARALLEL SAFE indicates that the function is
safe to run in parallel mode without restriction.
Functions should be labeled parallel unsafe if they modify any database state, or if they make
changes to the transaction such as using sub-transactions, or if they access sequences or attempt to
make persistent changes to settings (e.g., setval). They should be labeled as parallel restricted
if they access temporary tables, client connection state, cursors, prepared statements, or miscella-
neous backend-local state which the system cannot synchronize in parallel mode (e.g., setseed
cannot be executed other than by the group leader because a change made by another process
would not be reflected in the leader). In general, if a function is labeled as being safe when it is
restricted or unsafe, or if it is labeled as being restricted when it is in fact unsafe, it may throw
errors or produce wrong answers when used in a parallel query. C-language functions could in
theory exhibit totally undefined behavior if mislabeled, since there is no way for the system to
protect itself against arbitrary C code, but in most likely cases the result will be no worse than for
any other function. If in doubt, functions should be labeled as UNSAFE, which is the default.
1540
CREATE FUNCTION
COST execution_cost
A positive number giving the estimated execution cost for the function, in units of cpu_opera-
tor_cost. If the function returns a set, this is the cost per returned row. If the cost is not specified,
1 unit is assumed for C-language and internal functions, and 100 units for functions in all other
languages. Larger values cause the planner to try to avoid evaluating the function more often than
necessary.
ROWS result_rows
A positive number giving the estimated number of rows that the planner should expect the function
to return. This is only allowed when the function is declared to return a set. The default assumption
is 1000 rows.
configuration_parameter
value
The SET clause causes the specified configuration parameter to be set to the specified value when
the function is entered, and then restored to its prior value when the function exits. SET FROM
CURRENT saves the value of the parameter that is current when CREATE FUNCTION is executed
as the value to be applied when the function is entered.
If a SET clause is attached to a function, then the effects of a SET LOCAL command executed
inside the function for the same variable are restricted to the function: the configuration parame-
ter's prior value is still restored at function exit. However, an ordinary SET command (without
LOCAL) overrides the SET clause, much as it would do for a previous SET LOCAL command:
the effects of such a command will persist after function exit, unless the current transaction is
rolled back.
See SET and Chapter 19 for more information about allowed parameter names and values.
definition
A string constant defining the function; the meaning depends on the language. It can be an internal
function name, the path to an object file, an SQL command, or text in a procedural language.
It is often helpful to use dollar quoting (see Section 4.1.2.4) to write the function definition string,
rather than the normal single quote syntax. Without dollar quoting, any single quotes or back-
slashes in the function definition must be escaped by doubling them.
obj_file, link_symbol
This form of the AS clause is used for dynamically loadable C language functions when the func-
tion name in the C language source code is not the same as the name of the SQL function. The
string obj_file is the name of the shared library file containing the compiled C function, and is
interpreted as for the LOAD command. The string link_symbol is the function's link symbol,
that is, the name of the function in the C language source code. If the link symbol is omitted,
it is assumed to be the same as the name of the SQL function being defined. The C names of
all functions must be different, so you must give overloaded C functions different C names (for
example, use the argument types as part of the C names).
When repeated CREATE FUNCTION calls refer to the same object file, the file is only loaded
once per session. To unload and reload the file (perhaps during development), start a new session.
Overloading
PostgreSQL allows function overloading; that is, the same name can be used for several different
functions so long as they have distinct input argument types. Whether or not you use it, this capability
entails security precautions when calling functions in databases where some users mistrust other users;
see Section 10.3.
1541
CREATE FUNCTION
Two functions are considered the same if they have the same names and input argument types, ignoring
any OUT parameters. Thus for example these declarations conflict:
Functions that have different argument type lists will not be considered to conflict at creation time,
but if defaults are provided they might conflict in use. For example, consider
A call foo(10) will fail due to the ambiguity about which function should be called.
Notes
The full SQL type syntax is allowed for declaring a function's arguments and return value. However,
parenthesized type modifiers (e.g., the precision field for type numeric) are discarded by CREATE
FUNCTION. Thus for example CREATE FUNCTION foo (varchar(10)) ... is exactly the
same as CREATE FUNCTION foo (varchar) ....
When replacing an existing function with CREATE OR REPLACE FUNCTION, there are restrictions
on changing parameter names. You cannot change the name already assigned to any input parameter
(although you can add names to parameters that had none before). If there is more than one output
parameter, you cannot change the names of the output parameters, because that would change the
column names of the anonymous composite type that describes the function's result. These restrictions
are made to ensure that existing calls of the function do not stop working when it is replaced.
If a function is declared STRICT with a VARIADIC argument, the strictness check tests that the
variadic array as a whole is non-null. The function will still be called if the array has null elements.
Examples
Add two integers using a SQL function:
1542
CREATE FUNCTION
You can do the same thing more verbosely with an explicitly named composite type:
However, a TABLE function is different from the preceding examples, because it actually returns a
set of records, not just one record.
RETURN passed;
END;
$$ LANGUAGE plpgsql
SECURITY DEFINER
-- Set a secure search_path: trusted schema(s), then 'pg_temp'.
SET search_path = admin, pg_temp;
This function's intention is to access a table admin.pwds. But without the SET clause, or with a
SET clause mentioning only admin, the function could be subverted by creating a temporary table
named pwds.
Before PostgreSQL version 8.3, the SET clause was not available, and so older functions may contain
rather complicated logic to save, set, and restore search_path. The SET clause is far easier to use
for this purpose.
1543
CREATE FUNCTION
Another point to keep in mind is that by default, execute privilege is granted to PUBLIC for newly
created functions (see GRANT for more information). Frequently you will wish to restrict use of
a security definer function to only some users. To do that, you must revoke the default PUBLIC
privileges and then grant execute privilege selectively. To avoid having a window where the new
function is accessible to all, create it and set the privileges within a single transaction. For example:
BEGIN;
CREATE FUNCTION check_password(uname TEXT, pass TEXT) ... SECURITY
DEFINER;
REVOKE ALL ON FUNCTION check_password(uname TEXT, pass TEXT) FROM
PUBLIC;
GRANT EXECUTE ON FUNCTION check_password(uname TEXT, pass TEXT) TO
admins;
COMMIT;
Compatibility
A CREATE FUNCTION command is defined in the SQL standard. The PostgreSQL version is similar
but not fully compatible. The attributes are not portable, neither are the different available languages.
For compatibility with some other database systems, argmode can be written either before or after
argname. But only the first way is standard-compliant.
For parameter defaults, the SQL standard specifies only the syntax with the DEFAULT key word. The
syntax with = is used in T-SQL and Firebird.
See Also
ALTER FUNCTION, DROP FUNCTION, GRANT, LOAD, REVOKE
1544
CREATE GROUP
CREATE GROUP — define a new database role
Synopsis
SUPERUSER | NOSUPERUSER
| CREATEDB | NOCREATEDB
| CREATEROLE | NOCREATEROLE
| INHERIT | NOINHERIT
| LOGIN | NOLOGIN
| REPLICATION | NOREPLICATION
| BYPASSRLS | NOBYPASSRLS
| CONNECTION LIMIT connlimit
| [ ENCRYPTED ] PASSWORD 'password' | PASSWORD NULL
| VALID UNTIL 'timestamp'
| IN ROLE role_name [, ...]
| IN GROUP role_name [, ...]
| ROLE role_name [, ...]
| ADMIN role_name [, ...]
| USER role_name [, ...]
| SYSID uid
Description
CREATE GROUP is now an alias for CREATE ROLE.
Compatibility
There is no CREATE GROUP statement in the SQL standard.
See Also
CREATE ROLE
1545
CREATE INDEX
CREATE INDEX — define a new index
Synopsis
Description
CREATE INDEX constructs an index on the specified column(s) of the specified relation, which can
be a table or a materialized view. Indexes are primarily used to enhance database performance (though
inappropriate use can result in slower performance).
The key field(s) for the index are specified as column names, or alternatively as expressions written
in parentheses. Multiple fields can be specified if the index method supports multicolumn indexes.
An index field can be an expression computed from the values of one or more columns of the table row.
This feature can be used to obtain fast access to data based on some transformation of the basic data.
For example, an index computed on upper(col) would allow the clause WHERE upper(col)
= 'JIM' to use an index.
PostgreSQL provides the index methods B-tree, hash, GiST, SP-GiST, GIN, and BRIN. Users can
also define their own index methods, but that is fairly complicated.
When the WHERE clause is present, a partial index is created. A partial index is an index that contains
entries for only a portion of a table, usually a portion that is more useful for indexing than the rest of the
table. For example, if you have a table that contains both billed and unbilled orders where the unbilled
orders take up a small fraction of the total table and yet that is an often used section, you can improve
performance by creating an index on just that portion. Another possible application is to use WHERE
with UNIQUE to enforce uniqueness over a subset of a table. See Section 11.8 for more discussion.
The expression used in the WHERE clause can refer only to columns of the underlying table, but it can
use all columns, not just the ones being indexed. Presently, subqueries and aggregate expressions are
also forbidden in WHERE. The same restrictions apply to index fields that are expressions.
All functions and operators used in an index definition must be “immutable”, that is, their results must
depend only on their arguments and never on any outside influence (such as the contents of another
table or the current time). This restriction ensures that the behavior of the index is well-defined. To
use a user-defined function in an index expression or WHERE clause, remember to mark the function
immutable when you create it.
Parameters
UNIQUE
Causes the system to check for duplicate values in the table when the index is created (if data
already exist) and each time data is added. Attempts to insert or update data which would result
in duplicate entries will generate an error.
1546
CREATE INDEX
Additional restrictions apply when unique indexes are applied to partitioned tables; see CREATE
TABLE.
CONCURRENTLY
When this option is used, PostgreSQL will build the index without taking any locks that prevent
concurrent inserts, updates, or deletes on the table; whereas a standard index build locks out writes
(but not reads) on the table until it's done. There are several caveats to be aware of when using
this option — see Building Indexes Concurrently.
For temporary tables, CREATE INDEX is always non-concurrent, as no other session can access
them, and non-concurrent index creation is cheaper.
IF NOT EXISTS
Do not throw an error if a relation with the same name already exists. A notice is issued in this
case. Note that there is no guarantee that the existing index is anything like the one that would
have been created. Index name is required when IF NOT EXISTS is specified.
INCLUDE
The optional INCLUDE clause specifies a list of columns which will be included in the index as
non-key columns. A non-key column cannot be used in an index scan search qualification, and
it is disregarded for purposes of any uniqueness or exclusion constraint enforced by the index.
However, an index-only scan can return the contents of non-key columns without having to visit
the index's table, since they are available directly from the index entry. Thus, addition of non-key
columns allows index-only scans to be used for queries that otherwise could not use them.
It's wise to be conservative about adding non-key columns to an index, especially wide columns.
If an index tuple exceeds the maximum size allowed for the index type, data insertion will fail. In
any case, non-key columns duplicate data from the index's table and bloat the size of the index,
thus potentially slowing searches.
Columns listed in the INCLUDE clause don't need appropriate operator classes; the clause can
include columns whose data types don't have operator classes defined for a given access method.
Expressions are not supported as included columns since they cannot be used in index-only scans.
Currently, only the B-tree index access method supports this feature. In B-tree indexes, the values
of columns listed in the INCLUDE clause are included in leaf tuples which correspond to heap
tuples, but are not included in upper-level index entries used for tree navigation.
name
The name of the index to be created. No schema name can be included here; the index is always
created in the same schema as its parent table. If the name is omitted, PostgreSQL chooses a
suitable name based on the parent table's name and the indexed column name(s).
ONLY
Indicates not to recurse creating indexes on partitions, if the table is partitioned. The default is
to recurse.
table_name
method
The name of the index method to be used. Choices are btree, hash, gist, spgist, gin, and
brin. The default method is btree.
1547
CREATE INDEX
column_name
expression
An expression based on one or more columns of the table. The expression usually must be written
with surrounding parentheses, as shown in the syntax. However, the parentheses can be omitted
if the expression has the form of a function call.
collation
The name of the collation to use for the index. By default, the index uses the collation declared for
the column to be indexed or the result collation of the expression to be indexed. Indexes with non-
default collations can be useful for queries that involve expressions using non-default collations.
opclass
ASC
DESC
NULLS FIRST
Specifies that nulls sort before non-nulls. This is the default when DESC is specified.
NULLS LAST
Specifies that nulls sort after non-nulls. This is the default when DESC is not specified.
storage_parameter
The name of an index-method-specific storage parameter. See Index Storage Parameters for de-
tails.
tablespace_name
The tablespace in which to create the index. If not specified, default_tablespace is consulted, or
temp_tablespaces for indexes on temporary tables.
predicate
fillfactor
The fillfactor for an index is a percentage that determines how full the index method will try to
pack index pages. For B-trees, leaf pages are filled to this percentage during initial index build,
and also when extending the index at the right (adding new largest key values). If pages subse-
1548
CREATE INDEX
quently become completely full, they will be split, leading to gradual degradation in the index's
efficiency. B-trees use a default fillfactor of 90, but any integer value from 10 to 100 can be se-
lected. If the table is static then fillfactor 100 is best to minimize the index's physical size, but
for heavily updated tables a smaller fillfactor is better to minimize the need for page splits. The
other index methods use fillfactor in different but roughly analogous ways; the default fillfactor
varies between methods.
vacuum_cleanup_index_scale_factor
buffering
Determines whether the buffering build technique described in Section 64.4.1 is used to build the
index. With OFF it is disabled, with ON it is enabled, and with AUTO it is initially disabled, but
turned on on-the-fly once the index size reaches effective_cache_size. The default is AUTO.
fastupdate
This setting controls usage of the fast update technique described in Section 66.4.1. It is a Boolean
parameter: ON enables fast update, OFF disables it. (Alternative spellings of ON and OFF are
allowed as described in Section 19.1.) The default is ON.
Note
Turning fastupdate off via ALTER INDEX prevents future insertions from going into
the list of pending index entries, but does not in itself flush previous entries. You might
want to VACUUM the table or call gin_clean_pending_list function afterward to
ensure the pending list is emptied.
gin_pending_list_limit
pages_per_range
Defines the number of table blocks that make up one block range for each entry of a BRIN index
(see Section 67.1 for more details). The default is 128.
autosummarize
Defines whether a summarization run is invoked for the previous page range whenever an inser-
tion is detected on the next one.
1549
CREATE INDEX
production database. Very large tables can take many hours to be indexed, and even for smaller tables,
an index build can lock out writers for periods that are unacceptably long for a production system.
PostgreSQL supports building indexes without locking out writes. This method is invoked by speci-
fying the CONCURRENTLY option of CREATE INDEX. When this option is used, PostgreSQL must
perform two scans of the table, and in addition it must wait for all existing transactions that could
potentially modify or use the index to terminate. Thus this method requires more total work than a
standard index build and takes significantly longer to complete. However, since it allows normal op-
erations to continue while the index is built, this method is useful for adding new indexes in a produc-
tion environment. Of course, the extra CPU and I/O load imposed by the index creation might slow
other operations.
In a concurrent index build, the index is actually entered into the system catalogs in one transaction,
then two table scans occur in two more transactions. Before each table scan, the index build must wait
for existing transactions that have modified the table to terminate. After the second scan, the index
build must wait for any transactions that have a snapshot (see Chapter 13) predating the second scan to
terminate, including transactions used by any phase of concurrent index builds on other tables. Then
finally the index can be marked ready for use, and the CREATE INDEX command terminates. Even
then, however, the index may not be immediately usable for queries: in the worst case, it cannot be
used as long as transactions exist that predate the start of the index build.
If a problem arises while scanning the table, such as a deadlock or a uniqueness violation in a unique
index, the CREATE INDEX command will fail but leave behind an “invalid” index. This index will
be ignored for querying purposes because it might be incomplete; however it will still consume update
overhead. The psql \d command will report such an index as INVALID:
postgres=# \d tab
Table "public.tab"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
col | integer | | |
Indexes:
"idx" btree (col) INVALID
The recommended recovery method in such cases is to drop the index and try again to perform CREATE
INDEX CONCURRENTLY. (Another possibility is to rebuild the index with REINDEX. However,
since REINDEX does not support concurrent builds, this option is unlikely to seem attractive.)
Another caveat when building a unique index concurrently is that the uniqueness constraint is already
being enforced against other transactions when the second table scan begins. This means that constraint
violations could be reported in other queries prior to the index becoming available for use, or even
in cases where the index build eventually fails. Also, if a failure does occur in the second scan, the
“invalid” index continues to enforce its uniqueness constraint afterwards.
Concurrent builds of expression indexes and partial indexes are supported. Errors occurring in the
evaluation of these expressions could cause behavior similar to that described above for unique con-
straint violations.
Regular index builds permit other regular index builds on the same table to occur simultaneously, but
only one concurrent index build can occur on a table at a time. In either case, schema modification
of the table is not allowed while the index is being built. Another difference is that a regular CRE-
ATE INDEX command can be performed within a transaction block, but CREATE INDEX CON-
CURRENTLY cannot.
Concurrent builds for indexes on partitioned tables are currently not supported. However, you may
concurrently build the index on each partition individually and then finally create the partitioned index
non-concurrently in order to reduce the time where writes to the partitioned table will be locked out.
In this case, building the partitioned index is a metadata only operation.
1550
CREATE INDEX
Notes
See Chapter 11 for information about when indexes can be used, when they are not used, and in which
particular situations they can be useful.
Currently, only the B-tree, GiST, GIN, and BRIN index methods support multicolumn indexes. Up
to 32 fields can be specified by default. (This limit can be altered when building PostgreSQL.) Only
B-tree currently supports unique indexes.
An operator class can be specified for each column of an index. The operator class identifies the
operators to be used by the index for that column. For example, a B-tree index on four-byte integers
would use the int4_ops class; this operator class includes comparison functions for four-byte inte-
gers. In practice the default operator class for the column's data type is usually sufficient. The main
point of having operator classes is that for some data types, there could be more than one meaningful
ordering. For example, we might want to sort a complex-number data type either by absolute value
or by real part. We could do this by defining two operator classes for the data type and then selecting
the proper class when creating an index. More information about operator classes is in Section 11.10
and in Section 38.15.
When CREATE INDEX is invoked on a partitioned table, the default behavior is to recurse to all
partitions to ensure they all have matching indexes. Each partition is first checked to determine whether
an equivalent index already exists, and if so, that index will become attached as a partition index to the
index being created, which will become its parent index. If no matching index exists, a new index will
be created and automatically attached; the name of the new index in each partition will be determined
as if no index name had been specified in the command. If the ONLY option is specified, no recursion
is done, and the index is marked invalid. (ALTER INDEX ... ATTACH PARTITION marks
the index valid, once all partitions acquire matching indexes.) Note, however, that any partition that
is created in the future using CREATE TABLE ... PARTITION OF will automatically have a
matching index, regardless of whether ONLY is specified.
For index methods that support ordered scans (currently, only B-tree), the optional clauses ASC, DESC,
NULLS FIRST, and/or NULLS LAST can be specified to modify the sort ordering of the index.
Since an ordered index can be scanned either forward or backward, it is not normally useful to create
a single-column DESC index — that sort ordering is already available with a regular index. The value
of these options is that multicolumn indexes can be created that match the sort ordering requested by a
mixed-ordering query, such as SELECT ... ORDER BY x ASC, y DESC. The NULLS options
are useful if you need to support “nulls sort low” behavior, rather than the default “nulls sort high”,
in queries that depend on indexes to avoid sorting steps.
The system regularly collects statistics on all of a table's columns. Newly-created non-expression
indexes can immediately use these statistics to determine an index's usefulness. For new expression
indexes, it is necessary to run ANALYZE or wait for the autovacuum daemon to analyze the table to
generate statistics for these indexes.
For most index methods, the speed of creating an index is dependent on the setting of mainte-
nance_work_mem. Larger values will reduce the time needed for index creation, so long as you don't
make it larger than the amount of memory really available, which would drive the machine into swap-
ping.
PostgreSQL can build indexes while leveraging multiple CPUs in order to process the table rows
faster. This feature is known as parallel index build. For index methods that support building indexes
in parallel (currently, only B-tree), maintenance_work_mem specifies the maximum amount of
memory that can be used by each index build operation as a whole, regardless of how many worker
processes were started. Generally, a cost model automatically determines how many worker processes
should be requested, if any.
Parallel index builds may benefit from increasing maintenance_work_mem where an equivalent
serial index build will see little or no benefit. Note that maintenance_work_mem may influence
the number of worker processes requested, since parallel workers must have at least a 32MB share of
1551
CREATE INDEX
the total maintenance_work_mem budget. There must also be a remaining 32MB share for the
leader process. Increasing max_parallel_maintenance_workers may allow more workers to be used,
which will reduce the time needed for index creation, so long as the index build is not already I/O
bound. Of course, there should also be sufficient CPU capacity that would otherwise lie idle.
Setting a value for parallel_workers via ALTER TABLE directly controls how many paral-
lel worker processes will be requested by a CREATE INDEX against the table. This bypasses the
cost model completely, and prevents maintenance_work_mem from affecting how many parallel
workers are requested. Setting parallel_workers to 0 via ALTER TABLE will disable parallel
index builds on the table in all cases.
Tip
You might want to reset parallel_workers after setting it as part of tuning an index
build. This avoids inadvertent changes to query plans, since parallel_workers affects
all parallel table scans.
While CREATE INDEX with the CONCURRENTLY option supports parallel builds without special
restrictions, only the first table scan is actually performed in parallel.
Prior releases of PostgreSQL also had an R-tree index method. This method has been removed because
it had no significant advantages over the GiST method. If USING rtree is specified, CREATE
INDEX will interpret it as USING gist, to simplify conversion of old databases to GiST.
Examples
To create a unique B-tree index on the column title in the table films:
To create a unique B-tree index on the column title with included columns director and rat-
ing in the table films:
(In this example we have chosen to omit the index name, so the system will choose a name, typically
films_lower_idx.)
1552
CREATE INDEX
To create an index on the column code in the table films and have the index reside in the tablespace
indexspace:
To create a GiST index on a point attribute so that we can efficiently use box operators on the result
of the conversion function:
Compatibility
CREATE INDEX is a PostgreSQL language extension. There are no provisions for indexes in the
SQL standard.
See Also
ALTER INDEX, DROP INDEX
1553
CREATE LANGUAGE
CREATE LANGUAGE — define a new procedural language
Synopsis
Description
CREATE LANGUAGE registers a new procedural language with a PostgreSQL database. Subsequently,
functions and procedures can be defined in this new language.
Note
As of PostgreSQL 9.1, most procedural languages have been made into “extensions”, and
should therefore be installed with CREATE EXTENSION not CREATE LANGUAGE. Direct
use of CREATE LANGUAGE should now be confined to extension installation scripts. If you
have a “bare” language in your database, perhaps as a result of an upgrade, you can convert it
to an extension using CREATE EXTENSION langname FROM unpackaged.
CREATE LANGUAGE effectively associates the language name with handler function(s) that are re-
sponsible for executing functions written in the language. Refer to Chapter 56 for more information
about language handlers.
There are two forms of the CREATE LANGUAGE command. In the first form, the user supplies just
the name of the desired language, and the PostgreSQL server consults the pg_pltemplate system
catalog to determine the correct parameters. In the second form, the user supplies the language para-
meters along with the language name. The second form can be used to create a language that is not
defined in pg_pltemplate, but this approach is considered obsolescent.
When the server finds an entry in the pg_pltemplate catalog for the given language name, it
will use the catalog data even if the command includes language parameters. This behavior simplifies
loading of old dump files, which are likely to contain out-of-date information about language support
functions.
Ordinarily, the user must have the PostgreSQL superuser privilege to register a new language. How-
ever, the owner of a database can register a new language within that database if the language is listed
in the pg_pltemplate catalog and is marked as allowed to be created by database owners (tm-
pldbacreate is true). The default is that trusted languages can be created by database owners, but
this can be adjusted by superusers by modifying the contents of pg_pltemplate. The creator of a
language becomes its owner and can later drop it, rename it, or assign it to a new owner.
CREATE OR REPLACE LANGUAGE will either create a new language, or replace an existing defi-
nition. If the language already exists, its parameters are updated according to the values specified or
taken from pg_pltemplate, but the language's ownership and permissions settings do not change,
and any existing functions written in the language are assumed to still be valid. In addition to the nor-
mal privilege requirements for creating a language, the user must be superuser or owner of the existing
language. The REPLACE case is mainly meant to be used to ensure that the language exists. If the
language has a pg_pltemplate entry then REPLACE will not actually change anything about an
1554
CREATE LANGUAGE
existing definition, except in the unusual case where the pg_pltemplate entry has been modified
since the language was created.
Parameters
TRUSTED
TRUSTED specifies that the language does not grant access to data that the user would not oth-
erwise have. If this key word is omitted when registering the language, only users with the Post-
greSQL superuser privilege can use this language to create new functions.
PROCEDURAL
name
The name of the new procedural language. The name must be unique among the languages in
the database.
HANDLER call_handler
call_handler is the name of a previously registered function that will be called to execute
the procedural language's functions. The call handler for a procedural language must be written
in a compiled language such as C with version 1 call convention and registered with PostgreSQL
as a function taking no arguments and returning the language_handler type, a placeholder
type that is simply used to identify the function as a call handler.
INLINE inline_handler
inline_handler is the name of a previously registered function that will be called to execute
an anonymous code block (DO command) in this language. If no inline_handler function
is specified, the language does not support anonymous code blocks. The handler function must
take one argument of type internal, which will be the DO command's internal representation,
and it will typically return void. The return value of the handler is ignored.
VALIDATOR valfunction
valfunction is the name of a previously registered function that will be called when a new
function in the language is created, to validate the new function. If no validator function is spec-
ified, then a new function will not be checked when it is created. The validator function must
take one argument of type oid, which will be the OID of the to-be-created function, and will
typically return void.
A validator function would typically inspect the function body for syntactical correctness, but it
can also look at other properties of the function, for example if the language cannot handle certain
argument types. To signal an error, the validator function should use the ereport() function.
The return value of the function is ignored.
The TRUSTED option and the support function name(s) are ignored if the server has an entry for the
specified language name in pg_pltemplate.
Notes
Use DROP LANGUAGE to drop procedural languages.
The system catalog pg_language (see Section 52.29) records information about the currently in-
stalled languages. Also, the psql command \dL lists the installed languages.
1555
CREATE LANGUAGE
To create functions in a procedural language, a user must have the USAGE privilege for the language.
By default, USAGE is granted to PUBLIC (i.e., everyone) for trusted languages. This can be revoked
if desired.
Procedural languages are local to individual databases. However, a language can be installed into the
template1 database, which will cause it to be available automatically in all subsequently-created
databases.
The call handler function, the inline handler function (if any), and the validator function (if any) must
already exist if the server does not have an entry for the language in pg_pltemplate. But when
there is an entry, the functions need not already exist; they will be automatically defined if not present
in the database. (This might result in CREATE LANGUAGE failing, if the shared library that imple-
ments the language is not available in the installation.)
In PostgreSQL versions before 7.3, it was necessary to declare handler functions as returning the
placeholder type opaque, rather than language_handler. To support loading of old dump files,
CREATE LANGUAGE will accept a function declared as returning opaque, but it will issue a notice
and change the function's declared return type to language_handler.
Examples
The preferred way of creating any of the standard procedural languages is just:
For a language not known in the pg_pltemplate catalog, a sequence such as this is needed:
Compatibility
CREATE LANGUAGE is a PostgreSQL extension.
See Also
ALTER LANGUAGE, CREATE FUNCTION, DROP LANGUAGE, GRANT, REVOKE
1556
CREATE MATERIALIZED VIEW
CREATE MATERIALIZED VIEW — define a new materialized view
Synopsis
Description
CREATE MATERIALIZED VIEW defines a materialized view of a query. The query is executed and
used to populate the view at the time the command is issued (unless WITH NO DATA is used) and
may be refreshed later using REFRESH MATERIALIZED VIEW.
CREATE MATERIALIZED VIEW is similar to CREATE TABLE AS, except that it also remembers
the query used to initialize the view, so that it can be refreshed later upon demand. A materialized
view has many of the same properties as a table, but there is no support for temporary materialized
views or automatic generation of OIDs.
Parameters
IF NOT EXISTS
Do not throw an error if a materialized view with the same name already exists. A notice is issued
in this case. Note that there is no guarantee that the existing materialized view is anything like
the one that would have been created.
table_name
column_name
The name of a column in the new materialized view. If column names are not provided, they are
taken from the output column names of the query.
This clause specifies optional storage parameters for the new materialized view; see Storage Pa-
rameters for more information. All parameters supported for CREATE TABLE are also supported
for CREATE MATERIALIZED VIEW with the exception of OIDS. See CREATE TABLE for
more information.
TABLESPACE tablespace_name
The tablespace_name is the name of the tablespace in which the new materialized view is
to be created. If not specified, default_tablespace is consulted.
query
A SELECT, TABLE, or VALUES command. This query will run within a security-restricted
operation; in particular, calls to functions that themselves create temporary tables will fail.
1557
CREATE MATERIALIZED VIEW
WITH [ NO ] DATA
This clause specifies whether or not the materialized view should be populated at creation time. If
not, the materialized view will be flagged as unscannable and cannot be queried until REFRESH
MATERIALIZED VIEW is used.
Compatibility
CREATE MATERIALIZED VIEW is a PostgreSQL extension.
See Also
ALTER MATERIALIZED VIEW, CREATE TABLE AS, CREATE VIEW, DROP
MATERIALIZED VIEW, REFRESH MATERIALIZED VIEW
1558
CREATE OPERATOR
CREATE OPERATOR — define a new operator
Synopsis
Description
CREATE OPERATOR defines a new operator, name. The user who defines an operator becomes its
owner. If a schema name is given then the operator is created in the specified schema. Otherwise it
is created in the current schema.
The operator name is a sequence of up to NAMEDATALEN-1 (63 by default) characters from the fol-
lowing list:
+-*/<>=~!@#%^&|`?
• -- and /* cannot appear anywhere in an operator name, since they will be taken as the start of
a comment.
• A multicharacter operator name cannot end in + or -, unless the name also contains at least one
of these characters:
~!@#%^&|`?
For example, @- is an allowed operator name, but *- is not. This restriction allows PostgreSQL to
parse SQL-compliant commands without requiring spaces between tokens.
• The use of => as an operator name is deprecated. It may be disallowed altogether in a future release.
The operator != is mapped to <> on input, so these two names are always equivalent.
At least one of LEFTARG and RIGHTARG must be defined. For binary operators, both must be de-
fined. For right unary operators, only LEFTARG should be defined, while for left unary operators only
RIGHTARG should be defined.
Note
Right unary, also called postfix, operators are deprecated and will be removed in PostgreSQL
version 14.
The function_name function must have been previously defined using CREATE FUNCTION and
must be defined to accept the correct number of arguments (either one or two) of the indicated types.
1559
CREATE OPERATOR
In the syntax of CREATE OPERATOR, the keywords FUNCTION and PROCEDURE are equivalent,
but the referenced function must in any case be a function, not a procedure. The use of the keyword
PROCEDURE here is historical and deprecated.
The other clauses specify optional operator optimization clauses. Their meaning is detailed in Sec-
tion 38.14.
To be able to create an operator, you must have USAGE privilege on the argument types and the return
type, as well as EXECUTE privilege on the underlying function. If a commutator or negator operator
is specified, you must own these operators.
Parameters
name
The name of the operator to be defined. See above for allowable characters. The name can be
schema-qualified, for example CREATE OPERATOR myschema.+ (...). If not, then the
operator is created in the current schema. Two operators in the same schema can have the same
name if they operate on different data types. This is called overloading.
function_name
left_type
The data type of the operator's left operand, if any. This option would be omitted for a left-unary
operator.
right_type
The data type of the operator's right operand, if any. This option would be omitted for a right-
unary operator.
com_op
neg_op
res_proc
join_proc
HASHES
MERGES
To give a schema-qualified operator name in com_op or the other optional arguments, use the OP-
ERATOR() syntax, for example:
1560
CREATE OPERATOR
COMMUTATOR = OPERATOR(myschema.===) ,
Notes
Refer to Section 38.13 for further information.
It is not possible to specify an operator's lexical precedence in CREATE OPERATOR, because the
parser's precedence behavior is hard-wired. See Section 4.1.6 for precedence details.
The obsolete options SORT1, SORT2, LTCMP, and GTCMP were formerly used to specify the names of
sort operators associated with a merge-joinable operator. This is no longer necessary, since information
about associated operators is found by looking at B-tree operator families instead. If one of these
options is given, it is ignored except for implicitly setting MERGES true.
Use DROP OPERATOR to delete user-defined operators from a database. Use ALTER OPERATOR
to modify operators in a database.
Examples
The following command defines a new operator, area-equality, for the data type box:
Compatibility
CREATE OPERATOR is a PostgreSQL extension. There are no provisions for user-defined operators
in the SQL standard.
See Also
ALTER OPERATOR, CREATE OPERATOR CLASS, DROP OPERATOR
1561
CREATE OPERATOR CLASS
CREATE OPERATOR CLASS — define a new operator class
Synopsis
Description
CREATE OPERATOR CLASS creates a new operator class. An operator class defines how a par-
ticular data type can be used with an index. The operator class specifies that certain operators will
fill particular roles or “strategies” for this data type and this index method. The operator class also
specifies the support functions to be used by the index method when the operator class is selected for
an index column. All the operators and functions used by an operator class must be defined before
the operator class can be created.
If a schema name is given then the operator class is created in the specified schema. Otherwise it is
created in the current schema. Two operator classes in the same schema can have the same name only
if they are for different index methods.
The user who defines an operator class becomes its owner. Presently, the creating user must be a
superuser. (This restriction is made because an erroneous operator class definition could confuse or
even crash the server.)
CREATE OPERATOR CLASS does not presently check whether the operator class definition includes
all the operators and functions required by the index method, nor whether the operators and functions
form a self-consistent set. It is the user's responsibility to define a valid operator class.
Related operator classes can be grouped into operator families. To add a new operator class to an
existing family, specify the FAMILY option in CREATE OPERATOR CLASS. Without this option,
the new class is placed into a family named the same as the new class (creating that family if it doesn't
already exist).
Parameters
name
The name of the operator class to be created. The name can be schema-qualified.
DEFAULT
If present, the operator class will become the default operator class for its data type. At most one
operator class can be the default for a specific data type and index method.
data_type
1562
CREATE OPERATOR CLASS
index_method
family_name
The name of the existing operator family to add this operator class to. If not specified, a family
named the same as the operator class is used (creating it, if it doesn't already exist).
strategy_number
The index method's strategy number for an operator associated with the operator class.
operator_name
The name (optionally schema-qualified) of an operator associated with the operator class.
op_type
In an OPERATOR clause, the operand data type(s) of the operator, or NONE to signify a left-unary
or right-unary operator. The operand data types can be omitted in the normal case where they are
the same as the operator class's data type.
In a FUNCTION clause, the operand data type(s) the function is intended to support, if different
from the input data type(s) of the function (for B-tree comparison functions and hash functions)
or the class's data type (for B-tree sort support functions and all functions in GiST, SP-GiST, GIN
and BRIN operator classes). These defaults are correct, and so op_type need not be specified in
FUNCTION clauses, except for the case of a B-tree sort support function that is meant to support
cross-data-type comparisons.
sort_family_name
The name (optionally schema-qualified) of an existing btree operator family that describes the
sort ordering associated with an ordering operator.
If neither FOR SEARCH nor FOR ORDER BY is specified, FOR SEARCH is the default.
support_number
The index method's support function number for a function associated with the operator class.
function_name
The name (optionally schema-qualified) of a function that is an index method support function
for the operator class.
argument_type
storage_type
The data type actually stored in the index. Normally this is the same as the column data type, but
some index methods (currently GiST, GIN and BRIN) allow it to be different. The STORAGE
clause must be omitted unless the index method allows a different type to be used. If the column
data_type is specified as anyarray, the storage_type can be declared as anyelement
to indicate that the index entries are members of the element type belonging to the actual array
type that each particular index is created for.
The OPERATOR, FUNCTION, and STORAGE clauses can appear in any order.
1563
CREATE OPERATOR CLASS
Notes
Because the index machinery does not check access permissions on functions before using them, in-
cluding a function or operator in an operator class is tantamount to granting public execute permission
on it. This is usually not an issue for the sorts of functions that are useful in an operator class.
The operators should not be defined by SQL functions. A SQL function is likely to be inlined into the
calling query, which will prevent the optimizer from recognizing that the query matches an index.
Before PostgreSQL 8.4, the OPERATOR clause could include a RECHECK option. This is no longer
supported because whether an index operator is “lossy” is now determined on-the-fly at run time. This
allows efficient handling of cases where an operator might or might not be lossy.
Examples
The following example command defines a GiST index operator class for the data type _int4 (array
of int4). See the intarray module for the complete example.
Compatibility
CREATE OPERATOR CLASS is a PostgreSQL extension. There is no CREATE OPERATOR CLASS
statement in the SQL standard.
See Also
ALTER OPERATOR CLASS, DROP OPERATOR CLASS, CREATE OPERATOR FAMILY, AL-
TER OPERATOR FAMILY
1564
CREATE OPERATOR FAMILY
CREATE OPERATOR FAMILY — define a new operator family
Synopsis
Description
CREATE OPERATOR FAMILY creates a new operator family. An operator family defines a collec-
tion of related operator classes, and perhaps some additional operators and support functions that are
compatible with these operator classes but not essential for the functioning of any individual index.
(Operators and functions that are essential to indexes should be grouped within the relevant operator
class, rather than being “loose” in the operator family. Typically, single-data-type operators are bound
to operator classes, while cross-data-type operators can be loose in an operator family containing op-
erator classes for both data types.)
The new operator family is initially empty. It should be populated by issuing subsequent CREATE
OPERATOR CLASS commands to add contained operator classes, and optionally ALTER OPERATOR
FAMILY commands to add “loose” operators and their corresponding support functions.
If a schema name is given then the operator family is created in the specified schema. Otherwise it
is created in the current schema. Two operator families in the same schema can have the same name
only if they are for different index methods.
The user who defines an operator family becomes its owner. Presently, the creating user must be a
superuser. (This restriction is made because an erroneous operator family definition could confuse or
even crash the server.)
Parameters
name
The name of the operator family to be created. The name can be schema-qualified.
index_method
Compatibility
CREATE OPERATOR FAMILY is a PostgreSQL extension. There is no CREATE OPERATOR
FAMILY statement in the SQL standard.
See Also
ALTER OPERATOR FAMILY, DROP OPERATOR FAMILY, CREATE OPERATOR CLASS, AL-
TER OPERATOR CLASS, DROP OPERATOR CLASS
1565
CREATE POLICY
CREATE POLICY — define a new row level security policy for a table
Synopsis
Description
The CREATE POLICY command defines a new row-level security policy for a table. Note that row-
level security must be enabled on the table (using ALTER TABLE ... ENABLE ROW LEVEL
SECURITY) in order for created policies to be applied.
A policy grants the permission to select, insert, update, or delete rows that match the relevant policy
expression. Existing table rows are checked against the expression specified in USING, while new
rows that would be created via INSERT or UPDATE are checked against the expression specified in
WITH CHECK. When a USING expression returns true for a given row then that row is visible to the
user, while if false or null is returned then the row is not visible. When a WITH CHECK expression
returns true for a row then that row is inserted or updated, while if false or null is returned then an
error occurs.
For INSERT and UPDATE statements, WITH CHECK expressions are enforced after BEFORE triggers
are fired, and before any actual data modifications are made. Thus a BEFORE ROW trigger may modify
the data to be inserted, affecting the result of the security policy check. WITH CHECK expressions
are enforced before any other constraints.
Policy names are per-table. Therefore, one policy name can be used for many different tables and have
a definition for each table which is appropriate to that table.
Policies can be applied for specific commands or for specific roles. The default for newly created
policies is that they apply for all commands and roles, unless otherwise specified. Multiple policies
may apply to a single command; see below for more details. Table 241 summarizes how the different
types of policy apply to specific commands.
For policies that can have both USING and WITH CHECK expressions (ALL and UPDATE), if no
WITH CHECK expression is defined, then the USING expression will be used both to determine
which rows are visible (normal USING case) and which new rows will be allowed to be added (WITH
CHECK case).
If row-level security is enabled for a table, but no applicable policies exist, a “default deny” policy is
assumed, so that no rows will be visible or updatable.
Parameters
name
The name of the policy to be created. This must be distinct from the name of any other policy
for the table.
1566
CREATE POLICY
table_name
The name (optionally schema-qualified) of the table the policy applies to.
PERMISSIVE
Specify that the policy is to be created as a permissive policy. All permissive policies which
are applicable to a given query will be combined together using the Boolean “OR” operator. By
creating permissive policies, administrators can add to the set of records which can be accessed.
Policies are permissive by default.
RESTRICTIVE
Specify that the policy is to be created as a restrictive policy. All restrictive policies which are
applicable to a given query will be combined together using the Boolean “AND” operator. By
creating restrictive policies, administrators can reduce the set of records which can be accessed
as all restrictive policies must be passed for each record.
Note that there needs to be at least one permissive policy to grant access to records before re-
strictive policies can be usefully used to reduce that access. If only restrictive policies exist, then
no records will be accessible. When a mix of permissive and restrictive policies are present, a
record is only accessible if at least one of the permissive policies passes, in addition to all the
restrictive policies.
command
The command to which the policy applies. Valid options are ALL, SELECT, INSERT, UPDATE,
and DELETE. ALL is the default. See below for specifics regarding how these are applied.
role_name
The role(s) to which the policy is to be applied. The default is PUBLIC, which will apply the
policy to all roles.
using_expression
Any SQL conditional expression (returning boolean). The conditional expression cannot con-
tain any aggregate or window functions. This expression will be added to queries that refer to
the table if row level security is enabled. Rows for which the expression returns true will be vis-
ible. Any rows for which the expression returns false or null will not be visible to the user (in a
SELECT), and will not be available for modification (in an UPDATE or DELETE). Such rows are
silently suppressed; no error is reported.
check_expression
Any SQL conditional expression (returning boolean). The conditional expression cannot con-
tain any aggregate or window functions. This expression will be used in INSERT and UPDATE
queries against the table if row level security is enabled. Only rows for which the expression
evaluates to true will be allowed. An error will be thrown if the expression evaluates to false or
null for any of the records inserted or any of the records that result from the update. Note that
the check_expression is evaluated against the proposed new contents of the row, not the
original contents.
Per-Command Policies
ALL
Using ALL for a policy means that it will apply to all commands, regardless of the type of com-
mand. If an ALL policy exists and more specific policies exist, then both the ALL policy and the
more specific policy (or policies) will be applied. Additionally, ALL policies will be applied to
1567
CREATE POLICY
both the selection side of a query and the modification side, using the USING expression for both
cases if only a USING expression has been defined.
As an example, if an UPDATE is issued, then the ALL policy will be applicable both to what the
UPDATE will be able to select as rows to be updated (applying the USING expression), and to
the resulting updated rows, to check if they are permitted to be added to the table (applying the
WITH CHECK expression, if defined, and the USING expression otherwise). If an INSERT or
UPDATE command attempts to add rows to the table that do not pass the ALL policy's WITH
CHECK expression, the entire command will be aborted.
SELECT
Using SELECT for a policy means that it will apply to SELECT queries and whenever SELECT
permissions are required on the relation the policy is defined for. The result is that only those
records from the relation that pass the SELECT policy will be returned during a SELECT query,
and that queries that require SELECT permissions, such as UPDATE, will also only see those
records that are allowed by the SELECT policy. A SELECT policy cannot have a WITH CHECK
expression, as it only applies in cases where records are being retrieved from the relation.
INSERT
Using INSERT for a policy means that it will apply to INSERT commands. Rows being inserted
that do not pass this policy will result in a policy violation error, and the entire INSERT command
will be aborted. An INSERT policy cannot have a USING expression, as it only applies in cases
where records are being added to the relation.
Note that INSERT with ON CONFLICT DO UPDATE checks INSERT policies' WITH CHECK
expressions only for rows appended to the relation by the INSERT path.
UPDATE
Using UPDATE for a policy means that it will apply to UPDATE, SELECT FOR UPDATE and
SELECT FOR SHARE commands, as well as auxiliary ON CONFLICT DO UPDATE clauses
of INSERT commands. Since UPDATE involves pulling an existing record and replacing it with
a new modified record, UPDATE policies accept both a USING expression and a WITH CHECK
expression. The USING expression determines which records the UPDATE command will see to
operate against, while the WITH CHECK expression defines which modified rows are allowed
to be stored back into the relation.
Any rows whose updated values do not pass the WITH CHECK expression will cause an error,
and the entire command will be aborted. If only a USING clause is specified, then that clause will
be used for both USING and WITH CHECK cases.
Typically an UPDATE command also needs to read data from columns in the relation being up-
dated (e.g., in a WHERE clause or a RETURNING clause, or in an expression on the right hand side
of the SET clause). In this case, SELECT rights are also required on the relation being updated,
and the appropriate SELECT or ALL policies will be applied in addition to the UPDATE policies.
Thus the user must have access to the row(s) being updated through a SELECT or ALL policy in
addition to being granted permission to update the row(s) via an UPDATE or ALL policy.
DELETE
Using DELETE for a policy means that it will apply to DELETE commands. Only rows that pass
this policy will be seen by a DELETE command. There can be rows that are visible through a
1568
CREATE POLICY
SELECT that are not available for deletion, if they do not pass the USING expression for the
DELETE policy.
In most cases a DELETE command also needs to read data from columns in the relation that it
is deleting from (e.g., in a WHERE clause or a RETURNING clause). In this case, SELECT rights
are also required on the relation, and the appropriate SELECT or ALL policies will be applied
in addition to the DELETE policies. Thus the user must have access to the row(s) being deleted
through a SELECT or ALL policy in addition to being granted permission to delete the row(s)
via a DELETE or ALL policy.
A DELETE policy cannot have a WITH CHECK expression, as it only applies in cases where
records are being deleted from the relation, so that there is no new row to check.
When multiple policies of the same command type apply to the same command, then there must
be at least one PERMISSIVE policy granting access to the relation, and all of the RESTRICTIVE
policies must pass. Thus all the PERMISSIVE policy expressions are combined using OR, all the
RESTRICTIVE policy expressions are combined using AND, and the results are combined using AND.
If there are no PERMISSIVE policies, then access is denied.
Note that, for the purposes of combining multiple policies, ALL policies are treated as having the same
type as whichever other type of policy is being applied.
For example, in an UPDATE command requiring both SELECT and UPDATE permissions, if there are
multiple applicable policies of each type, they will be combined as follows:
1569
CREATE POLICY
Notes
You must be the owner of a table to create or change policies for it.
While policies will be applied for explicit queries against tables in the database, they are not ap-
plied when the system is performing internal referential integrity checks or validating constraints. This
means there are indirect ways to determine that a given value exists. An example of this is attempting
to insert a duplicate value into a column that is a primary key or has a unique constraint. If the insert
fails then the user can infer that the value already exists. (This example assumes that the user is per-
mitted by policy to insert records which they are not allowed to see.) Another example is where a user
is allowed to insert into a table which references another, otherwise hidden table. Existence can be
determined by the user inserting values into the referencing table, where success would indicate that
the value exists in the referenced table. These issues can be addressed by carefully crafting policies to
prevent users from being able to insert, delete, or update records at all which might possibly indicate
a value they are not otherwise able to see, or by using generated values (e.g., surrogate keys) instead
of keys with external meanings.
Generally, the system will enforce filter conditions imposed using security policies prior to qualifica-
tions that appear in user queries, in order to prevent inadvertent exposure of the protected data to user-
defined functions which might not be trustworthy. However, functions and operators marked by the
system (or the system administrator) as LEAKPROOF may be evaluated before policy expressions, as
they are assumed to be trustworthy.
Since policy expressions are added to the user's query directly, they will be run with the rights of the
user running the overall query. Therefore, users who are using a given policy must be able to access
any tables or functions referenced in the expression or they will simply receive a permission denied
error when attempting to query the table that has row-level security enabled. This does not change
how views work, however. As with normal queries and views, permission checks and policies for the
tables which are referenced by a view will use the view owner's rights and any policies which apply
to the view owner.
1570
CREATE POLICY
Compatibility
CREATE POLICY is a PostgreSQL extension.
See Also
ALTER POLICY, DROP POLICY, ALTER TABLE
1571
CREATE PROCEDURE
CREATE PROCEDURE — define a new procedure
Synopsis
Description
CREATE PROCEDURE defines a new procedure. CREATE OR REPLACE PROCEDURE will either
create a new procedure, or replace an existing definition. To be able to define a procedure, the user
must have the USAGE privilege on the language.
If a schema name is included, then the procedure is created in the specified schema. Otherwise it is
created in the current schema. The name of the new procedure must not match any existing procedure
or function with the same input argument types in the same schema. However, procedures and func-
tions of different argument types can share a name (this is called overloading).
To replace the current definition of an existing procedure, use CREATE OR REPLACE PROCEDURE.
It is not possible to change the name or argument types of a procedure this way (if you tried, you
would actually be creating a new, distinct procedure).
When CREATE OR REPLACE PROCEDURE is used to replace an existing procedure, the ownership
and permissions of the procedure do not change. All other procedure properties are assigned the values
specified or implied in the command. You must own the procedure to replace it (this includes being
a member of the owning role).
The user that creates the procedure becomes the owner of the procedure.
To be able to create a procedure, you must have USAGE privilege on the argument types.
Parameters
name
argmode
The mode of an argument: IN, INOUT, or VARIADIC. If omitted, the default is IN. (OUT argu-
ments are currently not supported for procedures. Use INOUT instead.)
1572
CREATE PROCEDURE
argname
argtype
The data type(s) of the procedure's arguments (optionally schema-qualified), if any. The argument
types can be base, composite, or domain types, or can reference the type of a table column.
default_expr
An expression to be used as default value if the parameter is not specified. The expression has to
be coercible to the argument type of the parameter. All input parameters following a parameter
with a default value must have default values as well.
lang_name
The name of the language that the procedure is implemented in. It can be sql, c, internal,
or the name of a user-defined procedural language, e.g., plpgsql. Enclosing the name in single
quotes is deprecated and requires matching case.
Lists which transforms a call to the procedure should apply. Transforms convert between SQL
types and language-specific data types; see CREATE TRANSFORM. Procedural language im-
plementations usually have hardcoded knowledge of the built-in types, so those don't need to be
listed here. If a procedural language implementation does not know how to handle a type and no
transform is supplied, it will fall back to a default behavior for converting data types, but this
depends on the implementation.
SECURITY INVOKER indicates that the procedure is to be executed with the privileges of the
user that calls it. That is the default. SECURITY DEFINER specifies that the procedure is to be
executed with the privileges of the user that owns it.
The key word EXTERNAL is allowed for SQL conformance, but it is optional since, unlike in
SQL, this feature applies to all procedures not only external ones.
A SECURITY DEFINER procedure cannot execute transaction control statements (for example,
COMMIT and ROLLBACK, depending on the language).
configuration_parameter
value
The SET clause causes the specified configuration parameter to be set to the specified value
when the procedure is entered, and then restored to its prior value when the procedure exits. SET
FROM CURRENT saves the value of the parameter that is current when CREATE PROCEDURE
is executed as the value to be applied when the procedure is entered.
If a SET clause is attached to a procedure, then the effects of a SET LOCAL command execut-
ed inside the procedure for the same variable are restricted to the procedure: the configuration
parameter's prior value is still restored at procedure exit. However, an ordinary SET command
1573
CREATE PROCEDURE
(without LOCAL) overrides the SET clause, much as it would do for a previous SET LOCAL
command: the effects of such a command will persist after procedure exit, unless the current
transaction is rolled back.
If a SET clause is attached to a procedure, then that procedure cannot execute transaction control
statements (for example, COMMIT and ROLLBACK, depending on the language).
See SET and Chapter 19 for more information about allowed parameter names and values.
definition
A string constant defining the procedure; the meaning depends on the language. It can be an
internal procedure name, the path to an object file, an SQL command, or text in a procedural
language.
It is often helpful to use dollar quoting (see Section 4.1.2.4) to write the procedure definition
string, rather than the normal single quote syntax. Without dollar quoting, any single quotes or
backslashes in the procedure definition must be escaped by doubling them.
obj_file, link_symbol
This form of the AS clause is used for dynamically loadable C language procedures when the
procedure name in the C language source code is not the same as the name of the SQL procedure.
The string obj_file is the name of the shared library file containing the compiled C procedure,
and is interpreted as for the LOAD command. The string link_symbol is the procedure's link
symbol, that is, the name of the procedure in the C language source code. If the link symbol is
omitted, it is assumed to be the same as the name of the SQL procedure being defined.
When repeated CREATE PROCEDURE calls refer to the same object file, the file is only loaded
once per session. To unload and reload the file (perhaps during development), start a new session.
Notes
See CREATE FUNCTION for more details on function creation that also apply to procedures.
Examples
Compatibility
A CREATE PROCEDURE command is defined in the SQL standard. The PostgreSQL version is similar
but not fully compatible. For details see also CREATE FUNCTION.
See Also
ALTER PROCEDURE, DROP PROCEDURE, CALL, CREATE FUNCTION
1574
CREATE PUBLICATION
CREATE PUBLICATION — define a new publication
Synopsis
Description
CREATE PUBLICATION adds a new publication into the current database. The publication name
must be distinct from the name of any existing publication in the current database.
A publication is essentially a group of tables whose data changes are intended to be replicated through
logical replication. See Section 31.1 for details about how publications fit into the logical replication
setup.
Parameters
name
FOR TABLE
Specifies a list of tables to add to the publication. If ONLY is specified before the table name, only
that table is added to the publication. If ONLY is not specified, the table and all its descendant
tables (if any) are added. Optionally, * can be specified after the table name to explicitly indicate
that descendant tables are included.
Only persistent base tables can be part of a publication. Temporary tables, unlogged tables, foreign
tables, materialized views, regular views, and partitioned tables cannot be part of a publication.
To replicate a partitioned table, add the individual partitions to the publication.
Marks the publication as one that replicates changes for all tables in the database, including tables
created in the future.
This clause specifies optional parameters for a publication. The following parameters are sup-
ported:
publish (string)
This parameter determines which DML operations will be published by the new publication
to the subscribers. The value is comma-separated list of operations. The allowed operations
are insert, update, delete, and truncate. The default is to publish all actions, and
so the default value for this option is 'insert, update, delete, truncate'.
1575
CREATE PUBLICATION
Notes
If neither FOR TABLE nor FOR ALL TABLES is specified, then the publication starts out with an
empty set of tables. That is useful if tables are to be added later.
The creation of a publication does not start replication. It only defines a grouping and filtering logic
for future subscribers.
To create a publication, the invoking user must have the CREATE privilege for the current database.
(Of course, superusers bypass this check.)
To add a table to a publication, the invoking user must have ownership rights on the table. The FOR
ALL TABLES clause requires the invoking user to be a superuser.
The tables added to a publication that publishes UPDATE and/or DELETE operations must have RE-
PLICA IDENTITY defined. Otherwise those operations will be disallowed on those tables.
For an INSERT ... ON CONFLICT command, the publication will publish the operation that ac-
tually results from the command. So depending of the outcome, it may be published as either INSERT
or UPDATE, or it may not be published at all.
Examples
Create a publication that publishes all changes in two tables:
Compatibility
CREATE PUBLICATION is a PostgreSQL extension.
See Also
ALTER PUBLICATION, DROP PUBLICATION, CREATE SUBSCRIPTION, ALTER
SUBSCRIPTION
1576
CREATE ROLE
CREATE ROLE — define a new database role
Synopsis
SUPERUSER | NOSUPERUSER
| CREATEDB | NOCREATEDB
| CREATEROLE | NOCREATEROLE
| INHERIT | NOINHERIT
| LOGIN | NOLOGIN
| REPLICATION | NOREPLICATION
| BYPASSRLS | NOBYPASSRLS
| CONNECTION LIMIT connlimit
| [ ENCRYPTED ] PASSWORD 'password' | PASSWORD NULL
| VALID UNTIL 'timestamp'
| IN ROLE role_name [, ...]
| IN GROUP role_name [, ...]
| ROLE role_name [, ...]
| ADMIN role_name [, ...]
| USER role_name [, ...]
| SYSID uid
Description
CREATE ROLE adds a new role to a PostgreSQL database cluster. A role is an entity that can own
database objects and have database privileges; a role can be considered a “user”, a “group”, or both
depending on how it is used. Refer to Chapter 21 and Chapter 20 for information about managing
users and authentication. You must have CREATEROLE privilege or be a database superuser to use
this command.
Note that roles are defined at the database cluster level, and so are valid in all databases in the cluster.
Parameters
name
SUPERUSER
NOSUPERUSER
These clauses determine whether the new role is a “superuser”, who can override all access re-
strictions within the database. Superuser status is dangerous and should be used only when really
needed. You must yourself be a superuser to create a new superuser. If not specified, NOSUPE-
RUSER is the default.
1577
CREATE ROLE
CREATEDB
NOCREATEDB
These clauses define a role's ability to create databases. If CREATEDB is specified, the role being
defined will be allowed to create new databases. Specifying NOCREATEDB will deny a role the
ability to create databases. If not specified, NOCREATEDB is the default.
CREATEROLE
NOCREATEROLE
These clauses determine whether a role will be permitted to create new roles (that is, execute
CREATE ROLE). A role with CREATEROLE privilege can also alter and drop other roles. If not
specified, NOCREATEROLE is the default.
INHERIT
NOINHERIT
These clauses determine whether a role “inherits” the privileges of roles it is a member of. A
role with the INHERIT attribute can automatically use whatever database privileges have been
granted to all roles it is directly or indirectly a member of. Without INHERIT, membership in
another role only grants the ability to SET ROLE to that other role; the privileges of the other
role are only available after having done so. If not specified, INHERIT is the default.
LOGIN
NOLOGIN
These clauses determine whether a role is allowed to log in; that is, whether the role can be
given as the initial session authorization name during client connection. A role having the LOGIN
attribute can be thought of as a user. Roles without this attribute are useful for managing database
privileges, but are not users in the usual sense of the word. If not specified, NOLOGIN is the
default, except when CREATE ROLE is invoked through its alternative spelling CREATE USER.
REPLICATION
NOREPLICATION
These clauses determine whether a role is a replication role. A role must have this attribute (or be
a superuser) in order to be able to connect to the server in replication mode (physical or logical
replication) and in order to be able to create or drop replication slots. A role having the REPLI-
CATION attribute is a very highly privileged role, and should only be used on roles actually used
for replication. If not specified, NOREPLICATION is the default. You must be a superuser to
create a new role having the REPLICATION attribute.
BYPASSRLS
NOBYPASSRLS
These clauses determine whether a role bypasses every row-level security (RLS) policy. NOBY-
PASSRLS is the default. You must be a superuser to create a new role having the BYPASSRLS
attribute.
Note that pg_dump will set row_security to OFF by default, to ensure all contents of a table
are dumped out. If the user running pg_dump does not have appropriate permissions, an error will
be returned. However, superusers and the owner of the table being dumped always bypass RLS.
If role can log in, this specifies how many concurrent connections the role can make. -1 (the
default) means no limit. Note that only normal connections are counted towards this limit. Neither
prepared transactions nor background worker connections are counted towards this limit.
1578
CREATE ROLE
Sets the role's password. (A password is only of use for roles having the LOGIN attribute, but you
can nonetheless define one for roles without it.) If you do not plan to use password authentication
you can omit this option. If no password is specified, the password will be set to null and password
authentication will always fail for that user. A null password can optionally be written explicitly
as PASSWORD NULL.
Note
Specifying an empty string will also set the password to null, but that was not the case
before PostgreSQL version 10. In earlier versions, an empty string could be used, or not,
depending on the authentication method and the exact version, and libpq would refuse to
use it in any case. To avoid the ambiguity, specifying an empty string should be avoided.
The password is always stored encrypted in the system catalogs. The ENCRYPTED keyword
has no effect, but is accepted for backwards compatibility. The method of encryption is deter-
mined by the configuration parameter password_encryption. If the presented password string
is already in MD5-encrypted or SCRAM-encrypted format, then it is stored as-is regardless of
password_encryption (since the system cannot decrypt the specified encrypted password
string, to encrypt it in a different format). This allows reloading of encrypted passwords during
dump/restore.
The VALID UNTIL clause sets a date and time after which the role's password is no longer valid.
If this clause is omitted the password will be valid for all time.
IN ROLE role_name
The IN ROLE clause lists one or more existing roles to which the new role will be immediately
added as a new member. (Note that there is no option to add the new role as an administrator; use
a separate GRANT command to do that.)
IN GROUP role_name
ROLE role_name
The ROLE clause lists one or more existing roles which are automatically added as members of
the new role. (This in effect makes the new role a “group”.)
ADMIN role_name
The ADMIN clause is like ROLE, but the named roles are added to the new role WITH ADMIN
OPTION, giving them the right to grant membership in this role to others.
USER role_name
SYSID uid
Notes
Use ALTER ROLE to change the attributes of a role, and DROP ROLE to remove a role. All the
attributes specified by CREATE ROLE can be modified by later ALTER ROLE commands.
1579
CREATE ROLE
The preferred way to add and remove members of roles that are being used as groups is to use GRANT
and REVOKE.
The VALID UNTIL clause defines an expiration time for a password only, not for the role per se. In
particular, the expiration time is not enforced when logging in using a non-password-based authenti-
cation method.
The INHERIT attribute governs inheritance of grantable privileges (that is, access privileges for data-
base objects and role memberships). It does not apply to the special role attributes set by CREATE
ROLE and ALTER ROLE. For example, being a member of a role with CREATEDB privilege does not
immediately grant the ability to create databases, even if INHERIT is set; it would be necessary to
become that role via SET ROLE before creating a database.
The INHERIT attribute is the default for reasons of backwards compatibility: in prior releases of
PostgreSQL, users always had access to all privileges of groups they were members of. However,
NOINHERIT provides a closer match to the semantics specified in the SQL standard.
Be careful with the CREATEROLE privilege. There is no concept of inheritance for the privileges of
a CREATEROLE-role. That means that even if a role does not have a certain privilege but is allowed
to create other roles, it can easily create another role with different privileges than its own (except
for creating roles with superuser privileges). For example, if the role “user” has the CREATEROLE
privilege but not the CREATEDB privilege, nonetheless it can create a new role with the CREATEDB
privilege. Therefore, regard roles that have the CREATEROLE privilege as almost-superuser-roles.
PostgreSQL includes a program createuser that has the same functionality as CREATE ROLE (in fact,
it calls this command) but can be run from the command shell.
The CONNECTION LIMIT option is only enforced approximately; if two new sessions start at about
the same time when just one connection “slot” remains for the role, it is possible that both will fail.
Also, the limit is never enforced for superusers.
Caution must be exercised when specifying an unencrypted password with this command. The pass-
word will be transmitted to the server in cleartext, and it might also be logged in the client's command
history or the server log. The command createuser, however, transmits the password encrypted. Also,
psql contains a command \password that can be used to safely change the password later.
Examples
Create a role that can log in, but don't give it a password:
(CREATE USER is the same as CREATE ROLE except that it implies LOGIN.)
Create a role with a password that is valid until the end of 2004. After one second has ticked in 2005,
the password is no longer valid.
1580
CREATE ROLE
Compatibility
The CREATE ROLE statement is in the SQL standard, but the standard only requires the syntax
Multiple initial administrators, and all the other options of CREATE ROLE, are PostgreSQL exten-
sions.
The SQL standard defines the concepts of users and roles, but it regards them as distinct concepts and
leaves all commands defining users to be specified by each database implementation. In PostgreSQL
we have chosen to unify users and roles into a single kind of entity. Roles therefore have many more
optional attributes than they do in the standard.
The behavior specified by the SQL standard is most closely approximated by giving users the NOIN-
HERIT attribute, while roles are given the INHERIT attribute.
See Also
SET ROLE, ALTER ROLE, DROP ROLE, GRANT, REVOKE, createuser
1581
CREATE RULE
CREATE RULE — define a new rewrite rule
Synopsis
Description
CREATE RULE defines a new rule applying to a specified table or view. CREATE OR REPLACE
RULE will either create a new rule, or replace an existing rule of the same name for the same table.
The PostgreSQL rule system allows one to define an alternative action to be performed on insertions,
updates, or deletions in database tables. Roughly speaking, a rule causes additional commands to be
executed when a given command on a given table is executed. Alternatively, an INSTEAD rule can
replace a given command by another, or cause a command not to be executed at all. Rules are used to
implement SQL views as well. It is important to realize that a rule is really a command transformation
mechanism, or command macro. The transformation happens before the execution of the command
starts. If you actually want an operation that fires independently for each physical row, you probably
want to use a trigger, not a rule. More information about the rules system is in Chapter 41.
Presently, ON SELECT rules must be unconditional INSTEAD rules and must have actions that consist
of a single SELECT command. Thus, an ON SELECT rule effectively turns the table into a view,
whose visible contents are the rows returned by the rule's SELECT command rather than whatever had
been stored in the table (if anything). It is considered better style to write a CREATE VIEW command
than to create a real table and define an ON SELECT rule for it.
You can create the illusion of an updatable view by defining ON INSERT, ON UPDATE, and ON
DELETE rules (or any subset of those that's sufficient for your purposes) to replace update actions on
the view with appropriate updates on other tables. If you want to support INSERT RETURNING and
so on, then be sure to put a suitable RETURNING clause into each of these rules.
There is a catch if you try to use conditional rules for complex view updates: there must be an uncon-
ditional INSTEAD rule for each action you wish to allow on the view. If the rule is conditional, or is
not INSTEAD, then the system will still reject attempts to perform the update action, because it thinks
it might end up trying to perform the action on the dummy table of the view in some cases. If you want
to handle all the useful cases in conditional rules, add an unconditional DO INSTEAD NOTHING
rule to ensure that the system understands it will never be called on to update the dummy table. Then
make the conditional rules non-INSTEAD; in the cases where they are applied, they add to the default
INSTEAD NOTHING action. (This method does not currently work to support RETURNING queries,
however.)
Note
A view that is simple enough to be automatically updatable (see CREATE VIEW) does not
require a user-created rule in order to be updatable. While you can create an explicit rule
anyway, the automatic update transformation will generally outperform an explicit rule.
1582
CREATE RULE
Another alternative worth considering is to use INSTEAD OF triggers (see CREATE TRIG-
GER) in place of rules.
Parameters
name
The name of a rule to create. This must be distinct from the name of any other rule for the same
table. Multiple rules on the same table and same event type are applied in alphabetical name order.
event
The event is one of SELECT, INSERT, UPDATE, or DELETE. Note that an INSERT containing
an ON CONFLICT clause cannot be used on tables that have either INSERT or UPDATE rules.
Consider using an updatable view instead.
table_name
The name (optionally schema-qualified) of the table or view the rule applies to.
condition
Any SQL conditional expression (returning boolean). The condition expression cannot refer to
any tables except NEW and OLD, and cannot contain aggregate functions.
INSTEAD
INSTEAD indicates that the commands should be executed instead of the original command.
ALSO
ALSO indicates that the commands should be executed in addition to the original command.
command
The command or commands that make up the rule action. Valid commands are SELECT,
INSERT, UPDATE, DELETE, or NOTIFY.
Within condition and command, the special table names NEW and OLD can be used to refer to
values in the referenced table. NEW is valid in ON INSERT and ON UPDATE rules to refer to the
new row being inserted or updated. OLD is valid in ON UPDATE and ON DELETE rules to refer to
the existing row being updated or deleted.
Notes
You must be the owner of a table to create or change rules for it.
In a rule for INSERT, UPDATE, or DELETE on a view, you can add a RETURNING clause that emits
the view's columns. This clause will be used to compute the outputs if the rule is triggered by an
INSERT RETURNING, UPDATE RETURNING, or DELETE RETURNING command respective-
ly. When the rule is triggered by a command without RETURNING, the rule's RETURNING clause
will be ignored. The current implementation allows only unconditional INSTEAD rules to contain
RETURNING; furthermore there can be at most one RETURNING clause among all the rules for the
same event. (This ensures that there is only one candidate RETURNING clause to be used to compute
the results.) RETURNING queries on the view will be rejected if there is no RETURNING clause in
any available rule.
1583
CREATE RULE
It is very important to take care to avoid circular rules. For example, though each of the following
two rule definitions are accepted by PostgreSQL, the SELECT command would cause PostgreSQL to
report an error because of recursive expansion of a rule:
Presently, if a rule action contains a NOTIFY command, the NOTIFY command will be executed
unconditionally, that is, the NOTIFY will be issued even if there are not any rows that the rule should
apply to. For example, in:
one NOTIFY event will be sent during the UPDATE, whether or not there are any rows that match the
condition id = 42. This is an implementation restriction that might be fixed in future releases.
Compatibility
CREATE RULE is a PostgreSQL language extension, as is the entire query rewrite system.
See Also
ALTER RULE, DROP RULE
1584
CREATE SCHEMA
CREATE SCHEMA — define a new schema
Synopsis
user_name
| CURRENT_USER
| SESSION_USER
Description
CREATE SCHEMA enters a new schema into the current database. The schema name must be distinct
from the name of any existing schema in the current database.
A schema is essentially a namespace: it contains named objects (tables, data types, functions, and
operators) whose names can duplicate those of other objects existing in other schemas. Named objects
are accessed either by “qualifying” their names with the schema name as a prefix, or by setting a
search path that includes the desired schema(s). A CREATE command specifying an unqualified object
name creates the object in the current schema (the one at the front of the search path, which can be
determined with the function current_schema).
Optionally, CREATE SCHEMA can include subcommands to create objects within the new schema.
The subcommands are treated essentially the same as separate commands issued after creating the
schema, except that if the AUTHORIZATION clause is used, all the created objects will be owned
by that user.
Parameters
schema_name
The name of a schema to be created. If this is omitted, the user_name is used as the schema
name. The name cannot begin with pg_, as such names are reserved for system schemas.
user_name
The role name of the user who will own the new schema. If omitted, defaults to the user executing
the command. To create a schema owned by another role, you must be a direct or indirect member
of that role, or be a superuser.
schema_element
An SQL statement defining an object to be created within the schema. Currently, only CREATE
TABLE, CREATE VIEW, CREATE INDEX, CREATE SEQUENCE, CREATE TRIGGER and
1585
CREATE SCHEMA
GRANT are accepted as clauses within CREATE SCHEMA. Other kinds of objects may be created
in separate commands after the schema is created.
IF NOT EXISTS
Do nothing (except issuing a notice) if a schema with the same name already exists. schema_el-
ement subcommands cannot be included when this option is used.
Notes
To create a schema, the invoking user must have the CREATE privilege for the current database. (Of
course, superusers bypass this check.)
Examples
Create a schema:
Create a schema for user joe; the schema will also be named joe:
Create a schema named test that will be owned by user joe, unless there already is a schema named
test. (It does not matter whether joe owns the pre-existing schema.)
Compatibility
The SQL standard allows a DEFAULT CHARACTER SET clause in CREATE SCHEMA, as well as
more subcommand types than are presently accepted by PostgreSQL.
The SQL standard specifies that the subcommands in CREATE SCHEMA can appear in any order. The
present PostgreSQL implementation does not handle all cases of forward references in subcommands;
it might sometimes be necessary to reorder the subcommands in order to avoid forward references.
1586
CREATE SCHEMA
According to the SQL standard, the owner of a schema always owns all objects within it. PostgreSQL
allows schemas to contain objects owned by users other than the schema owner. This can happen only
if the schema owner grants the CREATE privilege on their schema to someone else, or a superuser
chooses to create objects in it.
See Also
ALTER SCHEMA, DROP SCHEMA
1587
CREATE SEQUENCE
CREATE SEQUENCE — define a new sequence generator
Synopsis
Description
CREATE SEQUENCE creates a new sequence number generator. This involves creating and initial-
izing a new special single-row table with the name name. The generator will be owned by the user
issuing the command.
If a schema name is given then the sequence is created in the specified schema. Otherwise it is created
in the current schema. Temporary sequences exist in a special schema, so a schema name cannot be
given when creating a temporary sequence. The sequence name must be distinct from the name of any
other sequence, table, index, view, or foreign table in the same schema.
After a sequence is created, you use the functions nextval, currval, and setval to operate on
the sequence. These functions are documented in Section 9.16.
Although you cannot update a sequence directly, you can use a query like:
to examine the parameters and current state of a sequence. In particular, the last_value field of
the sequence shows the last value allocated by any session. (Of course, this value might be obsolete
by the time it's printed, if other sessions are actively doing nextval calls.)
Parameters
TEMPORARY or TEMP
If specified, the sequence object is created only for this session, and is automatically dropped on
session exit. Existing permanent sequences with the same name are not visible (in this session)
while the temporary sequence exists, unless they are referenced with schema-qualified names.
IF NOT EXISTS
Do not throw an error if a relation with the same name already exists. A notice is issued in this
case. Note that there is no guarantee that the existing relation is anything like the sequence that
would have been created - it might not even be a sequence.
name
1588
CREATE SEQUENCE
data_type
The optional clause AS data_type specifies the data type of the sequence. Valid types are
smallint, integer, and bigint. bigint is the default. The data type determines the de-
fault minimum and maximum values of the sequence.
increment
The optional clause INCREMENT BY increment specifies which value is added to the cur-
rent sequence value to create a new value. A positive value will make an ascending sequence, a
negative one a descending sequence. The default value is 1.
minvalue
NO MINVALUE
The optional clause MINVALUE minvalue determines the minimum value a sequence can
generate. If this clause is not supplied or NO MINVALUE is specified, then defaults will be used.
The default for an ascending sequence is 1. The default for a descending sequence is the minimum
value of the data type.
maxvalue
NO MAXVALUE
The optional clause MAXVALUE maxvalue determines the maximum value for the sequence.
If this clause is not supplied or NO MAXVALUE is specified, then default values will be used.
The default for an ascending sequence is the maximum value of the data type. The default for a
descending sequence is -1.
start
The optional clause START WITH start allows the sequence to begin anywhere. The default
starting value is minvalue for ascending sequences and maxvalue for descending ones.
cache
The optional clause CACHE cache specifies how many sequence numbers are to be preallocated
and stored in memory for faster access. The minimum value is 1 (only one value can be generated
at a time, i.e., no cache), and this is also the default.
CYCLE
NO CYCLE
The CYCLE option allows the sequence to wrap around when the maxvalue or minvalue has
been reached by an ascending or descending sequence respectively. If the limit is reached, the
next number generated will be the minvalue or maxvalue, respectively.
If NO CYCLE is specified, any calls to nextval after the sequence has reached its maximum
value will return an error. If neither CYCLE or NO CYCLE are specified, NO CYCLE is the default.
OWNED BY table_name.column_name
OWNED BY NONE
The OWNED BY option causes the sequence to be associated with a specific table column, such
that if that column (or its whole table) is dropped, the sequence will be automatically dropped as
well. The specified table must have the same owner and be in the same schema as the sequence.
OWNED BY NONE, the default, specifies that there is no such association.
Notes
Use DROP SEQUENCE to remove a sequence.
1589
CREATE SEQUENCE
Sequences are based on bigint arithmetic, so the range cannot exceed the range of an eight-byte
integer (-9223372036854775808 to 9223372036854775807).
Because nextval and setval calls are never rolled back, sequence objects cannot be used if “gap-
less” assignment of sequence numbers is needed. It is possible to build gapless assignment by using
exclusive locking of a table containing a counter; but this solution is much more expensive than se-
quence objects, especially if many transactions need sequence numbers concurrently.
Unexpected results might be obtained if a cache setting greater than one is used for a sequence
object that will be used concurrently by multiple sessions. Each session will allocate and cache suc-
cessive sequence values during one access to the sequence object and increase the sequence object's
last_value accordingly. Then, the next cache-1 uses of nextval within that session simply
return the preallocated values without touching the sequence object. So, any numbers allocated but
not used within a session will be lost when that session ends, resulting in “holes” in the sequence.
Furthermore, although multiple sessions are guaranteed to allocate distinct sequence values, the values
might be generated out of sequence when all the sessions are considered. For example, with a cache
setting of 10, session A might reserve values 1..10 and return nextval=1, then session B might
reserve values 11..20 and return nextval=11 before session A has generated nextval=2. Thus,
with a cache setting of one it is safe to assume that nextval values are generated sequentially; with
a cache setting greater than one you should only assume that the nextval values are all distinct, not
that they are generated purely sequentially. Also, last_value will reflect the latest value reserved
by any session, whether or not it has yet been returned by nextval.
Another consideration is that a setval executed on such a sequence will not be noticed by other
sessions until they have used up any preallocated values they have cached.
Examples
Create an ascending sequence called serial, starting at 101:
SELECT nextval('serial');
nextval
---------
101
SELECT nextval('serial');
nextval
---------
102
1590
CREATE SEQUENCE
BEGIN;
COPY distributors FROM 'input_file';
SELECT setval('serial', max(id)) FROM distributors;
END;
Compatibility
CREATE SEQUENCE conforms to the SQL standard, with the following exceptions:
• Obtaining the next value is done using the nextval() function instead of the standard's NEXT
VALUE FOR expression.
See Also
ALTER SEQUENCE, DROP SEQUENCE
1591
CREATE SERVER
CREATE SERVER — define a new foreign server
Synopsis
Description
CREATE SERVER defines a new foreign server. The user who defines the server becomes its owner.
A foreign server typically encapsulates connection information that a foreign-data wrapper uses to
access an external data resource. Additional user-specific connection information may be specified
by means of user mappings.
Creating a server requires USAGE privilege on the foreign-data wrapper being used.
Parameters
IF NOT EXISTS
Do not throw an error if a server with the same name already exists. A notice is issued in this
case. Note that there is no guarantee that the existing server is anything like the one that would
have been created.
server_name
server_type
server_version
fdw_name
This clause specifies the options for the server. The options typically define the connection details
of the server, but the actual names and values are dependent on the server's foreign-data wrapper.
Notes
When using the dblink module, a foreign server's name can be used as an argument of the dblink_con-
nect function to indicate the connection parameters. It is necessary to have the USAGE privilege on
the foreign server to be able to use it in this way.
1592
CREATE SERVER
Examples
Create a server myserver that uses the foreign-data wrapper postgres_fdw:
Compatibility
CREATE SERVER conforms to ISO/IEC 9075-9 (SQL/MED).
See Also
ALTER SERVER, DROP SERVER, CREATE FOREIGN DATA WRAPPER, CREATE FOREIGN
TABLE, CREATE USER MAPPING
1593
CREATE STATISTICS
CREATE STATISTICS — define extended statistics
Synopsis
Description
CREATE STATISTICS will create a new extended statistics object tracking data about the specified
table, foreign table or materialized view. The statistics object will be created in the current database
and will be owned by the user issuing the command.
If a schema name is given (for example, CREATE STATISTICS myschema.mystat ...) then
the statistics object is created in the specified schema. Otherwise it is created in the current schema.
The name of the statistics object must be distinct from the name of any other statistics object in the
same schema.
Parameters
IF NOT EXISTS
Do not throw an error if a statistics object with the same name already exists. A notice is issued
in this case. Note that only the name of the statistics object is considered here, not the details of
its definition.
statistics_name
statistics_kind
A statistics kind to be computed in this statistics object. Currently supported kinds are ndis-
tinct, which enables n-distinct statistics, and dependencies, which enables functional de-
pendency statistics. If this clause is omitted, all supported statistics kinds are included in the sta-
tistics object. For more information, see Section 14.2.2 and Section 71.2.
column_name
The name of a table column to be covered by the computed statistics. At least two column names
must be given.
table_name
The name (optionally schema-qualified) of the table containing the column(s) the statistics are
computed on.
Notes
You must be the owner of a table to create a statistics object reading it. Once created, however, the
ownership of the statistics object is independent of the underlying table(s).
1594
CREATE STATISTICS
Examples
Create table t1 with two functionally dependent columns, i.e., knowledge of a value in the first column
is sufficient for determining the value in the other column. Then functional dependency statistics are
built on those columns:
CREATE TABLE t1 (
a int,
b int
);
ANALYZE t1;
ANALYZE t1;
Without functional-dependency statistics, the planner would assume that the two WHERE conditions
are independent, and would multiply their selectivities together to arrive at a much-too-small row
count estimate. With such statistics, the planner recognizes that the WHERE conditions are redundant
and does not underestimate the row count.
Compatibility
There is no CREATE STATISTICS command in the SQL standard.
See Also
ALTER STATISTICS, DROP STATISTICS
1595
CREATE SUBSCRIPTION
CREATE SUBSCRIPTION — define a new subscription
Synopsis
Description
CREATE SUBSCRIPTION adds a new subscription for the current database. The subscription name
must be distinct from the name of any existing subscription in the database.
The subscription represents a replication connection to the publisher. As such this command does not
only add definitions in the local catalogs but also creates a replication slot on the publisher.
A logical replication worker will be started to replicate data for the new subscription at the commit
of the transaction where this command is run.
Additional information about subscriptions and logical replication as a whole is available at Sec-
tion 31.2 and Chapter 31.
Parameters
subscription_name
CONNECTION 'conninfo'
The connection string to the publisher. For details see Section 34.1.1.
PUBLICATION publication_name
This clause specifies optional parameters for a subscription. The following parameters are sup-
ported:
copy_data (boolean)
Specifies whether the existing data in the publications that are being subscribed to should be
copied once the replication starts. The default is true.
create_slot (boolean)
Specifies whether the command should create the replication slot on the publisher. The de-
fault is true.
enabled (boolean)
Specifies whether the subscription should be actively replicating, or whether it should be just
setup but not started yet. The default is true.
1596
CREATE SUBSCRIPTION
slot_name (string)
Name of the replication slot to use. The default behavior is to use the name of the subscription
for the slot name.
When slot_name is set to NONE, there will be no replication slot associated with the sub-
scription. This can be used if the replication slot will be created later manually. Such sub-
scriptions must also have both enabled and create_slot set to false.
synchronous_commit (enum)
The value of this parameter overrides the synchronous_commit setting. The default value is
off.
It is safe to use off for logical replication: If the subscriber loses transactions because of
missing synchronization, the data will be sent again from the publisher.
A different setting might be appropriate when doing synchronous logical replication. The
logical replication workers report the positions of writes and flushes to the publisher, and
when using synchronous replication, the publisher will wait for the actual flush. This means
that setting synchronous_commit for the subscriber to off when the subscription is
used for synchronous replication might increase the latency for COMMIT on the publisher. In
this scenario, it can be advantageous to set synchronous_commit to local or higher.
connect (boolean)
Specifies whether the CREATE SUBSCRIPTION should connect to the publisher at all. Set-
ting this to false will change default values of enabled, create_slot and copy_da-
ta to false.
Since no connection is made when this option is set to false, the tables are not subscribed,
and so after you enable the subscription nothing will be replicated. It is required to run ALTER
SUBSCRIPTION ... REFRESH PUBLICATION in order for tables to be subscribed.
Notes
See Section 31.7 for details on how to configure access control between the subscription and the
publication instance.
When creating a replication slot (the default behavior), CREATE SUBSCRIPTION cannot be exe-
cuted inside a transaction block.
Creating a subscription that connects to the same database cluster (for example, to replicate between
databases in the same cluster or to replicate within the same database) will only succeed if the repli-
cation slot is not created as part of the same command. Otherwise, the CREATE SUBSCRIPTION
call will hang. To make this work, create the replication slot separately (using the function pg_cre-
ate_logical_replication_slot with the plugin name pgoutput) and create the subscrip-
tion using the parameter create_slot = false. This is an implementation restriction that might
be lifted in a future release.
Examples
Create a subscription to a remote server that replicates tables in the publications mypublication
and insert_only and starts replicating immediately on commit:
1597
CREATE SUBSCRIPTION
Create a subscription to a remote server that replicates tables in the insert_only publication and
does not start replicating until enabled at a later time.
Compatibility
CREATE SUBSCRIPTION is a PostgreSQL extension.
See Also
ALTER SUBSCRIPTION, DROP SUBSCRIPTION, CREATE PUBLICATION, ALTER PUBLI-
CATION
1598
CREATE TABLE
CREATE TABLE — define a new table
Synopsis
CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE
[ IF NOT EXISTS ] table_name ( [
{ column_name data_type [ COLLATE collation ] [ column_constraint
[ ... ] ]
| table_constraint
| LIKE source_table [ like_option ... ] }
[, ... ]
] )
[ INHERITS ( parent_table [, ... ] ) ]
[ PARTITION BY { RANGE | LIST | HASH } ( { column_name |
( expression ) } [ COLLATE collation ] [ opclass ] [, ... ] ) ]
[ WITH ( storage_parameter [= value] [, ... ] ) | WITH OIDS |
WITHOUT OIDS ]
[ ON COMMIT { PRESERVE ROWS | DELETE ROWS | DROP } ]
[ TABLESPACE tablespace_name ]
[ CONSTRAINT constraint_name ]
{ NOT NULL |
NULL |
1599
CREATE TABLE
[ CONSTRAINT constraint_name ]
{ CHECK ( expression ) [ NO INHERIT ] |
UNIQUE ( column_name [, ... ] ) index_parameters |
PRIMARY KEY ( column_name [, ... ] ) index_parameters |
EXCLUDE [ USING index_method ] ( exclude_element WITH operator
[, ... ] ) index_parameters [ WHERE ( predicate ) ] |
FOREIGN KEY ( column_name [, ... ] ) REFERENCES reftable
[ ( refcolumn [, ... ] ) ]
[ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ] [ ON
DELETE action ] [ ON UPDATE action ] }
[ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY
IMMEDIATE ]
Description
CREATE TABLE will create a new, initially empty table in the current database. The table will be
owned by the user issuing the command.
1600
CREATE TABLE
If a schema name is given (for example, CREATE TABLE myschema.mytable ...) then the
table is created in the specified schema. Otherwise it is created in the current schema. Temporary
tables exist in a special schema, so a schema name cannot be given when creating a temporary table.
The name of the table must be distinct from the name of any other table, sequence, index, view, or
foreign table in the same schema.
CREATE TABLE also automatically creates a data type that represents the composite type correspond-
ing to one row of the table. Therefore, tables cannot have the same name as any existing data type
in the same schema.
The optional constraint clauses specify constraints (tests) that new or updated rows must satisfy for an
insert or update operation to succeed. A constraint is an SQL object that helps define the set of valid
values in the table in various ways.
There are two ways to define constraints: table constraints and column constraints. A column constraint
is defined as part of a column definition. A table constraint definition is not tied to a particular column,
and it can encompass more than one column. Every column constraint can also be written as a table
constraint; a column constraint is only a notational convenience for use when the constraint only
affects one column.
To be able to create a table, you must have USAGE privilege on all column types or the type in the
OF clause, respectively.
Parameters
TEMPORARY or TEMP
If specified, the table is created as a temporary table. Temporary tables are automatically dropped
at the end of a session, or optionally at the end of the current transaction (see ON COMMIT
below). Existing permanent tables with the same name are not visible to the current session while
the temporary table exists, unless they are referenced with schema-qualified names. Any indexes
created on a temporary table are automatically temporary as well.
The autovacuum daemon cannot access and therefore cannot vacuum or analyze temporary tables.
For this reason, appropriate vacuum and analyze operations should be performed via session SQL
commands. For example, if a temporary table is going to be used in complex queries, it is wise
to run ANALYZE on the temporary table after it is populated.
Optionally, GLOBAL or LOCAL can be written before TEMPORARY or TEMP. This presently
makes no difference in PostgreSQL and is deprecated; see Compatibility.
UNLOGGED
If specified, the table is created as an unlogged table. Data written to unlogged tables is not written
to the write-ahead log (see Chapter 30), which makes them considerably faster than ordinary
tables. However, they are not crash-safe: an unlogged table is automatically truncated after a crash
or unclean shutdown. The contents of an unlogged table are also not replicated to standby servers.
Any indexes created on an unlogged table are automatically unlogged as well.
IF NOT EXISTS
Do not throw an error if a relation with the same name already exists. A notice is issued in this
case. Note that there is no guarantee that the existing relation is anything like the one that would
have been created.
table_name
1601
CREATE TABLE
OF type_name
Creates a typed table, which takes its structure from the specified composite type (name optionally
schema-qualified). A typed table is tied to its type; for example the table will be dropped if the
type is dropped (with DROP TYPE ... CASCADE).
When a typed table is created, then the data types of the columns are determined by the underlying
composite type and are not specified by the CREATE TABLE command. But the CREATE TABLE
command can add defaults and constraints to the table and can specify storage parameters.
column_name
data_type
The data type of the column. This can include array specifiers. For more information on the data
types supported by PostgreSQL, refer to Chapter 8.
COLLATE collation
The COLLATE clause assigns a collation to the column (which must be of a collatable data type).
If not specified, the column data type's default collation is used.
The optional INHERITS clause specifies a list of tables from which the new table automatically
inherits all columns. Parent tables can be plain tables or foreign tables.
Use of INHERITS creates a persistent relationship between the new child table and its parent
table(s). Schema modifications to the parent(s) normally propagate to children as well, and by
default the data of the child table is included in scans of the parent(s).
If the same column name exists in more than one parent table, an error is reported unless the data
types of the columns match in each of the parent tables. If there is no conflict, then the duplicate
columns are merged to form a single column in the new table. If the column name list of the new
table contains a column name that is also inherited, the data type must likewise match the inherited
column(s), and the column definitions are merged into one. If the new table explicitly specifies
a default value for the column, this default overrides any defaults from inherited declarations of
the column. Otherwise, any parents that specify default values for the column must all specify the
same default, or an error will be reported.
CHECK constraints are merged in essentially the same way as columns: if multiple parent tables
and/or the new table definition contain identically-named CHECK constraints, these constraints
must all have the same check expression, or an error will be reported. Constraints having the
same name and expression will be merged into one copy. A constraint marked NO INHERIT in
a parent will not be considered. Notice that an unnamed CHECK constraint in the new table will
never be merged, since a unique name will always be chosen for it.
If a column in the parent table is an identity column, that property is not inherited. A column in
the child table can be declared identity column if desired.
The optional PARTITION BY clause specifies a strategy of partitioning the table. The table
thus created is called a partitioned table. The parenthesized list of columns or expressions forms
the partition key for the table. When using range or hash partitioning, the partition key can in-
1602
CREATE TABLE
clude multiple columns or expressions (up to 32, but this limit can be altered when building Post-
greSQL), but for list partitioning, the partition key must consist of a single column or expression.
Range and list partitioning require a btree operator class, while hash partitioning requires a hash
operator class. If no operator class is specified explicitly, the default operator class of the appro-
priate type will be used; if no default operator class exists, an error will be raised. When hash par-
titioning is used, the operator class used must implement support function 2 (see Section 38.15.3
for details).
A partitioned table is divided into sub-tables (called partitions), which are created using separate
CREATE TABLE commands. The partitioned table is itself empty. A data row inserted into the
table is routed to a partition based on the value of columns or expressions in the partition key. If
no existing partition matches the values in the new row, an error will be reported.
Partitioned tables do not support EXCLUDE constraints; however, you can define these constraints
on individual partitions. Also, while it's possible to define PRIMARY KEY constraints on parti-
tioned tables, creating foreign keys that reference a partitioned table is not yet supported.
Creates the table as a partition of the specified parent table. The table can be created either as a
partition for specific values using FOR VALUES or as a default partition using DEFAULT. Any
indexes, constraints and user-defined row-level triggers that exist in the parent table are cloned
on the new partition.
The partition_bound_spec must correspond to the partitioning method and partition key
of the parent table, and must not overlap with any existing partition of that parent. The form with
IN is used for list partitioning, the form with FROM and TO is used for range partitioning, and the
form with WITH is used for hash partitioning.
When creating a list partition, NULL can be specified to signify that the partition allows the parti-
tion key column to be null. However, there cannot be more than one such list partition for a given
parent table. NULL cannot be specified for range partitions.
When creating a range partition, the lower bound specified with FROM is an inclusive bound,
whereas the upper bound specified with TO is an exclusive bound. That is, the values specified
in the FROM list are valid values of the corresponding partition key columns for this partition,
whereas those in the TO list are not. Note that this statement must be understood according to the
rules of row-wise comparison (Section 9.23.5). For example, given PARTITION BY RANGE
(x,y), a partition bound FROM (1, 2) TO (3, 4) allows x=1 with any y>=2, x=2 with
any non-null y, and x=3 with any y<4.
The special values MINVALUE and MAXVALUE may be used when creating a range partition to
indicate that there is no lower or upper bound on the column's value. For example, a partition
defined using FROM (MINVALUE) TO (10) allows any values less than 10, and a partition
defined using FROM (10) TO (MAXVALUE) allows any values greater than or equal to 10.
When creating a range partition involving more than one column, it can also make sense to use
MAXVALUE as part of the lower bound, and MINVALUE as part of the upper bound. For example,
a partition defined using FROM (0, MAXVALUE) TO (10, MAXVALUE) allows any rows
where the first partition key column is greater than 0 and less than or equal to 10. Similarly, a
partition defined using FROM ('a', MINVALUE) TO ('b', MINVALUE) allows any
rows where the first partition key column starts with "a".
1603
CREATE TABLE
Note that if MINVALUE or MAXVALUE is used for one column of a partitioning bound, the same
value must be used for all subsequent columns. For example, (10, MINVALUE, 0) is not a
valid bound; you should write (10, MINVALUE, MINVALUE).
Also note that some element types, such as timestamp, have a notion of "infinity", which is just
another value that can be stored. This is different from MINVALUE and MAXVALUE, which are
not real values that can be stored, but rather they are ways of saying that the value is unbounded.
MAXVALUE can be thought of as being greater than any other value, including "infinity" and
MINVALUE as being less than any other value, including "minus infinity". Thus the range FROM
('infinity') TO (MAXVALUE) is not an empty range; it allows precisely one value to
be stored — "infinity".
If DEFAULT is specified, the table will be created as the default partition of the parent table. This
option is not available for hash-partitioned tables. A partition key value not fitting into any other
partition of the given parent will be routed to the default partition.
When a table has an existing DEFAULT partition and a new partition is added to it, the default
partition must be scanned to verify that it does not contain any rows which properly belong in the
new partition. If the default partition contains a large number of rows, this may be slow. The scan
will be skipped if the default partition is a foreign table or if it has a constraint which proves that
it cannot contain rows which should be placed in the new partition.
When creating a hash partition, a modulus and remainder must be specified. The modulus must
be a positive integer, and the remainder must be a non-negative integer less than the modulus.
Typically, when initially setting up a hash-partitioned table, you should choose a modulus equal
to the number of partitions and assign every table the same modulus and a different remainder
(see examples, below). However, it is not required that every partition have the same modulus,
only that every modulus which occurs among the partitions of a hash-partitioned table is a factor
of the next larger modulus. This allows the number of partitions to be increased incrementally
without needing to move all the data at once. For example, suppose you have a hash-partitioned
table with 8 partitions, each of which has modulus 8, but find it necessary to increase the number
of partitions to 16. You can detach one of the modulus-8 partitions, create two new modulus-16
partitions covering the same portion of the key space (one with a remainder equal to the remainder
of the detached partition, and the other with a remainder equal to that value plus 8), and repopulate
them with data. You can then repeat this -- perhaps at a later time -- for each modulus-8 partition
until none remain. While this may still involve a large amount of data movement at each step, it
is still better than having to create a whole new table and move all the data at once.
A partition must have the same column names and types as the partitioned table to which it be-
longs. If the parent is specified WITH OIDS then all partitions must have OIDs; the parent's
OID column will be inherited by all partitions just like any other column. Modifications to the
column names or types of a partitioned table, or the addition or removal of an OID column, will
automatically propagate to all partitions. CHECK constraints will be inherited automatically by
every partition, but an individual partition may specify additional CHECK constraints; additional
constraints with the same name and condition as in the parent will be merged with the parent con-
straint. Defaults may be specified separately for each partition. But note that a partition's default
value is not applied when inserting a tuple through a partitioned table.
Rows inserted into a partitioned table will be automatically routed to the correct partition. If no
suitable partition exists, an error will occur.
Operations such as TRUNCATE which normally affect a table and all of its inheritance children
will cascade to all partitions, but may also be performed on an individual partition. Note that
dropping a partition with DROP TABLE requires taking an ACCESS EXCLUSIVE lock on the
parent table.
The LIKE clause specifies a table from which the new table automatically copies all column
names, their data types, and their not-null constraints.
1604
CREATE TABLE
Unlike INHERITS, the new table and original table are completely decoupled after creation is
complete. Changes to the original table will not be applied to the new table, and it is not possible
to include data of the new table in scans of the original table.
Default expressions for the copied column definitions will be copied only if INCLUDING DE-
FAULTS is specified. The default behavior is to exclude default expressions, resulting in the
copied columns in the new table having null defaults. Note that copying defaults that call data-
base-modification functions, such as nextval, may create a functional linkage between the
original and new tables.
Any identity specifications of copied column definitions will only be copied if INCLUDING
IDENTITY is specified. A new sequence is created for each identity column of the new table,
separate from the sequences associated with the old table.
Not-null constraints are always copied to the new table. CHECK constraints will be copied only if
INCLUDING CONSTRAINTS is specified. No distinction is made between column constraints
and table constraints.
Extended statistics are copied to the new table if INCLUDING STATISTICS is specified.
Indexes, PRIMARY KEY, UNIQUE, and EXCLUDE constraints on the original table will be created
on the new table only if INCLUDING INDEXES is specified. Names for the new indexes and
constraints are chosen according to the default rules, regardless of how the originals were named.
(This behavior avoids possible duplicate-name failures for the new indexes.)
STORAGE settings for the copied column definitions will be copied only if INCLUDING STOR-
AGE is specified. The default behavior is to exclude STORAGE settings, resulting in the copied
columns in the new table having type-specific default settings. For more on STORAGE settings,
see Section 69.2.
Comments for the copied columns, constraints, and indexes will be copied only if INCLUDING
COMMENTS is specified. The default behavior is to exclude comments, resulting in the copied
columns and constraints in the new table having no comments.
Note that unlike INHERITS, columns and constraints copied by LIKE are not merged with sim-
ilarly named columns and constraints. If the same name is specified explicitly or in another LIKE
clause, an error is signaled.
The LIKE clause can also be used to copy column definitions from views, foreign tables, or
composite types. Inapplicable options (e.g., INCLUDING INDEXES from a view) are ignored.
CONSTRAINT constraint_name
An optional name for a column or table constraint. If the constraint is violated, the constraint
name is present in error messages, so constraint names like col must be positive can
be used to communicate helpful constraint information to client applications. (Double-quotes are
needed to specify constraint names that contain spaces.) If a constraint name is not specified, the
system generates a name.
NOT NULL
NULL
1605
CREATE TABLE
This clause is only provided for compatibility with non-standard SQL databases. Its use is dis-
couraged in new applications.
The CHECK clause specifies an expression producing a Boolean result which new or updated
rows must satisfy for an insert or update operation to succeed. Expressions evaluating to TRUE or
UNKNOWN succeed. Should any row of an insert or update operation produce a FALSE result,
an error exception is raised and the insert or update does not alter the database. A check constraint
specified as a column constraint should reference that column's value only, while an expression
appearing in a table constraint can reference multiple columns.
Currently, CHECK expressions cannot contain subqueries nor refer to variables other than columns
of the current row (see Section 5.3.1). The system column tableoid may be referenced, but
not any other system column.
When a table has multiple CHECK constraints, they will be tested for each row in alphabetical
order by name, after checking NOT NULL constraints. (PostgreSQL versions before 9.5 did not
honor any particular firing order for CHECK constraints.)
DEFAULT default_expr
The DEFAULT clause assigns a default data value for the column whose column definition it
appears within. The value is any variable-free expression (subqueries and cross-references to other
columns in the current table are not allowed). The data type of the default expression must match
the data type of the column.
The default expression will be used in any insert operation that does not specify a value for the
column. If there is no default for a column, then the default is null.
This clause creates the column as an identity column. It will have an implicit sequence attached
to it and the column in new rows will automatically have values from the sequence assigned to
it. Such a column is implicitly NOT NULL.
The clauses ALWAYS and BY DEFAULT determine how the sequence value is given precedence
over a user-specified value in an INSERT statement. If ALWAYS is specified, a user-specified
value is only accepted if the INSERT statement specifies OVERRIDING SYSTEM VALUE. If BY
DEFAULT is specified, then the user-specified value takes precedence. See INSERT for details.
(In the COPY command, user-specified values are always used regardless of this setting.)
The optional sequence_options clause can be used to override the options of the sequence.
See CREATE SEQUENCE for details.
The UNIQUE constraint specifies that a group of one or more columns of a table can contain only
unique values. The behavior of a unique table constraint is the same as that of a unique column
constraint, with the additional capability to span multiple columns. The constraint therefore en-
forces that any two rows must differ in at least one of these columns.
For the purpose of a unique constraint, null values are not considered equal.
Each unique constraint should name a set of columns that is different from the set of columns
named by any other unique or primary key constraint defined for the table. (Otherwise, redundant
unique constraints will be discarded.)
1606
CREATE TABLE
When establishing a unique constraint for a multi-level partition hierarchy, all the columns in the
partition key of the target partitioned table, as well as those of all its descendant partitioned tables,
must be included in the constraint definition.
Adding a unique constraint will automatically create a unique btree index on the column or group
of columns used in the constraint.
The optional INCLUDE clause adds to that index one or more columns that are simply “pay-
load”: uniqueness is not enforced on them, and the index cannot be searched on the basis of those
columns. However they can be retrieved by an index-only scan. Note that although the constraint
is not enforced on included columns, it still depends on them. Consequently, some operations on
such columns (e.g., DROP COLUMN) can cause cascaded constraint and index deletion.
The PRIMARY KEY constraint specifies that a column or columns of a table can contain only
unique (non-duplicate), nonnull values. Only one primary key can be specified for a table, whether
as a column constraint or a table constraint.
The primary key constraint should name a set of columns that is different from the set of columns
named by any unique constraint defined for the same table. (Otherwise, the unique constraint is
redundant and will be discarded.)
PRIMARY KEY enforces the same data constraints as a combination of UNIQUE and NOT NULL.
However, identifying a set of columns as the primary key also provides metadata about the design
of the schema, since a primary key implies that other tables can rely on this set of columns as a
unique identifier for rows.
When placed on a partitioned table, PRIMARY KEY constraints share the restrictions previously
described for UNIQUE constraints.
Adding a PRIMARY KEY constraint will automatically create a unique btree index on the column
or group of columns used in the constraint.
The optional INCLUDE clause adds to that index one or more columns that are simply “pay-
load”: uniqueness is not enforced on them, and the index cannot be searched on the basis of those
columns. However they can be retrieved by an index-only scan. Note that although the constraint
is not enforced on included columns, it still depends on them. Consequently, some operations on
such columns (e.g., DROP COLUMN) can cause cascaded constraint and index deletion.
The EXCLUDE clause defines an exclusion constraint, which guarantees that if any two rows
are compared on the specified column(s) or expression(s) using the specified operator(s), not all
of these comparisons will return TRUE. If all of the specified operators test for equality, this is
equivalent to a UNIQUE constraint, although an ordinary unique constraint will be faster. How-
ever, exclusion constraints can specify constraints that are more general than simple equality. For
example, you can specify a constraint that no two rows in the table contain overlapping circles
(see Section 8.8) by using the && operator.
Exclusion constraints are implemented using an index, so each specified operator must be asso-
ciated with an appropriate operator class (see Section 11.10) for the index access method in-
dex_method. The operators are required to be commutative. Each exclude_element can
optionally specify an operator class and/or ordering options; these are described fully under CRE-
ATE INDEX.
The access method must support amgettuple (see Chapter 61); at present this means GIN
cannot be used. Although it's allowed, there is little point in using B-tree or hash indexes with
1607
CREATE TABLE
an exclusion constraint, because this does nothing that an ordinary unique constraint doesn't do
better. So in practice the access method will always be GiST or SP-GiST.
The predicate allows you to specify an exclusion constraint on a subset of the table; internally
this creates a partial index. Note that parentheses are required around the predicate.
These clauses specify a foreign key constraint, which requires that a group of one or more columns
of the new table must only contain values that match values in the referenced column(s) of some
row of the referenced table. If the refcolumn list is omitted, the primary key of the reftable
is used. The referenced columns must be the columns of a non-deferrable unique or primary key
constraint in the referenced table. The user must have REFERENCES permission on the referenced
table (either the whole table, or the specific referenced columns). The addition of a foreign key
constraint requires a SHARE ROW EXCLUSIVE lock on the referenced table. Note that foreign
key constraints cannot be defined between temporary tables and permanent tables. Also note that
while it is possible to define a foreign key on a partitioned table, it is not possible to declare a
foreign key that references a partitioned table.
A value inserted into the referencing column(s) is matched against the values of the referenced
table and referenced columns using the given match type. There are three match types: MATCH
FULL, MATCH PARTIAL, and MATCH SIMPLE (which is the default). MATCH FULL will
not allow one column of a multicolumn foreign key to be null unless all foreign key columns are
null; if they are all null, the row is not required to have a match in the referenced table. MATCH
SIMPLE allows any of the foreign key columns to be null; if any of them are null, the row is
not required to have a match in the referenced table. MATCH PARTIAL is not yet implemented.
(Of course, NOT NULL constraints can be applied to the referencing column(s) to prevent these
cases from arising.)
In addition, when the data in the referenced columns is changed, certain actions are performed on
the data in this table's columns. The ON DELETE clause specifies the action to perform when a
referenced row in the referenced table is being deleted. Likewise, the ON UPDATE clause specifies
the action to perform when a referenced column in the referenced table is being updated to a new
value. If the row is updated, but the referenced column is not actually changed, no action is done.
Referential actions other than the NO ACTION check cannot be deferred, even if the constraint
is declared deferrable. There are the following possible actions for each clause:
NO ACTION
Produce an error indicating that the deletion or update would create a foreign key constraint
violation. If the constraint is deferred, this error will be produced at constraint check time if
there still exist any referencing rows. This is the default action.
RESTRICT
Produce an error indicating that the deletion or update would create a foreign key constraint
violation. This is the same as NO ACTION except that the check is not deferrable.
CASCADE
Delete any rows referencing the deleted row, or update the values of the referencing colum-
n(s) to the new values of the referenced columns, respectively.
SET NULL
1608
CREATE TABLE
SET DEFAULT
Set the referencing column(s) to their default values. (There must be a row in the referenced
table matching the default values, if they are not null, or the operation will fail.)
If the referenced column(s) are changed frequently, it might be wise to add an index to the ref-
erencing column(s) so that referential actions associated with the foreign key constraint can be
performed more efficiently.
DEFERRABLE
NOT DEFERRABLE
This controls whether the constraint can be deferred. A constraint that is not deferrable will be
checked immediately after every command. Checking of constraints that are deferrable can be
postponed until the end of the transaction (using the SET CONSTRAINTS command). NOT DE-
FERRABLE is the default. Currently, only UNIQUE, PRIMARY KEY, EXCLUDE, and REFER-
ENCES (foreign key) constraints accept this clause. NOT NULL and CHECK constraints are not
deferrable. Note that deferrable constraints cannot be used as conflict arbitrators in an INSERT
statement that includes an ON CONFLICT DO UPDATE clause.
INITIALLY IMMEDIATE
INITIALLY DEFERRED
If a constraint is deferrable, this clause specifies the default time to check the constraint. If the
constraint is INITIALLY IMMEDIATE, it is checked after each statement. This is the default. If
the constraint is INITIALLY DEFERRED, it is checked only at the end of the transaction. The
constraint check time can be altered with the SET CONSTRAINTS command.
This clause specifies optional storage parameters for a table or index; see Storage Parameters for
more information. The WITH clause for a table can also include OIDS=TRUE (or just OIDS)
to specify that rows of the new table should have OIDs (object identifiers) assigned to them,
or OIDS=FALSE to specify that the rows should not have OIDs. If OIDS is not specified, the
default setting depends upon the default_with_oids configuration parameter. (If the new table
inherits from any tables that have OIDs, then OIDS=TRUE is forced even if the command says
OIDS=FALSE.)
If OIDS=FALSE is specified or implied, the new table does not store OIDs and no OID will be as-
signed for a row inserted into it. This is generally considered worthwhile, since it will reduce OID
consumption and thereby postpone the wraparound of the 32-bit OID counter. Once the counter
wraps around, OIDs can no longer be assumed to be unique, which makes them considerably less
useful. In addition, excluding OIDs from a table reduces the space required to store the table on
disk by 4 bytes per row (on most machines), slightly improving performance.
To remove OIDs from a table after it has been created, use ALTER TABLE.
WITH OIDS
WITHOUT OIDS
These are obsolescent syntaxes equivalent to WITH (OIDS) and WITH (OIDS=FALSE),
respectively. If you wish to give both an OIDS setting and storage parameters, you must use the
WITH ( ... ) syntax; see above.
ON COMMIT
The behavior of temporary tables at the end of a transaction block can be controlled using ON
COMMIT. The three options are:
PRESERVE ROWS
No special action is taken at the ends of transactions. This is the default behavior.
1609
CREATE TABLE
DELETE ROWS
All rows in the temporary table will be deleted at the end of each transaction block. Essen-
tially, an automatic TRUNCATE is done at each commit. When used on a partitioned table,
this is not cascaded to its partitions.
DROP
The temporary table will be dropped at the end of the current transaction block. When used
on a partitioned table, this action drops its partitions and when used on tables with inheritance
children, it drops the dependent children.
TABLESPACE tablespace_name
The tablespace_name is the name of the tablespace in which the new table is to be created.
If not specified, default_tablespace is consulted, or temp_tablespaces if the table is temporary.
This clause allows selection of the tablespace in which the index associated with a UNIQUE,
PRIMARY KEY, or EXCLUDE constraint will be created. If not specified, default_tablespace is
consulted, or temp_tablespaces if the table is temporary.
Storage Parameters
The WITH clause can specify storage parameters for tables, and for indexes associated with a
UNIQUE, PRIMARY KEY, or EXCLUDE constraint. Storage parameters for indexes are documented
in CREATE INDEX. The storage parameters currently available for tables are listed below. For many
of these parameters, as shown, there is an additional parameter with the same name prefixed with
toast., which controls the behavior of the table's secondary TOAST table, if any (see Section 69.2
for more information about TOAST). If a table parameter value is set and the equivalent toast.
parameter is not, the TOAST table will use the table's parameter value. Specifying these parameters
for partitioned tables is not supported, but you may specify them for individual leaf partitions.
fillfactor (integer)
The fillfactor for a table is a percentage between 10 and 100. 100 (complete packing) is the default.
When a smaller fillfactor is specified, INSERT operations pack table pages only to the indicated
percentage; the remaining space on each page is reserved for updating rows on that page. This
gives UPDATE a chance to place the updated copy of a row on the same page as the original,
which is more efficient than placing it on a different page. For a table whose entries are never
updated, complete packing is the best choice, but in heavily updated tables smaller fillfactors are
appropriate. This parameter cannot be set for TOAST tables.
toast_tuple_target (integer)
The toast_tuple_target specifies the minimum tuple length required before we try to compress and/
or move long column values into TOAST tables, and is also the target length we try to reduce the
length below once toasting begins. This affects columns marked as External (for move), Main (for
compression), or Extended (for both) and applies only to new tuples. There is no effect on existing
rows. By default this parameter is set to allow at least 4 tuples per block, which with the default
blocksize will be 2040 bytes. Valid values are between 128 bytes and the (blocksize - header), by
default 8160 bytes. Changing this value may not be useful for very short or very long rows. Note
that the default setting is often close to optimal, and it is possible that setting this parameter could
have negative effects in some cases. This parameter cannot be set for TOAST tables.
parallel_workers (integer)
This sets the number of workers that should be used to assist a parallel scan of this table. If not
set, the system will determine a value based on the relation size. The actual number of workers
chosen by the planner or by utility statements that use parallel scans may be less, for example due
to the setting of max_worker_processes.
1610
CREATE TABLE
Enables or disables the autovacuum daemon for a particular table. If true, the autovacuum dae-
mon will perform automatic VACUUM and/or ANALYZE operations on this table following the
rules discussed in Section 24.1.6. If false, this table will not be autovacuumed, except to prevent
transaction ID wraparound. See Section 24.1.5 for more about wraparound prevention. Note that
the autovacuum daemon does not run at all (except to prevent transaction ID wraparound) if the
autovacuum parameter is false; setting individual tables' storage parameters does not override
that. Therefore there is seldom much point in explicitly setting this storage parameter to true,
only to false.
autovacuum_vacuum_scale_factor, toast.autovacuum_vacuum_scale_factor
(floating point)
autovacuum_analyze_threshold (integer)
Per-table value for vacuum_freeze_min_age parameter. Note that autovacuum will ignore per-
table autovacuum_freeze_min_age parameters that are larger than half the system-wide
autovacuum_freeze_max_age setting.
Per-table value for autovacuum_freeze_max_age parameter. Note that autovacuum will ignore
per-table autovacuum_freeze_max_age parameters that are larger than the system-wide
setting (it can only be set smaller).
autovacuum_multixact_freeze_min_age, toast.autovacuum_multixac-
t_freeze_min_age (integer)
1611
CREATE TABLE
autovacuum_multixact_freeze_max_age, toast.autovacuum_multixac-
t_freeze_max_age (integer)
autovacuum_multixact_freeze_table_age, toast.autovacuum_multixac-
t_freeze_table_age (integer)
user_catalog_table (boolean)
Declare the table as an additional catalog table for purposes of logical replication. See Sec-
tion 49.6.2 for details. This parameter cannot be set for TOAST tables.
Notes
Using OIDs in new applications is not recommended: where possible, using an identity column or
other sequence generator as the table's primary key is preferred. However, if your application does
make use of OIDs to identify specific rows of a table, it is recommended to create a unique constraint
on the oid column of that table, to ensure that OIDs in the table will indeed uniquely identify rows
even after counter wraparound. Avoid assuming that OIDs are unique across tables; if you need a
database-wide unique identifier, use the combination of tableoid and row OID for the purpose.
Tip
The use of OIDS=FALSE is not recommended for tables with no primary key, since without
either an OID or a unique data key, it is difficult to identify specific rows.
PostgreSQL automatically creates an index for each unique constraint and primary key constraint to
enforce uniqueness. Thus, it is not necessary to create an index explicitly for primary key columns.
(See CREATE INDEX for more information.)
Unique constraints and primary keys are not inherited in the current implementation. This makes the
combination of inheritance and unique constraints rather dysfunctional.
A table cannot have more than 1600 columns. (In practice, the effective limit is usually lower because
of tuple-length constraints.)
Examples
Create table films and table distributors:
1612
CREATE TABLE
kind varchar(10),
len interval hour to minute
);
Define a unique table constraint for the table films. Unique table constraints can be defined on one
or more columns of the table:
1613
CREATE TABLE
Define a primary key constraint for table distributors. The following two examples are equiva-
lent, the first using the table constraint syntax, the second the column constraint syntax:
Assign a literal constant default value for the column name, arrange for the default value of column
did to be generated by selecting the next value of a sequence object, and make the default value of
modtime be the time at which the row is inserted:
Define two NOT NULL column constraints on the table distributors, one of which is explicitly
given a name:
Create the same table, specifying 70% fill factor for both the table and its unique index:
1614
CREATE TABLE
WITH (fillfactor=70);
Create table circles with an exclusion constraint that prevents any two circles from overlapping:
Create a range partitioned table with multiple columns in the partition key:
1615
CREATE TABLE
Create a few partitions of a range partitioned table with multiple columns in the partition key:
Create partition of a list partitioned table that is itself further partitioned and then add a partition to it:
1616
CREATE TABLE
Compatibility
The CREATE TABLE command conforms to the SQL standard, with exceptions listed below.
Temporary Tables
Although the syntax of CREATE TEMPORARY TABLE resembles that of the SQL standard, the effect
is not the same. In the standard, temporary tables are defined just once and automatically exist (starting
with empty contents) in every session that needs them. PostgreSQL instead requires each session to
issue its own CREATE TEMPORARY TABLE command for each temporary table to be used. This
allows different sessions to use the same temporary table name for different purposes, whereas the
standard's approach constrains all instances of a given temporary table name to have the same table
structure.
The standard's definition of the behavior of temporary tables is widely ignored. PostgreSQL's behavior
on this point is similar to that of several other SQL databases.
The SQL standard also distinguishes between global and local temporary tables, where a local tempo-
rary table has a separate set of contents for each SQL module within each session, though its definition
is still shared across sessions. Since PostgreSQL does not support SQL modules, this distinction is
not relevant in PostgreSQL.
For compatibility's sake, PostgreSQL will accept the GLOBAL and LOCAL keywords in a temporary
table declaration, but they currently have no effect. Use of these keywords is discouraged, since future
versions of PostgreSQL might adopt a more standard-compliant interpretation of their meaning.
The ON COMMIT clause for temporary tables also resembles the SQL standard, but has some differ-
ences. If the ON COMMIT clause is omitted, SQL specifies that the default behavior is ON COMMIT
DELETE ROWS. However, the default behavior in PostgreSQL is ON COMMIT PRESERVE ROWS.
The ON COMMIT DROP option does not exist in SQL.
EXCLUDE Constraint
The EXCLUDE constraint type is a PostgreSQL extension.
1617
CREATE TABLE
NULL “Constraint”
The NULL “constraint” (actually a non-constraint) is a PostgreSQL extension to the SQL standard
that is included for compatibility with some other database systems (and for symmetry with the NOT
NULL constraint). Since it is the default for any column, its presence is simply noise.
Constraint Naming
The SQL standard says that table and domain constraints must have names that are unique across the
schema containing the table or domain. PostgreSQL is laxer: it only requires constraint names to be
unique across the constraints attached to a particular table or domain. However, this extra freedom
does not exist for index-based constraints (UNIQUE, PRIMARY KEY, and EXCLUDE constraints),
because the associated index is named the same as the constraint, and index names must be unique
across all relations within the same schema.
Currently, PostgreSQL does not record names for NOT NULL constraints at all, so they are not subject
to the uniqueness restriction. This might change in a future release.
Inheritance
Multiple inheritance via the INHERITS clause is a PostgreSQL language extension. SQL:1999 and
later define single inheritance using a different syntax and different semantics. SQL:1999-style inher-
itance is not yet supported by PostgreSQL.
Zero-column Tables
PostgreSQL allows a table of no columns to be created (for example, CREATE TABLE foo();).
This is an extension from the SQL standard, which does not allow zero-column tables. Zero-column
tables are not in themselves very useful, but disallowing them creates odd special cases for ALTER
TABLE DROP COLUMN, so it seems cleaner to ignore this spec restriction.
LIKE Clause
While a LIKE clause exists in the SQL standard, many of the options that PostgreSQL accepts for it
are not in the standard, and some of the standard's options are not implemented by PostgreSQL.
WITH Clause
The WITH clause is a PostgreSQL extension; neither storage parameters nor OIDs are in the standard.
Tablespaces
The PostgreSQL concept of tablespaces is not part of the standard. Hence, the clauses TABLESPACE
and USING INDEX TABLESPACE are extensions.
Typed Tables
Typed tables implement a subset of the SQL standard. According to the standard, a typed table has
columns corresponding to the underlying composite type as well as one other column that is the “self-
1618
CREATE TABLE
referencing column”. PostgreSQL does not support these self-referencing columns explicitly, but the
same effect can be had using the OID feature.
PARTITION BY Clause
The PARTITION BY clause is a PostgreSQL extension.
PARTITION OF Clause
The PARTITION OF clause is a PostgreSQL extension.
See Also
ALTER TABLE, DROP TABLE, CREATE TABLE AS, CREATE TABLESPACE, CREATE TYPE
1619
CREATE TABLE AS
CREATE TABLE AS — define a new table from the results of a query
Synopsis
Description
CREATE TABLE AS creates a table and fills it with data computed by a SELECT command. The table
columns have the names and data types associated with the output columns of the SELECT (except
that you can override the column names by giving an explicit list of new column names).
CREATE TABLE AS bears some resemblance to creating a view, but it is really quite different: it
creates a new table and evaluates the query just once to fill the new table initially. The new table will
not track subsequent changes to the source tables of the query. In contrast, a view re-evaluates its
defining SELECT statement whenever it is queried.
Parameters
GLOBAL or LOCAL
Ignored for compatibility. Use of these keywords is deprecated; refer to CREATE TABLE for
details.
TEMPORARY or TEMP
If specified, the table is created as a temporary table. Refer to CREATE TABLE for details.
UNLOGGED
If specified, the table is created as an unlogged table. Refer to CREATE TABLE for details.
IF NOT EXISTS
Do not throw an error if a relation with the same name already exists. A notice is issued in this
case. Refer to CREATE TABLE for details.
table_name
column_name
The name of a column in the new table. If column names are not provided, they are taken from
the output column names of the query.
1620
CREATE TABLE AS
This clause specifies optional storage parameters for the new table; see Storage Parameters for
more information. The WITH clause can also include OIDS=TRUE (or just OIDS) to specify that
rows of the new table should have OIDs (object identifiers) assigned to them, or OIDS=FALSE
to specify that the rows should not have OIDs. See CREATE TABLE for more information.
WITH OIDS
WITHOUT OIDS
These are obsolescent syntaxes equivalent to WITH (OIDS) and WITH (OIDS=FALSE),
respectively. If you wish to give both an OIDS setting and storage parameters, you must use the
WITH ( ... ) syntax; see above.
ON COMMIT
The behavior of temporary tables at the end of a transaction block can be controlled using ON
COMMIT. The three options are:
PRESERVE ROWS
No special action is taken at the ends of transactions. This is the default behavior.
DELETE ROWS
All rows in the temporary table will be deleted at the end of each transaction block. Essen-
tially, an automatic TRUNCATE is done at each commit.
DROP
The temporary table will be dropped at the end of the current transaction block.
TABLESPACE tablespace_name
The tablespace_name is the name of the tablespace in which the new table is to be created.
If not specified, default_tablespace is consulted, or temp_tablespaces if the table is temporary.
query
WITH [ NO ] DATA
This clause specifies whether or not the data produced by the query should be copied into the new
table. If not, only the table structure is copied. The default is to copy the data.
Notes
This command is functionally similar to SELECT INTO, but it is preferred since it is less likely to be
confused with other uses of the SELECT INTO syntax. Furthermore, CREATE TABLE AS offers
a superset of the functionality offered by SELECT INTO.
The CREATE TABLE AS command allows the user to explicitly specify whether OIDs should be
included. If the presence of OIDs is not explicitly specified, the default_with_oids configuration vari-
able is used.
Examples
Create a new table films_recent consisting of only recent entries from the table films:
1621
CREATE TABLE AS
To copy a table completely, the short form using the TABLE command can also be used:
Create a new temporary table films_recent, consisting of only recent entries from the table
films, using a prepared statement. The new table has OIDs and will be dropped at commit:
PREPARE recentfilms(date) AS
SELECT * FROM films WHERE date_prod > $1;
CREATE TEMP TABLE films_recent WITH (OIDS) ON COMMIT DROP AS
EXECUTE recentfilms('2002-01-01');
Compatibility
CREATE TABLE AS conforms to the SQL standard. The following are nonstandard extensions:
• The standard requires parentheses around the subquery clause; in PostgreSQL, these parentheses
are optional.
• In the standard, the WITH [ NO ] DATA clause is required; in PostgreSQL it is optional.
• PostgreSQL handles temporary tables in a way rather different from the standard; see CREATE
TABLE for details.
• The WITH clause is a PostgreSQL extension; neither storage parameters nor OIDs are in the stan-
dard.
• The PostgreSQL concept of tablespaces is not part of the standard. Hence, the clause TABLESPACE
is an extension.
See Also
CREATE MATERIALIZED VIEW, CREATE TABLE, EXECUTE, SELECT, SELECT INTO,
VALUES
1622
CREATE TABLESPACE
CREATE TABLESPACE — define a new tablespace
Synopsis
Description
CREATE TABLESPACE registers a new cluster-wide tablespace. The tablespace name must be dis-
tinct from the name of any existing tablespace in the database cluster.
A tablespace allows superusers to define an alternative location on the file system where the data files
containing database objects (such as tables and indexes) can reside.
A user with appropriate privileges can pass tablespace_name to CREATE DATABASE, CREATE
TABLE, CREATE INDEX or ADD CONSTRAINT to have the data files for these objects stored within
the specified tablespace.
Warning
A tablespace cannot be used independently of the cluster in which it is defined; see Sec-
tion 22.6.
Parameters
tablespace_name
The name of a tablespace to be created. The name cannot begin with pg_, as such names are
reserved for system tablespaces.
user_name
The name of the user who will own the tablespace. If omitted, defaults to the user executing the
command. Only superusers can create tablespaces, but they can assign ownership of tablespaces
to non-superusers.
directory
The directory that will be used for the tablespace. The directory should be empty and must be
owned by the PostgreSQL system user. The directory must be specified by an absolute path name.
tablespace_option
A tablespace parameter to be set or reset. Currently, the only available parameters are se-
q_page_cost, random_page_cost and effective_io_concurrency. Setting either
value for a particular tablespace will override the planner's usual estimate of the cost of reading
pages from tables in that tablespace, as established by the configuration parameters of the same
name (see seq_page_cost, random_page_cost, effective_io_concurrency). This may be useful if
1623
CREATE TABLESPACE
one tablespace is located on a disk which is faster or slower than the remainder of the I/O sub-
system.
Notes
Tablespaces are only supported on systems that support symbolic links.
Examples
Create a tablespace dbspace at /data/dbs:
Compatibility
CREATE TABLESPACE is a PostgreSQL extension.
See Also
CREATE DATABASE, CREATE TABLE, CREATE INDEX, DROP TABLESPACE, ALTER
TABLESPACE
1624
CREATE TEXT SEARCH CONFIGURATION
CREATE TEXT SEARCH CONFIGURATION — define a new text search configuration
Synopsis
Description
CREATE TEXT SEARCH CONFIGURATION creates a new text search configuration. A text search
configuration specifies a text search parser that can divide a string into tokens, plus dictionaries that
can be used to determine which tokens are of interest for searching.
If only the parser is specified, then the new text search configuration initially has no mappings from
token types to dictionaries, and therefore will ignore all words. Subsequent ALTER TEXT SEARCH
CONFIGURATION commands must be used to create mappings to make the configuration useful.
Alternatively, an existing text search configuration can be copied.
If a schema name is given then the text search configuration is created in the specified schema. Oth-
erwise it is created in the current schema.
The user who defines a text search configuration becomes its owner.
Parameters
name
The name of the text search configuration to be created. The name can be schema-qualified.
parser_name
The name of the text search parser to use for this configuration.
source_config
Notes
The PARSER and COPY options are mutually exclusive, because when an existing configuration is
copied, its parser selection is copied too.
Compatibility
There is no CREATE TEXT SEARCH CONFIGURATION statement in the SQL standard.
See Also
ALTER TEXT SEARCH CONFIGURATION, DROP TEXT SEARCH CONFIGURATION
1625
CREATE TEXT SEARCH DICTIONARY
CREATE TEXT SEARCH DICTIONARY — define a new text search dictionary
Synopsis
Description
CREATE TEXT SEARCH DICTIONARY creates a new text search dictionary. A text search dictio-
nary specifies a way of recognizing interesting or uninteresting words for searching. A dictionary de-
pends on a text search template, which specifies the functions that actually perform the work. Typical-
ly the dictionary provides some options that control the detailed behavior of the template's functions.
If a schema name is given then the text search dictionary is created in the specified schema. Otherwise
it is created in the current schema.
The user who defines a text search dictionary becomes its owner.
Parameters
name
The name of the text search dictionary to be created. The name can be schema-qualified.
template
The name of the text search template that will define the basic behavior of this dictionary.
option
value
The value to use for a template-specific option. If the value is not a simple identifier or number,
it must be quoted (but you can always quote it, if you wish).
Examples
The following example command creates a Snowball-based dictionary with a nonstandard list of stop
words.
1626
CREATE TEXT
SEARCH DICTIONARY
stopwords = myrussian
);
Compatibility
There is no CREATE TEXT SEARCH DICTIONARY statement in the SQL standard.
See Also
ALTER TEXT SEARCH DICTIONARY, DROP TEXT SEARCH DICTIONARY
1627
CREATE TEXT SEARCH PARSER
CREATE TEXT SEARCH PARSER — define a new text search parser
Synopsis
Description
CREATE TEXT SEARCH PARSER creates a new text search parser. A text search parser defines a
method for splitting a text string into tokens and assigning types (categories) to the tokens. A parser
is not particularly useful by itself, but must be bound into a text search configuration along with some
text search dictionaries to be used for searching.
If a schema name is given then the text search parser is created in the specified schema. Otherwise
it is created in the current schema.
You must be a superuser to use CREATE TEXT SEARCH PARSER. (This restriction is made because
an erroneous text search parser definition could confuse or even crash the server.)
Parameters
name
The name of the text search parser to be created. The name can be schema-qualified.
start_function
gettoken_function
end_function
lextypes_function
The name of the lextypes function for the parser (a function that returns information about the
set of token types it produces).
headline_function
The name of the headline function for the parser (a function that summarizes a set of tokens).
1628
CREATE TEXT SEARCH PARSER
The function names can be schema-qualified if necessary. Argument types are not given, since the
argument list for each type of function is predetermined. All except the headline function are required.
The arguments can appear in any order, not only the one shown above.
Compatibility
There is no CREATE TEXT SEARCH PARSER statement in the SQL standard.
See Also
ALTER TEXT SEARCH PARSER, DROP TEXT SEARCH PARSER
1629
CREATE TEXT SEARCH TEMPLATE
CREATE TEXT SEARCH TEMPLATE — define a new text search template
Synopsis
Description
CREATE TEXT SEARCH TEMPLATE creates a new text search template. Text search templates
define the functions that implement text search dictionaries. A template is not useful by itself, but must
be instantiated as a dictionary to be used. The dictionary typically specifies parameters to be given
to the template functions.
If a schema name is given then the text search template is created in the specified schema. Otherwise
it is created in the current schema.
You must be a superuser to use CREATE TEXT SEARCH TEMPLATE. This restriction is made
because an erroneous text search template definition could confuse or even crash the server. The rea-
son for separating templates from dictionaries is that a template encapsulates the “unsafe” aspects of
defining a dictionary. The parameters that can be set when defining a dictionary are safe for unprivi-
leged users to set, and so creating a dictionary need not be a privileged operation.
Parameters
name
The name of the text search template to be created. The name can be schema-qualified.
init_function
lexize_function
The function names can be schema-qualified if necessary. Argument types are not given, since the
argument list for each type of function is predetermined. The lexize function is required, but the init
function is optional.
The arguments can appear in any order, not only the one shown above.
Compatibility
There is no CREATE TEXT SEARCH TEMPLATE statement in the SQL standard.
See Also
ALTER TEXT SEARCH TEMPLATE, DROP TEXT SEARCH TEMPLATE
1630
CREATE TRANSFORM
CREATE TRANSFORM — define a new transform
Synopsis
Description
CREATE TRANSFORM defines a new transform. CREATE OR REPLACE TRANSFORM will either
create a new transform, or replace an existing definition.
A transform specifies how to adapt a data type to a procedural language. For example, when writing
a function in PL/Python using the hstore type, PL/Python has no prior knowledge how to present
hstore values in the Python environment. Language implementations usually default to using the
text representation, but that is inconvenient when, for example, an associative array or a list would
be more appropriate.
• A “from SQL” function that converts the type from the SQL environment to the language. This
function will be invoked on the arguments of a function written in the language.
• A “to SQL” function that converts the type from the language to the SQL environment. This function
will be invoked on the return value of a function written in the language.
It is not necessary to provide both of these functions. If one is not specified, the language-specific
default behavior will be used if necessary. (To prevent a transformation in a certain direction from
happening at all, you could also write a transform function that always errors out.)
To be able to create a transform, you must own and have USAGE privilege on the type, have USAGE
privilege on the language, and own and have EXECUTE privilege on the from-SQL and to-SQL func-
tions, if specified.
Parameters
type_name
lang_name
from_sql_function_name[(argument_type [, ...])]
The name of the function for converting the type from the SQL environment to the language. It
must take one argument of type internal and return type internal. The actual argument will
be of the type for the transform, and the function should be coded as if it were. (But it is not allowed
1631
CREATE TRANSFORM
to declare an SQL-level function returning internal without at least one argument of type
internal.) The actual return value will be something specific to the language implementation.
If no argument list is specified, the function name must be unique in its schema.
to_sql_function_name[(argument_type [, ...])]
The name of the function for converting the type from the language to the SQL environment. It
must take one argument of type internal and return the type that is the type for the transform.
The actual argument value will be something specific to the language implementation. If no ar-
gument list is specified, the function name must be unique in its schema.
Notes
Use DROP TRANSFORM to remove transforms.
Examples
To create a transform for type hstore and language plpythonu, first set up the type and the
language:
The contrib section contains a number of extensions that provide transforms, which can serve as
real-world examples.
Compatibility
This form of CREATE TRANSFORM is a PostgreSQL extension. There is a CREATE TRANSFORM
command in the SQL standard, but it is for adapting data types to client languages. That usage is not
supported by PostgreSQL.
See Also
CREATE FUNCTION, CREATE LANGUAGE, CREATE TYPE, DROP TRANSFORM
1632
CREATE TRIGGER
CREATE TRIGGER — define a new trigger
Synopsis
INSERT
UPDATE [ OF column_name [, ... ] ]
DELETE
TRUNCATE
Description
CREATE TRIGGER creates a new trigger. The trigger will be associated with the specified table, view,
or foreign table and will execute the specified function function_name when certain operations
are performed on that table.
The trigger can be specified to fire before the operation is attempted on a row (before constraints are
checked and the INSERT, UPDATE, or DELETE is attempted); or after the operation has completed
(after constraints are checked and the INSERT, UPDATE, or DELETE has completed); or instead of
the operation (in the case of inserts, updates or deletes on a view). If the trigger fires before or instead
of the event, the trigger can skip the operation for the current row, or change the row being inserted
(for INSERT and UPDATE operations only). If the trigger fires after the event, all changes, including
the effects of other triggers, are “visible” to the trigger.
A trigger that is marked FOR EACH ROW is called once for every row that the operation modifies. For
example, a DELETE that affects 10 rows will cause any ON DELETE triggers on the target relation to
be called 10 separate times, once for each deleted row. In contrast, a trigger that is marked FOR EACH
STATEMENT only executes once for any given operation, regardless of how many rows it modifies
(in particular, an operation that modifies zero rows will still result in the execution of any applicable
FOR EACH STATEMENT triggers).
Triggers that are specified to fire INSTEAD OF the trigger event must be marked FOR EACH ROW,
and can only be defined on views. BEFORE and AFTER triggers on a view must be marked as FOR
EACH STATEMENT.
In addition, triggers may be defined to fire for TRUNCATE, though only FOR EACH STATEMENT.
The following table summarizes which types of triggers may be used on tables, views, and foreign
tables:
1633
CREATE TRIGGER
Also, a trigger definition can specify a Boolean WHEN condition, which will be tested to see whether
the trigger should be fired. In row-level triggers the WHEN condition can examine the old and/or new
values of columns of the row. Statement-level triggers can also have WHEN conditions, although the
feature is not so useful for them since the condition cannot refer to any values in the table.
If multiple triggers of the same kind are defined for the same event, they will be fired in alphabetical
order by name.
When the CONSTRAINT option is specified, this command creates a constraint trigger. This is the
same as a regular trigger except that the timing of the trigger firing can be adjusted using SET CONS-
TRAINTS. Constraint triggers must be AFTER ROW triggers on plain tables (not foreign tables). They
can be fired either at the end of the statement causing the triggering event, or at the end of the contain-
ing transaction; in the latter case they are said to be deferred. A pending deferred-trigger firing can
also be forced to happen immediately by using SET CONSTRAINTS. Constraint triggers are expected
to raise an exception when the constraints they implement are violated.
The REFERENCING option enables collection of transition relations, which are row sets that include
all of the rows inserted, deleted, or modified by the current SQL statement. This feature lets the trigger
see a global view of what the statement did, not just one row at a time. This option is only allowed
for an AFTER trigger that is not a constraint trigger; also, if the trigger is an UPDATE trigger, it must
not specify a column_name list. OLD TABLE may only be specified once, and only for a trigger
that can fire on UPDATE or DELETE; it creates a transition relation containing the before-images of
all rows updated or deleted by the statement. Similarly, NEW TABLE may only be specified once, and
only for a trigger that can fire on UPDATE or INSERT; it creates a transition relation containing the
after-images of all rows updated or inserted by the statement.
SELECT does not modify any rows so you cannot create SELECT triggers. Rules and views may
provide workable solutions to problems that seem to need SELECT triggers.
Parameters
name
The name to give the new trigger. This must be distinct from the name of any other trigger for
the same table. The name cannot be schema-qualified — the trigger inherits the schema of its
table. For a constraint trigger, this is also the name to use when modifying the trigger's behavior
using SET CONSTRAINTS.
BEFORE
AFTER
INSTEAD OF
Determines whether the function is called before, after, or instead of the event. A constraint trigger
can only be specified as AFTER.
1634
CREATE TRIGGER
event
One of INSERT, UPDATE, DELETE, or TRUNCATE; this specifies the event that will fire the
trigger. Multiple events can be specified using OR, except when transition relations are requested.
For UPDATE events, it is possible to specify a list of columns using this syntax:
The trigger will only fire if at least one of the listed columns is mentioned as a target of the
UPDATE command.
INSTEAD OF UPDATE events do not allow a list of columns. A column list cannot be specified
when requesting transition relations, either.
table_name
The name (optionally schema-qualified) of the table, view, or foreign table the trigger is for.
referenced_table_name
The (possibly schema-qualified) name of another table referenced by the constraint. This option
is used for foreign-key constraints and is not recommended for general use. This can only be
specified for constraint triggers.
DEFERRABLE
NOT DEFERRABLE
INITIALLY IMMEDIATE
INITIALLY DEFERRED
The default timing of the trigger. See the CREATE TABLE documentation for details of these
constraint options. This can only be specified for constraint triggers.
REFERENCING
This keyword immediately precedes the declaration of one or two relation names that provide
access to the transition relations of the triggering statement.
OLD TABLE
NEW TABLE
This clause indicates whether the following relation name is for the before-image transition rela-
tion or the after-image transition relation.
transition_relation_name
The (unqualified) name to be used within the trigger for this transition relation.
This specifies whether the trigger function should be fired once for every row affected by the
trigger event, or just once per SQL statement. If neither is specified, FOR EACH STATEMENT
is the default. Constraint triggers can only be specified FOR EACH ROW.
condition
A Boolean expression that determines whether the trigger function will actually be executed. If
WHEN is specified, the function will only be called if the condition returns true. In FOR
EACH ROW triggers, the WHEN condition can refer to columns of the old and/or new row val-
1635
CREATE TRIGGER
Note that for constraint triggers, evaluation of the WHEN condition is not deferred, but occurs
immediately after the row update operation is performed. If the condition does not evaluate to
true then the trigger is not queued for deferred execution.
function_name
A user-supplied function that is declared as taking no arguments and returning type trigger,
which is executed when the trigger fires.
In the syntax of CREATE TRIGGER, the keywords FUNCTION and PROCEDURE are equivalent,
but the referenced function must in any case be a function, not a procedure. The use of the keyword
PROCEDURE here is historical and deprecated.
arguments
An optional comma-separated list of arguments to be provided to the function when the trigger
is executed. The arguments are literal string constants. Simple names and numeric constants can
be written here, too, but they will all be converted to strings. Please check the description of the
implementation language of the trigger function to find out how these arguments can be accessed
within the function; it might be different from normal function arguments.
Notes
To create a trigger on a table, the user must have the TRIGGER privilege on the table. The user must
also have EXECUTE privilege on the trigger function.
A column-specific trigger (one defined using the UPDATE OF column_name syntax) will fire when
any of its columns are listed as targets in the UPDATE command's SET list. It is possible for a column's
value to change even when the trigger is not fired, because changes made to the row's contents by
BEFORE UPDATE triggers are not considered. Conversely, a command such as UPDATE ... SET
x = x ... will fire a trigger on column x, even though the column's value did not change.
There are a few built-in trigger functions that can be used to solve common problems without having
to write your own trigger code; see Section 9.27.
In a BEFORE trigger, the WHEN condition is evaluated just before the function is or would be executed,
so using WHEN is not materially different from testing the same condition at the beginning of the trigger
function. Note in particular that the NEW row seen by the condition is the current value, as possibly
modified by earlier triggers. Also, a BEFORE trigger's WHEN condition is not allowed to examine the
system columns of the NEW row (such as oid), because those won't have been set yet.
In an AFTER trigger, the WHEN condition is evaluated just after the row update occurs, and it deter-
mines whether an event is queued to fire the trigger at the end of statement. So when an AFTER trig-
ger's WHEN condition does not return true, it is not necessary to queue an event nor to re-fetch the row
at end of statement. This can result in significant speedups in statements that modify many rows, if
the trigger only needs to be fired for a few of the rows.
In some cases it is possible for a single SQL command to fire more than one kind of trigger. For
instance an INSERT with an ON CONFLICT DO UPDATE clause may cause both insert and update
operations, so it will fire both kinds of triggers as needed. The transition relations supplied to triggers
1636
CREATE TRIGGER
are specific to their event type; thus an INSERT trigger will see only the inserted rows, while an
UPDATE trigger will see only the updated rows.
Row updates or deletions caused by foreign-key enforcement actions, such as ON UPDATE CASCADE
or ON DELETE SET NULL, are treated as part of the SQL command that caused them (note that such
actions are never deferred). Relevant triggers on the affected table will be fired, so that this provides
another way in which a SQL command might fire triggers not directly matching its type. In simple
cases, triggers that request transition relations will see all changes caused in their table by a single
original SQL command as a single transition relation. However, there are cases in which the presence
of an AFTER ROW trigger that requests transition relations will cause the foreign-key enforcement
actions triggered by a single SQL command to be split into multiple steps, each with its own transition
relation(s). In such cases, any statement-level triggers that are present will be fired once per creation
of a transition relation set, ensuring that the triggers see each affected row in a transition relation once
and only once.
Statement-level triggers on a view are fired only if the action on the view is handled by a row-level
INSTEAD OF trigger. If the action is handled by an INSTEAD rule, then whatever statements are
emitted by the rule are executed in place of the original statement naming the view, so that the triggers
that will be fired are those on tables named in the replacement statements. Similarly, if the view is
automatically updatable, then the action is handled by automatically rewriting the statement into an
action on the view's base table, so that the base table's statement-level triggers are the ones that are
fired.
Creating a row-level trigger on a partitioned table will cause identical triggers to be created in all its
existing partitions; and any partitions created or attached later will contain an identical trigger, too.
If the partition is detached from its parent, the trigger is removed. Triggers on partitioned tables may
only be AFTER.
Modifying a partitioned table or a table with inheritance children fires statement-level triggers attached
to the explicitly named table, but not statement-level triggers for its partitions or child tables. In con-
trast, row-level triggers are fired on the rows in affected partitions or child tables, even if they are not
explicitly named in the query. If a statement-level trigger has been defined with transition relations
named by a REFERENCING clause, then before and after images of rows are visible from all affected
partitions or child tables. In the case of inheritance children, the row images include only columns
that are present in the table that the trigger is attached to. Currently, row-level triggers with transition
relations cannot be defined on partitions or inheritance child tables.
In PostgreSQL versions before 7.3, it was necessary to declare trigger functions as returning the place-
holder type opaque, rather than trigger. To support loading of old dump files, CREATE TRIG-
GER will accept a function declared as returning opaque, but it will issue a notice and change the
function's declared return type to trigger.
Examples
Execute the function check_account_update whenever a row of the table accounts is about
to be updated:
The same, but only execute the function if column balance is specified as a target in the UPDATE
command:
1637
CREATE TRIGGER
This form only executes the function if column balance has in fact changed value:
Execute the function view_insert_row for each row to insert rows into the tables underlying a
view:
Execute the function check_matching_pairs for each row to confirm that changes are made to
matching pairs at the same time (by the same statement):
Compatibility
The CREATE TRIGGER statement in PostgreSQL implements a subset of the SQL standard. The
following functionalities are currently missing:
• While transition table names for AFTER triggers are specified using the REFERENCING clause in
the standard way, the row variables used in FOR EACH ROW triggers may not be specified in a
REFERENCING clause. They are available in a manner that is dependent on the language in which
1638
CREATE TRIGGER
the trigger function is written, but is fixed for any one language. Some languages effectively behave
as though there is a REFERENCING clause containing OLD ROW AS OLD NEW ROW AS NEW.
• The standard allows transition tables to be used with column-specific UPDATE triggers, but then
the set of rows that should be visible in the transition tables depends on the trigger's column list.
This is not currently implemented by PostgreSQL.
• PostgreSQL only allows the execution of a user-defined function for the triggered action. The stan-
dard allows the execution of a number of other SQL commands, such as CREATE TABLE, as the
triggered action. This limitation is not hard to work around by creating a user-defined function that
executes the desired commands.
SQL specifies that multiple triggers should be fired in time-of-creation order. PostgreSQL uses name
order, which was judged to be more convenient.
SQL specifies that BEFORE DELETE triggers on cascaded deletes fire after the cascaded DELETE
completes. The PostgreSQL behavior is for BEFORE DELETE to always fire before the delete action,
even a cascading one. This is considered more consistent. There is also nonstandard behavior if BE-
FORE triggers modify rows or prevent updates during an update that is caused by a referential action.
This can lead to constraint violations or stored data that does not honor the referential constraint.
The ability to specify multiple actions for a single trigger using OR is a PostgreSQL extension of the
SQL standard.
The ability to fire triggers for TRUNCATE is a PostgreSQL extension of the SQL standard, as is the
ability to define statement-level triggers on views.
See Also
ALTER TRIGGER, DROP TRIGGER, CREATE FUNCTION, SET CONSTRAINTS
1639
CREATE TYPE
CREATE TYPE — define a new data type
Synopsis
Description
CREATE TYPE registers a new data type for use in the current database. The user who defines a type
becomes its owner.
If a schema name is given then the type is created in the specified schema. Otherwise it is created in
the current schema. The type name must be distinct from the name of any existing type or domain
in the same schema. (Because tables have associated data types, the type name must also be distinct
from the name of any existing table in the same schema.)
There are five forms of CREATE TYPE, as shown in the syntax synopsis above. They respectively
create a composite type, an enum type, a range type, a base type, or a shell type. The first four of
1640
CREATE TYPE
these are discussed in turn below. A shell type is simply a placeholder for a type to be defined later;
it is created by issuing CREATE TYPE with no parameters except for the type name. Shell types are
needed as forward references when creating range types and base types, as discussed in those sections.
Composite Types
The first form of CREATE TYPE creates a composite type. The composite type is specified by a
list of attribute names and data types. An attribute's collation can be specified too, if its data type is
collatable. A composite type is essentially the same as the row type of a table, but using CREATE
TYPE avoids the need to create an actual table when all that is wanted is to define a type. A stand-
alone composite type is useful, for example, as the argument or return type of a function.
To be able to create a composite type, you must have USAGE privilege on all attribute types.
Enumerated Types
The second form of CREATE TYPE creates an enumerated (enum) type, as described in Section 8.7.
Enum types take a list of quoted labels, each of which must be less than NAMEDATALEN bytes long
(64 bytes in a standard PostgreSQL build). (It is possible to create an enumerated type with zero labels,
but such a type cannot be used to hold values before at least one label is added using ALTER TYPE.)
Range Types
The third form of CREATE TYPE creates a new range type, as described in Section 8.17.
The range type's subtype can be any type with an associated b-tree operator class (to determine the
ordering of values for the range type). Normally the subtype's default b-tree operator class is used to
determine ordering; to use a non-default operator class, specify its name with subtype_opclass.
If the subtype is collatable, and you want to use a non-default collation in the range's ordering, specify
the desired collation with the collation option.
The optional canonical function must take one argument of the range type being defined, and return
a value of the same type. This is used to convert range values to a canonical form, when applicable.
See Section 8.17.8 for more information. Creating a canonical function is a bit tricky, since it must
be defined before the range type can be declared. To do this, you must first create a shell type, which
is a placeholder type that has no properties except a name and an owner. This is done by issuing the
command CREATE TYPE name, with no additional parameters. Then the function can be declared
using the shell type as argument and result, and finally the range type can be declared using the same
name. This automatically replaces the shell type entry with a valid range type.
The optional subtype_diff function must take two values of the subtype type as argument,
and return a double precision value representing the difference between the two given values.
While this is optional, providing it allows much greater efficiency of GiST indexes on columns of the
range type. See Section 8.17.8 for more information.
Base Types
The fourth form of CREATE TYPE creates a new base type (scalar type). To create a new base type,
you must be a superuser. (This restriction is made because an erroneous type definition could confuse
or even crash the server.)
The parameters can appear in any order, not only that illustrated above, and most are optional. You
must register two or more functions (using CREATE FUNCTION) before defining the type. The sup-
port functions input_function and output_function are required, while the functions re-
ceive_function, send_function, type_modifier_input_function, type_modi-
fier_output_function and analyze_function are optional. Generally these functions
have to be coded in C or another low-level language.
The input_function converts the type's external textual representation to the internal represen-
tation used by the operators and functions defined for the type. output_function performs the
1641
CREATE TYPE
reverse transformation. The input function can be declared as taking one argument of type cstring,
or as taking three arguments of types cstring, oid, integer. The first argument is the input text
as a C string, the second argument is the type's own OID (except for array types, which instead receive
their element type's OID), and the third is the typmod of the destination column, if known (-1 will be
passed if not). The input function must return a value of the data type itself. Usually, an input function
should be declared STRICT; if it is not, it will be called with a NULL first parameter when reading
a NULL input value. The function must still return NULL in this case, unless it raises an error. (This
case is mainly meant to support domain input functions, which might need to reject NULL inputs.)
The output function must be declared as taking one argument of the new data type. The output function
must return type cstring. Output functions are not invoked for NULL values.
The optional receive_function converts the type's external binary representation to the internal
representation. If this function is not supplied, the type cannot participate in binary input. The bina-
ry representation should be chosen to be cheap to convert to internal form, while being reasonably
portable. (For example, the standard integer data types use network byte order as the external binary
representation, while the internal representation is in the machine's native byte order.) The receive
function should perform adequate checking to ensure that the value is valid. The receive function
can be declared as taking one argument of type internal, or as taking three arguments of types
internal, oid, integer. The first argument is a pointer to a StringInfo buffer holding the
received byte string; the optional arguments are the same as for the text input function. The receive
function must return a value of the data type itself. Usually, a receive function should be declared
STRICT; if it is not, it will be called with a NULL first parameter when reading a NULL input value.
The function must still return NULL in this case, unless it raises an error. (This case is mainly meant
to support domain receive functions, which might need to reject NULL inputs.) Similarly, the optional
send_function converts from the internal representation to the external binary representation. If
this function is not supplied, the type cannot participate in binary output. The send function must be
declared as taking one argument of the new data type. The send function must return type bytea.
Send functions are not invoked for NULL values.
You should at this point be wondering how the input and output functions can be declared to have
results or arguments of the new type, when they have to be created before the new type can be created.
The answer is that the type should first be defined as a shell type, which is a placeholder type that
has no properties except a name and an owner. This is done by issuing the command CREATE TYPE
name, with no additional parameters. Then the C I/O functions can be defined referencing the shell
type. Finally, CREATE TYPE with a full definition replaces the shell entry with a complete, valid
type definition, after which the new type can be used normally.
The optional analyze_function performs type-specific statistics collection for columns of the
data type. By default, ANALYZE will attempt to gather statistics using the type's “equals” and “less-
than” operators, if there is a default b-tree operator class for the type. For non-scalar types this behavior
is likely to be unsuitable, so it can be overridden by specifying a custom analysis function. The analysis
function must be declared to take a single argument of type internal, and return a boolean result.
The detailed API for analysis functions appears in src/include/commands/vacuum.h.
1642
CREATE TYPE
While the details of the new type's internal representation are only known to the I/O functions and other
functions you create to work with the type, there are several properties of the internal representation
that must be declared to PostgreSQL. Foremost of these is internallength. Base data types can
be fixed-length, in which case internallength is a positive integer, or variable-length, indicated
by setting internallength to VARIABLE. (Internally, this is represented by setting typlen to
-1.) The internal representation of all variable-length types must start with a 4-byte integer giving
the total length of this value of the type. (Note that the length field is often encoded, as described in
Section 69.2; it's unwise to access it directly.)
The optional flag PASSEDBYVALUE indicates that values of this data type are passed by value, rather
than by reference. Types passed by value must be fixed-length, and their internal representation cannot
be larger than the size of the Datum type (4 bytes on some machines, 8 bytes on others).
The alignment parameter specifies the storage alignment required for the data type. The allowed
values equate to alignment on 1, 2, 4, or 8 byte boundaries. Note that variable-length types must have
an alignment of at least 4, since they necessarily contain an int4 as their first component.
The storage parameter allows selection of storage strategies for variable-length data types. (Only
plain is allowed for fixed-length types.) plain specifies that data of the type will always be stored
in-line and not compressed. extended specifies that the system will first try to compress a long
data value, and will move the value out of the main table row if it's still too long. external allows
the value to be moved out of the main table, but the system will not try to compress it. main allows
compression, but discourages moving the value out of the main table. (Data items with this storage
strategy might still be moved out of the main table if there is no other way to make a row fit, but they
will be kept in the main table preferentially over extended and external items.)
All storage values other than plain imply that the functions of the data type can handle values that
have been toasted, as described in Section 69.2 and Section 38.12.1. The specific other value given
merely determines the default TOAST storage strategy for columns of a toastable data type; users can
pick other strategies for individual columns using ALTER TABLE SET STORAGE.
The like_type parameter provides an alternative method for specifying the basic representation
properties of a data type: copy them from some existing type. The values of internallength,
passedbyvalue, alignment, and storage are copied from the named type. (It is possible,
though usually undesirable, to override some of these values by specifying them along with the LIKE
clause.) Specifying representation this way is especially useful when the low-level implementation of
the new type “piggybacks” on an existing type in some fashion.
The category and preferred parameters can be used to help control which implicit cast will
be applied in ambiguous situations. Each data type belongs to a category named by a single ASCII
character, and each type is either “preferred” or not within its category. The parser will prefer casting
to preferred types (but only from other types within the same category) when this rule is helpful in
resolving overloaded functions or operators. For more details see Chapter 10. For types that have no
implicit casts to or from any other types, it is sufficient to leave these settings at the defaults. However,
for a group of related types that have implicit casts, it is often helpful to mark them all as belonging
to a category and select one or two of the “most general” types as being preferred within the category.
The category parameter is especially useful when adding a user-defined type to an existing built-in
category, such as the numeric or string types. However, it is also possible to create new entirely-user-
defined type categories. Select any ASCII character other than an upper-case letter to name such a
category.
A default value can be specified, in case a user wants columns of the data type to default to something
other than the null value. Specify the default with the DEFAULT key word. (Such a default can be
overridden by an explicit DEFAULT clause attached to a particular column.)
To indicate that a type is an array, specify the type of the array elements using the ELEMENT key
word. For example, to define an array of 4-byte integers (int4), specify ELEMENT = int4. More
details about array types appear below.
1643
CREATE TYPE
To indicate the delimiter to be used between values in the external representation of arrays of this type,
delimiter can be set to a specific character. The default delimiter is the comma (,). Note that the
delimiter is associated with the array element type, not the array type itself.
If the optional Boolean parameter collatable is true, column definitions and expressions of the
type may carry collation information through use of the COLLATE clause. It is up to the implementa-
tions of the functions operating on the type to actually make use of the collation information; this does
not happen automatically merely by marking the type collatable.
Array Types
Whenever a user-defined type is created, PostgreSQL automatically creates an associated array type,
whose name consists of the element type's name prepended with an underscore, and truncated if nec-
essary to keep it less than NAMEDATALEN bytes long. (If the name so generated collides with an ex-
isting type name, the process is repeated until a non-colliding name is found.) This implicitly-creat-
ed array type is variable length and uses the built-in input and output functions array_in and ar-
ray_out. The array type tracks any changes in its element type's owner or schema, and is dropped
if the element type is.
You might reasonably ask why there is an ELEMENT option, if the system makes the correct array type
automatically. The only case where it's useful to use ELEMENT is when you are making a fixed-length
type that happens to be internally an array of a number of identical things, and you want to allow these
things to be accessed directly by subscripting, in addition to whatever operations you plan to provide
for the type as a whole. For example, type point is represented as just two floating-point numbers,
which can be accessed using point[0] and point[1]. Note that this facility only works for fixed-
length types whose internal form is exactly a sequence of identical fixed-length fields. A subscriptable
variable-length type must have the generalized internal representation used by array_in and ar-
ray_out. For historical reasons (i.e., this is clearly wrong but it's far too late to change it), subscript-
ing of fixed-length array types starts from zero, rather than from one as for variable-length arrays.
Parameters
name
attribute_name
data_type
The name of an existing data type to become a column of the composite type.
collation
The name of an existing collation to be associated with a column of a composite type, or with
a range type.
label
A string literal representing the textual label associated with one value of an enum type.
subtype
The name of the element type that the range type will represent ranges of.
subtype_operator_class
1644
CREATE TYPE
canonical_function
subtype_diff_function
input_function
The name of a function that converts data from the type's external textual form to its internal form.
output_function
The name of a function that converts data from the type's internal form to its external textual form.
receive_function
The name of a function that converts data from the type's external binary form to its internal form.
send_function
The name of a function that converts data from the type's internal form to its external binary form.
type_modifier_input_function
The name of a function that converts an array of modifier(s) for the type into internal form.
type_modifier_output_function
The name of a function that converts the internal form of the type's modifier(s) to external textual
form.
analyze_function
The name of a function that performs statistical analysis for the data type.
internallength
A numeric constant that specifies the length in bytes of the new type's internal representation. The
default assumption is that it is variable-length.
alignment
The storage alignment requirement of the data type. If specified, it must be char, int2, int4,
or double; the default is int4.
storage
The storage strategy for the data type. If specified, must be plain, external, extended, or
main; the default is plain.
like_type
The name of an existing data type that the new type will have the same representation as. The val-
ues of internallength, passedbyvalue, alignment, and storage are copied from
that type, unless overridden by explicit specification elsewhere in this CREATE TYPE command.
category
The category code (a single ASCII character) for this type. The default is 'U' for “user-defined
type”. Other standard category codes can be found in Table 52.63. You may also choose other
ASCII characters in order to create custom categories.
1645
CREATE TYPE
preferred
True if this type is a preferred type within its type category, else false. The default is false. Be
very careful about creating a new preferred type within an existing type category, as this could
cause surprising changes in behavior.
default
The default value for the data type. If this is omitted, the default is null.
element
The type being created is an array; this specifies the type of the array elements.
delimiter
The delimiter character to be used between values in arrays made of this type.
collatable
True if this type's operations can use collation information. The default is false.
Notes
Because there are no restrictions on use of a data type once it's been created, creating a base type or
range type is tantamount to granting public execute permission on the functions mentioned in the type
definition. This is usually not an issue for the sorts of functions that are useful in a type definition. But
you might want to think twice before designing a type in a way that would require “secret” information
to be used while converting it to or from external form.
Before PostgreSQL version 8.3, the name of a generated array type was always exactly the element
type's name with one underscore character (_) prepended. (Type names were therefore restricted in
length to one fewer character than other names.) While this is still usually the case, the array type
name may vary from this in case of maximum-length names or collisions with user type names that
begin with underscore. Writing code that depends on this convention is therefore deprecated. Instead,
use pg_type.typarray to locate the array type associated with a given type.
It may be advisable to avoid using type and table names that begin with underscore. While the server
will change generated array type names to avoid collisions with user-given names, there is still risk
of confusion, particularly with old client software that may assume that type names beginning with
underscores always represent arrays.
Before PostgreSQL version 8.2, the shell-type creation syntax CREATE TYPE name did not exist.
The way to create a new base type was to create its input function first. In this approach, PostgreSQL
will first see the name of the new data type as the return type of the input function. The shell type is
implicitly created in this situation, and then it can be referenced in the definitions of the remaining
I/O functions. This approach still works, but is deprecated and might be disallowed in some future
release. Also, to avoid accidentally cluttering the catalogs with shell types as a result of simple typos
in function definitions, a shell type will only be made this way when the input function is written in C.
In PostgreSQL versions before 7.3, it was customary to avoid creating a shell type at all, by replacing
the functions' forward references to the type name with the placeholder pseudo-type opaque. The
cstring arguments and results also had to be declared as opaque before 7.3. To support loading
of old dump files, CREATE TYPE will accept I/O functions declared using opaque, but it will issue
a notice and change the function declarations to use the correct types.
Examples
This example creates a composite type and uses it in a function definition:
1646
CREATE TYPE
This example creates the base data type box and then uses the type in a table definition:
If the internal structure of box were an array of four float4 elements, we might instead use:
which would allow a box value's component numbers to be accessed by subscripting. Otherwise the
type behaves the same as before.
This example creates a large object type and uses it in a table definition:
1647
CREATE TYPE
More examples, including suitable input and output functions, are in Section 38.12.
Compatibility
The first form of the CREATE TYPE command, which creates a composite type, conforms to the SQL
standard. The other forms are PostgreSQL extensions. The CREATE TYPE statement in the SQL
standard also defines other forms that are not implemented in PostgreSQL.
The ability to create a composite type with zero attributes is a PostgreSQL-specific deviation from the
standard (analogous to the same case in CREATE TABLE).
See Also
ALTER TYPE, CREATE DOMAIN, CREATE FUNCTION, DROP TYPE
1648
CREATE USER
CREATE USER — define a new database role
Synopsis
SUPERUSER | NOSUPERUSER
| CREATEDB | NOCREATEDB
| CREATEROLE | NOCREATEROLE
| INHERIT | NOINHERIT
| LOGIN | NOLOGIN
| REPLICATION | NOREPLICATION
| BYPASSRLS | NOBYPASSRLS
| CONNECTION LIMIT connlimit
| [ ENCRYPTED ] PASSWORD 'password' | PASSWORD NULL
| VALID UNTIL 'timestamp'
| IN ROLE role_name [, ...]
| IN GROUP role_name [, ...]
| ROLE role_name [, ...]
| ADMIN role_name [, ...]
| USER role_name [, ...]
| SYSID uid
Description
CREATE USER is now an alias for CREATE ROLE. The only difference is that when the command
is spelled CREATE USER, LOGIN is assumed by default, whereas NOLOGIN is assumed when the
command is spelled CREATE ROLE.
Compatibility
The CREATE USER statement is a PostgreSQL extension. The SQL standard leaves the definition
of users to the implementation.
See Also
CREATE ROLE
1649
CREATE USER MAPPING
CREATE USER MAPPING — define a new mapping of a user to a foreign server
Synopsis
Description
CREATE USER MAPPING defines a mapping of a user to a foreign server. A user mapping typically
encapsulates connection information that a foreign-data wrapper uses together with the information
encapsulated by a foreign server to access an external data resource.
The owner of a foreign server can create user mappings for that server for any user. Also, a user can
create a user mapping for their own user name if USAGE privilege on the server has been granted to
the user.
Parameters
IF NOT EXISTS
Do not throw an error if a mapping of the given user to the given foreign server already exists.
A notice is issued in this case. Note that there is no guarantee that the existing user mapping is
anything like the one that would have been created.
user_name
The name of an existing user that is mapped to foreign server. CURRENT_USER and USER match
the name of the current user. When PUBLIC is specified, a so-called public mapping is created
that is used when no user-specific mapping is applicable.
server_name
The name of an existing server for which the user mapping is to be created.
This clause specifies the options of the user mapping. The options typically define the actual user
name and password of the mapping. Option names must be unique. The allowed option names
and values are specific to the server's foreign-data wrapper.
Examples
Create a user mapping for user bob, server foo:
CREATE USER MAPPING FOR bob SERVER foo OPTIONS (user 'bob',
password 'secret');
Compatibility
CREATE USER MAPPING conforms to ISO/IEC 9075-9 (SQL/MED).
1650
CREATE USER MAPPING
See Also
ALTER USER MAPPING, DROP USER MAPPING, CREATE FOREIGN DATA WRAPPER,
CREATE SERVER
1651
CREATE VIEW
CREATE VIEW — define a new view
Synopsis
Description
CREATE VIEW defines a view of a query. The view is not physically materialized. Instead, the query
is run every time the view is referenced in a query.
CREATE OR REPLACE VIEW is similar, but if a view of the same name already exists, it is replaced.
The new query must generate the same columns that were generated by the existing view query (that
is, the same column names in the same order and with the same data types), but it may add additional
columns to the end of the list. The calculations giving rise to the output columns may be completely
different.
If a schema name is given (for example, CREATE VIEW myschema.myview ...) then the view
is created in the specified schema. Otherwise it is created in the current schema. Temporary views
exist in a special schema, so a schema name cannot be given when creating a temporary view. The
name of the view must be distinct from the name of any other view, table, sequence, index or foreign
table in the same schema.
Parameters
TEMPORARY or TEMP
If specified, the view is created as a temporary view. Temporary views are automatically dropped
at the end of the current session. Existing permanent relations with the same name are not visible
to the current session while the temporary view exists, unless they are referenced with schema-
qualified names.
If any of the tables referenced by the view are temporary, the view is created as a temporary view
(whether TEMPORARY is specified or not).
RECURSIVE
is equivalent to
1652
CREATE VIEW
name
column_name
An optional list of names to be used for columns of the view. If not given, the column names are
deduced from the query.
This clause specifies optional parameters for a view; the following parameters are supported:
check_option (string)
This parameter may be either local or cascaded, and is equivalent to specifying WITH
[ CASCADED | LOCAL ] CHECK OPTION (see below). This option can be changed
on existing views using ALTER VIEW.
security_barrier (boolean)
This should be used if the view is intended to provide row-level security. See Section 41.5
for full details.
query
A SELECT or VALUES command which will provide the columns and rows of the view.
This option controls the behavior of automatically updatable views. When this option is specified,
INSERT and UPDATE commands on the view will be checked to ensure that new rows satisfy the
view-defining condition (that is, the new rows are checked to ensure that they are visible through
the view). If they are not, the update will be rejected. If the CHECK OPTION is not specified,
INSERT and UPDATE commands on the view are allowed to create rows that are not visible
through the view. The following check options are supported:
LOCAL
New rows are only checked against the conditions defined directly in the view itself. Any
conditions defined on underlying base views are not checked (unless they also specify the
CHECK OPTION).
CASCADED
New rows are checked against the conditions of the view and all underlying base views. If
the CHECK OPTION is specified, and neither LOCAL nor CASCADED is specified, then
CASCADED is assumed.
Note that the CHECK OPTION is only supported on views that are automatically updatable, and do
not have INSTEAD OF triggers or INSTEAD rules. If an automatically updatable view is defined
on top of a base view that has INSTEAD OF triggers, then the LOCAL CHECK OPTION may be
used to check the conditions on the automatically updatable view, but the conditions on the base
view with INSTEAD OF triggers will not be checked (a cascaded check option will not cascade
down to a trigger-updatable view, and any check options defined directly on a trigger-updatable
view will be ignored). If the view or any of its base relations has an INSTEAD rule that causes
the INSERT or UPDATE command to be rewritten, then all check options will be ignored in the
1653
CREATE VIEW
rewritten query, including any checks from automatically updatable views defined on top of the
relation with the INSTEAD rule.
Notes
Use the DROP VIEW statement to drop views.
Be careful that the names and types of the view's columns will be assigned the way you want. For
example:
is bad form because the column name defaults to ?column?; also, the column data type defaults
to text, which might not be what you wanted. Better style for a string literal in a view's result is
something like:
Access to tables referenced in the view is determined by permissions of the view owner. In some cases,
this can be used to provide secure but restricted access to the underlying tables. However, not all views
are secure against tampering; see Section 41.5 for details. Functions called in the view are treated the
same as if they had been called directly from the query using the view. Therefore the user of a view
must have permissions to call all functions used by the view.
When CREATE OR REPLACE VIEW is used on an existing view, only the view's defining SELECT
rule is changed. Other view properties, including ownership, permissions, and non-SELECT rules,
remain unchanged. You must own the view to replace it (this includes being a member of the owning
role).
Updatable Views
Simple views are automatically updatable: the system will allow INSERT, UPDATE and DELETE
statements to be used on the view in the same way as on a regular table. A view is automatically
updatable if it satisfies all of the following conditions:
• The view must have exactly one entry in its FROM list, which must be a table or another updatable
view.
• The view definition must not contain WITH, DISTINCT, GROUP BY, HAVING, LIMIT, or OF-
FSET clauses at the top level.
• The view definition must not contain set operations (UNION, INTERSECT or EXCEPT) at the top
level.
• The view's select list must not contain any aggregates, window functions or set-returning functions.
An automatically updatable view may contain a mix of updatable and non-updatable columns. A col-
umn is updatable if it is a simple reference to an updatable column of the underlying base relation;
otherwise the column is read-only, and an error will be raised if an INSERT or UPDATE statement
attempts to assign a value to it.
If the view is automatically updatable the system will convert any INSERT, UPDATE or DELETE
statement on the view into the corresponding statement on the underlying base relation. INSERT
statements that have an ON CONFLICT UPDATE clause are fully supported.
If an automatically updatable view contains a WHERE condition, the condition restricts which rows
of the base relation are available to be modified by UPDATE and DELETE statements on the view.
However, an UPDATE is allowed to change a row so that it no longer satisfies the WHERE condition,
1654
CREATE VIEW
and thus is no longer visible through the view. Similarly, an INSERT command can potentially insert
base-relation rows that do not satisfy the WHERE condition and thus are not visible through the view
(ON CONFLICT UPDATE may similarly affect an existing row not visible through the view). The
CHECK OPTION may be used to prevent INSERT and UPDATE commands from creating such rows
that are not visible through the view.
If an automatically updatable view is marked with the security_barrier property then all the
view's WHERE conditions (and any conditions using operators which are marked as LEAKPROOF)
will always be evaluated before any conditions that a user of the view has added. See Section 41.5
for full details. Note that, due to this, rows which are not ultimately returned (because they do not
pass the user's WHERE conditions) may still end up being locked. EXPLAIN can be used to see which
conditions are applied at the relation level (and therefore do not lock rows) and which are not.
A more complex view that does not satisfy all these conditions is read-only by default: the system
will not allow an insert, update, or delete on the view. You can get the effect of an updatable view
by creating INSTEAD OF triggers on the view, which must convert attempted inserts, etc. on the
view into appropriate actions on other tables. For more information see CREATE TRIGGER. Another
possibility is to create rules (see CREATE RULE), but in practice triggers are easier to understand
and use correctly.
Note that the user performing the insert, update or delete on the view must have the corresponding
insert, update or delete privilege on the view. In addition the view's owner must have the relevant
privileges on the underlying base relations, but the user performing the update does not need any
permissions on the underlying base relations (see Section 41.5).
Examples
Create a view consisting of all comedy films:
This will create a view containing the columns that are in the film table at the time of view creation.
Though * was used to create the view, columns added later to the table will not be part of the view.
This will create a view based on the comedies view, showing only films with kind = 'Comedy'
and classification = 'U'. Any attempt to INSERT or UPDATE a row in the view will be
rejected if the new row doesn't have classification = 'U', but the film kind will not be
checked.
1655
CREATE VIEW
This will create a view that checks both the kind and classification of new rows.
This view will support INSERT, UPDATE and DELETE. All the columns from the films table will
be updatable, whereas the computed columns country and avg_rating will be read-only.
Notice that although the recursive view's name is schema-qualified in this CREATE, its internal
self-reference is not schema-qualified. This is because the implicitly-created CTE's name cannot be
schema-qualified.
Compatibility
CREATE OR REPLACE VIEW is a PostgreSQL language extension. So is the concept of a temporary
view. The WITH ( ... ) clause is an extension as well.
See Also
ALTER VIEW, DROP VIEW, CREATE MATERIALIZED VIEW
1656
DEALLOCATE
DEALLOCATE — deallocate a prepared statement
Synopsis
Description
DEALLOCATE is used to deallocate a previously prepared SQL statement. If you do not explicitly
deallocate a prepared statement, it is deallocated when the session ends.
Parameters
PREPARE
name
ALL
Compatibility
The SQL standard includes a DEALLOCATE statement, but it is only for use in embedded SQL.
See Also
EXECUTE, PREPARE
1657
DECLARE
DECLARE — define a cursor
Synopsis
Description
DECLARE allows a user to create cursors, which can be used to retrieve a small number of rows at a
time out of a larger query. After the cursor is created, rows are fetched from it using FETCH.
Note
This page describes usage of cursors at the SQL command level. If you are trying to use cursors
inside a PL/pgSQL function, the rules are different — see Section 43.7.
Parameters
name
BINARY
Causes the cursor to return data in binary rather than in text format.
INSENSITIVE
Indicates that data retrieved from the cursor should be unaffected by updates to the table(s) under-
lying the cursor that occur after the cursor is created. In PostgreSQL, this is the default behavior;
so this key word has no effect and is only accepted for compatibility with the SQL standard.
SCROLL
NO SCROLL
SCROLL specifies that the cursor can be used to retrieve rows in a nonsequential fashion (e.g.,
backward). Depending upon the complexity of the query's execution plan, specifying SCROLL
might impose a performance penalty on the query's execution time. NO SCROLL specifies that the
cursor cannot be used to retrieve rows in a nonsequential fashion. The default is to allow scrolling
in some cases; this is not the same as specifying SCROLL. See Notes for details.
WITH HOLD
WITHOUT HOLD
WITH HOLD specifies that the cursor can continue to be used after the transaction that created
it successfully commits. WITHOUT HOLD specifies that the cursor cannot be used outside of the
transaction that created it. If neither WITHOUT HOLD nor WITH HOLD is specified, WITHOUT
HOLD is the default.
query
A SELECT or VALUES command which will provide the rows to be returned by the cursor.
1658
DECLARE
The key words BINARY, INSENSITIVE, and SCROLL can appear in any order.
Notes
Normal cursors return data in text format, the same as a SELECT would produce. The BINARY option
specifies that the cursor should return data in binary format. This reduces conversion effort for both
the server and client, at the cost of more programmer effort to deal with platform-dependent binary
data formats. As an example, if a query returns a value of one from an integer column, you would get a
string of 1 with a default cursor, whereas with a binary cursor you would get a 4-byte field containing
the internal representation of the value (in big-endian byte order).
Binary cursors should be used carefully. Many applications, including psql, are not prepared to handle
binary cursors and expect data to come back in the text format.
Note
When the client application uses the “extended query” protocol to issue a FETCH command,
the Bind protocol message specifies whether data is to be retrieved in text or binary format.
This choice overrides the way that the cursor is defined. The concept of a binary cursor as such
is thus obsolete when using extended query protocol — any cursor can be treated as either
text or binary.
Unless WITH HOLD is specified, the cursor created by this command can only be used within the
current transaction. Thus, DECLARE without WITH HOLD is useless outside a transaction block: the
cursor would survive only to the completion of the statement. Therefore PostgreSQL reports an error
if such a command is used outside a transaction block. Use BEGIN and COMMIT (or ROLLBACK)
to define a transaction block.
If WITH HOLD is specified and the transaction that created the cursor successfully commits, the
cursor can continue to be accessed by subsequent transactions in the same session. (But if the creating
transaction is aborted, the cursor is removed.) A cursor created with WITH HOLD is closed when
an explicit CLOSE command is issued on it, or the session ends. In the current implementation, the
rows represented by a held cursor are copied into a temporary file or memory area so that they remain
available for subsequent transactions.
WITH HOLD may not be specified when the query includes FOR UPDATE or FOR SHARE.
The SCROLL option should be specified when defining a cursor that will be used to fetch backwards.
This is required by the SQL standard. However, for compatibility with earlier versions, PostgreSQL
will allow backward fetches without SCROLL, if the cursor's query plan is simple enough that no extra
overhead is needed to support it. However, application developers are advised not to rely on using
backward fetches from a cursor that has not been created with SCROLL. If NO SCROLL is specified,
then backward fetches are disallowed in any case.
Backward fetches are also disallowed when the query includes FOR UPDATE or FOR SHARE; there-
fore SCROLL may not be specified in this case.
Caution
Scrollable cursors may give unexpected results if they invoke any volatile functions (see Sec-
tion 38.7). When a previously fetched row is re-fetched, the functions might be re-executed,
perhaps leading to results different from the first time. It's best to specify NO SCROLL for
a query involving volatile functions. If that is not practical, one workaround is to declare the
cursor SCROLL WITH HOLD and commit the transaction before reading any rows from it.
This will force the entire output of the cursor to be materialized in temporary storage, so that
volatile functions are executed exactly once for each row.
1659
DECLARE
If the cursor's query includes FOR UPDATE or FOR SHARE, then returned rows are locked at the
time they are first fetched, in the same way as for a regular SELECT command with these options. In
addition, the returned rows will be the most up-to-date versions; therefore these options provide the
equivalent of what the SQL standard calls a “sensitive cursor”. (Specifying INSENSITIVE together
with FOR UPDATE or FOR SHARE is an error.)
Caution
It is generally recommended to use FOR UPDATE if the cursor is intended to be used with
UPDATE ... WHERE CURRENT OF or DELETE ... WHERE CURRENT OF. Using FOR
UPDATE prevents other sessions from changing the rows between the time they are fetched
and the time they are updated. Without FOR UPDATE, a subsequent WHERE CURRENT OF
command will have no effect if the row was changed since the cursor was created.
Another reason to use FOR UPDATE is that without it, a subsequent WHERE CURRENT
OF might fail if the cursor query does not meet the SQL standard's rules for being “simply
updatable” (in particular, the cursor must reference just one table and not use grouping or
ORDER BY). Cursors that are not simply updatable might work, or might not, depending on
plan choice details; so in the worst case, an application might work in testing and then fail in
production. If FOR UPDATE is specified, the cursor is guaranteed to be updatable.
The main reason not to use FOR UPDATE with WHERE CURRENT OF is if you need the
cursor to be scrollable, or to be insensitive to the subsequent updates (that is, continue to show
the old data). If this is a requirement, pay close heed to the caveats shown above.
The SQL standard only makes provisions for cursors in embedded SQL. The PostgreSQL server does
not implement an OPEN statement for cursors; a cursor is considered to be open when it is declared.
However, ECPG, the embedded SQL preprocessor for PostgreSQL, supports the standard SQL cursor
conventions, including those involving DECLARE and OPEN statements.
You can see all available cursors by querying the pg_cursors system view.
Examples
To declare a cursor:
Compatibility
The SQL standard says that it is implementation-dependent whether cursors are sensitive to concurrent
updates of the underlying data by default. In PostgreSQL, cursors are insensitive by default, and can
be made sensitive by specifying FOR UPDATE. Other products may work differently.
The SQL standard allows cursors only in embedded SQL and in modules. PostgreSQL permits cursors
to be used interactively.
See Also
CLOSE, FETCH, MOVE
1660
DELETE
DELETE — delete rows of a table
Synopsis
Description
DELETE deletes rows that satisfy the WHERE clause from the specified table. If the WHERE clause is
absent, the effect is to delete all rows in the table. The result is a valid, but empty table.
Tip
TRUNCATE provides a faster mechanism to remove all rows from a table.
There are two ways to delete rows in a table using information contained in other tables in the data-
base: using sub-selects, or specifying additional tables in the USING clause. Which technique is more
appropriate depends on the specific circumstances.
The optional RETURNING clause causes DELETE to compute and return value(s) based on each row
actually deleted. Any expression using the table's columns, and/or columns of other tables mentioned
in USING, can be computed. The syntax of the RETURNING list is identical to that of the output list
of SELECT.
You must have the DELETE privilege on the table to delete from it, as well as the SELECT privilege
for any table in the USING clause or whose values are read in the condition.
Parameters
with_query
The WITH clause allows you to specify one or more subqueries that can be referenced by name
in the DELETE query. See Section 7.8 and SELECT for details.
table_name
The name (optionally schema-qualified) of the table to delete rows from. If ONLY is specified
before the table name, matching rows are deleted from the named table only. If ONLY is not spec-
ified, matching rows are also deleted from any tables inheriting from the named table. Optionally,
* can be specified after the table name to explicitly indicate that descendant tables are included.
alias
A substitute name for the target table. When an alias is provided, it completely hides the actual
name of the table. For example, given DELETE FROM foo AS f, the remainder of the DELETE
statement must refer to this table as f not foo.
1661
DELETE
from_item
A table expression allowing columns from other tables to appear in the WHERE condition. This
uses the same syntax as the FROM Clause of a SELECT statement; for example, an alias for the
table name can be specified. Do not repeat the target table as a from_item unless you wish to
set up a self-join (in which case it must appear with an alias in the from_item).
condition
An expression that returns a value of type boolean. Only rows for which this expression returns
true will be deleted.
cursor_name
The name of the cursor to use in a WHERE CURRENT OF condition. The row to be deleted is
the one most recently fetched from this cursor. The cursor must be a non-grouping query on the
DELETE's target table. Note that WHERE CURRENT OF cannot be specified together with a
Boolean condition. See DECLARE for more information about using cursors with WHERE CUR-
RENT OF.
output_expression
An expression to be computed and returned by the DELETE command after each row is deleted.
The expression can use any column names of the table named by table_name or table(s) listed
in USING. Write * to return all columns.
output_name
Outputs
On successful completion, a DELETE command returns a command tag of the form
DELETE count
The count is the number of rows deleted. Note that the number may be less than the number of
rows that matched the condition when deletes were suppressed by a BEFORE DELETE trigger.
If count is 0, no rows were deleted by the query (this is not considered an error).
If the DELETE command contains a RETURNING clause, the result will be similar to that of a SELECT
statement containing the columns and values defined in the RETURNING list, computed over the
row(s) deleted by the command.
Notes
PostgreSQL lets you reference columns of other tables in the WHERE condition by specifying the other
tables in the USING clause. For example, to delete all films produced by a given producer, one can do:
What is essentially happening here is a join between films and producers, with all successfully
joined films rows being marked for deletion. This syntax is not standard. A more standard way to
do it is:
1662
DELETE
In some cases the join style is easier to write or faster to execute than the sub-select style.
Examples
Delete all films but musicals:
Delete the row of tasks on which the cursor c_tasks is currently positioned:
Compatibility
This command conforms to the SQL standard, except that the USING and RETURNING clauses are
PostgreSQL extensions, as is the ability to use WITH with DELETE.
See Also
TRUNCATE
1663
DISCARD
DISCARD — discard session state
Synopsis
Description
DISCARD releases internal resources associated with a database session. This command is useful for
partially or fully resetting the session's state. There are several subcommands to release different types
of resources; the DISCARD ALL variant subsumes all the others, and also resets additional state.
Parameters
PLANS
Releases all cached query plans, forcing re-planning to occur the next time the associated prepared
statement is used.
SEQUENCES
TEMPORARY or TEMP
ALL
Releases all temporary resources associated with the current session and resets the session to its
initial state. Currently, this has the same effect as executing the following sequence of statements:
Notes
DISCARD ALL cannot be executed inside a transaction block.
Compatibility
DISCARD is a PostgreSQL extension.
1664
DO
DO — execute an anonymous code block
Synopsis
Description
DO executes an anonymous code block, or in other words a transient anonymous function in a proce-
dural language.
The code block is treated as though it were the body of a function with no parameters, returning void.
It is parsed and executed a single time.
The optional LANGUAGE clause can be written either before or after the code block.
Parameters
code
The procedural language code to be executed. This must be specified as a string literal, just as in
CREATE FUNCTION. Use of a dollar-quoted literal is recommended.
lang_name
The name of the procedural language the code is written in. If omitted, the default is plpgsql.
Notes
The procedural language to be used must already have been installed into the current database by
means of CREATE EXTENSION. plpgsql is installed by default, but other languages are not.
The user must have USAGE privilege for the procedural language, or must be a superuser if the lan-
guage is untrusted. This is the same privilege requirement as for creating a function in the language.
If DO is executed in a transaction block, then the procedure code cannot execute transaction control
statements. Transaction control statements are only allowed if DO is executed in its own transaction.
Examples
Grant all privileges on all views in schema public to role webuser:
DO $$DECLARE r record;
BEGIN
FOR r IN SELECT table_schema, table_name FROM
information_schema.tables
WHERE table_type = 'VIEW' AND table_schema = 'public'
LOOP
EXECUTE 'GRANT ALL ON ' || quote_ident(r.table_schema) ||
'.' || quote_ident(r.table_name) || ' TO webuser';
END LOOP;
1665
DO
END$$;
Compatibility
There is no DO statement in the SQL standard.
See Also
CREATE LANGUAGE
1666
DROP ACCESS METHOD
DROP ACCESS METHOD — remove an access method
Synopsis
Description
DROP ACCESS METHOD removes an existing access method. Only superusers can drop access
methods.
Parameters
IF EXISTS
Do not throw an error if the access method does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the access method (such as operator classes, operator
families, and indexes), and in turn all objects that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the access method if any objects depend on it. This is the default.
Examples
Drop the access method heptree:
Compatibility
DROP ACCESS METHOD is a PostgreSQL extension.
See Also
CREATE ACCESS METHOD
1667
DROP AGGREGATE
DROP AGGREGATE — remove an aggregate function
Synopsis
* |
[ argmode ] [ argname ] argtype [ , ... ] |
[ [ argmode ] [ argname ] argtype [ , ... ] ] ORDER BY [ argmode ]
[ argname ] argtype [ , ... ]
Description
DROP AGGREGATE removes an existing aggregate function. To execute this command the current
user must be the owner of the aggregate function.
Parameters
IF EXISTS
Do not throw an error if the aggregate does not exist. A notice is issued in this case.
name
argmode
argname
The name of an argument. Note that DROP AGGREGATE does not actually pay any attention
to argument names, since only the argument data types are needed to determine the aggregate
function's identity.
argtype
An input data type on which the aggregate function operates. To reference a zero-argument aggre-
gate function, write * in place of the list of argument specifications. To reference an ordered-set
aggregate function, write ORDER BY between the direct and aggregated argument specifications.
CASCADE
Automatically drop objects that depend on the aggregate function (such as views using it), and in
turn all objects that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the aggregate function if any objects depend on it. This is the default.
1668
DROP AGGREGATE
Notes
Alternative syntaxes for referencing ordered-set aggregates are described under ALTER AGGRE-
GATE.
Examples
To remove the aggregate function myavg for type integer:
To remove the hypothetical-set aggregate function myrank, which takes an arbitrary list of ordering
columns and a matching list of direct arguments:
Compatibility
There is no DROP AGGREGATE statement in the SQL standard.
See Also
ALTER AGGREGATE, CREATE AGGREGATE
1669
DROP CAST
DROP CAST — remove a cast
Synopsis
Description
DROP CAST removes a previously defined cast.
To be able to drop a cast, you must own the source or the target data type. These are the same privileges
that are required to create a cast.
Parameters
IF EXISTS
Do not throw an error if the cast does not exist. A notice is issued in this case.
source_type
target_type
CASCADE
RESTRICT
These key words do not have any effect, since there are no dependencies on casts.
Examples
To drop the cast from type text to type int:
Compatibility
The DROP CAST command conforms to the SQL standard.
See Also
CREATE CAST
1670
DROP COLLATION
DROP COLLATION — remove a collation
Synopsis
Description
DROP COLLATION removes a previously defined collation. To be able to drop a collation, you must
own the collation.
Parameters
IF EXISTS
Do not throw an error if the collation does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the collation, and in turn all objects that depend on
those objects (see Section 5.13).
RESTRICT
Refuse to drop the collation if any objects depend on it. This is the default.
Examples
To drop the collation named german:
Compatibility
The DROP COLLATION command conforms to the SQL standard, apart from the IF EXISTS option,
which is a PostgreSQL extension.
See Also
ALTER COLLATION, CREATE COLLATION
1671
DROP CONVERSION
DROP CONVERSION — remove a conversion
Synopsis
Description
DROP CONVERSION removes a previously defined conversion. To be able to drop a conversion, you
must own the conversion.
Parameters
IF EXISTS
Do not throw an error if the conversion does not exist. A notice is issued in this case.
name
CASCADE
RESTRICT
These key words do not have any effect, since there are no dependencies on conversions.
Examples
To drop the conversion named myname:
Compatibility
There is no DROP CONVERSION statement in the SQL standard, but a DROP TRANSLATION
statement that goes along with the CREATE TRANSLATION statement that is similar to the CREATE
CONVERSION statement in PostgreSQL.
See Also
ALTER CONVERSION, CREATE CONVERSION
1672
DROP DATABASE
DROP DATABASE — remove a database
Synopsis
Description
DROP DATABASE drops a database. It removes the catalog entries for the database and deletes the
directory containing the data. It can only be executed by the database owner. Also, it cannot be exe-
cuted while you or anyone else are connected to the target database. (Connect to postgres or any
other database to issue this command.)
Parameters
IF EXISTS
Do not throw an error if the database does not exist. A notice is issued in this case.
name
Notes
DROP DATABASE cannot be executed inside a transaction block.
This command cannot be executed while connected to the target database. Thus, it might be more
convenient to use the program dropdb instead, which is a wrapper around this command.
Compatibility
There is no DROP DATABASE statement in the SQL standard.
See Also
CREATE DATABASE
1673
DROP DOMAIN
DROP DOMAIN — remove a domain
Synopsis
Description
DROP DOMAIN removes a domain. Only the owner of a domain can remove it.
Parameters
IF EXISTS
Do not throw an error if the domain does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the domain (such as table columns), and in turn all
objects that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the domain if any objects depend on it. This is the default.
Examples
To remove the domain box:
Compatibility
This command conforms to the SQL standard, except for the IF EXISTS option, which is a Post-
greSQL extension.
See Also
CREATE DOMAIN, ALTER DOMAIN
1674
DROP EVENT TRIGGER
DROP EVENT TRIGGER — remove an event trigger
Synopsis
Description
DROP EVENT TRIGGER removes an existing event trigger. To execute this command, the current
user must be the owner of the event trigger.
Parameters
IF EXISTS
Do not throw an error if the event trigger does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the trigger, and in turn all objects that depend on those
objects (see Section 5.13).
RESTRICT
Refuse to drop the trigger if any objects depend on it. This is the default.
Examples
Destroy the trigger snitch:
Compatibility
There is no DROP EVENT TRIGGER statement in the SQL standard.
See Also
CREATE EVENT TRIGGER, ALTER EVENT TRIGGER
1675
DROP EXTENSION
DROP EXTENSION — remove an extension
Synopsis
Description
DROP EXTENSION removes extensions from the database. Dropping an extension causes its com-
ponent objects to be dropped as well.
Parameters
IF EXISTS
Do not throw an error if the extension does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the extension, and in turn all objects that depend on
those objects (see Section 5.13).
RESTRICT
Refuse to drop the extension if any objects depend on it (other than its own member objects and
other extensions listed in the same DROP command). This is the default.
Examples
To remove the extension hstore from the current database:
This command will fail if any of hstore's objects are in use in the database, for example if any tables
have columns of the hstore type. Add the CASCADE option to forcibly remove those dependent
objects as well.
Compatibility
DROP EXTENSION is a PostgreSQL extension.
See Also
CREATE EXTENSION, ALTER EXTENSION
1676
DROP FOREIGN DATA WRAPPER
DROP FOREIGN DATA WRAPPER — remove a foreign-data wrapper
Synopsis
Description
DROP FOREIGN DATA WRAPPER removes an existing foreign-data wrapper. To execute this
command, the current user must be the owner of the foreign-data wrapper.
Parameters
IF EXISTS
Do not throw an error if the foreign-data wrapper does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the foreign-data wrapper (such as foreign tables and
servers), and in turn all objects that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the foreign-data wrapper if any objects depend on it. This is the default.
Examples
Drop the foreign-data wrapper dbi:
Compatibility
DROP FOREIGN DATA WRAPPER conforms to ISO/IEC 9075-9 (SQL/MED). The IF EXISTS
clause is a PostgreSQL extension.
See Also
CREATE FOREIGN DATA WRAPPER, ALTER FOREIGN DATA WRAPPER
1677
DROP FOREIGN TABLE
DROP FOREIGN TABLE — remove a foreign table
Synopsis
Description
DROP FOREIGN TABLE removes a foreign table. Only the owner of a foreign table can remove it.
Parameters
IF EXISTS
Do not throw an error if the foreign table does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the foreign table (such as views), and in turn all objects
that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the foreign table if any objects depend on it. This is the default.
Examples
To destroy two foreign tables, films and distributors:
Compatibility
This command conforms to the ISO/IEC 9075-9 (SQL/MED), except that the standard only allows
one foreign table to be dropped per command, and apart from the IF EXISTS option, which is a
PostgreSQL extension.
See Also
ALTER FOREIGN TABLE, CREATE FOREIGN TABLE
1678
DROP FUNCTION
DROP FUNCTION — remove a function
Synopsis
Description
DROP FUNCTION removes the definition of an existing function. To execute this command the user
must be the owner of the function. The argument types to the function must be specified, since several
different functions can exist with the same name and different argument lists.
Parameters
IF EXISTS
Do not throw an error if the function does not exist. A notice is issued in this case.
name
argmode
The mode of an argument: IN, OUT, INOUT, or VARIADIC. If omitted, the default is IN. Note
that DROP FUNCTION does not actually pay any attention to OUT arguments, since only the
input arguments are needed to determine the function's identity. So it is sufficient to list the IN,
INOUT, and VARIADIC arguments.
argname
The name of an argument. Note that DROP FUNCTION does not actually pay any attention
to argument names, since only the argument data types are needed to determine the function's
identity.
argtype
CASCADE
Automatically drop objects that depend on the function (such as operators or triggers), and in turn
all objects that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the function if any objects depend on it. This is the default.
Examples
This command removes the square root function:
1679
DROP FUNCTION
If the function name is unique in its schema, it can be referred to without an argument list:
which refers to a function with zero arguments, whereas the first variant can refer to a function with
any number of arguments, including zero, as long as the name is unique.
Compatibility
This command conforms to the SQL standard, with these PostgreSQL extensions:
See Also
CREATE FUNCTION, ALTER FUNCTION, DROP PROCEDURE, DROP ROUTINE
1680
DROP GROUP
DROP GROUP — remove a database role
Synopsis
Description
DROP GROUP is now an alias for DROP ROLE.
Compatibility
There is no DROP GROUP statement in the SQL standard.
See Also
DROP ROLE
1681
DROP INDEX
DROP INDEX — remove an index
Synopsis
Description
DROP INDEX drops an existing index from the database system. To execute this command you must
be the owner of the index.
Parameters
CONCURRENTLY
Drop the index without locking out concurrent selects, inserts, updates, and deletes on the index's
table. A normal DROP INDEX acquires an ACCESS EXCLUSIVE lock on the table, blocking
other accesses until the index drop can be completed. With this option, the command instead waits
until conflicting transactions have completed.
There are several caveats to be aware of when using this option. Only one index name can be
specified, and the CASCADE option is not supported. (Thus, an index that supports a UNIQUE or
PRIMARY KEY constraint cannot be dropped this way.) Also, regular DROP INDEX commands
can be performed within a transaction block, but DROP INDEX CONCURRENTLY cannot. Lastly,
indexes on partitioned tables cannot be dropped using this option.
For temporary tables, DROP INDEX is always non-concurrent, as no other session can access
them, and non-concurrent index drop is cheaper.
IF EXISTS
Do not throw an error if the index does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the index, and in turn all objects that depend on those
objects (see Section 5.13).
RESTRICT
Refuse to drop the index if any objects depend on it. This is the default.
Examples
This command will remove the index title_idx:
1682
DROP INDEX
Compatibility
DROP INDEX is a PostgreSQL language extension. There are no provisions for indexes in the SQL
standard.
See Also
CREATE INDEX
1683
DROP LANGUAGE
DROP LANGUAGE — remove a procedural language
Synopsis
Description
DROP LANGUAGE removes the definition of a previously registered procedural language. You must
be a superuser or the owner of the language to use DROP LANGUAGE.
Note
As of PostgreSQL 9.1, most procedural languages have been made into “extensions”, and
should therefore be removed with DROP EXTENSION not DROP LANGUAGE.
Parameters
IF EXISTS
Do not throw an error if the language does not exist. A notice is issued in this case.
name
The name of an existing procedural language. For backward compatibility, the name can be en-
closed by single quotes.
CASCADE
Automatically drop objects that depend on the language (such as functions in the language), and
in turn all objects that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the language if any objects depend on it. This is the default.
Examples
This command removes the procedural language plsample:
Compatibility
There is no DROP LANGUAGE statement in the SQL standard.
See Also
ALTER LANGUAGE, CREATE LANGUAGE
1684
DROP MATERIALIZED VIEW
DROP MATERIALIZED VIEW — remove a materialized view
Synopsis
Description
DROP MATERIALIZED VIEW drops an existing materialized view. To execute this command you
must be the owner of the materialized view.
Parameters
IF EXISTS
Do not throw an error if the materialized view does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the materialized view (such as other materialized
views, or regular views), and in turn all objects that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the materialized view if any objects depend on it. This is the default.
Examples
This command will remove the materialized view called order_summary:
Compatibility
DROP MATERIALIZED VIEW is a PostgreSQL extension.
See Also
CREATE MATERIALIZED VIEW, ALTER MATERIALIZED VIEW, REFRESH
MATERIALIZED VIEW
1685
DROP OPERATOR
DROP OPERATOR — remove an operator
Synopsis
Description
DROP OPERATOR drops an existing operator from the database system. To execute this command
you must be the owner of the operator.
Parameters
IF EXISTS
Do not throw an error if the operator does not exist. A notice is issued in this case.
name
left_type
The data type of the operator's left operand; write NONE if the operator has no left operand.
right_type
The data type of the operator's right operand; write NONE if the operator has no right operand.
CASCADE
Automatically drop objects that depend on the operator (such as views using it), and in turn all
objects that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the operator if any objects depend on it. This is the default.
Examples
Remove the power operator a^b for type integer:
Remove the left unary bitwise complement operator ~b for type bit:
1686
DROP OPERATOR
Compatibility
There is no DROP OPERATOR statement in the SQL standard.
See Also
CREATE OPERATOR, ALTER OPERATOR
1687
DROP OPERATOR CLASS
DROP OPERATOR CLASS — remove an operator class
Synopsis
Description
DROP OPERATOR CLASS drops an existing operator class. To execute this command you must be
the owner of the operator class.
DROP OPERATOR CLASS does not drop any of the operators or functions referenced by the class.
If there are any indexes depending on the operator class, you will need to specify CASCADE for the
drop to complete.
Parameters
IF EXISTS
Do not throw an error if the operator class does not exist. A notice is issued in this case.
name
index_method
The name of the index access method the operator class is for.
CASCADE
Automatically drop objects that depend on the operator class (such as indexes), and in turn all
objects that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the operator class if any objects depend on it. This is the default.
Notes
DROP OPERATOR CLASS will not drop the operator family containing the class, even if there is
nothing else left in the family (in particular, in the case where the family was implicitly created by
CREATE OPERATOR CLASS). An empty operator family is harmless, but for the sake of tidiness
you might wish to remove the family with DROP OPERATOR FAMILY; or perhaps better, use DROP
OPERATOR FAMILY in the first place.
Examples
Remove the B-tree operator class widget_ops:
1688
DROP OPERATOR CLASS
This command will not succeed if there are any existing indexes that use the operator class. Add
CASCADE to drop such indexes along with the operator class.
Compatibility
There is no DROP OPERATOR CLASS statement in the SQL standard.
See Also
ALTER OPERATOR CLASS, CREATE OPERATOR CLASS, DROP OPERATOR FAMILY
1689
DROP OPERATOR FAMILY
DROP OPERATOR FAMILY — remove an operator family
Synopsis
Description
DROP OPERATOR FAMILY drops an existing operator family. To execute this command you must
be the owner of the operator family.
DROP OPERATOR FAMILY includes dropping any operator classes contained in the family, but
it does not drop any of the operators or functions referenced by the family. If there are any indexes
depending on operator classes within the family, you will need to specify CASCADE for the drop to
complete.
Parameters
IF EXISTS
Do not throw an error if the operator family does not exist. A notice is issued in this case.
name
index_method
The name of the index access method the operator family is for.
CASCADE
Automatically drop objects that depend on the operator family, and in turn all objects that depend
on those objects (see Section 5.13).
RESTRICT
Refuse to drop the operator family if any objects depend on it. This is the default.
Examples
Remove the B-tree operator family float_ops:
This command will not succeed if there are any existing indexes that use operator classes within the
family. Add CASCADE to drop such indexes along with the operator family.
Compatibility
There is no DROP OPERATOR FAMILY statement in the SQL standard.
1690
DROP OPERATOR FAMILY
See Also
ALTER OPERATOR FAMILY, CREATE OPERATOR FAMILY, ALTER OPERATOR CLASS,
CREATE OPERATOR CLASS, DROP OPERATOR CLASS
1691
DROP OWNED
DROP OWNED — remove database objects owned by a database role
Synopsis
Description
DROP OWNED drops all the objects within the current database that are owned by one of the specified
roles. Any privileges granted to the given roles on objects in the current database or on shared objects
(databases, tablespaces) will also be revoked.
Parameters
name
The name of a role whose objects will be dropped, and whose privileges will be revoked.
CASCADE
Automatically drop objects that depend on the affected objects, and in turn all objects that depend
on those objects (see Section 5.13).
RESTRICT
Refuse to drop the objects owned by a role if any other database objects depend on one of the
affected objects. This is the default.
Notes
DROP OWNED is often used to prepare for the removal of one or more roles. Because DROP OWNED
only affects the objects in the current database, it is usually necessary to execute this command in each
database that contains objects owned by a role that is to be removed.
Using the CASCADE option might make the command recurse to objects owned by other users.
The REASSIGN OWNED command is an alternative that reassigns the ownership of all the database
objects owned by one or more roles. However, REASSIGN OWNED does not deal with privileges
for other objects.
Compatibility
The DROP OWNED command is a PostgreSQL extension.
See Also
REASSIGN OWNED, DROP ROLE
1692
DROP POLICY
DROP POLICY — remove a row level security policy from a table
Synopsis
Description
DROP POLICY removes the specified policy from the table. Note that if the last policy is removed
for a table and the table still has row level security enabled via ALTER TABLE, then the default-deny
policy will be used. ALTER TABLE ... DISABLE ROW LEVEL SECURITY can be used to
disable row level security for a table, whether policies for the table exist or not.
Parameters
IF EXISTS
Do not throw an error if the policy does not exist. A notice is issued in this case.
name
table_name
The name (optionally schema-qualified) of the table that the policy is on.
CASCADE
RESTRICT
These key words do not have any effect, since there are no dependencies on policies.
Examples
To drop the policy called p1 on the table named my_table:
Compatibility
DROP POLICY is a PostgreSQL extension.
See Also
CREATE POLICY, ALTER POLICY
1693
DROP PROCEDURE
DROP PROCEDURE — remove a procedure
Synopsis
Description
DROP PROCEDURE removes the definition of an existing procedure. To execute this command the
user must be the owner of the procedure. The argument types to the procedure must be specified, since
several different procedures can exist with the same name and different argument lists.
Parameters
IF EXISTS
Do not throw an error if the procedure does not exist. A notice is issued in this case.
name
argmode
argname
The name of an argument. Note that DROP PROCEDURE does not actually pay any attention
to argument names, since only the argument data types are needed to determine the procedure's
identity.
argtype
CASCADE
Automatically drop objects that depend on the procedure, and in turn all objects that depend on
those objects (see Section 5.13).
RESTRICT
Refuse to drop the procedure if any objects depend on it. This is the default.
Examples
1694
DROP PROCEDURE
Compatibility
This command conforms to the SQL standard, with these PostgreSQL extensions:
See Also
CREATE PROCEDURE, ALTER PROCEDURE, DROP FUNCTION, DROP ROUTINE
1695
DROP PUBLICATION
DROP PUBLICATION — remove a publication
Synopsis
Description
DROP PUBLICATION removes an existing publication from the database.
Parameters
IF EXISTS
Do not throw an error if the publication does not exist. A notice is issued in this case.
name
CASCADE
RESTRICT
These key words do not have any effect, since there are no dependencies on publications.
Examples
Drop a publication:
Compatibility
DROP PUBLICATION is a PostgreSQL extension.
See Also
CREATE PUBLICATION, ALTER PUBLICATION
1696
DROP ROLE
DROP ROLE — remove a database role
Synopsis
Description
DROP ROLE removes the specified role(s). To drop a superuser role, you must be a superuser yourself;
to drop non-superuser roles, you must have CREATEROLE privilege.
A role cannot be removed if it is still referenced in any database of the cluster; an error will be raised
if so. Before dropping the role, you must drop all the objects it owns (or reassign their ownership)
and revoke any privileges the role has been granted on other objects. The REASSIGN OWNED and
DROP OWNED commands can be useful for this purpose; see Section 21.4 for more discussion.
However, it is not necessary to remove role memberships involving the role; DROP ROLE automati-
cally revokes any memberships of the target role in other roles, and of other roles in the target role.
The other roles are not dropped nor otherwise affected.
Parameters
IF EXISTS
Do not throw an error if the role does not exist. A notice is issued in this case.
name
Notes
PostgreSQL includes a program dropuser that has the same functionality as this command (in fact, it
calls this command) but can be run from the command shell.
Examples
To drop a role:
Compatibility
The SQL standard defines DROP ROLE, but it allows only one role to be dropped at a time, and it
specifies different privilege requirements than PostgreSQL uses.
See Also
CREATE ROLE, ALTER ROLE, SET ROLE
1697
DROP ROUTINE
DROP ROUTINE — remove a routine
Synopsis
Description
DROP ROUTINE removes the definition of an existing routine, which can be an aggregate function,
a normal function, or a procedure. See under DROP AGGREGATE, DROP FUNCTION, and DROP
PROCEDURE for the description of the parameters, more examples, and further details.
Examples
To drop the routine foo for type integer:
This command will work independent of whether foo is an aggregate, function, or procedure.
Compatibility
This command conforms to the SQL standard, with these PostgreSQL extensions:
See Also
DROP AGGREGATE, DROP FUNCTION, DROP PROCEDURE, ALTER ROUTINE
1698
DROP RULE
DROP RULE — remove a rewrite rule
Synopsis
Description
DROP RULE drops a rewrite rule.
Parameters
IF EXISTS
Do not throw an error if the rule does not exist. A notice is issued in this case.
name
table_name
The name (optionally schema-qualified) of the table or view that the rule applies to.
CASCADE
Automatically drop objects that depend on the rule, and in turn all objects that depend on those
objects (see Section 5.13).
RESTRICT
Refuse to drop the rule if any objects depend on it. This is the default.
Examples
To drop the rewrite rule newrule:
Compatibility
DROP RULE is a PostgreSQL language extension, as is the entire query rewrite system.
See Also
CREATE RULE, ALTER RULE
1699
DROP SCHEMA
DROP SCHEMA — remove a schema
Synopsis
Description
DROP SCHEMA removes schemas from the database.
A schema can only be dropped by its owner or a superuser. Note that the owner can drop the schema
(and thereby all contained objects) even if they do not own some of the objects within the schema.
Parameters
IF EXISTS
Do not throw an error if the schema does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects (tables, functions, etc.) that are contained in the schema, and in turn
all objects that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the schema if it contains any objects. This is the default.
Notes
Using the CASCADE option might make the command remove objects in other schemas besides the
one(s) named.
Examples
To remove schema mystuff from the database, along with everything it contains:
Compatibility
DROP SCHEMA is fully conforming with the SQL standard, except that the standard only allows one
schema to be dropped per command, and apart from the IF EXISTS option, which is a PostgreSQL
extension.
See Also
ALTER SCHEMA, CREATE SCHEMA
1700
DROP SEQUENCE
DROP SEQUENCE — remove a sequence
Synopsis
Description
DROP SEQUENCE removes sequence number generators. A sequence can only be dropped by its
owner or a superuser.
Parameters
IF EXISTS
Do not throw an error if the sequence does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the sequence, and in turn all objects that depend on
those objects (see Section 5.13).
RESTRICT
Refuse to drop the sequence if any objects depend on it. This is the default.
Examples
To remove the sequence serial:
Compatibility
DROP SEQUENCE conforms to the SQL standard, except that the standard only allows one sequence
to be dropped per command, and apart from the IF EXISTS option, which is a PostgreSQL extension.
See Also
CREATE SEQUENCE, ALTER SEQUENCE
1701
DROP SERVER
DROP SERVER — remove a foreign server descriptor
Synopsis
Description
DROP SERVER removes an existing foreign server descriptor. To execute this command, the current
user must be the owner of the server.
Parameters
IF EXISTS
Do not throw an error if the server does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the server (such as user mappings), and in turn all
objects that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the server if any objects depend on it. This is the default.
Examples
Drop a server foo if it exists:
Compatibility
DROP SERVER conforms to ISO/IEC 9075-9 (SQL/MED). The IF EXISTS clause is a PostgreSQL
extension.
See Also
CREATE SERVER, ALTER SERVER
1702
DROP STATISTICS
DROP STATISTICS — remove extended statistics
Synopsis
Description
DROP STATISTICS removes statistics object(s) from the database. Only the statistics object's owner,
the schema owner, or a superuser can drop a statistics object.
Parameters
IF EXISTS
Do not throw an error if the statistics object does not exist. A notice is issued in this case.
name
CASCADE
RESTRICT
These key words do not have any effect, since there are no dependencies on statistics.
Examples
To destroy two statistics objects in different schemas, without failing if they don't exist:
Compatibility
There is no DROP STATISTICS command in the SQL standard.
See Also
ALTER STATISTICS, CREATE STATISTICS
1703
DROP SUBSCRIPTION
DROP SUBSCRIPTION — remove a subscription
Synopsis
Description
DROP SUBSCRIPTION removes a subscription from the database cluster.
DROP SUBSCRIPTION cannot be executed inside a transaction block if the subscription is associated
with a replication slot. (You can use ALTER SUBSCRIPTION to unset the slot.)
Parameters
name
CASCADE
RESTRICT
These key words do not have any effect, since there are no dependencies on subscriptions.
Notes
When dropping a subscription that is associated with a replication slot on the remote host (the normal
state), DROP SUBSCRIPTION will connect to the remote host and try to drop the replication slot as
part of its operation. This is necessary so that the resources allocated for the subscription on the remote
host are released. If this fails, either because the remote host is not reachable or because the remote
replication slot cannot be dropped or does not exist or never existed, the DROP SUBSCRIPTION
command will fail. To proceed in this situation, disassociate the subscription from the replication
slot by executing ALTER SUBSCRIPTION ... SET (slot_name = NONE). After that,
DROP SUBSCRIPTION will no longer attempt any actions on a remote host. Note that if the remote
replication slot still exists, it should then be dropped manually; otherwise it will continue to reserve
WAL and might eventually cause the disk to fill up. See also Section 31.2.1.
If a subscription is associated with a replication slot, then DROP SUBSCRIPTION cannot be executed
inside a transaction block.
Examples
Drop a subscription:
Compatibility
DROP SUBSCRIPTION is a PostgreSQL extension.
1704
DROP SUBSCRIPTION
See Also
CREATE SUBSCRIPTION, ALTER SUBSCRIPTION
1705
DROP TABLE
DROP TABLE — remove a table
Synopsis
Description
DROP TABLE removes tables from the database. Only the table owner, the schema owner, and su-
peruser can drop a table. To empty a table of rows without destroying the table, use DELETE or
TRUNCATE.
DROP TABLE always removes any indexes, rules, triggers, and constraints that exist for the target
table. However, to drop a table that is referenced by a view or a foreign-key constraint of another table,
CASCADE must be specified. (CASCADE will remove a dependent view entirely, but in the foreign-key
case it will only remove the foreign-key constraint, not the other table entirely.)
Parameters
IF EXISTS
Do not throw an error if the table does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the table (such as views), and in turn all objects that
depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the table if any objects depend on it. This is the default.
Examples
To destroy two tables, films and distributors:
Compatibility
This command conforms to the SQL standard, except that the standard only allows one table to be
dropped per command, and apart from the IF EXISTS option, which is a PostgreSQL extension.
See Also
ALTER TABLE, CREATE TABLE
1706
DROP TABLESPACE
DROP TABLESPACE — remove a tablespace
Synopsis
Description
DROP TABLESPACE removes a tablespace from the system.
A tablespace can only be dropped by its owner or a superuser. The tablespace must be empty of all
database objects before it can be dropped. It is possible that objects in other databases might still
reside in the tablespace even if no objects in the current database are using the tablespace. Also, if the
tablespace is listed in the temp_tablespaces setting of any active session, the DROP might fail due to
temporary files residing in the tablespace.
Parameters
IF EXISTS
Do not throw an error if the tablespace does not exist. A notice is issued in this case.
name
Notes
DROP TABLESPACE cannot be executed inside a transaction block.
Examples
To remove tablespace mystuff from the system:
Compatibility
DROP TABLESPACE is a PostgreSQL extension.
See Also
CREATE TABLESPACE, ALTER TABLESPACE
1707
DROP TEXT SEARCH CONFIGURATION
DROP TEXT SEARCH CONFIGURATION — remove a text search configuration
Synopsis
Description
DROP TEXT SEARCH CONFIGURATION drops an existing text search configuration. To execute
this command you must be the owner of the configuration.
Parameters
IF EXISTS
Do not throw an error if the text search configuration does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the text search configuration, and in turn all objects
that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the text search configuration if any objects depend on it. This is the default.
Examples
Remove the text search configuration my_english:
This command will not succeed if there are any existing indexes that reference the configuration in
to_tsvector calls. Add CASCADE to drop such indexes along with the text search configuration.
Compatibility
There is no DROP TEXT SEARCH CONFIGURATION statement in the SQL standard.
See Also
ALTER TEXT SEARCH CONFIGURATION, CREATE TEXT SEARCH CONFIGURATION
1708
DROP TEXT SEARCH DICTIONARY
DROP TEXT SEARCH DICTIONARY — remove a text search dictionary
Synopsis
Description
DROP TEXT SEARCH DICTIONARY drops an existing text search dictionary. To execute this
command you must be the owner of the dictionary.
Parameters
IF EXISTS
Do not throw an error if the text search dictionary does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the text search dictionary, and in turn all objects that
depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the text search dictionary if any objects depend on it. This is the default.
Examples
Remove the text search dictionary english:
This command will not succeed if there are any existing text search configurations that use the dictio-
nary. Add CASCADE to drop such configurations along with the dictionary.
Compatibility
There is no DROP TEXT SEARCH DICTIONARY statement in the SQL standard.
See Also
ALTER TEXT SEARCH DICTIONARY, CREATE TEXT SEARCH DICTIONARY
1709
DROP TEXT SEARCH PARSER
DROP TEXT SEARCH PARSER — remove a text search parser
Synopsis
Description
DROP TEXT SEARCH PARSER drops an existing text search parser. You must be a superuser to
use this command.
Parameters
IF EXISTS
Do not throw an error if the text search parser does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the text search parser, and in turn all objects that depend
on those objects (see Section 5.13).
RESTRICT
Refuse to drop the text search parser if any objects depend on it. This is the default.
Examples
Remove the text search parser my_parser:
This command will not succeed if there are any existing text search configurations that use the parser.
Add CASCADE to drop such configurations along with the parser.
Compatibility
There is no DROP TEXT SEARCH PARSER statement in the SQL standard.
See Also
ALTER TEXT SEARCH PARSER, CREATE TEXT SEARCH PARSER
1710
DROP TEXT SEARCH TEMPLATE
DROP TEXT SEARCH TEMPLATE — remove a text search template
Synopsis
Description
DROP TEXT SEARCH TEMPLATE drops an existing text search template. You must be a superuser
to use this command.
Parameters
IF EXISTS
Do not throw an error if the text search template does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the text search template, and in turn all objects that
depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the text search template if any objects depend on it. This is the default.
Examples
Remove the text search template thesaurus:
This command will not succeed if there are any existing text search dictionaries that use the template.
Add CASCADE to drop such dictionaries along with the template.
Compatibility
There is no DROP TEXT SEARCH TEMPLATE statement in the SQL standard.
See Also
ALTER TEXT SEARCH TEMPLATE, CREATE TEXT SEARCH TEMPLATE
1711
DROP TRANSFORM
DROP TRANSFORM — remove a transform
Synopsis
Description
DROP TRANSFORM removes a previously defined transform.
To be able to drop a transform, you must own the type and the language. These are the same privileges
that are required to create a transform.
Parameters
IF EXISTS
Do not throw an error if the transform does not exist. A notice is issued in this case.
type_name
lang_name
CASCADE
Automatically drop objects that depend on the transform, and in turn all objects that depend on
those objects (see Section 5.13).
RESTRICT
Refuse to drop the transform if any objects depend on it. This is the default.
Examples
To drop the transform for type hstore and language plpythonu:
Compatibility
This form of DROP TRANSFORM is a PostgreSQL extension. See CREATE TRANSFORM for details.
See Also
CREATE TRANSFORM
1712
DROP TRIGGER
DROP TRIGGER — remove a trigger
Synopsis
Description
DROP TRIGGER removes an existing trigger definition. To execute this command, the current user
must be the owner of the table for which the trigger is defined.
Parameters
IF EXISTS
Do not throw an error if the trigger does not exist. A notice is issued in this case.
name
table_name
The name (optionally schema-qualified) of the table for which the trigger is defined.
CASCADE
Automatically drop objects that depend on the trigger, and in turn all objects that depend on those
objects (see Section 5.13).
RESTRICT
Refuse to drop the trigger if any objects depend on it. This is the default.
Examples
Destroy the trigger if_dist_exists on the table films:
Compatibility
The DROP TRIGGER statement in PostgreSQL is incompatible with the SQL standard. In the SQL
standard, trigger names are not local to tables, so the command is simply DROP TRIGGER name.
See Also
CREATE TRIGGER
1713
DROP TYPE
DROP TYPE — remove a data type
Synopsis
Description
DROP TYPE removes a user-defined data type. Only the owner of a type can remove it.
Parameters
IF EXISTS
Do not throw an error if the type does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the type (such as table columns, functions, and oper-
ators), and in turn all objects that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the type if any objects depend on it. This is the default.
Examples
To remove the data type box:
Compatibility
This command is similar to the corresponding command in the SQL standard, apart from the IF
EXISTS option, which is a PostgreSQL extension. But note that much of the CREATE TYPE com-
mand and the data type extension mechanisms in PostgreSQL differ from the SQL standard.
See Also
ALTER TYPE, CREATE TYPE
1714
DROP USER
DROP USER — remove a database role
Synopsis
Description
DROP USER is simply an alternate spelling of DROP ROLE.
Compatibility
The DROP USER statement is a PostgreSQL extension. The SQL standard leaves the definition of
users to the implementation.
See Also
DROP ROLE
1715
DROP USER MAPPING
DROP USER MAPPING — remove a user mapping for a foreign server
Synopsis
Description
DROP USER MAPPING removes an existing user mapping from foreign server.
The owner of a foreign server can drop user mappings for that server for any user. Also, a user can drop
a user mapping for their own user name if USAGE privilege on the server has been granted to the user.
Parameters
IF EXISTS
Do not throw an error if the user mapping does not exist. A notice is issued in this case.
user_name
User name of the mapping. CURRENT_USER and USER match the name of the current user.
PUBLIC is used to match all present and future user names in the system.
server_name
Examples
Drop a user mapping bob, server foo if it exists:
Compatibility
DROP USER MAPPING conforms to ISO/IEC 9075-9 (SQL/MED). The IF EXISTS clause is a
PostgreSQL extension.
See Also
CREATE USER MAPPING, ALTER USER MAPPING
1716
DROP VIEW
DROP VIEW — remove a view
Synopsis
Description
DROP VIEW drops an existing view. To execute this command you must be the owner of the view.
Parameters
IF EXISTS
Do not throw an error if the view does not exist. A notice is issued in this case.
name
CASCADE
Automatically drop objects that depend on the view (such as other views), and in turn all objects
that depend on those objects (see Section 5.13).
RESTRICT
Refuse to drop the view if any objects depend on it. This is the default.
Examples
This command will remove the view called kinds:
Compatibility
This command conforms to the SQL standard, except that the standard only allows one view to be
dropped per command, and apart from the IF EXISTS option, which is a PostgreSQL extension.
See Also
ALTER VIEW, CREATE VIEW
1717
END
END — commit the current transaction
Synopsis
Description
END commits the current transaction. All changes made by the transaction become visible to others
and are guaranteed to be durable if a crash occurs. This command is a PostgreSQL extension that is
equivalent to COMMIT.
Parameters
WORK
TRANSACTION
Notes
Use ROLLBACK to abort a transaction.
Issuing END when not inside a transaction does no harm, but it will provoke a warning message.
Examples
To commit the current transaction and make all changes permanent:
END;
Compatibility
END is a PostgreSQL extension that provides functionality equivalent to COMMIT, which is specified
in the SQL standard.
See Also
BEGIN, COMMIT, ROLLBACK
1718
EXECUTE
EXECUTE — execute a prepared statement
Synopsis
Description
EXECUTE is used to execute a previously prepared statement. Since prepared statements only exist
for the duration of a session, the prepared statement must have been created by a PREPARE statement
executed earlier in the current session.
If the PREPARE statement that created the statement specified some parameters, a compatible set of
parameters must be passed to the EXECUTE statement, or else an error is raised. Note that (unlike
functions) prepared statements are not overloaded based on the type or number of their parameters;
the name of a prepared statement must be unique within a database session.
For more information on the creation and usage of prepared statements, see PREPARE.
Parameters
name
parameter
The actual value of a parameter to the prepared statement. This must be an expression yielding a
value that is compatible with the data type of this parameter, as was determined when the prepared
statement was created.
Outputs
The command tag returned by EXECUTE is that of the prepared statement, and not EXECUTE.
Examples
Examples are given in the Examples section of the PREPARE documentation.
Compatibility
The SQL standard includes an EXECUTE statement, but it is only for use in embedded SQL. This
version of the EXECUTE statement also uses a somewhat different syntax.
See Also
DEALLOCATE, PREPARE
1719
EXPLAIN
EXPLAIN — show the execution plan of a statement
Synopsis
ANALYZE [ boolean ]
VERBOSE [ boolean ]
COSTS [ boolean ]
BUFFERS [ boolean ]
TIMING [ boolean ]
SUMMARY [ boolean ]
FORMAT { TEXT | XML | JSON | YAML }
Description
This command displays the execution plan that the PostgreSQL planner generates for the supplied
statement. The execution plan shows how the table(s) referenced by the statement will be scanned —
by plain sequential scan, index scan, etc. — and if multiple tables are referenced, what join algorithms
will be used to bring together the required rows from each input table.
The most critical part of the display is the estimated statement execution cost, which is the planner's
guess at how long it will take to run the statement (measured in cost units that are arbitrary, but
conventionally mean disk page fetches). Actually two numbers are shown: the start-up cost before the
first row can be returned, and the total cost to return all the rows. For most queries the total cost is
what matters, but in contexts such as a subquery in EXISTS, the planner will choose the smallest start-
up cost instead of the smallest total cost (since the executor will stop after getting one row, anyway).
Also, if you limit the number of rows to return with a LIMIT clause, the planner makes an appropriate
interpolation between the endpoint costs to estimate which plan is really the cheapest.
The ANALYZE option causes the statement to be actually executed, not only planned. Then actual run
time statistics are added to the display, including the total elapsed time expended within each plan node
(in milliseconds) and the total number of rows it actually returned. This is useful for seeing whether
the planner's estimates are close to reality.
Important
Keep in mind that the statement is actually executed when the ANALYZE option is used. Al-
though EXPLAIN will discard any output that a SELECT would return, other side effects of
the statement will happen as usual. If you wish to use EXPLAIN ANALYZE on an INSERT,
UPDATE, DELETE, CREATE TABLE AS, or EXECUTE statement without letting the com-
mand affect your data, use this approach:
BEGIN;
EXPLAIN ANALYZE ...;
ROLLBACK;
1720
EXPLAIN
Only the ANALYZE and VERBOSE options can be specified, and only in that order, without surround-
ing the option list in parentheses. Prior to PostgreSQL 9.0, the unparenthesized syntax was the only
one supported. It is expected that all new options will be supported only in the parenthesized syntax.
Parameters
ANALYZE
Carry out the command and show actual run times and other statistics. This parameter defaults
to FALSE.
VERBOSE
Display additional information regarding the plan. Specifically, include the output column list
for each node in the plan tree, schema-qualify table and function names, always label variables
in expressions with their range table alias, and always print the name of each trigger for which
statistics are displayed. This parameter defaults to FALSE.
COSTS
Include information on the estimated startup and total cost of each plan node, as well as the esti-
mated number of rows and the estimated width of each row. This parameter defaults to TRUE.
BUFFERS
Include information on buffer usage. Specifically, include the number of shared blocks hit, read,
dirtied, and written, the number of local blocks hit, read, dirtied, and written, and the number
of temp blocks read and written. A hit means that a read was avoided because the block was
found already in cache when needed. Shared blocks contain data from regular tables and indexes;
local blocks contain data from temporary tables and indexes; while temp blocks contain short-
term working data used in sorts, hashes, Materialize plan nodes, and similar cases. The number
of blocks dirtied indicates the number of previously unmodified blocks that were changed by
this query; while the number of blocks written indicates the number of previously-dirtied blocks
evicted from cache by this backend during query processing. The number of blocks shown for an
upper-level node includes those used by all its child nodes. In text format, only non-zero values are
printed. This parameter may only be used when ANALYZE is also enabled. It defaults to FALSE.
TIMING
Include actual startup time and time spent in each node in the output. The overhead of repeatedly
reading the system clock can slow down the query significantly on some systems, so it may be
useful to set this parameter to FALSE when only actual row counts, and not exact times, are
needed. Run time of the entire statement is always measured, even when node-level timing is
turned off with this option. This parameter may only be used when ANALYZE is also enabled.
It defaults to TRUE.
SUMMARY
Include summary information (e.g., totaled timing information) after the query plan. Summary
information is included by default when ANALYZE is used but otherwise is not included by de-
fault, but can be enabled using this option. Planning time in EXPLAIN EXECUTE includes the
time required to fetch the plan from the cache and the time required for re-planning, if necessary.
FORMAT
Specify the output format, which can be TEXT, XML, JSON, or YAML. Non-text output contains
the same information as the text output format, but is easier for programs to parse. This parameter
defaults to TEXT.
1721
EXPLAIN
boolean
Specifies whether the selected option should be turned on or off. You can write TRUE, ON, or 1 to
enable the option, and FALSE, OFF, or 0 to disable it. The boolean value can also be omitted,
in which case TRUE is assumed.
statement
Any SELECT, INSERT, UPDATE, DELETE, VALUES, EXECUTE, DECLARE, CREATE TABLE
AS, or CREATE MATERIALIZED VIEW AS statement, whose execution plan you wish to see.
Outputs
The command's result is a textual description of the plan selected for the statement, optionally
annotated with execution statistics. Section 14.1 describes the information provided.
Notes
In order to allow the PostgreSQL query planner to make reasonably informed decisions when optimiz-
ing queries, the pg_statistic data should be up-to-date for all tables used in the query. Normally
the autovacuum daemon will take care of that automatically. But if a table has recently had substantial
changes in its contents, you might need to do a manual ANALYZE rather than wait for autovacuum
to catch up with the changes.
In order to measure the run-time cost of each node in the execution plan, the current implementation
of EXPLAIN ANALYZE adds profiling overhead to query execution. As a result, running EXPLAIN
ANALYZE on a query can sometimes take significantly longer than executing the query normally. The
amount of overhead depends on the nature of the query, as well as the platform being used. The worst
case occurs for plan nodes that in themselves require very little time per execution, and on machines
that have relatively slow operating system calls for obtaining the time of day.
Examples
To show the plan for a simple query on a table with a single integer column and 10000 rows:
QUERY PLAN
---------------------------------------------------------
Seq Scan on foo (cost=0.00..155.00 rows=10000 width=4)
(1 row)
1722
EXPLAIN
"Plan Width": 4 +
} +
} +
]
(1 row)
If there is an index and we use a query with an indexable WHERE condition, EXPLAIN might show
a different plan:
QUERY PLAN
--------------------------------------------------------------
Index Scan using fi on foo (cost=0.00..5.98 rows=1 width=4)
Index Cond: (i = 4)
(2 rows)
QUERY PLAN
----------------------------
Index Scan using fi on foo
Index Cond: (i = 4)
(2 rows)
QUERY PLAN
---------------------------------------------------------------------
Aggregate (cost=23.93..23.93 rows=1 width=4)
-> Index Scan using fi on foo (cost=0.00..23.92 rows=6
width=4)
1723
EXPLAIN
Here is an example of using EXPLAIN EXECUTE to display the execution plan for a prepared query:
QUERY PLAN
--------------------------------------------------------------------------------
HashAggregate (cost=9.54..9.54 rows=1 width=8) (actual
time=0.156..0.161 rows=11 loops=1)
Group Key: foo
-> Index Scan using test_pkey on test (cost=0.29..9.29 rows=50
width=8) (actual time=0.039..0.091 rows=99 loops=1)
Index Cond: ((id > $1) AND (id < $2))
Planning time: 0.197 ms
Execution time: 0.225 ms
(6 rows)
Of course, the specific numbers shown here depend on the actual contents of the tables involved. Also
note that the numbers, and even the selected query strategy, might vary between PostgreSQL releases
due to planner improvements. In addition, the ANALYZE command uses random sampling to estimate
data statistics; therefore, it is possible for cost estimates to change after a fresh run of ANALYZE, even
if the actual distribution of data in the table has not changed.
Compatibility
There is no EXPLAIN statement defined in the SQL standard.
See Also
ANALYZE
1724
FETCH
FETCH — retrieve rows from a query using a cursor
Synopsis
NEXT
PRIOR
FIRST
LAST
ABSOLUTE count
RELATIVE count
count
ALL
FORWARD
FORWARD count
FORWARD ALL
BACKWARD
BACKWARD count
BACKWARD ALL
Description
FETCH retrieves rows using a previously-created cursor.
A cursor has an associated position, which is used by FETCH. The cursor position can be before the
first row of the query result, on any particular row of the result, or after the last row of the result. When
created, a cursor is positioned before the first row. After fetching some rows, the cursor is positioned
on the row most recently retrieved. If FETCH runs off the end of the available rows then the cursor is
left positioned after the last row, or before the first row if fetching backward. FETCH ALL or FETCH
BACKWARD ALL will always leave the cursor positioned after the last row or before the first row.
The forms NEXT, PRIOR, FIRST, LAST, ABSOLUTE, RELATIVE fetch a single row after moving
the cursor appropriately. If there is no such row, an empty result is returned, and the cursor is left
positioned before the first row or after the last row as appropriate.
The forms using FORWARD and BACKWARD retrieve the indicated number of rows moving in the
forward or backward direction, leaving the cursor positioned on the last-returned row (or after/before
all rows, if the count exceeds the number of rows available).
RELATIVE 0, FORWARD 0, and BACKWARD 0 all request fetching the current row without moving
the cursor, that is, re-fetching the most recently fetched row. This will succeed unless the cursor is
positioned before the first row or after the last row; in which case, no row is returned.
Note
This page describes usage of cursors at the SQL command level. If you are trying to use cursors
inside a PL/pgSQL function, the rules are different — see Section 43.7.3.
1725
FETCH
Parameters
direction
direction defines the fetch direction and number of rows to fetch. It can be one of the fol-
lowing:
NEXT
PRIOR
FIRST
LAST
ABSOLUTE count
Fetch the count'th row of the query, or the abs(count)'th row from the end if count
is negative. Position before first row or after last row if count is out of range; in particular,
ABSOLUTE 0 positions before the first row.
RELATIVE count
Fetch the count'th succeeding row, or the abs(count)'th prior row if count is negative.
RELATIVE 0 re-fetches the current row, if any.
count
ALL
FORWARD
FORWARD count
Fetch the next count rows. FORWARD 0 re-fetches the current row.
FORWARD ALL
BACKWARD
BACKWARD count
Fetch the prior count rows (scanning backwards). BACKWARD 0 re-fetches the current row.
1726
FETCH
BACKWARD ALL
count
count is a possibly-signed integer constant, determining the location or number of rows to fetch.
For FORWARD and BACKWARD cases, specifying a negative count is equivalent to changing the
sense of FORWARD and BACKWARD.
cursor_name
Outputs
On successful completion, a FETCH command returns a command tag of the form
FETCH count
The count is the number of rows fetched (possibly zero). Note that in psql, the command tag will
not actually be displayed, since psql displays the fetched rows instead.
Notes
The cursor should be declared with the SCROLL option if one intends to use any variants of FETCH
other than FETCH NEXT or FETCH FORWARD with a positive count. For simple queries PostgreSQL
will allow backwards fetch from cursors not declared with SCROLL, but this behavior is best not relied
on. If the cursor is declared with NO SCROLL, no backward fetches are allowed.
ABSOLUTE fetches are not any faster than navigating to the desired row with a relative move: the
underlying implementation must traverse all the intermediate rows anyway. Negative absolute fetches
are even worse: the query must be read to the end to find the last row, and then traversed backward
from there. However, rewinding to the start of the query (as with FETCH ABSOLUTE 0) is fast.
DECLARE is used to define a cursor. Use MOVE to change cursor position without retrieving data.
Examples
The following example traverses a table using a cursor:
BEGIN WORK;
-- Set up a cursor:
DECLARE liahona SCROLL CURSOR FOR SELECT * FROM films;
1727
FETCH
Compatibility
The SQL standard defines FETCH for use in embedded SQL only. The variant of FETCH described
here returns the data as if it were a SELECT result rather than placing it in host variables. Other than
this point, FETCH is fully upward-compatible with the SQL standard.
The FETCH forms involving FORWARD and BACKWARD, as well as the forms FETCH count and
FETCH ALL, in which FORWARD is implicit, are PostgreSQL extensions.
The SQL standard allows only FROM preceding the cursor name; the option to use IN, or to leave
them out altogether, is an extension.
See Also
CLOSE, DECLARE, MOVE
1728
GRANT
GRANT — define access privileges
Synopsis
1729
GRANT
[ GROUP ] role_name
| PUBLIC
| CURRENT_USER
| SESSION_USER
Description
The GRANT command has two basic variants: one that grants privileges on a database object (table,
column, view, foreign table, sequence, database, foreign-data wrapper, foreign server, function, pro-
cedure, procedural language, schema, or tablespace), and one that grants membership in a role. These
variants are similar in many ways, but they are different enough to be described separately.
There is also an option to grant privileges on all objects of the same type within one or more schemas.
This functionality is currently supported only for tables, sequences, functions, and procedures. ALL
TABLES also affects views and foreign tables, just like the specific-object GRANT command. ALL
FUNCTIONS also affects aggregate functions, but not procedures, again just like the specific-object
GRANT command.
The key word PUBLIC indicates that the privileges are to be granted to all roles, including those that
might be created later. PUBLIC can be thought of as an implicitly defined group that always includes
all roles. Any particular role will have the sum of privileges granted directly to it, privileges granted
to any role it is presently a member of, and privileges granted to PUBLIC.
If WITH GRANT OPTION is specified, the recipient of the privilege can in turn grant it to others.
Without a grant option, the recipient cannot do that. Grant options cannot be granted to PUBLIC.
There is no need to grant privileges to the owner of an object (usually the user that created it), as the
owner has all privileges by default. (The owner could, however, choose to revoke some of their own
privileges for safety.)
The right to drop an object, or to alter its definition in any way, is not treated as a grantable privilege; it
is inherent in the owner, and cannot be granted or revoked. (However, a similar effect can be obtained
1730
GRANT
by granting or revoking membership in the role that owns the object; see below.) The owner implicitly
has all grant options for the object, too.
PostgreSQL grants default privileges on some types of objects to PUBLIC. No privileges are granted
to PUBLIC by default on tables, table columns, sequences, foreign data wrappers, foreign servers,
large objects, schemas, or tablespaces. For other types of objects, the default privileges granted to
PUBLIC are as follows: CONNECT and TEMPORARY (create temporary tables) privileges for data-
bases; EXECUTE privilege for functions and procedures; and USAGE privilege for languages and da-
ta types (including domains). The object owner can, of course, REVOKE both default and expressly
granted privileges. (For maximum security, issue the REVOKE in the same transaction that creates the
object; then there is no window in which another user can use the object.) Also, these initial default
privilege settings can be changed using the ALTER DEFAULT PRIVILEGES command.
SELECT
Allows SELECT from any column, or the specific columns listed, of the specified table, view, or
sequence. Also allows the use of COPY TO. This privilege is also needed to reference existing
column values in UPDATE or DELETE. For sequences, this privilege also allows the use of the
currval function. For large objects, this privilege allows the object to be read.
INSERT
Allows INSERT of a new row into the specified table. If specific columns are listed, only those
columns may be assigned to in the INSERT command (other columns will therefore receive de-
fault values). Also allows COPY FROM.
UPDATE
Allows UPDATE of any column, or the specific columns listed, of the specified table. (In practice,
any nontrivial UPDATE command will require SELECT privilege as well, since it must reference
table columns to determine which rows to update, and/or to compute new values for columns.)
SELECT ... FOR UPDATE and SELECT ... FOR SHARE also require this privilege on
at least one column, in addition to the SELECT privilege. For sequences, this privilege allows
the use of the nextval and setval functions. For large objects, this privilege allows writing
or truncating the object.
DELETE
Allows DELETE of a row from the specified table. (In practice, any nontrivial DELETE command
will require SELECT privilege as well, since it must reference table columns to determine which
rows to delete.)
TRUNCATE
REFERENCES
Allows creation of a foreign key constraint referencing the specified table, or specified column(s)
of the table. (See the CREATE TABLE statement.)
TRIGGER
Allows the creation of a trigger on the specified table. (See the CREATE TRIGGER statement.)
CREATE
For databases, allows new schemas and publications to be created within the database.
For schemas, allows new objects to be created within the schema. To rename an existing object,
you must own the object and have this privilege for the containing schema.
1731
GRANT
For tablespaces, allows tables, indexes, and temporary files to be created within the tablespace,
and allows databases to be created that have the tablespace as their default tablespace. (Note that
revoking this privilege will not alter the placement of existing objects.)
CONNECT
Allows the user to connect to the specified database. This privilege is checked at connection
startup (in addition to checking any restrictions imposed by pg_hba.conf).
TEMPORARY
TEMP
EXECUTE
Allows the use of the specified function or procedure and the use of any operators that are imple-
mented on top of the function. This is the only type of privilege that is applicable to functions
and procedures. The FUNCTION syntax also works for aggregate functions. Alternatively, use
ROUTINE to refer to a function, aggregate function, or procedure regardless of what it is.
USAGE
For procedural languages, allows the use of the specified language for the creation of functions
in that language. This is the only type of privilege that is applicable to procedural languages.
For schemas, allows access to objects contained in the specified schema (assuming that the objects'
own privilege requirements are also met). Essentially this allows the grantee to “look up” objects
within the schema. Without this permission, it is still possible to see the object names, e.g., by
querying the system tables. Also, after revoking this permission, existing backends might have
statements that have previously performed this lookup, so this is not a completely secure way to
prevent object access.
For sequences, this privilege allows the use of the currval and nextval functions.
For types and domains, this privilege allows the use of the type or domain in the creation of tables,
functions, and other schema objects. (Note that it does not control general “usage” of the type, such
as values of the type appearing in queries. It only prevents objects from being created that depend
on the type. The main purpose of the privilege is controlling which users create dependencies on
a type, which could prevent the owner from changing the type later.)
For foreign-data wrappers, this privilege allows creation of new servers using the foreign-data
wrapper.
For servers, this privilege allows creation of foreign tables using the server. Grantees may also
create, alter, or drop their own user mappings associated with that server.
ALL PRIVILEGES
Grant all of the available privileges at once. The PRIVILEGES key word is optional in Post-
greSQL, though it is required by strict SQL.
The privileges required by other commands are listed on the reference page of the respective command.
GRANT on Roles
This variant of the GRANT command grants membership in a role to one or more other roles. Member-
ship in a role is significant because it conveys the privileges granted to a role to each of its members.
If WITH ADMIN OPTION is specified, the member can in turn grant membership in the role to others,
and revoke membership in the role as well. Without the admin option, ordinary users cannot do that. A
role is not considered to hold WITH ADMIN OPTION on itself, but it may grant or revoke membership
1732
GRANT
in itself from a database session where the session user matches the role. Database superusers can
grant or revoke membership in any role to anyone. Roles having CREATEROLE privilege can grant
or revoke membership in any role that is not a superuser.
If GRANTED BY is specified, the grant is recorded as having been done by the specified role. Only
database superusers may use this option, except when it names the same role executing the command.
Unlike the case with privileges, membership in a role cannot be granted to PUBLIC. Note also that
this form of the command does not allow the noise word GROUP in role_specification.
Notes
The REVOKE command is used to revoke access privileges.
Since PostgreSQL 8.1, the concepts of users and groups have been unified into a single kind of entity
called a role. It is therefore no longer necessary to use the keyword GROUP to identify whether a
grantee is a user or a group. GROUP is still allowed in the command, but it is a noise word.
A user may perform SELECT, INSERT, etc. on a column if they hold that privilege for either the
specific column or its whole table. Granting the privilege at the table level and then revoking it for
one column will not do what one might wish: the table-level grant is unaffected by a column-level
operation.
When a non-owner of an object attempts to GRANT privileges on the object, the command will fail
outright if the user has no privileges whatsoever on the object. As long as some privilege is available,
the command will proceed, but it will grant only those privileges for which the user has grant options.
The GRANT ALL PRIVILEGES forms will issue a warning message if no grant options are held,
while the other forms will issue a warning if grant options for any of the privileges specifically named
in the command are not held. (In principle these statements apply to the object owner as well, but since
the owner is always treated as holding all grant options, the cases can never occur.)
It should be noted that database superusers can access all objects regardless of object privilege settings.
This is comparable to the rights of root in a Unix system. As with root, it's unwise to operate as
a superuser except when absolutely necessary.
If a superuser chooses to issue a GRANT or REVOKE command, the command is performed as though
it were issued by the owner of the affected object. In particular, privileges granted via such a command
will appear to have been granted by the object owner. (For role membership, the membership appears
to have been granted by the containing role itself.)
GRANT and REVOKE can also be done by a role that is not the owner of the affected object, but is a
member of the role that owns the object, or is a member of a role that holds privileges WITH GRANT
OPTION on the object. In this case the privileges will be recorded as having been granted by the role
that actually owns the object or holds the privileges WITH GRANT OPTION. For example, if table
t1 is owned by role g1, of which role u1 is a member, then u1 can grant privileges on t1 to u2,
but those privileges will appear to have been granted directly by g1. Any other member of role g1
could revoke them later.
If the role executing GRANT holds the required privileges indirectly via more than one role membership
path, it is unspecified which containing role will be recorded as having done the grant. In such cases
it is best practice to use SET ROLE to become the specific role you want to do the GRANT as.
Granting permission on a table does not automatically extend permissions to any sequences used by the
table, including sequences tied to SERIAL columns. Permissions on sequences must be set separately.
Use psql's \dp command to obtain information about existing privileges for tables and columns. For
example:
1733
GRANT
r -- SELECT ("read")
w -- UPDATE ("write")
a -- INSERT ("append")
d -- DELETE
D -- TRUNCATE
x -- REFERENCES
t -- TRIGGER
X -- EXECUTE
U -- USAGE
C -- CREATE
c -- CONNECT
T -- TEMPORARY
arwdDxt -- ALL PRIVILEGES (for tables, varies for other
objects)
* -- grant option for preceding privilege
The above example display would be seen by user miriam after creating table mytable and doing:
For non-table objects there are other \d commands that can display their privileges.
If the “Access privileges” column is empty for a given object, it means the object has default privileges
(that is, its privileges column is null). Default privileges always include all privileges for the owner,
and can include some privileges for PUBLIC depending on the object type, as explained above. The
first GRANT or REVOKE on an object will instantiate the default privileges (producing, for example,
{miriam=arwdDxt/miriam}) and then modify them per the specified request. Similarly, entries
are shown in “Column access privileges” only for columns with nondefault privileges. (Note: for this
purpose, “default privileges” always means the built-in default privileges for the object's type. An
object whose privileges have been affected by an ALTER DEFAULT PRIVILEGES command will
always be shown with an explicit privilege entry that includes the effects of the ALTER.)
Notice that the owner's implicit grant options are not marked in the access privileges display. A * will
appear only when grant options have been explicitly granted to someone.
Examples
Grant insert privilege to all users on table films:
1734
GRANT
Note that while the above will indeed grant all privileges if executed by a superuser or the owner of
kinds, when executed by someone else it will only grant those permissions for which the someone
else has grant options.
Compatibility
According to the SQL standard, the PRIVILEGES key word in ALL PRIVILEGES is required. The
SQL standard does not support setting the privileges on more than one object per command.
PostgreSQL allows an object owner to revoke their own ordinary privileges: for example, a table
owner can make the table read-only to themselves by revoking their own INSERT, UPDATE, DELETE,
and TRUNCATE privileges. This is not possible according to the SQL standard. The reason is that
PostgreSQL treats the owner's privileges as having been granted by the owner to themselves; therefore
they can revoke them too. In the SQL standard, the owner's privileges are granted by an assumed entity
“_SYSTEM”. Not being “_SYSTEM”, the owner cannot revoke these rights.
According to the SQL standard, grant options can be granted to PUBLIC; PostgreSQL only supports
granting grant options to roles.
The SQL standard allows the GRANTED BY option to be used in all forms of GRANT. PostgreSQL
only supports it when granting role membership, and even then only superusers may use it in nontrivial
ways.
The SQL standard provides for a USAGE privilege on other kinds of objects: character sets, collations,
translations.
In the SQL standard, sequences only have a USAGE privilege, which controls the use of the NEXT
VALUE FOR expression, which is equivalent to the function nextval in PostgreSQL. The sequence
privileges SELECT and UPDATE are PostgreSQL extensions. The application of the sequence USAGE
privilege to the currval function is also a PostgreSQL extension (as is the function itself).
See Also
REVOKE, ALTER DEFAULT PRIVILEGES
1735
IMPORT FOREIGN SCHEMA
IMPORT FOREIGN SCHEMA — import table definitions from a foreign server
Synopsis
Description
IMPORT FOREIGN SCHEMA creates foreign tables that represent tables existing on a foreign server.
The new foreign tables will be owned by the user issuing the command and are created with the correct
column definitions and options to match the remote tables.
By default, all tables and views existing in a particular schema on the foreign server are imported.
Optionally, the list of tables can be limited to a specified subset, or specific tables can be excluded.
The new foreign tables are all created in the target schema, which must already exist.
To use IMPORT FOREIGN SCHEMA, the user must have USAGE privilege on the foreign server, as
well as CREATE privilege on the target schema.
Parameters
remote_schema
The remote schema to import from. The specific meaning of a remote schema depends on the
foreign data wrapper in use.
Import only foreign tables matching one of the given table names. Other tables existing in the
foreign schema will be ignored.
Exclude specified foreign tables from the import. All tables existing in the foreign schema will
be imported except the ones listed here.
server_name
local_schema
Options to be used during the import. The allowed option names and values are specific to each
foreign data wrapper.
1736
IMPORT FOREIGN SCHEMA
Examples
Import table definitions from a remote schema foreign_films on server film_server, creating
the foreign tables in local schema films:
As above, but import only the two tables actors and directors (if they exist):
Compatibility
The IMPORT FOREIGN SCHEMA command conforms to the SQL standard, except that the OP-
TIONS clause is a PostgreSQL extension.
See Also
CREATE FOREIGN TABLE, CREATE SERVER
1737
INSERT
INSERT — create new rows in a table
Synopsis
( { index_column_name | ( index_expression
) } [ COLLATE collation ] [ opclass ] [, ...] )
[ WHERE index_predicate ]
ON CONSTRAINT constraint_name
DO NOTHING
DO UPDATE SET { column_name = { expression | DEFAULT } |
( column_name [, ...] ) = [ ROW ]
( { expression | DEFAULT } [, ...] ) |
( column_name [, ...] ) = ( sub-SELECT )
} [, ...]
[ WHERE condition ]
Description
INSERT inserts new rows into a table. One can insert one or more rows specified by value expressions,
or zero or more rows resulting from a query.
The target column names can be listed in any order. If no list of column names is given at all, the
default is all the columns of the table in their declared order; or the first N column names, if there
are only N columns supplied by the VALUES clause or query. The values supplied by the VALUES
clause or query are associated with the explicit or implicit column list left-to-right.
Each column not present in the explicit or implicit column list will be filled with a default value, either
its declared default value or null if there is none.
If the expression for any column is not of the correct data type, automatic type conversion will be
attempted.
INSERT into tables that lack unique indexes will not be blocked by concurrent activity. Tables with
unique indexes might block if concurrent sessions perform actions that lock or modify rows matching
the unique index values being inserted; the details are covered in Section 61.5. ON CONFLICT can
be used to specify an alternative action to raising a unique constraint or exclusion constraint violation
error. (See ON CONFLICT Clause below.)
The optional RETURNING clause causes INSERT to compute and return value(s) based on each row
actually inserted (or updated, if an ON CONFLICT DO UPDATE clause was used). This is primarily
1738
INSERT
useful for obtaining values that were supplied by defaults, such as a serial sequence number. However,
any expression using the table's columns is allowed. The syntax of the RETURNING list is identical
to that of the output list of SELECT. Only rows that were successfully inserted or updated will be
returned. For example, if a row was locked but not updated because an ON CONFLICT DO UP-
DATE ... WHERE clause condition was not satisfied, the row will not be returned.
You must have INSERT privilege on a table in order to insert into it. If ON CONFLICT DO UPDATE
is present, UPDATE privilege on the table is also required.
If a column list is specified, you only need INSERT privilege on the listed columns. Similarly, when
ON CONFLICT DO UPDATE is specified, you only need UPDATE privilege on the column(s) that
are listed to be updated. However, ON CONFLICT DO UPDATE also requires SELECT privilege on
any column whose values are read in the ON CONFLICT DO UPDATE expressions or condition.
Use of the RETURNING clause requires SELECT privilege on all columns mentioned in RETURNING.
If you use the query clause to insert rows from a query, you of course need to have SELECT privilege
on any table or column used in the query.
Parameters
Inserting
This section covers parameters that may be used when only inserting new rows. Parameters exclusively
used with the ON CONFLICT clause are described separately.
with_query
The WITH clause allows you to specify one or more subqueries that can be referenced by name
in the INSERT query. See Section 7.8 and SELECT for details.
It is possible for the query (SELECT statement) to also contain a WITH clause. In such a case
both sets of with_query can be referenced within the query, but the second one takes prece-
dence since it is more closely nested.
table_name
alias
A substitute name for table_name. When an alias is provided, it completely hides the actual
name of the table. This is particularly useful when ON CONFLICT DO UPDATE targets a table
named excluded, since that will otherwise be taken as the name of the special table representing
the row proposed for insertion.
column_name
The name of a column in the table named by table_name. The column name can be qualified
with a subfield name or array subscript, if needed. (Inserting into only some fields of a compos-
ite column leaves the other fields null.) When referencing a column with ON CONFLICT DO
UPDATE, do not include the table's name in the specification of a target column. For example,
INSERT INTO table_name ... ON CONFLICT DO UPDATE SET table_name.col
= 1 is invalid (this follows the general behavior for UPDATE).
Without this clause, it is an error to specify an explicit value (other than DEFAULT) for an identity
column defined as GENERATED ALWAYS. This clause overrides that restriction.
1739
INSERT
If this clause is specified, then any values supplied for identity columns defined as GENERATED
BY DEFAULT are ignored and the default sequence-generated values are applied.
This clause is useful for example when copying values between tables. Writing INSERT INTO
tbl2 OVERRIDING USER VALUE SELECT * FROM tbl1 will copy from tbl1 all
columns that are not identity columns in tbl2 while values for the identity columns in tbl2
will be generated by the sequences associated with tbl2.
DEFAULT VALUES
All columns will be filled with their default values. (An OVERRIDING clause is not permitted
in this form.)
expression
DEFAULT
query
A query (SELECT statement) that supplies the rows to be inserted. Refer to the SELECT statement
for a description of the syntax.
output_expression
An expression to be computed and returned by the INSERT command after each row is inserted
or updated. The expression can use any column names of the table named by table_name.
Write * to return all columns of the inserted or updated row(s).
output_name
ON CONFLICT Clause
The optional ON CONFLICT clause specifies an alternative action to raising a unique violation or
exclusion constraint violation error. For each individual row proposed for insertion, either the inser-
tion proceeds, or, if an arbiter constraint or index specified by conflict_target is violated, the
alternative conflict_action is taken. ON CONFLICT DO NOTHING simply avoids inserting
a row as its alternative action. ON CONFLICT DO UPDATE updates the existing row that conflicts
with the row proposed for insertion as its alternative action.
conflict_target can perform unique index inference. When performing inference, it consists of
one or more index_column_name columns and/or index_expression expressions, and an
optional index_predicate. All table_name unique indexes that, without regard to order, con-
tain exactly the conflict_target-specified columns/expressions are inferred (chosen) as arbiter
indexes. If an index_predicate is specified, it must, as a further requirement for inference, satisfy
arbiter indexes. Note that this means a non-partial unique index (a unique index without a predicate)
will be inferred (and thus used by ON CONFLICT) if such an index satisfying every other criteria is
available. If an attempt at inference is unsuccessful, an error is raised.
1740
INSERT
conflict_target
Specifies which conflicts ON CONFLICT takes the alternative action on by choosing arbiter
indexes. Either performs unique index inference, or names a constraint explicitly. For ON CON-
FLICT DO NOTHING, it is optional to specify a conflict_target; when omitted, conflicts
with all usable constraints (and unique indexes) are handled. For ON CONFLICT DO UPDATE,
a conflict_target must be provided.
conflict_action
Note that the effects of all per-row BEFORE INSERT triggers are reflected in excluded values,
since those effects may have contributed to the row being excluded from insertion.
index_column_name
The name of a table_name column. Used to infer arbiter indexes. Follows CREATE INDEX
format. SELECT privilege on index_column_name is required.
index_expression
collation
opclass
index_predicate
Used to allow inference of partial unique indexes. Any indexes that satisfy the predicate (which
need not actually be partial indexes) can be inferred. Follows CREATE INDEX format. SELECT
privilege on any column appearing within index_predicate is required.
constraint_name
Explicitly specifies an arbiter constraint by name, rather than inferring a constraint or index.
condition
An expression that returns a value of type boolean. Only rows for which this expression returns
true will be updated, although all rows will be locked when the ON CONFLICT DO UPDATE
action is taken. Note that condition is evaluated last, after a conflict has been identified as
a candidate to update.
1741
INSERT
Note that exclusion constraints are not supported as arbiters with ON CONFLICT DO UPDATE. In
all cases, only NOT DEFERRABLE constraints and unique indexes are supported as arbiters.
Note that it is currently not supported for the ON CONFLICT DO UPDATE clause of an INSERT
applied to a partitioned table to update the partition key of a conflicting row such that it requires the
row be moved to a new partition.
Tip
It is often preferable to use unique index inference rather than naming a constraint directly
using ON CONFLICT ON CONSTRAINT constraint_name. Inference will continue
to work correctly when the underlying index is replaced by another more or less equivalent
index in an overlapping way, for example when using CREATE UNIQUE INDEX ...
CONCURRENTLY before dropping the index being replaced.
Outputs
On successful completion, an INSERT command returns a command tag of the form
The count is the number of rows inserted or updated. If count is exactly one, and the target table
has OIDs, then oid is the OID assigned to the inserted row. The single row must have been inserted
rather than updated. Otherwise oid is zero.
If the INSERT command contains a RETURNING clause, the result will be similar to that of a SELECT
statement containing the columns and values defined in the RETURNING list, computed over the
row(s) inserted or updated by the command.
Notes
If the specified table is a partitioned table, each row is routed to the appropriate partition and inserted
into it. If the specified table is a partition, an error will occur if one of the input rows violates the
partition constraint.
Examples
Insert a single row into table films:
In this example, the len column is omitted and therefore it will have the default value:
This example uses the DEFAULT clause for the date columns rather than specifying a value:
1742
INSERT
This example inserts some rows into table films from a table tmp_films with the same column
layout as films:
Insert a single row into table distributors, returning the sequence number generated by the DE-
FAULT clause:
Increment the sales count of the salesperson who manages the account for Acme Corporation, and
record the whole updated row along with current time in a log table:
WITH upd AS (
UPDATE employees SET sales_count = sales_count + 1 WHERE id =
(SELECT sales_person FROM accounts WHERE name = 'Acme
Corporation')
RETURNING *
)
INSERT INTO employees_log SELECT *, current_timestamp FROM upd;
Insert or update new distributors as appropriate. Assumes a unique index has been defined that con-
strains values appearing in the did column. Note that the special excluded table is used to reference
values originally proposed for insertion:
1743
INSERT
Insert a distributor, or do nothing for rows proposed for insertion when an existing, excluded row
(a row with a matching constrained column or columns after before row insert triggers fire) exists.
Example assumes a unique index has been defined that constrains values appearing in the did column:
Insert or update new distributors as appropriate. Example assumes a unique index has been defined
that constrains values appearing in the did column. WHERE clause is used to limit the rows actually
updated (any existing row not updated will still be locked, though):
Insert new distributor if possible; otherwise DO NOTHING. Example assumes a unique index has been
defined that constrains values appearing in the did column on a subset of rows where the is_active
Boolean column evaluates to true:
Compatibility
INSERT conforms to the SQL standard, except that the RETURNING clause is a PostgreSQL exten-
sion, as is the ability to use WITH with INSERT, and the ability to specify an alternative action with
ON CONFLICT. Also, the case in which a column name list is omitted, but not all the columns are
filled from the VALUES clause or query, is disallowed by the standard.
The SQL standard specifies that OVERRIDING SYSTEM VALUE can only be specified if an identity
column that is generated always exists. PostgreSQL allows the clause in any case and ignores it if it
is not applicable.
1744
LISTEN
LISTEN — listen for a notification
Synopsis
LISTEN channel
Description
LISTEN registers the current session as a listener on the notification channel named channel. If the
current session is already registered as a listener for this notification channel, nothing is done.
Whenever the command NOTIFY channel is invoked, either by this session or another one con-
nected to the same database, all the sessions currently listening on that notification channel are noti-
fied, and each will in turn notify its connected client application.
A session can be unregistered for a given notification channel with the UNLISTEN command. A
session's listen registrations are automatically cleared when the session ends.
The method a client application must use to detect notification events depends on which PostgreSQL
application programming interface it uses. With the libpq library, the application issues LISTEN as
an ordinary SQL command, and then must periodically call the function PQnotifies to find out
whether any notification events have been received. Other interfaces such as libpgtcl provide high-
er-level methods for handling notify events; indeed, with libpgtcl the application programmer should
not even issue LISTEN or UNLISTEN directly. See the documentation for the interface you are using
for more details.
NOTIFY contains a more extensive discussion of the use of LISTEN and NOTIFY.
Parameters
channel
Notes
LISTEN takes effect at transaction commit. If LISTEN or UNLISTEN is executed within a transaction
that later rolls back, the set of notification channels being listened to is unchanged.
A transaction that has executed LISTEN cannot be prepared for two-phase commit.
Examples
Configure and execute a listen/notify sequence from psql:
LISTEN virtual;
NOTIFY virtual;
Asynchronous notification "virtual" received from server process
with PID 8448.
1745
LISTEN
Compatibility
There is no LISTEN statement in the SQL standard.
See Also
NOTIFY, UNLISTEN
1746
LOAD
LOAD — load a shared library file
Synopsis
LOAD 'filename'
Description
This command loads a shared library file into the PostgreSQL server's address space. If the file has
been loaded already, the command does nothing. Shared library files that contain C functions are
automatically loaded whenever one of their functions is called. Therefore, an explicit LOAD is usually
only needed to load a library that modifies the server's behavior through “hooks” rather than providing
a set of functions.
The library file name is typically given as just a bare file name, which is sought in the server's library
search path (set by dynamic_library_path). Alternatively it can be given as a full path name. In either
case the platform's standard shared library file name extension may be omitted. See Section 38.10.1
for more information on this topic.
Non-superusers can only apply LOAD to library files located in $libdir/plugins/ — the speci-
fied filename must begin with exactly that string. (It is the database administrator's responsibility
to ensure that only “safe” libraries are installed there.)
Compatibility
LOAD is a PostgreSQL extension.
See Also
CREATE FUNCTION
1747
LOCK
LOCK — lock a table
Synopsis
Description
LOCK TABLE obtains a table-level lock, waiting if necessary for any conflicting locks to be released.
If NOWAIT is specified, LOCK TABLE does not wait to acquire the desired lock: if it cannot be
acquired immediately, the command is aborted and an error is emitted. Once obtained, the lock is
held for the remainder of the current transaction. (There is no UNLOCK TABLE command; locks are
always released at transaction end.)
When a view is locked, all relations appearing in the view definition query are also locked recursively
with the same lock mode.
When acquiring locks automatically for commands that reference tables, PostgreSQL always uses the
least restrictive lock mode possible. LOCK TABLE provides for cases when you might need more
restrictive locking. For example, suppose an application runs a transaction at the READ COMMITTED
isolation level and needs to ensure that data in a table remains stable for the duration of the transaction.
To achieve this you could obtain SHARE lock mode over the table before querying. This will prevent
concurrent data changes and ensure subsequent reads of the table see a stable view of committed
data, because SHARE lock mode conflicts with the ROW EXCLUSIVE lock acquired by writers, and
your LOCK TABLE name IN SHARE MODE statement will wait until any concurrent holders
of ROW EXCLUSIVE mode locks commit or roll back. Thus, once you obtain the lock, there are no
uncommitted writes outstanding; furthermore none can begin until you release the lock.
To achieve a similar effect when running a transaction at the REPEATABLE READ or SERIALIZ-
ABLE isolation level, you have to execute the LOCK TABLE statement before executing any SELECT
or data modification statement. A REPEATABLE READ or SERIALIZABLE transaction's view of
data will be frozen when its first SELECT or data modification statement begins. A LOCK TABLE
later in the transaction will still prevent concurrent writes — but it won't ensure that what the transac-
tion reads corresponds to the latest committed values.
If a transaction of this sort is going to change the data in the table, then it should use SHARE ROW
EXCLUSIVE lock mode instead of SHARE mode. This ensures that only one transaction of this type
runs at a time. Without this, a deadlock is possible: two transactions might both acquire SHARE mode,
and then be unable to also acquire ROW EXCLUSIVE mode to actually perform their updates. (Note
that a transaction's own locks never conflict, so a transaction can acquire ROW EXCLUSIVE mode
when it holds SHARE mode — but not if anyone else holds SHARE mode.) To avoid deadlocks, make
sure all transactions acquire locks on the same objects in the same order, and if multiple lock modes are
involved for a single object, then transactions should always acquire the most restrictive mode first.
More information about the lock modes and locking strategies can be found in Section 13.3.
1748
LOCK
Parameters
name
The name (optionally schema-qualified) of an existing table to lock. If ONLY is specified before
the table name, only that table is locked. If ONLY is not specified, the table and all its descendant
tables (if any) are locked. Optionally, * can be specified after the table name to explicitly indicate
that descendant tables are included.
The command LOCK TABLE a, b; is equivalent to LOCK TABLE a; LOCK TABLE b;.
The tables are locked one-by-one in the order specified in the LOCK TABLE command.
lockmode
The lock mode specifies which locks this lock conflicts with. Lock modes are described in Sec-
tion 13.3.
If no lock mode is specified, then ACCESS EXCLUSIVE, the most restrictive mode, is used.
NOWAIT
Specifies that LOCK TABLE should not wait for any conflicting locks to be released: if the
specified lock(s) cannot be acquired immediately without waiting, the transaction is aborted.
Notes
LOCK TABLE ... IN ACCESS SHARE MODE requires SELECT privileges on the target ta-
ble. LOCK TABLE ... IN ROW EXCLUSIVE MODE requires INSERT, UPDATE, DELETE,
or TRUNCATE privileges on the target table. All other forms of LOCK require table-level UPDATE,
DELETE, or TRUNCATE privileges.
The user performing the lock on the view must have the corresponding privilege on the view. In
addition the view's owner must have the relevant privileges on the underlying base relations, but the
user performing the lock does not need any permissions on the underlying base relations.
LOCK TABLE is useless outside a transaction block: the lock would remain held only to the completion
of the statement. Therefore PostgreSQL reports an error if LOCK is used outside a transaction block.
Use BEGIN and COMMIT (or ROLLBACK) to define a transaction block.
LOCK TABLE only deals with table-level locks, and so the mode names involving ROW are all mis-
nomers. These mode names should generally be read as indicating the intention of the user to acquire
row-level locks within the locked table. Also, ROW EXCLUSIVE mode is a shareable table lock.
Keep in mind that all the lock modes have identical semantics so far as LOCK TABLE is concerned,
differing only in the rules about which modes conflict with which. For information on how to acquire
an actual row-level lock, see Section 13.3.2 and the The Locking Clause in the SELECT reference
documentation.
Examples
Obtain a SHARE lock on a primary key table when going to perform inserts into a foreign key table:
BEGIN WORK;
LOCK TABLE films IN SHARE MODE;
SELECT id FROM films
WHERE name = 'Star Wars: Episode I - The Phantom Menace';
-- Do ROLLBACK if record was not returned
INSERT INTO films_user_comments VALUES
(_id_, 'GREAT! I was waiting for it for so long!');
1749
LOCK
COMMIT WORK;
Take a SHARE ROW EXCLUSIVE lock on a primary key table when going to perform a delete
operation:
BEGIN WORK;
LOCK TABLE films IN SHARE ROW EXCLUSIVE MODE;
DELETE FROM films_user_comments WHERE id IN
(SELECT id FROM films WHERE rating < 5);
DELETE FROM films WHERE rating < 5;
COMMIT WORK;
Compatibility
There is no LOCK TABLE in the SQL standard, which instead uses SET TRANSACTION to speci-
fy concurrency levels on transactions. PostgreSQL supports that too; see SET TRANSACTION for
details.
Except for ACCESS SHARE, ACCESS EXCLUSIVE, and SHARE UPDATE EXCLUSIVE lock
modes, the PostgreSQL lock modes and the LOCK TABLE syntax are compatible with those present
in Oracle.
1750
MOVE
MOVE — position a cursor
Synopsis
NEXT
PRIOR
FIRST
LAST
ABSOLUTE count
RELATIVE count
count
ALL
FORWARD
FORWARD count
FORWARD ALL
BACKWARD
BACKWARD count
BACKWARD ALL
Description
MOVE repositions a cursor without retrieving any data. MOVE works exactly like the FETCH command,
except it only positions the cursor and does not return rows.
The parameters for the MOVE command are identical to those of the FETCH command; refer to FETCH
for details on syntax and usage.
Outputs
On successful completion, a MOVE command returns a command tag of the form
MOVE count
The count is the number of rows that a FETCH command with the same parameters would have
returned (possibly zero).
Examples
BEGIN WORK;
DECLARE liahona CURSOR FOR SELECT * FROM films;
1751
MOVE
Compatibility
There is no MOVE statement in the SQL standard.
See Also
CLOSE, DECLARE, FETCH
1752
NOTIFY
NOTIFY — generate a notification
Synopsis
Description
The NOTIFY command sends a notification event together with an optional “payload” string to each
client application that has previously executed LISTEN channel for the specified channel name
in the current database. Notifications are visible to all users.
NOTIFY provides a simple interprocess communication mechanism for a collection of processes ac-
cessing the same PostgreSQL database. A payload string can be sent along with the notification, and
higher-level mechanisms for passing structured data can be built by using tables in the database to
pass additional data from notifier to listener(s).
The information passed to the client for a notification event includes the notification channel name,
the notifying session's server process PID, and the payload string, which is an empty string if it has
not been specified.
It is up to the database designer to define the channel names that will be used in a given database and
what each one means. Commonly, the channel name is the same as the name of some table in the
database, and the notify event essentially means, “I changed this table, take a look at it to see what's
new”. But no such association is enforced by the NOTIFY and LISTEN commands. For example, a
database designer could use several different channel names to signal different sorts of changes to a
single table. Alternatively, the payload string could be used to differentiate various cases.
When NOTIFY is used to signal the occurrence of changes to a particular table, a useful programming
technique is to put the NOTIFY in a statement trigger that is triggered by table updates. In this way,
notification happens automatically when the table is changed, and the application programmer cannot
accidentally forget to do it.
NOTIFY interacts with SQL transactions in some important ways. Firstly, if a NOTIFY is executed
inside a transaction, the notify events are not delivered until and unless the transaction is committed.
This is appropriate, since if the transaction is aborted, all the commands within it have had no effect,
including NOTIFY. But it can be disconcerting if one is expecting the notification events to be deliv-
ered immediately. Secondly, if a listening session receives a notification signal while it is within a
transaction, the notification event will not be delivered to its connected client until just after the trans-
action is completed (either committed or aborted). Again, the reasoning is that if a notification were
delivered within a transaction that was later aborted, one would want the notification to be undone
somehow — but the server cannot “take back” a notification once it has sent it to the client. So noti-
fication events are only delivered between transactions. The upshot of this is that applications using
NOTIFY for real-time signaling should try to keep their transactions short.
If the same channel name is signaled multiple times from the same transaction with identical payload
strings, the database server can decide to deliver a single notification only. On the other hand, no-
tifications with distinct payload strings will always be delivered as distinct notifications. Similarly,
notifications from different transactions will never get folded into one notification. Except for drop-
ping later instances of duplicate notifications, NOTIFY guarantees that notifications from the same
transaction get delivered in the order they were sent. It is also guaranteed that messages from different
transactions are delivered in the order in which the transactions committed.
1753
NOTIFY
It is common for a client that executes NOTIFY to be listening on the same notification channel itself.
In that case it will get back a notification event, just like all the other listening sessions. Depending on
the application logic, this could result in useless work, for example, reading a database table to find
the same updates that that session just wrote out. It is possible to avoid such extra work by noticing
whether the notifying session's server process PID (supplied in the notification event message) is the
same as one's own session's PID (available from libpq). When they are the same, the notification event
is one's own work bouncing back, and can be ignored.
Parameters
channel
payload
The “payload” string to be communicated along with the notification. This must be specified as
a simple string literal. In the default configuration it must be shorter than 8000 bytes. (If binary
data or large amounts of information need to be communicated, it's best to put it in a database
table and send the key of the record.)
Notes
There is a queue that holds notifications that have been sent but not yet processed by all listening
sessions. If this queue becomes full, transactions calling NOTIFY will fail at commit. The queue is
quite large (8GB in a standard installation) and should be sufficiently sized for almost every use case.
However, no cleanup can take place if a session executes LISTEN and then enters a transaction for
a very long time. Once the queue is half full you will see warnings in the log file pointing you to the
session that is preventing cleanup. In this case you should make sure that this session ends its current
transaction so that cleanup can proceed.
The function pg_notification_queue_usage returns the fraction of the queue that is currently
occupied by pending notifications. See Section 9.25 for more information.
A transaction that has executed NOTIFY cannot be prepared for two-phase commit.
pg_notify
To send a notification you can also use the function pg_notify(text, text). The function takes
the channel name as the first argument and the payload as the second. The function is much easier to
use than the NOTIFY command if you need to work with non-constant channel names and payloads.
Examples
Configure and execute a listen/notify sequence from psql:
LISTEN virtual;
NOTIFY virtual;
Asynchronous notification "virtual" received from server process
with PID 8448.
NOTIFY virtual, 'This is the payload';
Asynchronous notification "virtual" with payload "This is the
payload" received from server process with PID 8448.
LISTEN foo;
SELECT pg_notify('fo' || 'o', 'pay' || 'load');
1754
NOTIFY
Compatibility
There is no NOTIFY statement in the SQL standard.
See Also
LISTEN, UNLISTEN
1755
PREPARE
PREPARE — prepare a statement for execution
Synopsis
Description
PREPARE creates a prepared statement. A prepared statement is a server-side object that can be used
to optimize performance. When the PREPARE statement is executed, the specified statement is parsed,
analyzed, and rewritten. When an EXECUTE command is subsequently issued, the prepared statement
is planned and executed. This division of labor avoids repetitive parse analysis work, while allowing
the execution plan to depend on the specific parameter values supplied.
Prepared statements can take parameters: values that are substituted into the statement when it is
executed. When creating the prepared statement, refer to parameters by position, using $1, $2, etc.
A corresponding list of parameter data types can optionally be specified. When a parameter's data
type is not specified or is declared as unknown, the type is inferred from the context in which the
parameter is first referenced (if possible). When executing the statement, specify the actual values for
these parameters in the EXECUTE statement. Refer to EXECUTE for more information about that.
Prepared statements only last for the duration of the current database session. When the session ends,
the prepared statement is forgotten, so it must be recreated before being used again. This also means
that a single prepared statement cannot be used by multiple simultaneous database clients; howev-
er, each client can create their own prepared statement to use. Prepared statements can be manually
cleaned up using the DEALLOCATE command.
Prepared statements potentially have the largest performance advantage when a single session is being
used to execute a large number of similar statements. The performance difference will be particularly
significant if the statements are complex to plan or rewrite, e.g., if the query involves a join of many
tables or requires the application of several rules. If the statement is relatively simple to plan and
rewrite but relatively expensive to execute, the performance advantage of prepared statements will
be less noticeable.
Parameters
name
An arbitrary name given to this particular prepared statement. It must be unique within a single
session and is subsequently used to execute or deallocate a previously prepared statement.
data_type
The data type of a parameter to the prepared statement. If the data type of a particular parameter is
unspecified or is specified as unknown, it will be inferred from the context in which the parameter
is first referenced. To refer to the parameters in the prepared statement itself, use $1, $2, etc.
statement
Notes
Prepared statements can use generic plans rather than re-planning with each set of supplied EXECUTE
values. This occurs immediately for prepared statements with no parameters; otherwise it occurs only
1756
PREPARE
after five or more executions produce plans whose estimated cost average (including planning over-
head) is more expensive than the generic plan cost estimate. Once a generic plan is chosen, it is used for
the remaining lifetime of the prepared statement. Using EXECUTE values which are rare in columns
with many duplicates can generate custom plans that are so much cheaper than the generic plan, even
after adding planning overhead, that the generic plan might never be used.
A generic plan assumes that each value supplied to EXECUTE is one of the column's distinct values and
that column values are uniformly distributed. For example, if statistics record three distinct column
values, a generic plan assumes a column equality comparison will match 33% of processed rows.
Column statistics also allow generic plans to accurately compute the selectivity of unique columns.
Comparisons on non-uniformly-distributed columns and specification of non-existent values affects
the average plan cost, and hence if and when a generic plan is chosen.
To examine the query plan PostgreSQL is using for a prepared statement, use EXPLAIN, e.g., EX-
PLAIN EXECUTE. If a generic plan is in use, it will contain parameter symbols $n, while a custom
plan will have the supplied parameter values substituted into it. The row estimates in the generic plan
reflect the selectivity computed for the parameters.
For more information on query planning and the statistics collected by PostgreSQL for that purpose,
see the ANALYZE documentation.
Although the main point of a prepared statement is to avoid repeated parse analysis and planning
of the statement, PostgreSQL will force re-analysis and re-planning of the statement before using it
whenever database objects used in the statement have undergone definitional (DDL) changes since the
previous use of the prepared statement. Also, if the value of search_path changes from one use to the
next, the statement will be re-parsed using the new search_path. (This latter behavior is new as of
PostgreSQL 9.3.) These rules make use of a prepared statement semantically almost equivalent to re-
submitting the same query text over and over, but with a performance benefit if no object definitions
are changed, especially if the best plan remains the same across uses. An example of a case where the
semantic equivalence is not perfect is that if the statement refers to a table by an unqualified name, and
then a new table of the same name is created in a schema appearing earlier in the search_path, no
automatic re-parse will occur since no object used in the statement changed. However, if some other
change forces a re-parse, the new table will be referenced in subsequent uses.
You can see all prepared statements available in the session by querying the pg_prepared_s-
tatements system view.
Examples
Create a prepared statement for an INSERT statement, and then execute it:
Create a prepared statement for a SELECT statement, and then execute it:
Note that the data type of the second parameter is not specified, so it is inferred from the context in
which $2 is used.
1757
PREPARE
Compatibility
The SQL standard includes a PREPARE statement, but it is only for use in embedded SQL. This
version of the PREPARE statement also uses a somewhat different syntax.
See Also
DEALLOCATE, EXECUTE
1758
PREPARE TRANSACTION
PREPARE TRANSACTION — prepare the current transaction for two-phase commit
Synopsis
Description
PREPARE TRANSACTION prepares the current transaction for two-phase commit. After this com-
mand, the transaction is no longer associated with the current session; instead, its state is fully stored
on disk, and there is a very high probability that it can be committed successfully, even if a database
crash occurs before the commit is requested.
Once prepared, a transaction can later be committed or rolled back with COMMIT PREPARED or
ROLLBACK PREPARED, respectively. Those commands can be issued from any session, not only
the one that executed the original transaction.
From the point of view of the issuing session, PREPARE TRANSACTION is not unlike a ROLLBACK
command: after executing it, there is no active current transaction, and the effects of the prepared trans-
action are no longer visible. (The effects will become visible again if the transaction is committed.)
If the PREPARE TRANSACTION command fails for any reason, it becomes a ROLLBACK: the current
transaction is canceled.
Parameters
transaction_id
An arbitrary identifier that later identifies this transaction for COMMIT PREPARED or ROLL-
BACK PREPARED. The identifier must be written as a string literal, and must be less than 200
bytes long. It must not be the same as the identifier used for any currently prepared transaction.
Notes
PREPARE TRANSACTION is not intended for use in applications or interactive sessions. Its purpose
is to allow an external transaction manager to perform atomic global transactions across multiple
databases or other transactional resources. Unless you're writing a transaction manager, you probably
shouldn't be using PREPARE TRANSACTION.
This command must be used inside a transaction block. Use BEGIN to start one.
It is not currently allowed to PREPARE a transaction that has executed any operations involving tem-
porary tables or the session's temporary namespace, created any cursors WITH HOLD, or executed
LISTEN, UNLISTEN, or NOTIFY. Those features are too tightly tied to the current session to be
useful in a transaction to be prepared.
If the transaction modified any run-time parameters with SET (without the LOCAL option), those ef-
fects persist after PREPARE TRANSACTION, and will not be affected by any later COMMIT PRE-
PARED or ROLLBACK PREPARED. Thus, in this one respect PREPARE TRANSACTION acts more
like COMMIT than ROLLBACK.
All currently available prepared transactions are listed in the pg_prepared_xacts system view.
1759
PREPARE TRANSACTION
Caution
It is unwise to leave transactions in the prepared state for a long time. This will interfere with
the ability of VACUUM to reclaim storage, and in extreme cases could cause the database to
shut down to prevent transaction ID wraparound (see Section 24.1.5). Keep in mind also that
the transaction continues to hold whatever locks it held. The intended usage of the feature is
that a prepared transaction will normally be committed or rolled back as soon as an external
transaction manager has verified that other databases are also prepared to commit.
If you have not set up an external transaction manager to track prepared transactions and ensure
they get closed out promptly, it is best to keep the prepared-transaction feature disabled by
setting max_prepared_transactions to zero. This will prevent accidental creation of prepared
transactions that might then be forgotten and eventually cause problems.
Examples
Prepare the current transaction for two-phase commit, using foobar as the transaction identifier:
Compatibility
PREPARE TRANSACTION is a PostgreSQL extension. It is intended for use by external transaction
management systems, some of which are covered by standards (such as X/Open XA), but the SQL
side of those systems is not standardized.
See Also
COMMIT PREPARED, ROLLBACK PREPARED
1760
REASSIGN OWNED
REASSIGN OWNED — change the ownership of database objects owned by a database role
Synopsis
Description
REASSIGN OWNED instructs the system to change the ownership of database objects owned by any
of the old_roles to new_role.
Parameters
old_role
The name of a role. The ownership of all the objects within the current database, and of all shared
objects (databases, tablespaces), owned by this role will be reassigned to new_role.
new_role
The name of the role that will be made the new owner of the affected objects.
Notes
REASSIGN OWNED is often used to prepare for the removal of one or more roles. Because REASSIGN
OWNED does not affect objects within other databases, it is usually necessary to execute this command
in each database that contains objects owned by a role that is to be removed.
REASSIGN OWNED requires membership on both the source role(s) and the target role.
The DROP OWNED command is an alternative that simply drops all the database objects owned by
one or more roles.
The REASSIGN OWNED command does not affect any privileges granted to the old_roles on
objects that are not owned by them. Likewise, it does not affect default privileges created with ALTER
DEFAULT PRIVILEGES. Use DROP OWNED to revoke such privileges.
Compatibility
The REASSIGN OWNED command is a PostgreSQL extension.
See Also
DROP OWNED, DROP ROLE, ALTER DATABASE
1761
REFRESH MATERIALIZED VIEW
REFRESH MATERIALIZED VIEW — replace the contents of a materialized view
Synopsis
Description
REFRESH MATERIALIZED VIEW completely replaces the contents of a materialized view. To
execute this command you must be the owner of the materialized view. The old contents are discarded.
If WITH DATA is specified (or defaults) the backing query is executed to provide the new data, and
the materialized view is left in a scannable state. If WITH NO DATA is specified no new data is
generated and the materialized view is left in an unscannable state.
Parameters
CONCURRENTLY
Refresh the materialized view without locking out concurrent selects on the materialized view.
Without this option a refresh which affects a lot of rows will tend to use fewer resources and
complete more quickly, but could block other connections which are trying to read from the ma-
terialized view. This option may be faster in cases where a small number of rows are affected.
This option is only allowed if there is at least one UNIQUE index on the materialized view which
uses only column names and includes all rows; that is, it must not be an expression index or
include a WHERE clause.
This option may not be used when the materialized view is not already populated.
Even with this option only one REFRESH at a time may run against any one materialized view.
name
Notes
If there is an ORDER BY clause in the materialized view's defining query, the original contents of
the materialized view will be ordered that way; but REFRESH MATERIALIZED VIEW does not
guarantee to preserve that ordering.
Examples
This command will replace the contents of the materialized view called order_summary using the
query from the materialized view's definition, and leave it in a scannable state:
1762
REFRESH MATERIALIZED VIEW
This command will free storage associated with the materialized view annual_statistics_ba-
sis and leave it in an unscannable state:
Compatibility
REFRESH MATERIALIZED VIEW is a PostgreSQL extension.
See Also
CREATE MATERIALIZED VIEW, ALTER MATERIALIZED VIEW, DROP MATERIALIZED
VIEW
1763
REINDEX
REINDEX — rebuild indexes
Synopsis
Description
REINDEX rebuilds an index using the data stored in the index's table, replacing the old copy of the
index. There are several scenarios in which to use REINDEX:
• An index has become corrupted, and no longer contains valid data. Although in theory this should
never happen, in practice indexes can become corrupted due to software bugs or hardware failures.
REINDEX provides a recovery method.
• An index has become “bloated”, that is it contains many empty or nearly-empty pages. This can
occur with B-tree indexes in PostgreSQL under certain uncommon access patterns. REINDEX pro-
vides a way to reduce the space consumption of the index by writing a new version of the index
without the dead pages. See Section 24.2 for more information.
• You have altered a storage parameter (such as fillfactor) for an index, and wish to ensure that the
change has taken full effect.
• An index build with the CONCURRENTLY option failed, leaving an “invalid” index. Such indexes
are useless but it can be convenient to use REINDEX to rebuild them. Note that REINDEX will not
perform a concurrent build. To build the index without interfering with production you should drop
the index and reissue the CREATE INDEX CONCURRENTLY command.
Parameters
INDEX
TABLE
Recreate all indexes of the specified table. If the table has a secondary “TOAST” table, that is
reindexed as well.
SCHEMA
Recreate all indexes of the specified schema. If a table of this schema has a secondary “TOAST”
table, that is reindexed as well. Indexes on shared system catalogs are also processed. This form
of REINDEX cannot be executed inside a transaction block.
DATABASE
Recreate all indexes within the current database. Indexes on shared system catalogs are also
processed. This form of REINDEX cannot be executed inside a transaction block.
SYSTEM
Recreate all indexes on system catalogs within the current database. Indexes on shared system
catalogs are included. Indexes on user tables are not processed. This form of REINDEX cannot
be executed inside a transaction block.
1764
REINDEX
name
The name of the specific index, table, or database to be reindexed. Index and table names can be
schema-qualified. Presently, REINDEX DATABASE and REINDEX SYSTEM can only reindex
the current database, so their parameter must match the current database's name.
VERBOSE
Notes
If you suspect corruption of an index on a user table, you can simply rebuild that index, or all indexes
on the table, using REINDEX INDEX or REINDEX TABLE.
Things are more difficult if you need to recover from corruption of an index on a system table. In
this case it's important for the system to not have used any of the suspect indexes itself. (Indeed, in
this sort of scenario you might find that server processes are crashing immediately at start-up, due to
reliance on the corrupted indexes.) To recover safely, the server must be started with the -P option,
which prevents it from using indexes for system catalog lookups.
One way to do this is to shut down the server and start a single-user PostgreSQL server with the -P
option included on its command line. Then, REINDEX DATABASE, REINDEX SYSTEM, REINDEX
TABLE, or REINDEX INDEX can be issued, depending on how much you want to reconstruct. If in
doubt, use REINDEX SYSTEM to select reconstruction of all system indexes in the database. Then
quit the single-user server session and restart the regular server. See the postgres reference page for
more information about how to interact with the single-user server interface.
Alternatively, a regular server session can be started with -P included in its command line options.
The method for doing this varies across clients, but in all libpq-based clients, it is possible to set the
PGOPTIONS environment variable to -P before starting the client. Note that while this method does
not require locking out other clients, it might still be wise to prevent other users from connecting to
the damaged database until repairs have been completed.
REINDEX is similar to a drop and recreate of the index in that the index contents are rebuilt from
scratch. However, the locking considerations are rather different. REINDEX locks out writes but not
reads of the index's parent table. It also takes an ACCESS EXCLUSIVE lock on the specific index
being processed, which will block reads that attempt to use that index. In contrast, DROP INDEX
momentarily takes an ACCESS EXCLUSIVE lock on the parent table, blocking both writes and reads.
The subsequent CREATE INDEX locks out writes but not reads; since the index is not there, no read
will attempt to use it, meaning that there will be no blocking but reads might be forced into expensive
sequential scans.
Reindexing a single index or table requires being the owner of that index or table. Reindexing a schema
or database requires being the owner of that schema or database. Note that is therefore sometimes
possible for non-superusers to rebuild indexes of tables owned by other users. However, as a special
exception, when REINDEX DATABASE, REINDEX SCHEMA or REINDEX SYSTEM is issued by
a non-superuser, indexes on shared catalogs will be skipped unless the user owns the catalog (which
typically won't be the case). Of course, superusers can always reindex anything.
Reindexing partitioned tables or partitioned indexes is not supported. Each individual partition can be
reindexed separately instead.
Examples
Rebuild a single index:
1765
REINDEX
Rebuild all indexes in a particular database, without trusting the system indexes to be valid already:
$ export PGOPTIONS="-P"
$ psql broken_db
...
broken_db=> REINDEX DATABASE broken_db;
broken_db=> \q
Compatibility
There is no REINDEX command in the SQL standard.
1766
RELEASE SAVEPOINT
RELEASE SAVEPOINT — destroy a previously defined savepoint
Synopsis
Description
RELEASE SAVEPOINT destroys a savepoint previously defined in the current transaction.
Destroying a savepoint makes it unavailable as a rollback point, but it has no other user visible behav-
ior. It does not undo the effects of commands executed after the savepoint was established. (To do
that, see ROLLBACK TO SAVEPOINT.) Destroying a savepoint when it is no longer needed allows
the system to reclaim some resources earlier than transaction end.
RELEASE SAVEPOINT also destroys all savepoints that were established after the named savepoint
was established.
Parameters
savepoint_name
Notes
Specifying a savepoint name that was not previously defined is an error.
If multiple savepoints have the same name, only the most recently defined unreleased one is released.
Repeated commands will release progressively older savepoints.
Examples
To establish and later destroy a savepoint:
BEGIN;
INSERT INTO table1 VALUES (3);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (4);
RELEASE SAVEPOINT my_savepoint;
COMMIT;
Compatibility
This command conforms to the SQL standard. The standard specifies that the key word SAVEPOINT
is mandatory, but PostgreSQL allows it to be omitted.
1767
RELEASE SAVEPOINT
See Also
BEGIN, COMMIT, ROLLBACK, ROLLBACK TO SAVEPOINT, SAVEPOINT
1768
RESET
RESET — restore the value of a run-time parameter to the default value
Synopsis
RESET configuration_parameter
RESET ALL
Description
RESET restores run-time parameters to their default values. RESET is an alternative spelling for
The default value is defined as the value that the parameter would have had, if no SET had ever been
issued for it in the current session. The actual source of this value might be a compiled-in default, the
configuration file, command-line options, or per-database or per-user default settings. This is subtly
different from defining it as “the value that the parameter had at session start”, because if the value
came from the configuration file, it will be reset to whatever is specified by the configuration file now.
See Chapter 19 for details.
The transactional behavior of RESET is the same as SET: its effects will be undone by transaction
rollback.
Parameters
configuration_parameter
Name of a settable run-time parameter. Available parameters are documented in Chapter 19 and
on the SET reference page.
ALL
Examples
Set the timezone configuration variable to its default value:
RESET timezone;
Compatibility
RESET is a PostgreSQL extension.
See Also
SET, SHOW
1769
REVOKE
REVOKE — remove access privileges
Synopsis
REVOKE [ GRANT OPTION FOR ]
{ { SELECT | INSERT | UPDATE | DELETE | TRUNCATE | REFERENCES |
TRIGGER }
[, ...] | ALL [ PRIVILEGES ] }
ON { [ TABLE ] table_name [, ...]
| ALL TABLES IN SCHEMA schema_name [, ...] }
FROM role_specification [, ...]
[ CASCADE | RESTRICT ]
1770
REVOKE
[ GROUP ] role_name
| PUBLIC
| CURRENT_USER
| SESSION_USER
Description
The REVOKE command revokes previously granted privileges from one or more roles. The key word
PUBLIC refers to the implicitly defined group of all roles.
See the description of the GRANT command for the meaning of the privilege types.
1771
REVOKE
Note that any particular role will have the sum of privileges granted directly to it, privileges granted to
any role it is presently a member of, and privileges granted to PUBLIC. Thus, for example, revoking
SELECT privilege from PUBLIC does not necessarily mean that all roles have lost SELECT privi-
lege on the object: those who have it granted directly or via another role will still have it. Similarly,
revoking SELECT from a user might not prevent that user from using SELECT if PUBLIC or another
membership role still has SELECT rights.
If GRANT OPTION FOR is specified, only the grant option for the privilege is revoked, not the
privilege itself. Otherwise, both the privilege and the grant option are revoked.
If a user holds a privilege with grant option and has granted it to other users then the privileges held
by those other users are called dependent privileges. If the privilege or the grant option held by the
first user is being revoked and dependent privileges exist, those dependent privileges are also revoked
if CASCADE is specified; if it is not, the revoke action will fail. This recursive revocation only affects
privileges that were granted through a chain of users that is traceable to the user that is the subject
of this REVOKE command. Thus, the affected users might effectively keep the privilege if it was also
granted through other users.
When revoking privileges on a table, the corresponding column privileges (if any) are automatically
revoked on each column of the table, as well. On the other hand, if a role has been granted privileges
on a table, then revoking the same privileges from individual columns will have no effect.
When revoking membership in a role, GRANT OPTION is instead called ADMIN OPTION, but the
behavior is similar. This form of the command also allows a GRANTED BY option, but that option
is currently ignored (except for checking the existence of the named role). Note also that this form of
the command does not allow the noise word GROUP in role_specification.
Notes
Use psql's \dp command to display the privileges granted on existing tables and columns. See
GRANT for information about the format. For non-table objects there are other \d commands that
can display their privileges.
A user can only revoke privileges that were granted directly by that user. If, for example, user A has
granted a privilege with grant option to user B, and user B has in turn granted it to user C, then user
A cannot revoke the privilege directly from C. Instead, user A could revoke the grant option from
user B and use the CASCADE option so that the privilege is in turn revoked from user C. For another
example, if both A and B have granted the same privilege to C, A can revoke their own grant but not
B's grant, so C will still effectively have the privilege.
When a non-owner of an object attempts to REVOKE privileges on the object, the command will fail
outright if the user has no privileges whatsoever on the object. As long as some privilege is available,
the command will proceed, but it will revoke only those privileges for which the user has grant options.
The REVOKE ALL PRIVILEGES forms will issue a warning message if no grant options are held,
while the other forms will issue a warning if grant options for any of the privileges specifically named
in the command are not held. (In principle these statements apply to the object owner as well, but since
the owner is always treated as holding all grant options, the cases can never occur.)
If a superuser chooses to issue a GRANT or REVOKE command, the command is performed as though
it were issued by the owner of the affected object. Since all privileges ultimately come from the object
owner (possibly indirectly via chains of grant options), it is possible for a superuser to revoke all
privileges, but this might require use of CASCADE as stated above.
REVOKE can also be done by a role that is not the owner of the affected object, but is a member of the
role that owns the object, or is a member of a role that holds privileges WITH GRANT OPTION on
the object. In this case the command is performed as though it were issued by the containing role that
actually owns the object or holds the privileges WITH GRANT OPTION. For example, if table t1 is
owned by role g1, of which role u1 is a member, then u1 can revoke privileges on t1 that are recorded
as being granted by g1. This would include grants made by u1 as well as by other members of role g1.
1772
REVOKE
If the role executing REVOKE holds privileges indirectly via more than one role membership path, it
is unspecified which containing role will be used to perform the command. In such cases it is best
practice to use SET ROLE to become the specific role you want to do the REVOKE as. Failure to do
so might lead to revoking privileges other than the ones you intended, or not revoking anything at all.
Examples
Revoke insert privilege for the public on table films:
Note that this actually means “revoke all privileges that I granted”.
Compatibility
The compatibility notes of the GRANT command apply analogously to REVOKE. The keyword
RESTRICT or CASCADE is required according to the standard, but PostgreSQL assumes RESTRICT
by default.
See Also
GRANT
1773
ROLLBACK
ROLLBACK — abort the current transaction
Synopsis
Description
ROLLBACK rolls back the current transaction and causes all the updates made by the transaction to
be discarded.
Parameters
WORK
TRANSACTION
Notes
Use COMMIT to successfully terminate a transaction.
Issuing ROLLBACK outside of a transaction block emits a warning and otherwise has no effect.
Examples
To abort all changes:
ROLLBACK;
Compatibility
The SQL standard only specifies the two forms ROLLBACK and ROLLBACK WORK. Otherwise, this
command is fully conforming.
See Also
BEGIN, COMMIT, ROLLBACK TO SAVEPOINT
1774
ROLLBACK PREPARED
ROLLBACK PREPARED — cancel a transaction that was earlier prepared for two-phase commit
Synopsis
Description
ROLLBACK PREPARED rolls back a transaction that is in prepared state.
Parameters
transaction_id
Notes
To roll back a prepared transaction, you must be either the same user that executed the transaction
originally, or a superuser. But you do not have to be in the same session that executed the transaction.
This command cannot be executed inside a transaction block. The prepared transaction is rolled back
immediately.
All currently available prepared transactions are listed in the pg_prepared_xacts system view.
Examples
Roll back the transaction identified by the transaction identifier foobar:
Compatibility
ROLLBACK PREPARED is a PostgreSQL extension. It is intended for use by external transaction
management systems, some of which are covered by standards (such as X/Open XA), but the SQL
side of those systems is not standardized.
See Also
PREPARE TRANSACTION, COMMIT PREPARED
1775
ROLLBACK TO SAVEPOINT
ROLLBACK TO SAVEPOINT — roll back to a savepoint
Synopsis
Description
Roll back all commands that were executed after the savepoint was established. The savepoint remains
valid and can be rolled back to again later, if needed.
ROLLBACK TO SAVEPOINT implicitly destroys all savepoints that were established after the named
savepoint.
Parameters
savepoint_name
Notes
Use RELEASE SAVEPOINT to destroy a savepoint without discarding the effects of commands ex-
ecuted after it was established.
Cursors have somewhat non-transactional behavior with respect to savepoints. Any cursor that is
opened inside a savepoint will be closed when the savepoint is rolled back. If a previously opened
cursor is affected by a FETCH or MOVE command inside a savepoint that is later rolled back, the cursor
remains at the position that FETCH left it pointing to (that is, the cursor motion caused by FETCH is
not rolled back). Closing a cursor is not undone by rolling back, either. However, other side-effects
caused by the cursor's query (such as side-effects of volatile functions called by the query) are rolled
back if they occur during a savepoint that is later rolled back. A cursor whose execution causes a
transaction to abort is put in a cannot-execute state, so while the transaction can be restored using
ROLLBACK TO SAVEPOINT, the cursor can no longer be used.
Examples
To undo the effects of the commands executed after my_savepoint was established:
BEGIN;
SAVEPOINT foo;
1776
ROLLBACK TO SAVEPOINT
COMMIT;
Compatibility
The SQL standard specifies that the key word SAVEPOINT is mandatory, but PostgreSQL and Oracle
allow it to be omitted. SQL allows only WORK, not TRANSACTION, as a noise word after ROLLBACK.
Also, SQL has an optional clause AND [ NO ] CHAIN which is not currently supported by Post-
greSQL. Otherwise, this command conforms to the SQL standard.
See Also
BEGIN, COMMIT, RELEASE SAVEPOINT, ROLLBACK, SAVEPOINT
1777
SAVEPOINT
SAVEPOINT — define a new savepoint within the current transaction
Synopsis
SAVEPOINT savepoint_name
Description
SAVEPOINT establishes a new savepoint within the current transaction.
A savepoint is a special mark inside a transaction that allows all commands that are executed after
it was established to be rolled back, restoring the transaction state to what it was at the time of the
savepoint.
Parameters
savepoint_name
The name to give to the new savepoint. If savepoints with the same name already exist, they will
be inaccessible until newer identically-named savepoints are released.
Notes
Use ROLLBACK TO SAVEPOINT to rollback to a savepoint. Use RELEASE SAVEPOINT to de-
stroy a savepoint, keeping the effects of commands executed after it was established.
Savepoints can only be established when inside a transaction block. There can be multiple savepoints
defined within a transaction.
Examples
To establish a savepoint and later undo the effects of all commands executed after it was established:
BEGIN;
INSERT INTO table1 VALUES (1);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (2);
ROLLBACK TO SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (3);
COMMIT;
The above transaction will insert the values 1 and 3, but not 2.
BEGIN;
INSERT INTO table1 VALUES (3);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (4);
RELEASE SAVEPOINT my_savepoint;
1778
SAVEPOINT
COMMIT;
BEGIN;
INSERT INTO table1 VALUES (1);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (2);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (3);
The above transaction shows row 3 being rolled back first, then row 2.
Compatibility
SQL requires a savepoint to be destroyed automatically when another savepoint with the same name
is established. In PostgreSQL, the old savepoint is kept, though only the more recent one will be used
when rolling back or releasing. (Releasing the newer savepoint with RELEASE SAVEPOINT will
cause the older one to again become accessible to ROLLBACK TO SAVEPOINT and RELEASE
SAVEPOINT.) Otherwise, SAVEPOINT is fully SQL conforming.
See Also
BEGIN, COMMIT, RELEASE SAVEPOINT, ROLLBACK, ROLLBACK TO SAVEPOINT
1779
SECURITY LABEL
SECURITY LABEL — define or change a security label applied to an object
Synopsis
* |
[ argmode ] [ argname ] argtype [ , ... ] |
[ [ argmode ] [ argname ] argtype [ , ... ] ] ORDER BY [ argmode ]
[ argname ] argtype [ , ... ]
Description
SECURITY LABEL applies a security label to a database object. An arbitrary number of security
labels, one per label provider, can be associated with a given database object. Label providers are
loadable modules which register themselves by using the function register_label_provider.
Note
register_label_provider is not an SQL function; it can only be called from C code
loaded into the backend.
1780
SECURITY LABEL
The label provider determines whether a given label is valid and whether it is permissible to assign
that label to a given object. The meaning of a given label is likewise at the discretion of the label
provider. PostgreSQL places no restrictions on whether or how a label provider must interpret security
labels; it merely provides a mechanism for storing them. In practice, this facility is intended to allow
integration with label-based mandatory access control (MAC) systems such as SELinux. Such systems
make all access control decisions based on object labels, rather than traditional discretionary access
control (DAC) concepts such as users and groups.
Parameters
object_name
table_name.column_name
aggregate_name
function_name
procedure_name
routine_name
The name of the object to be labeled. Names of tables, aggregates, domains, foreign tables, func-
tions, procedures, routines, sequences, types, and views can be schema-qualified.
provider
The name of the provider with which this label is to be associated. The named provider must be
loaded and must consent to the proposed labeling operation. If exactly one provider is loaded, the
provider name may be omitted for brevity.
argmode
The mode of a function, procedure, or aggregate argument: IN, OUT, INOUT, or VARIADIC. If
omitted, the default is IN. Note that SECURITY LABEL does not actually pay any attention to
OUT arguments, since only the input arguments are needed to determine the function's identity.
So it is sufficient to list the IN, INOUT, and VARIADIC arguments.
argname
The name of a function, procedure, or aggregate argument. Note that SECURITY LABEL does
not actually pay any attention to argument names, since only the argument data types are needed
to determine the function's identity.
argtype
large_object_oid
PROCEDURAL
label
The new security label, written as a string literal; or NULL to drop the security label.
Examples
The following example shows how the security label of a table might be changed.
1781
SECURITY LABEL
Compatibility
There is no SECURITY LABEL command in the SQL standard.
See Also
sepgsql, src/test/modules/dummy_seclabel
1782
SELECT
SELECT, TABLE, WITH — retrieve rows from a table or view
Synopsis
( )
expression
( expression [, ...] )
ROLLUP ( { expression | ( expression [, ...] ) } [, ...] )
CUBE ( { expression | ( expression [, ...] ) } [, ...] )
GROUPING SETS ( grouping_element [, ...] )
1783
SELECT
Description
SELECT retrieves rows from zero or more tables. The general processing of SELECT is as follows:
1. All queries in the WITH list are computed. These effectively serve as temporary tables that can be
referenced in the FROM list. A WITH query that is referenced more than once in FROM is computed
only once. (See WITH Clause below.)
2. All elements in the FROM list are computed. (Each element in the FROM list is a real or virtual
table.) If more than one element is specified in the FROM list, they are cross-joined together. (See
FROM Clause below.)
3. If the WHERE clause is specified, all rows that do not satisfy the condition are eliminated from the
output. (See WHERE Clause below.)
4. If the GROUP BY clause is specified, or if there are aggregate function calls, the output is combined
into groups of rows that match on one or more values, and the results of aggregate functions are
computed. If the HAVING clause is present, it eliminates groups that do not satisfy the given con-
dition. (See GROUP BY Clause and HAVING Clause below.)
5. The actual output rows are computed using the SELECT output expressions for each selected row
or row group. (See SELECT List below.)
6. SELECT DISTINCT eliminates duplicate rows from the result. SELECT DISTINCT ON elim-
inates rows that match on all the specified expressions. SELECT ALL (the default) will return all
candidate rows, including duplicates. (See DISTINCT Clause below.)
7. Using the operators UNION, INTERSECT, and EXCEPT, the output of more than one SELECT
statement can be combined to form a single result set. The UNION operator returns all rows that
are in one or both of the result sets. The INTERSECT operator returns all rows that are strictly in
both result sets. The EXCEPT operator returns the rows that are in the first result set but not in the
second. In all three cases, duplicate rows are eliminated unless ALL is specified. The noise word
DISTINCT can be added to explicitly specify eliminating duplicate rows. Notice that DISTINCT
is the default behavior here, even though ALL is the default for SELECT itself. (See UNION Clause,
INTERSECT Clause, and EXCEPT Clause below.)
8. If the ORDER BY clause is specified, the returned rows are sorted in the specified order. If ORDER
BY is not given, the rows are returned in whatever order the system finds fastest to produce. (See
ORDER BY Clause below.)
9. If the LIMIT (or FETCH FIRST) or OFFSET clause is specified, the SELECT statement only
returns a subset of the result rows. (See LIMIT Clause below.)
10.If FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE or FOR KEY SHARE is specified, the
SELECT statement locks the selected rows against concurrent updates. (See The Locking Clause
below.)
You must have SELECT privilege on each column used in a SELECT command. The use of FOR NO
KEY UPDATE, FOR UPDATE, FOR SHARE or FOR KEY SHARE requires UPDATE privilege as
well (for at least one column of each table so selected).
1784
SELECT
Parameters
WITH Clause
The WITH clause allows you to specify one or more subqueries that can be referenced by name in
the primary query. The subqueries effectively act as temporary tables or views for the duration of the
primary query. Each subquery can be a SELECT, TABLE, VALUES, INSERT, UPDATE or DELETE
statement. When writing a data-modifying statement (INSERT, UPDATE or DELETE) in WITH, it is
usual to include a RETURNING clause. It is the output of RETURNING, not the underlying table that the
statement modifies, that forms the temporary table that is read by the primary query. If RETURNING
is omitted, the statement is still executed, but it produces no output so it cannot be referenced as a
table by the primary query.
A name (without schema qualification) must be specified for each WITH query. Optionally, a list of
column names can be specified; if this is omitted, the column names are inferred from the subquery.
If RECURSIVE is specified, it allows a SELECT subquery to reference itself by name. Such a subquery
must have the form
where the recursive self-reference must appear on the right-hand side of the UNION. Only one recur-
sive self-reference is permitted per query. Recursive data-modifying statements are not supported, but
you can use the results of a recursive SELECT query in a data-modifying statement. See Section 7.8
for an example.
Another effect of RECURSIVE is that WITH queries need not be ordered: a query can reference another
one that is later in the list. (However, circular references, or mutual recursion, are not implemented.)
Without RECURSIVE, WITH queries can only reference sibling WITH queries that are earlier in the
WITH list.
A key property of WITH queries is that they are evaluated only once per execution of the primary query,
even if the primary query refers to them more than once. In particular, data-modifying statements are
guaranteed to be executed once and only once, regardless of whether the primary query reads all or
any of their output.
When there are multiple queries in the WITH clause, RECURSIVE should be written only once, im-
mediately after WITH. It applies to all queries in the WITH clause, though it has no effect on queries
that do not use recursion or forward references.
The primary query and the WITH queries are all (notionally) executed at the same time. This implies
that the effects of a data-modifying statement in WITH cannot be seen from other parts of the query,
other than by reading its RETURNING output. If two such data-modifying statements attempt to modify
the same row, the results are unspecified.
FROM Clause
The FROM clause specifies one or more source tables for the SELECT. If multiple sources are specified,
the result is the Cartesian product (cross join) of all the sources. But usually qualification conditions
are added (via WHERE) to restrict the returned rows to a small subset of the Cartesian product.
table_name
The name (optionally schema-qualified) of an existing table or view. If ONLY is specified before
the table name, only that table is scanned. If ONLY is not specified, the table and all its descendant
1785
SELECT
tables (if any) are scanned. Optionally, * can be specified after the table name to explicitly indicate
that descendant tables are included.
alias
A substitute name for the FROM item containing the alias. An alias is used for brevity or to elim-
inate ambiguity for self-joins (where the same table is scanned multiple times). When an alias is
provided, it completely hides the actual name of the table or function; for example given FROM
foo AS f, the remainder of the SELECT must refer to this FROM item as f not foo. If an
alias is written, a column alias list can also be written to provide substitute names for one or more
columns of the table.
The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the
fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can
be any real-valued expression. (Other sampling methods might accept more or different argu-
ments.) These two methods each return a randomly-chosen sample of the table that will contain
approximately the specified percentage of the table's rows. The BERNOULLI method scans the
whole table and selects or ignores individual rows independently with the specified probability.
The SYSTEM method does block-level sampling with each block having the specified chance of
being selected; all rows in each selected block are returned. The SYSTEM method is significantly
faster than the BERNOULLI method when small sampling percentages are specified, but it may
return a less-random sample of the table as a result of clustering effects.
The optional REPEATABLE clause specifies a seed number or expression to use for generating
random numbers within the sampling method. The seed value can be any non-null floating-point
value. Two queries that specify the same seed and argument values will select the same sample
of the table, if the table has not been changed meanwhile. But different seed values will usually
produce different samples. If REPEATABLE is not given then a new random sample is selected
for each query, based upon a system-generated seed. Note that some add-on sampling methods
do not accept REPEATABLE, and will always produce new samples on each use.
select
A sub-SELECT can appear in the FROM clause. This acts as though its output were created as
a temporary table for the duration of this single SELECT command. Note that the sub-SELECT
must be surrounded by parentheses, and an alias must be provided for it. A VALUES command
can also be used here.
with_query_name
A WITH query is referenced by writing its name, just as though the query's name were a table
name. (In fact, the WITH query hides any real table of the same name for the purposes of the
primary query. If necessary, you can refer to a real table of the same name by schema-qualifying
the table's name.) An alias can be provided in the same way as for a table.
function_name
Function calls can appear in the FROM clause. (This is especially useful for functions that return
result sets, but any function can be used.) This acts as though the function's output were created
as a temporary table for the duration of this single SELECT command. When the optional WITH
ORDINALITY clause is added to the function call, a new column is appended after all the func-
tion's output columns with numbering for each row.
1786
SELECT
An alias can be provided in the same way as for a table. If an alias is written, a column alias
list can also be written to provide substitute names for one or more attributes of the function's
composite return type, including the column added by ORDINALITY if present.
Multiple function calls can be combined into a single FROM-clause item by surrounding them
with ROWS FROM( ... ). The output of such an item is the concatenation of the first row
from each function, then the second row from each function, etc. If some of the functions produce
fewer rows than others, null values are substituted for the missing data, so that the total number
of rows returned is always the same as for the function that produced the most rows.
If the function has been defined as returning the record data type, then an alias or the key
word AS must be present, followed by a column definition list in the form ( column_name
data_type [, ... ]). The column definition list must match the actual number and types
of columns returned by the function.
When using the ROWS FROM( ... ) syntax, if one of the functions requires a column de-
finition list, it's preferred to put the column definition list after the function call inside ROWS
FROM( ... ). A column definition list can be placed after the ROWS FROM( ... ) construct
only if there's just a single function and no WITH ORDINALITY clause.
To use ORDINALITY together with a column definition list, you must use the ROWS
FROM( ... ) syntax and put the column definition list inside ROWS FROM( ... ).
join_type
One of
• [ INNER ] JOIN
• CROSS JOIN
For the INNER and OUTER join types, a join condition must be specified, namely exactly one of
NATURAL, ON join_condition, or USING (join_column [, ...]). See below for
the meaning. For CROSS JOIN, none of these clauses can appear.
A JOIN clause combines two FROM items, which for convenience we will refer to as “tables”,
though in reality they can be any type of FROM item. Use parentheses if necessary to determine
the order of nesting. In the absence of parentheses, JOINs nest left-to-right. In any case JOIN
binds more tightly than the commas separating FROM-list items.
CROSS JOIN and INNER JOIN produce a simple Cartesian product, the same result as you
get from listing the two tables at the top level of FROM, but restricted by the join condition (if
any). CROSS JOIN is equivalent to INNER JOIN ON (TRUE), that is, no rows are removed
by qualification. These join types are just a notational convenience, since they do nothing you
couldn't do with plain FROM and WHERE.
LEFT OUTER JOIN returns all rows in the qualified Cartesian product (i.e., all combined rows
that pass its join condition), plus one copy of each row in the left-hand table for which there was
no right-hand row that passed the join condition. This left-hand row is extended to the full width
of the joined table by inserting null values for the right-hand columns. Note that only the JOIN
clause's own condition is considered while deciding which rows have matches. Outer conditions
are applied afterwards.
Conversely, RIGHT OUTER JOIN returns all the joined rows, plus one row for each unmatched
right-hand row (extended with nulls on the left). This is just a notational convenience, since you
could convert it to a LEFT OUTER JOIN by switching the left and right tables.
1787
SELECT
FULL OUTER JOIN returns all the joined rows, plus one row for each unmatched left-hand row
(extended with nulls on the right), plus one row for each unmatched right-hand row (extended
with nulls on the left).
ON join_condition
NATURAL
NATURAL is shorthand for a USING list that mentions all columns in the two tables that have
matching names. If there are no common column names, NATURAL is equivalent to ON TRUE.
LATERAL
The LATERAL key word can precede a sub-SELECT FROM item. This allows the sub-SELECT to
refer to columns of FROM items that appear before it in the FROM list. (Without LATERAL, each
sub-SELECT is evaluated independently and so cannot cross-reference any other FROM item.)
LATERAL can also precede a function-call FROM item, but in this case it is a noise word, because
the function expression can refer to earlier FROM items in any case.
A LATERAL item can appear at top level in the FROM list, or within a JOIN tree. In the latter
case it can also refer to any items that are on the left-hand side of a JOIN that it is on the right-
hand side of.
When a FROM item contains LATERAL cross-references, evaluation proceeds as follows: for each
row of the FROM item providing the cross-referenced column(s), or set of rows of multiple FROM
items providing the columns, the LATERAL item is evaluated using that row or row set's values
of the columns. The resulting row(s) are joined as usual with the rows they were computed from.
This is repeated for each row or set of rows from the column source table(s).
The column source table(s) must be INNER or LEFT joined to the LATERAL item, else there
would not be a well-defined set of rows from which to compute each set of rows for the LATERAL
item. Thus, although a construct such as X RIGHT JOIN LATERAL Y is syntactically valid,
it is not actually allowed for Y to reference X.
WHERE Clause
The optional WHERE clause has the general form
WHERE condition
where condition is any expression that evaluates to a result of type boolean. Any row that does
not satisfy this condition will be eliminated from the output. A row satisfies the condition if it returns
true when the actual row values are substituted for any variable references.
GROUP BY Clause
The optional GROUP BY clause has the general form
1788
SELECT
GROUP BY will condense into a single row all selected rows that share the same values for the grouped
expressions. An expression used inside a grouping_element can be an input column name,
or the name or ordinal number of an output column (SELECT list item), or an arbitrary expression
formed from input-column values. In case of ambiguity, a GROUP BY name will be interpreted as an
input-column name rather than an output column name.
If any of GROUPING SETS, ROLLUP or CUBE are present as grouping elements, then the GROUP
BY clause as a whole defines some number of independent grouping sets. The effect of this is
equivalent to constructing a UNION ALL between subqueries with the individual grouping sets as
their GROUP BY clauses. For further details on the handling of grouping sets see Section 7.2.4.
Aggregate functions, if any are used, are computed across all rows making up each group, producing
a separate value for each group. (If there are aggregate functions but no GROUP BY clause, the query
is treated as having a single group comprising all the selected rows.) The set of rows fed to each
aggregate function can be further filtered by attaching a FILTER clause to the aggregate function call;
see Section 4.2.7 for more information. When a FILTER clause is present, only those rows matching
it are included in the input to that aggregate function.
When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list
expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped
column is functionally dependent on the grouped columns, since there would otherwise be more than
one possible value to return for an ungrouped column. A functional dependency exists if the grouped
columns (or a subset thereof) are the primary key of the table containing the ungrouped column.
Keep in mind that all aggregate functions are evaluated before evaluating any “scalar” expressions in
the HAVING clause or SELECT list. This means that, for example, a CASE expression cannot be used
to skip evaluation of an aggregate function; see Section 4.2.14.
Currently, FOR NO KEY UPDATE, FOR UPDATE, FOR SHARE and FOR KEY SHARE cannot
be specified with GROUP BY.
HAVING Clause
The optional HAVING clause has the general form
HAVING condition
HAVING eliminates group rows that do not satisfy the condition. HAVING is different from WHERE:
WHERE filters individual rows before the application of GROUP BY, while HAVING filters group rows
created by GROUP BY. Each column referenced in condition must unambiguously reference a
grouping column, unless the reference appears within an aggregate function or the ungrouped column
is functionally dependent on the grouping columns.
The presence of HAVING turns a query into a grouped query even if there is no GROUP BY clause.
This is the same as what happens when the query contains aggregate functions but no GROUP BY
clause. All the selected rows are considered to form a single group, and the SELECT list and HAVING
clause can only reference table columns from within aggregate functions. Such a query will emit a
single row if the HAVING condition is true, zero rows if it is not true.
Currently, FOR NO KEY UPDATE, FOR UPDATE, FOR SHARE and FOR KEY SHARE cannot
be specified with HAVING.
WINDOW Clause
The optional WINDOW clause has the general form
1789
SELECT
where window_name is a name that can be referenced from OVER clauses or subsequent window
definitions, and window_definition is
[ existing_window_name ]
[ PARTITION BY expression [, ...] ]
[ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS
{ FIRST | LAST } ] [, ...] ]
[ frame_clause ]
The elements of the PARTITION BY list are interpreted in much the same fashion as elements of a
GROUP BY Clause, except that they are always simple expressions and never the name or number of
an output column. Another difference is that these expressions can contain aggregate function calls,
which are not allowed in a regular GROUP BY clause. They are allowed here because windowing
occurs after grouping and aggregation.
Similarly, the elements of the ORDER BY list are interpreted in much the same fashion as elements of
an ORDER BY Clause, except that the expressions are always taken as simple expressions and never
the name or number of an output column.
The optional frame_clause defines the window frame for window functions that depend on the
frame (not all do). The window frame is a set of related rows for each row of the query (called the
current row). The frame_clause can be one of
UNBOUNDED PRECEDING
offset PRECEDING
CURRENT ROW
offset FOLLOWING
UNBOUNDED FOLLOWING
If frame_end is omitted it defaults to CURRENT ROW. Restrictions are that frame_start can-
not be UNBOUNDED FOLLOWING, frame_end cannot be UNBOUNDED PRECEDING, and the
frame_end choice cannot appear earlier in the above list of frame_start and frame_end op-
tions than the frame_start choice does — for example RANGE BETWEEN CURRENT ROW AND
offset PRECEDING is not allowed.
1790
SELECT
The default framing option is RANGE UNBOUNDED PRECEDING, which is the same as RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW; it sets the frame to be all rows from
the partition start up through the current row's last peer (a row that the window's ORDER BY clause
considers equivalent to the current row; all rows are peers if there is no ORDER BY). In general,
UNBOUNDED PRECEDING means that the frame starts with the first row of the partition, and similarly
UNBOUNDED FOLLOWING means that the frame ends with the last row of the partition, regardless of
RANGE, ROWS or GROUPS mode. In ROWS mode, CURRENT ROW means that the frame starts or ends
with the current row; but in RANGE or GROUPS mode it means that the frame starts or ends with the
current row's first or last peer in the ORDER BY ordering. The offset PRECEDING and offset
FOLLOWING options vary in meaning depending on the frame mode. In ROWS mode, the offset is
an integer indicating that the frame starts or ends that many rows before or after the current row. In
GROUPS mode, the offset is an integer indicating that the frame starts or ends that many peer groups
before or after the current row's peer group, where a peer group is a group of rows that are equivalent
according to the window's ORDER BY clause. In RANGE mode, use of an offset option requires
that there be exactly one ORDER BY column in the window definition. Then the frame contains
those rows whose ordering column value is no more than offset less than (for PRECEDING) or
more than (for FOLLOWING) the current row's ordering column value. In these cases the data type
of the offset expression depends on the data type of the ordering column. For numeric ordering
columns it is typically of the same type as the ordering column, but for datetime ordering columns
it is an interval. In all these cases, the value of the offset must be non-null and non-negative.
Also, while the offset does not have to be a simple constant, it cannot contain variables, aggregate
functions, or window functions.
The frame_exclusion option allows rows around the current row to be excluded from the frame,
even if they would be included according to the frame start and frame end options. EXCLUDE CUR-
RENT ROW excludes the current row from the frame. EXCLUDE GROUP excludes the current row and
its ordering peers from the frame. EXCLUDE TIES excludes any peers of the current row from the
frame, but not the current row itself. EXCLUDE NO OTHERS simply specifies explicitly the default
behavior of not excluding the current row or its peers.
Beware that the ROWS mode can produce unpredictable results if the ORDER BY ordering does not
order the rows uniquely. The RANGE and GROUPS modes are designed to ensure that rows that are
peers in the ORDER BY ordering are treated alike: all rows of a given peer group will be in the frame
or excluded from it.
The purpose of a WINDOW clause is to specify the behavior of window functions appearing in the
query's SELECT List or ORDER BY Clause. These functions can reference the WINDOW clause entries
by name in their OVER clauses. A WINDOW clause entry does not have to be referenced anywhere,
however; if it is not used in the query it is simply ignored. It is possible to use window functions without
any WINDOW clause at all, since a window function call can specify its window definition directly
in its OVER clause. However, the WINDOW clause saves typing when the same window definition is
needed for more than one window function.
Currently, FOR NO KEY UPDATE, FOR UPDATE, FOR SHARE and FOR KEY SHARE cannot
be specified with WINDOW.
Window functions are described in detail in Section 3.5, Section 4.2.8, and Section 7.2.5.
SELECT List
The SELECT list (between the key words SELECT and FROM) specifies expressions that form the
output rows of the SELECT statement. The expressions can (and usually do) refer to columns com-
puted in the FROM clause.
Just as in a table, every output column of a SELECT has a name. In a simple SELECT this name is
just used to label the column for display, but when the SELECT is a sub-query of a larger query, the
name is seen by the larger query as the column name of the virtual table produced by the sub-query. To
specify the name to use for an output column, write AS output_name after the column's expression.
(You can omit AS, but only if the desired output name does not match any PostgreSQL keyword (see
1791
SELECT
Appendix C). For protection against possible future keyword additions, it is recommended that you
always either write AS or double-quote the output name.) If you do not specify a column name, a name
is chosen automatically by PostgreSQL. If the column's expression is a simple column reference then
the chosen name is the same as that column's name. In more complex cases a function or type name
may be used, or the system may fall back on a generated name such as ?column?.
An output column's name can be used to refer to the column's value in ORDER BY and GROUP BY
clauses, but not in the WHERE or HAVING clauses; there you must write out the expression instead.
Instead of an expression, * can be written in the output list as a shorthand for all the columns of the
selected rows. Also, you can write table_name.* as a shorthand for the columns coming from just
that table. In these cases it is not possible to specify new names with AS; the output column names
will be the same as the table columns' names.
According to the SQL standard, the expressions in the output list should be computed before applying
DISTINCT, ORDER BY, or LIMIT. This is obviously necessary when using DISTINCT, since oth-
erwise it's not clear what values are being made distinct. However, in many cases it is convenient if
output expressions are computed after ORDER BY and LIMIT; particularly if the output list contains
any volatile or expensive functions. With that behavior, the order of function evaluations is more in-
tuitive and there will not be evaluations corresponding to rows that never appear in the output. Post-
greSQL will effectively evaluate output expressions after sorting and limiting, so long as those expres-
sions are not referenced in DISTINCT, ORDER BY or GROUP BY. (As a counterexample, SELECT
f(x) FROM tab ORDER BY 1 clearly must evaluate f(x) before sorting.) Output expressions
that contain set-returning functions are effectively evaluated after sorting and before limiting, so that
LIMIT will act to cut off the output from a set-returning function.
Note
PostgreSQL versions before 9.6 did not provide any guarantees about the timing of evaluation
of output expressions versus sorting and limiting; it depended on the form of the chosen query
plan.
DISTINCT Clause
If SELECT DISTINCT is specified, all duplicate rows are removed from the result set (one row is
kept from each group of duplicates). SELECT ALL specifies the opposite: all rows are kept; that is
the default.
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows
where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using
the same rules as for ORDER BY (see above). Note that the “first row” of each set is unpredictable
unless ORDER BY is used to ensure that the desired row appears first. For example:
retrieves the most recent weather report for each location. But if we had not used ORDER BY to force
descending order of time values for each location, we'd have gotten a report from an unpredictable
time for each location.
The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER
BY clause will normally contain additional expression(s) that determine the desired precedence of
rows within each DISTINCT ON group.
Currently, FOR NO KEY UPDATE, FOR UPDATE, FOR SHARE and FOR KEY SHARE cannot
be specified with DISTINCT.
1792
SELECT
UNION Clause
The UNION clause has this general form:
select_statement is any SELECT statement without an ORDER BY, LIMIT, FOR NO KEY
UPDATE, FOR UPDATE, FOR SHARE, or FOR KEY SHARE clause. (ORDER BY and LIMIT can
be attached to a subexpression if it is enclosed in parentheses. Without parentheses, these clauses will
be taken to apply to the result of the UNION, not to its right-hand input expression.)