Query Optimization
Query Optimization
PostgreSQL Page 1
slow performance in
their development
environments and
were understandably
worried about the
impact that they’d see
if they went to
production with poor
query performance.
We got right to work to
help them out, and our
first stone to turn over
was to have them send
us their EXPLAIN
ANALYZE output for
the query, which
yielded:
PostgreSQL Page 2
query was wrapping
info -> 'dept' in a
function called
jsonb_array_elements(
), which led the query
planner to think that it
shouldn’t use the
index. The fix was
simple, and we were
able to get the
customer back on their
way after a rather
quick adjustment to
their query. Once the
customer changed
their query to the
following, the Index
started getting
scanned:
postgres=# SELECT * FROM org
where 'aa'::text IN (info -> 'dept' ->>
'name');
postgres=# explain SELECT *
FROM organization where 'aa'::text
IN (info -> 'dept' ->> 'name');
QUERY
PLAN
-----------------------------------------------
-----------------------------------------------
Index Scan using idx_org_dept on
org (cost=0.42..8.44 rows=1 width=
1169)
Index Cond: ('aa'::text = ((info ->
'dept'::text) ->> 'name'::text))
(2 rows)
As we can see, having
and using EXPLAIN in
your troubleshooting
arsenal can be
invaluable.
What is
Explain?
EXPLAIN is a keyword
that gets prepended to
a query to show a user
how the query planner
plans to execute the
PostgreSQL Page 3
plans to execute the
given
query. Depending on
the complexity of the
query, it will show the
join strategy, method
of extracting data from
tables, estimated rows
involved in executing
the query, and a
number of other bits of
useful
information. Used with
ANALYZE, EXPLAIN
will also show the time
spent on executing the
query, sorts, and
merges that couldn’t
be done in-memory,
and more. This
information is
invaluable when it
comes to identifying
query performance
bottlenecks and
opportunities, and
helps us understand
what information the
query planner is
working with as it
makes its decisions for
us.
A Cost-
Based
Approach
To the query planner,
all the data on disk is
basically the same. To
determine the fastest
way to reach a
particular piece of data
PostgreSQL Page 4
particular piece of data
requires some
estimation of the
amount of time it takes
to do a full table scan,
a merge of two tables,
and other operations to
get data back to the
user. PostgreSQL
accomplishes this by
assigning costs to
each execution task,
and these values are
derived from the
postgresql.conf file
(see parameters
ending in *_cost or
beginning with
enable_*). When a
query is sent to the
database, the query
planner calculates the
cumulative costs for
different execution
strategies and selects
the most optimal plan
(which may not
necessarily be the one
with the lowest cost).
bash $ pgbench -i && psql
<...>
postgres=# EXPLAIN SELECT *
FROM pgbench_accounts a JOIN
pgbench_branches b ON
(a.bid=b.bid) WHERE a.aid
< 100000;
QUERY
PLAN
-----------------------------------------------
---------------------------------
Nested Loop (cost=0.00..4141.00
rows=99999 width=461)
Join Filter: (a.bid = b.bid)
-> Seq Scan on
pgbench_branches b (cost=
0.00..1.01 rows=1 width=364)
-> Seq Scan on
pgbench_accounts a (cost=
0.00..2890.00 rows=99999 width=
97)
Filter: (aid < 100000)
(5 rows)
Here, we see that the
Seq Scan on
PostgreSQL Page 5
Seq Scan on
pgbench_accounts has
cost 2890 to execute
the task. Where does
this value come
from? If we look at
some settings and do
the calculations, we
find:
A Note About
Statistics
The query planner
calculates costs based
on statistics stored in
pg_statistic (don’t look
there--there’s nothing
human-readable in
there. If you want to
get visibility into the
PostgreSQL Page 6
get visibility into the
table and row
statistics, try looking at
pg_stats). If any of
these internal statistics
are off (i.e., a bloated
table or too many joins
that cause the Genetic
Query Optimizer to
kick in), a sub-optimal
plan may be selected,
leading to poor query
performance. Having
bad statistics isn’t
necessarily a
problem--the statistics
aren’t always updated
in real-time, and much
of it depends on
PostgreSQL’s internal
maintenance. As
such, it’s imperative
that database
maintenance is
conducted regularly--
this means frequent
VACUUM-ing and
ANALYZE-
ing. Without good
statistics, you could
end up with something
like this:
postgres=# EXPLAIN SELECT *
FROM pgbench_history WHERE
aid < 100;
QUERY PLAN
-----------------------------------------------
------------------------
Seq Scan on pgbench_history
(cost=0.00..2346.00 rows=35360
width=50)
Filter: (aid < 100)
In the example above,
the database had gone
through a fair amount
of activity, and the
statistics were
inaccurate. With an
ANALYZE (not
PostgreSQL Page 7
ANALYZE (not
VACUUM ANALYZE or
EXPLAIN ANALYZE,
but just a plain
ANALYZE), the
statistics are fixed, and
the query planner now
chooses an Index
Scan:
postgres=# EXPLAIN SELECT *
FROM pgbench_history WHERE
aid < 100;
QUERY PLAN
-----------------------------------------------
-----------------------
Index Scan using foo on
pgbench_history (cost=
0.42..579.09 rows=153 width=50)
Index Cond: (aid < 100)
How Does
EXPLAIN
ANALYZE
Help?
When an EXPLAIN is
prepended to a query,
the query plan gets
printed, but the query
does not get run. We
won’t know whether
the statistics stored in
the database were
correct or not, and we
won’t know if some
operations required
expensive I/O instead
of fully running in
memory. When used
with ANALYZE, the
query is actually run
and the query plan,
along with some
under-the-hood activity
is printed out.
PostgreSQL Page 8
If we look at the first
query above and run
EXPLAIN ANALYZE
instead of a plain
EXPLAIN, we get:
postgres=# EXPLAIN ANALYZE
SELECT * FROM
pgbench_accounts a JOIN
pgbench_branches b ON
(a.bid=b.bid) WHERE a.aid
< 100000;
QUERY PLAN
-----------------------------------------------
-----------------------------------------------
---------------
Nested Loop (cost=0.00..4141.00
rows=99999 width=461) (actual
time=0.039..56.582 rows=99999
loops=1)
Join Filter: (a.bid = b.bid)
-> Seq Scan on
pgbench_branches b (cost=
0.00..1.01 rows=1 width=364)
(actual time=0.025..0.026 rows=1
loops=1)
-> Seq Scan on
pgbench_accounts a (cost=
0.00..2890.00 rows=99999 width=
97) (actual time=0.008..25.752
rows=99999 loops=1)
Filter: (aid < 100000)
Rows Removed by Filter: 1
Planning Time: 0.306 ms
Execution Time: 61.031 ms
(8 rows)
You’ll notice here that
there’s more
information -- actual
time and rows, as well
as planning and
execution times. If we
add BUFFERS, like
EXPLAIN (ANALYZE,
BUFFERS), we’ll even
get cache hit/miss
statistics in the output:
QUERY PLAN
-----------------------------------------------
-----------------------------------------------
---------------
Nested Loop (cost=0.00..4141.00
PostgreSQL Page 9
Nested Loop (cost=0.00..4141.00
rows=99999 width=461) (actual
time=0.039..56.582 rows=99999
loops=1)
Join Filter: (a.bid = b.bid)
Buffers: shared hit=3 read=1638
-> Seq Scan on
pgbench_branches b (cost=
0.00..1.01 rows=1 width=364)
(actual time=0.025..0.026 rows=1
loops=1)
Buffers: shared hit=1
-> Seq Scan on
pgbench_accounts a (cost=
0.00..2890.00 rows=99999 width=
97) (actual time=0.008..25.752
rows=99999 loops=1)
Filter: (aid < 100000)
Rows Removed by Filter: 1
Buffers: shared hit=2 read=
1638
Planning Time: 0.306 ms
Execution Time: 61.031 ms
(8 rows)
Very quickly, you can
see that EXPLAIN can
be a useful tool for
people looking to
understand their
database performance
behaviors.
A Quick
Review of
Scan Types
and Joins
It’s important to know
that every join type and
scan type have their
time and place. Some
people look for the
word “Sequential” scan
and immediately jump
back in fear, not
considering whether it
would be worthwhile to
access data
another. Take, for
example, a table with 2
rows -- it would not
PostgreSQL Page 10
rows -- it would not
make sense to the
query planner to scan
the index, then go back
and retrieve data from
the disk when it could
just quickly scan the
table and pull data out
without touching the
index. In this case,
and in the case of most
other small-ish tables,
it would be more
efficient to do a
sequential scan. To
quickly review the join
and scan types that
PostgreSQL works
with:
• Scan Types
• Sequential Scan
• Basically a brute-force
retrieval from disk
• Scans the whole table
• Fast for small tables
• Index Scan
• Scan all/some rows in
an index; look up rows
in heap
• Causes random seek,
which can be costly for
old-school spindle-
based disks
• Faster than a
Sequential Scan when
extracting a small
number of rows for
large tables
• Index Only Scan
• Scan all/some rows in
index
• No need to lookup
rows in the table
because the values we
want are already
stored in the index
PostgreSQL Page 11
stored in the index
itself
• Bitmap Heap Scan
• Scan index, building a
bitmap of pages to visit
• Then, look up only
relevant pages in the
table for desired rows
• Join Types
• Nested Loops
• For each row in the
outer table, scan for
matching rows in the
inner table
• Fast to start, best for
small tables
• Merge Join
• Zipper-operation on
_sorted_ data sets
• Good for large tables
• High startup cost if an
additional sort is
required
• Hash Join
• Build hash of inner
table values, scan
outer table for matches
• Only usable for
equality conditions
• High startup cost, but
fast execution
As we can see, every
scan type and join type
has its place. What’s
most important is that
the query planner has
good statistics to work
with, as mentioned
earlier.
PostgreSQL Page 12
problem and give an
idea of how to solve
it. At EDB Support,
we’ve seen many
situations where
EXPLAIN could help
identify things like:
• Inaccurate statistics
leading to poor
join/scan choices
• Maintenance activity
(VACUUM and
ANALYZE) not
aggressive enough
• Corrupted indexes
requiring a REINDEX
• Index definition v.
query mismatch
• work_mem being set
too low, preventing in-
memory sorts and joins
• Poor performance due
to join order listing
when writing a query
• Improper ORM
configuration
EXPLAIN is certainly
one of the most
invaluable tools for
anyone working with
PostgreSQL, and using
it well will save you lots
of time!
Query optimization in PostgreSQL involves techniques and strategies to make SQL queries
run faster and more efficiently. Here are key aspects and best practices for optimizing
PostgreSQL Page 13
run faster and more efficiently. Here are key aspects and best practices for optimizing
queries:
1. Understanding the Query Execution Plan
• EXPLAIN and EXPLAIN ANALYZE: Analyze how PostgreSQL executes a query.
sql
Copy code
EXPLAIN SELECT * FROM orders WHERE order_date = '2023-01-01';
EXPLAIN ANALYZE SELECT * FROM orders WHERE order_date = '2023-01-01';
sql
Copy code
CREATE INDEX idx_orders_order_date ON orders(order_date);
• Composite Indexes: Index multiple columns used together in queries.
sql
Copy code
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
• Partial Indexes: Index a subset of rows based on a condition.
sql
Copy code
CREATE INDEX idx_active_orders ON orders (order_date) WHERE status = 'active';
• Expression Indexes: Index expressions instead of raw columns.
sql
Copy code
CREATE INDEX idx_lower_email ON users (LOWER(email));
3. Optimizing Query Design
• Selecting Columns: Only retrieve necessary columns.
sql
Copy code
SELECT id, name FROM users WHERE active = true;
• **Avoiding SELECT ***: Select only the columns you need.
sql
Copy code
SELECT id, name FROM users WHERE active = true;
• Filtering Early: Use WHERE clauses to filter data as early as possible.
sql
Copy code
SELECT id, name FROM users WHERE active = true AND created_at > '2023-01-01';
• Using JOINs Effectively: Ensure proper use of JOINs and indexes on join columns.
sql
PostgreSQL Page 14
sql
Copy code
SELECT o.id, c.name FROM orders o JOIN customers c ON o.customer_id = c.id WHERE
o.order_date > '2023-01-01';
• Avoiding Functions in WHERE Clauses: Avoid using functions on indexed columns in
WHERE clauses.
sql
Copy code
SELECT * FROM users WHERE LOWER(email) = '[email protected]'; -- Not good if
email is indexed
SELECT * FROM users WHERE email = '[email protected]'; -- Better
4. Using Subqueries and CTEs
• Common Table Expressions (CTEs): Use CTEs for better readability and complex
queries.
sql
Copy code
WITH recent_orders AS (
SELECT id, customer_id FROM orders WHERE order_date > '2023-01-01'
)
SELECT ro.id, c.name
FROM recent_orders ro
JOIN customers c ON ro.customer_id = c.id;
• Subqueries: Use subqueries where appropriate but beware of performance
implications.
sql
Copy code
SELECT name FROM customers WHERE id IN (SELECT customer_id FROM orders
WHERE order_date > '2023-01-01');
5. Optimizing Joins
• Proper Indexing: Ensure indexed columns are used in joins.
sql
Copy code
SELECT o.id, c.name FROM orders o JOIN customers c ON o.customer_id = c.id;
• Choosing the Right Join Type: Use INNER JOIN, LEFT JOIN, etc., appropriately.
sql
Copy code
SELECT o.id, c.name FROM orders o INNER JOIN customers c ON o.customer_id = c.id;
6. Optimizing Sorting and Grouping
• Indexes on Sorting Columns: Index columns used in ORDER BY clauses.
sql
Copy code
CREATE INDEX idx_orders_date ON orders(order_date);
SELECT * FROM orders ORDER BY order_date;
• Group By Optimization: Ensure proper use of indexes and avoid unnecessary grouping.
sql
Copy code
SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id;
7. Optimizing Updates and Deletes
• Batch Updates and Deletes: Perform updates and deletes in batches to avoid long
PostgreSQL Page 15
• Batch Updates and Deletes: Perform updates and deletes in batches to avoid long
transactions.
sql
Copy code
DELETE FROM orders WHERE order_date < '2020-01-01' LIMIT 1000;
• Using WHERE Clauses: Ensure WHERE clauses are used to limit affected rows.
sql
Copy code
UPDATE users SET active = false WHERE last_login < '2023-01-01';
8. Using Table Partitioning
• Partition Large Tables: Partition tables by range, list, or hash to improve performance.
sql
Copy code
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
order_date DATE,
amount NUMERIC
) PARTITION BY RANGE (order_date);
sql
Copy code
ANALYZE;
10. Managing Configuration Parameters
• shared_buffers: Amount of memory dedicated to PostgreSQL for caching data.
plaintext
Copy code
shared_buffers = 25% of system memory
• work_mem: Memory used for internal sort operations and hash tables.
plaintext
Copy code
work_mem = 4MB (adjust based on workload)
• maintenance_work_mem: Memory used for maintenance tasks like VACUUM and
CREATE INDEX.
plaintext
Copy code
maintenance_work_mem = 64MB (or higher for large databases)
11. Utilizing Extensions
• pg_stat_statements: Track and analyze the performance of executed queries.
sql
Copy code
CREATE EXTENSION pg_stat_statements;
SELECT query, total_time, calls FROM pg_stat_statements ORDER BY total_time DESC
LIMIT 10;
12. Reducing Network Latency
PostgreSQL Page 16
12. Reducing Network Latency
• Proximity: Host your database servers close to your application servers to minimize
network latency.
Summary
Query optimization in PostgreSQL involves analyzing execution plans, using indexes
effectively, designing efficient queries, managing database configuration, and regularly
maintaining and monitoring the database. By following these strategies, you can significantly
improve the performance of your PostgreSQL queries. If you have specific queries or
scenarios you’d like to optimize, feel free to share them!
PostgreSQL Page 17