Ceng301 Dbms Session 12
Ceng301 Dbms Session 12
Session-12
Asst. Prof. Mustafa YENIAD
[email protected]
PostgreSQL Vacuum
• The term "vacuum" refers to a database maintenance procedure that reclaims storage space
and optimizes database performance.
• Whenever data is inserted, updated, or deleted in a PostgreSQL database, it can generate "dead tuples"
- rows that have become obsolete or are inaccessible.
• Sometime in the 2000s the developers of PostgreSQL found a major loophole in the design of their
relational database management system with respect to storage space and transaction speed:
• It turned out that the UPDATE query was becoming an
expensive routine.
DBMS
• UPDATE was duplicating the old row and rewriting new data
in, which meant that the size of the database or the tables
were not bound to any limit! Additionally, deleting a row
only MARKED the row deleted while the actual data
remained untouched – data forensics supported later.
• This may sound familiar since it is what present-day file
systems and data recovery software rely on i.e., data, when
deleted, remains intact on the magnetic disk in its raw form,
but is hidden on the interface. However, keeping old data
was important for older transactions as well. So technically,
it wasn’t right to compromise on transactional integrity. This
being sufficient stimulus, the Postgres team soon introduced
the ‘vacuum’ feature which literally vacuumed the deleted
rows. However, this was a manual process and due to the
several parameters involved in the function, it wasn’t
desirable. Hence, autovacuum was developed.
PostgreSQL Vacuum
• Remember: UPDATE in PostgreSQL would perform an insert and a delete. Hence, all the records being
UPDATED have been deleted and inserted back with the new value.
• As mentioned above, every such record that has been deleted but is still taking some space is called a
dead tuple.
• Once there is no dependency on those dead tuples with the already running transactions, the dead tuples
are no longer needed.
• Thus, PostgreSQL runs VACUUM on such Tables. VACUUM reclaims the storage occupied by these dead
DBMS
tuples.
• The VACUUM operation in PostgreSQL identifies and eliminates these dead tuples, freeing up disk space to
utilize for future operations.
• In a large-scale datacenter, the tables in the Application Server PostgreSQL database can grow quite large.
Performance can degrade significantly if stale and temporary data are not systematically removed.
• Vacuuming cleans up stale or temporary data in a table, and analyzing refreshes its knowledge of all the
tables for the query planner.
PostgreSQL Vacuum
• The "autovacuum_vacuum_threshold" parameter in PostgreSQL determines the minimum number of
updated or deleted tuples required in a table before the autovacuum process is triggered.
• The default is 50 tuples, meaning that if 50 or more tuples are modified in a table, autovacuum will be
triggered. However, you can adjust this value in the PostgreSQL configuration file (postgresql.conf) or by
changing table storage parameters.
• The default auto-vacuum analyze and vacuum settings are sufficient for a small deployment, but the
DBMS
percentage thresholds take longer to trigger as the tables grow larger. Performance degrades significantly
before the auto-vacuum vacuuming and analyzing occurs.
• Autovacuum is one of the background utility processes that starts automatically when you start PostgreSQL.
• To confirm whether the autovacuum daemon is running on LINUX, use the command below:
$ ps aux|grep autovacuum|grep -v grep
• Alternatively, the SQL query below can be used to check the status of the autovacuum in the pg_settings:
$ sudo --login -u postgres # switch to the postgres account
postgres@[hostname]:~ $ psql # then access the PostgreSQL prompt immediately
postgres=# SELECT name, setting FROM pg_settings WHERE name LIKE '%autovacuum%';
PostgreSQL Vacuum
• The VACUUM command will reclaim storage space occupied by dead tuples.
• The VACUUM can be run on its own, or with ANALYZE command.
• In the examples below, [tablename] is optional. Without a table specified, VACUUM will be run on ALL available tables in
the current schema that the user has access to.
• Plain VACUUM: Frees up space for re-use.
postgres=# VACUUM [tablename];
DBMS
• Full VACUUM: Locks the database table, and reclaims more space than a plain VACUUM.
postgres=# VACUUM(FULL) [tablename];
• A plain VACUUM (without FULL) simply reclaims space and makes it available for re-use. This form of the command
can operate in parallel with normal reading and writing of the table, as an exclusive lock is not obtained. However,
extra space is not returned to the operating system (in most cases), it is made available for re-use within the same
table.
• VACUUM(FULL) rewrites the entire contents of the table into a new disk file with no extra space, allowing unused
space to be returned to the operating system. This form is much slower and requires an ACCESS EXCLUSIVE lock on
each table while it is being processed and usage of the table will be blocked until this completes. We may consider
this an outage of the table.
• VACUUM(FULL) is useful when a particular table is full of dead rows and not expected to become that big again.
PostgreSQL Vacuum
• Full VACUUM and ANALYZE: Performs a Full VACUUM and gathers new statistics on query executions paths using ANALYZE:
postgres=# VACUUM(FULL, ANALYZE) [tablename];
• Verbose Full VACUUM and ANALYZE: Performs a Full VACUUM and gathers new statistics on query executions paths using
ANALYZE with verbose progress output:
postgres=# VACUUM(FULL, ANALYZE, VERBOSE) [tablename];
• ANALYZE gathers statistics for the query planner to create the most efficient query execution paths. Per PostgreSQL
documentation, accurate statistics will help the planner to choose the most appropriate query plan, and thereby improve the
speed of query processing:
postgres=# ANALYZE [tablename];
PostgreSQL Index
• An index is sorting and a pointer from each record to their corresponding record in the original table where the data is actually stored.
Fundamentally, a database index is a strategically designed data structure that enhances the speed of data retrieval activities in a database table.
• PostgreSQL indexes are effective tools to enhance database performance. Indexes help the database server find specific rows much faster than it
could do without indexes.
• Indexes are special lookup tables that the database search engine can use to speed up data retrieval. Simply put, an index is a pointer to data in a
table. An index in a database is very similar to an index in the back of a book. For example, if you want to reference all pages in a book that
discusses a certain topic, you have to first refer to the index, which lists all topics alphabetically and then refer to one or more specific page
numbers.
• An index helps to speed up SELECT queries and WHERE clauses; however, it slows down data input, with UPDATE and INSERT statements. Indexes
can be created or dropped with no effect on the data. Adding an index can improve query time from minutes to milliseconds.
DBMS
• Keep in mind: Indexes add write and storage overheads to the database system. Therefore, using them appropriately is very important! This
means you shouldn't create indexes unnecessarily.
• Before moving on you should know that indexes are not perfect and can also be a query performance killer.
• While indexes are good at speeding up the time it takes to find data, they can actually slow down UPDATE, INSERT, or DELETE queries
because of the need to reindex the data when the table changes.
• As a rule, if your tables are constantly modified with frequent INSERTs and UPDATEs (more often than you read the data), then indexes will
cause performance degradation.
PostgreSQL Index
• Advantages of Indexes in PostgreSQL
Leveraging indexes in PostgreSQL provides several key advantages:
• Rapid data access: Indexes are instrumental in drastically slashing the time needed to retrieve data, particularly from large tables. Without
indexes, a complete table scan would be required, which can be quite time-consuming.
• Boosted query efficiency: Queries that include conditions in the WHERE clause or require table joins see marked improvements in
performance with indexing. Such queries leverage indexes to quickly pinpoint rows that fulfill the set criteria.
• Reduced disk I/O: As indexes hold a subset of the table's data, disk I/O operations are significantly lessened. This not only speeds up query
execution but also lightens the load on the storage system.
• Data integrity maintenance: Unique indexes act as safeguards against duplicate values within specific columns, thereby maintaining data
integrity by ensuring no two rows have identical values in the designated columns.
DBMS
Despite PostgreSQL indexes offering remarkable benefits in query performance, it's crucial to judiciously balance these advantages against
potential drawbacks, especially in situations where storage efficiency and write performance are key considerations.
PostgreSQL Index - Common PostgreSQL Index Types
1. B-Tree indexes
The B-tree (Balanced Tree) index stands as the default and most widely employed index type within PostgreSQL. When an
indexed column participates in a comparison employing any of these operators: <, <=, =, >=, >, the query planner will consider
employing a B-tree index.
B-tree indexes can also be effectively utilized with operators like BETWEEN and IN. Furthermore, an IS NULL or IS NOT NULL
condition on an indexed column can be combined with a B-tree index. By default, B-tree indexes organize their entries in
ascending order, with null last. If you wish to gain more insights into index ordering and its potential advantages in specific
scenarios, you can refer to the following link: https://www.postgresql.org/docs/current/indexes-ordering.html
DBMS
In the following example, a B-tree index is created on the product_id column of the products table. This index enhances the
speed of queries that seek specific product_id values or product_id ranges.
postgres=# CREATE INDEX index_product_id ON products (product_id);
2. Hash indexes
Hash indexes are ideal for equality-based lookups, but they don't support range queries or sorting. They work well with data
types like integers and are usually faster than B-tree indexes for equality checks.
postgres=# CREATE INDEX index_product_id ON products USING HASH (product_id);
PostgreSQL Index - Common PostgreSQL Index Types
3. Composite indexes
A multicolumn index is defined on more than one column of a table.
postgres=# CREATE INDEX index_product_id_name ON products (product_id, product_name);
It is presumed that the product_id column is heavily utilized when retrieving data from the products table. Therefore, we are
giving precedence to the product_id column over the product_name column.
When deciding whether to create a single-column index or a multicolumn index, it's essential to consider the column or
columns that are frequently used in a query's WHERE clause as filter conditions.
DBMS
4. Partial indexes
A partial index is an index built over a subset of a table; the subset is defined by a conditional expression (called the predicate
of the partial index). The index contains entries only for those table rows that satisfy the predicate.
postgres=# CREATE INDEX index_product_id ON products (product_id) WHERE product_available = 'true';
One common use case for partial indexes is to filter out rows that are irrelevant for most queries. For example, suppose you
have a table of products with a column called product_available, which can be ‘true' or ‘false’. Most queries will only need to
access the available products, so you can create a partial index on the availability column, where available = 'true’.
PostgreSQL Index - Common PostgreSQL Index Types
5. Covering indexes
Covering index allows a user to perform an index-only scan if the select list in the query matches the columns that are included
in the index. Additional columns can be specified using the INCLUDE keyword.
postgres=# CREATE INDEX index_product_id_name_status ON products (product_id, product_name)
include (status);
All the columns specified in the SELECT clause will be retrieved directly from the index pages. In theory, this can significantly
reduce the amount of I/O (input/output) your query requires to access information.
Traditionally, I/O operations represent a significant bottleneck in database systems, so by avoiding access to the heap (table
data) and minimizing the need for multiple I/O operations, PostgreSQL can enhance query performance.
However, it's important to exercise caution when implementing covering indexes. Each column added to the index still takes up
space on disk, and there is an associated cost for maintaining the index, especially when it comes to row updates.
PostgreSQL Index - Common PostgreSQL Index Types
6. Block Range Index(BRIN)
BRIN indexes are designed for large tables with sorted data, such as time-series data. They divide the table into logical blocks
where each block contains a range of values. Instead of storing individual index entries for each row, the index stores the
minimum and maximum values within each block, making them smaller in size compared to other index types.
postgres=# CREATE INDEX btree_example_index ON logs USING BRIN(log_date); #index size i.e: 521kB
DBMS
7. Other indexes
For complex data types like arrays or geometric shapes, use GiST, GIN, or SP-GIST indexes.
PostgreSQL Index - Drop or List Indexes
• Sometimes, you may want to remove an existing index from the database system. To do it, you use the DROP
INDEX statement as follows:
postgres=# DROP INDEX index_name;
• PostgreSQL does not provide a command like SHOW INDEXES to list the index information of a table or database.
However, it does provide you with access to the pg_indexes view so that you can query the index information.
• The following statement lists all indexes of the schema public in the current database:
DBMS
• Also you can use the \d meta-command to view the index information for a table.
postgres=# \d table_name;
PostgreSQL Reindex
• The REINDEX command rebuilds one or more indices, replacing the previous version of the index. REINDEX can be
used in many scenarios, including the following (from PostgreSQL documentation):
• An index has become corrupted, and no longer contains valid data. Although in theory, this should never happen, in practice
indexes can become corrupted due to software bugs or hardware failures. REINDEX provides a recovery method.
• An index has become "bloated", that is it contains many empty or nearly-empty pages. This can occur with B-tree indexes in
PostgreSQL under certain uncommon access patterns. REINDEX provides a way to reduce the space consumption of the index by
writing a new version of the index without the dead pages.
DBMS
• A storage parameter (such as fillfactor) has been changed for an index, and needs ensure that the change has taken full effect.
• An index build with the CONCURRENTLY option failed, leaving an "invalid" index. Such indexes are useless but it can be convenient
to use REINDEX to rebuild them. Note that REINDEX will not perform a concurrent build. To build the index without interfering with
production it is necessary to drop the index and reissue the CREATE INDEX CONCURRENTLY command *.
• With most installations and packages, 16 MB is the size of the wal segments. Unless your transaction rate is
through the roof, 16 MB size is good enough.
DBMS
• As new records are written, they are appended to WAL logs. Its position is defined by a Log Sequence
Number. The Log Sequence Number (LSN) is a unique identifier in the transaction log. It represents a position
in the WAL stream. That is, as records are added to the Postgres WAL log, their insert positions are described
by the Log Sequence Number.
• You can look at two LSN values and based on the difference, determine how much WAL data lies in between
them. This will let you estimate the advancement of recovery.
PostgreSQL Write-Ahead Log (WAL)
• Benefits of WAL in PostgreSQL:
• As only log files are flushed to disk during transnational
commit, it reduces the number of disk writes.
• The cost of syncing your log file is less as the log files are
written sequentially.
• It adds data page consistency .
• Postgres WAL offers on-line backup and point-in-time
recovery.
DBMS
typically result in a lot of disk I/O and CPU activity, causing application queries to run slower until things are back to
normal.
• Increases in the count that refuse to come back down have to be dealt with quickly. These can be because of:
• Archival failures: If the archive script fails for a certain WAL file, Postgres will retain it and keep retrying until it
succeeds. In the meantime, new WAL files will keep getting created. Ensure that WAL archival processes are not
broken, and can keep up with the WAL creation rate.
• Replication failures: When using streaming replication if the standby goes offline for extended periods of time,
or if someone forgot to delete the replication slot on the primary, the WAL files can be retained indefinitely.
• Long running transactions: These can prevent checkpoints, and therefore the WAL files have to be retained until
the time checkpointer can make progress. Ensure that you have no long running transactions, especially ones
that mutate a lot of data.
PostgreSQL Write-Ahead Log (WAL)
• How Can We Monitor / Follow WAL Files?
• Scripts:
• Shell scripts that simply monitor the count of files in the $PGDATA/pg_wal directory, and send the values to your
existing monitoring systems should help you keep track of the WAL file count.
• Existing script-based tools like check_postgres can also collect this information. You should also have a way to
correlate this count with the PostgreSQL activity going on at a specific time. Ensure that you have no long running
transactions, especially ones that mutate a lot of data.
• Queries:
DBMS
• PostgreSQL does not have a built-in function or view that directly returns WAL file related information. You can
however, use this query:
postgres=# SELECT COUNT(*) FROM pg_ls_dir('pg_xlog') WHERE pg_ls_dir ~ '^[0-9A-F]{24}';
that does the job of getting the count of WAL files. (Note that you’ll need superuser privileges or explicit GRANTs to do a pg_ls_dir)
• pgmetrics:
• pgmetrics (https://pgmetrics.io) is an open-source tool that can collect and report a lot of PostgreSQL metrics,
including WAL file counts.
• pgDash:
• pgDash (https://pgdash.io/) is a modern, in-depth monitoring solution designed specifically for PostgreSQL
deployments.
• pgDash shows you information and metrics about every aspect of your PostgreSQL database server, collected
using the open-source tool pgmetrics. With pgDash you can correlate WAL file activity at any given time with
the SQL queries that were running at the time, and system-level metrics like CPU and memory usage.
Database Query - JOINS
• PostgreSQL join is used to combine columns from one (self-join) or more tables based on the values of the common
columns between related tables. The common columns are typically the primary key columns of the first table and
foreign key columns of the second table.
• PostgreSQL supports: inner join, left join, right join, full outer join, cross join, natural join and a special kind of join
called self-join.
inner join
);
• The inner join examines each row in the first table (basket_a).
• It compares the value in the fruit_a column with the value in the fruit_b column of each row in the second table
(basket_b).
• If these values are equal, the inner join creates a new row that contains columns from both tables and adds this new
row the result set.
Database Query - JOINS - Left Join
• The following statement uses the left join the first table (basket_a) with the second table (basket_b). In the left join
context, the first table is called the left table and the second table is called the right table:
postgres=# SELECT a, fruit_a, b, fruit_b
FROM basket_a
LEFT JOIN basket_b
ON fruit_a = fruit_b;
DBMS
• The left join starts selecting data from the left table. It compares values in the fruit_a column with the values in the
fruit_b column in the basket_b table. If these values are equal, the left join creates a new row that contains columns
of both tables and adds this new row to the result set. (see the row #1 and #2 in the result set).
• In case the values do not equal, the left join also creates a new row that contains columns from both tables and adds
it to the result set. However, it fills the columns of the right table (basket_b) with null (both the row #3 and #4 in the
result are null).
Database Query - JOINS - Left Join - only rows from the left table
• To select rows from the left table that do not have matching rows in the right table, you use the left join with a WHERE
clause. For example:
postgres=# SELECT a, fruit_a, b, fruit_b
FROM basket_a
LEFT JOIN basket_b
ON fruit_a = fruit_b
DBMS
WHERE b IS NULL;
• Note that the LEFT JOIN is the same as the LEFT OUTER JOIN so you can use them interchangeably.
• The following diagram illustrates the left join that returns rows from the left table that do not have matching rows from
the right table:
Database Query - JOINS - Right Join
• The right join is a reversed version of the left join. The right join starts selecting data from the right table. It compares each value
in the fruit_b column of every row in the right table with each value in the fruit_a column of every row in the fruit_a table.
• If these values are equal, the right join creates a new row that contains columns from both tables.
• In case these values are not equal, the right join also creates a new row that contains columns from both tables. However, it fills
the columns in the left table with NULL.
• The following statement uses the right join to join the basket_a table with the basket_b table:
FROM basket_a
RIGHT JOIN basket_b
ON fruit_a = fruit_b;
Database Query - JOINS - Right Join - only rows from the right table
• Similarly, you can get rows from the right table that do not have matching rows from the left table by adding a WHERE clause as
follows:
WHERE a IS NULL;
• Note that the RIGHT JOIN and RIGHT OUTER JOIN are the same therefore you can use them interchangeably.
• The following diagram illustrates the right join that returns rows from the right table that do not have matching rows in
the left table:
Database Query - JOINS - Full Outer Join
• The full outer join or full join returns a result set that contains all rows from both left and right tables, with the matching rows
from both sides if available. In case there is no match, the columns of the table will be filled with NULL:
• The following Venn diagram illustrates the full outer join that returns rows from a table that do not have the
corresponding rows in the other table:
Database Query - JOINS - Briefly
🔹 INNER JOIN: Returns matching rows in both tables
🔹 LEFT JOIN: Returns all records from the left table, and the matching records from the right table
🔹 RIGHT JOIN: Returns all records from the right table, and the matching records from the left table
🔹 FULL OUTER JOIN: Returns all records where there is a match in either left or right table
DBMS
Database Query - See also other SQL operators may be required
• UNION
• INTERSECT
• EXCEPT
• EXPLAIN
• HAVING
DBMS
• ROLLUP
• ANY
• ALL
• EXISTS
• UPDATE JOIN
• DELETE JOIN
...
and so on... :)
Tutorials & Other Resources
Website URL Description
W3Schools PostgreSQL Documentation Quickly learn PostgreSQL and test yourself with exercises.
DBMS
A full, free online course for walking through PostgreSQL, from the basics to advanced
Tutorials Point PostgreSQL
administration.
PostgreSQL Primer for Busy People A handy single-paged resource and reference guide for getting started with PostgreSQL.
PostgreSQL Tutorial Learn PostgreSQL and how to get started quickly through practical examples.
Awesome Postgres A curated list of awesome PostgreSQL software, libraries, tools and resources.